{"id":13586140,"url":"https://github.com/vignif/crawler-google-scholar","last_synced_at":"2025-04-07T14:33:42.785Z","repository":{"id":137237165,"uuid":"209286213","full_name":"vignif/crawler-google-scholar","owner":"vignif","description":"This bot crawls and downloads statistics and pictures from google scholar's researchers.","archived":false,"fork":false,"pushed_at":"2023-04-27T13:28:58.000Z","size":71,"stargazers_count":17,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-02-14T21:23:34.559Z","etag":null,"topics":["crawler","downloading-statistics","google-scholar","indexes","statistics"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vignif.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-09-18T10:56:54.000Z","updated_at":"2024-08-01T16:32:08.058Z","dependencies_parsed_at":null,"dependency_job_id":"2bed6bdc-14ed-4568-bace-4b3f9dd45534","html_url":"https://github.com/vignif/crawler-google-scholar","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vignif%2Fcrawler-google-scholar","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vignif%2Fcrawler-google-scholar/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vignif%2Fcrawler-google-scholar/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vignif%2Fcrawler-google-scholar/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vignif","download_url":"https://codeload.github.com/vignif/crawler-google-scholar/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223285095,"owners_count":17119832,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","downloading-statistics","google-scholar","indexes","statistics"],"created_at":"2024-08-01T15:05:21.054Z","updated_at":"2025-04-07T14:33:42.778Z","avatar_url":"https://github.com/vignif.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# crawler-google-scholar ![](spider.png)\n\nThis repo presents an automatic way of downloading statistics of a set of researchers or professors from the google scholar.\ngiving as input a list of [name surname] of researchers it retrieves data from google scholar such as {# of publications, h-index, i10-index and others}\n\nThe project scholarly (https://pypi.org/project/scholarly/) allows to do something similar in a (way more structured way) but I wanted to find out a bit more regarding http requests and its implications.\nCrawling the web is time expensive and the amount of request accepted by servers is limited and has to be respected!\nA method to avoid the system staying idle while the web server responds is to allow multple tasks to run simultaneously.\nThe scripts here presented shows different ways of getting the same set of information.\n\n## The scripts\n\n`get_stats_serial.py` waits until each task(load webpage of researcher X) is completed, and only after that proceeds with the new author (Y). This simple approach comes with the expense of time complexity O(N), meaning as long as the amount of researcher is 'little' it won't require too much time.\n\n`get_stats_coroutine.py` does not wait for researcher X to be downloaded and requests right away the next ones.\n\nA proper timing sleep function must be setted inside each file in order to avoid rejection by the server. If we are requesting informations too fast, the server will answer always with an [Error 429 Too Many Requests].\n\n### Performances\n| Script      | Downloaded info per second |\n| ----------- | ----------- |\n| `get_stats_serial.py`      | 0.7       |\n| `get_stats_coroutine.py`   | 0.05        |\n\n## Use\nThe input information must be an .xlsx file with two columns [surname, name]\n- Run **get_stats_coroutine.py**\n- output **stats.txt**\n\nRecommended script for downloading stats : **get_stats_coroutine.py**\n\nRecommended script for downloading profile images : **get_picts.py**\n\n*image with the courtesy of icons8.com*\n\n## License\n\n[![License](http://img.shields.io/:license-mit-blue.svg?style=flat-square)](http://badges.mit-license.org)\n\n**[MIT license](http://opensource.org/licenses/mit-license.php)**\n- Copyright 2023 © Francesco Vigni\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvignif%2Fcrawler-google-scholar","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvignif%2Fcrawler-google-scholar","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvignif%2Fcrawler-google-scholar/lists"}