{"id":16560830,"url":"https://github.com/shobrook/git-pull","last_synced_at":"2025-03-23T13:32:44.952Z","repository":{"id":57434773,"uuid":"324093199","full_name":"shobrook/git-pull","owner":"shobrook","description":"Parallelized web scraper for Github","archived":false,"fork":false,"pushed_at":"2021-01-03T03:42:06.000Z","size":7871,"stargazers_count":18,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-18T20:54:21.090Z","etag":null,"topics":["github","github-api","github-scraper","parallel","scraper","web-scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shobrook.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-12-24T07:19:23.000Z","updated_at":"2024-12-30T19:04:07.000Z","dependencies_parsed_at":"2022-09-04T15:23:48.983Z","dependency_job_id":null,"html_url":"https://github.com/shobrook/git-pull","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shobrook%2Fgit-pull","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shobrook%2Fgit-pull/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shobrook%2Fgit-pull/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shobrook%2Fgit-pull/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shobrook","download_url":"https://codeload.github.com/shobrook/git-pull/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245108329,"owners_count":20562024,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["github","github-api","github-scraper","parallel","scraper","web-scraper"],"created_at":"2024-10-11T20:30:09.400Z","updated_at":"2025-03-23T13:32:41.733Z","avatar_url":"https://github.com/shobrook.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# git-pull\n\n**git-pull** is a web scraper for Github. You can use it to scrape –– or, if you will, _pull_ –– data from a Github profile, repo, or file. It's parallelized and designed for anyone who wants to avoid using the Github API (e.g. due to the rate limit). Using it is very simple:\n\n```python\nfrom git_pull import GithubProfile\n\ngh = GithubProfile(\"shobrook\")\ngh.scrape_follower_count() # \u003e\u003e\u003e 168\n```\n\nNote that **git-pull** is _not_ a perfect replacement for the Github API. There's some stuff that it can't scrape (yet), like a repo's commit history or release count.\n\n## Installation\n\nYou can install **git-pull** with `pip`:\n\n```bash\n$ pip install git-pull\n```\n\n## Usage\n\n**git-pull** provides three objects –– `GithubProfile`, `Repo`, and `File` –– each with methods for scraping data. Below are descriptions and usage examples for each object.\n\n#### `GithubProfile(username, num_threads=cpu_count(), scrape_everything=False)`\n\nThis is the master object for scraping data from a Github profile. All it requires is the username of the Github user, and from there you can scrape social info for that user and their repos.\n\n**Parameters:**\n\n* **`username` _(str)_:** Github username\n* **`num_threads` _(int, optional (default=multiprocessing.cpu_count()))_:** Number of threads to allocate for splitting up scraping work; default is # of cores in your machine's CPU\n* **`scrape_everything` _(bool, optional (default=False))_:** If `True`, does a \"deep scrape\" and scrapes all social info and repo data for the user (i.e. it calls all the scraper methods listed below and stores the results in properties of the object); if `False`, you have to call individual scraper methods to get the data you want\n\n**Methods:**\n\n* **`scrape_name() -\u003e str`:** Returns the name of the Github user\n* **`scrape_avatar() -\u003e str`:** Returns a URL for the user's profile picture\n* **`scrape_follower_count() -\u003e int`:** Returns the number of followers the user has\n* **`scrape_contribution_graph() -\u003e dict`:** Returns the contribution history for the user as a map of dates (as strings) to commit counts\n* **`scrape_location() -\u003e str`:** Returns the user's location, if available\n* **`scrape_personal_site() -\u003e str`:** Returns the URL of the user's website, if available\n* **`scrape_workplace() -\u003e str`:** Returns the name of the user's workplace, if available\n* **`scrape_repos(scrape_everything=False) -\u003e list`:** Returns list of `Repo` objects for each of the user's repos (both source and forked); if `scrape_everything=True`, then a \"deep scrape\" is performed for each repo\n* **`scrape_repo(repo_name, scrape_everything=False) -\u003e Repo`:** Returns a single `Repo` object for a given repo that the user owns\n\n**Example:**\n\n```python\nfrom git_pull import GithubProfile\n\n# If scrape_everything=True, then all scraped data is stored in object\n# properties\ngh = GithubProfile(\"shobrook\", scrape_everything=True)\ngh.name # \u003e\u003e\u003e \"Jonathan Shobrook\"\ngh.avatar # \u003e\u003e\u003e \"https://avatars1.githubusercontent.com/u/18684735?s=460\u0026u=60f797085eb69d8bba4aba80078ad29bce78551a\u0026v=4\"\ngh.repos # \u003e\u003e\u003e [Repo(\"git-pull\"), Repo(\"saplings\"), ...]\n\n# If scrape_everything=False, individual scraper methods have to be called, each\n# of which both returns the scraped data and stores it in the object properties\ngh = GithubProfile(\"shobrook\", scrape_everything=False)\ngh.name # \u003e\u003e\u003e ''\ngh.scrape_name() # \u003e\u003e\u003e \"Jonathan Shobrook\"\ngh.name # \u003e\u003e\u003e \"Jonathan Shobrook\"\n```\n\n#### `Repo(name, owner, num_threads=cpu_count(), scrape_everything=False)`\n\nUse this object for scraping data from a Github repo.\n\n**Parameters:**\n\n* **`name` _(str)_:** Name of the repo to be scraped\n* **`owner` _(str)_:** Username of the owner of the repo\n* **`num_threads` _(int, (optional, default=multiprocessing.cpu_count()))_:** Number of threads to allocate for splitting up scraping work; default is # of cores in your machine's CPU\n* **`scrape_everything` _(bool, (optional, default=False))_:** If `True`, scrapes all metadata for the repo and scrapes files; if `False`, you have to call individual scraper methods to get the data you want\n\n**Methods:**\n\n* **`scrape_topics() -\u003e list`:** Returns list of topics for the repo\n* **`scrape_star_count() -\u003e int`:** Returns number of stars the repo has\n* **`scrape_fork_count() -\u003e int`:** Returns number of times the repo has been forked\n* **`scrape_fork_status() -\u003e bool`:** Returns whether or not the repo is a fork of another one\n* **`scrape_files(scrape_everything=False) -\u003e list`:** Returns a list of `File` objects, each representing a file in the repo; files that aren't programs or documentation files (e.g. boilerplate) are not scraped\n* **`scrape_file(file_path, file_type=None, scrape_everything=False) -\u003e File`:** Returns a `File` object given a file path\n\n**Example:**\n\n```python\nfrom git_pull import Repo\n\nrepo = Repo(\"git-pull\", \"shobrook\", scrape_everything=True)\nrepo.topics # \u003e\u003e\u003e [\"web-scraper\", \"github\", \"github-api\", \"parallel\", \"scraper\"]\nrepo.fork_status # \u003e\u003e\u003e False\n```\n\n#### `File(path, repo, owner, scrape_everything=False)`\n\nUse this object for scraping data from a single file inside a Github repo.\n\n**Parameters:**\n\n* **`path` _(str)_:** Absolute path of the file inside the repo\n* **`repo` _(str)_:** Name of the repo containing the file\n* **`owner` _(str)_:** Username of the repo's owner\n* **`scrape_everything` _(bool, (optional, default=False))_:** If `True`, scrapes the blame history for the file and the file type (i.e. calls the methods listed below)\n\n**Methods:**\n\n* **`scrape_blames() -\u003e dict`:** Returns the blame history for a file as a map of usernames (i.e. contributors) to `{\"line_nums\": [1, 2, ...], \"committers\": [...]}` dictionaries, where `\"line_nums\"` is a list of line numbers the user wrote and `\"committers\"` is a list of usernames of contributors the user pair programmed with, if any\n\n**Example:**\n\n```python\nfrom git_pull import File\n\nfile = File(\"git_pull/git_pull.py\", \"git-pull\", \"shobrook\", scrape_everything=True)\nfile.blames # \u003e\u003e\u003e {\"shobrook\": {\"line_nums\": [1, 2, ...], \"committers\": []}}\nfile.raw_url # \u003e\u003e\u003e \"https://raw.githubusercontent.com/shobrook/git-pull/master/git_pull/git_pull.py\"\nfile.type # \u003e\u003e\u003e \"Python\"\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshobrook%2Fgit-pull","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshobrook%2Fgit-pull","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshobrook%2Fgit-pull/lists"}