{"id":22173507,"url":"https://github.com/jlumbroso/basic-git-scraper-template","last_synced_at":"2025-10-12T19:30:47.773Z","repository":{"id":87028287,"uuid":"526398893","full_name":"jlumbroso/basic-git-scraper-template","owner":"jlumbroso","description":"🔬 Starter template for automating web scrapers using GitHub Actions workflows to incrementally commit data to Git 📈 Includes sample script, scheduling, dependency installation, output to CSV/JSON, and ethics guide 🤖 Customizable for diverse sites and use cases!","archived":false,"fork":false,"pushed_at":"2024-03-03T18:39:59.000Z","size":456,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-05T18:50:14.083Z","etag":null,"topics":["git-scraping","github-template","template","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jlumbroso.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-18T23:08:53.000Z","updated_at":"2025-02-22T22:09:43.000Z","dependencies_parsed_at":"2024-12-02T07:44:10.972Z","dependency_job_id":null,"html_url":"https://github.com/jlumbroso/basic-git-scraper-template","commit_stats":null,"previous_names":[],"tags_count":0,"template":true,"template_full_name":null,"purl":"pkg:github/jlumbroso/basic-git-scraper-template","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlumbroso%2Fbasic-git-scraper-template","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlumbroso%2Fbasic-git-scraper-template/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlumbroso%2Fbasic-git-scraper-template/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlumbroso%2Fbasic-git-scraper-template/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jlumbroso","download_url":"https://codeload.github.com/jlumbroso/basic-git-scraper-template/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jlumbroso%2Fbasic-git-scraper-template/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279012668,"owners_count":26085159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["git-scraping","github-template","template","web-scraping"],"created_at":"2024-12-02T07:33:52.312Z","updated_at":"2025-10-12T19:30:47.412Z","avatar_url":"https://github.com/jlumbroso.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Basic Git Scraper Template\n\nThis template provides a starting point for **git scraping**—the technique of scraping data from websites and automatically committing it to a Git repository using workflows, [coined by Simon Willison](https://simonwillison.net/2020/Oct/9/git-scraping/).\n\nGit scraping helps create an audit trail capturing snapshots of data over time. It leverages Git's version control and a continuous integration's scheduling capabilities to regularly scrape sites and save data without needing to manage servers.\n\nThe key benefit is automating web scrapers to run on a schedule with little overhead. The scraped data gets stored incrementally so you can review historical changes. This helps enable use-cases like price monitoring, content updates tracking, research datasets building, and more. The ability to have these resources for virtually free, enables the use of this technique for a wide range of projects.\n\nTools like GitHub Actions, GitLab CI and others make git scraping adaptable to diverse sites and data needs. The scraping logic just needs to output data serialized formats like CSV, JSON etc which gets committed back to git. This makes the data easily consumable downstream for analysis and vis.\n\nThis template includes a sample workflow to demonstrate the core git scraping capabilities. Read on to learn how to customize it!\n\n## Overview\n\nThe workflow defined in `.github/workflows/scrape.yaml` runs on a defined schedule to:\n\n1. Checkout the code\n2. Set up the Python environment\n3. Install dependencies via Pipenv\n4. Run the python script `script.py` to scrape data\n5. Commit any updated data files to the Git repository\n\n## Scheduling\n\nThe workflow schedule is configured with [cron syntax](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#schedule) to run:\n\n- Every day at 8PM UTC\n\nThis once-daily scraping is a good rule-of-thumb, as it is generally respectful of the target website, as it does not contribute to any measurable burden to the site's resources.\n\nYou can use [crontab.guru](https://crontab.guru/) to generate your own cron schedule.\n\n## Python Libraries\n\nThe main libraries used are:\n\n- [`bs4`](https://www.crummy.com/software/BeautifulSoup/) - BeautifulSoup for parsing HTML\n- [`requests`](https://requests.readthedocs.io/en/latest/) - Making HTTP requests to scrape web pages\n- [`loguru`](https://github.com/Delgan/loguru) - Logging errors and run info\n- [`pytz`](https://github.com/stub42/pytz) - Handling datetimes and timezones  \n- [`waybackpy`](https://github.com/akamhy/waybackpy/) - Scraping web archives (optional)\n\n## Getting Started\n\nTo adapt this for your own scraping project:\n\n- Use [this template to create your own repository](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-repository-from-a-template#creating-a-repository-from-a-template)\n- Modify `script.py` to scrape different sites and data points:\n  - Modifying the request URL\n  - Parsing the HTML with BeautifulSoup to extract relevant data\n  - Processing and outputting the scraped data as CSV, JSON etc\n- Update the workflow schedule as needed\n- Output and commit the scraped data to CSV, JSON or other formats\n- Add any additional libraries to `Pipfile` that you need\n- Update this `README.md` with project specifics\n\nFeel free to use this as a starter kit for your Python web scraping projects!\n\n## Setting Up a Local Development\n\nIt is recommended to use a version manager, and virtual environments and environment managers for local development of Python projects.\n\n**asdf** is a version manager that allows you to easily install and manage multiple versions of languages and runtimes like Python. This is useful so you can upgrade/downgrade Python versions without interfering with your system Python.\n\n**Pipenv** creates a **virtual environment** for your project to isolate its dependencies from other projects. This allows you to install packages safely without impacting globally installed packages that other tools or apps may rely on. The virtual env also allows reproducibility of builds across different systems.\n\nBelow we detail how to setup these environments to develop this template scrape project locally.\n\n### Setting Up a Python Environment\n\nOnce you have installed `asdf`, you can install the Python plugin with:\n\n```bash\nasdf plugin add python\n```\n\nThen you can install the latest version of Python with:\n\n```bash\nasdf install python latest\n```\n\nAfter that, you can first install `pipenv` with:\n\n```bash\npip install pipenv\n```\n\n### Installing Project Dependencies\n\nThen you can install the dependencies with:\n\n```bash\npipenv install --dev\n```\n\nThis will create a virtual environment and install the dependencies from the `Pipfile`. The `--dev` flag will also install the development dependencies, which includes `ipykernel` for Jupyter Notebook support.\n\n### Running the Script\n\nYou can then run the script to try it out with:\n\n```bash\npipenv run python script.py\n```\n\n## Licensing\n\nThis software is distributed under the terms of the MIT License. You have the freedom to use, modify, distribute, and sell it for any purpose. However, you must include the original copyright notice and the permission notice found in the LICENSE file in all copies or substantial portions of the software.\n\nYou can [read more about the MIT license](https://choosealicense.com/licenses/mit/), and [compare different open-source licenses at `choosealicense.com`](https://choosealicense.com/licenses/).\n\n## Some Ethical Guidelines to Consider\n\nWeb scraping is a powerful tool for gathering data, and its [legality has been upheld](https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn).\n\nBut it is important to use it responsibly and ethically. Here are some guidelines to consider:\n\n1. Review the website's Terms of Service and [`robots.txt`](https://en.wikipedia.org/wiki/robots.txt) file to understand allowances and restrictions for automated scraping before starting.\n\n2. Avoid scraping copyrighted content verbatim without permission. Summarizing is safer. Use data judiciously under \"fair use\" principles.\n\n3. Do not enable illegal or fraudulent uses of scraped data, and be mindful of security and privacy.\n\n4. Check that your scraping activity does not overload or harm the website's servers. Scale activity gradually.\n\n5. Reflect on whether scraping could unintentionally reveal private user or organizational information from the site.\n\n6. Consider if scraped data could negatively impact the website's value or business model.\n\n7. Assess if decisions made using the data could contribute to bias, discrimination or unfair profiling.\n\n8. Validate quality of scraped data, and recognize limitations in ensuring relevance and accuracy inherent with web data.  \n\n9. Document your scraping process thoroughly for replicability, transparency and accountability.\n\n10. Continuously re-evaluate your scraping program against applicable laws and ethical principles.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjlumbroso%2Fbasic-git-scraper-template","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjlumbroso%2Fbasic-git-scraper-template","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjlumbroso%2Fbasic-git-scraper-template/lists"}