{"id":16894715,"url":"https://github.com/ddbourgin/news-scrapers","last_synced_at":"2026-05-11T16:34:03.111Z","repository":{"id":134549290,"uuid":"74650413","full_name":"ddbourgin/news-scrapers","owner":"ddbourgin","description":"Simple scrapers for news articles from WaPo, NYT, Buzzfeed, NPR","archived":false,"fork":false,"pushed_at":"2016-11-29T21:10:28.000Z","size":17,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-20T10:19:22.045Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ddbourgin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-11-24T07:54:40.000Z","updated_at":"2023-08-24T03:56:14.000Z","dependencies_parsed_at":"2023-06-18T00:15:23.821Z","dependency_job_id":null,"html_url":"https://github.com/ddbourgin/news-scrapers","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ddbourgin/news-scrapers","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ddbourgin%2Fnews-scrapers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ddbourgin%2Fnews-scrapers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ddbourgin%2Fnews-scrapers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ddbourgin%2Fnews-scrapers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ddbourgin","download_url":"https://codeload.github.com/ddbourgin/news-scrapers/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ddbourgin%2Fnews-scrapers/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32903398,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-10T13:40:02.631Z","status":"online","status_checked_at":"2026-05-11T02:00:05.975Z","response_time":120,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T17:19:42.326Z","updated_at":"2026-05-11T16:34:03.092Z","avatar_url":"https://github.com/ddbourgin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Installation\nThe scrapers use [PhantomJS](http://phantomjs.org/) to render the Javascript in some of the search pages. If you already use node.js, you can install PhantomJS via npm:\n\n```bash\nnpm install phantomjs-prebuilt\n```\n\nAlternatively, you can install it using Homebrew on OSX:\n```bash\nbrew update\nbrew install phantomjs\n```\n\nOr just download the Linux/OSX/Windows/FreeBSD binaries [here](http://phantomjs.org/download.html).\n\nOnce you've installed PhantomJS, clone this repo and install the Python dependencies using pip:\n\n```bash\ngit clone https://github.com/ddbourgin/news-scrapers.git\ncd news-scrapers\npip install -r requirements.txt\n```\n\n## Usage\nEach scraper can be run from the command-line. To see the available arguments, run `python \u003cscraper_file\u003e.py -h`. You can also run the scrapers in tandem using the provided `scrape.sh` shell script.\n\nScraping occurs in two phases. In the first phase, the scraper compiles a list of article hyperlinks based on the user query  and saves them in newline-delimited text file in the `./links` directory. In the second phase the scraper loops over each link identified during phase 1 and extracts the article text, saving the final scraped article collection in a JSON file in the `./scraped_json` directory. The output JSON has the the following format:\n\n```json\n{\n    \"articles\": [\n        {\n            \"author\": [\"Netochka Nezvanova\"],\n            \"before_election\": false,\n            \"description\": \"Article 1 lede\",\n            \"publishedAt\": \"2016-11-18T00:00:00+00:00\",\n            \"text\": \"This is the article text.\",\n            \"title\": \"Article 1 Title\",\n            \"url\": \"http://www.nytimes.com/2016/11/18/us/article-1.html\",\n            \"urlToImage\": null\n        },\n        {\n            \"author\": [\"Rudolph Lingens\", \"Luther Blissett\"],\n            \"before_election\": true,\n            \"description\": \"Article 2 lede\",\n            \"publishedAt\": \"2016-11-05T00:02:00+00:00\",\n            \"text\": \"This is some more article text.\",\n            \"title\": \"Article 2 Title\",\n            \"url\": \"http://www.nytimes.com/2016/11/5/article-2.html\",\n            \"urlToImage\": null\n        },        \n    ],\n    \"from_last\": \"30 days\",\n    \"pagerange\": [1, 5],\n    \"query\":\"my search query\",\n    \"source\":\"new-york-times\",\n    \"status\":\"ok\"\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fddbourgin%2Fnews-scrapers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fddbourgin%2Fnews-scrapers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fddbourgin%2Fnews-scrapers/lists"}