https://github.com/ddbourgin/news-scrapers
Simple scrapers for news articles from WaPo, NYT, Buzzfeed, NPR
https://github.com/ddbourgin/news-scrapers
Last synced: about 1 month ago
JSON representation
Simple scrapers for news articles from WaPo, NYT, Buzzfeed, NPR
- Host: GitHub
- URL: https://github.com/ddbourgin/news-scrapers
- Owner: ddbourgin
- Created: 2016-11-24T07:54:40.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-11-29T21:10:28.000Z (over 9 years ago)
- Last Synced: 2025-03-20T10:19:22.045Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 16.6 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Installation
The scrapers use [PhantomJS](http://phantomjs.org/) to render the Javascript in some of the search pages. If you already use node.js, you can install PhantomJS via npm:
```bash
npm install phantomjs-prebuilt
```
Alternatively, you can install it using Homebrew on OSX:
```bash
brew update
brew install phantomjs
```
Or just download the Linux/OSX/Windows/FreeBSD binaries [here](http://phantomjs.org/download.html).
Once you've installed PhantomJS, clone this repo and install the Python dependencies using pip:
```bash
git clone https://github.com/ddbourgin/news-scrapers.git
cd news-scrapers
pip install -r requirements.txt
```
## Usage
Each scraper can be run from the command-line. To see the available arguments, run `python .py -h`. You can also run the scrapers in tandem using the provided `scrape.sh` shell script.
Scraping occurs in two phases. In the first phase, the scraper compiles a list of article hyperlinks based on the user query and saves them in newline-delimited text file in the `./links` directory. In the second phase the scraper loops over each link identified during phase 1 and extracts the article text, saving the final scraped article collection in a JSON file in the `./scraped_json` directory. The output JSON has the the following format:
```json
{
"articles": [
{
"author": ["Netochka Nezvanova"],
"before_election": false,
"description": "Article 1 lede",
"publishedAt": "2016-11-18T00:00:00+00:00",
"text": "This is the article text.",
"title": "Article 1 Title",
"url": "http://www.nytimes.com/2016/11/18/us/article-1.html",
"urlToImage": null
},
{
"author": ["Rudolph Lingens", "Luther Blissett"],
"before_election": true,
"description": "Article 2 lede",
"publishedAt": "2016-11-05T00:02:00+00:00",
"text": "This is some more article text.",
"title": "Article 2 Title",
"url": "http://www.nytimes.com/2016/11/5/article-2.html",
"urlToImage": null
},
],
"from_last": "30 days",
"pagerange": [1, 5],
"query":"my search query",
"source":"new-york-times",
"status":"ok"
}
```