https://github.com/njraladdin/newspapers-com-scraper
A Node.js scraper for extracting article data from Newspapers.com based on keywords, dates, and locations.
https://github.com/njraladdin/newspapers-com-scraper
archive data newspapers scraper scraper-api scraping
Last synced: 10 months ago
JSON representation
A Node.js scraper for extracting article data from Newspapers.com based on keywords, dates, and locations.
- Host: GitHub
- URL: https://github.com/njraladdin/newspapers-com-scraper
- Owner: njraladdin
- Created: 2024-09-27T12:58:56.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-26T17:30:00.000Z (about 1 year ago)
- Last Synced: 2025-04-03T14:39:16.602Z (10 months ago)
- Topics: archive, data, newspapers, scraper, scraper-api, scraping
- Language: JavaScript
- Homepage: https://www.npmjs.com/package/newspapers-com-scraper
- Size: 160 KB
- Stars: 3
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Newspaper.com Scraper
[](https://badge.fury.io/js/newspapers-com-scraper)
[](https://opensource.org/licenses/MIT)
[](https://www.npmjs.com/package/newspapers-com-scraper)
A Node.js scraper for extracting article data from Newspapers.com based on keywords, dates, and locations.
```javascript
const scraper = new NewspaperScraper();
await scraper.retrieve({
keyword: "elon musk twitter",
limit: 500,
dateRange: [2020, 2024],
location: "us"
});
```
## What it Does
Searches Newspapers.com and extracts:
- Newspaper title
- Page number and URL
- Publication date
- Location
- Number of keyword matches on each page
[Sample Output](https://docs.google.com/spreadsheets/d/1uq366pyEfolITFZ9X507ogsQjssx_pL1bp1pGPIPtt4/edit?gid=0#gid=0)
## Requirements
- Node.js 14+
- Google Chrome browser
- GEONODE.com account (optional, for proxy support)
## Installation
```bash
npm install newspapers-com-scraper
```
## Basic Usage
```javascript
const NewspaperScraper = require('newspapers-com-scraper');
async function main() {
const scraper = new NewspaperScraper();
// Listen for articles
scraper.on('article', (article) => {
console.log(`Found: ${article.title} (${article.date})`);
});
await scraper.retrieve({
keyword: "elon musk twitter", // Required: search term
limit: 500, // Optional: limit total results
dateRange: [2020, 2024], // Optional: date range
location: "us" // Optional: location code
});
}
```
## Events
The scraper emits three types of events:
```javascript
// 1. Article found
scraper.on('article', (article) => {
console.log(article);
// {
// title: "The Daily News",
// pageNumber: 4,
// date: "2023-05-15",
// location: "New York, NY",
// keywordMatches: 3,
// url: "https://www.newspapers.com/image/12345678/"
// }
});
// 2. Progress update
scraper.on('progress', (progress) => {
console.log(progress);
// {
// current: 5, // Current page
// total: 20, // Total pages
// percentage: 25.0, // Progress percentage
// stats: {
// timeElapsed: 45.2, // Total seconds
// avgPageTime: 9.04 // Avg seconds per page
// }
// }
});
// 3. Scraping complete
scraper.on('complete', (stats) => {
console.log(stats);
// {
// timeElapsed: 180.5, // Total seconds
// pageTimes: [8.2, 9.1] // Time per page
// }
});
```
## Advanced Configuration
Full configuration example:
```javascript
const scraper = new NewspaperScraper({
// Scraping settings
concurrentPages: 2, // Pages to scrape in parallel
resultsPerPage: 50, // Results per page (max 50)
maxConcurrentRequests: 10, // Max parallel requests
// Browser settings
browser: {
headless: false, // Show browser
userAgent: 'Mozilla/5.0...',
executablePath: '/path/to/chrome' // Optional
},
// Proxy settings (optional)
proxy: {
enabled: false,
host: 'proxy.host',
port: 9008,
username: 'user',
password: 'pass'
},
// Logging
logger: {
level: 'info', // 'error' | 'warn' | 'info' | 'debug' | 'silent'
custom: null // Custom logger
}
});
await scraper.retrieve({
keyword: "elon musk twitter",
limit: 500,
dateRange: [2020, 2024],
location: "us"
});
// If using proxy, set up .env:
// PROXY_HOST=your_geonode_proxy_host
// PROXY_USER=your_geonode_username
// PROXY_PASS=your_geonode_password
```
See `examples/main.js` for a complete working example.