An open API service indexing awesome lists of open source software.

https://github.com/njraladdin/newspapers-com-scraper

A Node.js scraper for extracting article data from Newspapers.com based on keywords, dates, and locations.
https://github.com/njraladdin/newspapers-com-scraper

archive data newspapers scraper scraper-api scraping

Last synced: 10 months ago
JSON representation

A Node.js scraper for extracting article data from Newspapers.com based on keywords, dates, and locations.

Awesome Lists containing this project

README

          

# Newspaper.com Scraper

[![npm version](https://badge.fury.io/js/newspapers-com-scraper.svg)](https://badge.fury.io/js/newspapers-com-scraper)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![npm total downloads](https://img.shields.io/npm/dt/newspapers-com-scraper.svg)](https://www.npmjs.com/package/newspapers-com-scraper)

A Node.js scraper for extracting article data from Newspapers.com based on keywords, dates, and locations.

```javascript
const scraper = new NewspaperScraper();

await scraper.retrieve({
keyword: "elon musk twitter",
limit: 500,
dateRange: [2020, 2024],
location: "us"
});
```

## What it Does

Searches Newspapers.com and extracts:

- Newspaper title
- Page number and URL
- Publication date
- Location
- Number of keyword matches on each page

[Sample Output](https://docs.google.com/spreadsheets/d/1uq366pyEfolITFZ9X507ogsQjssx_pL1bp1pGPIPtt4/edit?gid=0#gid=0)

## Requirements

- Node.js 14+
- Google Chrome browser
- GEONODE.com account (optional, for proxy support)

## Installation

```bash
npm install newspapers-com-scraper
```

## Basic Usage

```javascript
const NewspaperScraper = require('newspapers-com-scraper');

async function main() {
const scraper = new NewspaperScraper();

// Listen for articles
scraper.on('article', (article) => {
console.log(`Found: ${article.title} (${article.date})`);
});

await scraper.retrieve({
keyword: "elon musk twitter", // Required: search term
limit: 500, // Optional: limit total results
dateRange: [2020, 2024], // Optional: date range
location: "us" // Optional: location code
});
}
```

## Events

The scraper emits three types of events:

```javascript
// 1. Article found
scraper.on('article', (article) => {
console.log(article);
// {
// title: "The Daily News",
// pageNumber: 4,
// date: "2023-05-15",
// location: "New York, NY",
// keywordMatches: 3,
// url: "https://www.newspapers.com/image/12345678/"
// }
});

// 2. Progress update
scraper.on('progress', (progress) => {
console.log(progress);
// {
// current: 5, // Current page
// total: 20, // Total pages
// percentage: 25.0, // Progress percentage
// stats: {
// timeElapsed: 45.2, // Total seconds
// avgPageTime: 9.04 // Avg seconds per page
// }
// }
});

// 3. Scraping complete
scraper.on('complete', (stats) => {
console.log(stats);
// {
// timeElapsed: 180.5, // Total seconds
// pageTimes: [8.2, 9.1] // Time per page
// }
});
```

## Advanced Configuration

Full configuration example:

```javascript
const scraper = new NewspaperScraper({
// Scraping settings
concurrentPages: 2, // Pages to scrape in parallel
resultsPerPage: 50, // Results per page (max 50)
maxConcurrentRequests: 10, // Max parallel requests

// Browser settings
browser: {
headless: false, // Show browser
userAgent: 'Mozilla/5.0...',
executablePath: '/path/to/chrome' // Optional
},

// Proxy settings (optional)
proxy: {
enabled: false,
host: 'proxy.host',
port: 9008,
username: 'user',
password: 'pass'
},

// Logging
logger: {
level: 'info', // 'error' | 'warn' | 'info' | 'debug' | 'silent'
custom: null // Custom logger
}
});

await scraper.retrieve({
keyword: "elon musk twitter",
limit: 500,
dateRange: [2020, 2024],
location: "us"
});

// If using proxy, set up .env:
// PROXY_HOST=your_geonode_proxy_host
// PROXY_USER=your_geonode_username
// PROXY_PASS=your_geonode_password
```
See `examples/main.js` for a complete working example.