https://github.com/lkstrp/newspaper-scraper

The all-in-one Python package for seamless newspaper article indexing, scraping, and processing – supports public and premium content!
https://github.com/lkstrp/newspaper-scraper

news newspaper nlp parser scraper

Last synced: 6 months ago
JSON representation

The all-in-one Python package for seamless newspaper article indexing, scraping, and processing – supports public and premium content!

Host: GitHub
URL: https://github.com/lkstrp/newspaper-scraper
Owner: lkstrp
License: mit
Created: 2023-03-02T19:57:42.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-05-17T03:04:49.000Z (over 2 years ago)
Last Synced: 2025-04-10T12:38:58.071Z (6 months ago)
Topics: news, newspaper, nlp, parser, scraper
Language: Python
Homepage:
Size: 77.1 KB
Stars: 22
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Newspaper-Scraper  

  

##### The all-in-one Python package for seamless newspaper article indexing, scraping, and processing – supports public and premium content!

[](https://pypi.org/project/newspaper-scraper/)

[](https://pypi.org/project/newspaper-scraper/)



## Intro  

While tools like [newspaper3k](https://newspaper.readthedocs.io/en/latest/) and [goose3](https://github.com/goose3/goose3) can be used for extracting articles from news websites, they need a dedicated article url for older articles and do not support paywall content. This package aims to solve these issues by providing a unified interface for indexing, extracting and processing articles from newspapers.  

1. Indexing: Index articles from a newspaper website using the [beautifulsoup](https://beautiful-soup-4.readthedocs.io/en/latest/) package for public articles and [selenium](https://selenium-python.readthedocs.io/) for paywall content.  

2. Extraction: Extract article content using the [goose3](https://github.com/goose3/goose3) package.  

3. Processing: Process articles for nlp features using the [spaCy](https://spacy.io/) package.  

  

The indexing functionality is based on a dedicated file for each newspaper. A few newspapers are already supported, but it is easy to add new ones.  

  

### Supported Newspapers  

| Logo | Newspaper                                        | Country | Time span  | Number of articles |  

| ----------------------------------------------------------------------------------------------------------------------------------------------- |--------------------------------------------------| ------- |------------| --------------- |  

|  | [Der Spiegel](https://www.spiegel.de/)           | Germany | Since 2000 | tbd |  

|  | [Die Welt](https://www.welt.de/)                 | Germany | Since 2000 | tbd  

|  | [Bild](https://www.bild.de/)                     | Germany | Since 2006 | tbd |  

|  | [Die Zeit](https://www.zeit.de/)                 | Germany | Since 1946 | tbd |   

|  | [Handelsblatt](https://www.handelsblatt.com/)    | Germany | Since 2003 | tbd | 

|  | [Der Tagesspiegel](https://www.tagesspiegel.de/) | Germany | Since 2000 | tbd |

|  | [Süddeutsche Zeitung](https://www.sueddeutsche.de/)    | Germany | Since 2001 | tbd |

## Setup  

It is recommended to install the package in an dedicated Python environment.  

To install the package via pip, run the following command:  

  

```bash  

pip install newspaper-scraper

```  

  

To also include the nlp extraction functionality (via [spaCy](https://spacy.io/)), run the following command:  

  

```bash  

pip install newspaper-scraper[nlp]

```  

  

## Usage  

To index, extract and process all public and premium articles from [Der Spiegel](https://www.spiegel.de/), published in August 2021, run the following code:  

  

```python  

import newspaper_scraper as nps  

from credentials import username, password  

  

with nps.Spiegel(db_file='articles.db') as news:

    news.index_articles_by_date_range('2021-08-01', '2021-08-31')  

    news.scrape_public_articles()

    news.scrape_premium_articles(username=username, password=password)  

    news.nlp()

```  

  

This will create a sqlite database file called `articles.db` in the current working directory. The database contains the following tables:  

- `tblArticlesIndexed`: Contains all indexed articles with their scraping/ processing status and whether they are public or premium content.  

- `tblArticlesScraped`: Contains metadata for all parsed articles, provided by goose3.  

- `tblArticlesProcessed`: Contains nlp features of the cleaned article text, provided by spaCy.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lkstrp/newspaper-scraper

Awesome Lists containing this project

README