Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/lkstrp/newspaper-scraper
The all-in-one Python package for seamless newspaper article indexing, scraping, and processing – supports public and premium content!
https://github.com/lkstrp/newspaper-scraper
news newspaper nlp parser scraper
Last synced: about 1 month ago
JSON representation
The all-in-one Python package for seamless newspaper article indexing, scraping, and processing – supports public and premium content!
- Host: GitHub
- URL: https://github.com/lkstrp/newspaper-scraper
- Owner: lkstrp
- License: mit
- Created: 2023-03-02T19:57:42.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-05-17T03:04:49.000Z (over 1 year ago)
- Last Synced: 2024-10-31T22:51:42.663Z (about 2 months ago)
- Topics: news, newspaper, nlp, parser, scraper
- Language: Python
- Homepage:
- Size: 77.1 KB
- Stars: 20
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Newspaper-Scraper
##### The all-in-one Python package for seamless newspaper article indexing, scraping, and processing – supports public and premium content![](https://pypi.org/project/newspaper-scraper/)
[](https://pypi.org/project/newspaper-scraper/)## Intro
While tools like [newspaper3k](https://newspaper.readthedocs.io/en/latest/) and [goose3](https://github.com/goose3/goose3) can be used for extracting articles from news websites, they need a dedicated article url for older articles and do not support paywall content. This package aims to solve these issues by providing a unified interface for indexing, extracting and processing articles from newspapers.
1. Indexing: Index articles from a newspaper website using the [beautifulsoup](https://beautiful-soup-4.readthedocs.io/en/latest/) package for public articles and [selenium](https://selenium-python.readthedocs.io/) for paywall content.
2. Extraction: Extract article content using the [goose3](https://github.com/goose3/goose3) package.
3. Processing: Process articles for nlp features using the [spaCy](https://spacy.io/) package.
The indexing functionality is based on a dedicated file for each newspaper. A few newspapers are already supported, but it is easy to add new ones.
### Supported Newspapers
| Logo | Newspaper | Country | Time span | Number of articles |
| ----------------------------------------------------------------------------------------------------------------------------------------------- |--------------------------------------------------| ------- |------------| --------------- |
| | [Der Spiegel](https://www.spiegel.de/) | Germany | Since 2000 | tbd |
| | [Die Welt](https://www.welt.de/) | Germany | Since 2000 | tbd
| | [Bild](https://www.bild.de/) | Germany | Since 2006 | tbd |
| | [Die Zeit](https://www.zeit.de/) | Germany | Since 1946 | tbd |
| | [Handelsblatt](https://www.handelsblatt.com/) | Germany | Since 2003 | tbd |
| | [Der Tagesspiegel](https://www.tagesspiegel.de/) | Germany | Since 2000 | tbd |
| | [Süddeutsche Zeitung](https://www.sueddeutsche.de/) | Germany | Since 2001 | tbd |## Setup
It is recommended to install the package in an dedicated Python environment.
To install the package via pip, run the following command:
```bash
pip install newspaper-scraper
```
To also include the nlp extraction functionality (via [spaCy](https://spacy.io/)), run the following command:
```bash
pip install newspaper-scraper[nlp]
```
## Usage
To index, extract and process all public and premium articles from [Der Spiegel](https://www.spiegel.de/), published in August 2021, run the following code:
```python
import newspaper_scraper as nps
from credentials import username, password
with nps.Spiegel(db_file='articles.db') as news:
news.index_articles_by_date_range('2021-08-01', '2021-08-31')
news.scrape_public_articles()
news.scrape_premium_articles(username=username, password=password)
news.nlp()
```
This will create a sqlite database file called `articles.db` in the current working directory. The database contains the following tables:
- `tblArticlesIndexed`: Contains all indexed articles with their scraping/ processing status and whether they are public or premium content.
- `tblArticlesScraped`: Contains metadata for all parsed articles, provided by goose3.
- `tblArticlesProcessed`: Contains nlp features of the cleaned article text, provided by spaCy.