https://github.com/gambolputty/newscorpus
A Python scraping module, that extracts text from articles found in RSS feeds. Uses SQLite as database.
https://github.com/gambolputty/newscorpus
corpus crawler news newsarticles scraper
Last synced: 5 months ago
JSON representation
A Python scraping module, that extracts text from articles found in RSS feeds. Uses SQLite as database.
- Host: GitHub
- URL: https://github.com/gambolputty/newscorpus
- Owner: gambolputty
- License: agpl-3.0
- Created: 2020-01-02T18:22:58.000Z (over 6 years ago)
- Default Branch: main
- Last Pushed: 2024-05-03T21:51:05.000Z (about 2 years ago)
- Last Synced: 2024-05-03T22:46:10.952Z (about 2 years ago)
- Topics: corpus, crawler, news, newsarticles, scraper
- Language: Python
- Homepage:
- Size: 145 KB
- Stars: 16
- Watchers: 1
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# Newscorpus 📰🐍
Takes a list of RSS feeds, downloads found articles, processes them and stores the result in a SQLite database.
This project uses [Trafilatura](https://github.com/adbar/trafilatura) to extract text from HTML pages and [feedparser](https://github.com/kurtmckee/feedparser) to parse RSS feeds.
## Installation
This project uses [Poetry](https://python-poetry.org/) to manage dependencies. Make sure you have it installed.
### Via Poetry
```bash
poetry add "git+https://github.com/gambolputty/newscorpus.git"
```
### Via clone
```bash
# Clone this repository
git clone git@github.com:gambolputty/newscorpus.git
# Install dependencies with poetry
cd newscorpus
poetry install
```
## Configuration
Copy the [example sources file](sources.example.json) and edit it to your liking.
```bash
cp sources.example.json sources.json
```
It is expected to be in the following format:
```json
[
{
"id": 0,
"name": "Example",
"url": "https://example.com/rss",
},
...
]
```
## Usage
### Starting the scraper (CLI)
To start the scraping process run:
```bash
poetry run scrape [OPTIONS]
```
#### Options (optional)
| Option | Default | Description |
|--------------------|-----------------------------------|------------------------------------------------------------------------------|
| --src-path | `sources.json` | Path to a `sources.json`-file. |
| --db-path | `newscorpus.db` | Path to the SQLite database to use. |
| --debug | _none_ (flag) | Show debug information. |
| --workers | `4` | Number of download workers. |
| --keep | `2` | Don't save articles older than n days. |
| --min-length | `350` | Don't process articles with a text length smaller than x characters. |
| --help | _none_ (flag) | Show help menu. |
### Accessing the database
Access the database within your Python script:
```python
from newscorpus.database import Database
db = Database()
for article in db.iter_articles():
print(article.title)
print(article.published_at)
print(article.text)
print()
```
Arguments to `iter_articles()` are the same as for `rows_where()`in [sqlite-utils](https://sqlite-utils.datasette.io/) ([Docs](https://sqlite-utils.datasette.io/en/stable/python-api.html#listing-rows), [Reference](https://sqlite-utils.datasette.io/en/stable/reference.html#sqlite_utils.db.Queryable.rows_where)).
The `Database` class takes an optional `path` argument to specify the path to the database file.
## Acknowledgements
- [IFG-Ticker](https://github.com/beyondopen/ifg-ticker) for some source
## License
[GNU AFFERO GENERAL PUBLIC LICENSE](LICENSE)