https://github.com/gambolputty/newscorpus

A Python scraping module, that extracts text from articles found in RSS feeds. Uses SQLite as database.
https://github.com/gambolputty/newscorpus

corpus crawler news newsarticles scraper

Last synced: 5 months ago
JSON representation

A Python scraping module, that extracts text from articles found in RSS feeds. Uses SQLite as database.

Host: GitHub
URL: https://github.com/gambolputty/newscorpus
Owner: gambolputty
License: agpl-3.0
Created: 2020-01-02T18:22:58.000Z (over 6 years ago)
Default Branch: main
Last Pushed: 2024-05-03T21:51:05.000Z (about 2 years ago)
Last Synced: 2024-05-03T22:46:10.952Z (about 2 years ago)
Topics: corpus, crawler, news, newsarticles, scraper
Language: Python
Homepage:
Size: 145 KB
Stars: 16
Watchers: 1
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # Newscorpus 📰🐍

Takes a list of RSS feeds, downloads found articles, processes them and stores the result in a SQLite database.

This project uses [Trafilatura](https://github.com/adbar/trafilatura) to extract text from HTML pages and [feedparser](https://github.com/kurtmckee/feedparser) to parse RSS feeds.

## Installation

This project uses [Poetry](https://python-poetry.org/) to manage dependencies. Make sure you have it installed.

### Via Poetry

```bash

poetry add "git+https://github.com/gambolputty/newscorpus.git"

```

### Via clone

```bash

# Clone this repository

git clone git@github.com:gambolputty/newscorpus.git

# Install dependencies with poetry

cd newscorpus

poetry install

```

## Configuration

Copy the [example sources file](sources.example.json) and edit it to your liking.

```bash

cp sources.example.json sources.json

```

It is expected to be in the following format:

```json

[

  {

    "id": 0,

    "name": "Example",

    "url": "https://example.com/rss",

  },

  ...

]

```

## Usage

### Starting the scraper (CLI)

To start the scraping process run:

```bash

poetry run scrape [OPTIONS]

```

#### Options (optional)

| Option             | Default                           | Description                                                                                                                        |

|--------------------|-----------------------------------|------------------------------------------------------------------------------|

| --src-path         | `sources.json`                    | Path to a `sources.json`-file.            |

| --db-path          | `newscorpus.db`                   | Path to the SQLite database to use.                                          |

| --debug            | _none_ (flag)                     | Show debug information.                                                      |

| --workers          | `4`                               | Number of download workers.                                                  |

| --keep             | `2`                               | Don't save articles older than n days.                                       |

| --min-length       | `350`                             | Don't process articles with a text length smaller than x characters.         |

| --help             | _none_ (flag)                     | Show help menu.                                                              |

### Accessing the database

Access the database within your Python script:

```python

from newscorpus.database import Database

db = Database()

for article in db.iter_articles():

    print(article.title)

    print(article.published_at)

    print(article.text)

    print()

```

Arguments to `iter_articles()` are the same as for `rows_where()`in [sqlite-utils](https://sqlite-utils.datasette.io/) ([Docs](https://sqlite-utils.datasette.io/en/stable/python-api.html#listing-rows), [Reference](https://sqlite-utils.datasette.io/en/stable/reference.html#sqlite_utils.db.Queryable.rows_where)).

The `Database` class takes an optional `path` argument to specify the path to the database file.

## Acknowledgements

- [IFG-Ticker](https://github.com/beyondopen/ifg-ticker) for some source

## License

[GNU AFFERO GENERAL PUBLIC LICENSE](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gambolputty/newscorpus

Awesome Lists containing this project

README