{"id":37773616,"url":"https://github.com/gambolputty/newscorpus","last_synced_at":"2026-01-16T14:56:39.058Z","repository":{"id":54553438,"uuid":"231436695","full_name":"gambolputty/newscorpus","owner":"gambolputty","description":"A Python scraping module, that extracts text from articles found in RSS feeds. Uses SQLite as database.","archived":false,"fork":false,"pushed_at":"2024-05-03T21:51:05.000Z","size":148,"stargazers_count":16,"open_issues_count":1,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-05-03T22:46:10.952Z","etag":null,"topics":["corpus","crawler","news","newsarticles","scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gambolputty.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-02T18:22:58.000Z","updated_at":"2024-02-14T16:09:13.000Z","dependencies_parsed_at":"2023-12-26T20:43:34.386Z","dependency_job_id":"e7be9cf8-2b56-499f-80a7-8cc2ff9dfafb","html_url":"https://github.com/gambolputty/newscorpus","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/gambolputty/newscorpus","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gambolputty%2Fnewscorpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gambolputty%2Fnewscorpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gambolputty%2Fnewscorpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gambolputty%2Fnewscorpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gambolputty","download_url":"https://codeload.github.com/gambolputty/newscorpus/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gambolputty%2Fnewscorpus/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28479405,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corpus","crawler","news","newsarticles","scraper"],"created_at":"2026-01-16T14:56:38.364Z","updated_at":"2026-01-16T14:56:39.033Z","avatar_url":"https://github.com/gambolputty.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Newscorpus 📰🐍\n\u003c!-- Description of this project --\u003e\nTakes a list of RSS feeds, downloads found articles, processes them and stores the result in a SQLite database.\n\nThis project uses [Trafilatura](https://github.com/adbar/trafilatura) to extract text from HTML pages and [feedparser](https://github.com/kurtmckee/feedparser) to parse RSS feeds.\n\n\n## Installation\nThis project uses [Poetry](https://python-poetry.org/) to manage dependencies. Make sure you have it installed.\n\n### Via Poetry\n```bash\npoetry add \"git+https://github.com/gambolputty/newscorpus.git\"\n```\n\n### Via clone\n```bash\n# Clone this repository\ngit clone git@github.com:gambolputty/newscorpus.git\n\n# Install dependencies with poetry\ncd newscorpus\npoetry install\n```\n\n## Configuration\nCopy the [example sources file](sources.example.json) and edit it to your liking.\n```bash\ncp sources.example.json sources.json\n```\nIt is expected to be in the following format:\n```json\n[\n  {\n    \"id\": 0,\n    \"name\": \"Example\",\n    \"url\": \"https://example.com/rss\",\n  },\n  ...\n]\n```\n\n## Usage\n\n### Starting the scraper (CLI)\nTo start the scraping process run:\n```bash\npoetry run scrape [OPTIONS]\n```\n\n#### Options (optional)\n\n| Option             | Default                           | Description                                                                                                                        |\n|--------------------|-----------------------------------|------------------------------------------------------------------------------|\n| --src-path         | `sources.json`                    | Path to a `sources.json`-file.            |\n| --db-path          | `newscorpus.db`                   | Path to the SQLite database to use.                                          |\n| --debug            | _none_ (flag)                     | Show debug information.                                                      |\n| --workers          | `4`                               | Number of download workers.                                                  |\n| --keep             | `2`                               | Don't save articles older than n days.                                       |\n| --min-length       | `350`                             | Don't process articles with a text length smaller than x characters.         |\n| --help             | _none_ (flag)                     | Show help menu.                                                              |\n\n### Accessing the database\nAccess the database within your Python script:\n```python\nfrom newscorpus.database import Database\n\ndb = Database()\n\nfor article in db.iter_articles():\n    print(article.title)\n    print(article.published_at)\n    print(article.text)\n    print()\n```\nArguments to `iter_articles()` are the same as for `rows_where()`in [sqlite-utils](https://sqlite-utils.datasette.io/) ([Docs](https://sqlite-utils.datasette.io/en/stable/python-api.html#listing-rows), [Reference](https://sqlite-utils.datasette.io/en/stable/reference.html#sqlite_utils.db.Queryable.rows_where)).\n\nThe `Database` class takes an optional `path` argument to specify the path to the database file.\n\n## Acknowledgements\n- [IFG-Ticker](https://github.com/beyondopen/ifg-ticker) for some source\n\n## License\n[GNU AFFERO GENERAL PUBLIC LICENSE](LICENSE)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgambolputty%2Fnewscorpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgambolputty%2Fnewscorpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgambolputty%2Fnewscorpus/lists"}