{"id":17201573,"url":"https://github.com/lkstrp/newspaper-scraper","last_synced_at":"2025-04-13T19:53:19.676Z","repository":{"id":158301860,"uuid":"608818712","full_name":"lkstrp/newspaper-scraper","owner":"lkstrp","description":"The all-in-one Python package for seamless newspaper article indexing, scraping, and processing – supports public and premium content!","archived":false,"fork":false,"pushed_at":"2023-05-17T03:04:49.000Z","size":79,"stargazers_count":22,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-10T12:38:58.071Z","etag":null,"topics":["news","newspaper","nlp","parser","scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lkstrp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-03-02T19:57:42.000Z","updated_at":"2025-01-20T21:06:30.000Z","dependencies_parsed_at":"2023-11-23T22:32:30.335Z","dependency_job_id":"81b886b0-344f-49cb-8e3d-2a90387a3ec4","html_url":"https://github.com/lkstrp/newspaper-scraper","commit_stats":{"total_commits":35,"total_committers":1,"mean_commits":35.0,"dds":0.0,"last_synced_commit":"4cf9e611ab8f43945750d033b14aa048090addbb"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lkstrp%2Fnewspaper-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lkstrp%2Fnewspaper-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lkstrp%2Fnewspaper-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lkstrp%2Fnewspaper-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lkstrp","download_url":"https://codeload.github.com/lkstrp/newspaper-scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248774159,"owners_count":21159526,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["news","newspaper","nlp","parser","scraper"],"created_at":"2024-10-15T02:12:11.814Z","updated_at":"2025-04-13T19:53:19.655Z","avatar_url":"https://github.com/lkstrp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Newspaper-Scraper  \n  \n##### The all-in-one Python package for seamless newspaper article indexing, scraping, and processing – supports public and premium content!\n\n[\u003cimg src=\"https://img.shields.io/pypi/v/newspaper-scraper.svg\"\u003e](https://pypi.org/project/newspaper-scraper/)\n[\u003cimg src=\"https://img.shields.io/pypi/l/newspaper-scraper.svg\"\u003e](https://pypi.org/project/newspaper-scraper/)\n\u003cimg src=\"https://static.pepy.tech/badge/newspaper-scraper\"\u003e\n\n## Intro  \nWhile tools like [newspaper3k](https://newspaper.readthedocs.io/en/latest/) and [goose3](https://github.com/goose3/goose3) can be used for extracting articles from news websites, they need a dedicated article url for older articles and do not support paywall content. This package aims to solve these issues by providing a unified interface for indexing, extracting and processing articles from newspapers.  \n1. Indexing: Index articles from a newspaper website using the [beautifulsoup](https://beautiful-soup-4.readthedocs.io/en/latest/) package for public articles and [selenium](https://selenium-python.readthedocs.io/) for paywall content.  \n2. Extraction: Extract article content using the [goose3](https://github.com/goose3/goose3) package.  \n3. Processing: Process articles for nlp features using the [spaCy](https://spacy.io/) package.  \n  \nThe indexing functionality is based on a dedicated file for each newspaper. A few newspapers are already supported, but it is easy to add new ones.  \n  \n### Supported Newspapers  \n| Logo | Newspaper                                        | Country | Time span  | Number of articles |  \n| ----------------------------------------------------------------------------------------------------------------------------------------------- |--------------------------------------------------| ------- |------------| --------------- |  \n| \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/4/48/Der_Spiegel_2022_logo.svg/640px-Der_Spiegel_2022_logo.svg.png\" height=\"70\"\u003e | [Der Spiegel](https://www.spiegel.de/)           | Germany | Since 2000 | tbd |  \n| \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/0/0a/Die_Welt_Logo_2015.png\" height=\"70\"\u003e | [Die Welt](https://www.welt.de/)                 | Germany | Since 2000 | tbd  \n| \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/e/e3/Logo_BILD.svg/1920px-Logo_BILD.svg.png\" height=\"70\"\u003e | [Bild](https://www.bild.de/)                     | Germany | Since 2006 | tbd |  \n| \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/6/60/Die_Zeit-Logo-Bremen.svg\" height=\"70\"\u003e | [Die Zeit](https://www.zeit.de/)                 | Germany | Since 1946 | tbd |   \n| \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/f/ff/Handelsblatt_201x_logo.svg/2880px-Handelsblatt_201x_logo.svg.png\" height=\"70\"\u003e | [Handelsblatt](https://www.handelsblatt.com/)    | Germany | Since 2003 | tbd | \n| \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/Tagesspiegel_%282022-11-29%29.svg/2880px-Tagesspiegel_%282022-11-29%29.svg.png\" height=\"70\"\u003e | [Der Tagesspiegel](https://www.tagesspiegel.de/) | Germany | Since 2000 | tbd |\n| \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/4/42/S%C3%BCddeutsche_Zeitung_Logo.svg/2880px-S%C3%BCddeutsche_Zeitung_Logo.svg.png\" height=\"70\"\u003e | [Süddeutsche Zeitung](https://www.sueddeutsche.de/)    | Germany | Since 2001 | tbd |\n\n## Setup  \nIt is recommended to install the package in an dedicated Python environment.  \nTo install the package via pip, run the following command:  \n  \n```bash  \npip install newspaper-scraper\n```  \n  \nTo also include the nlp extraction functionality (via [spaCy](https://spacy.io/)), run the following command:  \n  \n```bash  \npip install newspaper-scraper[nlp]\n```  \n  \n## Usage  \nTo index, extract and process all public and premium articles from [Der Spiegel](https://www.spiegel.de/), published in August 2021, run the following code:  \n  \n```python  \nimport newspaper_scraper as nps  \nfrom credentials import username, password  \n  \nwith nps.Spiegel(db_file='articles.db') as news:\n    news.index_articles_by_date_range('2021-08-01', '2021-08-31')  \n    news.scrape_public_articles()\n    news.scrape_premium_articles(username=username, password=password)  \n    news.nlp()\n```  \n  \nThis will create a sqlite database file called `articles.db` in the current working directory. The database contains the following tables:  \n- `tblArticlesIndexed`: Contains all indexed articles with their scraping/ processing status and whether they are public or premium content.  \n- `tblArticlesScraped`: Contains metadata for all parsed articles, provided by goose3.  \n- `tblArticlesProcessed`: Contains nlp features of the cleaned article text, provided by spaCy.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flkstrp%2Fnewspaper-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flkstrp%2Fnewspaper-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flkstrp%2Fnewspaper-scraper/lists"}