{"id":18556593,"url":"https://github.com/valkryst/rss_news_scraper","last_synced_at":"2025-04-10T01:31:19.585Z","repository":{"id":182358055,"uuid":"668371703","full_name":"Valkryst/RSS_News_Scraper","owner":"Valkryst","description":"A script for building a local database of news articles by crawling RSS feeds and downloading the articles.","archived":true,"fork":false,"pushed_at":"2025-03-15T15:45:46.000Z","size":45,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-24T14:09:54.582Z","etag":null,"topics":["news","news-aggregator","rss","rss-scraper","scraper","scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Valkryst.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"Valkryst"}},"created_at":"2023-07-19T16:41:11.000Z","updated_at":"2025-03-15T15:54:34.000Z","dependencies_parsed_at":"2023-10-16T22:31:56.219Z","dependency_job_id":"8bd2d713-92af-4fd3-89d7-b32e8a21b23d","html_url":"https://github.com/Valkryst/RSS_News_Scraper","commit_stats":null,"previous_names":["valkryst/rss_news_scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Valkryst%2FRSS_News_Scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Valkryst%2FRSS_News_Scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Valkryst%2FRSS_News_Scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Valkryst%2FRSS_News_Scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Valkryst","download_url":"https://codeload.github.com/Valkryst/RSS_News_Scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248140324,"owners_count":21054279,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["news","news-aggregator","rss","rss-scraper","scraper","scraping"],"created_at":"2024-11-06T21:32:07.230Z","updated_at":"2025-04-10T01:31:14.572Z","avatar_url":"https://github.com/Valkryst.png","language":"Python","funding_links":["https://github.com/sponsors/Valkryst"],"categories":[],"sub_categories":[],"readme":"This script can be used to build a local database of news articles by crawling RSS feeds and downloading the articles.\n\n* It can be run periodically (e.g. VIA [cron](https://en.wikipedia.org/wiki/Cron)) to keep the database up to date.\n* It stores the articles in a local folder, in a simple format that can be easily loaded into memory.\n* It does not allow multiple instances to run at the same time.\n* It employs a cache to avoid downloading the same articles multiple times.\n* It uses simple methods to avoid being blacklisted by websites, though no guarantees are made.\n\n## Usage\n\n1. Download `main.py` and `requirements.txt` into a folder. Preferably, create a new folder for this purpose.\n2. Install the required packages by running `pip install -r requirements.txt`.\n3. Create a `rss_urls.txt` file and enter the RSS feed URLs you want to crawl. Each URL should be on a new line.\n4. Run `python main.py` to start the crawler, or create a cron job to run it periodically.\n\n## Output\n\nThe crawler will create a `data` folder and store the crawled data in it. Articles are grouped by publication date and stored in files named `YYYY-MM-DD.pkl`.\n\nThe `.pkl` files can be loaded as follows:\n\n```python\nimport os\nimport pickle\n\narticles = {}\n\nfor file in os.listdir('data/articles'):\n    with open(os.path.join('data/articles', file), 'rb') as f:\n        articles[file.replace('.pkl', '')] = pickle.load(f)\n```\n\nEach entry in the `articles` dict uses the following format:\n\n```python\n'YYYY-MM-DD': [\n    {\n        'url': 'example.com/articles/1',\n        'title': 'Lorem ipsum dolor sit amet',\n        'body': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.'\n    },\n    {\n        'url': 'example.com/articles/2',\n        'title': 'Lorem ipsum dolor sit amet',\n        'body': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.'\n    }\n    # ...\n]\n```\n\n## Logging\n\nYou can control the logging level by setting the `LOG_LEVEL` environment variable. The default level is `WARNING` and the list of levels can be found [here](https://docs.python.org/3/howto/logging.html).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvalkryst%2Frss_news_scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvalkryst%2Frss_news_scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvalkryst%2Frss_news_scraper/lists"}