{"id":23426753,"url":"https://github.com/ivbeg/newsworker","last_synced_at":"2026-03-06T15:03:44.474Z","repository":{"id":56757025,"uuid":"141791202","full_name":"ivbeg/newsworker","owner":"ivbeg","description":"Advanced news feeds extractor and finder library. Helps to automatically extract news from websites without RSS/ATOM feeds","archived":false,"fork":false,"pushed_at":"2025-11-26T05:31:31.000Z","size":63,"stargazers_count":80,"open_issues_count":14,"forks_count":8,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-11-29T06:49:44.572Z","etag":null,"topics":["news","news-scraper","python","rss","rss-feed-parser","scraper"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ivbeg.png","metadata":{"files":{"readme":"README.rst","changelog":"HISTORY.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-07-21T07:31:36.000Z","updated_at":"2025-11-26T05:31:34.000Z","dependencies_parsed_at":"2022-08-16T01:50:33.355Z","dependency_job_id":null,"html_url":"https://github.com/ivbeg/newsworker","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/ivbeg/newsworker","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivbeg%2Fnewsworker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivbeg%2Fnewsworker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivbeg%2Fnewsworker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivbeg%2Fnewsworker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ivbeg","download_url":"https://codeload.github.com/ivbeg/newsworker/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ivbeg%2Fnewsworker/sbom","scorecard":{"id":498368,"data":{"date":"2025-08-11","repo":{"name":"github.com/ivbeg/newsworker","commit":"ae42ea0bfcda19152ee92771a47a35b3f5013307"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":1.7,"checks":[{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Code-Review","score":0,"reason":"Found 0/9 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":0,"reason":"12 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2020-231 / GHSA-g8q7-xv52-hf9f","Warn: Project is vulnerable to: PYSEC-2011-18 / GHSA-3mwg-gp5g-fv3q","Warn: Project is vulnerable to: PYSEC-2012-14 / GHSA-hjf3-r7gw-9rwg","Warn: Project is vulnerable to: PYSEC-2011-19","Warn: Project is vulnerable to: PYSEC-2011-20","Warn: Project is vulnerable to: PYSEC-2011-21","Warn: Project is vulnerable to: GHSA-55x5-fj6c-h6m8","Warn: Project is vulnerable to: PYSEC-2014-9 / GHSA-57qw-cc2g-pv5p","Warn: Project is vulnerable to: PYSEC-2021-19 / GHSA-jq4v-f5q6-mjqq","Warn: Project is vulnerable to: GHSA-pgww-xf46-h92r","Warn: Project is vulnerable to: PYSEC-2022-230 / GHSA-wrxv-2j5q-m38w","Warn: Project is vulnerable to: PYSEC-2018-12 / GHSA-xp26-p53h-6h2p"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}}]},"last_synced_at":"2025-08-19T21:08:20.828Z","repository_id":56757025,"created_at":"2025-08-19T21:08:20.829Z","updated_at":"2025-08-19T21:08:20.829Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30182686,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T14:42:24.748Z","status":"ssl_error","status_checked_at":"2026-03-06T14:42:14.925Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["news","news-scraper","python","rss","rss-feed-parser","scraper"],"created_at":"2024-12-23T06:01:06.987Z","updated_at":"2026-03-06T15:03:44.466Z","avatar_url":"https://github.com/ivbeg.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"===================================================================\nnewsworker -- advanced automatic news extractor using HTML scraping\n===================================================================\n\n\n.. image:: https://img.shields.io/pypi/v/newsworker.svg?style=flat-square\n    :target: https://pypi.python.org/pypi/newsworker\n    :alt: pypi version\n\n.. image:: https://readthedocs.org/projects/newsworker/badge/?version=latest\n    :target: http://newsworker.readthedocs.org/en/latest/?badge=latest\n    :alt: Documentation Status\n\n\n\n`newsworker` is a Python 3 lib that extracts feeds from html pages. It's useful when you need to subscribe to a news\nfrom website that doesn't publish RSS/ATOM feeds and you don't want to use page change monitoring tools since it's not\nso accurate.\n\nAn idea behind this algorithm is simple. Most pages with news contain date and time information about certain news.\nThese dates could look like \"2017-09-27\" or \"1 jul 2016\" or many other ways. First of all we needed to find all dates,\nsecond is to understand when date is just a date of this webpage and when date is on webpage area dedicated for news.\n\nThis tool helps to find news by locating news blocks on html webpage and parsing them for further usage.\n\n\nUsage examples\n---------------\n\nExtract news from html page from EIB website and Bulgarian government website\n\n    \u003e\u003e\u003e feed, session = f.get_feed(url=\"http://government.bg/bg/prestsentar/novini\")\n    \u003e\u003e\u003e feed\n    ...\n\n\n    \u003e\u003e\u003e from newsworker.extractor import FeedExtractor\n    \u003e\u003e\u003e f = FeedExtractor(filtered_text_length=150)\n    \u003e\u003e\u003e feed, session = f.get_feed(url=\"http://www.eib.org/en/index.htm?lang=en\")\n    \u003e\u003e\u003e feed\n    {'title': 'European Investment Bank (EIB)', 'language': 'en', 'link': 'http://www.eib.org/en/index.htm?lang=en', 'description': 'European Investment Bank (EIB)', 'items': [{'title': 'Blockchain Challenge: coders at the EIB', 'description': 'Blockchain Challenge: coders at the EIB', 'pubdate': datetime.datetime(2018, 6, 18, 0, 0), 'unique_id': 'f9d359f76118076c5331ffec3cdb82eb', 'raw_html': b'\u003cdiv class=\"first-column col-xs-12 col-sm-12 col-md-8 col-lg-8 no-padding-left-right\"\u003e\u003cdiv class=\"video-box no-padding-left-right\"\u003e\u003ca class=\"video-youtube\" href=\"https://www.youtube.com/watch?v=YlKa2LZgxhE?autoplay=1\"\u003e\u003cdiv class=\"img-item-1\" style=\"background-image:url(\\'/img/movies/blockchain-video-hp.png\\');\"\u003e\u003cspan class=\"video-icon\"\u003e\u003cimg height=\"100\" src=\"/img/site/play.png\" width=\"100\"/\u003e\u003c/span\u003e\u003cdiv class=\"video-container\"\u003e\u003cdiv class=\"left-box col-lg-8 col-xs-12\"\u003e\u003cdiv class=\"video-date-time\"\u003e\u003csmall\u003e18/06/2018\u003c/small\u003e\u003cspan class=\"space-separator\"\u003e | \u003c/span\u003e\u003csmall\u003e02:12\u003c/small\u003e\u003c/div\u003e\u003cdiv class=\"video-title col-xs-12 col-lg-12 no-padding-left-right\"\u003eBlockchain Challenge: coders at the EIB\u003c/div\u003e\u003c/div\u003e\u003c/div\u003e\u003c/div\u003e\u003c/a\u003e\u003c/div\u003e\u003c/div\u003e', 'extra': {'links': ['https://www.youtube.com/watch?v=YlKa2LZgxhE?autoplay=1'], 'images': ['http://www.eib.org/img/site/play.png']}, 'link': 'https://www.youtube.com/watch?v=YlKa2LZgxhE?autoplay=1'}, {'title': 'A brighter life for Kenyan women', 'description': 'Jujuy Verde â€“ new horizons for women waste-pickers in Argentina', 'pubdate': datetime.datetime(2018, 6, 5, 0, 0), 'unique_id': '9caef61535352d2734d122c0e092b011', 'raw_html': b'\u003cdiv class=\"second-column col-xs-12 col-sm-12 col-md-4 col-lg-4 no-padding-left-right\"\u003e\u003cdiv class=\"video-box no-padding-left-right\"\u003e\u003ca class=\"video-youtube  fancybox.iframe\" href=\"https://www.youtube.com/watch?v=T_7OmSDSXtc?autoplay=1\"\u003e\u003cdiv class=\"img-item-2\" style=\"background-image:url(\\'/img/kenya-dlight-video-hp.png\\');\"\u003e\u003cspan class=\"video-icon\"\u003e\u003cimg height=\"100\" src=\"/img/site/play.png\" width=\"100\"/\u003e\u003c/span\u003e\u003cdiv class=\"video-container\"\u003e\u003cdiv class=\"left-box col-lg-8 col-xs-12\"\u003e\u003cdiv class=\"video-date-time\"\u003e\u003csmall\u003e04/06/2018\u003c/small\u003e\u003cspan class=\"space-separator\"\u003e | \u003c/span\u003e\u003csmall\u003e01:32\u003c/small\u003e\u003c/div\u003e\u003cdiv class=\"video-title col-xs-12 col-lg-12 no-padding-left-right\"\u003eA brighter life for Kenyan women\u003c/div\u003e\u003c/div\u003e\u003c/div\u003e\u003c/div\u003e\u003c/a\u003e\u003c/div\u003e\u003cdiv class=\"video-box no-padding-left-right\"\u003e\u003ca class=\"video-youtube fancybox.iframe\" href=\"https://www.youtube.com/watch?v=d-btxsYT9hI?autoplay=1\"\u003e\u003cdiv class=\"img-item-3\" style=\"background-image:url(\\'/img/jujuy-video-hp.png\\');\"\u003e\u003cspan class=\"video-icon\"\u003e\u003cimg height=\"100\" src=\"/img/site/play.png\" width=\"100\"/\u003e\u003c/span\u003e\u003cdiv class=\"video-container\"\u003e\u003cdiv class=\"left-box col-lg-8 col-xs-12\"\u003e\u003cdiv class=\"video-date-time\"\u003e\u003csmall\u003e05/06/2018\u003c/small\u003e\u003cspan class=\"space-separator\"\u003e | \u003c/span\u003e\u003csmall\u003e03:12\u003c/small\u003e\u003c/div\u003e\u003cdiv class=\"video-title col-xs-12 col-lg-12 no-padding-left-right\"\u003eJujuy Verde \\xc3\\xa2\\xe2\\x82\\xac\\xe2\\x80\\x9c new horizons for women waste-pickers in Argentina\u003c/div\u003e\u003c/div\u003e\u003c/div\u003e\u003c/div\u003e\u003c/a\u003e\u003c/div\u003e\u003c/div\u003e', 'extra': {'links': ['https://www.youtube.com/watch?v=T_7OmSDSXtc?autoplay=1', 'https://www.youtube.com/watch?v=d-btxsYT9hI?autoplay=1'], 'images': ['http://www.eib.org/img/site/play.png']}, 'link': 'https://www.youtube.com/watch?v=T_7OmSDSXtc?autoplay=1'}], 'cache': {'pats': ['dt:date:date_1']}}\n\nReuse cached patterns to speed up further news extraction. It could greatly improve page parsing speed since it minimizes number of date comparsion up to 100x times\n(2-3 date patterns instead of 350 patterns)\n    \u003e\u003e\u003e pats = feeds['cache']['pats']\n    \u003e\u003e\u003e feed, session = f.get_feed(url=\"http://www.eib.org/en/index.htm?lang=en\", cached_p=pats)\n\nChange user agent if needed\n    \u003e\u003e\u003e USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'\n    \u003e\u003e\u003e feed, session = f.get_feed(url=\"http://www.eib.org/en/index.htm?lang=en\", user_agent=USER_AGENT)\n\n\nInitialize feed finder on webpage\n    \u003e\u003e\u003e from newsworker.finder import FeedsFinder\n    \u003e\u003e\u003e f = FeedsFinder()\nTry to extract feeds if no one feed exists\n    \u003e\u003e\u003e feeds = f.find_feeds('http://government.bg/bg/prestsentar/novini')\n    {'url': 'http://government.bg/bg/prestsentar/novini', 'items': []}\n\nAdd \"extractrss\" param launches FeedExtractor\n    \u003e\u003e\u003e feeds = f.find_feeds('http://government.bg/bg/prestsentar/novini', extractrss=True)\n    \u003e\u003e\u003e feeds\n    {'url': 'http://government.bg/bg/prestsentar/novini', 'items': [{'feedtype': 'html', 'title': 'Министерски съвет :: Новини', 'num_entries': 12, 'url': 'http://government.bg/bg/prestsentar/novini'}]}\n\nFind all feeds and more info from feeds. With \"noverify=False\" each feed parsed\n    \u003e\u003e\u003e feeds = f.find_feeds('https://www.dta.gov.au/news/', noverify=False)\n    \u003e\u003e\u003e feeds\n    {'url': 'https://www.dta.gov.au/news/', 'items': [{'title': 'Digital Transformation Agency', 'url': 'https://www.dta.gov.au/feed.xml', 'feedtype': 'rss', 'num_entries': 10}]}\n\nAddind \"include_entries=True\" returns feeds and all parsed feed entries\n    \u003e\u003e\u003e feeds = f.find_feeds('https://www.dta.gov.au/news/', noverify=False, include_entries=True)\n    \u003e\u003e\u003e feeds\n\n\n\nDocumentation\n=============\n\nDocumentation is built automatically and can be found on\n`Read the Docs \u003chttps://qddate.readthedocs.org/en/latest/\u003e`_.\n\n\nFeatures\n========\n\n* Identifies news blocks on webpages using date patterns. More than 348 date patterns supported. Uses `qddate \u003chttps://github.com/ivbeg/qddate\u003e`_\n* Extremely fast, uses pyparsing to identify dates on webpages\n* Includes function to find feeds on html page and if no feed found, than extract news\n\nLimitations\n========\n\n* Not all language-specific dates supported\n* Right aligned dates like \"Published - 27-01-2018\" not supported. It's not hard to add it but it greatly increases false acceptance rate.\n* Some news pages has no dates with urls or texts. These pages are not supported yet\n\nSpeed optimization\n========\n\n* `qddate \u003chttps://github.com/ivbeg/qddate\u003e`_ date parsing lib was created for this algorithm. Right now pattern marching is really fast\n* date patterns could be cached to speed up parsing speed for the same website\n* feed finder without verification of feeds works fast, but if verification enabled than it's slowed down\n\n\nTODO\n====\n* Support more date formats and improve qddate lib\n* Support news pages without dates\n\nUsage\n=====\n\nThe easiest way is to use the `newsworker.FeedExtractor \u003c#newsworker.FeedExtractor\u003e`_ class,\nand it's `get_feed` function.\n\n.. automodule:: newsworker.extractor\n   :members: FeedExtractor\n.. automodule:: newsworker.finder\n   :members: FeedsFinder\n\n\nDependencies\n============\n\n`newsworker` relies on following libraries in some ways:\n\n  * qddate_ is a module for data processing\n.. _qddate: https://pypi.python.org/pypi/qddate\n\n  * pyparsing_ is a module for advanced text processing.\n.. _pyparsing: https://pypi.python.org/pypi/pyparsing\n\n  * lxml is a module for xml parsing.\n.. _lxml: https://pypi.python.org/pypi/lxml\n\n\nSupported languages specific dates\n==================================\n* Bulgarian\n* Czech\n* English\n* French\n* German\n* Portuguese\n* Russian\n* Spanish\n\nThanks\n======\nI wrote this news extraction code at 2008 year and later only updated it several times, migrating from regular expressions\nto pyparsing. Initial project was divided between qddate date parsing lib and newsworker intended to news identification\non html pages\n\nFeel free to ask question ivan@begtin.tech\n\n.. image:: https://badges.gitter.im/newsworker/Lobby.svg\n   :alt: Join the chat at https://gitter.im/newsworker/Lobby\n   :target: https://gitter.im/newsworker/Lobby?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fivbeg%2Fnewsworker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fivbeg%2Fnewsworker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fivbeg%2Fnewsworker/lists"}