{"id":13006018,"url":"https://github.com/flairNLP/fundus","last_synced_at":"2025-03-04T15:31:10.909Z","repository":{"id":176731647,"uuid":"558916769","full_name":"flairNLP/fundus","owner":"flairNLP","description":"A very simple news crawler with a funny name","archived":false,"fork":false,"pushed_at":"2024-10-29T22:39:11.000Z","size":18610,"stargazers_count":288,"open_issues_count":34,"forks_count":74,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-10-30T00:46:02.556Z","etag":null,"topics":["cc-news","commoncrawl","corpus","crawler","news-crawler","news-scraping","nlp","python","rss","scraper","sitemap","text-extraction","web-corpus","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/flairNLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":"docs/supported_publishers.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-28T15:34:58.000Z","updated_at":"2024-10-28T11:26:09.000Z","dependencies_parsed_at":null,"dependency_job_id":"bca233fc-f5dd-4e07-a5fb-f4477725f548","html_url":"https://github.com/flairNLP/fundus","commit_stats":null,"previous_names":["flairnlp/fundus"],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flairNLP%2Ffundus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flairNLP%2Ffundus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flairNLP%2Ffundus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/flairNLP%2Ffundus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/flairNLP","download_url":"https://codeload.github.com/flairNLP/fundus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241644548,"owners_count":19996178,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cc-news","commoncrawl","corpus","crawler","news-crawler","news-scraping","nlp","python","rss","scraper","sitemap","text-extraction","web-corpus","web-scraping"],"created_at":"2024-07-24T00:29:10.106Z","updated_at":"2025-03-04T15:31:10.903Z","avatar_url":"https://github.com/flairNLP.png","language":"Python","funding_links":[],"categories":["网络服务"],"sub_categories":["网络爬虫"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_darkmode_with_font_and_clear_space.svg\"\u003e\n    \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_lightmode_with_font_and_clear_space.svg\"\u003e\n    \u003cimg src=\"https://github.com/flairNLP/fundus/blob/master/resources/logo/svg/logo_lightmode_with_font_and_clear_space.svg\" alt=\"Logo\" width=\"50%\" height=\"50%\"\u003e\n  \u003c/picture\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003eA very simple \u003cb\u003enews crawler\u003c/b\u003e in Python.\nDeveloped at \u003ca href=\"https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/\"\u003eHumboldt University of Berlin\u003c/a\u003e.\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://pypi.org/project/fundus/\"\u003e\u003cimg alt=\"PyPi version\" src=\"https://badge.fury.io/py/fundus.svg\"\u003e\u003c/a\u003e\n\u003cimg alt=\"python\" src=\"https://img.shields.io/badge/python-3.8-blue\"\u003e\n\u003cimg alt=\"Static Badge\" src=\"https://img.shields.io/badge/license-MIT-green\"\u003e\n\u003cimg alt=\"Publisher Coverage\" src=\"https://img.shields.io/endpoint?url=https://gist.githubusercontent.com/dobbersc/ca0ae056b05cbfeaf30fa42f84ddf458/raw/fundus_publisher_coverage.json\"\u003e\n\u003c/p\u003e\n\u003cdiv align=\"center\"\u003e\n\u003chr\u003e\n\n[Quick Start](#quick-start) | [Tutorials](#tutorials) | [News Sources](/docs/supported_publishers.md) | [Paper](https://aclanthology.org/2024.acl-demos.29/)\n\n\u003c/div\u003e\n\n\n---\n\nFundus is:\n\n* **A static news crawler.** \n  Fundus lets you crawl online news articles with only a few lines of Python code!\n  Be it from live websites or the CC-NEWS dataset.\n\n* **An open-source Python package.**\n  Fundus is built on the idea of building something together. \n  We welcome your contribution to  help Fundus [grow](docs/how_to_contribute.md)!\n\n\u003chr\u003e\n\n## Quick Start\n\nTo install from pip, simply do:\n\n```\npip install fundus\n```\n\nFundus requires Python 3.8+.\n\n\n## Example 1: Crawl a bunch of English-language news articles\n\nLet's use Fundus to crawl 2 articles from publishers based in the US.\n\n```python\nfrom fundus import PublisherCollection, Crawler\n\n# initialize the crawler for news publishers based in the US\ncrawler = Crawler(PublisherCollection.us)\n\n# crawl 2 articles and print\nfor article in crawler.crawl(max_articles=2):\n    print(article)\n```\n\nThat's already it!\n\nIf you run this code, it should print out something like this:\n\n```console\nFundus-Article including 1 image(s):\n- Title: \"Feinstein's Return Not Enough for Confirmation of Controversial New [...]\"\n- Text:  \"89-year-old California senator arrived hour late to Judiciary Committee hearing\n          to advance President Biden's stalled nominations  Democrats [...]\"\n- URL:    https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/\n- From:   The Washington Free Beacon (2023-05-11 18:41)\n\nFundus-Article including 3 image(s):\n- Title: \"Northwestern student government freezes College Republicans funding over [...]\"\n- Text:  \"Student government at Northwestern University in Illinois \"indefinitely\" froze\n          the funds of the university's chapter of College Republicans [...]\"\n- URL:    https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community\n- From:   Fox News (2023-05-09 14:37)\n```\n\nThis printout tells you that you successfully crawled two articles!\n\nFor each article, the printout details:\n- the number of images included in the article\n- the \"Title\" of the article, i.e. its headline \n- the \"Text\", i.e. the main article body text\n- the \"URL\" from which it was crawled\n- the news source it is \"From\"\n\n\n## Example 2: Crawl a specific news source\n\nMaybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:\n\n```python\nfrom fundus import PublisherCollection, Crawler\n\n# initialize the crawler for The New Yorker\ncrawler = Crawler(PublisherCollection.us.TheNewYorker)\n\n# crawl 2 articles and print\nfor article in crawler.crawl(max_articles=2):\n    print(article)\n```\n\n## Example 3: Crawl 1 Million articles\n\nTo crawl such a vast amount of data, Fundus relies on the `CommonCrawl` web archive, in particular the news crawl `CC-NEWS`.\nIf you're not familiar with [`CommonCrawl`](https://commoncrawl.org/) or [`CC-NEWS`](https://commoncrawl.org/blog/news-dataset-available) check out their websites.\nSimply import our `CCNewsCrawler` and make sure to check out our [tutorial](docs/2_crawl_from_cc_news.md) beforehand.\n\n````python\nfrom fundus import PublisherCollection, CCNewsCrawler\n\n# initialize the crawler using all publishers supported by fundus\ncrawler = CCNewsCrawler(*PublisherCollection)\n\n# crawl 1 million articles and print\nfor article in crawler.crawl(max_articles=1000000):\n  print(article)\n````\n\n**_Note_**: By default, the crawler utilizes all available CPU cores on your system. \nFor optimal performance, we recommend manually setting the number of processes using the `processes` parameter. \nA good rule of thumb is to allocate `one process per 200 Mbps of bandwidth`.\nThis can vary depending on core speed.\n\n**_Note_**: The crawl above took ~7 hours using the entire `PublisherCollection` on a machine with 1000 Mbps connection, Core i9-13905H, 64GB Ram, Windows 11 and without printing the articles.\nThe estimated time can vary substantially depending on the publisher used and the available bandwidth.\nAdditionally, not all publishers are included in the `CC-NEWS` crawl (especially US based publishers).\nFor large corpus creation, one can also use the regular crawler by utilizing only sitemaps, which requires significantly less bandwidth.\n\n````python\nfrom fundus import PublisherCollection, Crawler, Sitemap\n\n# initialize a crawler for us/uk based publishers and restrict to Sitemaps only\ncrawler = Crawler(PublisherCollection.us, PublisherCollection.uk, restrict_sources_to=[Sitemap])\n\n# crawl 1 million articles and print\nfor article in crawler.crawl(max_articles=1000000):\n  print(article)\n````\n\n\n## Example 4: Crawl some images\n\nBy default, Fundus tries to parse the images included in every crawled article.\nLet's crawl an article and print out the images for some more details.\n\n```python\nfrom fundus import PublisherCollection, Crawler\n\n# initialize the crawler for The LA Times\ncrawler = Crawler(PublisherCollection.us.LATimes)\n\n# crawl 1 article and print the images\nfor article in crawler.crawl(max_articles=1):\n    for image in article.images:\n        print(image)\n```\n\nFor [this article](https://www.latimes.com/sports/lakers/story/2024-12-13/lakers-lebron-james-away-from-team-timberwolves) you will get the following output:\n\n```console\nFundus-Article Cover-Image:\n-URL:\t\t\t 'https://ca-times.brightspotcdn.com/dims4/default/41c9bc4/2147483647/strip/true/crop/4598x3065+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F77%2Feb%2F7fed2d3942fd97b0f7325e7060cf%2Flakers-timberwolves-basketball-33765.jpg'\n-Description:\t         'Minnesota Timberwolves forward Julius Randle (30) works toward the basket.'\n-Caption:\t\t 'Minnesota Timberwolves forward Julius Randle, left, controls the ball in front of Lakers forward Anthony Davis during the first half of the Lakers’ 97-87 loss Friday.'\n-Authors:\t\t ['Abbie Parr / Associated Press']\n-Versions:\t\t [320x213, 568x379, 768x512, 1024x683, 1200x800]\n\nFundus-Article Image:\n-URL:\t\t\t 'https://ca-times.brightspotcdn.com/dims4/default/9a22715/2147483647/strip/true/crop/4706x3137+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2Ff7%2F52%2Fdcd6b263480ab579ac583a4fdbbf%2Flakers-timberwolves-basketball-48004.jpg'\n-Description:\t         'Lakers coach JJ Redick talks with forward Anthony Davis during a loss to the Timberwolves.'\n-Caption:\t\t 'Lakers coach JJ Redick, right, talks with forward Anthony Davis during the first half of a 97-87 loss to the Timberwolves on Friday night.'\n-Authors:\t\t ['Abbie Parr / Associated Press']\n-Versions:\t\t [320x213, 568x379, 768x512, 1024x683, 1200x800]\n\nFundus-Article Image:\n-URL:\t\t\t 'https://ca-times.brightspotcdn.com/dims4/default/580bae4/2147483647/strip/true/crop/5093x3470+0+0/resize/1200x818!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F3b%2Fdf%2F64c0198b4c2fb2b5824aaccb64b7%2F1486148-sp-nba-lakers-trailblazers-25-gmf.jpg'\n-Description:\t         'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James.'\n-Caption:\t\t 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James, during a win over Portland at Crypto.com Arena on Dec. 8.'\n-Authors:\t\t ['Gina Ferazzi / Los Angeles Times']\n-Versions:\t\t [320x218, 568x387, 768x524, 1024x698, 1200x818]\n```\n\nFor each image, the printout details:\n- The cover image designation (if applicable).\n- The URL for the highest-resolution version of the image.\n- A description of the image.\n- The image's caption.\n- The name of the copyright holder.\n- A list of all available versions of the image.\n\n\n## Tutorials\n\nWe provide **quick tutorials** to get you started with the library:\n\n1. [**Tutorial 1: How to crawl news with Fundus**](docs/1_getting_started.md)\n2. [**Tutorial 2: How to crawl articles from CC-NEWS**](docs/2_crawl_from_cc_news.md)\n3. [**Tutorial 3: The Article Class**](docs/3_the_article_class.md)\n4. [**Tutorial 4: How to filter articles**](docs/4_how_to_filter_articles.md)\n5. [**Tutorial 5: Advanced topics**](docs/5_advanced_topics.md)\n6. [**Tutorial 6: Logging**](docs/6_logging.md)\n\nIf you wish to contribute check out these tutorials:\n1. [**How to contribute**](docs/how_to_contribute.md)\n2. [**How to add a publisher**](docs/how_to_add_a_publisher.md)\n\n## Currently Supported News Sources\n\nYou can find the publishers currently supported [**here**](/docs/supported_publishers.md).\n\nAlso: **Adding a new publisher is easy - consider contributing to the project!**\n\n## Evaluation Benchmark\n\nCheck out our evaluation [benchmark](https://github.com/dobbersc/fundus-evaluation).\n\nThe following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. The table is sorted in descending order over the F1-score:\n\n| **Scraper**                                                                                                     | **Precision**             | **Recall**                | **F1-Score**              | **Version** |\n|-----------------------------------------------------------------------------------------------------------------|:--------------------------|---------------------------|---------------------------|-------------|\n| [Fundus](https://github.com/flairNLP/fundus)                                                                    | **99.89**\u003csub\u003e±0.57\u003c/sub\u003e | 96.75\u003csub\u003e±12.75\u003c/sub\u003e    | **97.69**\u003csub\u003e±9.75\u003c/sub\u003e | 0.4.1       |\n| [Trafilatura](https://github.com/adbar/trafilatura)                                                             | 93.91\u003csub\u003e±12.89\u003c/sub\u003e    | 96.85\u003csub\u003e±15.69\u003c/sub\u003e    | 93.62\u003csub\u003e±16.73\u003c/sub\u003e    | 1.12.0      |\n| [news-please](https://github.com/fhamborg/news-please)                                                          | 97.95\u003csub\u003e±10.08\u003c/sub\u003e    | 91.89\u003csub\u003e±16.15\u003c/sub\u003e    | 93.39\u003csub\u003e±14.52\u003c/sub\u003e    | 1.6.13      |\n| [BTE](https://github.com/dobbersc/fundus-evaluation/blob/master/src/fundus_evaluation/scrapers/bte.py)          | 81.09\u003csub\u003e±19.41\u003c/sub\u003e    | **98.23**\u003csub\u003e±8.61\u003c/sub\u003e | 87.14\u003csub\u003e±15.48\u003c/sub\u003e    | /           |\n| [jusText](https://github.com/miso-belica/jusText)                                                               | 86.51\u003csub\u003e±18.92\u003c/sub\u003e    | 90.23\u003csub\u003e±20.61\u003c/sub\u003e    | 86.96\u003csub\u003e±19.76\u003c/sub\u003e    | 3.0.1       |\n| [BoilerNet](https://github.com/dobbersc/fundus-evaluation/tree/master/src/fundus_evaluation/scrapers/boilernet) | 85.96\u003csub\u003e±18.55\u003c/sub\u003e    | 91.21\u003csub\u003e±19.15\u003c/sub\u003e    | 86.52\u003csub\u003e±18.03\u003c/sub\u003e    | /           |\n| [Boilerpipe](https://github.com/kohlschutter/boilerpipe)                                                        | 82.89\u003csub\u003e±20.65\u003c/sub\u003e    | 82.11\u003csub\u003e±29.99\u003c/sub\u003e    | 79.90\u003csub\u003e±25.86\u003c/sub\u003e    | 1.3.0       |\n\n## Cite\n\nPlease cite the following [paper](https://aclanthology.org/2024.acl-demos.29/) when using Fundus or building upon our work:\n\n```bibtex\n@inproceedings{dallabetta-etal-2024-fundus,\n    title = \"Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions\",\n    author = \"Dallabetta, Max  and\n      Dobberstein, Conrad  and\n      Breiding, Adrian  and\n      Akbik, Alan\",\n    editor = \"Cao, Yixin  and\n      Feng, Yang  and\n      Xiong, Deyi\",\n    booktitle = \"Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)\",\n    month = aug,\n    year = \"2024\",\n    address = \"Bangkok, Thailand\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2024.acl-demos.29\",\n    pages = \"305--314\",\n}\n```\n\n## Contact\n\nPlease email your questions or comments to [**Max Dallabetta**](mailto:max.dallabetta@googlemail.com?subject=[GitHub]%20Fundus)\n\n## Contributing\n\nThanks for your interest in contributing! There are many ways to get involved;\nstart with our [contributor guidelines](docs/how_to_contribute.md) and then\ncheck these [open issues](https://github.com/flairNLP/fundus/issues) for specific tasks.\n\n## License\n\n[MIT](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FflairNLP%2Ffundus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FflairNLP%2Ffundus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FflairNLP%2Ffundus/lists"}