Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
https://github.com/adbar/trafilatura
article-extractor corpus corpus-builder corpus-tools crawler html-to-markdown html2text news news-aggregator news-crawler nlp readability rss-feed scraping tei text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: about 2 months ago
JSON representation
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
- Host: GitHub
- URL: https://github.com/adbar/trafilatura
- Owner: adbar
- License: apache-2.0
- Created: 2019-04-08T11:38:48.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-06-11T10:40:24.000Z (6 months ago)
- Last Synced: 2024-06-11T20:20:20.887Z (6 months ago)
- Topics: article-extractor, corpus, corpus-builder, corpus-tools, crawler, html-to-markdown, html2text, news, news-aggregator, news-crawler, nlp, readability, rss-feed, scraping, tei, text-cleaning, text-extraction, text-mining, text-preprocessing, web-scraping
- Language: Python
- Homepage: https://trafilatura.readthedocs.io
- Size: 32.5 MB
- Stars: 3,090
- Watchers: 30
- Forks: 232
- Open Issues: 61
-
Metadata Files:
- Readme: README.md
- Changelog: HISTORY.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
- awesome-github-repos - adbar/trafilatura - Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML (Python)
- awesome-rainmana - adbar/trafilatura - Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML (Python)
- awesome-hacking-lists - adbar/trafilatura - Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML (Python)
- project-awesome - adbar/trafilatura - Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML (Python)
- StarryDivineSky - adbar/trafilatura
- jimsghstars - adbar/trafilatura - Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML (Python)
- best-of-web-python - GitHub - 19% open · ⏱️ 06.06.2024): (Web Scraping & Crawling)