Projects in Awesome Lists tagged with article-extractor
A curated list of projects in awesome lists tagged with article-extractor .
https://github.com/adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
article-extractor corpus corpus-builder corpus-tools crawler html-to-markdown html2text news news-aggregator news-crawler nlp readability rss-feed scraping tei text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 24 Dec 2025
https://github.com/extractus/article-extractor
To extract main article from given URL with Node.js
article article-extractor article-parser crawler extract nodejs readability scraper
Last synced: 27 Apr 2025
https://github.com/scotteh/php-goose
Readability / Html Content / Article Extractor & Web Scrapping library written in PHP
article article-extractor autoloader composer php php-goose readability scraper
Last synced: 05 Oct 2025
https://github.com/hipstermojo/paperoni
An article extractor in Rust
article-extractor readability rust
Last synced: 07 Apr 2025
https://github.com/artiomn/markdown_articles_tool
Parse markdown article, download images and replace images URL's with local paths
article article-extracting article-extractor articles downloader html image-manipulation images markdown markdown-articles markdown-converter markdown-parser markdown-to-html markdown-to-pdf md pdf python-library toolset
Last synced: 09 Apr 2025
https://github.com/fterh/sneakpeek
Reddit bot to preview and post hyperlinks as comments
article-extractor news-articles preview reddit reddit-bot
Last synced: 08 Jul 2025
https://github.com/web64/laravel-nlp
Laravel wrapper for common NLP tasks
article-extractor entity-extraction language-detection laravel-package nlp sentiment-analysis
Last synced: 24 Jan 2026
https://github.com/inaridiy/webforai
The best HTML to Markdown library, A esm-native & Useful Utilities with simple, lightweight and epic quality.
article-extractor extractor html-to-markdown html2markdown html2md html2text readability scraping text-mining
Last synced: 06 Apr 2025
https://github.com/johnbumgarner/newshound
This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around the world in over 50 languages.
article-extracting article-extractor data-extraction data-mining data-science datascience news news-aggregator news-crawler newspaper-crawler python-newspaper python3 text-mining web-scraping webscraping
Last synced: 14 Jan 2026
https://github.com/woojubb/html-article-extractor
A web page content extractor
article-extracting article-extractor crawler crawling extraction extractor
Last synced: 24 Dec 2025
https://github.com/sanidhyy/ai-summarizer
Modern OpenAI GPT-4 Article Summarizer
ai article-extractor artificial-intelligence chatgpt css gpt-4 html javascript js machine-learning react reactjs tailwindcss
Last synced: 13 Apr 2025
https://github.com/metalwarrior665/actor-article-extractor-smart
Combines Apify's crawling system and article parsing with unfluff library.
actor apify article-extractor scraper web-scraper
Last synced: 03 Sep 2025
https://github.com/bharathvaj-ganesan/artixtractor
Extract article/blog from websites like [medium.com, inc42.com,etc]:100:
article-extractor hacktoberfest nodejs
Last synced: 19 Jun 2025
https://github.com/gadzan/generatoc
Automatically generate table of content from heading of HTML document
article-extractor html-document ssr toc typescript
Last synced: 23 May 2026
https://github.com/andythefactory/article-extraction-dataset
Article title, authors, date and body extraction dataset.
article-extractor corpus corpus-builder corpus-tools dataset datasets html-to-markdown html2text news news-aggregator news-crawler readability scraping scraping-websites text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 27 Jan 2026
https://github.com/mccallofthewild/alexandrias-revenge
🔥The bold new archive that can’t be burned, bulldozed or battering-rammed #PoweredByArweave
archive article-extractor arweave blockchain webarchive
Last synced: 21 Apr 2025
https://github.com/robinmillford/cortex-ai-multi-model-insights-hub
Cortex AI: Multi-Model Insights Hub is an advanced platform that leverages cutting-edge AI to empower your research, analysis, and data exploration. By integrating multiple Large Language Models (LLMs) with a sophisticated Retrieve-and-Generate (RAG) system
article-extractor chatbot data-analysis data-visualization deepseek-chat deepseek-r1 llama3 llm pdf-document-processor rag streamlit-webapp summarizer vector-database
Last synced: 28 Oct 2025
https://github.com/hemantwasthere/ai-sumz
Simplify your reading with Summarizer, an open-source article summarizer that transforms lengthy articles into clear and concise summaries
article-extractor rapidapi react redux-toolkit tailwindcss vite
Last synced: 12 Apr 2026
https://github.com/sters/extract-content
ExtractContent for PHP7. Extract web article tool.
article-extractor extract-content php
Last synced: 16 Jan 2026
https://github.com/jujulis18/smartdata_tracker
Outil de scraping conçu pour extraire proprement le contenu d’articles en ligne (blogs, presse, publications). Il automatise la collecte de données textuelles, nettoie le contenu (suppression des balises, publicités, etc.), et permet un export structuré pour une analyse ultérieure (NLP, résumé, veille, etc.).
agent article-extractor google-cloud-platform mistral-api ner python scraping streamlit veille
Last synced: 31 Jan 2026
https://github.com/sahilg28/artisumm
Artisumm is a tool that delivers concise and accurate article summaries for quick information digestion.
article-extractor article-summarizer article-summary fronted-development javascript rapidapi reactjs tailwindcss webdevelopment
Last synced: 15 Apr 2026
https://github.com/amcelo13/openai_url_sumz
article-extractor localstorage openai rapidapi reactjs tailwindcss
Last synced: 10 Jun 2025
https://github.com/ryshaal/google-scholar-scraping
Google Scholar Scraper
article-extractor scraping-websites
Last synced: 24 Aug 2025
https://github.com/n-ce/ap-navigator
"Technology for the actual growth of man."
acharya-prashant api article-extractor books education navigator spirituality tour vedanta
Last synced: 10 Feb 2026
https://github.com/arachnio/arachnio4py
Arachnio client library for Python 3.10+
arachnio article-extractor data-extraction news-scraping text-extraction web-scraping web-scraping-python
Last synced: 14 Jan 2026
https://github.com/reineimi/html-article-editor
A simple HTML article (rich text) editor
article-extractor article-generator contenteditable html html5 rich-text rich-text-editor
Last synced: 21 Jan 2026
https://github.com/RobinMillford/Cortex-AI-Multi-Model-Insights-Hub
This project creates a Retrieve-and-Generate (RAG) powered chatbot for summarizing and interacting with articles. The system processes articles provided as PDFs or URLs, extracts text, splits the content into chunks, generates embeddings, and stores them in a vector database
article-extractor chatbot llama3 llm pdf-document-processor rag streamlit summarizer vector-database
Last synced: 11 Oct 2025
https://github.com/parthapray/pdf_text_extraction_json_section_subsection
This repo contains codes for extraction of PDF text to JSON to show section number, section title, section body content, footnote
article-extractor document extraction json pdf pymupdf-fitz regex text
Last synced: 10 May 2026