Projects in Awesome Lists tagged with content-extraction
A curated list of projects in awesome lists tagged with content-extraction .
https://github.com/firecrawl/firecrawl-mcp-server
π₯ Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.
batch-processing claude content-extraction data-collection firecrawl firecrawl-ai javascript-rendering llm-tools mcp mcp-server model-context-protocol search-api web-crawler web-scraping
Last synced: 07 Apr 2026
https://github.com/currentslab/extractnet
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package
author-extraction content-extraction date-extraction machine-learning news news-articles news-extraction news-extractor python text-cleaning text-mining web-scraping webscraping
Last synced: 17 Mar 2026
https://github.com/pinkpixel-dev/web-scout-mcp
A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.
ai-assistant ai-tools cheerio content-extraction crawler duckduckgo duckduckgo-search google-search mcp mcp-server web-content web-crawler web-scraper web-scraping web-search web-search-agent
Last synced: 06 Mar 2026
https://github.com/mvasilkov/readability2
Readability2 converts HTML to plain text.
content-extraction html javascript plaintext readability
Last synced: 16 Apr 2025
https://github.com/graphlit/graphlit-mcp-server
Model Context Protocol (MCP) Server for Graphlit Platform
claude content-extraction content-ingestion data-collection llm-tools mcp-server model-context-protocol search-api unstructured-data web-crawler web-scraping
Last synced: 12 Oct 2025
https://github.com/gregors/boilerpipe-ruby
Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles
boilerpipe boilerpipe-algorithm content-extraction news webscraping
Last synced: 10 Jun 2025
https://github.com/developer0hye/anytomd-rs
Pure Rust document-to-Markdown converter for LLM workflows (DOCX, PPTX, XLSX, HTML, CSV, JSON, XML, images).
anytomd content-extraction converter csv docx html image-extraction json llm markdown pptx rust text-processing xlsx xml
Last synced: 31 May 2026
https://github.com/spences10/mcp-jinaai-reader
π Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader
content-extraction documentation-tool jinaai llm-tools mcp model-context-protocol text-extraction web-content web-scraping
Last synced: 15 Apr 2025
https://github.com/gdamdam/sumo
Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more
automatic-summarization content-extraction entity-recognition nlp nltk semantic-analysis sentence-extraction
Last synced: 08 Oct 2025
https://github.com/tuffstuff9/nextjs-pdf-parser
Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.
content-extraction filepond nextjs nextjs-pdf nextjs-pdf-parse nextjs-pdf-parser nextjs-pdf-parsing pdf-parse pdf-parser pdf-parsing pdf-upload pdf2json react-pdf react-pdf-parser
Last synced: 17 Jan 2026
https://github.com/zoharbabin/web-researcher-mcp
Verifies citations, flags retractions, audits bibliographies β and searches the web. Your AI research assistant that cites real sources and stays honest. Works with Claude, Cursor, any MCP client.
ai ai-agent anti-hallucination bibliography citation-verification claude claude-code claude-desktop content-extraction cursor fact-checking go golang llm mcp mcp-server model-context-protocol research web-scraping web-search
Last synced: 14 Jun 2026
https://github.com/jocmp/mercury-parser
Extract meaningful content from the chaos of a web page
article-parser content-extraction html-parser javascript mercury-parser nodejs readability reader-mode rss web-scraping
Last synced: 02 Apr 2026
https://github.com/rithulkamesh/docproc
Document Intelligence Platform β Extract, refine, and query documents with vision LLMs and config-driven RAG.
content-extraction data-extraction document-analysis document-parsing equation-detection layout-analysis machine-learning mathematical-symbols ocr pdf-processing pdf-text-extraction python region-detection text-classification text-extraction
Last synced: 02 Apr 2026
https://github.com/heliolj/youtube-transcript-copier
Chrome extension to copy YouTube transcripts with AI-friendly features
accessibility-tools browser-extension chatgpt-tools chrome-extension clipboard-manager content-extraction i18n javascript llm-tools productivity-tools transcript-copier youtube-api youtube-extension youtube-transcript
Last synced: 28 Apr 2026
https://github.com/youdotcom-oss/agent-skills
Agent Skills for integrating You.com capabilities into agentic workflows and AI development tools - guided integrations for Claude, OpenAI, Vercel AI SDK, and Teams.ai
agent-skills ai-agents ai-integration anthropic bash-agents claude-agent-sdk cli-tools content-extraction developer-tools enterprise-integration livecrawl mcp-server openai-agents-sdk openclaw python teams-ai typescript vercel-ai-sdk web-search youdotcom
Last synced: 23 Feb 2026
https://github.com/solrikk/datadigger
DataDigger is a powerful and intuitive web application designed to extract and analyze data from web pages.
business-intelligence content-extraction data-analysis data-collection data-extraction data-mining go golang-api html-parser marketing-tools metadata-extraction research-tools seo-tools web-application web-crawling web-scraping web-tools
Last synced: 15 Apr 2025
https://github.com/leroyanders/acrticle-scrapper
This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structuredβ¦
article-parser content-creation-tools content-extraction data-archiving html-to-markdown-converter image-downloading markdown-conversion metadata-extraction python web-scraping
Last synced: 02 Sep 2025
https://github.com/sbstnerhrdt/node-readability
Simple node server to extract relevant content from website source code using Mozilla's Readability.js
content-extraction docker node redability
Last synced: 05 May 2026
https://github.com/mukul975/mcp-web-scrape
π mcp-web-scrape β Clean, cache-aware web content fetcher for AI agents. Fetch any URL β extract readable content β return Markdown/JSON with citations. β‘ Fast caching, π€ robots.txt compliant, π Markdown-ready output, οΏ½οΏ½ works with ChatGPT/Claude Desktop.
agent ai api cache chatgpt citations claude content-extraction llm markdown mcp model-context-protocol nodejs scraper sse stdio typescript web-crawler web-scraping
Last synced: 30 Apr 2026
https://github.com/dotcommander/defuddle
Go library and CLI for extracting web page content β articles, metadata, and clean text from any URL
cli content-extraction defuddle go html-parser markdown web-scraping
Last synced: 31 May 2026
https://github.com/vakharwalad23/mark-minion
The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.
ai-powered cloudflare-worker content-extraction document-processing markdown-conversion puppeteer tweets-extraction typescript web-scraping
Last synced: 18 Jun 2025
https://github.com/khamel83/argus
Multi-provider web search broker for AI agents with budget-aware routing and 5,000+ free monthly queries.
ai-agents brave-search cli content-extraction duckduckgo fastapi llm-tools mcp mcp-registry mcp-server python search-api search-broker searxng tavily web-search
Last synced: 24 Apr 2026
https://github.com/pinkpixel-dev/prysm
Prysm is a blazing-smart Puppeteer-based web scraper that doesn't just extract - it understands structure. Capable of scraping virtually any website with intelligent content detection and 14 specialized scroll strategies that adapt to different page layouts, Prysm excels at extracting content that other scrapers miss.
api cloudflare-bypass content-extraction data-extraction headless-browser headless-browsers javascript nodejs pagination puppeteer web-automation web-scraper web-scraping
Last synced: 19 Apr 2026
https://github.com/amirthfultehrani/youtube-transcript-copier
A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.
accessibility automation browser-extension clipboard content-extraction data-extraction greasemonkey helper javascript productivity tampermonkey text-extraction tool transcript userscript utilities video violentmonkey web youtube
Last synced: 13 Apr 2026
https://github.com/baughmann/tikara
The metadata and text content extractor for almost every file type.
apache-tika content-extraction document-parsing document-processing docx image-to-text java language-detection llm metadata metadata-extraction ml natural-language-processing ocr pdf-to-text retrieval-augmented-generation text-extraction text-mining
Last synced: 16 Feb 2026
https://github.com/0x4d44/readex
HTML main-content extraction for Rust β ports of Mozilla Readability, Trafilatura, and htmldate.
article-extraction boilerplate-removal content-extraction html html-parser htmldate metadata-extraction readability rust text-extraction trafilatura web-scraping
Last synced: 12 Jun 2026
https://github.com/helebest/holo-epub-reader
EPUB parser for OpenClaw that converts books into LLM-ready Markdown blocks with image extraction, validation, and CI/CD quality gates.
cli-tool content-extraction ebook epub llm markdown openclaw python
Last synced: 06 Apr 2026
https://github.com/po4yka/bite-size-reader
Telegram bot for bite-sized content summaries β scrapes articles/YouTube/channels, summarizes via LLM, serves a Carbon web UI and mobile API. Self-hostable.
article-summarizer channel-digest content-extraction mcp-server openrouter scraper-chain telegram-bot vector-search web-scraping
Last synced: 08 Apr 2026
https://github.com/cyanheads/jinaai-mcp-server
A Model Context Protocol (MCP) server that provides intelligent web reading capabilities using the Jina AI Reader API. It extracts clean, LLM-ready content from any URL.
agent content-extraction jina jinaai llm mcp mcp-server modelcontextprotocol web-scraping
Last synced: 20 Jan 2026
https://github.com/praveenc/fetchv2-mcp-server
A robust MCP server for fetching and extracting web content using Trafilatura. Optimized for AI agents with clean markdown output.
ai-agents content-extraction markdown mcp mcp-server model-context-protocol web-fetching web-scraper
Last synced: 13 Jan 2026
https://github.com/simonpierreboucher/crawler
A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.
concurrent-crawling content-extraction data-collection data-extraction-pipeline data-preservation-and-recovery data-scraping error-handling html-parsing http-requests metadata-storage modular-design pdf-text-extraction python-crawler rate-limiting structured-data-storage text-processing url-normalization web-crawling yaml-configuration
Last synced: 30 Mar 2025
https://github.com/herrkaefer/anything2md
Python package and CLI for converting documents to Markdown using Cloudflare Workers AI toMarkdown.
automation cli cloudflare cloudflare-workers cloudflare-workers-ai content-extraction developer-tools document-conversion document-to-markdown docx-to-markdown file-conversion image-to-markdown markdown markdown-converter ocr pdf pdf-to-markdown python workers-ai xlsx-to-markdown
Last synced: 11 Mar 2026
https://github.com/j0hanz/fetch-url-mcp
A web content fetcher MCP server that converts HTML to clean, AI and human readable markdown.
content-extraction fetch fetch-api html-to-markdown llm-context mcp mcp-server model-context-protocol readable-format typescript web-fetch web-fetching webscraping
Last synced: 01 Apr 2026
https://github.com/justserpapi/web-markdown
JustSerpAPI Crawl Webpage Markdown API Python SDK examples, with related Google Search API, Google Lens API, Google Maps API, Google News API, Google Shopping API, Google Scholar API, Google Finance API, Google Trends API, Google Jobs API, Google Patents API, Google Hotels API, and Web APIs.
content-extraction google-finance-api google-hotels-api google-jobs-api google-lens-api google-maps-api google-news-api google-patents-api google-scholar-api google-search-api google-shopping-api google-trends-api justserpapi markdown-api python serp-api web-crawling web-markdown-api web-scraping
Last synced: 08 Jun 2026
https://github.com/harrydulaney/news-feed-scraper
Configurable and schedulable web scrapping tool. Used to extract raw article content and metadata for aggregated news feeds.
content-extraction java-web-scraper news-feed news-feed-provider newsscraper scraper scraperapi web-automation webscraper
Last synced: 10 Jun 2026