Projects in Awesome Lists tagged with content-extraction

https://github.com/firecrawl/firecrawl-mcp-server

🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.

batch-processing claude content-extraction data-collection firecrawl firecrawl-ai javascript-rendering llm-tools mcp mcp-server model-context-protocol search-api web-crawler web-scraping

Last synced: 07 Apr 2026

https://github.com/currentslab/extractnet

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

author-extraction content-extraction date-extraction machine-learning news news-articles news-extraction news-extractor python text-cleaning text-mining web-scraping webscraping

Last synced: 17 Mar 2026

https://github.com/pinkpixel-dev/web-scout-mcp

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

ai-assistant ai-tools cheerio content-extraction crawler duckduckgo duckduckgo-search google-search mcp mcp-server web-content web-crawler web-scraper web-scraping web-search web-search-agent

Last synced: 06 Mar 2026

https://github.com/mvasilkov/readability2

Readability2 converts HTML to plain text.

content-extraction html javascript plaintext readability

Last synced: 16 Apr 2025

https://github.com/graphlit/graphlit-mcp-server

Model Context Protocol (MCP) Server for Graphlit Platform

claude content-extraction content-ingestion data-collection llm-tools mcp-server model-context-protocol search-api unstructured-data web-crawler web-scraping

Last synced: 12 Oct 2025

https://github.com/gregors/boilerpipe-ruby

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

boilerpipe boilerpipe-algorithm content-extraction news webscraping

Last synced: 10 Jun 2025

https://github.com/developer0hye/anytomd-rs

Pure Rust document-to-Markdown converter for LLM workflows (DOCX, PPTX, XLSX, HTML, CSV, JSON, XML, images).

anytomd content-extraction converter csv docx html image-extraction json llm markdown pptx rust text-processing xlsx xml

Last synced: 31 May 2026

https://github.com/spences10/mcp-jinaai-reader

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

content-extraction documentation-tool jinaai llm-tools mcp model-context-protocol text-extraction web-content web-scraping

Last synced: 15 Apr 2025

https://github.com/gdamdam/sumo

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

automatic-summarization content-extraction entity-recognition nlp nltk semantic-analysis sentence-extraction

Last synced: 08 Oct 2025

https://github.com/tuffstuff9/nextjs-pdf-parser

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

content-extraction filepond nextjs nextjs-pdf nextjs-pdf-parse nextjs-pdf-parser nextjs-pdf-parsing pdf-parse pdf-parser pdf-parsing pdf-upload pdf2json react-pdf react-pdf-parser

Last synced: 17 Jan 2026

https://github.com/zoharbabin/web-researcher-mcp

Verifies citations, flags retractions, audits bibliographies — and searches the web. Your AI research assistant that cites real sources and stays honest. Works with Claude, Cursor, any MCP client.

ai ai-agent anti-hallucination bibliography citation-verification claude claude-code claude-desktop content-extraction cursor fact-checking go golang llm mcp mcp-server model-context-protocol research web-scraping web-search

Last synced: 14 Jun 2026

https://github.com/jocmp/mercury-parser

Extract meaningful content from the chaos of a web page

article-parser content-extraction html-parser javascript mercury-parser nodejs readability reader-mode rss web-scraping

Last synced: 02 Apr 2026

https://github.com/rithulkamesh/docproc

Document Intelligence Platform — Extract, refine, and query documents with vision LLMs and config-driven RAG.

content-extraction data-extraction document-analysis document-parsing equation-detection layout-analysis machine-learning mathematical-symbols ocr pdf-processing pdf-text-extraction python region-detection text-classification text-extraction

Last synced: 02 Apr 2026

https://github.com/heliolj/youtube-transcript-copier

Chrome extension to copy YouTube transcripts with AI-friendly features

accessibility-tools browser-extension chatgpt-tools chrome-extension clipboard-manager content-extraction i18n javascript llm-tools productivity-tools transcript-copier youtube-api youtube-extension youtube-transcript

Last synced: 28 Apr 2026

https://github.com/youdotcom-oss/agent-skills

Agent Skills for integrating You.com capabilities into agentic workflows and AI development tools - guided integrations for Claude, OpenAI, Vercel AI SDK, and Teams.ai

agent-skills ai-agents ai-integration anthropic bash-agents claude-agent-sdk cli-tools content-extraction developer-tools enterprise-integration livecrawl mcp-server openai-agents-sdk openclaw python teams-ai typescript vercel-ai-sdk web-search youdotcom

Last synced: 23 Feb 2026

https://github.com/solrikk/datadigger

DataDigger is a powerful and intuitive web application designed to extract and analyze data from web pages.

business-intelligence content-extraction data-analysis data-collection data-extraction data-mining go golang-api html-parser marketing-tools metadata-extraction research-tools seo-tools web-application web-crawling web-scraping web-tools

Last synced: 15 Apr 2025

https://github.com/leroyanders/acrticle-scrapper

This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…

article-parser content-creation-tools content-extraction data-archiving html-to-markdown-converter image-downloading markdown-conversion metadata-extraction python web-scraping

Last synced: 02 Sep 2025

https://github.com/sbstnerhrdt/node-readability

Simple node server to extract relevant content from website source code using Mozilla's Readability.js

content-extraction docker node redability

Last synced: 05 May 2026

https://github.com/mukul975/mcp-web-scrape

🚀 mcp-web-scrape — Clean, cache-aware web content fetcher for AI agents. Fetch any URL → extract readable content → return Markdown/JSON with citations. ⚡ Fast caching, 🤝 robots.txt compliant, 📝 Markdown-ready output, �� works with ChatGPT/Claude Desktop.

agent ai api cache chatgpt citations claude content-extraction llm markdown mcp model-context-protocol nodejs scraper sse stdio typescript web-crawler web-scraping

Last synced: 30 Apr 2026

https://github.com/dotcommander/defuddle

Go library and CLI for extracting web page content — articles, metadata, and clean text from any URL

cli content-extraction defuddle go html-parser markdown web-scraping

Last synced: 31 May 2026

https://github.com/vakharwalad23/mark-minion

The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.

ai-powered cloudflare-worker content-extraction document-processing markdown-conversion puppeteer tweets-extraction typescript web-scraping

Last synced: 18 Jun 2025

https://github.com/khamel83/argus

Multi-provider web search broker for AI agents with budget-aware routing and 5,000+ free monthly queries.

ai-agents brave-search cli content-extraction duckduckgo fastapi llm-tools mcp mcp-registry mcp-server python search-api search-broker searxng tavily web-search

Last synced: 24 Apr 2026

https://github.com/pinkpixel-dev/prysm

Prysm is a blazing-smart Puppeteer-based web scraper that doesn't just extract - it understands structure. Capable of scraping virtually any website with intelligent content detection and 14 specialized scroll strategies that adapt to different page layouts, Prysm excels at extracting content that other scrapers miss.

api cloudflare-bypass content-extraction data-extraction headless-browser headless-browsers javascript nodejs pagination puppeteer web-automation web-scraper web-scraping

Last synced: 19 Apr 2026

https://github.com/amirthfultehrani/youtube-transcript-copier

A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.

accessibility automation browser-extension clipboard content-extraction data-extraction greasemonkey helper javascript productivity tampermonkey text-extraction tool transcript userscript utilities video violentmonkey web youtube

Last synced: 13 Apr 2026

https://github.com/baughmann/tikara

The metadata and text content extractor for almost every file type.

apache-tika content-extraction document-parsing document-processing docx image-to-text java language-detection llm metadata metadata-extraction ml natural-language-processing ocr pdf-to-text retrieval-augmented-generation text-extraction text-mining

Last synced: 16 Feb 2026

https://github.com/0x4d44/readex

HTML main-content extraction for Rust — ports of Mozilla Readability, Trafilatura, and htmldate.

article-extraction boilerplate-removal content-extraction html html-parser htmldate metadata-extraction readability rust text-extraction trafilatura web-scraping

Last synced: 12 Jun 2026

https://github.com/helebest/holo-epub-reader

EPUB parser for OpenClaw that converts books into LLM-ready Markdown blocks with image extraction, validation, and CI/CD quality gates.

cli-tool content-extraction ebook epub llm markdown openclaw python

Last synced: 06 Apr 2026

https://github.com/po4yka/bite-size-reader

Telegram bot for bite-sized content summaries — scrapes articles/YouTube/channels, summarizes via LLM, serves a Carbon web UI and mobile API. Self-hostable.

article-summarizer channel-digest content-extraction mcp-server openrouter scraper-chain telegram-bot vector-search web-scraping

Last synced: 08 Apr 2026

https://github.com/cyanheads/jinaai-mcp-server

A Model Context Protocol (MCP) server that provides intelligent web reading capabilities using the Jina AI Reader API. It extracts clean, LLM-ready content from any URL.

agent content-extraction jina jinaai llm mcp mcp-server modelcontextprotocol web-scraping

Last synced: 20 Jan 2026

https://github.com/praveenc/fetchv2-mcp-server

A robust MCP server for fetching and extracting web content using Trafilatura. Optimized for AI agents with clean markdown output.

ai-agents content-extraction markdown mcp mcp-server model-context-protocol web-fetching web-scraper

Last synced: 13 Jan 2026

https://github.com/simonpierreboucher/crawler

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

concurrent-crawling content-extraction data-collection data-extraction-pipeline data-preservation-and-recovery data-scraping error-handling html-parsing http-requests metadata-storage modular-design pdf-text-extraction python-crawler rate-limiting structured-data-storage text-processing url-normalization web-crawling yaml-configuration

Last synced: 30 Mar 2025

https://github.com/herrkaefer/anything2md

Python package and CLI for converting documents to Markdown using Cloudflare Workers AI toMarkdown.

automation cli cloudflare cloudflare-workers cloudflare-workers-ai content-extraction developer-tools document-conversion document-to-markdown docx-to-markdown file-conversion image-to-markdown markdown markdown-converter ocr pdf pdf-to-markdown python workers-ai xlsx-to-markdown

Last synced: 11 Mar 2026

https://github.com/j0hanz/fetch-url-mcp

A web content fetcher MCP server that converts HTML to clean, AI and human readable markdown.

content-extraction fetch fetch-api html-to-markdown llm-context mcp mcp-server model-context-protocol readable-format typescript web-fetch web-fetching webscraping

Last synced: 01 Apr 2026

https://github.com/justserpapi/web-markdown

JustSerpAPI Crawl Webpage Markdown API Python SDK examples, with related Google Search API, Google Lens API, Google Maps API, Google News API, Google Shopping API, Google Scholar API, Google Finance API, Google Trends API, Google Jobs API, Google Patents API, Google Hotels API, and Web APIs.

content-extraction google-finance-api google-hotels-api google-jobs-api google-lens-api google-maps-api google-news-api google-patents-api google-scholar-api google-search-api google-shopping-api google-trends-api justserpapi markdown-api python serp-api web-crawling web-markdown-api web-scraping

Last synced: 08 Jun 2026

https://github.com/harrydulaney/news-feed-scraper

Configurable and schedulable web scrapping tool. Used to extract raw article content and metadata for aggregated news feeds.

content-extraction java-web-scraper news-feed news-feed-provider newsscraper scraper scraperapi web-automation webscraper

Last synced: 10 Jun 2026