An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with content-extraction

A curated list of projects in awesome lists tagged with content-extraction .

https://github.com/firecrawl/firecrawl-mcp-server

πŸ”₯ Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.

batch-processing claude content-extraction data-collection firecrawl firecrawl-ai javascript-rendering llm-tools mcp mcp-server model-context-protocol search-api web-crawler web-scraping

Last synced: 07 Apr 2026

https://github.com/currentslab/extractnet

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

author-extraction content-extraction date-extraction machine-learning news news-articles news-extraction news-extractor python text-cleaning text-mining web-scraping webscraping

Last synced: 17 Mar 2026

https://github.com/pinkpixel-dev/web-scout-mcp

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

ai-assistant ai-tools cheerio content-extraction crawler duckduckgo duckduckgo-search google-search mcp mcp-server web-content web-crawler web-scraper web-scraping web-search web-search-agent

Last synced: 06 Mar 2026

https://github.com/mvasilkov/readability2

Readability2 converts HTML to plain text.

content-extraction html javascript plaintext readability

Last synced: 16 Apr 2025

https://github.com/gregors/boilerpipe-ruby

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

boilerpipe boilerpipe-algorithm content-extraction news webscraping

Last synced: 10 Jun 2025

https://github.com/developer0hye/anytomd-rs

Pure Rust document-to-Markdown converter for LLM workflows (DOCX, PPTX, XLSX, HTML, CSV, JSON, XML, images).

anytomd content-extraction converter csv docx html image-extraction json llm markdown pptx rust text-processing xlsx xml

Last synced: 31 May 2026

https://github.com/spences10/mcp-jinaai-reader

πŸ” Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

content-extraction documentation-tool jinaai llm-tools mcp model-context-protocol text-extraction web-content web-scraping

Last synced: 15 Apr 2025

https://github.com/gdamdam/sumo

Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

automatic-summarization content-extraction entity-recognition nlp nltk semantic-analysis sentence-extraction

Last synced: 08 Oct 2025

https://github.com/tuffstuff9/nextjs-pdf-parser

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

content-extraction filepond nextjs nextjs-pdf nextjs-pdf-parse nextjs-pdf-parser nextjs-pdf-parsing pdf-parse pdf-parser pdf-parsing pdf-upload pdf2json react-pdf react-pdf-parser

Last synced: 17 Jan 2026

https://github.com/zoharbabin/web-researcher-mcp

Verifies citations, flags retractions, audits bibliographies β€” and searches the web. Your AI research assistant that cites real sources and stays honest. Works with Claude, Cursor, any MCP client.

ai ai-agent anti-hallucination bibliography citation-verification claude claude-code claude-desktop content-extraction cursor fact-checking go golang llm mcp mcp-server model-context-protocol research web-scraping web-search

Last synced: 14 Jun 2026

https://github.com/youdotcom-oss/agent-skills

Agent Skills for integrating You.com capabilities into agentic workflows and AI development tools - guided integrations for Claude, OpenAI, Vercel AI SDK, and Teams.ai

agent-skills ai-agents ai-integration anthropic bash-agents claude-agent-sdk cli-tools content-extraction developer-tools enterprise-integration livecrawl mcp-server openai-agents-sdk openclaw python teams-ai typescript vercel-ai-sdk web-search youdotcom

Last synced: 23 Feb 2026

https://github.com/leroyanders/acrticle-scrapper

This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…

article-parser content-creation-tools content-extraction data-archiving html-to-markdown-converter image-downloading markdown-conversion metadata-extraction python web-scraping

Last synced: 02 Sep 2025

https://github.com/sbstnerhrdt/node-readability

Simple node server to extract relevant content from website source code using Mozilla's Readability.js

content-extraction docker node redability

Last synced: 05 May 2026

https://github.com/mukul975/mcp-web-scrape

πŸš€ mcp-web-scrape β€” Clean, cache-aware web content fetcher for AI agents. Fetch any URL β†’ extract readable content β†’ return Markdown/JSON with citations. ⚑ Fast caching, 🀝 robots.txt compliant, πŸ“ Markdown-ready output, οΏ½οΏ½ works with ChatGPT/Claude Desktop.

agent ai api cache chatgpt citations claude content-extraction llm markdown mcp model-context-protocol nodejs scraper sse stdio typescript web-crawler web-scraping

Last synced: 30 Apr 2026

https://github.com/dotcommander/defuddle

Go library and CLI for extracting web page content β€” articles, metadata, and clean text from any URL

cli content-extraction defuddle go html-parser markdown web-scraping

Last synced: 31 May 2026

https://github.com/vakharwalad23/mark-minion

The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.

ai-powered cloudflare-worker content-extraction document-processing markdown-conversion puppeteer tweets-extraction typescript web-scraping

Last synced: 18 Jun 2025

https://github.com/khamel83/argus

Multi-provider web search broker for AI agents with budget-aware routing and 5,000+ free monthly queries.

ai-agents brave-search cli content-extraction duckduckgo fastapi llm-tools mcp mcp-registry mcp-server python search-api search-broker searxng tavily web-search

Last synced: 24 Apr 2026

https://github.com/pinkpixel-dev/prysm

Prysm is a blazing-smart Puppeteer-based web scraper that doesn't just extract - it understands structure. Capable of scraping virtually any website with intelligent content detection and 14 specialized scroll strategies that adapt to different page layouts, Prysm excels at extracting content that other scrapers miss.

api cloudflare-bypass content-extraction data-extraction headless-browser headless-browsers javascript nodejs pagination puppeteer web-automation web-scraper web-scraping

Last synced: 19 Apr 2026

https://github.com/0x4d44/readex

HTML main-content extraction for Rust β€” ports of Mozilla Readability, Trafilatura, and htmldate.

article-extraction boilerplate-removal content-extraction html html-parser htmldate metadata-extraction readability rust text-extraction trafilatura web-scraping

Last synced: 12 Jun 2026

https://github.com/helebest/holo-epub-reader

EPUB parser for OpenClaw that converts books into LLM-ready Markdown blocks with image extraction, validation, and CI/CD quality gates.

cli-tool content-extraction ebook epub llm markdown openclaw python

Last synced: 06 Apr 2026

https://github.com/po4yka/bite-size-reader

Telegram bot for bite-sized content summaries β€” scrapes articles/YouTube/channels, summarizes via LLM, serves a Carbon web UI and mobile API. Self-hostable.

article-summarizer channel-digest content-extraction mcp-server openrouter scraper-chain telegram-bot vector-search web-scraping

Last synced: 08 Apr 2026

https://github.com/cyanheads/jinaai-mcp-server

A Model Context Protocol (MCP) server that provides intelligent web reading capabilities using the Jina AI Reader API. It extracts clean, LLM-ready content from any URL.

agent content-extraction jina jinaai llm mcp mcp-server modelcontextprotocol web-scraping

Last synced: 20 Jan 2026

https://github.com/praveenc/fetchv2-mcp-server

A robust MCP server for fetching and extracting web content using Trafilatura. Optimized for AI agents with clean markdown output.

ai-agents content-extraction markdown mcp mcp-server model-context-protocol web-fetching web-scraper

Last synced: 13 Jan 2026

https://github.com/simonpierreboucher/crawler

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

concurrent-crawling content-extraction data-collection data-extraction-pipeline data-preservation-and-recovery data-scraping error-handling html-parsing http-requests metadata-storage modular-design pdf-text-extraction python-crawler rate-limiting structured-data-storage text-processing url-normalization web-crawling yaml-configuration

Last synced: 30 Mar 2025

https://github.com/j0hanz/fetch-url-mcp

A web content fetcher MCP server that converts HTML to clean, AI and human readable markdown.

content-extraction fetch fetch-api html-to-markdown llm-context mcp mcp-server model-context-protocol readable-format typescript web-fetch web-fetching webscraping

Last synced: 01 Apr 2026

https://github.com/justserpapi/web-markdown

JustSerpAPI Crawl Webpage Markdown API Python SDK examples, with related Google Search API, Google Lens API, Google Maps API, Google News API, Google Shopping API, Google Scholar API, Google Finance API, Google Trends API, Google Jobs API, Google Patents API, Google Hotels API, and Web APIs.

content-extraction google-finance-api google-hotels-api google-jobs-api google-lens-api google-maps-api google-news-api google-patents-api google-scholar-api google-search-api google-shopping-api google-trends-api justserpapi markdown-api python serp-api web-crawling web-markdown-api web-scraping

Last synced: 08 Jun 2026

https://github.com/harrydulaney/news-feed-scraper

Configurable and schedulable web scrapping tool. Used to extract raw article content and metadata for aggregated news feeds.

content-extraction java-web-scraper news-feed news-feed-provider newsscraper scraper scraperapi web-automation webscraper

Last synced: 10 Jun 2026