An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-extraction

A curated list of projects in awesome lists tagged with data-extraction .

https://github.com/getmaxun/maxun

🔥 Open Source No Code Web Data Extraction Platform • Turn Websites To APIs & Spreadsheets With No-Code Robots In Minutes 🔥

agents api automation browser browser-automation data-extraction no-code no-code-web-scraper playwright robotic-process-automation rpa scraper self-hosted web-agent web-automation web-scraper web-scraping web-scraping-agent webscraping website-to-api

Last synced: 23 Jan 2026

https://github.com/zipstack/unstract

LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows

api-deployments data-extraction document-processing etl-pipelines open-source-data-pipeline unstructured-data-extraction

Last synced: 13 May 2026

https://github.com/vi3k6i5/flashtext

Extract Keywords from sentence or Replace keywords in sentences.

data-extraction keyword-extraction nlp search-in-text word2vec

Last synced: 13 May 2025

https://github.com/D4Vinci/Scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath

Last synced: 13 May 2025

https://github.com/d4vinci/scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath

Last synced: 15 Feb 2026

https://github.com/jonathanlink/pdflayouttextstripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

data-extraction extract java layout pdf pdfbox text

Last synced: 15 May 2025

https://github.com/JonathanLink/PDFLayoutTextStripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

data-extraction extract java layout pdf pdfbox text

Last synced: 15 Mar 2025

https://github.com/raznem/parsera

Lightweight library for scraping web-sites with LLMs

ai ai-scraping data-extraction llm opensource playwright python scraping webscraping

Last synced: 11 Apr 2025

https://github.com/thinh-vu/vnstock

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

data-extraction quantitative-analysis quantitative-finance quantitative-trading stock-market stock-screener

Last synced: 14 May 2025

https://github.com/eclaire-labs/eclaire

Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.

ai ai-assistant automation bookmark-manager bookmarks data-extraction document-processing llm local-first note-taking ocr on-device-ai open-source personal-knowledge-management privacy rest-api self-hosted task-management web-archiving

Last synced: 16 Jan 2026

https://github.com/yfedoseev/pdf_oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

data-extraction document-processing fast image-extraction llm markdown pdf pdf-editor pdf-generation pdf-library pdf-parser pdf-to-markdown pdf-to-text pyo3 python rag rust text-extraction

Last synced: 13 May 2026

https://github.com/a-maliarov/amazoncaptcha

Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.

amazon amazon-captcha amazon-scraper amazoncaptcha captcha captcha-solver data-extraction pillow python3 training-data

Last synced: 26 Mar 2025

https://github.com/0xMassi/webclaw

Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.

ai ai-agents ai-scraping cli crawler data-extraction html-to-markdown llm markdown mcp mcp-server rust scraper self-hosted tls-fingerprinting web-crawler web-extraction web-scraper web-scraping webscraping

Last synced: 04 Apr 2026

https://github.com/molybdenum-99/infoboxer

Wikipedia information extraction library

data-extraction mediawiki wikipedia

Last synced: 05 Apr 2025

https://github.com/mrshu/github-statuses

The "Missing GitHub Status Page" -- a Flat Data attempt at historically documenting GitHub statuses

data-extraction flat-data github ner open-data status status-page uptime

Last synced: 09 Apr 2026

https://github.com/dilawar/plotdigitizer

A Python utility to digitize plots.

data-extraction digitization image-processing python3

Last synced: 06 Apr 2025

https://github.com/ScrapeGraphAI/scrapecraft

🤖 AI-powered web scraping editor with visual workflow builder. Build, test & deploy web scrapers using natural language. Powered by ScrapeGraphAI & LangGraph.

ai automation data-extraction docker fastapi hacktoberfest langgraph python react scrapegraphai typescript web-scraping webscraping

Last synced: 25 Aug 2025

https://github.com/nfx/go-htmltable

Structured HTML table data extraction from URLs in Go that has almost no external dependencies

data-extraction go go-generics html

Last synced: 05 Apr 2025

https://github.com/dav009/flash

Golang Keyword extraction/replacement Datastructure using Tries instead of regexes

data-extraction go golang search text text-search trie

Last synced: 30 Apr 2025

https://github.com/tech-engine/goscrapy

GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.

data-extraction go-scrapy golang goscraper scrapy spider web-crawler webscraper webscrapping

Last synced: 18 Jan 2026

https://github.com/danburzo/hred

Reduce HTML and XML to JSON from the command line, using an expressive query language inspired by CSS selectors.

cli data-extraction html json xml

Last synced: 02 Apr 2025

https://github.com/us/crw

Fast, lightweight Firecrawl alternative in Rust. Web scraper, crawler & search API with MCP server for AI agents. Drop-in Firecrawl-compatible API (/v1/scrape, /v1/crawl, /v1/search). 2.3x faster than Tavily, 1.5x faster than Firecrawl in 1K-URL benchmarks. 6 MB RAM, single binary. Self-host or use managed cloud.

ai ai-agents crawler data-extraction docker firecrawl firecrawl-alternative html-to-markdown llm markdown mcp mcp-server rust scraping-api self-hosted tavily-alternative web-crawler web-scraper web-scraping web-search-api

Last synced: 09 May 2026

https://github.com/html-extract/hext

Domain-specific language for extracting structured data from HTML documents

cpp data-extraction dsl html html-extraction node php python ruby scraping

Last synced: 15 Apr 2025

https://github.com/xquik-dev/x-twitter-scraper

X (Twitter) data platform skill for AI coding agents. 122 REST API endpoints, 2 MCP tools, 23 extraction types, HMAC webhooks. Reads from $0.00015/call - 33x cheaper than the official X API. Works with Claude Code, Cursor, Codex, Copilot, Windsurf & 40+ agents.

ai-agent automation cheap-api claude-code codex cursor data-extraction giveaway mcp mcp-server monitoring pay-per-use rest-api scraper skills social-media twitter twitter-api webhooks x-api

Last synced: 10 May 2026

https://github.com/duriantaco/jonq

Query JSON with SQL-like syntax. A readable jq alternative that generates pure jq under the hood. Table, CSV, YAML output. Interactive REPL. Pipes from curl, streams NDJSON logs.

cli command-line-tools csv data-extraction jq jq-alternative json json-parser json-processor json-query log-analysis ndjson python sql yaml

Last synced: 29 Apr 2026

https://github.com/extralit/extralit

Fast and accurate systemic data extraction with LLM assistance

data-extraction literature-review llm

Last synced: 14 Jan 2026

https://github.com/mhucka/taupe

Taupe takes a downloaded Twitter archive ZIP file, extracts the URLs corresponding to tweets, retweets, replies, quote tweets, and liked tweets, and outputs the results in a comma-separated values (CSV) format that you can use with other software tools.

archives comma-separated-values csv data-extraction markdown twitter twitter-archive twitter-archives url

Last synced: 14 Dec 2025

https://github.com/johnbumgarner/newshound

This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around the world in over 50 languages.

article-extracting article-extractor data-extraction data-mining data-science datascience news news-aggregator news-crawler newspaper-crawler python-newspaper python3 text-mining web-scraping webscraping

Last synced: 14 Jan 2026

https://github.com/rubydamodar/protext-analyzer

ProText Analyzer is a powerful tool for extracting insights from text. It conducts sentiment analysis, categorizing content as positive, negative, or neutral, while also assessing readability and linguistic complexity. Ideal for businesses and researchers, it enhances understanding of textual data.

complex-word-definition data-cleaning-techniques data-extraction linguistic-complexity metrics-explained readability-analysis sentiment-analysis syllable-counting-methodology tokenization-process

Last synced: 17 Jul 2025

https://github.com/linw1995/data_extractor

Combine XPath, CSS Selectors and JSONPath for Web data extracting.

css-selectors data-extraction data-extractor jsonpath xpath

Last synced: 30 Jan 2026

https://github.com/Xquik-dev/tweetclaw

Post tweets, reply, like, retweet, follow, DM & more from OpenClaw. Full X/Twitter automation via Xquik — 120 endpoints, reads from $0.00015/call (66x cheaper than official X API). 2 tools, 2 commands, background event poller.

ai-agent automation cheap-api data-extraction giveaway mcp-server openclaw openclaw-plugin pay-per-use skills social-media tweet tweetclaw twitter twitter-api twitter-automation x x-api xquik

Last synced: 26 May 2026

https://github.com/imranr98/wealthsimpleton

A Python script that scrapes your Wealthsimple activity history and saves the data in a JSON file.

data-extraction data-ownership export python selenium selenium-webdriver wealthsimple web web-scraping

Last synced: 14 May 2025

https://github.com/biraj21/web-wanderer

A multi-threaded web crawler written in Python, utilizing ThreadPoolExecutor and Playwright to efficiently crawl dynamically rendered web pages and download them.

data-extraction multithreading python web-crawler webcrawler

Last synced: 12 Jan 2026

https://github.com/shdev/phpflashtext

Extract Keywords from sentence or Replace keywords in sentences. @ https://github.com/vi3k6i5/flashtext

data-analysis data-extraction flashtext keyword-extraction nlp php search-in-text string-manipulation string-matching word2vec

Last synced: 12 Jan 2026

https://github.com/quantumbytestudios/githubuserdataextractor

GitHubUserDataExtractor is a cross-platform Python tool designed to extract and display public GitHub user data both in the terminal and through a visual HTML dashboard. It provides a streamlined way to fetch a user’s profile, recent activity, and contribution statistics using GitHub’s REST API and external visualization services.

data-extraction data-extractor hack hack-tool hack-tools hacker-scripts hacker-tool hacking linux-tools python-tools tools

Last synced: 31 Jul 2025

https://github.com/Fabiopf02/ofx-data-extractor

A module written in TypeScript that provides a utility to extract data from an OFX file in Node.js and Browser

banking data-extraction financial no-dependencies ofx ofx-js ofx-json ofx-parser open-financial-exchange parser qfx

Last synced: 11 Sep 2025

https://github.com/arkutils/arkutils-website

The source for the arkutils website, home of a few Ark: Survival Evolved and Ascenced tools.

ark-survival-ascended ark-survival-evolved ark-survivial data-extraction game-tool

Last synced: 23 Jan 2026

https://github.com/masurii/fbscrapeideas

Modern CLI tool for scraping & analyzing Facebook groups using Playwright & Gemini AI. Features self-healing selectors, session security, and local offline analysis.

academic-research ai cli data-extraction data-mining facebook-scraper gemini-api idea-generation nlp python selenium text-analysis

Last synced: 28 Apr 2026

https://github.com/robert-mcdermott/ollama-batch-cluster

Large Scale Batch Processing with Ollama

data-extraction gpu hpc-cluster llm ollama

Last synced: 06 Apr 2026

https://github.com/webmiddle/webmiddle

Node.js framework for modular web scraping and data extraction

data-extraction framework jsx jsx-components modular nodejs web-scraping

Last synced: 29 Oct 2025

https://github.com/attogram/justrefs

Just Refs - extract just the references and related topics from any page on the English Wikipedia

data-extraction information-extraction wikipedia wikipedia-api wikipedia-scraper wikipedia-viewer

Last synced: 14 Apr 2025

https://github.com/aryanvbw/exif

ExifTool is a powerful command-line tool that can be used to extract and edit metadata in a wide range of media files, including images, audio, and video. Metadata is information that is stored within a file that describes the file’s content or other attributes.

aryan-technologies aryanshop aryanvbw data-extraction image-metadata image-processing images-hacking information-gathering powered-by-aryan-technologies vivek

Last synced: 24 Oct 2025

https://github.com/u-c4n/u-transkript

U-Transkript is a powerful Python library for automatically extracting transcripts (subtitles) from YouTube videos and translating them into various languages using Google Gemini AI. It supports 50+ languages, offers flexible output formats (TXT, JSON, XML), and features an easy-to-use, chainable API. Ideal for education, research, content creation

ai data-extraction python subtitles transcript translation youtube youtube-api

Last synced: 01 Jul 2025

https://github.com/fabiopf02/ofx-data-extractor

A module written in TypeScript that provides a utility to extract data from an OFX file in Node.js and Browser

banking data-extraction financial no-dependencies ofx ofx-js ofx-json ofx-parser open-financial-exchange parser qfx

Last synced: 10 Jul 2025

https://github.com/petrpatek/airbnb-scraper

Apify public actor for scraping Airbnb homes.

airbnb airbnb-api apify crawler data-extraction scrape

Last synced: 20 Mar 2025

https://github.com/chaitanyarahalkar/financial-info-extractor

Extract financial information in CSV format for companies compliant to the NSE

beautifulsoup csv-parser data-extraction data-scraping financial-data financial-services python selenium

Last synced: 17 Aug 2025

https://github.com/jakubjafra/stellaris-map-generation

Extracts geopolitical data from Stellaris save game files

data-extraction game-files game-modding stellaris stellaris-map-generation

Last synced: 13 May 2025

https://github.com/bluishwu/treeclip

TreeClip 是一款Chrome扩展工具,它提供了多种灵活的页面文本选择方式(同类选择、点选、框选、文本搜索),并结合了层级导航、内部元素选择、层级绑定、自定义输出格式等功能,大幅提升您从网页复制信息的效率。TreeClip offers flexible text selection methods (similar selection, point selection, box selection, text search) to enhance your efficiency in copying information from web pages.

bulk-copy bulk-operation chrome-extension copy-paste data-extraction element-selection html text-selection treeclip web-tools

Last synced: 13 May 2025

https://github.com/rririanto/unstructured-demo-streamlit

Extract your docs (CSV, PDF, JSON, HTML, DOCS, Sheets and more) for your own GPT and LLM projects using Unstructured.io via streamlit

ai data data-extraction gpt unstructured unstructured-data

Last synced: 09 Apr 2025

https://github.com/bisaloo/xlcutter

Parse Batches of 'xlsx' Files Based on a Template

data-extraction excel non-rectangular-data r r-package tidy-data

Last synced: 12 May 2025

https://github.com/beautifulmoon211/onthemarket-scraping

Web scraping tool used to extract real estate information from OnTheMarket.com, a leading property portal in the United Kingdom.

cheerio data-extraction onthemarket onthemarket-scraper real-estate requests typescript web-scraper

Last synced: 13 Jun 2025

https://github.com/ksm26/function-calling-and-data-extraction-with-llms

Master the techniques of function-calling and structured data extraction with LLMs. Learn to enhance LLM capabilities, integrate web services, and build practical applications for real-world data usability.

advanced-workflows ai-integration custom-functionality customer-service-transcripts data-analysis data-extraction end-to-end-applications function-calling llms natural-language-processing openapi practical-implementation structured-data web-services-integration

Last synced: 01 May 2026

https://github.com/geniuszly/genpythondoxing

GenPythonDoxing is a demo version of a Python-based tool designed for gathering publicly available information about email addresses, usernames, IP addresses, and Minecraft nicknames. It utilizes various APIs and web scraping techniques to collect data, providing a comprehensive view of online footprints.

cyber-investigation data-extraction data-mining dox doxing doxing-methods genpythondoxing information-gathering osint python python-doxing python-doxing-tool pythondoxing security-research

Last synced: 13 Apr 2025

https://github.com/blalop/bbva2pandas

Extract the data from your BBVA's monthly statements

bank bank-account bbva data-extraction extracted-data pandas

Last synced: 28 Apr 2025

https://github.com/Bisaloo/xlcutter

Parse Batches of 'xlsx' Files Based on a Template

data-extraction excel non-rectangular-data r r-package tidy-data

Last synced: 01 Apr 2025

https://github.com/venkat-0706/amazon-webscraper

An Amazon web scraper extracts product data like prices, reviews, and ratings using tools like BeautifulSoup or Scrapy, aiding in market research while adhering to ethical and legal guidelines.

api-and-data-parsing automation beautifulsoup data-extraction ethical-scraping python-programming webscraping

Last synced: 26 Jun 2025

https://github.com/ExceptionRegret/Kryfto

The open-source web-browsing backend for AI agents & workflow engines. Ships a 42-tool MCP server for Claude Code/Cursor/Codex, a full REST API for n8n/Zapier/Make, federated multi-engine search, anti-bot stealth, and enterprise infrastructure (Postgres, Redis, BullMQ, MinIO). Self-host for $5/mo flat

ai-agents anti-detection claude-code codex cursor data-extraction developer-tools fastapi headless-browser mcp mcp-server n8n open-source playwright redis search-engine self-hosted stealth web-scraping workflow-automation

Last synced: 03 Apr 2026

https://github.com/os-climate/crrf-det

A web application for PDF content and table extraction, featuring image-based visual layout analysis, indexed document search, batch processing and extraction result annotation.

annotation data-extraction layout-analysis pdf table-extraction

Last synced: 12 Apr 2025

https://github.com/davidumoru/scryer

Transform web data into actionable knowledge

content-parsing data-extraction gemini-api google-gemini web-scraping

Last synced: 13 Aug 2025