Projects in Awesome Lists tagged with data-extraction
A curated list of projects in awesome lists tagged with data-extraction .
https://github.com/getmaxun/maxun
🔥 Open Source No Code Web Data Extraction Platform • Turn Websites To APIs & Spreadsheets With No-Code Robots In Minutes 🔥
agents api automation browser browser-automation data-extraction no-code no-code-web-scraper playwright robotic-process-automation rpa scraper self-hosted web-agent web-automation web-scraper web-scraping web-scraping-agent webscraping website-to-api
Last synced: 23 Jan 2026
https://github.com/zipstack/unstract
LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows
api-deployments data-extraction document-processing etl-pipelines open-source-data-pipeline unstructured-data-extraction
Last synced: 13 May 2026
https://github.com/vi3k6i5/flashtext
Extract Keywords from sentence or Replace keywords in sentences.
data-extraction keyword-extraction nlp search-in-text word2vec
Last synced: 13 May 2025
https://github.com/D4Vinci/Scrapling
🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!
ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath
Last synced: 13 May 2025
https://github.com/d4vinci/scrapling
🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!
ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath
Last synced: 15 Feb 2026
https://github.com/brightdata/brightdata-mcp
A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.
ai-agents ai-integrations anti-bot-detection browser-automation data-collection data-extraction llm mcp mcp-server modelcontextprotocol scraping scraping-tools structured-data web-crawling web-data web-scraping
Last synced: 16 Jan 2026
https://github.com/saifyxpro/headlessx
A lightweight, self-hosted headless browser automation platform. Designed as an alternative to Browserless, built for speed, privacy, and scalability.
automation automation-api automation-platform browser-automation browser-testing browserless chrome-headless chromedriver container-automation data-extraction headless headless-chrome headless-service playwright playwright-automation puppeteer scraping-service web-automation web-scraping
Last synced: 31 Jan 2026
https://github.com/jonathanlink/pdflayouttextstripper
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
data-extraction extract java layout pdf pdfbox text
Last synced: 15 May 2025
https://github.com/JonathanLink/PDFLayoutTextStripper
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
data-extraction extract java layout pdf pdfbox text
Last synced: 15 Mar 2025
https://github.com/hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Last synced: 14 May 2025
https://github.com/raznem/parsera
Lightweight library for scraping web-sites with LLMs
ai ai-scraping data-extraction llm opensource playwright python scraping webscraping
Last synced: 11 Apr 2025
https://github.com/thinh-vu/vnstock
A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone
data-extraction quantitative-analysis quantitative-finance quantitative-trading stock-market stock-screener
Last synced: 14 May 2025
https://github.com/eclaire-labs/eclaire
Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.
ai ai-assistant automation bookmark-manager bookmarks data-extraction document-processing llm local-first note-taking ocr on-device-ai open-source personal-knowledge-management privacy rest-api self-hosted task-management web-archiving
Last synced: 16 Jan 2026
https://github.com/yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
data-extraction document-processing fast image-extraction llm markdown pdf pdf-editor pdf-generation pdf-library pdf-parser pdf-to-markdown pdf-to-text pyo3 python rag rust text-extraction
Last synced: 13 May 2026
https://github.com/polyrabbit/hacker-news-digest
:newspaper: Let ChatGPT Summarize Hacker News for You
chatgpt chatgpt-api crawler data-extraction extract-summaries hacker-news hacker-news-digest hacker-news-reader machine-learning news-aggregator openai openai-api python rss spider
Last synced: 15 May 2025
https://github.com/adrienjoly/npm-pdfreader
🚜 Parse text and tables from PDF files.
data-extraction javascript parse-tables parsing pdf-converter pdf-reader rule-based-parsing tabular-data
Last synced: 14 May 2025
https://github.com/a-maliarov/amazoncaptcha
Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.
amazon amazon-captcha amazon-scraper amazoncaptcha captcha captcha-solver data-extraction pillow python3 training-data
Last synced: 26 Mar 2025
https://github.com/0xMassi/webclaw
Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.
ai ai-agents ai-scraping cli crawler data-extraction html-to-markdown llm markdown mcp mcp-server rust scraper self-hosted tls-fingerprinting web-crawler web-extraction web-scraper web-scraping webscraping
Last synced: 04 Apr 2026
https://github.com/shcherbak-ai/contextgem
ContextGem: Effortless LLM extraction from documents
ai contract-analysis data-extraction document-intelligence docx docx2md docx2txt generative-ai legaltech llm llm-extraction llm-framework llm-pipeline llms nlp prompt-engineering text-analysis unstructured-data
Last synced: 13 May 2025
https://github.com/py-pdf/benchmarks
Benchmarking PDF libraries
benchmark data-extraction mupdf pdf poppler-utils pypdf2 text-extraction
Last synced: 28 Jul 2025
https://github.com/serpapi/clauneck
A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.
automation command-line command-line-tool data-extraction data-extractor email email-extract-with-proxy email-extraction email-extractor email-marketing email-scraper open-source ruby rubygem serp social-media-scraper web-crawling webscraping
Last synced: 06 Apr 2025
https://github.com/molybdenum-99/infoboxer
Wikipedia information extraction library
data-extraction mediawiki wikipedia
Last synced: 05 Apr 2025
https://github.com/bzsanti/oxidizePdf
a PDF library for rust
crates-io data-extraction digital-signatures document-processing encryption invoice ocr pdf pdf-generation pdf-library pdf-manipulation pdf-parser pdf-reader pdfa rust rust-library table-extraction text-extraction
Last synced: 29 Apr 2026
https://github.com/sypht-team/sypht-python-client
A python client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-extraction invoice invoice-parser pdf-parser python python3 python3-library receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-python-client
Last synced: 11 Jul 2025
https://github.com/mrshu/github-statuses
The "Missing GitHub Status Page" -- a Flat Data attempt at historically documenting GitHub statuses
data-extraction flat-data github ner open-data status status-page uptime
Last synced: 09 Apr 2026
https://github.com/dilawar/plotdigitizer
A Python utility to digitize plots.
data-extraction digitization image-processing python3
Last synced: 06 Apr 2025
https://github.com/ScrapeGraphAI/scrapecraft
🤖 AI-powered web scraping editor with visual workflow builder. Build, test & deploy web scrapers using natural language. Powered by ScrapeGraphAI & LangGraph.
ai automation data-extraction docker fastapi hacktoberfest langgraph python react scrapegraphai typescript web-scraping webscraping
Last synced: 25 Aug 2025
https://github.com/nfx/go-htmltable
Structured HTML table data extraction from URLs in Go that has almost no external dependencies
data-extraction go go-generics html
Last synced: 05 Apr 2025
https://github.com/villagecomputing/superpipe
Superpipe - optimized LLM pipelines for structured data
classification data-extraction data-labeling llm llm-evaluation llm-optimization structured-data
Last synced: 04 Apr 2026
https://github.com/sshniro/line-segmentation-algorithm-to-gcp-vision
Line segmentation algorithm for Google Vision API.
data-extraction google-vision invoice proposed-algorithm segmentation
Last synced: 25 Jun 2025
https://github.com/dav009/flash
Golang Keyword extraction/replacement Datastructure using Tries instead of regexes
data-extraction go golang search text text-search trie
Last synced: 30 Apr 2025
https://github.com/tech-engine/goscrapy
GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.
data-extraction go-scrapy golang goscraper scrapy spider web-crawler webscraper webscrapping
Last synced: 18 Jan 2026
https://github.com/sypht-team/sypht-java-client
A Java client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-retrieval information-retrieval-engine invoice invoice-parser java java8 pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-java-client
Last synced: 10 Apr 2025
https://github.com/danburzo/hred
Reduce HTML and XML to JSON from the command line, using an expressive query language inspired by CSS selectors.
cli data-extraction html json xml
Last synced: 02 Apr 2025
https://github.com/us/crw
Fast, lightweight Firecrawl alternative in Rust. Web scraper, crawler & search API with MCP server for AI agents. Drop-in Firecrawl-compatible API (/v1/scrape, /v1/crawl, /v1/search). 2.3x faster than Tavily, 1.5x faster than Firecrawl in 1K-URL benchmarks. 6 MB RAM, single binary. Self-host or use managed cloud.
ai ai-agents crawler data-extraction docker firecrawl firecrawl-alternative html-to-markdown llm markdown mcp mcp-server rust scraping-api self-hosted tavily-alternative web-crawler web-scraper web-scraping web-search-api
Last synced: 09 May 2026
https://github.com/html-extract/hext
Domain-specific language for extracting structured data from HTML documents
cpp data-extraction dsl html html-extraction node php python ruby scraping
Last synced: 15 Apr 2025
https://github.com/xquik-dev/x-twitter-scraper
X (Twitter) data platform skill for AI coding agents. 122 REST API endpoints, 2 MCP tools, 23 extraction types, HMAC webhooks. Reads from $0.00015/call - 33x cheaper than the official X API. Works with Claude Code, Cursor, Codex, Copilot, Windsurf & 40+ agents.
ai-agent automation cheap-api claude-code codex cursor data-extraction giveaway mcp mcp-server monitoring pay-per-use rest-api scraper skills social-media twitter twitter-api webhooks x-api
Last synced: 10 May 2026
https://github.com/stabrise/spark-pdf
PDF DataSource for Apache Spark
big-data data-engineering data-extraction data-science ocr ocr-recognition pdf pdf-document pdf-document-processor spark spark-datasource tesseract tesseract-ocr
Last synced: 09 Apr 2025
https://github.com/articdive/articdata
Collection of data extracted from Minecraft.
data data-extraction data-mining java json mc minecraft minecraft-data minecraft-server minecraft-servers registry
Last synced: 17 May 2025
https://github.com/duriantaco/jonq
Query JSON with SQL-like syntax. A readable jq alternative that generates pure jq under the hood. Table, CSV, YAML output. Interactive REPL. Pipes from curl, streams NDJSON logs.
cli command-line-tools csv data-extraction jq jq-alternative json json-parser json-processor json-query log-analysis ndjson python sql yaml
Last synced: 29 Apr 2026
https://github.com/serpapi/google-search-results-java
Google Search Results JAVA API via SerpApi
data-extraction data-scraping java java-api json serp-api serpapi web-scraping webscraping
Last synced: 09 Jul 2025
https://github.com/Articdive/ArticData
Collection of data extracted from Minecraft.
data data-extraction data-mining java json mc minecraft minecraft-data minecraft-server minecraft-servers registry
Last synced: 08 May 2025
https://github.com/hddevteam/smart-form-filler
AI-powered form filling and data extraction browser extension
ai-powered browser-extension chrome-extension data-extraction demo-site developer-tools form-automation form-filling github-pages gpt-integration javascript local-ai machine-learning nodejs ollama privacy-focused productivity-tools web-automation
Last synced: 14 Feb 2026
https://github.com/sypht-team/sypht-golang-client
A Golang client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields go golang golang-library golang-package invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-golang-client
Last synced: 20 Oct 2025
https://github.com/extralit/extralit
Fast and accurate systemic data extraction with LLM assistance
data-extraction literature-review llm
Last synced: 14 Jan 2026
https://github.com/mhucka/taupe
Taupe takes a downloaded Twitter archive ZIP file, extracts the URLs corresponding to tweets, retweets, replies, quote tweets, and liked tweets, and outputs the results in a comma-separated values (CSV) format that you can use with other software tools.
archives comma-separated-values csv data-extraction markdown twitter twitter-archive twitter-archives url
Last synced: 14 Dec 2025
https://github.com/johnbumgarner/newshound
This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around the world in over 50 languages.
article-extracting article-extractor data-extraction data-mining data-science datascience news news-aggregator news-crawler newspaper-crawler python-newspaper python3 text-mining web-scraping webscraping
Last synced: 14 Jan 2026
https://github.com/rubydamodar/protext-analyzer
ProText Analyzer is a powerful tool for extracting insights from text. It conducts sentiment analysis, categorizing content as positive, negative, or neutral, while also assessing readability and linguistic complexity. Ideal for businesses and researchers, it enhances understanding of textual data.
complex-word-definition data-cleaning-techniques data-extraction linguistic-complexity metrics-explained readability-analysis sentiment-analysis syllable-counting-methodology tokenization-process
Last synced: 17 Jul 2025
https://github.com/linw1995/data_extractor
Combine XPath, CSS Selectors and JSONPath for Web data extracting.
css-selectors data-extraction data-extractor jsonpath xpath
Last synced: 30 Jan 2026
https://github.com/gambolputty/wiktionary-de-parser
Extract data from German Wiktionary XML files.
data-extraction dewiktionary german german-language nlp wiktionary wiktionary-dump wiktionary-parser
Last synced: 14 Jan 2026
https://github.com/Xquik-dev/tweetclaw
Post tweets, reply, like, retweet, follow, DM & more from OpenClaw. Full X/Twitter automation via Xquik — 120 endpoints, reads from $0.00015/call (66x cheaper than official X API). 2 tools, 2 commands, background event poller.
ai-agent automation cheap-api data-extraction giveaway mcp-server openclaw openclaw-plugin pay-per-use skills social-media tweet tweetclaw twitter twitter-api twitter-automation x x-api xquik
Last synced: 26 May 2026
https://github.com/pim97/scrappey-wrapper-python
An API wrapper for Scrappey.com written in Python (cloudflare, datadome bypass & solver)
akamai anti-bot-api captcha captcha-solver cloudflare-anti-bot cloudflare-bypass data-extraction datadome incapsula perimetex queue-it scraping-framework scraping-library scraping-service scraping-tool shape web-data-extration web-scraping web-scraping-solution
Last synced: 12 Jan 2026
https://github.com/nextkore/smartmuv
An EVM-compatible Solidity Smart Contract Storage/Slot Analyzer and Data Extractor.
blockchain-explorer code-analysis data-exploration data-extraction ethereum ethereum-blockchain explorer migrate scanner smart-contracts solidity static-analysis storage storage-analysis tracker upgrade
Last synced: 17 Jul 2025
https://github.com/cpl/exodus
Data exfiltration using DNS
data-extraction dns dns-client dns-exfiltration dns-server exfiltration firewall-bypass security-tools
Last synced: 16 Jan 2026
https://github.com/imranr98/wealthsimpleton
A Python script that scrapes your Wealthsimple activity history and saves the data in a JSON file.
data-extraction data-ownership export python selenium selenium-webdriver wealthsimple web web-scraping
Last synced: 14 May 2025
https://github.com/biraj21/web-wanderer
A multi-threaded web crawler written in Python, utilizing ThreadPoolExecutor and Playwright to efficiently crawl dynamically rendered web pages and download them.
data-extraction multithreading python web-crawler webcrawler
Last synced: 12 Jan 2026
https://github.com/shdev/phpflashtext
Extract Keywords from sentence or Replace keywords in sentences. @ https://github.com/vi3k6i5/flashtext
data-analysis data-extraction flashtext keyword-extraction nlp php search-in-text string-manipulation string-matching word2vec
Last synced: 12 Jan 2026
https://github.com/quantumbytestudios/githubuserdataextractor
GitHubUserDataExtractor is a cross-platform Python tool designed to extract and display public GitHub user data both in the terminal and through a visual HTML dashboard. It provides a streamlined way to fetch a user’s profile, recent activity, and contribution statistics using GitHub’s REST API and external visualization services.
data-extraction data-extractor hack hack-tool hack-tools hacker-scripts hacker-tool hacking linux-tools python-tools tools
Last synced: 31 Jul 2025
https://github.com/Fabiopf02/ofx-data-extractor
A module written in TypeScript that provides a utility to extract data from an OFX file in Node.js and Browser
banking data-extraction financial no-dependencies ofx ofx-js ofx-json ofx-parser open-financial-exchange parser qfx
Last synced: 11 Sep 2025
https://github.com/arkutils/arkutils-website
The source for the arkutils website, home of a few Ark: Survival Evolved and Ascenced tools.
ark-survival-ascended ark-survival-evolved ark-survivial data-extraction game-tool
Last synced: 23 Jan 2026
https://github.com/masurii/fbscrapeideas
Modern CLI tool for scraping & analyzing Facebook groups using Playwright & Gemini AI. Features self-healing selectors, session security, and local offline analysis.
academic-research ai cli data-extraction data-mining facebook-scraper gemini-api idea-generation nlp python selenium text-analysis
Last synced: 28 Apr 2026
https://github.com/NextKore/SmartMuv
An EVM-compatible Solidity Smart Contract Storage/Slot Analyzer and Data Extractor.
blockchain-explorer code-analysis data-exploration data-extraction ethereum ethereum-blockchain explorer migrate scanner smart-contracts solidity static-analysis storage storage-analysis tracker upgrade
Last synced: 12 Aug 2025
https://github.com/robert-mcdermott/ollama-batch-cluster
Large Scale Batch Processing with Ollama
data-extraction gpu hpc-cluster llm ollama
Last synced: 06 Apr 2026
https://github.com/webmiddle/webmiddle
Node.js framework for modular web scraping and data extraction
data-extraction framework jsx jsx-components modular nodejs web-scraping
Last synced: 29 Oct 2025
https://github.com/sypht-team/sypht-node-client
A Nodejs client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields invoice invoice-parser node node-module nodejs nodejs-client pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-node-client
Last synced: 13 Apr 2025
https://github.com/attogram/justrefs
Just Refs - extract just the references and related topics from any page on the English Wikipedia
data-extraction information-extraction wikipedia wikipedia-api wikipedia-scraper wikipedia-viewer
Last synced: 14 Apr 2025
https://github.com/aryanvbw/exif
ExifTool is a powerful command-line tool that can be used to extract and edit metadata in a wide range of media files, including images, audio, and video. Metadata is information that is stored within a file that describes the file’s content or other attributes.
aryan-technologies aryanshop aryanvbw data-extraction image-metadata image-processing images-hacking information-gathering powered-by-aryan-technologies vivek
Last synced: 24 Oct 2025
https://github.com/u-c4n/u-transkript
U-Transkript is a powerful Python library for automatically extracting transcripts (subtitles) from YouTube videos and translating them into various languages using Google Gemini AI. It supports 50+ languages, offers flexible output formats (TXT, JSON, XML), and features an easy-to-use, chainable API. Ideal for education, research, content creation
ai data-extraction python subtitles transcript translation youtube youtube-api
Last synced: 01 Jul 2025
https://github.com/irfanalidv/trustpilot_scraper
A Python library for scraping Trustpilot reviews.
beautifulsoup data-collection data-extraction etl-pipeline review-scraper text-mining trustpilot web-scraping-python
Last synced: 14 Jan 2026
https://github.com/fabiopf02/ofx-data-extractor
A module written in TypeScript that provides a utility to extract data from an OFX file in Node.js and Browser
banking data-extraction financial no-dependencies ofx ofx-js ofx-json ofx-parser open-financial-exchange parser qfx
Last synced: 10 Jul 2025
https://github.com/petrpatek/airbnb-scraper
Apify public actor for scraping Airbnb homes.
airbnb airbnb-api apify crawler data-extraction scrape
Last synced: 20 Mar 2025
https://github.com/sypht-team/sypht-kotlin-client
A Kotlin client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-extraction invoice invoice-parser kotlin kotlin-android kotlin-library pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-kotlin-client
Last synced: 27 Jul 2025
https://github.com/kehvinbehvin/json-mcp-filter
JSON MCP server to filter only relevant data for your LLM
claude-mcp data data-extraction data-filtering json json-analysis json-filter json-mcp-server json-parser json-schema-inference json-to-typescript json-utilities large-files mcp mcp-server query type-generation
Last synced: 07 Sep 2025
https://github.com/chaitanyarahalkar/financial-info-extractor
Extract financial information in CSV format for companies compliant to the NSE
beautifulsoup csv-parser data-extraction data-scraping financial-data financial-services python selenium
Last synced: 17 Aug 2025
https://github.com/deadbits/trs
🔭 Threat report analysis via LLM and Vector DB
data-extraction detection-engineering large-language-models llm llm-prompting openai prompt-engineering summarization threat-intelligence
Last synced: 14 Apr 2025
https://github.com/jakubjafra/stellaris-map-generation
Extracts geopolitical data from Stellaris save game files
data-extraction game-files game-modding stellaris stellaris-map-generation
Last synced: 13 May 2025
https://github.com/bluishwu/treeclip
TreeClip 是一款Chrome扩展工具,它提供了多种灵活的页面文本选择方式(同类选择、点选、框选、文本搜索),并结合了层级导航、内部元素选择、层级绑定、自定义输出格式等功能,大幅提升您从网页复制信息的效率。TreeClip offers flexible text selection methods (similar selection, point selection, box selection, text search) to enhance your efficiency in copying information from web pages.
bulk-copy bulk-operation chrome-extension copy-paste data-extraction element-selection html text-selection treeclip web-tools
Last synced: 13 May 2025
https://github.com/rririanto/unstructured-demo-streamlit
Extract your docs (CSV, PDF, JSON, HTML, DOCS, Sheets and more) for your own GPT and LLM projects using Unstructured.io via streamlit
ai data data-extraction gpt unstructured unstructured-data
Last synced: 09 Apr 2025
https://github.com/bisaloo/xlcutter
Parse Batches of 'xlsx' Files Based on a Template
data-extraction excel non-rectangular-data r r-package tidy-data
Last synced: 12 May 2025
https://github.com/beautifulmoon211/onthemarket-scraping
Web scraping tool used to extract real estate information from OnTheMarket.com, a leading property portal in the United Kingdom.
cheerio data-extraction onthemarket onthemarket-scraper real-estate requests typescript web-scraper
Last synced: 13 Jun 2025
https://github.com/ksm26/function-calling-and-data-extraction-with-llms
Master the techniques of function-calling and structured data extraction with LLMs. Learn to enhance LLM capabilities, integrate web services, and build practical applications for real-world data usability.
advanced-workflows ai-integration custom-functionality customer-service-transcripts data-analysis data-extraction end-to-end-applications function-calling llms natural-language-processing openapi practical-implementation structured-data web-services-integration
Last synced: 01 May 2026
https://github.com/geniuszly/genpythondoxing
GenPythonDoxing is a demo version of a Python-based tool designed for gathering publicly available information about email addresses, usernames, IP addresses, and Minecraft nicknames. It utilizes various APIs and web scraping techniques to collect data, providing a comprehensive view of online footprints.
cyber-investigation data-extraction data-mining dox doxing doxing-methods genpythondoxing information-gathering osint python python-doxing python-doxing-tool pythondoxing security-research
Last synced: 13 Apr 2025
https://github.com/blalop/bbva2pandas
Extract the data from your BBVA's monthly statements
bank bank-account bbva data-extraction extracted-data pandas
Last synced: 28 Apr 2025
https://github.com/Bisaloo/xlcutter
Parse Batches of 'xlsx' Files Based on a Template
data-extraction excel non-rectangular-data r r-package tidy-data
Last synced: 01 Apr 2025
https://github.com/aurumz-rgb/ReviewAid
AI-Driven Full-Text Screening and Data Extraction for Systematic Reviews and Evidence Synthesis
academic-research ai-assistant ai-tool data-extraction literature-review medical-research open-source python research-ai research-automation research-tool screening-tool streamlit systematic-reviews
Last synced: 19 Apr 2026
https://github.com/xarantolus/jsonextract
Go package for finding and extracting any JavaScript object (not just JSON) from an io.Reader
data-extraction javascript javascript-array javascript-extractor javascript-object javascript-object-scraper javascript-scraper javascript-web-scraper json json-extractor json-search web-json-extract web-scraping
Last synced: 11 Oct 2025
https://github.com/milahu/reverse-template-engine
find a template of many similar html files
data-extraction grammar-generation grammar-generator parser-generator reverse-template reverse-template-engine schema-generation schema-generator structured-data-extraction structured-text template-generator template-induction tree-automata-induction
Last synced: 14 Apr 2025
https://github.com/desininja/voice-disorder
Data Science project. ML algorithms to detect voice disorders.
accuracy algorithms classification classification-algorithm classifier classifier-model data data-extraction data-mining data-science health machine-learning smote voice-disorder
Last synced: 30 Apr 2025
https://github.com/venkat-0706/amazon-webscraper
An Amazon web scraper extracts product data like prices, reviews, and ratings using tools like BeautifulSoup or Scrapy, aiding in market research while adhering to ethical and legal guidelines.
api-and-data-parsing automation beautifulsoup data-extraction ethical-scraping python-programming webscraping
Last synced: 26 Jun 2025
https://github.com/sypht-team/sypht-csharp-client
A C# / .NET client for the Sypht API
api-client data-extraction document-capture dot-net dotnet dotnet-cli dotnet-library extract extract-data-from-pdf extract-fields invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-csharp-client
Last synced: 13 Apr 2025
https://github.com/sypht-team/sypht-elixir-client
An Elixir client for the Sypht API https://sypht.com
api-client data-extraction document-capture elixir elixir-lang extract extract-data extract-fields information-retrieval information-retrieval-engine invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-api-elixir
Last synced: 13 Apr 2025
https://github.com/lykmapipo/python-spark-log-analysis
Python scripts to process, and analyze log files using PySpark.
apache-arrow apache-spark apache-spark-sql data-analysis data-extraction data-processing data-transformation log-analysis log-analyzer log-monitor lykmapipo pandas pyarrow pyspark python seaborn spark-ml spark-nlp sparkml-pipelines sql
Last synced: 22 Jun 2025
https://github.com/kanugurajesh/invoice
Automated Data Extraction and invoice management application
automated data-extraction expressjs gemini-api invoice-management material-ui mern-stack performance-optimization reactjs real-world-project redux-toolkit typescript vite
Last synced: 10 Oct 2025
https://github.com/jrdnbradford/readmdtable
R 📦 for reading markdown tables into tibbles
data data-analysis data-analytics data-extraction data-mining data-science markdown markdown-parser markdown-table r r-package r-programming
Last synced: 23 Oct 2025
https://github.com/rithulkamesh/docproc
Document Intelligence Platform — Extract, refine, and query documents with vision LLMs and config-driven RAG.
content-extraction data-extraction document-analysis document-parsing equation-detection layout-analysis machine-learning mathematical-symbols ocr pdf-processing pdf-text-extraction python region-detection text-classification text-extraction
Last synced: 02 Apr 2026
https://github.com/ExceptionRegret/Kryfto
The open-source web-browsing backend for AI agents & workflow engines. Ships a 42-tool MCP server for Claude Code/Cursor/Codex, a full REST API for n8n/Zapier/Make, federated multi-engine search, anti-bot stealth, and enterprise infrastructure (Postgres, Redis, BullMQ, MinIO). Self-host for $5/mo flat
ai-agents anti-detection claude-code codex cursor data-extraction developer-tools fastapi headless-browser mcp mcp-server n8n open-source playwright redis search-engine self-hosted stealth web-scraping workflow-automation
Last synced: 03 Apr 2026
https://github.com/os-climate/crrf-det
A web application for PDF content and table extraction, featuring image-based visual layout analysis, indexed document search, batch processing and extraction result annotation.
annotation data-extraction layout-analysis pdf table-extraction
Last synced: 12 Apr 2025
https://github.com/davidumoru/scryer
Transform web data into actionable knowledge
content-parsing data-extraction gemini-api google-gemini web-scraping
Last synced: 13 Aug 2025