Projects in Awesome Lists tagged with data-extraction
A curated list of projects in awesome lists tagged with data-extraction .
https://github.com/getmaxun/maxun
🔥 Open Source No Code Web Data Extraction Platform • Turn Websites To APIs & Spreadsheets With No-Code Robots In Minutes 🔥
agents api automation browser browser-automation data-extraction no-code no-code-web-scraper playwright robotic-process-automation rpa scraper self-hosted web-agent web-automation web-scraper web-scraping web-scraping-agent webscraping website-to-api
Last synced: 23 Jan 2026
https://github.com/vi3k6i5/flashtext
Extract Keywords from sentence or Replace keywords in sentences.
data-extraction keyword-extraction nlp search-in-text word2vec
Last synced: 13 May 2025
https://github.com/D4Vinci/Scrapling
🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!
ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath
Last synced: 13 May 2025
https://github.com/d4vinci/scrapling
🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!
ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath
Last synced: 15 Feb 2026
https://github.com/brightdata/brightdata-mcp
A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.
ai-agents ai-integrations anti-bot-detection browser-automation data-collection data-extraction llm mcp mcp-server modelcontextprotocol scraping scraping-tools structured-data web-crawling web-data web-scraping
Last synced: 16 Jan 2026
https://github.com/saifyxpro/headlessx
A lightweight, self-hosted headless browser automation platform. Designed as an alternative to Browserless, built for speed, privacy, and scalability.
automation automation-api automation-platform browser-automation browser-testing browserless chrome-headless chromedriver container-automation data-extraction headless headless-chrome headless-service playwright playwright-automation puppeteer scraping-service web-automation web-scraping
Last synced: 31 Jan 2026
https://github.com/jonathanlink/pdflayouttextstripper
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
data-extraction extract java layout pdf pdfbox text
Last synced: 15 May 2025
https://github.com/JonathanLink/PDFLayoutTextStripper
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).
data-extraction extract java layout pdf pdfbox text
Last synced: 15 Mar 2025
https://github.com/hi-primus/optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark
Last synced: 14 May 2025
https://github.com/raznem/parsera
Lightweight library for scraping web-sites with LLMs
ai ai-scraping data-extraction llm opensource playwright python scraping webscraping
Last synced: 11 Apr 2025
https://github.com/thinh-vu/vnstock
A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone
data-extraction quantitative-analysis quantitative-finance quantitative-trading stock-market stock-screener
Last synced: 14 May 2025
https://github.com/eclaire-labs/eclaire
Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.
ai ai-assistant automation bookmark-manager bookmarks data-extraction document-processing llm local-first note-taking ocr on-device-ai open-source personal-knowledge-management privacy rest-api self-hosted task-management web-archiving
Last synced: 16 Jan 2026
https://github.com/polyrabbit/hacker-news-digest
:newspaper: Let ChatGPT Summarize Hacker News for You
chatgpt chatgpt-api crawler data-extraction extract-summaries hacker-news hacker-news-digest hacker-news-reader machine-learning news-aggregator openai openai-api python rss spider
Last synced: 15 May 2025
https://github.com/adrienjoly/npm-pdfreader
🚜 Parse text and tables from PDF files.
data-extraction javascript parse-tables parsing pdf-converter pdf-reader rule-based-parsing tabular-data
Last synced: 14 May 2025
https://github.com/a-maliarov/amazoncaptcha
Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.
amazon amazon-captcha amazon-scraper amazoncaptcha captcha captcha-solver data-extraction pillow python3 training-data
Last synced: 26 Mar 2025
https://github.com/shcherbak-ai/contextgem
ContextGem: Effortless LLM extraction from documents
ai contract-analysis data-extraction document-intelligence docx docx2md docx2txt generative-ai legaltech llm llm-extraction llm-framework llm-pipeline llms nlp prompt-engineering text-analysis unstructured-data
Last synced: 13 May 2025
https://github.com/py-pdf/benchmarks
Benchmarking PDF libraries
benchmark data-extraction mupdf pdf poppler-utils pypdf2 text-extraction
Last synced: 28 Jul 2025
https://github.com/serpapi/clauneck
A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.
automation command-line command-line-tool data-extraction data-extractor email email-extract-with-proxy email-extraction email-extractor email-marketing email-scraper open-source ruby rubygem serp social-media-scraper web-crawling webscraping
Last synced: 06 Apr 2025
https://github.com/molybdenum-99/infoboxer
Wikipedia information extraction library
data-extraction mediawiki wikipedia
Last synced: 05 Apr 2025
https://github.com/sypht-team/sypht-python-client
A python client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-extraction invoice invoice-parser pdf-parser python python3 python3-library receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-python-client
Last synced: 11 Jul 2025
https://github.com/yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
data-extraction document-processing fast image-extraction llm markdown pdf pdf-editor pdf-generation pdf-library pdf-parser pdf-to-markdown pdf-to-text pyo3 python rag rust text-extraction
Last synced: 09 Mar 2026
https://github.com/dilawar/plotdigitizer
A Python utility to digitize plots.
data-extraction digitization image-processing python3
Last synced: 06 Apr 2025
https://github.com/ScrapeGraphAI/scrapecraft
🤖 AI-powered web scraping editor with visual workflow builder. Build, test & deploy web scrapers using natural language. Powered by ScrapeGraphAI & LangGraph.
ai automation data-extraction docker fastapi hacktoberfest langgraph python react scrapegraphai typescript web-scraping webscraping
Last synced: 25 Aug 2025
https://github.com/nfx/go-htmltable
Structured HTML table data extraction from URLs in Go that has almost no external dependencies
data-extraction go go-generics html
Last synced: 05 Apr 2025
https://github.com/sshniro/line-segmentation-algorithm-to-gcp-vision
Line segmentation algorithm for Google Vision API.
data-extraction google-vision invoice proposed-algorithm segmentation
Last synced: 25 Jun 2025
https://github.com/tech-engine/goscrapy
GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.
data-extraction go-scrapy golang goscraper scrapy spider web-crawler webscraper webscrapping
Last synced: 18 Jan 2026
https://github.com/dav009/flash
Golang Keyword extraction/replacement Datastructure using Tries instead of regexes
data-extraction go golang search text text-search trie
Last synced: 30 Apr 2025
https://github.com/sypht-team/sypht-java-client
A Java client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-retrieval information-retrieval-engine invoice invoice-parser java java8 pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-java-client
Last synced: 10 Apr 2025
https://github.com/danburzo/hred
Reduce HTML and XML to JSON from the command line, using an expressive query language inspired by CSS selectors.
cli data-extraction html json xml
Last synced: 02 Apr 2025
https://github.com/html-extract/hext
Domain-specific language for extracting structured data from HTML documents
cpp data-extraction dsl html html-extraction node php python ruby scraping
Last synced: 15 Apr 2025
https://github.com/stabrise/spark-pdf
PDF DataSource for Apache Spark
big-data data-engineering data-extraction data-science ocr ocr-recognition pdf pdf-document pdf-document-processor spark spark-datasource tesseract tesseract-ocr
Last synced: 09 Apr 2025
https://github.com/articdive/articdata
Collection of data extracted from Minecraft.
data data-extraction data-mining java json mc minecraft minecraft-data minecraft-server minecraft-servers registry
Last synced: 17 May 2025
https://github.com/serpapi/google-search-results-java
Google Search Results JAVA API via SerpApi
data-extraction data-scraping java java-api json serp-api serpapi web-scraping webscraping
Last synced: 09 Jul 2025
https://github.com/Articdive/ArticData
Collection of data extracted from Minecraft.
data data-extraction data-mining java json mc minecraft minecraft-data minecraft-server minecraft-servers registry
Last synced: 08 May 2025
https://github.com/hddevteam/smart-form-filler
AI-powered form filling and data extraction browser extension
ai-powered browser-extension chrome-extension data-extraction demo-site developer-tools form-automation form-filling github-pages gpt-integration javascript local-ai machine-learning nodejs ollama privacy-focused productivity-tools web-automation
Last synced: 14 Feb 2026
https://github.com/johnbumgarner/newshound
This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around the world in over 50 languages.
article-extracting article-extractor data-extraction data-mining data-science datascience news news-aggregator news-crawler newspaper-crawler python-newspaper python3 text-mining web-scraping webscraping
Last synced: 14 Jan 2026
https://github.com/extralit/extralit
Fast and accurate systemic data extraction with LLM assistance
data-extraction literature-review llm
Last synced: 14 Jan 2026
https://github.com/mhucka/taupe
Taupe takes a downloaded Twitter archive ZIP file, extracts the URLs corresponding to tweets, retweets, replies, quote tweets, and liked tweets, and outputs the results in a comma-separated values (CSV) format that you can use with other software tools.
archives comma-separated-values csv data-extraction markdown twitter twitter-archive twitter-archives url
Last synced: 14 Dec 2025
https://github.com/sypht-team/sypht-golang-client
A Golang client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields go golang golang-library golang-package invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-golang-client
Last synced: 20 Oct 2025
https://github.com/rubydamodar/protext-analyzer
ProText Analyzer is a powerful tool for extracting insights from text. It conducts sentiment analysis, categorizing content as positive, negative, or neutral, while also assessing readability and linguistic complexity. Ideal for businesses and researchers, it enhances understanding of textual data.
complex-word-definition data-cleaning-techniques data-extraction linguistic-complexity metrics-explained readability-analysis sentiment-analysis syllable-counting-methodology tokenization-process
Last synced: 17 Jul 2025
https://github.com/linw1995/data_extractor
Combine XPath, CSS Selectors and JSONPath for Web data extracting.
css-selectors data-extraction data-extractor jsonpath xpath
Last synced: 30 Jan 2026
https://github.com/gambolputty/wiktionary-de-parser
Extract data from German Wiktionary XML files.
data-extraction dewiktionary german german-language nlp wiktionary wiktionary-dump wiktionary-parser
Last synced: 14 Jan 2026
https://github.com/pim97/scrappey-wrapper-python
An API wrapper for Scrappey.com written in Python (cloudflare, datadome bypass & solver)
akamai anti-bot-api captcha captcha-solver cloudflare-anti-bot cloudflare-bypass data-extraction datadome incapsula perimetex queue-it scraping-framework scraping-library scraping-service scraping-tool shape web-data-extration web-scraping web-scraping-solution
Last synced: 12 Jan 2026
https://github.com/nextkore/smartmuv
An EVM-compatible Solidity Smart Contract Storage/Slot Analyzer and Data Extractor.
blockchain-explorer code-analysis data-exploration data-extraction ethereum ethereum-blockchain explorer migrate scanner smart-contracts solidity static-analysis storage storage-analysis tracker upgrade
Last synced: 17 Jul 2025
https://github.com/cpl/exodus
Data exfiltration using DNS
data-extraction dns dns-client dns-exfiltration dns-server exfiltration firewall-bypass security-tools
Last synced: 16 Jan 2026
https://github.com/imranr98/wealthsimpleton
A Python script that scrapes your Wealthsimple activity history and saves the data in a JSON file.
data-extraction data-ownership export python selenium selenium-webdriver wealthsimple web web-scraping
Last synced: 14 May 2025
https://github.com/shdev/phpflashtext
Extract Keywords from sentence or Replace keywords in sentences. @ https://github.com/vi3k6i5/flashtext
data-analysis data-extraction flashtext keyword-extraction nlp php search-in-text string-manipulation string-matching word2vec
Last synced: 12 Jan 2026
https://github.com/biraj21/web-wanderer
A multi-threaded web crawler written in Python, utilizing ThreadPoolExecutor and Playwright to efficiently crawl dynamically rendered web pages and download them.
data-extraction multithreading python web-crawler webcrawler
Last synced: 12 Jan 2026
https://github.com/quantumbytestudios/githubuserdataextractor
GitHubUserDataExtractor is a cross-platform Python tool designed to extract and display public GitHub user data both in the terminal and through a visual HTML dashboard. It provides a streamlined way to fetch a user’s profile, recent activity, and contribution statistics using GitHub’s REST API and external visualization services.
data-extraction data-extractor hack hack-tool hack-tools hacker-scripts hacker-tool hacking linux-tools python-tools tools
Last synced: 31 Jul 2025
https://github.com/Fabiopf02/ofx-data-extractor
A module written in TypeScript that provides a utility to extract data from an OFX file in Node.js and Browser
banking data-extraction financial no-dependencies ofx ofx-js ofx-json ofx-parser open-financial-exchange parser qfx
Last synced: 11 Sep 2025
https://github.com/arkutils/arkutils-website
The source for the arkutils website, home of a few Ark: Survival Evolved and Ascenced tools.
ark-survival-ascended ark-survival-evolved ark-survivial data-extraction game-tool
Last synced: 23 Jan 2026
https://github.com/NextKore/SmartMuv
An EVM-compatible Solidity Smart Contract Storage/Slot Analyzer and Data Extractor.
blockchain-explorer code-analysis data-exploration data-extraction ethereum ethereum-blockchain explorer migrate scanner smart-contracts solidity static-analysis storage storage-analysis tracker upgrade
Last synced: 12 Aug 2025
https://github.com/robert-mcdermott/ollama-batch-cluster
Large Scale Batch Processing with Ollama
data-extraction gpu hpc-cluster llm ollama
Last synced: 12 Oct 2025
https://github.com/webmiddle/webmiddle
Node.js framework for modular web scraping and data extraction
data-extraction framework jsx jsx-components modular nodejs web-scraping
Last synced: 29 Oct 2025
https://github.com/aryanvbw/exif
ExifTool is a powerful command-line tool that can be used to extract and edit metadata in a wide range of media files, including images, audio, and video. Metadata is information that is stored within a file that describes the file’s content or other attributes.
aryan-technologies aryanshop aryanvbw data-extraction image-metadata image-processing images-hacking information-gathering powered-by-aryan-technologies vivek
Last synced: 24 Oct 2025
https://github.com/sypht-team/sypht-node-client
A Nodejs client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields invoice invoice-parser node node-module nodejs nodejs-client pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-node-client
Last synced: 13 Apr 2025
https://github.com/attogram/justrefs
Just Refs - extract just the references and related topics from any page on the English Wikipedia
data-extraction information-extraction wikipedia wikipedia-api wikipedia-scraper wikipedia-viewer
Last synced: 14 Apr 2025
https://github.com/u-c4n/u-transkript
U-Transkript is a powerful Python library for automatically extracting transcripts (subtitles) from YouTube videos and translating them into various languages using Google Gemini AI. It supports 50+ languages, offers flexible output formats (TXT, JSON, XML), and features an easy-to-use, chainable API. Ideal for education, research, content creation
ai data-extraction python subtitles transcript translation youtube youtube-api
Last synced: 01 Jul 2025
https://github.com/irfanalidv/trustpilot_scraper
A Python library for scraping Trustpilot reviews.
beautifulsoup data-collection data-extraction etl-pipeline review-scraper text-mining trustpilot web-scraping-python
Last synced: 14 Jan 2026
https://github.com/fabiopf02/ofx-data-extractor
A module written in TypeScript that provides a utility to extract data from an OFX file in Node.js and Browser
banking data-extraction financial no-dependencies ofx ofx-js ofx-json ofx-parser open-financial-exchange parser qfx
Last synced: 10 Jul 2025
https://github.com/petrpatek/airbnb-scraper
Apify public actor for scraping Airbnb homes.
airbnb airbnb-api apify crawler data-extraction scrape
Last synced: 20 Mar 2025
https://github.com/sypht-team/sypht-kotlin-client
A Kotlin client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-extraction invoice invoice-parser kotlin kotlin-android kotlin-library pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-kotlin-client
Last synced: 27 Jul 2025
https://github.com/chaitanyarahalkar/financial-info-extractor
Extract financial information in CSV format for companies compliant to the NSE
beautifulsoup csv-parser data-extraction data-scraping financial-data financial-services python selenium
Last synced: 17 Aug 2025
https://github.com/kehvinbehvin/json-mcp-filter
JSON MCP server to filter only relevant data for your LLM
claude-mcp data data-extraction data-filtering json json-analysis json-filter json-mcp-server json-parser json-schema-inference json-to-typescript json-utilities large-files mcp mcp-server query type-generation
Last synced: 07 Sep 2025
https://github.com/deadbits/trs
🔭 Threat report analysis via LLM and Vector DB
data-extraction detection-engineering large-language-models llm llm-prompting openai prompt-engineering summarization threat-intelligence
Last synced: 14 Apr 2025
https://github.com/jakubjafra/stellaris-map-generation
Extracts geopolitical data from Stellaris save game files
data-extraction game-files game-modding stellaris stellaris-map-generation
Last synced: 13 May 2025
https://github.com/bluishwu/treeclip
TreeClip 是一款Chrome扩展工具,它提供了多种灵活的页面文本选择方式(同类选择、点选、框选、文本搜索),并结合了层级导航、内部元素选择、层级绑定、自定义输出格式等功能,大幅提升您从网页复制信息的效率。TreeClip offers flexible text selection methods (similar selection, point selection, box selection, text search) to enhance your efficiency in copying information from web pages.
bulk-copy bulk-operation chrome-extension copy-paste data-extraction element-selection html text-selection treeclip web-tools
Last synced: 13 May 2025
https://github.com/rririanto/unstructured-demo-streamlit
Extract your docs (CSV, PDF, JSON, HTML, DOCS, Sheets and more) for your own GPT and LLM projects using Unstructured.io via streamlit
ai data data-extraction gpt unstructured unstructured-data
Last synced: 09 Apr 2025
https://github.com/beautifulmoon211/onthemarket-scraping
Web scraping tool used to extract real estate information from OnTheMarket.com, a leading property portal in the United Kingdom.
cheerio data-extraction onthemarket onthemarket-scraper real-estate requests typescript web-scraper
Last synced: 13 Jun 2025
https://github.com/xarantolus/jsonextract
Go package for finding and extracting any JavaScript object (not just JSON) from an io.Reader
data-extraction javascript javascript-array javascript-extractor javascript-object javascript-object-scraper javascript-scraper javascript-web-scraper json json-extractor json-search web-json-extract web-scraping
Last synced: 11 Oct 2025
https://github.com/desininja/voice-disorder
Data Science project. ML algorithms to detect voice disorders.
accuracy algorithms classification classification-algorithm classifier classifier-model data data-extraction data-mining data-science health machine-learning smote voice-disorder
Last synced: 30 Apr 2025
https://github.com/blalop/bbva2pandas
Extract the data from your BBVA's monthly statements
bank bank-account bbva data-extraction extracted-data pandas
Last synced: 28 Apr 2025
https://github.com/milahu/reverse-template-engine
find a template of many similar html files
data-extraction grammar-generation grammar-generator parser-generator reverse-template reverse-template-engine schema-generation schema-generator structured-data-extraction structured-text template-generator template-induction tree-automata-induction
Last synced: 14 Apr 2025
https://github.com/venkat-0706/amazon-webscraper
An Amazon web scraper extracts product data like prices, reviews, and ratings using tools like BeautifulSoup or Scrapy, aiding in market research while adhering to ethical and legal guidelines.
api-and-data-parsing automation beautifulsoup data-extraction ethical-scraping python-programming webscraping
Last synced: 26 Jun 2025
https://github.com/geniuszly/genpythondoxing
GenPythonDoxing is a demo version of a Python-based tool designed for gathering publicly available information about email addresses, usernames, IP addresses, and Minecraft nicknames. It utilizes various APIs and web scraping techniques to collect data, providing a comprehensive view of online footprints.
cyber-investigation data-extraction data-mining dox doxing doxing-methods genpythondoxing information-gathering osint python python-doxing python-doxing-tool pythondoxing security-research
Last synced: 13 Apr 2025
https://github.com/Bisaloo/xlcutter
Parse Batches of 'xlsx' Files Based on a Template
data-extraction excel non-rectangular-data r r-package tidy-data
Last synced: 01 Apr 2025
https://github.com/bisaloo/xlcutter
Parse Batches of 'xlsx' Files Based on a Template
data-extraction excel non-rectangular-data r r-package tidy-data
Last synced: 12 May 2025
https://github.com/sypht-team/sypht-elixir-client
An Elixir client for the Sypht API https://sypht.com
api-client data-extraction document-capture elixir elixir-lang extract extract-data extract-fields information-retrieval information-retrieval-engine invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-api-elixir
Last synced: 13 Apr 2025
https://github.com/jrdnbradford/readmdtable
R 📦 for reading markdown tables into tibbles
data data-analysis data-analytics data-extraction data-mining data-science markdown markdown-parser markdown-table r r-package r-programming
Last synced: 23 Oct 2025
https://github.com/kanugurajesh/invoice
Automated Data Extraction and invoice management application
automated data-extraction expressjs gemini-api invoice-management material-ui mern-stack performance-optimization reactjs real-world-project redux-toolkit typescript vite
Last synced: 10 Oct 2025
https://github.com/lykmapipo/python-spark-log-analysis
Python scripts to process, and analyze log files using PySpark.
apache-arrow apache-spark apache-spark-sql data-analysis data-extraction data-processing data-transformation log-analysis log-analyzer log-monitor lykmapipo pandas pyarrow pyspark python seaborn spark-ml spark-nlp sparkml-pipelines sql
Last synced: 22 Jun 2025
https://github.com/sypht-team/sypht-csharp-client
A C# / .NET client for the Sypht API
api-client data-extraction document-capture dot-net dotnet dotnet-cli dotnet-library extract extract-data-from-pdf extract-fields invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-csharp-client
Last synced: 13 Apr 2025
https://github.com/imranr98/instacartflation
A Python script that scrapes your Instacart order history and saves the data in a JSON file.
data-extraction data-ownership export instacart python selenium selenium-webdriver web web-scraping
Last synced: 14 May 2025
https://github.com/davidumoru/scryer
Transform web data into actionable knowledge
content-parsing data-extraction gemini-api google-gemini web-scraping
Last synced: 13 Aug 2025
https://github.com/lykmapipo/nyc-tlc-trip-data
Python scripts to download, process, and analyze the New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset
apache-arrow apache-spark data data-engineering data-extraction data-transformation etl fsspec geopandas joblib jupyterlab lykmapipo metadata nyc nyc-taxi-dataset pandas pyarrow python s3
Last synced: 17 Sep 2025
https://github.com/os-climate/crrf-det
A web application for PDF content and table extraction, featuring image-based visual layout analysis, indexed document search, batch processing and extraction result annotation.
annotation data-extraction layout-analysis pdf table-extraction
Last synced: 12 Apr 2025
https://github.com/promptapi/scraper-py
Python package for Prompt API's Scraper API
api-marketplace api-wrapper css-selector css-selector-parser data-extraction image-scraper promptapi python3 scraper scraper-api web-scraper web-scraping
Last synced: 16 Jan 2026
https://github.com/coskundeniz/twitter-data-extractor
Twitter Data Extractor
data-extraction excel google-sheets mongodb python sqlite tweepy tweets-extraction twitter twitter-api
Last synced: 06 Mar 2025
https://github.com/kalebu/worldmeter-coronavirus-scraper
A python program that tracks coronavirus statistics based on the worldometer website
beautifulsoup coronavirus data-extraction data-science python-tanzania tanzania webscraping worldmeter-coronavirus-scraper
Last synced: 08 May 2025
https://github.com/sypht-team/sypht-ruby-client
A Ruby client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-retrieval invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning ruby ruby-gem ruby-library sypht sypht-api sypht-ruby-client
Last synced: 05 Mar 2025
https://github.com/ghentcdh/taulu
Taulu is a Python package designed to segment tabular data in scanned or photographed documents.
data-extraction historic-documents htr ocr segmentation tabular-data
Last synced: 18 Jul 2025
https://github.com/pepe-god/dataprophet
Extracts the identity information citizens from MySQL, creates a family network based on TC ID No. and exports it to CSV
101m 109m adres data-analysis data-extraction database-connector family-tree genealogy gsm hsys identity mysql-database python-script pyton
Last synced: 13 Jul 2025
https://github.com/amirali104/text2excel
A GUI desktop application that can extract data from a text file and put them in an Excel or CSV file using regular expression (regex) patterns
automation csv data-extraction data-extractor data-processing excel openpyxl productivity-tool productivity-tools regex text-parsing text-processing text-to-excel tkinter tkinter-gui
Last synced: 04 Oct 2025
https://github.com/acuciureanu/js-maid
A rule-driven engine designed for seamless extraction of data from JavaScript files.
bugbounty-tool bugbountytips data-extraction javascript security-audit static-code-analyzer
Last synced: 09 Apr 2025
https://github.com/hyeonsangjeon/pdf2llm-tuning-studio
PDF 문서에서 GPU 가속 처리로 고품질 질의응답(QA) 데이터를 자동 생성하고 LLM을 효율적으로 파인튜닝하는 솔루션입니다. Unstructured 라이브러리와 AWS Bedrock Claude로 도메인 특화 QA 쌍을 생성하고, LoRA 기법으로 경량 모델을 훈련합니다.
aws bedrock claude cuda data-argumantation data-extraction distillation docker finetuning gpu llm pdf-generation pdf-text-extraction processing processing-job sagemaker text-disti unsloth unstructured
Last synced: 15 Jun 2025
https://github.com/oxylabs/how-to-scrape-wayfair
A step-by-step tutorial on extracting data from Wayfair’s product pages at scale and in real time. The guide details actionable code and considers various aspects before and during the scraping process.
data-extraction how-to parsing python wayfair wayfair-scraper web-scraping
Last synced: 27 Sep 2025
https://github.com/danhilse/web-scraper
A versatile Python-based web scraper that extracts content from single URLs or entire sitemaps, organizing data into structured text files. Features include sitemap parsing, content grouping by URL structure, and an easy-to-use command-line interface. Ideal for data extraction, content analysis, and web research tasks.
beautifulsoup cli-tool data-extraction python sitemap-parser web-scraping
Last synced: 23 Apr 2025
https://github.com/hugcis/data_journalism_extractor
A tool for extracting and integrating data from heterogeneous data sources
data-extraction data-journalism flink information-retrieval journalism
Last synced: 01 Sep 2025