Projects in Awesome Lists tagged with extract-data
A curated list of projects in awesome lists tagged with extract-data .
https://github.com/opendatalab/mineru
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
ai4science document-analysis extract-data layout-analysis ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser python
Last synced: 06 Jan 2026
https://github.com/opendatalab/MinerU
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
ai4science document-analysis extract-data layout-analysis ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser python
Last synced: 24 Mar 2025
https://github.com/pymupdf/pymupdf
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
data-science epub extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction tesseract text-processing text-shaping xps
Last synced: 09 Sep 2025
https://github.com/bda-research/node-crawler
Web Crawler/Spider for NodeJS + server-side jQuery ;-)
cheerio crawler extract-data javascript jquery nodejs spider
Last synced: 13 May 2025
https://github.com/pymupdf/PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
data-science epub extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction tesseract text-processing text-shaping xps
Last synced: 08 Apr 2025
https://github.com/meltano/meltano
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
connectors data data-engineering data-pipelines dataops dataops-platform elt extract-data integration loaders meltano meltano-sdk open-source opensource pipelines singer tap taps target targets
Last synced: 12 May 2025
https://github.com/markummitchell/engauge-digitizer
Extracts data points from images of graphs
digitizer extract-data image-analysis utility
Last synced: 16 May 2025
https://github.com/elixir-crawly/crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
crawler crawling elixir erlang extract-data scraper scraping scraping-websites spider
Last synced: 11 Dec 2025
https://github.com/slotix/dataflowkit
Extract structured data from web sites. Web sites scraping.
cdp chrome-fetcher crawling extract-data go golang golang-library headless scraper scraping scraping-websites
Last synced: 14 Mar 2025
https://github.com/danschultzer/receipt-scanner
Receipt scanner extracts information from your PDF or image receipts - built in NodeJS
extract-data extract-information ocr optical-character-recognition receipt-scanner receipts
Last synced: 07 Apr 2025
https://github.com/omkarpathak/resumeparser
A simple resume parser used for extracting information from resumes
extract-data gui parser python python3 resume-parser
Last synced: 05 Apr 2025
https://github.com/OmkarPathak/ResumeParser
A simple resume parser used for extracting information from resumes
extract-data gui parser python python3 resume-parser
Last synced: 18 Jul 2025
https://github.com/ropensci/smapr
An R package for acquisition and processing of NASA SMAP data
acquisition extract-data nasa peer-reviewed r r-package raster rstats smap-data soil-mapping soil-moisture soil-moisture-sensor
Last synced: 20 Jul 2025
https://github.com/yuanxu-li/html-table-extractor
extract data from html table
beautifulsoup crawler extract-data html html-table scraping table
Last synced: 10 Apr 2025
https://github.com/msoap/html2data
Library and cli for extracting data from HTML via CSS selectors
cli css-selector extract-data golang homebrew html library parser scrapping
Last synced: 27 Jul 2025
https://github.com/isaacmg/fb_scraper
FBLYZE is a Facebook scraping system and analysis system.
extract-data facebook-scraper flink kafka spark tf-idf
Last synced: 10 Jul 2025
https://github.com/techcatchers/pylyrics-extractor
Get Lyrics for any songs by just passing in the song name (spelled or misspelled) in less than 2 seconds using this awesome Python Library.
extract-data lyrics-fetcher python-library search-algorithm
Last synced: 11 Apr 2025
https://github.com/Techcatchers/PyLyrics-Extractor
Get Lyrics for any songs by just passing in the song name (spelled or misspelled) in less than 2 seconds using this awesome Python Library.
extract-data lyrics-fetcher python-library search-algorithm
Last synced: 22 Jul 2025
https://github.com/asad70/insider-trading
This program extracts insider trading data from the sec website and stores it in excel file for the specified time frame.
algotrading data-science extract-data insider-trading insiders tickers trading trading-strategies
Last synced: 27 Apr 2025
https://github.com/osh/gr-eventstream
gr-eventstream is a set of GNU Radio blocks for creating precisely timed events and either inserting them into, or extracting them from normal data-streams precisely. It allows for the definition of high speed time-synchronous c++ burst event handlers, as well as bridging to standard GNU Radio Async PDU messages with precise timing easily.
burst c-plus-plus event-handling extract-data extractor gnu-radio injection message-passing python radio signal-processing signaling-pathways synchronization synchronization-service synchronous timing-simulator
Last synced: 12 Apr 2025
https://github.com/ionictemplate-app/social-network-data-scraper-pro
Easily scrape 10,000+ email messages in one hour, helping you quickly increase your customers Extracts data from (LinkedIn, Facebook, Instagram, Youtube, Pinterest, Twitter) Perfect search by specific Keywords Ready-to-use Social Network Data Scraper Software to get started instantly 100% Include source code and install file
business-email business-extractor email-scraper extract-data extract-emails extractor-email google-extract scraper-address scraper-email scraper-facebook scraper-instagram scraper-linkedin scraper-name scraper-phone scraper-twitter social-media social-network social-scraper
Last synced: 03 Dec 2025
https://github.com/serhaturtis/tool-fastbatchimagecrop
A simple UI tool to batch crop images to prepare datasets from images and videos.
cropping-images dataset-generation extract-data gui image-classification machine-learning python stable-diffusion ui
Last synced: 12 May 2025
https://github.com/alienzhou/giframe
extract the first frame in GIF without reading whole bytes, support both browser and nodejs 📸
decoder extract-data frame gif gif87a gif89a progressive stream-like
Last synced: 06 May 2025
https://github.com/rdlopes/webhere
HTML scraping for Objective-C.
afnetworking cocoapods extract-data gdataxml-html html ios nocilla osx scraping web xpath
Last synced: 23 Oct 2025
https://github.com/agenty/scrapingai
Build web scraping agents using AI to auto-extract the data from websites, capture screenshot, generate pdf from URL and web crawling with Agenty
crawler crawling datascraping extract-data scraping webscraper webscraping
Last synced: 12 Apr 2025
https://github.com/meltanolabs/tap-dbt
Singer Tap for dbt API v2 built with the Meltano SDK
dbt dbt-cloud elt extract-data meltano-sdk singer-io singer-tap
Last synced: 19 Oct 2025
https://github.com/darkskygit/chatimporter
import chat records from your im and store into single sqlite database
backup backup-tool chat chat-history extract-data
Last synced: 14 Apr 2025
https://github.com/darkskygit/ChatImporter
import chat records from your im and store into single sqlite database
backup backup-tool chat chat-history extract-data
Last synced: 18 Jul 2025
https://github.com/jehad-halahla/linux_project
a linux lab bash project that focuses on automation and text extraction
bash-script commands extract-data linux manual
Last synced: 10 Apr 2025
https://github.com/aidayang/mineru-oneclick
MinerU免安装部署一键启动整合包
ai4science document-analysis extract-data layout-analysis markdown mineru ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser pdftojson pdftomarkdown python
Last synced: 12 Jul 2025
https://github.com/kormanowsky/jextract
Allows extracting data from DOM
css css-selector dom extract-data html javascript jextract jquery js selector
Last synced: 12 Apr 2025
https://github.com/sypht-team/sypht-elixir-client
An Elixir client for the Sypht API https://sypht.com
api-client data-extraction document-capture elixir elixir-lang extract extract-data extract-fields information-retrieval information-retrieval-engine invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-api-elixir
Last synced: 13 Apr 2025
https://github.com/basemax/extractword
Extract word(s) from the lines of the file.
extract extract-data extract-information extract-text extraction extractor php replace replace-text text-processing text-processor webform
Last synced: 13 Jun 2025
https://github.com/meltanolabs/tap-stackexchange
Singer tap for the StackExchange API
elt extract-data meltano-sdk singer-io singer-tap stackexchange
Last synced: 04 Aug 2025
https://github.com/apurvasijaria/googleplaystorescrape
Python module to extract Google Play store reviews and other information of any android app.
app-data extract extract-data google-play-store googleplaystore module pypi pypi-package pypi-packages python python-module scraper selenium
Last synced: 11 Apr 2025
https://github.com/walidbosso/r_data_mining
Extract knowledge from a data using different techniques, including Association Rules Hierarchical Agglomerative Clustering (HAC) K-means Clustering Decision Trees
association-rule-mining association-rules clustering data-analysis data-mining data-science data-visualization decision-tree-classifier decision-trees exportation extract-data hac hierarchical-clustering k-means k-means-clustering k-means-r r-programming r-studio
Last synced: 23 Mar 2025
https://github.com/chris1111/macos-extractor
7z-archives archive bzip2 extract-data extraction zip
Last synced: 03 Mar 2025
https://github.com/sc-networks/hydrator
A pragmatic hydrator and extractor library
extract extract-data extraction hydrate hydration hydrator php php7 php8
Last synced: 19 Mar 2025
https://github.com/qwazr/extractor
A WEB API for text and meta-data extraction
extract-data extractor metadata-extraction parse parser-auto-detection
Last synced: 03 Apr 2025
https://github.com/sandeepbalachandran/pytheract
Tool for extracting data from files.
extract-data extract-data-from-image pytesseract pytheract tesseract
Last synced: 25 Mar 2025
https://github.com/jalal246/corename
Automatically extracts packages root name for monorepos
corename extract-data extract-information extract-text extracts get-info monorepo package-development package-json package-management production read-json utility
Last synced: 26 Mar 2025
https://github.com/chetanxpro/document-ai
A app to extract structured data from a pdf document
Last synced: 11 Oct 2025
https://github.com/kamalpaneru/xtractor
Splits cells from excel sheet images and extracts data.
azure-computer-vision extract-data ruby split-cells
Last synced: 14 May 2025
https://github.com/Kamalpaneru/Xtractor
Splits cells from excel sheet images and extracts data.
azure-computer-vision extract-data ruby split-cells
Last synced: 03 May 2025
https://github.com/netodeolino/tcc
Trabalho de Conclusão de Curso - Sistemas de Informação UFC
clustering data-mining extract-data jupyter-notebook python
Last synced: 18 Oct 2025
https://github.com/lmlk-seal/printext
Printext is a lightweight, application that extracts text from images.
app application extract-data image-processing imagerecognition images imagetotext img2txt lightweight tesseract-ocr text tkinter-gui windows
Last synced: 15 Oct 2025
https://github.com/rainergo/uasfra-ms-knowledgegraph
Python project to read and use ESG data from XBRL-files to construct a neo4j Knowledge-Graph to be enriched with external data (Wikidata, DBPedia). An OpenAI-attached chat bot is used to query the Graph.
chatbot data-science esg extract-data knowledge-graph neo4j openai xbrl
Last synced: 25 Dec 2025
https://github.com/tamk-kol/chatbot-q-a-in-invoice-extractor-llm
The Invoice Extractor markdown is a specific format used to extract relevant information from invoices. It's a standardized way to annotate invoices with key information, making it easier to automate the extraction process.
chatbot extract-data extractor-api extractpdftext gemini-api gemini-pro gemini-pro-api gemini-pro-vision googleapi llms single-page-app
Last synced: 24 Feb 2025
https://github.com/shubhranpara/auto-filler-web
This repository contains my internship project work at Flexbox Technologies. I have developed a system that fills the patient details form automatically with the patient data extracted from pdf file.
automation docx-files extract-data faiss-vector-database flan-t5 form-filler html-css-javascript huggingface-transformers json langchain llms medical-application patient-data pdf-converter pdf-document pptx-files python-3 qa streamlit-webapp
Last synced: 02 Apr 2025
https://github.com/fuutoru/face-recognition-using-machine-learning
This is a repo to face recognition on 5 famous people
extract-data face-recognition famous-people
Last synced: 27 Mar 2025
https://github.com/qyfashae/extract_off_data
Extract Data from offline file. Ex: Emails, Phone Numbers, Links etc.
extract extract-data extract-emails extract-links scraping
Last synced: 02 Mar 2025
https://github.com/jmitander/jmscraper
Scrape web pages and effortlessly extract the data you need. Easy, robust, efficient, and intuitively user-friendly.
extract-data extract-media extract-metadata extractor scraping scraping-web scraping-websites webscraper webscraping website-scraper webtool
Last synced: 06 Sep 2025
https://github.com/ispyhumanfly/prowler
Query the web, extract data from the results, and transform that data into a format you can use.
ai analytics business cryptocurrency data extract-data machine-learning mining scraping web
Last synced: 06 Sep 2025
https://github.com/laskevych/vstup.edbo.gov.ua.report
Create report of students by page data
console-application extract-data extractor government puppeteer student ukraine
Last synced: 09 Apr 2025
https://github.com/duart38/pdf-snippets
Chrome extension to extract a select portion / section of a webpage into a PDF file
chrome-extension convert-to-pdf designer-tool extract-data extract-images imagetopdf pdf pdf-generation quality-of-life texttopdf tool webscraping website-to-pdf
Last synced: 16 Jun 2025
https://github.com/bessouat40/pdf-region-picker
A project to select only part of a PDF file. It's usefull when you want to extract informations with some python library like fitz.
data-extraction data-selection extract-data fitz javascript parsing pdf region-picker
Last synced: 06 Mar 2025
https://github.com/zedseven/urlextractor
A small tool for extracting all urls from a blob of binary data (ex. PDFs).
blob extract extract-data lightweight-tool url url-extractor urlextractor utility
Last synced: 06 Mar 2025
https://github.com/zeeshanahmad4/nlp--data-extraction-microsoft-word-documents-into-a-csv
extract-data nlp pdf pdf-converter pdf-document pdf-document-processor pdf-generation pdfconverter pdfcrawler pdfdata pdfextractor pdffileconversion pdfkit pdfpython pdfscraper pdftoword
Last synced: 01 Apr 2025
https://github.com/basemax/omitkeeplines
Keeping or removing some part of lines from a text with special attributes.
extract extract-data extraction extractor filter filter-line filter-lines filter-list filter-lists filtering filterlist filters keep-text line-filter lines-filter php text-keep word-count words-filter
Last synced: 03 Apr 2025
https://github.com/simplyyan/cutinfo
go library to extract information based on references
extract-data go go-lib go-library golang string-manipulation strings
Last synced: 01 Apr 2025
https://github.com/shubhranpara/auto-filler
This repository contains my team's internship project work at Flexbox Technologies. We have developed a system that fills the patient details form automatically with the patient data extracted from pdf file.
docx extract-data faiss-vector-database flan-t5 form-filling gemma huggingface-transformers langchain llms pdf pdf-converter pptx python3 qa-automation streamlit-application
Last synced: 22 Feb 2025
https://github.com/zeynepcol/data-science-cryptocurrencies-data-analysis-forecasting
Cryptocurrency price analysis and prediction using regression models
artificial-intelligence crytpocurrency data-analysis data-mining data-preprocessing data-processing data-science data-visualization extract-data financial-analysis linear-regression lstm machine-learning regression-algorithms xgboost
Last synced: 07 Jul 2025
https://github.com/rubenslyra/vse-py
O Video Subtitle Extractor (vse-py) é um projeto em Python que permite extrair legendas de vídeos a partir de URLs fornecidas pelo usuário.
extract-data python subtitles youtube-dl
Last synced: 18 Mar 2025
https://github.com/ecrmnn/extract-index
Extract values from an array of arrays by index
array-manipulations array-processing arrays extract extract-data
Last synced: 28 Oct 2025
https://github.com/fityannugroho/idn-area-data-extractor
Extract Indonesia area data from the raw sources to csv for fityannugroho/idn-area-data
extract-data extractor idn-area
Last synced: 28 Mar 2025
https://github.com/basemax/smartfilter
A Smart Filtering to keep and remove the character or words of the text. (SOON)
extract extract-data extract-features extract-information extract-text extraction extractive-summarization extractor php split splitter splitting text text-analysis text-analytics text-analyzer text-mining
Last synced: 03 Apr 2025
https://github.com/mistersoandso/python-gmail-extractor
Demo Project. Extract data from specifc senders
bs4 extract-data gmail-api google-cloud-platform python3 scraping terminal-based
Last synced: 28 Mar 2025
https://github.com/dann-oliv/query-results-to-excel
excel extract-data postgresql python3 sql
Last synced: 21 Mar 2025
https://github.com/randomgamingdev/mc_block_color_mapper
Python scripts & libraries for generating and mapping the average colors for each of the Minecraft blocks
average average-calculator cli data data-generator documented-api extract extract-data extractor fast minecraft python3 simple small texture texture-pack textures
Last synced: 26 Dec 2025
https://github.com/lamouchi-bayrem/document_scanner
flask web app that scans documents using OpenCV
ajax document extract-data flask ia ocr-recognition scanning-tool sql-server tailwindcss
Last synced: 12 May 2025
https://github.com/drisskhattabi6/meteo-data-mining
This repo contains using Data Mining Techniques to analyze meteorological (meteo) data. The objective is to extract meaningful insights and patterns from the data that can aid in understanding weather phenomena and predicting future weather conditions.
cart data-analysis data-mining data-visualization decision-making decision-tree extract-data extract-insights insights-analytics insights-data k-means knn machine-learning svm
Last synced: 21 Mar 2025
https://github.com/timothy-bartlett/pymupdf
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
data-science extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction text-processing text-shaping xps
Last synced: 17 Mar 2025
https://github.com/nostalgiccoder/readexcelfile.lib
Extracts data from a spreadsheet and outputs its contents to a '.SQL' file. Data extraction tool useful for people using SQL Server Express with no access to SSMS addon and import wizard.
c-sharp console etl excel extract-data library net-framework spreadsheet sql
Last synced: 25 Dec 2025
https://github.com/loglux-lab/ip-extractor
ip-extractor.sh uses nano to extract IP addresses. Results are stored in 'hosts', with duplicates removed. Ideal for sifting through logs and data-rich files.
bash extract-data linux-shell nano regex regular-expression
Last synced: 25 Feb 2025
https://github.com/raspi/fs2util
FreeSpace 2 util
checksum command-line-tool extract-data freespace2 game go golang
Last synced: 25 Feb 2025
https://github.com/ammaryasirnaich/pyreqify
This project is a lightweight Python module designed to generate the reqirements.txt file. It streamline dependency management by automatically extracting imported modules from python or juypter files and generating there requirements.txt
dependency environment extract-data jupyter-notebooks pip project-setup python requirements-generator requirements-txt version
Last synced: 31 Jul 2025
https://github.com/spaceshaman/deckard
Extract structured data from unstructured text — no AI, just regular expressions. 🔍
data-extraction extract extract-data regex regular-expression
Last synced: 22 Aug 2025
https://github.com/zebbern/jsx
Chrome extension that collects all JavaScript (.js) links on your current webpage!
bug-bounty bug-bounty-hunting chrome-extension ctf-tools endpoints extract-data filter hackathon hacking-tools javascript js js-extract js-project links links-gatherer osint pentest
Last synced: 24 Aug 2025
https://github.com/dann-oliv/db_query_exporter
Script para acessar o banco de dados desejado e extrair uma planilha de resultados de acordo com a query inserida.
Last synced: 27 Aug 2025
https://github.com/mistralys/x4-data-extractor
Batch file generator to extract X4 game files with the XRCatTool including DLC metadata.
extract-data unpacker x4foundations
Last synced: 30 Aug 2025
https://github.com/randomgamingdev/minecraft-asset-extractor
This repository teaches you how to, and provides tools for extracting data from Minecraft, like texture packs and achievements
all-platform-supported all-platforms asset-management assets automated extract extract-data extractor fast minecraft minecraft-java minecraft-java-edition simple
Last synced: 25 Dec 2025
https://github.com/thee-unruly/optimal-character-recognition
Extracting info from documents / images
Last synced: 01 Sep 2025
https://github.com/arsalan-dev-engineer/runescape-news-scraping
Runecsape news and updates in a beautifed table.
beautifulsoup beautifulsoup4 challange extract-data jagex project python requests runescape runescape3 url webscraping website
Last synced: 16 May 2025
https://github.com/athanclark/extractable-singleton
It's just a functor which has its stored value as isomorphic to Identity.
extract-data haskell singleton
Last synced: 28 Jun 2025
https://github.com/baikaresandip/node-extract-env-variables
This repo will extract the environment variables in the .env.example file of the repo.
environment environment-variables extract extract-data extraction node node-js nodejs npm scanner
Last synced: 03 Jul 2025
https://github.com/arthursilvadantas/extractjson
Aplicação Web para extrair informações de um arquivo JSON.
extract-data extract-json javascript js json
Last synced: 03 Jul 2025
https://github.com/manucabral/pysoccerdata
A python package for extracting real-time soccer data from diverse online sources, providing essential statistics and insights.
extract-data football football-analytics football-data scraper soccer soccer-analytics soccer-data
Last synced: 27 Feb 2025
https://github.com/isogeo/doc-old-extractor
Gitbook content about the Isogeo data extractor. In sync with gitbook.com.
documentation extract-data gitbook open-data
Last synced: 11 Mar 2025
https://github.com/mmikhail2001/photo_analysis
Извлечение метаданных Exif из фотографий формата JPEG. Десктоп-приложение на C++ фреймворке Qt.
binary-files exif extract-data jpeg oop patterns
Last synced: 16 Mar 2025
https://github.com/doarakko/japanese-company-extraction
This API extracts Japanese company names from text.
api extract-data japanese nlp python
Last synced: 07 Sep 2025
https://github.com/zuriel-hr/petojson
Extracción de características de archivos en formato portable ejecutable a archivo en formato JSON
extract-data json malware-analysis portable-executable
Last synced: 08 Oct 2025
https://github.com/dann-oliv/query_results_exporter
Script para acessar o banco de dados desejado e extrair uma planilha de resultados de acordo com a query inserida.
Last synced: 10 Oct 2025