An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with data-extraction

A curated list of projects in awesome lists tagged with data-extraction .

https://github.com/getmaxun/maxun

🔥 Open Source No Code Web Data Extraction Platform • Turn Websites To APIs & Spreadsheets With No-Code Robots In Minutes 🔥

agents api automation browser browser-automation data-extraction no-code no-code-web-scraper playwright robotic-process-automation rpa scraper self-hosted web-agent web-automation web-scraper web-scraping web-scraping-agent webscraping website-to-api

Last synced: 23 Jan 2026

https://github.com/vi3k6i5/flashtext

Extract Keywords from sentence or Replace keywords in sentences.

data-extraction keyword-extraction nlp search-in-text word2vec

Last synced: 13 May 2025

https://github.com/D4Vinci/Scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath

Last synced: 13 May 2025

https://github.com/d4vinci/scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath

Last synced: 15 Feb 2026

https://github.com/jonathanlink/pdflayouttextstripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

data-extraction extract java layout pdf pdfbox text

Last synced: 15 May 2025

https://github.com/JonathanLink/PDFLayoutTextStripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

data-extraction extract java layout pdf pdfbox text

Last synced: 15 Mar 2025

https://github.com/raznem/parsera

Lightweight library for scraping web-sites with LLMs

ai ai-scraping data-extraction llm opensource playwright python scraping webscraping

Last synced: 11 Apr 2025

https://github.com/thinh-vu/vnstock

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

data-extraction quantitative-analysis quantitative-finance quantitative-trading stock-market stock-screener

Last synced: 14 May 2025

https://github.com/eclaire-labs/eclaire

Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.

ai ai-assistant automation bookmark-manager bookmarks data-extraction document-processing llm local-first note-taking ocr on-device-ai open-source personal-knowledge-management privacy rest-api self-hosted task-management web-archiving

Last synced: 16 Jan 2026

https://github.com/a-maliarov/amazoncaptcha

Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.

amazon amazon-captcha amazon-scraper amazoncaptcha captcha captcha-solver data-extraction pillow python3 training-data

Last synced: 26 Mar 2025

https://github.com/molybdenum-99/infoboxer

Wikipedia information extraction library

data-extraction mediawiki wikipedia

Last synced: 05 Apr 2025

https://github.com/yfedoseev/pdf_oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

data-extraction document-processing fast image-extraction llm markdown pdf pdf-editor pdf-generation pdf-library pdf-parser pdf-to-markdown pdf-to-text pyo3 python rag rust text-extraction

Last synced: 09 Mar 2026

https://github.com/dilawar/plotdigitizer

A Python utility to digitize plots.

data-extraction digitization image-processing python3

Last synced: 06 Apr 2025

https://github.com/ScrapeGraphAI/scrapecraft

🤖 AI-powered web scraping editor with visual workflow builder. Build, test & deploy web scrapers using natural language. Powered by ScrapeGraphAI & LangGraph.

ai automation data-extraction docker fastapi hacktoberfest langgraph python react scrapegraphai typescript web-scraping webscraping

Last synced: 25 Aug 2025

https://github.com/nfx/go-htmltable

Structured HTML table data extraction from URLs in Go that has almost no external dependencies

data-extraction go go-generics html

Last synced: 05 Apr 2025

https://github.com/tech-engine/goscrapy

GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.

data-extraction go-scrapy golang goscraper scrapy spider web-crawler webscraper webscrapping

Last synced: 18 Jan 2026

https://github.com/dav009/flash

Golang Keyword extraction/replacement Datastructure using Tries instead of regexes

data-extraction go golang search text text-search trie

Last synced: 30 Apr 2025

https://github.com/danburzo/hred

Reduce HTML and XML to JSON from the command line, using an expressive query language inspired by CSS selectors.

cli data-extraction html json xml

Last synced: 02 Apr 2025

https://github.com/html-extract/hext

Domain-specific language for extracting structured data from HTML documents

cpp data-extraction dsl html html-extraction node php python ruby scraping

Last synced: 15 Apr 2025

https://github.com/johnbumgarner/newshound

This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around the world in over 50 languages.

article-extracting article-extractor data-extraction data-mining data-science datascience news news-aggregator news-crawler newspaper-crawler python-newspaper python3 text-mining web-scraping webscraping

Last synced: 14 Jan 2026

https://github.com/extralit/extralit

Fast and accurate systemic data extraction with LLM assistance

data-extraction literature-review llm

Last synced: 14 Jan 2026

https://github.com/mhucka/taupe

Taupe takes a downloaded Twitter archive ZIP file, extracts the URLs corresponding to tweets, retweets, replies, quote tweets, and liked tweets, and outputs the results in a comma-separated values (CSV) format that you can use with other software tools.

archives comma-separated-values csv data-extraction markdown twitter twitter-archive twitter-archives url

Last synced: 14 Dec 2025

https://github.com/rubydamodar/protext-analyzer

ProText Analyzer is a powerful tool for extracting insights from text. It conducts sentiment analysis, categorizing content as positive, negative, or neutral, while also assessing readability and linguistic complexity. Ideal for businesses and researchers, it enhances understanding of textual data.

complex-word-definition data-cleaning-techniques data-extraction linguistic-complexity metrics-explained readability-analysis sentiment-analysis syllable-counting-methodology tokenization-process

Last synced: 17 Jul 2025

https://github.com/linw1995/data_extractor

Combine XPath, CSS Selectors and JSONPath for Web data extracting.

css-selectors data-extraction data-extractor jsonpath xpath

Last synced: 30 Jan 2026

https://github.com/imranr98/wealthsimpleton

A Python script that scrapes your Wealthsimple activity history and saves the data in a JSON file.

data-extraction data-ownership export python selenium selenium-webdriver wealthsimple web web-scraping

Last synced: 14 May 2025

https://github.com/shdev/phpflashtext

Extract Keywords from sentence or Replace keywords in sentences. @ https://github.com/vi3k6i5/flashtext

data-analysis data-extraction flashtext keyword-extraction nlp php search-in-text string-manipulation string-matching word2vec

Last synced: 12 Jan 2026

https://github.com/biraj21/web-wanderer

A multi-threaded web crawler written in Python, utilizing ThreadPoolExecutor and Playwright to efficiently crawl dynamically rendered web pages and download them.

data-extraction multithreading python web-crawler webcrawler

Last synced: 12 Jan 2026

https://github.com/quantumbytestudios/githubuserdataextractor

GitHubUserDataExtractor is a cross-platform Python tool designed to extract and display public GitHub user data both in the terminal and through a visual HTML dashboard. It provides a streamlined way to fetch a user’s profile, recent activity, and contribution statistics using GitHub’s REST API and external visualization services.

data-extraction data-extractor hack hack-tool hack-tools hacker-scripts hacker-tool hacking linux-tools python-tools tools

Last synced: 31 Jul 2025

https://github.com/Fabiopf02/ofx-data-extractor

A module written in TypeScript that provides a utility to extract data from an OFX file in Node.js and Browser

banking data-extraction financial no-dependencies ofx ofx-js ofx-json ofx-parser open-financial-exchange parser qfx

Last synced: 11 Sep 2025

https://github.com/arkutils/arkutils-website

The source for the arkutils website, home of a few Ark: Survival Evolved and Ascenced tools.

ark-survival-ascended ark-survival-evolved ark-survivial data-extraction game-tool

Last synced: 23 Jan 2026

https://github.com/robert-mcdermott/ollama-batch-cluster

Large Scale Batch Processing with Ollama

data-extraction gpu hpc-cluster llm ollama

Last synced: 12 Oct 2025

https://github.com/webmiddle/webmiddle

Node.js framework for modular web scraping and data extraction

data-extraction framework jsx jsx-components modular nodejs web-scraping

Last synced: 29 Oct 2025

https://github.com/aryanvbw/exif

ExifTool is a powerful command-line tool that can be used to extract and edit metadata in a wide range of media files, including images, audio, and video. Metadata is information that is stored within a file that describes the file’s content or other attributes.

aryan-technologies aryanshop aryanvbw data-extraction image-metadata image-processing images-hacking information-gathering powered-by-aryan-technologies vivek

Last synced: 24 Oct 2025

https://github.com/attogram/justrefs

Just Refs - extract just the references and related topics from any page on the English Wikipedia

data-extraction information-extraction wikipedia wikipedia-api wikipedia-scraper wikipedia-viewer

Last synced: 14 Apr 2025

https://github.com/u-c4n/u-transkript

U-Transkript is a powerful Python library for automatically extracting transcripts (subtitles) from YouTube videos and translating them into various languages using Google Gemini AI. It supports 50+ languages, offers flexible output formats (TXT, JSON, XML), and features an easy-to-use, chainable API. Ideal for education, research, content creation

ai data-extraction python subtitles transcript translation youtube youtube-api

Last synced: 01 Jul 2025

https://github.com/fabiopf02/ofx-data-extractor

A module written in TypeScript that provides a utility to extract data from an OFX file in Node.js and Browser

banking data-extraction financial no-dependencies ofx ofx-js ofx-json ofx-parser open-financial-exchange parser qfx

Last synced: 10 Jul 2025

https://github.com/petrpatek/airbnb-scraper

Apify public actor for scraping Airbnb homes.

airbnb airbnb-api apify crawler data-extraction scrape

Last synced: 20 Mar 2025

https://github.com/chaitanyarahalkar/financial-info-extractor

Extract financial information in CSV format for companies compliant to the NSE

beautifulsoup csv-parser data-extraction data-scraping financial-data financial-services python selenium

Last synced: 17 Aug 2025

https://github.com/jakubjafra/stellaris-map-generation

Extracts geopolitical data from Stellaris save game files

data-extraction game-files game-modding stellaris stellaris-map-generation

Last synced: 13 May 2025

https://github.com/bluishwu/treeclip

TreeClip 是一款Chrome扩展工具,它提供了多种灵活的页面文本选择方式(同类选择、点选、框选、文本搜索),并结合了层级导航、内部元素选择、层级绑定、自定义输出格式等功能,大幅提升您从网页复制信息的效率。TreeClip offers flexible text selection methods (similar selection, point selection, box selection, text search) to enhance your efficiency in copying information from web pages.

bulk-copy bulk-operation chrome-extension copy-paste data-extraction element-selection html text-selection treeclip web-tools

Last synced: 13 May 2025

https://github.com/rririanto/unstructured-demo-streamlit

Extract your docs (CSV, PDF, JSON, HTML, DOCS, Sheets and more) for your own GPT and LLM projects using Unstructured.io via streamlit

ai data data-extraction gpt unstructured unstructured-data

Last synced: 09 Apr 2025

https://github.com/beautifulmoon211/onthemarket-scraping

Web scraping tool used to extract real estate information from OnTheMarket.com, a leading property portal in the United Kingdom.

cheerio data-extraction onthemarket onthemarket-scraper real-estate requests typescript web-scraper

Last synced: 13 Jun 2025

https://github.com/blalop/bbva2pandas

Extract the data from your BBVA's monthly statements

bank bank-account bbva data-extraction extracted-data pandas

Last synced: 28 Apr 2025

https://github.com/venkat-0706/amazon-webscraper

An Amazon web scraper extracts product data like prices, reviews, and ratings using tools like BeautifulSoup or Scrapy, aiding in market research while adhering to ethical and legal guidelines.

api-and-data-parsing automation beautifulsoup data-extraction ethical-scraping python-programming webscraping

Last synced: 26 Jun 2025

https://github.com/geniuszly/genpythondoxing

GenPythonDoxing is a demo version of a Python-based tool designed for gathering publicly available information about email addresses, usernames, IP addresses, and Minecraft nicknames. It utilizes various APIs and web scraping techniques to collect data, providing a comprehensive view of online footprints.

cyber-investigation data-extraction data-mining dox doxing doxing-methods genpythondoxing information-gathering osint python python-doxing python-doxing-tool pythondoxing security-research

Last synced: 13 Apr 2025

https://github.com/Bisaloo/xlcutter

Parse Batches of 'xlsx' Files Based on a Template

data-extraction excel non-rectangular-data r r-package tidy-data

Last synced: 01 Apr 2025

https://github.com/bisaloo/xlcutter

Parse Batches of 'xlsx' Files Based on a Template

data-extraction excel non-rectangular-data r r-package tidy-data

Last synced: 12 May 2025

https://github.com/imranr98/instacartflation

A Python script that scrapes your Instacart order history and saves the data in a JSON file.

data-extraction data-ownership export instacart python selenium selenium-webdriver web web-scraping

Last synced: 14 May 2025

https://github.com/davidumoru/scryer

Transform web data into actionable knowledge

content-parsing data-extraction gemini-api google-gemini web-scraping

Last synced: 13 Aug 2025

https://github.com/lykmapipo/nyc-tlc-trip-data

Python scripts to download, process, and analyze the New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset

apache-arrow apache-spark data data-engineering data-extraction data-transformation etl fsspec geopandas joblib jupyterlab lykmapipo metadata nyc nyc-taxi-dataset pandas pyarrow python s3

Last synced: 17 Sep 2025

https://github.com/os-climate/crrf-det

A web application for PDF content and table extraction, featuring image-based visual layout analysis, indexed document search, batch processing and extraction result annotation.

annotation data-extraction layout-analysis pdf table-extraction

Last synced: 12 Apr 2025

https://github.com/kalebu/worldmeter-coronavirus-scraper

A python program that tracks coronavirus statistics based on the worldometer website

beautifulsoup coronavirus data-extraction data-science python-tanzania tanzania webscraping worldmeter-coronavirus-scraper

Last synced: 08 May 2025

https://github.com/ghentcdh/taulu

Taulu is a Python package designed to segment tabular data in scanned or photographed documents.

data-extraction historic-documents htr ocr segmentation tabular-data

Last synced: 18 Jul 2025

https://github.com/pepe-god/dataprophet

Extracts the identity information citizens from MySQL, creates a family network based on TC ID No. and exports it to CSV

101m 109m adres data-analysis data-extraction database-connector family-tree genealogy gsm hsys identity mysql-database python-script pyton

Last synced: 13 Jul 2025

https://github.com/amirali104/text2excel

A GUI desktop application that can extract data from a text file and put them in an Excel or CSV file using regular expression (regex) patterns

automation csv data-extraction data-extractor data-processing excel openpyxl productivity-tool productivity-tools regex text-parsing text-processing text-to-excel tkinter tkinter-gui

Last synced: 04 Oct 2025

https://github.com/acuciureanu/js-maid

A rule-driven engine designed for seamless extraction of data from JavaScript files.

bugbounty-tool bugbountytips data-extraction javascript security-audit static-code-analyzer

Last synced: 09 Apr 2025

https://github.com/hyeonsangjeon/pdf2llm-tuning-studio

PDF 문서에서 GPU 가속 처리로 고품질 질의응답(QA) 데이터를 자동 생성하고 LLM을 효율적으로 파인튜닝하는 솔루션입니다. Unstructured 라이브러리와 AWS Bedrock Claude로 도메인 특화 QA 쌍을 생성하고, LoRA 기법으로 경량 모델을 훈련합니다.

aws bedrock claude cuda data-argumantation data-extraction distillation docker finetuning gpu llm pdf-generation pdf-text-extraction processing processing-job sagemaker text-disti unsloth unstructured

Last synced: 15 Jun 2025

https://github.com/oxylabs/how-to-scrape-wayfair

A step-by-step tutorial on extracting data from Wayfair’s product pages at scale and in real time. The guide details actionable code and considers various aspects before and during the scraping process.

data-extraction how-to parsing python wayfair wayfair-scraper web-scraping

Last synced: 27 Sep 2025

https://github.com/danhilse/web-scraper

A versatile Python-based web scraper that extracts content from single URLs or entire sitemaps, organizing data into structured text files. Features include sitemap parsing, content grouping by URL structure, and an easy-to-use command-line interface. Ideal for data extraction, content analysis, and web research tasks.

beautifulsoup cli-tool data-extraction python sitemap-parser web-scraping

Last synced: 23 Apr 2025

https://github.com/hugcis/data_journalism_extractor

A tool for extracting and integrating data from heterogeneous data sources

data-extraction data-journalism flink information-retrieval journalism

Last synced: 01 Sep 2025