Projects in Awesome Lists tagged with data-extraction

https://github.com/firecrawl/firecrawl

The API to search, scrape, and interact with the web at scale. 🔥

ai ai-agents ai-crawler ai-scraping ai-search crawler data-extraction html-to-markdown llm markdown scraper scraping web-crawler web-data web-data-extraction web-scraper web-scraping web-search webscraping

Last synced: 06 Jul 2026

https://github.com/getmaxun/maxun

🔥 Open Source No Code Web Data Extraction Platform • Turn Websites To APIs & Spreadsheets With No-Code Robots In Minutes 🔥

agents api automation browser browser-automation data-extraction no-code no-code-web-scraper playwright robotic-process-automation rpa scraper self-hosted web-agent web-automation web-scraper web-scraping web-scraping-agent webscraping website-to-api

Last synced: 23 Jan 2026

https://github.com/zipstack/unstract

LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows

api-deployments data-extraction document-processing etl-pipelines open-source-data-pipeline unstructured-data-extraction

Last synced: 13 May 2026

https://github.com/vi3k6i5/flashtext

Extract Keywords from sentence or Replace keywords in sentences.

data-extraction keyword-extraction nlp search-in-text word2vec

Last synced: 13 May 2025

https://github.com/D4Vinci/Scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath

Last synced: 13 May 2025

https://github.com/browser-act/skills

Browser automation CLI built for AI agents. Break through anti-bot walls, hand off to humans across platforms when stuck. Parallel multi-task execution, independent multi-session operation, isolated multi-account browsing.

ai-agents automation claude-cli claude-code claude-code-skills claude-skills codex codex-cli codex-skill cursor data-extraction no-code openclaw openclaw-cli openclaw-skill openclaw-skills web-data-extraction web-scraping web-scraping-api

Last synced: 28 Jun 2026

https://github.com/d4vinci/scrapling

🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!

ai ai-scraping automation crawler crawling crawling-python data data-extraction hacktoberfest playwright python python3 scraping selectors stealth web-scraper web-scraping web-scraping-python webscraping xpath

Last synced: 15 Feb 2026

https://github.com/brightdata/brightdata-mcp

A powerful Model Context Protocol (MCP) server that provides an all-in-one solution for public web access.

ai-agents ai-integrations anti-bot-detection browser-automation data-collection data-extraction llm mcp mcp-server modelcontextprotocol scraping scraping-tools structured-data web-crawling web-data web-scraping

Last synced: 16 Jan 2026

https://github.com/saifyxpro/headlessx

A lightweight, self-hosted headless browser automation platform. Designed as an alternative to Browserless, built for speed, privacy, and scalability.

automation automation-api automation-platform browser-automation browser-testing browserless chrome-headless chromedriver container-automation data-extraction headless headless-chrome headless-service playwright playwright-automation puppeteer scraping-service web-automation web-scraping

Last synced: 31 Jan 2026

https://github.com/jonathanlink/pdflayouttextstripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

data-extraction extract java layout pdf pdfbox text

Last synced: 15 May 2025

https://github.com/JonathanLink/PDFLayoutTextStripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf file for instance. This is a subclass of PDFTextStripper class (from the Apache PDFBox library).

data-extraction extract java layout pdf pdfbox text

Last synced: 15 Mar 2025

https://github.com/hi-primus/optimus

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

big-data-cleaning bigdata cudf dask dask-cudf data-analysis data-cleaner data-cleaning data-cleansing data-exploration data-extraction data-preparation data-profiling data-science data-transformation data-wrangling machine-learning pyspark spark

Last synced: 14 May 2025

https://github.com/raznem/parsera

Lightweight library for scraping web-sites with LLMs

ai ai-scraping data-extraction llm opensource playwright python scraping webscraping

Last synced: 11 Apr 2025

https://github.com/thinh-vu/vnstock

A beginner-friendly yet powerful Python toolkit for financial analysis and automation — built to make modern investing accessible to everyone

data-extraction quantitative-analysis quantitative-finance quantitative-trading stock-market stock-screener

Last synced: 14 May 2025

https://github.com/eclaire-labs/eclaire

Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.

ai ai-assistant automation bookmark-manager bookmarks data-extraction document-processing llm local-first note-taking ocr on-device-ai open-source personal-knowledge-management privacy rest-api self-hosted task-management web-archiving

Last synced: 16 Jan 2026

https://github.com/yfedoseev/pdf_oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

data-extraction document-processing fast image-extraction llm markdown pdf pdf-editor pdf-generation pdf-library pdf-parser pdf-to-markdown pdf-to-text pyo3 python rag rust text-extraction

Last synced: 13 May 2026

https://github.com/polyrabbit/hacker-news-digest

:newspaper: Let ChatGPT Summarize Hacker News for You

chatgpt chatgpt-api crawler data-extraction extract-summaries hacker-news hacker-news-digest hacker-news-reader machine-learning news-aggregator openai openai-api python rss spider

Last synced: 15 May 2025

https://github.com/adrienjoly/npm-pdfreader

🚜 Parse text and tables from PDF files.

data-extraction javascript parse-tables parsing pdf-converter pdf-reader rule-based-parsing tabular-data

Last synced: 14 May 2025

https://github.com/a-maliarov/amazoncaptcha

Pure Python, lightweight, Pillow-based solver for Amazon's text captcha.

amazon amazon-captcha amazon-scraper amazoncaptcha captcha captcha-solver data-extraction pillow python3 training-data

Last synced: 26 Mar 2025

https://github.com/0xMassi/webclaw

Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.

ai ai-agents ai-scraping cli crawler data-extraction html-to-markdown llm markdown mcp mcp-server rust scraper self-hosted tls-fingerprinting web-crawler web-extraction web-scraper web-scraping webscraping

Last synced: 04 Apr 2026

https://github.com/shcherbak-ai/contextgem

ContextGem: Effortless LLM extraction from documents

ai contract-analysis data-extraction document-intelligence docx docx2md docx2txt generative-ai legaltech llm llm-extraction llm-framework llm-pipeline llms nlp prompt-engineering text-analysis unstructured-data

Last synced: 13 May 2025

https://github.com/py-pdf/benchmarks

Benchmarking PDF libraries

benchmark data-extraction mupdf pdf poppler-utils pypdf2 text-extraction

Last synced: 28 Jul 2025

https://github.com/serpapi/clauneck

A tool for scraping emails, social media accounts, and much more information from websites using Google Search Results.

automation command-line command-line-tool data-extraction data-extractor email email-extract-with-proxy email-extraction email-extractor email-marketing email-scraper open-source ruby rubygem serp social-media-scraper web-crawling webscraping

Last synced: 06 Apr 2025

https://github.com/molybdenum-99/infoboxer

Wikipedia information extraction library

data-extraction mediawiki wikipedia

Last synced: 05 Apr 2025

https://github.com/bzsanti/oxidizePdf

a PDF library for rust

crates-io data-extraction digital-signatures document-processing encryption invoice ocr pdf pdf-generation pdf-library pdf-manipulation pdf-parser pdf-reader pdfa rust rust-library table-extraction text-extraction

Last synced: 29 Apr 2026

https://github.com/sypht-team/sypht-python-client

A python client for the Sypht API

api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-extraction invoice invoice-parser pdf-parser python python3 python3-library receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-python-client

Last synced: 11 Jul 2025

https://github.com/mrshu/github-statuses

The "Missing GitHub Status Page" -- a Flat Data attempt at historically documenting GitHub statuses

data-extraction flat-data github ner open-data status status-page uptime

Last synced: 09 Apr 2026

https://github.com/dilawar/plotdigitizer

A Python utility to digitize plots.

data-extraction digitization image-processing python3

Last synced: 06 Apr 2025

https://github.com/ScrapeGraphAI/scrapecraft

🤖 AI-powered web scraping editor with visual workflow builder. Build, test & deploy web scrapers using natural language. Powered by ScrapeGraphAI & LangGraph.

ai automation data-extraction docker fastapi hacktoberfest langgraph python react scrapegraphai typescript web-scraping webscraping

Last synced: 25 Aug 2025

https://github.com/nfx/go-htmltable

Structured HTML table data extraction from URLs in Go that has almost no external dependencies

data-extraction go go-generics html

Last synced: 05 Apr 2025

https://github.com/villagecomputing/superpipe

Superpipe - optimized LLM pipelines for structured data

classification data-extraction data-labeling llm llm-evaluation llm-optimization structured-data

Last synced: 04 Apr 2026

https://github.com/sshniro/line-segmentation-algorithm-to-gcp-vision

Line segmentation algorithm for Google Vision API.

data-extraction google-vision invoice proposed-algorithm segmentation

Last synced: 25 Jun 2025

https://github.com/tech-engine/goscrapy

GoScrapy: Harnessing Go's power for blazingly fast web scraping, inspired by Python's Scrapy framework.

data-extraction go-scrapy golang goscraper scrapy spider web-crawler webscraper webscrapping

Last synced: 18 Jan 2026

https://github.com/dav009/flash

Golang Keyword extraction/replacement Datastructure using Tries instead of regexes

data-extraction go golang search text text-search trie

Last synced: 30 Apr 2025

https://github.com/sypht-team/sypht-java-client

A Java client for the Sypht API

api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-retrieval information-retrieval-engine invoice invoice-parser java java8 pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-java-client

Last synced: 10 Apr 2025

https://github.com/danburzo/hred

Reduce HTML and XML to JSON from the command line, using an expressive query language inspired by CSS selectors.

cli data-extraction html json xml

Last synced: 02 Apr 2025

https://github.com/us/crw

Fast, lightweight Firecrawl alternative in Rust. Web scraper, crawler & search API with MCP server for AI agents. Drop-in Firecrawl-compatible API (/v1/scrape, /v1/crawl, /v1/search). 2.3x faster than Tavily, 1.5x faster than Firecrawl in 1K-URL benchmarks. 6 MB RAM, single binary. Self-host or use managed cloud.

ai ai-agents crawler data-extraction docker firecrawl firecrawl-alternative html-to-markdown llm markdown mcp mcp-server rust scraping-api self-hosted tavily-alternative web-crawler web-scraper web-scraping web-search-api

Last synced: 09 May 2026

https://github.com/wetransfer/format_parser

file metadata parsing, done cheap

aiff arw cr2 data-extraction exif file-parser flac format-parsers gif jpeg moov mp3 mpg ogg pdf png ruby tiff wav zip

Last synced: 04 Oct 2025

https://github.com/WeTransfer/format_parser

file metadata parsing, done cheap

aiff arw cr2 data-extraction exif file-parser flac format-parsers gif jpeg moov mp3 mpg ogg pdf png ruby tiff wav zip

Last synced: 16 Jul 2025

https://github.com/html-extract/hext

Domain-specific language for extracting structured data from HTML documents

cpp data-extraction dsl html html-extraction node php python ruby scraping

Last synced: 15 Apr 2025

https://github.com/xquik-dev/x-twitter-scraper

X (Twitter) data platform skill for AI coding agents. 122 REST API endpoints, 2 MCP tools, 23 extraction types, HMAC webhooks. Reads from $0.00015/call - 33x cheaper than the official X API. Works with Claude Code, Cursor, Codex, Copilot, Windsurf & 40+ agents.

ai-agent automation cheap-api claude-code codex cursor data-extraction giveaway mcp mcp-server monitoring pay-per-use rest-api scraper skills social-media twitter twitter-api webhooks x-api

Last synced: 10 May 2026

https://github.com/stabrise/spark-pdf

PDF DataSource for Apache Spark

big-data data-engineering data-extraction data-science ocr ocr-recognition pdf pdf-document pdf-document-processor spark spark-datasource tesseract tesseract-ocr

Last synced: 09 Apr 2025

https://github.com/articdive/articdata

Collection of data extracted from Minecraft.

data data-extraction data-mining java json mc minecraft minecraft-data minecraft-server minecraft-servers registry

Last synced: 17 May 2025

https://github.com/duriantaco/jonq

Query JSON with SQL-like syntax. A readable jq alternative that generates pure jq under the hood. Table, CSV, YAML output. Interactive REPL. Pipes from curl, streams NDJSON logs.

cli command-line-tools csv data-extraction jq jq-alternative json json-parser json-processor json-query log-analysis ndjson python sql yaml

Last synced: 29 Apr 2026

https://github.com/serpapi/google-search-results-java

Google Search Results JAVA API via SerpApi

data-extraction data-scraping java java-api json serp-api serpapi web-scraping webscraping

Last synced: 09 Jul 2025

https://github.com/Articdive/ArticData

Collection of data extracted from Minecraft.

data data-extraction data-mining java json mc minecraft minecraft-data minecraft-server minecraft-servers registry

Last synced: 08 May 2025

https://github.com/hddevteam/smart-form-filler

AI-powered form filling and data extraction browser extension

ai-powered browser-extension chrome-extension data-extraction demo-site developer-tools form-automation form-filling github-pages gpt-integration javascript local-ai machine-learning nodejs ollama privacy-focused productivity-tools web-automation

Last synced: 14 Feb 2026

https://github.com/sypht-team/sypht-golang-client

A Golang client for the Sypht API

api-client data-extraction document-capture extract extract-data-from-pdf extract-fields go golang golang-library golang-package invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-golang-client

Last synced: 20 Oct 2025

https://github.com/johnbumgarner/newshound

This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around the world in over 50 languages.

article-extracting article-extractor data-extraction data-mining data-science datascience news news-aggregator news-crawler newspaper-crawler python-newspaper python3 text-mining web-scraping webscraping

Last synced: 14 Jan 2026

https://github.com/extralit/extralit

Fast and accurate systemic data extraction with LLM assistance

data-extraction literature-review llm

Last synced: 14 Jan 2026

https://github.com/mhucka/taupe

Taupe takes a downloaded Twitter archive ZIP file, extracts the URLs corresponding to tweets, retweets, replies, quote tweets, and liked tweets, and outputs the results in a comma-separated values (CSV) format that you can use with other software tools.

archives comma-separated-values csv data-extraction markdown twitter twitter-archive twitter-archives url

Last synced: 14 Dec 2025

https://github.com/rubydamodar/protext-analyzer

ProText Analyzer is a powerful tool for extracting insights from text. It conducts sentiment analysis, categorizing content as positive, negative, or neutral, while also assessing readability and linguistic complexity. Ideal for businesses and researchers, it enhances understanding of textual data.

complex-word-definition data-cleaning-techniques data-extraction linguistic-complexity metrics-explained readability-analysis sentiment-analysis syllable-counting-methodology tokenization-process

Last synced: 17 Jul 2025

https://github.com/linw1995/data_extractor

Combine XPath, CSS Selectors and JSONPath for Web data extracting.

css-selectors data-extraction data-extractor jsonpath xpath

Last synced: 30 Jan 2026

https://github.com/gambolputty/wiktionary-de-parser

Extract data from German Wiktionary XML files.

data-extraction dewiktionary german german-language nlp wiktionary wiktionary-dump wiktionary-parser

Last synced: 14 Jan 2026

https://github.com/Xquik-dev/tweetclaw

Post tweets, reply, like, retweet, follow, DM & more from OpenClaw. Full X/Twitter automation via Xquik — 120 endpoints, reads from $0.00015/call (66x cheaper than official X API). 2 tools, 2 commands, background event poller.

ai-agent automation cheap-api data-extraction giveaway mcp-server openclaw openclaw-plugin pay-per-use skills social-media tweet tweetclaw twitter twitter-api twitter-automation x x-api xquik

Last synced: 26 May 2026

https://github.com/pim97/scrappey-wrapper-python

An API wrapper for Scrappey.com written in Python (cloudflare, datadome bypass & solver)

akamai anti-bot-api captcha captcha-solver cloudflare-anti-bot cloudflare-bypass data-extraction datadome incapsula perimetex queue-it scraping-framework scraping-library scraping-service scraping-tool shape web-data-extration web-scraping web-scraping-solution

Last synced: 12 Jan 2026

https://github.com/nextkore/smartmuv

An EVM-compatible Solidity Smart Contract Storage/Slot Analyzer and Data Extractor.

blockchain-explorer code-analysis data-exploration data-extraction ethereum ethereum-blockchain explorer migrate scanner smart-contracts solidity static-analysis storage storage-analysis tracker upgrade

Last synced: 17 Jul 2025

https://github.com/cpl/exodus

Data exfiltration using DNS

data-extraction dns dns-client dns-exfiltration dns-server exfiltration firewall-bypass security-tools

Last synced: 16 Jan 2026

https://github.com/imranr98/wealthsimpleton

A Python script that scrapes your Wealthsimple activity history and saves the data in a JSON file.

data-extraction data-ownership export python selenium selenium-webdriver wealthsimple web web-scraping

Last synced: 14 May 2025

https://github.com/biraj21/web-wanderer

A multi-threaded web crawler written in Python, utilizing ThreadPoolExecutor and Playwright to efficiently crawl dynamically rendered web pages and download them.

data-extraction multithreading python web-crawler webcrawler

Last synced: 12 Jan 2026

https://github.com/shdev/phpflashtext

Extract Keywords from sentence or Replace keywords in sentences. @ https://github.com/vi3k6i5/flashtext

data-analysis data-extraction flashtext keyword-extraction nlp php search-in-text string-manipulation string-matching word2vec

Last synced: 12 Jan 2026

https://github.com/quantumbytestudios/githubuserdataextractor

GitHubUserDataExtractor is a cross-platform Python tool designed to extract and display public GitHub user data both in the terminal and through a visual HTML dashboard. It provides a streamlined way to fetch a user’s profile, recent activity, and contribution statistics using GitHub’s REST API and external visualization services.

data-extraction data-extractor hack hack-tool hack-tools hacker-scripts hacker-tool hacking linux-tools python-tools tools

Last synced: 31 Jul 2025

https://github.com/Fabiopf02/ofx-data-extractor

A module written in TypeScript that provides a utility to extract data from an OFX file in Node.js and Browser

banking data-extraction financial no-dependencies ofx ofx-js ofx-json ofx-parser open-financial-exchange parser qfx

Last synced: 11 Sep 2025

https://github.com/arkutils/arkutils-website

The source for the arkutils website, home of a few Ark: Survival Evolved and Ascenced tools.

ark-survival-ascended ark-survival-evolved ark-survivial data-extraction game-tool

Last synced: 23 Jan 2026

https://github.com/NextKore/SmartMuv

An EVM-compatible Solidity Smart Contract Storage/Slot Analyzer and Data Extractor.

blockchain-explorer code-analysis data-exploration data-extraction ethereum ethereum-blockchain explorer migrate scanner smart-contracts solidity static-analysis storage storage-analysis tracker upgrade

Last synced: 12 Aug 2025

https://github.com/masurii/fbscrapeideas

Modern CLI tool for scraping & analyzing Facebook groups using Playwright & Gemini AI. Features self-healing selectors, session security, and local offline analysis.

academic-research ai cli data-extraction data-mining facebook-scraper gemini-api idea-generation nlp python selenium text-analysis

Last synced: 28 Apr 2026

https://github.com/robert-mcdermott/ollama-batch-cluster

Large Scale Batch Processing with Ollama

data-extraction gpu hpc-cluster llm ollama

Last synced: 06 Apr 2026

https://github.com/webmiddle/webmiddle

Node.js framework for modular web scraping and data extraction

data-extraction framework jsx jsx-components modular nodejs web-scraping

Last synced: 29 Oct 2025

https://github.com/sypht-team/sypht-node-client

A Nodejs client for the Sypht API

api-client data-extraction document-capture extract extract-data-from-pdf extract-fields invoice invoice-parser node node-module nodejs nodejs-client pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-node-client

Last synced: 13 Apr 2025

https://github.com/attogram/justrefs

Just Refs - extract just the references and related topics from any page on the English Wikipedia

data-extraction information-extraction wikipedia wikipedia-api wikipedia-scraper wikipedia-viewer

Last synced: 14 Apr 2025

https://github.com/aryanvbw/exif

ExifTool is a powerful command-line tool that can be used to extract and edit metadata in a wide range of media files, including images, audio, and video. Metadata is information that is stored within a file that describes the file’s content or other attributes.

aryan-technologies aryanshop aryanvbw data-extraction image-metadata image-processing images-hacking information-gathering powered-by-aryan-technologies vivek

Last synced: 24 Oct 2025

https://github.com/u-c4n/u-transkript

U-Transkript is a powerful Python library for automatically extracting transcripts (subtitles) from YouTube videos and translating them into various languages using Google Gemini AI. It supports 50+ languages, offers flexible output formats (TXT, JSON, XML), and features an easy-to-use, chainable API. Ideal for education, research, content creation

ai data-extraction python subtitles transcript translation youtube youtube-api

Last synced: 01 Jul 2025

https://github.com/irfanalidv/trustpilot_scraper

A Python library for scraping Trustpilot reviews.

beautifulsoup data-collection data-extraction etl-pipeline review-scraper text-mining trustpilot web-scraping-python

Last synced: 14 Jan 2026

https://github.com/siveci/javdb_magnet_spider

基于 Python 的 JavDB 磁力链接自动化爬虫。采用 curl_cffi 完美模拟浏览器 TLS 指纹绕过 Cloudflare 防火墙。支持多页列表抓取，根据“无码/中字/高清”等标签及文件大小，自动筛选并导出最优的磁力链接至 CSV 文件。

crawler data-extraction javdb magnet-links python python3 scraper spider

Last synced: 06 Jun 2026

https://github.com/fabiopf02/ofx-data-extractor

A module written in TypeScript that provides a utility to extract data from an OFX file in Node.js and Browser

banking data-extraction financial no-dependencies ofx ofx-js ofx-json ofx-parser open-financial-exchange parser qfx

Last synced: 10 Jul 2025

https://github.com/petrpatek/airbnb-scraper

Apify public actor for scraping Airbnb homes.

airbnb airbnb-api apify crawler data-extraction scrape

Last synced: 20 Mar 2025

https://github.com/sypht-team/sypht-kotlin-client

A Kotlin client for the Sypht API

api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-extraction invoice invoice-parser kotlin kotlin-android kotlin-library pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-kotlin-client

Last synced: 27 Jul 2025

https://github.com/kehvinbehvin/json-mcp-filter

JSON MCP server to filter only relevant data for your LLM

claude-mcp data data-extraction data-filtering json json-analysis json-filter json-mcp-server json-parser json-schema-inference json-to-typescript json-utilities large-files mcp mcp-server query type-generation

Last synced: 07 Sep 2025

https://github.com/chaitanyarahalkar/financial-info-extractor

Extract financial information in CSV format for companies compliant to the NSE

beautifulsoup csv-parser data-extraction data-scraping financial-data financial-services python selenium

Last synced: 17 Aug 2025

https://github.com/deadbits/trs

🔭 Threat report analysis via LLM and Vector DB

data-extraction detection-engineering large-language-models llm llm-prompting openai prompt-engineering summarization threat-intelligence

Last synced: 14 Apr 2025

https://github.com/jakubjafra/stellaris-map-generation

Extracts geopolitical data from Stellaris save game files

data-extraction game-files game-modding stellaris stellaris-map-generation

Last synced: 13 May 2025

https://github.com/bluishwu/treeclip

TreeClip 是一款Chrome扩展工具，它提供了多种灵活的页面文本选择方式（同类选择、点选、框选、文本搜索），并结合了层级导航、内部元素选择、层级绑定、自定义输出格式等功能，大幅提升您从网页复制信息的效率。TreeClip offers flexible text selection methods (similar selection, point selection, box selection, text search) to enhance your efficiency in copying information from web pages.

bulk-copy bulk-operation chrome-extension copy-paste data-extraction element-selection html text-selection treeclip web-tools

Last synced: 13 May 2025

https://github.com/rririanto/unstructured-demo-streamlit

Extract your docs (CSV, PDF, JSON, HTML, DOCS, Sheets and more) for your own GPT and LLM projects using Unstructured.io via streamlit

ai data data-extraction gpt unstructured unstructured-data

Last synced: 09 Apr 2025

https://github.com/bisaloo/xlcutter

Parse Batches of 'xlsx' Files Based on a Template

data-extraction excel non-rectangular-data r r-package tidy-data

Last synced: 12 May 2025

https://github.com/beautifulmoon211/onthemarket-scraping

Web scraping tool used to extract real estate information from OnTheMarket.com, a leading property portal in the United Kingdom.

cheerio data-extraction onthemarket onthemarket-scraper real-estate requests typescript web-scraper

Last synced: 13 Jun 2025

https://github.com/ksm26/function-calling-and-data-extraction-with-llms

Master the techniques of function-calling and structured data extraction with LLMs. Learn to enhance LLM capabilities, integrate web services, and build practical applications for real-world data usability.

advanced-workflows ai-integration custom-functionality customer-service-transcripts data-analysis data-extraction end-to-end-applications function-calling llms natural-language-processing openapi practical-implementation structured-data web-services-integration

Last synced: 01 May 2026

https://github.com/geniuszly/genpythondoxing

GenPythonDoxing is a demo version of a Python-based tool designed for gathering publicly available information about email addresses, usernames, IP addresses, and Minecraft nicknames. It utilizes various APIs and web scraping techniques to collect data, providing a comprehensive view of online footprints.

cyber-investigation data-extraction data-mining dox doxing doxing-methods genpythondoxing information-gathering osint python python-doxing python-doxing-tool pythondoxing security-research

Last synced: 13 Apr 2025

https://github.com/aurumz-rgb/ReviewAid

AI-Driven Full-Text Screening and Data Extraction for Systematic Reviews and Evidence Synthesis

academic-research ai-assistant ai-tool data-extraction literature-review medical-research open-source python research-ai research-automation research-tool screening-tool streamlit systematic-reviews

Last synced: 19 Apr 2026

https://github.com/Bisaloo/xlcutter

Parse Batches of 'xlsx' Files Based on a Template

data-extraction excel non-rectangular-data r r-package tidy-data

Last synced: 01 Apr 2025

https://github.com/xarantolus/jsonextract

Go package for finding and extracting any JavaScript object (not just JSON) from an io.Reader

data-extraction javascript javascript-array javascript-extractor javascript-object javascript-object-scraper javascript-scraper javascript-web-scraper json json-extractor json-search web-json-extract web-scraping

Last synced: 11 Oct 2025

https://github.com/milahu/reverse-template-engine

find a template of many similar html files

data-extraction grammar-generation grammar-generator parser-generator reverse-template reverse-template-engine schema-generation schema-generator structured-data-extraction structured-text template-generator template-induction tree-automata-induction

Last synced: 14 Apr 2025

https://github.com/blalop/bbva2pandas

Extract the data from your BBVA's monthly statements

bank bank-account bbva data-extraction extracted-data pandas

Last synced: 28 Apr 2025

https://github.com/desininja/voice-disorder

Data Science project. ML algorithms to detect voice disorders.

accuracy algorithms classification classification-algorithm classifier classifier-model data data-extraction data-mining data-science health machine-learning smote voice-disorder

Last synced: 30 Apr 2025

https://github.com/venkat-0706/amazon-webscraper

An Amazon web scraper extracts product data like prices, reviews, and ratings using tools like BeautifulSoup or Scrapy, aiding in market research while adhering to ethical and legal guidelines.

api-and-data-parsing automation beautifulsoup data-extraction ethical-scraping python-programming webscraping

Last synced: 26 Jun 2025

https://github.com/sypht-team/sypht-csharp-client

A C# / .NET client for the Sypht API

api-client data-extraction document-capture dot-net dotnet dotnet-cli dotnet-library extract extract-data-from-pdf extract-fields invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-csharp-client

Last synced: 13 Apr 2025

https://github.com/ExceptionRegret/Kryfto

The open-source web-browsing backend for AI agents & workflow engines. Ships a 42-tool MCP server for Claude Code/Cursor/Codex, a full REST API for n8n/Zapier/Make, federated multi-engine search, anti-bot stealth, and enterprise infrastructure (Postgres, Redis, BullMQ, MinIO). Self-host for $5/mo flat

ai-agents anti-detection claude-code codex cursor data-extraction developer-tools fastapi headless-browser mcp mcp-server n8n open-source playwright redis search-engine self-hosted stealth web-scraping workflow-automation