Projects in Awesome Lists tagged with extraction
A curated list of projects in awesome lists tagged with extraction .
https://github.com/axa-group/parsr
Transforms PDF, Documents and Images into Enriched Structured Data
data document extraction hacktoberfest images nlp ocr parsr pdf python typescript
Last synced: 13 May 2025
https://github.com/axa-group/Parsr
Transforms PDF, Documents and Images into Enriched Structured Data
data document extraction hacktoberfest images nlp ocr parsr pdf python typescript
Last synced: 13 Mar 2025
https://github.com/trusted-ai/adversarial-robustness-toolbox
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams
adversarial-attacks adversarial-examples adversarial-machine-learning ai artificial-intelligence attack blue-team evasion extraction inference machine-learning poisoning privacy python red-team trusted-ai trustworthy-ai
Last synced: 13 May 2025
https://github.com/Trusted-AI/adversarial-robustness-toolbox
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams
adversarial-attacks adversarial-examples adversarial-machine-learning ai artificial-intelligence attack blue-team evasion extraction inference machine-learning poisoning privacy python red-team trusted-ai trustworthy-ai
Last synced: 23 Mar 2025
https://github.com/google/mtail
extract internal monitoring data from application logs for collection in a timeseries database
bytecode calculator collector compiler extraction go instrumentation logs metrics monitoring mtail mtail-programs observability prometheus proxy timeseries vm
Last synced: 22 Oct 2025
https://github.com/aubio/aubio
a library for audio and music analysis
analysis annotation audio beat c extraction mfcc music onset pitch python sound tempo-tracking
Last synced: 14 May 2025
https://github.com/apache/tika
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
content extraction java metadata tika
Last synced: 09 Sep 2025
https://github.com/symfony/property-access
Provides functions to read and write from/to an object or array using a simple string notation
access array component extraction index injection object php property property-path reflection symfony symfony-component
Last synced: 25 Jan 2026
https://github.com/morkt/garbro
Visual Novels resource browser
audio extraction gui images reverse-engineering visual-novel
Last synced: 15 May 2025
https://github.com/onekey-sec/unblob
Extract files from any kind of container formats
archive compression extraction filesystem python
Last synced: 14 May 2025
https://github.com/dbashford/textract
node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!
extract-text extraction nodejs
Last synced: 14 May 2025
https://github.com/chrismattmann/tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
buffer covid-19 detection extraction memex mime nlp nlp-library nlp-machine-learning parse parser-interface python recognition text-extraction text-recognition tika-python tika-server tika-server-jar translation-interface usc
Last synced: 14 May 2025
https://github.com/langchain-ai/langchain-extract
🦜⛏️ Did you say you like data?
extraction extraction-data fastapi langchain langchain-python llm llms
Last synced: 16 May 2025
https://github.com/rikyoz/bit7z
A C++ static library offering a clean and simple interface to the 7-zip shared libraries.
7-zip 7z 7zip archives archives-metadata bzip2 c-plus-plus compression cpp cpp-library cross-platform encrypted-archives extraction gzip in-memory multi-volume-archives rar static-library tar zip
Last synced: 24 Feb 2026
https://github.com/Lattyware/unrpa
A program to extract files from the RPA archive format.
extraction python renpy rpa visual-novels
Last synced: 22 Jul 2025
https://github.com/philipperemy/stanford-openie-python
Stanford Open Information Extraction made simple!
extraction nlp python-wrapper stanford stanford-openie
Last synced: 16 May 2025
https://github.com/lattyware/unrpa
A program to extract files from the RPA archive format.
extraction python renpy rpa visual-novels
Last synced: 16 May 2025
https://github.com/bdbc-kg-nlp/ie-survey
北京航空航天大学大数据高精尖中心自然语言处理研究团队对信息抽取领域的调研。包括实体识别,关系抽取,属性抽取等子任务,每类子任务分别对学术界和工业界进行调研。
Last synced: 22 Feb 2026
https://github.com/carlospuenteg/File-Injector
File Injector is a script that allows you to store any file in an image using steganography
extraction file file-injection file-injector files image image-manipulation image-processing injection noise numpy photography python python3 steganography storage
Last synced: 29 Mar 2025
https://github.com/rize/uritemplate
PHP URI Template (RFC 6570) supports both URI expansion & extraction
expansion extraction php rfc-6570 uri-template
Last synced: 23 May 2026
https://github.com/rize/UriTemplate
PHP URI Template (RFC 6570) supports both URI expansion & extraction
expansion extraction php rfc-6570 uri-template
Last synced: 11 Mar 2025
https://github.com/overtools/OWLib
DataTool is a program that lets you extract models, maps, and files from Overwatch.
blizzard blizzard-games blte casc csharp datatool extraction modeling ngdp overtools overwatch overwatch-2 tact
Last synced: 29 Mar 2025
https://github.com/nissl-lab/toxy
.net text extraction & export framework
dataset export extraction fileformats
Last synced: 14 May 2025
https://github.com/puddly/android-otp-extractor
Extracts OTP tokens from rooted Android devices
adb android extraction otp python totp
Last synced: 06 Apr 2025
https://github.com/nazuke/SEOMacroscope
SEO Macroscope is a website scanning tool, to check your website for broken links; including some technical SEO functionality, site scraping, Excel reporting, and more.
broken-links custom-filter duplicate-content extract-pdf-metadata extraction hreflang-checker hreflang-matrix link-checker scan-website seo seo-excel-report seo-macroscope seo-tools web-scraping webmaster
Last synced: 14 Apr 2025
https://github.com/robinst/autolink-java
Java library to extract links (URLs, email addresses) from plain text; fast, small and smart
autolink extraction java-library linkify links url
Last synced: 14 May 2025
https://github.com/thrau/jarchivelib
A simple archiving and compression library for Java
archiving compression extraction
Last synced: 04 Apr 2025
https://github.com/neelshah18/emot
Open source Emoticons and Emoji detection library: emot
detection emoji emoticons extraction python
Last synced: 13 Apr 2025
https://github.com/nazywam/autoit-ripper
Extract AutoIt scripts embedded in PE binaries
Last synced: 04 Apr 2025
https://github.com/bobld/tabula-sharp
Extract tables from PDF files (port of tabula-java)
csharp dotnet extract extract-table extracting-tables extraction extraction-engine netstandard pdf-table-extract pdf-table-extraction pdfparser pdfpig pdfs table table-extraction tabula tabula-java tabula-sharp
Last synced: 15 May 2025
https://github.com/DiegoCaraballo/Email-extractor
The main functionality is to extract all the emails from one or several URLs - La funcionalidad principal es extraer todos los correos electrónicos de una o varias Url
email email-extractor email-marketing emails extraction python scraper scrapers scraping scraping-websites scrapper scrapping scrapy scrapy-spider spyder stractor
Last synced: 11 Jul 2025
https://github.com/assafmo/xioc
Extract indicators of compromise from text, including "escaped" ones.
command-line command-line-tool data-mining defang escaping extract extraction indicators-of-compromise ioc iocs regex regexp text-mining text-processing
Last synced: 22 Jun 2025
https://github.com/rossumai/docile
DocILE: Document Information Localization and Extraction Benchmark
benchmark document extraction information key kie kile understanding
Last synced: 12 Jan 2026
https://github.com/chatnoir-eu/chatnoir-resiliparse
A robust web archive analytics toolkit
bigdata cpp cython extraction htmlparser python warc web webarchive
Last synced: 04 Apr 2026
https://github.com/evyatarmeged/stegextract
Detect hidden files and text in images
bash capture-the-flag ctf extract-images extraction hidden-files images penetration-testing steg steganography stego
Last synced: 14 Apr 2025
https://github.com/MacPaw/XADMaster
Objective-C library for archive and file unarchiving and extraction
Last synced: 14 May 2025
https://github.com/macpaw/xadmaster
Objective-C library for archive and file unarchiving and extraction
Last synced: 21 Aug 2025
https://github.com/chrise96/3D_Ground_Segmentation
A ground segmentation algorithm for 3D point clouds based on the work described in “Fast segmentation of 3D point clouds: a paradigm on LIDAR data for Autonomous Vehicle Applications”, D. Zermas, I. Izzat and N. Papanikolopoulos, 2017. Distinguish between road and non-road points. Road surface extraction. Plane fit ground filter
cpp extraction ground ground-segmentation lastools lidar non-ground point-cloud preprocessing road-surface
Last synced: 19 Mar 2025
https://github.com/bdbc-kg-nlp/covid-19-tracker
北航大数据高精尖中心研究团队进行数据来源的整理与获取,利用自然语言处理等技术从已公开全国4626确诊患者轨迹中抽取了基本信息(性别、年龄、常住地、工作、武汉/湖北接触史等)、轨迹(时间、地点、交通工具、事件)及病患关系形成结构化信息
covid-19 extraction nlp tracking visualization
Last synced: 04 Mar 2026
https://github.com/philipperemy/stanford-ner-python
Stanford Named Entity Recognizer (NER) - Python Wrapper
extraction named-entity-recognition nlp python-wrapper stanford stanford-ner
Last synced: 18 Sep 2025
https://github.com/rse/extraction
Tree Extraction for JavaScript Object Graphs
dsl extraction javascript json query tree
Last synced: 19 Apr 2025
https://github.com/skblaz/rakun2
RaKUn 2.0 - A fast keyword detection algorithm
extraction information-retrieval keyphrase keyphrase-extraction keyphrases keyword-extraction keywords keywords-extraction library multilingual natural-language natural-language-processing nlp nlp-keywords-extraction nlp-library nlp-machine-learning python scalable-machine-learning unsupervised-learning
Last synced: 05 Apr 2025
https://github.com/cisco-talos/locky
analysis extraction locky malware ransom unpacker
Last synced: 09 Apr 2025
https://github.com/ckorzen/pdf-text-extraction-benchmark
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
arxiv benchmark evaluation extraction pdf tex text-extraction
Last synced: 11 May 2025
https://github.com/freelawproject/doctor
A microservice for document conversion at scale
document extraction ffmpeg ocr pdf
Last synced: 09 Feb 2026
https://github.com/imperialcollegelondon/pnextract
Pore network extraction from micro-CT images of porous media
Last synced: 07 Apr 2025
https://github.com/xyntopia/pydoxtools
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
chatgpt document-analysis document-extraction extraction information-retrieval llm nlp pdf python
Last synced: 11 May 2025
https://github.com/uintdev/discord-cache-dump
Dump Discord's cache and identify files
cache cross-platform-desktop discord dump extraction golang linux macos windows
Last synced: 13 Apr 2025
https://github.com/uintdev/Discord-Cache-Dump
Dump Discord's cache and identify files
cache cross-platform-desktop discord dump extraction golang linux macos windows
Last synced: 21 Sep 2025
https://github.com/josuemtzmo/trackeddy
Tracking eddy algorithm:
eddies eddy extraction ocean oceanic-eddies
Last synced: 17 Jan 2026
https://github.com/chrisvwn/Rnightlights
R package to extract data from satellite nightlights.
data dmsp-ols extraction nightlights noaa package r satellite snpp-viirs
Last synced: 13 Jul 2025
https://github.com/aphp/edspdf
EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.
extraction machine-learning pdf
Last synced: 24 Oct 2025
https://github.com/exoquery/decomat
Deconstructive Pattern-Matching for Kotlin
extraction kotlin kotlin-library pattern-matching scala
Last synced: 02 Sep 2025
https://github.com/borderless/unfurl
Extract rich metadata from URLs
content extraction html json-ld metadata microdata rdf rdfa scraper
Last synced: 12 Mar 2026
https://github.com/croqaz/a-extractor
Article content extraction database
database extraction readability
Last synced: 25 Apr 2025
https://github.com/shahules786/twitter-emotions
NLP tool to extract emotional phrase from tweets 🤩
docker extraction huggingface nlp pytorch sentiment
Last synced: 30 Jul 2025
https://github.com/adamyaxley/unformat
Fastest type-safe parsing library in the world for C++14 or C++17 (up to 300x faster than std::regex)
cpp14 cpp17 extraction formatting header-only parse parser parsing parsing-library string unformat
Last synced: 11 Apr 2025
https://github.com/infobyte/draytek-arsenal
Reverse Engineering and Observability toolkit for Draytek firewalls
extraction firmware modification reverse-engineering
Last synced: 22 Jul 2025
https://github.com/documentatom/documentatom
DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.
ai chunk chunking etl extraction extraction-transformation-and-loading parse parser semantic
Last synced: 31 Oct 2025
https://github.com/psolbach/metadoc
Aviation grade news article metadata extraction
extraction metadata news nlp perceptron
Last synced: 21 Aug 2025
https://github.com/webfactory/zauberlehrling
Collection of tools and ideas for splitting up big monolithic PHP applications in smaller parts.
assets composer database extraction files microservice monolith mysql packages php tables
Last synced: 13 Apr 2025
https://github.com/golift/xtractr
Go Library for Queuing and Extracting Archives: Rar, Zip, 7zip, Gz, Tar, Tgz, Bz2, Tbz2
7zip bz2 decompress extracter extraction extraction-library golang-library golang-module gzip rar rar-files tar zip zip-files
Last synced: 08 Mar 2026
https://github.com/Anonyfox/rake-js
A pure JS implementation of the Rapid Automated Keyword Extraction (RAKE) algorithm.
auto-tagging classification extraction keyword keywords rake tag tags
Last synced: 17 Jul 2025
https://github.com/anonyfox/rake-js
A pure JS implementation of the Rapid Automated Keyword Extraction (RAKE) algorithm.
auto-tagging classification extraction keyword keywords rake tag tags
Last synced: 07 Mar 2026
https://github.com/bobld/camelot-sharp
A C# library to extract tabular data from PDFs (port of camelot Python version using PdfPig).
camelot camelot-sharp csharp dotnet extract-table extracting-tables extraction extraction-engine netstandard opencv pdf-table-extract pdf-table-extraction pdfparser pdfpig pdfs table table-extraction
Last synced: 14 Jun 2025
https://github.com/Systemcluster/wrappe
Packer for creating self-contained single-binary applications from executables and directories. Distribute your application without the need for an installer, with smaller file size and faster startup than many alternatives 📦
command-line-tool compression cross-platform extraction packer rust
Last synced: 16 Jul 2025
https://github.com/agenty/browser-automation-api
Browser automation API for repetitive web-based tasks, with a friendly user interface. You can use it to scrape content or do many other things like capture a screenshot, generate pdf, extract content or execute custom Puppeteer, Playwright functions.
api browser-automation extraction nodejs pdf playwright puppeteer scraping screenshot webscraping
Last synced: 12 Apr 2025
https://github.com/molbie/outlaw
JSON mapper for macOS, iOS, tvOS, and watchOS
extraction ios json macos mapper marshal swift tvos watchos
Last synced: 21 Oct 2025
https://github.com/hbish/smex
A blazing fast CLI application that processes sitemaps in golang.
cli cross-platform csv extraction go-cli golang golang-library json seotools sitemap sitemap-extractor sitemap-parser
Last synced: 10 Jul 2025
https://github.com/woojubb/html-article-extractor
A web page content extractor
article-extracting article-extractor crawler crawling extraction extractor
Last synced: 24 Dec 2025
https://github.com/puntorigen/ti_recover
Appcelerator Titanium APK source code recovery tool
apk appcelerator decompiler extraction titanium titanium-alloy
Last synced: 09 Jul 2025
https://github.com/smx-smx/wcpex
A tool to extract Windows Manifest files that can be found in the WinSxS folder
binary delta extraction manifest-files tool wcp windows winsxs
Last synced: 14 Apr 2025
https://github.com/uditkarode/ucc
🖥 Compile and run programs through the TurboC Compiler without having to use the TurboC IDE or intricately fabricated DOS commands. Made out of frustration sometime in my high school days.
cli command-line extraction linux students turboc turbocpp ucc ucc-workspace
Last synced: 11 Apr 2025
https://github.com/esipfed/eskg
Earth Science Knowledge Graph - An Automatic Approach to Building Earth Science Knowledge Graph to Improve Data Discovery
earth-science esip esip-lab extraction knowledge-discovery knowledge-graph semantic-data semantic-web
Last synced: 13 Aug 2025
https://github.com/Ryota-Kawamura/Functions-Tools-and-Agents-with-LangChain
You’ll explore new advancements like ChatGPT’s function calling capability, and build a conversational agent using a new syntax called LangChain Expression Language (LCEL) for tasks like tagging, extraction, tool selection, and routing.
api conversational-agent extraction langchain langchain-expression-language lcel llm openai-function tagging
Last synced: 11 Sep 2025
https://github.com/esteinig/scrubby
Host depletion optimised for clinical metagenomic sequencing applications :panda_face:
alignment background bioinformatics depletion extraction host kraken metagenomics rust taxonomy
Last synced: 10 Apr 2025
https://github.com/dotfurther/OpenDiscoverSDK
.NET 6 API for document file format identification, text/metadata/attachment/embedded object/sensitive item (PII/PHI)/entity extraction.
archive csharp dotnet email embedded-objects entity-extraction extraction file-deduplication file-format-detection file-identification indexing metadata microsoft-office phi pii pii-detection pst sdk text text-extraction
Last synced: 12 Apr 2025
https://github.com/bucanero/libun7zip
A library that provides 7-Zip (.7z) archive handling and extraction on PS3, PS4, and PS Vita
7z 7zip compression-library extraction ps3 ps4lib un7zip
Last synced: 10 Apr 2025
https://github.com/kielx/anygrabber
Simplify AnyDesk log analysis by effortlessly searching, extracting, and generating reports on IP addresses and login dates.
anydesk extraction extractor grab grabber logs python
Last synced: 19 Mar 2025
https://github.com/zelon88/xpress
xPress File archiver and extractor
archive compression compression-algorithm decompression experimental extraction extractor python
Last synced: 06 May 2025
https://github.com/yeonghyeon/lung_extraction_from_cxr
Lung Extraction from Chest X-ray for Efficient Computing
computing deep-learning efficient extraction lung nih residual-networks
Last synced: 26 Apr 2025
https://github.com/decisionfacts/df-extract
DF Extract Lib
asyncio document-parser docx extraction jpeg jpg pdf png pptx python3
Last synced: 24 Apr 2025
https://github.com/rtymchyk/babel-plugin-extract-text
Babel plugin to extract strings from React components and gettext-like functions into a gettext PO file.
babel babel-plugin extraction gettext i18n internationalization js parser react translation
Last synced: 23 Aug 2025
https://github.com/hboisgibault/unicontent
Python module to extract structured metadata from URL, ISBN or DOI
doi extraction google-books isbn metadata open-graph python url
Last synced: 06 Apr 2026
https://github.com/yagoluiz/meuremedio-extracao
[PT-BR] Extração de dados de preço de medicamentos disponibilizados pela ANVISA
Last synced: 15 Jul 2025
https://github.com/au-cobra/coq-rust-extraction
Coq plugin for extracting Rust code
Last synced: 25 Oct 2025
https://github.com/DFKI/leechcrawler
Incremental crawling capabilities for Apache Tika. Crawl content out of e.g. file systems, http(s) sources (webcrawling) imap(s) servers or your own arbitrary data sources. LeechCrawler offers additional Tika parsers providing these crawling capabilities.
crawling extraction incremental metadata tika
Last synced: 01 Feb 2026
https://github.com/jacksongoode/nime-proceedings-analyzer
A tool for the bibliographic analysis of the NIME proceedings archive
analysis bibliometric extraction grobid nime proceedings
Last synced: 05 Apr 2025
https://github.com/lamba92/pinsir
PINSIR, or Person Identification Network Stack for Identity Recognition, is a scalable open source end to end solution for face detection and identity recognition.
comparison detection docker extraction face-detection grpc identity-recognition keras kotlin kotlin-multiplatform microservice neural-networks tensorflow
Last synced: 23 Apr 2025
https://github.com/wcampbell0x2a/librarium
Library and binaries for the reading, creating, and modification of cpio
cpio extraction firmware modification rust
Last synced: 29 Oct 2025
https://github.com/uudigitalhumanitieslab/perfectextractor
Extracting present perfects (and related forms) from parallel corpora
extraction parallel-corpus xpath
Last synced: 25 Jul 2025
https://github.com/mrodrig/deeks
Retrieve all keys and nested keys from objects and arrays of objects.
deep document extraction hacktoberfest javascript json key object parser
Last synced: 29 Jul 2025
https://github.com/pducks32/pailead
A palette generating and extraction Swift library for macOS, iOS, and Linux
extraction palette palette-library swatches swift
Last synced: 19 Feb 2026
https://github.com/andreas-aeschlimann/gabor
Demo web application for Gabor filters
extraction fft filters fourier fourier-transform gabor ifft image image-processing processing recognition transform
Last synced: 13 Mar 2025
https://github.com/valaphee/protod
Protobuf Decompiler
extract extraction extractor kotlin protobuf protobuf-definitions protobuf-java protocol-buffers
Last synced: 31 Aug 2025
https://github.com/datasciencecampus/readpyne
Toolkit for extracting relevant lines from receipts or similar image data.
dsc-projects extraction ocr receipts research
Last synced: 18 Mar 2025