An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with extraction

A curated list of projects in awesome lists tagged with extraction .

https://github.com/axa-group/parsr

Transforms PDF, Documents and Images into Enriched Structured Data

data document extraction hacktoberfest images nlp ocr parsr pdf python typescript

Last synced: 13 May 2025

https://github.com/axa-group/Parsr

Transforms PDF, Documents and Images into Enriched Structured Data

data document extraction hacktoberfest images nlp ocr parsr pdf python typescript

Last synced: 13 Mar 2025

https://github.com/trusted-ai/adversarial-robustness-toolbox

Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

adversarial-attacks adversarial-examples adversarial-machine-learning ai artificial-intelligence attack blue-team evasion extraction inference machine-learning poisoning privacy python red-team trusted-ai trustworthy-ai

Last synced: 13 May 2025

https://github.com/Trusted-AI/adversarial-robustness-toolbox

Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams

adversarial-attacks adversarial-examples adversarial-machine-learning ai artificial-intelligence attack blue-team evasion extraction inference machine-learning poisoning privacy python red-team trusted-ai trustworthy-ai

Last synced: 23 Mar 2025

https://github.com/google/mtail

extract internal monitoring data from application logs for collection in a timeseries database

bytecode calculator collector compiler extraction go instrumentation logs metrics monitoring mtail mtail-programs observability prometheus proxy timeseries vm

Last synced: 22 Oct 2025

https://github.com/aubio/aubio

a library for audio and music analysis

analysis annotation audio beat c extraction mfcc music onset pitch python sound tempo-tracking

Last synced: 14 May 2025

https://github.com/apache/tika

The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).

content extraction java metadata tika

Last synced: 09 Sep 2025

https://github.com/symfony/property-access

Provides functions to read and write from/to an object or array using a simple string notation

access array component extraction index injection object php property property-path reflection symfony symfony-component

Last synced: 25 Jan 2026

https://github.com/morkt/garbro

Visual Novels resource browser

audio extraction gui images reverse-engineering visual-novel

Last synced: 15 May 2025

https://github.com/onekey-sec/unblob

Extract files from any kind of container formats

archive compression extraction filesystem python

Last synced: 14 May 2025

https://github.com/dbashford/textract

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

extract-text extraction nodejs

Last synced: 14 May 2025

https://github.com/chrismattmann/tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

buffer covid-19 detection extraction memex mime nlp nlp-library nlp-machine-learning parse parser-interface python recognition text-extraction text-recognition tika-python tika-server tika-server-jar translation-interface usc

Last synced: 14 May 2025

https://github.com/Lattyware/unrpa

A program to extract files from the RPA archive format.

extraction python renpy rpa visual-novels

Last synced: 22 Jul 2025

https://github.com/philipperemy/stanford-openie-python

Stanford Open Information Extraction made simple!

extraction nlp python-wrapper stanford stanford-openie

Last synced: 16 May 2025

https://github.com/lattyware/unrpa

A program to extract files from the RPA archive format.

extraction python renpy rpa visual-novels

Last synced: 16 May 2025

https://github.com/bdbc-kg-nlp/ie-survey

北京航空航天大学大数据高精尖中心自然语言处理研究团队对信息抽取领域的调研。包括实体识别,关系抽取,属性抽取等子任务,每类子任务分别对学术界和工业界进行调研。

extraction nlp survey

Last synced: 22 Feb 2026

https://github.com/carlospuenteg/File-Injector

File Injector is a script that allows you to store any file in an image using steganography

extraction file file-injection file-injector files image image-manipulation image-processing injection noise numpy photography python python3 steganography storage

Last synced: 29 Mar 2025

https://github.com/rize/uritemplate

PHP URI Template (RFC 6570) supports both URI expansion & extraction

expansion extraction php rfc-6570 uri-template

Last synced: 23 May 2026

https://github.com/rize/UriTemplate

PHP URI Template (RFC 6570) supports both URI expansion & extraction

expansion extraction php rfc-6570 uri-template

Last synced: 11 Mar 2025

https://github.com/overtools/OWLib

DataTool is a program that lets you extract models, maps, and files from Overwatch.

blizzard blizzard-games blte casc csharp datatool extraction modeling ngdp overtools overwatch overwatch-2 tact

Last synced: 29 Mar 2025

https://github.com/nissl-lab/toxy

.net text extraction & export framework

dataset export extraction fileformats

Last synced: 14 May 2025

https://github.com/puddly/android-otp-extractor

Extracts OTP tokens from rooted Android devices

adb android extraction otp python totp

Last synced: 06 Apr 2025

https://github.com/nazuke/SEOMacroscope

SEO Macroscope is a website scanning tool, to check your website for broken links; including some technical SEO functionality, site scraping, Excel reporting, and more.

broken-links custom-filter duplicate-content extract-pdf-metadata extraction hreflang-checker hreflang-matrix link-checker scan-website seo seo-excel-report seo-macroscope seo-tools web-scraping webmaster

Last synced: 14 Apr 2025

https://github.com/robinst/autolink-java

Java library to extract links (URLs, email addresses) from plain text; fast, small and smart

autolink extraction java-library linkify links url

Last synced: 14 May 2025

https://github.com/thrau/jarchivelib

A simple archiving and compression library for Java

archiving compression extraction

Last synced: 04 Apr 2025

https://github.com/neelshah18/emot

Open source Emoticons and Emoji detection library: emot

detection emoji emoticons extraction python

Last synced: 13 Apr 2025

https://github.com/nazywam/autoit-ripper

Extract AutoIt scripts embedded in PE binaries

autoit extraction malware

Last synced: 04 Apr 2025

https://github.com/DiegoCaraballo/Email-extractor

The main functionality is to extract all the emails from one or several URLs - La funcionalidad principal es extraer todos los correos electrónicos de una o varias Url

email email-extractor email-marketing emails extraction python scraper scrapers scraping scraping-websites scrapper scrapping scrapy scrapy-spider spyder stractor

Last synced: 11 Jul 2025

https://github.com/rossumai/docile

DocILE: Document Information Localization and Extraction Benchmark

benchmark document extraction information key kie kile understanding

Last synced: 12 Jan 2026

https://github.com/chatnoir-eu/chatnoir-resiliparse

A robust web archive analytics toolkit

bigdata cpp cython extraction htmlparser python warc web webarchive

Last synced: 04 Apr 2026

https://github.com/MacPaw/XADMaster

Objective-C library for archive and file unarchiving and extraction

extraction unar unarchiver

Last synced: 14 May 2025

https://github.com/macpaw/xadmaster

Objective-C library for archive and file unarchiving and extraction

extraction unar unarchiver

Last synced: 21 Aug 2025

https://github.com/chrise96/3D_Ground_Segmentation

A ground segmentation algorithm for 3D point clouds based on the work described in “Fast segmentation of 3D point clouds: a paradigm on LIDAR data for Autonomous Vehicle Applications”, D. Zermas, I. Izzat and N. Papanikolopoulos, 2017. Distinguish between road and non-road points. Road surface extraction. Plane fit ground filter

cpp extraction ground ground-segmentation lastools lidar non-ground point-cloud preprocessing road-surface

Last synced: 19 Mar 2025

https://github.com/usc-isi-i2/etk

Extraction Toolkit

extraction

Last synced: 26 Jun 2025

https://github.com/bdbc-kg-nlp/covid-19-tracker

北航大数据高精尖中心研究团队进行数据来源的整理与获取,利用自然语言处理等技术从已公开全国4626确诊患者轨迹中抽取了基本信息(性别、年龄、常住地、工作、武汉/湖北接触史等)、轨迹(时间、地点、交通工具、事件)及病患关系形成结构化信息

covid-19 extraction nlp tracking visualization

Last synced: 04 Mar 2026

https://github.com/philipperemy/stanford-ner-python

Stanford Named Entity Recognizer (NER) - Python Wrapper

extraction named-entity-recognition nlp python-wrapper stanford stanford-ner

Last synced: 18 Sep 2025

https://github.com/rse/extraction

Tree Extraction for JavaScript Object Graphs

dsl extraction javascript json query tree

Last synced: 19 Apr 2025

https://github.com/ckorzen/pdf-text-extraction-benchmark

A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

arxiv benchmark evaluation extraction pdf tex text-extraction

Last synced: 11 May 2025

https://github.com/freelawproject/doctor

A microservice for document conversion at scale

document extraction ffmpeg ocr pdf

Last synced: 09 Feb 2026

https://github.com/imperialcollegelondon/pnextract

Pore network extraction from micro-CT images of porous media

extraction pore-network

Last synced: 07 Apr 2025

https://github.com/xyntopia/pydoxtools

Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.

chatgpt document-analysis document-extraction extraction information-retrieval llm nlp pdf python

Last synced: 11 May 2025

https://github.com/josuemtzmo/trackeddy

Tracking eddy algorithm:

eddies eddy extraction ocean oceanic-eddies

Last synced: 17 Jan 2026

https://github.com/chrisvwn/Rnightlights

R package to extract data from satellite nightlights.

data dmsp-ols extraction nightlights noaa package r satellite snpp-viirs

Last synced: 13 Jul 2025

https://github.com/aphp/edspdf

EDS-PDF is a generic, pure-Python framework for text extraction from PDF documents. It provides the machinery to use rule- or machine-learning-based approaches to classify text blocs between body and meta-data.

extraction machine-learning pdf

Last synced: 24 Oct 2025

https://github.com/exoquery/decomat

Deconstructive Pattern-Matching for Kotlin

extraction kotlin kotlin-library pattern-matching scala

Last synced: 02 Sep 2025

https://github.com/borderless/unfurl

Extract rich metadata from URLs

content extraction html json-ld metadata microdata rdf rdfa scraper

Last synced: 12 Mar 2026

https://github.com/croqaz/a-extractor

Article content extraction database

database extraction readability

Last synced: 25 Apr 2025

https://github.com/shahules786/twitter-emotions

NLP tool to extract emotional phrase from tweets 🤩

docker extraction huggingface nlp pytorch sentiment

Last synced: 30 Jul 2025

https://github.com/loyd/readability.rs

Really fast readability

dom extraction html text

Last synced: 10 Apr 2025

https://github.com/adamyaxley/unformat

Fastest type-safe parsing library in the world for C++14 or C++17 (up to 300x faster than std::regex)

cpp14 cpp17 extraction formatting header-only parse parser parsing parsing-library string unformat

Last synced: 11 Apr 2025

https://github.com/infobyte/draytek-arsenal

Reverse Engineering and Observability toolkit for Draytek firewalls

extraction firmware modification reverse-engineering

Last synced: 22 Jul 2025

https://github.com/documentatom/documentatom

DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.

ai chunk chunking etl extraction extraction-transformation-and-loading parse parser semantic

Last synced: 31 Oct 2025

https://github.com/psolbach/metadoc

Aviation grade news article metadata extraction

extraction metadata news nlp perceptron

Last synced: 21 Aug 2025

https://github.com/webfactory/zauberlehrling

Collection of tools and ideas for splitting up big monolithic PHP applications in smaller parts.

assets composer database extraction files microservice monolith mysql packages php tables

Last synced: 13 Apr 2025

https://github.com/golift/xtractr

Go Library for Queuing and Extracting Archives: Rar, Zip, 7zip, Gz, Tar, Tgz, Bz2, Tbz2

7zip bz2 decompress extracter extraction extraction-library golang-library golang-module gzip rar rar-files tar zip zip-files

Last synced: 08 Mar 2026

https://github.com/Anonyfox/rake-js

A pure JS implementation of the Rapid Automated Keyword Extraction (RAKE) algorithm.

auto-tagging classification extraction keyword keywords rake tag tags

Last synced: 17 Jul 2025

https://github.com/anonyfox/rake-js

A pure JS implementation of the Rapid Automated Keyword Extraction (RAKE) algorithm.

auto-tagging classification extraction keyword keywords rake tag tags

Last synced: 07 Mar 2026

https://github.com/lysxia/coq-simple-io

IO for Gallina

coq extraction ocaml

Last synced: 27 Jul 2025

https://github.com/Systemcluster/wrappe

Packer for creating self-contained single-binary applications from executables and directories. Distribute your application without the need for an installer, with smaller file size and faster startup than many alternatives 📦

command-line-tool compression cross-platform extraction packer rust

Last synced: 16 Jul 2025

https://github.com/agenty/browser-automation-api

Browser automation API for repetitive web-based tasks, with a friendly user interface. You can use it to scrape content or do many other things like capture a screenshot, generate pdf, extract content or execute custom Puppeteer, Playwright functions.

api browser-automation extraction nodejs pdf playwright puppeteer scraping screenshot webscraping

Last synced: 12 Apr 2025

https://github.com/molbie/outlaw

JSON mapper for macOS, iOS, tvOS, and watchOS

extraction ios json macos mapper marshal swift tvos watchos

Last synced: 21 Oct 2025

https://github.com/hbish/smex

A blazing fast CLI application that processes sitemaps in golang.

cli cross-platform csv extraction go-cli golang golang-library json seotools sitemap sitemap-extractor sitemap-parser

Last synced: 10 Jul 2025

https://github.com/puntorigen/ti_recover

Appcelerator Titanium APK source code recovery tool

apk appcelerator decompiler extraction titanium titanium-alloy

Last synced: 09 Jul 2025

https://github.com/smx-smx/wcpex

A tool to extract Windows Manifest files that can be found in the WinSxS folder

binary delta extraction manifest-files tool wcp windows winsxs

Last synced: 14 Apr 2025

https://github.com/uditkarode/ucc

🖥 Compile and run programs through the TurboC Compiler without having to use the TurboC IDE or intricately fabricated DOS commands. Made out of frustration sometime in my high school days.

cli command-line extraction linux students turboc turbocpp ucc ucc-workspace

Last synced: 11 Apr 2025

https://github.com/esipfed/eskg

Earth Science Knowledge Graph - An Automatic Approach to Building Earth Science Knowledge Graph to Improve Data Discovery

earth-science esip esip-lab extraction knowledge-discovery knowledge-graph semantic-data semantic-web

Last synced: 13 Aug 2025

https://github.com/Ryota-Kawamura/Functions-Tools-and-Agents-with-LangChain

You’ll explore new advancements like ChatGPT’s function calling capability, and build a conversational agent using a new syntax called LangChain Expression Language (LCEL) for tasks like tagging, extraction, tool selection, and routing.

api conversational-agent extraction langchain langchain-expression-language lcel llm openai-function tagging

Last synced: 11 Sep 2025

https://github.com/esteinig/scrubby

Host depletion optimised for clinical metagenomic sequencing applications :panda_face:

alignment background bioinformatics depletion extraction host kraken metagenomics rust taxonomy

Last synced: 10 Apr 2025

https://github.com/dotfurther/OpenDiscoverSDK

.NET 6 API for document file format identification, text/metadata/attachment/embedded object/sensitive item (PII/PHI)/entity extraction.

archive csharp dotnet email embedded-objects entity-extraction extraction file-deduplication file-format-detection file-identification indexing metadata microsoft-office phi pii pii-detection pst sdk text text-extraction

Last synced: 12 Apr 2025

https://github.com/bucanero/libun7zip

A library that provides 7-Zip (.7z) archive handling and extraction on PS3, PS4, and PS Vita

7z 7zip compression-library extraction ps3 ps4lib un7zip

Last synced: 10 Apr 2025

https://github.com/kielx/anygrabber

Simplify AnyDesk log analysis by effortlessly searching, extracting, and generating reports on IP addresses and login dates.

anydesk extraction extractor grab grabber logs python

Last synced: 19 Mar 2025

https://github.com/yeonghyeon/lung_extraction_from_cxr

Lung Extraction from Chest X-ray for Efficient Computing

computing deep-learning efficient extraction lung nih residual-networks

Last synced: 26 Apr 2025

https://github.com/planio-gmbh/plaintext

This gem wraps command line tools to extract plain text from typical files, such as PDF and common office formats.

cv doc docx extract extraction files fulltext odt office pdf ppt pptx rtf ruby ruby-on-rails xsl xslt

Last synced: 11 Nov 2025

https://github.com/rtymchyk/babel-plugin-extract-text

Babel plugin to extract strings from React components and gettext-like functions into a gettext PO file.

babel babel-plugin extraction gettext i18n internationalization js parser react translation

Last synced: 23 Aug 2025

https://github.com/hboisgibault/unicontent

Python module to extract structured metadata from URL, ISBN or DOI

doi extraction google-books isbn metadata open-graph python url

Last synced: 06 Apr 2026

https://github.com/yagoluiz/meuremedio-extracao

[PT-BR] Extração de dados de preço de medicamentos disponibilizados pela ANVISA

data extraction python3

Last synced: 15 Jul 2025

https://github.com/au-cobra/coq-rust-extraction

Coq plugin for extracting Rust code

coq extraction rust

Last synced: 25 Oct 2025

https://github.com/DFKI/leechcrawler

Incremental crawling capabilities for Apache Tika. Crawl content out of e.g. file systems, http(s) sources (webcrawling) imap(s) servers or your own arbitrary data sources. LeechCrawler offers additional Tika parsers providing these crawling capabilities.

crawling extraction incremental metadata tika

Last synced: 01 Feb 2026

https://github.com/jacksongoode/nime-proceedings-analyzer

A tool for the bibliographic analysis of the NIME proceedings archive

analysis bibliometric extraction grobid nime proceedings

Last synced: 05 Apr 2025

https://github.com/lamba92/pinsir

PINSIR, or Person Identification Network Stack for Identity Recognition, is a scalable open source end to end solution for face detection and identity recognition.

comparison detection docker extraction face-detection grpc identity-recognition keras kotlin kotlin-multiplatform microservice neural-networks tensorflow

Last synced: 23 Apr 2025

https://github.com/wcampbell0x2a/librarium

Library and binaries for the reading, creating, and modification of cpio

cpio extraction firmware modification rust

Last synced: 29 Oct 2025

https://github.com/uudigitalhumanitieslab/perfectextractor

Extracting present perfects (and related forms) from parallel corpora

extraction parallel-corpus xpath

Last synced: 25 Jul 2025

https://github.com/mrodrig/deeks

Retrieve all keys and nested keys from objects and arrays of objects.

deep document extraction hacktoberfest javascript json key object parser

Last synced: 29 Jul 2025

https://github.com/pducks32/pailead

A palette generating and extraction Swift library for macOS, iOS, and Linux

extraction palette palette-library swatches swift

Last synced: 19 Feb 2026

https://github.com/datasciencecampus/readpyne

Toolkit for extracting relevant lines from receipts or similar image data.

dsc-projects extraction ocr receipts research

Last synced: 18 Mar 2025