Projects in Awesome Lists tagged with extract-text
A curated list of projects in awesome lists tagged with extract-text .
https://github.com/dbashford/textract
node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!
extract-text extraction nodejs
Last synced: 14 May 2025
https://github.com/pd3f/pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
extract-text language-model machine-learning ocr parsr pd3f pdf pdf-to-text pipeline python text-extraction
Last synced: 08 Apr 2026
https://github.com/ropensci/fulltext
:warning: ARCHIVED :warning: Search across and get full text for OA & closed journals
crossref extract-text metadata open-access pdf r r-package rstats text-ming xml
Last synced: 15 Mar 2025
https://github.com/ropensci-archive/fulltext
:warning: ARCHIVED :warning: Search across and get full text for OA & closed journals
crossref extract-text metadata open-access pdf r r-package rstats text-ming xml
Last synced: 14 Dec 2025
https://github.com/opensemanticsearch/open-semantic-etl
Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
annotation documents elasticsearch enrichment etl extract extract-information extract-text extractor ingest ingestion-pipeline ingests-documents named-entity-recognition nlp ocr pdf python rdf solr solr-dataimporter
Last synced: 06 Apr 2025
https://github.com/kevm/tikaondotnet
Use the Java Tika text extraction library on the .NET platform
Last synced: 27 Jan 2026
https://github.com/ahmedkhemiri95/PDFs-TextExtract
Multiple and Large PDF Documents Text Extraction.
data-science extract-text parser pdf pdf-document pdf-processing pdfminer pdfs pdfs-textextract pypdf2 python text-analytics
Last synced: 04 Apr 2025
https://github.com/lu4p/cat
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
cat cross-platform docx2txt extract-text go golang odt2txt pdf2txt pdftotext rtf-to-text text-extraction textextracting
Last synced: 25 Jun 2025
https://github.com/ropensci/antiword
R wrapper for antiword utility
antiword extract-text r r-package rstats
Last synced: 17 Oct 2025
https://github.com/ApryseSDK/pdftron-document-search
Build search across multiple documents client-side in your file storage
algolia-instantsearch extract-text seach-documents search-office-text search-pdf
Last synced: 14 Mar 2025
https://github.com/aprysesdk/pdftron-document-search
Build search across multiple documents client-side in your file storage
algolia-instantsearch extract-text seach-documents search-office-text search-pdf
Last synced: 17 Aug 2025
https://github.com/lifegpc/msg-tool
Tools for export and import scripts
bgi circus ethornell extract-text extractor galgame kirikiri
Last synced: 09 May 2026
https://github.com/maxim2266/ocr
A collection of tools for OCR (optical character recognition).
bash-script c extract-text linux ocr ocr-recognition tesseract
Last synced: 21 Feb 2026
https://github.com/rlayers/pawpaw
Text Processing & Segmentation Framework
extract-text hierarchical-text-segmentation information-extraction knowledge-graph lexer natural-language-processing nlp parser python query-engine query-language text-processing text-segmentation tree xml-parser xmlparser
Last synced: 08 Apr 2026
https://github.com/bhattbhavesh91/google-vision-api-for-ocr-demo
Repo which contains a small demo to Extract Text from image OCR using Google Vision API in Python
demo extract-text google-ocr google-vision google-vision-api image-ocr python
Last synced: 17 Apr 2025
https://github.com/devmehq/extract-text
node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!
extract-text extractor ocr pdf tessaract tesseract-ocr
Last synced: 17 Jan 2026
https://github.com/greed2411/tokyo
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
apache-tika clojure document-processing extension extract-text filetype mime-types ring text-extraction text-parser text-parsing
Last synced: 07 May 2025
https://github.com/shelfio/tika-text-extract
Extract text from a document by Apache Tika
apache-tika extract-text node-module npm-package tika
Last synced: 19 Jul 2025
https://github.com/whuppi/pdf_manipulator
Cross-platform PDF manipulation for Flutter & Dart. Merge, split, render, extract text, search, sign, encrypt, validate, convert, build from scratch. Off the main thread on every platform including web.
compress dart decrypt encrypt extract-text flutter images-to-pdf merge pdf render reorder rotate rust search sign split
Last synced: 14 Jun 2026
https://github.com/rpg-maker-translation-tools/rvpacker-txt-rs-lib
Library that allows to extract text from RPG Maker files.
extract extract-text library rpg-maker rpg-maker-mv rpg-maker-mz rpg-maker-vx rpg-maker-vxace rpg-maker-xp rust rust-crate rust-library
Last synced: 19 Apr 2026
https://github.com/saidsef/tika-document-to-text
Apache Tika extract text and metadata from any document format with this pre-built containerised solution Kubernetes-ready deployment with intuitive UI, API, and text-to-speech capabilities - perfect for content indexing, analysis, and document processing workflows
docker-container document-to-text document-to-text-ui extract-text helm-chart kubernetes kubernetes-deployment nodejs python text-extraction text-to-speech
Last synced: 02 Apr 2026
https://github.com/basemax/extractword
Extract word(s) from the lines of the file.
extract extract-data extract-information extract-text extraction extractor php replace replace-text text-processing text-processor webform
Last synced: 13 Jun 2025
https://github.com/emmeryn/hocr-turtletext
A gem that parses positional text from hOCR output and provides convenience methods to find text.
extract-text gem hocr ruby-on-rails
Last synced: 08 Oct 2025
https://github.com/nanamare/ocr-android
Sample ocr using opencv (just toy project) since 2017..
extract-text ocr ocr-android opencv text-mining
Last synced: 12 May 2026
https://github.com/zeeshanahmad4/nlp-pdf-minning-extracting-text-from-pdf
NLP Pdf Minning Extracting text from pdf
extract-text pdf pdf-converter pdf-document-processor pdf-files pdf-format pdf-text-extraction pdfcon pdfkit pdftohtml pdftoimage pdftools pdftotext python text-extraction
Last synced: 01 Apr 2025
https://github.com/majd-kontar/pdf-highlight-extractor
extract-text highlight pdf python
Last synced: 17 Apr 2026
https://github.com/defi0x1/scan-document
Extract text from image using Pytesseract
extract-text image-to-text pytesseract-ocr tesseract
Last synced: 26 Jun 2025
https://github.com/jalal246/corename
Automatically extracts packages root name for monorepos
corename extract-data extract-information extract-text extracts get-info monorepo package-development package-json package-management production read-json utility
Last synced: 26 Mar 2025
https://github.com/djsudduth/joplin-plugin-paragraph-extractor
Extract specific paragraphs out of Joplin notes using keywords, hashtags or custom tags similar to Logseq block references. Also, refresh extracted notes if source notes change.
block-references extract-text joplin joplin-plugin note-taking notes-app notetaking plugin
Last synced: 14 Feb 2026
https://github.com/kingpin707/pdf-highlight-extractor
A Python tool for extracting highlighted text from PDF files while preserving formatting attributes (headers, bold, italic) and removing unwanted line breaks and page breaks. Perfect for integrating with content management systems.
ai21labs ebook-reader extract-highlights extract-text faiss-backend highlight-color kindle kindle-clippings koreader markdown mobi pdf-converter python remarkable-tablet
Last synced: 03 May 2026
https://github.com/hstanleycrow/easyphparticleextractor
Free PHP library to extract the main content from an article post or news post, including images and HTML
extract-article extract-content extract-text extract-website extraction extractor php php-library website
Last synced: 31 Jul 2025
https://github.com/islom-pardaboyev/image_to_text_converter
A React-based web app that extracts text from images using Tesseract.js. Upload an image, and the app will process it automatically. Supports manual text extraction as well. 🚀
extract-text image-to-text react react-icons sooner tailwindcss
Last synced: 10 May 2026
https://github.com/basemax/smartfilter
A Smart Filtering to keep and remove the character or words of the text. (SOON)
extract extract-data extract-features extract-information extract-text extraction extractive-summarization extractor php split splitter splitting text text-analysis text-analytics text-analyzer text-mining
Last synced: 02 May 2026
https://github.com/loganlinn/copy-text-of-selected-area-shortcut
Apple Shortcut to copy text of selected area (screenshot) to clipboard
extract-text ios macos screenshot shortcuts shortcuts-app
Last synced: 18 Jun 2025