An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with extract-text

A curated list of projects in awesome lists tagged with extract-text .

https://github.com/dbashford/textract

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

extract-text extraction nodejs

Last synced: 14 May 2025

https://github.com/pd3f/pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

extract-text language-model machine-learning ocr parsr pd3f pdf pdf-to-text pipeline python text-extraction

Last synced: 08 Apr 2026

https://github.com/ropensci/fulltext

:warning: ARCHIVED :warning: Search across and get full text for OA & closed journals

crossref extract-text metadata open-access pdf r r-package rstats text-ming xml

Last synced: 15 Mar 2025

https://github.com/ropensci-archive/fulltext

:warning: ARCHIVED :warning: Search across and get full text for OA & closed journals

crossref extract-text metadata open-access pdf r r-package rstats text-ming xml

Last synced: 14 Dec 2025

https://github.com/opensemanticsearch/open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database

annotation documents elasticsearch enrichment etl extract extract-information extract-text extractor ingest ingestion-pipeline ingests-documents named-entity-recognition nlp ocr pdf python rdf solr solr-dataimporter

Last synced: 06 Apr 2025

https://github.com/kevm/tikaondotnet

Use the Java Tika text extraction library on the .NET platform

extract-text tika

Last synced: 27 Jan 2026

https://github.com/lu4p/cat

Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

cat cross-platform docx2txt extract-text go golang odt2txt pdf2txt pdftotext rtf-to-text text-extraction textextracting

Last synced: 25 Jun 2025

https://github.com/ropensci/antiword

R wrapper for antiword utility

antiword extract-text r r-package rstats

Last synced: 17 Oct 2025

https://github.com/ApryseSDK/pdftron-document-search

Build search across multiple documents client-side in your file storage

algolia-instantsearch extract-text seach-documents search-office-text search-pdf

Last synced: 14 Mar 2025

https://github.com/aprysesdk/pdftron-document-search

Build search across multiple documents client-side in your file storage

algolia-instantsearch extract-text seach-documents search-office-text search-pdf

Last synced: 17 Aug 2025

https://github.com/lifegpc/msg-tool

Tools for export and import scripts

bgi circus ethornell extract-text extractor galgame kirikiri

Last synced: 09 May 2026

https://github.com/maxim2266/ocr

A collection of tools for OCR (optical character recognition).

bash-script c extract-text linux ocr ocr-recognition tesseract

Last synced: 21 Feb 2026

https://github.com/bhattbhavesh91/google-vision-api-for-ocr-demo

Repo which contains a small demo to Extract Text from image OCR using Google Vision API in Python

demo extract-text google-ocr google-vision google-vision-api image-ocr python

Last synced: 17 Apr 2025

https://github.com/devmehq/extract-text

node.js module for extracting text from html, pdf, doc, docx, xls, xlsx, csv, pptx, png, jpg, gif, rtf and more!

extract-text extractor ocr pdf tessaract tesseract-ocr

Last synced: 17 Jan 2026

https://github.com/greed2411/tokyo

tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

apache-tika clojure document-processing extension extract-text filetype mime-types ring text-extraction text-parser text-parsing

Last synced: 07 May 2025

https://github.com/shelfio/tika-text-extract

Extract text from a document by Apache Tika

apache-tika extract-text node-module npm-package tika

Last synced: 19 Jul 2025

https://github.com/whuppi/pdf_manipulator

Cross-platform PDF manipulation for Flutter & Dart. Merge, split, render, extract text, search, sign, encrypt, validate, convert, build from scratch. Off the main thread on every platform including web.

compress dart decrypt encrypt extract-text flutter images-to-pdf merge pdf render reorder rotate rust search sign split

Last synced: 14 Jun 2026

https://github.com/saidsef/tika-document-to-text

Apache Tika extract text and metadata from any document format with this pre-built containerised solution Kubernetes-ready deployment with intuitive UI, API, and text-to-speech capabilities - perfect for content indexing, analysis, and document processing workflows

docker-container document-to-text document-to-text-ui extract-text helm-chart kubernetes kubernetes-deployment nodejs python text-extraction text-to-speech

Last synced: 02 Apr 2026

https://github.com/emmeryn/hocr-turtletext

A gem that parses positional text from hOCR output and provides convenience methods to find text.

extract-text gem hocr ruby-on-rails

Last synced: 08 Oct 2025

https://github.com/nanamare/ocr-android

Sample ocr using opencv (just toy project) since 2017..

extract-text ocr ocr-android opencv text-mining

Last synced: 12 May 2026

https://github.com/defi0x1/scan-document

Extract text from image using Pytesseract

extract-text image-to-text pytesseract-ocr tesseract

Last synced: 26 Jun 2025

https://github.com/djsudduth/joplin-plugin-paragraph-extractor

Extract specific paragraphs out of Joplin notes using keywords, hashtags or custom tags similar to Logseq block references. Also, refresh extracted notes if source notes change.

block-references extract-text joplin joplin-plugin note-taking notes-app notetaking plugin

Last synced: 14 Feb 2026

https://github.com/kingpin707/pdf-highlight-extractor

A Python tool for extracting highlighted text from PDF files while preserving formatting attributes (headers, bold, italic) and removing unwanted line breaks and page breaks. Perfect for integrating with content management systems.

ai21labs ebook-reader extract-highlights extract-text faiss-backend highlight-color kindle kindle-clippings koreader markdown mobi pdf-converter python remarkable-tablet

Last synced: 03 May 2026

https://github.com/hstanleycrow/easyphparticleextractor

Free PHP library to extract the main content from an article post or news post, including images and HTML

extract-article extract-content extract-text extract-website extraction extractor php php-library website

Last synced: 31 Jul 2025

https://github.com/islom-pardaboyev/image_to_text_converter

A React-based web app that extracts text from images using Tesseract.js. Upload an image, and the app will process it automatically. Supports manual text extraction as well. 🚀

extract-text image-to-text react react-icons sooner tailwindcss

Last synced: 10 May 2026

https://github.com/loganlinn/copy-text-of-selected-area-shortcut

Apple Shortcut to copy text of selected area (screenshot) to clipboard

extract-text ios macos screenshot shortcuts shortcuts-app

Last synced: 18 Jun 2025