An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with text-extraction

A curated list of projects in awesome lists tagged with text-extraction .

https://github.com/kreuzberg-dev/kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

bun csharp document-intelligence elixir ffi golang java metadata-extraction node pdf-extraction pdfium php python rag ruby rust table-extraction tesseract text-extraction wasm

Last synced: 17 May 2026

https://github.com/run-llama/liteparse

A fast, helpful, and open-source document parser

document-ocr document-processing ocr ocr-recognition pdf pdf-parser text-extraction

Last synced: 30 May 2026

https://github.com/Goldziher/kreuzberg

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

async document-intelligence mcp metadata-extraction ocr pandoc pdf-extraction pdfium python rag table-extraction tesseract text-extraction

Last synced: 21 Oct 2025

https://github.com/goldziher/kreuzberg

A text extraction library supporting PDFs, images, office documents and more

asyncio docx ocr pdf text-extraction

Last synced: 14 May 2025

https://github.com/chrismattmann/tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

buffer covid-19 detection extraction memex mime nlp nlp-library nlp-machine-learning parse parser-interface python recognition text-extraction text-recognition tika-python tika-server tika-server-jar translation-interface usc

Last synced: 14 May 2025

https://github.com/whitelok/image-text-localization-recognition

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約

awesome convolutional-neural-networks deep-learning deep-learning-algorithms machine-learning ocr scene-texts text-detection text-extraction text-recognition

Last synced: 20 Mar 2025

https://github.com/kreuzberg-dev/html-to-markdown

High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.

hocr html html-converter markdown markdown-converter rag text-extraction text-processing

Last synced: 28 May 2026

https://github.com/yfedoseev/pdf_oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

data-extraction document-processing fast image-extraction llm markdown pdf pdf-editor pdf-generation pdf-library pdf-parser pdf-to-markdown pdf-to-text pyo3 python rag rust text-extraction

Last synced: 13 May 2026

https://github.com/miso-belica/justext

Heuristic based boilerplate removal tool

html-parser html-parsing python text-extraction

Last synced: 13 Apr 2025

https://github.com/miso-belica/jusText

Heuristic based boilerplate removal tool

html-parser html-parsing python text-extraction

Last synced: 14 Mar 2025

https://github.com/unidoc/unidoc

This repository has moved! https://github.com/unidoc/unipdf

golang pdf pdf-files pdf-invoice pdf-library text-extraction unidoc

Last synced: 01 Apr 2025

https://github.com/ropensci/pdftools

Text Extraction, Rendering and Converting of PDF Documents

pdf-files pdf-format pdftools poppler poppler-library r r-package rstats text-extraction

Last synced: 27 Aug 2025

https://github.com/cdown/srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

command-line command-line-tool library mit-license python srt subtitle subtitle-fixer subtitle-parser subtitles subtitles-parsing text-extraction tools

Last synced: 14 May 2025

https://github.com/iamarunbrahma/vision-parse

Parse PDFs into markdown using Vision LLMs

document-parser pdf-parser pdf-to-markdown text-extraction

Last synced: 13 Dec 2025

https://github.com/Shixzie/nlp

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

go golang natural-language-processing nlp parse text text-extraction

Last synced: 14 Mar 2025

https://github.com/shixzie/nlp

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

go golang natural-language-processing nlp parse text text-extraction

Last synced: 01 Apr 2025

https://github.com/pd3f/pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

extract-text language-model machine-learning ocr parsr pd3f pdf pdf-to-text pipeline python text-extraction

Last synced: 08 Apr 2026

https://github.com/bookieio/breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

html-extraction html-extractor html-parsing python text-extraction text-mining

Last synced: 21 Oct 2025

https://github.com/weareprestatech/hotpdf

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

pdf python text-extraction text-search

Last synced: 03 Mar 2026

https://github.com/skylander86/lambda-text-extractor

AWS Lambda functions to extract text from various binary formats.

aws-lambda lambda-functions ocr pdf pdf-ocr-extraction searchable-pdfs tesseract text-extraction

Last synced: 16 Jan 2026

https://github.com/vsymbol/CUTIE

CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

computer-vision deep-learning text-extraction

Last synced: 02 Apr 2025

https://github.com/archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives

Last synced: 13 Apr 2025

https://github.com/vaites/php-apache-tika

Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

apache ocr php-library text-extraction text-recognition tika

Last synced: 28 Jan 2026

https://github.com/victorqribeiro/ocr

Simple app to extract text from pictures using Tesseract

image-recognition ocr tesseract text-extraction text-recognition

Last synced: 07 Dec 2025

https://github.com/lu4p/cat

Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

cat cross-platform docx2txt extract-text go golang odt2txt pdf2txt pdftotext rtf-to-text text-extraction textextracting

Last synced: 25 Jun 2025

https://github.com/gamemaker1/office-text-extractor

Yet another library to extract text from MS Office and PDF files

docx get-text ms-excel ms-office ms-powerpoint ms-word parser pdf pptx text-extraction xlsx

Last synced: 16 Mar 2026

https://github.com/iscc/mobi

python based software to unpack kindlegen generated ebooks

kindle mobi text-extraction

Last synced: 17 Feb 2026

https://github.com/jonathanraiman/wikipedia_ner

:book: Labeled examples from wiki dumps in Python

dataset named-entity-recognition python text-extraction wikipedia

Last synced: 11 Jul 2025

https://github.com/iamarunbrahma/pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

document-conversion document-processing information-retrieval pdf-converter pdf-extraction pdf-parsing pdf-to-markdown python rag retrieval-augmented-generation text-extraction

Last synced: 10 Apr 2025

https://github.com/ckorzen/pdf-text-extraction-benchmark

A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

arxiv benchmark evaluation extraction pdf tex text-extraction

Last synced: 11 May 2025

https://github.com/abhinaba-ghosh/any-text

Get text content from any file

file-reader reader text text-extraction text-extractor

Last synced: 22 Jul 2025

https://github.com/fourdigits/wagtail_textract

Text extraction for Wagtail document search

django search tesseract text-extraction textract wagtail

Last synced: 06 Oct 2025

https://github.com/pd3f/pd3f-core

📑 Python Package to reconstruct the original continuous text from PDFs with language models

dehyphenation language-model machine-learning pd3f pdf text-extraction

Last synced: 08 Apr 2026

https://github.com/spences10/mcp-jinaai-reader

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

content-extraction documentation-tool jinaai llm-tools mcp model-context-protocol text-extraction web-content web-scraping

Last synced: 15 Apr 2025

https://github.com/arshad-yaseen/ocr-llm

⚡️ Fast, ultra-accurate text extraction from any image or PDF—including challenging ones—with structured markdown output powered by vision models.

llm ocr text-extraction

Last synced: 05 May 2025

https://github.com/ingmarboeschen/jatsdecoder

A text extraction and manipulation toolset for NISO-JATS coded XML files

cermine niso-jats pubmedcentral r text-extraction text-mining xml-files

Last synced: 12 Apr 2025

https://github.com/greed2411/tokyo

tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

apache-tika clojure document-processing extension extract-text filetype mime-types ring text-extraction text-parser text-parsing

Last synced: 07 May 2025

https://github.com/Altabeh/tesseract-ocr-wrapper

This is a highly efficient python wrapper for tesseract-ocr.

leptonica multiprocessing tesseract-ocr text-extraction xpdf

Last synced: 09 Jul 2025

https://github.com/ad-freiburg/pdftotext-plus-plus

A fast and accurate command line tool for extracting text from PDF files.

c-plus-plus cli document-analysis metadata-extraction pdf text-extraction

Last synced: 16 May 2025

https://github.com/dotfurther/OpenDiscoverSDK

.NET 6 API for document file format identification, text/metadata/attachment/embedded object/sensitive item (PII/PHI)/entity extraction.

archive csharp dotnet email embedded-objects entity-extraction extraction file-deduplication file-format-detection file-identification indexing metadata microsoft-office phi pii pii-detection pst sdk text text-extraction

Last synced: 12 Apr 2025

https://github.com/ssciwr/ammico

AI-based Media and Misinformation Content Analysis Tool: Analyze text and images

classification computer-vision nlp text-extraction translation

Last synced: 06 Mar 2026

https://github.com/shelfio/apache-tika-lambda-layer

AWS Lambda layer containing latest version of Apache Tika

apache-tika aws-lambda lambda-layer text-extraction

Last synced: 10 Jun 2025

https://github.com/coregx/gxpdf

GxPDF - Enterprise-grade PDF library for Go. Table extraction, text parsing, encryption, document creation.

go golang open-source pdf pdf-encryption pdf-generation pdf-library pdf-parser table-extraction text-extraction

Last synced: 02 Apr 2026

https://github.com/bmoscon/articleparse

Heuristic text extraction from news sites in Python3

analysis boilerplate-removal heuristics python text-analysis text-extraction

Last synced: 07 May 2025

https://github.com/funinkina/gnome-ocr-screenshot

A simple python script to extarct text from screenshot in GNOME desktop environment using pytesseract.

gnome gnome-shell linux ocr screenshot text-extraction tools utility

Last synced: 30 Jul 2025

https://github.com/typo3-solr/ext-tika

A TYPO3 CMS extension that provides Apache Tika functionality

cms cms-extension file-indexing language-detection metadata php search text-extraction tika typo3 typo3-cms-extension

Last synced: 04 Apr 2025

https://github.com/andrealenzi11/py-poppleract

Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents

ocr optical-character-recognition pdf-reader pdf-splitting pdf-to-text pdf2text pdftotext poppler poppleract py-poppleract tesseract tesseract-ocr text-extraction

Last synced: 26 Mar 2025

https://github.com/heussd/pdftotext-go

Extract texts + their page numbers from PDF

golang pdf text-extraction

Last synced: 14 Jan 2026

https://github.com/globality-corp/deboiler

Deboiler - Boilerplate Identification and Removal

boilerplate-identification boilerplate-removal deboiler python text-extraction

Last synced: 29 Jan 2026

https://github.com/pt-perkasa-pilar-utama/ppu-pdf

Pdf utilities for text extraction in digital and convert scanned pdf into canvas.

jsr npm pdf-canvas pdf-digital pdf-reader pdfjs rag scanned-pdf text-extraction

Last synced: 01 Aug 2025

https://github.com/manofstrong/sitescrapper

A PHP library to Scrape Websites from their Sitemaps and Extract Relevant Content from the Webpage and Upload to a Database

keywords-extraction php scraper scraping-websites sitemap-xml text-extraction

Last synced: 13 Jan 2026

https://github.com/fisseha-estifanos/llm-api

A repository to demonstrate some of the concepts behind large language models, transformer (foundation) models, in-context learning, and prompt engineering using open source large language models like Bloom and co:here.

api bloom cohere in-context-learning llm news-score prompt-engineering text-extraction transformer

Last synced: 08 Mar 2026

https://github.com/apyhub/apyhub.js

ApyHub SDK for Node.js is a library for accessing the ApyHub APIs.

api document-generation file-conversion image-generation image-processing nodejs text-extraction

Last synced: 26 Jan 2026

https://github.com/saidsef/tika-document-to-text

Apache Tika extract text and metadata from any document format with this pre-built containerised solution Kubernetes-ready deployment with intuitive UI, API, and text-to-speech capabilities - perfect for content indexing, analysis, and document processing workflows

docker-container document-to-text document-to-text-ui extract-text helm-chart kubernetes kubernetes-deployment nodejs python text-extraction text-to-speech

Last synced: 02 Apr 2026

https://github.com/lihanghang/tecroom

技术栈在线总结文档,包含编程语言、数据结构与算法、机器学习、数据库等。

artificial-intelligence coding data-mining data-structures deep-learning design-patterns docker java machine-learning nlp python3 summary text-extraction text-mining tools

Last synced: 09 Apr 2025

https://github.com/rajdeep2804/automated_invoice_processing

The number of types of physical documents being digitized is on the increase. Medical bills, bank documents and personal documents are examples of such documents. Objective of this repo is to implement and understand such use cases with an example of extracting text information from invoice receipts.

automation computer-vision detectron2 digitalization image-processing image-segmentation maskrcnn ocr opencv-python polygon python3 pytorch tesseract-ocr text-extraction

Last synced: 05 Oct 2025

https://github.com/kind-unes/flutter-translation-application

Flutter Android & iOS Translation Education Application. It utilizes ObjectBox as a local database and Google API for translations, and is powered by GEMINI-ULTRA for AI capabilities

ai android chatbot computer-vision dart educational-application flutter gemini-api google-translate-api hive ios langugage-recognition nosql open-source phrasebook source-code sqlite text-extraction translation

Last synced: 09 Apr 2025

https://github.com/rosette-api-community/text-embeddings-sample

A little python code to show how to get similarity between word embeddings returned from the Rosette API's new /text-embedding endpoint.

machine-learning natural-language-processing nlp python text-embedding text-extraction text-similarity word-similarity

Last synced: 25 May 2026

https://github.com/gatenlp/wpextract

Create datasets from WordPress sites for research or archiving

corpus crawler nlp text-extraction text-mining web-scraping wordpress

Last synced: 25 Jun 2025

https://github.com/gursv/url-summ

A URL summarizer, which summarizes the content of a URL with proper formatting. It uses 'sshleifer/distilbart-cnn-12-6', which is a distilled version of the BART model, specifically optimized for text summarization tasks, including CNN summarization.

ai beautifulsoup chunking formatted-text huggingface-models python3 smtp star-rating streamlit text-extraction text-summarization transformers url-summarization

Last synced: 23 Apr 2025

https://github.com/atahanuz/yt2text

Extract text from a YouTube video in a single command, using OpenAi's Whisper speech recognition model.

artificial-intelligence python text-extraction transcription whisper whisper-ai youtube

Last synced: 13 May 2025

https://github.com/utachicodes/pyshotter

A python library for smart, annotated, and shareable screenshots.

annotation cross-platform ocr python screenshot sharing smart-detection text-extraction

Last synced: 05 Feb 2026

https://github.com/dataiku/dss-plugin-tesseract-ocr

Dataiku DSS plugin to perform optical character recognition (OCR) using the Tesseract engine.

dataiku dss-plugin ocr optical-character-recognition tesseract tesseract-ocr text-extraction

Last synced: 04 Apr 2026

https://github.com/rushi-balapure/pdf_2_json_extractor

A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_to_json preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.

cli-tool cpu-only cross-platform data-extraction document-parsing document-processing json layout-analysis nlp offline pdf pdf-extraction pdf-parser pdf-processing pdf-to-json python python-library structure-extraction text-extraction

Last synced: 21 Apr 2026

https://github.com/ceodaniyal/free-llm-image-to-text

Free OCR powered by LLMs using OpenRouter — extract text from images with no API costs. Works with image URLs and Base64 inputs using free vision-capable models.

ai-ocr api-integration computer-vision free-ai free-ocr image-processing image-to-text llm ocr openrouter python text-extraction vision-llm

Last synced: 04 May 2026

https://github.com/nishant2018/text-extraction-ocr-opencv

Text extraction is the process of automatically extracting text from images or documents. Optical Character Recognition (OCR) is a technology that enables computers to convert images of text into machine-readable text.

ocr opencv python text-extraction

Last synced: 04 May 2026

https://github.com/ceodaniyal/universal-llm-ocr

This repository contains a Python script to extract text from images using OpenAI's GPT-4 API. The script supports text extraction from both online image URLs and locally stored images (converted to base64). It ensures accurate and structured text extraction, making it a powerful tool for OCR-like tasks. The extracted text is saved to a file

api-integration base64 gpt-4 gpt-4o gpt-4o-mini image-ocr image-processing image-to-text ocr openai python text-analysis text-extraction

Last synced: 04 May 2026

https://github.com/nbdy/prntscrngrb

prnt.sc / lightshot crawler, nudity detection and text extraction to a sqlite database

crawler nudity-detection prntsc text-extraction

Last synced: 04 Oct 2025

https://github.com/h0neyp0t-466/pen2pdf

"📝 Pen2PDF – AI-powered web app to transform handwritten notes, slides, PDFs & images into editable Markdown ✏️ → export as polished PDFs 📄. Features drag & drop 📤, real-time editing ⚡, responsive UI 📱, and Google Gemini 🤖 integration. Perfect for students, creators & pros 🚀."

ai-app ai-text-extraction document-processing express file-converter google-gemini handwritten-notes javascript markdown-editor nodejs ocr pdf-converter pdf-to-markdown pdf-tools pen2pdf ppt-to-pdf react text-extraction vite web-app

Last synced: 15 Mar 2026

https://github.com/chchench/textract

Golang module for extracting text from XML-based MS Office documents

golang msoffice msoffice-parser msword text-extraction unarchive

Last synced: 13 Jan 2026

https://github.com/virajmadhu/pdf_key_matcher

Highlights the key matches between your Given PDF and the description text

ats cv open-source pdf pdf-text-extraction python python-script python3 terminal-based text-compression text-extraction virajmadhu

Last synced: 01 Feb 2026

https://github.com/anyparser/anyparserjs

Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction. Supports extraction from various file formats including PDF, Microsoft Office documents, OCR/Image to Text, Audio to Text, and Website to Text.

anyparser artificial-intelligence cache-augmented-generation crawler etl-pipeline graph-rag knowledgebase langchain microsoft-office microsoft-word ms-office n8n-nodes ocr pdf-extraction rag retrieval-augmented-generation text-extraction web-crawler

Last synced: 17 Feb 2026

https://github.com/mazzasaverio/url2md4ai

Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.

html-to-markdown html-to-markdown-converter openai playwright text-extraction trafilatura

Last synced: 05 Jul 2025

https://github.com/fieldcure/fieldcure-document-parsers

Document text extraction library for DOCX, HWPX, XLSX, PPTX, and PDF. Supports OOXML math-to-LaTeX conversion, Hancom equation parsing, and IMediaDocumentParser for image extraction.

csharp document-parser docx dotnet equation-parser hwpx latex nuget pdf text-extraction

Last synced: 27 Apr 2026

https://github.com/cmhac/chat-extract

Experimental tool to extract data from screen recordings of text chats

chat-app osint text-extraction

Last synced: 02 Mar 2026

https://github.com/caesariodito/mp-assignment-automation

Mini Personal Project to Automate Assignments (Provide Insights Only)

automation chatgpt chatgpt-api homework-assignments image pdf python text-extraction

Last synced: 27 Jul 2025

https://github.com/importcjj/go-readability

Go package that cleans a HTML page for better readability.

extractor go golang html html-extractor html2text readability text text-extraction

Last synced: 14 Jan 2026

https://github.com/lightbridge-ks/radreportparser-app

A Python web app for extract key sections from radiology reports text

radiology-report regex text-extraction

Last synced: 24 Apr 2026