An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with text-extraction

A curated list of projects in awesome lists tagged with text-extraction .

https://github.com/kreuzberg-dev/kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Ruby, Go, and TypeScript/Node.js—or use via CLI, REST API, or MCP server.

document-intelligence ffi golang java metadata-extraction node pdf-extraction pdfium python rag ruby rust table-extraction tesseract text-extraction wasm

Last synced: 04 Jan 2026

https://github.com/Goldziher/kreuzberg

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

async document-intelligence mcp metadata-extraction ocr pandoc pdf-extraction pdfium python rag table-extraction tesseract text-extraction

Last synced: 21 Oct 2025

https://github.com/goldziher/kreuzberg

A text extraction library supporting PDFs, images, office documents and more

asyncio docx ocr pdf text-extraction

Last synced: 14 May 2025

https://github.com/chrismattmann/tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

buffer covid-19 detection extraction memex mime nlp nlp-library nlp-machine-learning parse parser-interface python recognition text-extraction text-recognition tika-python tika-server tika-server-jar translation-interface usc

Last synced: 14 May 2025

https://github.com/whitelok/image-text-localization-recognition

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約

awesome convolutional-neural-networks deep-learning deep-learning-algorithms machine-learning ocr scene-texts text-detection text-extraction text-recognition

Last synced: 20 Mar 2025

https://github.com/miso-belica/jusText

Heuristic based boilerplate removal tool

html-parser html-parsing python text-extraction

Last synced: 14 Mar 2025

https://github.com/miso-belica/justext

Heuristic based boilerplate removal tool

html-parser html-parsing python text-extraction

Last synced: 13 Apr 2025

https://github.com/unidoc/unidoc

This repository has moved! https://github.com/unidoc/unipdf

golang pdf pdf-files pdf-invoice pdf-library text-extraction unidoc

Last synced: 01 Apr 2025

https://github.com/ropensci/pdftools

Text Extraction, Rendering and Converting of PDF Documents

pdf-files pdf-format pdftools poppler poppler-library r r-package rstats text-extraction

Last synced: 27 Aug 2025

https://github.com/cdown/srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

command-line command-line-tool library mit-license python srt subtitle subtitle-fixer subtitle-parser subtitles subtitles-parsing text-extraction tools

Last synced: 14 May 2025

https://github.com/iamarunbrahma/vision-parse

Parse PDFs into markdown using Vision LLMs

document-parser pdf-parser pdf-to-markdown text-extraction

Last synced: 13 Dec 2025

https://github.com/Shixzie/nlp

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

go golang natural-language-processing nlp parse text text-extraction

Last synced: 14 Mar 2025

https://github.com/shixzie/nlp

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

go golang natural-language-processing nlp parse text text-extraction

Last synced: 01 Apr 2025

https://github.com/pd3f/pd3f

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

extract-text language-model machine-learning ocr parsr pd3f pdf pdf-to-text pipeline python text-extraction

Last synced: 03 Apr 2025

https://github.com/bookieio/breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

html-extraction html-extractor html-parsing python text-extraction text-mining

Last synced: 21 Oct 2025

https://github.com/weareprestatech/hotpdf

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

pdf python text-extraction text-search

Last synced: 01 May 2025

https://github.com/skylander86/lambda-text-extractor

AWS Lambda functions to extract text from various binary formats.

aws-lambda lambda-functions ocr pdf pdf-ocr-extraction searchable-pdfs tesseract text-extraction

Last synced: 12 Jul 2025

https://github.com/vsymbol/CUTIE

CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

computer-vision deep-learning text-extraction

Last synced: 02 Apr 2025

https://github.com/archivesunleashed/aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives

Last synced: 13 Apr 2025

https://github.com/victorqribeiro/ocr

Simple app to extract text from pictures using Tesseract

image-recognition ocr tesseract text-extraction text-recognition

Last synced: 07 Dec 2025

https://github.com/lu4p/cat

Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

cat cross-platform docx2txt extract-text go golang odt2txt pdf2txt pdftotext rtf-to-text text-extraction textextracting

Last synced: 25 Jun 2025

https://github.com/gamemaker1/office-text-extractor

Yet another library to extract text from MS Office and PDF files

docx get-text ms-excel ms-office ms-powerpoint ms-word parser pdf pptx text-extraction xlsx

Last synced: 26 Dec 2025

https://github.com/jonathanraiman/wikipedia_ner

:book: Labeled examples from wiki dumps in Python

dataset named-entity-recognition python text-extraction wikipedia

Last synced: 11 Jul 2025

https://github.com/iamarunbrahma/pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

document-conversion document-processing information-retrieval pdf-converter pdf-extraction pdf-parsing pdf-to-markdown python rag retrieval-augmented-generation text-extraction

Last synced: 10 Apr 2025

https://github.com/ckorzen/pdf-text-extraction-benchmark

A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

arxiv benchmark evaluation extraction pdf tex text-extraction

Last synced: 11 May 2025

https://github.com/abhinaba-ghosh/any-text

Get text content from any file

file-reader reader text text-extraction text-extractor

Last synced: 22 Jul 2025

https://github.com/fourdigits/wagtail_textract

Text extraction for Wagtail document search

django search tesseract text-extraction textract wagtail

Last synced: 06 Oct 2025

https://github.com/spences10/mcp-jinaai-reader

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

content-extraction documentation-tool jinaai llm-tools mcp model-context-protocol text-extraction web-content web-scraping

Last synced: 15 Apr 2025

https://github.com/arshad-yaseen/ocr-llm

⚡️ Fast, ultra-accurate text extraction from any image or PDF—including challenging ones—with structured markdown output powered by vision models.

llm ocr text-extraction

Last synced: 05 May 2025

https://github.com/ingmarboeschen/jatsdecoder

A text extraction and manipulation toolset for NISO-JATS coded XML files

cermine niso-jats pubmedcentral r text-extraction text-mining xml-files

Last synced: 12 Apr 2025

https://github.com/greed2411/tokyo

tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

apache-tika clojure document-processing extension extract-text filetype mime-types ring text-extraction text-parser text-parsing

Last synced: 07 May 2025

https://github.com/Altabeh/tesseract-ocr-wrapper

This is a highly efficient python wrapper for tesseract-ocr.

leptonica multiprocessing tesseract-ocr text-extraction xpdf

Last synced: 09 Jul 2025

https://github.com/ad-freiburg/pdftotext-plus-plus

A fast and accurate command line tool for extracting text from PDF files.

c-plus-plus cli document-analysis metadata-extraction pdf text-extraction

Last synced: 16 May 2025

https://github.com/dotfurther/OpenDiscoverSDK

.NET 6 API for document file format identification, text/metadata/attachment/embedded object/sensitive item (PII/PHI)/entity extraction.

archive csharp dotnet email embedded-objects entity-extraction extraction file-deduplication file-format-detection file-identification indexing metadata microsoft-office phi pii pii-detection pst sdk text text-extraction

Last synced: 12 Apr 2025

https://github.com/shelfio/apache-tika-lambda-layer

AWS Lambda layer containing latest version of Apache Tika

apache-tika aws-lambda lambda-layer text-extraction

Last synced: 10 Jun 2025

https://github.com/bmoscon/articleparse

Heuristic text extraction from news sites in Python3

analysis boilerplate-removal heuristics python text-analysis text-extraction

Last synced: 07 May 2025

https://github.com/typo3-solr/ext-tika

A TYPO3 CMS extension that provides Apache Tika functionality

cms cms-extension file-indexing language-detection metadata php search text-extraction tika typo3 typo3-cms-extension

Last synced: 04 Apr 2025

https://github.com/andrealenzi11/py-poppleract

Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents

ocr optical-character-recognition pdf-reader pdf-splitting pdf-to-text pdf2text pdftotext poppler poppleract py-poppleract tesseract tesseract-ocr text-extraction

Last synced: 26 Mar 2025

https://github.com/funinkina/gnome-ocr-screenshot

A simple python script to extarct text from screenshot in GNOME desktop environment using pytesseract.

gnome gnome-shell linux ocr screenshot text-extraction tools utility

Last synced: 30 Jul 2025

https://github.com/pt-perkasa-pilar-utama/ppu-pdf

Pdf utilities for text extraction in digital and convert scanned pdf into canvas.

jsr npm pdf-canvas pdf-digital pdf-reader pdfjs rag scanned-pdf text-extraction

Last synced: 01 Aug 2025

https://github.com/fisseha-estifanos/llm-api

A repository to demonstrate some of the concepts behind large language models, transformer (foundation) models, in-context learning, and prompt engineering using open source large language models like Bloom and co:here.

api bloom cohere in-context-learning llm news-score prompt-engineering text-extraction transformer

Last synced: 06 Oct 2025

https://github.com/ssciwr/ammico

AI-based Media and Misinformation Content Analysis Tool: Analyze text and images

classification computer-vision nlp text-extraction translation

Last synced: 26 Jul 2025

https://github.com/kind-unes/flutter-translation-application

Flutter Android & iOS Translation Education Application. It utilizes ObjectBox as a local database and Google API for translations, and is powered by GEMINI-ULTRA for AI capabilities

ai android chatbot computer-vision dart educational-application flutter gemini-api google-translate-api hive ios langugage-recognition nosql open-source phrasebook source-code sqlite text-extraction translation

Last synced: 09 Apr 2025

https://github.com/lihanghang/tecroom

技术栈在线总结文档,包含编程语言、数据结构与算法、机器学习、数据库等。

artificial-intelligence coding data-mining data-structures deep-learning design-patterns docker java machine-learning nlp python3 summary text-extraction text-mining tools

Last synced: 09 Apr 2025

https://github.com/rajdeep2804/automated_invoice_processing

The number of types of physical documents being digitized is on the increase. Medical bills, bank documents and personal documents are examples of such documents. Objective of this repo is to implement and understand such use cases with an example of extracting text information from invoice receipts.

automation computer-vision detectron2 digitalization image-processing image-segmentation maskrcnn ocr opencv-python polygon python3 pytorch tesseract-ocr text-extraction

Last synced: 05 Oct 2025

https://github.com/gursv/url-summ

A URL summarizer, which summarizes the content of a URL with proper formatting. It uses 'sshleifer/distilbart-cnn-12-6', which is a distilled version of the BART model, specifically optimized for text summarization tasks, including CNN summarization.

ai beautifulsoup chunking formatted-text huggingface-models python3 smtp star-rating streamlit text-extraction text-summarization transformers url-summarization

Last synced: 23 Apr 2025

https://github.com/gatenlp/wpextract

Create datasets from WordPress sites for research or archiving

corpus crawler nlp text-extraction text-mining web-scraping wordpress

Last synced: 25 Jun 2025

https://github.com/rosette-api-community/text-embeddings-sample

A little python code to show how to get similarity between word embeddings returned from the Rosette API's new /text-embedding endpoint.

machine-learning natural-language-processing nlp python text-embedding text-extraction text-similarity word-similarity

Last synced: 24 Nov 2025

https://github.com/atahanuz/yt2text

Extract text from a YouTube video in a single command, using OpenAi's Whisper speech recognition model.

artificial-intelligence python text-extraction transcription whisper whisper-ai youtube

Last synced: 13 May 2025

https://github.com/virajmadhu/pdf_key_matcher

Highlights the key matches between your Given PDF and the description text

ats cv open-source pdf pdf-text-extraction python python-script python3 terminal-based text-compression text-extraction virajmadhu

Last synced: 12 Apr 2025

https://github.com/nbdy/prntscrngrb

prnt.sc / lightshot crawler, nudity detection and text extraction to a sqlite database

crawler nudity-detection prntsc text-extraction

Last synced: 04 Oct 2025

https://github.com/nishant2018/text-extraction-ocr-opencv

Text extraction is the process of automatically extracting text from images or documents. Optical Character Recognition (OCR) is a technology that enables computers to convert images of text into machine-readable text.

ocr opencv python text-extraction

Last synced: 26 Feb 2025

https://github.com/caesariodito/mp-assignment-automation

Mini Personal Project to Automate Assignments (Provide Insights Only)

automation chatgpt chatgpt-api homework-assignments image pdf python text-extraction

Last synced: 27 Jul 2025

https://github.com/h0neyp0t-466/pen2pdf

"📝 Pen2PDF – AI-powered web app to transform handwritten notes, slides, PDFs & images into editable Markdown ✏️ → export as polished PDFs 📄. Features drag & drop 📤, real-time editing ⚡, responsive UI 📱, and Google Gemini 🤖 integration. Perfect for students, creators & pros 🚀."

ai-app ai-text-extraction document-processing express file-converter google-gemini handwritten-notes javascript markdown-editor nodejs ocr pdf-converter pdf-to-markdown pdf-tools pen2pdf ppt-to-pdf react text-extraction vite web-app

Last synced: 25 Sep 2025

https://github.com/mazzasaverio/url2md4ai

Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.

html-to-markdown html-to-markdown-converter openai playwright text-extraction trafilatura

Last synced: 05 Jul 2025

https://github.com/junioralive/discordbotlab

DiscordBotLab is a repository focused on hosting and managing a variety of utility-driven Discord bots.

ai automated-bots bot-development discord discord-bot google-colab langchain llama ocr quiz-bot tesseract text-extraction utility-bots

Last synced: 24 Oct 2025

https://github.com/mansurpro/docuparse

DocuParse is a high-performance tool for converting PDF documents into clean, structured Markdown files. Designed for speed and accuracy, it extracts and formats content while minimizing errors like hallucinations and repetitions.

digital-archive document-layout-analysis google-colab huggingface-transformers markdown-conversion pdf-parsing pdf-to-markdown tesseract-ocr text-extraction

Last synced: 05 Apr 2025

https://github.com/pt-perkasa-pilar-utama/ppu-pdf-headless

Pdf utilities for text extraction in digital pdf

headless pdf-digital pdf-reader pdfjs pdfjs-dist rag text-extraction

Last synced: 22 Jun 2025

https://github.com/el-mehdiri/pdf-text-extraction-api

A REST API to extract structured text, font details, and positioning from PDF files using Node.js and Python.

express font-detection nodejs ocr pdf pdfminer text-extraction

Last synced: 30 Dec 2025

https://github.com/juliandavidmr/text2locale

Extract all the texts of any project with HTML files and generate a KV (Key-Value) file, key = reference key, value = extracted text.

extract html-files i18n text-extraction

Last synced: 24 Feb 2025

https://github.com/lightbridge-ks/radreportparser

Regex-based text parser for common radiology report

radiology-reports regex text-extraction text-mining

Last synced: 22 Mar 2025

https://github.com/paramsiddharth/pdf2text

An application that can extract editable as well as scanned text from PDF files.

hacktoberfest ocr pdf text-extraction

Last synced: 18 Sep 2025

https://github.com/anant2003jain/textextractify

TextExtractify is an AI-powered tool that extracts text from images and PDFs using both Azure OCR and EasyOCR. It offers features like multi-image upload, text entity extraction, and .docx export for premium users. Designed to streamline document processing with fast, accurate text extraction.

azure login-system ocr ocr-python pillow python3 streamlit text-extraction

Last synced: 21 Aug 2025

https://github.com/furqanhun/textnomnom-py

Extract text from PDFs, PPTs, & URLs (with OCR support). Converts PPT to PDF & handles files or folders. 🦍

automated-conversion automation cross-platform document-conversion image-text-extraction linux pdf-processing pdf-to-text ppt ppt-to-text pptx pptx-to-text text-extraction windows

Last synced: 23 Mar 2025

https://github.com/olegiv/pdf_2_md

CLI tool to convert PDFs to Markdown with NLP summaries

automation cli markdown nlp pdf pdf-to-markdown python python3 summarization text-extraction toc

Last synced: 10 Apr 2025

https://github.com/gauff/textprocessing

Text extraction, transcription, punctuation restoration, translation, summarization and text to speech from almost any file type

cli file-downloader llm ocr punctuation-restoration python summarizer text-extraction text-extraction-from-image text-processing text-to-speech transcoding transcription translator

Last synced: 24 Mar 2025

https://github.com/lightbridge-ks/radreportparser-app

A Python web app for extract key sections from radiology reports text

radiology-report regex text-extraction

Last synced: 13 Jun 2025

https://github.com/nidhish-balasubramanya/pdf-summarizer

A streamlined and efficient PDF Summarizer powered by Google's Gemini AI API. This tool allows users to upload PDFs and receive concise, AI-generated summaries instantly. Built with Streamlit for an intuitive user experience, it is ideal for students, researchers, and professionals who need quick insights from lengthy documents.

ai automation gemini-api google-ai machine-learning openai pdf-summarizer python streamlit text-extraction

Last synced: 05 Oct 2025

https://github.com/allen-reji/paddleocr-text-extraction-ml-model

Utilizes PaddleOCR and advanced image pre-processing techniques to extract product attributes from images.

amazon-ml-challenge image-processing machine-learning opencv paddleocr paddlepaddle pil text-extraction

Last synced: 23 Mar 2025

https://github.com/jameshobden/repo-to-prompt

📂 Tool to transform files & dirs into structured prompts for LLMs. 🌳 Generates file maps + extracts text. 📋 Copies to clipboard. macOS-ready with Automator support!

ai-tools automator clipboard directory-structure file-map macos prompt-generation prompt-generator python python3 text-extraction

Last synced: 13 Aug 2025

https://github.com/natylaza89/semantic-similarity-dating-app

Semantic Similarity LLM Dating App using Python 3.12, FastAPI, WebSockets, CoHere, Gemini 1.5 Flash & Embeddings

async cosine-similarity embeddings fastapi gemini-api llm mypy nlp poetry pytest python ruff text-extraction websockets

Last synced: 02 Sep 2025

https://github.com/sankeer28/url-extractor-and-downloader

Extracts multiple URLs from text, and if downloadable, downloads them into a ZIP

image-downloader text-extraction

Last synced: 01 Aug 2025

https://github.com/oeo/processor-rs

High-performance document processing pipeline in Rust. Extracts text, performs OCR, and optimizes images from PDFs and other document formats with parallel processing and memory efficiency.

document-processing image-optimization parallel-processing rust tesseract-ocr text-extraction

Last synced: 10 Jun 2025

https://github.com/porky-chen/i18n-t

A Vue SDK for batch multi-language translation using Youdao API, auto-extracting text for i18n localization.

batch-translation i18n json localization multi-language sdk text-extraction translation vue youdao-api

Last synced: 25 Jul 2025

https://github.com/sulemansaeed73/cleansetext

CleanseText – AI Writing Assistant built with Next.js & Django REST Framework.

django-rest-framework grammar-checker nextjs redux summarization-model tailwindcss text-extraction

Last synced: 13 Jul 2025

https://github.com/rmottanet/unchainedtext

UnchainedText: Break free from PDFs! Easily extract raw text to .txt for preprocessing.

data-extraction extractor pdf-text-extraction text-extraction text-extraction-tool text-processing

Last synced: 27 Jun 2025

https://github.com/terry-li-hm/prometheus

PDF Liberation MCP Server - Break large PDFs into digestible chunks for Claude

ai-tools claude-code document-processing fastmcp mcp-server pdf-processing pdf-splitter prometheus pymupdf python text-extraction

Last synced: 03 Sep 2025