Projects in Awesome Lists tagged with text-extraction
A curated list of projects in awesome lists tagged with text-extraction .
https://github.com/miso-belica/sumy
Module for automatic summarization of text documents and HTML pages.
html-extraction html-extractor html-page lsa nlp pagerank-algorithm python reduction summarization summarizer summary sumy text-extraction textteaser
Last synced: 13 May 2025
https://github.com/kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Ruby, Go, and TypeScript/Node.js—or use via CLI, REST API, or MCP server.
document-intelligence ffi golang java metadata-extraction node pdf-extraction pdfium python rag ruby rust table-extraction tesseract text-extraction wasm
Last synced: 04 Jan 2026
https://github.com/adbar/trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
article-extractor corpus corpus-builder corpus-tools crawler html-to-markdown html2text news news-aggregator news-crawler nlp readability rss-feed scraping tei text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 24 Dec 2025
https://github.com/unidoc/unipdf
Golang PDF library for creating and processing PDF files (pure go)
golang pdf pdf-compression pdf-document-processor pdf-generation pdf-generator pdf-library pdf-manipulation pdf-reader pdf-reports pdf-sign signing text-extraction
Last synced: 12 May 2025
https://github.com/Goldziher/kreuzberg
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
async document-intelligence mcp metadata-extraction ocr pandoc pdf-extraction pdfium python rag table-extraction tesseract text-extraction
Last synced: 21 Oct 2025
https://github.com/goldziher/kreuzberg
A text extraction library supporting PDFs, images, office documents and more
asyncio docx ocr pdf text-extraction
Last synced: 14 May 2025
https://github.com/chrismattmann/tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
buffer covid-19 detection extraction memex mime nlp nlp-library nlp-machine-learning parse parser-interface python recognition text-extraction text-recognition tika-python tika-server tika-server-jar translation-interface usc
Last synced: 14 May 2025
https://github.com/whitelok/image-text-localization-recognition
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
awesome convolutional-neural-networks deep-learning deep-learning-algorithms machine-learning ocr scene-texts text-detection text-extraction text-recognition
Last synced: 20 Mar 2025
https://github.com/miso-belica/jusText
Heuristic based boilerplate removal tool
html-parser html-parsing python text-extraction
Last synced: 14 Mar 2025
https://github.com/miso-belica/justext
Heuristic based boilerplate removal tool
html-parser html-parsing python text-extraction
Last synced: 13 Apr 2025
https://github.com/unidoc/unidoc
This repository has moved! https://github.com/unidoc/unipdf
golang pdf pdf-files pdf-invoice pdf-library text-extraction unidoc
Last synced: 01 Apr 2025
https://github.com/ICIJ/datashare
A self-hosted search engine for documents.
datashare docker elasticsearch extract investigative-journalism named-entity-recognition text-extraction web-gui
Last synced: 15 Apr 2025
https://github.com/icij/datashare
A self-hosted search engine for documents.
datashare docker elasticsearch extract investigative-journalism named-entity-recognition text-extraction web-gui
Last synced: 19 Nov 2025
https://github.com/ropensci/pdftools
Text Extraction, Rendering and Converting of PDF Documents
pdf-files pdf-format pdftools poppler poppler-library r r-package rstats text-extraction
Last synced: 27 Aug 2025
https://github.com/cdown/srt
A simple library and set of tools for parsing, modifying, and composing SRT files.
command-line command-line-tool library mit-license python srt subtitle subtitle-fixer subtitle-parser subtitles subtitles-parsing text-extraction tools
Last synced: 14 May 2025
https://github.com/iamarunbrahma/vision-parse
Parse PDFs into markdown using Vision LLMs
document-parser pdf-parser pdf-to-markdown text-extraction
Last synced: 13 Dec 2025
https://github.com/Shixzie/nlp
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
go golang natural-language-processing nlp parse text text-extraction
Last synced: 14 Mar 2025
https://github.com/shixzie/nlp
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
go golang natural-language-processing nlp parse text text-extraction
Last synced: 01 Apr 2025
https://github.com/flairnlp/fundus
A very simple news crawler with a funny name
cc-news commoncrawl corpus corpus-tools crawler datasets image-classification image-extraction news-crawler news-scraping nlp python rss scraper sitemap text-extraction web-corpus web-scraping
Last synced: 14 May 2025
https://github.com/flairNLP/fundus
A very simple news crawler with a funny name
cc-news commoncrawl corpus crawler news-crawler news-scraping nlp python rss scraper sitemap text-extraction web-corpus web-scraping
Last synced: 04 Mar 2025
https://github.com/pd3f/pd3f
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
extract-text language-model machine-learning ocr parsr pd3f pdf pdf-to-text pipeline python text-extraction
Last synced: 03 Apr 2025
https://github.com/py-pdf/benchmarks
Benchmarking PDF libraries
benchmark data-extraction mupdf pdf poppler-utils pypdf2 text-extraction
Last synced: 28 Jul 2025
https://github.com/bookieio/breadability
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
html-extraction html-extractor html-parsing python text-extraction text-mining
Last synced: 21 Oct 2025
https://github.com/weareprestatech/hotpdf
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
pdf python text-extraction text-search
Last synced: 01 May 2025
https://github.com/sapienzanlp/extend
Entity Disambiguation as text extraction (ACL 2022)
acl acl2022 entity-disambiguation entity-disambiguation-models entity-linking natural-language-processing nlp pytorch text-extraction
Last synced: 03 Aug 2025
https://github.com/skylander86/lambda-text-extractor
AWS Lambda functions to extract text from various binary formats.
aws-lambda lambda-functions ocr pdf pdf-ocr-extraction searchable-pdfs tesseract text-extraction
Last synced: 12 Jul 2025
https://github.com/vsymbol/CUTIE
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
computer-vision deep-learning text-extraction
Last synced: 02 Apr 2025
https://github.com/archivesunleashed/aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
analysis apache-spark big-data big-data-analytics dataframe digital-humanities hadoop network-graphing pyspark python3 scala spark text-extraction webarchives
Last synced: 13 Apr 2025
https://github.com/victorqribeiro/ocr
Simple app to extract text from pictures using Tesseract
image-recognition ocr tesseract text-extraction text-recognition
Last synced: 07 Dec 2025
https://github.com/lu4p/cat
Extract text from plaintext, .docx, .odt and .rtf files. Pure go.
cat cross-platform docx2txt extract-text go golang odt2txt pdf2txt pdftotext rtf-to-text text-extraction textextracting
Last synced: 25 Jun 2025
https://github.com/gamemaker1/office-text-extractor
Yet another library to extract text from MS Office and PDF files
docx get-text ms-excel ms-office ms-powerpoint ms-word parser pdf pptx text-extraction xlsx
Last synced: 26 Dec 2025
https://github.com/jonathanraiman/wikipedia_ner
:book: Labeled examples from wiki dumps in Python
dataset named-entity-recognition python text-extraction wikipedia
Last synced: 11 Jul 2025
https://github.com/iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
document-conversion document-processing information-retrieval pdf-converter pdf-extraction pdf-parsing pdf-to-markdown python rag retrieval-augmented-generation text-extraction
Last synced: 10 Apr 2025
https://github.com/ckorzen/pdf-text-extraction-benchmark
A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.
arxiv benchmark evaluation extraction pdf tex text-extraction
Last synced: 11 May 2025
https://github.com/abhinaba-ghosh/any-text
Get text content from any file
file-reader reader text text-extraction text-extractor
Last synced: 22 Jul 2025
https://github.com/fourdigits/wagtail_textract
Text extraction for Wagtail document search
django search tesseract text-extraction textract wagtail
Last synced: 06 Oct 2025
https://github.com/goldziher/html-to-markdown
HTML to markdown converter
html-converter markdown-converter rag text-extraction text-processing
Last synced: 24 Apr 2025
https://github.com/hscspring/pnlp
NLP预/后处理工具。
chinese-nlp concurrency nlp nlp-enhancer nlp-preprocess normalization preprocessing text-cleaning text-extraction text-length text-processing
Last synced: 27 Sep 2025
https://github.com/amenezes/aiopytesseract
A Python asyncio wrapper for Tesseract-OCR.
asyncio ocr optical-character-recognition pdftotext pytesseract pytesseract-ocr tesseract tesseract-ocr text-extraction
Last synced: 04 Oct 2025
https://github.com/spences10/mcp-jinaai-reader
🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader
content-extraction documentation-tool jinaai llm-tools mcp model-context-protocol text-extraction web-content web-scraping
Last synced: 15 Apr 2025
https://github.com/arshad-yaseen/ocr-llm
⚡️ Fast, ultra-accurate text extraction from any image or PDF—including challenging ones—with structured markdown output powered by vision models.
Last synced: 05 May 2025
https://github.com/ingmarboeschen/jatsdecoder
A text extraction and manipulation toolset for NISO-JATS coded XML files
cermine niso-jats pubmedcentral r text-extraction text-mining xml-files
Last synced: 12 Apr 2025
https://github.com/greed2411/tokyo
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
apache-tika clojure document-processing extension extract-text filetype mime-types ring text-extraction text-parser text-parsing
Last synced: 07 May 2025
https://github.com/Altabeh/tesseract-ocr-wrapper
This is a highly efficient python wrapper for tesseract-ocr.
leptonica multiprocessing tesseract-ocr text-extraction xpdf
Last synced: 09 Jul 2025
https://github.com/ad-freiburg/pdftotext-plus-plus
A fast and accurate command line tool for extracting text from PDF files.
c-plus-plus cli document-analysis metadata-extraction pdf text-extraction
Last synced: 16 May 2025
https://github.com/dotfurther/OpenDiscoverSDK
.NET 6 API for document file format identification, text/metadata/attachment/embedded object/sensitive item (PII/PHI)/entity extraction.
archive csharp dotnet email embedded-objects entity-extraction extraction file-deduplication file-format-detection file-identification indexing metadata microsoft-office phi pii pii-detection pst sdk text text-extraction
Last synced: 12 Apr 2025
https://github.com/shelfio/apache-tika-lambda-layer
AWS Lambda layer containing latest version of Apache Tika
apache-tika aws-lambda lambda-layer text-extraction
Last synced: 10 Jun 2025
https://github.com/pt-perkasa-pilar-utama/ppu-paddle-ocr
A lightweight, PaddleOCR implementation in Bun/Node.js for text detection and recognition in JavaScript environments.
bun computer-vision image-processing image-text-extraction javascript-ocr node-js ocr onnx onnxruntime optical-character-recognition paddleocr paddlepaddle text-detection text-extraction text-recognition typescript-ocr
Last synced: 22 Jun 2025
https://github.com/mrgrd56/textractor-translator
Translate visual novels in real time
anime electron games javascript text-extraction textractor textractor-extension translation translator typescript visual-novel
Last synced: 04 May 2025
https://github.com/dotfurther/OpenDiscoverPlatformCaseStudy
Case study using dotfurther's Open Discover Platform with the RavenDB document store to rapidly create a full-text search/eDiscovery/information governance capable demonstration application.
archive-extractor data-breach document-ingestion ediscovery file-deduplication file-format-detection file-identification full-text full-text-extraction full-text-search indexing-engine information-governance information-governance-catalog metadata personally-identifiable-information pii pii-detection ravendb text-extraction
Last synced: 12 Apr 2025
https://github.com/bmoscon/articleparse
Heuristic text extraction from news sites in Python3
analysis boilerplate-removal heuristics python text-analysis text-extraction
Last synced: 07 May 2025
https://github.com/typo3-solr/ext-tika
A TYPO3 CMS extension that provides Apache Tika functionality
cms cms-extension file-indexing language-detection metadata php search text-extraction tika typo3 typo3-cms-extension
Last synced: 04 Apr 2025
https://github.com/andrealenzi11/py-poppleract
Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents
ocr optical-character-recognition pdf-reader pdf-splitting pdf-to-text pdf2text pdftotext poppler poppleract py-poppleract tesseract tesseract-ocr text-extraction
Last synced: 26 Mar 2025
https://github.com/funinkina/gnome-ocr-screenshot
A simple python script to extarct text from screenshot in GNOME desktop environment using pytesseract.
gnome gnome-shell linux ocr screenshot text-extraction tools utility
Last synced: 30 Jul 2025
https://github.com/pt-perkasa-pilar-utama/ppu-pdf
Pdf utilities for text extraction in digital and convert scanned pdf into canvas.
jsr npm pdf-canvas pdf-digital pdf-reader pdfjs rag scanned-pdf text-extraction
Last synced: 01 Aug 2025
https://github.com/fisseha-estifanos/llm-api
A repository to demonstrate some of the concepts behind large language models, transformer (foundation) models, in-context learning, and prompt engineering using open source large language models like Bloom and co:here.
api bloom cohere in-context-learning llm news-score prompt-engineering text-extraction transformer
Last synced: 06 Oct 2025
https://github.com/ssciwr/ammico
AI-based Media and Misinformation Content Analysis Tool: Analyze text and images
classification computer-vision nlp text-extraction translation
Last synced: 26 Jul 2025
https://github.com/andythefactory/article-extraction-dataset
Article title, authors, date and body extraction dataset.
article-extractor corpus corpus-builder corpus-tools dataset datasets html-to-markdown html2text news news-aggregator news-crawler readability scraping scraping-websites text-cleaning text-extraction text-mining text-preprocessing web-scraping
Last synced: 06 Nov 2025
https://github.com/kind-unes/flutter-translation-application
Flutter Android & iOS Translation Education Application. It utilizes ObjectBox as a local database and Google API for translations, and is powered by GEMINI-ULTRA for AI capabilities
ai android chatbot computer-vision dart educational-application flutter gemini-api google-translate-api hive ios langugage-recognition nosql open-source phrasebook source-code sqlite text-extraction translation
Last synced: 09 Apr 2025
https://github.com/lihanghang/tecroom
技术栈在线总结文档,包含编程语言、数据结构与算法、机器学习、数据库等。
artificial-intelligence coding data-mining data-structures deep-learning design-patterns docker java machine-learning nlp python3 summary text-extraction text-mining tools
Last synced: 09 Apr 2025
https://github.com/rajdeep2804/automated_invoice_processing
The number of types of physical documents being digitized is on the increase. Medical bills, bank documents and personal documents are examples of such documents. Objective of this repo is to implement and understand such use cases with an example of extracting text information from invoice receipts.
automation computer-vision detectron2 digitalization image-processing image-segmentation maskrcnn ocr opencv-python polygon python3 pytorch tesseract-ocr text-extraction
Last synced: 05 Oct 2025
https://github.com/gursv/url-summ
A URL summarizer, which summarizes the content of a URL with proper formatting. It uses 'sshleifer/distilbart-cnn-12-6', which is a distilled version of the BART model, specifically optimized for text summarization tasks, including CNN summarization.
ai beautifulsoup chunking formatted-text huggingface-models python3 smtp star-rating streamlit text-extraction text-summarization transformers url-summarization
Last synced: 23 Apr 2025
https://github.com/gatenlp/wpextract
Create datasets from WordPress sites for research or archiving
corpus crawler nlp text-extraction text-mining web-scraping wordpress
Last synced: 25 Jun 2025
https://github.com/rosette-api-community/text-embeddings-sample
A little python code to show how to get similarity between word embeddings returned from the Rosette API's new /text-embedding endpoint.
machine-learning natural-language-processing nlp python text-embedding text-extraction text-similarity word-similarity
Last synced: 24 Nov 2025
https://github.com/atahanuz/yt2text
Extract text from a YouTube video in a single command, using OpenAi's Whisper speech recognition model.
artificial-intelligence python text-extraction transcription whisper whisper-ai youtube
Last synced: 13 May 2025
https://github.com/zeeshanahmad4/nlp-pdf-minning-extracting-text-from-pdf
NLP Pdf Minning Extracting text from pdf
extract-text pdf pdf-converter pdf-document-processor pdf-files pdf-format pdf-text-extraction pdfcon pdfkit pdftohtml pdftoimage pdftools pdftotext python text-extraction
Last synced: 01 Apr 2025
https://github.com/virajmadhu/pdf_key_matcher
Highlights the key matches between your Given PDF and the description text
ats cv open-source pdf pdf-text-extraction python python-script python3 terminal-based text-compression text-extraction virajmadhu
Last synced: 12 Apr 2025
https://github.com/nbdy/prntscrngrb
prnt.sc / lightshot crawler, nudity detection and text extraction to a sqlite database
crawler nudity-detection prntsc text-extraction
Last synced: 04 Oct 2025
https://github.com/nishant2018/text-extraction-ocr-opencv
Text extraction is the process of automatically extracting text from images or documents. Optical Character Recognition (OCR) is a technology that enables computers to convert images of text into machine-readable text.
ocr opencv python text-extraction
Last synced: 26 Feb 2025
https://github.com/caesariodito/mp-assignment-automation
Mini Personal Project to Automate Assignments (Provide Insights Only)
automation chatgpt chatgpt-api homework-assignments image pdf python text-extraction
Last synced: 27 Jul 2025
https://github.com/h0neyp0t-466/pen2pdf
"📝 Pen2PDF – AI-powered web app to transform handwritten notes, slides, PDFs & images into editable Markdown ✏️ → export as polished PDFs 📄. Features drag & drop 📤, real-time editing ⚡, responsive UI 📱, and Google Gemini 🤖 integration. Perfect for students, creators & pros 🚀."
ai-app ai-text-extraction document-processing express file-converter google-gemini handwritten-notes javascript markdown-editor nodejs ocr pdf-converter pdf-to-markdown pdf-tools pen2pdf ppt-to-pdf react text-extraction vite web-app
Last synced: 25 Sep 2025
https://github.com/mazzasaverio/url2md4ai
Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.
html-to-markdown html-to-markdown-converter openai playwright text-extraction trafilatura
Last synced: 05 Jul 2025
https://github.com/lykmapipo/us-inaugural-addresses
Python scripts to download, process, and analyze US Inaugural Addresses
beautifulsoup4 gensim joblib lykmapipo natural-language-processing nlp nltk python python-scripts requests spacy text-analysis text-analytics text-extraction text-processing web-scraping
Last synced: 08 Apr 2025
https://github.com/junioralive/discordbotlab
DiscordBotLab is a repository focused on hosting and managing a variety of utility-driven Discord bots.
ai automated-bots bot-development discord discord-bot google-colab langchain llama ocr quiz-bot tesseract text-extraction utility-bots
Last synced: 24 Oct 2025
https://github.com/amirthfultehrani/youtube-transcript-copier
A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.
accessibility automation browser-extension clipboard content-extraction data-extraction greasemonkey helper javascript productivity tampermonkey text-extraction tool transcript userscript utilities video violentmonkey web youtube
Last synced: 17 Aug 2025
https://github.com/mansurpro/docuparse
DocuParse is a high-performance tool for converting PDF documents into clean, structured Markdown files. Designed for speed and accuracy, it extracts and formats content while minimizing errors like hallucinations and repetitions.
digital-archive document-layout-analysis google-colab huggingface-transformers markdown-conversion pdf-parsing pdf-to-markdown tesseract-ocr text-extraction
Last synced: 05 Apr 2025
https://github.com/pt-perkasa-pilar-utama/ppu-pdf-headless
Pdf utilities for text extraction in digital pdf
headless pdf-digital pdf-reader pdfjs pdfjs-dist rag text-extraction
Last synced: 22 Jun 2025
https://github.com/el-mehdiri/pdf-text-extraction-api
A REST API to extract structured text, font details, and positioning from PDF files using Node.js and Python.
express font-detection nodejs ocr pdf pdfminer text-extraction
Last synced: 30 Dec 2025
https://github.com/juliandavidmr/text2locale
Extract all the texts of any project with HTML files and generate a KV (Key-Value) file, key = reference key, value = extracted text.
extract html-files i18n text-extraction
Last synced: 24 Feb 2025
https://github.com/lightbridge-ks/radreportparser
Regex-based text parser for common radiology report
radiology-reports regex text-extraction text-mining
Last synced: 22 Mar 2025
https://github.com/paramsiddharth/pdf2text
An application that can extract editable as well as scanned text from PDF files.
hacktoberfest ocr pdf text-extraction
Last synced: 18 Sep 2025
https://github.com/atharvaj1234/glyphify
AI-powered mobile app for converting handwritten notes and documents into editable digital text - built to boost productivity for students and educators.
ai ai-agents android android-application edtech education handwriting-recognition handwritten-character-recognition handwritten-text-recognition image-processing machine-learning mobile-app mobile-application ocr react-native react-native-app student-productivity text-extraction
Last synced: 30 Jul 2025
https://github.com/baughmann/tikara
The metadata and text content extractor for almost every file type.
apache-tika content-extraction document-parsing document-processing docx image-to-text java language-detection llm metadata metadata-extraction ml natural-language-processing ocr pdf-to-text retrieval-augmented-generation text-extraction text-mining
Last synced: 03 Oct 2025
https://github.com/anant2003jain/textextractify
TextExtractify is an AI-powered tool that extracts text from images and PDFs using both Azure OCR and EasyOCR. It offers features like multi-image upload, text entity extraction, and .docx export for premium users. Designed to streamline document processing with fast, accurate text extraction.
azure login-system ocr ocr-python pillow python3 streamlit text-extraction
Last synced: 21 Aug 2025
https://github.com/furqanhun/textnomnom-py
Extract text from PDFs, PPTs, & URLs (with OCR support). Converts PPT to PDF & handles files or folders. 🦍
automated-conversion automation cross-platform document-conversion image-text-extraction linux pdf-processing pdf-to-text ppt ppt-to-text pptx pptx-to-text text-extraction windows
Last synced: 23 Mar 2025
https://github.com/olegiv/pdf_2_md
CLI tool to convert PDFs to Markdown with NLP summaries
automation cli markdown nlp pdf pdf-to-markdown python python3 summarization text-extraction toc
Last synced: 10 Apr 2025
https://github.com/gauff/textprocessing
Text extraction, transcription, punctuation restoration, translation, summarization and text to speech from almost any file type
cli file-downloader llm ocr punctuation-restoration python summarizer text-extraction text-extraction-from-image text-processing text-to-speech transcoding transcription translator
Last synced: 24 Mar 2025
https://github.com/lightbridge-ks/radreportparser-app
A Python web app for extract key sections from radiology reports text
radiology-report regex text-extraction
Last synced: 13 Jun 2025
https://github.com/nidhish-balasubramanya/pdf-summarizer
A streamlined and efficient PDF Summarizer powered by Google's Gemini AI API. This tool allows users to upload PDFs and receive concise, AI-generated summaries instantly. Built with Streamlit for an intuitive user experience, it is ideal for students, researchers, and professionals who need quick insights from lengthy documents.
ai automation gemini-api google-ai machine-learning openai pdf-summarizer python streamlit text-extraction
Last synced: 05 Oct 2025
https://github.com/allen-reji/paddleocr-text-extraction-ml-model
Utilizes PaddleOCR and advanced image pre-processing techniques to extract product attributes from images.
amazon-ml-challenge image-processing machine-learning opencv paddleocr paddlepaddle pil text-extraction
Last synced: 23 Mar 2025
https://github.com/jameshobden/repo-to-prompt
📂 Tool to transform files & dirs into structured prompts for LLMs. 🌳 Generates file maps + extracts text. 📋 Copies to clipboard. macOS-ready with Automator support!
ai-tools automator clipboard directory-structure file-map macos prompt-generation prompt-generator python python3 text-extraction
Last synced: 13 Aug 2025
https://github.com/natylaza89/semantic-similarity-dating-app
Semantic Similarity LLM Dating App using Python 3.12, FastAPI, WebSockets, CoHere, Gemini 1.5 Flash & Embeddings
async cosine-similarity embeddings fastapi gemini-api llm mypy nlp poetry pytest python ruff text-extraction websockets
Last synced: 02 Sep 2025
https://github.com/qeeqbox/galeodes
screenshots selenium text text-extraction wrapper
Last synced: 09 Aug 2025
https://github.com/sankeer28/url-extractor-and-downloader
Extracts multiple URLs from text, and if downloadable, downloads them into a ZIP
image-downloader text-extraction
Last synced: 01 Aug 2025
https://github.com/oeo/processor-rs
High-performance document processing pipeline in Rust. Extracts text, performs OCR, and optimizes images from PDFs and other document formats with parallel processing and memory efficiency.
document-processing image-optimization parallel-processing rust tesseract-ocr text-extraction
Last synced: 10 Jun 2025
https://github.com/porky-chen/i18n-t
A Vue SDK for batch multi-language translation using Youdao API, auto-extracting text for i18n localization.
batch-translation i18n json localization multi-language sdk text-extraction translation vue youdao-api
Last synced: 25 Jul 2025
https://github.com/sulemansaeed73/cleansetext
CleanseText – AI Writing Assistant built with Next.js & Django REST Framework.
django-rest-framework grammar-checker nextjs redux summarization-model tailwindcss text-extraction
Last synced: 13 Jul 2025
https://github.com/rmottanet/unchainedtext
UnchainedText: Break free from PDFs! Easily extract raw text to .txt for preprocessing.
data-extraction extractor pdf-text-extraction text-extraction text-extraction-tool text-processing
Last synced: 27 Jun 2025
https://github.com/terry-li-hm/prometheus
PDF Liberation MCP Server - Break large PDFs into digestible chunks for Claude
ai-tools claude-code document-processing fastmcp mcp-server pdf-processing pdf-splitter prometheus pymupdf python text-extraction
Last synced: 03 Sep 2025