https://github.com/baughmann/tikara

The metadata and text content extractor for almost every file type.
https://github.com/baughmann/tikara

apache-tika content-extraction document-parsing document-processing docx image-to-text java language-detection llm metadata metadata-extraction ml natural-language-processing ocr pdf-to-text retrieval-augmented-generation text-extraction text-mining

Last synced: 5 months ago
JSON representation

The metadata and text content extractor for almost every file type.

Host: GitHub
URL: https://github.com/baughmann/tikara
Owner: baughmann
License: apache-2.0
Created: 2025-01-25T03:36:34.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2025-02-03T16:24:58.000Z (over 1 year ago)
Last Synced: 2025-05-06T08:17:01.089Z (about 1 year ago)
Topics: apache-tika, content-extraction, document-parsing, document-processing, docx, image-to-text, java, language-detection, llm, metadata, metadata-extraction, ml, natural-language-processing, ocr, pdf-to-text, retrieval-augmented-generation, text-extraction, text-mining
Language: Python
Homepage: https://baughmann.github.io/tikara/
Size: 161 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
- Support: SUPPORTED_MIME_TYPES.md

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/baughmann/tikara

Awesome Lists containing this project