https://github.com/baughmann/tikara
The metadata and text content extractor for almost every file type.
https://github.com/baughmann/tikara
apache-tika content-extraction document-parsing document-processing docx image-to-text java language-detection llm metadata metadata-extraction ml natural-language-processing ocr pdf-to-text retrieval-augmented-generation text-extraction text-mining
Last synced: 6 days ago
JSON representation
The metadata and text content extractor for almost every file type.
- Host: GitHub
- URL: https://github.com/baughmann/tikara
- Owner: baughmann
- License: apache-2.0
- Created: 2025-01-25T03:36:34.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2025-02-03T16:24:58.000Z (4 months ago)
- Last Synced: 2025-05-06T08:17:01.089Z (13 days ago)
- Topics: apache-tika, content-extraction, document-parsing, document-processing, docx, image-to-text, java, language-detection, llm, metadata, metadata-extraction, ml, natural-language-processing, ocr, pdf-to-text, retrieval-augmented-generation, text-extraction, text-mining
- Language: Python
- Homepage: https://baughmann.github.io/tikara/
- Size: 161 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
- Support: SUPPORTED_MIME_TYPES.md