https://github.com/ibmstreams/streamsx.document
(Incubation) This toolkit allows extract text and metadata from documents in a binary formats such as PDF, Word, Office, etc
https://github.com/ibmstreams/streamsx.document
extractor ibm-streams stream-processing
Last synced: 8 months ago
JSON representation
(Incubation) This toolkit allows extract text and metadata from documents in a binary formats such as PDF, Word, Office, etc
- Host: GitHub
- URL: https://github.com/ibmstreams/streamsx.document
- Owner: IBMStreams
- License: other
- Created: 2014-07-09T14:37:49.000Z (almost 12 years ago)
- Default Branch: develop
- Last Pushed: 2020-07-10T10:18:21.000Z (almost 6 years ago)
- Last Synced: 2025-07-28T00:04:34.612Z (11 months ago)
- Topics: extractor, ibm-streams, stream-processing
- Language: Java
- Homepage: http://ibmstreams.github.io/streamsx.document
- Size: 59.6 MB
- Stars: 6
- Watchers: 5
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
streamsx.document
=================
This toolkit allows extract text and metadata from documents in a binary formats
such as PDF, Word, Office, etc. For this purpose the toolkit implements a DocumentSource operator.
The DocumentSource operator utilized multiple third party and open source document extraction technologies,
and can be enhanced with additional commercial /proprietary extractors. The operator automatically determines
the document MIME type and delegated the extraction request to appropriate extractor plugin.
Out of the box the toolkit provides the following extractors:
* Apache Tika – The primary extractor for binary documents such as Office documents (Word, Powerpoint, Excel), HTML files, etc.
* PDFBox – For handling Acrobat PDF files
* TrueZIP – ZIP, JAR, TAR, GZ, GZIP files and other archive files
* JUnrar – RAR files
* Plain Text – Text files of various encodings (ASCII, UTF-8, UTF-16, local encodings)
The toolkit's home page is available at:
http://ibmstreams.github.io/streamsx.document/