Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shelfio/tika-text-extract
Extract text from a document by Apache Tika
https://github.com/shelfio/tika-text-extract
apache-tika extract-text node-module npm-package tika
Last synced: 12 days ago
JSON representation
Extract text from a document by Apache Tika
- Host: GitHub
- URL: https://github.com/shelfio/tika-text-extract
- Owner: shelfio
- Created: 2017-02-23T22:03:59.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2024-05-18T21:27:52.000Z (6 months ago)
- Last Synced: 2024-05-18T22:28:30.571Z (6 months ago)
- Topics: apache-tika, extract-text, node-module, npm-package, tika
- Language: TypeScript
- Homepage: https://www.npmjs.com/package/tika-text-extract
- Size: 318 KB
- Stars: 15
- Watchers: 21
- Forks: 4
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Tika Text Extract
> Extract text from any document by [Apache Tika](https://tika.apache.org/)
[![CircleCI](https://img.shields.io/circleci/project/github/vladgolubev/tika-text-extract.svg)](https://circleci.com/gh/vladgolubev/tika-text-extract)
[![npm](https://img.shields.io/npm/v/tika-text-extract.svg)](https://www.npmjs.com/package/tika-text-extract)
[![David](https://img.shields.io/david/vladgolubev/tika-text-extract.svg)](https://david-dm.org/vladgolubev/tika-text-extract)
[![npm](https://img.shields.io/npm/dm/tika-text-extract.svg)](https://github.com/vladgolubev/tika-text-extract)## What?
> The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand
> different file types (such as PPT, XLS, and PDF). All of these file types can be parsed
> through a single interface, making Tika useful for search engine indexing,
> content analysis, translation, and much more.## Why?
This was mainly built for convenience usage in AWS Lambda environment.
If you want to use Tika from node.js you are left with these options:
- Spawn a CLI - no, extremely inefficient to pay for Java startup time
- Start HTTP Server
- Use Java ?Spawning a Tika as CLI is extremely inefficient.
Using Java API from node.js is tedious.
This module starts a [Tika HTTP Server](https://wiki.apache.org/tika/TikaJAXRS) to stream files to
and return a string of extracted text.Requires `java` to be present on the system.
## Install
```bash
$ yarn add @shelf/tika-text-extract
```### Note
By default in tika-text-extract version 3 use tika-server greater than 2.
## Usage
```javascript
import {readFileSync} from 'fs';
import tte from '@shelf/tika-text-extract';await tte.startServer('/tmp/tika-server-standard-2.2.1.jar');
const testFile = readFileSync('./README.md');const extractedText = await tte.extract(testFile);
```## Execute Tika with a custom path to Java binary
```javascript
const options = {executableJavaPath: '/bin/jre/java'};await tte.startServer('/tmp/tika-server-standard-2.2.1.jar', options);
// The next command will be executed:
// /bin/jre/java -jar /tmp/tika-server-standard-2.2.1.jar -noFork
```## Execute Tika with Java version less than 9
By default, the library does not support Java versions less than 9.
In order to use it with Java 8, pass an option to `startServer` function```javascript
const options = {alignWithJava8: true};await tte.startServer('/tmp/tika-server-standard-2.2.1.jar', options);
// The next command will be executed:
// java -jar /tmp/tika-server-standard-2.2.1.jar -noFork
```## Execute Tika V1
By default in tika-text-extract version 3 use apache-tika greater than 2.
To use tika-server less than 2, pass an option to `startServer` function```javascript
const options = {useTikaV1: true};await tte.startServer('/tmp/tika-server-1.25.jar', options);
// The next command will be executed:
// /bin/jre/java --add-modules=java.xml.bind,java.activation -Duser.home=/tmp -jar /tmp/tika-server-1.25.jar
```### Note
If you don't use this option with apache-tika less than 2. You will get an error
## API
You can see debug messages by setting env var `DEBUG=tika-text-extract`
### tte.startServer(artifactPath)
Params: `artifactPath` - path to your `tika-server.jar` file.
Returns: Promise resolved when server is started. Rejects in case of error.
### tte.extract(fileInput)
Params: `fileInput` - `Buffer`, `String`, `Stream` or `Promise` of file to extract text from.
Returns: Promise resolved with extracted text.
## Publish
```sh
$ git checkout master
$ yarn version
$ yarn publish
$ git push origin master --tags
```## How to run tika-text-extract
Download Java, you can accomplish it with these commands:
```
mkdir javadocker run --rm -v "$PWD"/java:/lambda/opt lambci/yumda:2 yum install -y java-1.8.0-openjdk-headless.x86_64
```Move `java` folder inside `tika-text-extract`.
Download `tika-server` which you want to use. You can find it in an [archive](https://archive.apache.org/dist/tika/)
After that you can run this command:```
docker run --rm \
-v "$PWD":/var/task \
-v "$PWD/java":/opt/java \
-v "$PWD/tika":/../layer/tika/ \
lambci/lambda:nodejs12.x basic-usage.handler
```## License
MIT © [Shelf](https://shelf.io)