{"id":15290804,"url":"https://github.com/shelfio/tika-text-extract","last_synced_at":"2025-07-19T02:32:44.124Z","repository":{"id":36953832,"uuid":"82975994","full_name":"shelfio/tika-text-extract","owner":"shelfio","description":"Extract text from a document by Apache Tika","archived":false,"fork":false,"pushed_at":"2025-07-12T00:26:35.000Z","size":370,"stargazers_count":17,"open_issues_count":10,"forks_count":6,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-07-12T02:59:58.281Z","etag":null,"topics":["apache-tika","extract-text","node-module","npm-package","tika"],"latest_commit_sha":null,"homepage":"https://www.npmjs.com/package/tika-text-extract","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shelfio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2017-02-23T22:03:59.000Z","updated_at":"2025-06-28T08:05:30.000Z","dependencies_parsed_at":"2023-01-17T08:01:20.048Z","dependency_job_id":"2528c140-eec7-47ba-b58c-12ed24ef2157","html_url":"https://github.com/shelfio/tika-text-extract","commit_stats":{"total_commits":249,"total_committers":12,"mean_commits":20.75,"dds":0.642570281124498,"last_synced_commit":"9ff996644a11764783f152cf442d1a69c02f99e7"},"previous_names":["vladgolubev/tika-text-extract"],"tags_count":14,"template":false,"template_full_name":null,"purl":"pkg:github/shelfio/tika-text-extract","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shelfio%2Ftika-text-extract","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shelfio%2Ftika-text-extract/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shelfio%2Ftika-text-extract/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shelfio%2Ftika-text-extract/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shelfio","download_url":"https://codeload.github.com/shelfio/tika-text-extract/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shelfio%2Ftika-text-extract/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265876915,"owners_count":23842956,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-tika","extract-text","node-module","npm-package","tika"],"created_at":"2024-09-30T16:09:34.697Z","updated_at":"2025-07-19T02:32:44.104Z","avatar_url":"https://github.com/shelfio.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tika Text Extract\n\n\u003e Extract text from any document by [Apache Tika](https://tika.apache.org/)\n\n[![CircleCI](https://img.shields.io/circleci/project/github/vladgolubev/tika-text-extract.svg)](https://circleci.com/gh/vladgolubev/tika-text-extract)\n[![npm](https://img.shields.io/npm/v/tika-text-extract.svg)](https://www.npmjs.com/package/tika-text-extract)\n[![David](https://img.shields.io/david/vladgolubev/tika-text-extract.svg)](https://david-dm.org/vladgolubev/tika-text-extract)\n[![npm](https://img.shields.io/npm/dm/tika-text-extract.svg)](https://github.com/vladgolubev/tika-text-extract)\n\n## What?\n\n\u003e The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand\n\u003e different file types (such as PPT, XLS, and PDF). All of these file types can be parsed\n\u003e through a single interface, making Tika useful for search engine indexing,\n\u003e content analysis, translation, and much more.\n\n## Why?\n\nThis was mainly built for convenience usage in AWS Lambda environment.\n\nIf you want to use Tika from node.js you are left with these options:\n\n- Spawn a CLI - no, extremely inefficient to pay for Java startup time\n- Start HTTP Server\n- Use Java ?\n\nSpawning a Tika as CLI is extremely inefficient.\nUsing Java API from node.js is tedious.\nThis module starts a [Tika HTTP Server](https://wiki.apache.org/tika/TikaJAXRS) to stream files to\nand return a string of extracted text.\n\nRequires `java` to be present on the system.\n\n## Install\n\n```bash\n$ yarn add @shelf/tika-text-extract\n```\n\n### Note\n\nBy default in tika-text-extract version 3 use tika-server greater than 2.\n\n## Usage\n\n```javascript\nimport {readFileSync} from 'fs';\nimport tte from '@shelf/tika-text-extract';\n\nawait tte.startServer('/tmp/tika-server-standard-2.2.1.jar');\nconst testFile = readFileSync('./README.md');\n\nconst extractedText = await tte.extract(testFile);\n```\n\n## Execute Tika with a custom path to Java binary\n\n```javascript\nconst options = {executableJavaPath: '/bin/jre/java'};\n\nawait tte.startServer('/tmp/tika-server-standard-2.2.1.jar', options);\n// The next command will be executed:\n// /bin/jre/java -jar /tmp/tika-server-standard-2.2.1.jar -noFork\n```\n\n## Execute Tika with Java version less than 9\n\nBy default, the library does not support Java versions less than 9.\nIn order to use it with Java 8, pass an option to `startServer` function\n\n```javascript\nconst options = {alignWithJava8: true};\n\nawait tte.startServer('/tmp/tika-server-standard-2.2.1.jar', options);\n// The next command will be executed:\n// java -jar /tmp/tika-server-standard-2.2.1.jar -noFork\n```\n\n## Execute Tika V1\n\nBy default in tika-text-extract version 3 use apache-tika greater than 2.\nTo use tika-server less than 2, pass an option to `startServer` function\n\n```javascript\nconst options = {useTikaV1: true};\n\nawait tte.startServer('/tmp/tika-server-1.25.jar', options);\n// The next command will be executed:\n// /bin/jre/java --add-modules=java.xml.bind,java.activation -Duser.home=/tmp -jar /tmp/tika-server-1.25.jar\n```\n\n### Note\n\nIf you don't use this option with apache-tika less than 2. You will get an error\n\n## API\n\nYou can see debug messages by setting env var `DEBUG=tika-text-extract`\n\n### tte.startServer(artifactPath)\n\nParams: `artifactPath` - path to your `tika-server.jar` file.\n\nReturns: Promise resolved when server is started. Rejects in case of error.\n\n### tte.extract(fileInput)\n\nParams: `fileInput` - `Buffer`, `String`, `Stream` or `Promise` of file to extract text from.\n\nReturns: Promise resolved with extracted text.\n\n## Publish\n\n```sh\n$ git checkout master\n$ yarn version\n$ yarn publish\n$ git push origin master --tags\n```\n\n## How to run tika-text-extract\n\nDownload Java, you can accomplish it with these commands:\n\n```\nmkdir java\n\ndocker run --rm -v \"$PWD\"/java:/lambda/opt lambci/yumda:2 yum install -y java-1.8.0-openjdk-headless.x86_64\n```\n\nMove `java` folder inside `tika-text-extract`.\nDownload `tika-server` which you want to use. You can find it in an [archive](https://archive.apache.org/dist/tika/)\nAfter that you can run this command:\n\n```\ndocker run --rm \\\n-v \"$PWD\":/var/task \\\n-v \"$PWD/java\":/opt/java \\\n-v \"$PWD/tika\":/../layer/tika/ \\\nlambci/lambda:nodejs12.x basic-usage.handler\n```\n\n## License\n\nMIT © [Shelf](https://shelf.io)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshelfio%2Ftika-text-extract","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshelfio%2Ftika-text-extract","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshelfio%2Ftika-text-extract/lists"}