{"id":42560483,"url":"https://github.com/vaites/php-apache-tika","last_synced_at":"2026-01-28T20:19:02.384Z","repository":{"id":57076624,"uuid":"41619268","full_name":"vaites/php-apache-tika","owner":"vaites","description":"Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats","archived":false,"fork":false,"pushed_at":"2025-10-04T07:41:51.000Z","size":14527,"stargazers_count":117,"open_issues_count":0,"forks_count":24,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-12-08T10:45:35.958Z","etag":null,"topics":["apache","ocr","php-library","text-extraction","text-recognition","tika"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vaites.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-08-30T06:38:09.000Z","updated_at":"2025-10-30T16:20:45.000Z","dependencies_parsed_at":"2024-05-29T00:41:07.483Z","dependency_job_id":"7db66aa6-a3bc-46ea-934a-5dd8f21e21d1","html_url":"https://github.com/vaites/php-apache-tika","commit_stats":{"total_commits":325,"total_committers":8,"mean_commits":40.625,"dds":0.06153846153846154,"last_synced_commit":"dd145e4e8d4595d3bde883b7b4bd0edecaa4df22"},"previous_names":[],"tags_count":44,"template":false,"template_full_name":null,"purl":"pkg:github/vaites/php-apache-tika","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vaites%2Fphp-apache-tika","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vaites%2Fphp-apache-tika/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vaites%2Fphp-apache-tika/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vaites%2Fphp-apache-tika/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vaites","download_url":"https://codeload.github.com/vaites/php-apache-tika/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vaites%2Fphp-apache-tika/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28850552,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-28T15:15:36.453Z","status":"ssl_error","status_checked_at":"2026-01-28T15:15:13.020Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","ocr","php-library","text-extraction","text-recognition","tika"],"created_at":"2026-01-28T20:19:01.365Z","updated_at":"2026-01-28T20:19:02.371Z","avatar_url":"https://github.com/vaites.png","language":"PHP","readme":"[![Current release](https://img.shields.io/github/release/vaites/php-apache-tika.svg)](https://github.com/vaites/php-apache-tika/releases/latest)\n[![Package at Packagist](https://img.shields.io/packagist/dt/vaites/php-apache-tika.svg)](https://packagist.org/packages/vaites/php-apache-tika)\n[![Build status](https://img.shields.io/github/workflow/status/vaites/php-apache-tika/tests/1.x)](https://github.com/vaites/php-apache-tika/actions)\n[![Code coverage](https://img.shields.io/codecov/c/github/vaites/php-apache-tika.svg)](https://codecov.io/github/vaites/php-apache-tika)\n[![Code quality](https://img.shields.io/scrutinizer/quality/g/vaites/php-apache-tika.svg)](https://scrutinizer-ci.com/g/vaites/php-apache-tika/)\n[![Code insight](https://img.shields.io/sensiolabs/i/92852e11-8648-4d48-9698-653aee765df5.svg)](https://insight.symfony.com/projects/92852e11-8648-4d48-9698-653aee765df5)\n[![License](https://img.shields.io/github/license/vaites/php-apache-tika.svg?color=%23999999)](https://github.com/vaites/php-apache-tika/blob/master/LICENSE)\n\n# PHP Apache Tika\n\nThis tool provides [Apache Tika](https://tika.apache.org) bindings for PHP, allowing to extract text and metadata \nfrom documents, images and other formats. \n\nThe following modes are supported:\n* **App mode**: run app JAR via command line interface\n* **Server mode**: make HTTP requests to [JSR 311 network server](https://cwiki.apache.org/confluence/display/TIKA/TikaServer)\n\nServer mode is recommended because is 5 times faster, but some shared hosts don't allow run processes in background.\n\nAlthough the library contains a list of supported versions, any version of Apache Tika should be compatible as long as\nbackward compatibility is maintained by Tika team. Therefore, it is not necessary to wait for an update of the library \nto work with the new versions of the tool.\n\n## Features\n\n* Simple class interface to Apache Tika features:\n    * Text and HTML extraction\n    * Metadata extraction\n    * OCR recognition\n* Standarized metadata for documents\n* Support for local and remote resources\n* No heavyweight library dependencies\n* Compatible with Apache Tika 1.15 or greater\n    * Tested up to 1.28.5, 2.9.4 and 3.2.2\n* Works on Linux, macOS, Windows and probably on FreeBSD\n\n## Requirements\n\n* PHP 7.3 or greater\n    * [Multibyte String support](http://php.net/manual/en/book.mbstring.php)\n    * [cURL extension](http://php.net/manual/en/book.curl.php)\n* Apache Tika 1.15 or greater\n* Oracle Java or OpenJDK \n    * Java 8 for Tika 1.19 or greater\n    * Java 7 for Tika from 1.15 to 1.18\n* [Tesseract](https://github.com/tesseract-ocr/tesseract) (optional for OCR recognition)\n\n**NOTE**: the supported PHP version will remain synced with [the latest supported by PHP team](https://www.php.net/supported-versions.php)\n\n## Installation\n\nInstall using Composer:\n\n```bash\ncomposer require vaites/php-apache-tika\n```\n\nIf you want to use OCR you must install [Tesseract](https://github.com/tesseract-ocr/tesseract):\n\n* **Fedora/CentOS**: `sudo yum install tesseract` (use dnf instead of yum on Fedora 22 or greater)\n* **Debian/Ubuntu**: `sudo apt-get install tesseract-ocr`\n* **macOS**: `brew install tesseract` (using [Homebrew](http://brew.sh))\n* **Windows**: `scoop install tesseract` (using [Scoop](http://scoop.sh))\n\nThe library assumes `tesseract` binary is in path, so you can compile it yourself or install using any other method. \n\n## Usage\n\nStart Apache Tika server with [caution](http://www.openwall.com/lists/oss-security/2015/08/13/5):\n\n```bash\njava -jar tika-server-x.xx.jar\n```\n\nIf you are using JRE instead of JDK, you must run if you have Java 9 or greater:\n\n```bash\njava --add-modules java.se.ee -jar tika-server-x.xx.jar\n```\n\nInstantiate the class, checking if JAR exists or server is running:\n\n```php\n$client = \\Vaites\\ApacheTika\\Client::make('localhost', 9998);           // server mode (default)\n$client = \\Vaites\\ApacheTika\\Client::make('/path/to/tika-app.jar');     // app mode \n```\n\nIf you want to use dependency injection, serialize the class or just delay the check:\n\n```php\n$client = \\Vaites\\ApacheTika\\Client::prepare('localhost', 9998);\n$client = \\Vaites\\ApacheTika\\Client::prepare('/path/to/tika-app.jar'); \n```\n\nYou can use an URL too:\n\n```php\n$client = \\Vaites\\ApacheTika\\Client::make('http://localhost:9998');\n$client = \\Vaites\\ApacheTika\\Client::prepare('http://localhost:9998');\n```\n\nUse the class to extract text from documents:\n\n```php\n$language = $client-\u003egetLanguage('/path/to/your/document');\n$metadata = $client-\u003egetMetadata('/path/to/your/document');\n\n$html = $client-\u003egetHTML('/path/to/your/document');\n$text = $client-\u003egetText('/path/to/your/document');\n```\n\nOr use to extract text from images:\n\n```php\n$client = \\Vaites\\ApacheTika\\Client::make($host, $port);\n$metadata = $client-\u003egetMetadata('/path/to/your/image');\n\n$text = $client-\u003egetText('/path/to/your/image');\n```\n    \nYou can use an URL instead of a file path and the library will download the file and pass it to Apache Tika. There's \n**no need** to add `-enableUnsecureFeatures -enableFileUrl` to command line when starting the server, as described \n[here](https://wiki.apache.org/tika/TikaJAXRS#Specifying_a_URL_Instead_of_Putting_Bytes).\n\nIf you use Apache Tika \u003e= 2.0.0, you *can* [define an HttpFetcher](https://cwiki.apache.org/confluence/display/TIKA/tika-pipes)\nand use the option `-enableUnsecureFeatures -enableFileUrl` when starting the server to make the server download remote\nfiles when passing a URL instead of a filename. In order to do so, you must set the name of the HttpFetcher using \n`$client-\u003esetFetcherName('yourFetcherName')`.\n\n### Methods\n\nHere are the full list of available methods\n\n#### Common\n\nTika file related methods:\n\n```php\n$client-\u003egetMetadata($file);\n$client-\u003egetRecursiveMetadata($file, 'text');\n$client-\u003egetLanguage($file);\n$client-\u003egetMIME($file);\n$client-\u003egetHTML($file);\n$client-\u003egetXHTML($file); // only CLI mode\n$client-\u003egetText($file);\n$client-\u003egetMainText($file);\n```\n    \nOther Tika related methods:\n\n```php\n$client-\u003egetSupportedMIMETypes();\n$client-\u003egetIsMIMETypeSupported('application/pdf');\n$client-\u003egetAvailableDetectors();\n$client-\u003egetAvailableParsers();\n$client-\u003egetVersion();\n```\n\nEncoding methods:\n```php\n$client-\u003egetEncoding();\n$client-\u003esetEncoding('UTF-8');\n```\n    \nSupported versions related methods:\n\n```php\n$client-\u003egetSupportedVersions();\n$client-\u003eisVersionSupported($version);\n```\n\nSet/get a callback for sequential read of response:\n\n```php\n$client-\u003esetCallback($callback);\n$client-\u003egetCallback();\n```\n    \nSet/get the chunk size for secuential read:\n\n```php\n$client-\u003esetChunkSize($size);\n$client-\u003egetChunkSize();\n```\n    \nEnable/disable the internal remote file downloader:\n\n```php\n$client-\u003esetDownloadRemote(true);\n$client-\u003egetDownloadRemote();\n```\n\n\nSet the [fetcher name](https://cwiki.apache.org/confluence/display/TIKA/tika-pipes):\n\n```php\n$client-\u003esetFetcherName($fetcher); // one of FileSystemFetcher, HttpFetcher, S3Fetcher, GCSFetcher, or SolrFetcher\n$client-\u003egetFetcherName();\n```\n\n#### Command line client\n    \nSet/get JAR/Java paths (only CLI mode):\n\n```php\n$client-\u003esetPath($path);\n$client-\u003egetPath();\n\n$client-\u003esetJava($java);\n$client-\u003egetJava();\n\n$client-\u003esetJavaArgs('-JXmx4g');\n$client-\u003egetJavaArgs();\n\n$client-\u003esetEnvVars(['LANG' =\u003e 'es_ES.UTF-8']);\n$client-\u003egetEnvVars();\n```\n\n#### Web client\n    \nSet/get host properties\n\n```php\n$client-\u003esetHost($host);\n$client-\u003egetHost();\n\n$client-\u003esetPort($port);\n$client-\u003egetPort();\n\n$client-\u003esetUrl($url);\n$client-\u003egetUrl();\n\n$client-\u003esetRetries($retries);\n$client-\u003egetRetries();\n```\n    \nSet/get [cURL client options](http://php.net/manual/en/function.curl-setopt.php)\n\n```php\n$client-\u003esetOptions($options);\n$client-\u003egetOptions();\n$client-\u003esetOption($option, $value);\n$client-\u003egetOption($option);\n```\n\nSet/get timeout:\n\n```php\n$client-\u003esetTimeout($seconds);\n$client-\u003egetTimeout();\n```\n\nSet/get HTTP headers (see [TikaServer](https://cwiki.apache.org/confluence/display/TIKA/TikaServer)):\n\n```php\n$client-\u003esetHeader('Foo', 'bar');\n$client-\u003egetHeader('Foo');\n$client-\u003esetHeaders(['Foo' =\u003e 'bar', 'Bar' =\u003e 'baz']);\n$client-\u003egetHeaders();\n```\n\nSet/get OCR languages (see [TikaOCR](https://cwiki.apache.org/confluence/display/tika/tikaocr)):\n\n```php\n$client-\u003esetOCRLanguage($language);\n$client-\u003esetOCRLanguages($languages);\n$client-\u003egetOCRLanguages();\n```\n\nSet HTTP fetcher name (for Tika \u003e= 2.0.0 only, see https://cwiki.apache.org/confluence/display/TIKA/tika-pipes)\n\n```php\n$client-\u003esetFetcherName($fetcherName)\n```\n\n### Breaking changes\n\nSince 1.0 version there are some breaking changes:\n\n* Apache Tika versions prior to 1.15 are not supported (use [0.x](https://github.com/vaites/php-apache-tika/tree/0.x) version for 1.14 and older)\n* PHP minimum requirement is 7.3 or greater (use [0.x](https://github.com/vaites/php-apache-tika/tree/0.x) version for 7.1 and older)\n* `$client-\u003egetRecursiveMetadata()` returns an array as expected\n* `Client::getSupportedVersions()` and `Client::isVersionSupported()` methods cannot be called statically\n* Values returned by `Client::getAvailableDetectors()` and `Client::getAvailableParsers()` are identical and have a new definition \n\nSee [CHANGELOG.md](CHANGELOG.md) for more details.\n\n## Troubleshooting\n\n### Empty responses or unexpected results\n\nThis library is only a _proxy_ so if you get an empy responses or unexpected results the most common cause is Tika \nitself. A simple test is using the GUI to check the response:\n\n1. Run the Tika app without arguments: `java -jar tika-app-x.xx.jar` \n2. Drop your file or select it using _File -\u003e Open_\n3. Wait until the metadata appears\n4. Get the text or HTML using _View_ menu\n\nIf the results are the same, you must take a look into [Tika's Jira](https://issues.apache.org/jira/projects/TIKA/issues)\nand open an issue if necessary.\n\n### Encoding\n\nBy default the returned text is encoded with UTF-8, andthe `Client::setEncoding()` method allows to set the expected \nencoding. \n\n## Tests\n\nTests are designed to **cover all features for all supported versions** of Apache Tika in app mode and server mode. \nThere are a few samples to test against:\n\n* **sample1**: document metadata and text extraction\n* **sample2**: image metadata \n* **sample3**: text recognition\n* **sample4**: unsupported media\n* **sample5**: huge text for callbacks \n* **sample6**: remote calls \n* **sample7**: text encoding\n* **sample8**: recursive metadatata\n\n## Known issues\n\nThere are some issues found during tests, not related with this library:\n\n* Apache Tika 1.17 and lower can't extract text from OCR as described in [TIKA-2509](https://issues.apache.org/jira/browse/TIKA-2509)\n* Tesseract slows down document parsing as described in [TIKA-2359](https://issues.apache.org/jira/browse/TIKA-2359)\n    \n## Integrations\n\n- [Symfony2 Bundle](https://github.com/welcoMattic/ApacheTikaBundle)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvaites%2Fphp-apache-tika","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvaites%2Fphp-apache-tika","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvaites%2Fphp-apache-tika/lists"}