{"id":28532126,"url":"https://github.com/databrickslabs/tika-ocr","last_synced_at":"2026-01-31T14:04:51.796Z","repository":{"id":65507546,"uuid":"540512137","full_name":"databrickslabs/tika-ocr","owner":"databrickslabs","description":null,"archived":false,"fork":false,"pushed_at":"2024-10-21T14:04:42.000Z","size":247,"stargazers_count":21,"open_issues_count":7,"forks_count":3,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-06-09T15:43:41.884Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rich Text Format","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databrickslabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-09-23T15:48:25.000Z","updated_at":"2025-06-04T08:45:53.000Z","dependencies_parsed_at":"2024-01-04T17:26:13.295Z","dependency_job_id":"ca589596-6785-4203-b7fc-e348bc4ee814","html_url":"https://github.com/databrickslabs/tika-ocr","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/databrickslabs/tika-ocr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Ftika-ocr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Ftika-ocr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Ftika-ocr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Ftika-ocr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databrickslabs","download_url":"https://codeload.github.com/databrickslabs/tika-ocr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databrickslabs%2Ftika-ocr/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264091962,"owners_count":23556216,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-09T15:31:01.200Z","updated_at":"2026-01-31T14:04:51.792Z","avatar_url":"https://github.com/databrickslabs.png","language":"Rich Text Format","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tika-ocr-inputformat\n\n**Digitization of documents with Tika on Databricks** : *The volume of available data is growing by the second. \nAbout [64 zettabytes](https://www.wsj.com/articles/how-to-understand-the-data-explosion-11638979214) was created \nor copied last year, according to IDC, a technology market research firm. By 2025, this number will grow to an \nestimated [175 zetabytes](https://www.statista.com/statistics/871513/worldwide-data-created/), and it is becoming \nincreasingly granular and difficult to codify, unify, and centralize. And though more financial services institutions \n(FSIs) are talking about big data and using technology to capture more data than ever, Forrester reports that 70% of\nall data within an enterprise still goes unused for analytics. The open source nature of Lakehouse for Financial \nServices makes it possible for bank compliance officers, insurance underwriting agents or claim adjusters to combine \nlatest technologies in optical character recognition (OCR) and natural language processing (NLP) in order to transform \nany financial document, in any format, into valuable data assets. The Apache Tika toolkit detects and extracts \nmetadata and text from over a thousand different file types (such as PPT, XLS, and PDF). Combined with \n[Tesseract](https://en.wikipedia.org/wiki/Tesseract_(software), the most commonly used OCR technology, there is \nliterally no limit to what files we can ingest, store and exploit for analytics / operation purpose.*\n\n## Project Support\nPlease note that all projects in the /databrickslabs github account are provided for your exploration only, \nand are not formally supported by Databricks with Service Level Agreements (SLAs). \nThey are provided AS-IS and we do not make any guarantees of any kind. \nPlease do not submit a support ticket relating to any issues arising from the use of these projects.\n\nAny issues discovered through the use of this project should be filed as GitHub Issues on the Repo. \nThey will be reviewed as time permits, but there are no formal SLAs for support.\n\n## Building the Project\n\n```\nmvn clean install\n```\n\n## Deploying / Installing the Project\n\n```\nmvn release:prepare\n```\n\n## Releasing the Project\n\n```\nmvn release:perform\n```\n\n## Using the Project\n\nAdd `com.databricks.labs:tika-ocr:0.1.4` maven dependency to your databricks runtime.\nAlternatively, compile this project with maven profile `shaded` enabled to generate an uber jar that you upload to \nyour databricks runtime. You can now read any file, extracting text content from any file format.\n\n```python\nspark.read.format('tika').load(path_to_any_file)\n```\n\n|                path|length|    modificationTime|             content|         contentType|         contentText|     contentMetadata|\n| ------------------ | ---- | ------------------ | ------------------ | ------------------ | ------------------ | ------------------ |\n|file:/Users/antoi...| 36864|2022-08-25 14:15:...|[D0 CF 11 E0 A1 B...|  application/msword|key\\n\\nvalue\\n\\nh...|{meta:page-count ...|\n|file:/Users/antoi...| 34030|2022-08-25 14:16:...|[89 50 4E 47 0D 0...|           image/png|key\\n\\nvalue\\n\\nh...|{tiff:BitsPerSamp...|\n|file:/Users/antoi...| 26294|2022-08-25 14:13:...|[50 4B 03 04 14 0...|application/vnd.o...|\\n\\n\\nimage1.png\\...|{meta:page-count ...|\n|file:/Users/antoi...| 22805|2022-08-25 14:13:...|[25 50 44 46 2D 3...|     application/pdf|\\n \\n \\n\\n \\n\\nke...|{dc:format -\u003e app...|\n\nFor Tesseract support, please make sure Tesseract library is available on each executor. This can be achieved using a \nsimple [init script](https://docs.databricks.com/clusters/init-scripts.html) as follows\n\n```shell\n#!/usr/bin/env bash\nsudo apt-get install -y tesseract-ocr\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabrickslabs%2Ftika-ocr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabrickslabs%2Ftika-ocr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabrickslabs%2Ftika-ocr/lists"}