{"id":25353926,"url":"https://github.com/aborroy/alf-tengine-ocr","last_synced_at":"2025-10-29T22:30:53.277Z","repository":{"id":46860759,"uuid":"229290455","full_name":"aborroy/alf-tengine-ocr","owner":"aborroy","description":"Alfresco Transformer For ACS 70+ from PDF to OCRd PDF","archived":false,"fork":false,"pushed_at":"2024-11-18T07:16:06.000Z","size":199,"stargazers_count":20,"open_issues_count":3,"forks_count":12,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-11-18T08:26:02.941Z","etag":null,"topics":["ocr","pdf"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aborroy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-12-20T15:35:56.000Z","updated_at":"2024-11-18T07:16:11.000Z","dependencies_parsed_at":"2024-11-18T08:33:31.440Z","dependency_job_id":null,"html_url":"https://github.com/aborroy/alf-tengine-ocr","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falf-tengine-ocr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falf-tengine-ocr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falf-tengine-ocr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falf-tengine-ocr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aborroy","download_url":"https://codeload.github.com/aborroy/alf-tengine-ocr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238902582,"owners_count":19549776,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ocr","pdf"],"created_at":"2025-02-14T19:56:05.695Z","updated_at":"2025-10-29T22:30:52.026Z","avatar_url":"https://github.com/aborroy.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Alfresco Transformer from PDF to OCRd PDF\n\nThis project includes a simple Transformer for Alfresco from PDF to OCRd PDF to be used with **ACS Community 7.0+**\n\n\u003e\u003e OCR Transformation is performed by [ocrmypdf](https://ocrmypdf.readthedocs.io/en/latest/), a wrapper of [Tesseract](https://github.com/tesseract-ocr/tesseract) that includes additional features in order to improve the accuracy of the process.\n\nThe Transformer `ats-transformer-ocr` uses the new Alfresco Local Transform API, that allows to register a Spring Boot Application as a local transformation service.\n\nThe folder `embed-metadata-action` includes an Alfresco Repository Addon that enables the action `embed-metadata` in Folder Rule feature.\n\n**ACS Community 7.4 or later** requires modifying default configuration for HTTP requests timeouts. Increase default values (5000 ms / 5 s) to a larger value, like in the following sample that uses 500000 ms / 500 s\n\n```\nhttpclient.config.transform.socketTimeout=500000\nhttpclient.config.transform.connectionRequestTimeout=500000\nhttpclient.config.transform.connectionTimeout=500000\n```\n\n## Local testing\n\n### Build Docker Image for Alfresco OCR Transformer\n\nBuilding the Alfresco OCR Transformer Docker Image is required before running the Docker Compose template provided.\n\n```\n$ cd ats-transformer-ocr\n\n$ mvn clean package\n```\n\nMaven will create a Docker Image named `alfresco/tengine-ocr:latest`\n\n### Starting\n\n```\n$ docker run -p 8090:8090 alfresco/tengine-ocr:latest\n```\n\n### Testing\n\nA sample web page has been created in order to test the transformer is working:\n\nhttp://localhost:8090\n\n\n## Deployment with ACS Stack\n\n### Obtaining Repository Addon to enable Embed Metadata Action\n\nBefore deploying Alfresco OCR Transformer, `embed-metadata-action` Repository Addon should be built.\n\n```\n$ cd embed-metadata-action\n\n$ mvn clean package\n\n$ ls target/embed-metadata-action-1.0.0.jar\ntarget/embed-metadata-action-1.0.0.jar\n```\n\nAlternatively `embed-metadata-action-1.0.0.jar` can be download from [Releases](https://github.com/aborroy/alf-tengine-ocr/releases/download/1.0.0/embed-metadata-action-1.0.0.jar)\n\n### Deploying Repository Addon to enable Embed Metadata Action\n\nUse some of the available alternatives to deploy `embed-metadata-action-1.0.0.jar` in alfresco service, like adding the JAR to `alfresco/modules/jar` folder when using [Alfresco Docker Installer](https://github.com/alfresco/alfresco-docker-installer) tool.\n\n### Adding Alfresco OCR Transformer to Docker Compose (Local Transformer - HTTP) - Community Edition\n\nReview that the following configuration is applied to `docker-compose.yml` file.\n\n```\nservices:\n    alfresco:\n        environment:\n            JAVA_OPTS : \"\n                -DlocalTransform.core-aio.url=http://transform-core-aio:8090/\n                -DlocalTransform.ocr.url=http://transform-ocr:8090/\n            \"\n\n    transform-core-aio:\n        image: alfresco/alfresco-transform-core-aio:2.3.10\n        mem_limit: 1536m\n        environment:\n            JAVA_OPTS: \" -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80\"\n\n    transform-ocr:\n        image: alfresco/tengine-ocr:latest\n        mem_limit: 1536m\n        environment:\n            JAVA_OPTS: \" -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80\"\n```\n\n* Include the `localTransform` URL for OCR Transformer in `alfresco` Docker Container, http://transform-ocr:8090/ by default\n* Declare the new `transform-ocr` Docker Container\n\n\u003e\u003e Remember that you need to build Docker Image for `alfresco/tengine-ocr` before running this composition\n\nStart ACS Stack from folder containing `docker-compose.yml` file.\n\n```\n$ docker-compose up --build --force-recreate\n```\n\nSample deployment is available in [docker](docker) folder.\n\n\n### Adding Alfresco OCR Transformer to Docker Compose (Async Transformer - ActiveMQ) - Enterprise Edition\n\nReview that the following configuration is applied to `docker-compose.yml` file.\n\n```\nservices:\n    alfresco:\n        environment:\n            JAVA_OPTS : \"\n              -Dlocal.transform.service.enabled=true\n              -Dtransform.service.enabled=true\n              -Dtransform.service.url=http://transform-router:8095\n              -Dsfs.url=http://shared-file-store:8099/\n            \"\n\n    transform-router:\n      image: quay.io/alfresco/alfresco-transform-router:${TRANSFORM_ROUTER_TAG}\n      environment:\n        JAVA_OPTS: \" -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80\"\n        ACTIVEMQ_URL: \"nio://activemq:61616\"\n        CORE_AIO_URL: \"http://transform-core-aio:8090\"\n        TRANSFORMER_URL_OCR: \"http://transform-ocr:8090\"\n        TRANSFORMER_QUEUE_OCR: \"ocr-engine-queue\"\n        FILE_STORE_URL: \"http://shared-file-store:8099/alfresco/api/-default-/private/sfs/versions/1/file\"\n\n    transform-ocr:\n      image: alfresco/tengine-ocr:latest\n      mem_limit: 1536m\n      environment:\n        JAVA_OPTS: \" -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80 \n\t\t  -Docrmypdf.path=ocrmypdf -Docrmypdf.arguments=--skip-text -Dqueue.engineRequestQueue=ocr-engine-queue\n\t\t \"\n        ACTIVEMQ_URL: \"nio://activemq:61616\"\n        FILE_STORE_URL: \"http://shared-file-store:8099/alfresco/api/-default-/private/sfs/versions/1/file\"\n```\n\n* You can optionally disable `local.transform` service in `alfresco` Docker Container and enable `transform` service (asynchronous). Local Transform Service or Transform Service (supports only asynchronous requests) can be enabled or disabled independently of each other. Please keep in mind that when your deployment has Share and SOLR (think of full text indexing), or both then you'll need to have `local.transform` and `transform` service (asynchronous) enabled and running. The Repository will try to transform content using the Transform Service via the T-Router if possible and fall back to direct Local Transform Service. Share makes use of both, so functionality such as preview will be unavailable if `local.transform` service is disabled.\n* Add OCR Transformer configuration to `transform-router` Docker Container: URL (http://transform-ocr:8090/ by default) and Queue Name (`ocr-engine-queue` as declared in [ats-transformer-ocr/src/main/resources/application-default.yaml](ats-transformer-ocr/src/main/resources/application-default.yaml))\n* Declare the new `transform-ocr` Docker Container using the ActiveMQ and Shared File services\n\n\u003e\u003e Remember that you need to build Docker Image for `alfresco/tengine-ocr` before running this composition\n\nStart ACS Stack from folder containing `docker-compose.yml` file.\n\n```\n$ docker-compose up --build --force-recreate\n```\n\nSample deployment is available in [docker-enterprise](docker-enterprise) folder.\n\n\n### Defining the OCR Rule in Alfresco Share\n\nUse your browser to access to Alfresco Share App (by default available in http://localhost:8080/share/)\n\nCreate a folder and add following rule (`Manage Rules` folder option):\n\n* When: Items are created or enter this folder\n* If all criteria are met: Mimetype is 'Adobe PDF Document'\n* Perform Action: Embed properties as metadata in content\n\n\u003e\u003e To limit the amount of parallel OCR processing threads, use the **Run rule in background** checkbox.\n\nFrom that point, every PDF File uploaded to the folder will be OCRd. Original version for the PDF file will remain as 1.0 version, while the one with text layer on it will be labeled as 1.1 version.\n\n## Customizing ocrmypdf arguments\n\nBy default, Alfresco OCR Transformer is providing following `ocrmypdf` configuration.\n\n```\n# Executable command for ocrmypdf program\nocrmypdf.path=ocrmypdf\n\n# Arguments for ocrmypdf invocation. This is the optimized option. \n# If --skip-text is issued, then no image processing or OCR will be performed on pages that already have text.\nocrmypdf.arguments=--skip-text\n\n# To force OCR, use the following:\nocrmypdf.arguments=--force-ocr\n```   \n\nConfiguration can be changed by using Docker environment variables from command line.\n\n```\n$ docker run -p 8090:8090 -e OCRMYPDF_ARGUMENTS='--skip-text -l eng' alfresco/tengine-ocr:latest\n```\n\nOr with the equivalent notation in `docker-compose.yml`\n\n```\ntransform-ocr:\n    image: alfresco/tengine-ocr:latest\n    mem_limit: 1536m\n    environment:\n      JAVA_OPTS: \"-XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80 -Dqueue.engineRequestQueue=ocr-engine-queue\"\n      OCRMYPDF_ARGUMENTS: \"--skip-text -l eng\"\n```\n\n## Additional contributors\n\n* Thanks to [dgradecak](https://github.com/dgradecak) for the `embed-metadata` action approach: https://github.com/aborroy/alf-tengine-ocr/pull/2\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faborroy%2Falf-tengine-ocr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faborroy%2Falf-tengine-ocr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faborroy%2Falf-tengine-ocr/lists"}