{"id":28260135,"url":"https://github.com/aborroy/alf-tengine-convert2md","last_synced_at":"2025-10-03T23:47:05.204Z","repository":{"id":293424247,"uuid":"983988648","full_name":"aborroy/alf-tengine-convert2md","owner":"aborroy","description":"AI‑powered Alfresco Transform Engine that converts PDF files to clean, richly‑described Markdown.","archived":false,"fork":false,"pushed_at":"2025-05-23T06:45:57.000Z","size":44,"stargazers_count":3,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-17T22:39:26.348Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aborroy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-15T08:14:57.000Z","updated_at":"2025-06-05T16:34:29.000Z","dependencies_parsed_at":"2025-05-15T09:28:28.085Z","dependency_job_id":"1b1def8c-25f8-4bcb-b36b-0c577830e304","html_url":"https://github.com/aborroy/alf-tengine-convert2md","commit_stats":null,"previous_names":["aborroy/alf-tengine-convert2md"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aborroy/alf-tengine-convert2md","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falf-tengine-convert2md","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falf-tengine-convert2md/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falf-tengine-convert2md/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falf-tengine-convert2md/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aborroy","download_url":"https://codeload.github.com/aborroy/alf-tengine-convert2md/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aborroy%2Falf-tengine-convert2md/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263819241,"owners_count":23516123,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-20T04:09:21.326Z","updated_at":"2025-10-03T23:47:05.184Z","avatar_url":"https://github.com/aborroy.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# alf-tengine-convert2md\n[![Build](https://img.shields.io/badge/build-Maven_3.9+-blue?logo=apachemaven)](pom.xml)\n[![Docker Compose](https://img.shields.io/badge/run-docker--compose-blue?logo=docker)](compose.yaml)\n[![License](https://img.shields.io/github/license/aborroy/alf-tengine-convert2md)](LICENSE)\n\nAI‑powered Alfresco Transform Engine that converts PDF files to clean, richly‑described Markdown.\n\n## Features\n\n| Capability      | Details                                                                                                                       |\n| --------------- | ----------------------------------------------------------------------------------------------------------------------------- |\n| PDF to Markdown | Extracts text and layout, turning each page into Markdown                                                                     |\n| Image captions  | Uses OCR **and** an LLaVA multimodal model (via [Ollama](https://ollama.ai)) to inject descriptive `![alt]` captions in‑line  |\n| Alfresco‑ready  | Implements the Alfresco Transform Core SPI (`TransformEngine` \u0026 `CustomTransformer`)                                          |\n| Containerised   | Multi‑stage Docker build (Java 17 + Python 3.11) with health‑check                                                            |\n| Configurable    | All knobs live in `application.yml` or environment variables                                                                  |\n\n\u003e Review also a similar project available in https://github.com/becpg/becpg-transform-markdown\n\n## Quick start\n\n### 1. Prerequisites\n\n* Java 17 \u0026 Maven 3.9+ (for local builds)\n* Docker (for running the service)\n* An Ollama daemon exposing `llava` LLM on `http://localhost:11434`\n\n```bash\nollama pull llava   # once\nollama serve        # or `ollama run llava` in another shell\n```\n\n### 2. Run with Docker Compose\n\n```bash\ngit clone git@github.com:aborroy/alf-tengine-convert2md.git\ncd alf-tengine-convert2md\ndocker compose up --build -d\n```\n\n*Service will be reachable at [http://localhost:8090](http://localhost:8090)*\n\nTest it:\n\n```bash\ncurl -X POST \\\n     -F \"file=@/path/to/document.pdf\" \\\n     \"http://localhost:8090/transform?sourceMimetype=application/pdf\u0026targetMimetype=text/markdown\" \\\n     -o output.md\n```\n\nOptionally, mode for image embeddings can be specified as an `image` parameter:\n\n```bash\ncurl -X POST \\\n     -F \"file=@/path/to/document.pdf\" \\\n     \"http://localhost:8090/transform?sourceMimetype=application/pdf\u0026targetMimetype=text/markdown\u0026image=described\" \\\n     -o output.md\n```\n\nAccepted values for `image` parameter:\n\n* `placeholder`: only the position of the image is marked in the output\n* `embedded`: the image is embedded as base64 encoded string\n* `referenced`: the image is exported in PNG format and referenced from the main exported document\n* `described`: the image description is embedded in the position of the image. This is not a native Docling option.\n\nWhen using `described` for image mode, language for the descriptions can be specified as a `language` parameter:\n\n```bash\ncurl -X POST \\\n     -F \"file=@/path/to/document.pdf\" \\\n     \"http://localhost:8090/transform?sourceMimetype=application/pdf\u0026targetMimetype=text/markdown\u0026image=described\u0026language=spanish\" \\\n     -o output.md\n```\n\n\u003e Accepted languages: english, spanish, french, german, italian, portuguese\n\n### 3. Build \u0026 run locally\n\n```bash\nmvn clean package -DskipTests\njava -jar target/alf-tengine-convert2md-0.8.0.jar\n```\n\nThe app listens on `:8090` by default.\n\n## Configuration\n\n| Property                                    | Default                  | Purpose                           |\n|---------------------------------------------|--------------------------|-----------------------------------|\n| `SPRING_AI_OLLAMA_BASE_URL`                 | `http://localhost:11434` | Endpoint for the Ollama REST API  |\n| `spring.servlet.multipart.max-file-size`    | `100MB`                  | Max upload size                   |\n| `spring.servlet.multipart.max-request-size` | `100MB`                  | Max request size                  |\n| `transform.image.default`                   | `placeholder`            | Default image mode embedding      |\n| `transform.language.default`                | `english`                | Default language for descriptions |\n\nEdit `src/main/resources/application.yml` or supply env vars/`‑D` flags.\n\n### Testing with the HTML Interface\n\nAfter starting the service, open the test application at [http://localhost:8090](http://localhost:8090). Use the following input values:\n\n- **file**: Upload a PDF file\n- **sourceMimetype**: `application/pdf`\n- **targetMimetype**: `text/markdown`\n- **image**: `placeholder` (don't set a value to this parameter for using default image mode)\n- **language**: `english` (don't set a value to this parameter for using default language)\n\nClick the **Transform** button to process the PDF file, the extracted metadata will be returned as a Markdown file\n\n## Building the Docker Image\n\n### Requirements\n\n- Docker 4.30+\n\n### Building the Image\n\nFrom the project root directory, build the Docker image with:\n\n```bash\ndocker build . -t alf-tengine-convert2md\n```\n\nThis will create a Docker image named `alf-tengine-convert2md:latest` in your local Docker repository\n\n## Deploying with Alfresco Community 25.x\n\nEnsure your `compose.yaml` file includes the following configuration:\n\n```yaml\nservices:\n  alfresco:\n    environment:\n      JAVA_OPTS : \u003e-\n        -DlocalTransform.core-aio.url=http://transform-core-aio:8090/\n        -DlocalTransform.md.url=http://transform-md:8090/\n\n  transform-core-aio:\n    image: alfresco/alfresco-transform-core-aio:5.1.7\n\n  transform-md:\n    image: docker.io/angelborroy/alf-tengine-convert2md\n```\n\nKey Configuration Updates\n\n- Add `localTransform.md.url` to the Alfresco service (`http://transform-md:8090/` by default)\n- Define the `transform-md` service using the custom-built image\n\n*Ensure you have built the Docker image (`alf-tengine-convert2md`) before running Docker Compose*\n\n## Deploying with Alfresco Enterprise 25.x\n\nEnsure your `compose.yaml` file includes the following configuration:\n\n```yaml\nservices:\n  alfresco:\n    environment:\n      JAVA_OPTS : \u003e-\n        -Dtransform.service.enabled=true\n        -Dtransform.service.url=http://transform-router:8095\n        -Dsfs.url=http://shared-file-store:8099/\n\n  transform-router:\n    image: quay.io/alfresco/alfresco-transform-router:4.1.7\n    environment:\n      CORE_AIO_URL: \"http://transform-core-aio:8090\"\n      TRANSFORMER_URL_MD: \"http://transform-md:8090\"\n      TRANSFORMER_QUEUE_MD: \"markdown-engine-queue\"\n\n  transform-md:\n    image: docker.io/angelborroy/alf-tengine-convert2md\n    environment:\n      ACTIVEMQ_URL: \"nio://activemq:61616\"\n      FILE_STORE_URL: \u003e-\n        http://shared-file-store:8099/alfresco/api/-default-/private/sfs/versions/1/file\n```\n\nKey Configuration Updates\n\n- Register the Markdown transformer with `transform-router`\n    - URL: `http://transform-md:8090/` (default)\n    - Queue Name: `markdown-engine-queue` (defined in `application.yaml`)\n- Define the `transform-md` service and link it to ActiveMQ and Shared File Store services\n\n*Ensure you have built the Docker image (`alf-tengine-convert2md`) before running Docker Compose*\n\n## Internals\n\n```mermaid\ngraph TD\n    A[PDF File] --\u003e|Multipart| B[MarkdownTransformer]\n    B --\u003e C[DoclingService]\n    C --\u003e|text| D[Markdown]\n    C --\u003e|images| E(OCR + LLaVA)\n    E --\u003e D\n```\n\n* `MarkdownEngine` Declares the Alfresco Engine (`markdown`) and its capabilities\n* `MarkdownTransformer` Streams the PDF to a temp file and invokes `DoclingService`\n* `DoclingService`\n\n    * Parses PDF with [Docling](https://pypi.org/project/docling/)\n    * Describes embedded images via Spring AI using Ollama `llava`\n\n---\n\n## Build and Publish to Your Docker Registry\n\nThe image `docker.io/angelborroy/alf-tengine-convert2md` is already available on Docker Hub.\n\nIf you want to build and publish your own version of the image (e.g., to your own Docker Hub account or private registry), follow these steps:\n\n### Enable Buildx (if not already enabled)\n\n```bash\ndocker buildx create --name multiarch-builder --use\ndocker buildx inspect --bootstrap\n```\n\n### Build and Push Multi-Arch Image\n\nReplace `yourdockeruser` with your Docker Hub username or private registry path:\n\n```bash\ndocker buildx build --no-cache \\\n  --platform linux/amd64,linux/arm64 \\\n  --attest type=sbom --attest type=provenance,mode=max \\\n  --tag yourdockeruser/alf-tengine-convert2md:latest \\\n  --push .\n```\n\nThis command:\n\n* Builds the image for both `amd64` and `arm64` architectures.\n* Tags it as `yourdockeruser/alf-tengine-convert2md:latest`.\n* Pushes it to your specified registry.\n\n\u003e Make sure you're logged into your Docker registry before pushing:\n\n```bash\ndocker login\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faborroy%2Falf-tengine-convert2md","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faborroy%2Falf-tengine-convert2md","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faborroy%2Falf-tengine-convert2md/lists"}