{"id":50215981,"url":"https://github.com/miku/doclingclient","last_synced_at":"2026-05-26T09:01:40.406Z","repository":{"id":358458185,"uuid":"1237851390","full_name":"miku/doclingclient","owner":"miku","description":"A Go docling client library and CLI","archived":false,"fork":false,"pushed_at":"2026-05-17T13:19:32.000Z","size":666,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-17T14:48:08.433Z","etag":null,"topics":["chunking","docling","document-conversion","markdown","pdf"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/miku.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-13T15:10:44.000Z","updated_at":"2026-05-17T13:19:36.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/miku/doclingclient","commit_stats":null,"previous_names":["miku/doclingclient"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/miku/doclingclient","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fdoclingclient","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fdoclingclient/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fdoclingclient/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fdoclingclient/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/miku","download_url":"https://codeload.github.com/miku/doclingclient/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/miku%2Fdoclingclient/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33512327,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T03:12:49.672Z","status":"ssl_error","status_checked_at":"2026-05-26T03:12:47.976Z","response_time":63,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chunking","docling","document-conversion","markdown","pdf"],"created_at":"2026-05-26T09:01:24.410Z","updated_at":"2026-05-26T09:01:40.383Z","avatar_url":"https://github.com/miku.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# doclingclient\n\nA Go [docling](https://www.docling.ai/) client library and CLI.\n[Docling](https://www.docling.ai/) is a deep learning document analysis and\nconversion project, which can also be run [as\nservice](https://github.com/docling-project/docling-serve). This project helps\nto decouple the document processing, which may benefit from a GPU, from the\nclient, which may be a lower spec machine.\n\n[![](static/Europeana.eu-90402-RP_T_2013_34_6_R_-98a3ebb632a8943697d57d5193f12634-s.jpeg)](https://www.europeana.eu/en/item/90402/RP_T_2013_34_6_R_)\n\n## Installation\n\n```shell\n$ go install github.com/miku/doclingclient/cmd/docli@latest\n```\n\nPackages (deb, rpm), cf.\n[releases](https://github.com/miku/doclingclient/releases). Quick start:\n\n```shell\n$ docli --server http://docling.city:5001 convert https://arxiv.org/pdf/2110.06595\n```\n\n\n## Background, Prompt\n\n[Docling serve](https://github.com/docling-project/docling-serve) supplies an [openapi](openapi.json) spec, currently using version\n3.1.0 of the standard.\n\n```\n$ jq -rc '.paths | keys[]' openapi.json\n/health\n/openapi-3.0.json\n/ready\n/v1/chunk/hierarchical/file\n/v1/chunk/hierarchical/file/async\n/v1/chunk/hierarchical/source\n/v1/chunk/hierarchical/source/async\n/v1/chunk/hybrid/file\n/v1/chunk/hybrid/file/async\n/v1/chunk/hybrid/source\n/v1/chunk/hybrid/source/async\n/v1/clear/converters\n/v1/clear/results\n/v1/convert/file\n/v1/convert/file/async\n/v1/convert/source\n/v1/convert/source/async\n/v1/memory/counts\n/v1/memory/stats\n/v1/result/{task_id}\n/v1/status/poll/{task_id}\n/version\n```\n\nUnfortunately, an SDK generated from a spec can be quite large and may have\ndownsides; cf. also [this\ncomparison](https://www.speakeasy.com/docs/sdks/languages/golang/oss-comparison-go#go-sdk-generator-options).\n\nHence, we decided to use a more manual approach. We use an LLM to build a\nsimple, mostly idiomatic client for the core functionality first. For docling this may\nbe just \"/v1/convert/file\" and \"/v1/convert/source\" - this would already serve\nmost use cases.\n\nCreate a minimal Go library, then wrap a nice CLI around the library, so\ninteracting with the docling service becomes easy to integrate into shell\nscripts or ad-hoc human (and maybe agentic) terminal use.\n\n**Status**: Library and CLI cover synchronous conversion\n`/v1/convert/{source,file}`, synchronous chunking\n`/v1/chunk/{hybrid,hierarchical}/{source,file}`, and the `/health`, `/ready`,\nand `/version` routes. Async conversion and async chunking are not yet wrapped.\n\n## Library\n\n```go\nimport \"github.com/miku/doclingclient\"\n\nc := doclingclient.New(\"http://localhost:5001\",\n    doclingclient.WithAPIKey(\"sk-...\"),\n    doclingclient.WithTimeout(10*time.Minute),\n)\n\n// Convert a URL.\nresp, err := c.ConvertURL(ctx, \"https://arxiv.org/pdf/2206.01062\", doclingclient.ConvertOptions{})\n\n// Convert a local file (streamed multipart upload).\nresp, err := c.ConvertPath(ctx, \"paper.pdf\", doclingclient.ConvertOptions{\n    ToFormats: []doclingclient.OutputFormat{\n                    doclingclient.FormatMD,\n                    doclingclient.FormatJSON},\n    DoOCR:     doclingclient.Ptr(true),\n    Pipeline:  doclingclient.PipelineStandard,\n})\n\n// A 200 response can still describe a conversion failure — check it.\nif err := resp.Err(false); err != nil {\n    log.Fatal(err)\n}\nfmt.Println(resp.Document.MarkdownContent())\n\n// Single-struct request: redirect the result with an explicit delivery\n// target. The server defaults to inbody; use PutTarget / S3Target / ZipTarget\n// on /v1/convert/source. The multipart /v1/convert/file endpoint only\n// supports inbody and zip, expressed as TargetTypeInBody / TargetTypeZip via\n// ProcessFileRequest.TargetType.\nresp, err = c.ProcessURL(ctx, doclingclient.ProcessURLRequest{\n    Sources: []doclingclient.Source{\n        doclingclient.NewHTTPSource(\"https://arxiv.org/pdf/2206.01062\"),\n    },\n    Target: doclingclient.PutTarget{URL: \"https://sink.example/result\"},\n})\n```\n\nThe library covers `/v1/convert/source` (URL or base64 in-body), `/v1/convert/file`\n(streamed multipart upload), and the `/health`, `/ready`, `/version` routes.\nFor full coverage of `ConvertDocumentsOptions`, the struct in `types.go` is a\ndeliberate subset — extend it as needed.\n\nNote on output formats: the docling-serve `OutputFormat` enum also defines\n`yaml`, `html_split_page`, and `vtt`, but the `ExportDocumentResponse` object\ndoes not carry corresponding content fields, so this library and CLI do not\nsurface them. The five exposed formats: `md`, `json`, `html`, `text`,\n`doctags` match what the server actually returns.\n\n## CLI\n\nA minimal command, `docli`, wraps the library. It is named to avoid collision\nwith the upstream `docling` CLI.\n\n```sh\ngo install github.com/miku/doclingclient/cmd/docli@latest\n\n# Convert a URL (default output: markdown to stdout).\ndocli convert https://arxiv.org/pdf/2206.01062 \u003e paper.md\n\n# Convert a local file as JSON.\ndocli convert --to json paper.pdf \u003e paper.json\n\n# Produce several formats at once and write them to a directory.\ndocli convert --to md,json,html --output ./out paper.pdf\n# =\u003e ./out/paper.md, ./out/paper.json, ./out/paper.html\n\n# Talk to a remote docling-serve, with auth.\nDOCLING_SERVER=https://docling.example.org \\\nDOCLING_API_KEY=sk-... \\\n    docli convert paper.pdf\n\n# Server checks.\ndocli health\ndocli ready\ndocli version\n```\n\n### Chunking for RAG / embeddings\n\n`docli chunk` converts a document and splits it into chunks suitable for\nfeeding into an embedding model. Output is JSONL on stdout — one chunk per\nline — which composes naturally with `jq`.\n\n```sh\n# Default hybrid chunker (tokenization-aware).\ndocli chunk paper.pdf \u003e chunks.jsonl\n\n# Pick a tokenizer and cap chunks to 512 tokens.\ndocli chunk --max-tokens 512 \\\n    --tokenizer Qwen/Qwen3-Embedding-0.6B \\\n    paper.pdf \u003e chunks.jsonl\n\n# Structural chunks (one per document element, no tokenizer).\ndocli chunk --chunker hierarchical paper.pdf \u003e chunks.jsonl\n\n# Inspect chunk lengths.\njq -r '.num_tokens // (.text | length)' \u003c chunks.jsonl | sort -n | uniq -c\n```\n\nEach chunk carries `text` (with headings/captions inlined for context),\noptional `raw_text` (with `--include-raw-text`), `num_tokens`, `headings`,\n`captions`, `page_numbers`, and `doc_items` references into the source\ndocument.\n\n#### Tokenizer choice\n\nThe hybrid chunker counts tokens to keep each chunk within a budget. That\nbudget is meaningful only relative to a specific tokenizer — and you almost\nalways want the tokenizer to match the embedding model you'll feed the chunks\ninto downstream, so chunk sizes line up with the embedder's context window.\n\ndocling-serve accepts any HuggingFace tokenizer identifier as `--tokenizer`\n(OpenAI/tiktoken tokenizers are not reachable through the server). The default\nis `sentence-transformers/all-MiniLM-L6-v2`. If you don't pass `--max-tokens`,\nthe cap is derived from the tokenizer's `model_max_length`.\n\nA few common picks, biased toward what shows up in docling's own examples and\ntypical RAG stacks:\n\n| Tokenizer (HuggingFace ID)                  | Max tokens | Notes                                                    |\n|---------------------------------------------|------------|----------------------------------------------------------|\n| `sentence-transformers/all-MiniLM-L6-v2`    | 256        | Default. Tiny, fast, English-only. Good baseline.        |\n| `sentence-transformers/all-mpnet-base-v2`   | 384        | Higher-quality English embeddings, still small.          |\n| `BAAI/bge-small-en-v1.5`                    | 512        | Strong small English model, widely used in RAG.          |\n| `BAAI/bge-m3`                               | 8192       | Multilingual, long-context. Good general-purpose pick.   |\n| `intfloat/multilingual-e5-large`            | 512        | Multilingual, balanced quality/size.                     |\n| `nomic-ai/nomic-embed-text-v1.5`            | 8192       | Long-context English.                                    |\n| `Qwen/Qwen3-Embedding-0.6B`                 | 32768      | Long-context, multilingual, newer.                       |\n\nRule of thumb: pick the tokenizer that ships with the embedding model you\nplan to call after `docli chunk`. Mixing them silently misaligns the token\ncount and leads to chunks that overflow (or underfill) the real embedder.\n\nThe server needs to fetch the tokenizer the first time it sees it. In\nair-gapped deployments only models already cached on the server will work.\n\n### Conversion flags (shared by `convert` and `chunk`)\n\nThese flags tune the underlying document conversion. They apply identically\nto `docli convert` and `docli chunk`. Numeric and boolean defaults marked\n`(auto)` are sent only when you set them explicitly, so docling-serve's own\ndefaults stay authoritative on bare invocations.\n\n| Flag                  | Default | Description                                                            |\n|-----------------------|---------|------------------------------------------------------------------------|\n| `--from`              | (auto)  | Input formats, e.g. `pdf,docx`; server autodetects if empty.           |\n| `--ocr`               | `true`  | Enable OCR.                                                            |\n| `--force-ocr`         | `false` | Force OCR over existing text.                                          |\n| `--ocr-lang`          | (auto)  | Comma-separated OCR languages, e.g. `en,de`.                           |\n| `--table-mode`        | (auto)  | `fast` or `accurate`; server default if empty.                         |\n| `--tables`            | (auto)  | Extract table structure. Sent only when explicitly set.                |\n| `--pages`             | (all)   | Page range, e.g. `1-10` or `3`.                                        |\n| `--image-export-mode` | (auto)  | `placeholder`, `embedded`, or `referenced`. Server default if empty.   |\n| `--include-images`    | (auto)  | Include extracted images. Sent only when explicitly set.               |\n| `--images-scale`      | (auto)  | Scale factor for extracted images (server default ~2.0).               |\n| `--abort-on-error`    | `false` | Abort on first error. Sent only when explicitly set.                   |\n| `--document-timeout`  | (none)  | Per-document timeout in seconds.                                       |\n| `--pdf-backend`       | (auto)  | `pypdfium2`, `docling_parse`, `dlparse_v1`, `dlparse_v2`, `dlparse_v4`.|\n| `--pipeline`          | (auto)  | `legacy`, `standard`, `vlm`, or `asr`. Server default if empty.        |\n\n### `docli convert` extras\n\n| Flag              | Default | Description                                                                       |\n|-------------------|---------|-----------------------------------------------------------------------------------|\n| `--to`, `-t`      | `md`    | Output formats: `md`, `json`, `html`, `text`, `doctags`.                          |\n| `--output`, `-o`  | (none)  | Directory to write all requested formats as `\u003cbasename\u003e.\u003cext\u003e`; stdout is silent. |\n| `--status`        | `false` | Emit one status line/object to stderr after the conversion.                       |\n| `--status-format` | `text`  | `text` or `json` (see Caching below).                                             |\n| `--cache-dir`     | (XDG)   | Override the on-disk cache directory. Env: `DOCLING_CACHE_DIR`.                   |\n| `--no-cache`      | `false` | Disable the on-disk result cache.                                                 |\n\n### `docli chunk` extras\n\n| Flag                 | Default                                  | Description                                                              |\n|----------------------|------------------------------------------|--------------------------------------------------------------------------|\n| `--chunker`          | `hybrid`                                 | Chunker strategy: `hybrid` or `hierarchical`.                            |\n| `--max-tokens`       | (auto)                                   | Hybrid only. Max tokens per chunk; derived from the tokenizer if unset.  |\n| `--tokenizer`        | `sentence-transformers/all-MiniLM-L6-v2` | Hybrid only. HuggingFace tokenizer ID. See \"Tokenizer choice\" above.     |\n| `--merge-peers`      | `true`                                   | Hybrid only. Merge undersized successive chunks with the same headings.  |\n| `--markdown-tables`  | `false`                                  | Serialize tables as Markdown instead of triplets.                        |\n| `--include-raw-text` | `false`                                  | Populate `raw_text` on each chunk alongside the contextualized `text`.   |\n| `--pretty`           | `false`                                  | Emit the full response as indented JSON instead of one chunk per line.   |\n\nNote: `docli chunk` does not cache results; each invocation re-runs the\nconversion server-side. Only `docli convert` uses the on-disk cache.\n\nGlobal flags (any subcommand): `--server`/`-s` (env `DOCLING_SERVER`),\n`--api-key`/`-K` (env `DOCLING_API_KEY`), `--tenant`/`-T` (env\n`DOCLING_TENANT_ID`).\n\n## Caching\n\n`docli convert` caches results on disk by default, so repeating a request is\nnear-instant. The cache uses the XDG spec, typically\n`~/.cache/doclingclient/`, overridable with `--cache-dir` or\n`DOCLING_CACHE_DIR`. Disable with `--no-cache`.\n\nLayout:\n\n```\n~/.cache/doclingclient/\n├── server_version.json           # /version response, refreshed every 24 h\n└── \u003c12-char-server-hash\u003e/\n    ├── server_info.json           # full server version map for this namespace\n    └── \u003cinput-hash\u003e.json.zst     # zstd-compressed ConvertResponse JSON\n```\n\nCache key fingerprints everything that affects output: source URL or local\nfile content (SHA-256), `to_formats`, OCR settings, table mode, page range,\netc. The server-version directory namespaces cached results, so an upstream\ndocling-serve upgrade naturally falls into a fresh namespace — old results\nstay around for diffing or can be pruned with `rm -rf\n~/.cache/doclingclient/\u003chash\u003e/`.\n\nUse `--status` to see whether a request was served fresh or from cache:\n\n```sh\n$ docli convert --status paper.pdf \u003e /dev/null\nstatus=success processing_time=12.43s source=fresh\n$ docli convert --status paper.pdf \u003e /dev/null\nstatus=success processing_time=12.43s source=cached\n```\n\nFor ad-hoc post-processing, add `--status-format json` to emit a single JSON\nobject per run to stderr (one line, suitable for `jq` or appending to a log):\n\n```sh\n$ docli convert --status --status-format json paper.pdf \u003e paper.md\n{\"status\":\"success\",\"processing_time\":12.43,\"source\":\"fresh\",\"filename\":\"paper.pdf\",\"errors\":[]}\n\n$ docli convert --status --status-format json paper.pdf 2\u003e status.jsonl \u003e paper.md\n$ jq -r '.processing_time' \u003c status.jsonl\n12.43\n```\n\n## Testing\n\n```sh\ngo test ./...\ngo test -cover ./...\n```\n\nThe library exercises its HTTP client against `httptest.Server`; no live\ndocling-serve instance is required.\n\n## Other projects\n\n* [https://github.com/iguanesolutions/go-docling](https://github.com/iguanesolutions/go-docling)\n\n## A random thought on openapi\n\n[OpenAPI](https://en.wikipedia.org/wiki/OpenAPI_Specification) was very helpful\nto get this client started, in that the LLM could inquire the\n[openapi.json](openapi.json) file for the spec. However, we did not need to use\nany of the openapi generators, of which there are [quite a\nfew](https://www.speakeasy.com/docs/sdks/languages/golang/oss-comparison-go). A\nmore systematic comparison of features of various libraries is still\noutstanding, but you could see an LLM + Prompt + openapi.json based client SDK\ngenerator.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiku%2Fdoclingclient","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmiku%2Fdoclingclient","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmiku%2Fdoclingclient/lists"}