{"id":50528340,"url":"https://github.com/ozefe/ytcc-pipeline","last_synced_at":"2026-06-03T10:03:49.297Z","repository":{"id":359229201,"uuid":"1245077602","full_name":"ozefe/ytcc-pipeline","owner":"ozefe","description":"A synchronous Python library that converts an academic-thesis PDF into a structured JSON document plus a tar bundle of cropped figures, tables, and formulas.","archived":false,"fork":false,"pushed_at":"2026-05-21T18:11:14.000Z","size":67606,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-05-29T03:05:19.142Z","etag":null,"topics":["ai-pipelines","formula-recognition","grobid","layout-detection","ocr","pdf-parser","table-recognition"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/ytcc-pipeline","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ozefe.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":".github/SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-20T22:19:05.000Z","updated_at":"2026-05-21T18:12:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ozefe/ytcc-pipeline","commit_stats":null,"previous_names":["ozefe/ytcc-pipeline"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/ozefe/ytcc-pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ozefe%2Fytcc-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ozefe%2Fytcc-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ozefe%2Fytcc-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ozefe%2Fytcc-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ozefe","download_url":"https://codeload.github.com/ozefe/ytcc-pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ozefe%2Fytcc-pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33858580,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-pipelines","formula-recognition","grobid","layout-detection","ocr","pdf-parser","table-recognition"],"created_at":"2026-06-03T10:03:48.394Z","updated_at":"2026-06-03T10:03:49.278Z","avatar_url":"https://github.com/ozefe.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ytcc-pipeline\n\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/ytcc-pipeline)\n![PyPI - License](https://img.shields.io/pypi/l/ytcc-pipeline)\n![PyPI - Status](https://img.shields.io/pypi/status/ytcc-pipeline)\n![PyPI - Downloads](https://img.shields.io/pypi/dm/ytcc-pipeline)\n\n\u003cimg alt=\"ytcc-pipeline mascot generated by Google's Nano Banana 2\" align=\"right\" src=\".github/mascot.png\" width=\"200\" /\u003e\n\nA synchronous Python library that converts an academic-thesis PDF into a structured JSON document plus a tar bundle of cropped figures, tables, and formulas. Ships an optional FastAPI wrapper for service deployments and Docker images for four deployment profiles.\n\n## What it does\n\nGiven a PDF, the pipeline runs eight stages in fixed order:\n\n```text\nrender -\u003e metadata -\u003e layout -\u003e blocks -\u003e table -\u003e formula -\u003e reference -\u003e bundle\n```\n\n1. **render** -- decode every page to an image (`pdf-oxide`, `cfg.render_workers` processes).\n2. **metadata** -- sha256, byte size, XMP fields. Cheap; runs before model load so I/O failures surface early.\n3. **layout** -- PP-DocLayoutV3 emits one `LayoutDetection` per detected block with `label`, `bbox`, `confidence`, `reading_order`.\n4. **blocks** -- per-page dispatcher routes each detection by `Route`. Text comes from the PDF text layer (digital-born) or RapidOCR over the rendered crop (scanned). Figures, tables, and formulas are cropped and saved.\n5. **table** (opt-in) -- RapidTable SLANet+ recovers cell grids for `TABLE` blocks.\n6. **formula** (default on) -- PP-FormulaNet-L recovers LaTeX from every `FORMULA` block's crop.\n7. **reference** (opt-in) -- batched POST to an externally-managed GROBID server enriches `REFERENCE` blocks with parsed `Reference` records.\n8. **bundle** -- pack `document.json` and every saved crop into a single uncompressed tar.\n\nDisabled stages never load their model. Each stage emits one INFO log line on completion; skips short-circuit with `skipped reason=...`.\n\n## Features\n\n- **One public entry point.** `process_pdf(pdf_path, language=..., ...)` -- everything else is implementation detail.\n- **Auto-detected digital-born vs scanned.** The orchestrator probes the PDF's text layer once and picks the path. Override with `digital_born=True/False`.\n- **Two text-extraction paths, asymmetric workers.** `digital_born_workers` (cheap `pdf_oxide` processes) and `ocr_workers` (RapidOCR + CUDA, ~2 GiB VRAM each) are tuned independently.\n- **Bucketed formula batching.** Sorts crops by bbox area into small/medium/large buckets with per-bucket `max_new_tokens` caps. Measured 1.63x speedup over flat batching.\n- **fp16 + cv2 fast preproc on layout.** ~1.7x and ~2.3x isolated layout-stage speedups on the SafeTensors backend.\n- **Auto-DPI on the digital-born path.** Renders at 150 DPI for digital-born (layout downsamples to 800x800 anyway, `pdf_oxide` is resolution-independent), 300 DPI for scanned. Halves render wall.\n- **Injectable resident models.** Pass pre-loaded `LayoutAnalyzer`, `FormulaRecognizer`, and `TableEngine` to skip the ~5s + ~3s + ~1s reload between calls. The FastAPI service does this in `lifespan`.\n- **Frozen-dataclass schema.** `Document` / `Page` / `Block` / `Cell` / `Reference` are immutable; `dataclasses.replace` is the only rewrite path. JSON serialisation is stable across runs (modulo `uuid4` crop filenames).\n- **Streaming-first tar bundle.** `document.json` is the first archive member; consumers parse the index before the image bytes arrive. The FastAPI service ships it directly via `FileResponse`.\n- **Three-layer config.** `config.toml` (service-wide), `YTCC_*` env vars (per-field overrides), `PipelineConfig(...)` keyword args (per-call). Unknown TOML keys raise `ValueError`.\n\n## Requirements\n\n- Python 3.14+.\n- A CUDA GPU for any non-trivial throughput. CPU works but is neither tested nor recommended.\n- An externally-managed GROBID server when `references_enabled=true` (not bundled).\n\nThe project pins CUDA-enabled `torch` and `onnxruntime-gpu`. Substitute these for CPU wheels if you target CPU-only environments; nothing else in the codebase assumes CUDA at import time.\n\n## Installation\n\n```bash\npip install ytcc-pipeline       # library only\npip install ytcc-pipeline[api]  # + FastAPI service\npip install ytcc-pipeline[dev]  # + tests + lint + typing + benchmarks (includes [api])\n```\n\n## Library quickstart\n\n```python\nfrom ytcc_pipeline import process_pdf\n\n# paper.tar: contains document.json + images/*\nbundle_path = process_pdf(\"paper.pdf\", language=\"en\")\n```\n\n`process_pdf` is **synchronous and blocking** -- internally it uses `multiprocessing.spawn` pools, not asyncio. Call it from a thread (`asyncio.to_thread(process_pdf, ...)`) if you need to integrate with an event loop.\n\n## Service quickstart\n\n```bash\npip install ytcc-pipeline[api]\nuvicorn ytcc_pipeline.api.app:app --host 0.0.0.0 --port 8000\n```\n\nThe service reads `config.toml` at startup -- override the path with `YTCC_CONFIG=/path/to/config.toml`. One process, one GPU, one PDF at a time; concurrency is serialised on an `asyncio.Lock`.\n\n```bash\ncurl -X POST http://localhost:8000/process \\\n  -F \"pdf=@paper.pdf\" \\\n  -F \"language=en\" \\\n  -o paper.tar\n```\n\n`GET /health` reports liveness + readiness (`{\"status\":\"ok\",\"model_loaded\":true}`). `POST /process` accepts `pdf` (file), `language` (ISO 639-1), and optional `digital_born` (`true`/`false`). The response body is the tar bundle; `X-Processing-Time` carries the server-side wall in seconds.\n\n\u003e [!CAUTION]\n\u003e Run **one** uvicorn worker per GPU. `--workers N\u003e1` multi-loads every resident model and contends for VRAM.\n\n## Docker\n\nPre-built images for four deployment profiles are published to GitHub Container Registry, each available as a slim variant (~6 GB, models fetched on first request) or a baked variant (~11 GB, models pre-downloaded). Pin to a versioned tag in production.\n\n| Tag | Profile | Use case |\n|---|---|---|\n| `:scanned[-baked]` | Mixed (digital-born + scanned) | Default; loads OCR engines |\n| `:digital-born[-baked]` | `scanned_enabled=false` | Rejects scanned PDFs with HTTP 415; saves ~12 GiB VRAM |\n| `:digital-born-a100[-baked]` | A100-tuned | Larger batches, `formula_torch_compile=true` |\n| `:text-extract[-baked]` | 48GB VRAM-tuned, formula + table OFF | Text + image + references only; ~10x faster on math-heavy theses |\n\nEach profile ships a compose file with a GROBID sidecar:\n\n```bash\ndocker compose -f docker/compose.scanned.yml up -d\ncurl -X POST http://localhost:8000/process \\\n    -F \"pdf=@paper.pdf\" \\\n    -F \"language=en\" \\\n    -o paper.tar\n```\n\nOverride config via env vars (every `PipelineConfig` field is reachable via `YTCC_\u003cUPPERCASE_FIELD\u003e`) or by mounting your own TOML over `/app/config.toml`. See `docker/README.md` for the full image matrix, build instructions, and troubleshooting.\n\n## Output format\n\nThe output is one uncompressed tar:\n\n```text\npaper.tar\n├── document.json              # the schema document\n└── images/                    # cropped block images\n    ├── 0001-image-{uuid}.png\n    ├── 0014-formula-{uuid}.png\n    ├── 0014-table-{uuid}.png\n    └── 0027-formula-MISS-{uuid}.png\n```\n\n`document.json` is the first archive member -- consumers can stream-parse it before image bytes arrive. Image filenames sort by page (1-based, zero-padded), then by layout label, then by random UUID. The `-MISS-` marker identifies fallback crops written when primary extraction failed.\n\n```python\nimport json, tarfile\n\nwith tarfile.open(\"paper.tar\") as tf:\n    doc = json.loads(tf.extractfile(\"document.json\").read())\n\nfor page in doc[\"pages\"]:\n    for block in page[\"blocks\"]:\n        print(block[\"reading_order\"], block[\"type\"], (block[\"text\"] or \"\")[:60])\n```\n\nThe schema mirrors the `Document` / `Page` / `Block` / `Cell` / `Reference` dataclasses. `bbox` floats are rounded to two decimals; pixel coordinates are in the **effective render DPI** (150 for digital-born, 300 for scanned), origin top-left.\n\nPer-block invariants:\n\n| Block kind | `text` | `image_path` | `miss` |\n|---|---|---|---|\n| TEXT, success | extracted text | `null` | `false` |\n| TEXT, MISS | `null` | crop (if bundled) or `null` | `true` |\n| REFERENCE, success | extracted text | `null` | `false` |\n| REFERENCE, MISS | `null` | crop (if bundled) or `null` | `true` |\n| IMAGE | `null` | crop path | `false` |\n| FORMULA, success | LaTeX | `null` (crop deleted) | `false` |\n| FORMULA, MISS | `null` | `-MISS-` marked crop | `true` |\n| TABLE, structured | `null` | crop path + `cells` / `n_rows` / `n_cols` set | `false` |\n| TABLE, image-only fallback | `null` | crop path, `cells=null` | `false` |\n\n`miss=True` always means the primary representation is unavailable. TABLE blocks never carry `miss=True` -- a structure failure degrades silently to image-only.\n\nFull schema reference: `docs/output-format.md`.\n\n## Configuration\n\nThree layers, broadest to narrowest:\n\n1. **`config.toml`** at the project root -- single source of truth for the service and benchmarks. Loaded by `load_service_config()`.\n2. **`YTCC_*` environment variables** -- per-field overrides via `PipelineConfig.from_env()`.\n3. **`PipelineConfig(...)` keyword arguments** -- explicit, per-call.\n\n\u003e [!IMPORTANT]\n\u003e The three layers don't compose automatically. The TOML and env vars are read only by `load_service_config()` and `PipelineConfig.from_env()`. Use `dataclasses.replace(loaded.pipeline, ...)` to layer overrides on top of a TOML-loaded config.\n\nTOML resolution walks: `path` arg -\u003e `YTCC_CONFIG` env -\u003e `./config.toml` -\u003e installed-package `config.toml` -\u003e dataclass defaults. Unknown TOML keys raise `ValueError` -- typos don't pass silently.\n\nEvery `PipelineConfig` field has a matching `YTCC_\u003cUPPERCASE_NAME\u003e` env var. Bool parser accepts `1`/`true`/`yes`/`on` (case-insensitive). Comma-list fields strip whitespace and drop empties.\n\nFull knob reference and tuning guidance: `docs/configuration.md` and `docs/performance.md`.\n\n## Digital-born vs scanned\n\nThe two paths share rendering, layout, formula recognition, table extraction, and bundling. They diverge only inside the block stage:\n\n| | Digital-born | Scanned |\n|---|---|---|\n| Text source | `pdf_oxide` text layer | `RapidOCR` over rendered crops |\n| Render DPI | 150 (default) | 300 (default) |\n| Per-worker cost | ~10 MiB RSS (one `PdfDocument` handle) | ~2 GiB VRAM (one RapidOCR engine + CUDA context) |\n| Typical wall on RTX 3090 | ~15s for 150 pages | ~5-10x that |\n\nAuto-detect samples 5 pages and checks per-page non-whitespace character counts; override with `digital_born=True/False`. Set `scanned_enabled=false` to reject scanned PDFs entirely (saves ~12 GiB VRAM; FastAPI returns HTTP 415).\n\n## References (GROBID)\n\nThe reference stage requires an externally-managed GROBID server -- the pipeline never spawns the JVM. Start it separately:\n\n```bash\ndocker run --rm -p 8070:8070 grobid/grobid:0.9.0\n```\n\nOr use the bundled helper which generates a citation-only config (drops startup from ~10s to ~3s, saves ~1 GiB RSS):\n\n```bash\nscripts/grobid_start.sh\nGROBID_PORT=9090 scripts/grobid_start.sh\nscripts/grobid_stop.sh\n```\n\nWhen enabled, `run_reference_stage` does one batched POST to `/api/processCitationList` per PDF. Failures (server unreachable, timeout, HTTP error, malformed XML) are logged at WARNING and the page list flows through unchanged -- references are an enrichment, not a hard requirement. The raw reference string always survives on `Block.text`.\n\n## Design principles\n\n- **Synchronous by default.** `process_pdf` is blocking. Async is layered on top in the FastAPI service via `asyncio.to_thread`. No async leakage into the core pipeline.\n- **`multiprocessing.spawn`, never `fork`.** The parent process may hold a CUDA context; `fork` corrupts it. Workers re-import their module from scratch, which is why `pdf_oxide` is imported inside the digital-born worker entry rather than at module top.\n- **One process, one GPU, one PDF at a time.** Concurrency is serialised at the `asyncio.Lock` in the service layer; the pipeline itself is sequential.\n- **Resident models are injectable, not global.** `LayoutAnalyzer`, `FormulaRecognizer`, and `TableEngine` are constructor-injected with idempotent `close()`. The FastAPI lifespan loads them once; library callers either inject manually or let the orchestrator own the per-call lifecycle.\n- **Opt-in opt-in opt-in.** Heavy stages (`table_enabled`, `references_enabled`) and slow tradeoffs (`formula_torch_compile`, `layout_fp16`) are off by default. Defaults are library-safe; production callers flip them on explicitly.\n- **Streaming-first output.** Tar over zip because tar writes sequentially without seeking back for a central directory -- the bundle can be a pipe, socket, or HTTP response body. `document.json` is written first so consumers parse the index before image bytes arrive.\n- **No silent failures.** Unknown TOML keys raise. MISS extractions are flagged on the block (`miss=true`) and preserve reading order + bbox. GROBID failures degrade the reference stage but never fail the pipeline.\n\n## Limitations\n\n- **Single-GPU, single-PDF concurrency.** The service serialises on a lock. Throughput scales with replicas, not workers.\n- **Python 3.14+ only.** The project uses PEP 649/749 deferred-annotation semantics and modern stdlib features. No backport path.\n- **No CPU path is supported.** CPU works but is untested and unoptimised. Production deployments need CUDA.\n- **`fork` not supported.** Mixing this library with `multiprocessing.fork` corrupts CUDA contexts.\n- **Tuned for academic theses.** The PP-DocLayoutV3 label set and routing rules target academic documents (abstracts, references, formulas, tables). General-purpose PDFs may produce surprising layouts.\n- **GROBID is external.** The reference stage requires a separately-managed GROBID server. The pipeline never bundles or starts the JVM.\n- **Reference output may have weird XMP keys.** `pdf_info` comes directly from the PDF's XMP block, cleaned of UTF-16 BOMs and null bytes -- but anything else (encrypted PDFs, IPTC, RDF) is out of scope.\n- **Bundle filenames don't sort by reading order.** `document.json` is the authoritative reading order; image filenames sort by page + label + UUID.\n\n## Benchmarks\n\nKnob-sweep benchmarks live in the `benchmarks/` package. Each sweep varies one `PipelineConfig` field across a range of values and records per-stage wall, process-tree CPU/RSS, device VRAM, and quality metrics (block counts, MISS counts, formulas recovered, references parsed). Standalone scripts cover cold-start, sustained load, API concurrency, GROBID payload scaling, and `torch.compile` amortisation.\n\n```bash\npython benchmarks/run_all.py                # every sweep (cached CSVs skipped)\npython benchmarks/run_all.py --only formula # just the sweeps matching \"formula\"\npython -m benchmarks.plot                   # generate plots from existing CSVs\n```\n\nCommitted reference results (`benchmarks/results/summary.md`, `sweeps/*.{csv,md}`, `plots/*.png`) live in git. Full catalogue: `benchmarks/README.md`.\n\n## Documentation\n\n| File | Topic |\n|---|---|\n| `docs/quickstart.md` | Install, first run, library + service modes |\n| `docs/architecture.md` | Stage-by-stage pipeline, module layout, resource lifecycle |\n| `docs/output-format.md` | Tar layout, `document.json` schema, MISS semantics |\n| `docs/configuration.md` | `PipelineConfig` knobs, TOML, env-var overrides |\n| `docs/stages.md` | Per-stage behaviour, knobs, skip / no-op semantics |\n| `docs/performance.md` | Recommended config, per-knob impact, VRAM budget, tuning checklist |\n| `docs/digital-born-vs-scanned.md` | Auto-detect heuristic, when to override, scanned-only deployments |\n| `docs/api-service.md` | FastAPI contract, lifespan, concurrency model |\n| `docs/references.md` | GROBID setup, parsed `Reference` shape, failure modes |\n| `docs/gotchas.md` | Common pitfalls, MISS handling, OOM recovery, log conventions |\n\n## Samples\n\nSix PDFs under `samples/` cover English / Turkish / Arabic, digital-born and scanned, good and bad quality. Use `904599.pdf` (English, digital-born, good) as the first sanity check -- it exercises every stage except OCR.\n\n## License\n\nMIT. See [`LICENSE`](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fozefe%2Fytcc-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fozefe%2Fytcc-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fozefe%2Fytcc-pipeline/lists"}