{"id":48429179,"url":"https://github.com/fieldcure/fieldcure-document-parsers","last_synced_at":"2026-04-27T05:01:28.831Z","repository":{"id":348621825,"uuid":"1193332403","full_name":"fieldcure/fieldcure-document-parsers","owner":"fieldcure","description":"Document text extraction library for DOCX, HWPX, XLSX, PPTX, and PDF. Supports OOXML math-to-LaTeX conversion, Hancom equation parsing, and IMediaDocumentParser for image extraction.","archived":false,"fork":false,"pushed_at":"2026-04-26T08:28:52.000Z","size":6191,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-26T10:16:43.066Z","etag":null,"topics":["csharp","document-parser","docx","dotnet","equation-parser","hwpx","latex","nuget","pdf","text-extraction"],"latest_commit_sha":null,"homepage":"https://www.nuget.org/packages/FieldCure.DocumentParsers","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fieldcure.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-27T05:32:19.000Z","updated_at":"2026-04-26T08:28:38.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/fieldcure/fieldcure-document-parsers","commit_stats":null,"previous_names":["fieldcure/fieldcure-document-parsers"],"tags_count":11,"template":false,"template_full_name":null,"purl":"pkg:github/fieldcure/fieldcure-document-parsers","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fieldcure%2Ffieldcure-document-parsers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fieldcure%2Ffieldcure-document-parsers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fieldcure%2Ffieldcure-document-parsers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fieldcure%2Ffieldcure-document-parsers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fieldcure","download_url":"https://codeload.github.com/fieldcure/fieldcure-document-parsers/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fieldcure%2Ffieldcure-document-parsers/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32323215,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T23:26:28.701Z","status":"online","status_checked_at":"2026-04-27T02:00:06.769Z","response_time":128,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csharp","document-parser","docx","dotnet","equation-parser","hwpx","latex","nuget","pdf","text-extraction"],"created_at":"2026-04-06T10:02:26.591Z","updated_at":"2026-04-27T05:01:28.744Z","avatar_url":"https://github.com/fieldcure.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FieldCure.DocumentParsers\n\n[![Core](https://img.shields.io/nuget/v/FieldCure.DocumentParsers?label=Core)](https://www.nuget.org/packages/FieldCure.DocumentParsers)\n[![Imaging](https://img.shields.io/nuget/v/FieldCure.DocumentParsers.Imaging?label=Imaging)](https://www.nuget.org/packages/FieldCure.DocumentParsers.Imaging)\n[![Ocr](https://img.shields.io/nuget/v/FieldCure.DocumentParsers.Ocr?label=Ocr)](https://www.nuget.org/packages/FieldCure.DocumentParsers.Ocr)\n[![Audio](https://img.shields.io/nuget/v/FieldCure.DocumentParsers.Audio?label=Audio)](https://www.nuget.org/packages/FieldCure.DocumentParsers.Audio)\n[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)\n\nLightweight document-to-text extraction library for .NET.\nConverts DOCX, HWPX, XLSX, PPTX, HTML, and PDF files into structured Markdown\nwith heading detection and table support — designed for LLM/RAG pipelines.\n\n## Features\n\n- **DOCX** — headings, paragraphs, tables → markdown, math equations → LaTeX, metadata → YAML front matter, footnotes, endnotes, comments, headers/footers\n- **HWPX** — Korean standard format (KS X 6101/OWPML), headings, tables → markdown, equations → LaTeX, metadata → YAML front matter, footnotes, endnotes, memos, headers/footers\n- **XLSX** — sheets → markdown tables (multi-sheet support), metadata → YAML front matter\n- **PPTX** — slide text, speaker notes, slide tables → markdown, grouped shapes, metadata → YAML front matter\n- **HTML** — readable content extraction (SmartReader) → GitHub-flavored Markdown (ReverseMarkdown)\n- **PDF** — text extraction (PdfPig, pure managed). Page image rendering and OCR are separate opt-in packages.\n- **Audio** — MP3/WAV/M4A/OGG/FLAC/WebM → timestamped Markdown transcripts via Whisper.net (separate opt-in package).\n\n## Packages\n\n| Package | Description | Native deps |\n|---------|-------------|:-----------:|\n| `FieldCure.DocumentParsers` | DOCX, HWPX, XLSX, PPTX, HTML, **PDF** (text) | — |\n| `FieldCure.DocumentParsers.Imaging` | PDF → page images (adds `IMediaDocumentParser`) | PDFium |\n| `FieldCure.DocumentParsers.Ocr` | Tesseract OCR fallback for scanned PDFs — **Windows only** | PDFium + Tesseract |\n| `FieldCure.DocumentParsers.Audio` | Audio → timestamped transcripts via Whisper.net — **Windows only** | Whisper.net + NAudio |\n\nThe core package is pure managed — no native binaries are pulled in unless you opt into Imaging, Ocr, or Audio.\n\n\u003e The Ocr package is currently **Windows only** — the bundled Tesseract 5.2.0 ships native Windows binaries only. The assembly carries `[SupportedOSPlatform(\"windows\")]`, so non-Windows consumers will see CA1416 warnings at compile time. Cross-platform OCR is on the roadmap; in the meantime use the core package directly for PDFs that have an embedded text layer (works everywhere).\n\n\u003e **Deprecated (v2.0):** `FieldCure.DocumentParsers.Pdf` (replaced by core + Imaging) and `FieldCure.DocumentParsers.Pdf.Ocr` (renamed to `.Ocr`).\n\n## Installation\n\n```bash\n# Core (DOCX, HWPX, XLSX, PPTX, HTML, PDF text)\ndotnet add package FieldCure.DocumentParsers\n\n# PDF page rendering (optional, pulls PDFium)\ndotnet add package FieldCure.DocumentParsers.Imaging\n\n# OCR fallback for scanned PDFs (optional, pulls Tesseract + PDFium)\ndotnet add package FieldCure.DocumentParsers.Ocr\n\n# Audio transcription (optional, pulls Whisper.net runtimes + NAudio)\ndotnet add package FieldCure.DocumentParsers.Audio\n```\n\n## Quick Start\n\n```csharp\nusing FieldCure.DocumentParsers;\n\n// PDF is now registered automatically — no AddPdfSupport() call needed.\nvar parser = DocumentParserFactory.GetParser(\".pdf\");\nvar text = parser!.ExtractText(File.ReadAllBytes(\"document.pdf\"));\n\n// Same API for all formats\nforeach (var ext in DocumentParserFactory.SupportedExtensions)\n    Console.WriteLine(ext);\n// .docx, .hwpx, .xlsx, .pptx, .html, .htm, .pdf\n```\n\n```csharp\n// Opt-out control for metadata, footnotes, etc.\nvar parser = new DocxParser();\nvar options = new ExtractionOptions\n{\n    IncludeMetadata = false,\n    IncludeFootnotes = false\n};\nvar text = parser.ExtractText(File.ReadAllBytes(\"report.docx\"), options);\n```\n\n```csharp\nusing FieldCure.DocumentParsers;\nusing FieldCure.DocumentParsers.Imaging;\n\n// Upgrade the factory's .pdf entry to IMediaDocumentParser (text + images).\nDocumentParserFactoryImagingExtensions.AddImagingSupport();\nvar pdf = (IMediaDocumentParser)DocumentParserFactory.GetParser(\".pdf\")!;\nvar images = pdf.ExtractImages(File.ReadAllBytes(\"document.pdf\"), dpi: 150);\n```\n\n```csharp\nusing FieldCure.DocumentParsers;\nusing FieldCure.DocumentParsers.Ocr;\n\n// Register an OCR-augmented PDF parser. Dispose the engine at shutdown.\nusing var ocr = DocumentParserFactoryOcrExtensions.AddOcrSupport();\n\n// Scanned pages are OCR'd; pages with an embedded text layer go through PdfPig.\nvar parser = DocumentParserFactory.GetParser(\".pdf\")!;\nvar text = parser.ExtractText(File.ReadAllBytes(\"scanned.pdf\"));\n```\n\n```csharp\nusing FieldCure.DocumentParsers;\nusing FieldCure.DocumentParsers.Audio;\n\n// Register audio support. Dispose the transcriber at shutdown.\nawait using var transcriber = DocumentParserFactoryAudioExtensions.AddAudioSupport();\n\nvar parser = DocumentParserFactory.GetParser(\".mp3\")!;\nvar transcript = parser.ExtractText(File.ReadAllBytes(\"meeting.mp3\"));\n```\n\n```csharp\n// Let the library pick a Whisper model size based on detected GPU/RAM/cores.\n// QualityBias.Accuracy (default) shifts up one tier — suitable for batch indexing.\nusing FieldCure.DocumentParsers.Audio;\n\nvar recommended = WhisperEnvironment.RecommendModelSize(); // e.g. WhisperModelSize.Large\nvar options = AudioExtractionOptions.Default.WithModelSize(recommended);\n\nvar probe = WhisperEnvironment.Probe();\nConsole.Error.WriteLine(\n    $\"[Audio] CUDA={probe.CudaAvailable} Vulkan={probe.VulkanAvailable} \" +\n    $\"RAM={probe.SystemRamBytes / (1024L * 1024 * 1024)}GB → {recommended}\");\n```\n\n## Custom Parser\n\nImplement `IDocumentParser` to add support for any format:\n\n```csharp\npublic class MyParser : IDocumentParser\n{\n    public IReadOnlyList\u003cstring\u003e SupportedExtensions =\u003e [\".xyz\"];\n\n    public string ExtractText(byte[] data)\n    {\n        // Your extraction logic\n        return \"extracted text\";\n    }\n}\n\n// Register\nDocumentParserFactory.Register(new MyParser());\n```\n\n## Table Output Format\n\nAll parsers convert tables to markdown format for LLM comprehension:\n\n```markdown\n| Name | Age | City |\n| --- | --- | --- |\n| Alice | 30 | Seoul |\n| Bob | 25 | Busan |\n```\n\nPipe characters inside cells are escaped as `\\|` to preserve table structure.\n\n## Limitations\n\n### DOCX\n\n| Supported | Not Yet Supported |\n|-----------|-------------------|\n| Paragraph text | Charts / SmartArt |\n| Headings (Heading1–9 style + OutlineLevel) | Images (embedded) — no OCR |\n| Tables → markdown (including nested) | Tracked changes |\n| Hyperlink text | Text boxes / shapes |\n| Math equations (OMML → LaTeX) | Legacy .doc format (use LibreOffice to convert) |\n| Numbered / bulleted lists (as text) | |\n| Multi-section documents | |\n| Metadata → YAML front matter | |\n| Footnotes / Endnotes | |\n| Comments → inline blockquote | |\n| Headers / Footers | |\n\n### HWPX\n\n| Supported | Not Yet Supported |\n|-----------|-------------------|\n| Paragraph text (hp:p) | Form fields |\n| Headings (header.xml outline levels) | Legacy .hwp format (binary, not XML) |\n| Standalone tables (hp:tbl) | |\n| Embedded tables (hp:p \u003e hp:run \u003e hp:tbl) | |\n| Math equations (hp:equation → LaTeX) | |\n| Drawing text (hp:drawText) | |\n| Multi-section documents | |\n| Table cell merging | |\n| Metadata → YAML front matter | |\n| Footnotes / Endnotes | |\n| Memos → inline blockquote | |\n| Headers / Footers | |\n\n### XLSX\n\n| Supported | Not Yet Supported |\n|-----------|-------------------|\n| Cell text values | Charts |\n| SharedString references | Pivot tables |\n| Multi-sheet (separated by headings) | Formula evaluation (values only) |\n| Empty row/cell handling | Conditional formatting info |\n| Pipe character escaping | Merged cells (partial support) |\n\n### PPTX\n\n| Supported | Not Yet Supported |\n|-----------|-------------------|\n| Slide text (all shapes) | SmartArt |\n| Title / body separation | Charts |\n| Speaker notes | Animations / transitions info |\n| Slide tables → markdown | Audio / video references |\n| Grouped shapes (text extraction) | Math equations |\n| Slide ordering | |\n| Field elements (slide numbers, dates) | |\n\n### HTML\n\n| Supported | Not Yet Supported |\n|-----------|-------------------|\n| Readable article extraction (SmartReader) | JavaScript-rendered content (SPA) |\n| GitHub-flavored Markdown output | Login-required pages |\n| Tables, headings, links preserved | Embedded media extraction |\n| Nav / ads / footer auto-removal | Non-UTF-8 encodings |\n\n### PDF\n\n| Supported | Not Yet Supported |\n|-----------|-------------------|\n| Text extraction (text-based PDF) — core package | Form field extraction |\n| Page image rendering — `Imaging` package | Digital signature info |\n| OCR fallback for scanned PDFs — `Ocr` package | PDF/A validation |\n| Multi-page documents, Unicode text | |\n| English + Korean OCR (tessdata_fast) — `Ocr` | |\n\n### Audio\n\n| Supported | Not Yet Supported |\n|-----------|-------------------|\n| MP3, WAV, M4A, OGG, FLAC, WebM — `Audio` package | Real-time microphone input |\n| Timestamped Markdown transcript | Speaker diarization |\n| Whisper ggml model cache | Video audio track extraction |\n| Custom `IAudioTranscriber` injection | Word-level timestamps |\n| Environment-aware model size recommendation | NVML-based VRAM probing (deferred to v0.2) |\n\n#### Model size selection\n\n`WhisperEnvironment.RecommendModelSize(QualityBias)` picks a Whisper model size based on the local environment. CUDA/Vulkan availability is detected via driver-shipped `nvcuda.dll` / `vulkan-1.dll`; physical RAM via `GlobalMemoryStatusEx`; logical cores via `Environment.ProcessorCount`. VRAM is intentionally not probed in v0.1 — RAM ≥ 8 GB on a GPU host is treated as sufficient, and Whisper.net's runtime fallback (CUDA → Vulkan → CPU) handles real-device mismatches.\n\nThe balanced matrix (used directly by `QualityBias.Balanced`):\n\n| Environment | Recommended model |\n|---|---|\n| GPU available, RAM ≥ 16 GB | `Large` |\n| GPU available, RAM ≥ 8 GB | `Medium` |\n| CPU only, RAM ≥ 16 GB, cores ≥ 8 | `Small` |\n| CPU only, RAM ≥ 8 GB | `Base` |\n| Otherwise | `Tiny` |\n\n`QualityBias.Accuracy` (default) shifts the recommendation one tier up — appropriate for batch indexing where transcription latency is acceptable. `QualityBias.Speed` shifts one tier down — appropriate for interactive UI flows where the user is actively waiting.\n\n## Repository Structure\n\nAll library projects multi-target `net8.0;net10.0`.\n\n```\nsrc/\n├── DocumentParsers/                     FieldCure.DocumentParsers 2.0 (net8.0 + net10.0)\n│   ├── Ooxml/                           DocxParser, PptxParser, XlsxParser\n│   ├── Hwpx/                            HwpxParser\n│   ├── Html/                            HtmlParser\n│   └── Pdf/                             PdfParser (text via PdfPig)\n├── DocumentParsers.Imaging/             FieldCure.DocumentParsers.Imaging 1.0 (net8.0 + net10.0)\n├── DocumentParsers.Ocr/                 FieldCure.DocumentParsers.Ocr 1.0 (net8.0 + net10.0)\n├── DocumentParsers.Audio/               FieldCure.DocumentParsers.Audio 0.1 (net8.0 + net10.0)\n├── DocumentParsers.Cli/                 Console tool for manual output inspection\n├── DocumentParsers.Tests/               MSTest — core + PdfParser tests\n├── DocumentParsers.Imaging.Tests/       MSTest — PdfImageRenderer tests\n├── DocumentParsers.Ocr.Tests/           MSTest — OcrPdfParser + TesseractOcrEngine tests\n└── DocumentParsers.Audio.Tests/         MSTest — Audio parser tests\n```\n\n## Build \u0026 Test\n\n```bash\ndotnet build\ndotnet test\n```\n\n## See Also\n\nPart of the [AssistStudio ecosystem](https://github.com/fieldcure/fieldcure-assiststudio#packages).\n\n## Release Notes\n\n- [FieldCure.DocumentParsers](RELEASENOTES.DocumentParsers.md)\n- [FieldCure.DocumentParsers.Imaging](RELEASENOTES.DocumentParsers.Imaging.md)\n- [FieldCure.DocumentParsers.Ocr](RELEASENOTES.DocumentParsers.Ocr.md)\n- [FieldCure.DocumentParsers.Audio](RELEASENOTES.DocumentParsers.Audio.md)\n\n## License\n\n[MIT](LICENSE) — Copyright (c) 2026 FieldCure Co., Ltd.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffieldcure%2Ffieldcure-document-parsers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffieldcure%2Ffieldcure-document-parsers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffieldcure%2Ffieldcure-document-parsers/lists"}