{"id":29705126,"url":"https://github.com/documentatom/documentatom","last_synced_at":"2025-10-31T21:23:27.042Z","repository":{"id":273262013,"uuid":"910161176","full_name":"DocumentAtom/DocumentAtom","owner":"DocumentAtom","description":"DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.","archived":false,"fork":false,"pushed_at":"2025-09-11T22:17:22.000Z","size":11301,"stargazers_count":38,"open_issues_count":0,"forks_count":5,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-14T03:54:27.586Z","etag":null,"topics":["ai","chunk","chunking","etl","extraction","extraction-transformation-and-loading","parse","parser","semantic"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DocumentAtom.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-12-30T16:36:52.000Z","updated_at":"2025-09-11T22:17:26.000Z","dependencies_parsed_at":"2025-02-07T04:51:38.593Z","dependency_job_id":"3486ca33-5400-4993-96af-98420547e7bf","html_url":"https://github.com/DocumentAtom/DocumentAtom","commit_stats":null,"previous_names":["jchristn/documentatom","documentatom/documentatom"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DocumentAtom/DocumentAtom","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocumentAtom%2FDocumentAtom","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocumentAtom%2FDocumentAtom/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocumentAtom%2FDocumentAtom/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocumentAtom%2FDocumentAtom/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DocumentAtom","download_url":"https://codeload.github.com/DocumentAtom/DocumentAtom/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocumentAtom%2FDocumentAtom/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":275658802,"owners_count":25504780,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-17T02:00:09.119Z","response_time":84,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","chunk","chunking","etl","extraction","extraction-transformation-and-loading","parse","parser","semantic"],"created_at":"2025-07-23T15:01:13.539Z","updated_at":"2025-10-31T21:23:27.036Z","avatar_url":"https://github.com/DocumentAtom.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"https://raw.githubusercontent.com/jchristn/DocumentAtom/refs/heads/main/assets/icon.png\" width=\"256\" height=\"256\"\u003e\n\n# DocumentAtom\n\nDocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.\n\nDocumentAtom requires that Tesseract v5.0 be installed on the host.  This is required as certain document types can have embedded images which are parsed using OCR via Tesseract.\n\n| Package | Version | Downloads |\n|---------|---------|-----------|\n| DocumentAtom.Csv | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.Csv.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.Csv/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.Csv.svg)](https://www.nuget.org/packages/DocumentAtom.Csv)  |\n| DocumentAtom.Excel | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.Excel.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.Excel/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.Excel.svg)](https://www.nuget.org/packages/DocumentAtom.Excel)  |\n| DocumentAtom.Html | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.Html.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.Html/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.Html.svg)](https://www.nuget.org/packages/DocumentAtom.Html)  |\n| DocumentAtom.Image | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.Image.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.Image/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.Image.svg)](https://www.nuget.org/packages/DocumentAtom.Image)  |\n| DocumentAtom.Json | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.Json.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.Json/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.Json.svg)](https://www.nuget.org/packages/DocumentAtom.Json)  |\n| DocumentAtom.Markdown | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.Markdown.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.Markdown/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.Markdown.svg)](https://www.nuget.org/packages/DocumentAtom.Markdown)  |\n| DocumentAtom.Pdf | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.Pdf.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.Pdf/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.Pdf.svg)](https://www.nuget.org/packages/DocumentAtom.Pdf)  |\n| DocumentAtom.PowerPoint | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.PowerPoint.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.PowerPoint/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.PowerPoint.svg)](https://www.nuget.org/packages/DocumentAtom.PowerPoint)  |\n| DocumentAtom.Ocr | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.Ocr.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.Ocr/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.Ocr.svg)](https://www.nuget.org/packages/DocumentAtom.Ocr)  |\n| DocumentAtom.RichText | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.RichText.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.RichText/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.RichText.svg)](https://www.nuget.org/packages/DocumentAtom.RichText)  |\n| DocumentAtom.Text | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.Text.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.Text/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.Text.svg)](https://www.nuget.org/packages/DocumentAtom.Text)  |\n| DocumentAtom.TypeDetection | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.TypeDetection.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.TypeDetection/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.TypeDetection.svg)](https://www.nuget.org/packages/DocumentAtom.TypeDetection)  |\n| DocumentAtom.Word | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.Word.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.Word/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.Word.svg)](https://www.nuget.org/packages/DocumentAtom.Word)  |\n| DocumentAtom.Xml | [![NuGet Version](https://img.shields.io/nuget/v/DocumentAtom.Xml.svg?style=flat)](https://www.nuget.org/packages/DocumentAtom.Xml/) | [![NuGet](https://img.shields.io/nuget/dt/DocumentAtom.Xml.svg)](https://www.nuget.org/packages/DocumentAtom.Xml)  |\n\n## New in v1.1.x\n\n- Hierarchical atomization (see `BuildHierarchy` in settings) - heading-based for markdown/HTML/Word, page-based for PowerPoint\n- Support for CSV, JSON, and XML documents\n- Dependency updates and fixes\n\n## Motivation\n\nParsing documents and extracting constituent parts is one part science and one part black magic.  If you find ways to improve processing and extraction in any way that is horizontally useful, I'd would love your feedback on ways to make this library more accurate, more useful, faster, and overall better.  My goal in building this library is to make it easier to analyze input data assets and make them more consumable by other systems including analytics and artificial intelligence.\n\n## Bugs, Quality, Feedback, or Enhancement Requests\n\nPlease feel free to file issues, enhancement requests, or start discussions about use of the library, improvements, or fixes.  \n\n## Types Supported\n\nDocumentAtom supports the following input file types:\n- CSV\n- HTML\n- JSON\n- Markdown\n- Microsoft Word (.docx)\n- Microsoft Excel (.xlsx)\n- Microsoft PowerPoint (.pptx)\n- PNG images (**requires Tesseract on the host**)\n- PDF\n- Rich text (.rtf)\n- Text\n- XML\n\n## Simple Example \n\nRefer to the various `Test` projects for working examples.\n\nThe following example shows processing a markdown (`.md`) file.\n\n```csharp\nusing DocumentAtom.Core.Atoms;\nusing DocumentAtom.Markdown;\n\nMarkdownProcessorSettings settings = new MarkdownProcessorSettings();\nMarkdownProcessor processor = new MarkdownProcessor(_Settings);\nforeach (Atom atom in processor.Extract(filename))\n    Console.WriteLine(atom.ToString());\n```\n\n## Atom Types\n\nDocumentAtom parses input data assets into a variety of `Atom` objects.  Each `Atom` includes top-level metadata including:\n- `ParentGUID` - globally-unique identifier of the parent atom, or, null\n- `GUID` - globally-unique identifier\n- `Type` - including `Text`, `Image`, `Binary`, `Table`, and `List`\n- `PageNumber` - where available; some document types do not explicitly indicate page numbers, and page numbers are inferred when rendered\n- `Position` - the ordinal position of the `Atom`, relative to others\n- `Length` - the length of the `Atom`'s content\n- `MD5Hash` - the MD5 hash of the `Atom` content\n- `SHA1Hash` - the SHA1 hash of the `Atom` content\n- `SHA256Hash` - the SHA256 hash of the `Atom` content\n- `Quarks` - sub-atomic particles created from the `Atom` content, for instance, when chunking text\n\nThe `AtomBase` class provides the aforementioned metadata, and several type-specific `Atom`s are returned from the various processors, including:\n- `BinaryAtom` - includes a `Bytes` property\n- `DocxAtom` - includes `Text`, `HeaderLevel`, `UnorderedList`, `OrderedList`, `Table`, and `Binary` properties\n- `ImageAtom` - includes `BoundingBox`, `Text`, `UnorderedList`, `OrderedList`, `Table`, and `Binary` properties\n- `MarkdownAtom` - includes `Formatting`, `Text`, `UnorderedList`, `OrderedList`, and `Table` properties\n- `PdfAtom` - includes `BoundingBox`, `Text`, `UnorderedList`, `OrderedList`, `Table`, and `Binary` properties\n- `PptxAtom` - includes `Title`, `Subtitle`, `Text`, `UnorderedList`, `OrderedList`, `Table`, and `Binary` properties\n- `TableAtom` - includes `Rows`, `Columns`, `Irregular`, and `Table` properties\n- `TextAtom` - includes `Text`\n- `XlsxAtom` - includes `SheetName`, `CellIdentifier`, `Text`, `Table`, and `Binary` properties\n\n`Table` objects inside of `Atom` objects are always presented as `SerializableDataTable` objects (see [SerializableDataTable](https://github.com/jchristn/serializabledatatable) for more information) to provide simple serialization and conversion to native `System.Data.DataTable` objects.\n\n## Underlying Libraries\n\nDocumentAtom is built on the shoulders of several libraries, without which, this work would not be possible.\n\n- [CsvHelper](https://github.com/JoshClose/CsvHelper)\n- [DocumentFormat.OpenXml](https://github.com/dotnet/Open-XML-SDK)\n- [HTML Agility Pack](https://github.com/zzzprojects/html-agility-pack)\n- [PdfPig](https://github.com/UglyToad/PdfPig)\n- [RtfPipe](github.com/erdomke/RtfPipe)\n- [SixLabors.ImageSharp](https://github.com/SixLabors/ImageSharp)\n- [Tabula](https://github.com/BobLd/tabula-sharp)\n- [Tesseract](https://github.com/charlesw/tesseract/)\n\nEach of these libraries were integrated as NuGet packages, and no source was included or modified from these packages.\n\nMy libraries used within DocumentAtom:\n\n- [SerializableDataTable](https://github.com/jchristn/serializabledatatable)\n- [SerializationHelper](https://github.com/jchristn/serializationhelper)\n\n## RESTful API and Docker\n\nRun the `DocumentAtom.Server` project to start a RESTful server listening on `localhost:8000`.  Modify the `documentatom.json` file to change the webserver, logging, or Tesseract settings.  Alternatively, you can pull `jchristn/documentatom` from [Docker Hub](https://hub.docker.com/repository/docker/jchristn/documentatom/general).  Refer to the `Docker` directory in the project for assets for running in Docker.\n\nRefer to the Postman collection for examples exercising the APIs.\n\n## Version History\n\nPlease refer to ```CHANGELOG.md``` for version history.\n\n## Thanks\n\nSpecial thanks to iconduck.com and the content authors for producing this [icon](https://iconduck.com/icons/27054/atom).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdocumentatom%2Fdocumentatom","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdocumentatom%2Fdocumentatom","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdocumentatom%2Fdocumentatom/lists"}