{"id":17309830,"url":"https://github.com/bobld/simple-docstrum","last_synced_at":"2025-06-23T14:33:37.483Z","repository":{"id":108794800,"uuid":"258309199","full_name":"BobLd/simple-docstrum","owner":"BobLd","description":"A step-by-step C# implementation of the Docstrum algorithm","archived":false,"fork":false,"pushed_at":"2020-12-13T16:58:33.000Z","size":920,"stargazers_count":23,"open_issues_count":0,"forks_count":5,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-19T04:48:59.208Z","etag":null,"topics":["csharp","docstrum","document-layout-analysis","dotnet","pdf","pdfpig"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BobLd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-04-23T19:27:24.000Z","updated_at":"2024-10-12T06:21:38.000Z","dependencies_parsed_at":"2023-06-05T05:00:08.466Z","dependency_job_id":null,"html_url":"https://github.com/BobLd/simple-docstrum","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/BobLd/simple-docstrum","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BobLd%2Fsimple-docstrum","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BobLd%2Fsimple-docstrum/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BobLd%2Fsimple-docstrum/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BobLd%2Fsimple-docstrum/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BobLd","download_url":"https://codeload.github.com/BobLd/simple-docstrum/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BobLd%2Fsimple-docstrum/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261494737,"owners_count":23167187,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csharp","docstrum","document-layout-analysis","dotnet","pdf","pdfpig"],"created_at":"2024-10-15T12:32:57.236Z","updated_at":"2025-06-23T14:33:37.475Z","avatar_url":"https://github.com/BobLd.png","language":"Jupyter Notebook","readme":"A step-by-step C# implementation of the Docstrum algorithm for pdf documents\n\n### How to run C# code in Jupyter Lab / How Install .NET Interactive\nhttps://devblogs.microsoft.com/cesardelatorre/using-ml-net-in-jupyter-notebooks/\n\nhttps://devblogs.microsoft.com/dotnet/net-interactive-is-here-net-notebooks-preview-2/ (if the previous fails when installing)\n\n# Description\n\u003e**Version 2 is the latest update and handles rotated words/lines/paragraphs. This is the version implemenented in PdfPig.**\n\n![rotated example](https://github.com/BobLd/DocumentLayoutAnalysis/blob/master/DocumentLayoutAnalysis/DocumentLayoutAnalysis/doc/docstrum%20example%202.png)\n\nLink to original paper: [The Document Spectrum for Page Layout Analysis](https://ieeexplore.ieee.org/document/244677) by Lawrence O'Gorman\n\nFrom [_Performance Comparison of Six Algorithms for Page Segmentation_](https://www.researchgate.net/publication/220932988_Performance_Comparison_of_Six_Algorithms_for_Page_Segmentation): The Docstrum algorithm by O'Gorman is a __bottom-up__ approach based on __nearest-neighborhood clustering__ of connected components extracted from the document image. After noise removal, the connected components are separated into two groups, one with dominant characters and another one with characters in titles and section heading, using a character size ratio factor _fd_. Then, _K_ nearest neighbors are found for each connected component. Then, text-lines are found by computing the transitive closure on within-line nearest neighbor pairings using a threshold _ft_. Finally, text-lines are merged to form text blocks using a parallel distance threshold _fpa_ and a perpendicular distance threshold _fpe_. - [wiki](https://en.wikipedia.org/wiki/Document_layout_analysis#Example_of_a_bottom_up_approach)\n\n## Variables used in structural block determination\nThe variables can be accessed by using the `GetStructuralBlockingParameters()` function.\n\n```csharp\npublic static bool GetStructuralBlockingParameters(PdfLine i, PdfLine j, double epsilon,\n    out double angularDifference, out double normalisedOverlap, out double perpendicularDistance)\n    {\n      ...\n    }\n```\nFrom the original paper by O'Gorman:\n\u003e![Fig. 8 - Structural block determination](https://github.com/BobLd/simple-docstrum/blob/master/images/docstrum3.svg)\n\n\u003e**Fig. 8. Variables used in structural block determination.**\nThe two text lines, represented by segments *i* and *j*, are to be tested here to determine if they should be grouped into the same block. Their angular difference is *θ\u003csub\u003eij\u003c/sub\u003e*. The overlap length of segment *i* on segment *j* is *p\u003csub\u003ej\u003c/sub\u003e*, (and that is normalized to obtain the overlap parameter). The parallel distance between *i* and *j* is *d\u003csup\u003ea\u003c/sup\u003e\u003csub\u003eij\u003c/sub\u003e = p\u003csub\u003ej\u003c/sub\u003e* in this case. The perpendicular distance betwen *i* and *j* is *d\u003csup\u003ee\u003c/sup\u003e\u003csub\u003eij\u003c/sub\u003e*.\n\n# Results\n## 0.0 Open pdf document\n![raw](images/raw_v1.png)\n\n## 0.1 Extract words and preprocess\n![words](images/words_v1.png)\n\n## 1. Estimate within-line and between-line spacing\n### 1.1 Within-line (between words) spacing\n![](images/wl_dist_v1.png)\n### 1.2 Between-line spacing\n![](images/bl_dist_v1.png)\n\n## 2. Get lines\n![lines](images/lines_v1.png)\n\n## 3. Get paragraphs blocks\n![paragraphs](images/paragraphs_v1.png)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbobld%2Fsimple-docstrum","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbobld%2Fsimple-docstrum","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbobld%2Fsimple-docstrum/lists"}