{"id":17309823,"url":"https://github.com/bobld/pdfpigsvmregionclassifier","last_synced_at":"2025-04-14T13:54:59.813Z","repository":{"id":38371309,"uuid":"227085803","full_name":"BobLd/PdfPigSvmRegionClassifier","owner":"BobLd","description":"Proof of concept of a simple SVM Region Classifier using PdfPig and Accord.Net. The objective is to classify each text block in a pdf document page as either title, text, list, table and image.","archived":false,"fork":false,"pushed_at":"2022-06-23T08:49:08.000Z","size":1180,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-06T07:38:52.356Z","etag":null,"topics":["accord-net","csharp","document-layout-analysis","machine-learning","pdf","pdf-document","pdfpig","publaynet","support-vector-machine","svm","svm-classifier","svm-training"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BobLd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-12-10T10:04:59.000Z","updated_at":"2023-05-18T03:41:48.000Z","dependencies_parsed_at":"2022-08-25T01:40:50.776Z","dependency_job_id":null,"html_url":"https://github.com/BobLd/PdfPigSvmRegionClassifier","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BobLd%2FPdfPigSvmRegionClassifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BobLd%2FPdfPigSvmRegionClassifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BobLd%2FPdfPigSvmRegionClassifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BobLd%2FPdfPigSvmRegionClassifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BobLd","download_url":"https://codeload.github.com/BobLd/PdfPigSvmRegionClassifier/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248892276,"owners_count":21178805,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["accord-net","csharp","document-layout-analysis","machine-learning","pdf","pdf-document","pdfpig","publaynet","support-vector-machine","svm","svm-classifier","svm-training"],"created_at":"2024-10-15T12:32:53.685Z","updated_at":"2025-04-14T13:54:59.786Z","avatar_url":"https://github.com/BobLd.png","language":"C#","readme":"# PdfPig SVM Region Classifier\nProof of concept of a simple Support Vector Machine Region Classifier using [PdfPig](https://github.com/UglyToad/PdfPig) and [Accord.Net](https://github.com/accord-net/framework/). The model was trained on a subset of the [PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet#getting-data) dataset. See their license [here](https://cdla.io/permissive-1-0/).\n\nThe objective is to classify each text block using machine learning in a pdf document page as either _title_, _text_, _list_, _table_ and _image_.\n\nThe annotions from the dataset (see sample [here](https://github.com/ibm-aur-nlp/PubLayNet/blob/master/examples/samples.json)) were converted to the [PAGE](https://github.com/PRImA-Research-Lab/PAGE-XML) xml format. See the [`PageXmlConverter`](https://github.com/BobLd/PdfPigSvmRegionClassifier/blob/master/PdfPigSvmRegionClassifier/PageXmlConverter.cs) to convert the json file into PAGE xml files. Images from the dataset were not used. You will need to download the pdf documents separately as we leverage the pdf documents features directly instead.\n\n# Labels \nFollowing the [PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet) methodology, the following [categories](https://github.com/ibm-aur-nlp/PubLayNet/tree/master/pre-trained-models) are available:\n\n|Label|id (svm)|\n|---:|:---:|\n|**title**|0|\n|**text**|1|\n|**list**|2|\n|**table**|3|\n|**image**|4|\n\n# Features\n## Text\n- Character count\n- Percentage of numeric characters\n- Percentage of alphabetical characters\n- Percentage of symbolic characters\n- Percentage of bullet characters\n- Average delta to average page glyph height\n\n## Paths\n- Path count\n- Percentage of Bezier curve paths\n- Percentage of horizontal paths\n- Percentage of vertical paths\n- Percentage of oblique paths\n\n## Images\n- Image count\n- Average area covered by images\n\n## Code\nSee the [`GenerateData`](https://github.com/BobLd/PdfPigSvmRegionClassifier/blob/master/PdfPigSvmRegionClassifier/GenerateData.cs) class to generate a csv file with the features, using the pdf documents, and their respective PageXml ground truth (one xml document per page). See the [`FeatureHelper`](https://github.com/BobLd/PdfPigSvmRegionClassifier/blob/master/PdfPigSvmRegionClassifier/FeatureHelper.cs) class to easily generate the features vector from a block.\n\n# Results (in sample)\n## Accuracy\nModel accuracy = 90.898\n\n## Normalised confusion matrix\n\n![Normalised confusion matrix](https://github.com/BobLd/PdfPigSvmRegionClassifier/blob/master/confusion%20matrix.png)\n\n## Confusion matrix\n\n| |title|text|list|table|image|\n|---:|:---:|:---:|:---:|:---:|:---:|\n|**title**|9312|1592|19|3|135|\n|**text**|1166|37136|988|820|32|\n|**list**|0|1|32|0|0|\n|**table**|0|16|4|1092|3|\n|**image**|0|0|0|0|154|\n\n## Precision, Recall and F1 score\n\n| |Precision|Recall|F1 score|\n|---|:---:|:---:|:---:|\n|**title**|0.842|0.889|0.865|\n|**text**|0.925|0.958|0.941|\n|**list**|0.970|0.031|0.059|\n|**table**|0.979|0.570|0.721|\n|**image**|1.000|0.475|0.644|\n\n## Code\nSee the [`Trainer`](https://github.com/BobLd/PdfPigSvmRegionClassifier/blob/master/PdfPigSvmRegionClassifier/Trainer.cs) class to **train** and **evaluate** the model.\nAfter training, the SVM model will be saved as a Gzip.\n\n# Usage\nOnce the training is finished, you can test the classification on a new pdf document by using either [DocstrumBoundingBoxes](https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig/DocumentLayoutAnalysis/DocstrumBoundingBoxes.cs) or [RecursiveXYCut](https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig/DocumentLayoutAnalysis/RecursiveXYCut.cs) to generate the text blocks, and then classify each block.\nSee [`SvmZoneClassifier`](https://github.com/BobLd/PdfPigSvmRegionClassifier/blob/master/PdfPigSvmRegionClassifier/SvmZoneClassifier.cs) for a demo implementation. The SVM trained model is available [here](https://github.com/BobLd/PdfPigSvmRegionClassifier/tree/master/PdfPigSvmRegionClassifier/model).\n\n# References\n- https://visualstudiomagazine.com/articles/2019/02/01/support-vector-machines.aspx\n- http://accord-framework.net/docs/html/T_Accord_MachineLearning_Performance_GridSearch_2.htm\n- https://github.com/ibm-aur-nlp/PubLayNet\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbobld%2Fpdfpigsvmregionclassifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbobld%2Fpdfpigsvmregionclassifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbobld%2Fpdfpigsvmregionclassifier/lists"}