https://github.com/guilhermestracini/poc-dotnet-extractpdfcontent

🔬 Proof of Concept of extracting content from PDF files using multiple PDF libraries
https://github.com/guilhermestracini/poc-dotnet-extractpdfcontent

docnet dotnet dotnetcore itextsharp pdf-extractor pdf-reader pdfextraction pdfpig pdfsharp poc prdreader proof-of-concept

Last synced: 7 months ago
JSON representation

🔬 Proof of Concept of extracting content from PDF files using multiple PDF libraries

Host: GitHub
URL: https://github.com/guilhermestracini/poc-dotnet-extractpdfcontent
Owner: GuilhermeStracini
License: mit
Created: 2023-08-30T10:17:03.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-03-17T20:35:23.000Z (7 months ago)
Last Synced: 2025-03-17T21:27:55.526Z (7 months ago)
Topics: docnet, dotnet, dotnetcore, itextsharp, pdf-extractor, pdf-reader, pdfextraction, pdfpig, pdfsharp, poc, prdreader, proof-of-concept
Language: C#
Homepage: https://guilhermestracini.github.io/POC-dotnet-ExtractPdfContent/
Size: 199 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # PoC .NET - Extract PDF content

[![wakatime](https://wakatime.com/badge/github/GuilhermeStracini/POC-dotnet-ExtractPdfContent.svg)](https://wakatime.com/badge/github/GuilhermeStracini/POC-dotnet-ExtractPdfContent)

[![Maintainability](https://api.codeclimate.com/v1/badges/0473f6981139c13f8820/maintainability)](https://codeclimate.com/github/GuilhermeStracini/POC-dotnet-ExtractPdfContent/maintainability)

[![Test Coverage](https://api.codeclimate.com/v1/badges/0473f6981139c13f8820/test_coverage)](https://codeclimate.com/github/GuilhermeStracini/POC-dotnet-ExtractPdfContent/test_coverage)

[![CodeFactor](https://www.codefactor.io/repository/github/GuilhermeStracini/POC-dotnet-ExtractPdfContent/badge)](https://www.codefactor.io/repository/github/GuilhermeStracini/POC-dotnet-ExtractPdfContent)

[![GitHub license](https://img.shields.io/github/license/GuilhermeStracini/POC-dotnet-ExtractPdfContent)](https://github.com/GuilhermeStracini/POC-dotnet-ExtractPdfContent)

[![GitHub last commit](https://img.shields.io/github/last-commit/GuilhermeStracini/POC-dotnet-ExtractPdfContent)](https://github.com/GuilhermeStracini/POC-dotnet-ExtractPdfContent)

[![Build](https://github.com/GuilhermeStracini/POC-dotnet-ExtractPdfContent/actions/workflows/build.yml/badge.svg)](https://github.com/GuilhermeStracini/POC-dotnet-ExtractPdfContent/actions/workflows/build.yml)

[![Linting](https://github.com/GuilhermeStracini/POC-dotnet-ExtractPdfContent/actions/workflows/linter.yml/badge.svg)](https://github.com/GuilhermeStracini/POC-dotnet-ExtractPdfContent/actions/workflows/linter.yml)

🔬 Proof of Concept of extracting content from PDF files using multiple PDF libraries.

---

## Libraries

- [DocNet](https://github.com/GowenGit/docnet)

- ~~[iTextSharp.LGPLv2.Core](https://github.com/VahidN/iTextSharp.LGPLv2.Core)~~

- [PdfPig](https://github.com/UglyToad/PdfPig/)

- ~~[PdfSharpCore](https://github.com/ststeiger/PdfSharpCore)~~

Refer to this article: [Reading a PDF in C# on .NET Core](https://dev.to/eliotjones/reading-a-pdf-in-c-on-net-core-43ef)

The main goal of this POC is to test the available options for effectively reading content from PDF files and replace the current [iTextSharp—for .NET Framework](https://www.nuget.org/packages/iTextSharp).

---

## Results

### ✅ ⚠️ DocNet

The results are not the best, but they look good.

With the files tested, some errors were detected that could be avoided using some simple regexp when processing it later.

### ❌ iTextSharp.LGPLv2.Core

Encoding issues.

The simple PDF generated by the library itself can be read, but another PDF tested returns problems with encoding.

- [SimpleTextExtractionStrategy ?](https://github.com/VahidN/iTextSharp.LGPLv2.Core/issues/7)

- [Encoding problem with extracted text from GhostScript generated pdf](https://github.com/VahidN/iTextSharp.LGPLv2.Core/issues/42)

### ✅ 🔝 PdfPig

99.999% of the result of PdfPig was the same as the old [iTextSharp](https://www.nuget.org/packages/iTextSharp) class (not the [itextSharp Core](https://www.nuget.org/packages/iTextSharp.LGPLv2.Core/) version).

This will be used in my projects to replace the old one.

### ❌ PdfSharpCore

 This library doesn't support extract text yet.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/guilhermestracini/poc-dotnet-extractpdfcontent

Awesome Lists containing this project

README