https://github.com/bobld/tabula-sharp
Extract tables from PDF files (port of tabula-java)
https://github.com/bobld/tabula-sharp
csharp dotnet extract extract-table extracting-tables extraction extraction-engine netstandard pdf-table-extract pdf-table-extraction pdfparser pdfpig pdfs table table-extraction tabula tabula-java tabula-sharp
Last synced: 15 days ago
JSON representation
Extract tables from PDF files (port of tabula-java)
- Host: GitHub
- URL: https://github.com/bobld/tabula-sharp
- Owner: BobLd
- License: mit
- Created: 2020-09-08T12:38:41.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2025-03-17T20:00:14.000Z (about 1 month ago)
- Last Synced: 2025-04-15T02:12:49.453Z (15 days ago)
- Topics: csharp, dotnet, extract, extract-table, extracting-tables, extraction, extraction-engine, netstandard, pdf-table-extract, pdf-table-extraction, pdfparser, pdfpig, pdfs, table, table-extraction, tabula, tabula-java, tabula-sharp
- Language: C#
- Homepage:
- Size: 9.33 MB
- Stars: 175
- Watchers: 9
- Forks: 27
- Open Issues: 13
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# tabula-sharp
`tabula-sharp` is a library for extracting tables from PDF files — it is a port of [tabula-java](https://github.com/tabulapdf/tabula-java)

- Supports netstandard2.0, net462, net471, net6.0, net8.0
- No java bindingsNuGet packages available on the [releases](https://github.com/BobLd/tabula-sharp/releases) page and on www.nuget.org:
- [Tabula](https://www.nuget.org/packages/Tabula)
- [Tabula.Json](https://www.nuget.org/packages/Tabula.Json)
- [Tabula.Csv](https://www.nuget.org/packages/Tabula.Csv)## Differences with tabula-java
- Uses [PdfPig](https://github.com/UglyToad/PdfPig), and not PdfBox.
- Coordinate system starts from the bottom left point (going up) of the page, and not from the top left point (going down).
- The `NurminenDetectionAlgorithm` is replaced by `SimpleNurminenDetectionAlgorithm`, because it requieres an image management library.
- Table results might be different because of the way PdfPig builds Letters bounding box.# Usage
## Stream mode - BasicExtractionAlgorithm
```csharp
using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
PageArea page = ObjectExtractor.Extract(document, 1);
// detect canditate table zones
SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
var regions = detector.Detect(page);
IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
IReadOnlyList tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
var table = tables[0];
var rows = table.Rows;
}
```
## Lattice mode - SpreadsheetExtractionAlgorithm
```csharp
using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
PageArea page = ObjectExtractor.Extract(document, 1);IExtractionAlgorithm ea = new SpreadsheetExtractionAlgorithm();
IReadOnlyList tables = ea.Extract(page);
var table = tables[0];
var rows = table.Rows;
}
```# Results
## Stream mode - BasicExtractionAlgorithm

## Lattice mode - SpreadsheetExtractionAlgorithm
