https://github.com/bobld/camelot-sharp

A C# library to extract tabular data from PDFs (port of camelot Python version using PdfPig).
https://github.com/bobld/camelot-sharp

camelot camelot-sharp csharp dotnet extract-table extracting-tables extraction extraction-engine netstandard opencv pdf-table-extract pdf-table-extraction pdfparser pdfpig pdfs table table-extraction

Last synced: about 1 month ago
JSON representation

A C# library to extract tabular data from PDFs (port of camelot Python version using PdfPig).

Host: GitHub
URL: https://github.com/bobld/camelot-sharp
Owner: BobLd
License: mit
Created: 2020-11-17T11:17:11.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2022-02-04T16:19:16.000Z (over 3 years ago)
Last Synced: 2025-04-14T14:22:10.956Z (3 months ago)
Topics: camelot, camelot-sharp, csharp, dotnet, extract-table, extracting-tables, extraction, extraction-engine, netstandard, opencv, pdf-table-extract, pdf-table-extraction, pdfparser, pdfpig, pdfs, table, table-extraction
Language: C#
Homepage:
Size: 3.51 MB
Stars: 31
Watchers: 5
Forks: 5
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # camelot-sharp

A C# library to extract tabular data from PDFs (port of camelot Python version using PdfPig).

Original Python source code available here: [camelot-dev/camelot](https://github.com/camelot-dev/camelot).

[![Windows](https://github.com/BobLd/camelot-sharp/actions/workflows/dotnet.yml/badge.svg)](https://github.com/BobLd/camelot-sharp/actions/workflows/dotnet.yml)

NuGet packages available on the [releases](https://github.com/BobLd/camelot-sharp/releases) page and on www.nuget.org:

- [Camelot](https://www.nuget.org/packages/Camelot)

- [Camelot.ImageProcessing.OpenCvSharp4](https://www.nuget.org/packages/Camelot.ImageProcessing.OpenCvSharp4)

# Usage

## Stream mode 

```csharp

using (PdfDocument doc = PdfDocument.Open(@"Files\foo.pdf", new ParsingOptions() { ClipPaths = true }))

{

	Stream stream = new Stream();

	var tables = stream.ExtractTables(doc.GetPage(1));

	Assert.Single(tables);

	Assert.Equal((612, 792), stream.Dimensions);

	Assert.Equal(612, stream.PdfWidth);

	Assert.Equal(792, stream.PdfHeight);

	//Assert.Equal(84, stream.HorizontalText.Count);

	var parsingReport = tables[0].ParsingReport();

	//   parsing_report = {"accuracy": 99.02, "whitespace": 12.24, "order": 1, "page": 1}

	parsingReport["order"] = 1;

	parsingReport["page"] = 1;

}

```

## Lattice mode

```csharp

using (var doc = PdfDocument.Open(@"Files\column_span_2.pdf", new ParsingOptions() { ClipPaths = true }))

{

	var page = doc.GetPage(1);

	Lattice lattice = new Lattice(new OpenCvImageProcesser(), new BasicSystemDrawingProcessor(), line_scale: 40);

	var tables = lattice.ExtractTables(page,

		layout_kwargs: new DlaOptions[]

		{

			new DocstrumBoundingBoxes.DocstrumBoundingBoxesOptions()

			{

				WithinLineMultiplier = 2

			}

		});

	Assert.Single(tables);

	Assert.Equal(DataLatticeShiftTextLeftTop.Length, tables[0].Cells.Count);

	Assert.Equal(DataLatticeShiftTextLeftTop, tables[0].Data().Select(r => r.Select(c => c).ToArray()).ToArray());

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bobld/camelot-sharp

Awesome Lists containing this project

README