Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/xavctn/img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
https://github.com/xavctn/img2table
image-processing opencv python table-extraction
Last synced: 3 days ago
JSON representation
img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
Host: GitHub
URL: https://github.com/xavctn/img2table
Owner: xavctn
License: mit
Created: 2022-03-21T10:07:19.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2024-11-11T19:13:12.000Z (3 months ago)
Last Synced: 2025-01-23T18:04:21.856Z (10 days ago)
Topics: image-processing, opencv, python, table-extraction
Language: Python
Homepage:
Size: 7.1 MB
Stars: 635
Watchers: 10
Forks: 83
Open Issues: 56
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project

README

        # img2table

`img2table` is a simple, easy to use, table identification and extraction Python Library based on [OpenCV](https://opencv.org/) image 

processing that supports most common image file formats as well as PDF files.

Thanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.

## Table of contents

* [Installation](#installation)

* [Features](#features)

* [Supported file formats](#supported-file-formats)

* [Usage](#usage)

   * [Documents](#documents)

      * [Images](#images-doc)

      * [PDF](#pdf-doc)

   * [Supported OCRs](#ocr)

   * [Table extraction](#table-extract)

   * [Excel export](#xlsx)

* [Examples](#examples)

* [Caveats / FYI](#fyi)

## Installation 

The library can be installed via pip:

> pip install img2table: Standard installation, supporting Tesseract


> pip install img2table[paddle]: For usage with Paddle OCR


> pip install img2table[easyocr]: For usage with EasyOCR


> pip install img2table[surya]: For usage with Surya OCR


> pip install img2table[gcp]: For usage with Google Vision OCR


> pip install img2table[aws]: For usage with AWS Textract OCR


> pip install img2table[azure]: For usage with Azure Cognitive Services OCR

## Features 

* Table identification for images and PDF files, including bounding boxes at the table cell level

* Handling of complex table structures such as merged cells

* Handling of implicit content - see [example](/examples/Implicit.ipynb)

* Table content extraction by providing support for OCR services / tools

* Extracted tables are returned as a simple object, including a Pandas DataFrame representation

* Export extracted tables to an Excel file, preserving their original structure

## Supported file formats 

### Images 

Images are loaded using the `opencv-python` library, supported formats are listed below.

Supported image formats








Windows bitmaps - .bmp, .dib

JPEG files - .jpeg, .jpg, *.jpe

JPEG 2000 files - *.jp2

Portable Network Graphics - *.png

WebP - *.webp

Portable image format - .pbm, .pgm, .ppm .pxm, *.pnm

PFM files - *.pfm

Sun rasters - .sr, .ras

TIFF files - .tiff, .tif

OpenEXR Image files - *.exr

Radiance HDR - .hdr, .pic

Raster and Vector geospatial data supported by GDAL


OpenCV: Image file reading and writing







Multi-page images are not supported.

---

### PDF 

Both native and scanned PDF files are supported.

## Usage 

### Documents 

#### Images 

Images are instantiated as follows :

```python

from img2table.document import Image

image = Image(src, 

              detect_rotation=False)

```

> 
Parameters

>

>    src : str, pathlib.Path, bytes or io.BytesIO, required

>    Image source

>    detect_rotation : bool, optional, default False



>    Detect and correct skew/rotation of the image

>




The implemented method to handle skewed/rotated images supports skew angles up to 45° and is

based on the publication by Huang, 2020.


Setting the detect_rotation parameter to True, image coordinates and bounding boxes returned by other 

methods might not correspond to the original image.

#### PDF 

PDF files are instantiated as follows :

```python

from img2table.document import PDF

pdf = PDF(src, 

          pages=[0, 2],

          detect_rotation=False,

          pdf_text_extraction=True)

```

> 
Parameters

>

>    src : str, pathlib.Path, bytes or io.BytesIO, required

>    PDF source

>    pages : list, optional, default None



>    List of PDF page indexes to be processed. If None, all pages are processed

>    detect_rotation : bool, optional, default False



>    Detect and correct skew/rotation of extracted images from the PDF

>    pdf_text_extraction : bool, optional, default True



>    Extract text from the PDF file for native PDFs

>


PDF pages are converted to images with a 200 DPI for table identification.

---

### OCR 

`img2table` provides an interface for several OCR services and tools in order to parse table content.


If possible (i.e for native PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.

Tesseract




```python

from img2table.ocr import TesseractOCR

ocr = TesseractOCR(n_threads=1, 

                   lang="eng", 

                   psm=11,

                   tessdata_dir="...")

```

> 
Parameters

>

>    n_threads : int, optional, default 1



>    Number of concurrent threads used to call Tesseract

>    lang : str, optional, default "eng"



>    Lang parameter used in Tesseract for text extraction

>    psm : int, optional, default 11



>    PSM parameter used in Tesseract, run tesseract --help-psm for details

>    tessdata_dir : str, optional, default None



>    Directory containing Tesseract traineddata files. If None, the TESSDATA_PREFIX env variable is used.

>


*Usage of [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract) requires prior installation. 

Check [documentation](https://tesseract-ocr.github.io/tessdoc/) for instructions.*




*For Windows users getting environment variable errors, you can check this [tutorial](https://linuxhint.com/install-tesseract-windows/)*




PaddleOCR




PaddleOCR is an open-source OCR based on Deep Learning models.


At first use, relevant languages models will be downloaded.

```python

from img2table.ocr import PaddleOCR

ocr = PaddleOCR(lang="en",

                kw={"kwarg": kw_value, ...})

```

> 
Parameters

>

>    lang : str, optional, default "en"



>    Lang parameter used in Paddle for text extraction, check documentation for available languages



>    kw : dict, optional, default None



>    Dictionary containing additional keyword arguments passed to the PaddleOCR constructor.

>





NB: For usage of PaddleOCR with GPU, the CUDA specific version of paddlepaddle-gpu has to be installed by the user manually 

as stated in this issue.

```bash

# Example of installation with CUDA 11.8

pip install paddlepaddle-gpu==2.5.0rc1.post118 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html

pip install paddleocr img2table

```

If you get an error trying to run PaddleOCR on Ubuntu,

please check this issue for a working solution.




EasyOCR




EasyOCR is an open-source OCR based on Deep Learning models.


At first use, relevant languages models will be downloaded.

```python

from img2table.ocr import EasyOCR

ocr = EasyOCR(lang=["en"],

              kw={"kwarg": kw_value, ...})

```

> 
Parameters

>

>    lang : list, optional, default ["en"]



>    Lang parameter used in EasyOCR for text extraction, check documentation for available languages



>    kw : dict, optional, default None



>    Dictionary containing additional keyword arguments passed to the EasyOCR Reader constructor.

>





docTR




docTR is an open-source OCR based on Deep Learning models.


*In order to be used, docTR has to be installed by the user beforehand. Installation procedures are detailed in

the package documentation*

```python

from img2table.ocr import DocTR

ocr = DocTR(detect_language=False,

            kw={"kwarg": kw_value, ...})

```

> 
Parameters

>

>    detect_language : bool, optional, default False



>    Parameter indicating if language prediction is run on the document

>    kw : dict, optional, default None



>    Dictionary containing additional keyword arguments passed to the docTR ocr_predictor method.

>





Surya OCR




Only available for python >= 3.10


Surya is an open-source OCR based on Deep Learning models.


At first use, relevant models will be downloaded.

```python

from img2table.ocr import SuryaOCR

ocr = SuryaOCR(langs=["en"])

```

> 
Parameters

>

>    langs : list, optional, default ["en"]



>    Lang parameter used in Surya OCR for text extraction

>





Google Vision




Authentication to GCP can be done by setting the standard `GOOGLE_APPLICATION_CREDENTIALS` environment variable.


If this variable is missing, an API key should be provided via the `api_key` parameter.

```python

from img2table.ocr import VisionOCR

ocr = VisionOCR(api_key="api_key", timeout=15)

```

> 
Parameters

>

>    api_key : str, optional, default None



>    Google Vision API key

>    timeout : int, optional, default 15



>    API requests timeout, in seconds

>




AWS Textract




When using AWS Textract, the DetectDocumentText API is exclusively called.

Authentication to AWS can be done by passing credentials to the `TextractOCR` class.


If credentials are not provided, authentication is done using environment variables or configuration files. 

Check `boto3` [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for more details.

```python

from img2table.ocr import TextractOCR

ocr = TextractOCR(aws_access_key_id="***",

                  aws_secret_access_key="***",

                  aws_session_token="***",

                  region="eu-west-1")

```

> 
Parameters

>

>    aws_access_key_id : str, optional, default None



>    AWS access key id

>    aws_secret_access_key : str, optional, default None



>    AWS secret access key

>    aws_session_token : str, optional, default None



>    AWS temporary session token

>    region : str, optional, default None



>    AWS server region

>




Azure Cognitive Services




```python

from img2table.ocr import AzureOCR

ocr = AzureOCR(endpoint="abc.azure.com",

               subscription_key="***")

```

> 
Parameters

>

>    endpoint : str, optional, default None



>    Azure Cognitive Services endpoint. If None, inferred from the COMPUTER_VISION_ENDPOINT environment variable.

>    subscription_key : str, optional, default None



>    Azure Cognitive Services subscription key. If None, inferred from the COMPUTER_VISION_SUBSCRIPTION_KEY environment variable.

>




---

### Table extraction 

Multiple tables can be extracted at once from a PDF page/ an image using the `extract_tables` method of a document.

```python

from img2table.ocr import TesseractOCR

from img2table.document import Image

# Instantiation of OCR

ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF

doc = Image(src)

# Table extraction

extracted_tables = doc.extract_tables(ocr=ocr,

                                      implicit_rows=False,

                                      implicit_columns=False,

                                      borderless_tables=False,

                                      min_confidence=50)

```

> 
Parameters

>

>    ocr : OCRInstance, optional, default None



>    OCR instance used to parse document text. If None, cells content will not be extracted

>    implicit_rows : bool, optional, default False



>    Boolean indicating if implicit rows should be identified - check related example



>    implicit_columns : bool, optional, default False



>    Boolean indicating if implicit columns should be identified - check related example



>    borderless_tables : bool, optional, default False



>    Boolean indicating if borderless tables are extracted on top of bordered tables.

>    min_confidence : int, optional, default 50



>    Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)

>


NB: Borderless table extraction can, by design, only extract tables with 3 or more columns.

#### Method return

The [`ExtractedTable`](/src/img2table/tables/objects/extraction.py#L35) class is used to model extracted tables from documents.

> 
Attributes

>

>    bbox : BBox



>    Table bounding box

>    title : str

>    Extracted title of the table

>    content : OrderedDict



>    Dict with row indexes as keys and list of TableCell objects as values

>    df : pd.DataFrame



>    Pandas DataFrame representation of the table

>    html : str



>    HTML representation of the table

>





In order to access bounding boxes at the cell level, you can use the following code snippet :

```python

for id_row, row in enumerate(table.content.values()):

    for id_col, cell in enumerate(row):

        x1 = cell.bbox.x1

        y1 = cell.bbox.y1

        x2 = cell.bbox.x2

        y2 = cell.bbox.y2

        value = cell.value

```

Images


`extract_tables` method from the `Image` class returns a list of `ExtractedTable` objects. 

```Python

output = [ExtractedTable(...), ExtractedTable(...), ...]

```

PDF


`extract_tables` method from the `PDF` class returns an `OrderedDict` object with page indexes as keys and lists of `ExtractedTable` objects. 

```Python

output = {

    0: [ExtractedTable(...), ...],

    1: [],

    ...

    last_page: [ExtractedTable(...), ...]

}

```

### Excel export 

Tables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table.


Method arguments are mostly common with the `extract_tables` method.

```python

from img2table.ocr import TesseractOCR

from img2table.document import Image

# Instantiation of OCR

ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF

doc = Image(src)

# Extraction of tables and creation of a xlsx file containing tables

doc.to_xlsx(dest=dest,

            ocr=ocr,

            implicit_rows=False,

            implicit_columns=False,

            borderless_tables=False,

            min_confidence=50)

```

> 
Parameters

>

>    dest : str, pathlib.Path or io.BytesIO, required

>    Destination for xlsx file

>    ocr : OCRInstance, optional, default None



>    OCR instance used to parse document text. If None, cells content will not be extracted

>    implicit_rows : bool, optional, default False



>    Boolean indicating if implicit rows should be identified - check related example



>    implicit_rows : bool, optional, default False



>    Boolean indicating if implicit columns should be identified - check related example



>    borderless_tables : bool, optional, default False



>    Boolean indicating if borderless tables are extracted. It requires to provide an OCR to the method in order to be performed - feature in alpha version



>    min_confidence : int, optional, default 50



>    Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)

>

> Returns

> If a io.BytesIO buffer is passed as dest arg, it is returned containing xlsx data

## Examples 

Several Jupyter notebooks with examples are available :





Basic usage: generic library usage, including examples with images, PDF and OCRs





Borderless tables: specific examples dedicated to the extraction of borderless tables





Implicit content: illustrated effect 

of the parameter implicit_rows/implicit_columns of the extract_tables method





## Caveats / FYI 





For table extraction, results are highly dependent on OCR quality. By design, tables where no OCR data 

can be found are not returned.





The library is tailored for usage on documents with white/light background. 

Effectiveness can not be guaranteed on other type of documents. 





Table detection using only OpenCV processing can have some limitations. If the library fails to detect tables, 

you may check CNN based solutions.