{"id":14977713,"url":"https://github.com/xavctn/img2table","last_synced_at":"2025-05-14T12:12:41.616Z","repository":{"id":65216731,"uuid":"472280519","full_name":"xavctn/img2table","owner":"xavctn","description":"img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing","archived":false,"fork":false,"pushed_at":"2025-02-10T02:38:10.000Z","size":7444,"stargazers_count":693,"open_issues_count":64,"forks_count":98,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-04-11T22:59:24.886Z","etag":null,"topics":["image-processing","opencv","python","table-extraction"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xavctn.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-21T10:07:19.000Z","updated_at":"2025-04-11T12:56:15.000Z","dependencies_parsed_at":"2023-01-15T15:15:28.439Z","dependency_job_id":"5ff71236-3e3f-42e6-b8cb-5a1a17bf446e","html_url":"https://github.com/xavctn/img2table","commit_stats":{"total_commits":158,"total_committers":2,"mean_commits":79.0,"dds":0.4620253164556962,"last_synced_commit":"8eb9ca9b97670c4d850a80f3e55176b2b025e04a"},"previous_names":[],"tags_count":56,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xavctn%2Fimg2table","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xavctn%2Fimg2table/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xavctn%2Fimg2table/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xavctn%2Fimg2table/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xavctn","download_url":"https://codeload.github.com/xavctn/img2table/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254140768,"owners_count":22021220,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["image-processing","opencv","python","table-extraction"],"created_at":"2024-09-24T13:56:11.349Z","updated_at":"2025-05-14T12:12:41.594Z","avatar_url":"https://github.com/xavctn.png","language":"Python","readme":"# img2table\n\n`img2table` is a simple, easy to use, table identification and extraction Python Library based on [OpenCV](https://opencv.org/) image \nprocessing that supports most common image file formats as well as PDF files.\n\nThanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.\n\n## Table of contents\n* [Installation](#installation)\n* [Features](#features)\n* [Supported file formats](#supported-file-formats)\n* [Usage](#usage)\n   * [Documents](#documents)\n      * [Images](#images-doc)\n      * [PDF](#pdf-doc)\n   * [Supported OCRs](#ocr)\n   * [Table extraction](#table-extract)\n   * [Excel export](#xlsx)\n* [Examples](#examples)\n* [Caveats / FYI](#fyi)\n\n\n## Installation \u003ca name=\"installation\"\u003e\u003c/a\u003e\nThe library can be installed via pip:\n\n\u003e \u003ccode\u003epip install img2table\u003c/code\u003e: Standard installation, supporting Tesseract\u003cbr\u003e\n\u003e \u003ccode\u003epip install img2table[paddle]\u003c/code\u003e: For usage with Paddle OCR\u003cbr\u003e\n\u003e \u003ccode\u003epip install img2table[easyocr]\u003c/code\u003e: For usage with EasyOCR\u003cbr\u003e\n\u003e \u003ccode\u003epip install img2table[surya]\u003c/code\u003e: For usage with Surya OCR\u003cbr\u003e\n\u003e \u003ccode\u003epip install img2table[gcp]\u003c/code\u003e: For usage with Google Vision OCR\u003cbr\u003e\n\u003e \u003ccode\u003epip install img2table[aws]\u003c/code\u003e: For usage with AWS Textract OCR\u003cbr\u003e\n\u003e \u003ccode\u003epip install img2table[azure]\u003c/code\u003e: For usage with Azure Cognitive Services OCR\n\n## Features \u003ca name=\"features\"\u003e\u003c/a\u003e\n\n* Table identification for images and PDF files, including bounding boxes at the table cell level\n* Handling of complex table structures such as merged cells\n* Handling of implicit content - see [example](/examples/Implicit.ipynb)\n* Table content extraction by providing support for OCR services / tools\n* Extracted tables are returned as a simple object, including a Pandas DataFrame representation\n* Export extracted tables to an Excel file, preserving their original structure\n\n## Supported file formats \u003ca name=\"supported-file-formats\"\u003e\u003c/a\u003e\n\n### Images \u003ca name=\"images-formats\"\u003e\u003c/a\u003e\n\nImages are loaded using the `opencv-python` library, supported formats are listed below.\n\n\u003cdetails\u003e\n\u003csummary\u003eSupported image formats\u003c/summary\u003e\n\u003cbr\u003e\n\n\u003cblockquote\u003e\n\u003cul\u003e\n\u003cli\u003eWindows bitmaps - \u003cem\u003e.bmp, \u003c/em\u003e.dib\u003c/li\u003e\n\u003cli\u003eJPEG files - \u003cem\u003e.jpeg, \u003c/em\u003e.jpg, *.jpe\u003c/li\u003e\n\u003cli\u003eJPEG 2000 files - *.jp2\u003c/li\u003e\n\u003cli\u003ePortable Network Graphics - *.png\u003c/li\u003e\n\u003cli\u003eWebP - *.webp\u003c/li\u003e\n\u003cli\u003ePortable image format - \u003cem\u003e.pbm, \u003c/em\u003e.pgm, \u003cem\u003e.ppm \u003c/em\u003e.pxm, *.pnm\u003c/li\u003e\n\u003cli\u003ePFM files - *.pfm\u003c/li\u003e\n\u003cli\u003eSun rasters - \u003cem\u003e.sr, \u003c/em\u003e.ras\u003c/li\u003e\n\u003cli\u003eTIFF files - \u003cem\u003e.tiff, \u003c/em\u003e.tif\u003c/li\u003e\n\u003cli\u003eOpenEXR Image files - *.exr\u003c/li\u003e\n\u003cli\u003eRadiance HDR - \u003cem\u003e.hdr, \u003c/em\u003e.pic\u003c/li\u003e\n\u003cli\u003eRaster and Vector geospatial data supported by GDAL\u003cbr\u003e\n\u003ccite\u003e\u003ca href=\"https://docs.opencv.org/4.x/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56\"\u003eOpenCV: Image file reading and writing\u003c/a\u003e\u003c/cite\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/blockquote\u003e\n\u003c/details\u003e\nMulti-page images are not supported.\n\n---\n\n### PDF \u003ca name=\"pdf-formats\"\u003e\u003c/a\u003e\n\nBoth native and scanned PDF files are supported.\n\n## Usage \u003ca name=\"usage\"\u003e\u003c/a\u003e\n\n### Documents \u003ca name=\"documents\"\u003e\u003c/a\u003e\n\n#### Images \u003ca name=\"images-doc\"\u003e\u003c/a\u003e\nImages are instantiated as follows :\n```python\nfrom img2table.document import Image\n\nimage = Image(src, \n              detect_rotation=False)\n```\n\n\u003e \u003ch4\u003eParameters\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003esrc : str, \u003ccode\u003epathlib.Path\u003c/code\u003e, bytes or \u003ccode\u003eio.BytesIO\u003c/code\u003e, required\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eImage source\u003c/dd\u003e\n\u003e    \u003cdt\u003edetect_rotation : bool, optional, default \u003ccode\u003eFalse\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eDetect and correct skew/rotation of the image\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\u003cbr\u003e\nThe implemented method to handle skewed/rotated images supports skew angles up to 45° and is\nbased on the publication by \u003ca href=\"https://www.mdpi.com/2079-9292/9/1/55\"\u003eHuang, 2020\u003c/a\u003e.\u003cbr\u003e\nSetting the \u003ccode\u003edetect_rotation\u003c/code\u003e parameter to \u003ccode\u003eTrue\u003c/code\u003e, image coordinates and bounding boxes returned by other \nmethods might not correspond to the original image.\n\n#### PDF \u003ca name=\"pdf-doc\"\u003e\u003c/a\u003e\nPDF files are instantiated as follows :\n```python\nfrom img2table.document import PDF\n\npdf = PDF(src, \n          pages=[0, 2],\n          detect_rotation=False,\n          pdf_text_extraction=True)\n```\n\n\u003e \u003ch4\u003eParameters\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003esrc : str, \u003ccode\u003epathlib.Path\u003c/code\u003e, bytes or \u003ccode\u003eio.BytesIO\u003c/code\u003e, required\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003ePDF source\u003c/dd\u003e\n\u003e    \u003cdt\u003epages : list, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eList of PDF page indexes to be processed. If None, all pages are processed\u003c/dd\u003e\n\u003e    \u003cdt\u003edetect_rotation : bool, optional, default \u003ccode\u003eFalse\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eDetect and correct skew/rotation of extracted images from the PDF\u003c/dd\u003e\n\u003e    \u003cdt\u003epdf_text_extraction : bool, optional, default \u003ccode\u003eTrue\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eExtract text from the PDF file for native PDFs\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\nPDF pages are converted to images with a 200 DPI for table identification.\n\n---\n\n### OCR \u003ca name=\"ocr\"\u003e\u003c/a\u003e\n\n`img2table` provides an interface for several OCR services and tools in order to parse table content.\u003cbr\u003e\nIf possible (i.e for native PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.\n\n\u003cdetails\u003e\n\u003csummary\u003eTesseract\u003ca name=\"tesseract\"\u003e\u003c/a\u003e\u003c/summary\u003e\n\u003cbr\u003e\n\n```python\nfrom img2table.ocr import TesseractOCR\n\nocr = TesseractOCR(n_threads=1, \n                   lang=\"eng\", \n                   psm=11,\n                   tessdata_dir=\"...\")\n```\n\n\u003e \u003ch4\u003eParameters\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003en_threads : int, optional, default \u003ccode\u003e1\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eNumber of concurrent threads used to call Tesseract\u003c/dd\u003e\n\u003e    \u003cdt\u003elang : str, optional, default \u003ccode\u003e\"eng\"\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eLang parameter used in Tesseract for text extraction\u003c/dd\u003e\n\u003e    \u003cdt\u003epsm : int, optional, default \u003ccode\u003e11\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003ePSM parameter used in Tesseract, run \u003ccode\u003etesseract --help-psm\u003c/code\u003e for details\u003c/dd\u003e\n\u003e    \u003cdt\u003etessdata_dir : str, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eDirectory containing Tesseract traineddata files. If None, the \u003ccode\u003eTESSDATA_PREFIX\u003c/code\u003e env variable is used.\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\n\n*Usage of [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract) requires prior installation. \nCheck [documentation](https://tesseract-ocr.github.io/tessdoc/) for instructions.*\n\u003cbr\u003e\n*For Windows users getting environment variable errors, you can check this [tutorial](https://linuxhint.com/install-tesseract-windows/)*\n\u003cbr\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003ePaddleOCR\u003ca name=\"paddle\"\u003e\u003c/a\u003e\u003c/summary\u003e\n\u003cbr\u003e\n\n\u003ca href=\"https://github.com/PaddlePaddle/PaddleOCR\"\u003ePaddleOCR\u003c/a\u003e is an open-source OCR based on Deep Learning models.\u003cbr\u003e\nAt first use, relevant languages models will be downloaded.\n\n```python\nfrom img2table.ocr import PaddleOCR\n\nocr = PaddleOCR(lang=\"en\",\n                kw={\"kwarg\": kw_value, ...})\n```\n\n\u003e \u003ch4\u003eParameters\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003elang : str, optional, default \u003ccode\u003e\"en\"\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eLang parameter used in Paddle for text extraction, check \u003ca href=\"https://github.com/Mushroomcat9998/PaddleOCR/blob/main/doc/doc_en/multi_languages_en.md#5-support-languages-and-abbreviations\"\u003edocumentation for available languages\u003c/a\u003e\u003c/dd\u003e\n\u003e    \u003cdt\u003ekw : dict, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eDictionary containing additional keyword arguments passed to the PaddleOCR constructor.\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\n\u003cbr\u003e\n\u003cb\u003eNB:\u003c/b\u003e For usage of PaddleOCR with GPU, the CUDA specific version of paddlepaddle-gpu has to be installed by the user manually \nas stated in this \u003ca href=\"https://github.com/PaddlePaddle/PaddleOCR/issues/7993\"\u003eissue\u003c/a\u003e.\n\n```bash\n# Example of installation with CUDA 11.8\npip install paddlepaddle-gpu==2.5.0rc1.post118 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html\npip install paddleocr img2table\n```\n\nIf you get an error trying to run PaddleOCR on Ubuntu,\nplease check this \u003ca href=\"https://github.com/PaddlePaddle/PaddleOCR/discussions/9989#discussioncomment-6642037\"\u003eissue\u003c/a\u003e for a working solution.\n\n\u003cbr\u003e\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eEasyOCR\u003ca name=\"easyocr\"\u003e\u003c/a\u003e\u003c/summary\u003e\n\u003cbr\u003e\n\n\u003ca href=\"https://github.com/JaidedAI/EasyOCR\"\u003eEasyOCR\u003c/a\u003e is an open-source OCR based on Deep Learning models.\u003cbr\u003e\nAt first use, relevant languages models will be downloaded.\n\n```python\nfrom img2table.ocr import EasyOCR\n\nocr = EasyOCR(lang=[\"en\"],\n              kw={\"kwarg\": kw_value, ...})\n```\n\n\u003e \u003ch4\u003eParameters\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003elang : list, optional, default \u003ccode\u003e[\"en\"]\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eLang parameter used in EasyOCR for text extraction, check \u003ca href=\"https://www.jaided.ai/easyocr\"\u003edocumentation for available languages\u003c/a\u003e\u003c/dd\u003e\n\u003e    \u003cdt\u003ekw : dict, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eDictionary containing additional keyword arguments passed to the EasyOCR \u003ccode\u003eReader\u003c/code\u003e constructor.\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\n\u003cbr\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003edocTR\u003ca name=\"docTR\"\u003e\u003c/a\u003e\u003c/summary\u003e\n\u003cbr\u003e\n\n\u003ca href=\"https://github.com/mindee/doctr\"\u003edocTR\u003c/a\u003e is an open-source OCR based on Deep Learning models.\u003cbr\u003e\n*In order to be used, docTR has to be installed by the user beforehand. Installation procedures are detailed in\nthe package documentation*\n\n```python\nfrom img2table.ocr import DocTR\n\nocr = DocTR(detect_language=False,\n            kw={\"kwarg\": kw_value, ...})\n```\n\n\u003e \u003ch4\u003eParameters\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003edetect_language : bool, optional, default \u003ccode\u003eFalse\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eParameter indicating if language prediction is run on the document\u003c/dd\u003e\n\u003e    \u003cdt\u003ekw : dict, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eDictionary containing additional keyword arguments passed to the docTR \u003ccode\u003eocr_predictor\u003c/code\u003e method.\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\n\u003cbr\u003e\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eSurya OCR\u003ca name=\"surya\"\u003e\u003c/a\u003e\u003c/summary\u003e\n\u003cbr\u003e\n\n\u003cb\u003e\u003ci\u003eOnly available for \u003ccode\u003epython \u003e= 3.10\u003c/code\u003e\u003c/i\u003e\u003c/b\u003e\u003cbr\u003e\n\u003ca href=\"https://github.com/VikParuchuri/surya\"\u003eSurya\u003c/a\u003e is an open-source OCR based on Deep Learning models.\u003cbr\u003e\nAt first use, relevant models will be downloaded.\n\n```python\nfrom img2table.ocr import SuryaOCR\n\nocr = SuryaOCR(langs=[\"en\"])\n```\n\n\u003e \u003ch4\u003eParameters\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003elangs : list, optional, default \u003ccode\u003e[\"en\"]\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eLang parameter used in Surya OCR for text extraction\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\n\u003cbr\u003e\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eGoogle Vision\u003ca name=\"vision\"\u003e\u003c/a\u003e\u003c/summary\u003e\n\u003cbr\u003e\n\nAuthentication to GCP can be done by setting the standard `GOOGLE_APPLICATION_CREDENTIALS` environment variable.\u003cbr\u003e\nIf this variable is missing, an API key should be provided via the `api_key` parameter.\n\n```python\nfrom img2table.ocr import VisionOCR\n\nocr = VisionOCR(api_key=\"api_key\", timeout=15)\n```\n\n\u003e \u003ch4\u003eParameters\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003eapi_key : str, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eGoogle Vision API key\u003c/dd\u003e\n\u003e    \u003cdt\u003etimeout : int, optional, default \u003ccode\u003e15\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eAPI requests timeout, in seconds\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\u003cbr\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eAWS Textract\u003ca name=\"textract\"\u003e\u003c/a\u003e\u003c/summary\u003e\n\u003cbr\u003e\n\nWhen using AWS Textract, the DetectDocumentText API is exclusively called.\n\nAuthentication to AWS can be done by passing credentials to the `TextractOCR` class.\u003cbr\u003e\nIf credentials are not provided, authentication is done using environment variables or configuration files. \nCheck `boto3` [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html) for more details.\n\n```python\nfrom img2table.ocr import TextractOCR\n\nocr = TextractOCR(aws_access_key_id=\"***\",\n                  aws_secret_access_key=\"***\",\n                  aws_session_token=\"***\",\n                  region=\"eu-west-1\")\n```\n\n\u003e \u003ch4\u003eParameters\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003eaws_access_key_id : str, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eAWS access key id\u003c/dd\u003e\n\u003e    \u003cdt\u003eaws_secret_access_key : str, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eAWS secret access key\u003c/dd\u003e\n\u003e    \u003cdt\u003eaws_session_token : str, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eAWS temporary session token\u003c/dd\u003e\n\u003e    \u003cdt\u003eregion : str, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eAWS server region\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\u003cbr\u003e\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eAzure Cognitive Services\u003ca name=\"azure\"\u003e\u003c/a\u003e\u003c/summary\u003e\n\u003cbr\u003e\n\n```python\nfrom img2table.ocr import AzureOCR\n\nocr = AzureOCR(endpoint=\"abc.azure.com\",\n               subscription_key=\"***\")\n```\n\n\u003e \u003ch4\u003eParameters\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003eendpoint : str, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eAzure Cognitive Services endpoint. If None, inferred from the \u003ccode\u003eCOMPUTER_VISION_ENDPOINT\u003c/code\u003e environment variable.\u003c/dd\u003e\n\u003e    \u003cdt\u003esubscription_key : str, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eAzure Cognitive Services subscription key. If None, inferred from the \u003ccode\u003eCOMPUTER_VISION_SUBSCRIPTION_KEY\u003c/code\u003e environment variable.\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\u003cbr\u003e\n\u003c/details\u003e\n\n---\n\n### Table extraction \u003ca name=\"table-extract\"\u003e\u003c/a\u003e\n\nMultiple tables can be extracted at once from a PDF page/ an image using the `extract_tables` method of a document.\n\n```python\nfrom img2table.ocr import TesseractOCR\nfrom img2table.document import Image\n\n# Instantiation of OCR\nocr = TesseractOCR(n_threads=1, lang=\"eng\")\n\n# Instantiation of document, either an image or a PDF\ndoc = Image(src)\n\n# Table extraction\nextracted_tables = doc.extract_tables(ocr=ocr,\n                                      implicit_rows=False,\n                                      implicit_columns=False,\n                                      borderless_tables=False,\n                                      min_confidence=50)\n```\n\u003e \u003ch4\u003eParameters\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003eocr : OCRInstance, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eOCR instance used to parse document text. If None, cells content will not be extracted\u003c/dd\u003e\n\u003e    \u003cdt\u003eimplicit_rows : bool, optional, default \u003ccode\u003eFalse\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eBoolean indicating if implicit rows should be identified - check related \u003ca href=\"/examples/Implicit.ipynb\" target=\"_self\"\u003eexample\u003c/a\u003e\u003c/dd\u003e\n\u003e    \u003cdt\u003eimplicit_columns : bool, optional, default \u003ccode\u003eFalse\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eBoolean indicating if implicit columns should be identified - check related \u003ca href=\"/examples/Implicit.ipynb\" target=\"_self\"\u003eexample\u003c/a\u003e\u003c/dd\u003e\n\u003e    \u003cdt\u003eborderless_tables : bool, optional, default \u003ccode\u003eFalse\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eBoolean indicating if \u003ca href=\"/examples/borderless.ipynb\" target=\"_self\"\u003eborderless tables\u003c/a\u003e are extracted \u003cb\u003eon top of\u003c/b\u003e bordered tables.\u003c/dd\u003e\n\u003e    \u003cdt\u003emin_confidence : int, optional, default \u003ccode\u003e50\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eMinimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\n\u003cb\u003eNB\u003c/b\u003e: Borderless table extraction can, by design, only extract tables with 3 or more columns.\n\n#### Method return\n\nThe [`ExtractedTable`](/src/img2table/tables/objects/extraction.py#L35) class is used to model extracted tables from documents.\n\n\u003e \u003ch4\u003eAttributes\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003ebbox : \u003ccode\u003eBBox\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eTable bounding box\u003c/dd\u003e\n\u003e    \u003cdt\u003etitle : str\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eExtracted title of the table\u003c/dd\u003e\n\u003e    \u003cdt\u003econtent : \u003ccode\u003eOrderedDict\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eDict with row indexes as keys and list of \u003ccode\u003eTableCell\u003c/code\u003e objects as values\u003c/dd\u003e\n\u003e    \u003cdt\u003edf : \u003ccode\u003epd.DataFrame\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003ePandas DataFrame representation of the table\u003c/dd\u003e\n\u003e    \u003cdt\u003ehtml : \u003ccode\u003estr\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eHTML representation of the table\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\n\u003cbr\u003e\n\nIn order to access bounding boxes at the cell level, you can use the following code snippet :\n```python\nfor id_row, row in enumerate(table.content.values()):\n    for id_col, cell in enumerate(row):\n        x1 = cell.bbox.x1\n        y1 = cell.bbox.y1\n        x2 = cell.bbox.x2\n        y2 = cell.bbox.y2\n        value = cell.value\n```\n\n\u003ch5 style=\"color:grey\"\u003eImages\u003c/h5\u003e\n\n`extract_tables` method from the `Image` class returns a list of `ExtractedTable` objects. \n```Python\noutput = [ExtractedTable(...), ExtractedTable(...), ...]\n```\n\n\u003ch5 style=\"color:grey\"\u003ePDF\u003c/h5\u003e\n\n`extract_tables` method from the `PDF` class returns an `OrderedDict` object with page indexes as keys and lists of `ExtractedTable` objects. \n```Python\noutput = {\n    0: [ExtractedTable(...), ...],\n    1: [],\n    ...\n    last_page: [ExtractedTable(...), ...]\n}\n```\n\n\n### Excel export \u003ca name=\"xlsx\"\u003e\u003c/a\u003e\n\nTables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table.\u003cbr\u003e\nMethod arguments are mostly common with the `extract_tables` method.\n\n```python\nfrom img2table.ocr import TesseractOCR\nfrom img2table.document import Image\n\n# Instantiation of OCR\nocr = TesseractOCR(n_threads=1, lang=\"eng\")\n\n# Instantiation of document, either an image or a PDF\ndoc = Image(src)\n\n# Extraction of tables and creation of a xlsx file containing tables\ndoc.to_xlsx(dest=dest,\n            ocr=ocr,\n            implicit_rows=False,\n            implicit_columns=False,\n            borderless_tables=False,\n            min_confidence=50)\n```\n\u003e \u003ch4\u003eParameters\u003c/h4\u003e\n\u003e\u003cdl\u003e\n\u003e    \u003cdt\u003edest : str, \u003ccode\u003epathlib.Path\u003c/code\u003e or \u003ccode\u003eio.BytesIO\u003c/code\u003e, required\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eDestination for xlsx file\u003c/dd\u003e\n\u003e    \u003cdt\u003eocr : OCRInstance, optional, default \u003ccode\u003eNone\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eOCR instance used to parse document text. If None, cells content will not be extracted\u003c/dd\u003e\n\u003e    \u003cdt\u003eimplicit_rows : bool, optional, default \u003ccode\u003eFalse\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eBoolean indicating if implicit rows should be identified - check related \u003ca href=\"/examples/Implicit.ipynb\" target=\"_self\"\u003eexample\u003c/a\u003e\u003c/dd\u003e\n\u003e    \u003cdt\u003eimplicit_rows : bool, optional, default \u003ccode\u003eFalse\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eBoolean indicating if implicit columns should be identified - check related \u003ca href=\"/examples/Implicit.ipynb\" target=\"_self\"\u003eexample\u003c/a\u003e\u003c/dd\u003e\n\u003e    \u003cdt\u003eborderless_tables : bool, optional, default \u003ccode\u003eFalse\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eBoolean indicating if \u003ca href=\"/examples/borderless.ipynb\" target=\"_self\"\u003eborderless tables\u003c/a\u003e are extracted. It requires to provide an OCR to the method in order to be performed - \u003cb\u003efeature in alpha version\u003c/b\u003e\u003c/dd\u003e\n\u003e    \u003cdt\u003emin_confidence : int, optional, default \u003ccode\u003e50\u003c/code\u003e\u003c/dt\u003e\n\u003e    \u003cdd style=\"font-style: italic;\"\u003eMinimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)\u003c/dd\u003e\n\u003e\u003c/dl\u003e\n\u003e \u003ch4\u003eReturns\u003c/h4\u003e\n\u003e If a \u003ccode\u003eio.BytesIO\u003c/code\u003e buffer is passed as dest arg, it is returned containing xlsx data\n\n\n\n## Examples \u003ca name=\"examples\"\u003e\u003c/a\u003e\n\nSeveral Jupyter notebooks with examples are available :\n\u003cul\u003e\n\u003cli\u003e\n\u003ca href=\"/examples/Basic_usage.ipynb\" target=\"_self\"\u003eBasic usage\u003c/a\u003e: generic library usage, including examples with images, PDF and OCRs\n\u003c/li\u003e\n\u003cli\u003e\n\u003ca href=\"/examples/borderless.ipynb\" target=\"_self\"\u003eBorderless tables\u003c/a\u003e: specific examples dedicated to the extraction of borderless tables\n\u003c/li\u003e\n\u003cli\u003e\n\u003ca href=\"/examples/Implicit.ipynb\" target=\"_self\"\u003eImplicit content\u003c/a\u003e: illustrated effect \nof the parameter \u003ccode\u003eimplicit_rows\u003c/code\u003e/\u003ccode\u003eimplicit_columns\u003c/code\u003e of the \u003ccode\u003eextract_tables\u003c/code\u003e method\n\u003c/li\u003e\n\u003c/ul\u003e\n\n## Caveats / FYI \u003ca name=\"fyi\"\u003e\u003c/a\u003e\n\n\u003cul\u003e\n\u003cli\u003e\nFor table extraction, results are highly dependent on OCR quality. By design, tables where no OCR data \ncan be found are not returned.\n\u003c/li\u003e\n\u003cli\u003e\nThe library is tailored for usage on documents with white/light background. \nEffectiveness can not be guaranteed on other type of documents. \n\u003c/li\u003e\n\u003cli\u003e\nTable detection using only OpenCV processing can have some limitations. If the library fails to detect tables, \nyou may check CNN based solutions.\n\u003c/li\u003e\n\u003c/ul\u003e\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxavctn%2Fimg2table","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxavctn%2Fimg2table","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxavctn%2Fimg2table/lists"}