{"id":26127487,"url":"https://github.com/icaropires/pdf2dataset","last_synced_at":"2025-04-13T16:53:47.033Z","repository":{"id":57451448,"uuid":"275980625","full_name":"icaropires/pdf2dataset","owner":"icaropires","description":"Converts a whole subdirectory with a big (or small) volume of PDF documents to a dataset (pandas DataFrame) with error tracking and choice of features","archived":false,"fork":false,"pushed_at":"2020-09-20T03:29:45.000Z","size":306,"stargazers_count":19,"open_issues_count":12,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-07-04T21:55:53.320Z","etag":null,"topics":["data-science","distributed-computing","distributed-systems","ocr","pandas-dataframe","parallel","parquet","pdf","pdf2image","pdftotext","pyarrow","pytesseract","pytesseract-ocr","python","python3","ray","tesseract","tesseract-ocr"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/icaropires.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-06-30T02:51:29.000Z","updated_at":"2024-07-02T19:19:53.000Z","dependencies_parsed_at":"2022-09-04T10:40:36.965Z","dependency_job_id":null,"html_url":"https://github.com/icaropires/pdf2dataset","commit_stats":null,"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/icaropires%2Fpdf2dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/icaropires%2Fpdf2dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/icaropires%2Fpdf2dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/icaropires%2Fpdf2dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/icaropires","download_url":"https://codeload.github.com/icaropires/pdf2dataset/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242900073,"owners_count":20203704,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","distributed-computing","distributed-systems","ocr","pandas-dataframe","parallel","parquet","pdf","pdf2image","pdftotext","pyarrow","pytesseract","pytesseract-ocr","python","python3","ray","tesseract","tesseract-ocr"],"created_at":"2025-03-10T18:08:35.979Z","updated_at":"2025-03-10T18:08:36.738Z","avatar_url":"https://github.com/icaropires.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pdf2dataset\n\n[![pdf2dataset](https://github.com/icaropires/pdf2dataset/workflows/pdf2dataset/badge.svg?branch=master)](https://github.com/icaropires/pdf2dataset)\n[![pypi](https://img.shields.io/pypi/v/pdf2dataset.svg)](https://pypi.python.org/pypi/pdf2dataset)\n[![Maintainability](https://api.codeclimate.com/v1/badges/cbe90c3043b038f52b18/maintainability)](https://codeclimate.com/github/icaropires/pdf2dataset/maintainability)\n[![codecov](https://codecov.io/gh/icaropires/pdf2dataset/branch/master/graph/badge.svg)](https://codecov.io/gh/icaropires/pdf2dataset)\n[![pypi-stats](https://img.shields.io/pypi/dm/pdf2dataset)](https://pypistats.org/packages/pdf2dataset)\n\nConverts a whole subdirectory with any volume (small or huge) of PDF documents to a dataset (pandas DataFrame).\nNo need to setup any external service (no database, brokers, etc). Just install and run it!\n\n\n## Main features\n\n* Conversion of a whole subdirectory with PDFs documents into a pandas DataFrame\n* Support for parallel and distributed processing through [ray](https://github.com/ray-project/ray)\n* Extractions are performed by page, making tasks distribution more uniform for handling documents with big differences in number of pages\n* Incremental writing of resulting DataFrame, making possible to process data bigger than memory\n* Error tracking of faulty documents\n* Resume interrupted processing\n* Extract text through [pdftotext](https://github.com/jalan/pdftotext)\n* Use OCR for extracting text through [pytesseract](https://github.com/madmaze/pytesseract)\n* Extract images through [pdf2image](https://github.com/Belval/pdf2image)\n* Support to implement custom features extraction\n* Highly customizable behavior through params\n\n\n## Installation\n\n### Install Dependencies\n\n#### Fedora\n\n``` bash\n# \"-por\" for portuguese, use the documents language\n$ sudo dnf install -y gcc-c++ poppler-utils pkgconfig poppler-cpp-devel python3-devel tesseract-langpack-por\n```\n\n#### Ubuntu (or debians)\n\n``` bash\n$ sudo apt update\n\n# \"-por\" for portuguese, use the documents language\n$ sudo apt install -y build-essential poppler-utils libpoppler-cpp-dev pkg-config python3-dev tesseract-ocr-por\n```\n\n### Install pdf2dataset\n\n#### For usage\n\n``` bash\n$ pip3 install pdf2dataset --user  # Please, isolate the environment\n```\n\n#### For development\n\n``` bash\n# First, install poetry, clone repository and cd into it\n$ poetry install\n```\n\n\n## Usage\n\n### Simple - CLI\n\n``` bash\n# Note: path, page and error will always be present in resulting DataFrame\n\n# Reads all PDFs from my_pdfs_dir and saves the resultant dataframe to my_df.parquet.gzip\n$ pdf2dataset my_pdfs_dir my_df.parquet.gzip  # Most basic, extract all possible features\n$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --features=text  # Extract just text\n$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --features=image  # Extract just image\n$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --num-cpus 1  # Maximum reducing of parallelism\n$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true  # For scanned PDFs\n$ pdf2dataset my_pdfs_dir my_df.parquet.gzip --ocr true --lang eng  # For scanned documents with english text\n```\n\n### Resume processing\n\nIn case of any interruption, to resume the processing, just use the same path as output and the\nprocessing will be resumed automatically. The flag `--saving-interval` (or the param `saving_interval`)\ncontrols the frequency the output path will be updated, and so, the processing \"checkpoints\".\n\n\n### Using as a library\n\n#### Main functions\n\nThere're some helper functions to facilitate pdf2dataset usage:\n\n* **extract:** function can be used analogously to the CLI\n* **extract_text**: `extract` wrapper with `features=text`\n* **extract_image**: `extract` wrapper with `features=image`\n* **image_from_bytes:** (pdf2image.utils) get a Pillow `Image` object given the image bytes\n* **image_to_bytes:** (pdf2image.utils) get the image bytes given the a Pillow `Image` object\n\n#### Basic example\n``` python\nfrom pdf2dataset import extract\n\nextract('my_pdfs_dir', 'all_features.parquet.gzip')\n```\n\n#### Small data\n\nOne feature, not available to the CLI, is the custom behavior for handling small volumes of data (small can\nbe understood as that: the extraction won't run for hours or days and won't be distributed).\n\nThe complete list of differences are:\n\n* Faster initialization (use multiprocessing instead of ray)\n* Don't save processing progress\n* Distributed processing not supported\n* Don't write dataframe to disk\n* Returns the dataframe\n\n##### Example:\n``` python\nfrom pdf2dataset import extract_text\n\ndf = extract_text('my_pdfs_dir', small=True)\n# ...\n```\n\n#### Pass list of files paths\n\nInstead of specifying a directory, one can specify a list of files to be processed.\n\n##### Example:\n\n``` python\nfrom pdf2dataset import extract\n\n\nmy_files = [\n    './tests/samples/single_page1.pdf',\n    './tests/samples/invalid1.pdf',\n]\n\ndf = extract(my_files, small=True)\n# ...\n```\n\n#### Pass files from memory\n\nIf you don't want to specify a directory for the documents, you can specify the tasks that\nwill be processed.\n\nThe tasks can be of the form `(document_name, document_bytes, page_number)`\nor just `(document_name, document_bytes)`, **document_name** must ends with `.pdf` but \ndon't need to be a real file, **document_bytes** are the bytes of the pdf document and\n**page_number** is the number of the page to process (all pages, if not specified).\n\n##### Example:\n\n``` python\nfrom pdf2dataset import extract_text\n\ntasks = [\n    ('a.pdf', a_bytes),  # Processing all pages of this document\n    ('b.pdf', b_bytes, 1),\n    ('b.pdf', b_bytes, 2),\n]\n\n# 'df' will contain results from all pages from 'a.pdf' and page 1 and 2 from 'b.pdf'\ndf = extract_text(tasks, 'my_df.parquet.gzip', small=True)\n\n# ...\n```\n\n#### Returning a list\n\nIf you don't want to handle the DataFrame, is possible to return a nested list with the features values.\nThe structure for the resulting list is:\n```\nresult = List[documents]\ndocuments = List[pages]\npages = List[features]\nfeatures = List[feature]\nfeature = any\n```\n\n* `any` is any type supported by pyarrow.\n* features are ordered by the feature name (`text`, `image`, etc)\n\n##### Example:\n\n``` python\n\u003e\u003e\u003e from pdf2dataset import extract_text\n\u003e\u003e\u003e extract_text('tests/samples', return_list=True)\n[[[None]],\n [['First page'], ['Second page'], ['Third page']],\n [['My beautiful sample!']],\n [['First page'], ['Second page'], ['Third page']],\n [['My beautiful sample!']]]\n```\n\n* Features with error will have `None` value as result\n* Here, `extract_text` was used, so the only feature is `text`\n\n#### Custom Features\n\nWith version \u003e= 0.4.0, is also possible to easily implement extraction of custom features:\n\n##### Example:\n\nThis is the structure:\n\n``` python\nfrom pdf2dataset import extract, feature, PdfExtractTask\n\n\nclass MyCustomTask(PdfExtractTask):\n\n    @feature('bool_')\n    def get_is_page_even(self):\n        return self.page % 2 == 0\n\n    @feature('binary')\n    def get_doc_first_bytes(self):\n        return self.file_bin[:10]\n\n    @feature('string', exceptions=[ValueError])\n    def get_wrong(self):\n        raise ValueError(\"There was a problem!\")\n\n\nif __name__ == '__main__':\n    df = extract('tests/samples', small=True, task_class=MyCustomTask)\n    print(df)\n\n    df.dropna(subset=['text'], inplace=True)  # Discard invalid documents\n    print(df.iloc[0].error)\n```\n\n* First print:\n```\n                         path  page doc_first_bytes  ...                  text  wrong                                              error\n0                invalid1.pdf    -1   b\"I'm invali\"  ...                  None   None  image_original:\\nTraceback (most recent call l...\n1             multi_page1.pdf     2  b'%PDF-1.5\\n%'  ...           Second page   None  wrong:\\nTraceback (most recent call last):\\n  ...\n2             multi_page1.pdf     3  b'%PDF-1.5\\n%'  ...            Third page   None  wrong:\\nTraceback (most recent call last):\\n  ...\n3   sub1/copy_multi_page1.pdf     1  b'%PDF-1.5\\n%'  ...            First page   None  wrong:\\nTraceback (most recent call last):\\n  ...\n4   sub1/copy_multi_page1.pdf     3  b'%PDF-1.5\\n%'  ...            Third page   None  wrong:\\nTraceback (most recent call last):\\n  ...\n5             multi_page1.pdf     1  b'%PDF-1.5\\n%'  ...            First page   None  wrong:\\nTraceback (most recent call last):\\n  ...\n6  sub2/copy_single_page1.pdf     1  b'%PDF-1.5\\n%'  ...  My beautiful sample!   None  wrong:\\nTraceback (most recent call last):\\n  ...\n7   sub1/copy_multi_page1.pdf     2  b'%PDF-1.5\\n%'  ...           Second page   None  wrong:\\nTraceback (most recent call last):\\n  ...\n8            single_page1.pdf     1  b'%PDF-1.5\\n%'  ...  My beautiful sample!   None  wrong:\\nTraceback (most recent call last):\\n  ...\n\n[9 rows x 8 columns]\n```\n\n* Second print:\n```\nwrong:\nTraceback (most recent call last):\n  File \"/home/icaro/Desktop/pdf2dataset/pdf2dataset/extract_task.py\", line 32, in inner\n    result = feature_method(*args, **kwargs)\n  File \"example.py\", line 16, in get_wrong\n    raise ValueError(\"There was a problem!\")\nValueError: There was a problem!\n\n```\n\nNotes:\n* `@feature` is the decorator used to define new features.\n* The extraction method name must start with the prefix `get_` (avoids collisions with attribute names and increases readability)\n* First argument to `@feature` must be a valid PyArrow type, complete list [here](https://arrow.apache.org/docs/python/api/datatypes.html)\n* `exceptions` param specify a list of exceptions to be recorded on DataFrame, otherwise they are raised\n* For this example, all available features plus the custom ones are extracted\n\n\n### Results File\n\nThe resulting \"file\" is a directory with structure specified by dask with pyarrow engine,\nit can be easily read with pandas or dask:\n\n#### Example with pandas\n``` python\n\u003e\u003e\u003e import pandas as pd\n\u003e\u003e\u003e df = pd.read_parquet('my_df.parquet.gzip', engine='pyarrow')\n\u003e\u003e\u003e df\n                             path  page                  text                                              error\nindex                                                                                                           \n0                single_page1.pdf     1  My beautiful sample!                                                   \n1       sub1/copy_multi_page1.pdf     2           Second page                                                   \n2      sub2/copy_single_page1.pdf     1  My beautiful sample!                                                   \n3       sub1/copy_multi_page1.pdf     3            Third page                                                   \n4                 multi_page1.pdf     1            First page                                                   \n5                 multi_page1.pdf     3            Third page                                                   \n6       sub1/copy_multi_page1.pdf     1            First page                                                   \n7                 multi_page1.pdf     2           Second page                                                   \n0                    invalid1.pdf    -1                        Traceback (most recent call last):\\n  File \"/h...\n```\n\nThere is no guarantee about the uniqueness or order of `index`, you might need to create a new index with\nthe whole data in memory.\n\nThe `-1` page number means that was not possible of even parsing the document.\n\n### Run on a Cluster\n\n#### Setup the Cluster\n\nFollow ray documentation for [manual](https://docs.ray.io/en/latest/using-ray-on-a-cluster.html?setup#manual-cluster-setup) or [automatic](https://docs.ray.io/en/latest/autoscaling.html?setup#automatic-cluster-setup)\nsetup.\n\n#### Run it\n\nTo go distributed you can run it just like local, but using the `--address` and `--redis-password` flags to point to your cluster ([More information](https://docs.ray.io/en/latest/multiprocessing.html)).\n\nWith version \u003e= 0.2.0, only the head node needs to have access to the documents in disk.\n\n\n### CLI Help\n\n```\nusage: pdf2dataset [-h] [--features FEATURES]\n                   [--saving-interval SAVING_INTERVAL] [--ocr-lang OCR_LANG]\n                   [--ocr OCR] [--chunksize CHUNKSIZE]\n                   [--image-size IMAGE_SIZE] [--ocr-image-size OCR_IMAGE_SIZE]\n                   [--image-format IMAGE_FORMAT] [--num-cpus NUM_CPUS]\n                   [--address ADDRESS] [--dashboard-host DASHBOARD_HOST]\n                   [--redis-password REDIS_PASSWORD]\n                   input_dir out_file\n\nExtract text from all PDF files in a directory\n\npositional arguments:\n  input_dir             The folder to lookup for PDF files recursively\n  out_file              File to save the resultant dataframe\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --features FEATURES   Specify a comma separated list with the features you\n                        want to extract. 'path' and 'page' will always be\n                        added. Available features to add: image, page, path,\n                        text Examples: '--features=text,image' or '--\n                        features=all'\n  --saving-interval SAVING_INTERVAL\n                        Results will be persisted to results folder every\n                        saving interval of pages\n  --ocr-lang OCR_LANG   Tesseract language\n  --ocr OCR             'pytesseract' if true, else 'pdftotext'. default:\n                        false\n  --chunksize CHUNKSIZE\n                        Chunksize to use while processing pages, otherwise is\n                        calculated\n  --image-size IMAGE_SIZE\n                        If adding image feature, image will be resized to this\n                        size. Provide two integers separated by 'x'. Example:\n                        --image-size 1000x1414\n  --ocr-image-size OCR_IMAGE_SIZE\n                        The height of the image OCR will be applied. Width\n                        will be adjusted to keep the ratio.\n  --image-format IMAGE_FORMAT\n                        Format of the image generated from the PDF pages\n  --num-cpus NUM_CPUS   Number of cpus to use\n  --address ADDRESS     Ray address to connect\n  --dashboard-host DASHBOARD_HOST\n                        Which IP ray webui will try to listen on\n  --redis-password REDIS_PASSWORD\n                        Redis password to use to connect with ray\n```\n\n\n## Troubleshooting\n\n1. **Troubles with high memory usage**\n\n* Decrease the number of CPUs in use, reducing the level of parallelism, test it with `--num-cpus 1` flag and then increase according to your hardware.\n\n* Use smaller chunksize, so less documents will be put in memory at once. Use `--chunksize 1` for having `1 * num_cpus` documents in memory at once.\n\n\n## How to Contribute\n\nJust open your [issues](https://github.com/icaropires/pdf2dataset/issues) and/or [pull requests](https://github.com/icaropires/pdf2dataset/pulls), all are welcome :smiley:!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ficaropires%2Fpdf2dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ficaropires%2Fpdf2dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ficaropires%2Fpdf2dataset/lists"}