Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/strickvl/pdfsplitter

Turn PDFs into image files for use in machine learning projects
https://github.com/strickvl/pdfsplitter

computer-vision data-science machine-learning pdf python

Last synced: about 2 months ago
JSON representation

Turn PDFs into image files for use in machine learning projects

Awesome Lists containing this project

README

        

# pdfsplitter
> A simple way to extract and parse images for machine learning workflows.

## What is pdfsplitter?

There are lots of repeated tasks you have to perform when working with PDF files for a machine learning project. I found myself wanting a tool that could handle some of the more common parts of this. Not finding anything suitable, I built something for myself.

## Features

- downloading all the PDF files on a web page
- extraction / exporting a single image file for each page of the PDF
- statistics generation to get an overview of the total page count of the PDFs.

## Install

`pip install --upgrade pdfsplitter`

## How to use

The highest-level function for exporting image files from a series of images is `extract_images_from_pdfs`, which will take all the PDF files inside a source directory and extract the images to a destination directory. You have the added option of specifying which sort of image filetype you'd like for the exported images, as in this example:

```python
source = Path("./tryout/")
destination = Path("./tryout/processed")

# download all the PDFs listed on a particular list of URLs
download_pdf_files(
get_pdf_links("https://open.defense.gov/Transparency/FOIA.aspx"), "./tryout"
)

# extracts all the images from the downloaded PDFs and saves them to a directory
extract_images_from_pdfs(source, destination, "jpg")
```

```python
# get stats on the downloaded PDF files
display_stats(get_stats(source))
```

                                  Stats for your PDF Files                                   

┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
PageCou… Filename ocr_lay… pdf_fil… author
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
27 2014_ACFO_Report_FINAL_REPORT.pdf False 236655 Stephan…
│ │ │ │ │ Carr
3 7-26-2013_Determination.pdf False 214683
2 DA Determination-DCRIT Hawaii Water Wells.pdf False 115574
3 12-18-14_Determination.pdf False 50925
4 6-1-2012_Determination.pdf False 463902
2 8-19-2021_Determination.pdf False 350438
15 2012_ACFO_Report_FINAL_REPORT.pdf False 242305 CarrS
3 2-12-2014_Determination.pdf False 23823 timothy…
2 DA%20Determination%20DoD%20Flights.pdf False 111521
22 2013_ACFO_Report_FINAL_REPORT.pdf False 258462 CarrS
2 2-15-2018_Determination.pdf False 342195
49 DoDFY2020AnnualFOIA_Report.pdf False 1247446
3 7-5-2019_Determination.pdf False 204453
30 2017_DoD_Chief_FOIA_Officer_Report.pdf False 4810077
28 2021_DoD_Chief_FOIA_Officer_Report.pdf False 1131474
10 2011_DoD_Chief_FOIA_OfficerReport.pdf False 113387 CarrS
27 2018_DoD_Chief_FOIA_Officer_Report.pdf False 788227 brandoct
2 8-3-15_Determination.pdf False 105563
3 1-21-2016_Determination.pdf False 122706
2 12-6-2017_Determination.pdf False 189563 deleonv
2 12-18-2018_Determination.pdf False 153675
30 2016_ACFO_Report_FINAL_REPORT.pdf False 1108008
2 11-29-2017_Determination.pdf False 369290
2 DoD SAP IT DCRIT Determination.pdf False 127858
3 10-19-2018_Determination.pdf False 70088 JAMES
│ │ │ │ │ HOGAN
30 2015_ACFO_Report_FINAL_REPORT.pdf False 287445 Stephan…
│ │ │ │ │ Carr
3 7-31-2020_Determination.pdf False 88447 Dziecic…
│ │ │ │ │ Gerald J
│ │ │ │ │ Jr CIV
│ │ │ │ │ OSD OGC
│ │ │ │ │ (USA)
└──────────┴───────────────────────────────────────────────┴──────────┴──────────┴──────────┘

TOTAL PAGECOUNT: 311