Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/strickvl/pdfsplitter
Turn PDFs into image files for use in machine learning projects
https://github.com/strickvl/pdfsplitter
computer-vision data-science machine-learning pdf python
Last synced: about 2 months ago
JSON representation
Turn PDFs into image files for use in machine learning projects
- Host: GitHub
- URL: https://github.com/strickvl/pdfsplitter
- Owner: strickvl
- License: apache-2.0
- Created: 2021-09-18T17:00:53.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-11-30T08:55:17.000Z (about 1 year ago)
- Last Synced: 2024-11-07T18:11:51.517Z (2 months ago)
- Topics: computer-vision, data-science, machine-learning, pdf, python
- Language: Jupyter Notebook
- Homepage: https://strickvl.github.io/pdfsplitter/
- Size: 106 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# pdfsplitter
> A simple way to extract and parse images for machine learning workflows.## What is pdfsplitter?
There are lots of repeated tasks you have to perform when working with PDF files for a machine learning project. I found myself wanting a tool that could handle some of the more common parts of this. Not finding anything suitable, I built something for myself.
## Features
- downloading all the PDF files on a web page
- extraction / exporting a single image file for each page of the PDF
- statistics generation to get an overview of the total page count of the PDFs.## Install
`pip install --upgrade pdfsplitter`
## How to use
The highest-level function for exporting image files from a series of images is `extract_images_from_pdfs`, which will take all the PDF files inside a source directory and extract the images to a destination directory. You have the added option of specifying which sort of image filetype you'd like for the exported images, as in this example:
```python
source = Path("./tryout/")
destination = Path("./tryout/processed")# download all the PDFs listed on a particular list of URLs
download_pdf_files(
get_pdf_links("https://open.defense.gov/Transparency/FOIA.aspx"), "./tryout"
)# extracts all the images from the downloaded PDFs and saves them to a directory
extract_images_from_pdfs(source, destination, "jpg")
``````python
# get stats on the downloaded PDF files
display_stats(get_stats(source))
```Stats for your PDF Files
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ PageCou… ┃ Filename ┃ ocr_lay… ┃ pdf_fil… ┃ author ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
│ 27 │ 2014_ACFO_Report_FINAL_REPORT.pdf │ False │ 236655 │ Stephan… │
│ │ │ │ │ Carr │
│ 3 │ 7-26-2013_Determination.pdf │ False │ 214683 │ │
│ 2 │ DA Determination-DCRIT Hawaii Water Wells.pdf │ False │ 115574 │ │
│ 3 │ 12-18-14_Determination.pdf │ False │ 50925 │ │
│ 4 │ 6-1-2012_Determination.pdf │ False │ 463902 │ │
│ 2 │ 8-19-2021_Determination.pdf │ False │ 350438 │ │
│ 15 │ 2012_ACFO_Report_FINAL_REPORT.pdf │ False │ 242305 │ CarrS │
│ 3 │ 2-12-2014_Determination.pdf │ False │ 23823 │ timothy… │
│ 2 │ DA%20Determination%20DoD%20Flights.pdf │ False │ 111521 │ │
│ 22 │ 2013_ACFO_Report_FINAL_REPORT.pdf │ False │ 258462 │ CarrS │
│ 2 │ 2-15-2018_Determination.pdf │ False │ 342195 │ │
│ 49 │ DoDFY2020AnnualFOIA_Report.pdf │ False │ 1247446 │ │
│ 3 │ 7-5-2019_Determination.pdf │ False │ 204453 │ │
│ 30 │ 2017_DoD_Chief_FOIA_Officer_Report.pdf │ False │ 4810077 │ │
│ 28 │ 2021_DoD_Chief_FOIA_Officer_Report.pdf │ False │ 1131474 │ │
│ 10 │ 2011_DoD_Chief_FOIA_OfficerReport.pdf │ False │ 113387 │ CarrS │
│ 27 │ 2018_DoD_Chief_FOIA_Officer_Report.pdf │ False │ 788227 │ brandoct │
│ 2 │ 8-3-15_Determination.pdf │ False │ 105563 │ │
│ 3 │ 1-21-2016_Determination.pdf │ False │ 122706 │ │
│ 2 │ 12-6-2017_Determination.pdf │ False │ 189563 │ deleonv │
│ 2 │ 12-18-2018_Determination.pdf │ False │ 153675 │ │
│ 30 │ 2016_ACFO_Report_FINAL_REPORT.pdf │ False │ 1108008 │ │
│ 2 │ 11-29-2017_Determination.pdf │ False │ 369290 │ │
│ 2 │ DoD SAP IT DCRIT Determination.pdf │ False │ 127858 │ │
│ 3 │ 10-19-2018_Determination.pdf │ False │ 70088 │ JAMES │
│ │ │ │ │ HOGAN │
│ 30 │ 2015_ACFO_Report_FINAL_REPORT.pdf │ False │ 287445 │ Stephan… │
│ │ │ │ │ Carr │
│ 3 │ 7-31-2020_Determination.pdf │ False │ 88447 │ Dziecic… │
│ │ │ │ │ Gerald J │
│ │ │ │ │ Jr CIV │
│ │ │ │ │ OSD OGC │
│ │ │ │ │ (USA) │
└──────────┴───────────────────────────────────────────────┴──────────┴──────────┴──────────┘TOTAL PAGECOUNT: 311