https://github.com/arnos-stuff/pdf-data-process
https://github.com/arnos-stuff/pdf-data-process
Last synced: 8 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/arnos-stuff/pdf-data-process
- Owner: arnos-stuff
- Created: 2023-05-28T22:20:38.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2023-05-28T22:23:03.000Z (about 3 years ago)
- Last Synced: 2025-10-14T02:13:28.174Z (8 months ago)
- Language: Python
- Size: 23.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Turning a pdf dataset into CSV
In the rare occasion where researchers publish an entire dataset written in PDF format, [such as in the case of this research paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0123920), you need a quick tool to turn that PDF into a CSV file.
## What happens under the hood
### Embeded nodejs runtime
Using the `nodejs-bin` [python node runtime](https://pypi.org/project/nodejs-bin/), we are able to call `npx` to remove execute the [pdf2json npm package](https://www.npmjs.com/package/pdf2json)
When you type
```bash
pdfdataprocess mkjson
```
The `npx` call applies on the first PDF in your directory. If not found, the path you provide to flag `-f` is considered.
### Parsing the JSON-PDF
Then a highly hacky python processing is applied to the JSON-PDF file.
The JSON-PDF is a JSON file containing the content of the PDF.
The python script iterates through the JSON-PDF and extracts the text content of each page.
How this is done is quite simple:
- if your dataset is an actual CSV file that's printed into PDF, then on each page you will have, at same height (`top` entry), a line of different values (`left` entry changes, `data` entry should be rearranged)
- this requires a few layers of pre and post-processing
- the final result is a CSV file with the text content of each page
## Usage
```bash
pdfdataprocess mkjson # outputs the json file
```
then
```bash
pdfdataprocess pjson # outputs the csv file
```