https://github.com/serhatci/data-extraction-from-pdf
A sample script to extract text data from a pdf file, converts it to a pandas data frame, and saves it into a CSV file.
https://github.com/serhatci/data-extraction-from-pdf
data-extraction pandas-data-frame pdf pdfplumber python
Last synced: 7 months ago
JSON representation
A sample script to extract text data from a pdf file, converts it to a pandas data frame, and saves it into a CSV file.
- Host: GitHub
- URL: https://github.com/serhatci/data-extraction-from-pdf
- Owner: serhatci
- License: mit
- Created: 2020-11-21T21:31:52.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2020-12-18T15:58:24.000Z (almost 5 years ago)
- Last Synced: 2025-01-27T06:43:46.502Z (9 months ago)
- Topics: data-extraction, pandas-data-frame, pdf, pdfplumber, python
- Language: Python
- Homepage:
- Size: 3.23 MB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Data extraction from a pdf file
A script to extract text data from a pdf file, converts it to pandas data frame and saves it in to a csv file.
[](https://www.codefactor.io/repository/github/serhatci/data-extraction-from-pdf)
## Installation
You can clone below repository:
`git clone https://github.com/serhatci/data-extraction-from-pdf.git`install the requirements:
`pip install -r requirements.txt`Be sure following pdf files are in the script folder:
ITRCAnnualReportPdf2019.pdf
ITRCAnnualReportPdf2018.pdfand run the application:
`python script/pdf_data_extractor.py`## Requirements
Script works Python 3.7 or higher version.
Below libraries should be installed:
```
pip install pdfplumber~=0.5.25
pip install pandas~=0.25.1
```# Demonstration of extracted text from pdf file
Below image represents the format of pdf file and the extracted data in the CSV file.
