Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/monambike/pdfconverter-pdftables-to-csv

Python project that converts tables inside PDFs to CSV for convenient data manipulation. It has log and exception handling.
https://github.com/monambike/pdfconverter-pdftables-to-csv

automation csv glob log pandas pdf pdf-converter pdf-to-csv pdf-to-excel pdf-to-text python regex tabula

Last synced: 4 months ago
JSON representation

Python project that converts tables inside PDFs to CSV for convenient data manipulation. It has log and exception handling.

Awesome Lists containing this project

README

        

Static Badge Static Badge

![Python](https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54)
![Pandas](https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white)

# PDFConverter - Script

PDFConverter is a Python project that needs to be converted into an executable file in order to quickly interpret and convert a large number of tables into PDF format without requiring extensive user interaction.

You can also check the branchs [docs](https://github.com/monambike/pdfconverter-pdftables-to-csv/tree/docs) or the [desktop application](https://github.com/monambike/pdfconverter-pdftables-to-csv/tree/desktop) used for testing the call of this Script.

**Example of Script call:**

```
python pdfconverter.py --ImportPath "C:\\users\\dvp10\\desktop\\EDITAL (2).pdf" --ExportPath "C:\\users\\dvp10\\desktop" --PageNumber "all"
```

## Project Structure

![image](https://github.com/monambike/pdfconverter-pdftables-to-csv/assets/35270174/c14e73d1-4143-4134-b3da-29f57bbd6680)

# Contact

You can find me on likedin by here [linkedin.com/in/monambike/](https://www.linkedin.com/in/monambike/). If you want to see videos about my work you can check my YouTube channel [youtube.com/@monambike_portfolio](https://www.youtube.com/@monambike_portfolio) and if you want to see my artworks you can check at my instagram [instagram.com/monambike_portfolio](https://www.instagram.com/monambike_portfolio).

# License

The license for this repository is available [here](LICENSE). Please refer to the provided link for detailed information regarding the terms and conditions governing the use of this project.

## Table of Contents

- [Libraries](#libraries)
- [Formatting](#formatting)
- [File Read Handling](#file-read-handling)
- [Remove Double Quotes](#remove-double-quotes)
- [Delete Empty Lines](#delete-empty-lines)
- [Delete Empty Columns](#delete-empty-columns)
- [Convert Header to Body](#convert-header-to-body)
- [Remove Line Breaks](#remove-line-breaks)
- [Replace Semicolon](#replace-semicolon)
- [Conversion File Handling](#conversion-file-handling)
- [EXPORT \[withoutFormatting\]](#export-withoutformatting)
- [Empty Data in Header](#empty-data-in-header)
- [Line Breaks in the Middle of Data](#line-breaks-in-the-middle-of-data)
- [Semicolon at the End of the Line](#semicolon-at-the-end-of-the-line)
- [Space at the Beginning of the Line](#space-at-the-beginning-of-the-line)
- [Quotes and One Column \(First Check\)](#quotes-and-one-column-first-check)
- [EXPORT \[tableWithBlankCells\]](#export-tablewithblankcells)
- [Empty Data](#empty-data)
- [Adjacent Double Quotes](#adjacent-double-quotes)
- [Space After a Separator](#space-after-a-separator)
- [Space between Separators and Double Quotes](#space-between-separators-and-double-quotes)
- [Quotes and One Column (Second Check)](#quotes-and-one-column-second-check)
- [EXPORT \[main\]](#export-main)
- [Quotes at the Beginning](#quotes-at-the-beginning)
- [Quotes at the End](#quotes-at-the-end)
- [Empty Lines or Without Quotes (Second Check)](#empty-lines-or-without-quotes-second-check)
- [Three Columns](#three-columns)
- [EXPORT \[fullClear\]](#export-fullclear)

## Libraries

List of libraries used for the development of the Python script:
- [**Pandas**](https://pandas.pydata.org/), for text conversion and DataFrame manipulation;
- [**Tabula**](https://tabula.technology/), for reading PDF files;
- Other standard libraries of the Python language were also used, such as [**Glob**](https://docs.python.org/3/library/glob.html) for retrieving only PDF files, [**OS**](https://docs.python.org/3/library/os.html) for system operations, [**argparse**](https://docs.python.org/3/library/argparse.html) for receiving and manipulating command-line arguments, among others.

## Formatting
Types of formatting and the files to which they were applied. When a file is shown to be exported (in table format), it means that all the formatting above the export will be applied.

### File Read Handling
Formatting related to reading.

#### Remove Double Quotes
Removes all double quotes from the DataFrame to avoid future issues.

#### Replace Semicolon
Replaces all semicolons in the DataFrame with commas to avoid conflicts.

#### Delete Empty Lines
Deletes all empty rows in the DataFrame.

#### Delete Empty Columns
Deletes all empty columns in the DataFrame.

#### Convert Header to Body
Converts the header to body to remove unnecessary and detrimental formatting.

#### Remove Line Breaks
Removes line breaks that occur when the PDF has a very long line.

### Conversion File Handling
Formatting related to conversion.

#### Export \[withoutFormatting\]
Starts the first export, which is the export of the unformatted file that will be formatted later.



EXPORT




Folder Name:
withoutFormatting


Folder Path:
(lattice/stream) + "\\withoutFormatting"


Description:

The 'withoutFormatting' file

is exported at this moment

without any formatting.








##### Empty Data in Header
Removes empty data in the header.

If it is:
```
"";"Unnamed: 0";""
```
It becomes:
```
"";""
```


##### Line Breaks in the Middle of Data
Removes line breaks if they occur in the middle of the data.

If it is:
```
""
```
It becomes:
```
""
```


##### Semicolon at the End of the Line
Removes semicolon `';'` if it is at the end of the line.

If it is:
```
"";"";
```
It becomes:
```
"";""
```


##### Space at the Beginning of the Line
Removes leading spaces in the lines.

If it is:
```
"";""
"";""
"";""
```
It becomes:
```
"";""
"";""
"";""
```


##### Quotes and One Column (First Check)
Removes the line if it has quotes at the beginning and end, and on top of that, it has only one column or less.

If it is:
```
"";"";"";""
"";""
"";"";"";""
""
"";"";"";""
"
"
```
It remains the same:
```
"";"";"";""
"";""
"";"";"";""
"";"";"";""
"
"
```


#### Export \[tableWithBlankCells\]
Starts the export of the file to handle the exception when converting a table that has empty cells.



EXPORT




Folder Name:
tableWithBlankCells


Folder Path:
(lattice/stream) + "\\tableWithBlankCells"


Description:

The file 'tableWithBlankCells'

is exported at this moment

with all the formatting

applied above.







##### Empty Data
Removes data that is empty `"";` and `;""`.

If it is:
```
"";"";"";""
"";"";"";""
"";"";"";""
```
It becomes:
```
"";"";""
"";"";""
"";"";""
```


##### Adjacent Double Quotes
Inserts a line break if there are double quotes side by side.

If it is:
```
"";"""";""
```
It becomes:
```
"";""
"";""
```


##### Space After a Separator
If there is a semicolon followed by a space, it is replaced by a line break.

If it is:
```
"";""; "";""
```

It becomes:
```
"";""
"";""
```


##### Space Between Separators and Double Quotes
Removes the preceding content if there is a space between the separators and the quotes.

If it is:
```
"";""; "";""
```
It becomes:
```
"";""
```


##### Quotes and One Column (Second Check)
Removes the line if it has quotes at the beginning and end, and on top of that, it has only one column or less.

If it is:
```
"";"";"";""
"";""
"";"";"";""
""
"";"";"";""
"
"
```
It remains the same:
```
"";"";"";""
"";""
"";"";"";""
"";"";"";""
"
"
```


#### Export \[main\]
Starts the export of the main file.



EXPORT




Folder Name:
main


Folder Path:
(lattice/stream) + "\\main"


Description:

The file 'main'

is exported at this moment

with all the formatting

applied above.







##### Quotes at the Beginning
Deletes the line if it doesn't start with quotes.

If it is:
```
"";"";""
";"";""
"";"";""
```
It becomes:
```
"";"";""
"";"";""
```


##### Quotes at the End
Deletes the line if it doesn't end with quotes.

If it is:
```
"";"";""
"";"";"
"";"";""
```
It becomes:
```
"";"";""
"";"";""
```


##### Empty Lines or Without Quotes (Second Check)
Empty lines that only have line breaks `'\n'` or don't have a double quote anywhere will be deleted.

If it is:
```

Lorem
"";"";""
"";"";""
Lorem ipsum

"";""
```
It becomes:
```
"";"";""
"";"";""
"";""
```


##### Three Columns
Only writes the line if it has at least three columns or more.

If it is:
```
"";"";"";""
"";""
"";"";"";""
""
"";"";"";""
"";"";""
```
It becomes:
```
"";"";"";""
"";"";"";""
"";"";"";""
"";"";""
```


#### Export \[fullClear\]
Starts the export of the main file with some stricter formatting modifications.



EXPORT




Folder Name:
fullClear


Folder Path:
(lattice/stream) + "\\fullClear"


Description:

The file 'fullClear'

is exported at this moment

with all the formatting

applied above.