https://github.com/samyam81/pdf-to-excel

Converts text data from PDF files into Excel spreadsheets using Go and external libraries (`go-fitz` and `excelize`).
https://github.com/samyam81/pdf-to-excel

go pdf-to-excel

Last synced: about 2 months ago
JSON representation

Converts text data from PDF files into Excel spreadsheets using Go and external libraries (`go-fitz` and `excelize`).

Host: GitHub
URL: https://github.com/samyam81/pdf-to-excel
Owner: samyam81
Created: 2024-06-19T05:28:32.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-06-20T04:13:24.000Z (11 months ago)
Last Synced: 2025-01-27T09:14:20.418Z (4 months ago)
Topics: go, pdf-to-excel
Language: Go
Homepage:
Size: 6.84 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

# PDF-To-Excel Converter

This repository contains a Go application (`APP.go`) that converts text data from a PDF file into an Excel spreadsheet. It utilizes the `go-fitz` library for PDF parsing and `excelize` for Excel file manipulation.

### Requirements
- Go 1.11+
- Libraries:
- `github.com/gen2brain/go-fitz`
- `github.com/360EntSecGroup-Skylar/excelize`

### Installation
1. Make sure you have Go installed and set up properly.
2. Clone this repository:
```bash
git clone https://github.com/samyam81/PDF-To-Excel
```
3. Install dependencies:
```bash
go mod tidy
```

### Usage
The application expects two command-line arguments:
- `pdf_path`: Path to the input PDF file.
- `output_path`: Path to the output Excel file.

Example usage:
```bash
go run APP.go -pdf_path input.pdf -output_path output.xlsx
```

### How It Works
1. **Argument Parsing**: Command-line arguments (`pdf_path` and `output_path`) are parsed using the `flag` package.
2. **Excel Initialization**: An Excel workbook is initialized using `excelize.NewFile()`.
3. **PDF Processing**:
- The input PDF file is opened and read using `go-fitz`.
- Text blocks are extracted from each page of the PDF using `page.TextBlocks()`.
- Each text block is cleaned of newline characters and written into successive rows in the Excel sheet (`Sheet1`).
4. **Excel Writing**:
- Text blocks are written into corresponding cells in the Excel sheet, starting from column A and incrementing the row for each text block.
5. **Saving**: The resulting Excel file is saved to the specified output path using `xlsx.SaveAs(outputPath)`.

### Notes
- Text blocks in the PDF are directly processed without additional grouping by vertical position (`groupMapsByRange` function is removed).
- Ensure the input PDF is structured such that text extraction results in meaningful rows and columns in the Excel output.

### Author
This project was developed by [Samyam](https://github.com/samyam81).

---

### Explanation of Changes:
- **How It Works**: Updated to reflect the direct extraction of text blocks from each page of the PDF using `page.TextBlocks()` method.
- **Excel Writing**: Clarified that text blocks are written into Excel starting from column A and incrementing the row for each block.
- **Notes**: Removed the section about vertical position grouping (`groupMapsByRange` function) as it was not utilized in the revised code.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/samyam81/pdf-to-excel

Awesome Lists containing this project

README