https://github.com/pfei/invoice-cleaner
https://github.com/pfei/invoice-cleaner
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/pfei/invoice-cleaner
- Owner: pfei
- License: mit
- Created: 2026-04-29T07:47:05.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-29T08:19:05.000Z (about 2 months ago)
- Last Synced: 2026-04-29T09:33:06.688Z (about 2 months ago)
- Language: Python
- Size: 115 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Invoice Cleaner
A lightweight Python tool designed to extract structured data (amount, date) from PDF invoices using optimized regex patterns.
## ⚠️ Status
This project is in early development (alpha stage).
API, structure and extraction rules are subject to change.
## 🎯 Purpose
This project demonstrates a **"Fast-Path" parsing strategy**. Before resorting to complex AI models, `invoice-cleaner` uses high-performance heuristic extraction to process known invoice formats (like Free Telecom) in milliseconds.
## 🛠️ Features
* **Speed**: Near-instant extraction using `pdfplumber` and `re`.
* **Robustness**: Handles multi-page PDFs and text sanitization.
* **Extensibility**: Structured with modular functions to easily add new providers.
* **Open Data**: Includes sample invoices for immediate testing.
## 📂 Project Structure
```text
.
├── examples/ # Sample PDF invoices for testing
├── output/ # Generated JSON/CSV reports (ignored by git)
├── .gitignore
├── main.py
├── requirements.txt
├── LICENSE
└── README.md
```
## 🚀 Getting Started
### 1. Prerequisites
Ensure you have Python 3.8+ installed.
### 2. Installation
Clone the repository and install the required dependencies:
```bash
git clone https://github.com/pierre-feilles/invoice-cleaner.git
cd invoice-cleaner
pip install pdfplumber
```
### 3. Usage
Run the main script to process the default example:
python main.py
## 📊 Performance Logic
The script follows a deterministic extraction flow:
1. **Layout Analysis**: Extracts raw text while maintaining logical groupings.
2. **Regex Anchoring**: Locates financial data based on semantic anchors like "Somme à payer".
3. **Sanitization**: Cleans up formatting artifacts to return pure data (e.g., 29.99).
## 📜 Acknowledgments & License
* **Data Sources**: Some PDF samples in the `examples/` directory are sourced from the [invoice2data](https://github.com/invoice-x/invoice2data) project (MIT License).
* **License**: This project is licensed under the **MIT License** - see the [LICENSE](LICENSE) file for details.
Developed by **Pierre Feilles** (2026).