https://github.com/pmthetechguy/document-entity-extractor
AI-powered document extractor for names, emails, and organizations.
https://github.com/pmthetechguy/document-entity-extractor
ai automation data-extraction document-extraction entity-recognition fastapi gpt openai pandas portfolio-project python uvicorn web-app
Last synced: 2 months ago
JSON representation
AI-powered document extractor for names, emails, and organizations.
- Host: GitHub
- URL: https://github.com/pmthetechguy/document-entity-extractor
- Owner: PMTheTechGuy
- Created: 2025-04-26T06:42:30.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-28T15:10:18.000Z (about 1 year ago)
- Last Synced: 2025-04-28T16:26:42.073Z (about 1 year ago)
- Topics: ai, automation, data-extraction, document-extraction, entity-recognition, fastapi, gpt, openai, pandas, portfolio-project, python, uvicorn, web-app
- Language: Python
- Homepage:
- Size: 33.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AI Data Extraction Tool
🚀 Upload documents → Extract Names, Emails, and Organizations → Download structured Excel results instantly.
Built with **FastAPI**, **Pandas**, and optional **GPT-enhanced** extraction.
Deployed live on **Render**.
---
## ✨ Features
- ✅ Upload PDF, DOCX, and TXT documents
- ✅ Extract **Names**, **Emails**, and **Organizations**
- ✅ Multi-file uploads supported (combines results into one Excel)
- ✅ Clean and organized Excel file download (`.xlsx`)
- ✅ Supports both **local entity extraction** and **GPT-enhanced** extraction
- ✅ Automatic fallback if the custom model is missing
- ✅ Deployed online via [Render](https://render.com/)
---
## 📸 Screenshots
### Upload Page

### Extraction Results Page

---
## 🚀 Live Demo
> 🟢 [Visit the Live App Here](https://ai-data-extraction-tool.onrender.com/)
---
## ⚙️ Technologies Used
- Python 3.11
- FastAPI
- Uvicorn
- Pandas
- spaCy
- OpenAI API (optional GPT-enhancement)
- openpyxl (for Excel export)
---
## 🛠 Local Development Setup
Clone the repository:
```bash
git clone https://github.com/PMTheTechGuy/document-entity-extractor.git
cd document-entity-extractor
```
Install dependencies:
```bash
pip install -r requirements.txt
```
Set up your environment variables:
Create a `.env` file based on `.env.example`.
```bash
cp .env.example .env
```
Start the server locally:
```bash
uvicorn api.main:app --reload
```
If you encounter an issue loading the application on `HTTP://localhost:8000`.
Quit the application using `Ctrl + C` and start the server on port `8001`.
```bash
uvicorn api.main:app --reload --port 8001
```
---
## 🧠 OpenAI Key Setup (Optional for GPT Extraction)
This app supports two extraction modes:
- 🧠 GPT-enhanced extraction (more accurate, slower, uses OpenAI API)
- ⚡ Local spaCy model extraction (faster, free, no external API calls)
By default, the app will fall back to spaCy if no OpenAI key is provided and the `USE_GPT_EXTRACTION` is set to `False`.
### Setting Up OpenAI GPT Extraction (Optional)
*1. In your `.env` file, add your OpenAI API Key:*
```env
OPENAI_API_KEY=your-real-openai-api-key-here
```
*2. Save the `.env` file.*
*3. Restart the FastAPI server:*
```bash
uvicorn api.main:app --reload
```
- ✅ If a key is provided, the app will automatically use GPT for extractions.
- ✅ If no key is provided or an API error occurs, the app will fall back to using spaCy.
---
## ⚙️ Controlling GPT Extraction Mode
In your `.env` file, you can control whether the app uses GPT or local spaCy extraction:
```
USE_GPT_EXTRACTION=True
```
- ✅ True → Use GPT extraction (requires valid OpenAI API key)
- ✅ False → Force local spaCy extraction, even if API key is present
Restart the server after changing the `.env` settings.
```
uvicorn api.main:app --reload
```
The app will detect this automatically at runtime.
---
## 🌍 Deployment
This app is deployed on [Render](https://render.com/).
You can deploy your version in one click.
---
## 📦 Folder Structure
```php
api/ # FastAPI backend
├── templates/ # HTML templates (upload form, results page)
├── static/ # Static files
├── db/ # Database
utils/ # Helper modules (export, logging, etc.)
extractor/ # File reading and entity extraction
gpt_integration/ # GPT-enhanced extraction
output/ # Exported Excel files
logs/ # Application logs
```
---
## 📦 Features
- **Multi-file Upload**: Upload one or more `.pdf`, `.docx`, or `.txt` files for processing.
- **Entity Extraction**: Automatically identifies and extracts:
- People (names)
- Emails
- Organizations
- **Results Summary**: Displays a summary of total files processed, and the number of names, emails, and organizations found.
- **CSV & Excel Export**: Download extracted data in `.csv` or `.xlsx` format.
- **Auto Cleanup**: Temporary files that are older than one hour will be automatically deleted.
- **Error Handling**: User interface for handling invalid uploads, unsupported file types, and extraction failures.
---
## 🚧 Coming Soon
- Daily upload limits per user or IP (via database tracking)
- Admin dashboard to review processed data
- File size limit configuration in .env
---
### 🙌 Acknowledgements
- [FastAPI](https://fastapi.tiangolo.com/)
- [spaCy](https://spacy.io/)
- [OpenAI](https://openai.com/)
- [Render](https://render.com/)
---
## 📫 Contact
Crafted with dedication by
[PM The Tech Guy](https://github.com/PMTheTechGuy).
Please don't hesitate to reach out or share your ideas!
---
## 📝 License
This project is licensed under the [MIT License](LICENSE).