{"id":27773559,"url":"https://github.com/pmthetechguy/document-entity-extractor","last_synced_at":"2026-04-16T08:31:42.676Z","repository":{"id":290392060,"uuid":"973053375","full_name":"PMTheTechGuy/document-entity-extractor","owner":"PMTheTechGuy","description":"AI-powered document extractor for names, emails, and organizations.","archived":false,"fork":false,"pushed_at":"2025-04-28T15:10:18.000Z","size":34938,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-28T16:26:42.073Z","etag":null,"topics":["ai","automation","data-extraction","document-extraction","entity-recognition","fastapi","gpt","openai","pandas","portfolio-project","python","uvicorn","web-app"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PMTheTechGuy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-26T06:42:30.000Z","updated_at":"2025-04-28T15:10:21.000Z","dependencies_parsed_at":"2025-04-28T16:38:25.587Z","dependency_job_id":null,"html_url":"https://github.com/PMTheTechGuy/document-entity-extractor","commit_stats":null,"previous_names":["pmthetechguy/document-entity-extractor"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PMTheTechGuy%2Fdocument-entity-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PMTheTechGuy%2Fdocument-entity-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PMTheTechGuy%2Fdocument-entity-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PMTheTechGuy%2Fdocument-entity-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PMTheTechGuy","download_url":"https://codeload.github.com/PMTheTechGuy/document-entity-extractor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251608442,"owners_count":21616858,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","automation","data-extraction","document-extraction","entity-recognition","fastapi","gpt","openai","pandas","portfolio-project","python","uvicorn","web-app"],"created_at":"2025-04-30T01:10:18.667Z","updated_at":"2026-04-16T08:31:42.670Z","avatar_url":"https://github.com/PMTheTechGuy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AI Data Extraction Tool\n\n🚀 Upload documents → Extract Names, Emails, and Organizations → Download structured Excel results instantly.  \nBuilt with **FastAPI**, **Pandas**, and optional **GPT-enhanced** extraction.  \nDeployed live on **Render**.\n\n---\n\n## ✨ Features\n\n- ✅ Upload PDF, DOCX, and TXT documents\n- ✅ Extract **Names**, **Emails**, and **Organizations**\n- ✅ Multi-file uploads supported (combines results into one Excel)\n- ✅ Clean and organized Excel file download (`.xlsx`)\n- ✅ Supports both **local entity extraction** and **GPT-enhanced** extraction\n- ✅ Automatic fallback if the custom model is missing\n- ✅ Deployed online via [Render](https://render.com/)\n\n---\n\n## 📸 Screenshots\n\n### Upload Page\n\u003cimg src = api/static/screenshot/Upload_PageWith_item.png width = 675 height = 675 alt = Upload Page\u003e\n\n### Extraction Results Page\n\u003cimg src = api/static/screenshot/ExtractionResultsPage.png width = 675 height = 675 alt = Results Page\u003e\n\n---\n\n## 🚀 Live Demo\n\n\u003e 🟢 [Visit the Live App Here](https://ai-data-extraction-tool.onrender.com/)  \n\n---\n\n## ⚙️ Technologies Used\n\n- Python 3.11\n- FastAPI\n- Uvicorn\n- Pandas\n- spaCy\n- OpenAI API (optional GPT-enhancement)\n- openpyxl (for Excel export)\n\n---\n\n## 🛠 Local Development Setup\n\nClone the repository:\n\n```bash\ngit clone https://github.com/PMTheTechGuy/document-entity-extractor.git\ncd document-entity-extractor\n```\nInstall dependencies:\n\n```bash\npip install -r requirements.txt\n```\n\nSet up your environment variables:\n\nCreate a `.env` file based on `.env.example`.\n\n```bash\ncp .env.example .env\n```\nStart the server locally:\n\n```bash\nuvicorn api.main:app --reload\n```\nIf you encounter an issue loading the application on `HTTP://localhost:8000`.\n\nQuit the application using `Ctrl + C` and start the server on port `8001`.\n\n```bash\nuvicorn api.main:app --reload --port 8001\n```\n---\n\n## 🧠 OpenAI Key Setup (Optional for GPT Extraction)\n\nThis app supports two extraction modes:\n\n- 🧠 GPT-enhanced extraction (more accurate, slower, uses OpenAI API)\n\n- ⚡ Local spaCy model extraction (faster, free, no external API calls)\n\nBy default, the app will fall back to spaCy if no OpenAI key is provided and the `USE_GPT_EXTRACTION` is set to `False`.\n\n### Setting Up OpenAI GPT Extraction (Optional)\n\n*1. In your `.env` file, add your OpenAI API Key:*\n\n```env\nOPENAI_API_KEY=your-real-openai-api-key-here\n```\n*2. Save the `.env` file.*\n\n*3. Restart the FastAPI server:*\n\n```bash\nuvicorn api.main:app --reload\n```\n\n- ✅ If a key is provided, the app will automatically use GPT for extractions.\n- ✅ If no key is provided or an API error occurs, the app will fall back to using spaCy.\n\n---\n\n## ⚙️ Controlling GPT Extraction Mode\n\nIn your `.env` file, you can control whether the app uses GPT or local spaCy extraction:\n```\nUSE_GPT_EXTRACTION=True\n```\n- ✅ True → Use GPT extraction (requires valid OpenAI API key)\n\n- ✅ False → Force local spaCy extraction, even if API key is present\n\nRestart the server after changing the `.env` settings.\n```\nuvicorn api.main:app --reload\n```\n\nThe app will detect this automatically at runtime.\n\n---\n\n## 🌍 Deployment\n\nThis app is deployed on [Render](https://render.com/).\n\nYou can deploy your version in one click.\n\n---\n## 📦 Folder Structure\n\n```php\napi/             # FastAPI backend\n├── templates/   # HTML templates (upload form, results page)\n├── static/      # Static files\n├── db/          # Database\nutils/           # Helper modules (export, logging, etc.)\nextractor/       # File reading and entity extraction\ngpt_integration/ # GPT-enhanced extraction\noutput/          # Exported Excel files\nlogs/            # Application logs\n```\n---\n\n## 📦 Features\n\n- **Multi-file Upload**: Upload one or more `.pdf`, `.docx`, or `.txt` files for processing.\n- **Entity Extraction**: Automatically identifies and extracts:\n  - People (names)\n  - Emails\n  - Organizations\n- **Results Summary**: Displays a summary of total files processed, and the number of names, emails, and organizations found.\n- **CSV \u0026 Excel Export**: Download extracted data in `.csv` or `.xlsx` format.\n- **Auto Cleanup**: Temporary files that are older than one hour will be automatically deleted.\n- **Error Handling**: User interface for handling invalid uploads, unsupported file types, and extraction failures.\n---\n\n## 🚧 Coming Soon\n- Daily upload limits per user or IP (via database tracking)\n- Admin dashboard to review processed data\n- File size limit configuration in .env\n\n---\n### 🙌 Acknowledgements\n- [FastAPI](https://fastapi.tiangolo.com/)\n\n- [spaCy](https://spacy.io/)\n\n- [OpenAI](https://openai.com/)\n\n- [Render](https://render.com/)\n\n---\n\n## 📫 Contact\n\nCrafted with dedication by \n[PM The Tech Guy](https://github.com/PMTheTechGuy).\n\n\nPlease don't hesitate to reach out or share your ideas!\n\n---\n## 📝 License\n\nThis project is licensed under the [MIT License](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpmthetechguy%2Fdocument-entity-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpmthetechguy%2Fdocument-entity-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpmthetechguy%2Fdocument-entity-extractor/lists"}