{"id":29047833,"url":"https://github.com/e-candeloro/credem_hack_2025","last_synced_at":"2026-05-15T20:06:41.770Z","repository":{"id":300646454,"uuid":"1004643269","full_name":"e-candeloro/Credem_Hack_2025","owner":"e-candeloro","description":"AI-powered document processing pipeline for Credem Hackathon 2025. Leverages Google Cloud AI services to intelligently extract, classify, and process HR documents through a robust ETL pipeline. ","archived":false,"fork":false,"pushed_at":"2025-06-22T21:37:03.000Z","size":14524,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-10-08T13:44:16.851Z","etag":null,"topics":["ai","document-processing","googlecloudplatform","hackathon","llm","prompt-engineering","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/e-candeloro.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-19T00:42:15.000Z","updated_at":"2025-06-22T21:36:50.000Z","dependencies_parsed_at":"2025-06-22T22:32:02.773Z","dependency_job_id":null,"html_url":"https://github.com/e-candeloro/Credem_Hack_2025","commit_stats":null,"previous_names":["e-candeloro/credem_hack_2025"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/e-candeloro/Credem_Hack_2025","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-candeloro%2FCredem_Hack_2025","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-candeloro%2FCredem_Hack_2025/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-candeloro%2FCredem_Hack_2025/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-candeloro%2FCredem_Hack_2025/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/e-candeloro","download_url":"https://codeload.github.com/e-candeloro/Credem_Hack_2025/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-candeloro%2FCredem_Hack_2025/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33078072,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-15T20:05:40.333Z","status":"ssl_error","status_checked_at":"2026-05-15T20:05:38.672Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","document-processing","googlecloudplatform","hackathon","llm","prompt-engineering","python"],"created_at":"2025-06-26T17:06:08.578Z","updated_at":"2026-05-15T20:06:41.751Z","avatar_url":"https://github.com/e-candeloro.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Credem Hack 2025 - AI Document Processing Pipeline\n\n[![Python](https://img.shields.io/badge/Python-3.11+-blue.svg)](https://www.python.org/downloads/)\n[![Docker](https://img.shields.io/badge/Docker-Ready-blue.svg)](https://www.docker.com/)\n[![Google Cloud](https://img.shields.io/badge/Google%20Cloud-AI%20Services-orange.svg)](https://cloud.google.com/)\n[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE.md)\n\n\u003e **🏆 Hackathon Project**: Advanced AI-powered document processing solution for Credem Bank\n\nThis project is a sophisticated data processing pipeline built for the Credem Hackathon 2025. It leverages cutting-edge Google Cloud AI services, including Document AI and Gemini 2.5 Pro, to intelligently extract, classify, and process information from various document formats through a robust ETL pipeline.\n\n## 📋 Table of Contents\n\n- [🚦 What is this?](#-what-is-this)\n- [📋 Challenge Information \u0026 Team](#-challenge-information--team)\n- [🏆 Solution Strengths](#-solution-strengths)\n- [🏁 Quick Start with Docker](#-quick-start-with-docker)\n- [🛠️ Local Development Setup](#️-local-development-setup)\n- [📊 Architecture Overview](#-architecture-overview)\n- [🔧 Configuration](#-configuration)\n- [📈 Performance \u0026 Results](#-performance--results)\n- [📄 License](#-license)\n\n## 🚦 What is this?\n\n### Core Technologies\n- **🤖 AI/OCR Engine**: Google Document AI + Google Gemini 2.5 Pro via GCP API\n- **🐍 Core Engine**: Python 3.11 with modern async capabilities\n- **📊 Data Processing**: Pandas for efficient data manipulation\n- **🏗️ Architecture**: Microservices-ready with Docker containerization\n- **🔧 Dev Experience**: VS Code Dev Container, pre-commit hooks, `uv` for Python dependency management\n- **☁️ Cloud Native**: Built for Google Cloud Platform deployment\n\n### Key Features\n- **Intelligent Document Classification**: 22+ predefined document categories\n- **Multi-format Support**: PDF, TIFF, JPEG, PNG, and more\n- **Real-time Processing**: Streamlined pipeline for high-volume document processing\n- **Error Resilience**: Robust error handling and fallback mechanisms\n- **Scalable Architecture**: Designed for enterprise-grade scalability\n\n## 📋 Challenge Information \u0026 Team\n\n### Challenge Documents\nThe document challenge information and team pitch presentation can be found in the `documents/` folder:\n- 📄 `Credemhack - Materiale per i team.pdf` - Challenge specifications and requirements\n- 📋 `specifiche.pdf` - Technical specifications and evaluation criteria\n- 🎯 `team_presentation.pdf` - Our team's solution presentation and pitch\n\n### Team: CloudFunctions 🚀\nOur diverse team combines expertise in cybersecurity, AI research, data science, and engineering:\n\n- **🛡️ Daniele Di Battista** - Cybersecurity Expert at Leonardo - [LinkedIn Profile](https://www.linkedin.com/in/daniele-di-battista-883160266/)\n- **💻 Luca Pedretti** - Data Scientist - [LinkedIn Profile](https://www.linkedin.com/in/luca-pedretti-re/)\n- **🔧 Matteo Peroni** - Data Analyst/UX Design - [LinkedIn Profile](https://www.linkedin.com/in/matteo-peroni-049951237/)\n- **🎓 Omar Carpentiero** - MSc Student in AI Engineering at UNIMORE - [LinkedIn Profile](https://www.linkedin.com/in/omar-carpentiero-6543992a5/)\n- **🤖 Ettore Candeloro** - AI Researcher at AImagelab at UNIMORE - [LinkedIn Profile](https://linkedin.com/in/ettore-candeloro-900081162)\n\n### 📊 Solution Presentation\n📽️ **Canva Presentation**: [View our solution presentation](https://www.canva.com/design/DAGq6NnpBmQ/VWwyAapBeBnXo8b5S6CODQ/edit?utm_content=DAGq6NnpBmQ\u0026utm_campaign=designshare\u0026utm_medium=link2\u0026utm_source=sharebutton)\n\n## 🏆 Solution Strengths\n\nOur AI-powered document processing solution offers several key advantages:\n\n### 🔍 Advanced AI Integration\n- **Dual AI Approach**: Combines Google Document AI for text extraction with Gemini 2.5 Pro for intelligent classification and data extraction\n- **Multi-modal Processing**: Handles both text and image-based documents seamlessly\n- **Context-Aware Analysis**: Understands document context for better classification accuracy\n\n### 📊 High Accuracy \u0026 Performance\n- **Smart Classification**: Automatically categorizes documents into 22+ predefined clusters with confidence scoring\n- **Intelligent Data Extraction**: Extracts names, dates, and contextual information with sophisticated error handling\n- **Validation Pipeline**: Multi-stage validation ensures data quality and consistency\n\n### ⚡ Enterprise-Grade Architecture\n- **Scalable Design**: Built on Google Cloud Platform for enterprise-grade scalability and reliability\n- **Microservices Ready**: Containerized architecture supports easy deployment and scaling\n\n### 🔄 Complete ETL Solution\n- **End-to-End Pipeline**: Complete ETL process from document ingestion to structured data export\n- **Data Transformation**: Intelligent data cleaning and normalization\n\n### 🔧 Developer Experience\n- **Modern Stack**: Python 3.11 with latest libraries and best practices\n- **Docker Integration**: Complete containerization for consistent deployment\n- **Comprehensive Documentation**: Detailed setup and usage instructions\n\n## 🏁 Quick Start with Docker\n\n### 1. **Clone and Configure**\n```bash\n# Clone the repository\ngit clone https://github.com/e-candeloro/Credem_Hack_2025.git\ncd Credem_Hack_2025\n\n# Copy environment file and add your credentials\ncp env.example .env\n```\n\n### 2. **Configure Environment Variables**\nEdit the `.env` file with your Google Cloud credentials:\n```bash\n# Required: Google Cloud credentials\nPROJECT_ID=your-gcp-project-id\nPROCESSOR_ID=your-document-ai-processor-id\n\n# Optional: Customize model and paths\nLLM_MODEL=gemini-2.5-pro\nLOCATION=us\n```\n\n### 3. **Build and Run**\n```bash\n# Build the Docker image\ndocker build -t credem-hack-2025 .\n\n# Run the pipeline inside the container\ndocker run credem-hack-2025\n```\n\n## 🛠️ Local Development Setup\n\n### Prerequisites\n- Python 3.11+\n- [uv](https://astral.sh/docs/uv/installation/) (the recommended package manager)\n- Google Cloud SDK with authentication\n\n### 1. **Clone and Bootstrap**\n```bash\ngit clone https://github.com/e-candeloro/Credem_Hack_2025.git\ncd Credem_Hack_2025\ncp env.example .env\n```\n\n### 2. **Set Up Python Environment with `uv`**\n```bash\n# Install all dependencies from pyproject.toml\nuv sync\n\n# Activate the virtual environment\nsource .venv/bin/activate  # Linux/macOS\n# or\n.venv\\Scripts\\activate     # Windows\n```\n\n### 3. **Install pre-commit hooks**\nThis ensures code quality and formatting standards are met before committing.\n```bash\n# Install pre-commit into the virtual environment\nuv pip install pre-commit\n# Set up the git hooks\npre-commit install\n# Run all checks manually on all files\npre-commit run --all-files\n```\n\n### 4. **Authenticate with Google Cloud**\n```bash\n# Authenticate with Google Cloud\ngcloud auth application-default login\n\n# Set your project ID\ngcloud config set project YOUR_PROJECT_ID\n```\n\n## 📊 Architecture Overview\n\n### Processing Flow\n1. **Document Ingestion**: Documents are loaded from the `tmp/` directory\n2. **OCR Processing**: Google Document AI extracts text and structure\n3. **AI Classification**: Gemini 2.5 Pro classifies documents and extracts key data\n4. **Data Validation**: Extracted data is validated and cleaned\n5. **ETL Processing**: Data is transformed and enriched using reference datasets\n6. **Output Generation**: Final structured data is exported in required formats\n\n## 🔧 Configuration\n\n### Environment Variables\n| Variable | Description | Default |\n|----------|-------------|---------|\n| `PROJECT_ID` | Google Cloud Project ID | `credemhack-cloudfunctions` |\n| `PROCESSOR_ID` | Document AI Processor ID | `e4a86664fd2377e2` |\n| `LLM_MODEL` | Gemini model to use | `gemini-2.5-pro` |\n| `LOCATION` | GCP region | `us` |\n| `CLUSTERS_PATH` | Path to clusters CSV | `etl_db_data/clusters.csv` |\n| `TRAIN_GT_PATH` | Path to training data | `etl_db_data/doc_trains.csv` |\n| `PERSONALE_PATH` | Path to personnel data | `etl_db_data/personale.csv` |\n\n### File Structure\n```\nCredem_Hack_2025/\n├── app/                    # Main application code\n│   ├── ocr/               # Document AI and OCR processing\n│   ├── etl/               # ETL pipeline components\n│   ├── etl_db_data/       # Local documents folder for data enrichment\n│   └── utils/             # Utility functions\n├── documents/             # Challenge documents and presentations\n├── notebooks/             # Jupyter notebooks for analysis\n└── tmp/                   # Temporary document storage\n```\n\n## 📈 Performance \u0026 Results\n\n### Processing Capabilities\n- **Document Types**: PDF, TIFF, JPEG, PNG, and more\n- **Processing Speed**: ~100 documents/minute (depending on complexity)\n- **Accuracy**: \u003e95% classification accuracy on test datasets\n- **Scalability**: Designed to handle thousands of documents\n\n### Quality Metrics\n- **Text Extraction**: High accuracy OCR with layout preservation\n- **Classification**: 22+ document categories with confidence scoring\n- **Data Extraction**: Names, dates, and contextual information extraction\n- **Error Handling**: Robust fallback mechanisms for edge cases\n\n---\n\n## 📄 License\nThis project is created for hackathon purposes. See [LICENSE](LICENSE.md) for details.\n\n---\n\n\n*Built with ❤️ by Team CloudFunctions for Credem Hack 2025*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fe-candeloro%2Fcredem_hack_2025","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fe-candeloro%2Fcredem_hack_2025","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fe-candeloro%2Fcredem_hack_2025/lists"}