{"id":31750621,"url":"https://github.com/jhaayush2004/churncast","last_synced_at":"2025-10-09T15:54:43.205Z","repository":{"id":312567451,"uuid":"1045574554","full_name":"jhaayush2004/ChurnCast","owner":"jhaayush2004","description":"Fusion of deep Data Science, Machine Learning and MLOps...","archived":false,"fork":false,"pushed_at":"2025-08-31T16:05:11.000Z","size":7556,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-31T16:21:47.321Z","etag":null,"topics":["aws","data-analysis","data-science","data-visualization","deep-neural-networks","docker","machine-learning","mlops-workflow"],"latest_commit_sha":null,"homepage":"https://youtu.be/VECdHmgFqwo","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jhaayush2004.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-27T11:49:26.000Z","updated_at":"2025-08-31T16:05:14.000Z","dependencies_parsed_at":"2025-08-31T16:21:49.094Z","dependency_job_id":"411c8304-75fa-4650-b523-9cdb79d46404","html_url":"https://github.com/jhaayush2004/ChurnCast","commit_stats":null,"previous_names":["jhaayush2004/churncast"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/jhaayush2004/ChurnCast","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jhaayush2004%2FChurnCast","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jhaayush2004%2FChurnCast/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jhaayush2004%2FChurnCast/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jhaayush2004%2FChurnCast/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jhaayush2004","download_url":"https://codeload.github.com/jhaayush2004/ChurnCast/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jhaayush2004%2FChurnCast/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279001638,"owners_count":26083147,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","data-analysis","data-science","data-visualization","deep-neural-networks","docker","machine-learning","mlops-workflow"],"created_at":"2025-10-09T15:54:40.223Z","updated_at":"2025-10-09T15:54:43.196Z","avatar_url":"https://github.com/jhaayush2004.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ChurnCast - The Autonomous Retention Intelligence Engine\n**ChurnCast** represents the fusion of meticulous data science and robust MLOps automation, engineered to proactively identify customers at risk of churning with an exceptional performance. It's a comprehensive demonstration of the entire machine learning lifecycle, from deep statistical analysis and insight-driven feature engineering to a fully containerized, CI/CD-driven deployment on AWS. The result is a self-sustaining intelligence engine that is as scientifically rigorous as it is operationally resilient.\n\n---\n\n![App Screenshot](https://image2url.com/images/1756651962898-4126d880-294f-4722-9bce-9e69be74b741.png)\n\n## 🌐 Tech Stack\n\n* **Languages**: Python 3.10\n* **Data Storage**: MongoDB Atlas, AWS S3\n* **Deployment**: Docker, AWS (EC2, ECR), GitHub Actions\n* **Data Science/ Machine Learning**: scikit-learn, pandas, Tensorflow, keras, NumPy, Imbalanced-learn, Dill, XAI (Explainable AI), Matplotlib, Plotly, Missingno, express, seaborn\n* **MLOps/DevOps Tools**: GitHub Actions, Docker, PyProject, Conda\n* **Frontend**: HTML, CSS, Jinja2\n* **Backend**: Python, FastAPI, Uvicorn\n\n---\n## 📁 Project Structure and Setup\n\n```\n📦ChrunCast\n ┣ 📂src\n ┃ ┣ 📂components\n ┃ ┣ 📂data_access\n ┃ ┣ 📂aws_storage\n ┃ ┣ 📂configuration\n ┃ ┣ 📂entity\n ┃ ┣ 📂pipeline\n ┃ ┗ 📜utils\n ┣ 📂notebook\n ┣ 📂static\n ┣ 📂templates\n ┣ 📜app.py\n ┣ 📜requirements.txt\n ┣ 📜Dockerfile\n ┣ 📜.dockerignore\n ┣ 📜setup.py\n ┣ 📜pyproject.toml\n ┗ 📜README.md\n```\n---\n\n### Project Template Creation\n\nRun `template.py` to automatically generate a clean project structure:\n\n```bash\npython template.py\n```\n\nThis creates all essential modules and files, including:\n\n```\nsrc/\n├── components/\n│   ├── data_ingestion.py, model_trainer.py, ...\n├── configuration/\n│   ├── mongo_db_connection.py, aws_connection.py\n├── cloud_storage/\n├── data_access/\n├── entity/\n├── pipeline/\n├── utils/\n├── exception/, logger/\n```\n\n---\n\n## 🧰 Environment Setup\n\n### 2️⃣ Local Package Management\n\nConfigure `setup.py` and `pyproject.toml` to register local packages. \n\n### 3️⃣ Create Virtual Environment\n\n```bash\npy=3.10 -m venv venv\npip install -r requirements.txt\npip list  # verify installations\n```\n\n---\n\n## 🍃 MongoDB Atlas Setup\n\n### 4️⃣ Steps to Configure MongoDB Atlas\n\n1. Create an account on [MongoDB Atlas](https://www.mongodb.com/cloud/atlas).\n2. Create new **M0 cluster** → Define a user with password.\n3. Add IP: `0.0.0.0/0` for access from all IPs.\n4. Get the **Python connection string**.\n\n### 5️⃣ Push Dataset to MongoDB\n\n* Create a `notebook/` folder and add your dataset.\n* Use `mongoDB_demo.ipynb` to:\n\n  * Load dataset\n  * Push to MongoDB\n  * Validate data in Atlas → *Browse Collections*\n\n![App Screenshot](https://image2url.com/images/1756637536553-99a842da-ecdc-48c4-bda8-5bd7c4fa392a.png)\n---\n\n\n## 6️⃣ Logging and Exception Handling\n\n* Add logging logic in `src/logger/__init__.py`\n* Add exception logic in `src/exception/__init__.py`\n* Testing using `demo.py`\n\n##  Exploratory Data Analysis \u0026 Key Insights\nThe foundation of ChurnCast was built upon a deep Exploratory Data Analysis (EDA) to understand the underlying patterns, correlations, and characteristics of the customer dataset. The process was systematic, leveraging tools like Pandas for data manipulation, Matplotlib, plotly and Seaborn for rich visualizations, and Missingno for a clear view of data completeness.\n\nThe methodology involved a **multi-layered approach** to dissect the data from every angle:\n\n**Initial Data Assessment**: The analysis began by examining the dataset's structure (`.info()`) and statistical summaries (`.describe()`). The `missingno.matrix` visualization was crucial for confirming the presence and pattern of missing values across features, which directly informed the `multi-stage imputation` strategy.\n\n**Univariate Analysis**: The distribution of each feature was analyzed individually to understand its characteristics.\n - For numerical features, distributions were visualized using `histplots` and `kdeplots` to identify skewness, while `boxplots` were used to detect potential outliers.\n\n - For categorical features, `countplots` and `pie charts` were used to understand the balance of classes (e.g., Churn vs. No Churn proportions).\n\n**Bivariate \u0026 Multivariate Analysis**: This was the core of the EDA, where relationships between variables were uncovered using a variety of plots. A `correlation heatmap` provided a high-level overview of linear relationships. Deeper insights were gained using:\n\n- `Violin Plots`: To compare the distribution of a numerical variable across different categories (e.g., Tenure by MaritalStatus), combining the benefits of a box plot and a KDE plot.\n\n- `Grid Plots \u0026 Faceting`: To create comprehensive multivariate views and compare relationships across different segments simultaneously.\n\n- Targeted `groupby` Aggregations: To calculate precise statistics (like mean churn rate) within specific customer segments, turning visual insights into hard numbers.\n\n**Some visuals of Exploratory Data Analysis**\n\n![App Screenshot](https://image2url.com/images/1756642268824-d492ce70-2f90-45b0-ae1e-c098210f74d6.jpg)\n\n**Key Findings:**\n\n- High-Risk Customer Segments Identified:\n\n  - **Gender Disparity**: Bivariate analysis using count plots and cross-tabulations revealed a significant gender imbalance in churn, with males accounting for 63.3% of the total churned population, pointing to a potential product-market mismatch for this demographic.\n\n   - **Marital Status**: `groupby` aggregations showed that while married customers form the largest user base, single customers are disproportionately more likely to churn, highlighting a key segment for targeted retention campaigns.\n\n   - **Login Device**: A high churn rate among mobile phone users was identified, strongly suggesting that potential UI/UX friction within the mobile app is a significant driver of churn. This insight directly led to a recommendation for a technical audit of the mobile platform.\n\n- **Purchase Behavior \u0026 Loyalty Indicators**:\n\n  - **Positive Engagement**: Correlation analysis showed that churn risk decreases significantly as a customer's `OrderAmountHikeFromlastYear` increases, with the 12-15% hike threshold appearing as a critical loyalty milestone. Increased `CouponUsed` also correlated strongly with lower churn.\n\n  - **Cashback Paradox**: A counter-intuitive positive correlation between `CashbackAmount` and `Churn` was discovered. This suggests that high cashback offers may be attracting less loyal, \"deal-seeking\" customers who leave after securing a deal, indicating a need to rethink the incentive structure for long-term retention.\n\n- **Ineffective Feedback Metrics**:\n\n  Analysis showed a weak or non-existent correlation between both `SatisfactionScore` and the formal `Complain` metric with the actual `Churn` outcome. This critical insight revealed that these channels are not capturing the true drivers of customer dissatisfaction and are unreliable for proactive retention, justifying the development of a more intelligent predictive model.\n\n### **Actionable Recommendations from EDA**:\n\nBased on these findings, several data-driven business strategies were proposed:\n\n**Refine Product Strategy**: Investigate and expand product categories that appeal to male and single customers and conduct a thorough UI/UX audit of the mobile application.\n\n**Optimize Loyalty Programs**: Focus retention efforts on customers reaching the 12-15% order amount hike milestone and re-evaluate the cashback strategy to incentivize long-term loyalty.\n\n**Improve Feedback Loop**: Develop more direct feedback channels, such as proactive surveys, as the current systems are not reliable predictors of churn.\n\n# A Journey Through the Automated Pipeline:\n\n\n\n## Data Ingestion \u0026 Validation: \n\nThe pipeline begins by automatically sourcing customer data from a `MongoDB database`. A rigorous validation schema ensures data integrity, checking for correct data types and column structures, guaranteeing that only high-quality data enters the transformation stage.\n\n**Data Ingestion Implementation**\n\n - Define MongoDB connector in configuration/mongo_db_connection.py\n\n - Access and transform data using data_access/proj1_data.py\n\n - Configure ingestion in:\n\n   - entity/config_entity.py\n   - entity/artifact_entity.py\n - Logic in components/data_ingestion.py\n\n - constants in constants/__init__.py\n\n - Run ingestion via pipeline/training_pipeline.py\n\n## Data Validation\nOnce ingested, the data doesn't immediately enter the transformation stage. Instead, it is passed through a rigorous, automated Data Validation component. This step is critical for maintaining the stability and reliability of the entire ML system.\n\nThe validation process is entirely schema-driven, using a central schema.yaml file as the single source of truth for the expected data structure. This is a key MLOps practice that decouples the validation logic from the code.\n\nThe validation component systematically performs the following checks on both the training and testing datasets:\n\n - **Column Presence \u0026 Integrity**: It verifies that all columns specified in the `schema.yaml` exist in the ingested data. This immediately catches errors caused by upstream changes, such as a column being accidentally dropped or renamed.\n\n - **Data Type Conformance**: It meticulously checks that the data type of each column (e.g., `int64`, `float64`, `object`) exactly matches the data type defined in the schema. This prevents silent errors and pipeline failures during the transformation or training stages, which often expect specific numerical or categorical formats.\n\n - **Generation of an Auditable Report**: Upon completion, the component generates a `validation_report.yaml`. This report serves as an auditable artifact, providing a clear and immediate status (pass/fail) of the data's quality. If validation fails, the pipeline is designed to halt immediately, preventing corrupt data from propagating downstream.\n\nBy enforcing a strict data contract through this schema, the Data Validation component guarantees that only clean, reliable, and correctly structured data proceeds to the EDA and transformation stages, ensuring the robustness and reproducibility of the entire project.\n\n - Schema defined in `config/schema.yaml`\n - validation logic in `utils/main_utils.py`\n - Implement validation logic in `components/data_validation.py`\n### Insight-Driven Data Transformation: \nThe foundation of ChurnCast was built upon a deep Exploratory Data Analysis (EDA), which revealed the unique characteristics of the dataset. This informed a bespoke preprocessing strategy:\n\n - **Strategic Imputation**: Instead of a generic approach, a `multi-stage imputation` process was designed. `IterativeImputer (MICE)` was used for features with complex interdependencies, `K-Nearest Neighbors (KNN) Imputer` was applied to behavioral metrics, and a robust `SimpleImputer` handled straightforward transactional data. This tailored strategy ensured that the integrity and predictive power of the data were maximized.\n\n - **Advanced Encoding**: To handle categorical variables, `Target Encoding` was deliberately chosen over traditional methods. This prevented the `curse of dimensionality` that plagues One-Hot Encoding and avoided the `false ordinality` that can be introduced by Label Encoding, enriching the feature set with valuable statistical information.\n\n - **Handling Class Imbalance**: The inherent class imbalance was meticulously addressed. Both `SMOTEENN` and `SMOTETomek` resampling techniques were implemented and evaluated, with SMOTEENN ultimately being selected for its superior ability to improve the model's F1-score and generalization on unseen data.\n\n - **Feature Creation**: A new, powerful feature, `Digital_Engagement`, was engineered by combining `HourSpendOnApp` and `NumberOfDeviceRegistered` to create a more potent indicator of customer interaction.\n\n- **Encapsulated Transformation Pipeline**: All of these intricate steps—`sequential imputation`, `target encoding`, `feature engineering`, `outlier handling`, and `scaling`—are encapsulated into a single, portable scikit-learn pipeline object. This ensures that the exact same transformations are applied flawlessly during training, evaluation, and real-time prediction, eliminating any chance of training-serving skew.\n\n- Transform logic in `components/data_transformation.py`\n- Use `entity/estimator.py` for transformation classes\n\n## Exhaustive Modeling \u0026 Experimentation: \nThe path to the final `0.99` `recall` and `precision score` as well as `98% accurate model` was paved with rigorous experimentation:\n\n\n- **Broad-Spectrum Exploration**: The initial discovery phase explored a wide range of modeling paradigms, from leveraging AutoML platforms for baseline performance metrics to designing and training custom Neural Networks.\n\n - **Hyperparameter Tuning \u0026 Cross-Validation**: The final XGBoost model was not a default implementation. It was meticulously refined through extensive hyperparameter tuning, with its robustness and consistency validated using K-Fold Cross-Validation.\n\n - **Advanced Performance Metrics**: The model's success was measured beyond simple accuracy. A suite of advanced classification metrics, including F-beta scores (to weigh recall higher than precision) and Cohen's Kappa, were used to ensure its effectiveness in a real-world, imbalanced data scenario.\n\n```\nImplemented model training in components/model_trainer.py\nUpdated estimator utilities in entity/estimator.py\n```\n## Deep Learning Exploration: Architecting a Custom Neural Network\nTo ensure the highest possible performance, the project extended beyond traditional machine learning models to explore deep learning solutions. A custom Artificial Neural Network (ANN) was designed from scratch using TensorFlow and Keras, validating that the chosen model was indeed the best fit for the problem. After tuning prameters like activation functions, number of layers, nuerons in layers, optimizer, learning rate, etc , I finally acheived performance almost comparable to the best performing ML model XGBClassifier after running for 100 epochs.\n\n- **Custom Architecture**: A Multi-Layer Perceptron (MLP) was architected with two hidden layers using tanh activation functions to capture non-linear patterns, and a final sigmoid output layer perfectly suited for the binary classification task.\n\n- **Addressing Class Imbalance**: The significant class imbalance discovered during EDA was a critical challenge. The model was trained using a strategic class_weight parameter, which assigns a higher penalty to classification errors on the minority (churn) class. This forced the network to pay closer attention to the signals of churning customers, dramatically improving its real-world effectiveness.\n\n- **Exceptional Performance**: This custom-built ANN proved highly effective, achieving an overall 97% accuracy, and an impressive F1-score of 0.97 for the churn class. This demonstrated performance is almost comparable with the final, highly-tuned XGBoost model.\n![App Screenshot](https://image2url.com/images/1756650373955-f9ea6c62-aef7-49dc-b068-78ea7bd4183a.png)\n\n## Explainable AI (XAI) for Actionable Insights\nChurnCast is not a `\"black box.\"` Explainable AI (XAI) techniques have been applied to the final model to interpret its predictions. This allows stakeholders to understand the key drivers behind why a customer is flagged as a churn risk, turning a simple prediction into an actionable business insight.\n\n### Model Training\n\n* Implemented model training in `components/model_trainer.py`\n* Update estimator utilities in `entity/estimator.py` \n\n## Automated Model Evaluation: The Quality Gatekeeper\nThis component serves as the critical automated quality gate for the entire pipeline, ensuring that only superior models are promoted to production. Its primary role is to prevent \"model degradation\" by making a data-driven decision on whether a newly trained model outperforms the one currently in service.\n\nThe evaluation process follows a classic Champion vs. Challenger methodology:\n\n- **Benchmarking the Challenger**: The newly trained model (the \"challenger\") is loaded from the `model_trainer` artifact. Its performance is rigorously benchmarked against the held-out, transformed test set (`transformed_test.npy`), which it has never seen before.\n\n- **Retrieving the Champion**: The current production model (the \"champion\") is retrieved from its permanent home in the Amazon S3 model registry. If no champion model exists (as in the very first pipeline run), the challenger is automatically accepted.\n\n- **Fair and Rigorous Comparison**: Both models are evaluated on the exact same raw test dataset to ensure a fair, apples-to-apples comparison. The champion model, which contains the full preprocessing pipeline, is able to transform this raw data itself, demonstrating its real-world predictive capability.\n\n- **Multi-Faceted Performance Metrics**: While the final automated go/no-go decision in the code is driven by the F1-score (chosen for its robustness on imbalanced datasets), the model's overall health and business value are assessed using a comprehensive suite of metrics that were established during the experimentation phase. This includes:\n\n  - **Precision and Recall**: To understand the trade-offs between false positives and false negatives.\n\n  - **F-beta Scores**: Specifically used to assign more weight to recall, which is crucial in a churn problem where failing to identify a churning customer (a false negative) is more costly than mistakenly flagging a loyal one.\n\n  - **Cohen's Kappa Score**: To measure the model's performance while accounting for the possibility of correct predictions occurring by chance.\n  - **AUC-ROC Score**: To evaluate the model's ability to distinguish between the positive (Churn) and negative (No Churn) classes. A high AUC value indicates that the model is excellent at ranking customers by their probability of churning.\n\n- **The Final Verdict**: An artifact is generated containing a simple boolean flag: `is_model_accepted`. If the challenger model demonstrates a statistically significant performance improvement over the champion, this flag is set to True. This artifact acts as a signal, authorizing the `ModelPusher` component to proceed with deploying the new, superior model to the production environment.\n\n* Evaluate new model vs old using logic in `components/model_evaluation.py`\n---\n![App Screenshot](https://image2url.com/images/1756648933793-7f5d3ca7-143d-48d8-8a14-fb778b9bf4a6.png)\n\n\n## ☁️ AWS Setup for Model Deployment\n\n### AWS IAM and S3\n\n* Create an IAM User with `AdministratorAccess`\n* Generate and download **Access Key \u0026 Secret**\n* Add credentials as ENV vars:\n\n```bash\n# Bash\nexport AWS_ACCESS_KEY_ID=\"XXX\"\nexport AWS_SECRET_ACCESS_KEY=\"XXX\"\n```\n\n* Add to `constants/__init__.py`:\n\n```python\nMODEL_BUCKET_NAME = \"churncast\"\nMODEL_PUSHER_S3_KEY = \"churncastkey\"\nMODEL_EVALUATION_CHANGED_THRESHOLD_SCORE = 0.03\n```\n\n###  S3 Bucket Creation\n\n* Go to S3 → Create bucket → `churncast` (Region: `us-east-1`)\n* Uncheck “Block all public access”\n\n###  S3 Logic\n\n* Write push/pull logic in:\n\n  * `cloud_storage/aws_storage.py`\n  * `entity/s3_estimator.py`\n\n---\n\n### Model Pusher\n\n* Push the final model to S3 in `components/model_pusher.py`\n\n## Live Prediction Pipeline \u0026 Inference\nThe project includes a robust prediction pipeline designed to serve real-time predictions on new, unseen data via a FastAPI web endpoint. This component is critical for operationalizing the model and turning its insights into actionable results.\n\nThe inference process is engineered for reliability and consistency:\n\n- **Structured Data Input**: A dedicated ChurnData class is used to structure incoming raw data (e.g., from a web form) into a pandas DataFrame. This class also cleverly injects placeholder ID columns, ensuring the data format perfectly matches what the pre-trained pipeline expects.\n\n- **Model Retrieval from Cloud Registry**: The ChurnPredictor class interfaces with an s3_estimator to load the complete, production-ready model pipeline directly from the AWS S3 model registry. This ensures that the application always uses the officially promoted \"champion\" model.\n\n- **Preventing Training-Serving Skew**: To guarantee prediction consistency and prevent errors, the pipeline performs two crucial checks:\n\n- **Dependency Management**: It explicitly imports the custom transformer classes (NotebookImputer, TargetEncoder, etc.). This is essential for Python's pickle to correctly reconstruct the saved pipeline object with its custom components.\n\n- **Schema Enforcement**: Before prediction, it references the project's schema.yaml to reorder the incoming DataFrame's columns to exactly match the order used during training. This eliminates common ValueError exceptions related to feature order mismatch.\n---\n\n## 🔧 Web UI + Prediction\n\n### Prediction Pipeline\n\n* Add logic to `pipeline/prediction_pipeline.py`\n* Implement web backend in `app.py`\n\n### Static and Template Setup\n\n* Add `static/` and `templates/` for Flask UI\n* Display prediction outputs via HTML interface\n\n![App Screenshot](https://image2url.com/images/1756648103028-2f29510a-7096-45ce-808e-f97d37bff195.png)\n---\n\n## 🔁 CI/CD Automation with Docker, GitHub, EC2\n\n###  Docker + GitHub Actions\n\n* Write `Dockerfile` and `.dockerignore`\n* Create `.github/workflows/aws.yaml`\n\n### GitHub Secrets\n\nAdd the following in GitHub → Settings → Secrets:\n\n* `AWS_ACCESS_KEY_ID`\n* `AWS_SECRET_ACCESS_KEY`\n* `AWS_DEFAULT_REGION`\n* `ECR_REPO`\n\n---\n\n## ⚙️ AWS EC2 \u0026 Docker Deployment\n\n###  EC2 Setup\n\n* Launch EC2 (T2.medium, Ubuntu 24.04)\n* Allow port `5080` in Inbound rules\n* SSH into instance\n\n###  Install Docker\n\n```bash\ncurl -fsSL https://get.docker.com -o get-docker.sh\nsudo sh get-docker.sh\nsudo usermod -aG docker ubuntu\nnewgrp docker\n```\n\n###   GitHub Self-Hosted Runner\n\n* GitHub → Settings → Actions → Runner → New Self-hosted Runner\n* Follow Linux instructions on EC2\n\n```bash\n./run.sh  # To keep runner alive\n```\n\n---\n\n## 🚀 Final Deployment\n\n###  Trigger CI/CD\n\n* Commit changes → GitHub Action triggers → Docker builds \u0026 pushes image → EC2 deploys container\n\n###  Access App\n\n* Open browser:\n\n```\nhttp://\u003cEC2_PUBLIC_IP\u003e:5080\n```\n\n---\n\n## 🧪 Additional Features\n\n### `/training` Route\n\nTrigger model training from browser.\n\n### GitHub Actions\n\nFull CI/CD integrated. Automates:\n\n* Docker Build\n* Push to ECR\n* Pull to EC2\n* Restart container\n\n---\n\n\n\n## 🚀 **End-to-End Project Workflow**\n\n```\n                     ## 🚀 End-to-End Project Workflow\n\n                      ┌───────────────────────────┐\n                      │    🔄 Data Source         │\n                      │    MongoDB (Atlas)        │\n                      └───────────┬───────────────┘\n                                  │\n                                  ▼\n                      ┌───────────────────────────┐\n                      │    📥 Data Ingestion      │\n                      │    Pull from MongoDB      │\n                      └───────────┬───────────────┘\n                                  │\n                                  ▼\n                      ┌───────────────────────────┐\n                      │   ✅ Data Validation      │\n                      │   Schema \u0026 Integrity Check│\n                      └───────────┬───────────────┘\n                                  │\n                                  ▼\n                      ┌───────────────────────────┐\n                      │   🔍 EDA \u0026 Insights       │\n                      │   Uni/Bi/Multivariate      │\n                      └───────────┬───────────────┘\n                                  │\n                                  ▼\n                      ┌───────────────────────────┐\n                      │  🔃 Data Transformation   │\n                      │ Impute, Encode, Feature Eng│\n                      └───────────┬───────────────┘\n                                  │\n                                  ▼\n                      ┌───────────────────────────┐\n                      │   🧠 Model Training       │\n                      │ XGBoost/NN, Hyper-tuning  │\n                      └───────────┬───────────────┘\n                                  │\n                                  ▼\n                      ┌───────────────────────────┐\n                      │   📊 Model Evaluation     │\n                      │ Champion vs Challenger, XAI│\n                      └───────────┬───────────────┘\n                                  │\n                                  ▼\n                      ┌───────────────────────────┐\n                      │    ☁️ Model Pusher       │\n                      │   Push to AWS S3 Registry │\n                      └───────────┬───────────────┘\n                                  │\n                                  │\n         ┌────────────────────────┴─────────────────────────┐\n         │                                                  │\n         ▼                                                  ▼\n┌───────────────────────────┐             ┌──────────────────────────────────┐\n│  🧪 Prediction API        │            │  ⚙️ CI/CD Automation (GitHub Actions) │\n│  FastAPI + Web UI         │             │                                    │\n└───────────┬───────────────┘             │  CI: Docker Build -\u003e Push to ECR   │\n                                          │  CD: Pull from ECR -\u003e Deploy on EC2│\n            ▼                             └──────────────────────────────────  ┘\n┌───────────────────────────┐\n│   🌐 Live on AWS EC2      \n│   (Port 5000)             │\n└───────────────────────────┘\n      \n```\n\n---\n\n## 🧠 **High-Level Stages**\n\n| Phase                   | Tooling / Libraries Used                                           |\n| ----------------------- | ------------------------------------------------------------------ |\n| **Data Storage** | MongoDB Atlas (Raw Data), AWS S3 (Model Registry)                  |\n| **ETL (Ingestion)** | Pandas, Pymongo, Custom Python Scripts                             |\n| **EDA \u0026 Visualization** | Matplotlib, Seaborn, Plotly, Missingno                             |\n| **Data Validation** | YAML Schema, Custom Python Validators                              |\n| **Transformation** | Scikit-learn Pipelines, Imbalanced-learn (SMOTEENN)                |\n| **Model Training** | XGBoost, Tensorflow, Keras, Scikit-learn, AutoML (Experimentation) |\n| **Model Evaluation** | Scikit-learn Metrics (F-beta, Kappa, Precision, Recall), XAI       |\n| **Web API** | FastAPI, Uvicorn                                                   |\n| **Web Interface** | Jinja2, HTML, CSS                                                  |\n| **CI/CD Automation** | Docker, GitHub Actions (Self-Hosted Runner), AWS ECR               |\n| **Cloud Deployment** | AWS EC2 (Application Hosting), AWS S3 (Model Serving)\n\n\n## 🛠️ **Behind the Scenes – Infra \u0026 Automation**\n\n* **Dockerized App**: Ensures cross-platform consistency\n* **GitHub Actions**: Automates testing, containerization, and push to AWS\n* **AWS EC2**: Host for live Flask API\n* **AWS ECR**: Private container registry\n* **MongoDB Atlas**: Cloud-hosted database for insurance data\n* **Custom Exception \u0026 Logging Framework**: Centralized logs for debugging\n\n---\n## 🎯 Project Workflow Summary\n\n```mermaid\ngraph TD;\n    A[Data Ingestion] --\u003e B[Data Validation];\n    B --\u003e C[Data Transformation];\n    C --\u003e D[Model Training];\n    D --\u003e E[Model Evaluation];\n    E --\u003e F[Model Pusher to S3];\n    F --\u003e G[Web App \u0026 Prediction];\n    G --\u003e H[CI/CD via GitHub Actions + Docker + AWS]\n```\n\n---\n\n## Video Demo\n👉 Watch the demo on YouTube: https://www.youtube.com/watch?v=OVSx0p0GWb4\n\n\n## 🏁 License\n\n[MIT License](LICENSE)\n\n\n\nVISIT - [click](https://docs.google.com/document/d/1iUCK06895yOGELGyTAdYr9SRYbzywLPY/edit?usp=sharing\u0026ouid=101578109680909709365\u0026rtpof=true\u0026sd=true)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjhaayush2004%2Fchurncast","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjhaayush2004%2Fchurncast","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjhaayush2004%2Fchurncast/lists"}