{"id":30848322,"url":"https://github.com/saroshfarhan/kaggle-titanic-predictions","last_synced_at":"2025-09-07T03:08:34.672Z","repository":{"id":312089274,"uuid":"1046259736","full_name":"saroshfarhan/kaggle-titanic-predictions","owner":"saroshfarhan","description":"My submission to Kaggle titanic prediction competition","archived":false,"fork":false,"pushed_at":"2025-08-28T12:41:56.000Z","size":761,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-28T20:03:26.479Z","etag":null,"topics":["decision-trees","machine-learning","random-forest","tensorflow","tensorflow-decision-forests"],"latest_commit_sha":null,"homepage":"https://www.kaggle.com/competitions/titanic","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/saroshfarhan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-28T12:28:18.000Z","updated_at":"2025-08-28T12:44:27.000Z","dependencies_parsed_at":"2025-08-28T20:03:31.981Z","dependency_job_id":"cbda3765-b469-4028-8ac8-a0016c35db66","html_url":"https://github.com/saroshfarhan/kaggle-titanic-predictions","commit_stats":null,"previous_names":["saroshfarhan/kaggle-titanic-predictions"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/saroshfarhan/kaggle-titanic-predictions","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saroshfarhan%2Fkaggle-titanic-predictions","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saroshfarhan%2Fkaggle-titanic-predictions/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saroshfarhan%2Fkaggle-titanic-predictions/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saroshfarhan%2Fkaggle-titanic-predictions/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/saroshfarhan","download_url":"https://codeload.github.com/saroshfarhan/kaggle-titanic-predictions/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saroshfarhan%2Fkaggle-titanic-predictions/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273990215,"owners_count":25203293,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-07T02:00:09.463Z","response_time":67,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["decision-trees","machine-learning","random-forest","tensorflow","tensorflow-decision-forests"],"created_at":"2025-09-07T03:08:31.631Z","updated_at":"2025-09-07T03:08:34.664Z","avatar_url":"https://github.com/saroshfarhan.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Titanic Survival Prediction - Kaggle Competition\n\nThis repository contains a comprehensive machine learning solution for the famous Titanic survival prediction competition on Kaggle. The project explores various machine learning algorithms and ensemble methods to predict passenger survival on the Titanic.\n\n## 📊 Dataset Overview\n\nThe Titanic dataset contains information about passengers aboard the Titanic, including:\n- **PassengerId**: Unique identifier for each passenger\n- **Survived**: Target variable (0 = No, 1 = Yes)\n- **Pclass**: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)\n- **Name**: Passenger name\n- **Sex**: Gender\n- **Age**: Age in years\n- **SibSp**: Number of siblings/spouses aboard\n- **Parch**: Number of parents/children aboard\n- **Ticket**: Ticket number\n- **Fare**: Passenger fare\n- **Cabin**: Cabin number\n- **Embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)\n\n## 🚀 Features\n\n### Data Preprocessing\n- **Missing Value Handling**: Filled missing Age with mean, Cabin with 'Unknown', and Embarked with 'S'\n- **Feature Engineering**: Created new features like family size and age groups\n- **Categorical Encoding**: Applied one-hot encoding for categorical variables\n\n### Machine Learning Models Implemented\n\n1. **Random Forest Classifier**\n   - Baseline model with 100 estimators\n   - Feature importance analysis\n   - Cross-validation evaluation\n\n2. **XGBoost Classifier**\n   - Hyperparameter tuning with GridSearchCV\n   - Improved performance over Random Forest\n   - Feature importance visualization\n\n3. **Ensemble Methods**\n   - **Voting Classifier**: Hard and soft voting approaches\n   - **Stacking Classifier**: Meta-learning with Logistic Regression as final estimator\n   - **Weighted Ensemble**: Performance-based weighting of individual models\n\n4. **TensorFlow Decision Forests**\n   - Random Forest implementation\n   - Gradient Boosted Trees\n   - CART (Classification and Regression Trees)\n\n## 📈 Performance Results\n\nThe best performing model was **TensorFlow Decision Forests** with cross-validation accuracy of **0.891304** (89.13%) on the validation set.\n\n### Model Comparison\n- Random Forest: ~0.82-0.85 accuracy\n- XGBoost: ~0.84-0.87 accuracy  \n- Stacking Classifier: 0.84916 accuracy\n- **TensorFlow Decision Forests: 0.891304 accuracy** (Best) 🏆\n\n## 🛠️ Technical Stack\n\n- **Python 3.9**\n- **Pandas**: Data manipulation and analysis\n- **NumPy**: Numerical computing\n- **Scikit-learn**: Machine learning algorithms\n- **XGBoost**: Gradient boosting framework\n- **TensorFlow Decision Forests**: Advanced tree-based models\n- **Matplotlib/Seaborn**: Data visualization\n\n## 📁 Project Structure\n\n```\nkaggle-titanic-predictions/\n├── titanic.ipynb              # Main Jupyter notebook with complete analysis\n├── train.csv                  # Training dataset\n├── test.csv                   # Test dataset\n├── venv/                      # Python virtual environment\n└── README.md                  # This file\n```\n\n## 🚀 Getting Started\n\n### Prerequisites\n- Python 3.9+\n- Jupyter Notebook\n- Required packages (see installation below)\n\n### Installation\n\n1. **Clone the repository**\n   ```bash\n   git clone \u003crepository-url\u003e\n   cd kaggle-titanic-predictions\n   ```\n\n2. **Create and activate virtual environment**\n   ```bash\n   python -m venv venv\n   source venv/bin/activate  # On Windows: venv\\Scripts\\activate\n   ```\n\n3. **Install required packages**\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n4. **Run the notebook**\n   ```bash\n   jupyter notebook titanic.ipynb\n   ```\n\n## 📊 Key Insights\n\n### Feature Importance\nThe most important features for survival prediction were:\n1. **Sex**: Gender was the strongest predictor\n2. **Fare**: Higher fare correlated with survival\n3. **Age**: Younger passengers had higher survival rates\n4. **Pclass**: First class passengers had better survival rates\n\n### Data Patterns\n- **Gender Gap**: Women had significantly higher survival rates than men\n- **Class Effect**: First class passengers had better survival rates\n- **Age Effect**: Children and elderly had different survival patterns\n- **Family Size**: Passengers with families had varying survival rates\n\n## 🎯 Model Selection Strategy\n\n1. **Baseline**: Started with Random Forest for interpretability\n2. **Improvement**: Applied XGBoost with hyperparameter tuning\n3. **Ensemble**: Combined multiple models using stacking\n4. **Advanced**: Explored TensorFlow Decision Forests\n5. **Final**: Selected TensorFlow Decision Forests as the best performer\n\n## 📝 Submission Files\n\nMultiple submission files were generated for comparison:\n- `submission.csv`: Basic Random Forest predictions\n- `submission_xgboost.csv`: XGBoost predictions\n- `submission_stacking.csv`: Stacking Classifier predictions\n- `submission_tfDT.csv`: TensorFlow Decision Forests predictions (Best) 🏆\n\n## 🤝 Contributing\n\nFeel free to contribute to this project by:\n- Improving the models\n- Adding new feature engineering techniques\n- Optimizing hyperparameters\n- Adding new visualization techniques\n\n## 📄 License\n\nThis project is open source and available under the [MIT License](LICENSE).\n\n## 🙏 Acknowledgments\n\n- Kaggle for hosting the competition\n- The Titanic dataset contributors\n- The open-source machine learning community\n\n---\n\n**Note**: This project is for educational purposes and demonstrates various machine learning techniques applied to a classic classification problem.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaroshfarhan%2Fkaggle-titanic-predictions","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsaroshfarhan%2Fkaggle-titanic-predictions","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaroshfarhan%2Fkaggle-titanic-predictions/lists"}