{"id":23726496,"url":"https://github.com/dalageo/ml-titanicshipwreck","last_synced_at":"2025-09-04T03:32:14.503Z","repository":{"id":258369080,"uuid":"857031081","full_name":"Dalageo/ML-TitanicShipwreck","owner":"Dalageo","description":"Exploring the World's Most Renowned Shipwreck 🚢","archived":false,"fork":false,"pushed_at":"2024-12-10T15:41:20.000Z","size":1014,"stargazers_count":12,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-05T00:41:25.568Z","etag":null,"topics":["data-science","decision-tree-classifier","logistic-regression","machine-learning","python","random-forest-classifier","scikit-learn","stacking-ensemble","titanic-dataset","xgboost-classifier"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Dalageo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-13T17:01:10.000Z","updated_at":"2025-01-05T18:59:28.000Z","dependencies_parsed_at":"2024-11-21T14:24:15.290Z","dependency_job_id":"cc0780bc-2c9d-42a5-9841-62d03121c51e","html_url":"https://github.com/Dalageo/ML-TitanicShipwreck","commit_stats":null,"previous_names":["dalageo/ml-titanicshipwreck"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Dalageo/ML-TitanicShipwreck","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dalageo%2FML-TitanicShipwreck","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dalageo%2FML-TitanicShipwreck/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dalageo%2FML-TitanicShipwreck/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dalageo%2FML-TitanicShipwreck/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Dalageo","download_url":"https://codeload.github.com/Dalageo/ML-TitanicShipwreck/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dalageo%2FML-TitanicShipwreck/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273547311,"owners_count":25125030,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-04T02:00:08.968Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","decision-tree-classifier","logistic-regression","machine-learning","python","random-forest-classifier","scikit-learn","stacking-ensemble","titanic-dataset","xgboost-classifier"],"created_at":"2024-12-31T00:31:40.046Z","updated_at":"2025-09-04T03:32:14.138Z","avatar_url":"https://github.com/Dalageo.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/user-attachments/assets/3f727ef5-2d2c-45ec-ab62-3e4a049e2168\" alt=\"Titanic Gif\" width=\"700\"/\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"https://colab.research.google.com/drive/1itTfyj5bdfKmYyCkwf01IpkzQuB4Nxm7?usp=sharing\" target=\"_blank\"\u003e\n  \u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\"\u003e\u003c/a\u003e\n   \u003ca href=\"https://www.kaggle.com/competitions/titanic/leaderboard\" target=\"_blank\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Kaggle-Top%2010%25-4CAF50\" alt=\"Kaggle Top 10%\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/Dalageo/ML-TitanicShipwreck/blob/main/LICENSE\" target=\"_blank\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/License-Apache%202.0-D22128\" alt=\"License: Apache License 2.0\"\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/github/stars/Dalageo/ML-TitanicShipwreck?style=social\" alt=\"GitHub stars\"\u003e\n\u003c/div\u003e\n\n# Exploring the World's Most Renowned Shipwreck 🚢\n\nIn 1912, the Titanic set off on its first voyage across the Atlantic Ocean, carrying passengers ranging from the wealthy elite to emigrants seeking a new life. Tragically, the ship collided with an iceberg and sank, resulting in the loss of over 1,500 lives. This disaster not only shook the world but also sparked discussions about maritime safety and the social dynamics of the time.\n\nThis repository explores the factors affecting passenger survival on the Titanic and aims to build a predictive model to estimate survival probabilities based on available passenger characteristics. The available dataset contains a detailed records of the passengers aboard, including information such as age, gender, passenger class, fare paid, and survival outcome. However, some key data points are missing, particularly in features like age and cabin, which poses challenges for building accurate predictive models. \n\nIn this project, two different approaches are explored and compared based on model performance:\n\n- **1. Removing Missing Data**: This method involves deleting rows with missing values to clean the dataset. While it ensures that the remaining data is complete, it reduces the number of observations available for analysis.\n\n- **2. Filling Missing Data**: This approach fills in missing values in an effort to retain more data and potentially enhance the model's performance.\n\nOverall, more robust models (Random Forest, XGBoost) were achieved using the second approach, which involved filling in missing values. A version of the developed model was also submitted to Kaggle’s [Titanic-Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic) competition, where it ranked in the top 9.38% (1316 out of 14036). \n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"https://github.com/user-attachments/assets/0b991de5-3238-4b7f-96bd-99216d136574\" alt=\"Kaggle\" width = \"800\" height = \"150\"/\u003e\n\u003c/div\u003e\n\nGiven that the true survival status of Titanic passengers is publicly available, some higher-ranked entries likely used manually crafted labels to achieve near-perfect accuracies. Therefore, the actual position of the provided model could be higher if all competitors strictly followed the competition rules. You can also find the Kaggle's notebook [here](https://www.kaggle.com/code/dalageo/exploring-the-world-s-most-renowned-shipwreck).\n\n*It's important to mention that the score shown in the above image (0.78947) was achieved through a **slightly** modified ensemble model and different parameter tuning compared to the provided notebook (0.78468). These exact details are not shared here to encourage independent experimentation and to prevent you from overfitting.* 😜\n\n\n## Dataset Description\n\nThe Titanic dataset used in this project is divided into two main files: `train.csv` and `test.csv`. Below is a brief description of each file:\n\n- **`train.csv`**: This is the primary training dataset containing labeled data used to train the model. It includes 891 records and 12 columns, with the `Survived` column indicating whether a passenger survived (1) or not (0). This dataset is used to build and validate the machine learning model.\n  \n- **`test.csv`**: This is the test dataset that contains 418 records and 11 columns. It does **not** have the `Survived` column. The goal is to predict `Survived` using a model trained on the provided training data.\n\n*On the competition's data, you will also find the `gender_submission.csv` file, which is an example submission file (not the true labels) provided by [Kaggle](https://www.kaggle.com/). This file shows the expected format of the predictions, containing only the `PassengerId` and `Survived` columns.*\n\nThe following table provides a detailed description of the columns found in `train.csv` and `test.csv`:\n\n| Column Name    | Data Type   | Description                                                                 |\n|----------------|-------------|-----------------------------------------------------------------------------|\n| `PassengerId`  | Integer     | Unique identifier for each passenger                                        |\n| `Survived`     | Integer     | Survival status (0 = No, 1 = Yes)                                           |\n| `Pclass`       | Integer     | Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd)                                 |\n| `Name`         | String      | Name of the passenger                                                       |\n| `Sex`          | String      | Gender of the passenger (`male`, `female`).                                 |\n| `Age`          | Float       | Age of the passenger                                                        |\n| `SibSp`        | Integer     | Number of siblings/spouses aboard the Titanic                               |\n| `Parch`        | Integer     | Number of parents/children aboard the Titanic                               |\n| `Ticket`       | String      | Ticket number                                                               |\n| `Fare`         | Float       | Passenger fare                                                              |\n| `Cabin`        | String      | Cabin number                                                                |  \n| `Embarked`     | String      | Port of embarkation (`C` = Cherbourg; `Q` = Queenstown; `S` = Southampton)  |\n\n\n## Setup Instructions\n\n### \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/d/d0/Google_Colaboratory_SVG_Logo.svg\" alt=\"Google Colab Logo\" width=\"15\" height = \"18\"/\u003e **Google Colab Setup**\n\n1. **Download the required dataset from**:\n   - **[Kaggle - Titanic: Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/data)**\n\n2. **Upload the `train.csv` and `test.csv` files to your own Google Drive in your preferred folder structure.**\n\n3. **Update the file paths in the notebook to reflect your own Google Drive paths.**  \n\n4. **Run the notebook cells as instructed to reproduce the results.**\n\n---\n\n### \u003cimg src=\"https://github.com/user-attachments/assets/8d36d1a5-e9b1-40d1-97c9-3d4ca49e9c95\" alt=\"Local PC\" width=\"18\" height = \"16\" /\u003e **Local Environment Setup**\n\n1. **Download the required dataset from**:\n    - **[Kaggle - Titanic: Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/data)**\n\n2. **Clone the repository**:\n   ```sh\n   git clone https://github.com/Dalageo/ML-TitanicShipwreck.git\n\n3. **Navigate to the cloned directory**:\n   ```sh\n   cd ML-TitanicShipwreck\n  \n4. **Open the `Exploring the World's Most Renowned Shipwreck.ipynb` using your preferred Jupyter-compatible environment (e.g., [Jupyter Notebook](https://jupyter.org/), [VS Code](https://code.visualstudio.com/), or [PyCharm](https://www.jetbrains.com/pycharm/))**\n   \n5. **Update file paths for `train.csv` and `test.csv` as needed.**\n   \n6. **Run the cells sequentially to reproduce the results.**\n\n\n## Acknowledgments\n\nThe dataset used in this project is provided by [Kaggle](https://kaggle.com/competitions/titanic) as part of the [Titanic-Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic) competition. Special thanks to [Kaggle's](https://www.kaggle.com/) data science community, and Will Cukierski for making this dataset available for educational and research purposes.\n\n\n## License\n\nThis work is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). It was chosen to comply with the competition rules, which require the use of an [Open Source Initiative (OSI)](https://opensource.org/) approved license that permits commercial use while promoting open collaboration.\n\n\u003cdiv align=\"center\"\u003e\n\u003ca href=\"https://www.apache.org/licenses/LICENSE-2.0\"\u003e\n  \u003cimg src=\"https://github.com/user-attachments/assets/bcf30286-f8b7-488a-8300-ec2464090c33\" alt=\"Apache License 2.0\" width=\"220\" height=\"120\"\u003e\n\u003c/a\u003e\n\u003c/div\u003e\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdalageo%2Fml-titanicshipwreck","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdalageo%2Fml-titanicshipwreck","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdalageo%2Fml-titanicshipwreck/lists"}