Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dalageo/ml-titanicshipwreck

Exploring the World's Most Renowned Shipwreck 🚢
https://github.com/dalageo/ml-titanicshipwreck

data-science decision-tree-classifier logistic-regression machine-learning python random-forest-classifier scikit-learn stacking-ensemble titanic-dataset xgboost-classifier

Last synced: about 1 month ago
JSON representation

Exploring the World's Most Renowned Shipwreck 🚢

Host: GitHub
URL: https://github.com/dalageo/ml-titanicshipwreck
Owner: Dalageo
License: apache-2.0
Created: 2024-09-13T17:01:10.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-12-10T15:41:20.000Z (about 2 months ago)
Last Synced: 2024-12-10T16:41:39.339Z (about 2 months ago)
Topics: data-science, decision-tree-classifier, logistic-regression, machine-learning, python, random-forest-classifier, scikit-learn, stacking-ensemble, titanic-dataset, xgboost-classifier
Language: Jupyter Notebook
Homepage:
Size: 990 KB
Stars: 7
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Exploring the World's Most Renowned Shipwreck 🚢

In 1912, the Titanic set off on its first voyage across the Atlantic Ocean, carrying passengers ranging from the wealthy elite to emigrants seeking a new life. Tragically, the ship collided with an iceberg and sank, resulting in the loss of over 1,500 lives. This disaster not only shook the world but also sparked discussions about maritime safety and the social dynamics of the time.

This repository explores the factors affecting passenger survival on the Titanic and aims to build a predictive model to estimate survival probabilities based on available passenger characteristics. The available dataset contains a detailed records of the passengers aboard, including information such as age, gender, passenger class, fare paid, and survival outcome. However, some key data points are missing, particularly in features like age and cabin, which poses challenges for building accurate predictive models.

In this project, two different approaches are explored and compared based on model performance:

- **1. Removing Missing Data**: This method involves deleting rows with missing values to clean the dataset. While it ensures that the remaining data is complete, it reduces the number of observations available for analysis.

- **2. Filling Missing Data**: This approach fills in missing values in an effort to retain more data and potentially enhance the model's performance.

Overall, more robust models (Random Forest, XGBoost) were achieved using the second approach, which involved filling in missing values. A version of the developed model was also submitted to Kaggle’s [Titanic-Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic) competition, where it ranked in the top 9.38% (1316 out of 14036).

Given that the true survival status of Titanic passengers is publicly available, some higher-ranked entries likely used manually crafted labels to achieve near-perfect accuracies. Therefore, the actual position of the provided model could be higher if all competitors strictly followed the competition rules. You can also find the Kaggle's notebook [here](https://www.kaggle.com/code/dalageo/exploring-the-world-s-most-renowned-shipwreck).

*It's important to mention that the score shown in the above image (0.78947) was achieved through a **slightly** modified ensemble model and different parameter tuning compared to the provided notebook (0.78468). These exact details are not shared here to encourage independent experimentation and to prevent you from overfitting.* 😜

## Dataset Description

The Titanic dataset used in this project is divided into two main files: `train.csv` and `test.csv`. Below is a brief description of each file:

- **`train.csv`**: This is the primary training dataset containing labeled data used to train the model. It includes 891 records and 12 columns, with the `Survived` column indicating whether a passenger survived (1) or not (0). This dataset is used to build and validate the machine learning model.

- **`test.csv`**: This is the test dataset that contains 418 records and 11 columns. It does **not** have the `Survived` column. The goal is to predict `Survived` using a model trained on the provided training data.

*On the competition's data, you will also find the `gender_submission.csv` file, which is an example submission file (not the true labels) provided by [Kaggle](https://www.kaggle.com/). This file shows the expected format of the predictions, containing only the `PassengerId` and `Survived` columns.*

The following table provides a detailed description of the columns found in `train.csv` and `test.csv`:

| Column Name | Data Type | Description |
|----------------|-------------|-----------------------------------------------------------------------------|
| `PassengerId` | Integer | Unique identifier for each passenger |
| `Survived` | Integer | Survival status (0 = No, 1 = Yes) |
| `Pclass` | Integer | Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd) |
| `Name` | String | Name of the passenger |
| `Sex` | String | Gender of the passenger (`male`, `female`). |
| `Age` | Float | Age of the passenger |
| `SibSp` | Integer | Number of siblings/spouses aboard the Titanic |
| `Parch` | Integer | Number of parents/children aboard the Titanic |
| `Ticket` | String | Ticket number |
| `Fare` | Float | Passenger fare |
| `Cabin` | String | Cabin number |
| `Embarked` | String | Port of embarkation (`C` = Cherbourg; `Q` = Queenstown; `S` = Southampton) |

## Setup Instructions

### **Google Colab Setup**

1. **Download the required dataset from**:
- **[Kaggle - Titanic: Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/data)**

2. **Upload the `train.csv` and `test.csv` files to your own Google Drive in your preferred folder structure.**

3. **Update the file paths in the notebook to reflect your own Google Drive paths.**

4. **Run the notebook cells as instructed to reproduce the results.**

---

### **Local Environment Setup**

1. **Download the required dataset from**:
- **[Kaggle - Titanic: Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/data)**

2. **Clone the repository**:
```sh
git clone https://github.com/Dalageo/ML-TitanicShipwreck.git

3. **Navigate to the cloned directory**:
```sh
cd ML-TitanicShipwreck

4. **Open the `Exploring the World's Most Renowned Shipwreck.ipynb` using your preferred Jupyter-compatible environment (e.g., [Jupyter Notebook](https://jupyter.org/), [VS Code](https://code.visualstudio.com/), or [PyCharm](https://www.jetbrains.com/pycharm/))**

5. **Update file paths for `train.csv` and `test.csv` as needed.**

6. **Run the cells sequentially to reproduce the results.**

## Acknowledgments

The dataset used in this project is provided by [Kaggle](https://kaggle.com/competitions/titanic) as part of the [Titanic-Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic) competition. Special thanks to [Kaggle's](https://www.kaggle.com/) data science community, and Will Cukierski for making this dataset available for educational and research purposes.

## License

This work is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). It was chosen to comply with the competition rules, which require the use of an [Open Source Initiative (OSI)](https://opensource.org/) approved license that permits commercial use while promoting open collaboration.