https://github.com/johannaschmidle/titanicsurvival
My code for the Titanic - Machine Learning from Disaster competition on Kaggle
https://github.com/johannaschmidle/titanicsurvival
Last synced: 4 months ago
JSON representation
My code for the Titanic - Machine Learning from Disaster competition on Kaggle
- Host: GitHub
- URL: https://github.com/johannaschmidle/titanicsurvival
- Owner: johannaschmidle
- Created: 2024-08-08T19:52:15.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-08-13T15:31:29.000Z (10 months ago)
- Last Synced: 2025-01-11T01:59:37.588Z (5 months ago)
- Language: Jupyter Notebook
- Size: 756 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Titanic Survival Predictor
![]()
This repository has my code for my [competition submission](https://www.kaggle.com/code/johannaschmidle7/titanic-survival-predictor) for the [Titanic - Machine Learning from Disaster Competition](https://www.kaggle.com/competitions/titanic/overview) on Kaggle.
## Motivation
**Goal:** build a machine learning model to predict if a passenger survived the sinking of the Titanic or not.
For each instance in the test set, you must predict a 0 or 1 value for the target variable (Classifier).## Metric
Submissions are evaluated on **accuracy**. The score is the percentage of passengers you correctly predict (this is known as accuracy).## Steps
1. **EDA**
2. **Feature Engineering**
- **Family Size:** Larger families might have different survival rates compared to solo travelers.
- **Person's Title:** (ex. Ms, Mr) Titles can provide insight into age, gender, and social status, which might affect survival chances.
- **Cabin Deck:** The deck could correlate with proximity to lifeboats and thus survival rates.
- **Cabin Assigned:** Passengers who have not been assigned a cabin might have different survival probabilities compared to those with recorded cabin details.
- **Age Group:** Different age groups might have had different survival probabilities.
- **Fare Price Groups:** I will create different groups of fare price, which can capture non-linear relationships between fare and survival.
- **Name Length:** Especially in the early 1900s a person with a longer name could indicate importance which can impact survival rate
4. **Preprocessing**
1. Dealing With Nulls
2. Split the Data
3. Create Pipelines + Transform Columns
5. **Visualize and Understand Data**
- Histogram
- KDE
- Pie Chart
- Heatmap
6. **Define Models**
- I created 5 models:
- Model 1: Random Forest Classifier
- Model 2: Logistic Regression
- Model 3: K-Nearest Neighbours
- Model 4: XGBoost
- Model 5: Adaptive Boost
7. **Create Competition Submission**## Result of Model Evaluations
### Model 1: Random Forest Regressor
- Best Score: 0.834
- Correct: 138
- Incorrect: 41
### Model 2: Logistic Regression
- Best Score: 0.795
- Correct: 141
- Incorrect: 38
### Model 3: K-Nearest Neighbours
- Best Score: 0.829
- Correct: 136
- Incorrect: 43
### Model 4: XGBoost
- Best Score: 0.803
- Correct: 137
- Incorrect: 42
### Model 5: Adaptive Boost
- Best Score: 0.819
- Correct: 137
- Incorrect: 42## Competition Scores (Best to Worst)
- Model 4: XGBoost - **0.76555**
- Model 2: Logistic Regression - **0.76794**
- Model 5: Adaptive Boost - **0.77751**
- Model 3: K-Nearest Neighbours - **0.77990**
- Model 1: Random Forest Regressor - **0.78229**
## Data
The dataset used in this project is available publicly on Kaggle: [https://www.kaggle.com/competitions/titanic/data](https://www.kaggle.com/competitions/titanic/data)## Technologies
Python
- pandas, numpy, matplotlib, seaborn
- sklearn (OrdinalEncoder, OneHotEncoder, SimpleImputer, make_column_transformer, ColumnTransformer, Pipeline, LogisticRegression, DecisionTreeClassifier, KNeighborsClassifier, RandomForestClassifier, AdaBoostClassifier, cross_val_score, GridSearchCV, ConfusionMatrixDisplay)
- XGBoost