https://github.com/arnabushna24/titanic-disaster-analysis
Titanic - Machine Learning from Disaster
https://github.com/arnabushna24/titanic-disaster-analysis
data-analysis data-visualization python statistical-analysis
Last synced: 9 months ago
JSON representation
Titanic - Machine Learning from Disaster
- Host: GitHub
- URL: https://github.com/arnabushna24/titanic-disaster-analysis
- Owner: ArnabUshna24
- Created: 2025-05-09T19:30:02.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-11T05:39:54.000Z (about 1 year ago)
- Last Synced: 2025-05-19T20:19:36.476Z (about 1 year ago)
- Topics: data-analysis, data-visualization, python, statistical-analysis
- Language: Jupyter Notebook
- Homepage: https://www.kaggle.com/c/titanic/data
- Size: 284 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Titanic Disaster Analysis
## Overview
This project aims to analyze [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic/data) dataset and find insights using statistical analysis. It serves the following concerns:
* Retriveing the data from the target location.
* Handling the missing values and outliers (if there is any).
* Performing data visualization.
* Performing basis statistical analyses.
## Data Retrieval
`Titanic - Machine Learning from Disaster` dataset is available on Kaggle. It contains three (3) `.csv` files - `gender_submission.csv`, `test.csv`, and `train.csv`. Among them, `train.csv` file was used for this project. It contains twelve (12) columns - `PassengerId`, `Survived`, `Pclass`, `Name`, `Sex`, `Age`, `SibSp`, `Parch`, `Ticket`, `Fare`, `Cabin`, and `Embarked`. `train.csv` file was loaded into a `pandas` dataframe for further analysis.
## Data Cleaning and Manipulation
To find missing values in the dataset, `isnull` function was used. There were 177 missing `Age` values, 687 missing `Cabin` values, and 2 `Embarked` values. For the missing values in `Age` column, it was imputed with the median of the column values, whereas `Cabin` and `Embarked` columns were handled using `notnull` function and mode, respectively. After that, outlier identification was performed and outliers were then capped. However, there were no duplicated records.
## Data Visualizations
Fig. 1: Distribution of Passengers by Gender
Fig. 2: Age Distribution Histogram
Fig. 3: Survival Rate by Gender
Fig. 4: Survival Rate by Class
## Statistical Analysis
Table 1: Mean, Median and Mode of 'Fare' and 'Age' Columns
Columns
Mean
Median
Mode
Fare
32.2042
14.4542
8.05
Age
29.3616
28.0
28.0
Table 2: Gender-wise Survival Rate
Test Component
Result
Null hypothesis
Significant difference in survival rates
Significance level (α)
0.05
T-statistic
-18.672
P-value
2.28 × 10⁻⁶¹
Decision
Reject the null hypothesis
Interpretation
There is a significant difference in survival rates between males and females on the Titanic
## Build from Source
Instructions are provided in the `.ipynb` file.
If you have any queries, contact me: arnabnushna24@gmail.com