https://github.com/soumwadeep/titanic-data-analysis
This analysis attempts to predicate the probability for survival of the Titanic passengers.
https://github.com/soumwadeep/titanic-data-analysis
python
Last synced: 2 months ago
JSON representation
This analysis attempts to predicate the probability for survival of the Titanic passengers.
- Host: GitHub
- URL: https://github.com/soumwadeep/titanic-data-analysis
- Owner: soumwadeep
- Created: 2021-08-02T13:52:16.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2021-08-02T13:57:51.000Z (almost 4 years ago)
- Last Synced: 2025-01-18T04:42:38.309Z (4 months ago)
- Topics: python
- Language: R
- Homepage:
- Size: 50.8 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Introduction
This analysis attempts to predicate the probability for survival of the Titanic passengers. In order to do this, I will use the different features available about the passengers, use a subset of the data to train an algorithm and then run the algorithm on the rest of the data set to get a prediction.In this analysis we have asked the following questions:
1)What is the relationship the features and a passenger’s chance of survival.
2)Prediction of survival for the entire ship.# Titanic Survival Prediction
This repository contains some of my approaches to the Titanic survival prediction Problem from Kaggle. The repository includes scripts for feature selection, alternate strategies for data modelling, the original test & train data sets and the visualizations plots generated for the same. All code snippets are written in R.In this popular challenge, the aim is to predict what sorts of people were likely to survive the Titanic disaster based on attributes such as gender, class, ticket details , age category e.t.c .
## Code Example
TODO
## Motivation
One approach among the listed samples code was submitted to Kaggle.
## Installation
The datasets can be found in 'data' folder. It includes 2 csv files for train & test respectively.
train.csv (59.76 kb)
test.csv (27.96 kb)Following R packages are used.
seqinr: Biological Sequences Retrieval and Analysis
https://cran.r-project.org/web/packages/seqinr/index.htmle1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien
https://cran.r-project.org/web/packages/e1071/index.htmlparty: A Laboratory for Recursive Partytioning
https://cran.r-project.org/web/packages/party/index.htmlAmelia: Amelia II: A Program for Missing Data
https://cran.r-project.org/web/packages/Amelia/index.htmlggplot2: An Implementation of the Grammar of Graphics
https://cran.r-project.org/web/packages/ggplot2/index.htmlhttps://cran.r-project.org/web/packages/corrgram/index.html
corrgram: Plot a Correlogram## API Reference
Following scripts need to be run seprately
exploratory_data_analysis.R -- All exploratory data analysis approaches analyzing data sets to summarize their main characteristics, building plots, creating map of missing features e.t.c
logistic_regression_solution.R -- Feature clean up, split of data set into train test, cross validation & survival prediction using Logistic Regression
random_forest_solution.R -- Feature clean up, split of data set into train test, cross validation & survival prediction using Random Forest
support_vector_machine_solution.R -- Feature clean up, split of data set into train test, cross validation & survival prediction using Support Vector Machine
final_baggged_solution.R -- Feature clean up, split of data set into train test, cross validation & survival prediction using all the above methods and then bagging the results.
## Plots
Some of the data visualizations are saved as pdfs in plots folder. Code for the same can be found in exploratory_data_analysis.R. The remaining visualizations like missmaps are printed in console itself.
Age_Distribution.pdf
Age_Vs_Survival.pdf
Embarked_Vs_Survival.pdf
Fare_Distribution.pdf
PClass_Distribution_Bar.pdf
survival_vs_sex.pdf
survived_vs_died.pdf
## Conclusion
The mean of the right predictions that I got on the test set is 0.76555 with the decision tree method, 0.77990 with the logistic regression model, and 0.80382 with the random forest model.## Contributors
1)Soumwadeep Guha
2)Sagnik Saha
3)Surabhi Saha
4)Sanajit Bhunia
5)Nabin Das
6)Sajal Santra## License
None