Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sanjiv856/machine_learning_scikit-learn
Repository for machine learning in Python using Scikit-learn.
https://github.com/sanjiv856/machine_learning_scikit-learn
pipelines python scikit-learn sklearn titanic-kaggle titanic-survival-prediction
Last synced: 17 days ago
JSON representation
Repository for machine learning in Python using Scikit-learn.
- Host: GitHub
- URL: https://github.com/sanjiv856/machine_learning_scikit-learn
- Owner: sanjiv856
- Created: 2024-08-20T21:26:20.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-08-24T08:14:58.000Z (3 months ago)
- Last Synced: 2024-10-17T12:41:30.491Z (about 1 month ago)
- Topics: pipelines, python, scikit-learn, sklearn, titanic-kaggle, titanic-survival-prediction
- Language: Jupyter Notebook
- Homepage:
- Size: 479 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Titanic Survival Prediction
This project implements a machine learning pipeline to predict the survival of passengers on the Titanic using various classification algorithms. The project involves data preprocessing, feature engineering, model training, hyperparameter tuning, and model evaluation.
## Project Structure
- `data/` - Directory containing the dataset (`train.csv` and `test.csv`).
- `python_scikit-learn_titanic.py` - Main script containing the code for loading data, preprocessing, training models, and generating submissions.## Running the Code
### Feature Engineering & Preprocessing:
Feature engineering is applied to create new features such as Family_Size, Is_Alone, Title, Age_Group, etc.
Preprocessing pipelines are defined for numerical and categorical features.### Model Training & Hyperparameter Tuning:
Several classifiers are trained and tuned using GridSearchCV, including:
- Random Forest
- Extra Trees
- XGBoost
- Decision Tree
- Logistic Regression
- Gaussian Naive Bayes
- K-Nearest NeighborsBest models and their parameters are saved as .pkl files.
### Key Libraries Used
- pandas - Data manipulation and analysis.
- numpy - Numerical computations.
- matplotlib & seaborn - Data visualization.
- scikit-learn - Machine learning library for model building and evaluation.
- xgboost - Implementation of gradient boosting algorithm.### Feature Engineering
The following features are engineered:
- Family_Size - Number of family members onboard.
- Is_Alone - Binary feature indicating if the passenger was alone.
- Title - Extracted from passenger names.
- Age_Group - Binned age groups.
- Ticket_Number, Ticket_Location - Extracted from ticket information.
- Cabin_Alphabet, Cabin_Recorded - Extracted from cabin information.### Hyperparameter Tuning
Hyperparameters are tuned using GridSearchCV with cross-validation to find the best model configuration.### Feature Importance
Feature importance is plotted for the top 20 features for each model.