https://github.com/mindful-ai-assistants/spaceship
Predict which passengers are transported to an alternate dimension
https://github.com/mindful-ai-assistants/spaceship
kaggle-competition machine-learning oneness-consciousness python3
Last synced: about 2 months ago
JSON representation
Predict which passengers are transported to an alternate dimension
- Host: GitHub
- URL: https://github.com/mindful-ai-assistants/spaceship
- Owner: Mindful-AI-Assistants
- License: mit
- Created: 2024-07-17T02:50:43.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-11-21T15:34:29.000Z (6 months ago)
- Last Synced: 2025-04-11T22:56:21.075Z (about 2 months ago)
- Topics: kaggle-competition, machine-learning, oneness-consciousness, python3
- Language: Jupyter Notebook
- Homepage:
- Size: 2.79 MB
- Stars: 3
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# **Spaceship Titanic 🚀 Transport Prediction**
### Starship's flight trajectory
https://github.com/user-attachments/assets/dde94daf-422a-45a7-b918-a34cc7c5f12f
### Starship's Launching
https://github.com/user-attachments/assets/2a010218-b6a9-468d-97dc-4c6db34271e8
### Starship Launch Engine
https://github.com/user-attachments/assets/1b82f588-5551-4b17-bd6e-575fbe51e021
## Overview
This repository contains a machine learning project for the Kaggle competition "Spaceship Titanic." The goal is to predict which passengers were transported to an alternate dimension during a collision with a spacetime anomaly.
## Project Description
In this competition, we use machine learning techniques to analyze data from the Spaceship Titanic's damaged computer system and predict whether passengers were transported.
-----
## Project Structure1. **Introduction**
2. **Dependencies Installation**
3. **Data Loading**
4. **Initial Data Exploration**
5. **Feature Engineering and PCA**
6. **Data Preprocessing**
7. **Model Training and Evaluation (Ensemble Learning)**
8. **Hyperparameter Optimization**
9. **Feature Importance (Random Forest & Gradient Boosting)**
10. **Submission**
11. **Conclusion**## 1. Project Structure
1. Project Structure
The project follows a complete machine learning pipeline, which includes:
Installation of Dependencies: Installing and importing necessary Python libraries.
Data Loading: Loading the training and testing datasets.
Exploratory Data Analysis (EDA): A first look at the data through visualization and summary statistics.
Feature Engineering: Enhancing the dataset by creating new variables to improve prediction.
Preprocessing: Handling missing values, scaling numeric features, and encoding categorical variables.
Model Building: Training different machine learning models and evaluating their performance.
Hyperparameter Optimization: Using grid search to fine-tune the best model.
Submission: Predicting on the test set and creating a submission file for Kaggle.
## Getting Started
### Prerequisites
- Python 3.x
- Required Libraries: `numpy`, `pandas`, `matplotlib`, `seaborn`, `scikit-learn`### Installation
Install the required libraries using pip:
```bash
pip install numpy pandas matplotlib seaborn scikit-learn
```### Usage
1. **Clone the Repository**
```bash
git clone https://github.com/yourusername/spaceship-titanic.git
```2. **Navigate to the Project Directory**
```bash
cd spaceship-titanic
```3. **Run the Main Script**
```bash
python main.py
```## Code Explanation
### 1. Introduction
The goal of this project is to predict if a passenger will be transported using machine learning models.
### 2. Installation of Dependencies
```python
!pip install numpy pandas matplotlib seaborn scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer%matplotlib inline
plt.style.use('dark_background') # Setting dark mode for visualizations
```### 3. Loading the Data
```python
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
train_data.head()
```### 4. Initial Data Exploration
```python
train_data.info()
plt.figure(figsize=(8, 6))
sns.countplot(x='Transported', data=train_data, palette='cool')
plt.title('Distribution of Transported')
plt.show() # Dark mode applied
```
Transported Distribution Graphic

### 5. Feature Engineering and PCA
```python
# Feature engineering: Total Spend and Average Spend
train_data['TotalSpend'] = train_data['RoomService'] + train_data['FoodCourt'] + train_data['ShoppingMall'] + train_data['Spa'] + train_data['VRDeck']
train_data['AvgSpend'] = train_data['TotalSpend'] / 5
train_data['CabinNumRatio'] = pd.to_numeric(train_data['Num'], errors='coerce') / train_data['Age']# PCA for dimensionality reduction
X = train_data[['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'TotalSpend', 'AvgSpend', 'CabinNumRatio']].fillna(0)
y = train_data['Transported']pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='coolwarm', edgecolor='k', alpha=0.7)
plt.title('PCA of Features (2 Components) - Dark Mode')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid(True)
plt.show() # PCA plot in dark mode
```PCA of Features (2 Components) Graphic

### 6. Data Preprocessing
```python
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']),
('cat', categorical_transformer, ['HomePlanet', 'Destination', 'Deck', 'Side'])
])
```### 7. Model Training and Evaluation (Ensemble Learning)
```python
X_train_pca, X_val_pca, y_train_pca, y_val_pca = train_test_split(X_pca, y, test_size=0.2, random_state=42)rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)ensemble_model = VotingClassifier(estimators=[('rf', rf_model), ('gb', gb_model)], voting='soft')
ensemble_model.fit(X_train_pca, y_train_pca)
y_pred_ensemble = ensemble_model.predict(X_val_pca)# Metrics
accuracy = accuracy_score(y_val_pca, y_pred_ensemble)
f1 = f1_score(y_val_pca, y_pred_ensemble)
roc_auc = roc_auc_score(y_val_pca, y_pred_ensemble)print(f"Ensemble Model Accuracy: {accuracy:.4f}")
print(f"Ensemble Model F1 Score: {f1:.4f}")
print(f"Ensemble Model ROC AUC: {roc_auc:.4f}")# Confusion matrix
cm = confusion_matrix(y_val_pca, y_pred_ensemble)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Purples')
plt.title('Confusion Matrix - Ensemble Model (Dark Mode)')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
```
Confusion Matrix - Random Forest Graphic

### 8. Hyperparameter Optimization
```python
param_grid = {
'classifier__n_estimators': [100, 200, 300],
'classifier__max_depth': [None, 10, 20, 30],
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': [1, 2, 4]
}grid_search = GridSearchCV(rf_model, param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train_pca, y_train_pca)print("Best parameters:", grid_search.best_params_)
```### 9. Feature Importance (Random Forest & Gradient Boosting)
```python
ensemble_model.estimators_[0].fit(X_train_pca, y_train_pca) # Random Forest
feature_importance_rf = ensemble_model.estimators_[0].feature_importances_ensemble_model.estimators_[1].fit(X_train_pca, y_train_pca) # Gradient Boosting
feature_importance_gb = ensemble_model.estimators_[1].feature_importances_importance_df = pd.DataFrame({
'Feature': ['PC1', 'PC2'],
'RandomForest': feature_importance_rf,
'GradientBoosting': feature_importance_gb
})importance_df = pd.melt(importance_df, id_vars=['Feature'], var_name='Model', value_name='Importance')
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', hue='Model', data=importance_df, palette='coolwarm')
plt.title('Feature Importance by Model (Random Forest vs Gradient Boosting)')
plt.tight_layout()
plt.show()
```
Feature Importance by Model (Random Forest vs Gradient Boosting) Graphic

### 10. Submission
```python
test_data['TotalSpend'] = (test_data['RoomService'] + test_data['FoodCourt'] +
test_data['ShoppingMall'] + test_data['Spa'] + test_data['VRDeck'])# Assuming you have transformed the test data similarly to the training data
X_test = test_data[['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'TotalSpend', 'AvgSpend', 'CabinNumRatio']].fillna(0)
X_test_pca = pca.transform(X_test)test_predictions = ensemble_model.predict(X_test_pca)
submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Transported': test_predictions})
submission.to_csv('submission.csv', index=False)
```### 11. Conclusion
```markdown
This project demonstrates a complete machine learning pipeline from feature engineering and PCA to ensemble learning. We further improve the model with hyperparameter tuning and provide visualizations in dark mode for better readability. The final results show competitive accuracy and F1 scores.
```---
### **Jupyter Notebook**
```python
# Spaceship Titanic - Transport Prediction 🚀## 1. Introduction
This notebook aims to predict whether a passenger aboard the Spaceship Titanic will be transported to another dimension using machine learning algorithms. We will use the Kaggle Spaceship Titanic dataset, explore the data,
```'
#
#####Copyright 2024 Mindful-AI-Assistants. Code released under the [MIT license.]( https://github.com/Mindful-AI-Assistants/.github/blob/ad6948fdec771e022d49cd96f99024fcc7f1106a/LICENSE)