https://github.com/aymen016/cosmic-mystery-challenge-2912
"Explore the depths of space and unravel cosmic mysteries in the year 2912 with our Cosmic Mystery Challenge repository. Dive into data science adventures as you predict the fate of passengers aboard the Spaceship Titanic after a collision with a spacetime anomaly. Join us in reshaping history and saving lives across the universe!"
https://github.com/aymen016/cosmic-mystery-challenge-2912
kaggle matplotlib matplotlib-pyplot numpy pandas pandas-dataframe python scikit-learn scikitlearn-machine-learning seaborn
Last synced: 23 days ago
JSON representation
"Explore the depths of space and unravel cosmic mysteries in the year 2912 with our Cosmic Mystery Challenge repository. Dive into data science adventures as you predict the fate of passengers aboard the Spaceship Titanic after a collision with a spacetime anomaly. Join us in reshaping history and saving lives across the universe!"
- Host: GitHub
- URL: https://github.com/aymen016/cosmic-mystery-challenge-2912
- Owner: Aymen016
- License: mit
- Created: 2024-03-03T13:31:24.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2025-01-31T14:55:01.000Z (10 months ago)
- Last Synced: 2025-03-29T16:42:17.755Z (8 months ago)
- Topics: kaggle, matplotlib, matplotlib-pyplot, numpy, pandas, pandas-dataframe, python, scikit-learn, scikitlearn-machine-learning, seaborn
- Language: Jupyter Notebook
- Homepage:
- Size: 510 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ๐ **Spaceship Titanic: Predicting Passenger Transported Status**

## ๐ **Project Overview**
This project is based on the **Spaceship Titanic** dataset, which contains passenger information from a spaceship journey. The goal is to predict whether a passenger was **transported** to a different dimension, based on various features such as home planet, age, cryogenic sleep status, and amenities usage.
We preprocess the dataset, handle missing values, explore correlations, and visualize the distribution of features to build and train a model for **classification**.
## ๐งโ๐ป **Installation and Setup**
### **Prerequisites**
Ensure you have **Python** and the required libraries installed. You can install the necessary dependencies by running:
```bash
pip install numpy pandas seaborn matplotlib
```
### ๐ **Dataset**
This project uses the **Spaceship Titanic** dataset, available on Kaggle. It contains two main files:
- **train.csv**: The training dataset containing labeled data.
- **test.csv**: The testing dataset for prediction (no labels).
### **Loading the Data**
The data is loaded using **Pandas** as follows:
```python
train_data = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/spaceship-titanic/test.csv")
```
## ๐งน **Data Preprocessing**
Data preprocessing is a crucial step before feeding the data into a machine learning model. Below are the steps involved:
### **Handling Missing Values**:
- For numerical features such as **Age**, missing values are filled with the mean.
- For categorical features like **HomePlanet**, missing values are filled with the mode (most frequent value).
- **Amenities** columns (RoomService, FoodCourt, Spa, VRDeck) with missing values are filled with the mean of each respective feature.
### **Categorical Encoding**:
- **HomePlanet** and **CryoSleep** are converted into **category** types, and **CryoSleep** is also converted to boolean values.
### **Feature Engineering**:
- We create new features like **CabinLevel**, **CabinSection**, and **Cabinn** by splitting the **Cabin** feature.
### **Error Handling**:
- We identify rows where passengers in cryogenic sleep have non-zero values in amenities-related columns and handle these erroneous rows by removing them.
## ๐ **Exploratory Data Analysis (EDA)**
Visualization helps us understand the underlying patterns in the data. Below are some key visualizations:
### **Distribution of Passengers by HomePlanet**:
- This bar chart shows how passengers are distributed across different home planets.
- 
### **Distribution of Passengers by Destination**:
- Visualizes the number of passengers going to each destination.
- 
### **Age Distribution by Transported Status**:
- A boxplot showing the distribution of **Age** for passengers who were transported vs. those who were not.
- 
### **Correlation Heatmap**:
- Displays the correlation between numerical features in the dataset to identify potential relationships.
- 
### **Transportation Status Distribution**:
- A countplot showing the number of passengers who were and weren't transported.
- 
# ๐ **Feature Engineering & Transformation**
- **Age Transformation**: Missing values in **Age** are filled with the mean age.
- **Cabin Transformation**: Splitting the **Cabin** column into **CabinLevel**, **CabinSection**, and **Cabinn** for further analysis.
- **CryoSleep Transformation**: Convert the **CryoSleep** column to boolean values and handle erroneous data where passengers in cryo sleep have non-zero values in amenities columns.
## โ๏ธ **Modeling**
The ultimate goal is to predict the **Transported** status using various features. Below is an example of how you might implement a machine learning model:
```python
# Example: Model training with RandomForestClassifier or any other classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Prepare data for model
X = train_data.drop(columns=['Transported'])
y = train_data['Transported']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Model training
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Evaluate the model
score = model.score(X_test, y_test)
print(f"Model Accuracy: {score}")
```
## ๐ **Model Evaluation**
We use **accuracy** and other classification metrics to evaluate the model. We ensure the model generalizes well by testing on the **test dataset**.
## ๐ง **Future Improvements**
- **Hyperparameter Tuning**: Experiment with different algorithms and hyperparameter optimization techniques such as **GridSearchCV** or **RandomizedSearchCV**.
- **Feature Selection**: Use techniques like **Recursive Feature Elimination (RFE)** or **PCA** to reduce the feature space and improve model performance.
- **Cross-validation**: Implement **k-fold cross-validation** for better performance evaluation.
## ๐ฌ **Conclusion**
Through this project, we successfully implemented data preprocessing and visualization techniques to clean and analyze the **Spaceship Titanic** dataset. The next steps involve building predictive models and comparing their performance to identify the best algorithm for the prediction of **Transported** status. We also plan to continue refining the model by testing more complex algorithms and adding feature engineering techniques.
## ๐ฅ **License**
This project is licensed under the MIT License - see the [LICENSE](https://github.com/Aymen016/Cosmic-Mystery-Challenge-2912/blob/master/LICENSE) file for details.