https://github.com/fernandesotero/project-data-exploration
Student Performance Prediction with Data Science
https://github.com/fernandesotero/project-data-exploration
data-visualization jupyter-notebook python
Last synced: about 2 months ago
JSON representation
Student Performance Prediction with Data Science
- Host: GitHub
- URL: https://github.com/fernandesotero/project-data-exploration
- Owner: fernandesotero
- Created: 2025-05-13T19:43:24.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-13T19:44:47.000Z (about 1 year ago)
- Last Synced: 2025-05-13T20:58:17.207Z (about 1 year ago)
- Topics: data-visualization, jupyter-notebook, python
- Language: Jupyter Notebook
- Homepage:
- Size: 154 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🎓 Student Performance Prediction with Data Science
This project aims to apply data science techniques to predict **students academic performance** based on their study habits, lifestyle, and socioeconomic background. The main idea is to explore how different behavioral and contextual variables influence final exam scores using machine learning.
## 📁 Dataset
The dataset used, named `student_habits_performance.csv`, contains **1,000 student records** with the following information:
- Demographic data (age, gender)
- Daily habits (study hours, sleep, social media, Netflix)
- External conditions (part-time job, internet quality, parents' education level)
- Health and well-being (exercise, diet, mental health)
- Participation in extracurricular activities
- Final exam score (target variable: `exam_score`)
## 🎯 Objective
To develop a predictive model capable of estimating a student's exam score based on their habits and characteristics, enabling:
- Understanding of the variables that most impact academic performance
- Support for the development of data-driven educational policies
- Practical demonstration of machine learning techniques
## ⚙️ Project Steps
### 1. Exploratory Data Analysis (EDA)
Tools used: `pandas`, `seaborn`, `matplotlib`
- Check data types, missing values, and descriptive statistics
- Graphical analysis of distributions, correlations, and variable relationships
- Initial understanding of data patterns
### 2. Preprocessing
Tools used: `scikit-learn` (`ColumnTransformer`, `StandardScaler`, `OneHotEncoder`)
- Separation of numerical and categorical features
- Normalization of continuous variables
- Encoding of categorical variables (one-hot encoding)
- Removal of irrelevant columns (e.g., student ID)
### 3. Predictive Modeling
Models used:
- **Linear Regression** – Baseline model to evaluate simple linear relationships
- **Random Forest Regressor** – Robust, non-linear model to capture complex interactions
Both models were integrated into a `Pipeline`, enabling automatic execution of preprocessing and training.
### 4. Evaluation
Metrics used:
- **R² (coefficient of determination)**: measures explained variance
- **MAE (Mean Absolute Error)**: interprets average error in actual units
These metrics allow for model comparison and understanding of prediction effectiveness.
## 🧠 Conclusion
The project demonstrated how behavioral and lifestyle data can be analyzed to predict academic performance. In addition to building predictive models, it helped identify key factors that influence students' learning—valuable insights for real-world educational applications.
## 🛠️ Technologies and Libraries
- Python
- pandas
- matplotlib
- seaborn
- scikit-learn
## 📌 Future Improvements
- Hyperparameter tuning with GridSearchCV
- Testing other models (XGBoost, LightGBM)
- Feature selection to reduce complexity