https://github.com/pngo1997/predictive-model-data-science-salaries
Project analyzes and predicts Data Science salaries worldwide (2020-2023) using Multiple Linear Regression.
https://github.com/pngo1997/predictive-model-data-science-salaries
eda multiple-linear-regression predictive-analytics predictive-modeling
Last synced: 8 months ago
JSON representation
Project analyzes and predicts Data Science salaries worldwide (2020-2023) using Multiple Linear Regression.
- Host: GitHub
- URL: https://github.com/pngo1997/predictive-model-data-science-salaries
- Owner: pngo1997
- Created: 2025-01-30T18:11:30.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-30T18:24:50.000Z (over 1 year ago)
- Last Synced: 2025-01-30T19:32:28.944Z (over 1 year ago)
- Topics: eda, multiple-linear-regression, predictive-analytics, predictive-modeling
- Language: SAS
- Homepage:
- Size: 8.33 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🏗️ Data Science Salaries Prediction
## 📜 Overview
This project analyzes and predicts **Data Science salaries worldwide (2020-2023)** using **Multiple Linear Regression**. The dataset contains **3,755 observations** from Kaggle, including job details such as **experience level, employment type, salary, company location, and remote work ratio**. The goal is to develop a predictive model that estimates **future Data Science salaries** based on employment attributes.
## 🎯 Problem Explanation
The dataset includes **11 attributes** (4 numerical and 7 categorical):
- **Target Variable:** `salary_in_usd` (Salary in USD).
- **Independent Variables:**
- `work_year` (Year salary was paid).
- `experience_level` (Entry, Mid, Senior, Executive).
- `employment_type` (Part-time, Full-time, Contract, Freelance).
- `job_title` (Data Scientist, Engineer, etc.).
- `salary` (Salary in original currency).
- `salary_currency` (USD, EUR, GBP, etc.).
- `employee_residence` (Country of employee residence).
- `remote_ratio` (0 = No remote, 50 = Hybrid, 100 = Fully remote).
- `company_location` (Employer's country).
- `company_size` (S = <50, M = 50-250, L = >250 employees).
## 🛠️ Implementation Details
- **Exploratory Data Analysis (EDA):**
- Applied **square root transformation** to normalize salary distribution.
- Created **dummy variables** for categorical attributes.
- Analyzed **correlations & multicollinearity (VIF test)**.
- **Regression Models:**
- **Full Model:** All predictors included (Adjusted R² = 39.34%).
- **Refined Model (Removing Multicollinearity):**
- Excluded `company_location` due to high correlation with `employee_residence`.
- Improved Adjusted R² to **39.35%**.
- **Stepwise Selection Model:**
- Reduced to **six key predictors** (Adjusted R² = **39.46%**).
- **Final Model (After Outlier Removal):**
- Adjusted R² = **41.84%**, RMSE = **64.73**, F-value = **440.04**, P-value < **0.0001**.
- **Hypothesis Testing (F-Test):**
- Null Hypothesis: None of the six predictors significantly impact salary.
- Alternative Hypothesis: At least one predictor has a significant impact.
- Result: **Rejected Null Hypothesis**, confirming predictor relevance.