https://github.com/pngo1997/predictive-model-data-science-salaries

Project analyzes and predicts Data Science salaries worldwide (2020-2023) using Multiple Linear Regression.
https://github.com/pngo1997/predictive-model-data-science-salaries

eda multiple-linear-regression predictive-analytics predictive-modeling

Last synced: 8 months ago
JSON representation

Project analyzes and predicts Data Science salaries worldwide (2020-2023) using Multiple Linear Regression.

Host: GitHub
URL: https://github.com/pngo1997/predictive-model-data-science-salaries
Owner: pngo1997
Created: 2025-01-30T18:11:30.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-01-30T18:24:50.000Z (over 1 year ago)
Last Synced: 2025-01-30T19:32:28.944Z (over 1 year ago)
Topics: eda, multiple-linear-regression, predictive-analytics, predictive-modeling
Language: SAS
Homepage:
Size: 8.33 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🏗️ Data Science Salaries Prediction

## 📜 Overview
This project analyzes and predicts **Data Science salaries worldwide (2020-2023)** using **Multiple Linear Regression**. The dataset contains **3,755 observations** from Kaggle, including job details such as **experience level, employment type, salary, company location, and remote work ratio**. The goal is to develop a predictive model that estimates **future Data Science salaries** based on employment attributes.

## 🎯 Problem Explanation
The dataset includes **11 attributes** (4 numerical and 7 categorical):
- **Target Variable:** `salary_in_usd` (Salary in USD).
- **Independent Variables:**
- `work_year` (Year salary was paid).
- `experience_level` (Entry, Mid, Senior, Executive).
- `employment_type` (Part-time, Full-time, Contract, Freelance).
- `job_title` (Data Scientist, Engineer, etc.).
- `salary` (Salary in original currency).
- `salary_currency` (USD, EUR, GBP, etc.).
- `employee_residence` (Country of employee residence).
- `remote_ratio` (0 = No remote, 50 = Hybrid, 100 = Fully remote).
- `company_location` (Employer's country).
- `company_size` (S = <50, M = 50-250, L = >250 employees).

## 🛠️ Implementation Details
- **Exploratory Data Analysis (EDA):**
- Applied **square root transformation** to normalize salary distribution.
- Created **dummy variables** for categorical attributes.
- Analyzed **correlations & multicollinearity (VIF test)**.
- **Regression Models:**
- **Full Model:** All predictors included (Adjusted R² = 39.34%).
- **Refined Model (Removing Multicollinearity):**
- Excluded `company_location` due to high correlation with `employee_residence`.
- Improved Adjusted R² to **39.35%**.
- **Stepwise Selection Model:**
- Reduced to **six key predictors** (Adjusted R² = **39.46%**).
- **Final Model (After Outlier Removal):**
- Adjusted R² = **41.84%**, RMSE = **64.73**, F-value = **440.04**, P-value < **0.0001**.
- **Hypothesis Testing (F-Test):**
- Null Hypothesis: None of the six predictors significantly impact salary.
- Alternative Hypothesis: At least one predictor has a significant impact.
- Result: **Rejected Null Hypothesis**, confirming predictor relevance.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pngo1997/predictive-model-data-science-salaries

Awesome Lists containing this project

README