https://github.com/musadiqpasha/imbalance-learning

To tackle imbalanced classification in fake job detection dataset using various resampling techniques and model evaluations.
https://github.com/musadiqpasha/imbalance-learning

classification imbalanced-data ipython-notebook resampling smote-sampling

Last synced: about 1 year ago
JSON representation

To tackle imbalanced classification in fake job detection dataset using various resampling techniques and model evaluations.

Host: GitHub
URL: https://github.com/musadiqpasha/imbalance-learning
Owner: MusadiqPasha
License: mit
Created: 2025-05-04T19:41:42.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-05T06:40:37.000Z (about 1 year ago)
Last Synced: 2025-05-09T01:45:00.449Z (about 1 year ago)
Topics: classification, imbalanced-data, ipython-notebook, resampling, smote-sampling
Language: Jupyter Notebook
Homepage:
Size: 3.88 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Imbalance-Learning

# Real-Fake Job Prediction Model 

This repository contains a Jupyter notebook that develops and evaluates machine learning models to classify job postings as **real** or **fraudulent**. The dataset, sourced from [Kaggle](https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction), presents a significant **class imbalance**, with fake job postings comprising only about 5% of the total data.

## Objective

To tackle imbalanced classification in fake job detection dataset using various resampling techniques and model evaluations.



## Dataset Summary

- **Rows**: ~17,880

- **Columns**: 17

- **Target Variable**: `fraudulent` (0: real, 1: fake)

- **Imbalance**: ~5% fraudulent entries

## Workflow

### 1. Exploratory Data Analysis (EDA)

- Visualized missing values and class distribution.

- Explored feature distributions using plots.

- Analyzed relationships using scatter plot matrix.

### 2. Data Preprocessing

- Handled missing values (mode fill and text defaults).

- Dropped high-null columns (e.g., `department`).

- Encoded categorical features.

- Split data into training and testing sets.

### 3. Tackling Imbalance

Applied various sampling techniques:

- **SMOTE (Synthetic Minority Over-sampling Technique)**

- **Random Oversampling**

- **Random Undersampling**

### 4. Modeling

Evaluated several models:

- Logistic Regression

- Decision Tree Classifier

- Random Forest

- XGBoost

- SVM (Support Vector Machine)

### 5. Evaluation Metrics

Models were assessed using:

- Accuracy

- Precision

- Recall

- F1-Score

- Confusion Matrix

- ROC-AUC Curve

## Results (SMOTE-enhanced dataset)

| Model                | Accuracy | Precision | Recall | F1 Score |

|---------------------|----------|-----------|--------|----------|

| Logistic Regression | 78.92%   | 77.74%    | 82.21% | 79.91%   |

| Decision Tree       | 97.26%   | 96.79%    | 97.87% | 97.33%   |

| Random Forest       | 99.15%   | 99.20%    | 99.13% | 99.17%   |

| XGBoost             | 99.11%   | 98.75%    | 99.53% | 99.13%   |

| SVM                 | 75.53%   | 78.01%    | 72.03% | 74.90%   |

## Files

- `RealFakeJobPrediction.ipynb`: Main notebook containing all steps from preprocessing to model evaluation.

---

Feel free to explore the notebook and reach out with suggestions or improvements!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/musadiqpasha/imbalance-learning

Awesome Lists containing this project

README