Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jazib-2004/prediction-classification-and-clustering-on-public-expenses-dataset

Applying end-to-end ML pipeline incl. EDA to get to know data more, data preprocessing to prepare data for modelling, and at last REGRESSION to predict one feature's value, CLASSIFICATION to classify one feature, and K-means for clustering and its analysis.
https://github.com/jazib-2004/prediction-classification-and-clustering-on-public-expenses-dataset

data-preprocessing exploratory-data-analysis k-means-clustering lasso-regression logistic-regression matplotlib ml-pipeline python scikit-learn

Last synced: 16 days ago
JSON representation

Applying end-to-end ML pipeline incl. EDA to get to know data more, data preprocessing to prepare data for modelling, and at last REGRESSION to predict one feature's value, CLASSIFICATION to classify one feature, and K-means for clustering and its analysis.

Awesome Lists containing this project

README

        

# Prediction-Classification-and-Clustering-on-Public-Expenses-Dataset

So, I used a public expenses dataset, and applied a simple end-to-end Machine Learning pipeline on it. The pipeline included exploratory data analysis to get to know data more, basic data preprocessing to make the dataset feasible enough to be feeded into any model, and at last did REGRESSION to predict one feature's value, did CLASSIFICATION to classify one feature, and did K-MEANS CLUSTERING to do cluster analysis for optimal number of clusters and finally clustering the data.

**Exploratory Data Analysis**

![image](https://github.com/user-attachments/assets/219a2aa8-94af-4583-ac4b-20f8ec25443d)
![image](https://github.com/user-attachments/assets/3b38d91a-a6f2-4646-87b7-ec26245a27c4)

**Lasso Regression Results**

![image](https://github.com/user-attachments/assets/921aa964-89df-423f-96e2-2aab08751b97)

**Logistic Regression Results**

![image](https://github.com/user-attachments/assets/93e6447d-a8e2-40a7-8780-d0d1a1fb731b)

The reason of such poor results is the bad selection of feature for classification. I wanted to see if classifying GENDER based on this data can work but given these results, it's evident that the dataset is fair enough and is not biased towards any gender.

**Cluster Analysis**

![image](https://github.com/user-attachments/assets/4979db82-6fd3-49fe-94ab-9e3cffb0d969)

**K-Means Clustering**

![image](https://github.com/user-attachments/assets/fada2427-be71-48b8-95dd-4e9027563529)