Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/deliprofesor/income-analytics-interpretable-machine-learning-model
This project predicts whether an individual earns more than 50K using the Adult Income dataset. A Random Forest model is trained and evaluated, with explanations provided through DALEX and LIME for feature importance and model transparency.
https://github.com/deliprofesor/income-analytics-interpretable-machine-learning-model
classification dalex data-preprocessing data-science data-visualization feature-engineering income-prediction lime machine-learning model-explainability predictive-modeling r-programming random-forest
Last synced: 6 days ago
JSON representation
This project predicts whether an individual earns more than 50K using the Adult Income dataset. A Random Forest model is trained and evaluated, with explanations provided through DALEX and LIME for feature importance and model transparency.
- Host: GitHub
- URL: https://github.com/deliprofesor/income-analytics-interpretable-machine-learning-model
- Owner: deliprofesor
- Created: 2024-12-08T20:50:26.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-12-14T19:42:39.000Z (2 months ago)
- Last Synced: 2024-12-23T22:15:32.648Z (about 2 months ago)
- Topics: classification, dalex, data-preprocessing, data-science, data-visualization, feature-engineering, income-prediction, lime, machine-learning, model-explainability, predictive-modeling, r-programming, random-forest
- Language: R
- Homepage:
- Size: 626 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Income Prediction and Model Explainability using Random Forest, DALEX, and LIME

## Project Overview
This project aims to predict the income of individuals based on various demographic and employment-related features using machine learning techniques. A **Random Forest** model is used for classification, and two powerful explainability tools, **DALEX** and **LIME**, are employed to interpret and visualize the model's predictions. This helps in understanding which features are the most influential in determining the income class (e.g., ">50K" or "<=50K").
## Dataset
The dataset used in this project is the **Adult Income Dataset** (also known as the **Census Income Dataset**), which contains information about individuals and their income. The dataset includes the following columns:
- **age**: The age of the individual.
- **workclass**: The type of employment.
- **education**: The highest level of education attained.
- **marital.status**: Marital status of the individual.
- **occupation**: The occupation of the individual.
- **relationship**: Relationship status.
- **race**: The race of the individual.
- **sex**: The gender of the individual.
- **native.country**: The country of origin.
- **income**: The target variable, indicating if the individual earns more than 50K a year or not.The dataset is pre-processed by handling missing values and converting categorical variables to factors.
## Libraries Used
- **tidyverse**: For data manipulation and visualization.
- **caret**: For model training and evaluation.
- **DALEX**: For model explainability and feature importance.
- **lime**: For local interpretable model-agnostic explanations.## Steps Involved
1. **Data Loading and Preprocessing**:
- The data is loaded from a CSV file, and the first few rows are displayed.
- Missing values are checked, and categorical variables are converted into factors.2. **Data Splitting**:
- The dataset is split into training (70%) and testing (30%) sets.3. **Model Training**:
- A **Random Forest** model is trained using the `caret` package with 10-fold cross-validation to predict income.4. **Model Evaluation**:
- After the model is trained, predictions are made on the test set, and performance metrics such as **Confusion Matrix** are calculated.5. **Model Explainability with DALEX**:
- The DALEX package is used to explain the model's predictions. Feature importance is visualized to understand which features play a major role in predicting the income.6. **Model Explainability with LIME**:
- The LIME package provides local explanations for individual predictions. The explanation for the first five test instances is visualized, and the importance of features such as **gender** is also explored.## Key Visualizations
- **Variable Importance**: This visualization shows the relative importance of each feature in predicting income.
- **LIME Explanations**: Local explanations for individual predictions show how specific features influence the outcome for individual data points.## How to Run the Code
To run the code, follow these steps:
1. Install the required packages:
```r
install.packages("tidyverse")
install.packages("caret")
install.packages("DALEX")
install.packages("lime")