https://github.com/capsuleismail/drybeanuci

Data Science Project with Model comparison.
https://github.com/capsuleismail/drybeanuci

datascience jupyter-notebook machinelearning-python scikit-learn

Last synced: about 1 month ago
JSON representation

Data Science Project with Model comparison.

Host: GitHub
URL: https://github.com/capsuleismail/drybeanuci
Owner: capsuleismail
License: cc-by-4.0
Created: 2025-02-10T15:16:42.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-11T13:24:00.000Z (over 1 year ago)
Last Synced: 2025-08-11T16:32:29.952Z (11 months ago)
Topics: datascience, jupyter-notebook, machinelearning-python, scikit-learn
Language: Jupyter Notebook
Homepage:
Size: 22.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          ## Introduction  

---------------

The **Dry Bean Dataset** from the UCI Machine Learning Repository is a well-structured dataset used for classifying different types of beans based on their morphological features. The dataset consists of **various shape-related attributes** extracted from bean images using computer vision techniques. Given these numerical attributes, the goal is to build a **classification model** that can accurately predict the bean type.  

In this **[notebook](https://github.com/capsuleismail/dry_bean_uci/blob/main/dry-bean-dataset-uci.ipynb)**, we explore and analyze the Dry Bean Dataset by answering key questions related to its structure, attributes, and classification potential. We perform **exploratory data analysis (EDA)** using **histograms, boxplots, and correlation matrices** to understand feature distributions and relationships. Additionally, we implement various **machine learning models**, compare their performance, and optimize hyperparameters using **Optuna** to enhance classification accuracy.  

Through this analysis, we aim to determine the **most effective model** for distinguishing between different bean types, leveraging advanced preprocessing techniques and **machine learning pipelines** to streamline the workflow.

These are the questions I've gone through on my notebooks:

1. **What is the Dry Bean Dataset?** 


2. **How many instances (rows) and attributes (columns) are present in the dataset?** 


3. **What are the different classes of beans in the dataset?** 


4. **What are the main features (attributes) used to describe each bean?** 


5. **Are all attributes numerical, or are there categorical attributes as well?** 


6. **What type of classification problem is this dataset used for? (Binary or Multi-class?)** 


7. **Which machine learning algorithms can be used to classify the bean types?** 


8. **Use Histogram plots to understand the numerical features.** 


9. **Use Boxplot plots to understand the numerical features.** 


10. **Use Correlation plot to understand any relationship between variables.** 


11. **What performance metrics can be used to evaluate classification models trained on this dataset?** 


12. **Use a Pipeline to preprocess and modeling your data.** 


13. **Compare between diffferent models which one is more accurate.** 


14. **Tune hyperparameters using Optuna to improve accuracy with RandomForestClassifier.** 


**Citation: Dry Bean [Dataset](https://doi.org/10.24432/C50S4B). (2020). UCI Machine Learning Repository.** 


--------------------------------------------------------------------------------------------------------------------

### I. How to import the dataset via pip.

--------------------------------------------------------------------------------------------------------------------

```

pip install ucimlrepo

Import the dataset into your code 

from ucimlrepo import fetch_ucirepo 

  

# fetch dataset 

dry_bean = fetch_ucirepo(id=602) 

  

# data (as pandas dataframes) 

X = dry_bean.data.features 

y = dry_bean.data.targets 

  

# metadata 

print(dry_bean.metadata) 

  

# variable information 

print(dry_bean.variables)

```

### II. All packages used for this notebook.

--------------------------------------------------------------------------------------------------------------------

```

import gc # Garbage Collector

import pandas as pd

import numpy as np

import os

# Time Modules

import calendar

from time import time

import datetime

from datetime import datetime, timedelta

pd.set_option('display.max_rows', None)

pd.set_option('display.max_columns', None)

# Plots

import seaborn as sns

import matplotlib.pyplot as plt

import plotly.graph_objects as go

import plotly.express as px

import plotly.subplots as sp

sns.set_style("whitegrid")

sns.set(rc={'figure.figsize':(18, 12)})

%matplotlib inline

# Statistics 

from scipy.stats import norm

from scipy.stats import zscore

from scipy import stats

import warnings

warnings.filterwarnings('ignore')

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from sklearn.model_selection import train_test_split, StratifiedKFold, StratifiedGroupKFold, cross_val_score

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.neighbors import KNeighborsClassifier

from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.decomposition import PCA

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.metrics import ConfusionMatrixDisplay

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/capsuleismail/drybeanuci

Awesome Lists containing this project

README