{"id":26231078,"url":"https://github.com/capsuleismail/drybeanuci","last_synced_at":"2026-05-18T03:35:43.353Z","repository":{"id":276815310,"uuid":"930398761","full_name":"capsuleismail/DryBeanUCI","owner":"capsuleismail","description":"Data Science Project with Model comparison.","archived":false,"fork":false,"pushed_at":"2025-02-11T13:24:00.000Z","size":23,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-11T16:32:29.952Z","etag":null,"topics":["datascience","jupyter-notebook","machinelearning-python","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/capsuleismail.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-10T15:16:42.000Z","updated_at":"2025-02-24T14:08:13.000Z","dependencies_parsed_at":"2025-08-11T16:18:08.819Z","dependency_job_id":"ce7d9e8b-b72f-4a64-aa40-b2d57877cd59","html_url":"https://github.com/capsuleismail/DryBeanUCI","commit_stats":null,"previous_names":["capsuleismail/dry_bean_uci","capsuleismail/drybeanuci"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/capsuleismail/DryBeanUCI","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capsuleismail%2FDryBeanUCI","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capsuleismail%2FDryBeanUCI/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capsuleismail%2FDryBeanUCI/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capsuleismail%2FDryBeanUCI/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/capsuleismail","download_url":"https://codeload.github.com/capsuleismail/DryBeanUCI/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/capsuleismail%2FDryBeanUCI/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33163780,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-17T22:39:12.733Z","status":"online","status_checked_at":"2026-05-18T02:00:06.436Z","response_time":71,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datascience","jupyter-notebook","machinelearning-python","scikit-learn"],"created_at":"2025-03-12T23:18:19.909Z","updated_at":"2026-05-18T03:35:43.335Z","avatar_url":"https://github.com/capsuleismail.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Introduction  \n---------------\n\nThe **Dry Bean Dataset** from the UCI Machine Learning Repository is a well-structured dataset used for classifying different types of beans based on their morphological features. The dataset consists of **various shape-related attributes** extracted from bean images using computer vision techniques. Given these numerical attributes, the goal is to build a **classification model** that can accurately predict the bean type.  \n\nIn this **[notebook](https://github.com/capsuleismail/dry_bean_uci/blob/main/dry-bean-dataset-uci.ipynb)**, we explore and analyze the Dry Bean Dataset by answering key questions related to its structure, attributes, and classification potential. We perform **exploratory data analysis (EDA)** using **histograms, boxplots, and correlation matrices** to understand feature distributions and relationships. Additionally, we implement various **machine learning models**, compare their performance, and optimize hyperparameters using **Optuna** to enhance classification accuracy.  \n\nThrough this analysis, we aim to determine the **most effective model** for distinguishing between different bean types, leveraging advanced preprocessing techniques and **machine learning pipelines** to streamline the workflow.\n\nThese are the questions I've gone through on my notebooks:\n\n1. **What is the Dry Bean Dataset?** \u003cbr/\u003e\n2. **How many instances (rows) and attributes (columns) are present in the dataset?** \u003cbr/\u003e\n3. **What are the different classes of beans in the dataset?** \u003cbr/\u003e\n4. **What are the main features (attributes) used to describe each bean?** \u003cbr/\u003e\n5. **Are all attributes numerical, or are there categorical attributes as well?** \u003cbr/\u003e\n6. **What type of classification problem is this dataset used for? (Binary or Multi-class?)** \u003cbr/\u003e\n7. **Which machine learning algorithms can be used to classify the bean types?** \u003cbr/\u003e\n8. **Use Histogram plots to understand the numerical features.** \u003cbr/\u003e\n9. **Use Boxplot plots to understand the numerical features.** \u003cbr/\u003e\n10. **Use Correlation plot to understand any relationship between variables.** \u003cbr/\u003e\n11. **What performance metrics can be used to evaluate classification models trained on this dataset?** \u003cbr/\u003e\n12. **Use a Pipeline to preprocess and modeling your data.** \u003cbr/\u003e\n13. **Compare between diffferent models which one is more accurate.** \u003cbr/\u003e\n14. **Tune hyperparameters using Optuna to improve accuracy with RandomForestClassifier.** \u003cbr/\u003e\n\n\n**Citation: Dry Bean [Dataset](https://doi.org/10.24432/C50S4B). (2020). UCI Machine Learning Repository.** \u003cbr/\u003e\n--------------------------------------------------------------------------------------------------------------------\n\n\n### I. How to import the dataset via pip.\n--------------------------------------------------------------------------------------------------------------------\n```\npip install ucimlrepo\nImport the dataset into your code \nfrom ucimlrepo import fetch_ucirepo \n  \n# fetch dataset \ndry_bean = fetch_ucirepo(id=602) \n  \n# data (as pandas dataframes) \nX = dry_bean.data.features \ny = dry_bean.data.targets \n  \n# metadata \nprint(dry_bean.metadata) \n  \n# variable information \nprint(dry_bean.variables)\n\n```\n\n\n### II. All packages used for this notebook.\n--------------------------------------------------------------------------------------------------------------------\n```\nimport gc # Garbage Collector\n\nimport pandas as pd\nimport numpy as np\nimport os\n\n# Time Modules\nimport calendar\nfrom time import time\nimport datetime\nfrom datetime import datetime, timedelta\n\npd.set_option('display.max_rows', None)\npd.set_option('display.max_columns', None)\n\n\n# Plots\nimport seaborn as sns\nimport matplotlib.pyplot as plt\nimport plotly.graph_objects as go\nimport plotly.express as px\nimport plotly.subplots as sp\nsns.set_style(\"whitegrid\")\nsns.set(rc={'figure.figsize':(18, 12)})\n%matplotlib inline\n\n# Statistics \nfrom scipy.stats import norm\nfrom scipy.stats import zscore\nfrom scipy import stats\n\nimport warnings\nwarnings.filterwarnings('ignore')\nimport numpy as np # linear algebra\nimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n\nfrom sklearn.model_selection import train_test_split, StratifiedKFold, StratifiedGroupKFold, cross_val_score\nfrom sklearn.tree import DecisionTreeClassifier\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.neighbors import KNeighborsClassifier\n\nfrom sklearn.preprocessing import MinMaxScaler, StandardScaler\nfrom sklearn.decomposition import PCA\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.pipeline import Pipeline\n\nfrom sklearn.metrics import ConfusionMatrixDisplay\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcapsuleismail%2Fdrybeanuci","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcapsuleismail%2Fdrybeanuci","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcapsuleismail%2Fdrybeanuci/lists"}