Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/burhanahmed1/iris-dataset-analysis-with-pyspark
Implementation of K-means,Bisecting K-means and Decision Tree in PySpark on the Iris Dataset.
https://github.com/burhanahmed1/iris-dataset-analysis-with-pyspark
bisecting-kmeans bisecting-kmeans-clustering decision-tree decision-trees jupyter-notebook kmeans kmeans-clustering matplotlib pyspark pyspark-machine-learning pyspark-ml pyspark-mllib pyspark-python python seaborn
Last synced: about 1 month ago
JSON representation
Implementation of K-means,Bisecting K-means and Decision Tree in PySpark on the Iris Dataset.
- Host: GitHub
- URL: https://github.com/burhanahmed1/iris-dataset-analysis-with-pyspark
- Owner: burhanahmed1
- License: mit
- Created: 2024-06-28T20:16:57.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-06-29T07:03:43.000Z (6 months ago)
- Last Synced: 2024-06-30T09:06:07.695Z (6 months ago)
- Topics: bisecting-kmeans, bisecting-kmeans-clustering, decision-tree, decision-trees, jupyter-notebook, kmeans, kmeans-clustering, matplotlib, pyspark, pyspark-machine-learning, pyspark-ml, pyspark-mllib, pyspark-python, python, seaborn
- Language: Jupyter Notebook
- Homepage:
- Size: 146 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Iris Dataset Analysis with PySpark
This project implements K-means, Bisecting K-means, and Decision Tree algorithms in PySpark on the Iris dataset.## Table of Contents
- [Introduction](#introduction)
- [Description](#description)
- [Installation](#installation)
- [Usage](#usage)
- [Results](#results)
- [Technologies Used](#technologies-used)
- [Contributing](#contributing)
- [License](#license)## Introduction
This project demonstrates the use of PySpark to perform clustering and classification on the Iris dataset. The Iris dataset is a classic dataset in machine learning and statistics, containing measurements of various attributes of iris flowers and their corresponding species.
## Description
The project consists of three main components:
1. **K-means Clustering**: A traditional clustering algorithm that partitions data into k distinct clusters.
2. **Bisecting K-means Clustering**: A variation of K-means that recursively splits clusters to improve clustering quality.
3. **Decision Tree Classification**: A classification algorithm that uses a tree-like model to make decisions based on input features.## Installation
To run this project, you need to have PySpark installed. You can install it using pip:
```bash
pip install pyspark
```
Additionally, you need to have Matplotlib and Seaborn for data visualization:```bash
pip install matplotlib seaborn
```## Usage
1. **Initialize Spark Session:** The Spark session is initialized with the name "IrisAnalysis".
2. **Load Data:** The Iris dataset is loaded from a CSV file.
3. **Prepare Features:** Features are prepared using VectorAssembler.
4. **K-means Clustering:** K-means algorithm is applied to the data, and results are visualized.
5. **Bisecting K-means Clustering:** Bisecting K-means algorithm is applied, and results are visualized.
6. **Decision Tree Classification:** A decision tree classifier is trained, evaluated, and results are visualized.## Results
+ **K-means Silhouette Score:** The silhouette score for the K-means clustering model is **0.7482**.
+ **Bisecting K-means Silhouette Score:** The silhouette score for the Bisecting K-means clustering model is **0.6682**.
+ **Decision Tree Accuracy:** The accuracy of the Decision Tree classifier is **1.0**.
Visualizations for each clustering method and the decision tree classification are generated using Matplotlib and Seaborn.## Technologies Used
+ **PySpark:** For data processing and machine learning.
+ **Matplotlib:** For data visualization.
+ **Seaborn:** For enhanced data visualization.## Contributing
Contributions are welcome! Please open an issue or submit a pull request for any improvements or additions.## License
This project is licensed under the MIT License.