https://github.com/r-mahesh45/principle-component-analysis-pca-
This project uses PCA for dimensionality reduction and applies hierarchical and K-means clustering to identify optimal clusters, validated against the original data.
https://github.com/r-mahesh45/principle-component-analysis-pca-
pca-analysis python3
Last synced: 6 months ago
JSON representation
This project uses PCA for dimensionality reduction and applies hierarchical and K-means clustering to identify optimal clusters, validated against the original data.
- Host: GitHub
- URL: https://github.com/r-mahesh45/principle-component-analysis-pca-
- Owner: R-Mahesh45
- Created: 2024-03-07T09:49:10.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-04T16:04:18.000Z (9 months ago)
- Last Synced: 2025-01-30T07:16:10.499Z (8 months ago)
- Topics: pca-analysis, python3
- Language: Jupyter Notebook
- Homepage:
- Size: 3.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Principal Component Analysis and Clustering
This project demonstrates how to apply **Principal Component Analysis (PCA)** for dimensionality reduction and perform **clustering** using the first three principal component scores. Both **hierarchical clustering** and **K-means clustering** are used to determine the optimal number of clusters, and the results are validated against the original data.
---
## Table of Contents
- [Introduction](#introduction)
- [Project Workflow](#project-workflow)
- [Results and Validation](#results-and-validation)
- [Visualizations](#visualizations)
- [Technologies Used](#technologies-used)
- [Installation](#installation)
- [Usage](#usage)
- [Future Enhancements](#future-enhancements)
- [Contributing](#contributing)---
## Introduction
Principal Component Analysis is applied to reduce high-dimensional data into fewer dimensions (principal components). Using these principal components:
1. Perform **clustering** to group similar data points.
2. Determine the **optimal number of clusters** using techniques such as the scree plot and elbow curve.
3. Validate the clustering results with the original data (ignoring the `class` column initially).---
## Project Workflow
1. **Data Preprocessing**:
- Scale the data using `StandardScaler`.
- Apply PCA to extract principal component scores.
2. **Clustering**:
- Perform hierarchical clustering on the first three principal components.
- Use K-means clustering to analyze the same.
3. **Determine Optimal Clusters**:
- Generate a scree plot for PCA.
- Use an elbow curve to find the optimal number of clusters.
4. **Validation**:
- Compare clustering results with the original dataset's class column.
5. **Visualization**:
- Plot dendrograms, elbow curves, and scatter plots for better understanding.---
## Results and Validation
- Using the first 3 principal components, we identified clusters that closely match the original dataset's clusters.
- The optimal number of clusters is determined using the elbow method and scree plot.
- Results are validated with the original data's class column, confirming consistency.---
## Visualizations
1. **Scree Plot**: To analyze the variance explained by each principal component.
2. **Elbow Curve**: To find the optimal number of clusters for K-means.
3. **Dendrogram**: For hierarchical clustering visualization.
4. **Cluster Scatter Plot**: To visualize clustering results in 2D/3D.---
## Technologies Used
- **Python**: Programming language.
- **Libraries**:
- `pandas` and `numpy`: Data manipulation.
- `scikit-learn`: PCA, K-means clustering, scaling.
- `matplotlib` and `seaborn`: Visualization.---
## Installation
1. Clone the repository:
```bash
git clone https://github.com/R-Mahesh45/pca-clustering.git
cd pca-clustering
```
2. Install required dependencies:
```bash
pip install -r requirements.txt
```---
## Usage
1. Run the script to perform PCA and clustering:
```bash
python pca_clustering.py
```
2. Modify clustering parameters, such as the number of clusters, in the code for experiments.
3. View visualizations for analysis.---
## Future Enhancements
- Extend the clustering to more datasets.
- Automate optimal cluster detection using silhouette scores.
- Add interactive visualizations using Plotly or Dash.---
## Contributing
Contributions are welcome! Please fork the repository and submit a pull request for improvements.