Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/melikaas/lda_classification
This repository focuses on implementing Linear Discriminant Analysis (LDA) as a classifier on embedded scikit-learn datasets. The goal is to improve the model's performance on these datasets.
https://github.com/melikaas/lda_classification
classification machine-learning-algorithms scatterplot scikit-learn
Last synced: 3 months ago
JSON representation
This repository focuses on implementing Linear Discriminant Analysis (LDA) as a classifier on embedded scikit-learn datasets. The goal is to improve the model's performance on these datasets.
- Host: GitHub
- URL: https://github.com/melikaas/lda_classification
- Owner: MelikaaS
- Created: 2024-08-25T11:48:02.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-08-26T09:06:08.000Z (4 months ago)
- Last Synced: 2024-09-23T17:25:29.204Z (3 months ago)
- Topics: classification, machine-learning-algorithms, scatterplot, scikit-learn
- Language: Jupyter Notebook
- Homepage:
- Size: 940 KB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# LDA_Classification
This repository focuses on implementing Linear Discriminant Analysis (LDA) as a classifier on embedded scikit-learn datasets. The goal is to improve the model's performance on these datasets.## Dataset Overview
This repository utilizes two small embedded scikit-learn datasets: **`load_wine`** and **`load_digits`**.
- **`load_wine`**: A small dataset containing 178 data points with 13 features serving as predictors and three target classes.
- **`load_digits`**: A larger dataset with 1,797 data points, each having 64 features and 10 target classes.## Operations
### On the `load_wine` Dataset:
1. Fitted the LDA model on the `load_wine` dataset.
2. Extracted the `explained_variance_ratio_` to analyze the contribution of each LDA component.
3. Evaluated the model by calculating the accuracy using **`accuracy_score`**.
4. Conducted a visual analysis using a scatter plot of the LDA components.### On the `load_digits` Dataset:
1. Fitted the LDA model on the `load_digits` dataset.
2. Extracted the `explained_variance_ratio_` to understand the significance of each LDA component.
3. Evaluated the model's accuracy using **`accuracy_score`**.
4. Conducted a visual analysis using scatter plots of the LDA components.
5. Standardized the data using **`StandardScaler`**.
6. Re-fitted the LDA model on the standardized data to assess the impact of standardization.
7. Recalculated the accuracy using **`accuracy_score`**.
8. Made additional visual observations based on the standardized data.
9. Implemented a pipeline to streamline data standardization and feature selection before fitting the LDA model.
10. Used cross-validation to evaluate the models' performance.
11. Calculated accuracy scores for the pipelined data.
12. Made further visual observations.## Results
The `load_wine` dataset is relatively small, allowing the LDA model to achieve perfect classification, with an accuracy score of **1.0**. This indicates that the data was perfectly separated into three groups, as demonstrated in the scatter plot of LDA Component 0 against LDA Component 1.![LDA classification on **`load_wine`** dataset](https://github.com/MelikaaS/LDA_Classification/blob/main/Screenshot%20from%202024-08-26%2011-58-36.jpg)
----
The `load-digits` dataset contains 1797 datapoints, 64 predictors and 10 target classes. Below table shows the result of implementing LDA on `load_digits` dataset:| Step | Description | Accuracy Score |
|---------------------------------------------------|-----------------------------------------------------------|-------------------------|
| LDA model fitted on `load_digits` dataset | Initial model without any preprocessing | 0.9638 |
| Data standardized with `StandardScaler()` | Data was standardized before fitting the LDA model | 0.9638 |
| Pipeline: Standardization and feature selection | Standardization and PCA applied before LDA through pipeline| 0.9638 |---
## Conclusion
The `**load_digits**` dataset in scikit-learn is a well-known dataset used for classification tasks.
The similarity in accuracy scores across the different methods (direct LDA, LDA after standardization, and LDA in a pipeline) suggests that the features are already quite effective for classification and that the transformations are not significantly altering the feature space in a way that impacts classification performance.