https://github.com/nishant2018/pca-feature-selection-scratch
Principal Component Analysis (PCA) is a powerful dimensionality reduction technique commonly used in machine learning and data analysis. It transforms a dataset into a set of linearly uncorrelated variables called principal components.
https://github.com/nishant2018/pca-feature-selection-scratch
feature-selection linear-algebra machine-learning pca statistics
Last synced: 12 days ago
JSON representation
Principal Component Analysis (PCA) is a powerful dimensionality reduction technique commonly used in machine learning and data analysis. It transforms a dataset into a set of linearly uncorrelated variables called principal components.
- Host: GitHub
- URL: https://github.com/nishant2018/pca-feature-selection-scratch
- Owner: Nishant2018
- Created: 2024-06-10T09:03:10.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-10T10:11:28.000Z (about 2 years ago)
- Last Synced: 2025-02-26T15:17:17.467Z (over 1 year ago)
- Topics: feature-selection, linear-algebra, machine-learning, pca, statistics
- Language: Jupyter Notebook
- Homepage: https://www.kaggle.com/code/endofnight17j03/pca-feature-selection-scratch
- Size: 669 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Principal Component Analysis (PCA)
### Introduction
Principal Component Analysis (PCA) is a powerful dimensionality reduction technique commonly used in machine learning and data analysis. It transforms a dataset into a set of linearly uncorrelated variables called principal components. The primary goal of PCA is to reduce the dimensionality of the data while retaining as much variability as possible.
### Why Use PCA?
- **Dimensionality Reduction**: Simplifies the dataset by reducing the number of features.
- **Noise Reduction**: Helps in removing noise and redundant features.
- **Visualization**: Makes it easier to visualize high-dimensional data in 2D or 3D space.
- **Improved Performance**: Enhances the performance of machine learning algorithms by reducing overfitting.
### How PCA Works
1. **Standardize the Data**: PCA is affected by the scale of the variables, so it's essential to standardize the dataset.
\[
z = \frac{x - \mu}{\sigma}
\]
Where \( z \) is the standardized value, \( x \) is the original value, \( \mu \) is the mean, and \( \sigma \) is the standard deviation.
2. **Compute the Covariance Matrix**: Measure the variance and the relationship between different variables.
\[
\mathbf{C} = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(x_i - \bar{x})^T
\]
Where \( \mathbf{C} \) is the covariance matrix, \( n \) is the number of samples, \( x_i \) is the \( i \)-th sample, and \( \bar{x} \) is the mean vector.
3. **Calculate the Eigenvalues and Eigenvectors**: Eigenvectors determine the direction of the new feature space, and eigenvalues determine their magnitude (importance).
\[
\mathbf{C} \mathbf{v} = \lambda \mathbf{v}
\]
Where \( \mathbf{v} \) is the eigenvector and \( \lambda \) is the eigenvalue.
4. **Sort Eigenvalues and Eigenvectors**: Rank the eigenvalues and their corresponding eigenvectors in descending order.
5. **Select Principal Components**: Choose the top \( k \) eigenvectors based on the largest eigenvalues to form a new matrix \( \mathbf{W} \).
6. **Transform the Data**: Project the original dataset onto the new feature space.
\[
\mathbf{Y} = \mathbf{W}^T \mathbf{X}
\]
Where \( \mathbf{Y} \) is the transformed dataset, \( \mathbf{W} \) is the matrix of selected eigenvectors, and \( \mathbf{X} \) is the original dataset.
### Example Code
Here is a simple example of how to perform PCA using Python's `scikit-learn` library:
```python
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Sample data
X = np.array([[2.5, 2.4], [0.5, 0.7], [2.2, 2.9], [1.9, 2.2], [3.1, 3.0], [2.3, 2.7], [2, 1.6], [1, 1.1], [1.5, 1.6], [1.1, 0.9]])
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_scaled)
print("Principal Components:\n", principal_components)