Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/saritaphd/pca-and-svd
PCA and SVD working codes with basic details about them.
https://github.com/saritaphd/pca-and-svd
Last synced: 3 days ago
JSON representation
PCA and SVD working codes with basic details about them.
- Host: GitHub
- URL: https://github.com/saritaphd/pca-and-svd
- Owner: SaritaPhD
- License: mit
- Created: 2023-10-18T09:46:57.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-10-18T09:59:59.000Z (about 1 year ago)
- Last Synced: 2023-10-18T10:47:21.711Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 736 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PCA-and-SVD
PCA and SVD working codes with basic details about them.
Principal Component Analysis (PCA) is a dimensionality reduction technique and a fundamental tool in multivariate statistics and machine learning for data analysis. PCA is used to reduce the complexity in high-dimensional data while retaining trends and patterns. Here's a comprehensive step-by-step explanation of PCA:- 1. Data Preparation:
Begin with a dataset containing high-dimensional data, typically with many variables (features) that may be correlated.
- 2. Standardization:If the variables are not measured in the same units or have different scales, standardize the data by subtracting the mean and dividing by the standard deviation for each variable. This step ensures that all variables have a mean of 0 and a standard deviation of 1.
- 3. Covariance Matrix:Compute the covariance matrix of the standardized data. The covariance matrix describes the relationships and dependencies between variables.
- 4. Eigendecomposition:Perform eigendecomposition on the covariance matrix. This involves finding the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent the principal components, and the eigenvalues indicate their importance.
- 5. Eigenvalues and Explained Variance:Sort the eigenvalues in descending order to identify the most significant principal components. The eigenvalues represent the amount of variance explained by each principal component.
- 6. Selecting Principal Components:Decide on the number of principal components to retain based on the explained variance. You can choose to retain a specific number of top components, or you can set a threshold for explained variance (e.g., 95% of the total variance). The goal is to reduce dimensionality while retaining as much information as possible.
- 7. Projection:Transform the original data into the new feature space defined by the selected principal components. This is done by taking a dot product between the data and the selected eigenvectors.
- 8. Data Reconstruction (Optional):If necessary, you can reverse the dimensionality reduction step by projecting the data back into the original feature space. However, this may not yield exactly the same data, as some information may have been lost.
- 9. Data Analysis and Visualization:Analyze the reduced data in the new feature space. You can visualize the data to explore its structure or use it as input for other machine learning algorithms. PCA simplifies the data while preserving the dominant trends.
Advanced Concepts in PCA:- Centering vs. Scaling:
PCA can be applied to uncentered (raw) data, but it's common to center the data by subtracting the mean. Scaling (standardization) ensures that variables with different units and scales do not dominate the principal components.
- Kernel PCA:In cases where linear PCA is not suitable due to non-linearity in the data, Kernel PCA can be applied using a kernel function to map the data into a higher-dimensional space.
- Incremental PCA:For very large datasets that don't fit into memory, Incremental PCA can be used to compute principal components in a batch-wise manner.
C- hoosing the Number of Components:Methods like the elbow method or scree plot can help in selecting the appropriate number of principal components based on explained variance.
- Interpretability:Principal components may not always have a clear physical interpretation, but they capture patterns in the data.
PCA is a powerful tool for reducing dimensionality and uncovering patterns in high-dimensional data. Understanding the explained variance and the impact of dimensionality reduction on data analysis is crucial for effectively applying PCA.# Singular Value Decomposition (SVD) is a fundamental matrix factorization technique used in various fields, including linear algebra, statistics, and machine learning. SVD decomposes a matrix into three other matrices, revealing important information about the original matrix. Here's a comprehensive step-by-step explanation of SVD:
- 1. Matrix Notation:
Begin with a matrix, typically denoted as A, that you want to decompose using SVD. Matrix A can be of any size, M x N.
- 2. Centering Data (Optional):If you're working with data, you may want to center it by subtracting the mean along each dimension to have zero mean. Centering is common when applying SVD to data analysis.
- 3. SVD Decomposition:The SVD decomposition of matrix A is represented as follows:
A = UΣV^T
Where:
U: The left singular vectors (an M x M orthogonal matrix).
Σ (Sigma): A diagonal matrix with non-negative singular values on the diagonal (M x N, but usually with zero values in a skinny matrix).
V^T (Transpose of V): The right singular vectors (an N x N orthogonal matrix).
- 4. Singular Value Matrix Σ:The singular value matrix Σ contains the singular values of matrix A. The singular values are ordered in descending order. They represent the importance of the corresponding singular vectors in explaining the variance in the data.
- 5. Rank of the Matrix:The rank of matrix A is equal to the number of non-zero singular values in Σ. This rank is often used to determine the effective dimensionality of the data.
- 6. Dimensionality Reduction:To reduce the dimensionality of your data while retaining as much information as possible, you can retain only the top k singular values and their corresponding columns in U and V^T. This is particularly useful in data compression and noise reduction.
- 7. Pseudo-Inverse (Moore-Penrose Inverse):The pseudo-inverse of matrix A, denoted as A⁺, can be computed from the SVD as follows:
A⁺ = VΣ⁺U^T
Where Σ⁺ is the pseudo-inverse of Σ. The pseudo-inverse is used for solving linear systems of equations when A is not full rank.
- 8. Applications of SVD:
SVD has numerous applications, including:
Data compression and denoising.
Recommender systems (collaborative filtering).
Principal Component Analysis (PCA).
Image compression (in image processing).
Solving linear systems of equations (when A is not full rank).
Latent Semantic Analysis (LSA) in natural language processing.
Advanced Concepts in SVD:- Truncated SVD:
Truncated SVD retains only the top k singular values, resulting in a reduced-rank approximation of the original matrix. It's a powerful technique for dimensionality reduction.
- Sparse SVD:When dealing with very large and sparse matrices, sparse SVD algorithms can be more efficient.
- Numerical Stability:Care should be taken to handle numerical stability issues, especially when working with matrices that have very small or very large singular values.
- Randomized SVD:Randomized SVD techniques are used to approximate SVD for large datasets, often much faster than the exact decomposition.
SVD is a versatile and powerful tool for analyzing and decomposing data. Understanding its components and how to apply it effectively is essential for a wide range of applications in data analysis, linear algebra, and machine learning.