https://github.com/ai-ahmed/gen_fex
Probabilistic PCA and PKPCA for Stochastic Feature Extraction and Missing Data Reconstruction
https://github.com/ai-ahmed/gen_fex
bayesian chex distrax finance jax pkpca ppca probabilistic probabilistic-models python quantitative-finance stochastic
Last synced: 9 days ago
JSON representation
Probabilistic PCA and PKPCA for Stochastic Feature Extraction and Missing Data Reconstruction
- Host: GitHub
- URL: https://github.com/ai-ahmed/gen_fex
- Owner: AI-Ahmed
- License: apache-2.0
- Created: 2024-04-08T21:00:52.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-12-02T04:34:21.000Z (6 months ago)
- Last Synced: 2026-01-28T13:11:16.005Z (4 months ago)
- Topics: bayesian, chex, distrax, finance, jax, pkpca, ppca, probabilistic, probabilistic-models, python, quantitative-finance, stochastic
- Language: Python
- Homepage:
- Size: 1.63 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Probabilistic Feature Extraction in JAX
[](https://www.python.org/downloads/)
[](LICENSE)
[](https://github.com/google/jax)
[](https://github.com/psf/black)
## Overview
**gen_fex** is a high-performance library for **Probabilistic Feature Extraction** and **Generative Modeling**, built on top of [JAX](https://github.com/google/jax). It is designed to handle high-dimensional, sparse time-series data, making it particularly effective for financial modeling in high-risk regimes.
This repository accompanies the manuscript **"Generative Modeling for High-Dimensional Sparse Data: Probabilistic Feature Extraction in High-Risk Financial Regimes"**. It implements robust probabilistic models that outperform conventional methods in capturing non-linear, time-dependent features, especially during volatile market conditions.
## โจ Key Features
- **๐ JAX-Accelerated**: Leverages JAX for high-performance numerical computing and automatic differentiation.
- **scikit-learn Compatible**: Fully compatible with the `scikit-learn` API (`fit`, `transform`, `inverse_transform`), allowing seamless integration into existing ML pipelines.
- **High-Dimensional Efficiency**: Automatically handles the "Transpose Trick" (Dual formulation) to efficiently process datasets where features ($D$) far exceed samples ($N$).
- **Missing Data Imputation**: Robust reconstruction of missing values in sparse datasets.
- **Advanced Models**:
- **PPCA (Probabilistic PCA)**: A probabilistic framework for PCA that handles noise and missing data.
- **PKPCA (Probabilistic Kernel PCA)**: Extends PPCA with kernel methods (e.g., RBF) and Wishart processes to capture non-linear structures.
## ๐ ๏ธ Installation
### Prerequisites
- Python **3.10** or newer.
### Install via pip
You can install the package directly from GitHub:
```bash
pip install git+https://github.com/AI-Ahmed/gen_fex.git
```
### Development Installation
If you want to contribute or modify the code:
1. **Clone the repository:**
```bash
git clone https://github.com/AI-Ahmed/gen_fex.git
cd gen_fex
```
2. **Install using Flit:**
```bash
pip install flit
flit install --deps develop --extras test --symlink
```
## ๐ Quick Start
Here is a simple example of how to use the `PPCA` and `PKPCA` classes.
```python
import numpy as np
from gen_fex import PPCA, PKPCA
# 1. Generate synthetic high-dimensional data (Samples < Features)
# Shape: (N_samples, D_features)
N, D = 100, 1000
data = np.random.rand(N, D)
# 2. Initialize Models
# We choose a latent dimension q
q = 50
ppca = PPCA(q=q)
pkpca = PKPCA(q=q)
# 3. Fit Models
# The models automatically handle the high-dimensional nature (N < D)
print("Fitting PPCA...")
ppca.fit(data, use_em=True, verbose=1)
print("Fitting PKPCA...")
pkpca.fit(data, use_em=True, verbose=1)
# 4. Transform (Dimensionality Reduction)
latent_ppca = ppca.transform()
latent_pkpca = pkpca.transform()
print(f"Original Shape: {data.shape}")
print(f"PPCA Latent Shape: {latent_ppca.shape}") # (q, D) - Latent features
print(f"PKPCA Latent Shape: {latent_pkpca.shape}") # (q, D) - Latent features
# Note: The model decomposes X approx W @ Z
# W: (N, q) - Sample embeddings
# Z: (q, D) - Latent features (returned by transform)
# 5. Reconstruction (Inverse Transform)
recon_ppca = ppca.inverse_transform(latent_ppca)
print(f"Reconstructed Shape: {recon_ppca.shape}")
```
## ๐งฎ Mathematical Background
### Probabilistic PCA (PPCA)
PPCA in the matrix-variate setting models the observed data matrix $P โ โ^{NรD}$ using
latent variables $Z โ โ^{qรN}$ with a linear Gaussian generative structure:
$$
P = WZ + \mu + E
$$
where:
- \$W \in \mathbb{R}^{N \times q}\$ is the loading matrix,
- \$\mu \in \mathbb{R}^{N \times D}\$ is the mean matrix,
- \$E \in \mathbb{R}^{N \times D}\$ is the noise matrix.
The latent variables follow an isotropic Gaussian:
$$
Z \sim \mathcal{N}(0, I_q)
$$
and the noise is modeled using a matrix-variate Gaussian distribution:
$$
E \sim \mathcal{N}_{N \times D}
( 0,\ \sigma^2 I_N,\ I_D ).
$$
### Dual Formulation & The Transpose Trick
For high-dimensional data where the number of features $D$ is much larger than the number of samples $N$ (\$D \gg N\$), standard PCA is computationally expensive ( \$O(D^3)\$ ). **gen_fex** implements the **Dual PPCA** formulation (often called the "Transpose Trick"), which operates on the NรN Gram matrix instead of the DรD covariance matrix, significantly reducing computational cost to \$O(N^3)\$.
### Probabilistic Kernel PCA (PKPCA)
PKPCA extends this by mapping data into a non-linear feature space using a kernel function (e.g., RBF). Our implementation utilizes a **Wishart Process** prior for the covariance matrix, allowing for robust uncertainty quantification in the kernel space.
## ๐ Results & Performance
We evaluate our models on high-dimensional sparse financial data. Below are comparisons of model performance and reconstruction quality.
### Model Performance Comparison

*Fig. 2. Comparison of the negative log-likelihood (`โ`) between high-dimensional PPCA and PKPCA over 20 iterations (T4 GPU).*
### Reconstructed Data Comparison

*Fig. 8. Monthly reconstructed correlation between PPCA and PKPCA for assets in the R2 regime. PKPCA shows a clear divergence from PPCA, particularly between the IT and Materials sectors, reflecting sector-specific performance during this period.*
## ๐ Directory Structure
```text
.
โโโ gen_fex/ # Source code
โ โโโ _ppcax.py # PPCA implementation
โ โโโ _pkpcax.py # PKPCA implementation
โโโ tests/ # Unit tests
โโโ pyproject.toml # Project configuration
โโโ README.md # Documentation
```
## ๐งช Running Tests
To ensure everything is working correctly, run the test suite:
```bash
pytest tests/test.py
```
## ๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository.
2. Create your feature branch (`git checkout -b feature/AmazingFeature`).
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`).
4. Push to the branch (`git push origin feature/AmazingFeature`).
5. Open a Pull Request.
## ๐ License
This project is licensed under the [Apache License 2.0](LICENSE).
## ๐ฃ Citation
If you use this software in your research, please cite our manuscript:
```bibtex
@article{ATWA2026113376,
title = {Generative modeling for high-dimensional sparse data: Probabilistic feature extraction in high-risk financial regimes},
journal = {Engineering Applications of Artificial Intelligence},
volume = {164},
pages = {113376},
year = {2026},
issn = {0952-1976},
doi = {https://doi.org/10.1016/j.engappai.2025.113376},
url = {https://www.sciencedirect.com/science/article/pii/S0952197625034074},
author = {Ahmed Nabil Atwa and Mohamed Kholief and Ahmed Sedky},
keywords = {Probabilistic principal component analysis, Probabilistic kernel principal component analysis, Wishart process, Missing value imputation, Information-driven bars, Hierarchical risk parity}
}
```