https://github.com/ai-ahmed/gen_fex

Probabilistic PCA and PKPCA for Stochastic Feature Extraction and Missing Data Reconstruction
https://github.com/ai-ahmed/gen_fex

bayesian chex distrax finance jax pkpca ppca probabilistic probabilistic-models python quantitative-finance stochastic

Last synced: 9 days ago
JSON representation

Probabilistic PCA and PKPCA for Stochastic Feature Extraction and Missing Data Reconstruction

Host: GitHub
URL: https://github.com/ai-ahmed/gen_fex
Owner: AI-Ahmed
License: apache-2.0
Created: 2024-04-08T21:00:52.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-12-02T04:34:21.000Z (6 months ago)
Last Synced: 2026-01-28T13:11:16.005Z (4 months ago)
Topics: bayesian, chex, distrax, finance, jax, pkpca, ppca, probabilistic, probabilistic-models, python, quantitative-finance, stochastic
Language: Python
Homepage:
Size: 1.63 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Probabilistic Feature Extraction in JAX

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

[![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](LICENSE)

[![JAX](https://img.shields.io/badge/backend-JAX-red.svg)](https://github.com/google/jax)

[![Code Style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

## Overview

**gen_fex** is a high-performance library for **Probabilistic Feature Extraction** and **Generative Modeling**, built on top of [JAX](https://github.com/google/jax). It is designed to handle high-dimensional, sparse time-series data, making it particularly effective for financial modeling in high-risk regimes.

This repository accompanies the manuscript **"Generative Modeling for High-Dimensional Sparse Data: Probabilistic Feature Extraction in High-Risk Financial Regimes"**. It implements robust probabilistic models that outperform conventional methods in capturing non-linear, time-dependent features, especially during volatile market conditions.

## ✨ Key Features

- **🚀 JAX-Accelerated**: Leverages JAX for high-performance numerical computing and automatic differentiation.

- **scikit-learn Compatible**: Fully compatible with the `scikit-learn` API (`fit`, `transform`, `inverse_transform`), allowing seamless integration into existing ML pipelines.

- **High-Dimensional Efficiency**: Automatically handles the "Transpose Trick" (Dual formulation) to efficiently process datasets where features ($D$) far exceed samples ($N$).

- **Missing Data Imputation**: Robust reconstruction of missing values in sparse datasets.

- **Advanced Models**:

  - **PPCA (Probabilistic PCA)**: A probabilistic framework for PCA that handles noise and missing data.

  - **PKPCA (Probabilistic Kernel PCA)**: Extends PPCA with kernel methods (e.g., RBF) and Wishart processes to capture non-linear structures.

## 🛠️ Installation

### Prerequisites

- Python **3.10** or newer.

### Install via pip

You can install the package directly from GitHub:

```bash

pip install git+https://github.com/AI-Ahmed/gen_fex.git

```

### Development Installation

If you want to contribute or modify the code:

1. **Clone the repository:**

   ```bash

   git clone https://github.com/AI-Ahmed/gen_fex.git

   cd gen_fex

   ```

2. **Install using Flit:**

   ```bash

   pip install flit

   flit install --deps develop --extras test --symlink

   ```

## 🚀 Quick Start

Here is a simple example of how to use the `PPCA` and `PKPCA` classes.

```python

import numpy as np

from gen_fex import PPCA, PKPCA

# 1. Generate synthetic high-dimensional data (Samples < Features)

# Shape: (N_samples, D_features)

N, D = 100, 1000

data = np.random.rand(N, D)

# 2. Initialize Models

# We choose a latent dimension q

q = 50

ppca = PPCA(q=q)

pkpca = PKPCA(q=q)

# 3. Fit Models

# The models automatically handle the high-dimensional nature (N < D)

print("Fitting PPCA...")

ppca.fit(data, use_em=True, verbose=1)

print("Fitting PKPCA...")

pkpca.fit(data, use_em=True, verbose=1)

# 4. Transform (Dimensionality Reduction)

latent_ppca = ppca.transform()

latent_pkpca = pkpca.transform()

print(f"Original Shape: {data.shape}")

print(f"PPCA Latent Shape: {latent_ppca.shape}")   # (q, D) - Latent features

print(f"PKPCA Latent Shape: {latent_pkpca.shape}") # (q, D) - Latent features

# Note: The model decomposes X approx W @ Z

# W: (N, q) - Sample embeddings

# Z: (q, D) - Latent features (returned by transform)

# 5. Reconstruction (Inverse Transform)

recon_ppca = ppca.inverse_transform(latent_ppca)

print(f"Reconstructed Shape: {recon_ppca.shape}")

```

## 🧮 Mathematical Background

### Probabilistic PCA (PPCA)

PPCA in the matrix-variate setting models the observed data matrix $P ∈ ℝ^{N×D}$ using

latent variables $Z ∈ ℝ^{q×N}$ with a linear Gaussian generative structure:

$$

P = WZ + \mu + E

$$

where:

- \$W \in \mathbb{R}^{N \times q}\$ is the loading matrix,

- \$\mu \in \mathbb{R}^{N \times D}\$ is the mean matrix,

- \$E \in \mathbb{R}^{N \times D}\$ is the noise matrix.

The latent variables follow an isotropic Gaussian:

$$

Z \sim \mathcal{N}(0, I_q)

$$

and the noise is modeled using a matrix-variate Gaussian distribution:

$$

E \sim \mathcal{N}_{N \times D}

( 0,\ \sigma^2 I_N,\ I_D ).

$$

### Dual Formulation & The Transpose Trick

For high-dimensional data where the number of features $D$ is much larger than the number of samples $N$ (\$D \gg N\$), standard PCA is computationally expensive ( \$O(D^3)\$ ). **gen_fex** implements the **Dual PPCA** formulation (often called the "Transpose Trick"), which operates on the N×N Gram matrix instead of the D×D covariance matrix, significantly reducing computational cost to \$O(N^3)\$.

### Probabilistic Kernel PCA (PKPCA)

PKPCA extends this by mapping data into a non-linear feature space using a kernel function (e.g., RBF). Our implementation utilizes a **Wishart Process** prior for the covariance matrix, allowing for robust uncertainty quantification in the kernel space.

## 📊 Results & Performance

We evaluate our models on high-dimensional sparse financial data. Below are comparisons of model performance and reconstruction quality.

### Model Performance Comparison

![Model Performance Comparison](docs/model_performance_comp.jpg)

*Fig. 2. Comparison of the negative log-likelihood (`ℓ`) between high-dimensional PPCA and PKPCA over 20 iterations (T4 GPU).*

### Reconstructed Data Comparison

![Reconstructed Data Comparison](docs/reconstructed_data_comparsion.jpg)

*Fig. 8. Monthly reconstructed correlation between PPCA and PKPCA for assets in the R2 regime. PKPCA shows a clear divergence from PPCA, particularly between the IT and Materials sectors, reflecting sector-specific performance during this period.*

## 📁 Directory Structure

```text

.

├── gen_fex/            # Source code

│   ├── _ppcax.py       # PPCA implementation

│   └── _pkpcax.py      # PKPCA implementation

├── tests/              # Unit tests

├── pyproject.toml      # Project configuration

└── README.md           # Documentation

```

## 🧪 Running Tests

To ensure everything is working correctly, run the test suite:

```bash

pytest tests/test.py

```

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository.

2. Create your feature branch (`git checkout -b feature/AmazingFeature`).

3. Commit your changes (`git commit -m 'Add some AmazingFeature'`).

4. Push to the branch (`git push origin feature/AmazingFeature`).

5. Open a Pull Request.

## 📄 License

This project is licensed under the [Apache License 2.0](LICENSE).

## 📣 Citation

If you use this software in your research, please cite our manuscript:

```bibtex

@article{ATWA2026113376,

title = {Generative modeling for high-dimensional sparse data: Probabilistic feature extraction in high-risk financial regimes},

journal = {Engineering Applications of Artificial Intelligence},

volume = {164},

pages = {113376},

year = {2026},

issn = {0952-1976},

doi = {https://doi.org/10.1016/j.engappai.2025.113376},

url = {https://www.sciencedirect.com/science/article/pii/S0952197625034074},

author = {Ahmed Nabil Atwa and Mohamed Kholief and Ahmed Sedky},

keywords = {Probabilistic principal component analysis, Probabilistic kernel principal component analysis, Wishart process, Missing value imputation, Information-driven bars, Hierarchical risk parity}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ai-ahmed/gen_fex

Awesome Lists containing this project

README