https://github.com/kjpou1/regimetry
Unsupervised regime detection for financial time series using embeddings and clustering.
https://github.com/kjpou1/regimetry
clustering contributions-welcome deep-learning embeddings financial-data market-structure-analysis quantitative-trading regime-detection spectral-clustering technical-analysis tensorflow2 time-series transformer tsne umap unsupervised-learning
Last synced: 5 months ago
JSON representation
Unsupervised regime detection for financial time series using embeddings and clustering.
- Host: GitHub
- URL: https://github.com/kjpou1/regimetry
- Owner: kjpou1
- License: mit
- Created: 2025-05-04T10:51:36.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-06-03T11:54:53.000Z (7 months ago)
- Last Synced: 2025-08-05T11:41:01.520Z (5 months ago)
- Topics: clustering, contributions-welcome, deep-learning, embeddings, financial-data, market-structure-analysis, quantitative-trading, regime-detection, spectral-clustering, technical-analysis, tensorflow2, time-series, transformer, tsne, umap, unsupervised-learning
- Language: Jupyter Notebook
- Homepage:
- Size: 14.5 MB
- Stars: 3
- Watchers: 1
- Forks: 1
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# regimetry
> **Mapping latent regimes in financial time series.**







---
- [regimetry](#regimetry)
- [๐ Overview](#-overview)
- [๐ What is a Regime?](#-what-is-a-regime)
- [๐ง How It Works](#-how-it-works)
- [1. **Data Ingestion**](#1-data-ingestion)
- [2. **Embedding Pipeline**](#2-embedding-pipeline)
- [3. **Clustering**](#3-clustering)
- [4. **Visualization \& Interpretation**](#4-visualization--interpretation)
- [๐ Getting Started](#-getting-started)
- [๐ Regime Detection Window Delay](#-regime-detection-window-delay)
- [๐ Documentation](#-documentation)
- [๐ Command Line Usage](#-command-line-usage)
- [๐น Ingest Data](#-ingest-data)
- [๐น Generate Embeddings](#-generate-embeddings)
- [๐ Available CLI Arguments for `embed`](#-available-cli-arguments-for-embed)
- [๐น Cluster Regimes](#-cluster-regimes)
- [๐ Available CLI Arguments for `cluster`](#-available-cli-arguments-for-cluster)
- [๐น Analyze Regime Structure](#-analyze-regime-structure)
- [๐น Analyze Full Pipeline (Embed + Cluster)](#-analyze-full-pipeline-embed--cluster)
- [๐ Available CLI Arguments for `analyze`](#-available-cli-arguments-for-analyze)
- [๐งช Example Dataset](#-example-dataset)
- [๐ ๏ธ Configuration Files](#๏ธ-configuration-files)
- [๐ Example Config](#-example-config)
- [โ
Section: Configuration Files โ Example Config](#-section-configuration-files--example-config)
- [๐ง Usage in CLI](#-usage-in-cli)
- [๐ผ๏ธ Usage in Dash App](#๏ธ-usage-in-dash-app)
- [๐ฅ๏ธ Interactive Dashboard](#๏ธ-interactive-dashboard)
- [๐ Launch the App](#-launch-the-app)
- [๐งฉ Features](#-features)
- [๐ Directory Structure](#-directory-structure)
- [๐ฆ Example Config for Palette Preview](#-example-config-for-palette-preview)
- [๐ Project Structure](#-project-structure)
- [๐งญ Orientation Going Forward](#-orientation-going-forward)
- [โ
Status](#-status)
- [๐ Related Projects](#-related-projects)
- [๐ Further Reading](#-further-reading)
- [๐ License](#-license)
- [๐ค Author](#-author)
---
## ๐ Overview
**regimetry** is a modular, unsupervised regime detection engine for financial time series โ originally developed as a personal research project to explore latent structure and behavioral transitions in markets.
It combines transformer-based embeddings with clustering and regime structure analysis to help identify and label recurring phases such as trends, reversals, and volatility shifts.
> While built for exploratory analysis, `regimetry` may evolve into a foundational component of my broader trading strategy stack.
> โ๏ธ **Tech Highlights**:
>
> * Transformer encoder with positional encoding
> * Attention-based temporal modeling (windowed)
> * Spectral clustering on learned embeddings
> * Regime structure modeling via Markov transitions, stickiness, and entropy
---
## ๐ What is a Regime?
In `regimetry`, a **regime** is a *latent, temporally structured pattern* in market behavior โ characterized by combinations of volatility, trend strength, momentum shifts, and signal alignment. These are not defined by hand, but **emerge from patterns discovered in the data**.
Formally:
* Regimes are clusters in the embedding space of overlapping market windows (e.g., 30 bars).
* Each embedding is generated via a Transformer encoder that learns internal structure within each window using attention over time.
* Spectral clustering then groups these embeddings into recurring *behavioral states* the market tends to revisit.
---
## ๐ง How It Works
### 1. **Data Ingestion**
- Load daily bar data per instrument
- Normalize features (Close, AHMA, LP, LC, etc.)
- Features are typically sourced from [`ConvolutionLab`](https://github.com/kjpou1/ConvolutionLab),
but `regimetry` is **not dependent** on that specific pipeline โ any compatible feature set can be used.
- Slice into overlapping windows (default: 30 bars, stride 1)
### 2. **Embedding Pipeline**
* **Each rolling window is passed through a Transformer encoder** that uses positional encoding to preserve temporal structure and self-attention to learn nonlinear dependencies within the window.
* This produces a dense, contextualized embedding that reflects local market dynamics.
* The architecture is modular and can be swapped with alternatives such as autoencoders, SimCLR, or CNN-based encoders.
### 3. **Clustering**
- Standardize the embeddings
- Cluster them using Spectral Clustering (or another method)
- Assign each window a `regime_id`
### 4. **Visualization & Interpretation**
- Use t-SNE or UMAP to project embeddings
- Visualize regime transitions over time
- Map regimes back to chart or signal data for strategy insights
---
## ๐ Getting Started
See the full step-by-step guide:
๐ [`docs/GETTING_STARTED_README.md`](docs/GETTING_STARTED_README.md)
> Includes:
>
> * Git clone instructions
> * Poetry or manual install
> * Data ingestion
> * Embedding generation
> * Regime clustering
> * Optional Dash dashboard launch
---
### ๐ Regime Detection Window Delay
> ๐ See: [`docs/REGIME_DETECTION_README.md`](docs/REGIME_DETECTION_README.md)
Because regime labels are assigned based on **rolling windows**, the cluster ID for the final bars of a dataset **cannot be known until the full window is complete**.
For example, with a `window_size = 30`:
* The first 29 bars will not receive a regime ID
* The **last 29 bars** also **do not reflect any future regime change**, since there are no forward windows to reclassify them
This introduces a **natural lag** in regime detection:
* New regimes will only appear after enough time has passed for the model to โobserveโ a full window in the new market condition.
๐ For more details, see the full explanation: [`REGIME_DETECTION_README.md`](docs/REGIME_DETECTION_README.md)
---
## ๐ Documentation
* [๐ Getting Started](docs/GETTING_STARTED_README.md)
*Step-by-step setup, from ingestion to visualization.*
* [๐ง Regime Detection Window Logic](docs/REGIME_DETECTION_README.md)
*Explains the natural lag from using rolling windows in clustering.*
* [๐งญ Regime Assignment & Label Alignment](docs/REGIME_ASSIGNMENT_README.md)
*Details how Spectral Clustering labels are aligned across runs using the Hungarian algorithm, with persistent baseline mapping and cluster color stability.*
---
## ๐ Command Line Usage
Run `regimetry` pipelines directly from the command line with optional overrides.
#### ๐น Ingest Data
```bash
python launch_host.py ingest \
--signal-input-path examples/EUR_USD_processed_signals.csv
```
This will:
* Parse the input CSV
* Normalize and structure features
* Save the result to `artifacts/data/processed/`
### ๐น Generate Embeddings
```bash
python launch_host.py embed \
--signal-input-path examples/EUR_USD_processed_signals.csv \
--output-name EUR_USD_embeddings.npy \
--window-size 30 \
--stride 1 \
--encoding-method sinusoidal \
--encoding-style interleaved
```
This will:
* Apply a rolling window (default: 30 bars, stride: 1 unless overridden)
* Use positional encoding and Transformer to generate embeddings
* Save the result to `embeddings/EUR_USD_embeddings.npy`
> โ ๏ธ **Note:** Ensure that `window_size` is smaller than your dataset length.
> If `window_size >= len(data)`, no embeddings will be produced.
Ah โ got it. Since `--embedding-dim` is now **used for both `learnable` and `sinusoidal`**, the description needs to be updated accordingly. Here's the revised table and footnote:
---
#### ๐ Available CLI Arguments for `embed`
| Argument | Description |
| --------------------- | ------------------------------------------------------------------------------- |
| `--signal-input-path` | Path to the CSV file with feature-enriched signal data |
| `--output-name` | Optional output file name for the `.npy` embeddings (default: `embeddings.npy`) |
| `--window-size` | Number of time steps per rolling window (default: `30`) |
| `--stride` | Step size between rolling windows (default: `1`) |
| `--encoding-method` | Positional encoding method: `sinusoidal` (default) or `learnable` |
| `--encoding-style` | Sinusoidal encoding format: `interleaved` (default) or `stacked` |
| `--embedding-dim` | Embedding dimension to use for both sinusoidal and learnable encodings |
| `--config` | Optional YAML config path to override pipeline settings |
| `--debug` | Enable debug logging |
> โน๏ธ **Note:** `--embedding-dim` applies to **both** `sinusoidal` and `learnable` encodings.
> For `sinusoidal`, it sets the generated frequency embedding size. For `learnable`, it defines the trainable positional embedding dimension.
### ๐น Cluster Regimes
```bash
python launch_host.py cluster \
--embedding-path embeddings/EUR_USD_embeddings.npy \
--regime-data-path data/processed/regime_input.csv \
--output-dir reports/EUR_USD \
--window-size 30 \
--n-clusters 3
```
This will:
* Load precomputed transformer embeddings
* Apply spectral clustering to assign regime IDs
* Align cluster labels with original time-series data (using `window_size` for offset)
* Generate visualizations (t-SNE, UMAP, timeline, and price overlay)
* Save outputs to the specified report directory
> โ ๏ธ **Note:** The `window_size` used here **must match** the one used during embedding.
> Otherwise, the cluster labels will not align correctly with the input time series.
---
#### ๐ Available CLI Arguments for `cluster`
| Argument | Description |
| -------------------- | ------------------------------------------------------------------------------ |
| `--embedding-path` | Path to the `.npy` file with saved embeddings |
| `--regime-data-path` | CSV file containing the signal-enriched time series (e.g., `regime_input.csv`) |
| `--output-dir` | Directory to save visualizations and labeled data |
| `--window-size` | Window size used during embedding (used for alignment) |
| `--n-clusters` | Number of regimes (clusters) to detect (default: `3`) |
| `--config` | Optional YAML config file to provide all arguments at once |
| `--debug` | Enable debug logging |
### ๐น Analyze Regime Structure
```bash
python launch_host.py interpret \
--input-path artifacts/reports/EUR_USD/cluster_assignments.csv \
--output-dir artifacts/reports/EUR_USD/ \
--save-csv \
--save-heatmap \
--save-json
```
This will:
* Compute the Markov transition matrix from cluster sequences
* Derive stickiness, entropy, and most-likely transitions per regime
* Generate:
* `regime_decision_table.csv`
* `transition_matrix.csv`
* `transition_matrix_heatmap.png`
* `regime_metadata.json` (for runtime strategy filtering)
> โ ๏ธ `--input-path` must point to the `cluster_assignments.csv` generated by the `cluster` step.
> The `interpret` pipeline does **not** run embedding or clustering โ it analyzes the regime structure from their output.
> โน๏ธ **Note:** This pipeline does not require a config file. It operates directly on a post-clustering output CSV with a `Cluster_ID` column.
---
### ๐น Analyze Full Pipeline (Embed + Cluster)
```bash
python launch_host.py analyze \
--instrument EUR_USD \
--window-size 5 \
--stride 1 \
--encoding-method sinusoidal \
--encoding-style interleaved \
--embedding-dim 64 \
--n-clusters 12 \
--create-dir \
--force \
--clean
```
This **single command**:
* Loads and expands a base config (e.g., `configs/EUR_USD_base.yaml`)
* Dynamically resolves output paths for embeddings and clustering reports
* Creates output directories if `--create-dir` is provided
* Forces re-run even if output exists (`--force`)
* Cleans existing embedding/report directories before rerun (`--clean`)
> ๐ Outputs:
>
> * Embedding: `artifacts/embeddings/.../embedding.npy`
> * Clustering: `artifacts/reports/.../cluster_assignments.csv`
> * Auto-exported config: `artifacts/tmp_config.yaml`
---
#### ๐ Available CLI Arguments for `analyze`
| Argument | Description |
| ------------------- | -------------------------------------------------------------------- |
| `--instrument` | Instrument symbol (e.g., `EUR_USD`) |
| `--window-size` | Rolling window size used for embedding |
| `--stride` | Step size between rolling windows |
| `--encoding-method` | Positional encoding method: `sinusoidal` or `learnable` |
| `--encoding-style` | Sinusoidal encoding style: `interleaved` or `stacked` |
| `--embedding-dim` | Dimensionality of the positional encoding |
| `--n-clusters` | Number of clusters for regime detection |
| `--create-dir` | Create output folders for embeddings and reports if they donโt exist |
| `--force` | Force re-run even if embedding or cluster outputs already exist |
| `--clean` | Remove previous output folders before running |
| `--debug` | Enable debug logging |
---
## ๐งช Example Dataset
An example file is included at [`examples/EUR_USD_processed_signals.csv`](examples/EUR_USD_processed_signals.csv) to help you test the pipeline immediately.
This file contains:
- Processed technical indicators (AHMA, LP, LC, ATR, etc.)
- Cleaned and aligned daily bars for EUR/USD
- A ready-to-ingest format compatible with the full `embedding_pipeline`
You can run the **ingestion pipeline** on this dataset:
```bash
python launch_host.py ingest --signal-input-path examples/EUR_USD_processed_signals.csv
````
โ OR โ
Run the **embedding pipeline** to generate transformer embeddings:
```bash
python launch_host.py embed --signal-input-path examples/EUR_USD_processed_signals.csv
```
---
## ๐ ๏ธ Configuration Files
`regimetry` supports YAML configuration files to streamline pipeline execution and visualization setup. These configs centralize all key parameters used by the CLI and Dash dashboard.
### ๐ Example Config
Here's a fully annotated example config file at [`config/full_config.yaml`](config/full_config.yaml):
```yaml
# โ
General Settings
debug: true
# โ
Ingestion Settings
signal_input_path: ./examples/EUR_USD_processed_signals.csv
include_columns: "*"
exclude_columns: ["Date", "Hour"] # Remove Date/Hour for daily resolution
deterministic: true # Enables reproducible embeddings and clustering
random_seed: 42 # Controls randomness for TF, t-SNE, UMAP, Spectral Clustering
# โ
Embedding Settings
output_name: EUR_USD_embeddings.npy
window_size: 10
stride: 1
encoding_method: "sinusoidal" # Options: 'sinusoidal', 'learnable'
encoding_style: "interleaved" # Options: 'interleaved', 'stacked'
# embedding_dim: 80
# โ
Clustering Settings
embedding_path: ./embeddings/EUR_USD_embeddings.npy
regime_data_path: ./data/processed/regime_input.csv
output_dir: ./reports/EUR_USD
n_clusters: 8
# โ
Report Settings
report_format: ["matplotlib", "plotly"] # Options: [], ["matplotlib"], ["plotly"]
report_palette: Set2 # Any valid seaborn palette name
```
Your `README.md` is already outstanding โ clean, modular, and informative. To reflect your recent changes, hereโs a **drop-in-ready update section** you can patch under:
---
## โ
Section: Configuration Files โ Example Config
Update the example YAML to include the new deterministic settings:
```yaml
# โ
Embedding Settings
output_name: EUR_USD_embeddings.npy
window_size: 10
stride: 1
encoding_method: "learnable" # Options: 'sinusoidal', 'learnable'
encoding_style: "interleaved" # Used only for 'sinusoidal'
embedding_dim: 71 # Required for 'learnable'; optional for 'sinusoidal'
deterministic: true # Enables reproducible embeddings and clustering
random_seed: 42 # Controls randomness for TF, t-SNE, UMAP, Spectral Clustering
```
> ๐งฌ **Determinism Note:**
> When `deterministic: true`, all randomness (including Transformer, t-SNE, UMAP, Spectral Clustering) is locked using `random_seed`.
> This ensures identical results across re-runs with the same input data.
> When `false`, variability is allowed โ useful for exploration or stress testing.
>
> ๐ [Learn more โ Reproducibility Controls](./docs/REPRODUCIBILITY_README.md)
---
### ๐ง Usage in CLI
You can run any pipeline stage using a config override:
```bash
python launch_host.py cluster --config config/full_config.yaml
```
* CLI will auto-resolve relative paths (e.g., to `./data/`, `./embeddings/`)
* Config values override internal defaults
* Any CLI argument passed explicitly will override the config
> โ
**CLI flags always take precedence** over values defined in the YAML.
---
### ๐ผ๏ธ Usage in Dash App
The Dash dashboard can also load and preview a YAML config:
```bash
poetry run python -m dash_app.app
```
In the **Palette Preview** tab:
* Upload any `.yaml` file
* The dashboard will display:
* Parsed settings (`window_size`, `n_clusters`, `output_dir`, etc.)
* Current seaborn `report_palette` rendered as a color swatch
> โ ๏ธ *This is for preview only โ uploaded config **does not affect** the rendered plots.*
> To change plots, rerun the `cluster` CLI with the updated config.
---
For a full reference of all supported fields, see:
๐ [`docs/CONFIG_REFERENCE_README.md`](docs/CONFIG_REFERENCE_README.md)
---
## ๐ฅ๏ธ Interactive Dashboard
`regimetry` ships with an optional Dash app that provides a user-friendly interface for exploring clustering results.
### ๐ Launch the App
```bash
poetry run python -m dash_app.app
```
The app will run locally at [http://localhost:8050](http://localhost:8050)
> โ ๏ธ Requires `dash`, `dash-bootstrap-components`, and `plotly` installed in your environment.
### ๐งฉ Features
* **๐ YAML Config Loader**
Upload a YAML config file (e.g., `configs/full_config.yaml`) to view the current settings:
* `window_size`
* `report_palette`
* `output_dir`
* `report_format`
> ๐ *This is for **informational preview only** โ uploading a config file does **not** affect the rendered plots.*
> Plots are static and must be regenerated via the CLI (`launch_host.py cluster`) if you want different parameters applied.
* **๐ง Cluster Visualizations**
* `๐ Price Overlay`: Close price with color-coded cluster markers
* `๐ t-SNE`: 2D projection of regime embedding space
* `๐ฎ UMAP`: Alternative manifold-based view of clusters
* **๐จ Palette Preview**
* Auto-detects and displays the seaborn color palette in use
* Ensures consistent cluster color mapping between matplotlib and Plotly
* Preview updates when a new YAML config is uploaded
---
### ๐ Directory Structure
```bash
dash_app/
โโโ app.py # Main Dash app with config reader and tab layout
โโโ ...
```
### ๐ฆ Example Config for Palette Preview
```yaml
report_format: ["matplotlib", "plotly"]
report_palette: "Set2"
output_dir: ./artifacts/reports/EUR_USD
```
---
## ๐ Project Structure
```bash
regimetry/
โโโ models/ # Trained encoders and clustering artifacts
โโโ data/ # Input raw / processed datasets
โโโ artifacts/ # JSON logs, regime labels, regime visual outputs
โโโ config.yaml # Tunable pipeline settings
โโโ pyproject.toml
โโโ README.md
```
---
## ๐งญ Orientation Going Forward
* Start with regime labeling and visualization
* Build diagnostic tools to analyze regime behavior
* Eventually tie `regime_id` into strategy filters and signal validation
* Keep architecture modular, interpretable, and ready for real-world integration
---
## โ
Status
* [x] Core concept defined
* [x] Data ingestion pipeline implemented
* [x] Transformer encoder + positional encoding embedded
* [x] Embedding pipeline operational and CLI-integrated
* [x] Embeddings saved to `embeddings/`
* [x] Spectral clustering and regime ID assignment
* [x] Visualization tools (UMAP, t-SNE) with cluster overlay
* [x] Historical regime labeling and export
* [ ] Live inference support
* [ ] Contrastive or autoregressive pretraining options
---
## ๐ Related Projects
- [`ConvolutionLab`](https://github.com/kjpou1/ConvolutionLab):
A technical feature engineering framework that produces structured indicators (e.g., AHMA, LP, LC, ATR)
used as inputs to `regimetry`.
**Note:** While `regimetry` is compatible with ConvolutionLab outputs, it is not tightly coupled to it โ
any feature-rich dataset with proper formatting can be used for embedding and clustering.
---
## ๐ Further Reading
For foundational papers, models, and tools behind the `regimetry` pipeline, see the [References](./docs/REFERENCES_README.md).
---
## ๐ License
MIT
## ๐ค Author
\kjpou1 โ Initial maintainer