https://github.com/stremtec/astrape-vst
Causal zero-shot voice conversion at 44.1kHz.
https://github.com/stremtec/astrape-vst
44khz causal neural-audio-codec q2d2 voice-conversion zero-shot
Last synced: about 8 hours ago
JSON representation
Causal zero-shot voice conversion at 44.1kHz.
- Host: GitHub
- URL: https://github.com/stremtec/astrape-vst
- Owner: stremtec
- Created: 2026-06-06T06:10:25.000Z (26 days ago)
- Default Branch: main
- Last Pushed: 2026-06-30T02:50:04.000Z (2 days ago)
- Last Synced: 2026-06-30T03:19:21.676Z (2 days ago)
- Topics: 44khz, causal, neural-audio-codec, q2d2, voice-conversion, zero-shot
- Language: Python
- Homepage:
- Size: 43.7 MB
- Stars: 24
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# astrape-vst
Strict-causal, zero-shot voice conversion VST — real-time on Apple Silicon (MPS).
| | |
|---|---|
| cos768 | **0.935** (probe, 8L StridingAdapter) |
| Encoder params | 24.9M |
| Algorithmic latency | ~49ms E2E (0 look-ahead) |
| Platform | macOS / Apple Silicon (MPS) |
## Design
- **0 look-ahead** — strictly causal. No future frames, KV-cache streaming.
- **Content/speaker split** — *what* is said (768d content @25Hz) vs *who* says it (global embedding).
- **Teacher–student** — trained by distilling a frozen, bidirectional MioCodec teacher into a causal student.
- **Zero-shot VC** — GRL disentanglement + Q2D2 bottleneck. No parallel data needed.
## Quick Start
```bash
# Voicebank
python -m astrape.build_voicebank ref.wav -o speaker.astrape
# Convert
python -m astrape.evaluate --source src.wav --target speaker.astrape --output out.wav
```
## References
- **Q2D2** — Shuster & Nachmani, "Two-Dimensional Quantization for Geometry-Aware Audio Coding", ICML 2026, arXiv:2512.01537
- **Mamba** — Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", 2023, arXiv:2312.00752
- **Hyena** — Poli et al., "Hyena Hierarchy: Towards Larger Convolutional Language Models", 2023, arXiv:2302.10866
- **MioCodec** — Aratako/MioCodec-25Hz-44.1kHz-v2 (teacher codec, HuggingFace)
- **WavLM** — Chen et al., "WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing", 2022
- **GRL** — Ganin & Lempitsky, "Unsupervised Domain Adaptation by Backpropagation", ICML 2015
- **ConvNeXt** — Liu et al., "A ConvNet for the 2020s", CVPR 2022
- **WavTokenizer** — Ji et al., "WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer", 2024
- **Snake / BigVGAN** — Lee et al., "BigVGAN: A Universal Neural Vocoder", 2023, arXiv:2206.02944
- **iSTFT / Vocos** — Siuzdak et al., "Vocos: Closing the Gap Between Time-Domain and Fourier-Based Neural Vocoders", 2024, arXiv:2306.00819
- **Predictive Coding** — Oord et al., "Representation Learning with Contrastive Predictive Coding (CPC)", 2018
- **APCodec** — Ai et al., "APCodec: A Neural Audio Codec", IEEE/ACM TASLP 2024, arXiv:2402.10533