https://github.com/helloooideeeeea/openspeakerbeam-ss
OpenSpeakerBeam-SS is an independent reimplementation of SpeakerBeam-SS, a real-time target speaker extraction model combining Conv-TasNet and State Space Modeling (S4D). This project is not affiliated with the original paper's authors.
https://github.com/helloooideeeeea/openspeakerbeam-ss
Last synced: about 2 months ago
JSON representation
OpenSpeakerBeam-SS is an independent reimplementation of SpeakerBeam-SS, a real-time target speaker extraction model combining Conv-TasNet and State Space Modeling (S4D). This project is not affiliated with the original paper's authors.
- Host: GitHub
- URL: https://github.com/helloooideeeeea/openspeakerbeam-ss
- Owner: helloooideeeeea
- Created: 2025-03-16T08:50:32.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-16T11:22:16.000Z (about 1 year ago)
- Last Synced: 2025-03-16T12:27:09.480Z (about 1 year ago)
- Language: Python
- Size: 739 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# OpenSpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling
This is an **independent implementation** of [SpeakerBeam-SS](https://arxiv.org/abs/2407.01857), a real-time target speaker extraction model combining lightweight Conv-TasNet and State Space Modeling (S4D). The goal is to achieve efficient and high-performance speaker separation on resource-constrained devices.
๐จ **Disclaimer:** This repository is **not affiliated** with the authors of the original paper. It is an independent reimplementation and may have differences from the paper's methodology. If you have suggestions for improvements, feel free to share them! ๐จ
## โ
Project Status
The **network model implementation, training, and test dataset preparation are complete**. A full training cycle has been conducted using datasets published on Hugging Face, and test results are available. Some architectural differences from the original paper may exist. Feedback and pull requests are welcome.
## ๐ Reference
- **Paper:** [SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling](https://arxiv.org/abs/2407.01857)
## ๐ Features
- Conv-TasNet-based architecture with **S4D blocks** for efficient temporal modeling
- **Multiplicative adaptation** with d-vector speaker embeddings
- **1D convolutional blocks** for feature extraction
- **ONNX Runtime support** for CPU acceleration (AVX2 / AVX-512)
- **Designed for real-time inference** on mobile and server environments
## ๐ง Installation
### Dependencies
Install required dependencies with:
```sh
pip install -r requirements.txt
```
## ๐ Usage
### ๐ Inference
Run speaker extraction on a given mixture and enrollment audio:
```sh
python inference.py \
--mixture data/sample/mixture_000001.wav \
--enrollment data/sample/enrollment_000001.wav \
--output data/sample/result_000001.wav
```
### ๐๏ธ Training
```sh
python train.py --mode=train
```
### ๐งช Testing
```sh
python train.py --mode=test
```
Training and testing CSV metadata files are automatically downloaded and stored from Hugging Face:
```text
--train_csv data_csv/train/metadata.csv
--dev_csv data_csv/dev/metadata.csv
--test_csv data_csv/test/metadata.csv
```
## ๐พ Dataset & Checkpoints
- โ
**Test dataset and pretrained model available on Hugging Face:**
https://huggingface.co/datasets/helloidea/OpenSpeakerBeam-SS-dataset/tree/main
- โ
**Pretrained model:** `checkpoints/best_model.pth`
- ๐ **[Test] Test Loss (SI-SNR): -5.8925**
*(Note: current performance is modest; improvements are planned.)*
- Evaluation result samples:
[enrollment audio 1](data/sample/enrollment_000001.wav)
[mixture audio 1](data/sample/mixture_000001.wav)
[result audio 1](data/sample/result_000001.wav)
[enrollment audio 2](data/sample/enrollment_000002.wav)
[mixture audio 2](data/sample/mixture_000002.wav)
[result audio 2](data/sample/result_000002.wav)
## ๐ก Performance
Initial FLOP measurements on 1-second input (16kHz):
```
FLOPs: 21.60G, Params: 7.64M
```
- Expected to run **in real-time on modern CPUs** with **AVX2 or AVX-512** optimizations.
- **Neon acceleration** planned for **iOS devices** via ONNX Runtime.
## ๐ TODO
- Validate output quality
- Optimize model for mobile deployment
## ๐ License
TBD (likely MIT or Apache 2.0)
## ๐ Acknowledgments
This work is inspired by the original SpeakerBeam-SS paper and the Conv-TasNet framework.
๐น **Speaker embeddings are generated using [Resemblyzer](https://github.com/resemble-ai/Resemblyzer/).**