https://github.com/salute-developers/gigaam
Foundational Model for Speech Recognition Tasks
https://github.com/salute-developers/gigaam
emotion-recognition foundation-models self-supervised-learning speech-recognition
Last synced: 6 months ago
JSON representation
Foundational Model for Speech Recognition Tasks
- Host: GitHub
- URL: https://github.com/salute-developers/gigaam
- Owner: salute-developers
- License: mit
- Created: 2024-04-01T18:56:50.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-04T11:13:16.000Z (7 months ago)
- Last Synced: 2025-03-28T23:03:25.827Z (6 months ago)
- Topics: emotion-recognition, foundation-models, self-supervised-learning, speech-recognition
- Language: Python
- Homepage:
- Size: 1.19 MB
- Stars: 189
- Watchers: 14
- Forks: 21
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# GigaAM: the family of open-source acoustic models for speech processing

## Latest News
* 2024/12 — [MIT License](./LICENSE), GigaAM-v2 (**-15%** and **-12%** WER Reduction for CTC and RNN-T models, respectively), [ONNX export support](#onnx-inference-example)
* 2024/05 — GigaAM-RNNT (**-19%** WER Reduction), [long-form inference using external Voice Activity Detection](#long-form-audio-transcribation)
* 2024/04 — GigaAM Release: GigaAM-CTC ([SoTA Speech Recognition model for the Russian language](#performance-metrics-word-error-rate)), [GigaAM-Emo](#gigaam-emo-emotion-recognition)
---## Table of Contents
- [Overview](#overview)
- [Installation](#installation)
- [GigaAM: The Foundational Model](#gigaam-the-foundational-model)
- [GigaAM for Speech Recognition](#gigaam-for-speech-recognition)
- [GigaAM-CTC](#gigaam-ctc)
- [GigaAM-RNNT](#gigaam-rnnt)
- [GigaAM-Emo: Emotion Recognition](#gigaam-emo-emotion-recognition)
- [License](#license)
- [Links](#links)---
## Overview
GigaAM (**Giga** **A**coustic **M**odel) is a family of open-source models for Russian speech processing tasks, including speech recognition and emotion recognition. The models are built on top of the [Conformer](https://arxiv.org/pdf/2005.08100.pdf) architecture and leverage self-supervised learning ([wav2vec2](https://arxiv.org/abs/2006.11477)-based for GigaAM-v1 and [HuBERT](https://arxiv.org/pdf/2106.07447)-based for GigaAM-v2).
GigaAM models are state-of-the-art open-source solutions for their respective tasks in the Russian language.
This repository includes:
- **GigaAM**: A foundational self-supervised model pre-trained on massive Russian speech datasets.
- **GigaAM-CTC** and **GigaAM-RNNT**: Fine-tuned models for automatic speech recognition (ASR).
- **GigaAM-Emo**: A fine-tuned model for emotion recognition.## Installation
### Requirements
- Python ≥ 3.8
- [ffmpeg](https://ffmpeg.org/) installed and added to your system's PATH### Install the GigaAM Package
1. Clone the repository:
```bash
git clone https://github.com/salute-developers/GigaAM.git
cd GigaAM
```2. Install the package in editable mode:
```bash
pip install -e .
```3. Verify the installation:
```python
import gigaam
model = gigaam.load_model("ctc")
print(model)
```---
## GigaAM: The Foundational Model
GigaAM is a [Conformer](https://arxiv.org/pdf/2005.08100.pdf)-based foundational model (240M parameters) pre-trained on 50,000+ hours of diverse Russian speech data.
It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition.
There are 2 available versions:
* GigaAM-v1 was trained with a [wav2vec2](https://arxiv.org/abs/2006.11477)-like approach and can be used by loading the `v1_ssl` model version.
* GigaAM-v2 was trained with a [HuBERT](https://arxiv.org/pdf/2106.07447)-like approach and allows us to get GigaAM-v2 ASR model with better quality. It can be used by loading the `v2_ssl` or `ssl` model version.More information about GigaAM-v1 can be found in our [post on Habr](https://habr.com/ru/companies/sberdevices/articles/805569).
### GigaAM Usage Example
```python
import gigaam
model = gigaam.load_model('ssl') # Options: "ssl", "v1_ssl"
embedding, _ = model.embed_audio(audio_path)
```---
## GigaAM for Speech Recognition
We fine-tuned the GigaAM encoder for ASR using two different architectures:
- GigaAM-CTC was fine-tuned with [Connectionist Temporal Classification](https://www.cs.toronto.edu/~graves/icml_2006.pdf) and a character-based tokenizer.
- GigaAM-RNNT was fine-tuned with [RNN Transducer loss](https://arxiv.org/abs/1211.3711).Fine-tuning was done for both GigaAM-v1 and GigaAM-v2 SSL models, so we have 4 ASR models: `v1` and `v2` versions for both CTC and RNNT.
### Training Data
The models were trained on publicly available Russian datasets:| Dataset | Size (hours) | Weight |
|------------------------|--------------|--------|
| Golos | 1227 | 0.6 |
| SOVA | 369 | 0.2 |
| Russian Common Voice | 207 | 0.1 |
| Russian LibriSpeech | 93 | 0.1 |### Performance Metrics (Word Error Rate)
| Model | Parameters | Golos Crowd | Golos Farfield | OpenSTT YouTube | OpenSTT Phone Calls | OpenSTT Audiobooks | Mozilla Common Voice 12 | Mozilla Common Voice 19 | Russian LibriSpeech |
|--------------------|------------|-------------|----------------|-----------------|----------------------|--------------------|-------|-------|---------------------|
| Whisper-large-v3 | 1.5B | 13.9 | 16.6 | 18.0 | 28.0 | 14.4 | 5.7 | 5.5 | 9.5 |
| NVIDIA FastConformer | 115M | 2.2 | 6.6 | 21.2 | 30.0 | 13.9 | 2.7 | 5.7 | 11.3 |
| **GigaAM-CTC-v1** | 242M | 3.0 | 5.7 | 16.0 | 23.2 | 12.5 | 2.0 | 10.5 | 7.5 |
| **GigaAM-RNNT-v1** | 243M | 2.3 | 5.0 | 14.0 | 21.7 | 11.7 | 1.9 | 9.9 | 7.7 |
| **GigaAM-CTC-v2** | 242M | 2.5 | 4.3 | 14.1 | 21.1 | 10.7 | 2.1 | 3.1 | 5.5 |
| **GigaAM-RNNT-v2** | 243M | **2.2** | **3.9** | **13.3** | **20.0** | **10.2** | **1.8** | **2.7** | **5.5** |### Speech Recognition Example (GigaAM-ASR)
#### Basic usage: short audio transcribation (up to 30 seconds)
```python
import gigaam
model_name = "rnnt" # Options: "v2_ctc" or "ctc", "v2_rnnt" or "rnnt", "v1_ctc", "v1_rnnt"
model = gigaam.load_model(model_name)
transcription = model.transcribe(audio_path)
```#### Long-form audio transcribation
1. Install external VAD dependencies ([pyannote.audio](https://github.com/pyannote/pyannote-audio) library) with
```bash
pip install gigaam[longform]
```
2.
* Generate [Hugging Face API token](https://huggingface.co/docs/hub/security-tokens)
* Accept the conditions to access [pyannote/voice-activity-detection](https://huggingface.co/pyannote/voice-activity-detection) files and content.
* Accept the conditions to access [pyannote/segmentation](https://huggingface.co/pyannote/segmentation) files and content.
3. Use the `model.transcribe_longform` method:
```python
import os
import gigaamos.environ["HF_TOKEN"] = ""
model = gigaam.load_model("ctc")
recognition_result = model.transcribe_longform("long_example.wav")for utterance in recognition_result:
transcription = utterance["transcription"]
start, end = utterance["boundaries"]
print(f"[{gigaam.format_time(start)} - {gigaam.format_time(end)}]: {transcription}")
```#### ONNX inference example
1. Export the model to ONNX using the `model.to_onnx` method:
```python
onnx_dir = "onnx"
model_type = "rnnt" # or "ctc"model = gigaam.load_model(
model_type,
fp16_encoder=False, # only fp32 tensors
use_flash=False, # disable flash attention
)
model.to_onnx(dir_path=onnx_dir)
```
2. Run ONNX inference:
```python
from gigaam.onnx_utils import load_onnx_sessions, transcribe_samplesessions = load_onnx_sessions(onnx_dir, model_type)
transcribe_sample("example.wav", model_type, sessions)
```All these examples can also be found in [inference_example.ipynb](./inference_example.ipynb) notebook.
---
## GigaAM-Emo: Emotion Recognition
GigaAM-Emo is a fine-tuned model for emotion recognition trained on the [Dusha](https://arxiv.org/pdf/2212.12266.pdf) dataset. It significantly outperforms existing models on several metrics.
### Performance Metrics
| | | Crowd | | | Podcast | |
| --- | --- | --- | --- | --- | --- | --- |
| | Unweighted Accuracy | Weighted Accuracy | Macro F1-score | Unweighted Accuracy | Weighted Accuracy | Macro F1-score |
| [DUSHA](https://arxiv.org/pdf/2212.12266.pdf) baseline
([MobileNetV2](https://arxiv.org/abs/1801.04381) + [Self-Attention](https://arxiv.org/pdf/1805.08318.pdf)) | 0.83 | 0.76 | 0.77 | 0.89 | 0.53 | 0.54 |
| [АБК](https://aij.ru/archive?albumId=2&videoId=337) ([TIM-Net](https://arxiv.org/pdf/2211.08233.pdf)) | 0.84 | 0.77 | 0.78 | 0.90 | 0.50 | 0.55 |
| GigaAM-Emo | 0.90 | 0.87 | 0.84 | 0.90 | 0.76 | 0.67 |### Emotion Recognition Example (GigaAM-Emo)
```python
import gigaam
model = gigaam.load_model('emo')
emotion2prob: Dict[str, int] = model.get_probs("example.wav")print(", ".join([f"{emotion}: {prob:.3f}" for emotion, prob in emotion2prob.items()]))
```---
## License
GigaAM's code and model weights are released under the [MIT License](./LICENSE).
---
## Links
* [[habr] GigaAM: класс открытых моделей для обработки звучащей речи](https://habr.com/ru/companies/sberdevices/articles/805569)
* [[youtube] Как научить LLM слышать: GigaAM 🤝 GigaChat Audio](https://www.youtube.com/watch?v=O7NSH2SAwRc)
* [[youtube] GigaAM: Семейство акустических моделей для русского языка](https://youtu.be/PvZuTUnZa2Q?t=26442)
* [[youtube] Speech-only Pre-training: обучение универсального аудиоэнкодера](https://www.youtube.com/watch?v=ktO4Mx6UMNk)