https://github.com/salute-developers/gigaam

Foundational Model for Speech Recognition Tasks
https://github.com/salute-developers/gigaam
emotion-recognition foundation-models self-supervised-learning speech-recognition
Last synced: 6 months ago
JSON representation
Foundational Model for Speech Recognition Tasks
Host: GitHub
URL: https://github.com/salute-developers/gigaam
Owner: salute-developers
License: mit
Created: 2024-04-01T18:56:50.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-04T11:13:16.000Z (7 months ago)
Last Synced: 2025-03-28T23:03:25.827Z (6 months ago)
Topics: emotion-recognition, foundation-models, self-supervised-learning, speech-recognition
Language: Python
Homepage:
Size: 1.19 MB
Stars: 189
Watchers: 14
Forks: 21
Open Issues: 15
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # GigaAM: the family of open-source acoustic models for speech processing

![plot](./gigaam_scheme.svg)

## Latest News

* 2024/12 — [MIT License](./LICENSE), GigaAM-v2 (**-15%** and **-12%** WER Reduction for CTC and RNN-T models, respectively), [ONNX export support](#onnx-inference-example)

* 2024/05 — GigaAM-RNNT (**-19%** WER Reduction), [long-form inference using external Voice Activity Detection](#long-form-audio-transcribation)

* 2024/04 — GigaAM Release: GigaAM-CTC ([SoTA Speech Recognition model for the Russian language](#performance-metrics-word-error-rate)), [GigaAM-Emo](#gigaam-emo-emotion-recognition)

---

## Table of Contents

- [Overview](#overview)

- [Installation](#installation)

- [GigaAM: The Foundational Model](#gigaam-the-foundational-model)

- [GigaAM for Speech Recognition](#gigaam-for-speech-recognition)

  - [GigaAM-CTC](#gigaam-ctc)

  - [GigaAM-RNNT](#gigaam-rnnt)

- [GigaAM-Emo: Emotion Recognition](#gigaam-emo-emotion-recognition)

- [License](#license)

- [Links](#links)

---

## Overview

GigaAM (**Giga** **A**coustic **M**odel) is a family of open-source models for Russian speech processing tasks, including speech recognition and emotion recognition. The models are built on top of the [Conformer](https://arxiv.org/pdf/2005.08100.pdf) architecture and leverage self-supervised learning ([wav2vec2](https://arxiv.org/abs/2006.11477)-based for GigaAM-v1 and [HuBERT](https://arxiv.org/pdf/2106.07447)-based for GigaAM-v2).

GigaAM models are state-of-the-art open-source solutions for their respective tasks in the Russian language.

This repository includes:

- **GigaAM**: A foundational self-supervised model pre-trained on massive Russian speech datasets.

- **GigaAM-CTC** and **GigaAM-RNNT**: Fine-tuned models for automatic speech recognition (ASR).

- **GigaAM-Emo**: A fine-tuned model for emotion recognition.

## Installation

### Requirements

- Python ≥ 3.8

- [ffmpeg](https://ffmpeg.org/) installed and added to your system's PATH

### Install the GigaAM Package

1. Clone the repository:

   ```bash

   git clone https://github.com/salute-developers/GigaAM.git

   cd GigaAM

   ```

2. Install the package in editable mode:

   ```bash

   pip install -e .

   ```

3. Verify the installation:

   ```python

   import gigaam

   model = gigaam.load_model("ctc")

   print(model)

   ```

---

## GigaAM: The Foundational Model

GigaAM is a [Conformer](https://arxiv.org/pdf/2005.08100.pdf)-based foundational model (240M parameters) pre-trained on 50,000+ hours of diverse Russian speech data. 

It serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition.

There are 2 available versions:

* GigaAM-v1 was trained with a [wav2vec2](https://arxiv.org/abs/2006.11477)-like approach and can be used by loading the `v1_ssl` model version.

* GigaAM-v2 was trained with a [HuBERT](https://arxiv.org/pdf/2106.07447)-like approach and allows us to get GigaAM-v2 ASR model with better quality. It can be used by loading the `v2_ssl` or `ssl` model version.

More information about GigaAM-v1 can be found in our [post on Habr](https://habr.com/ru/companies/sberdevices/articles/805569).

### GigaAM Usage Example

```python

import gigaam

model = gigaam.load_model('ssl') # Options: "ssl", "v1_ssl"

embedding, _ = model.embed_audio(audio_path)

```

---

## GigaAM for Speech Recognition

We fine-tuned the GigaAM encoder for ASR using two different architectures:

- GigaAM-CTC was fine-tuned with [Connectionist Temporal Classification](https://www.cs.toronto.edu/~graves/icml_2006.pdf) and a character-based tokenizer.

- GigaAM-RNNT was fine-tuned with [RNN Transducer loss](https://arxiv.org/abs/1211.3711).

Fine-tuning was done for both GigaAM-v1 and GigaAM-v2 SSL models, so we have 4 ASR models: `v1` and `v2` versions for both CTC and RNNT.

### Training Data

The models were trained on publicly available Russian datasets:

| Dataset                | Size (hours) | Weight |

|------------------------|--------------|--------|

| Golos                 | 1227         | 0.6    |

| SOVA                  | 369          | 0.2    |

| Russian Common Voice  | 207          | 0.1    |

| Russian LibriSpeech   | 93           | 0.1    |

### Performance Metrics (Word Error Rate)

| Model              | Parameters | Golos Crowd | Golos Farfield | OpenSTT YouTube | OpenSTT Phone Calls | OpenSTT Audiobooks | Mozilla Common Voice 12 | Mozilla Common Voice 19 | Russian LibriSpeech |

|--------------------|------------|-------------|----------------|-----------------|----------------------|--------------------|-------|-------|---------------------|

| Whisper-large-v3   | 1.5B       | 13.9        | 16.6           | 18.0            | 28.0                 | 14.4               | 5.7   | 5.5   | 9.5                 |

| NVIDIA FastConformer | 115M       | 2.2         | 6.6            | 21.2            | 30.0                 | 13.9               | 2.7   | 5.7   | 11.3                |

| **GigaAM-CTC-v1**  | 242M       | 3.0         | 5.7            | 16.0            | 23.2                 | 12.5               | 2.0   | 10.5  | 7.5                 |

| **GigaAM-RNNT-v1** | 243M       | 2.3         | 5.0            | 14.0            | 21.7                 | 11.7               | 1.9   | 9.9   | 7.7                 |

| **GigaAM-CTC-v2**  | 242M       | 2.5         | 4.3            | 14.1            | 21.1                 | 10.7               | 2.1   | 3.1   | 5.5                 |

| **GigaAM-RNNT-v2** | 243M       | **2.2**         | **3.9**            | **13.3**            | **20.0**                | **10.2**               | **1.8**   | **2.7**   | **5.5**               |

### Speech Recognition Example (GigaAM-ASR)

   #### Basic usage: short audio transcribation (up to 30 seconds)

   ```python

   import gigaam

   model_name = "rnnt"  # Options: "v2_ctc" or "ctc", "v2_rnnt" or "rnnt", "v1_ctc", "v1_rnnt"

   model = gigaam.load_model(model_name)

   transcription = model.transcribe(audio_path)

   ```

   #### Long-form audio transcribation

   1. Install external VAD dependencies ([pyannote.audio](https://github.com/pyannote/pyannote-audio) library) with 

      ```bash

      pip install gigaam[longform]

      ```

   2. 

      * Generate [Hugging Face API token](https://huggingface.co/docs/hub/security-tokens)

      * Accept the conditions to access [pyannote/voice-activity-detection](https://huggingface.co/pyannote/voice-activity-detection) files and content.

      * Accept the conditions to access [pyannote/segmentation](https://huggingface.co/pyannote/segmentation) files and content.

   3. Use the `model.transcribe_longform` method:

      ```python

      import os

      import gigaam

      os.environ["HF_TOKEN"] = ""

      model = gigaam.load_model("ctc")

      recognition_result = model.transcribe_longform("long_example.wav")

      for utterance in recognition_result:

         transcription = utterance["transcription"]

         start, end = utterance["boundaries"]

         print(f"[{gigaam.format_time(start)} - {gigaam.format_time(end)}]: {transcription}")

      ```   

   #### ONNX inference example

   1. Export the model to ONNX using the `model.to_onnx` method:

      ```python

      onnx_dir = "onnx"

      model_type = "rnnt" # or "ctc"

      model = gigaam.load_model(

         model_type,

         fp16_encoder=False,  # only fp32 tensors

         use_flash=False,  # disable flash attention

      )

      model.to_onnx(dir_path=onnx_dir)

      ```

   2. Run ONNX inference:

      ```python

      from gigaam.onnx_utils import load_onnx_sessions, transcribe_sample

      sessions = load_onnx_sessions(onnx_dir, model_type)

      transcribe_sample("example.wav", model_type, sessions)

      ```

All these examples can also be found in [inference_example.ipynb](./inference_example.ipynb) notebook.

---

## GigaAM-Emo: Emotion Recognition

GigaAM-Emo is a fine-tuned model for emotion recognition trained on the [Dusha](https://arxiv.org/pdf/2212.12266.pdf) dataset. It significantly outperforms existing models on several metrics.

### Performance Metrics

|  |  | Crowd |  |  | Podcast |  |

| --- | --- | --- | --- | --- | --- | --- |

|  | Unweighted Accuracy | Weighted Accuracy | Macro F1-score | Unweighted Accuracy | Weighted Accuracy | Macro F1-score |

| [DUSHA](https://arxiv.org/pdf/2212.12266.pdf) baseline 
 ([MobileNetV2](https://arxiv.org/abs/1801.04381) + [Self-Attention](https://arxiv.org/pdf/1805.08318.pdf)) | 0.83 | 0.76 | 0.77 | 0.89 | 0.53 | 0.54 |

| [АБК](https://aij.ru/archive?albumId=2&videoId=337) ([TIM-Net](https://arxiv.org/pdf/2211.08233.pdf)) | 0.84 | 0.77 | 0.78 | 0.90 | 0.50 | 0.55 |

| GigaAM-Emo | 0.90 | 0.87 | 0.84 | 0.90 | 0.76 | 0.67 |

### Emotion Recognition Example (GigaAM-Emo)

```python

import gigaam

model = gigaam.load_model('emo')

emotion2prob: Dict[str, int] = model.get_probs("example.wav")

print(", ".join([f"{emotion}: {prob:.3f}" for emotion, prob in emotion2prob.items()]))

```

---

## License

GigaAM's code and model weights are released under the [MIT License](./LICENSE).

---

## Links

* [[habr] GigaAM: класс открытых моделей для обработки звучащей речи](https://habr.com/ru/companies/sberdevices/articles/805569)

* [[youtube] Как научить LLM слышать: GigaAM 🤝 GigaChat Audio](https://www.youtube.com/watch?v=O7NSH2SAwRc)

* [[youtube] GigaAM: Семейство акустических моделей для русского языка](https://youtu.be/PvZuTUnZa2Q?t=26442)

* [[youtube] Speech-only Pre-training: обучение универсального аудиоэнкодера](https://www.youtube.com/watch?v=ktO4Mx6UMNk)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/salute-developers/gigaam

Awesome Lists containing this project

README