{"id":19489369,"url":"https://github.com/salute-developers/gigaam","last_synced_at":"2025-04-05T00:05:09.779Z","repository":{"id":232152662,"uuid":"780573982","full_name":"salute-developers/GigaAM","owner":"salute-developers","description":"Foundational Model for Speech Recognition Tasks","archived":false,"fork":false,"pushed_at":"2025-03-04T11:13:16.000Z","size":1251,"stargazers_count":189,"open_issues_count":15,"forks_count":21,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-03-28T23:03:25.827Z","etag":null,"topics":["emotion-recognition","foundation-models","self-supervised-learning","speech-recognition"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/salute-developers.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-01T18:56:50.000Z","updated_at":"2025-03-25T10:58:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"354646ea-c074-4f02-945a-f7d96dc2ecf4","html_url":"https://github.com/salute-developers/GigaAM","commit_stats":null,"previous_names":["salute-developers/gigaam"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salute-developers%2FGigaAM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salute-developers%2FGigaAM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salute-developers%2FGigaAM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salute-developers%2FGigaAM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/salute-developers","download_url":"https://codeload.github.com/salute-developers/GigaAM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247266562,"owners_count":20910836,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["emotion-recognition","foundation-models","self-supervised-learning","speech-recognition"],"created_at":"2024-11-10T21:08:19.339Z","updated_at":"2025-04-05T00:05:09.760Z","avatar_url":"https://github.com/salute-developers.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GigaAM: the family of open-source acoustic models for speech processing\n\n![plot](./gigaam_scheme.svg)\n\n## Latest News\n* 2024/12 — [MIT License](./LICENSE), GigaAM-v2 (**-15%** and **-12%** WER Reduction for CTC and RNN-T models, respectively), [ONNX export support](#onnx-inference-example)\n* 2024/05 — GigaAM-RNNT (**-19%** WER Reduction), [long-form inference using external Voice Activity Detection](#long-form-audio-transcribation)\n* 2024/04 — GigaAM Release: GigaAM-CTC ([SoTA Speech Recognition model for the Russian language](#performance-metrics-word-error-rate)), [GigaAM-Emo](#gigaam-emo-emotion-recognition)\n---\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Installation](#installation)\n- [GigaAM: The Foundational Model](#gigaam-the-foundational-model)\n- [GigaAM for Speech Recognition](#gigaam-for-speech-recognition)\n  - [GigaAM-CTC](#gigaam-ctc)\n  - [GigaAM-RNNT](#gigaam-rnnt)\n- [GigaAM-Emo: Emotion Recognition](#gigaam-emo-emotion-recognition)\n- [License](#license)\n- [Links](#links)\n\n---\n\n## Overview\n\nGigaAM (**Giga** **A**coustic **M**odel) is a family of open-source models for Russian speech processing tasks, including speech recognition and emotion recognition. The models are built on top of the [Conformer](https://arxiv.org/pdf/2005.08100.pdf) architecture and leverage self-supervised learning ([wav2vec2](https://arxiv.org/abs/2006.11477)-based for GigaAM-v1 and [HuBERT](https://arxiv.org/pdf/2106.07447)-based for GigaAM-v2).\n\nGigaAM models are state-of-the-art open-source solutions for their respective tasks in the Russian language.\n\nThis repository includes:\n\n- **GigaAM**: A foundational self-supervised model pre-trained on massive Russian speech datasets.\n- **GigaAM-CTC** and **GigaAM-RNNT**: Fine-tuned models for automatic speech recognition (ASR).\n- **GigaAM-Emo**: A fine-tuned model for emotion recognition.\n\n## Installation\n\n### Requirements\n- Python ≥ 3.8\n- [ffmpeg](https://ffmpeg.org/) installed and added to your system's PATH\n\n### Install the GigaAM Package\n\n1. Clone the repository:\n   ```bash\n   git clone https://github.com/salute-developers/GigaAM.git\n   cd GigaAM\n   ```\n\n2. Install the package in editable mode:\n   ```bash\n   pip install -e .\n   ```\n\n3. Verify the installation:\n   ```python\n   import gigaam\n   model = gigaam.load_model(\"ctc\")\n   print(model)\n   ```\n\n---\n\n## GigaAM: The Foundational Model\n\nGigaAM is a [Conformer](https://arxiv.org/pdf/2005.08100.pdf)-based foundational model (240M parameters) pre-trained on 50,000+ hours of diverse Russian speech data. \n\nIt serves as the backbone for the entire GigaAM family, enabling state-of-the-art fine-tuned performance in speech recognition and emotion recognition.\n\nThere are 2 available versions:\n\n* GigaAM-v1 was trained with a [wav2vec2](https://arxiv.org/abs/2006.11477)-like approach and can be used by loading the `v1_ssl` model version.\n* GigaAM-v2 was trained with a [HuBERT](https://arxiv.org/pdf/2106.07447)-like approach and allows us to get GigaAM-v2 ASR model with better quality. It can be used by loading the `v2_ssl` or `ssl` model version.\n\nMore information about GigaAM-v1 can be found in our [post on Habr](https://habr.com/ru/companies/sberdevices/articles/805569).\n\n### GigaAM Usage Example\n\n```python\nimport gigaam\nmodel = gigaam.load_model('ssl') # Options: \"ssl\", \"v1_ssl\"\nembedding, _ = model.embed_audio(audio_path)\n```\n\n---\n\n## GigaAM for Speech Recognition\n\nWe fine-tuned the GigaAM encoder for ASR using two different architectures:\n\n- GigaAM-CTC was fine-tuned with [Connectionist Temporal Classification](https://www.cs.toronto.edu/~graves/icml_2006.pdf) and a character-based tokenizer.\n- GigaAM-RNNT was fine-tuned with [RNN Transducer loss](https://arxiv.org/abs/1211.3711).\n\nFine-tuning was done for both GigaAM-v1 and GigaAM-v2 SSL models, so we have 4 ASR models: `v1` and `v2` versions for both CTC and RNNT.\n\n### Training Data\nThe models were trained on publicly available Russian datasets:\n\n| Dataset                | Size (hours) | Weight |\n|------------------------|--------------|--------|\n| Golos                 | 1227         | 0.6    |\n| SOVA                  | 369          | 0.2    |\n| Russian Common Voice  | 207          | 0.1    |\n| Russian LibriSpeech   | 93           | 0.1    |\n\n### Performance Metrics (Word Error Rate)\n| Model              | Parameters | Golos Crowd | Golos Farfield | OpenSTT YouTube | OpenSTT Phone Calls | OpenSTT Audiobooks | Mozilla Common Voice 12 | Mozilla Common Voice 19 | Russian LibriSpeech |\n|--------------------|------------|-------------|----------------|-----------------|----------------------|--------------------|-------|-------|---------------------|\n| Whisper-large-v3   | 1.5B       | 13.9        | 16.6           | 18.0            | 28.0                 | 14.4               | 5.7   | 5.5   | 9.5                 |\n| NVIDIA FastConformer | 115M       | 2.2         | 6.6            | 21.2            | 30.0                 | 13.9               | 2.7   | 5.7   | 11.3                |\n| **GigaAM-CTC-v1**  | 242M       | 3.0         | 5.7            | 16.0            | 23.2                 | 12.5               | 2.0   | 10.5  | 7.5                 |\n| **GigaAM-RNNT-v1** | 243M       | 2.3         | 5.0            | 14.0            | 21.7                 | 11.7               | 1.9   | 9.9   | 7.7                 |\n| **GigaAM-CTC-v2**  | 242M       | 2.5         | 4.3            | 14.1            | 21.1                 | 10.7               | 2.1   | 3.1   | 5.5                 |\n| **GigaAM-RNNT-v2** | 243M       | **\u003cspan style=\"color:green\"\u003e2.2\u003c/span\u003e**         | **\u003cspan style=\"color:green\"\u003e3.9\u003c/span\u003e**            | **\u003cspan style=\"color:green\"\u003e13.3\u003c/span\u003e**            | **\u003cspan style=\"color:green\"\u003e20.0\u003c/span\u003e**                | **\u003cspan style=\"color:green\"\u003e10.2\u003c/span\u003e**               | **\u003cspan style=\"color:green\"\u003e1.8\u003c/span\u003e**   | **\u003cspan style=\"color:green\"\u003e2.7\u003c/span\u003e**   | **\u003cspan style=\"color:green\"\u003e5.5\u003c/span\u003e**               |\n\n\n### Speech Recognition Example (GigaAM-ASR)\n\n   #### Basic usage: short audio transcribation (up to 30 seconds)\n\n   ```python\n   import gigaam\n   model_name = \"rnnt\"  # Options: \"v2_ctc\" or \"ctc\", \"v2_rnnt\" or \"rnnt\", \"v1_ctc\", \"v1_rnnt\"\n   model = gigaam.load_model(model_name)\n   transcription = model.transcribe(audio_path)\n   ```\n\n   #### Long-form audio transcribation\n   1. Install external VAD dependencies ([pyannote.audio](https://github.com/pyannote/pyannote-audio) library) with \n      ```bash\n      pip install gigaam[longform]\n      ```\n   2. \n      * Generate [Hugging Face API token](https://huggingface.co/docs/hub/security-tokens)\n      * Accept the conditions to access [pyannote/voice-activity-detection](https://huggingface.co/pyannote/voice-activity-detection) files and content.\n      * Accept the conditions to access [pyannote/segmentation](https://huggingface.co/pyannote/segmentation) files and content.\n   3. Use the `model.transcribe_longform` method:\n      ```python\n      import os\n      import gigaam\n\n      os.environ[\"HF_TOKEN\"] = \"\u003cHF_TOKEN\u003e\"\n\n      model = gigaam.load_model(\"ctc\")\n      recognition_result = model.transcribe_longform(\"long_example.wav\")\n\n      for utterance in recognition_result:\n         transcription = utterance[\"transcription\"]\n         start, end = utterance[\"boundaries\"]\n         print(f\"[{gigaam.format_time(start)} - {gigaam.format_time(end)}]: {transcription}\")\n      ```   \n\n   #### ONNX inference example\n\n   1. Export the model to ONNX using the `model.to_onnx` method:\n      ```python\n      onnx_dir = \"onnx\"\n      model_type = \"rnnt\" # or \"ctc\"\n\n      model = gigaam.load_model(\n         model_type,\n         fp16_encoder=False,  # only fp32 tensors\n         use_flash=False,  # disable flash attention\n      )\n      model.to_onnx(dir_path=onnx_dir)\n      ```\n   2. Run ONNX inference:\n      ```python\n      from gigaam.onnx_utils import load_onnx_sessions, transcribe_sample\n\n      sessions = load_onnx_sessions(onnx_dir, model_type)\n      transcribe_sample(\"example.wav\", model_type, sessions)\n      ```\n\n\nAll these examples can also be found in [inference_example.ipynb](./inference_example.ipynb) notebook.\n\n---\n\n\n## GigaAM-Emo: Emotion Recognition\n\nGigaAM-Emo is a fine-tuned model for emotion recognition trained on the [Dusha](https://arxiv.org/pdf/2212.12266.pdf) dataset. It significantly outperforms existing models on several metrics.\n\n### Performance Metrics\n|  |  | Crowd |  |  | Podcast |  |\n| --- | --- | --- | --- | --- | --- | --- |\n|  | Unweighted Accuracy | Weighted Accuracy | Macro F1-score | Unweighted Accuracy | Weighted Accuracy | Macro F1-score |\n| [DUSHA](https://arxiv.org/pdf/2212.12266.pdf) baseline \u003cbr/\u003e ([MobileNetV2](https://arxiv.org/abs/1801.04381) + [Self-Attention](https://arxiv.org/pdf/1805.08318.pdf)) | 0.83 | 0.76 | 0.77 | 0.89 | 0.53 | 0.54 |\n| [АБК](https://aij.ru/archive?albumId=2\u0026videoId=337) ([TIM-Net](https://arxiv.org/pdf/2211.08233.pdf)) | 0.84 | 0.77 | 0.78 | \u003cspan style=\"color:green\"\u003e0.90\u003c/span\u003e | 0.50 | 0.55 |\n| GigaAM-Emo | \u003cspan style=\"color:green\"\u003e0.90\u003c/span\u003e | \u003cspan style=\"color:green\"\u003e0.87\u003c/span\u003e | \u003cspan style=\"color:green\"\u003e0.84\u003c/span\u003e | \u003cspan style=\"color:green\"\u003e0.90\u003c/span\u003e | \u003cspan style=\"color:green\"\u003e0.76\u003c/span\u003e | \u003cspan style=\"color:green\"\u003e0.67\u003c/span\u003e |\n\n### Emotion Recognition Example (GigaAM-Emo)\n\n```python\nimport gigaam\nmodel = gigaam.load_model('emo')\nemotion2prob: Dict[str, int] = model.get_probs(\"example.wav\")\n\nprint(\", \".join([f\"{emotion}: {prob:.3f}\" for emotion, prob in emotion2prob.items()]))\n```\n\n---\n\n## License\n\nGigaAM's code and model weights are released under the [MIT License](./LICENSE).\n\n---\n\n## Links\n* [[habr] GigaAM: класс открытых моделей для обработки звучащей речи](https://habr.com/ru/companies/sberdevices/articles/805569)\n* [[youtube] Как научить LLM слышать: GigaAM 🤝 GigaChat Audio](https://www.youtube.com/watch?v=O7NSH2SAwRc)\n* [[youtube] GigaAM: Семейство акустических моделей для русского языка](https://youtu.be/PvZuTUnZa2Q?t=26442)\n* [[youtube] Speech-only Pre-training: обучение универсального аудиоэнкодера](https://www.youtube.com/watch?v=ktO4Mx6UMNk)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalute-developers%2Fgigaam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsalute-developers%2Fgigaam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalute-developers%2Fgigaam/lists"}