https://github.com/abdullahhendy/live-translation

This project provides a real-time speech-to-text translation solution. It captures audio from the microphone, processes it, transcribes it into text, and translates it to a target language. Multiple output formats are supported.
https://github.com/abdullahhendy/live-translation
ai nlp opus-mt silero-vad transcription translation whisper-ai
Last synced: 6 months ago
JSON representation
Host: GitHub
URL: https://github.com/abdullahhendy/live-translation
Owner: AbdullahHendy
License: mit
Created: 2025-02-21T23:46:44.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-04-06T11:11:42.000Z (6 months ago)
Last Synced: 2025-04-09T17:14:00.615Z (6 months ago)
Topics: ai, nlp, opus-mt, silero-vad, transcription, translation, whisper-ai
Language: Python
Homepage:
Size: 1.58 MB
Stars: 3
Watchers: 1
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Real-time Speech-to-Text Translation

This project provides a real-time speech-to-text translation solution. It captures audio from the microphone, processes it, transcribes it into text, and translates it to a target language. It uses the **Silero** model for processing (Voice Activity Detection), **Whisper** model for transcription and **Opus-MT** for translation. The output can be through ***stdout***, a ***JSON file***, or ***websockets***. 

#### 🖥️ Print Output Demo



  



#### 🌍 WebSocket Output Demo



  



## Architecture Overview



## Features

- Real-time speech capture and processing using **Silero** VAD (Voice Activity Detection)

- Speech-to-text transcription using the Whisper model

- Translation of transcriptions from a source language to a target language

- Multithreaded design for efficient processing

- Different output modes: stdout, **JSON** file, websocket server

## Prerequisites

Before running the project, you need to install the following system dependencies:

- **PortAudio** (for audio input handling)

- **FFmpeg** (for audio and video processing)

    - On Ubuntu/Debian-based systems, you can install it with:

      ```bash

      sudo apt-get install portaudio19-dev ffmpeg

      ```

## Installation

**(RECOMMENDED)**: install this package inside a virtual environment to avoid dependency conflicts.

```bash

python -m venv .venv

source .venv/bin/activate

```

**Install** the [PyPI package](https://pypi.org/project/live-translation/0.3.2/):

```bash

pip install live-translation

```

**Verify** the installation:

```bash

python -c "import live_translation; print('live-translation installed successfully!')"

```

## Usage

> **NOTE**: One can safely ignore the following warning that might appear on **Linux** systems:

>

> ALSA lib pcm_dsnoop.c:567:(snd_pcm_dsnoop_open) unable to open slave

> ALSA lib pcm_dmix.c:1000:(snd_pcm_dmix_open) unable to open slave

> ALSA lib pcm.c:2722:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear

> ALSA lib pcm.c:2722:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe

> ALSA lib pcm.c:2722:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side

> ALSA lib pcm_dmix.c:1000:(snd_pcm_dmix_open) unable to open slave

> Cannot connect to server socket err = No such file or directory

> Cannot connect to server request channel

> jack server is not running or cannot be started

> JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock

> JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock

>

### CLI 

live-translation can be run directly from the command line:

```bash

live-translate [OPTIONS]

```

**[OPTIONS]**

```bash

usage: live-translate [-h] [--silence_threshold SILENCE_THRESHOLD] [--vad_aggressiveness {0,1,2,3,4,5,6,7,8,9}] [--max_buffer_duration {5,6,7,8,9,10}] [--device {cpu,cuda}] [--whisper_model {tiny,base,small,medium,large,large-v2}]

                      [--trans_model {Helsinki-NLP/opus-mt,Helsinki-NLP/opus-mt-tc-big}] [--src_lang SRC_LANG] [--tgt_lang TGT_LANG] [--output {print,file,websocket}] [--ws_port WS_PORT] [--transcribe_only]

Live Translation Pipeline - Configure runtime settings.

options:

  -h, --help            show this help message and exit

  --silence_threshold SILENCE_THRESHOLD

                        Number of consecutive 32ms silent chunks to detect SILENCE.

                        SILENCE clears the audio buffer for transcription/translation.

                        NOTE: Minimum value is 16.

                        Default is 65 (~ 2s).

  --vad_aggressiveness {0,1,2,3,4,5,6,7,8,9}

                        Voice Activity Detection (VAD) aggressiveness level (0-9).

                        Higher values mean VAD has to be more confident to detect speech vs silence.

                        Default is 8.

  --max_buffer_duration {5,6,7,8,9,10}

                        Max audio buffer duration in seconds before trimming it.

                        Default is 7 seconds.

  --device {cpu,cuda}   Device for processing ('cpu', 'cuda').

                        Default is 'cpu'.

  --whisper_model {tiny,base,small,medium,large,large-v2}

                        Whisper model size ('tiny', 'base', 'small', 'medium', 'large', 'large-v2').

                        Default is 'base'.

  --trans_model {Helsinki-NLP/opus-mt,Helsinki-NLP/opus-mt-tc-big}

                        Translation model ('Helsinki-NLP/opus-mt', 'Helsinki-NLP/opus-mt-tc-big'). 

                        NOTE: Don't include source and target languages here.

                        Default is 'Helsinki-NLP/opus-mt'.

  --src_lang SRC_LANG   Source/Input language for transcription (e.g., 'en', 'fr').

                        Default is 'en'.

  --tgt_lang TGT_LANG   Target language for translation (e.g., 'es', 'de').

                        Default is 'es'.

  --output {print,file,websocket}

                        Output method ('print', 'file', 'websocket').

                          - 'print': Prints transcriptions and translations to stdout.

                          - 'file': Saves structured JSON data (see below) in ./transcripts/transcriptions.json.

                          - 'websocket': Sends structured JSON data (see below) over WebSocket.

                        JSON format for 'file' and 'websocket':

                        {

                            "timestamp": "2025-03-06T12:34:56.789Z",

                            "transcription": "Hello world",

                            "translation": "Hola mundo"

                        }.

                        Default is 'print'.

  --ws_port WS_PORT     WebSocket port for sending transcriptions.

                        Required if --output is 'websocket'.

  --transcribe_only     Transcribe only mode. No translations are performed.

```

- in case of **websockets**, one can connect to the server using **curl**, **wscat**, etc.. 

  ```bash

  curl --include --no-buffer ws://localhost:

  ```

  ```bash

  wscat -c ws://localhost:

  ```

### API

You can also import and use live_translation directly in your Python code.

The following is a ***simple*** example of running *live_translation* in a server/client fashion.

For more detailed examples see [examples/](/examples/).

- **Server**

  ```python

  from live_translation.config import Config

  from live_translation.app import LiveTranslationApp

  def main():

      config = Config(

          device="cpu",

          output="websocket",

          ws_port=8765

      )

      # Create and start the Live Translation App

      app = LiveTranslationApp(config)

      app.run()

  # Main guard is CRITICAL for systems that uses spawn method to create new processes

  # This is the case for Windows and MacOS

  if __name__ == "__main__":

      main()

  ```

- **Client**

  ```python

  import asyncio

  import websockets

  import json

  async def listen():

      uri = "ws://localhost:8765"

      async with websockets.connect(uri) as websocket:

          print("🔌 Connected to Live Translation WebSocket server.")

          try:

              while True:

                  message = await websocket.recv()

                  data = json.loads(message)

                  print(f"⏳ Timestamp: {data['timestamp']}")

                  print(f"📝 Transcription: {data['transcription']}")

                  print(f"🌍 Translation: {data['translation']}\n")

          except websockets.exceptions.ConnectionClosed:

              print("WebSocket connection closed.")

  asyncio.run(listen())

  ```

## Development

To contribute or modify this project, these steps might be helpful:

> **NOTE**: This workflow below is made for Linux-based systems. One might need to do some step manually on other systems. For example run test manually using `python -m pytest -s tests/` instead of `make test`. 

> See **Makefile** for more details.

**Clone** the repository:

```bash

git clone git@github.com:AbdullahHendy/live-translation.git

cd live-translation

```

**Ceate** a virtual environment:

```bash

python -m venv .venv

source .venv/bin/activate 

```

**Install** Dependencies:

```bash

pip install --upgrade pip

pip install -r requirements.txt

```

**Test** the package:

```bash

make test

```

**Build** the package:

```bash

make build

```

> **NOTE**: Building does ***lint*** and checks for ***formatting*** using [ruff](https://docs.astral.sh/ruff/). One can do that seprately using `make format` and `make lint`. For linting and formatting rules, see the [ruff config](/ruff.toml).

> **NOTE**: Building generates a ***.whl*** file that can be ***pip*** installed in a new environment for testing

**If needed**, run the program within the virtual environment:

```bash

python -m live_translation.cli [OPTIONS]

```

## Tested Environment

This project was tested and developed on the following system configuration:

- **Architecture**: x86_64 (64-bit)

- **Operating System**: Ubuntu 24.10 (Oracular Oriole)

- **Kernel Version**: 6.11.0-18-generic

- **Python Version**: 3.12.7

- **Processor**: 13th Gen Intel(R) Core(TM) i9-13900HX

- **GPU**: GeForce RTX 4070 Max-Q / Mobile [^1]

- **RAM**: 16GB DDR5

- **Dependencies**: All required dependencies are listed in `requirements.txt` and [Prerequisites](#prerequisites)

[^1]: CUDA not utilized, as the `DEVICE` configuration is set to `"cpu"`. Additional Nvidia drivers, CUDA, cuDNN installation needed if option `"cuda"` were to be used.

## Improvements

- **Better Error Handling**: Improve error handling across various components (audio, transcription, translation) to ensure the system is robust and can handle unexpected scenarios gracefully.

- **Performance Optimization**: Investigate performance bottlenecks including checking sleep durations and optimizing concurrency management to minimize lag.

- **Concurrency Design Check**: Review and optimize the threading design to ensure thread safety and prevent issues like race conditions or deadlocks, etc., revisit the current design of ***AudioRecorder*** being a thread while ***AudioProcessor***, ***Transcriber***, and ***Translator*** being processes.

- **Logging**: Integrate detailed logging to track system activity, errors, and performance metrics using a more formal logging framework.

## Citations

 ```bibtex

  @article{Whisper,

    title = {Robust Speech Recognition via Large-Scale Weak Supervision},

    url = {https://arxiv.org/abs/2212.04356},

    author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},

    publisher = {arXiv},

    year = {2022}

  }

  @misc{Silero VAD,

    author = {Silero Team},

    title = {Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier},

    year = {2021},

    publisher = {GitHub},

    journal = {GitHub repository},

    howpublished = {\url{https://github.com/snakers4/silero-vad}},

    email = {hello@silero.ai}

  }

  @article{tiedemann2023democratizing,

    title={Democratizing neural machine translation with {OPUS-MT}},

    author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},

    journal={Language Resources and Evaluation},

    number={58},

    pages={713--755},

    year={2023},

    publisher={Springer Nature},

    issn={1574-0218},

    doi={10.1007/s10579-023-09704-w}

  }

  @InProceedings{TiedemannThottingal:EAMT2020,

    author = {J{\"o}rg Tiedemann and Santhosh Thottingal},

    title = {{OPUS-MT} — {B}uilding open translation services for the {W}orld},

    booktitle = {Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (EAMT)},

    year = {2020},

    address = {Lisbon, Portugal}

  }

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/abdullahhendy/live-translation

Awesome Lists containing this project

README