{"id":15013888,"url":"https://github.com/pulijon/sttcast","last_synced_at":"2025-04-12T05:50:29.379Z","repository":{"id":139452593,"uuid":"588407202","full_name":"pulijon/Sttcast","owner":"pulijon","description":"Transcription from mp3 files to html with or without embedded player","archived":false,"fork":false,"pushed_at":"2025-04-05T06:59:40.000Z","size":80550,"stargazers_count":17,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-05T07:29:32.353Z","etag":null,"topics":["ansible","artificial-intelligence","automation","aws-ec2","aws-s3","diarization","g4dn","gpu","iac","puppet","python","terraform","transcription","vagrant","vosk-engine","whisper","whisperx"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pulijon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-13T03:19:36.000Z","updated_at":"2025-04-05T06:59:44.000Z","dependencies_parsed_at":null,"dependency_job_id":"0b0a96a7-b477-427f-817e-f8c2db138af8","html_url":"https://github.com/pulijon/Sttcast","commit_stats":{"total_commits":94,"total_committers":1,"mean_commits":94.0,"dds":0.0,"last_synced_commit":"b0cfaf269b9a23c086d4d03bccf6f120ff5fdfab"},"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pulijon%2FSttcast","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pulijon%2FSttcast/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pulijon%2FSttcast/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pulijon%2FSttcast/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pulijon","download_url":"https://codeload.github.com/pulijon/Sttcast/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248525153,"owners_count":21118616,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ansible","artificial-intelligence","automation","aws-ec2","aws-s3","diarization","g4dn","gpu","iac","puppet","python","terraform","transcription","vagrant","vosk-engine","whisper","whisperx"],"created_at":"2024-09-24T19:44:53.925Z","updated_at":"2025-04-12T05:50:29.357Z","avatar_url":"https://github.com/pulijon.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Rationale for sttcast.py\n\nSTT (Speech To Text) technology is becoming increasyngly popular. Virtual assistants as Alexa, Siri, Cortana or Google are able to understand voice commands and operate accordingly.\n\nEvery big cloud provider has its APIs to transcribe voice to text. Results are usually good. However if you want (as I do) to convert collections of podcasts to text (hundreds of hours), you must consider time and cost of the operation.\n\nThere are open source projects as Vosk-Kaldi that may be of help in this task. **sttcast.py** makes use of its Python API to offline transcribe podcasts, downloaded as mp3 files.\n\nIt is worth also mentioning OpenAI Whisper. It is a very interesting alternative although it is also more resource  consuming. It has been included as an option engine for **sttcast.py**\n\n# Requirements\n\nThe requirements for **sttcast.py** are as follows:\n\n* A python 3.x installation (it has been tested on Python 3.10 on Windows and Linux)\n* The tool **ffmpeg** installed in a folder of the PATH variable.\n* A vosk model for the desired language (you may find a lot of them in [alfphacephei](https://alphacephei.com/vosk/models). It has been tested with the Spanish model [vosk-model-es-0.42](https://alphacephei.com/vosk/models/vosk-model-small-es-0.42.zip))\n  \nPython dependencies can be installed in a virtual environment. The dependencies are specified in the file `requirements.txt`. The following commands install, in Linux, such dependencies (I suppose they should also work in Mac):\n\n```bash\npython -m venv .venv\nsource .venv/bin/activate\npip install -r requirements.txt\n```\n\nIn Windows, `python` could be `py` depending on the installation and the activation script is in `.venv\\Scripts`\n\nYou can find [here](https://dev.to/shriekdj/how-to-create-and-activate-the-virtual-environment-for-python3-project-3g4l) detailed instructions to create and activate the virtual environment.\n\n# How does sttcast.py work\n\nAs transcribing is a CPU intensive operation, **sttcast.py** makes use of multiprocessing in Python (you probably have known about GIL blues for multithreading or coroutines in Python). **sttcast.py** splits the entire work (the transcription of a podcast, perhaps of several hours) in fragments of s seconds (s is an optional paramenter, 600 seconds by default). \n\nIf the task is to transcribe a large number of files (clearly exceeding the number of available CPUs), sttcast.py can utilize its capacity to transcribe multiple files in parallel without splitting them, thus eliminating potential issues at the split boundaries. In this case, it is advisable to set the number of seconds to a value greater than the size of the largest file (36000 seconds, or 10 hours, should be sufficient for almost all files).\n\n**sttcast.py** converts the mp3 file to wav in order to use the vosk API. The main process pass the wav file as an argument to each one of the worker tasks, each proessing a fragment of audio (only part of the total frames of the wav file). The tasks are delivered to a pool of **c** processes (**c** is another optional paramenter, equal, by default, to the number of cpus of the system minus 2). In this way, the system may parallel **c** tasks.\n\nEach fragment is transcribed in a different HTML file. Words of the trascribed text are highlighted with different colors to display the level of conficence of the transcription. The vosk-kaldi library delivers with each word, its confidence as a number from 0 to 1. **sttcast** supports 4 configurable levels of confidence:\n\n* Very high confidence (text is shown in black)\n* High confidence (text is shown in green)\n* Medium confidence (text is shown in orange)\n* Low confidence (text is shown in red)\n\nFragments of text are also tagged with time stamps to facilitate searching and listening from the mp3 file. If the --audio-tags option is selected, there is also an html5 audio player configured to listen to the file at the beginning of the segment.\n\nThe tool **add_audio_tag.py** adds audio controls to a transcribed html without the **--audio-tags**. It requires:\n\n* BeautifulSoup\n  ```bash\n  pip install bs4\n  ```\nOnce all fragments have been transcribed, the last step is the integration of all of them in an unique html file.\n\nMetadata from mp3 is included in the title of the html\n\n# Use of OpenAI whisper library\n\n**sttcast** has an option (--whisper) to use the OpenAI whisper library instead of the vosk-kaldi one.\n\nIf you want to make transcription with whisper, you shoud take into account:\n\n* Whisper is able to work with different models. You can see them with the --whmodel option\n* With the --whisper option, you can take advantage of the CUDA acceleration (option --whdevice cuda) or not (--whdevice cpu). Without CUDA, whisper manages multiprocessing, so you will not notice any benefits configuring multiple cpus.\n* Transcriptions are very slow without CUDA acceleration\n* CUDA acceleration requires a good CUDA platform. \n* CUDA acceleation does benefit from multiple CPUS (option --cpu)\n\nThe following table, taken from [Whisper GitHub Repository](https://github.com/openai/whisper) shows the requirements for GPU memory of whisper models:\n\n| Model | Reauired vRAM | Speed |\n|---|---|---|\n| tiny | ~1 GB | ~32x |\n| base | ~1 GB | ~16x |\n| small | ~2 GB | ¬6x |\n| medium | ~5 GB | ~2x |\n| large | ~10 GB | 1x |\n\n\n\n# Use\n## YouTube Tutorial\n\nThere is [a video in YouTube](https://www.youtube.com/watch?v=l7TtUFJio2g) where you can view general instructions about how to install and use the application\n\n## CLI\n\n**sttcast.py** is a Python module that runs with the help of a 3.x interpreter. \n\nIt is has a very simple CLI interface that is autodocumented in the help (option **-h** or **--help**).\n\nYou should consider the location of model files and mp3 files in RAM drives to get more speed.\n\n```bash\n$ ./sttcast.py -h\n$ ./sttcast.py -h\nusage: sttcast.py [-h] [-m MODEL] [-s SECONDS] [-c CPUS] [-i HCONF] [-n MCONF] [-l LCONF] [-o OVERLAP]\n                  [-r RWAVFRAMES] [-w] [--whmodel WHMODEL] [--whdevice {cuda,cpu}] [--whlanguage WHLANGUAGE]\n                  [--whtraining WHTRAINING] [--whsusptime WHSUSPTIME] [-a] [--html-suffix HTML_SUFFIX]\n                  [--min-offset MIN_OFFSET] [--max-gap MAX_GAP]\n                  fnames [fnames ...]\n\npositional arguments:\n  fnames                archivos de audio o directorios a transcribir\n\noptions:\n  -h, --help            show this help message and exit\n  -m MODEL, --model MODEL\n                        modelo a utilizar. Por defecto, /mnt/ram/es/vosk-model-es-0.42\n  -s SECONDS, --seconds SECONDS\n                        segundos de cada tarea. Por defecto, 600\n  -c CPUS, --cpus CPUS  CPUs (tamaño del pool de procesos) a utilizar. Por defecto, 10\n  -i HCONF, --hconf HCONF\n                        umbral de confianza alta. Por defecto, 0.95\n  -n MCONF, --mconf MCONF\n                        umbral de confianza media. Por defecto, 0.7\n  -l LCONF, --lconf LCONF\n                        umbral de confianza baja. Por defecto, 0.5\n  -o OVERLAP, --overlap OVERLAP\n                        tiempo de solapamientro entre fragmentos. Por defecto, 2\n  -r RWAVFRAMES, --rwavframes RWAVFRAMES\n                        número de tramas en cada lectura del wav. Por defecto, 4000\n  -w, --whisper         utilización de motor whisper\n  --whmodel WHMODEL     modelo whisper a utilizar. Por defecto, small\n  --whdevice {cuda,cpu}\n                        aceleración a utilizar. Por defecto, cuda\n  --whlanguage WHLANGUAGE\n                        lenguaje a utilizar. Por defecto, es\n  --whtraining WHTRAINING\n                        nombre del fichero de entrenamiento. Por defecto, 'training.mp3'\n  --whsusptime WHSUSPTIME\n                        tiempo mínimo de intervención en el segmento. Por defecto, 60.0\n  -a, --audio-tags      inclusión de audio tags\n  --html-suffix HTML_SUFFIX\n                        sufijo para el fichero HTML con el resultado. Por defecto '_result'\n  --min-offset MIN_OFFSET\n                        diferencia mínima entre inicios de marcas de tiempo. Por defecto 30\n  --max-gap MAX_GAP     diferencia máxima entre el inicio de un segmento y el final del anterior. Por encima de\n                        esta diferencia, se pone una nueva marca de tiempo . Por defecto 0.8\n\n\n```\n\n**add_audio_tag.py** \n\n```bash\n$ ./add_audio_tag.py -h\nusage: add_audio_tag.py [-h] [--mp3-file MP3_FILE] [-o OUTPUT] html_file\n\npositional arguments:\n  html_file             Fichero html para añadir audio tags\n\noptions:\n  -h, --help            show this help message and exit\n  --mp3-file MP3_FILE   Fichero mp3 al que se refieren los audio tags\n  -o OUTPUT, --output OUTPUT\n                        Fichero resultado tras añadir los audio tags\n\n```\n\n\n\n## GUI\n\nFrom version v2.2.0, sttcast has also a GUI interface. It can be started with:\n\n```bash\n$ python ./sttcast-gui.py\n```\n\nWith this interface, you can configure the arguments (the same arguments that the CLI supports) in a graphical manner.\n\nThe following snapshot is an snapshot of the interface:\n\n![](sttcast-gui.png)\n\n# Automation\n\nThe whisper engine requires GPUs to avoid taking too much time. If you don't have a machine with GPU acceleration, or if you prefer not to have to install sttcasst in your environment, you can use the automation procedure explained in the ```Automation``` directory.\n\nAutomation creates an AWS EC2 machine in the Amazon Cloud, provisions it installing sttcast, upload the payload and download the results. And all with just two commands: one to create the resources in the cloud and perform the work, and another to destroy the resources.\n\nCommands are executed in a VM also created with one command.\n\n## Diarization\n\nThe **Whisper/Pyannote pipeline** is used to identify speakers in audio files. **Pyannote** is an AI-powered project hosted on **HuggingFace** that performs **speaker diarization**, clustering segments of speech by speaker identity. Since Pyannote performs **clustering** rather than **identification**, it does not inherently assign real names to speakers.\n\n### HuggingFace Token Requirement\nTo use Pyannote, you need to obtain a **HuggingFace read access token**. This token should be stored in the **HUGGINGFACE_TOKEN** environment variable for authentication.\n\n### Assigning Real Speaker Names\nBecause Pyannote clusters voices instead of identifying them explicitly, the program overcomes this limitation by appending **recognized voices** to the audio file before processing. This allows the system to **match unidentified segments to the closest known voice cluster** and assign a corresponding speaker label.\n\n### Training Metadata Storage\nThe **trainingmp3.py** utility generates a **training MP3 file** containing known speaker samples. The **speaker identifiers** are stored as metadata in this training file.\n\n### Speaker Identification Process\nThe complete process follows these steps:\n\n1. **Generate identified audio samples**  \n   - Extract speaker samples and prepare training segments.  \n\n2. **Concatenate selected fragments into a single training MP3 file**  \n   - Configuration is defined in the **training.yml** file.  \n   - The **trainingmp3.py** module handles this task.  \n\n3. **Run sttcast with the specified training file**  \n   - The training MP3 file is used to improve speaker labeling.\n\n### Output and Analysis\nThe generated **Whisper HTML files** label speech segments with speaker names and include final comments indicating the **total speaking time for each participant**.\n\nThe **speakingtime.py** utility extracts this speaker time data and stores it in a **CSV file**, which can be analyzed using the **Jupyter Notebook speakingtimes.ipynb**.\n\n### Tools\n\n**trainingmp3.py** is a Python module that generates a mp3 file with identified voices which is used to correctly perform diarization\n\n```bash\n$python trainingmp3.py -h\nusage: trainingmp3.py [-h] [-c CONFIG] [-o OUTPUT] [-s SILENCE] [-t TIME]\n\nGenera un archivo de entrenamiento a partir de audios etiquetados en un YAML.\n\noptions:\n  -h, --help            show this help message and exit\n  -c CONFIG, --config CONFIG\n                        Archivo YAML con la lista de hablantes y sus archivos de audio (Predeterminado: training.yml).\n  -o OUTPUT, --output OUTPUT\n                        Nombre del archivo de salida (MP3). Predeterminado: training.mp3.\n  -s SILENCE, --silence SILENCE\n                        Duración del silencio entre hablantes en segundos. (Predeterminado: 5s)\n  -t TIME, --time TIME  Duración total del fragmento en segundos. (Predeterminado: 600)\n```\n\nExample of configuration file:\n\n```yaml\n---\nF01:\n  name: Héctor Socas\n  files:\n    - Training/Coffee Break/Héctor Socas - 1.mp3\n    - Training/Coffee Break/Héctor Socas - 2.mp3\nF02:\n  name: Héctor Vives\n  files:\n    - Training/Coffee Break/Héctor Vives - 1.mp3\n    - Training/Coffee Break/Héctor Vives - 2.mp3\nF03:\n  name: Sara Robisco\n  files:\n    - Training/Coffee Break/Sara Robisco - 1.mp3\n    - Training/Coffee Break/Sara Robisco - 2.mp3\nF04:\n  name: Francis Villatoro\n  files:\n    - Training/Coffee Break/Francis Villatoro - 1.mp3\nF05: # Noisy environment\n  name: Héctor Socas\n  files:\n    - Training/Coffee Break/Héctor Socas - 3.mp3\n```\n\n\n**speakingtime.py** is a Python module that extracts the total speaking times of speakers from the Whisper HTML files generated by sttcast and saves them into a CSV file.\n\n```bash\n$ python speakingtime.py -h\nusage: speakingtime.py [-h] [-o OUTPUT] fnames [fnames ...]\n\npositional arguments:\n  fnames                Archivos con transcripciones de audio\n\noptions:\n  -h, --help            show this help message and exit\n  -o OUTPUT, --output OUTPUT\n                        Nombre del archivo de salida\n```\n\n\n\n## To Do\n\nMany modifications can be made and will be made in the future.\n\n* In **sttcast**, the number of **CPUs** can be configured (in **Automation**, this is done with the ```app_exec role``` variables). Each file is divided into that number of pieces and assigned to a Python process. It would be much more intelligent to divide the work to be done (several MP3s) into subsets of similar sizes and start a sttcast process with each subset with the number of **CPUs** equal to 1. This way, time would be optimized, and potential boundary issues between pieces could be avoided. **(Done 2024-07-06)**\n* Add a relevant searching system based in ElasticSearch and Kibana\n\n# Screenshots\n\n![transcription with diarization](transcription_with_diarization.png)\n\n![speaking times](speaking_times.png)\n\n\u003c!-- ![](sttcast_example.png) --\u003e\n\n![](sttcast-gui.png)\n\n\u003c!-- ![comparation vosk - whisper](comparation_vosk_whisper.png)\n\n![example audio tag](example_audio_tag.png) --\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpulijon%2Fsttcast","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpulijon%2Fsttcast","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpulijon%2Fsttcast/lists"}