{"id":13625789,"url":"https://github.com/elanmart/cbp-translate","last_synced_at":"2025-04-16T10:33:13.289Z","repository":{"id":64999219,"uuid":"577109129","full_name":"elanmart/cbp-translate","owner":"elanmart","description":null,"archived":false,"fork":false,"pushed_at":"2023-01-02T14:34:46.000Z","size":69069,"stargazers_count":1278,"open_issues_count":1,"forks_count":55,"subscribers_count":10,"default_branch":"main","last_synced_at":"2024-08-02T22:20:29.941Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/elanmart.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-12-12T01:19:35.000Z","updated_at":"2024-07-29T14:36:07.000Z","dependencies_parsed_at":"2023-01-13T15:11:16.825Z","dependency_job_id":null,"html_url":"https://github.com/elanmart/cbp-translate","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elanmart%2Fcbp-translate","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elanmart%2Fcbp-translate/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elanmart%2Fcbp-translate/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/elanmart%2Fcbp-translate/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/elanmart","download_url":"https://codeload.github.com/elanmart/cbp-translate/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223708374,"owners_count":17189765,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T21:02:02.062Z","updated_at":"2024-11-08T15:30:34.028Z","avatar_url":"https://github.com/elanmart.png","language":"Python","funding_links":[],"categories":["Miscellaneous","Python"],"sub_categories":[],"readme":"# 1. Introduction\n\n# 1.1 The main idea\n\nI finally got around to playing [Cyberpunk 2077](https://www.gog.com/en/game/cyberpunk_2077) the other day, and I noticed that the game has one interesting feature: \n\nwhen a character speaks a foreign language, the text first appears above them in the original form, and then gets sort of live-translated into English.\n\nI've then asked myself: how much work would it take to build something like that with modern DL stack? Is it possible to do it over a weekend? \n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://media.giphy.com/media/bUMahCWn8OpEusmJV8/giphy.gif\" alt=\"cyberpunk-example-gif\" width=\"600\"/\u003e\n\u003c/p\u003e\n\n## 1.2 The rough requirements\n\nI wanted to have a system which would\n- Process short video clips (e.g. a single scene)\n- Work with multiple characters / speakers\n- Detect and transcribe speech in both English and Polish\n- Translate the speech to any language\n- Assign each phrase to a speaker\n- Show the speaker on the screen \n- Add subtitles to the original video in a way mimicking the Cyberpunk example\n- Have a nice frontend\n- Run remotely in the cloud\n\n## 1.3 The TL;DR\n\nWith the amazing ML ecosystem we have today, it's definitely possible to build a PoC of a system like that in a couple of evenings. \n\nThe off-the-shelf tools are quite robust, and mostly extremely easy to integrate. What's more, the abundance of pre-trained models meant that I could build the whole app without running a single gradient update, or hand-labeling a single example.\n\nAs for the timelines -- it definitely took me more time than I anticipated, but actually most of the time was spent on non-ML issues (like figuring out how to add Unicode characters to a video frame).\n\nHere's a 60s clip of an interview conducted in Polish, translated to English. You can see that we a very clean setup like this, the results actually look quite OK!\n\nhttps://user-images.githubusercontent.com/10772830/208771745-37e64474-438b-418d-a99b-58c11657d5f2.mp4\n\nAnd here's a part of an interview with Keanu Reeves (who plays a major character in Cyberpunk 2077) talking to Steven Colbert, translated to Polish.\n\nNote that in this case the speaker diarization is not perfect, and the speaker IDs get mixed up for a moment mid-video:\n\nhttps://user-images.githubusercontent.com/10772830/208771178-632b180a-231e-4a77-b578-f18cd23c3697.mp4\n\n\n# 2. Implementation \n\nI glued together a couple of tools to make this thing fly:\n- [ffmpeg-python](https://github.com/kkroening/ffmpeg-python) for processing the video files (e.g. extracting audio, streaming raw frames)\n- [Whisper](https://github.com/openai/whisper) for speech recognition\n- [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) for speaker diarization (note: I also tested [PyAnnote](https://github.com/pyannote), but the results were not satisfactory)\n- [DeepL](https://github.com/DeepLcom/deepl-python) for translation\n- [RetinaFace](https://github.com/serengil/retinaface) for face detection\n- [DeepFace](https://github.com/serengil/deepface) for face embedding\n- [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) for detecting unique faces (via clustering)\n- [Gradio](https://github.com/gradio-app/gradio) for a nice demo frontend\n- [Modal](https://modal.com/) for serverless deployment \n\nThere's also [PIL](https://github.com/python-pillow/Pillow) \u0026 [OpenCV](https://github.com/opencv/opencv-python) used to annotate the video frames, and [yt-dlp](https://github.com/yt-dlp/yt-dlp) to download samples from YT.\n\nHere's a sketch of how these things work together to produce the final output:\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/resources/pipeline-diagram.png\" alt=\"data-flow-diagram\" width=\"75%\"/\u003e\n\u003c/p\u003e\n\n## 2.1 Handling the speech\n\nExtracting audio from a `webm` / `mp4` is trivial with `ffmpeg` \n\n```python\ndef extract_audio(path: str, path_out: Optional[str] = None):\n    \"\"\"Extract audio from a video file using ffmpeg\"\"\"\n\n    audio = ffmpeg.input(path).audio\n    output = ffmpeg.output(audio, path_out)\n    output = ffmpeg.overwrite_output(output)\n    ffmpeg.run(output, quiet=True)\n\n    return path_out\n```\n\nOnce we have the sound extracted, we can process it with:\n\n### 2.1.1 Whisper\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://github.com/openai/whisper/blob/main/approach.png?raw=true\" alt=\"Whisper Approach\" width=\"50%\"/\u003e\n\u003c/p\u003e\n\nThere isn't much to say about [Whisper](https://github.com/openai/whisper), really. \n\nIt's a fantastic tool, which recognizes english speech better than me.\n\nIt handles multiple languages, and works okay even with overlapping speech.\n\nI've decided to feed the whole audio stream to `whisper` as a single input, but if you wanted to improve this part of the code, you could experiment with partitioning the audio for each speaker, but my bet is that this will not give any better results. \n\n### 2.1.2 DeepL\n\n\u003cimg src=\"https://static.deepl.com/img/logo/DeepL_Logo_darkBlue_v2.svg\" alt=\"DeepL Logo\" width=\"5%\"/\u003e\n\nI could use a pre-trained Neural Machine Translation model here (or use `Whisper`, since it also does translation), but I wanted to get the highest quality possible. \n\nIn my experience, [DeepL](https://www.deepl.com/) works better than Google Translate, and their API gives you 500k characters / month for free. \n\nThey also provide a convenient python interface.\n\nTo improve this part of the code one could try to translate the text from each speaker separately, maybe then the translation would be even more coherent? But this would strongly depend on our ability to accurately assign the phrases to speakers.\n\n### 2.1.3 Speaker Diarization -- NeMo and PyAnnote\n\nSpeaker diarization is a process assigning speaker IDs to each time point in the audio signal. \n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/resources/diarization.png\" alt=\"Speaker Diarization example\"\n\u003c/p\u003e\n\n#### 2.1.3.1 PyAnnote\n\nI initially used [PyAnnote](https://github.com/pyannote) for this purpose, since it's available on `HuggingFace` and extremely straightforward to integrate into your codebase:\n\n```python\npipeline = Pipeline.from_pretrained(\n    \"pyannote/speaker-diarization@2.1.1\",\n    use_auth_token=auth_token,\n    cache_dir=cache_dir,\n)\n\ndia = pipeline(path_audio)\n```\n\nUnfortunately the quality was not really satisfactory, and errors in this part of the pipeline were hurting all of the downstream steps.\n\n#### 2.1.3.2 NeMo\n\nI then turned to [NeMo](https://github.com/NVIDIA/NeMo), from good folks at `NVIDIA`. \n\nIn their words: \"NVIDIA NeMo is a conversational AI toolkit built for researchers working on automatic speech recognition (ASR), text-to-speech synthesis (TTS), large language models (LLMs), and natural language processing (NLP).\"\n\nI found it to be quite reliable, especially for english. It still struggles with short segments of overlapping speech, but it's definitely good enough for the demo. \n\nThe biggest downside is that `NeMo` is a research toolkit. Therefore simple tasks like \"give me unique IDs for this audio file\" result in a code that is much more messy than the `PyAnnote` version. \n\nNote that I mainly tested it on rather high-quality, interview-type audio. I do not know how this would translate to other scenarios or very different languages (e.g. Japanese).\n\n### 2.1.4 Matching speaker IDs to spoken phrases\n\nI used a simple heuristic here, where for every section of speech (output from `NeMo`) we find the phrase detected by `Whisper` with the largest overlap. \n\nThis part of the code could definitely be improved with a more sophisticated approach. It would also be good to look more into the timestamps returned by the two systems, since for some reason I had an impression that an offset \n\n## 2.2 Handling video streams\n\nThis is pretty straightforward with `cv2` and `ffmpeg`. The main tip is that for video processing, generators are the way to go -- you probably don't want to load 1 minute video into a numpy array (`1920 * 1080 * 3 * 24 * 60` entries will take `~35GB` of RAM).\n\n### 2.2.1 Detecting faces in video\n\nDetecting faces is luckily super straightforward with modern tools like `RetinaFace` or `MTCNN`. \n\nIn this first step we run a pre-trained model to detect all faces visible in each frame.\n\nWe then crop, align, and re-size them as required by the downstream embedding model, as in this example from the [original repo](https://github.com/serengil/retinaface)\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/serengil/retinaface/master/tests/outputs/alignment-procedure.png\" alt=\"Speaker Diarization example\", width=\"75%\"\n\u003c/p\u003e\n\nThis step is quite robust and reliable, the only downside is that it relies on `Tensorflow`, and the code can only handle single frame at a time. \n\nIt's quite time-consuming to run this detection for every frame in a video, so this part of the code could definitely use some optimizations.\n\nWith a modern GPU it takes several minutes to process ~60s of video.\n\nLuckily, with `Modal` we can use massive parallelization, so the runtime is shorter, even if processing happens on single-CPU machines.\n\n### 2.2.2 Embedding faces and assigning them unique IDs\n\nOnce we've located faces in each frame, we can use a pre-trained model to extract embeddings for each of them.\n\nFor this I've grabbed the `FaceNet512` model from [DeepFace](https://github.com/serengil/deepface) library.\n\nOnce embeddings are extracted, we still need to assign them unique IDs. \nTo do this, I went with a simple hierarchical clustering algorithm (or specifically, [Agglomerative Clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering) from `scikit-learn`)\n\nAgglomerative Clustering will recursively merge clusters as long as the distance between them is below a certain threshold. That threshold is model- and metric-specific. Here I used same value which is used by `DeepFace` when performing \"face verification\".\n\nThis part of the code could be improved in many ways:\n\n- Improve the clustering algorithm by either\n    - Using a different algorithm (e.g. DBSCAN)\n    - Using more domain knowledge (e.g. the fact that faces with similar locations in consecutive frames are likely to be the same person, no two faces in a single frame can be a single person etc.)\n- Investigate if it would be a good idea to identify a couple of \"best\" frames where the face is in the best position, and use them as a template.\n- Enforce temporal consistency -- predictions should not be made for each frame in isolation. \n- Improve the embeddings themselves, e.g. by using a combination of models, or different distance metrics?\n\nHere's a visual representation of how the Agglomerative Clustering works -- by changing the threshold (cutoff on the Y-axis) you will end up with different number of clusters.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://scikit-learn.org/stable/_images/sphx_glr_plot_agglomerative_dendrogram_001.png\" alt=\"Agglomerative Clustering Dendrogram\",\n\u003c/p\u003e\n\n### 2.2.3 Matching Face IDs to Speaker IDs \n\nFor this we employ another simple heuristic: for each face, we create a set of frames where that face was detected. \n\nWe then do the same for speakers -- create a set of frames where a given speaker can be heard. \n\nNow, for each face ID we find the speaker ID for which Jaccard index between the two sets is minimized.\n\n### 2.2.4 Generating the frames\n\nOnce we have annotated every frame with a speaker ID, face ID, phrase in original language, and phrase in the translated language -- we can finally add subtitles.\n\nEven though our system does not work in real time, I wanted to give it a similar look to the Cyberpunk example -- so as a last processing step I calculate how many characters from a recognized phrase should be displayed on the frame.\n\nWhat remains now is to figure out how to place the subtitles such that they fit on the screen etc.\n\nThis part of the code could be improved to handle more languages. To place UTF-8 characters on the screen, I need to explicitly pass a path to a font file to `PIL`. The problem is that different languages require different fonts, so the current solution won't work e.g. for Korean.\n\n## 2.3 Deployment\n\n### 2.3.1 Gradio frontend\n\nLast thing I wanted to check is how easy it is to deploy this system on a cloud platform.\n\nGetting the frontend ready can be trivially done with [Gradio](https://gradio.app/docs/). \n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/resources/gradio-screenshot.png\" alt=\"Gradio Screenshot\", width=\"70%\"\u003e\n\u003c/p\u003e\n\n### 2.3.2 Serverless backend with Modal and FastAPI\n\nWe could try to deploy the model with [Huggingface Spaces](https://huggingface.co/docs/hub/spaces-sdks-gradio), but I wanted to try something a bit more \"production-ready\".\n\nI went ahead with [Modal](https://modal.com/) -- a serverless platform built by [Erik Bernhardsson](https://erikbern.com/) and his team. You can read more about it [in his blog post](https://erikbern.com/2022/12/07/what-ive-been-working-on-modal.html)\n\n`Modal` is really appealing since it allows me to write the code exactly how I imagined the programming for the cloud should look like. What locally you'd write as:\n\n```python\ndef run_asr(audio_path: str):\n    return whisper.transcribe(audio_path)\n\n\ndef process_single_frame(frame: np.ndarray, text: str):\n    frame = add_subtitles(frame, text)\n    return frame\n```\n\nWith `Modal` becomes\n\n```python\n@stub.function(image=gpu_image, gpu=True)\ndef run_asr(audio_path: str):\n    return whisper.transcribe(audio_path)\n\n\n@stub.function(image=cpu_image)\ndef process_single_frame(frame: np.ndarray, text: str):\n    frame = add_subtitles(frame, text)\n    return frame\n```\n\nSo with minimal boilerplate we now have a code that can run remotely **within seconds**. Pretty wild.\n\nThere are obviously still some rough edges (`Modal` is still in beta), and I had to work around one last issue: when running a `FastAPI` app, there is a 45 second limit for each request. And since processing a video takes a bit longer, I used a not-so-nice workaround, where pressing `Submit` for the first time gives you the job id, and you can use that id to fetch the final result:\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"assets/resources/modal-step-1.png\" alt=\"Gradio Screenshot\", width=\"70%\"\u003e\n\u003c/p\u003e\n\n# Limitations\n\nThis is very obviously just a demo / proof-of-concept!\n\nThe main limitations are:\n- Processing 30s of video takes several minutes on a modern PC\n- The approach used here will not work well for clips with multiple scenes\n- Matching faces to voices relies on simple co-occurrence heuristic, and will not work in certain scenarios (e.g. if the whole conversation between two people is recorded from a single angle)\n- All the steps of the pipeline rely on imperfect tools (e.g. diarization) or simplistic heuristics (e.g. finding unique faces with agglomerative clustering)\n- The pipeline was only tested on a handful of examples\n\n# Development\n\n## With Modal\n\nFirst of all, you'll need a `Modal` account, see https://modal.com/\n\nYou'll then need to add your `HuggingFace` (`HUGGINGFACE_TOKEN`) and `DeepL` (`DEEPL_KEY`) authentication tokens as `Secret`s in `Modal` dashboard. \n\nOnce this is set up, you should be able to simply run:\n\n```\npython -m venv ./venv\nsource ./venv/bin/activate\npython -m pip install -r requirements-modal.txt\n```\n\nYou can then run it with\n\n```\npython cbp_translate/app.py\n```\n\n## Locally\n\nFirst, export the necessary env variables:\n\n```bash\nexport DEEPL_KEY=\"...\"\nexport HUGGINGFACE_TOKEN=\"...\"\nexport MODAL_RUN_LOCALLY=1\n```\n\nThen, it's best if you look at the steps defined in `cbp_translate/modal_/remote.py`\n\nRoughly, it goes something like this:\n\n```bash\n# Apt\nsudo apt install ffmpeg libsndfile1 git build-essential\n\n# We skip conda steps, assuming you have cuda and cudnn installed \necho \"Skipping CUDA installation\"\n\n# pip\npython -m venv ./venv\nsource ./venv/bin/activate\npython -m pip install --upgrade pip setuptools\npython -m pip install -r requirements-local.txt\n\n# Install the package for development\npython setup.py develop\n```\n\nRun the CLI:\n\n```bash\npython \n\npython cbp_translate/cli.py \\\n    --path-in ./assets/videos/keanu-reeves-interview.mp4 \\\n    --path-out ./translated.mp4 \\\n    --language PL\n```\n\n# Note on git-lfs\n\nThere are several large files included in this repo, which are stored using `git-lfs`.\n\nTo clone the repo without downloading the large files, run:\n\n```bash\nGIT_LFS_SKIP_SMUDGE=1 git clone ...\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felanmart%2Fcbp-translate","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felanmart%2Fcbp-translate","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felanmart%2Fcbp-translate/lists"}