{"id":22887227,"url":"https://github.com/chanulee/monalisa-silent-speech","last_synced_at":"2025-03-31T19:13:54.028Z","repository":{"id":262661687,"uuid":"887963047","full_name":"chanulee/monalisa-silent-speech","owner":"chanulee","description":null,"archived":false,"fork":false,"pushed_at":"2024-11-13T15:28:22.000Z","size":11,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-06T23:31:03.063Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chanulee.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-13T15:24:48.000Z","updated_at":"2024-11-13T15:28:27.000Z","dependencies_parsed_at":"2024-11-13T16:32:49.816Z","dependency_job_id":"1b506682-4faa-481c-88f0-30d251d2c55f","html_url":"https://github.com/chanulee/monalisa-silent-speech","commit_stats":null,"previous_names":["chanulee/monalisa-silent-speech"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chanulee%2Fmonalisa-silent-speech","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chanulee%2Fmonalisa-silent-speech/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chanulee%2Fmonalisa-silent-speech/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chanulee%2Fmonalisa-silent-speech/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chanulee","download_url":"https://codeload.github.com/chanulee/monalisa-silent-speech/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246523872,"owners_count":20791444,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-13T20:31:43.899Z","updated_at":"2025-03-31T19:13:53.997Z","avatar_url":"https://github.com/chanulee.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Silent Speech\nThis repo is to understand and learn form Mona Lisa.  \nMajor difference between the Mona Lisa project is that we want to try this on Colab, and integrate ear EEG.\n\nThis project was originally forked from https://github.com/Leoputera2407/tyler_silent_speech\n\nThe notebook that I tried to recreate this: [main.ipynb](https://github.com/chanulee/silent_speech/blob/main/main.ipynb)\n\n### Resources\n- repo: https://github.com/Leoputera2407/silent_speech\n- LibriSpeech: https://huggingface.co/datasets/openslr/librispeech_asr\n- Gaddy Dataset: https://doi.org/10.5281/zenodo.4064408\n- paper: https://arxiv.org/abs/2403.05583\n\n## Approach 1: Breaking down [Paper reproduction](#paper-reproduction) \n0. run `notebooks/tyler/2023-07-17_cache_dataset_with_attrs_.py` [See How ↓](#0-2023-07-17_cache_dataset_with_attrs_py) or [view file](notebooks/tyler/2023-07-17_cache_dataset_with_attrs_.py)\n1. run `notebooks/tyler/batch_model_sweep.sh` (`2024-01-15_icml_models.py`)\n2. run `notebooks/tyler/2024-01-26_icml_pred.py`\n3. run `notebooks/tyler/batch_beam_search.sh` (`2024-01-26_icml_beams.py`)\n4. run `notebooks/tyler/2024-01-28_icml_figures.py`\n5. run `notebooks/tyler/2024-01-31_icml_TEST.py`\n\n## 0. 2023-07-17_cache_dataset_with_attrs_.py\n[view file (modified version)](notebooks/tyler/2023-07-17_cache_dataset_with_attrs_.py) or [view file (original tyler version)](https://github.com/Leoputera2407/tyler_silent_speech/blob/main/notebooks/tyler/2023-07-17_cache_dataset_with_attrs_.py)\n\n### Spotted problems \u0026 ways to solve it\n- [x] The current code is based on SLURM, running on SHERLOCK\n  - [x] ```ON_SHERLOCK = False```\n- [x] Loading and cacheing the Libri Dataset ```librispeech_datasets = load_dataset(\"librispeech_asr\")``` keeps timing out on colab\n  - [ ] Testing them again with seperated block\n  - [ ] Or is there any other way to do this by downloading the full dataset in prior to running this line, and then cache them seperately? (Given that downloading part is the reason of timing out / 'manual approach' block on colab)\n- [ ] **What is ```sessions_dir```, ```scratch_directory``` and ```gaddy_dir``` supposed to be?**\n```python\nsessions_dir = \"/data/magneto/\"\nscratch_directory = \"/scratch\"\ngaddy_dir = \"/scratch/GaddyPaper/\"\n```\n  - [ ] and where is the root directory of this /data/magneto and /scratch?\n- [ ] ```\"librispeech-cache\"``` automatically generates when load_dataset run?\n- [ ] ```\"librispeech-alignments\"``` and ```alignment_dir```\n  - [ ] Do MFA to generate this alignment - shown below in this readme [Montreal forced aligner ↓](#montreal-forced-aligner)\n- [ ] While running ```from dataloaders import LibrispeechDataset, cache_dataset```, an error occured as the ```os.environ[\"SCRATCH\"]``` was not set.\n  - [ ] ```os.environ[\"SCRATCH\"] = \"/scratch\"```\n  - [ ] But tyler_silent_speech repo does not set this... why? Is it set already somewhere else?\n  -\n\n## Approach 2: Breaking down [Brain-to-text '24 reproduction](#brain-to-text-24-reproduction)  \n1. Train 10 models of the [Pytorch NPTL baseline RNN](https://github.com/cffan/neural_seq_decoder)\n2. Run beam search with the 5-gram model. The average validation WER should be approximatel 14.6%\n3. run `notebooks/tyler/2024-02-13_wiilet_competition.py`. The validation WER of finetuned LISA should be approximately 13.7% without finetuning, or 10.2% with finetuning.\n\n## 1. Pytorch NPTL baseline RNN\n### Spotted problems \u0026 ways to solve it\n- [ ] Train this model... with what?\n\n---\n\n# Achive of original Repo\n## MONA LISA\n\nThis repository contains code for training Multimodal Orofacial Neural Audio (MONA) and Large Language\nModel (LLM) Integrated Scoring Adjustment\n(LISA). Together, MONA LISA sets a new state-of-the art for decoding silent speech, achieving 7.3% WER on validation data for open vocabulary.\n\n[See the preprint on arxiv](https://arxiv.org/abs/2403.05583).\n\n### Paper reproduction\nFirst you will need to download the [Gaddy 2020 dataset](https://doi.org/10.5281/zenodo.4064408) Then, the following scripts can be modified and run in order on SLURM or a local machine. An individual model trains on one A100 for 24-48 hours depending on loss functions (supTcon increases train time by ~75%). The full model sweep as done in the paper trains 60 models.\n0) run `notebooks/tyler/2023-07-17_cache_dataset_with_attrs_.py`\n1) run `notebooks/tyler/batch_model_sweep.sh` (`2024-01-15_icml_models.py`)\n2) run `notebooks/tyler/2024-01-26_icml_pred.py`\n3) run `notebooks/tyler/batch_beam_search.sh` (`2024-01-26_icml_beams.py`)\n4) run `notebooks/tyler/2024-01-28_icml_figures.py`\n5) run `notebooks/tyler/2024-01-31_icml_TEST.py`\n\n### Brain-to-text '24 reproduction\n1) Train 10 models of the [Pytorch NPTL baseline RNN](https://github.com/cffan/neural_seq_decoder)\n2) Run beam search with the 5-gram model. The average validation WER should be approximatel 14.6%\n3) run `notebooks/tyler/2024-02-13_wiilet_competition.py`. The validation WER of finetuned LISA should be approximately 13.7% without finetuning, or 10.2% with finetuning.\n\nThe [final competition WER was 8.9%](https://eval.ai/web/challenges/challenge-page/2099/leaderboard/4944), which at time of writing is rank 1.\n\n## Environment Setup\n\n### alternate setup\nFirst build the `environment.yml`. Then, \n```\n\u003e conda install libsndfile -c conda-forge\n\u003e \n\u003e pip install jiwer torchaudio matplotlib scipy soundfile absl-py librosa numba unidecode praat-textgrids g2p_en einops opt_einsum hydra-core pytorch_lightning \"neptune-client==0.16.18\"\n```\n\n\n## Explanation of model outputs for CTC loss\nFor each timestep, the network predicts probability of each of 38 characters ('abcdefghijklmnopqrstuvwxyz0123456789|_'), where `|` is word boundary, and `_` is the \"blank token\". The blank token is used to separate repeat letters like \"ll\" in hello: `[h,h,e,l,l,_,l,o]`. It can optionally be inserted elsewhere too, like `__hhhh_eeee_llll_lllooo___`\n\n### Example prediction\n\n\n**Target text**: after breakfast instead of working i decided to walk down towards the common\n\nExample model prediction (argmax last dim) of shape `(1821, 38)`:\n\n`______________________________________________________________a__f___tt__eerr|||b__rr_eaaakk___ff____aa____ss_tt___________________||____a_nd__|_ssttt___eaa_dd_||ooff||ww___o_rr_____kk_____ii___nngg________________________||_____a____t__||_______c______i___d_____eedd__________||tt___o__||_w_____a______l_kkk____________________||______o______w__t______________|||t____oowwwaarrrdddsss____||thhee_|||c_____o___mm__mm___oo_nn___________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________`\n\nBeam search gives, ' after breakfast and stead of working at cided to walk owt towards the common ', which here is same as result from \"best path decoding\" (argmax), but in theory could be different since sums probability of multiple alignments and is therefore more accurate.\n\n\n### Montreal forced aligner\nInstructions for getting phoneme alignments\n\n\nhttps://montreal-forced-aligner.readthedocs.io/en/latest/first_steps/index.html#first-steps-align-pretrained\n\n```\n\u003e conda create -n mfa -c conda-forge montreal-forced-aligner\n\u003e mfa model download acoustic english_us_arpa\n\u003e mfa model download dictionary english_us_arpa\n\u003e mfa validate --single_speaker -j 32 /data/data/T12_data/synthetic_audio/TTS english_us_arpa english_us_arpa\n\u003e mfa model download g2p english_us_arpa\n\u003e mfa g2p --single_speaker /data/data/T12_data/synthetic_audio/TTS english_us_arpa ~/Documents/MFA/TTS/oovs_found_english_us_arpa.txt --dictionary_path english_us_arpa\n\u003e mfa model add_words english_us_arpa ~/mfa_data/g2pped_oovs.txt\n\u003e mfa adapt --single_speaker -j 32 /data/data/T12_data/synthetic_audio/TTS english_us_arpa english_us_arpa /data/data/T12_data/synthetic_audio/adapted_bark_english_us_arpa\n\u003e mfa validate --single_speaker -j 32 /data/data/T12_data/synthetic_audio/TTS english_us_arpa english_us_arpa\n# ensure no OOV (I had to manually correct a transcript due to a `{`)\n\u003e mfa adapt --single_speaker -j 32 --output_directory /data/data/T12_data/synthetic_audio/TTS /data/data/T12_data/synthetic_audio/TTS english_us_arpa english_us_arpa /data/data/T12_data/synthetic_audio/adapted_bark_english_us_arpa\n\n### misc\n\nFast transfer of cache on sherlock to local NVME\n```\ncd $MAG/librispeech\nfind . -type f | parallel -j 16 rsync -avPR {} $LOCAL_SCRATCH/librispeech/\n```\nfind . -type f | parallel -j 16 rsync -avPR {} $SCRATCH/librispeech/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchanulee%2Fmonalisa-silent-speech","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchanulee%2Fmonalisa-silent-speech","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchanulee%2Fmonalisa-silent-speech/lists"}