{"id":13532104,"url":"https://github.com/aishoot/LSTM_PIT_Speech_Separation","last_synced_at":"2025-04-01T20:31:22.229Z","repository":{"id":40515313,"uuid":"138172496","full_name":"aishoot/LSTM_PIT_Speech_Separation","owner":"aishoot","description":"Two-talker Speech Separation with LSTM/BLSTM by Permutation Invariant Training method.","archived":false,"fork":false,"pushed_at":"2022-01-06T06:43:26.000Z","size":7741,"stargazers_count":308,"open_issues_count":17,"forks_count":90,"subscribers_count":16,"default_branch":"master","last_synced_at":"2024-12-24T19:29:18.504Z","etag":null,"topics":["audio-separation","multi-speaker","permutation-invariant-training","robust-speech-recognition","speech-enhancement","speech-separation"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aishoot.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-06-21T13:15:14.000Z","updated_at":"2024-12-01T08:46:54.000Z","dependencies_parsed_at":"2022-07-25T16:18:04.428Z","dependency_job_id":null,"html_url":"https://github.com/aishoot/LSTM_PIT_Speech_Separation","commit_stats":null,"previous_names":["pchao6/lstm_pit_speech_separation"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aishoot%2FLSTM_PIT_Speech_Separation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aishoot%2FLSTM_PIT_Speech_Separation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aishoot%2FLSTM_PIT_Speech_Separation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aishoot%2FLSTM_PIT_Speech_Separation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aishoot","download_url":"https://codeload.github.com/aishoot/LSTM_PIT_Speech_Separation/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246709923,"owners_count":20821297,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-separation","multi-speaker","permutation-invariant-training","robust-speech-recognition","speech-enhancement","speech-separation"],"created_at":"2024-08-01T07:01:08.227Z","updated_at":"2025-04-01T20:31:17.215Z","avatar_url":"https://github.com/aishoot.png","language":"Jupyter Notebook","funding_links":[],"categories":["Speech Separation (single channel)"],"sub_categories":["NN-based separation"],"readme":"# LSTM/BLSTM based PIT for Two Speakers\n```\n====================================================================================\n                Two-speaker speech separation with BLSTM and PIT\n                   Author: aishoot, EECS, Peking University\n            Github: https://github.com/aishoot/LSTM_PIT_Speech_Separation\n                            Created in: June 2018\n====================================================================================\n```\n\nThe progress made in multitalker mixed speech separation and recognition, often referred to as the \"cocktail-party problem\", has been less impressive. Although human listeners can easily perceive separate sources in an acoustic mixture, the same task seems to be extremely difficult for computers, especially when only a single microphone recording the mixed-speech.\n\n\u003cimg width=\"75%\" height=\"75%\" src=\"spectrogram.PNG\"/\u003e\n\n## 1. Speration Performance\nNotice: The training set and the validation set that contain two-speaker mixtures generated by randomly selecting speakers and utterances from the WSJ0 set, and mixing them at various signal-to-noise ratios (SNRs) uniformly chosen between -2.5 dB and 2.5 dB. \u003cbr\u003e\n\nThe separation performance of **LSTM** are as follows:\n\nGender Combination | SDR | SAR | SIR | STOI | ESTOI | PESQ \n:-: | :-: | :-: | :-: | :-: | :-: | :-: |\nOverall| 6.453328 | 9.372059 | 11.570311 | 0.473229 | 0.377204 | 1.5812\nMale \u0026 Female | 8.238905 | 9.939668 | 14.531649 | 0.488542 | 0.393999 | 1.663442\nFemale \u0026 Female | 3.538810 | 8.134054 | 7.230494 | 0.459762 | 0.363213 | 1.478075\nMale \u0026 Male | 5.011563 | 9.026763 | 9.000010 | 0.456667 | 0.358757 | 1.602058\n\nThe separation performance of **BLSTM** are as follows:\n\nGender Combination | SDR | SAR | SIR | STOI | ESTOI | PESQ \n:-: | :-: | :-: | :-: | :-: | :-: | :-: |\nOverall| 9.177447 | 10.629142 | 16.116564 | 0.536987 | 0.429255 | 1.65339\nMale \u0026 Female | 10.647645 | 11.691969 | 18.203052 | 0.521656 | 0.421868 | 1.731112\nFemale \u0026 Female | 7.309365 | 9.393608 | 13.355384 | 0.560099 | 0.441704 | 1.553452\nMale \u0026 Male | 7.797448 | 9.589827 | 14.198003 | 0.550071 | 0.435083 | 1.675609\n\nFrom above results we can see that the separation effect of mixed gender audio is better than that of the same gender and BLSTM performs better than LSTM.\n\n## 2. Evaluation Criterion\n* SDR: Signal to Distortion Ratio\n* SAR: Signal to Artifact Ratio\n* SIR: Signal to Interference Ratio\n* STOI: Short Time Objective Intelligibility Measure\n* ESTOI: Extended Short Time Objective Intelligibility Measure\n* PESQ: Perceptual Evaluation of Speech Quality\n\n## 3. Dependency Library\n* [librosa](https://librosa.github.io/)\n* matlab (my test version: R2016b 64-bit)\n* tensorflow (my test version: 1.4.0)\n* anaconda3 (Python3.5+)\n\n## 4. Usage Process\n#### Generate Mixed and Target Speech:\nWhen you have WSJ0 data, you can use the code \"create-speaker-mixtures-V1/V2\" to create the mixed speech. We mixed 2-speaker audios with samplerate 8000.\n\n#### Run the command line script:\n```bash\nbash run.sh\n```\nwhich contains three steps:\n1. Extract STFT features, and convert them to the tfrecords format of Tensorflow.\nThe training data is ready here. The file structure of training data is now as follows:\n```\nstorage/\n├── lists\n│   ├── cv_tf.lst\n│   ├── cv_wav.lst\n│   ├── tr_tf.lst\n│   ├── tr_wav.lst\n│   ├── tt_tf.lst\n│   └── tt_wav.lst\n├── separated\n├── TFCheckpoint\n└── tfrecords\n    ├── cv_tfrecord\n    │   ├── 01aa010k_1.3053_01po0310_-1.3053.tfrecords\n    │   ├── 01aa010p_0.93798_02bo0311_-0.93798.tfrecords\n    │   ├── ...\n    │   └── 409o0317_1.2437_025c0217_-1.2437.tfrecords\n    ├── tr_tfrecord\n    │   ├── 01aa010b_0.97482_209a010p_-0.97482.tfrecords\n    │   ├── 01aa010b_1.4476_20aa010p_-1.4476.tfrecords\n    │   ├── ...\n    │   └── 409o0316_1.3942_20oo010p_-1.3942.tfrecords\n    └── tt_tfrecord\n        ├── 050a050a_0.032494_446o030v_-0.032494.tfrecords\n        ├── 050a050a_1.7521_422c020j_-1.7521.tfrecords\n        ├── ...\n        └── 447o0312_2.0302_440c0206_-2.0302.tfrecords\n```\nNote: {tr,cv,tt}_wav.lst is like as follows:\n```\n447o030v_0.1232_050c0109_-0.1232.wav\n447o030v_1.7882_444o0310_-1.7882.wav\n...\n447o030x_0.98832_441o0308_-0.98832.wav\n447o030x_1.4783_422o030p_-1.4783.wav\n```\nAnd {tr,cv,tt}_tf.lst is like as follows:\n```\nstorage/tfrecords/cv_tfrecord/011o031b_1.8_206a010u_-1.8.tfrecords\nstorage/tfrecords/cv_tfrecord/20ec0109_0.47371_020c020q_-0.47371.tfrecords\n...\nstorage/tfrecords/cv_tfrecord/01zo030l_0.6242_40ho030s_-0.6242.tfrecords\nstorage/tfrecords/cv_tfrecord/20fo0109_1.1429_017o030p_-1.1429.tfrecords\n```\n2. Train the deep learning neural network.\n3. Decode the network to generate separation audios.\n\n## 5. File Description\n* 1.create-speaker-mixtures-V1: Version one of scripts to generate the wsj0-mix multi-speaker dataset.\n* 2.create-speaker-mixtures-V2: Version two of scripts to generate the wsj0-mix multi-speaker dataset.\n* 3.SPHFile2Wav: Converting SPH format of TIMIT and WSJ0 corpus into wav format.\n* 4.introduction_to_mask: tntroduction to the Computational Auditory Scene Analysis (mask-based method) in speech separation.\n\n**mixed speech**:\n\n\u003cimg width=\"55%\" height=\"55%\" src=\"4. introduction_to_mask/mixturesignals.png\"/\u003e\n\n**masks**:\n\n\u003cimg width=\"90%\" height=\"90%\" src=\"4. introduction_to_mask/masks.png\"/\u003e\n\n**recovered speech 1**:\n\n\u003cimg width=\"50%\" height=\"50%\" src=\"4. introduction_to_mask/recoverd1.png\"/\u003e\n\n**recovered speech 2**:\n\n\u003cimg width=\"50%\" height=\"50%\" src=\"4. introduction_to_mask/recoverd2.png\"/\u003e\n\n* 5.step_to_CASA_DL: Step to multi-speaker speech separation with Computational Auditory Scene Analysis and Deep Learning.\n* 6.separated_result_LSTM: Demos of separated speech based on LSTM and PIT.\n* 7.separated_result_BLSTM: Demos of separated speech based on BLSTM and PIT.\n\n## 6. Reference Paper \u0026 Code\n*Thank Dong Yu et al. for the paper and Sining Sun (Northwestern Polytechnical University, China) et al. for sharing their code.*\n* __Paper__: Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation.\n* __Authors__: Dong Yu, Morten Kolbæk, Zheng-Hua Tan, Jesper Jensen\n* __Published__: [ICASSP 2017](https://ieeexplore.ieee.org/document/7952154/) (5-9 March 2017)\n* __Dataset__: [WSJ0 data](https://catalog.ldc.upenn.edu/ldc93s6a), [VCTK-Corpus](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html)\n* __SDR/SAR/SIR__\n    * Toolbox: [BSS Eval](http://bass-db.gforge.inria.fr/bss_eval/), [The PEASS Toolkit](http://bass-db.gforge.inria.fr/peass/), [craffel/mir_eval/separation.py](https://github.com/craffel/mir_eval/blob/master/mir_eval/separation.py)\n    * Paper: [Performance measurement in blind audio source separation](https://ieeexplore.ieee.org/document/1643671/)\n* __STOI__\n    * Toolbox: [stoi.zip](http://insy.ewi.tudelft.nl/content/short-time-objective-intelligibility-measure)+[actuallyaswin/stoi](https://github.com/actuallyaswin/stoi), [mpariente/pystoi](https://github.com/mpariente/pystoi)\n    * Paper: [A short-time objective intelligibility measure for time-frequency weighted noisy speech](https://ieeexplore.ieee.org/document/5495701/)\n* __ESTOI__\n    * Toolbox: [estoi.m](http://kom.aau.dk/~jje/code/estoi.m)\n    * Paper: [An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers](https://ieeexplore.ieee.org/document/7539284/)\n* __PESQ__\n    * Toolbox: [pesq.m](https://github.com/JacobD10/SoundZone_Tools/blob/master/pesq2.m), [MATLAB software-composite](http://ecs.utdallas.edu/loizou/speech/software.htm)\n    * Paper: [Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs](https://ieeexplore.ieee.org/document/941023/)\n\n## 7. Directions of Future Research\n* Scaling down DNNs without compromising performance.\n* Multiple microphone algorithms.\n* Beyond single-modality algorithm, for example, visual perception.\n* Beyond the Mean Squared Error Cost Function\n* Towards Time-Domain End-to-End system.\n\n*Thanks for your attention!* \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faishoot%2FLSTM_PIT_Speech_Separation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faishoot%2FLSTM_PIT_Speech_Separation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faishoot%2FLSTM_PIT_Speech_Separation/lists"}