{"id":19388105,"url":"https://github.com/lankaraniamir/lyric-source-separation","last_synced_at":"2025-04-14T02:29:53.562Z","repository":{"id":171166815,"uuid":"647480481","full_name":"lankaraniamir/lyric-source-separation","owner":"lankaraniamir","description":"Using alignments and posteriorgrams extracted from lyrics as novel input into source separation models","archived":false,"fork":false,"pushed_at":"2023-06-23T18:49:48.000Z","size":5471,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-14T02:29:42.486Z","etag":null,"topics":["lyrics","music-information-retrieval","source-separation","word-alignment"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lankaraniamir.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-30T21:56:18.000Z","updated_at":"2023-10-25T01:51:50.000Z","dependencies_parsed_at":null,"dependency_job_id":"9af32803-0f4f-4505-9e09-da9856b0c6c0","html_url":"https://github.com/lankaraniamir/lyric-source-separation","commit_stats":null,"previous_names":["lankaraniamir/lyric_source_separation","lankaraniamir/lyric-source-separation"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lankaraniamir%2Flyric-source-separation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lankaraniamir%2Flyric-source-separation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lankaraniamir%2Flyric-source-separation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lankaraniamir%2Flyric-source-separation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lankaraniamir","download_url":"https://codeload.github.com/lankaraniamir/lyric-source-separation/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248810854,"owners_count":21165189,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lyrics","music-information-retrieval","source-separation","word-alignment"],"created_at":"2024-11-10T10:11:48.061Z","updated_at":"2025-04-14T02:29:53.525Z","avatar_url":"https://github.com/lankaraniamir.png","language":"Jupyter Notebook","readme":"# Lyric Source Separation\nCode for research paper [\"Using Synchronized Lyrics as a Novel Input into\nVocal Isolation Models\"](https://github.com/lankaraniamir/lyric_source_separation/blob/main/documents/Using%20Synchronized%20Lyrics%20as%20a%20Novel%20Input%20into%20Vocal%20Isolation%20Models.pdf)\n\n\n![Concatenated Posteriorgram](https://github.com/lankaraniamir/lyric_source_separation/blob/main/documents/Concatenated%20Posteriorgram%20Plus%20Spectrogram.png?raw=true)\n\n\n## Project Summary\nSource separation projects have typically only focused on using audio data\nas an input into the separation network, so this project tests whether or not\nother external information about the music can be used as a method to bolster\nthis separation. More specifically, I took the standard Musdb source separation\ndataset, alongside annotations of all of its lyrics and time points, and hoped\nto test whether these lyrics could help the model learn certain phonetic traits\nthat would make it easier to isolate what parts of the frequency spectrum are\nassociated with a vocalist. To create a more expansive dataset, I altered\nDemucs' audio remixing program to adapt the speed, pitch, and texture of the\ninput audio around the vocals while only splitting based upon the start of a\nlyrical phrase to keep text position estimations in tact. Then this was fed into\na modified combination of Emir Demirel's ASA \u0026 ALTA Kaldi recipes predicting the\nexact moment where a word is spoken to create alignment text files, as well as\npredicting the chance each phoneme is said at every frame of audio in order to\ncreate a posteriorgram. These features were then fed into a Pytorch source\nseparation network based upon the NUSSL package to see if it could improve our\nseparation. Posteriorgrams were concatenated onto the audio before the\nRNN chain and, when used, the lyrics were encoded and embedded using linear\nmodels which were then matched with the RNN at the final layer. These models\nwere based upon finding a spectral mask for the vocal audio in hopes that this\nwould map well with the posteriorgrams' dimensionality. Sadly, it seems that the\nseparation showed little to no improvements with the posteriorgrams and the\nalignments hindered training as it overly complicated and misled the network\nbiasing it towards a few words. To improve: a significnatly larger amount of\ndata needs to be procured with lyrics and stems; a more nuanced integration of\nthe audio features in our model is needed; and there should probably\nbe posteriorgrams and alignments that are predicted at the same frame\nrate as the spectrogram for a consistent correlation between the two.\n\n\n## List of Tools Needed\n- [Demucs](https://github.com/facebookresearch/demucs) to run modified version of [Demucs' automixing program](https://github.com/facebookresearch/demucs/blob/main/tools/automix.py)\n  as well as to do the initial source separation used to create more\n  accurate posteriograms and alignments in Kaldi.\n  \n- [ALTA recipe](https://github.com/emirdemirel/ALTA) for Kaldi was used to create the models for alignment and\nposteriogram creation and was also used for the actual creation of the\nposteriorgrams.\n\n- [ASA recipe](https://github.com/emirdemirel/ASA_ICASSP2021) for Kaldi was used to create the alignments. Quite similar to\nALTA so I was able to use the same models for each\n\n- [Nussl](https://github.com/nussl/nussl\n) was used as a general framework for source separation to make a\ngood starting point to compare posteriogram addition to. Further additions\nwere made using Pytorch.\n\n- [High quality version of the musdb dataset](https://zenodo.org/record/3338373) standard for source separation\n\n- [Annotations of the lyrics](https://zenodo.org/record/3989267) in each song of the Musdb dataset\n\n\n## Directories and executables\n\nPart 0: Create Conda environments\n- Build each conda environment from the yml files using:\n\n    conda env create -f ASA.yml\n    \u003cbr /\u003econda env create -f demucs.yml\n    \u003cbr /\u003econda env create -f mirdata.yml\n    \u003cbr /\u003econda env create -f sep.yml\n\n- In part 1, the first three environments will be activate and deactivated\nautomatically\n- For part 2, activate the sep environment for all testing\n- Due to the usage of varying packages of different ages, different\nconda environments were the only way I could get things to work\nportably.\n\n\n\u003cbr /\u003ePart 1: Create alignments\n- This directory contains all of the code to bolster the dataset and to\ncreate alignments/posteriograms.\n- run_al3625.sh - main script\n- align_and_post_al3625.sh - this is the script to extract the\nalignments \u0026 posteriorgram from any one file using the models trained\nfrom the ALTA recipe. This works by first isolating the vocals from the\naudio using a very basic pre-trained source separation model meant to\nbe useful for this exact purpose. Then we silence moments where no\nlyrics are being spoken to create better alignments and posteriorgrams.\nThen we use the ivecs \u0026 mfccs to get the our output as described in the\nfile itself. Most of the modifications come in terms of getting the\ndesired outputon a larger scale and making the two be extracted by\nsimilar means.\n- local/data_preparation... - These files are used to process the Musdb\naudio and its lyrics to create output files in Kaldi format as well as in\nALTA/ASA format creating the general tagging and directory structure.\n- posteriorgram_visual_al3625.ipynb - simple visual of a posteriorgram\nfor reference and understanding (better one in next section)\n- local/mashup_with... - creates the remixed version of the songs as\nwell as the new aligned lyrics of this output by shifting chromas and\ntempos of a set and then choosing a random match of these features that\nsatisfies our minimum criteria. Most of my modifications revolved\naround creating modifications at lyric breakpoints that resulted in an\noutputting of correctly aligned lyrics, modifying the code be more\nrelativistic to the vocals, and extending the range of variance to\naccount for our smaller dataset since we can only use transcribed audio.\n- local/prep_data... - preps the processed data to create lexical data\nfor Kaldi\n- local/process_segmentation_al3625.py - fades out parts of the\npre-isolated vocals that should not have audio based on the lyric files\nto create cleaner and more accurate posteriorgrams and alignments\n\n\n\u003cbr /\u003ePart 2: Separate Sources\n- This directory uses the Kaldi input to do the actual source separation\n- setup_al3625.py - defines most of the groundwork to how all the models are\nsimilarly designed. The most notable part of this is the aligned data audio class\nwhich processes posteriorgrams to exist on the same scale as the audio\nfile and creates a vector of the current lyric encoding at each audio frame.\n- training_scripts - these were used to define and train each of the\ndistinct models. They also were used to evaluate the models after\ntraining (although I didn't make time to do a final evaluation for a few\nof the less important models). Finally, at the end you can visualize the\nspectral mask for any evaluated model as well as listen to its\nprediction of what is and isn't the vocal. The basic setup follow the\nNussl backbone closely so that the model we compare to is compared to is\na standard baseline. However, lots of modifications were made to support\nour specific data and make it interactable with non-audio data.\n    - audio_post_no_norm: model using the audio and an unnormalized\n    posterirogram. This in general has the most evaluation tools as I\n    used it to create the figures for the paper. If would like similar\n    images of other model simply copy the codes for that section. This\n    includes an image of predicted real mask and an image of what\n    the posteriorgram and spectrogram concatenation looks like\n    - all_audio: basic source separation model with no additions\n    - all_post: model without any audio data and just using\n    posteriorgrams\n    - audio_post_aligns: model using audio data, posteriorgrams, and the\n    lyrical alignments\n    - audio_post_no_norm_larger: model using the audio and an unnormalized\n    posterirogram with extra posteriorgram rows (as dx \u0026 ddx of each row\n    of posteriorgram) and larger hidden_layers\n    - audio_post_norm: model using the audio and a normalized posteriogram\n- eval_results: contain jsons of each eval test\n- output_audio: exported audio from evaluation\n- trained_models: contain the newest and best trained models \u0026\noptimizers of each type\n- compare_training_al3625.ipynb - simple visual comparing loss of models\n- get_eval_scores_al3625.ipynb - this will take the scores from the\nevaluation of the decoded models and output them into clean tables for\nall of the important evaluation features\n\n\n## How to run\n- Download all tools neeeded\n- Download databases and place into db folder in 1_create_alignments\n- Train all models needed from trained github project and put into model in\n1_create_alignments\n- Run the part 0 code to setup conda environments as described above\n- Run the part 1 file: run_al3625.sh. This will first prep the musdb data,\n create remixes, prep the mashup data, and then prep the mashup data. Then \nit will go through each of the mono audio files and feed it into \nalign_and_post_al3625 to get the alignments and posteriograms for that song\n- Run the part 2 training_scripts - Each are simple ipynb notebooks that can \nsimply be gone through in any way you like. If you would like to hear any \noutput audio and see the output masks, run the visualization functions commented\nout of the code\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flankaraniamir%2Flyric-source-separation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flankaraniamir%2Flyric-source-separation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flankaraniamir%2Flyric-source-separation/lists"}