{"id":23444077,"url":"https://github.com/sevagh/mixin","last_synced_at":"2025-06-12T15:37:40.396Z","repository":{"id":82225512,"uuid":"341960147","full_name":"sevagh/MiXiN","owner":"sevagh","description":"Music Xtraction with Nonstationary Gabor Transforms and Convolutional Denoising Autoencoders","archived":false,"fork":false,"pushed_at":"2024-02-11T21:17:45.000Z","size":3164,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-05T01:01:52.719Z","etag":null,"topics":["convolutional-autoencoder","convolutional-neural-networks","cqt-spectrogram","deep-learning","gabor-analysis","music-separation","source-separation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sevagh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2021-02-24T16:20:09.000Z","updated_at":"2024-03-21T21:01:52.000Z","dependencies_parsed_at":"2024-02-11T22:42:10.939Z","dependency_job_id":null,"html_url":"https://github.com/sevagh/MiXiN","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sevagh%2FMiXiN","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sevagh%2FMiXiN/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sevagh%2FMiXiN/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sevagh%2FMiXiN/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sevagh","download_url":"https://codeload.github.com/sevagh/MiXiN/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248714609,"owners_count":21149926,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["convolutional-autoencoder","convolutional-neural-networks","cqt-spectrogram","deep-learning","gabor-analysis","music-separation","source-separation"],"created_at":"2024-12-23T18:26:28.151Z","updated_at":"2025-04-13T12:34:18.091Z","avatar_url":"https://github.com/sevagh.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MiXiN\n\nMiXiN, or **M**usic **X**traction with **N**onstationary Gabor Transforms, is a model for harmonic/percussive/vocal source separation based on [Convolutional Denoising Autoencoders](https://arxiv.org/abs/1703.08019). The pretrained models are trained on Periphery stems from the albums Juggernaut, Omega, Periphery III, and Hail Stan (available for purchase [here](https://store.periphery.net/music/music)).\n\nMiXiN takes the simple [median-filtering HPSS](http://dafx10.iem.at/papers/DerryFitzGerald_DAFx10_P15.pdf) algorithm (which applies soft masks computed from harmonic and percussive magnitude estimates), replaces the STFT with an NSGT using 96 bands on the Bark frequency scale (from 0-22050Hz), and replaces the simple (but not so impressive) median filtering estimation step with trained CDAEs.\n\nMiXiN has only been tested on my own Linux computer - if you experience any issues, or need help getting it running somewhere else, feel free to use GitHub issues to ask or suggest anything.\n\n## Install and use\n\nTo install MiXiN, use the requirements.txt file (probably with a virtualenv, your choice):\n\n```\n$ pip install -r ./requirements.txt\n```\n\nIf you want to use MiXiN to separate songs, run the `xtract_mixin.py` script:\n\n```\n$ ls *.wav\nmixed.wav\n$ ./xtract_mixin.py ./mixed.wav\n$ ls *.wav\nmixed_harmonic.wav  mixed_percussive.wav  mixed_vocal.wav  mixed.wav\n```\n\nThere are two flags that affect the quality of separation:\n* `--single-model` uses the original CDAE strategy of using the network output magnitude + original mix phase to invert and create the separations. The default uses all 3 models to create per-component soft masks.\n* `--instrumental` ignores the vocal model. Useful for instrumental metal songs; you may get slightly better outputs.\n\nIf you want to use MiXiN in your own code, take a look at `xtract_mixin.py` to see how it calls the mixin library. You'll know the pretrained models are loaded correctly if you see 3 different sets of Keras layers printed:\n```\nLoading models from:\n        /home/sevagh/repos/MiXiN/./pretrained-models/model_harmonic.h5\n        /home/sevagh/repos/MiXiN/./pretrained-models/model_percussive.h5\n        /home/sevagh/repos/MiXiN/./pretrained-models/model_vocal.h5\n...\n\nModel: \"sequential\"\n_________________________________________________________________\nLayer (type)                 Output Shape              Param #\n=================================================================\nconv2d (Conv2D)              (None, 96, 1948, 12)      120\n...\n\n_________________________________________________________________\nModel: \"sequential_1\"\n_________________________________________________________________\nLayer (type)                 Output Shape              Param #\n=================================================================\nconv2d_8 (Conv2D)            (None, 96, 1948, 12)      120\n_________________________________________________________________\n...\n\n_________________________________________________________________\nModel: \"sequential_2\"\n_________________________________________________________________\nLayer (type)                 Output Shape              Param #\n=================================================================\nconv2d_16 (Conv2D)           (None, 96, 1948, 12)      120\n...\ncropping2d_2 (Cropping2D)    (None, 96, 1948, 1)       0\n=================================================================\nTotal params: 37,005\nTrainable params: 37,005\nNon-trainable params: 0\n...\nWriting harmonic and percussive audio files\nWriting vocal audio files\nDone\n```\n\n## Architecture\n\nMiXiN uses 3 CDAEs, each trained on separate harmonic, vocal, and percussive components.\n\nFrom the paper on [CDAEs for source separation](https://arxiv.org/abs/1703.08019):\n![!cdae](./.github/cdae_arch.png)\n\nThe CDAE paper uses an STFT of 15 frames with a window size of 1024 (1025 FFT points, representing 2048/2 + 1 non-redundant spectral coefficients) to perform source separation, by training a separate CDAE for each source. One CDAE is recommended for each desired source. In MiXiN, the sources considered are harmonic/percussive/vocal - this is influenced by the [median-filtering HPSS](http://dafx10.iem.at/papers/DerryFitzGerald_DAFx10_P15.pdf) and [HPSS vocal separation with CQT](https://arrow.tudublin.ie/cgi/viewcontent.cgi?article=1007\u0026context=argart) papers. This gives us an architecture with 3 CDAEs.\n\nIn MiXiN, the Nonstationary Gabor Transform is used with 96 bands on the Bark scale from 0-22050 Hz using [my fork](https://github.com/sevagh/nsgt) of the Python [nsgt](https://github.com/grrrr/nsgt) library. Think of the NSGT like an STFT with some useful time-frequency properties that might make it more amenable to musical applications - for further reading more musical time-frequency analyses, check out [Judith Brown's first paper on the CQT](https://www.ee.columbia.edu/~dpwe/papers/Brown91-cqt.pdf), [Monika Doerfler's dissertation](http://www.mathe.tu-freiberg.de/files/thesis/gamu_1.pdf) for a treatment of multiple Gabor dictionaries, and [paper 1](https://ltfat.github.io/notes/ltfatnote010.pdf), [paper 2](https://ltfat.github.io/notes/ltfatnote018.pdf) on NSGTs.\n\nThe signal is split into chunk sizes of 44032 samples (representing roughly 1s of audio, divisible by 1024), and a forward NSGT is done on each chunk. The magnitude of the mixed audio NSGT coefficients per chunk is the input to all of the 3 CDAEs, and the outputs are the estimates of vocal, percussive, and harmonic magnitude NSGT coefficients.\n\nThe original CDAE paper uses the phase of the mixture and the magnitude of the CDAE output to invert and create the separated source. The approach in MiXiN is closer to that of the HPSS formulation, where the magnitude estimates of each source (harmonic, percussive, vocal) are used to compute soft masks using the Wiener filter formula:\n\n\u003cimg src=\"./.github/mixin_arch.png\" width=\"640px\"\u003e\n\nFinally, the NSGT coefficient matrix of the original mixed chunk is multiplied with the respective mask and the backward NSGT gives us the harmonic, percussive, and vocal chunks (appended to the full track output).\n\n## Training\n\nAs mentioned, the training data used was 4 albums from Periphery, prepared in two sets:\n* Instrumental mix, harmonic (rhythm guitar + lead guitar + bass + other stems), and percussive (drum stem)\n* Full mix, harmonic (rhythm guitar + lead guitar + bass + other stems), and percussive (drum stem), vocal (vocal stem)\n\nA consequence is that the vocal CDAE is trained on half the data of the percussive and harmonic ones.\n\nThe data is split into 80%/20%/20% train/validation/test. There are 3 models trained, with 37,000 parameters each. The CDAE implementation is relatively clear in the paper, but I was also helped by [this implementation](https://github.com/SahilJindal1/Sound-Separation). Here are the training plots for the 3 networks - the loss is mae.\n\nPercussive:\n\n\u003cimg src=\"./.github/percussive_train_loss.png\" width=512px\u003e\n\nHarmonic:\n\n\u003cimg src=\"./.github/harmonic_train_loss.png\" width=512px\u003e\n\nVocal:\n\n\u003cimg src=\"./.github/vocal_train_loss.png\" width=512px\u003e\n\n## Evaluation\n\nA small evaluation was performed on some tracks from the [MUSDB18-HQ](https://zenodo.org/record/3338373) test set, using the testbench from my larger [Music-Separation-TF](https://github.com/sevagh/Music-Separation-TF) project (survey of various DSP approaches to source separation).\n\nThe metric is SigSep's [BSSv4](https://github.com/sigsep/bsseval), and MiXiN is compared against [Open-Unmix](https://sigsep.github.io/open-unmix/):\n\n![bssv4](./.github/bssv4_results_musdb18hq.png)\n\nMiXiN scores poorly, but to me it still sounds pretty good - I like the percussive and vocal outputs more than harmonic.\n\n## Run tests, train your own models\n\nRun the end-to-end test (you must supply a stem dir). Pass `--delete` to automatically delete the `./data`, `./model`, and `./logdir` working directories. Use `2\u003e/dev/null` to discard the verbose Tensorflow/Keras outputs.\n\nThe e2e test runs a full cycle through:\n1. Keras CDAE layers and input/output dimensions check\n2. Stem segment preparation from original stems (e.g. in MUSDB18-HQ)\n3. Stem segment verification (duration, sample rate, appropriate components i.e. harmonic/percussive/vocal/mix)\n4. HDF5 file creation\n5. HDF5 file verification including test/train/validation sizes and NSGT input/output dimensionality \n6. Training all 3 models\n7. Verifying existence of checkpoint dirs and saved h5 model files\n\n```\n$ STEM_DIR=~/TRAINING-MUSIC/MUSDB18-HQ/test/ ./test_e2e.py --delete 2\u003e/dev/null\n[MiXiN e2e test] Early sanity check - Keras model layers and dimensions...\n[MiXiN e2e test] good!\n[MiXiN e2e test] Deleting dirs...\n[MiXiN e2e test] Checking if dirs exist... dont want to overwrite user training\n[MiXiN e2e test] Checking if STEM_DIR is defined\n[MiXiN e2e test] Preparing stems with train util\n[MiXiN e2e test] Verifying prepared stems\n[MiXiN e2e test] Verifying prepared segments for audio properties\n[MiXiN e2e test] good\n[MiXiN e2e test] Creating hdf5 with train util\n[MiXiN e2e test] good\n[MiXiN e2e test] Verifying dimensionality of hdf5 files\n[MiXiN e2e test] good\n[MiXiN e2e test] Training the 3 networks\n[MiXiN e2e test] Verifying training outputs\n[MiXiN e2e test] finished!\n```\n\nIf the e2e test succeeds, chances are MiXiN will work fine for you.\n\nTo train your own, use the `train_util.py` script.\n\n**Step 1**, prepare stem files - example arguments are 10-second segments per track, 2 total tracks, skipping the first 3 segments, and limiting to 4 segments (total segments = 2 tracks * (4 segments - 3 segment offset) = 2):\n\n```\n# prepares 2 sets of files: harmonic/percussive/mix, suffixed \"${track_sequence}nov\"\n#                           harmonic/percussive/vocal/mix, suffixed \"${track_sequence}v\"\n$ ./train_util.py \\\n        --prepare-stems \\\n        --stem-dirs ~/TRAINING-MUSIC/MUSDB18-HQ/train/ \\\n        --segment-duration 10 \\\n        --segment-offset 3 \\\n        --track-limit 2 \\\n        --segment-limit 4\n$ tree data/\ndata/\n├── 0000003nov\n│   ├── harmonic.wav\n│   ├── mix.wav\n│   └── percussive.wav\n├── 0000003v\n│   ├── harmonic.wav\n│   ├── mix.wav\n│   ├── percussive.wav\n│   └── vocal.wav\n├── 0010003nov\n│   ├── harmonic.wav\n│   ├── mix.wav\n│   └── percussive.wav\n└── 0010003v\n    ├── harmonic.wav\n    ├── mix.wav\n    ├── percussive.wav\n    └── vocal.wav\n```\n\n**Step 2**, prepare hdf5 data files from stems, which does the train/test/validation split (80/20/20, hardcoded):\n\n```\n$ ./train_util.py --create-hdf5\n--prepare-stems not specified, skipping...\npercussive chunk 0 TRAIN/TEST/VALIDATION SPLIT:\n        all data: (40, 96, 3896)\n        train: (32, 96, 3896)\n        test: (4, 96, 3896)\n        validation: (4, 96, 3896)\nharmonic chunk 0 TRAIN/TEST/VALIDATION SPLIT:\n        all data: (40, 96, 3896)\n        train: (32, 96, 3896)\n        test: (4, 96, 3896)\n        validation: (4, 96, 3896)\nvocal chunk 0 TRAIN/TEST/VALIDATION SPLIT:\n        all data: (20, 96, 3896)\n        train: (16, 96, 3896)\n        test: (2, 96, 3896)\n        validation: (2, 96, 3896)\n--train not specified, skipping...\n$ ls data/*.hdf5\ndata/data_harmonic.hdf5  data/data_percussive.hdf5  data/data_vocal.hdf5\n```\n\nThe dimensionality of the data is 96 x (2 * 1948), representing 2 concatenated NSGTs, one for the mix, and one for the source. There are 2 segments (20s of audio total), which is split into chunks of ~1s before taking the NSGT. Vocal has half the data of harmonic/percussive since the same track is made into an instrumental and vocal mix.\n\n**Step 3**, train the networks, which store checkpoints in `logdir` and the trained model in `model`. Use `--plot-training` to view training plots (this will block until you exit matplotlib, so you can't fire-and-forget with `--plot-training`). It will print the MAE/loss for train, test, and validation (for each model) before exiting:\n\n```\n$ ./train_util.py --train\n$ ls logdir/\nmodel_harmonic.ckpt  model_percussive.ckpt  model_vocal.ckpt\n$ ls model/\nmodel_harmonic.h5  model_percussive.h5  model_vocal.h5\n```\n\nModels are saved as h5 files.\n\n**Step 4**, use your custom models by passing `--pretrained-model-dir` in the `xtract_mixin.py` script:\n\n```\n$ ./xtract_mixin.py --pretrained-model-dir=./model mixed.wav\nLoading models from:\n        ./model/model_harmonic.h5\n        ./model/model_percussive.h5\n        ./model/model_vocal.h5\n...\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsevagh%2Fmixin","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsevagh%2Fmixin","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsevagh%2Fmixin/lists"}