{"id":18965313,"url":"https://github.com/sony/diffroll","last_synced_at":"2025-07-14T15:35:13.126Z","repository":{"id":62345833,"uuid":"549500782","full_name":"sony/DiffRoll","owner":"sony","description":"PyTorch implementation of DiffRoll, a diffusion-based generative automatic music transcription (AMT) model","archived":false,"fork":false,"pushed_at":"2023-12-06T14:14:24.000Z","size":42377,"stargazers_count":75,"open_issues_count":0,"forks_count":11,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-19T16:58:50.903Z","etag":null,"topics":["automatic-music-transcription","deep-generative-model","diffusion","generative-model","inpainting","machine-learning","music-generation","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sony.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-11T09:25:28.000Z","updated_at":"2025-04-17T15:26:45.000Z","dependencies_parsed_at":"2024-11-08T14:35:18.203Z","dependency_job_id":"7d99fbf1-c50f-4813-8d53-e01be9a831f0","html_url":"https://github.com/sony/DiffRoll","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sony/DiffRoll","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sony%2FDiffRoll","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sony%2FDiffRoll/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sony%2FDiffRoll/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sony%2FDiffRoll/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sony","download_url":"https://codeload.github.com/sony/DiffRoll/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sony%2FDiffRoll/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265312501,"owners_count":23745181,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automatic-music-transcription","deep-generative-model","diffusion","generative-model","inpainting","machine-learning","music-generation","pytorch"],"created_at":"2024-11-08T14:28:49.806Z","updated_at":"2025-07-14T15:35:13.102Z","avatar_url":"https://github.com/sony.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"- __Demo__: https://sony.github.io/DiffRoll/\n- __Paper__: https://arxiv.org/abs/2210.05148\n\n# Table of Content\n\u003c!-- @import \"[TOC]\" {cmd=\"toc\" depthFrom=1 depthTo=4 orderedList=false} --\u003e\n\n\u003c!-- code_chunk_output --\u003e\n- [Installation](#installation)\n- [Table of Content](#table-of-content)\n- [Installation](#installation)\n- [Training](#training)\n  - [Supervised training](#supervised-training)\n  - [Unsupervised pretraining](#unsupervised-pretraining)\n    - [Step 1: Pretraining on MAESTRO using only piano rolls](#step-1-pretraining-on-maestro-using-only-piano-rolls)\n    - [Step 2](#step-2)\n      - [Option A: pre-DiffRoll (p=0.1)](#option-a-pre-diffroll-p01)\n      - [Option B: pre-DiffRoll (p=0+1)](#option-b-pre-diffroll-p01)\n      - [Option C: MAESTRO 0.1](#option-c-maestro-01)\n- [Sampling](#sampling)\n  - [Transcription](#transcription)\n  - [Inpainting](#inpainting)\n  - [Generation](#generation)\n\n\u003c!-- /code_chunk_output --\u003e\n\n\n# Installation\nThis repo is developed using `python==3.8.10`, so it is recommended to use `python\u003e=3.8.10`.\n\nTo install all dependencies\n```\npip install -r requirements.txt\n```\n\n# Training\n\n## Supervised training\n```\npython train_spec_roll.py gpus=[0] model.args.kernel_size=9 model.args.spec_dropout=0.1 dataset=MAESTRO dataloader.train.num_workers=4 epochs=2500 download=True\n```\n\n\n- `gpus` sets which GPU to use. `gpus=[k]` means `device='cuda:k'`, `gpus=2` means [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) (DDP) is used with two GPUs.\n- `model.args.kernel_size` sets the kernel size for the ResNet layers in DiffRoll. `model.args.kernel_size=9` performs the best according to our experiments.\n- `model.args.spec_dropout` sets the dropout rate ($p$ in the paper)\n- `dataset` sets the dataset to be trained on. Can be `MAESTRO` or `MAPS`.\n- `dataloader.train.num_workers` sets the number of workers for train loader.\n- `download` should be set to `True` if you are running the script for the first time to download and setup the dataset automatically. You can set it to `False` if you already have the dataset downloaded.\n\nThe checkpoints and training logs are avaliable at `outputs/YYYY-MM-DD/HH-MM-SS/`. \n\nTo check the progress of training using TensorBoard, you can use the command below\n```\ntensorboard --logdir='./outputs'\n```\n\n## Unsupervised pretraining\n### Step 1: Pretraining on MAESTRO using only piano rolls\n```\npython train_spec_roll.py gpus=[0] model.args.kernel_size=9 model.args.spec_dropout=1 dataset=MAESTRO dataloader.train.num_workers=4 epochs=2500\n```\n\n- `model.args.spec_dropout` sets the dropout rate ($p$ in the paper). When it is set to `1`, it means no spectrograms will be used (all spectrograms dropped to `-1`)\n- other arguments are same as [Supervised Training](#supervised-training).\n\nThe pretrained checkpoints are avaliable at `outputs/YYYY-MM-DD/HH-MM-SS/ClassifierFreeDiffRoll/version_1/checkpoints`.\n\nAfter this, you can choose one of the options ([2A](#option-a-pre-diffroll-p01), [2B](#option-b-pre-diffroll-p01), or [2C](#option-c-maestro-01)) to continue training below.\n\n\n### Step 2\nChoose one of the options below ([A](#option-a-pre-diffroll-p01), [B](#option-b-pre-diffroll-p01), or [C](#option-c-maestro-01)).\n#### Option A: pre-DiffRoll (p=0.1)\n\n```\npython continue_train_single.py gpus=[0] model.args.kernel_size=9 model.args.spec_dropout=0.1 dataset=MAPS dataloader.train.num_workers=4 epochs=10000 pretrained_path='path_to_your_weights' \n```\n\n- `pretrained_path` specifies the location of pretrained weights obtained in [Step 1](#step-1-pretraining-on-maestro-using-only-piano-rolls)\n- other arguments are same as [Supervised Training](#supervised-training).\n\n\n#### Option B: pre-DiffRoll (p=0+1)\n\n```\npython continue_train_both.py gpus=[0] model.args.kernel_size=9 model.args.spec_dropout=0 dataset=Both dataloader.train.num_workers=4epochs=10000 pretrained_path='path_to_your_weights' \n```\n\n- `pretrained_path` specifies the location of pretrained weights obtained in [Step 1](#step-1-pretraining-on-maestro-using-only-piano-rolls)\n- `model.args.spec_dropout` controls the dropout for the MAPS dataset. The MAESTRO dataset is always set to p=-1. \n- other arguments are same as [Supervised Training](#supervised-training).\n\n#### Option C: MAESTRO 0.1\nThis option is not reported in the paper, but it is the best.\n\n```\npython continue_train_single.py gpus=[0] model.args.kernel_size=9 model.args.spec_dropout=0 dataset=MAESTRO dataloader.train.num_workers=4 epochs=2500 pretrained_path='path_to_your_weights' \n```\n\n- `pretrained_path` specifies the location of pretrained weights obtained in [Step 1](#step-1-pretraining-on-maestro-using-only-piano-rolls)\n- other arguments are same as [Supervised Training](#supervised-training).\n\n# Testing\nThe training script above already includes the testing. This section is for you to re-run the test set and get the transcription score.\n\nFirst, open `config/test.yaml`, and then specify the weight to use in `checkpoint_path`.\n\nFor example, if you want to use `Pretrain_MAESTRO-retrain_Both-k=9.ckpt`, then set  `checkpoint_path='weights/Pretrain_MAESTRO-retrain_Both-k=9.ckpt'`.\n\nYou can download pretrained weights from [Zenodo](https://zenodo.org/record/7246522#.Y2tXoi0RphE). After downloading, put them inside the folder `weights`.\n\n```\npython test.py gpus=[0] dataset=MAPS\n```\n\n- `dataset` sets the dataset to be trained on. Can be `MAESTRO` or `MAPS`.\n\n# Sampling\nYou can download pretrained weights from [Zenodo](https://zenodo.org/record/7246522#.Y2tXoi0RphE). After downloading, put them inside the folder `weights`.\n\nThe folder `my_audio` already includes four samples as a demonstration. You can put your own audio clips inside this folder.\n\n## Transcription\nThis script supports only transcribing music from either MAPS or MAESTRO.\n\nTODO: add support for transcribing any music\n\nFirst, open `config/test.yaml`, and then specify the weight to use in `checkpoint_path`.\n\nFor example, if you want to use `Pretrain_MAESTRO-retrain_MAESTRO-k=9.ckpt`, then set  `checkpoint_path='weights/Pretrain_MAESTRO-retrain_MAESTRO-k=9.ckpt'`.\n\n```\npython sampling.py task=transcription dataloader.batch_size=4 dataset=Custom dataset.args.audio_ext=mp3 dataset.args.max_segment_samples=327680 gpus=[0]\n```\n\n- `dataloader.batch_size` sets the batch size. You can set a higher number if your GPU has enough memory.\n- `dataset` when setting to `Custom`, it load audio clips from the folder `my_audio`.\n- `dataset.args.audio_ext` sets the file extension to be loaded. The default extension is `mp3`.\n- `dataset.args.max_segment_samples` sets length of audio segment to be loaded. If it is smaller than the actual audio clip duration, the first `max_segment_samples` samples of the audio clip would be loaded. If it is larger than the actual audio clip, the audio clip will be padded to `max_segment_samples` with 0. The default value is `327680` which is around 10 seconds when `sample_rate=16000`.\n- `gpus` sets which GPU to use. `gpus=[k]` means `device='cuda:k'`, `gpus=2` means [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) (DDP) is used with two GPUs.\n\n## Inpainting\nThis script supports only transcribing music from either MAPS or MAESTRO.\n\nTODO: add support for transcribing any music\n\nFirst, open `config/sampling.yaml`, and then specify the weight to use in `checkpoint_path`.\n\nFor example, if you want to use `Pretrain_MAESTRO-retrain_Both-k=9.ckpt`, then set  `checkpoint_path='weights/Pretrain_MAESTRO-retrain_Both-k=9.ckpt'`.\n\n```\npython sampling.py task=inpainting task.inpainting_t=[0,100] dataloader.batch_size=4 dataset=Custom dataset.args.audio_ext=mp3 dataset.args.max_segment_samples=327680 gpus=[0]\n```\n\n- `gpus` sets which GPU to use. `gpus=[k]` means `device='cuda:k'`, `gpus=2` means [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) (DDP) is used with two GPUs.\n- `task.inpainting_t` sets the frames to be masked to -1 in the spectrogram. `[0,100]` means that frame 0-99 will be masked to -1.\n- `dataloader.batch_size` sets the batch size. You can set a higher number if your GPU has enough memory.\n- `dataset` when setting to `Custom`, it load audio clips from the folder `my_audio`.\n- `dataset.args.audio_ext` sets the file extension to be loaded. The default extension is `mp3`.\n- `dataset.args.max_segment_samples` sets length of audio segment to be loaded. If it is smaller than the actual audio clip duration, the first `max_segment_samples` samples of the audio clip would be loaded. If it is larger than the actual audio clip, the audio clip will be padded to `max_segment_samples` with 0. The default value is `327680` which is around 10 seconds when `sample_rate=16000`.\n\n## Generation\nFirst, open `config/sampling.yaml`, and then specify the weight to use in `checkpoint_path`.\n\nFor example, if you want to use `Pretrain_MAESTRO-retrain_Both-k=9.ckpt`, then set  `checkpoint_path='weights/Pretrain_MAESTRO-retrain_Both-k=9.ckpt'`.\n\n```\npython sampling.py task=generation dataset.num_samples=8 dataloader.batch_size=4\n\n```\n\n- `generation dataset.num_sample` sets the number of piano rolls to be generated.\n- `dataloader.batch_size` sets the batch size of the dataloader. If you have enough GPU memory, you can set `dataloader.batch_size` to be equal to `dataset.num_samples` to generate everything in one go.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsony%2Fdiffroll","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsony%2Fdiffroll","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsony%2Fdiffroll/lists"}