https://github.com/chavinlo/musicgen_trainer
simple trainer for musicgen/audiocraft
https://github.com/chavinlo/musicgen_trainer
audiocraft generative-music music musicgen trainer
Last synced: 7 months ago
JSON representation
simple trainer for musicgen/audiocraft
- Host: GitHub
- URL: https://github.com/chavinlo/musicgen_trainer
- Owner: chavinlo
- License: agpl-3.0
- Created: 2023-06-12T14:36:31.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-07-12T03:02:27.000Z (over 1 year ago)
- Last Synced: 2025-03-11T10:49:07.587Z (7 months ago)
- Topics: audiocraft, generative-music, music, musicgen, trainer
- Language: Python
- Homepage:
- Size: 10.3 MB
- Stars: 20
- Watchers: 1
- Forks: 5
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MusicGen Trainer
This is a trainer for MusicGen model. It's based on [this](https://github.com/chavinlo/musicgen_trainer).
# Contributors
- [@mkualquiera](https://github.com/mkualquiera) and [@neverix](https://github.com/neverix): actually got it working
- elyxlz: help with masks## STATUS: MVP
Removing the gradient scaler, increasing the batch size and only training on conditional samples makes training work.
TODO:
* [ ] Add notebook
* [ ] Add webdataset support
* [ ] Try larger models
* [ ] Add LoRA
* [ ] Make rolling generation customizable## Usage
### Dataset Creation
Create a folder, in it, place your audio and caption files. **They must be `.wav` and `.txt` format respectively.** You can omit `.txt` files for training with empty text by setting the `--no_label` option to `1`.

You can use `.wav` files longer than 30 seconds, in that case the model will be trained on random crops of the original `.wav` file.
In this example, segment_000.txt contains the caption "jazz music, jobim" for wav file segment_000.wav.
### Running the trainer
Run `python3 run.py --dataset `. Make sure to use the full path to the dataset, not a relative path.
### Options
- `dataset_path`: String, path to your dataset with `.wav` and `.txt` pairs.
- `model_id`: String, MusicGen model to use. Can be `small`/`medium`/`large`. Default: `small`
- `lr`: Float, learning rate. Default: `0.00001`/`1e-5`
- `epochs`: Integer, epoch count. Default: `100`
- `use_wandb`: Integer, `1` to enable wandb, `0` to disable it. Default: `0` = Disabled
- `save_step`: Integer, amount of steps to save a checkpoint. Default: None
- `no_label`: Integer, whether to read a dataset without `.txt` files. Default: `0` = Disabled
- `tune_text`: Integer, perform textual inversion instead of full training. Default: `0` = Disabled
- `weight_decay`: Float, the weight decay regularization coefficient. Default: `0.00001`/`1e-5`
- `grad_acc`: Integer, number of steps to smooth gradients over. Default: 2
- `warmup_steps`: Integer, amount of steps to slowly increase learning rate over to let the optimizer compute statistics. Default: 16
- `batch_size`: Integer, batch size the model sees at once. Reduce to lower memory consumption. Default: 4
- `use_cfg`: Integer, whether to train with some labels randomly dropped out. Default: `0` = DisabledYou can set these options like this: `python3 run.py --use_wandb=1`.
### Models
Once training finishes, the model (and checkpoints) will be available under the `models` folder in the same path you ran the trainer on.

To load them, simply run the following on your generation script:
```python
model.lm.load_state_dict(torch.load('models/lm_final.pt'))
```Where `model` is the MusicGen Object and `models/lm_final.pt` is the path to your model (or checkpoint).
## Citations
```
@article{copet2023simple,
title={Simple and Controllable Music Generation},
author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
year={2023},
journal={arXiv preprint arXiv:2306.05284},
}
```@mkualquiera (mkualquiera@discord) added batching, debugged the code and trained the first working model.
Special thanks to elyxlz (223864514326560768@discord) for helping @chavinlau with the masks.
@chavinlau wrote the original version of the training code. Original README:
---
# MusicGen Trainer
This is a trainer for MusicGen model. Currently it's very basic but I'll add more features soon.
## STATUS: BROKEN
Only works for overfitting. Breaks model on anything else
More information on the current training quality on the [experiments section](#experiments)
## Usage
### Dataset Creation
Create a folder, in it, place your audio and caption files. **They must be WAV and TXT format respectively.**

### Important: Split your audios in 35 second chunks. Only the first 30 seconds will be processed. Audio cannot be less than 30 seconds.
In this example, segment_000.txt contains the caption "jazz music, jobim" for wav file segment_000.wav
### Running the trainer
Run `python3 run.py --dataset /home/ubuntu/dataset`, replace `/home/ubuntu/dataset` with the path to your dataset. Make sure to use the full path, not a relative path.
### Options
- `dataset_path`: String, path to your dataset with WAV and TXT pairs.
- `model_id`: String, MusicGen model to use. Can be `small`/`medium`/`large`. Default: `small`
- `lr`: Float, learning rate. Default: `0.0001`/`1e-4`
- `epochs`: Integer, epoch count. Default: `5`
- `use_wandb`: Integer, `1` to enable wandb, `0` to disable it. Default: `0` = Disabled
- `save_step`: Integer, amount of steps to save a checkpoint. Default: NoneYou can set these options like this: `python3 run.py --use_wandb=1`
### Models
Once training finishes, the model (and checkpoints) will be available under the `models` folder in the same path you ran the trainer on.

To load them, simply run the following on your generation script:
```python
model.lm.load_state_dict(torch.load('models/lm_final.pt'))
```Where `model` is the MusicGen Object and `models/lm_final.pt` is the path to your model (or checkpoint).
## Experiments
### Electronic music (Moe Shop):
Encodec seems to struggle with electronic music. Even just Encoding->Decoding has many problems.
4:00 - 4:30 - [Moe Shop - WONDER POP](https://youtu.be/H4PZ7mju5QQ?t=240)
Original: https://voca.ro/1jbsor6BAyLY
Encode -> Decode: https://voca.ro/1kF2yyGyRn0y
Overfit -> Generate -> Decode: https://voca.ro/1f6ru5ieejJY
### Bossa Nova (Tom Jobim):
Softer and less aggressive melodies seem to play best with encodec and musicgen. One of these are bossa nova, which to me sounds great:
1:20 - 1:50 - [Tom Jobim - Children's Games](https://youtu.be/8KVtgzOTqDw?t=80)
Original: https://voca.ro/1dm9QpRqa5rj (last 5 seconds are ignored)
Encode -> Decode: https://voca.ro/19LpwVE44si7
Overfit -> Generate -> Decode: https://voca.ro/1hJGVdxsvBOG
## Citations
```
@article{copet2023simple,
title={Simple and Controllable Music Generation},
author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
year={2023},
journal={arXiv preprint arXiv:2306.05284},
}
```Special thanks to elyxlz (223864514326560768@discord) for helping me with the masks.