{"id":13527949,"url":"https://github.com/openai/jukebox","last_synced_at":"2025-05-14T03:09:27.862Z","repository":{"id":37244600,"uuid":"259991778","full_name":"openai/jukebox","owner":"openai","description":"Code for the paper \"Jukebox: A Generative Model for Music\"","archived":false,"fork":false,"pushed_at":"2024-06-19T05:14:24.000Z","size":2815,"stargazers_count":7965,"open_issues_count":206,"forks_count":1444,"subscribers_count":301,"default_branch":"master","last_synced_at":"2025-04-09T02:13:00.984Z","etag":null,"topics":["audio","generative-model","music","paper","pytorch","transformer","vq-vae"],"latest_commit_sha":null,"homepage":"https://openai.com/blog/jukebox/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-04-29T17:16:12.000Z","updated_at":"2025-04-06T22:51:24.000Z","dependencies_parsed_at":"2024-10-29T10:20:59.476Z","dependency_job_id":null,"html_url":"https://github.com/openai/jukebox","commit_stats":{"total_commits":99,"total_committers":7,"mean_commits":"14.142857142857142","dds":"0.43434343434343436","last_synced_commit":"08efbbc1d4ed1a3cef96e08a931944c8b4d63bb3"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fjukebox","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fjukebox/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fjukebox/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fjukebox/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openai","download_url":"https://codeload.github.com/openai/jukebox/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254059510,"owners_count":22007768,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio","generative-model","music","paper","pytorch","transformer","vq-vae"],"created_at":"2024-08-01T06:02:07.884Z","updated_at":"2025-05-14T03:09:27.832Z","avatar_url":"https://github.com/openai.png","language":"Python","readme":"**Status:** Archive (code is provided as-is, no updates expected)\n\n# Jukebox\nCode for \"Jukebox: A Generative Model for Music\"\n\n[Paper](https://arxiv.org/abs/2005.00341) \n[Blog](https://openai.com/blog/jukebox) \n[Explorer](http://jukebox.openai.com/) \n[Colab](https://colab.research.google.com/github/openai/jukebox/blob/master/jukebox/Interacting_with_Jukebox.ipynb) \n\n# Install\nInstall the conda package manager from https://docs.conda.io/en/latest/miniconda.html    \n    \n``` \n# Required: Sampling\nconda create --name jukebox python=3.7.5\nconda activate jukebox\nconda install mpi4py=3.0.3 # if this fails, try: pip install mpi4py==3.0.3\nconda install pytorch=1.4 torchvision=0.5 cudatoolkit=10.0 -c pytorch\ngit clone https://github.com/openai/jukebox.git\ncd jukebox\npip install -r requirements.txt\npip install -e .\n\n# Required: Training\nconda install av=7.0.01 -c conda-forge \npip install ./tensorboardX\n \n# Optional: Apex for faster training with fused_adam\nconda install pytorch=1.1 torchvision=0.3 cudatoolkit=10.0 -c pytorch\npip install -v --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./apex\n```\n\n# Sampling\n## Sampling from scratch\nTo sample normally, run the following command. Model can be `5b`, `5b_lyrics`, `1b_lyrics`\n``` \npython jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 \\\n--total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125\n```\n``` \npython jukebox/sample.py --model=1b_lyrics --name=sample_1b --levels=3 --sample_length_in_seconds=20 \\\n--total_sample_length_in_seconds=180 --sr=44100 --n_samples=16 --hop_fraction=0.5,0.5,0.125\n```\nThe above generates the first `sample_length_in_seconds` seconds of audio from a song of total length `total_sample_length_in_seconds`.\nTo use multiple GPU's, launch the above scripts as `mpiexec -n {ngpus} python jukebox/sample.py ...` so they use `{ngpus}`\n\nThe samples decoded from each level are stored in `{name}/level_{level}`. \nYou can also view the samples as an html with the aligned lyrics under `{name}/level_{level}/index.html`. \nRun `python -m http.server` and open the html through the server to see the lyrics animate as the song plays.  \nA summary of all sampling data including zs, x, labels and sampling_kwargs is stored in `{name}/level_{level}/data.pth.tar`.\n\nThe hps are for a V100 GPU with 16 GB GPU memory. The `1b_lyrics`, `5b`, and `5b_lyrics` top-level priors take up \n3.8 GB, 10.3 GB, and 11.5 GB, respectively. The peak memory usage to store transformer key, value cache is about 400 MB \nfor `1b_lyrics` and 1 GB for `5b_lyrics` per sample. If you are having trouble with CUDA OOM issues, try `1b_lyrics` or \ndecrease `max_batch_size` in sample.py, and `--n_samples` in the script call.\n\nOn a V100, it takes about 3 hrs to fully sample 20 seconds of music. Since this is a long time, it is recommended to use `n_samples \u003e 1` so you can generate as many samples as possible in parallel. The 1B lyrics and upsamplers can process 16 samples at a time, while 5B can fit only up to 3. Since the vast majority of time is spent on upsampling, we recommend using a multiple of 3 less than 16 like `--n_samples 15` for `5b_lyrics`. This will make the top-level generate samples in groups of three while upsampling is done in one pass.\n\nTo continue sampling from already generated codes for a longer duration, you can run\n```\npython jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --mode=continue \\\n--codes_file=sample_5b/level_0/data.pth.tar --sample_length_in_seconds=40 --total_sample_length_in_seconds=180 \\\n--sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125\n```\nHere, we take the 20 seconds samples saved from the first sampling run at `sample_5b/level_0/data.pth.tar` and continue by adding 20 more seconds. \n\nYou could also continue directly from the level 2 saved outputs, just pass `--codes_file=sample_5b/level_2/data.pth.tar`.\n Note this will upsample the full 40 seconds song at the end.\n\nIf you stopped sampling at only the first level and want to upsample the saved codes, you can run\n```\npython jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --mode=upsample \\\n--codes_file=sample_5b/level_2/data.pth.tar --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 \\\n--sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125\n```\nHere, we take the 20 seconds samples saved from the first sampling run at `sample_5b/level_2/data.pth.tar` and upsample the lower two levels.\n\n## Prompt with your own music\nIf you want to prompt the model with your own creative piece or any other music, first save them as wave files and run\n```\npython jukebox/sample.py --model=5b_lyrics --name=sample_5b_prompted --levels=3 --mode=primed \\\n--audio_file=path/to/recording.wav,awesome-mix.wav,fav-song.wav,etc.wav --prompt_length_in_seconds=12 \\\n--sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125\n```\nThis will load the four files, tile them to fill up to `n_samples` batch size, and prime the model with the first `prompt_length_in_seconds` seconds.\n\n# Training\n## VQVAE\nTo train a small vqvae, run\n```\nmpiexec -n {ngpus} python jukebox/train.py --hps=small_vqvae --name=small_vqvae --sample_length=262144 --bs=4 \\\n--audio_files_dir={audio_files_dir} --labels=False --train --aug_shift --aug_blend\n```\nHere, `{audio_files_dir}` is the directory in which you can put the audio files for your dataset, and `{ngpus}` is number of GPU's you want to use to train. \nThe above trains a two-level VQ-VAE with `downs_t = (5,3)`, and `strides_t = (2, 2)` meaning we downsample the audio by `2**5 = 32` to get the first level of codes, and `2**8 = 256` to get the second level codes.  \nCheckpoints are stored in the `logs` folder. You can monitor the training by running Tensorboard\n```\ntensorboard --logdir logs\n```\n    \n## Prior\n### Train prior or upsamplers\nOnce the VQ-VAE is trained, we can restore it from its saved checkpoint and train priors on the learnt codes. \nTo train the top-level prior, we can run\n\n```\nmpiexec -n {ngpus} python jukebox/train.py --hps=small_vqvae,small_prior,all_fp16,cpu_ema --name=small_prior \\\n--sample_length=2097152 --bs=4 --audio_files_dir={audio_files_dir} --labels=False --train --test --aug_shift --aug_blend \\\n--restore_vqvae=logs/small_vqvae/checkpoint_latest.pth.tar --prior --levels=2 --level=1 --weight_decay=0.01 --save_iters=1000\n```\n\nTo train the upsampler, we can run\n```\nmpiexec -n {ngpus} python jukebox/train.py --hps=small_vqvae,small_upsampler,all_fp16,cpu_ema --name=small_upsampler \\\n--sample_length=262144 --bs=4 --audio_files_dir={audio_files_dir} --labels=False --train --test --aug_shift --aug_blend \\\n--restore_vqvae=logs/small_vqvae/checkpoint_latest.pth.tar --prior --levels=2 --level=0 --weight_decay=0.01 --save_iters=1000\n```\nWe pass `sample_length = n_ctx * downsample_of_level` so that after downsampling the tokens match the n_ctx of the prior hps. \nHere, `n_ctx = 8192` and `downsamples = (32, 256)`, giving `sample_lengths = (8192 * 32, 8192 * 256) = (65536, 2097152)` respectively for the bottom and top level. \n\n### Learning rate annealing\nTo get the best sample quality anneal the learning rate to 0 near the end of training. To do so, continue training from the latest \ncheckpoint and run with\n```\n--restore_prior=\"path/to/checkpoint\" --lr_use_linear_decay --lr_start_linear_decay={already_trained_steps} --lr_decay={decay_steps_as_needed}\n```\n\n### Reuse pre-trained VQ-VAE and train top-level prior on new dataset from scratch.\n#### Train without labels\nOur pre-trained VQ-VAE can produce compressed codes for a wide variety of genres of music, and the pre-trained upsamplers \ncan upsample them back to audio that sound very similar to the original audio.\nTo re-use these for a new dataset of your choice, you can retrain just the top-level  \n\nTo train top-level on a new dataset, run\n```\nmpiexec -n {ngpus} python jukebox/train.py --hps=vqvae,small_prior,all_fp16,cpu_ema --name=pretrained_vqvae_small_prior \\\n--sample_length=1048576 --bs=4 --aug_shift --aug_blend --audio_files_dir={audio_files_dir} \\\n--labels=False --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000\n```\nTraining the `small_prior` with a batch size of 2, 4, and 8 requires 6.7 GB, 9.3 GB, and 15.8 GB of GPU memory, respectively. A few days to a week of training typically yields reasonable samples when the dataset is homogeneous (e.g. all piano pieces, songs of the same style, etc).\n\nNear the end of training, follow [this](#learning-rate-annealing) to anneal the learning rate to 0\n\n#### Sample from new model\nYou can then run sample.py with the top-level of our models replaced by your new model. To do so,\n- Add an entry `my_model=(\"vqvae\", \"upsampler_level_0\", \"upsampler_level_1\", \"small_prior\")` in `MODELS` in `make_models.py`. \n- Update the `small_prior` dictionary in `hparams.py` to include `restore_prior='path/to/checkpoint'`. If you\nyou changed any hps directly in the command line script (eg:`heads`), make sure to update them in the dictionary too so \nthat `make_models` restores our checkpoint correctly.\n- Run sample.py as outlined in the sampling section, but now with `--model=my_model` \n\nFor example, let's say we trained `small_vqvae`, `small_prior`, and `small_upsampler` under `/path/to/jukebox/logs`. In `make_models.py`, we are going to declare a tuple of the new models as `my_model`.\n```\nMODELS = {\n    '5b': (\"vqvae\", \"upsampler_level_0\", \"upsampler_level_1\", \"prior_5b\"),\n    '5b_lyrics': (\"vqvae\", \"upsampler_level_0\", \"upsampler_level_1\", \"prior_5b_lyrics\"),\n    '1b_lyrics': (\"vqvae\", \"upsampler_level_0\", \"upsampler_level_1\", \"prior_1b_lyrics\"),\n    'my_model': (\"my_small_vqvae\", \"my_small_upsampler\", \"my_small_prior\"),\n}\n```\n\nNext, in `hparams.py`, we add them to the registry with the corresponding `restore_`paths and any other command line options used during training. Another important note is that for top-level priors with lyric conditioning, we have to locate a self-attention layer that shows alignment between the lyric and music tokens. Look for layers where `prior.prior.transformer._attn_mods[layer].attn_func` is either 6 or 7. If your model is starting to sing along lyrics, it means some layer, head pair has learned alignment. Congrats!\n```\nmy_small_vqvae = Hyperparams(\n    restore_vqvae='/path/to/jukebox/logs/small_vqvae/checkpoint_some_step.pth.tar',\n)\nmy_small_vqvae.update(small_vqvae)\nHPARAMS_REGISTRY[\"my_small_vqvae\"] = my_small_vqvae\n\nmy_small_prior = Hyperparams(\n    restore_prior='/path/to/jukebox/logs/small_prior/checkpoint_latest.pth.tar',\n    level=1,\n    labels=False,\n    # TODO For the two lines below, if `--labels` was used and the model is\n    # trained with lyrics, find and enter the layer, head pair that has learned\n    # alignment.\n    alignment_layer=47,\n    alignment_head=0,\n)\nmy_small_prior.update(small_prior)\nHPARAMS_REGISTRY[\"my_small_prior\"] = my_small_prior\n\nmy_small_upsampler = Hyperparams(\n    restore_prior='/path/to/jukebox/logs/small_upsampler/checkpoint_latest.pth.tar',\n    level=0,\n    labels=False,\n)\nmy_small_upsampler.update(small_upsampler)\nHPARAMS_REGISTRY[\"my_small_upsampler\"] = my_small_upsampler\n```\n\n#### Train with labels \nTo train with you own metadata for your audio files, implement `get_metadata` in `data/files_dataset.py` to return the \n`artist`, `genre` and `lyrics` for a given audio file. For now, you can pass `''` for lyrics to not use any lyrics.\n\nFor training with labels, we'll use `small_labelled_prior` in `hparams.py`, and we set `labels=True,labels_v3=True`. \nWe use 2 kinds of labels information:\n- Artist/Genre: \n  - For each file, we return an artist_id and a list of genre_ids. The reason we have a list and not a single genre_id \n  is that in v2, we split genres like `blues_rock` into a bag of words `[blues, rock]`, and we pass atmost \n  `max_bow_genre_size` of those, in `v3` we consider it as a single word and just set `max_bow_genre_size=1`.\n  - Update the `v3_artist_ids` and `v3_genre_ids` to use ids from your new dataset. \n  - In `small_labelled_prior`, set the hps `y_bins = (number_of_genres, number_of_artists)` and `max_bow_genre_size=1`. \n- Timing: \n  - For each chunk of audio, we return the `total_length` of the song, the `offset` the current audio chunk is at and \n  the `sample_length` of the audio chunk. We have three timing embeddings: total_length, our current position, and our \n  current position as a fraction of the total length, and we divide the range of these values into `t_bins` discrete bins. \n  - In `small_labelled_prior`, set the hps `min_duration` and `max_duration` to be the shortest/longest duration of audio \n  files you want for your dataset, and `t_bins` for how many bins you want to discretize timing information into. Note \n  `min_duration * sr` needs to be at least `sample_length` to have an audio chunk in it.\n\nAfter these modifications, to train a top-level with labels, run\n```\nmpiexec -n {ngpus} python jukebox/train.py --hps=vqvae,small_labelled_prior,all_fp16,cpu_ema --name=pretrained_vqvae_small_prior_labels \\\n--sample_length=1048576 --bs=4 --aug_shift --aug_blend --audio_files_dir={audio_files_dir} \\\n--labels=True --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000\n```\n\nFor sampling, follow same instructions as [above](#sample-from-new-model) but use `small_labelled_prior` instead of `small_prior`.  \n\n#### Train with lyrics\nTo train in addition with lyrics, update `get_metadata` in `data/files_dataset.py` to return `lyrics` too.\nFor training with lyrics, we'll use `small_single_enc_dec_prior` in `hparams.py`. \n- Lyrics: \n  - For each file, we linearly align the lyric characters to the audio, find the position in lyric that corresponds to \n  the midpoint of our audio chunk, and pass a window of `n_tokens` lyric characters centred around that. \n  - In `small_single_enc_dec_prior`, set the hps `use_tokens=True` and `n_tokens` to be the number of lyric characters \n  to use for an audio chunk. Set it according to the `sample_length` you're training on so that its large enough that \n  the lyrics for an audio chunk are almost always found inside a window of that size.\n  - If you use a non-English vocabulary, update `text_processor.py` with your new vocab and set\n  `n_vocab = number of characters in vocabulary` accordingly in `small_single_enc_dec_prior`. In v2, we had a `n_vocab=80` \n  and in v3 we missed `+` and so `n_vocab=79` of characters. \n\nAfter these modifications, to train a top-level with labels and lyrics, run\n```\nmpiexec -n {ngpus} python jukebox/train.py --hps=vqvae,small_single_enc_dec_prior,all_fp16,cpu_ema --name=pretrained_vqvae_small_single_enc_dec_prior_labels \\\n--sample_length=786432 --bs=4 --aug_shift --aug_blend --audio_files_dir={audio_files_dir} \\\n--labels=True --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000\n```\nTo simplify hps choices, here we used a `single_enc_dec` model like the `1b_lyrics` model that combines both encoder and \ndecoder of the transformer into a single model. We do so by merging the lyric vocab and vq-vae vocab into a single \nlarger vocab, and flattening the lyric tokens and the vq-vae codes into a single sequence of length `n_ctx + n_tokens`. \nThis uses `attn_order=12` which includes `prime_attention` layers with keys/values from lyrics and queries from audio. \nIf you instead want to use a model with the usual encoder-decoder style transformer, use `small_sep_enc_dec_prior`.\n\nFor sampling, follow same instructions as [above](#sample-from-new-model) but use `small_single_enc_dec_prior` instead of \n`small_prior`. To also get the alignment between lyrics and samples in the saved html, you'll need to set `alignment_layer` \nand `alignment_head` in `small_single_enc_dec_prior`. To find which layer/head is best to use, run a forward pass on a training example,\nsave the attention weight tensors for all prime_attention layers, and pick the (layer, head) which has the best linear alignment \npattern between the lyrics keys and music queries. \n\n### Fine-tune pre-trained top-level prior to new style(s)\nPreviously, we showed how to train a small top-level prior from scratch. Assuming you have a GPU with at least 15 GB of memory and support for fp16, you could fine-tune from our pre-trained 1B top-level prior. Here are the steps:\n\n- Support `--labels=True` by implementing `get_metadata` in `jukebox/data/files_dataset.py` for your dataset.\n- Add new entries in `jukebox/data/ids`. We recommend replacing existing mappings (e.g. rename `\"unknown\"`, etc with styles of your choice). This uses the pre-trained style vectors as initialization and could potentially save some compute.\n\nAfter these modifications, run \n```\nmpiexec -n {ngpus} python jukebox/train.py --hps=vqvae,prior_1b_lyrics,all_fp16,cpu_ema --name=finetuned \\\n--sample_length=1048576 --bs=1 --aug_shift --aug_blend --audio_files_dir={audio_files_dir} \\\n--labels=True --train --test --prior --levels=3 --level=2 --weight_decay=0.01 --save_iters=1000\n```\nTo get the best sample quality, it is recommended to anneal the learning rate in the end. Training the 5B top-level requires GPipe which is not supported in this release.\n\n# Citation\n\nPlease cite using the following bibtex entry:\n\n```\n@article{dhariwal2020jukebox,\n  title={Jukebox: A Generative Model for Music},\n  author={Dhariwal, Prafulla and Jun, Heewoo and Payne, Christine and Kim, Jong Wook and Radford, Alec and Sutskever, Ilya},\n  journal={arXiv preprint arXiv:2005.00341},\n  year={2020}\n}\n```\n\n# License \n[Noncommercial Use License](./LICENSE) \n\nIt covers both released code and weights. \n\n","funding_links":[],"categories":["Python","\u003cspan id=\"music\"\u003eMusic\u003c/span\u003e","HarmonyOS","语音合成","Media Analysis, Quality Metrics \u0026 AI Tools"],"sub_categories":["\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e","Windows Manager","网络服务_其他","AI \u0026 Machine Learning Tools"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenai%2Fjukebox","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenai%2Fjukebox","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenai%2Fjukebox/lists"}