https://github.com/archinetai/audio-ai-timeline

A timeline of the latest AI models for audio generation, starting in 2023!
https://github.com/archinetai/audio-ai-timeline
artificial-intelligence audio-generation machine-learning
Last synced: 14 days ago
JSON representation
A timeline of the latest AI models for audio generation, starting in 2023!
Host: GitHub
URL: https://github.com/archinetai/audio-ai-timeline
Owner: archinetai
Created: 2023-01-29T09:58:56.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-01-04T01:15:07.000Z (over 1 year ago)
Last Synced: 2025-01-30T14:14:07.979Z (5 months ago)
Topics: artificial-intelligence, audio-generation, machine-learning
Homepage:
Size: 69.3 KB
Stars: 1,900
Watchers: 167
Forks: 70
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

awesome-generative-ai - 🔥🔥🔥
README

        # Audio AI Timeline

Here we will keep track of the latest AI models for waveform based audio generation, starting in 2023!

## 2023

| Date  | Release [Samples]                                                                                                                                                                              | Paper                                            | Code                                                                             | Trained Model                                                                                                                                                       |

| ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------ | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |

| 14.11 | [Mustango: Toward Controllable Text-to-Music Generation](https://amaai-lab.github.io/mustango/)                                                                                              | [arXiv](https://arxiv.org/abs/2311.08355)         | [GitHub](https://github.com/AMAAI-Lab/mustango)                                     | [Hugging Face](https://huggingface.co/declare-lab/mustango)                                                                              |

| 13.11 | [Music ControlNet: Multiple Time-varying Controls for Music Generation](https://musiccontrolnet.github.io/web/)                                                                              | [arXiv](https://arxiv.org/abs/2311.07069)         | -                                                                                   | -                                                                                                                                                                   |

| 02.11 | [E3 TTS: Easy End-to-End Diffusion-based Text to Speech](https://e3tts.github.io/)                                                                                                           | [arXiv](https://arxiv.org/abs/2311.00945)         | -                                                                                   | -                                                                                                                                                                   |

| 01.10 | [UniAudio: An Audio Foundation Model Toward Universal Audio Generation](http://dongchaoyang.top/UniAudio_demo/)                                                                              | [arXiv](https://arxiv.org/abs/2310.00704)         | [GitHub](https://github.com/yangdongchao/UniAudio)                                  | -                                                                                                                                                                   |

| 24.09 | [VoiceLDM: Text-to-Speech with Environmental Context](https://voiceldm.github.io/)                                                                                                           | [arXiv](https://arxiv.org/abs/2309.13664)         | [GitHub](https://github.com/glory20h/VoiceLDM)                                      | -                                                                                                                                                                   |

| 05.09 | [PromptTTS 2: Describing and Generating Voices with Text Prompt](https://speechresearch.github.io/prompttts2/)                                                                               | [arXiv](https://arxiv.org/abs/2309.02285)         | -                                                                                   | -                                                                                                                                                                   |

| 14.08 | [SpeechX: Neural Codec Language Model as a Versatile Speech Transformer](https://www.microsoft.com/en-us/research/project/speechx/)                                                          | [arXiv](https://arxiv.org/abs/2308.06873)         | -                                                                                   | -                                                                                                                                                                   |

| 10.08 | [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://audioldm.github.io/audioldm2/)                                                                     | [arXiv](https://arxiv.org/abs/2308.05734)         | [GitHub](https://github.com/haoheliu/audioldm2)                                     | [Hugging Face](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music)                                                                              |

| 09.08 | [JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models](https://www.futureverse.com/research/jen/demos/jen1)                                                   | [arXiv](https://arxiv.org/abs/2308.04729)         | -                                                                                   | -                                                                                                                                                                   |

| 03.08 | [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://musicldm.github.io/)                                                               | [arXiv](https://arxiv.org/abs/2308.01546)         | [GitHub](https://github.com/RetroCirce/MusicLDM/)                                | -                                                                                                                                                                   |

| 14.07 | [Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts](https://mega-tts.github.io/mega2_demo/)                                                                          | [arXiv](https://arxiv.org/abs/2307.07218)         | -                                                                                   | -                                                                                                                                                                   |

| 10.07 | [VampNet: Music Generation via Masked Acoustic Token Modeling](https://hugo-does-things.notion.site/VampNet-Music-Generation-via-Masked-Acoustic-Token-Modeling-e37aabd0d5f1493aa42c5711d0764b33)                                                                                   | [arXiv](https://arxiv.org/abs/2307.04686)        | [GitHub](https://github.com/hugofloresgarcia/vampnet)                                | -                                                                                                                                                                   |

| 22.06 | [AudioPaLM: A Large Language Model That Can Speak and Listen](https://google-research.github.io/seanet/audiopalm/examples/)                                                                   | [arXiv](https://arxiv.org/abs//2306.12925)        | -                                                                                   | -                                                                                                                                                                   |

| 19.06 | [Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale](https://voicebox.metademolab.com/)                                                                                  | [PDF](https://scontent-lga3-2.xx.fbcdn.net/v/t39.8562-6/354636794_599417672291955_3799385851435258804_n.pdf?_nc_cat=101&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=bN1S0esWehwAX_22ORV&_nc_ht=scontent-lga3-2.xx&oh=00_AfCtouRXIvwDx10qPVMNkq_4xMTVOQUrfmyYQW--9cIoWg&oe=64947BF1)        | [GitHub](https://github.com/SpeechifyInc/Meta-voicebox)                                | -                                                                                                                                                                   |

| 08.06 | [MusicGen: Simple and Controllable Music Generation](https://ai.honu.io/papers/musicgen/)                                                                                                     | [arXiv](https://arxiv.org/abs/2306.05284)        | [GitHub](https://github.com/facebookresearch/audiocraft)                            | [Hugging Face](https://huggingface.co/spaces/facebook/MusicGen) [Colab](https://colab.research.google.com/drive/1fxGqfg96RBUvGxZ1XXN07s3DthrKUl4-?usp=sharing)              |

| 06.06 | [Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias](https://mega-tts.github.io/demo-page/)                                                                            | [arXiv](https://arxiv.org/abs/2306.03509)        | -                                                                                   | -                                                                                                                                                                   |

| 01.06 | [Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis](https://charactr-platform.github.io/vocos/)                                   | [arXiv](https://arxiv.org/abs/2306.00814)        | [GitHub](https://github.com/charactr-platform/vocos)                                | -                                                                                                                                                                   |

| 29.05 | [Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation](https://make-an-audio-2.github.io/)                                                                                             | [arXiv](https://arxiv.org/abs/2305.18474)        | -                                                                                   | -                                                                                                                                                                   |

| 25.05 | [MeLoDy: Efficient Neural Music Generation](https://efficient-melody.github.io/)                                                                                                              | [arXiv](https://arxiv.org/abs/2305.15719)        | -                                                                                   | -                                                                                                                                                                   |

| 18.05 | [CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training](https://clapspeech.github.io/)                                                                  | [arXiv](https://arxiv.org/abs/2305.10763)        | -                                                                                   | -                                                                                                                                                                   |

| 18.05 | [SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities](https://0nutation.github.io/SpeechGPT.github.io/)                                           | [arXiv](https://arxiv.org/abs/2305.11000)        | [GitHub](https://github.com/0nutation/SpeechGPT)                                    | -                                                                                                                                                                   |

| 16.05 | [SoundStorm: Efficient Parallel Audio Generation](https://google-research.github.io/seanet/soundstorm/examples/)                                                                              | [arXiv](https://arxiv.org/abs/2305.09636)        | [GitHub (unofficial)](https://github.com/lucidrains/soundstorm-pytorch)             | -                                                                                                                                                                   |

| 03.05 | [Diverse and Vivid Sound Generation from Text Descriptions](https://ligw1998.github.io/audiogeneration.html)                                                                                  | [arXiv](https://arxiv.org/abs/2305.01980)        | -                                                                                   | -                                                                                                                                                                   |

| 02.05 | [Long-Term Rhythmic Video Soundtracker](https://justinyuu.github.io/LORIS/)                                                                              | [arXiv](https://arxiv.org/abs/2305.01319)        | [GitHub](https://github.com/OpenGVLab/LORIS)             | -                                                                                                                                                                   |

| 24.04 | [TANGO: Text-to-Audio generation using instruction tuned LLM and Latent Diffusion Model](https://tango-web.github.io/)                                                                        | [PDF](https://openreview.net/pdf?id=1Sn2WqLku1e) | [GitHub](https://github.com/declare-lab/tango)                                   | [Hugging Face](https://huggingface.co/declare-lab/tango)                                                                                                            |

| 18.04 | [NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers](https://speechresearch.github.io/naturalspeech2/)                                        | [arXiv](https://arxiv.org/abs/2304.09116)        | [GitHub (unofficial)](https://github.com/lucidrains/naturalspeech2-pytorch)        | -                                                                                                                                                                   |

| 10.04 | [Bark: Text-Prompted Generative Audio Model](https://github.com/suno-ai/bark)                                                                                                                  | -                                                | [GitHub](https://github.com/suno-ai/bark)                                        | [Hugging Face](https://huggingface.co/spaces/suno/bark) [Colab](https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing)              |

| 03.04 | [AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models](https://audit-demo.github.io/)                                                                                  | [arXiv](https://arxiv.org/abs/2304.00830)        | -                                                                                | -                                                                                                                                                                   |

| 08.03 | [VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling](https://www.microsoft.com/en-us/research/project/vall-e-x/)                             | [arXiv](https://arxiv.org/abs/2303.03926)        | -                                                                                | -                                                                                                                                                                   |

| 27.02 | [I Hear Your True Colors: Image Guided Audio Generation](https://pages.cs.huji.ac.il/adiyoss-lab/im2wav/)                                                                                       | [arXiv](https://arxiv.org/abs/2211.03089)        | [GitHub](https://github.com/RoySheffer/im2wav)                                  | -                                                                                                                                                                   |

| 08.02 | [Noise2Music: Text-conditioned Music Generation with Diffusion Models](https://google-research.github.io/noise2music/)                                                                         | [arXiv](https://arxiv.org/abs/2302.03917)        | -                                                                                | -                                                                                                                                                                   |

| 04.02 | [Multi-Source Diffusion Models for Simultaneous Music Generation and Separation](https://gladia-research-group.github.io/multi-source-diffusion-models/)                                       | [arXiv](https://arxiv.org/abs/2302.02257)        | [GitHub](https://github.com/gladia-research-group/multi-source-diffusion-models) | -                                                                                                                                                                   |

| 30.01 | [SingSong: Generating musical accompaniments from singing](https://storage.googleapis.com/sing-song/index.html)                                                                                | [arXiv](https://arxiv.org/abs/2301.12662)        | -                                                                                | -                                                                                                                                                                   |

| 30.01 | [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://audioldm.github.io/)                                                                                                 | [arXiv](https://arxiv.org/abs/2301.12503)        | [GitHub](https://github.com/haoheliu/AudioLDM)                                   | [Hugging Face](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation)                                                                            |

| 30.01 | [Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion](https://anonymous0.notion.site/Mo-sai-Text-to-Audio-with-Long-Context-Latent-Diffusion-b43dbc71caf94b5898f9e8de714ab5dc) | [arXiv](https://arxiv.org/abs/2301.11757)        | [GitHub](https://github.com/archinetai/audio-diffusion-pytorch)                  | -                                                                                                                                                                   |

| 29.01 | [Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models](https://text-to-audio.github.io/)                                                                              | [PDF](https://text-to-audio.github.io/paper.pdf) | -                                                                                | -                                                                                                                                                                   |

| 28.01 | [Noise2Music](https://noise2music.github.io/)                                                                                                                                                  | -                                                | -                                                                                | -                                                                                                                                                                   |

| 27.01 | [RAVE2](https://twitter.com/antoine_caillon/status/1618959533065535491?s=20&t=jMkPWBFuAH19HI9m5Sklmg) [[Samples RAVE1](https://anonymous84654.github.io/RAVE_anonymous/)]                      | [arXiv](https://arxiv.org/abs/2111.05011)        | [GitHub](https://github.com/acids-ircam/RAVE)                                    | -                                                                                                                                                                   |

| 26.01 | [MusicLM: Generating Music From Text](https://google-research.github.io/seanet/musiclm/examples/)                                                                                              | [arXiv](https://arxiv.org/abs/2301.11325)        | [GitHub (unofficial)](https://github.com/lucidrains/musiclm-pytorch)             | -                                                                                                                                                                   |

| 18.01 | [Msanii: High Fidelity Music Synthesis on a Shoestring Budget](https://kinyugo.github.io/msanii-demo/)                                                                                         | [arXiv](https://arxiv.org/abs/2301.06468)        | [GitHub](https://github.com/Kinyugo/msanii)                                      | [Hugging Face](https://huggingface.co/spaces/kinyugo/msanii) [Colab](https://colab.research.google.com/github/Kinyugo/msanii/blob/main/notebooks/msanii_demo.ipynb) |

| 16.01 | [ArchiSound: Audio Generation with Diffusion](https://flavioschneider.notion.site/Audio-Generation-with-Diffusion-c4f29f39048d4f03a23da13078a44cdb)                                            | [arXiv](https://arxiv.org/abs/2301.13267)        | [GitHub](https://github.com/archinetai/audio-diffusion-pytorch)                  | -                                                                                                                                                                   |

| 05.01 | [VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers](https://www.microsoft.com/en-us/research/project/vall-e-x/)                                                   | [arXiv](https://arxiv.org/abs/2301.02111)        | [GitHub (unofficial)](https://github.com/lifeiteng/vall-e) [(demo)](https://lifeiteng.github.io/valle/index.html)                                       | -                                                                                                                                                                 |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/archinetai/audio-ai-timeline

Awesome Lists containing this project

README