Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/archinetai/audio-ai-timeline

A timeline of the latest AI models for audio generation, starting in 2023!
https://github.com/archinetai/audio-ai-timeline

artificial-intelligence audio-generation machine-learning

Last synced: 2 months ago
JSON representation

A timeline of the latest AI models for audio generation, starting in 2023!

Lists

README

        

# Audio AI Timeline

Here we will keep track of the latest AI models for waveform based audio generation, starting in 2023!

## 2023

| Date | Release [Samples] | Paper | Code | Trained Model |
| ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------ | -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 14.11 | [Mustango: Toward Controllable Text-to-Music Generation](https://amaai-lab.github.io/mustango/) | [arXiv](https://arxiv.org/abs/2311.08355) | [GitHub](https://github.com/AMAAI-Lab/mustango) | [Hugging Face](https://huggingface.co/declare-lab/mustango) |
| 13.11 | [Music ControlNet: Multiple Time-varying Controls for Music Generation](https://musiccontrolnet.github.io/web/) | [arXiv](https://arxiv.org/abs/2311.07069) | - | - |
| 02.11 | [E3 TTS: Easy End-to-End Diffusion-based Text to Speech](https://e3tts.github.io/) | [arXiv](https://arxiv.org/abs/2311.00945) | - | - |
| 01.10 | [UniAudio: An Audio Foundation Model Toward Universal Audio Generation](http://dongchaoyang.top/UniAudio_demo/) | [arXiv](https://arxiv.org/abs/2310.00704) | [GitHub](https://github.com/yangdongchao/UniAudio) | - |
| 24.09 | [VoiceLDM: Text-to-Speech with Environmental Context](https://voiceldm.github.io/) | [arXiv](https://arxiv.org/abs/2309.13664) | [GitHub](https://github.com/glory20h/VoiceLDM) | - |
| 05.09 | [PromptTTS 2: Describing and Generating Voices with Text Prompt](https://speechresearch.github.io/prompttts2/) | [arXiv](https://arxiv.org/abs/2309.02285) | - | - |
| 14.08 | [SpeechX: Neural Codec Language Model as a Versatile Speech Transformer](https://www.microsoft.com/en-us/research/project/speechx/) | [arXiv](https://arxiv.org/abs/2308.06873) | - | - |
| 10.08 | [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://audioldm.github.io/audioldm2/) | [arXiv](https://arxiv.org/abs/2308.05734) | [GitHub](https://github.com/haoheliu/audioldm2) | [Hugging Face](https://huggingface.co/spaces/haoheliu/audioldm2-text2audio-text2music) |
| 09.08 | [JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models](https://www.futureverse.com/research/jen/demos/jen1) | [arXiv](https://arxiv.org/abs/2308.04729) | - | - |
| 03.08 | [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://musicldm.github.io/) | [arXiv](https://arxiv.org/abs/2308.01546) | [GitHub](https://github.com/RetroCirce/MusicLDM/) | - |
| 14.07 | [Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts](https://mega-tts.github.io/mega2_demo/) | [arXiv](https://arxiv.org/abs/2307.07218) | - | - |
| 10.07 | [VampNet: Music Generation via Masked Acoustic Token Modeling](https://hugo-does-things.notion.site/VampNet-Music-Generation-via-Masked-Acoustic-Token-Modeling-e37aabd0d5f1493aa42c5711d0764b33) | [arXiv](https://arxiv.org/abs/2307.04686) | [GitHub](https://github.com/hugofloresgarcia/vampnet) | - |
| 22.06 | [AudioPaLM: A Large Language Model That Can Speak and Listen](https://google-research.github.io/seanet/audiopalm/examples/) | [arXiv](https://arxiv.org/abs//2306.12925) | - | - |
| 19.06 | [Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale](https://voicebox.metademolab.com/) | [PDF](https://scontent-lga3-2.xx.fbcdn.net/v/t39.8562-6/354636794_599417672291955_3799385851435258804_n.pdf?_nc_cat=101&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=bN1S0esWehwAX_22ORV&_nc_ht=scontent-lga3-2.xx&oh=00_AfCtouRXIvwDx10qPVMNkq_4xMTVOQUrfmyYQW--9cIoWg&oe=64947BF1) | [GitHub](https://github.com/SpeechifyInc/Meta-voicebox) | - |
| 08.06 | [MusicGen: Simple and Controllable Music Generation](https://ai.honu.io/papers/musicgen/) | [arXiv](https://arxiv.org/abs/2306.05284) | [GitHub](https://github.com/facebookresearch/audiocraft) | [Hugging Face](https://huggingface.co/spaces/facebook/MusicGen) [Colab](https://colab.research.google.com/drive/1fxGqfg96RBUvGxZ1XXN07s3DthrKUl4-?usp=sharing) |
| 06.06 | [Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias](https://mega-tts.github.io/demo-page/) | [arXiv](https://arxiv.org/abs/2306.03509) | - | - |
| 01.06 | [Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis](https://charactr-platform.github.io/vocos/) | [arXiv](https://arxiv.org/abs/2306.00814) | [GitHub](https://github.com/charactr-platform/vocos) | - |
| 29.05 | [Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation](https://make-an-audio-2.github.io/) | [arXiv](https://arxiv.org/abs/2305.18474) | - | - |
| 25.05 | [MeLoDy: Efficient Neural Music Generation](https://efficient-melody.github.io/) | [arXiv](https://arxiv.org/abs/2305.15719) | - | - |
| 18.05 | [CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training](https://clapspeech.github.io/) | [arXiv](https://arxiv.org/abs/2305.10763) | - | - |
| 18.05 | [SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities](https://0nutation.github.io/SpeechGPT.github.io/) | [arXiv](https://arxiv.org/abs/2305.11000) | [GitHub](https://github.com/0nutation/SpeechGPT) | - |
| 16.05 | [SoundStorm: Efficient Parallel Audio Generation](https://google-research.github.io/seanet/soundstorm/examples/) | [arXiv](https://arxiv.org/abs/2305.09636) | [GitHub (unofficial)](https://github.com/lucidrains/soundstorm-pytorch) | - |
| 03.05 | [Diverse and Vivid Sound Generation from Text Descriptions](https://ligw1998.github.io/audiogeneration.html) | [arXiv](https://arxiv.org/abs/2305.01980) | - | - |
| 02.05 | [Long-Term Rhythmic Video Soundtracker](https://justinyuu.github.io/LORIS/) | [arXiv](https://arxiv.org/abs/2305.01319) | [GitHub](https://github.com/OpenGVLab/LORIS) | - |
| 24.04 | [TANGO: Text-to-Audio generation using instruction tuned LLM and Latent Diffusion Model](https://tango-web.github.io/) | [PDF](https://openreview.net/pdf?id=1Sn2WqLku1e) | [GitHub](https://github.com/declare-lab/tango) | [Hugging Face](https://huggingface.co/declare-lab/tango) |
| 18.04 | [NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers](https://speechresearch.github.io/naturalspeech2/) | [arXiv](https://arxiv.org/abs/2304.09116) | [GitHub (unofficial)](https://github.com/lucidrains/naturalspeech2-pytorch) | - |
| 10.04 | [Bark: Text-Prompted Generative Audio Model](https://github.com/suno-ai/bark) | - | [GitHub](https://github.com/suno-ai/bark) | [Hugging Face](https://huggingface.co/spaces/suno/bark) [Colab](https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing) |
| 03.04 | [AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models](https://audit-demo.github.io/) | [arXiv](https://arxiv.org/abs/2304.00830) | - | - |
| 08.03 | [VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling](https://www.microsoft.com/en-us/research/project/vall-e-x/) | [arXiv](https://arxiv.org/abs/2303.03926) | - | - |
| 27.02 | [I Hear Your True Colors: Image Guided Audio Generation](https://pages.cs.huji.ac.il/adiyoss-lab/im2wav/) | [arXiv](https://arxiv.org/abs/2211.03089) | [GitHub](https://github.com/RoySheffer/im2wav) | - |
| 08.02 | [Noise2Music: Text-conditioned Music Generation with Diffusion Models](https://google-research.github.io/noise2music/) | [arXiv](https://arxiv.org/abs/2302.03917) | - | - |
| 04.02 | [Multi-Source Diffusion Models for Simultaneous Music Generation and Separation](https://gladia-research-group.github.io/multi-source-diffusion-models/) | [arXiv](https://arxiv.org/abs/2302.02257) | [GitHub](https://github.com/gladia-research-group/multi-source-diffusion-models) | - |
| 30.01 | [SingSong: Generating musical accompaniments from singing](https://storage.googleapis.com/sing-song/index.html) | [arXiv](https://arxiv.org/abs/2301.12662) | - | - |
| 30.01 | [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://audioldm.github.io/) | [arXiv](https://arxiv.org/abs/2301.12503) | [GitHub](https://github.com/haoheliu/AudioLDM) | [Hugging Face](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation) |
| 30.01 | [Moรปsai: Text-to-Music Generation with Long-Context Latent Diffusion](https://anonymous0.notion.site/Mo-sai-Text-to-Audio-with-Long-Context-Latent-Diffusion-b43dbc71caf94b5898f9e8de714ab5dc) | [arXiv](https://arxiv.org/abs/2301.11757) | [GitHub](https://github.com/archinetai/audio-diffusion-pytorch) | - |
| 29.01 | [Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models](https://text-to-audio.github.io/) | [PDF](https://text-to-audio.github.io/paper.pdf) | - | - |
| 28.01 | [Noise2Music](https://noise2music.github.io/) | - | - | - |
| 27.01 | [RAVE2](https://twitter.com/antoine_caillon/status/1618959533065535491?s=20&t=jMkPWBFuAH19HI9m5Sklmg) [[Samples RAVE1](https://anonymous84654.github.io/RAVE_anonymous/)] | [arXiv](https://arxiv.org/abs/2111.05011) | [GitHub](https://github.com/acids-ircam/RAVE) | - |
| 26.01 | [MusicLM: Generating Music From Text](https://google-research.github.io/seanet/musiclm/examples/) | [arXiv](https://arxiv.org/abs/2301.11325) | [GitHub (unofficial)](https://github.com/lucidrains/musiclm-pytorch) | - |
| 18.01 | [Msanii: High Fidelity Music Synthesis on a Shoestring Budget](https://kinyugo.github.io/msanii-demo/) | [arXiv](https://arxiv.org/abs/2301.06468) | [GitHub](https://github.com/Kinyugo/msanii) | [Hugging Face](https://huggingface.co/spaces/kinyugo/msanii) [Colab](https://colab.research.google.com/github/Kinyugo/msanii/blob/main/notebooks/msanii_demo.ipynb) |
| 16.01 | [ArchiSound: Audio Generation with Diffusion](https://flavioschneider.notion.site/Audio-Generation-with-Diffusion-c4f29f39048d4f03a23da13078a44cdb) | [arXiv](https://arxiv.org/abs/2301.13267) | [GitHub](https://github.com/archinetai/audio-diffusion-pytorch) | - |
| 05.01 | [VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers](https://www.microsoft.com/en-us/research/project/vall-e-x/) | [arXiv](https://arxiv.org/abs/2301.02111) | [GitHub (unofficial)](https://github.com/lifeiteng/vall-e) [(demo)](https://lifeiteng.github.io/valle/index.html) | - |