https://github.com/sony/creativeai

Last synced: 7 months ago
JSON representation

Host: GitHub
URL: https://github.com/sony/creativeai
Owner: sony
Created: 2022-11-21T03:13:46.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2025-04-01T08:16:33.000Z (8 months ago)
Last Synced: 2025-04-02T12:08:07.079Z (8 months ago)
Language: CSS
Size: 52.7 MB
Stars: 66
Watchers: 7
Forks: 5
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          



	Deep Generative Modeling

	Multimodal
NLP

	Music
Technology

    Cinematic
Technology

    Hosted
Challenges





# Deep Generative Modeling






	

		Jump Your Steps

		

		

			[arXiv]

		

		A general method to find an optimal sampling schedule for inference in discrete diffusion

		ICLR25

	

	

		HERO-DM

		

		

			[arXiv]

			[demo]

		

		A method efficiently leverages online human feedback to fine-tune Stable Diffusion for various range of tasks

		ICLR25

	

	

		WPSE

		

		

			[arXiv]

		

		An enhanced multimodal representation using weighted point clouds and its theoretical benefits

		ICLR25

	

	

		PaGoDA

		

		

			[arXiv]

		

		A 64x64 pre-trained diffusion model is all you need for 1-step high-resolution SOTA generation

		NeurIPS24

		

	

		CTM

		

		

			[arXiv]

			[demo]

		

		Unified framework enables diverse samplers and 1-step generation SOTAs

		ICLR24

		Applications:


			[SoundGen]

		

	

	

		SAN

		

		

			[arXiv]

			[code]

			[demo]

		

		Enhancing GAN with metrizable discriminators

		ICLR24

		Applications:


			[Vocoder]

		

	

	

		MPGD

		

		

			[arXiv]

			[demo]

		

		Fast, Efficient, Training-Free, and Controllable diffusion-based generation method

		ICLR24

		

	

		HQ-VAE

		

		

			[OpenReview]

			[arXiv]

		

		Generalizing hierarchical VQ-VAEs with a Bayesian framework

		TMLR

	

	

		FP-Diffusion

		

		

			[PMLR]

			[code]

		

		Improving density estimation of diffusion

		ICML23

	

	

		GibbsDDRM

		

		

			[PMLR]

			[code]

		

		Achieving blind inversion using DDPM

		ICML23

		Applications:


			[DeReverb]

			[SpeechEnhance]

		

	

	

		Consistency-type Models

		

		

			[arXiv]

		

		Theoretically unified framework for "consistency" on diffusion model

		ICML23 SPIGM Workshop

	

	

		SQ-VAE

		

		

			[PMLR]

			[arXiv]

			[code]

		

		Improving codebook utilization and training stability

		ICML22

	

	

		AR-ELBO

		

		

			[Elsevier]

			[arXiv]

		

		Mitigating oversmoothness in VAE

		Neurocomputing

	

    

    





# Multimodal NLP






	

		VinaBench

		

		

			[CVPR]

                        [arXiv]

                        [data]

		

		VinaBench: Benchmark for Faithful and Consistent Visual Narratives

		CVPR25

	

	

		DiffuCOMET

		

		

			[ACL]

			[arXiv]

			[code]

		

		DiffuCOMET: Contextual Commonsense Knowledge Diffusion

		ACL24

	

	

		CyCLIPs/CyCLAPs

		

		

			[ACL]

			[arXiv]

		

		On the Language Encoder of Contrastive Cross-modal Models

		ACL24

	

	

		DIIR

		

		

			[ACL]

			[arXiv]

			[code]

		

		Few-shot Dialogue Strategy Learning for Motivational Interviewing via Inductive Reasoning

		ACL24

	

	

		PeaCok

		

		

			[ACL]

			[arXiv]

			[code]

		

		PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives
(Outstanding Paper Award)

        ACL23

	

	

		ComFact

		

		

			[EMNLP]

			[arXiv]

			[code]

		

		ComFact: A Benchmark for Linking Contextual Commonsense Knowledge

        EMNLP22 Findings

	

        

        






# Music Technologies






    

        Variable Bitrate RVQ

        

        

            [arXiv]

        

        VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression

        ICASSP25

    

    

        Instr. Timbre Transfer

        

        

            [arXiv]

            [code]

            [demo]

        

        Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

        ICASSP25

    

	

		Mixing Graph Estimation

		

		

			[arXiv]

            [code]

            [demo]

		

		Searching For Music Mixing Graphs: A Pruning Approach

		DAFx24

	

	

		Guitar Amp. Modeling

		

		

            [arXiv]

		

		Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data

		DAFx24

	

	

		Text-to-Music Editing

		

		

			[arXiv]

			[code]

            [demo]

		

		MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

		IJCAI24

	

	

		Instr.-Agnostic Trans.

		

		

			[IEEE]

			[arXiv]

		

		Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

		ICASSP24

	

	

		Vocal Restoration

		

		

			[IEEE]

			[arXiv]

			[demo]

		

		VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

		ICASSP24

		

	

		hFT-Transformer

		

		

			[arXiv]

			[code]

		

		Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

        ISMIR23

	

	

		Automatic Music Tagging

		

		

			[arXiv]

		

		An Attention-based Approach To Hierarchical Multi-label Music Instrument Classification

        ICASSP23

	

	

		Vocal Dereverberation

		

		

			[arXiv]

			[demo]

		

		Unsupervised Vocal Dereverberation with Diffusion-based Generative Models

        ICASSP23

	

	

		Mixing Style Transfer

		

		

			[arXiv]

			[code]

			[demo]

		

		Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

        ICASSP23

	

	

		Music Transcription

		

		

			[arXiv]

			[code]

			[demo]

		

		DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

        ICASSP23

	

	

		Singing Voice Vocoder

		

		

			[arXiv]

			[demo]

		

		Hierarchical Diffusion Models for Singing Voice Neural Vocoder

        ICASSP23

	

	

		Distortion Effect Removal

		

		

			[poster]

			[arXiv]

			[demo]

		

		Distortion Audio Effects: Learning How to Recover the Clean Signal

        ISMIR22

	

	

		Automatic Music Mixing

		

		

			[poster]

			[arXiv]

			[code]

			[demo]

		

		Automatic Music Mixing with Deep Learning and Out-of-Domain Data

        ISMIR22

	

	

		Sound Separation

		

		

			[IEEE]

		

		Music Source Separation with Deep Equilibrium Models

        ICASSP22

	

	

		Automatic DJ Transition

		

		

			[arXiv]

			[code]

			[demo]

		

		Automatic DJ Transitions with Differentiable Audio Effects and Generative Adversarial Networks

        ICASSP22

	

	

		Singing Voice Conversion

		

		

			[arXiv]

			[demo]

		

		Robust One-Shot Singing Voice Conversion

	

	

		Sound Separation

		

		

			[video]

			[site]

		

		Glenn Gould and Kanji Ishimaru 2021: A collaboration with AI Sound Separation after 60 years

	









# Cinematic Technologies






	

		MMAudio

		

		

			[arXiv]

			[code]

			[demo]

		

		MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

        CVPR25

	

	

		MMDisCo

		

		

			[OpenReview]

			[arXiv]

			[code]

		

		MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

        ICLR25

	

	

		SoundCTM

		

		

			[OpenReview]

			[arXiv]

			[code]

			[demo]

		

		SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

        ICLR25

	

    

        Mining Your Own Secrets

        

        

            [OpenReview]

            [arXiv]

        

        Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

        ICLR25

    

    

		GenWarp

		

		

			[arXiv]

			[demo]

		

		GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

        NeurIPS24

	

    

        SpecMaskGIT

        

        

            [arXiv]

            [demo]

            

        SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

        ISMIR24

    

	

		Acoustic Inv. Rendering

		

		

			[CVF]

			[arXiv]

			[dataset]

			[code]

			[demo]

		

		Hearing Anything Anywhere

		CVPR24

	

	

		BigVSAN Vocoder

		

		

			[arXiv]

			[code]

			[demo]

		

		BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

		ICASSP24

	

	

		Zero-/Few-shot SELD

		

		

			[IEEE]

			[arXiv]

		

		Zero- and Few-shot Sound Event Localization and Detection

		ICASSP24

	

	

		STARSS23

		

		

			[arXiv]

			[dataset]

		

		STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

		NeurIPS23

	

	

		Audio Restoration: ViT-AE

		

		

			[IEEE]

			[arXiv]

			[demo]

		

		Extending Audio Masked Autoencoders Toward Audio Restoration

        WASPAA23

	

	

		Diffiner

		

		

			[ISCA]

			[arXiv]

			[code]

		

		Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

        INTERSPEECH23

	

	

		CLIPSep

		

		

			[OpenReview]

			[arXiv]

			[code]

			[demo]

		

		CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

        ICLR23

		

	

		Sound Event Localization and Detection

		

		

			[IEEE]

			[arXiv]

		

		Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

        ICASSP22

	

    





# Hosted Challenges






	

		SVG Challenge 2024

		

		

			[SVG Challenge 2024]

		

		Sounding Video Generation Challenge 2024

	

	

		DCASE Challenge Task 3

		

		

			[DCASE Challenge2023]

		

		Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes

	

	

		CPD Challenge 2023

		

		

			[CPD Challenge 2023]

		

		Commonsense Persona-grounded Dialogue Challenge

	

	

		SDX Challenge 2023

		

		

			[site]

			[paper (music)]

			[paper (cinematic)]

		

		Sound Demixing Challenge 2023

		

	

		MDX Challenge 2021

		

		

			[site]

			[frontiers]

		

		Music Demixing Challenge 2021

	

    

    



### Contact

 Yuki Mitsufuji (yuhki.mitsufuji@sony.com)