Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ferdavid1/AIMV
AI model that generates a Music Video from a .wav file, using Instrument Classification, Speech Transcription, and Video Generation.
https://github.com/ferdavid1/AIMV
ai audio-processing computer-vision image-generation instrument-classification ml music-video natural-language-processing neural-network speech-recognition video-generation
Last synced: 2 months ago
JSON representation
AI model that generates a Music Video from a .wav file, using Instrument Classification, Speech Transcription, and Video Generation.
- Host: GitHub
- URL: https://github.com/ferdavid1/AIMV
- Owner: ferdavid1
- License: gpl-3.0
- Created: 2022-10-19T22:54:02.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2022-11-14T09:15:13.000Z (about 2 years ago)
- Last Synced: 2024-08-01T22:05:08.698Z (5 months ago)
- Topics: ai, audio-processing, computer-vision, image-generation, instrument-classification, ml, music-video, natural-language-processing, neural-network, speech-recognition, video-generation
- Homepage:
- Size: 63.5 KB
- Stars: 6
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- project-awesome - ferdavid1/AIMV - AI model that generates a Music Video from a .wav file, using Instrument Classification, Speech Transcription, and Video Generation. (Others)
README
# AIMV
AI model that generates a Music Video from a .wav file, using Instrument Classification, Speech Transcription, and Video Generation.
## *Architecture*:
### Content Transcription
- #### Find Datasets
- [Ismir](https://www.ismir.net/resources/datasets/)
- #### **Instrument Classification**
- [Musical Instrument Identification](https://www.mdpi.com/1424-8220/22/8/3033/pdf?version=1650009477)
- [Drum Sound Subtype Classification](https://www.researchgate.net/publication/41538774_Classification_of_snare_drum_sounds_using_neural_networks)
- [Timbre Identification](https://iopscience.iop.org/article/10.1088/1742-6596/1856/1/012006/pdf)
- [Sound Event Detection](https://arxiv.org/pdf/2107.05463)
- [Medley-solos-DB: a cross-collection dataset for musical instrument recognition](https://zenodo.org/record/2582103)
- #### Speech Recognition (is speech present?) and subsequent Speech Transcription (if present) using OpenAI Whisper ASR. use --language Spanish if Spanish, etc
[Whisper](https://github.com/openai/whisper) uses Transformer Neural Networks
- #### Concatenation of the above into a general content transcription
For an audio window of length 1 second containing a Vocal saying loudly "Heart" over a crash cymbal hit:
- Return (loud, heart, crash, splash, synth, hexcolorvalue)
This translates to:
- Volume relative to the running average volume + Vocal transcription + 3 most confident instrument identifications + color name (from frequency->hex map)
- Each of the three identifications need to meet a confidence criteria, say 70%, to make the list. Ex: (crash, rimshot, snare)
- if less than 3 instruments identified with 70% confidence, i.e. list is (crash), fill the list with the second-and-third most confident guess for those that have been identified with confidence, in the case of crash it could have been a splash instead or a weirdly textured synth. (crash, splash, synth).
- Ex. with 70% confidence, (guitar, vocal) -> (guitar, vocal, bass), bc the "guitar" guess has a 2nd-most-confident guess of "bass"
We then port this to a sentence.
- Volume informs the intensity of the image, ie if loud/quiet vocal saying Heart, say "person screaming/yelling/whispering, action + object.
- Action being looking at, running at, holding, etc, the object in question ("heart"). Choose action randomly.
- Further abstract the crash, splash, drum (transient description) by mapping these to additional, more abstracted related words.
- Text similiarity with Gensim [link](https://betterprogramming.pub/introduction-to-gensim-calculating-text-similarity-9e8b55de342d)
- (crash, splash, synth) -> (impact, wave, tech)
- (kick, bass, tom) -> (canon, deep, bounce)
- (cello, vocal, guitar) -> (floating, vibrate, pedal)
- add these abstracted words to the final text description- Category (drum, guitar, violin, vocal, piano, bass) given by the initial transient description determines the verb (hits, picks, plays, sings, plays, plays)
- If Drum category and not Kick, assign "Drum stick strikes" as the verb
- Least likely category gets assigned to the beginning of the metaphorical/abstraction section, with the last element of the abstraction as adjective before it.
- If speech present, assign a "loudly" or "softly" according to volume of the speech
- t = transients, a = abstractions
- (ifspeech)"Person(volume) sings, (action) a (transciption), in the background to a (t1, t2) (verb), with a foreground (a3)(t3)(a2)(a1), the scene has a (hexcolorname) color palette."#### Examples:
- (loud, heart, crash, splash, synth, hexcolorvalue, impact, wave, tech) -> "Person loudly sings, holding a heart, alongside a Crash and Splash hit with a foreground techy synth wave impact, the scene has a (hexcolorname) color palette."
- Abstraction is the foreground, if speech is present, it is "alongside" the background of the instruments.
- (loud, N/A, kick, bass, tom, hexcolorvalue, canon, deep, bounce) -> "Kick hits and Bass plays in the background. In the foreground a bouncey tom deep canon, the scene has a (hexcolorname) color palette"
- (quiet, N/A, cello, vocal, guitar, hexcolorvalue, floating, vibrate, pedal) -> "Cello plays and Vocal sings in the background. In the foreground a pedaly guitar vibrate floating, the scene has a (hexcolorname) color palette."
- Vocal identified as an instrument present, but no intelligible speech was transcribed. Likely a "hmmmmm" or an "oooooooo" or similar.### **Video Generation**
- [CogVideo](https://github.com/THUDM/CogVideo) uses Transformer Neural Networks
- [CogVideoPaper](https://github.com/THUDM/CogVideo)