Skip to content

Stability-AI/stable-audio-2-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 

Repository files navigation

⚠️ Warning: This website may not function properly on Safari. For the best experience, please use Google Chrome.

arXiv: Stable Audio 2 paper

stable-audio-tools: code to reproduce Stable Audio

stable-audio-metrics: code to evaluate Stable Audio

Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m 45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5 Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.

Comparison with state-of-the-art (song describer dataset prompts)

Prompt: An uplifting jazz song that makes your head shake.

Our Model (stereo, 44.1kHz) MusicGen-large-stereo (stereo, 32kHz)
Audio not supported by your browser. Audio not supported by your browser.

Prompt: One cannot avoid moving the feet and neck listening to this fast and loopy brazilian tune.

Our Model MusicGen-large-stereo
Audio not supported by your browser. Audio not supported by your browser.

Prompt: Ambiental song that evokes calm with a progression of stereo electronic elements.

Our Model MusicGen-large-stereo
Audio not supported by your browser. Audio not supported by your browser.

Prompt: This song starts with a ukulele and builds up with percussion using claps and an acoustic guitar that plays the same rhythm as the ukulele with melody played on a xylophone and has a very upbeat feel to it.

Our Model (stereo, 44.1kHz) MusicGen-large-stereo (stereo, 32kHz) Ground-truth (stereo, 44.1kHz)
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

Prompt: Calming instrumental music primarily on piano can be used for relaxing.

Our Model MusicGen-large-stereo Ground-truth
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

Prompt: A dance music club banger, with a heavy kick, subtle supporting percussion like tabla and bongos, prominent pop synth lines, and a repetitive hook.

Our Model MusicGen-large-stereo Ground-truth
Audio not supported by your browser. Audio not supported by your browser. Audio not supported by your browser.

These prompts/audios were used for the qualitative study we report in our paper.

Additional creative capabilities

Audio-to-audio With diffusion models is possible to perform some degree of style-transfer by initializing the noise with audio during sampling. This capability can be used to modify the aesthetics of an existing recording based on a given text prompt, whilst maintaining the reference audio's structure (e.g., a beatbox recording could be style-transfered to produce realistic-sounding drums). As a result, our model can be influenced by not only text prompts but also audio inputs, enhancing its controllability and expressiveness. We noted that when initialized with voice recordings (such as beatbox or onomatopoeias), there is a sensation of control akin to an instrument.

Input audio Output audio Prompt
Audio not supported by your browser.
Audio not supported by your browser.
Bass guitar
Audio not supported by your browser.
Audio not supported by your browser.
format: solo, instruments: vibraphone
Audio not supported by your browser.
Audio not supported by your browser.
Genre: UK Bass, Instruments: 707 Drum Machine, Strings, 808 bass stabs, Beautiful Synths
Audio not supported by your browser.
Audio not supported by your browser.
Guitar
Audio not supported by your browser.
Audio not supported by your browser.
Drums

Vocal music The training dataset contains a subset of music with vocals. Our focus is on the generation of instrumental music, so we do not provide any conditioning based on lyrics. As a result, when the model is prompted for vocals, the model's generations contains vocal-like melodies without intelligible words. Whilst not a substitute for intelligible vocals, these sounds have an artistic and textural value of their own.

|Audio not supported by your browser.
|Audio not supported by your browser.
| |Audio not supported by your browser.
|Audio not supported by your browser.
| |Audio not supported by your browser.
|Audio not supported by your browser.
|

Short-form audio generation The training set does not exclusively contain long-form music. It also contains shorter sounds like sound effects or instrument samples. As a consequence, our model is also capable of producing such sounds when prompted appropriately.

Generation by our model Prompt
Audio not supported by your browser.
Dog barking
Audio not supported by your browser.
Ringtone
Audio not supported by your browser.
Waves
Audio not supported by your browser.
Helicopter passing by from left to right
Audio not supported by your browser.
Fowl, chicken, rooster, crowing, cock-a-doodle-doo

Memorization analysis

Recent works examined the potential of generative models to memorize training data, especially for repeated elements in the training set. Further, musicLM conducted a memorization analysis to address concerns on the potential misappropriation of creative content. Adhering to principles of responsible model development, we also run a comprehensive study on memorization.

Considering the increased probability of memorizing repeated music within the dataset, we start by studying if our training set contains repeated data. We embed all our training data using the LAION-CLAP audio encoder to select audios that are close in this space based on a manually set threshold. The threshold is set such that the selected audios correspond to exact replicas. With this process, we identify 5566 repeated audios in our training set.

We compare our model's generations against the training set in LAION-CLAP space. Generations are from 5566 prompts within the repeated training data (in-distribution), and 586 prompts from the Song Describer Dataset (no-singing, out-of-distribution). We then identify the top-50 generated music that is closest to the training data and listen.

We extensively listened to potential memorization candidates, and could not find memorization. Those are the most interesting candidates from (repeated) training data prompts:

Generation by our model Closest #1 Closest #2 Closest #3 Prompt
Audio not supported by your browser. 427160 427105 140843 Birds chirping, forest birds, tropical, africa wild life, singing birds, sound effects.
Audio not supported by your browser. 978924 979616 978717 Totally rad 8-bit melodies and intense arps create that fearless throwback vibe.
Audio not supported by your browser. 979544 979695 979670 Totally rad 8-bit melodies and intense arps create that strong-willed throwback vibe.
Audio not supported by your browser. 972466 972983 973055 Pleasant strings create desire in this adamant scoring cue.

We found a fair ammount of 8-bit/chiptunes that were repeated in the training dataset. Still, our model does not memorize them.

We even selected additional outstanding generations from Song Describer Dataset prompts, and could not find memorization. Those are the most interesting memorization candidates:

Generation by our model Closest #1 Closest #2 Closest #3 Prompt
Audio not supported by your browser. 796563 1083119 634461 One cannot avoid moving the feet and neck listening to this fast and loopy brazilian tune.
Audio not supported by your browser. 279428 1082095 326758 An uplifting jazz song that makes your head shake.
Audio not supported by your browser. 1024058 1023046 788950 Calming instrumental music primarily on piano can be used for relaxing.
Audio not supported by your browser. 470048 470047 696082 This song starts with a ukulele and builds up with percussion using claps and an acoustic guitar that plays the same rhythm as the ukulele with melody played on a xylophone and has a very upbeat feel to it.

Autoencoder: reconstructions

This comparison is useful to evaluate the audio fidelity capabilities of the autoencoder. On the left, we have the ground truth recording. On the right, we take the ground truth recording and end pass it through the autoencoder. Note that the autoencoder reconstruction is fairly transparent, very close to the ground truth.

Ground truth Autoencoder reconstruction
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.
Audio not supported by your browser.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published