arXiv
: Stable Audio 2 paper
stable-audio-tools
: code to reproduce Stable Audio
stable-audio-metrics
: code to evaluate Stable Audio
Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m 45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5 Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.
Prompt: An uplifting jazz song that makes your head shake.
Our Model (stereo, 44.1kHz) | MusicGen-large-stereo (stereo, 32kHz) |
---|---|
Prompt: One cannot avoid moving the feet and neck listening to this fast and loopy brazilian tune.
Our Model | MusicGen-large-stereo |
---|---|
Prompt: Ambiental song that evokes calm with a progression of stereo electronic elements.
Our Model | MusicGen-large-stereo |
---|---|
Prompt: This song starts with a ukulele and builds up with percussion using claps and an acoustic guitar that plays the same rhythm as the ukulele with melody played on a xylophone and has a very upbeat feel to it.
Our Model (stereo, 44.1kHz) | MusicGen-large-stereo (stereo, 32kHz) | Ground-truth (stereo, 44.1kHz) |
---|---|---|
Prompt: Calming instrumental music primarily on piano can be used for relaxing.
Our Model | MusicGen-large-stereo | Ground-truth |
---|---|---|
Prompt: A dance music club banger, with a heavy kick, subtle supporting percussion like tabla and bongos, prominent pop synth lines, and a repetitive hook.
Our Model | MusicGen-large-stereo | Ground-truth |
---|---|---|
These prompts/audios were used for the qualitative study we report in our paper.
Audio-to-audio With diffusion models is possible to perform some degree of style-transfer by initializing the noise with audio during sampling. This capability can be used to modify the aesthetics of an existing recording based on a given text prompt, whilst maintaining the reference audio's structure (e.g., a beatbox recording could be style-transfered to produce realistic-sounding drums). As a result, our model can be influenced by not only text prompts but also audio inputs, enhancing its controllability and expressiveness. We noted that when initialized with voice recordings (such as beatbox or onomatopoeias), there is a sensation of control akin to an instrument.
Input audio | Output audio | Prompt |
---|---|---|
Bass guitar | ||
format: solo, instruments: vibraphone | ||
Genre: UK Bass, Instruments: 707 Drum Machine, Strings, 808 bass stabs, Beautiful Synths | ||
Guitar | ||
Drums |
Vocal music The training dataset contains a subset of music with vocals. Our focus is on the generation of instrumental music, so we do not provide any conditioning based on lyrics. As a result, when the model is prompted for vocals, the model's generations contains vocal-like melodies without intelligible words. Whilst not a substitute for intelligible vocals, these sounds have an artistic and textural value of their own.
|
Short-form audio generation The training set does not exclusively contain long-form music. It also contains shorter sounds like sound effects or instrument samples. As a consequence, our model is also capable of producing such sounds when prompted appropriately.
Generation by our model | Prompt |
---|---|
Dog barking | |
Ringtone | |
Waves | |
Helicopter passing by from left to right | |
Fowl, chicken, rooster, crowing, cock-a-doodle-doo |
Recent works examined the potential of generative models to memorize training data, especially for repeated elements in the training set. Further, musicLM conducted a memorization analysis to address concerns on the potential misappropriation of creative content. Adhering to principles of responsible model development, we also run a comprehensive study on memorization.
Considering the increased probability of memorizing repeated music within the dataset, we start by studying if our training set contains repeated data. We embed all our training data using the LAION-CLAP audio encoder to select audios that are close in this space based on a manually set threshold. The threshold is set such that the selected audios correspond to exact replicas. With this process, we identify 5566 repeated audios in our training set.
We compare our model's generations against the training set in LAION-CLAP space. Generations are from 5566 prompts within the repeated training data (in-distribution), and 586 prompts from the Song Describer Dataset (no-singing, out-of-distribution). We then identify the top-50 generated music that is closest to the training data and listen.
We extensively listened to potential memorization candidates, and could not find memorization. Those are the most interesting candidates from (repeated) training data prompts:
Generation by our model | Closest #1 | Closest #2 | Closest #3 | Prompt |
---|---|---|---|---|
427160 | 427105 | 140843 | Birds chirping, forest birds, tropical, africa wild life, singing birds, sound effects. | |
978924 | 979616 | 978717 | Totally rad 8-bit melodies and intense arps create that fearless throwback vibe. | |
979544 | 979695 | 979670 | Totally rad 8-bit melodies and intense arps create that strong-willed throwback vibe. | |
972466 | 972983 | 973055 | Pleasant strings create desire in this adamant scoring cue. |
We found a fair ammount of 8-bit/chiptunes that were repeated in the training dataset. Still, our model does not memorize them.
We even selected additional outstanding generations from Song Describer Dataset prompts, and could not find memorization. Those are the most interesting memorization candidates:
Generation by our model | Closest #1 | Closest #2 | Closest #3 | Prompt |
---|---|---|---|---|
796563 | 1083119 | 634461 | One cannot avoid moving the feet and neck listening to this fast and loopy brazilian tune. | |
279428 | 1082095 | 326758 | An uplifting jazz song that makes your head shake. | |
1024058 | 1023046 | 788950 | Calming instrumental music primarily on piano can be used for relaxing. | |
470048 | 470047 | 696082 | This song starts with a ukulele and builds up with percussion using claps and an acoustic guitar that plays the same rhythm as the ukulele with melody played on a xylophone and has a very upbeat feel to it. |
This comparison is useful to evaluate the audio fidelity capabilities of the autoencoder. On the left, we have the ground truth recording. On the right, we take the ground truth recording and end pass it through the autoencoder. Note that the autoencoder reconstruction is fairly transparent, very close to the ground truth.
Ground truth | Autoencoder reconstruction |
---|---|