GitHub - Stability-AI/stable-audio-2-demo

⚠️ Warning: This website may not function properly on Safari. For the best experience, please use Google Chrome.

arXiv: Stable Audio 2 paper

stable-audio-tools: code to reproduce Stable Audio

stable-audio-metrics: code to evaluate Stable Audio

Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m 45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5 Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.

Comparison with state-of-the-art (song describer dataset prompts)

Prompt: An uplifting jazz song that makes your head shake.

Our Model (stereo, 44.1kHz)	MusicGen-large-stereo (stereo, 32kHz)
Audio not supported by your browser.	Audio not supported by your browser.

Prompt: One cannot avoid moving the feet and neck listening to this fast and loopy brazilian tune.

Our Model	MusicGen-large-stereo
Audio not supported by your browser.	Audio not supported by your browser.

Prompt: Ambiental song that evokes calm with a progression of stereo electronic elements.

Our Model	MusicGen-large-stereo
Audio not supported by your browser.	Audio not supported by your browser.

Prompt: This song starts with a ukulele and builds up with percussion using claps and an acoustic guitar that plays the same rhythm as the ukulele with melody played on a xylophone and has a very upbeat feel to it.

Our Model (stereo, 44.1kHz)	MusicGen-large-stereo (stereo, 32kHz)	Ground-truth (stereo, 44.1kHz)
Audio not supported by your browser.	Audio not supported by your browser.	Audio not supported by your browser.

Prompt: Calming instrumental music primarily on piano can be used for relaxing.

Our Model	MusicGen-large-stereo	Ground-truth
Audio not supported by your browser.	Audio not supported by your browser.	Audio not supported by your browser.

Prompt: A dance music club banger, with a heavy kick, subtle supporting percussion like tabla and bongos, prominent pop synth lines, and a repetitive hook.

Our Model	MusicGen-large-stereo	Ground-truth
Audio not supported by your browser.	Audio not supported by your browser.	Audio not supported by your browser.

These prompts/audios were used for the qualitative study we report in our paper.

Additional creative capabilities

Audio-to-audio With diffusion models is possible to perform some degree of style-transfer by initializing the noise with audio during sampling. This capability can be used to modify the aesthetics of an existing recording based on a given text prompt, whilst maintaining the reference audio's structure (e.g., a beatbox recording could be style-transfered to produce realistic-sounding drums). As a result, our model can be influenced by not only text prompts but also audio inputs, enhancing its controllability and expressiveness. We noted that when initialized with voice recordings (such as beatbox or onomatopoeias), there is a sensation of control akin to an instrument.

Input audio	Output audio	Prompt
Audio not supported by your browser.	Audio not supported by your browser.	Bass guitar
Audio not supported by your browser.	Audio not supported by your browser.	format: solo, instruments: vibraphone
Audio not supported by your browser.	Audio not supported by your browser.	Genre: UK Bass, Instruments: 707 Drum Machine, Strings, 808 bass stabs, Beautiful Synths
Audio not supported by your browser.	Audio not supported by your browser.	Guitar
Audio not supported by your browser.	Audio not supported by your browser.	Drums

Vocal music The training dataset contains a subset of music with vocals. Our focus is on the generation of instrumental music, so we do not provide any conditioning based on lyrics. As a result, when the model is prompted for vocals, the model's generations contains vocal-like melodies without intelligible words. Whilst not a substitute for intelligible vocals, these sounds have an artistic and textural value of their own.

Short-form audio generation The training set does not exclusively contain long-form music. It also contains shorter sounds like sound effects or instrument samples. As a consequence, our model is also capable of producing such sounds when prompted appropriately.

Generation by our model	Prompt
Audio not supported by your browser.	Dog barking
Audio not supported by your browser.	Ringtone
Audio not supported by your browser.	Waves
Audio not supported by your browser.	Helicopter passing by from left to right
Audio not supported by your browser.	Fowl, chicken, rooster, crowing, cock-a-doodle-doo

Memorization analysis

Recent works examined the potential of generative models to memorize training data, especially for repeated elements in the training set. Further, musicLM conducted a memorization analysis to address concerns on the potential misappropriation of creative content. Adhering to principles of responsible model development, we also run a comprehensive study on memorization.

Considering the increased probability of memorizing repeated music within the dataset, we start by studying if our training set contains repeated data. We embed all our training data using the LAION-CLAP audio encoder to select audios that are close in this space based on a manually set threshold. The threshold is set such that the selected audios correspond to exact replicas. With this process, we identify 5566 repeated audios in our training set.

We compare our model's generations against the training set in LAION-CLAP space. Generations are from 5566 prompts within the repeated training data (in-distribution), and 586 prompts from the Song Describer Dataset (no-singing, out-of-distribution). We then identify the top-50 generated music that is closest to the training data and listen.

We extensively listened to potential memorization candidates, and could not find memorization. Those are the most interesting candidates from (repeated) training data prompts:

Generation by our model	Closest #1	Closest #2	Closest #3	Prompt
Audio not supported by your browser.	427160	427105	140843	Birds chirping, forest birds, tropical, africa wild life, singing birds, sound effects.
Audio not supported by your browser.	978924	979616	978717	Totally rad 8-bit melodies and intense arps create that fearless throwback vibe.
Audio not supported by your browser.	979544	979695	979670	Totally rad 8-bit melodies and intense arps create that strong-willed throwback vibe.
Audio not supported by your browser.	972466	972983	973055	Pleasant strings create desire in this adamant scoring cue.

We found a fair ammount of 8-bit/chiptunes that were repeated in the training dataset. Still, our model does not memorize them.

We even selected additional outstanding generations from Song Describer Dataset prompts, and could not find memorization. Those are the most interesting memorization candidates:

Generation by our model	Closest #1	Closest #2	Closest #3	Prompt
Audio not supported by your browser.	796563	1083119	634461	One cannot avoid moving the feet and neck listening to this fast and loopy brazilian tune.
Audio not supported by your browser.	279428	1082095	326758	An uplifting jazz song that makes your head shake.
Audio not supported by your browser.	1024058	1023046	788950	Calming instrumental music primarily on piano can be used for relaxing.
Audio not supported by your browser.	470048	470047	696082	This song starts with a ukulele and builds up with percussion using claps and an acoustic guitar that plays the same rhythm as the ukulele with melody played on a xylophone and has a very upbeat feel to it.

Autoencoder: reconstructions

This comparison is useful to evaluate the audio fidelity capabilities of the autoencoder. On the left, we have the ground truth recording. On the right, we take the ground truth recording and end pass it through the autoencoder. Note that the autoencoder reconstruction is fairly transparent, very close to the ground truth.

Ground truth	Autoencoder reconstruction
Audio not supported by your browser.	Audio not supported by your browser.
Audio not supported by your browser.	Audio not supported by your browser.
Audio not supported by your browser.	Audio not supported by your browser.
Audio not supported by your browser.	Audio not supported by your browser.
Audio not supported by your browser.	Audio not supported by your browser.
Audio not supported by your browser.	Audio not supported by your browser.
Audio not supported by your browser.	Audio not supported by your browser.
Audio not supported by your browser.	Audio not supported by your browser.
Audio not supported by your browser.	Audio not supported by your browser.
Audio not supported by your browser.	Audio not supported by your browser.
Audio not supported by your browser.	Audio not supported by your browser.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
audio		audio
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparison with state-of-the-art (song describer dataset prompts)

Additional creative capabilities

Memorization analysis

Autoencoder: reconstructions

About

Releases

Packages

Stability-AI/stable-audio-2-demo

Folders and files

Latest commit

History

Repository files navigation

Comparison with state-of-the-art (song describer dataset prompts)

Additional creative capabilities

Memorization analysis

Autoencoder: reconstructions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages