arXiv
: Stable Audio Open paper
HuggingFace
: model weights
stable-audio-tools
: code to reproduce Stable Audio
stable-audio-metrics
: code to evaluate Stable Audio
Stable Audio Open generates variable-length (up to 47s) stereo audio at 44.1kHz from text prompts. It comprises three components: an autoencoder that compresses waveforms into a manageable sequence length, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the latent space of the autoencoder.
Prompt: Pinball bumper.
Prompt: 80s drum beat.
Prompt: 80s bass guitar.
Prompt: Slap mandolin.
Prompt: Rain is falling and hitting surfaces and then splashing into puddles.
Stable Audio Open | Stable Audio 2.0 | AudioLDM2-48kHz |
---|---|---|
Prompt: A train horn goes off loudly.
Stable Audio Open | Stable Audio 2.0 | AudioLDM2-48kHz |
---|---|---|
Prompt: Gurgling and splashing water.
Stable Audio Open | Stable Audio 2.0 | AudioLDM2-48kHz |
---|---|---|
Prompt: An engine throttles and clanks and then suddenly accelerates off into the distance.
Stable Audio Open | Stable Audio 2.0 | AudioLDM2-48kHz |
---|---|---|
Prompt: A dance music club banger, with a heavy kick, subtle supporting percussion like tabla and bongos, prominent pop synth lines, and a repetitive hook.
Stable Audio Open | Stable Audio 2.0 | MusicGen-large-stereo |
---|---|---|
Prompt: A danceable electronic track in the genre of dance
Stable Audio Open | Stable Audio 2.0 | MusicGen-large-stereo |
---|---|---|
Prompt: Fast beat, hip hop, upbeat that has a positive vibe.
Stable Audio Open | Stable Audio 2.0 | MusicGen-large-stereo |
---|---|---|
Prompt: An instrumental song which employs a worldbeat element through its eerie percussion
Stable Audio Open | Stable Audio 2.0 | MusicGen-large-stereo |
---|---|---|
Recent works examined the potential of generative models to memorize training data, especially for repeated elements in the training set. Adhering to principles of responsible model development, we also run a comprehensive study on memorization.
In light of the possible risk of memorizing repeated audio within the training set, we start by studying if our dataset contains repeated data. We embed all our training data using the LAION-CLAP audio encoder to select audios that are close in this space based on a manually set threshold. The threshold is set such that the selected audio correspond to exact replicas. With this process, we identify 3,693 Freesound and 856 FMA repeated audios.
Our methodology is based on comparing our model's generations against the training set in LAION-CLAP space. We then select the top-50 generations that are closest to the training data (the memorization candidates) and listen. We listened to memorization candidates generated with prompts from the identified repeated data in our training set, and did not find memorization. We also listened to memorization candidates from 11,000 random prompts from the training set, and did not find memorization. We even listened to memorization candidates from outstanding generations, and did not find memorization. The most interesting memorization candidates, together with their closest training data, are listed here. We extensively listened to potential memorization candidates, and could not find memorization. Those are the most interesting candidates from training data prompts:
Generation by our model | Closest training data | Prompt |
---|---|---|
link | Scale, clarinet, Asharpmajor, neumann-U87, good-sounds. | |
link | Disturb, no-signal, tv, noise, radio, high-disturbance, frequency-jam, white-noise. | |
link | 120, bpm, beat, Drums, blues, loop. | |
link | Thunder, storm, field-recording, rain. | |
link | Avant-garde, improv, contemporary classical, instrumental. | |
link | Piano, modern jazz, minimalism, instrumental. | |
link | 160-BPM, kick-hat-snare, drumloop. | |
link | 1000Hz 48k sample rate MP3. |
This comparison is useful to evaluate the audio fidelity capabilities of the autoencoder. On the left, we have the ground truth recording. On the right, we take the ground truth recording and end pass it through the any of those autoencoders or neural audio codecs.
Ground truth | Stable Audio Open | Stable Audio 2.0 | DAC |
---|---|---|---|