Turning on Diarize with Streaming always returns speaker as 0 #108

TechyChan · 2023-03-26T05:50:18Z

TechyChan
Mar 26, 2023

I'm using Deepgram with Twilio stream, with the following config:

    this.deepgramLive = deepgram.transcription.live({
      punctuate: true,
      endpointing: true,
      language: "en-US",
      tier: "enhanced",
      model: "phonecall",
      encoding: "mulaw",
      sample_rate: 8000,
      interim_results: false,
      vad_turnoff: 400,
      numbers: true,
      diarize: true,
    });

I'm trying to test diarization in the hopes that I can filter out all speakers other than the primary one, but I always get speaker: 0 in the results. For example, in the following JSONs, I started by saying "Hey Sarah, this is David", and the use a pre-recorded female voices to say "My name is Deborah", but the words array always returns speaker: 0:

{
  transcript: 'Hey, Sarah. This is David.',
  confidence: 0.99848044,
  words: [
    {
      word: 'hey',
      start: 6.64892,
      end: 6.8887997,
      confidence: 0.96410036,
      speaker: 0,
      punctuated_word: 'Hey,'
    },
    {
      word: 'sarah',
      start: 6.8887997,
      end: 7.24862,
      confidence: 0.9139483,
      speaker: 0,
      punctuated_word: 'Sarah.'
    },
    {
      word: 'this',
      start: 7.24862,
      end: 7.4884996,
      confidence: 0.9997607,
      speaker: 0,
      punctuated_word: 'This'
    },
    {
      word: 'is',
      start: 7.4884996,
      end: 7.64842,
      confidence: 0.99981254,
      speaker: 0,
      punctuated_word: 'is'
    },
    {
      word: 'david',
      start: 7.64842,
      end: 8.14842,
      confidence: 0.99848044,
      speaker: 0,
      punctuated_word: 'David.'
    }
  ]
}

{
  transcript: 'My name is Deborah.',
  confidence: 0.9995363,
  words: [
    {
      word: 'my',
      start: 21.959818,
      end: 22.119738,
      confidence: 0.9920925,
      speaker: 0,
      punctuated_word: 'My'
    },
    {
      word: 'name',
      start: 22.119738,
      end: 22.27966,
      confidence: 0.9999344,
      speaker: 0,
      punctuated_word: 'name'
    },
    {
      word: 'is',
      start: 22.27966,
      end: 22.479559,
      confidence: 0.9995363,
      speaker: 0,
      punctuated_word: 'is'
    },
    {
      word: 'deborah',
      start: 22.479559,
      end: 22.979559,
      confidence: 0.95603275,
      speaker: 0,
      punctuated_word: 'Deborah.'
    }
  ]
}

Answered by jpvajda

Jan 25, 2024

Closing due to age of issue. If this is still a problem just let us know and we can re-open it.

View full answer

rilhia · 2023-03-26T17:45:29Z

rilhia
Mar 26, 2023

Hi @TechyChan I just tried this out to see whether I could recreate the same experience. I did it slightly differently to you. I streamed the audio from a tv show with various characters having conversations. I used the following web socket uri and params...

wss://api.deepgram.com/v1/listen?language=en&tier=enhanced&model=meeting&diarize=true&interim_results=true&smart_format=true&encoding=linear16&sample_rate=18000&profanity_filter=true

I found that different speakers were identified, but not consistently. The same speaker id was not selected for the same speaker throughout the tv show and even during scenes the speaker was sometimes switched.

I have raised a similar issue with a potential solution here:

https://github.com/orgs/deepgram/discussions/104

I guess another way of solving your issue is if you can send more than one channel from your calls (one for each person) and use the multichannel functionality.....

https://developers.deepgram.com/documentation/guides/multichannel-vs-diarization/

This is something I am yet to play around with, but it would be my next attempt at solving this if I was in your position.

0 replies

TechyChan · 2023-03-26T19:42:43Z

TechyChan
Mar 26, 2023
Author

Hi @rilhia, thanks for the response! Unfortunately I only have one audio channel from Twilio, so multichannel would not be an option for me. The idea is that I want to be able to separate out the primary speaker during the phone call, if the phone was speaker mode and had some background voices. I think I'll probably need to look into other options like Azure Speech to Text for diarization.

0 replies

lukeocodes · 2023-03-26T20:10:40Z

lukeocodes
Mar 26, 2023
Maintainer

Slightly different from audio channels, but twilio stream channels can be separated with a twilio feature called Single Party Call Recordings. That should allow you to only send through the caller audio, and be much more resilient.

Even with the best diarization in the world, it will never be 100% if you send both speakers through

By default Twilio's voice recordings capture all audio from a call in a single mono-channel file. To separate audio tracks in two channels, you can use recordingChannel=dual
Single Party Call Recordings is a feature that provides flexibility over which parties should be recorded during a call and it allows you to programmatically record only one side of the call.

1 reply

nikolawhallon Apr 17, 2023
Collaborator

That's true for Twilio recording captures, but for live-streaming one can get the two channels via something like this in their twiml:

    <Stream url="wss://your-server" track="both_tracks"/>

(https://www.twilio.com/docs/voice/twiml/stream#attributes-track)

That said, I wonder if the issue is different - you are trying to isolate the actual speaker on the phone from whatever people might be saying in the background - for this, there is only one phone/microphone so there are no channels to separate? In that case, diarization can help, yes, but also if the background speakers are quieter/fainter, or of worse audio quality, I wonder if using confidence might help, or implementing a VAD client-side (I've found that Deepgram can be frustratingly good at transcribing very quiet speech...)

jpvajda · 2024-01-25T00:01:19Z

jpvajda
Jan 25, 2024
Maintainer

Closing due to age of issue. If this is still a problem just let us know and we can re-open it.

0 replies

cmaycumber · 2024-05-23T19:29:41Z

cmaycumber
May 23, 2024

I'm still running into this exact same problem. It makes using diarization with live streaming almost unsuable for me.

Is the intended behavior to maintain speaker id's across the stream or is it mean't to separate out the speakers for a single transcription?

0 replies

joshpearce07 · 2024-10-30T20:22:12Z

joshpearce07
Oct 30, 2024

I am still running into the same issue with diarization on streaming audio - it is always returning speaker = 0. I am using a lightly modified version of the code for streaming audio from a microphone.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepgram

Turning on Diarize with Streaming always returns speaker as 0 #108

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Deepgram

Turning on Diarize with Streaming always returns speaker as 0 #108

TechyChan Mar 26, 2023

Replies: 6 comments · 1 reply

rilhia Mar 26, 2023

TechyChan Mar 26, 2023 Author

lukeocodes Mar 26, 2023 Maintainer

nikolawhallon Apr 17, 2023 Collaborator

jpvajda Jan 25, 2024 Maintainer

cmaycumber May 23, 2024

joshpearce07 Oct 30, 2024

TechyChan
Mar 26, 2023

Replies: 6 comments 1 reply

rilhia
Mar 26, 2023

TechyChan
Mar 26, 2023
Author

lukeocodes
Mar 26, 2023
Maintainer

nikolawhallon Apr 17, 2023
Collaborator

jpvajda
Jan 25, 2024
Maintainer

cmaycumber
May 23, 2024

joshpearce07
Oct 30, 2024