OAI Whisper transcribes correctly but whisperx returns `No active speech found in audio` #844

reasv · 2024-07-20T14:21:00Z

I'm getting poor transcription results using whisperx, specifically I am sometimes not getting any transcription out of some short videos, whereas OpenAI's official whisper model transcribes them correctly.

On the OpenAI side, I am using their official HF Space (https://huggingface.co/spaces/openai/whisper) which employs large-v3.

On the whisperx side, I am using Systran/faster-whisper-large-v3 for comparison, with the latest whisperx from github, and pytorch with CUDA on Windows 11 (on an RTX 4090).

Here's the code for the simple gradio UI I use for testing whisperx:
https://github.com/reasv/panoptikon/blob/master/src/ui/test_models/whisper.py
The transcription function is very simple:

def transcribe_audio(
    model_repo: str | None,
    language: str | None,
    batch_size: int,
    audio_tuple: Tuple[int, np.ndarray] | None,
    audio_file: str | None,
) -> Tuple[str, Tuple[int, np.ndarray] | None]:
    if model_repo is None:
        return "[No model selected]", None
    print(
        f"""
        Transcribing audio with model: {model_repo} \
        and language: {language}
        """
    )

    import torch
    import whisperx

    sample_rate, audio = (
        audio_tuple if audio_tuple is not None else (None, None)
    )

    if audio:
        print(f"Sample rate: {sample_rate}")

    if audio is None and audio_file is not None:
        audio = whisperx.load_audio(audio_file)

    if audio is None:
        return "[No audio provided]", None

    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda"

    whisper_model = whisperx.load_model(
        model_repo,
        device=device,
        language=language,
    )

    result = whisper_model.transcribe(
        audio,
        batch_size=batch_size,
        language=language,
    )
    print(result)
    merged_text = "\n".join([segment["text"] for segment in result["segments"]])
    return merged_text, (whisperx.audio.SAMPLE_RATE, audio)

I am only testing it with audio file paths at the moment, so assume audio_file is populated, and not audio_tuple.
The audio seems to be loaded correctly from the video file since I can listen to the extracted audio output to the gradio Audio component.

This is the output I get:

Transcribing audio with model: Systran/faster-whisper-large-v3 and language: en

Q:\projects\panoptikon\.venv\Lib\site-packages\pyannote\audio\core\io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.3.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\[]\.cache\torch\whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.1+cu121. Bad things might happen unless you revert torch to 1.x.
No active speech found in audio
{'segments': [], 'language': 'en'}

Some videos work as expected, others I get No active speech found even though the speech seems relatively clear (and in english).

At the moment I cannot give an example of a video that causes this problem as it's happening with personal videos and I haven't found publicly available videos that reproduce the issue yet.

Any ideas of why this is happening?

The text was updated successfully, but these errors were encountered:

reasv · 2024-07-20T14:27:50Z

I have noticed a pattern: In a video that has this issue, there is a first part of clear speech, and then loud noises. If I cut the last part with loud noises out, the transcription works correctly.

So, for some reason later noise prevents earlier speech from being transcribed?
Still, OpenAI Whisper transcribes everything correctly, including speech during the part with loud noises.

reasv · 2024-07-20T15:43:27Z

I have the same problem when using the new batched backend for faster_whisper. So perhaps batching is at fault here. This is despite the videos that have problems being too short for batching (eg, 8s)

BBC-Esq · 2024-07-20T15:47:02Z

I have noticed a pattern: In a video that has this issue, there is a first part of clear speech, and then loud noises. If I cut the last part with loud noises out, the transcription works correctly.

So, for some reason later noise prevents earlier speech from being transcribed? Still, OpenAI Whisper transcribes everything correctly, including speech during the part with loud noises.

Good delving into the issue, thanks.

reasv · 2024-07-20T16:17:31Z

As I mentioned in my comment on the faster_whisper PR, I have the same problem when enabling batching on faster_whisper, but the issue disappears when not using the batched pipeline (on faster_whisper)
SYSTRAN/faster-whisper#856

reasv · 2024-07-20T16:21:08Z

I found a video on the internet that replicates this problem. Audio: https://litter.catbox.moe/kyu2q8.wav

MahmoudAshraf97 · 2024-07-22T09:23:56Z

@reasv can you reupload the video to a permanent storage and share the link?

ncuxzy · 2024-08-19T02:20:35Z

@reasv can you reupload the video to a permanent storage and share the link?您可以将视频重新上传到永久存储并分享链接吗？

https://drive.google.com/file/d/1JKsYQZYQDrKuRFciFhh1aA5ftAGr-eud/view?usp=sharing，this video can replicates this problem

seanco-hash · 2024-08-25T12:46:11Z

Hi,
Anyone found a solution?
I noticed that the old version medium model works fine but the new version medium and large-v3 has these problems

MahmoudAshraf97 · 2024-08-25T13:06:02Z

Hi, Anyone found a solution? I noticed that the old version medium model works fine but the new version medium and large-v3 has these problems

The problem is with pyannote vad model, SYSTRAN/faster-whisper#936 is a possible solution, but you have to use faster-whisper for transcription

This was referenced Jul 20, 2024

WhisperX missing audio part while original whisper and fast whisper working fine #828

Open

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements SYSTRAN/faster-whisper#856

Merged

Jiltseb mentioned this issue Jul 22, 2024

Remove the usage of transformers.pipeline from BatchedInferencePipeline and fix word timestamps for batched inference SYSTRAN/faster-whisper#921

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAI Whisper transcribes correctly but whisperx returns `No active speech found in audio` #844

OAI Whisper transcribes correctly but whisperx returns `No active speech found in audio` #844

reasv commented Jul 20, 2024

reasv commented Jul 20, 2024 •

edited

Loading

reasv commented Jul 20, 2024

BBC-Esq commented Jul 20, 2024

reasv commented Jul 20, 2024

reasv commented Jul 20, 2024

MahmoudAshraf97 commented Jul 22, 2024

ncuxzy commented Aug 19, 2024

seanco-hash commented Aug 25, 2024

MahmoudAshraf97 commented Aug 25, 2024

OAI Whisper transcribes correctly but whisperx returns No active speech found in audio #844

OAI Whisper transcribes correctly but whisperx returns No active speech found in audio #844

Comments

reasv commented Jul 20, 2024

reasv commented Jul 20, 2024 • edited Loading

reasv commented Jul 20, 2024

BBC-Esq commented Jul 20, 2024

reasv commented Jul 20, 2024

reasv commented Jul 20, 2024

MahmoudAshraf97 commented Jul 22, 2024

ncuxzy commented Aug 19, 2024

seanco-hash commented Aug 25, 2024

MahmoudAshraf97 commented Aug 25, 2024

OAI Whisper transcribes correctly but whisperx returns `No active speech found in audio` #844

OAI Whisper transcribes correctly but whisperx returns `No active speech found in audio` #844

reasv commented Jul 20, 2024 •

edited

Loading