Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OAI Whisper transcribes correctly but whisperx returns No active speech found in audio #844

Open
reasv opened this issue Jul 20, 2024 · 9 comments

Comments

@reasv
Copy link

reasv commented Jul 20, 2024

I'm getting poor transcription results using whisperx, specifically I am sometimes not getting any transcription out of some short videos, whereas OpenAI's official whisper model transcribes them correctly.

On the OpenAI side, I am using their official HF Space (https://huggingface.co/spaces/openai/whisper) which employs large-v3.

On the whisperx side, I am using Systran/faster-whisper-large-v3 for comparison, with the latest whisperx from github, and pytorch with CUDA on Windows 11 (on an RTX 4090).

Here's the code for the simple gradio UI I use for testing whisperx:
https://github.com/reasv/panoptikon/blob/master/src/ui/test_models/whisper.py
The transcription function is very simple:

def transcribe_audio(
    model_repo: str | None,
    language: str | None,
    batch_size: int,
    audio_tuple: Tuple[int, np.ndarray] | None,
    audio_file: str | None,
) -> Tuple[str, Tuple[int, np.ndarray] | None]:
    if model_repo is None:
        return "[No model selected]", None
    print(
        f"""
        Transcribing audio with model: {model_repo} \
        and language: {language}
        """
    )

    import torch
    import whisperx

    sample_rate, audio = (
        audio_tuple if audio_tuple is not None else (None, None)
    )

    if audio:
        print(f"Sample rate: {sample_rate}")

    if audio is None and audio_file is not None:
        audio = whisperx.load_audio(audio_file)

    if audio is None:
        return "[No audio provided]", None

    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda"

    whisper_model = whisperx.load_model(
        model_repo,
        device=device,
        language=language,
    )

    result = whisper_model.transcribe(
        audio,
        batch_size=batch_size,
        language=language,
    )
    print(result)
    merged_text = "\n".join([segment["text"] for segment in result["segments"]])
    return merged_text, (whisperx.audio.SAMPLE_RATE, audio)

I am only testing it with audio file paths at the moment, so assume audio_file is populated, and not audio_tuple.
The audio seems to be loaded correctly from the video file since I can listen to the extracted audio output to the gradio Audio component.

This is the output I get:

Transcribing audio with model: Systran/faster-whisper-large-v3 and language: en

Q:\projects\panoptikon\.venv\Lib\site-packages\pyannote\audio\core\io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.3.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\[]\.cache\torch\whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.1+cu121. Bad things might happen unless you revert torch to 1.x.
No active speech found in audio
{'segments': [], 'language': 'en'}

Some videos work as expected, others I get No active speech found even though the speech seems relatively clear (and in english).

At the moment I cannot give an example of a video that causes this problem as it's happening with personal videos and I haven't found publicly available videos that reproduce the issue yet.

Any ideas of why this is happening?

@reasv
Copy link
Author

reasv commented Jul 20, 2024

I have noticed a pattern: In a video that has this issue, there is a first part of clear speech, and then loud noises. If I cut the last part with loud noises out, the transcription works correctly.

So, for some reason later noise prevents earlier speech from being transcribed?
Still, OpenAI Whisper transcribes everything correctly, including speech during the part with loud noises.

@reasv
Copy link
Author

reasv commented Jul 20, 2024

I have the same problem when using the new batched backend for faster_whisper. So perhaps batching is at fault here. This is despite the videos that have problems being too short for batching (eg, 8s)

@BBC-Esq
Copy link

BBC-Esq commented Jul 20, 2024

I have noticed a pattern: In a video that has this issue, there is a first part of clear speech, and then loud noises. If I cut the last part with loud noises out, the transcription works correctly.

So, for some reason later noise prevents earlier speech from being transcribed? Still, OpenAI Whisper transcribes everything correctly, including speech during the part with loud noises.

Good delving into the issue, thanks.

@reasv
Copy link
Author

reasv commented Jul 20, 2024

As I mentioned in my comment on the faster_whisper PR, I have the same problem when enabling batching on faster_whisper, but the issue disappears when not using the batched pipeline (on faster_whisper)
SYSTRAN/faster-whisper#856

@reasv
Copy link
Author

reasv commented Jul 20, 2024

I found a video on the internet that replicates this problem. Audio: https://litter.catbox.moe/kyu2q8.wav

@MahmoudAshraf97
Copy link
Contributor

@reasv can you reupload the video to a permanent storage and share the link?

@ncuxzy
Copy link

ncuxzy commented Aug 19, 2024

@reasv can you reupload the video to a permanent storage and share the link?您可以将视频重新上传到永久存储并分享链接吗?

https://drive.google.com/file/d/1JKsYQZYQDrKuRFciFhh1aA5ftAGr-eud/view?usp=sharing,this video can replicates this problem

@seanco-hash
Copy link

Hi,
Anyone found a solution?
I noticed that the old version medium model works fine but the new version medium and large-v3 has these problems

@MahmoudAshraf97
Copy link
Contributor

Hi, Anyone found a solution? I noticed that the old version medium model works fine but the new version medium and large-v3 has these problems

The problem is with pyannote vad model, SYSTRAN/faster-whisper#936 is a possible solution, but you have to use faster-whisper for transcription

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants