-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OAI Whisper transcribes correctly but whisperx returns No active speech found in audio
#844
Comments
I have noticed a pattern: In a video that has this issue, there is a first part of clear speech, and then loud noises. If I cut the last part with loud noises out, the transcription works correctly. So, for some reason later noise prevents earlier speech from being transcribed? |
I have the same problem when using the new batched backend for faster_whisper. So perhaps batching is at fault here. This is despite the videos that have problems being too short for batching (eg, 8s) |
Good delving into the issue, thanks. |
As I mentioned in my comment on the faster_whisper PR, I have the same problem when enabling batching on faster_whisper, but the issue disappears when not using the batched pipeline (on faster_whisper) |
I found a video on the internet that replicates this problem. Audio: https://litter.catbox.moe/kyu2q8.wav |
@reasv can you reupload the video to a permanent storage and share the link? |
https://drive.google.com/file/d/1JKsYQZYQDrKuRFciFhh1aA5ftAGr-eud/view?usp=sharing,this video can replicates this problem |
Hi, |
The problem is with pyannote vad model, SYSTRAN/faster-whisper#936 is a possible solution, but you have to use faster-whisper for transcription |
I'm getting poor transcription results using whisperx, specifically I am sometimes not getting any transcription out of some short videos, whereas OpenAI's official whisper model transcribes them correctly.
On the OpenAI side, I am using their official HF Space (https://huggingface.co/spaces/openai/whisper) which employs
large-v3
.On the whisperx side, I am using
Systran/faster-whisper-large-v3
for comparison, with the latest whisperx from github, and pytorch with CUDA on Windows 11 (on an RTX 4090).Here's the code for the simple gradio UI I use for testing whisperx:
https://github.com/reasv/panoptikon/blob/master/src/ui/test_models/whisper.py
The transcription function is very simple:
I am only testing it with audio file paths at the moment, so assume audio_file is populated, and not audio_tuple.
The audio seems to be loaded correctly from the video file since I can listen to the extracted audio output to the gradio Audio component.
This is the output I get:
Some videos work as expected, others I get
No active speech found
even though the speech seems relatively clear (and in english).At the moment I cannot give an example of a video that causes this problem as it's happening with personal videos and I haven't found publicly available videos that reproduce the issue yet.
Any ideas of why this is happening?
The text was updated successfully, but these errors were encountered: