-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856
Conversation
PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)
SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6
Updating the base faster-whisper to 0.10.0
Support for Batched inference and language detection from multiple segments in faster-whisper
Updating the base directory
It is astonishing that a merger has taken place. Thank you. |
@Jiltseb @MahmoudAshraf97 Thanks for the tips, I wrote a bit of code to split segments using word timings. It seems almost 2x faster when using without_timestamps=True. btw, I have noticed some overlapping segments in the output in batched mode, I don't recall seeing these previously, but I may be wrong. In fact the first segment here starts after the second one: |
Congratulations on the successful merge! I'll be doing benchmarks with WhisperX and WhisperS2T for everyone's edification, like I did before. |
Comparing the output of this new batched version to the previous non-batched pipeline, I find that on some videos it fails to detect and transcribe audio unless I disable batching. I experienced the same with whisperx: With some videos, it seems like the original OpenAI Whisper, as well as faster_whisper without batching, give correct results while whisperx and faster_whisper with batching output nothing, failing to detect speech correctly. It's weird because the videos I have issues with are too short for batching anyways (~8 seconds for example) while I believe by default batching works on 30 second segments? |
The batching implementation uses a different VAD model. That is likely the cause. |
The time-stamp-related issue will be solved in the follow-up PR: #921 Batching works on 30-second segments as well (batch size will be 1), but the VAD model is different hence a different set of parameters might be needed for your use case. We are comparing the performance with Silero VAD, there will be a PR to replace the VAD if trade-offs go in favour of it. |
Thanks this an important topic. Having two VAD libraries for different code path should be avoided for consistency, number of external models that we, etc. Should we open a separate ticket for this? |
@MahmoudAshraf97 is working on this and will open a PR soon, but you can create an issue if you like. |
PR better than ticket, thanks! |
…d Quality Enhancements (SYSTRAN#856)" This reverts commit eb83902.
Is this released yet? I have checked version 1.0.3 that does not have batch inference yet. |
No, it's not released. You can use it with pip install git+https://github.com/SYSTRAN/faster-whisper.git for the time being. |
I don't think so. The whisperX process is to first perform vad detection, and then batch process the segmented results, so vad is not used in the batch processing stage. And if I denoise the voice first, this problem will not exist. |
…d Quality Enhancements (SYSTRAN#856)" This reverts commit eb83902.
I get the issue with Batched faster-whisper: I use this function "batched_model.transcribe": from faster_whisper import BatchedInferencePipeline, WhisperModel
model = WhisperModel(
model_size_or_path="large-v3",
device="cuda",
compute_type="float16",
)
batched_model = BatchedInferencePipeline(model=model)
segments, info = batched_model.transcribe(
audio="file_a.wav" ,
without_timestamps=True, # Change this value to True or False to test.
word_timestamps=True,
beam_size=1,
task="transcribe",
language="en",
batch_size=5,
)
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text)) I have 2 audio files: file_a.wav and file_b.wav:
The argument "without_timestamps" does not work consistently. How can I resolve this issue? Thank you so much. |
I have the same issue as @toanhuynhnguyen, there are many missing segments in my usecase when using batched/without_timestamps=False and not batched/without_timestamps=True combos. |
Implement changes in review request
This works perfectly for me, using timestamps + batched |
|
…y Enhancements (SYSTRAN#856) Batching Support, Speed Boosts, and Quality Enhancements --------- Co-authored-by: Hargun Mujral <[email protected]> Co-authored-by: MahmoudAshraf97 <[email protected]>
This repo appears to have a lot of fixes. Possible to get it merged into this repo? |
Hello everyone,
This PR adds a major update to Faster Whisper, bringing both speed and quality improvements!
Speed improvements:
Batching support: Inspired by whisper-x, this update introduces batching support allowing for a 3x speed increase. This implementation builds on whiper-x and supports more run-time arguments and external VAD segments. The batched version now runs at 64x real-time speed, compared to the previous 20x.
Faster feature extraction: We've incorporated torchaudio-based parallel STFT as an alternative to the current implementation from transformers, providing additional speed boosts. With the
enable_ta_fe
flag, the final version achieves an impressive 104x real-time speed. This is up to 12.5x on average compared to OpenAI implementation!Using the batched version is straightforward:
Quality Improvements
Language detection Usage:
Benchmarking:
A. Open source benchmarking:
Open_asr_eval solely consists of short-form audio and the average audio duration is less than 10 sec in general. Hence, using a subset of the YouTube-Commons dataset, we've tested more complex use cases with long-form audio. Whisper-medium model is used (with batch size = 8 for batched versions) for the experiments. Dataset card of youtube-commons-asr-eval is mobiuslabsgmbh/youtube-commons-asr-eval.
Speed (x real-time):
WER:
B. Internal dataset:
Since the transcriptions in the open-source dataset are unverified, they can contain various types of errors. Additional internal benchmarking ensures robustness across various scenarios. A smaller test set (84 minutes) with verified ground truth is used for verifying the transcription quality and speed. The test set contains 9 audios ranging from 3 minutes to 13 minutes and various audio types.
Batched processing speeds up long-form audio without causing an increase in WER. Users can easily switch between sequential and batched Faster Whisper versions based on specific requirements.
Thank you in advance!
Acknowledgements
This is the work done at Mobiuslabs GmbH. Contact Dr. Jilt Sebastian for any queries or requests.