New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Jiltseb · 2024-05-24T09:02:22Z

Hello everyone,

This PR adds a major update to Faster Whisper, bringing both speed and quality improvements!

Speed improvements:

Batching support: Inspired by whisper-x, this update introduces batching support allowing for a 3x speed increase. This implementation builds on whiper-x and supports more run-time arguments and external VAD segments. The batched version now runs at 64x real-time speed, compared to the previous 20x.
Faster feature extraction: We've incorporated torchaudio-based parallel STFT as an alternative to the current implementation from transformers, providing additional speed boosts. With the enable_ta_fe flag, the final version achieves an impressive 104x real-time speed. This is up to 12.5x on average compared to OpenAI implementation!

Using the batched version is straightforward:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
result = batched_model.transcribe("audio.mp3", batch_size=16)

for segment, info in result:
	print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Quality Improvements

Consistency across runs: By setting the model seed, consistency across runs is improved.
Reducing hallucinations: Stricter checks in the inference pipeline reduce unstructured or repeated phrases.
Reliable language detection: A new function detects language more reliably by considering highly confident and random segments, breaking ties to determine the major language.
Code-switching support: Handles audio with multiple languages by detecting language every 30 seconds and dynamically directing data flow. Since the exact language switching position is unknown, this can have an error within a 30 sec segment range.

Language detection Usage:

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")

Benchmarking:

A. Open source benchmarking:

Open_asr_eval solely consists of short-form audio and the average audio duration is less than 10 sec in general. Hence, using a subset of the YouTube-Commons dataset, we've tested more complex use cases with long-form audio. Whisper-medium model is used (with batch size = 8 for batched versions) for the experiments. Dataset card of youtube-commons-asr-eval is mobiuslabsgmbh/youtube-commons-asr-eval.

Speed (x real-time):

System	Speed GPU	Speed CPU
OpenAI Whisper	8.2x	4.5x
faster-whisper	20.1x	5.6x
HF Whisper (batched)	59.3x	8.4x
Batched Faster-Whisper	104x	14.6x

WER:

System	WER
OpenAI Whisper	15.1
faster-whisper	14.6
HF Whisper (batched)	16.8
Batched Faster-Whisper	13.1

B. Internal dataset:

Since the transcriptions in the open-source dataset are unverified, they can contain various types of errors. Additional internal benchmarking ensures robustness across various scenarios. A smaller test set (84 minutes) with verified ground truth is used for verifying the transcription quality and speed. The test set contains 9 audios ranging from 3 minutes to 13 minutes and various audio types.

System	WER	Speed
OpenAI Whisper	6.8	9.1x
faster-whisper	6.1	17.4x
HF Whisper (batched)	8.2	42.8x
Batched Faster-Whisper	6.5	86.6x

Batched processing speeds up long-form audio without causing an increase in WER. Users can easily switch between sequential and batched Faster Whisper versions based on specific requirements.

Thank you in advance!

Acknowledgements

This is the work done at Mobiuslabs GmbH. Contact Dr. Jilt Sebastian for any queries or requests.

PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)

SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6

Updating the base faster-whisper to 0.10.0

… logic

Support for Batched inference and language detection from multiple segments in faster-whisper

Updating the base directory

hahazei · 2024-07-18T17:08:28Z

It is astonishing that a merger has taken place. Thank you.

hobodrifterdavid · 2024-07-20T02:37:33Z

@Jiltseb @MahmoudAshraf97 Thanks for the tips, I wrote a bit of code to split segments using word timings. It seems almost 2x faster when using without_timestamps=True.

btw, I have noticed some overlapping segments in the output in batched mode, I don't recall seeing these previously, but I may be wrong. In fact the first segment here starts after the second one:

BBC-Esq · 2024-07-20T13:02:51Z

Congratulations on the successful merge! I'll be doing benchmarks with WhisperX and WhisperS2T for everyone's edification, like I did before.

reasv · 2024-07-20T15:42:05Z

Comparing the output of this new batched version to the previous non-batched pipeline, I find that on some videos it fails to detect and transcribe audio unless I disable batching.
So, results seem to suffer.

I experienced the same with whisperx:
m-bain/whisperX#844

With some videos, it seems like the original OpenAI Whisper, as well as faster_whisper without batching, give correct results while whisperx and faster_whisper with batching output nothing, failing to detect speech correctly.
I'm not the only one with this issue:
m-bain/whisperX#828

It's weird because the videos I have issues with are too short for batching anyways (~8 seconds for example) while I believe by default batching works on 30 second segments?

stri8ed · 2024-07-25T22:08:16Z

Comparing the output of this new batched version to the previous non-batched pipeline, I find that on some videos it fails to detect and transcribe audio unless I disable batching. So, results seem to suffer.

I experienced the same with whisperx: m-bain/whisperX#844

With some videos, it seems like the original OpenAI Whisper, as well as faster_whisper without batching, give correct results while whisperx and faster_whisper with batching output nothing, failing to detect speech correctly. I'm not the only one with this issue: m-bain/whisperX#828

It's weird because the videos I have issues with are too short for batching anyways (~8 seconds for example) while I believe by default batching works on 30 second segments?

The batching implementation uses a different VAD model. That is likely the cause.

Jiltseb · 2024-07-26T07:24:07Z

The time-stamp-related issue will be solved in the follow-up PR: #921

Batching works on 30-second segments as well (batch size will be 1), but the VAD model is different hence a different set of parameters might be needed for your use case. We are comparing the performance with Silero VAD, there will be a PR to replace the VAD if trade-offs go in favour of it.

jordimas · 2024-07-26T10:19:10Z

The time-stamp-related issue will be solved in the follow-up PR: #921

Batching works on 30-second segments as well (batch size will be 1), but the VAD model is different hence a different set of parameters might be needed for your use case. We are comparing the performance with Silero VAD, there will be a PR to replace the VAD if trade-offs go in favour of it.

Thanks this an important topic. Having two VAD libraries for different code path should be avoided for consistency, number of external models that we, etc. Should we open a separate ticket for this?

Jiltseb · 2024-07-26T10:24:56Z

@MahmoudAshraf97 is working on this and will open a PR soon, but you can create an issue if you like.

jordimas · 2024-07-26T10:29:27Z

@MahmoudAshraf97 is working on this and will open a PR soon, but you can create an issue if you like.

PR better than ticket, thanks!

…d Quality Enhancements (SYSTRAN#856)" This reverts commit eb83902.

mru4913 · 2024-08-08T06:47:17Z

Is this released yet? I have checked version 1.0.3 that does not have batch inference yet.

Jiltseb · 2024-08-08T07:21:45Z

No, it's not released. You can use it with pip install git+https://github.com/SYSTRAN/faster-whisper.git for the time being.

ncuxzy · 2024-08-19T07:03:13Z

Comparing the output of this new batched version to the previous non-batched pipeline, I find that on some videos it fails to detect and transcribe audio unless I disable batching. So, results seem to suffer.将这个新的批处理版本的输出与之前的非批处理管道的输出进行比较，我发现在某些视频上，除非禁用批处理，否则它无法检测和转录音频。因此，结果似乎受到了影响。
I experienced the same with whisperx: m-bain/whisperX#844我对 WhisperX 也有同样的经历： m-bain/whisperX#844
With some videos, it seems like the original OpenAI Whisper, as well as faster_whisper without batching, give correct results while whisperx and faster_whisper with batching output nothing, failing to detect speech correctly. I'm not the only one with this issue: m-bain/whisperX#828在一些视频中，原始的 OpenAI Whisper 以及不带批处理的 fast_whisper 似乎给出了正确的结果，而带批处理的 whisperx 和 fast_whisper 则没有输出任何内容，无法正确检测语音。我不是唯一一个遇到这个问题的人： m-bain/whisperX#828
It's weird because the videos I have issues with are too short for batching anyways (~8 seconds for example) while I believe by default batching works on 30 second segments?这很奇怪，因为我遇到问题的视频对于批处理来说太短了（例如约 8 秒），而我相信默认情况下批处理可以处理 30 秒的片段？

The batching implementation uses a different VAD model. That is likely the cause.批处理实现使用不同的 VAD 模型。这很可能就是原因。

I don't think so. The whisperX process is to first perform vad detection, and then batch process the segmented results, so vad is not used in the batch processing stage. And if I denoise the voice first, this problem will not exist.

…d Quality Enhancements (SYSTRAN#856)" This reverts commit eb83902.

toanhuynhnguyen · 2024-09-15T17:07:08Z

I get the issue with Batched faster-whisper:

I use this function "batched_model.transcribe":

from faster_whisper import BatchedInferencePipeline, WhisperModel

model = WhisperModel(
    model_size_or_path="large-v3",
    device="cuda",
    compute_type="float16",
)

batched_model = BatchedInferencePipeline(model=model)

segments, info = batched_model.transcribe(
    audio="file_a.wav" ,
    without_timestamps=True, # Change this value to True or False to test.
    word_timestamps=True,
    beam_size=1,
    task="transcribe",
    language="en",
    batch_size=5,
)

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

I have 2 audio files: file_a.wav and file_b.wav:

Transcribe the file_a.wav

If I set without_timestamps=True for the "batched_model.transcribe" function then it transcribes all sentences in the audio file (My expectation).
If I set without_timestamps=False then it misses a few sentences in the audio file.

Transcribe the file_b.wav:

If I set without_timestamps=True then it misses a few sentences in the audio file.
If I set without_timestamps=False then it transcribes all sentences in the audio file (My expectation).

The argument "without_timestamps" does not work consistently. How can I resolve this issue? Thank you so much.

just-maiyak · 2024-10-01T15:32:17Z

I have the same issue as @toanhuynhnguyen, there are many missing segments in my usecase when using batched/without_timestamps=False and not batched/without_timestamps=True combos.

Implement changes in review request

brandonattruenation · 2024-10-08T00:32:33Z

This works perfectly for me, using timestamps + batched

toanhuynhnguyen · 2024-10-08T01:06:21Z

I have the same issue as @toanhuynhnguyen, there are many missing segments in my usecase when using batched/without_timestamps=False and not batched/without_timestamps=True combos.
You need test more!

…y Enhancements (SYSTRAN#856) Batching Support, Speed Boosts, and Quality Enhancements --------- Co-authored-by: Hargun Mujral <[email protected]> Co-authored-by: MahmoudAshraf97 <[email protected]>

eschmidbauer · 2024-10-08T16:57:06Z

This repo appears to have a lot of fixes. Possible to get it merged into this repo?

Jiltseb added 30 commits June 9, 2023 13:52

seed, multilingual and fixes

fc54cb9

added languages in tokenizer

84d58fa

multilingual fixes

63bea66

vocabulary extension fix for downloads

b95d694

code fixes for multilingual

a8626bb

Squash long words at window and sentence boundaries

c2ca8d4

added commits specifying changes to original package

9edf960

seed, multilingual and fixes

d008650

added languages in tokenizer

2573982

multilingual fixes

8add326

vocabulary extension fix for downloads

afc3f5c

code fixes for multilingual

dd55c03

Squash long words at window and sentence boundaries

d34780e

added commits specifying changes to original package

9fab8d9

modifications based on review

162fbf0

removed LANGUAGES from tokenizer and added numpy requirements

ca6a2ba

Merge remote-tracking branch 'upstream/master'

0df6953

Merge local master to 'updated_js_v2.1'

988c528

Merge pull request #1 from mobiusml/js_asr_v2.1_pr

443eb86

PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)

Update requirements.txt

6a51407

SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6

Merge pull request #2 from SYSTRAN/master

4138e16

Updating the base faster-whisper to 0.10.0

changes to README.md

b906a98

Added BatchedInferencePipeline

0464122

Added language detection from multiple segments and batched inference…

78b5cd7

… logic

added additional packages

f397e37

changes to batched inference based on the review

83895ac

change in silence detection

e1c1699

Merge pull request #3 from mobiusml/batched_asr

b516bc8

Support for Batched inference and language detection from multiple segments in faster-whisper

Merge pull request #4 from SYSTRAN/master

3477d86

Updating the base directory

added logic for torchaudio based feature extraction

95df9eb

kalradivyanshu mentioned this pull request Jul 19, 2024

Need ability to send multiple files in one go #915

Open

zh-plus mentioned this pull request Jul 19, 2024

The option parameter does not affect the VAD process. #916

Closed

ooobo mentioned this pull request Jul 20, 2024

major slowdown with batching commit - cpu only #917

Closed

reasv mentioned this pull request Jul 20, 2024

OAI Whisper transcribes correctly but whisperx returns No active speech found in audio m-bain/whisperX#844

Open

kalradivyanshu mentioned this pull request Jul 20, 2024

Segment timestamps are buggy in BatchedInferencePipeline #919

Closed

aligokalppeker added a commit to aligokalppeker/faster-whisper that referenced this pull request Jul 29, 2024

Revert "New PR for Faster Whisper: Batching Support, Speed Boosts, an…

915e7ce

…d Quality Enhancements (SYSTRAN#856)" This reverts commit eb83902.

Purfview mentioned this pull request Aug 10, 2024

Batching inference commit should be reverted and applied part-by-part for community adaptation !!!! #937

Open

MahmoudAshraf97 mentioned this pull request Aug 14, 2024

revert back to using PyAV instead of torch audio #961

Merged

shinlw added a commit to shinlw/faster-whisper that referenced this pull request Sep 6, 2024

Revert "New PR for Faster Whisper: Batching Support, Speed Boosts, an…

9d38313

…d Quality Enhancements (SYSTRAN#856)" This reverts commit eb83902.

MahmoudAshraf97 and others added 3 commits October 2, 2024 11:00

.

bb6696b

remove duplicate detect_language function

5e6a426

Merge pull request #22 from MahmoudAshraf97/master

3ffb18f

Implement changes in review request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Jiltseb commented May 24, 2024 •

edited

Loading

hahazei commented Jul 18, 2024

hobodrifterdavid commented Jul 20, 2024 •

edited

Loading

BBC-Esq commented Jul 20, 2024

reasv commented Jul 20, 2024 •

edited

Loading

stri8ed commented Jul 25, 2024

Jiltseb commented Jul 26, 2024

jordimas commented Jul 26, 2024

Jiltseb commented Jul 26, 2024

jordimas commented Jul 26, 2024

mru4913 commented Aug 8, 2024

Jiltseb commented Aug 8, 2024

ncuxzy commented Aug 19, 2024

toanhuynhnguyen commented Sep 15, 2024

just-maiyak commented Oct 1, 2024

brandonattruenation commented Oct 8, 2024

toanhuynhnguyen commented Oct 8, 2024

eschmidbauer commented Oct 8, 2024

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Conversation

Jiltseb commented May 24, 2024 • edited Loading

Speed improvements:

Quality Improvements

Benchmarking:

A. Open source benchmarking:

B. Internal dataset:

Acknowledgements

hahazei commented Jul 18, 2024

hobodrifterdavid commented Jul 20, 2024 • edited Loading

BBC-Esq commented Jul 20, 2024

reasv commented Jul 20, 2024 • edited Loading

stri8ed commented Jul 25, 2024

Jiltseb commented Jul 26, 2024

jordimas commented Jul 26, 2024

Jiltseb commented Jul 26, 2024

jordimas commented Jul 26, 2024

mru4913 commented Aug 8, 2024

Jiltseb commented Aug 8, 2024

ncuxzy commented Aug 19, 2024

toanhuynhnguyen commented Sep 15, 2024

just-maiyak commented Oct 1, 2024

brandonattruenation commented Oct 8, 2024

toanhuynhnguyen commented Oct 8, 2024

eschmidbauer commented Oct 8, 2024

Jiltseb commented May 24, 2024 •

edited

Loading

hobodrifterdavid commented Jul 20, 2024 •

edited

Loading

reasv commented Jul 20, 2024 •

edited

Loading