Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Merged
merged 145 commits into from
Jul 18, 2024

Conversation

Jiltseb
Copy link
Contributor

@Jiltseb Jiltseb commented May 24, 2024

Hello everyone,

This PR adds a major update to Faster Whisper, bringing both speed and quality improvements!

Speed improvements:

  • Batching support: Inspired by whisper-x, this update introduces batching support allowing for a 3x speed increase. This implementation builds on whiper-x and supports more run-time arguments and external VAD segments. The batched version now runs at 64x real-time speed, compared to the previous 20x.

  • Faster feature extraction: We've incorporated torchaudio-based parallel STFT as an alternative to the current implementation from transformers, providing additional speed boosts. With the enable_ta_fe flag, the final version achieves an impressive 104x real-time speed. This is up to 12.5x on average compared to OpenAI implementation!

Using the batched version is straightforward:

from faster_whisper import WhisperModel, BatchedInferencePipeline
#load faster-whisper model in the usual way
model = WhisperModel("medium", device="cuda", compute_type="float16") 

#apply batched pipeline
batched_model = BatchedInferencePipeline(model=model)

#predict using the batched_model
result = batched_model.transcribe("audio.mp3", batch_size=16)

for segment, info in result:
	print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Quality Improvements

  1. Consistency across runs: By setting the model seed, consistency across runs is improved.
  2. Reducing hallucinations: Stricter checks in the inference pipeline reduce unstructured or repeated phrases.
  3. Reliable language detection: A new function detects language more reliably by considering highly confident and random segments, breaking ties to determine the major language.
  4. Code-switching support: Handles audio with multiple languages by detecting language every 30 seconds and dynamically directing data flow. Since the exact language switching position is unknown, this can have an error within a 30 sec segment range.

Language detection Usage:

from faster_whisper import WhisperModel

model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")

Benchmarking:

A. Open source benchmarking:

Open_asr_eval solely consists of short-form audio and the average audio duration is less than 10 sec in general. Hence, using a subset of the YouTube-Commons dataset, we've tested more complex use cases with long-form audio. Whisper-medium model is used (with batch size = 8 for batched versions) for the experiments. Dataset card of youtube-commons-asr-eval is mobiuslabsgmbh/youtube-commons-asr-eval.

Speed (x real-time):

System Speed GPU Speed CPU
OpenAI Whisper 8.2x 4.5x
faster-whisper 20.1x 5.6x
HF Whisper (batched) 59.3x 8.4x
Batched Faster-Whisper 104x 14.6x

WER:

System WER
OpenAI Whisper 15.1
faster-whisper 14.6
HF Whisper (batched) 16.8
Batched Faster-Whisper 13.1

B. Internal dataset:

Since the transcriptions in the open-source dataset are unverified, they can contain various types of errors. Additional internal benchmarking ensures robustness across various scenarios. A smaller test set (84 minutes) with verified ground truth is used for verifying the transcription quality and speed. The test set contains 9 audios ranging from 3 minutes to 13 minutes and various audio types.

System WER Speed
OpenAI Whisper 6.8 9.1x
faster-whisper 6.1 17.4x
HF Whisper (batched) 8.2 42.8x
Batched Faster-Whisper 6.5 86.6x

Batched processing speeds up long-form audio without causing an increase in WER. Users can easily switch between sequential and batched Faster Whisper versions based on specific requirements.

Thank you in advance!

Acknowledgements

This is the work done at Mobiuslabs GmbH. Contact Dr. Jilt Sebastian for any queries or requests.

Jiltseb added 30 commits June 9, 2023 13:52
PR: Changes to faster-whisper project for asr v2.1 based on latest faster_whisper (0.9.0)
SDK v3.0 does not work with latest numpy version (1.26.0) and faster whisper won't work if numpy <1.21.6
Updating the base faster-whisper to 0.10.0
Support for Batched inference and language detection from multiple segments in faster-whisper
Updating the base directory
@hahazei
Copy link

hahazei commented Jul 18, 2024

It is astonishing that a merger has taken place. Thank you.

@hobodrifterdavid
Copy link

hobodrifterdavid commented Jul 20, 2024

@Jiltseb @MahmoudAshraf97 Thanks for the tips, I wrote a bit of code to split segments using word timings. It seems almost 2x faster when using without_timestamps=True.

btw, I have noticed some overlapping segments in the output in batched mode, I don't recall seeing these previously, but I may be wrong. In fact the first segment here starts after the second one:

image

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Jul 20, 2024

Congratulations on the successful merge! I'll be doing benchmarks with WhisperX and WhisperS2T for everyone's edification, like I did before.

@reasv
Copy link

reasv commented Jul 20, 2024

Comparing the output of this new batched version to the previous non-batched pipeline, I find that on some videos it fails to detect and transcribe audio unless I disable batching.
So, results seem to suffer.

I experienced the same with whisperx:
m-bain/whisperX#844

With some videos, it seems like the original OpenAI Whisper, as well as faster_whisper without batching, give correct results while whisperx and faster_whisper with batching output nothing, failing to detect speech correctly.
I'm not the only one with this issue:
m-bain/whisperX#828

It's weird because the videos I have issues with are too short for batching anyways (~8 seconds for example) while I believe by default batching works on 30 second segments?

@stri8ed
Copy link

stri8ed commented Jul 25, 2024

Comparing the output of this new batched version to the previous non-batched pipeline, I find that on some videos it fails to detect and transcribe audio unless I disable batching. So, results seem to suffer.

I experienced the same with whisperx: m-bain/whisperX#844

With some videos, it seems like the original OpenAI Whisper, as well as faster_whisper without batching, give correct results while whisperx and faster_whisper with batching output nothing, failing to detect speech correctly. I'm not the only one with this issue: m-bain/whisperX#828

It's weird because the videos I have issues with are too short for batching anyways (~8 seconds for example) while I believe by default batching works on 30 second segments?

The batching implementation uses a different VAD model. That is likely the cause.

@Jiltseb
Copy link
Contributor Author

Jiltseb commented Jul 26, 2024

The time-stamp-related issue will be solved in the follow-up PR: #921

Batching works on 30-second segments as well (batch size will be 1), but the VAD model is different hence a different set of parameters might be needed for your use case. We are comparing the performance with Silero VAD, there will be a PR to replace the VAD if trade-offs go in favour of it.

@jordimas
Copy link
Contributor

The time-stamp-related issue will be solved in the follow-up PR: #921

Batching works on 30-second segments as well (batch size will be 1), but the VAD model is different hence a different set of parameters might be needed for your use case. We are comparing the performance with Silero VAD, there will be a PR to replace the VAD if trade-offs go in favour of it.

Thanks this an important topic. Having two VAD libraries for different code path should be avoided for consistency, number of external models that we, etc. Should we open a separate ticket for this?

@Jiltseb
Copy link
Contributor Author

Jiltseb commented Jul 26, 2024

@MahmoudAshraf97 is working on this and will open a PR soon, but you can create an issue if you like.

@jordimas
Copy link
Contributor

@MahmoudAshraf97 is working on this and will open a PR soon, but you can create an issue if you like.

PR better than ticket, thanks!

aligokalppeker added a commit to aligokalppeker/faster-whisper that referenced this pull request Jul 29, 2024
@mru4913
Copy link

mru4913 commented Aug 8, 2024

Is this released yet? I have checked version 1.0.3 that does not have batch inference yet.

@Jiltseb
Copy link
Contributor Author

Jiltseb commented Aug 8, 2024

No, it's not released. You can use it with pip install git+https://github.com/SYSTRAN/faster-whisper.git for the time being.

@ncuxzy
Copy link

ncuxzy commented Aug 19, 2024

Comparing the output of this new batched version to the previous non-batched pipeline, I find that on some videos it fails to detect and transcribe audio unless I disable batching. So, results seem to suffer.将这个新的批处理版本的输出与之前的非批处理管道的输出进行比较,我发现在某些视频上,除非禁用批处理,否则它无法检测和转录音频。因此,结果似乎受到了影响。
I experienced the same with whisperx: m-bain/whisperX#844我对 WhisperX 也有同样的经历: m-bain/whisperX#844
With some videos, it seems like the original OpenAI Whisper, as well as faster_whisper without batching, give correct results while whisperx and faster_whisper with batching output nothing, failing to detect speech correctly. I'm not the only one with this issue: m-bain/whisperX#828在一些视频中,原始的 OpenAI Whisper 以及不带批处理的 fast_whisper 似乎给出了正确的结果,而带批处理的 whisperx 和 fast_whisper 则没有输出任何内容,无法正确检测语音。我不是唯一一个遇到这个问题的人: m-bain/whisperX#828
It's weird because the videos I have issues with are too short for batching anyways (~8 seconds for example) while I believe by default batching works on 30 second segments?这很奇怪,因为我遇到问题的视频对于批处理来说太短了(例如约 8 秒),而我相信默认情况下批处理可以处理 30 秒的片段?

The batching implementation uses a different VAD model. That is likely the cause.批处理实现使用不同的 VAD 模型。这很可能就是原因。

I don't think so. The whisperX process is to first perform vad detection, and then batch process the segmented results, so vad is not used in the batch processing stage. And if I denoise the voice first, this problem will not exist.

shinlw added a commit to shinlw/faster-whisper that referenced this pull request Sep 6, 2024
@toanhuynhnguyen
Copy link

I get the issue with Batched faster-whisper:

I use this function "batched_model.transcribe":

from faster_whisper import BatchedInferencePipeline, WhisperModel

model = WhisperModel(
    model_size_or_path="large-v3",
    device="cuda",
    compute_type="float16",
)

batched_model = BatchedInferencePipeline(model=model)

segments, info = batched_model.transcribe(
    audio="file_a.wav" ,
    without_timestamps=True, # Change this value to True or False to test.
    word_timestamps=True,
    beam_size=1,
    task="transcribe",
    language="en",
    batch_size=5,
)

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

I have 2 audio files: file_a.wav and file_b.wav:

  1. Transcribe the file_a.wav
  • If I set without_timestamps=True for the "batched_model.transcribe" function then it transcribes all sentences in the audio file (My expectation).
  • If I set without_timestamps=False then it misses a few sentences in the audio file.
  1. Transcribe the file_b.wav:
  • If I set without_timestamps=True then it misses a few sentences in the audio file.
  • If I set without_timestamps=False then it transcribes all sentences in the audio file (My expectation).

The argument "without_timestamps" does not work consistently. How can I resolve this issue? Thank you so much.

@just-maiyak
Copy link

I have the same issue as @toanhuynhnguyen, there are many missing segments in my usecase when using batched/without_timestamps=False and not batched/without_timestamps=True combos.

@brandonattruenation
Copy link

This works perfectly for me, using timestamps + batched

@toanhuynhnguyen
Copy link

I have the same issue as @toanhuynhnguyen, there are many missing segments in my usecase when using batched/without_timestamps=False and not batched/without_timestamps=True combos.
You need test more!

Jiltseb added a commit to mobiusml/faster-whisper that referenced this pull request Oct 8, 2024
…y Enhancements (SYSTRAN#856)

Batching Support, Speed Boosts, and Quality Enhancements

---------

Co-authored-by: Hargun Mujral <[email protected]>
Co-authored-by: MahmoudAshraf97 <[email protected]>
@eschmidbauer
Copy link

This repo appears to have a lot of fixes. Possible to get it merged into this repo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.