Silero VAD support #888

3manifold · 2024-09-26T08:32:35Z

Description

Implementation includes:

Extension of WhisperX to accept multiple VAD alternatives that do not have to necessarily emerge from pyannote-audio toolkit.
Silero VAD as an alternative VAD option.
Fix in whisperx\__init__.py imports.

The implementation aims to respect the current structure as well as keep the existing functionality intact. It is worth mentioning that the manually-assigned vad_model still works as expected (see load_model for details).

See relevant issue for further details. resolves #889

Tests

pyannote and silero cases both tested on CPU & GPU setups without an issue (current silero vad implementation utilizes only CPU)
Also tested using manually assigned vad_model (manually assigned vad_model has higher priority than vad_method, see load_model function for details)
Test were conducted using .wav files of various lengths (30s, 15min, 1hr)

Example command line (applies also for `--vad_method pyannote`):

GPU: python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero
CPU: python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

Example Python script usage:

import whisperx
import gc

device = "cpu"
audio_file = "audio.wav"
batch_size = 16 # reduce if low on GPU mem
compute_type = "int8" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("small", device, vad_method="silero", compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token="xxx", device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

output:

click to expand

python3 whisperx/example.py 
torchvision is not available - cannot save figures
No language specified, language will be first be detected for each audio file (increases inference time).
>>Performing voice activity detection using Silero...
Using cache found in /home/xxx/.cache/torch/hub/snakers4_silero-vad_master
Detected language: en (0.99) in first 30s of audio...
[{'text': ' Birch canoes slid on the smooth planks. Glued the sheet to the dark blue background. It is easy to tell the depth of a well. These days a chicken leg is a rare dish. Rice is often served in round bowls. The juice of lemons makes fine punch. The box was thrown beside the parked truck. The hogs were fed chopped corn and garbage. Four hours of study work faced us.', 'start': 0.674, 'end': 28.83}, {'text': ' A large size in stockings is hard to sell.', 'start': 30.05, 'end': 32.254}]
[{'start': 0.694, 'end': 2.995, 'text': ' Birch canoes slid on the smooth planks.', 'words': [{'word': 'Birch', 'start': 0.694, 'end': 1.034, 'score': 0.854}, {'word': 'canoes', 'start': 1.114, 'end': 1.555, 'score': 0.763}, {'word': 'slid', 'start': 1.595, 'end': 1.915, 'score': 0.881}, {'word': 'on', 'start': 2.015, 'end': 2.095, 'score': 0.909}, {'word': 'the', 'start': 2.115, 'end': 2.195, 'score': 0.789}, {'word': 'smooth', 'start': 2.255, 'end': 2.615, 'score': 0.828}, {'word': 'planks.', 'start': 2.695, 'end': 2.995, 'score': 0.861}]}, {'start': 4.296, 'end': 6.357, 'text': 'Glued the sheet to the dark blue background.', 'words': [{'word': 'Glued', 'start': 4.296, 'end': 4.616, 'score': 0.474}, {'word': 'the', 'start': 4.676, 'end': 4.756, 'score': 0.968}, {'word': 'sheet', 'start': 4.796, 'end': 5.016, 'score': 0.933}, {'word': 'to', 'start': 5.056, 'end': 5.157, 'score': 0.776}, {'word': 'the', 'start': 5.177, 'end': 5.237, 'score': 0.952}, {'word': 'dark', 'start': 5.277, 'end': 5.517, 'score': 0.99}, {'word': 'blue', 'start': 5.577, 'end': 5.777, 'score': 0.844}, {'word': 'background.', 'start': 5.837, 'end': 6.357, 'score': 0.93}]}, {'start': 7.838, 'end': 9.659, 'text': 'It is easy to tell the depth of a well.', 'words': [{'word': 'It', 'start': 7.838, 'end': 7.918, 'score': 0.932}, {'word': 'is', 'start': 7.978, 'end': 8.058, 'score': 0.724}, {'word': 'easy', 'start': 8.118, 'end': 8.318, 'score': 0.958}, {'word': 'to', 'start': 8.358, 'end': 8.438, 'score': 0.88}, {'word': 'tell', 'start': 8.498, 'end': 8.699, 'score': 0.712}, {'word': 'the', 'start': 8.739, 'end': 8.819, 'score': 0.828}, {'word': 'depth', 'start': 8.859, 'end': 9.119, 'score': 0.859}, {'word': 'of', 'start': 9.179, 'end': 9.279, 'score': 0.796}, {'word': 'a', 'start': 9.319, 'end': 9.339, 'score': 0.767}, {'word': 'well.', 'start': 9.399, 'end': 9.659, 'score': 0.933}]}, {'start': 10.9, 'end': 12.841, 'text': 'These days a chicken leg is a rare dish.', 'words': [{'word': 'These', 'start': 10.9, 'end': 11.12, 'score': 0.856}, {'word': 'days', 'start': 11.16, 'end': 11.36, 'score': 0.87}, {'word': 'a', 'start': 11.4, 'end': 11.44, 'score': 0.515}, {'word': 'chicken', 'start': 11.48, 'end': 11.78, 'score': 0.932}, {'word': 'leg', 'start': 11.82, 'end': 12.0, 'score': 0.993}, {'word': 'is', 'start': 12.04, 'end': 12.121, 'score': 0.76}, {'word': 'a', 'start': 12.181, 'end': 12.221, 'score': 0.499}, {'word': 'rare', 'start': 12.281, 'end': 12.501, 'score': 0.776}, {'word': 'dish.', 'start': 12.581, 'end': 12.841, 'score': 0.878}]}, {'start': 14.282, 'end': 16.123, 'text': 'Rice is often served in round bowls.', 'words': [{'word': 'Rice', 'start': 14.282, 'end': 14.522, 'score': 0.867}, {'word': 'is', 'start': 14.582, 'end': 14.662, 'score': 0.638}, {'word': 'often', 'start': 14.722, 'end': 15.022, 'score': 0.922}, {'word': 'served', 'start': 15.082, 'end': 15.362, 'score': 0.848}, {'word': 'in', 'start': 15.422, 'end': 15.502, 'score': 0.85}, {'word': 'round', 'start': 15.562, 'end': 15.783, 'score': 0.912}, {'word': 'bowls.', 'start': 15.823, 'end': 16.123, 'score': 0.647}]}, {'start': 17.343, 'end': 19.265, 'text': 'The juice of lemons makes fine punch.', 'words': [{'word': 'The', 'start': 17.343, 'end': 17.464, 'score': 0.796}, {'word': 'juice', 'start': 17.504, 'end': 17.764, 'score': 0.976}, {'word': 'of', 'start': 17.804, 'end': 17.884, 'score': 0.83}, {'word': 'lemons', 'start': 17.944, 'end': 18.264, 'score': 0.914}, {'word': 'makes', 'start': 18.344, 'end': 18.564, 'score': 0.866}, {'word': 'fine', 'start': 18.644, 'end': 18.904, 'score': 0.914}, {'word': 'punch.', 'start': 18.964, 'end': 19.265, 'score': 0.888}]}, {'start': 20.445, 'end': 22.406, 'text': 'The box was thrown beside the parked truck.', 'words': [{'word': 'The', 'start': 20.445, 'end': 20.565, 'score': 0.89}, {'word': 'box', 'start': 20.605, 'end': 20.885, 'score': 0.956}, {'word': 'was', 'start': 20.926, 'end': 21.046, 'score': 0.907}, {'word': 'thrown', 'start': 21.106, 'end': 21.346, 'score': 0.621}, {'word': 'beside', 'start': 21.386, 'end': 21.706, 'score': 0.901}, {'word': 'the', 'start': 21.746, 'end': 21.806, 'score': 0.977}, {'word': 'parked', 'start': 21.866, 'end': 22.086, 'score': 0.65}, {'word': 'truck.', 'start': 22.126, 'end': 22.406, 'score': 0.859}]}, {'start': 23.767, 'end': 25.748, 'text': 'The hogs were fed chopped corn and garbage.', 'words': [{'word': 'The', 'start': 23.767, 'end': 23.867, 'score': 0.997}, {'word': 'hogs', 'start': 23.907, 'end': 24.147, 'score': 0.873}, {'word': 'were', 'start': 24.167, 'end': 24.287, 'score': 0.874}, {'word': 'fed', 'start': 24.347, 'end': 24.588, 'score': 0.763}, {'word': 'chopped', 'start': 24.628, 'end': 24.928, 'score': 0.671}, {'word': 'corn', 'start': 24.968, 'end': 25.208, 'score': 0.843}, {'word': 'and', 'start': 25.248, 'end': 25.328, 'score': 0.923}, {'word': 'garbage.', 'start': 25.348, 'end': 25.748, 'score': 0.902}]}, {'start': 27.129, 'end': 28.73, 'text': 'Four hours of study work faced us.', 'words': [{'word': 'Four', 'start': 27.129, 'end': 27.329, 'score': 0.819}, {'word': 'hours', 'start': 27.369, 'end': 27.629, 'score': 0.805}, {'word': 'of', 'start': 27.669, 'end': 27.709, 'score': 0.735}, {'word': 'study', 'start': 27.749, 'end': 28.01, 'score': 0.873}, {'word': 'work', 'start': 28.05, 'end': 28.25, 'score': 0.885}, {'word': 'faced', 'start': 28.29, 'end': 28.57, 'score': 0.97}, {'word': 'us.', 'start': 28.67, 'end': 28.73, 'score': 0.99}]}, {'start': 30.111, 'end': 32.092, 'text': ' A large size in stockings is hard to sell.', 'words': [{'word': 'A', 'start': 30.111, 'end': 30.171, 'score': 0.927}, {'word': 'large', 'start': 30.212, 'end': 30.454, 'score': 0.968}, {'word': 'size', 'start': 30.515, 'end': 30.758, 'score': 0.982}, {'word': 'in', 'start': 30.798, 'end': 30.879, 'score': 0.691}, {'word': 'stockings', 'start': 30.919, 'end': 31.344, 'score': 0.923}, {'word': 'is', 'start': 31.405, 'end': 31.486, 'score': 0.816}, {'word': 'hard', 'start': 31.526, 'end': 31.708, 'score': 0.834}, {'word': 'to', 'start': 31.748, 'end': 31.85, 'score': 0.938}, {'word': 'sell.', 'start': 31.89, 'end': 32.092, 'score': 0.954}]}]
                             segment label  ... intersection      union
0  [ 00:00:00.486 -->  00:00:03.000]     A  ...   -28.889031  31.605406
1  [ 00:00:04.266 -->  00:00:06.392]     B  ...   -25.497156  27.825406
2  [ 00:00:07.776 -->  00:00:09.683]     C  ...   -22.206531  24.315406
3  [ 00:00:10.847 -->  00:00:12.923]     D  ...   -18.966531  21.244156
4  [ 00:00:14.205 -->  00:00:16.163]     E  ...   -15.726531  17.886031
5  [ 00:00:17.294 -->  00:00:19.319]     F  ...   -12.570906  14.797906
6  [ 00:00:20.399 -->  00:00:22.390]     G  ...    -9.499656  11.692906
7  [ 00:00:23.723 -->  00:00:25.849]     H  ...    -6.040281   8.368531
8  [ 00:00:27.064 -->  00:00:28.769]     I  ...    -3.120906   5.027281
9  [ 00:00:30.017 -->  00:00:32.194]     J  ...     0.202000   2.176875

[10 rows x 7 columns]
[{'start': 0.694, 'end': 2.995, 'text': ' Birch canoes slid on the smooth planks.', 'words': [{'word': 'Birch', 'start': 0.694, 'end': 1.034, 'score': 0.854, 'speaker': 'SPEAKER_00'}, {'word': 'canoes', 'start': 1.114, 'end': 1.555, 'score': 0.763, 'speaker': 'SPEAKER_00'}, {'word': 'slid', 'start': 1.595, 'end': 1.915, 'score': 0.881, 'speaker': 'SPEAKER_00'}, {'word': 'on', 'start': 2.015, 'end': 2.095, 'score': 0.909, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 2.115, 'end': 2.195, 'score': 0.789, 'speaker': 'SPEAKER_00'}, {'word': 'smooth', 'start': 2.255, 'end': 2.615, 'score': 0.828, 'speaker': 'SPEAKER_00'}, {'word': 'planks.', 'start': 2.695, 'end': 2.995, 'score': 0.861, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 4.296, 'end': 6.357, 'text': 'Glued the sheet to the dark blue background.', 'words': [{'word': 'Glued', 'start': 4.296, 'end': 4.616, 'score': 0.474, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 4.676, 'end': 4.756, 'score': 0.968, 'speaker': 'SPEAKER_00'}, {'word': 'sheet', 'start': 4.796, 'end': 5.016, 'score': 0.933, 'speaker': 'SPEAKER_00'}, {'word': 'to', 'start': 5.056, 'end': 5.157, 'score': 0.776, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 5.177, 'end': 5.237, 'score': 0.952, 'speaker': 'SPEAKER_00'}, {'word': 'dark', 'start': 5.277, 'end': 5.517, 'score': 0.99, 'speaker': 'SPEAKER_00'}, {'word': 'blue', 'start': 5.577, 'end': 5.777, 'score': 0.844, 'speaker': 'SPEAKER_00'}, {'word': 'background.', 'start': 5.837, 'end': 6.357, 'score': 0.93, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 7.838, 'end': 9.659, 'text': 'It is easy to tell the depth of a well.', 'words': [{'word': 'It', 'start': 7.838, 'end': 7.918, 'score': 0.932, 'speaker': 'SPEAKER_00'}, {'word': 'is', 'start': 7.978, 'end': 8.058, 'score': 0.724, 'speaker': 'SPEAKER_00'}, {'word': 'easy', 'start': 8.118, 'end': 8.318, 'score': 0.958, 'speaker': 'SPEAKER_00'}, {'word': 'to', 'start': 8.358, 'end': 8.438, 'score': 0.88, 'speaker': 'SPEAKER_00'}, {'word': 'tell', 'start': 8.498, 'end': 8.699, 'score': 0.712, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 8.739, 'end': 8.819, 'score': 0.828, 'speaker': 'SPEAKER_00'}, {'word': 'depth', 'start': 8.859, 'end': 9.119, 'score': 0.859, 'speaker': 'SPEAKER_00'}, {'word': 'of', 'start': 9.179, 'end': 9.279, 'score': 0.796, 'speaker': 'SPEAKER_00'}, {'word': 'a', 'start': 9.319, 'end': 9.339, 'score': 0.767, 'speaker': 'SPEAKER_00'}, {'word': 'well.', 'start': 9.399, 'end': 9.659, 'score': 0.933, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 10.9, 'end': 12.841, 'text': 'These days a chicken leg is a rare dish.', 'words': [{'word': 'These', 'start': 10.9, 'end': 11.12, 'score': 0.856, 'speaker': 'SPEAKER_00'}, {'word': 'days', 'start': 11.16, 'end': 11.36, 'score': 0.87, 'speaker': 'SPEAKER_00'}, {'word': 'a', 'start': 11.4, 'end': 11.44, 'score': 0.515, 'speaker': 'SPEAKER_00'}, {'word': 'chicken', 'start': 11.48, 'end': 11.78, 'score': 0.932, 'speaker': 'SPEAKER_00'}, {'word': 'leg', 'start': 11.82, 'end': 12.0, 'score': 0.993, 'speaker': 'SPEAKER_00'}, {'word': 'is', 'start': 12.04, 'end': 12.121, 'score': 0.76, 'speaker': 'SPEAKER_00'}, {'word': 'a', 'start': 12.181, 'end': 12.221, 'score': 0.499, 'speaker': 'SPEAKER_00'}, {'word': 'rare', 'start': 12.281, 'end': 12.501, 'score': 0.776, 'speaker': 'SPEAKER_00'}, {'word': 'dish.', 'start': 12.581, 'end': 12.841, 'score': 0.878, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 14.282, 'end': 16.123, 'text': 'Rice is often served in round bowls.', 'words': [{'word': 'Rice', 'start': 14.282, 'end': 14.522, 'score': 0.867, 'speaker': 'SPEAKER_00'}, {'word': 'is', 'start': 14.582, 'end': 14.662, 'score': 0.638, 'speaker': 'SPEAKER_00'}, {'word': 'often', 'start': 14.722, 'end': 15.022, 'score': 0.922, 'speaker': 'SPEAKER_00'}, {'word': 'served', 'start': 15.082, 'end': 15.362, 'score': 0.848, 'speaker': 'SPEAKER_00'}, {'word': 'in', 'start': 15.422, 'end': 15.502, 'score': 0.85, 'speaker': 'SPEAKER_00'}, {'word': 'round', 'start': 15.562, 'end': 15.783, 'score': 0.912, 'speaker': 'SPEAKER_00'}, {'word': 'bowls.', 'start': 15.823, 'end': 16.123, 'score': 0.647, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 17.343, 'end': 19.265, 'text': 'The juice of lemons makes fine punch.', 'words': [{'word': 'The', 'start': 17.343, 'end': 17.464, 'score': 0.796, 'speaker': 'SPEAKER_00'}, {'word': 'juice', 'start': 17.504, 'end': 17.764, 'score': 0.976, 'speaker': 'SPEAKER_00'}, {'word': 'of', 'start': 17.804, 'end': 17.884, 'score': 0.83, 'speaker': 'SPEAKER_00'}, {'word': 'lemons', 'start': 17.944, 'end': 18.264, 'score': 0.914, 'speaker': 'SPEAKER_00'}, {'word': 'makes', 'start': 18.344, 'end': 18.564, 'score': 0.866, 'speaker': 'SPEAKER_00'}, {'word': 'fine', 'start': 18.644, 'end': 18.904, 'score': 0.914, 'speaker': 'SPEAKER_00'}, {'word': 'punch.', 'start': 18.964, 'end': 19.265, 'score': 0.888, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 20.445, 'end': 22.406, 'text': 'The box was thrown beside the parked truck.', 'words': [{'word': 'The', 'start': 20.445, 'end': 20.565, 'score': 0.89, 'speaker': 'SPEAKER_00'}, {'word': 'box', 'start': 20.605, 'end': 20.885, 'score': 0.956, 'speaker': 'SPEAKER_00'}, {'word': 'was', 'start': 20.926, 'end': 21.046, 'score': 0.907, 'speaker': 'SPEAKER_00'}, {'word': 'thrown', 'start': 21.106, 'end': 21.346, 'score': 0.621, 'speaker': 'SPEAKER_00'}, {'word': 'beside', 'start': 21.386, 'end': 21.706, 'score': 0.901, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 21.746, 'end': 21.806, 'score': 0.977, 'speaker': 'SPEAKER_00'}, {'word': 'parked', 'start': 21.866, 'end': 22.086, 'score': 0.65, 'speaker': 'SPEAKER_00'}, {'word': 'truck.', 'start': 22.126, 'end': 22.406, 'score': 0.859, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 23.767, 'end': 25.748, 'text': 'The hogs were fed chopped corn and garbage.', 'words': [{'word': 'The', 'start': 23.767, 'end': 23.867, 'score': 0.997, 'speaker': 'SPEAKER_00'}, {'word': 'hogs', 'start': 23.907, 'end': 24.147, 'score': 0.873, 'speaker': 'SPEAKER_00'}, {'word': 'were', 'start': 24.167, 'end': 24.287, 'score': 0.874, 'speaker': 'SPEAKER_00'}, {'word': 'fed', 'start': 24.347, 'end': 24.588, 'score': 0.763, 'speaker': 'SPEAKER_00'}, {'word': 'chopped', 'start': 24.628, 'end': 24.928, 'score': 0.671, 'speaker': 'SPEAKER_00'}, {'word': 'corn', 'start': 24.968, 'end': 25.208, 'score': 0.843, 'speaker': 'SPEAKER_00'}, {'word': 'and', 'start': 25.248, 'end': 25.328, 'score': 0.923, 'speaker': 'SPEAKER_00'}, {'word': 'garbage.', 'start': 25.348, 'end': 25.748, 'score': 0.902, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 27.129, 'end': 28.73, 'text': 'Four hours of study work faced us.', 'words': [{'word': 'Four', 'start': 27.129, 'end': 27.329, 'score': 0.819, 'speaker': 'SPEAKER_00'}, {'word': 'hours', 'start': 27.369, 'end': 27.629, 'score': 0.805, 'speaker': 'SPEAKER_00'}, {'word': 'of', 'start': 27.669, 'end': 27.709, 'score': 0.735, 'speaker': 'SPEAKER_00'}, {'word': 'study', 'start': 27.749, 'end': 28.01, 'score': 0.873, 'speaker': 'SPEAKER_00'}, {'word': 'work', 'start': 28.05, 'end': 28.25, 'score': 0.885, 'speaker': 'SPEAKER_00'}, {'word': 'faced', 'start': 28.29, 'end': 28.57, 'score': 0.97, 'speaker': 'SPEAKER_00'}, {'word': 'us.', 'start': 28.67, 'end': 28.73, 'score': 0.99, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 30.111, 'end': 32.092, 'text': ' A large size in stockings is hard to sell.', 'words': [{'word': 'A', 'start': 30.111, 'end': 30.171, 'score': 0.927, 'speaker': 'SPEAKER_00'}, {'word': 'large', 'start': 30.212, 'end': 30.454, 'score': 0.968, 'speaker': 'SPEAKER_00'}, {'word': 'size', 'start': 30.515, 'end': 30.758, 'score': 0.982, 'speaker': 'SPEAKER_00'}, {'word': 'in', 'start': 30.798, 'end': 30.879, 'score': 0.691, 'speaker': 'SPEAKER_00'}, {'word': 'stockings', 'start': 30.919, 'end': 31.344, 'score': 0.923, 'speaker': 'SPEAKER_00'}, {'word': 'is', 'start': 31.405, 'end': 31.486, 'score': 0.816, 'speaker': 'SPEAKER_00'}, {'word': 'hard', 'start': 31.526, 'end': 31.708, 'score': 0.834, 'speaker': 'SPEAKER_00'}, {'word': 'to', 'start': 31.748, 'end': 31.85, 'score': 0.938, 'speaker': 'SPEAKER_00'}, {'word': 'sell.', 'start': 31.89, 'end': 32.092, 'score': 0.954, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}]

Process finished with exit code 0

Future work

Silero ONNX model usage (silero-vad repo & faster-whisper for inspiration) to enable GPU usage and harvest possible benefits.
Expose additional VAD settings to the user. These settings may have common meaning among the various VAD methods. E.g.:
- min_silence_duration_ms (silero) and min_duration_off (pyannote)
- min_speech_duration_ms (silero) and min_duration_on (pyannote)

3manifold · 2024-09-27T07:09:19Z

whisperx/vads/pyannote.py

+                     onset: float = 0.5,
+                     offset: Optional[float] = None,
+                     ):
+        assert chunk_size > 0


Keep binarization separate from the parent class function merge_chunks (i.e. Vad.merge_chunks). This is because binarization of other VAD methods (e.g. silero) may happen in earlier stages making Vad.merge_chunks easier to reuse. Specifically, in the case of silero, binarization happens during model invocation.

sulutian · 2024-09-30T09:04:26Z

How do I use Silero VAD with WhisperX！！

3manifold · 2024-09-30T09:07:05Z

How do I use Silero VAD with WhisperX！！

From the pull request description:

Example command line (applies also for --vad_method pyannote):

GPU: python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero

CPU: python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

sulutian · 2024-10-01T11:17:47Z

如何将 Silero VAD 与 WhisperX 一起使用！

来自请求的描述：

窗口命令行（也适用于--vad_method pyannote）：

图形处理器：python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero

中央处理器：python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

An error occurred whisperx: error: unrecognized arguments: --vad_method silero

3manifold · 2024-10-01T11:26:48Z

如何将 Silero VAD 与 WhisperX 一起使用！

来自请求的描述：

窗口命令行（也适用于--vad_method pyannote）：

图形处理器：python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero

中央处理器：python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

An error occurred whisperx: error: unrecognized arguments: --vad_method silero

You have to checkout silero-vad branch

sulutian · 2024-10-01T14:48:51Z

如何将 Silero VAD 与 WhisperX 一起使用！

来自请求的描述：

窗口命令行（也适用于--vad_method pyannote）：

图形处理器：python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero

中央处理器：python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

发生错误 whisperx：错误：无法识别的参数：--vad_method silero

您必须结帐silero-vad分行

I have * main
remotes/origin/HEAD -> origin/main
remotes/origin/main
remotes/origin/silero-vad

3manifold · 2024-10-01T14:51:06Z

如何将 Silero VAD 与 WhisperX 一起使用！

来自请求的描述：

窗口命令行（也适用于--vad_method pyannote）：

图形处理器：python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero

中央处理器：python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

发生错误 whisperx：错误：无法识别的参数：--vad_method silero

您必须结帐silero-vad分行

I have * main remotes/origin/HEAD -> origin/main remotes/origin/main remotes/origin/silero-vad

You can run git checkout -t origin/silero-vad to checkout the remote branch.

sulutian · 2024-10-01T14:57:39Z

如何将 Silero VAD 与 WhisperX 一起使用！

来自请求的描述：

窗口命令行（也适用于--vad_method pyannote）：

图形处理器：python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero

中央处理器：python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

发生错误whisperx：错误：无法识别的参数：--vad_method silero

男人结帐silero-vad分行

我有 * 主遥控器/原点/HEAD -> 原点/主遥控器/原点/主遥控器/原点/silero-vad

您可以运行git checkout -t origin/silero-vad来检出远程分支。

i showed up！！
whisperX-silero-vad>git checkout -t origin/silero-vad
fatal: a branch named 'silero-vad' already exists

sulutian · 2024-10-12T15:30:26Z

When will a parameter for threshold adjustment be added?

Accept alternative VAD methods. Extend to use Silero VAD.

ac44722

3manifold mentioned this pull request Sep 26, 2024

[Feature] Silero VAD support #889

Open

3manifold marked this pull request as ready for review September 26, 2024 08:35

3manifold commented Sep 27, 2024

View reviewed changes

cvl01 mentioned this pull request Oct 17, 2024

Multiple improvements: language detection per segment, VAD min duration on/off, unique speakers, pyproject.toml and more. #900

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silero VAD support #888

Silero VAD support #888

3manifold commented Sep 26, 2024 •

edited

Loading

3manifold Sep 27, 2024

sulutian commented Sep 30, 2024

3manifold commented Sep 30, 2024 •

edited

Loading

Example command line (applies also for `--vad_method pyannote`):

sulutian commented Oct 1, 2024

窗口命令行（也适用于`--vad_method pyannote`）：

3manifold commented Oct 1, 2024

窗口命令行（也适用于`--vad_method pyannote`）：

sulutian commented Oct 1, 2024

窗口命令行（也适用于`--vad_method pyannote`）：

3manifold commented Oct 1, 2024 •

edited

Loading

窗口命令行（也适用于`--vad_method pyannote`）：

sulutian commented Oct 1, 2024

窗口命令行（也适用于`--vad_method pyannote`）：

sulutian commented Oct 12, 2024

Silero VAD support #888

Are you sure you want to change the base?

Silero VAD support #888

Conversation

3manifold commented Sep 26, 2024 • edited Loading

Description

Tests

Example command line (applies also for --vad_method pyannote):

Example Python script usage:

Future work

3manifold Sep 27, 2024

Choose a reason for hiding this comment

sulutian commented Sep 30, 2024

3manifold commented Sep 30, 2024 • edited Loading

Example command line (applies also for --vad_method pyannote):

sulutian commented Oct 1, 2024

窗口命令行（也适用于--vad_method pyannote）：

3manifold commented Oct 1, 2024

窗口命令行（也适用于--vad_method pyannote）：

sulutian commented Oct 1, 2024

窗口命令行（也适用于--vad_method pyannote）：

3manifold commented Oct 1, 2024 • edited Loading

窗口命令行（也适用于--vad_method pyannote）：

sulutian commented Oct 1, 2024

窗口命令行（也适用于--vad_method pyannote）：

sulutian commented Oct 12, 2024

3manifold commented Sep 26, 2024 •

edited

Loading

Example command line (applies also for `--vad_method pyannote`):

3manifold commented Sep 30, 2024 •

edited

Loading

Example command line (applies also for `--vad_method pyannote`):

窗口命令行（也适用于`--vad_method pyannote`）：

窗口命令行（也适用于`--vad_method pyannote`）：

窗口命令行（也适用于`--vad_method pyannote`）：

3manifold commented Oct 1, 2024 •

edited

Loading

窗口命令行（也适用于`--vad_method pyannote`）：

窗口命令行（也适用于`--vad_method pyannote`）：