Whisper-WebUI

A Gradio-based browser interface for Whisper. You can use it as an Easy Subtitle Generator!

Notebook

If you wish to try this on Colab, you can do it in here!

Feature

Select the Whisper implementation you want to use between :
- openai/whisper
- SYSTRAN/faster-whisper (used by default)
- Vaibhavs10/insanely-fast-whisper
Generate subtitles from various sources, including :
- Files
- Youtube
- Microphone
Currently supported subtitle formats :
- SRT
- WebVTT
- txt ( only text file without timeline )
Speech to Text Translation
- From other languages to English. ( This is Whisper's end-to-end speech-to-text translation feature )
Text to Text Translation
- Translate subtitle files using Facebook NLLB models
- Translate subtitle files using DeepL API
Pre-processing audio input with Silero VAD.
Pre-processing audio input to separate BGM with UVR.
Post-processing with speaker diarization using the pyannote model.
- To download the pyannote model, you need to have a Huggingface token and manually accept their terms in the pages below.
  1. https://huggingface.co/pyannote/speaker-diarization-3.1
  2. https://huggingface.co/pyannote/segmentation-3.0

Installation and Running

Running with Pinokio

The app is able to run with Pinokio.

Install Pinokio Software.
Open the software and search for Whisper-WebUI and install it.
Start the Whisper-WebUI and connect to the http://localhost:7860.

Running with Docker

Install and launch Docker-Desktop.
Git clone the repository

git clone https://github.com/jhj0517/Whisper-WebUI.git

Build the image ( Image is about 7GB~ )

docker compose build

Run the container

docker compose up

Connect to the WebUI with your browser at http://localhost:7860

If needed, update the docker-compose.yaml to match your environment.

Run Locally

Prerequisite

To run this WebUI, you need to have git, 3.10 <= python <= 3.12, FFmpeg.
And if you're not using an Nvida GPU, or using a different CUDA version than 12.4, edit the requirements.txt to match your environment.

Please follow the links below to install the necessary software:

git : https://git-scm.com/downloads
python : https://www.python.org/downloads/ 3.10 ~ 3.12 is recommended.
FFmpeg : https://ffmpeg.org/download.html
CUDA : https://developer.nvidia.com/cuda-downloads

After installing FFmpeg, make sure to add the FFmpeg/bin folder to your system PATH!

Automatic Installation

git clone this repository

git clone https://github.com/jhj0517/Whisper-WebUI.git

Run install.bat or install.sh to install dependencies. (It will create a venv directory and install dependencies there.)
Start WebUI with start-webui.bat or start-webui.sh (It will run python app.py after activating the venv)

And you can also run the project with command line arguments if you like to, see wiki for a guide to arguments.

VRAM Usages

This project is integrated with faster-whisper by default for better VRAM usage and transcription speed.

According to faster-whisper, the efficiency of the optimized whisper model is as follows:

Implementation	Precision	Beam size	Time	Max. GPU memory	Max. CPU memory
openai/whisper	fp16	5	4m30s	11325MB	9439MB
faster-whisper	fp16	5	54s	4755MB	3244MB

If you want to use an implementation other than faster-whisper, use --whisper_type arg and the repository name.
Read wiki for more info about CLI args.

Available models

This is Whisper's original VRAM usage table for models.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~32x
base	74 M	`base.en`	`base`	~1 GB	~16x
small	244 M	`small.en`	`small`	~2 GB	~6x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x

.en models are for English only, and the cool thing is that you can use the Translate to English option from the "large" models!

TODO🗓

Add DeepL API translation
Add NLLB Model translation
Integrate with faster-whisper
Integrate with insanely-fast-whisper
Integrate with whisperX ( Only speaker diarization part )
Add background music separation pre-processing with UVR
Add fast api script
Support real-time transcription for microphone

Translation 🌐

Any PRs translating Japanese, Spanish, French, German, Chinese, or any other language into translation.yaml would be greatly appreciated!

Name		Name	Last commit message	Last commit date
Latest commit History 858 Commits
.github		.github
configs		configs
models		models
modules		modules
notebook		notebook
outputs		outputs
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Install.bat		Install.bat
Install.sh		Install.sh
LICENSE		LICENSE
README.md		README.md
app.py		app.py
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt
screenshot.png		screenshot.png
start-webui.bat		start-webui.bat
start-webui.sh		start-webui.sh
user-start-webui.bat		user-start-webui.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whisper-WebUI

Notebook

Feature

Installation and Running

Running with Pinokio

Running with Docker

Run Locally

Prerequisite

Automatic Installation

VRAM Usages

Available models

TODO🗓

Translation 🌐

About

Releases

Sponsor this project

Packages

Contributors 10

Languages

License

jhj0517/Whisper-WebUI

Folders and files

Latest commit

History

Repository files navigation

Whisper-WebUI

Notebook

Feature

Installation and Running

Running with Pinokio

Running with Docker

Run Locally

Prerequisite

Automatic Installation

VRAM Usages

Available models

TODO🗓

Translation 🌐

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 10

Languages

Packages