Hi everyone, I know there are various versions of Whisper in the open-source community (Whisper X, Whisper JAX, etc.), but I am looking to stay updated with the best version of the model. Specifically, I am trying to find the most effective Whisper implementation for transcribing a large batch of videos (~10k videos, each about 30 minutes long).
I would love to hear your thoughts on this.
2 Likes
The most efficient version of OpenAI Whisper depends on whether the models are running locally or via an API:
Whisper JAX is widely regarded as the best API for transcribing audio recordings. It’s driven by TPU v4-8 and can transcribe 1 hour of audio in around 30 seconds, with a restriction of 2 hours per audio upload.
For your giant video batch (10,000 x 30 minutes!), focus on Whisper’s “Base Model (Large)”. It’s a good balance of accuracy and speed for big jobs. There’s also a “Jax” version that might be faster, but test them both on a few videos first to see which one works best for you.
No matter which version you pick, use “batch processing” to transcribe multiple videos at once - this will save you tons of time!
Here’s where to find the Whisper stuff:
Check out the insanely-fast-whisper
Here is a list of Whisper model variants:
Hello Alex, the Whisper Enhanced, Featuring an optimized batching algorithm, this version achieves a 7x faster processing speed on lengthy audio files compared to the base OpenAI Whisper model. It’s also incredibly convenient – simply install it via pip install transformers and run with straightforward code samples.