The evolution of Speech-to-Text (STT) systems has opened new possibilities for voice-driven applications, from virtual assistants to real-time transcription services. However, delivering fast and accurate STT remains a challenge, especially in resource-constrained environments. A promising solution lies in combining the Wyoming Protocol, OpenAI’s Whisper, and Nvidia TensorRT, as demonstrated in the GitHub project Wyoming Whisper TensorRT. Let’s dive into how this approach can revolutionize STT and pave the way for smarter, more private smart home voice assistants.
The Evolution of This Project Over 2024
This project initially started as a fork of an official Nvidia demo using TensorRT with Whisper on an Nvidia Jetson. From there, I took Rhasspy’s Wyoming Faster Whisper project and tweaked the server and handler to work with the Nvidia project so it could connect to Home Assistant.
After that, I spent quite a bit of time playing around with newer Python packages of all dependencies before finding the right mix of dependencies that used the latest or near-latest releases. In both cases, I began to see performance improvements with my voice assist pipeline. Turns out, newer releases of PyTorch and TensorRT have a number of optimizations in them.
Finally, I ran the project through ChatGPT o1 to write more comprehensive logging for debugging, re-tool the code layout for better readability with functions and comments, and perform a few code optimizations.
Why Do This?
I could tongue-in-cheek just say “why not?” but it goes beyond that. It’s simple: performance. My goal in 2024 (now 2025 due to time constraints) is to build a fully local, conversational voice assistant to replace Amazon Alexa. While all my Echo Dots and Echo Shows have served me faithfully across several years and houses, they have their limitations. Lately, I increasingly find myself talking to a device on the other side of the house, or having to repeat myself because it mishears me or clips the last word or two from my command. My hope with running a local voice assistant is to overcome some of that, while allowing for more complex commands. In particular, my goal is to be able to say once sentence that can control multiple devices, i.e., instead of saying “turn off the left nightstand light” and “turn off the right nightstand light” before bed, I’d like to just say something like “turn off the nightstand lights” without having to create a routine or device group.
In addition, I’ve been pushing through 2024 to make as many of my smart home devices locally controlled as possible, so why not add a voice assistant to the list? You’ll see some more posts on this topic in the coming months, ranging from replacing devices, to moving them to a dedicated IoT network with firewall restrictions.
What are Wyoming, Whisper, and TensorRT anyways?
Wyoming Protocol: A Lightweight Communication Framework
The Wyoming Protocol is a lightweight, high-performance message exchange protocol designed for streaming audio data efficiently. It minimizes overhead while enabling seamless communication between audio sources and STT engines. Benefits of the Wyoming Protocol include:
- Low Latency: Real-time audio streaming with minimal delays.
- Scalability: Handle multiple audio streams concurrently without performance degradation.
- Interoperability: Works across diverse platforms and systems via easy-to-use APIs.
Wyoming is especially popular for use with Home Assistant, where it enables seamless voice control and automation capabilities. By serving as the bridge between Home Assistant and STT/TTS engines, Wyoming ensures fast and accurate transcription and responses for smart home commands and interactions. This approach helps reduce reliance on cloud-based solutions, offering enhanced privacy and security for smart home users.
Whisper: OpenAI’s Robust STT Model
Whisper, developed by OpenAI, has become a benchmark for STT models due to its:
- High Accuracy: Exceptional performance even on challenging accents and noisy environments.
- Multi-Language Support: Transcribe and translate in multiple languages seamlessly.
- Versatility: Handles tasks like automatic punctuation, speaker diarization, and timestamp generation.
However, the trade-off for Whisper’s accuracy and versatility is often computational intensity. This is where TensorRT steps in to optimize the model for real-time performance.
TensorRT: Optimizing Whisper for Speed
NVIDIA’s TensorRT is a powerful toolkit for optimizing and deploying deep learning models on NVIDIA GPUs. By converting Whisper into a TensorRT engine, significant improvements in performance can be achieved:
- Lower Latency: Faster inference times enable real-time transcription.
- Reduced Resource Usage: Optimized models require less GPU memory and power.
- Customizability: Tailored optimizations for specific hardware and workloads.
The Wyoming Whisper TensorRT project demonstrates how to leverage TensorRT to deploy Whisper efficiently, making it feasible for applications that demand both speed and accuracy. While TensorRT is built for datacenter GPUs, it can also run on most modern consumer GPUs, as evidenced by the aforementioned GitHub project being built, tested, and ran on an RTX 4070 Ti. At the time of writing, the latest TensorRT release is supported on CUDA 11.0 update 3 and newer.
Integrating Wyoming Protocol, Whisper, and TensorRT
Here’s how the integration works:
- Audio Streaming with Wyoming:
- Audio data is streamed in real time using the Wyoming Protocol.
- The protocol’s lightweight nature ensures low latency and seamless data flow to the STT engine.
- Processing with Whisper:
- The streamed audio is fed into the Whisper model, now optimized with TensorRT.
- TensorRT’s optimizations ensure that the model processes audio quickly without compromising accuracy.
- Output Delivery:
- The transcribed text is returned to the application via Wyoming, completing the STT loop efficiently.
For Home Assistant users, this integration enables smart home systems to process audio clips nearly instantly, enhancing the overall user experience. Furthermore, this approach empowers smart home users to migrate away from cloud-based assistants like Google Home and Amazon Alexa, achieving similar capabilities with local processing and enhanced data privacy.
Performance Benefits of This Project
While you can view some exact numbers on the project’s performance on its GitHub page, here’s a quick summary. While it may consume more VRAM than projects like Faster Whisper which utilizes CTranslate2, VRAM usage falls roughly 20% lower than stock OpenAI Whisper models. For comparison, on the 4070 Ti, Faster Whisper only consumes about another 3% less VRAM. Meanwhile, the transcription speed of a 20-second audio clip is significantly faster, clocking in at 0.07 seconds as opposed to 0.40 seconds for Whisper and 0.35 seconds for Faster Whisper when using the tiny.en models.
Getting Started with Wyoming Whisper TensorRT
If you’re ready to explore this integration, the Wyoming Whisper TensorRT GitHub repository provides a detailed guide to get started. The repository includes:
- Setup instructions for installing dependencies and configuring the environment. Alternatively, a docker-compose config is provided to deploy a container from a Docker Hub image.
- Scripts for converting Whisper models into TensorRT engines on the first launch
- Test benchmark scripts for TensorRT/Whisper
- Sample benchmarks, taken on an Nvidia Jetson and an RTX 4070 Ti-based Linux VM.
My Setup
In order to host this container, I run an AI VM on Proxmox. The VM has 8 cores on a Ryzen 5 5600G desktop, with 24 GB of RAM and an RTX 4070 Ti. The CPU and RAM are sized based on the resources I needed while initially spinning up voice assist components but before offloading most of their tasks to my GPU, so they can likely be reduced. This VM hosts a number of Docker containers:
- This project
- Ollama, hosting llama3.1
- A PyTorch-powered TTS engine to create audio clips of GLaDOS, an AI from Valve’s Portal series
- Obico, for AI failed print detection of 3D printers
- Bumper, to locally control my Ecovacs Omni X1
All said and done, the local assistant can transcribe, process, and respond to queries such as “tell me a joke” or “turn on the hallway light” in roughly 5-6 seconds. At the start of this endeavor, response times were in the 12-15 second range. For comparison, Amazon Alexa from some quick tests seems to fall in the 3-5 second range, when it gets it right on the first try. While my local hosted assistant may not be as fast and sometimes hallucinates, it is usable and will only get better with time. Between further optimization in newer releases of dependencies and new hardware such as the newly announced RTX 5000 series, response times will be reduced even further. As newer LLM models are released, they will hopefully hallucinate less, preventing the occasional odd response to a query that I get today.
Conclusion
The combination of the Wyoming Protocol, Whisper, and TensorRT represents a significant leap forward in STT technology. By optimizing Whisper with TensorRT and integrating it with Wyoming for efficient streaming, this approach delivers high-performance STT suitable for real-time, large-scale applications. For Home Assistant users, this means a smarter, faster, and more responsive smart home experience. By integrating this setup with Large Language Models (LLMs), smart home enthusiasts can begin replacing cloud-based assistants like Google Home and Amazon Alexa with local, privacy-focused solutions that are conversational. Explore the Wyoming Whisper TensorRT project today to unlock the full potential of these cutting-edge technologies.