Introduction
In my last post, I showed how compiling Whisper with TensorRT dramatically boosted transcription speed for Home Assistant’s voice features. Since then, I’ve pushed the project even further—optimizing performance, reducing VRAM usage, and laying the groundwork for a voice assistant whose responsiveness rivals cloud services like Alexa or Google Assistant.
Even better, I’ve integrated local large language models via Ollama for natural conversations and created a TensorRT powered text to speech system. The result? A completely private, local voice assistant that’s not only functional but fast.
Here’s what’s new.
Wyoming Whisper TensorRT Updates
Float16 by Default
Whisper model conversions to TensorRT now use float16 precision by default. This slashes both inference times and VRAM requirements, without noticeable transcription quality loss. On my GPU, the improvement has been significant, particularly for larger models.
Runtime Model Selection
Previously, swapping between Whisper model sizes (like small, medium, large-v2) meant rebuilding the container. Now, you can choose model size and compute type dynamically via environment variables:
docker run \
-e MODEL_SIZE=small \
-e COMPUTE_TYPE=float16 \
...
This makes it easy to test different trade-offs between speed and accuracy.
Real-Time Streaming Transcription
One big bottleneck for voice assistants is the “dead air” while waiting for long audio recordings to finish transcribing. I’ve begun implementing chunked audio processing, allowing:
- Audio to be split into overlapping windows
- Transcripts streamed back in near-real-time
However, Home Assistant’s Wyoming integration doesn’t yet fully support the streaming features introduced in version 1.7.0 of the protocol. As soon as it does, I’ll enable this functionality—critical for fast, natural conversations.
Smarter GPU Memory Management
TensorRT (and PyTorch) sometimes leaves GPU memory allocated to Python after processing, even if it’s idle. My container now automatically calls:
torch.cuda.empty_cache()
after audio jobs. This reduces idle VRAM usage of the service, freeing resources for other tasks—like running LLMs for conversational AI.
Wyoming GLaDOS: Fast Local TTS
Beyond transcription, I’ve been working on wyoming-glados, a Wyoming Protocol-compatible TTS server designed for fast, high-quality voice synthesis in Home Assistant.
This project builds on earlier open-source efforts, including:
- nalf3in’s wyoming-glados — a Wyoming server implementation for GLaDOS-style TTS
- R2D2FISH’s glados-tts — providing GLaDOS voice models and TTS implementations
Under the hood, the implementation is based on Tacotron for text-to-spectrogram conversion and a vocoder model to generate audio waveforms. I’ve applied many of the same optimizations used in Wyoming Whisper TRT:
- TensorRT acceleration for the tacotron and vocoder models
- Smart batching and GPU memory management
- Rapid inference for low-latency responses
- Model and bucket warmup for lower initial processing latency on first chunks
Full Support for Partial Streaming
Unlike Whisper STT, which is still waiting on full streaming support in Home Assistant, the TTS side has been upgraded as of Home Assistant 2025.7 to support partial streaming. Wyoming GLaDOS fully implements this feature:
- It receives chunks of text from Home Assistant
- It waits for the end of a sentence, then generates and streams synthesized audio in small, sentence-long chunks
- Playback can begin while transcript transmission and synthesis continues
Combined with TensorRT acceleration, this makes GLaDOS TTS fast enough for real-time conversations and responsive automations—delivering GLaDOS’s signature style with minimal delay.
Ollama + Home Assistant = Local Voice AI
Here’s where it all ties together: Ollama.
Ollama makes it simple to run powerful large language models locally—like Llama 3 or Mistral. I’ve integrated it into Home Assistant’s voice pipeline so I can:
- Convert speech to text quickly via wyoming-whisper-trt
- Send the text and specified smart home entities to a llama3.2 via Ollama
- Get back a conversational response while allowing for device control
- Synthesize that response with GLaDOS TTS
- Speak it back—all locally
The goal is near-Alexa-level speed for basic voice tasks, but with:
✅ Privacy – No cloud servers needed
✅ Speed – Sub-second responses possible for shorter queries
Results So Far
Even without full streaming support yet, the current setup feels dramatically faster than my original experiments. For short queries, my pipeline looks like:
- STT (TensorRT Whisper) → ~0.3 – 0.5s
- LLM (Ollama, local) → ~0.5 – 2.5s
- TTS (GLaDOS TRT) → ~0.2s
Meaning simple commands can round-trip in ~2 – 3 seconds—already competitive with many cloud assistants. Once chunked transcription is live, I expect that gap to narrow further.
This is compared to 10-12 seconds of processing time when I started on this journey 18 months ago.
One big item I think is still left on the table for some additional reduction in processing time is how streaming between components works. As far as I can tell, Home Assistant’s Ollama integration sends text responses to Home Assistant in chunks, but these chunks do not begin sending to the Wyoming TTS system until all chunks have been received. If Home Assistant could send the chunks as soon as they arrive, audio could be generated ~0.25–0.5 seconds faster on my setup, giving the impression of near-instant response times.
What’s Next
- Full streaming STT once Home Assistant’s Wyoming integration supports it
- Web search abilities via Brave Search for AI, Google Places, and Wikipedia (code is about 90% complete as of time of writing)
- Get the STT and TTS Docker containers to build for Jetson devices and ARM64 dGPU systems via GitHub Actions
Try It Yourself
Check out the repos:
All open-source—and all steps closer to a truly private voice assistant experience that’s fast enough to replace cloud services.
Conclusion
Thanks to TensorRT optimizations, smarter memory handling, and local LLMs like those from Ollama, a fast, private voice assistant isn’t a pipe dream—it’s here, and it’s only getting faster.