Fast, Local Voice: Home Assistant + TensorRT + Ollama

Introduction

In my last post, I showed how compiling Whisper with TensorRT dramatically boosted transcription speed for Home Assistant’s voice features. Since then, I’ve pushed the project even further—optimizing performance, reducing VRAM usage, and laying the groundwork for a voice assistant whose responsiveness rivals cloud services like Alexa or Google Assistant.

Even better, I’ve integrated local large language models via Ollama for natural conversations and created a TensorRT powered text to speech system. The result? A completely private, local voice assistant that’s not only functional but fast.

Here’s what’s new.

Wyoming Whisper TensorRT Updates

Float16 by Default

Whisper model conversions to TensorRT now use float16 precision by default. This slashes both inference times and VRAM requirements, without noticeable transcription quality loss. On my GPU, the improvement has been significant, particularly for larger models.

Runtime Model Selection

Previously, swapping between Whisper model sizes (like small, medium, large-v2) meant rebuilding the container. Now, you can choose model size and compute type dynamically via environment variables:

docker run \
  -e MODEL_SIZE=small \
  -e COMPUTE_TYPE=float16 \
  ...

This makes it easy to test different trade-offs between speed and accuracy.

Real-Time Streaming Transcription

One big bottleneck for voice assistants is the “dead air” while waiting for long audio recordings to finish transcribing. I’ve begun implementing chunked audio processing, allowing:

Audio to be split into overlapping windows
Transcripts streamed back in near-real-time

However, Home Assistant’s Wyoming integration doesn’t yet fully support the streaming features introduced in version 1.7.0 of the protocol. As soon as it does, I’ll enable this functionality—critical for fast, natural conversations.

Smarter GPU Memory Management

TensorRT (and PyTorch) sometimes leaves GPU memory allocated to Python after processing, even if it’s idle. My container now automatically calls:

torch.cuda.empty_cache()

after audio jobs. This reduces idle VRAM usage of the service, freeing resources for other tasks—like running LLMs for conversational AI.

Wyoming GLaDOS: Fast Local TTS

Beyond transcription, I’ve been working on wyoming-glados, a Wyoming Protocol-compatible TTS server designed for fast, high-quality voice synthesis in Home Assistant.

This project builds on earlier open-source efforts, including:

nalf3in’s wyoming-glados — a Wyoming server implementation for GLaDOS-style TTS
R2D2FISH’s glados-tts — providing GLaDOS voice models and TTS implementations

Under the hood, the implementation is based on Tacotron for text-to-spectrogram conversion and a vocoder model to generate audio waveforms. I’ve applied many of the same optimizations used in Wyoming Whisper TRT:

TensorRT acceleration for the tacotron and vocoder models
Smart batching and GPU memory management
Rapid inference for low-latency responses
Model and bucket warmup for lower initial processing latency on first chunks

Full Support for Partial Streaming

Unlike Whisper STT, which is still waiting on full streaming support in Home Assistant, the TTS side has been upgraded as of Home Assistant 2025.7 to support partial streaming. Wyoming GLaDOS fully implements this feature:

It receives chunks of text from Home Assistant
It waits for the end of a sentence, then generates and streams synthesized audio in small, sentence-long chunks
Playback can begin while transcript transmission and synthesis continues

Combined with TensorRT acceleration, this makes GLaDOS TTS fast enough for real-time conversations and responsive automations—delivering GLaDOS’s signature style with minimal delay.

Ollama + Home Assistant = Local Voice AI

Here’s where it all ties together: Ollama.

Ollama makes it simple to run powerful large language models locally—like Llama 3 or Mistral. I’ve integrated it into Home Assistant’s voice pipeline so I can:

Convert speech to text quickly via wyoming-whisper-trt
Send the text and specified smart home entities to a llama3.2 via Ollama
Get back a conversational response while allowing for device control
Synthesize that response with GLaDOS TTS
Speak it back—all locally

The goal is near-Alexa-level speed for basic voice tasks, but with:

✅ Privacy – No cloud servers needed
✅ Speed – Sub-second responses possible for shorter queries

Results So Far

Even without full streaming support yet, the current setup feels dramatically faster than my original experiments. For short queries, my pipeline looks like:

STT (TensorRT Whisper) → ~0.3 – 0.5s
LLM (Ollama, local) → ~0.5 – 2.5s
TTS (GLaDOS TRT) → ~0.2s

Meaning simple commands can round-trip in ~2 – 3 seconds—already competitive with many cloud assistants. Once chunked transcription is live, I expect that gap to narrow further.

This is compared to 10-12 seconds of processing time when I started on this journey 18 months ago.

One big item I think is still left on the table for some additional reduction in processing time is how streaming between components works. As far as I can tell, Home Assistant’s Ollama integration sends text responses to Home Assistant in chunks, but these chunks do not begin sending to the Wyoming TTS system until all chunks have been received. If Home Assistant could send the chunks as soon as they arrive, audio could be generated ~0.25–0.5 seconds faster on my setup, giving the impression of near-instant response times.

What’s Next

Full streaming STT once Home Assistant’s Wyoming integration supports it
Web search abilities via Brave Search for AI, Google Places, and Wikipedia (code is about 90% complete as of time of writing)
Get the STT and TTS Docker containers to build for Jetson devices and ARM64 dGPU systems via GitHub Actions

Try It Yourself

Check out the repos:

All open-source—and all steps closer to a truly private voice assistant experience that’s fast enough to replace cloud services.

Conclusion

Thanks to TensorRT optimizations, smarter memory handling, and local LLMs like those from Ollama, a fast, private voice assistant isn’t a pipe dream—it’s here, and it’s only getting faster.