You are currently viewing Fast, Local Voice: Home Assistant + TensorRT + Ollama

Fast, Local Voice: Home Assistant + TensorRT + Ollama

Introduction

In my last post, I showed how compiling Whisper with TensorRT dramatically boosted transcription speed for Home Assistant’s voice features. Since then, I’ve pushed the project even further—optimizing performance, reducing VRAM usage, and laying the groundwork for a voice assistant whose responsiveness rivals cloud services like Alexa or Google Assistant.

Even better, I’ve integrated local large language models via Ollama for natural conversations and created a TensorRT powered text to speech system. The result? A completely private, local voice assistant that’s not only functional but fast.

Here’s what’s new.

Wyoming Whisper TensorRT Updates

Float16 by Default

Whisper model conversions to TensorRT now use float16 precision by default. This slashes both inference times and VRAM requirements, without noticeable transcription quality loss. On my GPU, the improvement has been significant, particularly for larger models.

Runtime Model Selection

Previously, swapping between Whisper model sizes (like small, medium, large-v2) meant rebuilding the container. Now, you can choose model size and compute type dynamically via environment variables:

docker run \
-e MODEL_SIZE=small \
-e COMPUTE_TYPE=float16 \
...

This makes it easy to test different trade-offs between speed and accuracy.

Real-Time Streaming Transcription

One big bottleneck for voice assistants is the “dead air” while waiting for long audio recordings to finish transcribing. I’ve begun implementing chunked audio processing, allowing:

  • Audio to be split into overlapping windows
  • Transcripts streamed back in near-real-time

However, Home Assistant’s Wyoming integration doesn’t yet fully support the streaming features introduced in version 1.7.0 of the protocol. As soon as it does, I’ll enable this functionality—critical for fast, natural conversations.

Smarter GPU Memory Management

TensorRT (and PyTorch) sometimes leaves GPU memory allocated to Python after processing, even if it’s idle. My container now automatically calls:

torch.cuda.empty_cache()

after audio jobs. This reduces idle VRAM usage of the service, freeing resources for other tasks—like running LLMs for conversational AI.

Wyoming GLaDOS: Fast Local TTS

Beyond transcription, I’ve been working on wyoming-glados, a Wyoming Protocol-compatible TTS server designed for fast, high-quality voice synthesis in Home Assistant.

This project builds on earlier open-source efforts, including:

Under the hood, the implementation is based on Tacotron for text-to-spectrogram conversion and a vocoder model to generate audio waveforms. I’ve applied many of the same optimizations used in Wyoming Whisper TRT:

  • TensorRT acceleration for the tacotron and vocoder models
  • Smart batching and GPU memory management
  • Rapid inference for low-latency responses
  • Model and bucket warmup for lower initial processing latency on first chunks

Full Support for Partial Streaming

Unlike Whisper STT, which is still waiting on full streaming support in Home Assistant, the TTS side has been upgraded as of Home Assistant 2025.7 to support partial streaming. Wyoming GLaDOS fully implements this feature:

  • It receives chunks of text from Home Assistant
  • It waits for the end of a sentence, then generates and streams synthesized audio in small, sentence-long chunks
  • Playback can begin while transcript transmission and synthesis continues

Combined with TensorRT acceleration, this makes GLaDOS TTS fast enough for real-time conversations and responsive automations—delivering GLaDOS’s signature style with minimal delay.

Ollama + Home Assistant = Local Voice AI

Here’s where it all ties together: Ollama.

Ollama makes it simple to run powerful large language models locally—like Llama 3 or Mistral. I’ve integrated it into Home Assistant’s voice pipeline so I can:

  • Convert speech to text quickly via wyoming-whisper-trt
  • Send the text and specified smart home entities to a llama3.2 via Ollama
  • Get back a conversational response while allowing for device control
  • Synthesize that response with GLaDOS TTS
  • Speak it back—all locally

The goal is near-Alexa-level speed for basic voice tasks, but with:

✅ Privacy – No cloud servers needed
✅ Speed – Sub-second responses possible for shorter queries

Results So Far

Even without full streaming support yet, the current setup feels dramatically faster than my original experiments. For short queries, my pipeline looks like:

  • STT (TensorRT Whisper) → ~0.3 – 0.5s
  • LLM (Ollama, local) → ~0.5 – 2.5s
  • TTS (GLaDOS TRT) → ~0.2s

Meaning simple commands can round-trip in ~2 – 3 seconds—already competitive with many cloud assistants. Once chunked transcription is live, I expect that gap to narrow further.

This is compared to 10-12 seconds of processing time when I started on this journey 18 months ago.

One big item I think is still left on the table for some additional reduction in processing time is how streaming between components works. As far as I can tell, Home Assistant’s Ollama integration sends text responses to Home Assistant in chunks, but these chunks do not begin sending to the Wyoming TTS system until all chunks have been received. If Home Assistant could send the chunks as soon as they arrive, audio could be generated ~0.25–0.5 seconds faster on my setup, giving the impression of near-instant response times.

What’s Next

Try It Yourself

Check out the repos:

All open-source—and all steps closer to a truly private voice assistant experience that’s fast enough to replace cloud services.

Conclusion

Thanks to TensorRT optimizations, smarter memory handling, and local LLMs like those from Ollama, a fast, private voice assistant isn’t a pipe dream—it’s here, and it’s only getting faster.

Jonah May

Hey there! I’m Jonah May, a Product Architect and Product Engineering Manager at CyberFortress, a Platinum VCSP dedicated to keeping data safe and recoverable. When I’m not working on backup strategies and automation, you’ll find me deeply involved in the Veeam community—as a Veeam Vanguard, Veeam Certified Architect, VCSP Technical Ambassador, and co-founder of the Veeam Community Hackathon. I also help lead the Texas and Automation Desk Veeam User Groups, where we nerd out over all things backup, automation, and infrastructure.Beyond tech, I’m a Scout leader, having earned my Eagle Scout back in the day. I love sharing knowledge, solving problems, and making technology work smarter, not harder. If you’re into Veeam, automation, or home labs, let’s connect!

Leave a Reply