How I Built a 400ms Voice Agent From Scratch

The dirty secret of voice AI: most platforms are leaving 2x performance on the table.

I Actually Built This: My Journey to a 400ms Voice Agent

Last month I built a voice AI agent for a side project. The first version using Vapi worked great—until I tested it with real users. The latency was brutal: 1.8 seconds from when the user stopped talking to when the agent responded. Users were hanging up.

I tried Bland next. Better at 1.2 seconds, but at $0.05/minute, it would cost me $900/month at projected scale. That was a non-starter for a bootstrapped project.

So I spent two weeks building my own. I hit 400ms end-to-end. Here's exactly what I did, what failed, and what actually worked.

Why Platform Solutions Failed Me

I started with Vapi because it promised quick setup. It delivered—I had a basic voice agent running in 30 minutes. But when I tested it with 10 beta users, 3 of them complained about the delay. One said it felt like talking to someone on a bad international call.

The 1.8-second latency wasn't Vapi's fault exactly. It's the architecture: they use WebSockets instead of WebRTC, and they add their own processing layer. Every millisecond adds up.

Bland was faster at 1.2 seconds, but the pricing killed it. I did the math: at 10,000 minutes/month (which was my target), I'd pay $500/month just for voice infrastructure. That's more than my entire AWS bill.

I realized if I wanted sub-500ms latency at a reasonable cost, I'd have to build it myself.

My Stack That Hit 400ms

Here's what I ended up with after two weeks of iteration:

WebRTC via Daily.co for audio streaming. This was the biggest win. WebRTC has built-in packet loss concealment and lower latency than WebSockets. Setting it up took about 2 days—I had to learn their API and figure out the audio handling, but it was worth it. Latency dropped by 200ms immediately.

Whisper API for transcription. I tested this against Google Speech-to-Text and Azure. Whisper was consistently 50-100ms faster with better accuracy. I use the whisper-1 model with the response_format=srt parameter.

GPT-4-turbo with streaming. The key here is streaming. I don't wait for the full response. As soon as I have the first sentence, I start TTS generation. This cuts perceived latency in half.

ElevenLabs Turbo v2.5 for TTS. I tried Play.ht, Azure TTS, and Amazon Polly. ElevenLabs Turbo was fastest at ~150ms for short responses. Their "premade/echo" voice worked best for my use case.

What Actually Made the Difference

Three optimizations mattered more than everything else combined:

1. Parallel TTS generation. Instead of waiting for GPT to finish the entire response, I stream the response sentence by sentence. As soon as I have a complete sentence, I start TTS on it while GPT continues generating. This pipelining saved 300-400ms.

2. Pre-connecting WebRTC. I establish the WebRTC connection before the user starts speaking. When they start talking, everything is already connected. This saved ~200ms.

3. Smart interruption handling. I track which audio has already been sent to the client. When the user interrupts, I know exactly where to resume from without re-generating. This makes interruptions feel instant.

What I Learned

Building this taught me that voice AI latency is all about architecture, not just model speed. You can have the fastest models in the world and still have slow responses if your pipeline is sequential.

The DIY approach saved me money but cost me time. Two weeks of development vs 30 minutes with Vapi. For my project, it was worth it. For a quick prototype, use the platforms.

If you're building voice AI and latency matters, consider the DIY route. But be prepared to spend time on the audio pipeline—it's trickier than it looks.