OpenAI Explains How It Built a Low-Latency Voice AI Stack at Scale
OpenAI has published a detailed engineering breakdown of the custom WebRTC architecture it built to deliver low-latency voice AI across its platform. The system serves more than 900 million weekly active users and powers both ChatGPT voice and the Realtime API, according to a blog post authored by staff engineers Yi Zhang and William McDonald.
The core challenge, as OpenAI describes it, was that traditional WebRTC deployments assign one dedicated UDP port to every active session. That model breaks down inside Kubernetes, where pods are constantly added, removed, and rescheduled. Large public port ranges become difficult to expose, secure, and load balance at the concurrency levels OpenAI operates.
How the Relay-Plus-Transceiver Design Works
Rather than forcing backend services to act as full WebRTC peers, OpenAI split the problem into two layers. A lightweight relay handles UDP packet forwarding through a small, fixed public surface. Behind it, a stateful transceiver owns the entire WebRTC session, including ICE connectivity checks, the DTLS handshake, SRTP encryption keys, and session lifecycle.
The relay never decrypts media or runs ICE state machines. It reads just enough packet metadata to route traffic to the correct transceiver, then gets out of the way. From a client's perspective, nothing about the standard WebRTC flow changes.
First-packet routing relies on a protocol-native trick: the ICE username fragment, or ufrag. OpenAI generates each server-side ufrag with embedded routing metadata so the relay can determine the destination cluster and owning transceiver without pausing on an external lookup. Once the initial STUN binding request lands, subsequent packets flow through a cached in-memory mapping. A Redis layer stores route state for faster recovery if a relay restarts.
Global Relay Cuts the First Hop
OpenAI deployed this relay pattern across geographically distributed ingress points it calls Global Relay. Cloudflare handles geo and proximity steering for signaling, so the initial HTTP or WebSocket request reaches a nearby transceiver cluster. The SDP answer returned to the client advertises the closest Global Relay address.
Shorter first hops translate directly into lower jitter, fewer packet loss bursts, and faster connection setup. Users can start speaking sooner because both signaling and the first ICE connectivity check resolve through a nearby entry point.
What Developers and Users Should Know
The relay service is written in Go and built on Pion, the open-source WebRTC library. OpenAI credits Pion creator Sean DuBois and original WebRTC architect Justin Uberti, both now OpenAI employees, for foundational work that made the system possible.
Performance tuning stayed practical rather than exotic. The team used Linux socket options like SO_REUSEPORT to distribute packets across workers, pinned UDP-reading goroutines to OS threads for better cache locality, and pre-allocated buffers to minimize garbage collection. No kernel-bypass framework was needed.
For developers building on the Realtime API, the key takeaway is that OpenAI's infrastructure now routes low-latency voice AI traffic through a thin, horizontally scalable forwarding layer without requiring any client-side changes. Standard WebRTC behavior stays intact across browsers and mobile platforms.
The full technical walkthrough is available on OpenAI's engineering blog.