Smart Picks
AI Technology April 23, 2026

OpenAI Updates Responses API with WebSockets to Boost Agentic Speed

OpenAI Updates Responses API with WebSockets to Boost Agentic Speed

OpenAI said Tuesday it has made agent loops in its Responses API 40% faster end to end by using WebSockets, connection-scoped caching and other changes aimed at reducing overhead in Codex-style workflows. The company said the work was designed to help users see the speed gains from a new coding model, GPT-5.3-Codex-Spark, which it said can run at more than 1,000 tokens per second on specialized Cerebras hardware.

The changes matter because OpenAI said agentic tasks can involve dozens of back-and-forth API requests as a system scans code, builds context, makes edits and runs tests. As model inference gets faster, OpenAI said the API itself has become a bigger part of total latency for developers using agent loops.

When the API became the bottleneck

In a post dated April 22, 2026, OpenAI engineers Brian Yu and Ashwin Nathan said earlier flagship models such as GPT-5 and GPT-5.2 ran at about 65 tokens per second in the Responses API. The company said its target for GPT-5.3-Codex-Spark was more than 1,000 tokens per second.

OpenAI said the faster model was enabled by specialized Cerebras hardware optimized for large language model inference. To let users feel that speed in practice, OpenAI said it had to cut API overhead.

The company said it began a performance sprint around November 2025 and made several changes to the critical path for a single request. Those included caching rendered tokens and model configuration in memory so the system could skip expensive tokenization and network calls for multi-turn responses.

OpenAI also said it reduced network-hop latency by removing calls to intermediate services.

WebSockets replace repeated synchronous calls

OpenAI said the biggest change was building a persistent connection to the Responses API instead of relying on a series of synchronous API calls. The company said that approach cut repeated request overhead in agent loops, where a model decides its next action, a tool runs on a computer, and the tool output is sent back to the API.

According to the post, the agent loop in Codex can include scanning a codebase for relevant files, reading those files to build context, making edits and running tests to confirm a fix. OpenAI said each step can add minutes of waiting time when handled through repeated back-and-forth requests.

The company said its work also included improving the safety stack so issues could be flagged more quickly. OpenAI did not provide a separate breakdown for how much each change contributed to the 40% improvement.

OpenAI said the goal was to preserve a familiar API while making the underlying stack more incremental. The company described the work as part of a broader effort to let developers use faster models without being slowed by service overhead.

OpenAI said it will continue refining the Responses API as agentic workflows become more common and as faster inference puts more pressure on network and service latency.