Streaming & Time-to-First-Token — Making Claude Feel Instant
One of the most impactful UX improvements you can make to a Claude-powered product costs nothing in tokens and requires only a few lines of code: enable streaming. Without streaming, your UI shows a spinner until Claude finishes generating the entire response, then displays it all at once — a wait that can stretch to 30+ seconds for long outputs. With streaming enabled, the first token appears within a second or two, and the response flows progressively. Users perceive streaming interfaces as dramatically faster even when the total generation time is identical.
Streaming best practices
- Use the
stream=Trueparameter (Python) or the streaming variant ofmessages.createin the SDK. Both fire a server-sent events stream you iterate over. - Display chunks as they arrive: Don't buffer the entire stream before rendering. Flush each chunk to the UI immediately — this is the entire point of streaming.
- Handle
message_deltaevents: The SDK's streaming iterator surfaces typed events. Listen forcontent_block_deltafor text chunks andmessage_stopfor completion. - Reduce prompt verbosity to lower TTFT: Time-to-first-token correlates with input token count — a leaner prompt starts outputting faster. Prune any context that isn't essential to the specific query.
with client.messages.stream(
model="claude-haiku-4-5", # Haiku has lowest TTFT
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
Claude Haiku has the lowest time-to-first-token of the production models. For real-time chat, autocomplete, or any user-facing feature where latency is visible, default to Haiku and escalate to Sonnet or Opus only when capability requires it.