BlogEngineering

Streaming AI Responses: Why It Matters and How to Implement It

A 4-second AI response feels slow. The same 4-second response that starts displaying in 0.5 seconds feels instant. Streaming is the single most impactful UX improvement you can make to an AI-powered product.

Why Streaming Matters

Perceived latency is what users experience, not actual latency. Streaming reduces perceived latency by showing output as it's generated:

  • No waiting: Users see words appearing immediately
  • Perceived 40-60% faster: Even if total time is the same
  • Lower abandonment: Users who see output starting are less likely to refresh or leave
  • Better UX feedback: Shows the AI is "thinking"

Python Backend with Flask + SSE

from flask import Flask, Response
import openai
import os
import json

app = Flask(__name__)

client = openai.OpenAI(
    api_key=os.environ.get("CELUXE_API_KEY"),
    base_url="https://api.celuxe.shop/v1"
)

@app.route("/stream", methods=["POST"])
def stream():
    data = request.get_json()
    prompt = data.get("prompt", "")

    def generate():
        stream = client.chat.completions.create(
            model="deepseek-chat",
            messages=[{"role": "user", "content": prompt}],
            stream=True
        )
        for chunk in stream:
            if chunk.choices[0].delta.content:
                token = chunk.choices[0].delta.content
                yield f"data: {json.dumps({'token': token})}\n\n"

    return Response(
        generate(),
        mimetype="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
            "X-Accel-Buffering": "no"  # Disable nginx buffering
        }
    )

JavaScript Frontend

const response = await fetch("/stream", {
    method: "POST",
    headers: {"Content-Type": "application/json"},
    body: JSON.stringify({prompt: userInput})
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let done = false;

while (!done) {
    const {value, done: doneReading} = await reader.read();
    done = doneReading;
    if (value) {
        const chunk = decoder.decode(value);
        // SSE format: "data: {'token': 'hello'}\n\n"
        const lines = chunk.split("\n").filter(l => l.startsWith("data: "));
        for (const line of lines) {
            const data = JSON.parse(line.slice(6));
            appendToken(data.token); // Show token immediately
        }
    }
}

Nginx Configuration (Important!)

If you're behind nginx, you must disable buffering or SSE will not work:

location /stream {
    proxy_http_version 1.1;
    proxy_cache off;
    proxy_buffering off;
    chunked_transfer_encoding on;
    proxy_buffers 8 4k;
    proxy_buffer_size 4k;
    tcp_nodelay on;
    proxy_pass http://localhost:5000;
}

Error Handling in Streams

try {
    for chunk in stream:
        if error in chunk:
            yield f"data: {json.dumps({'error': error.message})}\n\n"
        else:
            yield f"data: {json.dumps({'token': chunk.choices[0].delta.content})}\n\n"
except Exception as e:
    yield f"data: {json.dumps({'error': str(e)})}\n\n"
finally:
    yield "data: [DONE]\n\n"
Streaming is not optional for consumer AI products. If your competitor's response starts in 0.5s and yours takes 2s before showing anything, you've already lost the user—even if your final response is better.

Test Streaming on Celuxe

Try streaming responses with 30+ models. See the UX difference for yourself.

Get Your API Key →
C

Celuxe Team

We write about real production AI engineering.