BlogEngineering

The Hidden Costs of AI Latency

Amazon found that every 100ms of latency costs 1% in revenue. Google found that 500ms of delay reduces traffic by 20%. For AI-powered applications, latency is not just a UX concern—it's a business metric.

Where Latency Comes From

AI latency has four components:

  • Network latency: Time to send request + receive response (50-300ms depending on geography)
  • Time to First Token (TTFT): How fast the model starts generating (0.5s - 3s)
  • Token generation speed: Tokens per second as model writes (10 - 200 tokens/sec)
  • Total generation time: TTFT + (output tokens / generation speed)
For a 500-token response: GPT-4o delivers in ~1.5s total. DeepSeek V3 takes ~4s. For a user waiting for a chat response, that's the difference between "fast" and "this feels slow."

Measuring Latency in Production

import time
import openai
import os

client = openai.OpenAI(
    api_key=os.environ.get("CELUXE_API_KEY"),
    base_url="https://api.celuxe.shop/v1"
)

def timed_complete(prompt, model="deepseek-chat"):
    start = time.time()
    ttft_start = time.time()
    
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        stream=True  # Stream to measure TTFT
    )
    
    first_token_time = None
    chunks = 0
    for chunk in response:
        if first_token_time is None and chunk.choices[0].delta.content:
            first_token_time = time.time() - ttft_start
        chunks += 1
    
    total_time = time.time() - start
    return {
        "total": total_time,
        "ttft": first_token_time,
        "chunks": chunks
    }

result = timed_complete("Explain quantum entanglement in 3 sentences.")
print(f"TTFT: {result['ttft']:.2f}s | Total: {result['total']:.2f}s")

Latency by Model

ModelTTFTOutput SpeedBest For
GPT-4o0.9s~60 tok/sUser-facing chat
GPT-4o-mini0.7s~80 tok/sFast completions
Claude 3.51.1s~40 tok/sLong docs
DeepSeek V31.8s~25 tok/sHigh-vol tasks

How to Optimize

  • Streaming: Always stream responses. TTFT is fast even if total time is long—the user sees output starting immediately.
  • Async I/O: Use async/await in Python to handle multiple AI requests concurrently.
  • Geographic routing: Deploy servers close to your AI provider's inference endpoints.
  • Prefilling: For known system prompts, cache the KV cache to reduce TTFT.
  • Smart routing: Use GPT-4o for latency-sensitive user-facing tasks; reserve DeepSeek V3 for background processing.

Test Latency on Your Traffic

Run your own benchmarks across 30+ models. See real TTFT and throughput numbers for your use case.

Start Free →
C

Celuxe Team

We write about real production AI infrastructure.