Amazon found that every 100ms of latency costs 1% in revenue. Google found that 500ms of delay reduces traffic by 20%. For AI-powered applications, latency is not just a UX concern—it's a business metric.
Where Latency Comes From
AI latency has four components:
- Network latency: Time to send request + receive response (50-300ms depending on geography)
- Time to First Token (TTFT): How fast the model starts generating (0.5s - 3s)
- Token generation speed: Tokens per second as model writes (10 - 200 tokens/sec)
- Total generation time: TTFT + (output tokens / generation speed)
For a 500-token response: GPT-4o delivers in ~1.5s total. DeepSeek V3 takes ~4s. For a user waiting for a chat response, that's the difference between "fast" and "this feels slow."
Measuring Latency in Production
import time
import openai
import os
client = openai.OpenAI(
api_key=os.environ.get("CELUXE_API_KEY"),
base_url="https://api.celuxe.shop/v1"
)
def timed_complete(prompt, model="deepseek-chat"):
start = time.time()
ttft_start = time.time()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
stream=True # Stream to measure TTFT
)
first_token_time = None
chunks = 0
for chunk in response:
if first_token_time is None and chunk.choices[0].delta.content:
first_token_time = time.time() - ttft_start
chunks += 1
total_time = time.time() - start
return {
"total": total_time,
"ttft": first_token_time,
"chunks": chunks
}
result = timed_complete("Explain quantum entanglement in 3 sentences.")
print(f"TTFT: {result['ttft']:.2f}s | Total: {result['total']:.2f}s")
Latency by Model
How to Optimize
- Streaming: Always stream responses. TTFT is fast even if total time is long—the user sees output starting immediately.
- Async I/O: Use async/await in Python to handle multiple AI requests concurrently.
- Geographic routing: Deploy servers close to your AI provider's inference endpoints.
- Prefilling: For known system prompts, cache the KV cache to reduce TTFT.
- Smart routing: Use GPT-4o for latency-sensitive user-facing tasks; reserve DeepSeek V3 for background processing.
Test Latency on Your Traffic
Run your own benchmarks across 30+ models. See real TTFT and throughput numbers for your use case.
Start Free →