Skip to content
Seedance 2.0 Face is here — generate video from real-person reference photos.Try it now
USE CASE RANKING

Lowest time-to-first-token and highest sustained output tokens per second

Latency on LLM APIs is measured along two axes — TTFT (time-to-first-token) for perceived responsiveness, and sustained tokens-per-second for output completion time. Gemini 3.1 Flash Lite Preview tops both at 321 tok/s with low TTFT on Google's infrastructure. Llama 4 Maverick comes second on 0.66s TTFT — the lowest in the catalog — and 110 tok/s. GPT-5.3-Codex (95.4 tok/s), GPT-5.4 (94.4 tok/s), and Grok 4.20 (93.8 tok/s) cluster in the 90–100 tok/s tier. Picking the right one means asking whether you optimize for the first token (chat UX) or the last token (response complete).

Top pick
Gemini 3.1 Flash Lite Preview
Google

EDITOR'S TOP PICK

Recommended model

Rank #1

Gemini 3.1 Flash Lite Preview

Google

Gemini 3.

Price posture$0.25 / $1.5 · 1M tokens

SELECTION CRITERIA

How we ranked these models

Speed isn't speed — it's two different numbers. TTFT determines whether a chat UI feels alive; sustained tok/s determines whether a long generation feels slow. We weight tok/s heaviest because most production workloads care about end-to-end completion time, but TTFT carries 30% because chat UX is felt in milliseconds.

  1. Weight
    40%

    Sustained output tok/s

    AA's measured tokens-per-second on the model's standard endpoint. This dominates end-to-end completion time on outputs over a few hundred tokens.

    View source
  2. Weight
    30%

    TTFT (time-to-first-token)

    The latency between request and first streamed token. Felt directly by chat users; ≤ 1.5s feels native, ≥ 5s feels broken.

    View source
  3. Weight
    15%

    Quality floor (AA Intelligence Index ≥ 18)

    A speed ranking that ignores quality recommends models that can't answer. We exclude anything below AA Intelligence Index 18 and prefer ≥ 30.

    View source
  4. Weight
    10%

    Output streaming smoothness

    Some providers stream in large bursts (better total time, worse perceived smoothness); others stream evenly. AA's measurements smooth this implicitly by sampling many calls.

    View source
  5. Weight
    5%

    Reasoning-effort tradeoff

    Models that expose reasoning-effort settings (low/medium/high) let you trade quality for speed. We weight this lightly because not every workload tolerates effort-tier switching.

    View source

TOP 5 LEADERBOARD

The ranking

Ranked by sustained output throughput from AA's latest snapshot, with ties broken by TTFT ascending. Quality floor: AA Intelligence Index ≥ 18 so we don't suggest a fast model that can't get the answer. All five accept OpenAI-compatible payloads on ElliotGate.

#ModelProvidertokens/secThroughputPrice (in / out) 
1Gemini 3.1 Flash Lite PreviewGoogle321 tok/s321 tok/s$0.25 / $1.5Open in ElliotGate
2Llama 4 MaverickMeta110 tok/s110 tok/s$0.15 / $0.6Open in ElliotGate
3GPT-5.3-CodexOpenAI95 tok/s95 tok/s$1.75 / $14Open in ElliotGate
4GPT-5.4OpenAI94 tok/s94 tok/s$2.5 / $15Open in ElliotGate
5Grok 4.20xAI94 tok/s94 tok/s$2 / $6Open in ElliotGate

Pricing is per 1M tokens, USD, sourced from Artificial Analysis and matched against each provider's official rate. ElliotGate charges the same per-token rate as the upstream provider.

MODEL-BY-MODEL ANALYSIS

Why each model placed where it did

  1. #1

    Gemini 3.1 Flash Lite Preview

    Google

    Gemini 3.1 Flash Lite Preview leads the fastest ranking with 321 tokens per second sustained output — three times faster than the next model and roughly 5x faster than frontier reasoning models. TTFT is also in the low-latency tier (Google's infrastructure runs lean on Flash Lite). AA Intelligence Index 33.5 is the quality floor for the top slot — Flash Lite is not a frontier reasoning model, but it clears AA Intelligence Index 30 by a comfortable margin and handles routing, summarization, and structured extraction reliably. Pricing of $0.25 input / $1.5 output is the second-cheapest on this list. Best fit: real-time chat UX, streaming product surfaces, voice-driven agents where every millisecond shows.

    Strengths

    • 321 tok/s — 3x faster than the runner-up
    • Multimodal text+image+audio+video+file in/out
    • $0.25 / $1.5 — cheap enough for streaming product surfaces
    • 1M context

    Weaknesses

    • AA Intelligence Index 33.5 — not frontier reasoning
    • Tau-2 0.313 — weak for agent tool use
    • Preview status — vendor reserves right to change
    Verify on Artificial Analysis
  2. #2

    Llama 4 Maverick

    Meta

    Llama 4 Maverick records the lowest TTFT in the entire ElliotGate catalog at 0.66 seconds — meaningfully ahead of every other production model. Sustained throughput of 110 tok/s holds second on this list. The catch is the quality floor: AA Intelligence Index 18.4 is the boundary of what we accept, and Tau-2 0.178 rules it out for agent workloads. Pair Maverick with smarter slow models in a route — Maverick takes the first turn or the warm-up, smarter models handle hard branches. Best fit: streaming UX where first-token latency dominates perception (autocomplete, predictive typing, fast prefill).

    Strengths

    • TTFT 0.66s — lowest in the catalog
    • 110 tok/s sustained
    • $0.15 / $0.6 — cheapest fast model
    • 1M context

    Weaknesses

    • AA Intelligence Index 18.4 — narrow quality posture
    • Tau-2 0.178 — not viable for agents
    • Max output 16K — long answers truncate
    Verify on Artificial Analysis
  3. #3

    GPT-5.3-Codex

    OpenAI

    GPT-5.3-Codex posts 95.4 tokens per second — the fastest OpenAI model in the speed-tier we accept and quietly the fastest model on this list with AA Intelligence Index above 50. AA Intelligence Index 53.6 keeps it in real-reasoning territory; Tau-2 0.86 supports tool use. Pricing at $1.75 input / $14 output is moderate. The trade is breadth — Codex line is specialized for software engineering, so general-knowledge tasks lean GPT-5.4 instead. For coding agents that need both speed and quality, this is often the right balance point.

    Strengths

    • 95.4 tok/s — fastest non-frontier-tier model with AA Intelligence Index > 50
    • AA Intelligence Index 53.6 — real reasoning preserved
    • Codex-line tuning for coding workloads
    • 400K context

    Weaknesses

    • Specialized for coding — narrow on general knowledge
    • Not as cheap as the top two
    • Image-only multimodal (no audio/video)
    Verify on Artificial Analysis
  4. #4

    GPT-5.4

    OpenAI

    GPT-5.4 posts 94.4 tok/s — within a token of GPT-5.3-Codex but with a broader knowledge profile (AA Intelligence Index 56.8 vs 53.6 for Codex). $2.5 input / $15 output sits in mid-tier pricing. The combination of frontier-tier reasoning + sub-100ms TTFT (when reasoning effort is low) + 94.4 tok/s makes it the speed champion among general-purpose frontier-quality models. Best fit: when chat UX matters but answers must hold up to scrutiny, and you don't need 5.5's marginal benchmark lead. The default choice for production assistant workloads on a latency budget.

    Strengths

    • 94.4 tok/s with AA Intelligence Index 56.8
    • Frontier reasoning at speed-tier latency
    • 1M context + file input
    • Reasoning-effort tier lets you trade speed for quality

    Weaknesses

    • $2.5 / $15 — not cheap enough for high-volume streaming
    • Image-only multimodal
    • TTFT higher at xhigh effort
    Verify on Artificial Analysis
  5. #5

    Grok 4.20

    xAI

    Grok 4.20 closes the fastest top five at 93.8 tok/s with AA Intelligence Index 49.3, GPQA 0.911, and Tau-2 0.93 — strong reasoning at speed-tier latency. The headline feature is the 2M context window, twice anything else on the speed list. Pricing of $2 input / $6 output undercuts every other speed-tier model on output. The trade: AA Intelligence Index 49.3 sits below GPT-5.4 (56.8) and the Codex line. Best fit: speed-first workloads with extremely long inputs (call transcripts, codebases, document archives) where the 2M window prevents chunking overhead.

    Strengths

    • 2M context — 2x anything else on the list
    • Tau-2 0.93 + GPQA 0.911
    • $2 / $6 — cheapest output rate in speed top five
    • Cache read $0.2 — sustained-prompt economics work

    Weaknesses

    • AA Intelligence Index 49.3 — below the GPT-5 line
    • Image input only, no file or video
    • Smaller knowledge base on enterprise-specific verticals
    Verify on Artificial Analysis

EXAMPLE PROMPTS

Three prompts you can run today

Paste these into the ElliotGate playground or your own SDK. Each prompt exercises a different part of the task and gives you a real signal on which model fits your workload.

Predictive text on a typing UI

Prompt
Complete the next sentence for an email being composed. The user has written: 'Hi Sarah, thanks for your quick reply yesterday. I wanted to circle back on the timeline we discussed —' Continue with one sentence that sounds natural. Stop after the period.
Expected behavior

One natural sentence streamed under 500ms. Llama 4 Maverick wins on TTFT; Gemini 3.1 Flash Lite wins on total completion time when the user keeps typing.

Streaming chat answer to a developer

Prompt
Explain in 250 words why JavaScript's `==` produces unexpected results for `null == undefined`, but `=== ` does not. Include the exact spec-defined behavior. Stream the answer.
Expected behavior

Stable streaming under 3 seconds total. GPT-5.4 and GPT-5.3-Codex strike the right balance of fast streaming and accurate spec citation; pure-speed models risk surface-level explanations.

Long-document summary at speed-tier latency

Prompt
Below is a 60-page board memo (mostly tables). Produce a one-page executive summary with three sections — Decisions, Risks, Next Steps. Stream the answer. Reading begins immediately on first token; do not wait until the analysis is complete to start writing.
Expected behavior

Substantive content arrives in the first 2 seconds. Grok 4.20's 2M context handles the long input without chunking; Gemini 3.1 Flash Lite finishes fastest overall.

QUICK START

Switch models with one line

Every ranked model accepts the same OpenAI-compatible request body. Change the model slug, keep the rest of the code, and you are routing across vendors with one API key.

Node.js
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OMINIGATE_API_KEY, // sk-omg-...
  baseURL: "https://api.elliotgate.com/v1",
});

const response = await client.chat.completions.create({
  model: "google/gemini-3.1-flash-lite-preview",            // swap to any prod slug
  messages: [{ role: "user", content: "..." }],
});

QUESTIONS WE GET

Frequently asked

GPT-5.5 measures 63 tok/s in AA's xhigh-effort snapshot — well below the speed cutoff for this ranking. At lower reasoning effort it speeds up but loses Coding Index and TerminalBench Hard points. The ranking principle: if you want both top-tier quality and speed-tier latency, GPT-5.4 is the right balance; if you need 5.5's specific benchmark lead, it has to be paired with extra latency budget.
TTFT for short answers (under 200 tokens), total tok/s for long answers (over 500 tokens). The split point is roughly when the user finishes reading the first few sentences — by then the rest of the answer needs to stream smoothly. For a typing UX or autocomplete, TTFT is everything; for a streaming explanation, total tok/s is what determines whether the user waits at the end. Most chat products instrument both and watch the worse one.
Cached prefixes dramatically cut TTFT — when the system prompt and tools are cached, the model only has to process the new turn. Most providers report 50-70% TTFT reduction on cache hits. This compounds with the speed-tier model choice: a Gemini Pro Preview with cache hits often beats a Flash Lite without cache on TTFT, even though Flash Lite is faster in isolation. Always cache the stable prefix in production.
Yes — this is the standard routing pattern. Send the first turn or the warm-up to Maverick or Flash Lite for sub-second TTFT, and route hard branches to GPT-5.4 or Opus 4.7 in the background. Many chat products show the fast model's response immediately and quietly upgrade to the smart model when the user clicks 'expand' or 'why?'. ElliotGate's single-key model makes this routing pattern straightforward — no second account, no second balance.
Roughly yes for steady-state behavior, with two caveats. AA's measurements are median across many calls — your TTFT may spike on cold starts or under upstream load. Geographic distance from the provider's edge also affects TTFT (Gemini's edge in Asia is fast; in Africa it's slower). Measure your own region with realistic prompts before locking in a speed-tier choice. ElliotGate's dashboard shows per-request latency so you can sanity-check.
Partially. GPT-5.5 at minimal effort drops to ~80 tok/s and TTFT to ~5s, which is meaningfully faster than the xhigh snapshot but still slower than GPT-5.4 at default effort. Effort tiers trade benchmark score for latency, so the right framing is: 'what is the minimum quality I can ship at the fastest speed?' Most production users find that the right answer is a faster baseline model (5.4) at default effort, rather than a slower premium model (5.5) at minimal effort.

Stop A/B-ing with vendor sprawl. Run the top 5 from one key.

Every model on this Lowest time-to-first-token and highest sustained output tokens per second ranking is one slug change away on ElliotGate. Same SDK, same balance, same dashboard.