USE CASE RANKING

Lowest time-to-first-token and highest sustained output tokens per second

Latency on LLM APIs is measured along two axes — TTFT (time-to-first-token) for perceived responsiveness, and sustained tokens-per-second for output completion time. Gemini 3.1 Flash Lite Preview tops both at 321 tok/s with low TTFT on Google's infrastructure. Llama 4 Maverick comes second on 0.66s TTFT — the lowest in the catalog — and 110 tok/s. GPT-5.3-Codex (95.4 tok/s), GPT-5.4 (94.4 tok/s), and Grok 4.20 (93.8 tok/s) cluster in the 90–100 tok/s tier. Picking the right one means asking whether you optimize for the first token (chat UX) or the last token (response complete).

Get an API key→Browse all models

Top pick

Gemini 3.1 Flash Lite Preview

Google

EDITOR'S TOP PICK

Recommended model

Rank #1

Gemini 3.1 Flash Lite Preview

Google

Gemini 3.

Price posture$0.25 / $1.5 · 1M tokens

Use it on ElliotGate See all 30 selected models

SELECTION CRITERIA

How we ranked these models

Speed isn't speed — it's two different numbers. TTFT determines whether a chat UI feels alive; sustained tok/s determines whether a long generation feels slow. We weight tok/s heaviest because most production workloads care about end-to-end completion time, but TTFT carries 30% because chat UX is felt in milliseconds.

Weight
40%
Sustained output tok/s
AA's measured tokens-per-second on the model's standard endpoint. This dominates end-to-end completion time on outputs over a few hundred tokens.
View source
Weight
30%
TTFT (time-to-first-token)
The latency between request and first streamed token. Felt directly by chat users; ≤ 1.5s feels native, ≥ 5s feels broken.
View source
Weight
15%
Quality floor (AA Intelligence Index ≥ 18)
A speed ranking that ignores quality recommends models that can't answer. We exclude anything below AA Intelligence Index 18 and prefer ≥ 30.
View source
Weight
10%
Output streaming smoothness
Some providers stream in large bursts (better total time, worse perceived smoothness); others stream evenly. AA's measurements smooth this implicitly by sampling many calls.
View source
Weight
5%
Reasoning-effort tradeoff
Models that expose reasoning-effort settings (low/medium/high) let you trade quality for speed. We weight this lightly because not every workload tolerates effort-tier switching.
View source

TOP 5 LEADERBOARD

The ranking

Ranked by sustained output throughput from AA's latest snapshot, with ties broken by TTFT ascending. Quality floor: AA Intelligence Index ≥ 18 so we don't suggest a fast model that can't get the answer. All five accept OpenAI-compatible payloads on ElliotGate.

#	Model	Provider	tokens/sec	Throughput	Price (in / out)
1	Gemini 3.1 Flash Lite Preview	Google	321 tok/s	321 tok/s	$0.25 / $1.5	Open in ElliotGate
2	Llama 4 Maverick	Meta	110 tok/s	110 tok/s	$0.15 / $0.6	Open in ElliotGate
3	GPT-5.3-Codex	OpenAI	95 tok/s	95 tok/s	$1.75 / $14	Open in ElliotGate
4	GPT-5.4	OpenAI	94 tok/s	94 tok/s	$2.5 / $15	Open in ElliotGate
5	Grok 4.20	xAI	94 tok/s	94 tok/s	$2 / $6	Open in ElliotGate

Pricing is per 1M tokens, USD, sourced from Artificial Analysis and matched against each provider's official rate. ElliotGate charges the same per-token rate as the upstream provider.

MODEL-BY-MODEL ANALYSIS

Why each model placed where it did

#1
Gemini 3.1 Flash Lite Preview
Google
Open on ElliotGate
Gemini 3.1 Flash Lite Preview leads the fastest ranking with 321 tokens per second sustained output — three times faster than the next model and roughly 5x faster than frontier reasoning models. TTFT is also in the low-latency tier (Google's infrastructure runs lean on Flash Lite). AA Intelligence Index 33.5 is the quality floor for the top slot — Flash Lite is not a frontier reasoning model, but it clears AA Intelligence Index 30 by a comfortable margin and handles routing, summarization, and structured extraction reliably. Pricing of $0.25 input / $1.5 output is the second-cheapest on this list. Best fit: real-time chat UX, streaming product surfaces, voice-driven agents where every millisecond shows.
Strengths
- 321 tok/s — 3x faster than the runner-up
- Multimodal text+image+audio+video+file in/out
- $0.25 / $1.5 — cheap enough for streaming product surfaces
- 1M context
Weaknesses
- AA Intelligence Index 33.5 — not frontier reasoning
- Tau-2 0.313 — weak for agent tool use
- Preview status — vendor reserves right to change
Verify on Artificial Analysis
#2
Llama 4 Maverick
Meta
Open on ElliotGate
Llama 4 Maverick records the lowest TTFT in the entire ElliotGate catalog at 0.66 seconds — meaningfully ahead of every other production model. Sustained throughput of 110 tok/s holds second on this list. The catch is the quality floor: AA Intelligence Index 18.4 is the boundary of what we accept, and Tau-2 0.178 rules it out for agent workloads. Pair Maverick with smarter slow models in a route — Maverick takes the first turn or the warm-up, smarter models handle hard branches. Best fit: streaming UX where first-token latency dominates perception (autocomplete, predictive typing, fast prefill).
Strengths
- TTFT 0.66s — lowest in the catalog
- 110 tok/s sustained
- $0.15 / $0.6 — cheapest fast model
- 1M context
Weaknesses
- AA Intelligence Index 18.4 — narrow quality posture
- Tau-2 0.178 — not viable for agents
- Max output 16K — long answers truncate
Verify on Artificial Analysis
#3
GPT-5.3-Codex
OpenAI
Open on ElliotGate
GPT-5.3-Codex posts 95.4 tokens per second — the fastest OpenAI model in the speed-tier we accept and quietly the fastest model on this list with AA Intelligence Index above 50. AA Intelligence Index 53.6 keeps it in real-reasoning territory; Tau-2 0.86 supports tool use. Pricing at $1.75 input / $14 output is moderate. The trade is breadth — Codex line is specialized for software engineering, so general-knowledge tasks lean GPT-5.4 instead. For coding agents that need both speed and quality, this is often the right balance point.
Strengths
- 95.4 tok/s — fastest non-frontier-tier model with AA Intelligence Index > 50
- AA Intelligence Index 53.6 — real reasoning preserved
- Codex-line tuning for coding workloads
- 400K context
Weaknesses
- Specialized for coding — narrow on general knowledge
- Not as cheap as the top two
- Image-only multimodal (no audio/video)
Verify on Artificial Analysis
#4
GPT-5.4
OpenAI
Open on ElliotGate
GPT-5.4 posts 94.4 tok/s — within a token of GPT-5.3-Codex but with a broader knowledge profile (AA Intelligence Index 56.8 vs 53.6 for Codex). $2.5 input / $15 output sits in mid-tier pricing. The combination of frontier-tier reasoning + sub-100ms TTFT (when reasoning effort is low) + 94.4 tok/s makes it the speed champion among general-purpose frontier-quality models. Best fit: when chat UX matters but answers must hold up to scrutiny, and you don't need 5.5's marginal benchmark lead. The default choice for production assistant workloads on a latency budget.
Strengths
- 94.4 tok/s with AA Intelligence Index 56.8
- Frontier reasoning at speed-tier latency
- 1M context + file input
- Reasoning-effort tier lets you trade speed for quality
Weaknesses
- $2.5 / $15 — not cheap enough for high-volume streaming
- Image-only multimodal
- TTFT higher at xhigh effort
Verify on Artificial Analysis
#5
Grok 4.20
xAI
Open on ElliotGate
Grok 4.20 closes the fastest top five at 93.8 tok/s with AA Intelligence Index 49.3, GPQA 0.911, and Tau-2 0.93 — strong reasoning at speed-tier latency. The headline feature is the 2M context window, twice anything else on the speed list. Pricing of $2 input / $6 output undercuts every other speed-tier model on output. The trade: AA Intelligence Index 49.3 sits below GPT-5.4 (56.8) and the Codex line. Best fit: speed-first workloads with extremely long inputs (call transcripts, codebases, document archives) where the 2M window prevents chunking overhead.
Strengths
- 2M context — 2x anything else on the list
- Tau-2 0.93 + GPQA 0.911
- $2 / $6 — cheapest output rate in speed top five
- Cache read $0.2 — sustained-prompt economics work
Weaknesses
- AA Intelligence Index 49.3 — below the GPT-5 line
- Image input only, no file or video
- Smaller knowledge base on enterprise-specific verticals
Verify on Artificial Analysis

EXAMPLE PROMPTS

Three prompts you can run today

Paste these into the ElliotGate playground or your own SDK. Each prompt exercises a different part of the task and gives you a real signal on which model fits your workload.

Predictive text on a typing UI

Prompt

Complete the next sentence for an email being composed. The user has written: 'Hi Sarah, thanks for your quick reply yesterday. I wanted to circle back on the timeline we discussed —' Continue with one sentence that sounds natural. Stop after the period.

Expected behavior

One natural sentence streamed under 500ms. Llama 4 Maverick wins on TTFT; Gemini 3.1 Flash Lite wins on total completion time when the user keeps typing.

Streaming chat answer to a developer

Prompt

Explain in 250 words why JavaScript's `==` produces unexpected results for `null == undefined`, but `=== ` does not. Include the exact spec-defined behavior. Stream the answer.

Expected behavior

Stable streaming under 3 seconds total. GPT-5.4 and GPT-5.3-Codex strike the right balance of fast streaming and accurate spec citation; pure-speed models risk surface-level explanations.

Long-document summary at speed-tier latency

Prompt

Below is a 60-page board memo (mostly tables). Produce a one-page executive summary with three sections — Decisions, Risks, Next Steps. Stream the answer. Reading begins immediately on first token; do not wait until the analysis is complete to start writing.

Expected behavior

Substantive content arrives in the first 2 seconds. Grok 4.20's 2M context handles the long input without chunking; Gemini 3.1 Flash Lite finishes fastest overall.

QUICK START

Switch models with one line

Every ranked model accepts the same OpenAI-compatible request body. Change the model slug, keep the rest of the code, and you are routing across vendors with one API key.

Node.js

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OMINIGATE_API_KEY, // sk-omg-...
  baseURL: "https://api.elliotgate.com/v1",
});

const response = await client.chat.completions.create({
  model: "google/gemini-3.1-flash-lite-preview",            // swap to any prod slug
  messages: [{ role: "user", content: "..." }],
});

QUESTIONS WE GET

Frequently asked

GPT-5.5 measures 63 tok/s in AA's xhigh-effort snapshot — well below the speed cutoff for this ranking. At lower reasoning effort it speeds up but loses Coding Index and TerminalBench Hard points. The ranking principle: if you want both top-tier quality and speed-tier latency, GPT-5.4 is the right balance; if you need 5.5's specific benchmark lead, it has to be paired with extra latency budget.

TTFT for short answers (under 200 tokens), total tok/s for long answers (over 500 tokens). The split point is roughly when the user finishes reading the first few sentences — by then the rest of the answer needs to stream smoothly. For a typing UX or autocomplete, TTFT is everything; for a streaming explanation, total tok/s is what determines whether the user waits at the end. Most chat products instrument both and watch the worse one.

Cached prefixes dramatically cut TTFT — when the system prompt and tools are cached, the model only has to process the new turn. Most providers report 50-70% TTFT reduction on cache hits. This compounds with the speed-tier model choice: a Gemini Pro Preview with cache hits often beats a Flash Lite without cache on TTFT, even though Flash Lite is faster in isolation. Always cache the stable prefix in production.

Yes — this is the standard routing pattern. Send the first turn or the warm-up to Maverick or Flash Lite for sub-second TTFT, and route hard branches to GPT-5.4 or Opus 4.7 in the background. Many chat products show the fast model's response immediately and quietly upgrade to the smart model when the user clicks 'expand' or 'why?'. ElliotGate's single-key model makes this routing pattern straightforward — no second account, no second balance.

Roughly yes for steady-state behavior, with two caveats. AA's measurements are median across many calls — your TTFT may spike on cold starts or under upstream load. Geographic distance from the provider's edge also affects TTFT (Gemini's edge in Asia is fast; in Africa it's slower). Measure your own region with realistic prompts before locking in a speed-tier choice. ElliotGate's dashboard shows per-request latency so you can sanity-check.

Partially. GPT-5.5 at minimal effort drops to ~80 tok/s and TTFT to ~5s, which is meaningfully faster than the xhigh snapshot but still slower than GPT-5.4 at default effort. Effort tiers trade benchmark score for latency, so the right framing is: 'what is the minimum quality I can ship at the fastest speed?' Most production users find that the right answer is a faster baseline model (5.4) at default effort, rather than a slower premium model (5.5) at minimal effort.

Stop A/B-ing with vendor sprawl. Run the top 5 from one key.

Every model on this Lowest time-to-first-token and highest sustained output tokens per second ranking is one slug change away on ElliotGate. Same SDK, same balance, same dashboard.

Get an API key →See pricing

Lowest time-to-first-token and highest sustained output tokens per second

Recommended model

Gemini 3.1 Flash Lite Preview

How we ranked these models

Sustained output tok/s

TTFT (time-to-first-token)

Quality floor (AA Intelligence Index ≥ 18)

Output streaming smoothness

Reasoning-effort tradeoff

The ranking

Why each model placed where it did

Gemini 3.1 Flash Lite Preview

Strengths

Weaknesses

Llama 4 Maverick

Strengths

Weaknesses

GPT-5.3-Codex

Strengths

Weaknesses

GPT-5.4

Strengths

Weaknesses

Grok 4.20

Strengths

Weaknesses

Three prompts you can run today

Predictive text on a typing UI

Streaming chat answer to a developer

Long-document summary at speed-tier latency

Switch models with one line

Frequently asked

Stop A/B-ing with vendor sprawl. Run the top 5 from one key.