USE-CASE RANKINGS
Best LLM by Use Case in 2026
Editorial rankings of the top 5 LLMs for 5 common production tasks. Every leaderboard is backed by Artificial Analysis benchmarks and the vendor's public pricing page.
WHY THIS MATTERS
Picking by aggregate score is how the wrong model ends up in production
Artificial Analysis Intelligence Index is a single number across many evaluations — useful as a shortcut, dangerous as a sole decision input. A model that scores 60.2 on aggregate can still be the wrong pick for a coding agent if its TerminalBench Hard pass rate is 15 points behind a model with a lower Intelligence Index. A model that wins on Tau-2 tool use may still be the wrong pick for retrieval workloads that care more about throughput and cache pricing.
Each ranking here picks weighted criteria for one task — coding, agents, cheapest cost-per-quality, fastest throughput, broadest multimodal coverage — and scores models against that mix instead of a single composite number. Every cited benchmark has a public source URL. Every cited price has a vendor pricing page. The top model on each list is the model we'd run on that workload today; the other four are real, credible runner-ups with the trade-offs called out.
MOST-READ RANKINGS
Three rankings to start with
Cheapest, best for coding, and fastest cover most of the early model-selection questions teams face. Each card shows the current top pick and links to the full ranking with criteria, benchmarks, and example prompts.
Lowest dollars-per-Intelligence-point for production LLM API workloads
Top pick
DeepSeek: DeepSeek V4 Flash · DeepSeek
DeepSeek V4 Flash takes the top slot on cost-per-quality by a wide margin.
Read the full ranking →
Software engineering, code completion, refactoring, and multi-file edits
Top pick
GPT-5.5 · OpenAI
GPT-5.
Read the full ranking →
Lowest time-to-first-token and highest sustained output tokens per second
Top pick
Gemini 3.1 Flash Lite Preview · Google
Gemini 3.
Read the full ranking →
ALL RANKINGS
Every ranking, with the top pick
All 5 rankings, sorted by use-case slug. Each card shows the rank-1 model, the provider, why it ranks where it does, and the criteria we used to score the leaderboard.
Lowest dollars-per-Intelligence-point for production LLM API workloads
Picking the cheapest LLM API by raw token price ships a worse product than picking by cost-per-quality. We rank by dollars per Artificial Analysis Intelligence point — blended (input+output)/2 divided by AA Intelligence Index — with a hard floor of AA Intelligence ≥ 30 so models that score nothing on real tasks never reach the top. DeepSeek V4 Flash wins by a wide margin at roughly $0.0045 per Intelligence point on AA Intelligence 46.5. Gemini 3.1 Flash Lite Preview, Qwen3.6 Plus, MiMo-V2.5-Pro, and DeepSeek V4 Pro round out the list at cost-per-point values between $0.026 and $0.051.
- Cost per Intelligence point
- AA Intelligence Index floor (≥ 30)
- Cache-read pricing
Top pick
DeepSeek: DeepSeek V4 Flash
DeepSeek
Read the full ranking →
Software engineering, code completion, refactoring, and multi-file edits
GPT-5.5 holds the top slot on coding workloads by margin — Artificial Analysis Coding Index 59.1 versus Claude Opus 4.7's 52.5 and Gemini 3.1 Pro Preview's 55.5, paired with the field-leading TerminalBench Hard score of 60.6%. Opus 4.7 takes second on throughput-adjusted quality: 70.6 tokens per second and Tau-2 tool use at 0.886 make it the safest pick for live coding agents where latency caps token-per-second matter. Gemini 3.1 Pro Preview is the throughput champion at 142.7 tok/s with multimodal input. The bottom of the top five — GPT-5.4 and GPT-5.3-Codex — give you cheaper inputs against the leaders' premium rate without losing the OpenAI tool-use ecosystem.
- AA Coding Index
- TerminalBench Hard
- Tau-2 tool use
Top pick
GPT-5.5
OpenAI
Read the full ranking →
Multi-step agentic workflows with reliable tool use and long-horizon planning
Tool-use reliability is the rate-limiting factor for agentic workloads, and AA's Tau-2 benchmark separates the top tier sharply. Gemini 3.1 Pro Preview leads at 0.956, followed in our weighted ranking by Qwen3.6-Plus at 0.977 (Plus edges Gemini on Tau-2 alone but trails on Intelligence Index), GPT-5.5 at 0.939, DeepSeek V4 Pro at 0.962, and Claude Opus 4.7 at 0.886. The picture changes when you weight long-horizon planning (TerminalBench Hard) and cache economics — the criteria below split agent workloads into reliability-first, throughput-first, and cost-first profiles.
- Tau-2 tool-use accuracy
- TerminalBench Hard pass rate
- AA Intelligence Index
Top pick
Gemini 3.1 Pro Preview
Google
Read the full ranking →
Combined text + image + audio + video input on a single API
Multimodal breadth is binary on most APIs — either the model accepts video natively or it doesn't, and the gap is steep. Gemini 3.1 Pro Preview is the only model in the catalog accepting text, image, audio, video, and file in one call, and it pairs that breadth with AA Intelligence Index 57.2. Gemini 3.1 Flash Lite covers the same modality matrix at 5x lower price and 2.2x the throughput. GPT-5.5 and GPT-5.4 anchor the text-+-image-+-file tier with frontier reasoning; Qwen3.6-Plus extends that with native video input at the cost-effective end. Real multimodal pipelines route by modality complexity, not by a single 'multimodal model' choice.
- Native modality breadth
- AA Intelligence Index after multimodal grounding
- Image-token pricing posture
Top pick
Gemini 3.1 Pro Preview
Google
Read the full ranking →
Lowest time-to-first-token and highest sustained output tokens per second
Latency on LLM APIs is measured along two axes — TTFT (time-to-first-token) for perceived responsiveness, and sustained tokens-per-second for output completion time. Gemini 3.1 Flash Lite Preview tops both at 321 tok/s with low TTFT on Google's infrastructure. Llama 4 Maverick comes second on 0.66s TTFT — the lowest in the catalog — and 110 tok/s. GPT-5.3-Codex (95.4 tok/s), GPT-5.4 (94.4 tok/s), and Grok 4.20 (93.8 tok/s) cluster in the 90–100 tok/s tier. Picking the right one means asking whether you optimize for the first token (chat UX) or the last token (response complete).
- Sustained output tok/s
- TTFT (time-to-first-token)
- Quality floor (AA Intelligence Index ≥ 18)
Top pick
Gemini 3.1 Flash Lite Preview
Google
Read the full ranking →
FREQUENTLY ASKED
FAQ
Why is the cheapest model not always the best choice?
Cost-per-token is a misleading axis on its own. A model that's 5x cheaper but fails 30% of acceptance tests on your workload doesn't save money — it adds retries, hand-edits, and engineering cost that swamp the inference savings. The cheapest-LLM ranking on this site uses cost-per-Intelligence-point with a hard quality floor: any model under AA Intelligence 30 is excluded regardless of price. That blend tracks real monthly spend on production workloads more closely than the raw token rate does.
How do you choose the winner for each use case?
Each ranking declares 3-5 weighted criteria up front, and weights sum to 1.0. We score every candidate against those criteria using public benchmark data (Artificial Analysis, vendor pricing pages, official model cards) and pick the top 5. The criteria, weights, source URLs, and a per-model analysis are all visible on the ranking page — you can borrow the criteria template for your own internal scorecard if your weights are different.
Can I see real benchmarks behind these rankings?
Yes. Every cited number links back to its source page — Artificial Analysis evaluations for benchmark scores, the vendor's own pricing page for per-token rates, and the provider's catalog for context window and modality data. We don't run our own benchmarks because reproducible third-party leaderboards already exist; what we add is the editorial step of picking the criteria that matter for each task.
What if the top-ranked model isn't on ElliotGate yet?
It's still ranked at #1 if it deserves the slot. We don't silently swap rankings to match the ElliotGate catalog. When a top pick is not yet onboarded, the ranking page marks the slot with a 'not yet on ElliotGate' note and links to the official model card. We use the demand signal to onboard within the following few weeks.
Do these rankings work for non-English workloads?
Most of the benchmarks we cite (Artificial Analysis, GPQA, TerminalBench, Tau-2) measure capability on English-language tasks. Models with strong English benchmarks usually transfer reasonably to other languages, but the gap varies by family — Qwen and DeepSeek tend to score better on Chinese-language tasks than their English benchmarks would predict. If you're shipping a non-English product, run a small evaluation on your own prompts before committing to a ranking.
Run the top of every list from one API key
Every ranked model accepts the same OpenAI-compatible request. Change the model slug, keep the rest of your code, and ship faster.