Skip to content
Seedance 2.0 Face is here — generate video from real-person reference photos.Try it now

USE-CASE RANKINGS

Best LLM by Use Case in 2026

Editorial rankings of the top 5 LLMs for 5 common production tasks. Every leaderboard is backed by Artificial Analysis benchmarks and the vendor's public pricing page.

WHY THIS MATTERS

Picking by aggregate score is how the wrong model ends up in production

Artificial Analysis Intelligence Index is a single number across many evaluations — useful as a shortcut, dangerous as a sole decision input. A model that scores 60.2 on aggregate can still be the wrong pick for a coding agent if its TerminalBench Hard pass rate is 15 points behind a model with a lower Intelligence Index. A model that wins on Tau-2 tool use may still be the wrong pick for retrieval workloads that care more about throughput and cache pricing.

Each ranking here picks weighted criteria for one task — coding, agents, cheapest cost-per-quality, fastest throughput, broadest multimodal coverage — and scores models against that mix instead of a single composite number. Every cited benchmark has a public source URL. Every cited price has a vendor pricing page. The top model on each list is the model we'd run on that workload today; the other four are real, credible runner-ups with the trade-offs called out.

ALL RANKINGS

Every ranking, with the top pick

All 5 rankings, sorted by use-case slug. Each card shows the rank-1 model, the provider, why it ranks where it does, and the criteria we used to score the leaderboard.

FREQUENTLY ASKED

FAQ

  • Why is the cheapest model not always the best choice?

    Cost-per-token is a misleading axis on its own. A model that's 5x cheaper but fails 30% of acceptance tests on your workload doesn't save money — it adds retries, hand-edits, and engineering cost that swamp the inference savings. The cheapest-LLM ranking on this site uses cost-per-Intelligence-point with a hard quality floor: any model under AA Intelligence 30 is excluded regardless of price. That blend tracks real monthly spend on production workloads more closely than the raw token rate does.

  • How do you choose the winner for each use case?

    Each ranking declares 3-5 weighted criteria up front, and weights sum to 1.0. We score every candidate against those criteria using public benchmark data (Artificial Analysis, vendor pricing pages, official model cards) and pick the top 5. The criteria, weights, source URLs, and a per-model analysis are all visible on the ranking page — you can borrow the criteria template for your own internal scorecard if your weights are different.

  • Can I see real benchmarks behind these rankings?

    Yes. Every cited number links back to its source page — Artificial Analysis evaluations for benchmark scores, the vendor's own pricing page for per-token rates, and the provider's catalog for context window and modality data. We don't run our own benchmarks because reproducible third-party leaderboards already exist; what we add is the editorial step of picking the criteria that matter for each task.

  • What if the top-ranked model isn't on ElliotGate yet?

    It's still ranked at #1 if it deserves the slot. We don't silently swap rankings to match the ElliotGate catalog. When a top pick is not yet onboarded, the ranking page marks the slot with a 'not yet on ElliotGate' note and links to the official model card. We use the demand signal to onboard within the following few weeks.

  • Do these rankings work for non-English workloads?

    Most of the benchmarks we cite (Artificial Analysis, GPQA, TerminalBench, Tau-2) measure capability on English-language tasks. Models with strong English benchmarks usually transfer reasonably to other languages, but the gap varies by family — Qwen and DeepSeek tend to score better on Chinese-language tasks than their English benchmarks would predict. If you're shipping a non-English product, run a small evaluation on your own prompts before committing to a ranking.

Run the top of every list from one API key

Every ranked model accepts the same OpenAI-compatible request. Change the model slug, keep the rest of your code, and ship faster.