Question 1

Why is the cheapest model not always the best choice?

Accepted Answer

Cost-per-token is a misleading axis on its own. A model that's 5x cheaper but fails 30% of acceptance tests on your workload doesn't save money — it adds retries, hand-edits, and engineering cost that swamp the inference savings. The cheapest-LLM ranking on this site uses cost-per-Intelligence-point with a hard quality floor: any model under AA Intelligence 30 is excluded regardless of price. That blend tracks real monthly spend on production workloads more closely than the raw token rate does.

Question 2

How do you choose the winner for each use case?

Accepted Answer

Each ranking declares 3-5 weighted criteria up front, and weights sum to 1.0. We score every candidate against those criteria using public benchmark data (Artificial Analysis, vendor pricing pages, official model cards) and pick the top 5. The criteria, weights, source URLs, and a per-model analysis are all visible on the ranking page — you can borrow the criteria template for your own internal scorecard if your weights are different.

Question 3

Can I see real benchmarks behind these rankings?

Accepted Answer

Yes. Every cited number links back to its source page — Artificial Analysis evaluations for benchmark scores, the vendor's own pricing page for per-token rates, and the provider's catalog for context window and modality data. We don't run our own benchmarks because reproducible third-party leaderboards already exist; what we add is the editorial step of picking the criteria that matter for each task.

Question 4

What if the top-ranked model isn't on ElliotGate yet?

Accepted Answer

It's still ranked at #1 if it deserves the slot. We don't silently swap rankings to match the ElliotGate catalog. When a top pick is not yet onboarded, the ranking page marks the slot with a 'not yet on ElliotGate' note and links to the official model card. We use the demand signal to onboard within the following few weeks.

Question 5

Do these rankings work for non-English workloads?

Accepted Answer

Most of the benchmarks we cite (Artificial Analysis, GPQA, TerminalBench, Tau-2) measure capability on English-language tasks. Models with strong English benchmarks usually transfer reasonably to other languages, but the gap varies by family — Qwen and DeepSeek tend to score better on Chinese-language tasks than their English benchmarks would predict. If you're shipping a non-English product, run a small evaluation on your own prompts before committing to a ranking.

Best LLM by Use Case in 2026

Picking by aggregate score is how the wrong model ends up in production

Three rankings to start with

Every ranking, with the top pick

Lowest dollars-per-Intelligence-point for production LLM API workloads

Software engineering, code completion, refactoring, and multi-file edits

Multi-step agentic workflows with reliable tool use and long-horizon planning

Combined text + image + audio + video input on a single API

Lowest time-to-first-token and highest sustained output tokens per second

FAQ

Run the top of every list from one API key