Skip to content
Seedance 2.0 Face is here — generate video from real-person reference photos.Try it now
USE CASE RANKING

Lowest dollars-per-Intelligence-point for production LLM API workloads

Picking the cheapest LLM API by raw token price ships a worse product than picking by cost-per-quality. We rank by dollars per Artificial Analysis Intelligence point — blended (input+output)/2 divided by AA Intelligence Index — with a hard floor of AA Intelligence ≥ 30 so models that score nothing on real tasks never reach the top. DeepSeek V4 Flash wins by a wide margin at roughly $0.0045 per Intelligence point on AA Intelligence 46.5. Gemini 3.1 Flash Lite Preview, Qwen3.6 Plus, MiMo-V2.5-Pro, and DeepSeek V4 Pro round out the list at cost-per-point values between $0.026 and $0.051.

Top pick
DeepSeek: DeepSeek V4 Flash
DeepSeek

EDITOR'S TOP PICK

Recommended model

Rank #1

DeepSeek: DeepSeek V4 Flash

DeepSeek

DeepSeek V4 Flash takes the top slot on cost-per-quality by a wide margin.

Price posture$0.14 / $0.28 · 1M tokens

SELECTION CRITERIA

How we ranked these models

Cost-first selection has to keep one foot on the quality side, or the savings buy a model that fails acceptance tests. The criteria below place dollars-per-Intelligence-point at the top — that single ratio dominates monthly spend on real workloads — then add a quality floor, cache pricing, throughput, and context window. Models scoring AA Intelligence under 30 do not enter the ranking even if their headline token price is the lowest on the market.

  1. Weight
    40%

    Cost per Intelligence point

    Blended (input + output) / 2 divided by AA Intelligence Index. Treats price and quality as one ratio. On retrieval and generation-mixed workloads this number tracks actual monthly spend more closely than the raw token rate does.

    View source
  2. Weight
    20%

    AA Intelligence Index floor (≥ 30)

    Hard quality floor — anything under 30 is excluded regardless of price. Below this line the model fails enough acceptance tests on production workloads that the cost savings stop translating into product savings.

    View source
  3. Weight
    15%

    Cache-read pricing

    Stable-prefix traffic — system prompts, retrieved chunks, multi-turn agent context — runs into cache. A cached-input rate of $0.03 vs $0.20 reshapes total cost more than a 10% gap on the headline number, so models with published cache-read prices rank higher.

    View source
  4. Weight
    15%

    Output throughput

    AA-measured tokens per second. Below 30 tok/s the model becomes a batch-only tool; below 60 tok/s interactive UX suffers visibly. Cost wins evaporate if every request blocks for minutes.

    View source
  5. Weight
    10%

    Context window ≥ 32K

    Budget routes still need to hold realistic payloads — a long support thread, a few PDFs, a small retrieval window. Anything under 32K forces chunking infrastructure that eats back the savings in engineering time.

    View source

TOP 5 LEADERBOARD

The ranking

Sorted ascending by dollars-per-Intelligence-point — the blended (input+output)/2 divided by AA Intelligence Index. Ties are broken by raw AA Intelligence Index so the higher-quality model wins. Every listed model accepts the same OpenAI-compatible request shape on ElliotGate; one API key swaps across all five.

#ModelProviderInput per 1MThroughputPrice (in / out) 
1DeepSeek: DeepSeek V4 FlashDeepSeekInput $0.1467 tok/s$0.14 / $0.28Open in ElliotGate
2Gemini 3.1 Flash Lite PreviewGoogleInput $0.25321 tok/s$0.25 / $1.5Open in ElliotGate
3Qwen3.6 PlusQwenInput $0.552 tok/s$0.5 / $3Open in ElliotGate
4MiMo-V2.5-ProXiaomiInput $158 tok/s$1 / $3Open in ElliotGate
5DeepSeek: DeepSeek V4 ProDeepSeekInput $1.7430 tok/s$1.74 / $3.48Open in ElliotGate

Pricing is per 1M tokens, USD, sourced from Artificial Analysis and matched against each provider's official rate. ElliotGate charges the same per-token rate as the upstream provider.

MODEL-BY-MODEL ANALYSIS

Why each model placed where it did

  1. #1

    DeepSeek: DeepSeek V4 Flash

    DeepSeek

    DeepSeek V4 Flash takes the top slot on cost-per-quality by a wide margin. The math: $0.14 input and $0.28 output average to $0.21 per million tokens, divided by AA Intelligence Index 46.5 gives roughly $0.0045 per Intelligence point — about 5x lower than the runner-up. The score itself is the part that surprises people. AA Intelligence 46.5 lands in the same band as mid-tier reasoning models, GPQA Diamond reaches 0.894, and Tau-2 tool use scores 0.95, close to the leaders. Output speed is 67 tokens per second — slower than Gemini Flash Lite but enough for streaming chat once the first token lands at 0.82 seconds. Context is 1M tokens with a 384K max output, which fits long answers and chain-of-thought traces without truncation. Cache reads at $0.028 keep multi-turn agent prompts in the floor band, which compounds: a chat product with a stable 4K system prompt and ten turns per session frequently sees 60-80% of input billing flow through cache reads, and at that volume the V4 Flash cache rate translates directly into the lowest production bill in the ranking. The model is text-only, so workloads that need native image or audio input still need a multimodal companion — Gemini 3.1 Flash Lite Preview is the obvious pair when you need vision below the same cost band. Reach for V4 Flash as the default budget model for retrieval, summarization, batch enrichment, and prototype workloads.

    Strengths

    • $0.0045 per Intelligence point — roughly 5x lower than the runner-up
    • $0.14 / $0.28 per 1M with cache-read at $0.028
    • AA Intelligence Index 46.5 — usable on real production tasks
    • Tau-2 0.95 makes it viable inside agent loops
    • 1M context + 384K max output

    Weaknesses

    • 67 tok/s output — slower than Gemini Flash Lite for streaming UX
    • Text-only — no native image or audio input
    Verify on Artificial Analysis
  2. #2

    Gemini 3.1 Flash Lite Preview

    Google

    Gemini 3.1 Flash Lite Preview comes in second on dollars-per-Intelligence-point at $0.0261, roughly 5.8x higher than DeepSeek V4 Flash but still the second-best ratio on the ranking. The blended rate is $0.875 per million ($0.25 input, $1.50 output), divided by AA Intelligence Index 33.5. The Intelligence score is the lowest in the ranking and only modestly above the floor of 30, so the model is best-suited to orchestration, summarization, classification, and structured extraction rather than graduate-level reasoning — Tau-2 tool use at 0.313 confirms that complex multi-step tool chains will fail more often than they succeed. What it offers in exchange shapes the use case: AA measures 321 output tokens per second — three times faster than V4 Flash and the fastest budget-tier throughput we cite, although time-to-first-token is a sluggish 5.1 seconds, so the model wins on streaming after the first chunk arrives, not on initial latency. Native multimodal input (text, image, audio, video, file) on a single endpoint is rare at this price band; cache reads cost $0.025, the lowest cache rate in the ranking. Reach for Flash Lite when the workload is multimodal, fan-out heavy (many short calls per query), or specifically bottlenecked on tokens-per-second after the first chunk lands.

    Strengths

    • $0.0261 per Intelligence point — second-best ratio on the list
    • 321 tok/s output — fastest in the budget tier by 3x
    • Multimodal in: text + image + audio + video + file
    • Cache read $0.025 — lowest cache rate in the ranking
    • 1M context window

    Weaknesses

    • AA Intelligence Index 33.5 — only marginally above the quality floor
    • Tau-2 0.313 means complex tool-use chains fail
    • Preview status — vendor reserves the right to bump pricing
    Verify on Artificial Analysis
  3. #3

    Qwen3.6 Plus

    Qwen

    Qwen3.6 Plus lands at $0.0350 per Intelligence point — roughly 7.8x higher than DeepSeek V4 Flash but earned by carrying meaningfully more reasoning capacity. The math is $0.5 input + $3 output → blended $1.75 ÷ AA Intelligence 50. The Intelligence score is the second highest in the ranking, GPQA Diamond reaches 0.882, and Tau-2 tool use scores 0.977 — the highest tool-use number in the ranking and a key reason to choose Qwen3.6 Plus for agent loops where reliable function calling matters more than raw token throughput. Output speed is 52 tokens per second with a 1.5-second time-to-first-token, on par with V4 Flash for streaming chat. The model accepts text, image, and video input through the same endpoint, which extends the budget multimodal coverage from Gemini Flash Lite (image+audio+video+file) into a higher-quality option for image-grounded reasoning. Context is 1M tokens with a 65K max output cap — long enough for most chat and extraction tasks but shorter than V4 Flash on chain-of-thought traces or extended report generation. Cache pricing covers writes ($0.625 per million) but not reads, so the cache cost advantage of V4 Flash and Gemini Flash Lite does not carry over to Qwen3.6 Plus on stable-prefix traffic; the model competes on quality, multimodal breadth, and tool-use reliability rather than total cost on multi-turn chat workloads.

    Strengths

    • AA Intelligence Index 50 — second highest in the ranking
    • Tau-2 0.977 — highest tool-use reliability in the ranking
    • Multimodal in: text + image + video
    • 1M context window

    Weaknesses

    • No published cache-read price — multi-turn savings limited
    • 65K max output — shorter than V4 Flash on long traces
    • $0.0350 per Intelligence point — 7x higher than V4 Flash
    Verify on Artificial Analysis
  4. #4

    MiMo-V2.5-Pro

    Xiaomi

    Xiaomi's MiMo-V2.5-Pro takes fourth at $0.0372 per Intelligence point — blended $2 ($1 input, $3 output) ÷ AA Intelligence Index 53.8. The Intelligence score is the highest in the ranking, GPQA Diamond is 0.866, HLE reaches 0.338, and Tau-2 tool use scores 0.942 — close to Qwen3.6 Plus on tool-use reliability and slightly above it on raw Intelligence Index. Output is 58 tokens per second with a 2-second time-to-first-token; the time-to-first-token is the slowest first-byte profile in the ranking and notable for any chat-style UX where the user is waiting on the first chunk. Context is 1M tokens with a 131K max output, comfortable for long traces, summary writing, and report generation that exceeds Qwen3.6 Plus's 65K cap. Cache reads are published at $0.20 per million — higher than V4 Flash and Gemini Flash Lite, meaning multi-turn pipelines that rely on stable-prefix cache traffic do not get the same cache-driven cost reduction here. The model is text-only, so workloads with image or audio input need to combine it with a multimodal partner. Choose MiMo-V2.5-Pro when reasoning quality matters most among the budget candidates, the latency profile is acceptable, and the cache rate sits inside the budget. The model essentially trades latency and cache rate for the highest Intelligence Index in the budget band.

    Strengths

    • AA Intelligence Index 53.8 — highest in the ranking
    • GPQA Diamond 0.866 with HLE 0.338
    • 131K max output — long-trace comfortable
    • 1M context window

    Weaknesses

    • 2-second TTFT — slowest first-byte in the ranking
    • Cache read $0.20 — higher than V4 Flash or Gemini Flash Lite
    • Text-only — no native image or audio input
    Verify on Artificial Analysis
  5. #5

    DeepSeek: DeepSeek V4 Pro

    DeepSeek

    DeepSeek V4 Pro closes the ranking at $0.0507 per Intelligence point — blended $2.61 ($1.74 input, $3.48 output) ÷ AA Intelligence Index 51.5. The Pro variant trades V4 Flash's throughput for a Tau-2 of 0.962 and a slightly higher Intelligence Index, with HLE at 0.359 — the highest difficulty number in the ranking, indicating the model holds up better on the kinds of problems where the other budget options crack. Output speed is 30 tokens per second, the slowest in the ranking and at the boundary where AA's own snapshot calls 'batch territory'; first-byte latency is 1 second, fine for non-interactive workloads but uncomfortable for real-time chat. Context is 1M tokens with 384K max output, same as V4 Flash, which means long chain-of-thought traces and extended report generation both fit without truncation. Cache reads at $0.145 per million sit between V4 Flash ($0.028) and MiMo ($0.20), giving multi-turn pipelines a partial cache benefit but not the floor rate V4 Flash carries. Reach for V4 Pro when the workload genuinely needs the extra Tau-2 reliability or the HLE-grade reasoning depth, and the higher cost-per-point fits the budget envelope. For most pipelines V4 Flash gives more quality per dollar; V4 Pro is the upgrade slot for the subset of calls where reasoning depth matters more than throughput.

    Strengths

    • HLE 0.359 — highest difficulty score in the ranking
    • Tau-2 0.962 with AA Intelligence 51.5
    • 1M context + 384K max output — same as V4 Flash
    • Cache read $0.145 — published and usable

    Weaknesses

    • $0.0507 per Intelligence point — highest cost-per-point in the ranking
    • 30 tok/s output — batch territory per AA's own snapshot
    • Text-only — no native image or audio input
    Verify on Artificial Analysis

EXAMPLE PROMPTS

Three prompts you can run today

Paste these into the ElliotGate playground or your own SDK. Each prompt exercises a different part of the task and gives you a real signal on which model fits your workload.

Bulk classify 10K product titles

Prompt
You will receive product titles one per line. For each line, output exactly one of: 'apparel', 'electronics', 'home', 'beauty', 'food', 'other'. No prose, no JSON, just the category followed by a newline. Treat ambiguous brand names as 'other'. The first title is: Sony WH-1000XM5 Wireless Noise Cancelling Headphones.
Expected behavior

Returns 'electronics' on the first line and continues classifying without extra prose. DeepSeek V4 Flash and Gemini 3.1 Flash Lite both hold the strict output format at high throughput; this is the canonical batch-classification workload where cost-per-Intelligence-point dominates monthly spend.

Summarize a 30-page PDF for a daily digest

Prompt
Below is a board memo. Produce a 5-bullet executive summary, then a 100-word risk paragraph, then a 'Recommended next steps' list with no more than 4 items. Cite page numbers in square brackets after each bullet. Memo follows:

[content...]
Expected behavior

Budget pick must hold structural discipline across a long input. Gemini 3.1 Flash Lite handles this faster than the field; DeepSeek V4 Flash gives the lowest per-document cost with strong long-context recall and the best dollars-per-Intelligence-point ratio in the ranking.

RAG fan-out: ranker pass over 200 retrieved chunks

Prompt
You are a re-ranker. Below is a user question followed by 200 candidate passages, each numbered. Return a JSON array of the 10 passage IDs most likely to answer the question, ranked best-first. No reasoning, just the array. Question: 'What changed in the 2025 EU AI Act enforcement timeline?'

[passages...]
Expected behavior

A tight JSON array of 10 IDs. This is the canonical pipeline-internal use case where budget tier matters: every query fans out to many ranker calls and cost compounds. DeepSeek V4 Flash shines here on cost-per-point; Qwen3.6 Plus shines when ranker reliability inside an agent loop matters more.

QUICK START

Switch models with one line

Every ranked model accepts the same OpenAI-compatible request body. Change the model slug, keep the rest of the code, and you are routing across vendors with one API key.

Node.js
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OMINIGATE_API_KEY, // sk-omg-...
  baseURL: "https://api.elliotgate.com/v1",
});

const response = await client.chat.completions.create({
  model: "deepseek/deepseek-v4-flash",            // swap to any prod slug
  messages: [{ role: "user", content: "..." }],
});

QUESTIONS WE GET

Frequently asked

Because dollars per million tokens, taken alone, is a misleading number for production buyers. A model priced at $0.10 input but scoring AA Intelligence Index 15 fails enough acceptance tests that you end up paying again — in retry traffic, in human review, in user churn. The blended (input + output) / 2 divided by AA Intelligence Index isolates the part you actually care about: how much you pay per unit of usable quality. DeepSeek V4 Flash wins both views, but the cost-per-Intelligence-point view forces models below AA Intelligence 30 out of consideration regardless of their headline price.
Thirty is the empirical inflection point on real production workloads. Below 30, AA-published evaluations show the model fails enough multi-step tasks, tool calls, and long-context recalls that teams end up wrapping it with retry layers and post-validators — the saving evaporates. Above 30, the model holds up on summarization, classification, extraction, and basic agent loops. Above 50 is genuinely mid-tier reasoning, and we want models in that band visible on this list too (Qwen3.6 Plus, MiMo-V2.5-Pro, DeepSeek V4 Pro) — but we do not exclude lower scorers as long as they pass 30 and offer a defensible cost-per-Intelligence ratio.
On stable-prefix workloads, yes — frequently more than the headline token rate. A chat product with a 4K system prompt and ten turns per session sends roughly the same prefix on every turn; cache reads end up accounting for 60-80% of input billing. At those volumes the difference between $0.025 (Gemini Flash Lite) and $0.20 (MiMo-V2.5-Pro) outweighs a 10% headline gap. We weight cache pricing at 0.15 in the criteria — substantial, but smaller than the cost-per-Intelligence-point metric that already captures most of the spend behavior on non-cached traffic.
For narrow flows, yes — FAQ matching, order lookup, returns scripting, structured triage. AA Intelligence 46.5 holds up on those. For open-ended assistants where users expect deep reasoning, no — you want a router that hands the common 80-90% of turns to V4 Flash and falls back to a frontier model (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro Preview) only on the long tail. The overall bill ends up close to the budget model's rate while the surfaced quality matches the premium model's. Qwen3.6 Plus is the right router stand-in when tool-use reliability inside the chat agent matters.
Higher up the price ladder, deliberately excluded from this cost-focused list. GPT-5.4 and Claude Sonnet/Haiku sit at three or more steps above the budget five on dollars-per-Intelligence-point, even though their absolute quality is higher. They belong on the best-llm-for-coding and best-llm-for-agents pages, not the cheapest-llm-api page. If your workload mixes batch summarization with high-quality reasoning on a fraction of calls, route the cheap 90% to V4 Flash and send the remaining 10% upward — that is the design pattern we see most often in production budgets.
Change the `model` field in your request body. The OpenAI-compatible /v1/chat/completions endpoint accepts every prod slug — `deepseek/deepseek-v4-flash`, `google/gemini-3.1-flash-lite-preview`, `qwen/qwen3.6-plus`, `xiaomi/mimo-v2.5-pro`, `deepseek/deepseek-v4-pro`. The rest of the request stays the same, including streaming and tool-use shape on models that support tools. Most teams wrap this in a short router helper that picks per request based on token budget, input length, or whether the call requires image input. ElliotGate charges the same per-token rate as the upstream vendor for every listed model and drains a single pre-paid balance across all of them.

Stop A/B-ing with vendor sprawl. Run the top 5 from one key.

Every model on this Lowest dollars-per-Intelligence-point for production LLM API workloads ranking is one slug change away on ElliotGate. Same SDK, same balance, same dashboard.