USE CASE RANKING

Multi-step agentic workflows with reliable tool use and long-horizon planning

Tool-use reliability is the rate-limiting factor for agentic workloads, and AA's Tau-2 benchmark separates the top tier sharply. Gemini 3.1 Pro Preview leads at 0.956, followed in our weighted ranking by Qwen3.6-Plus at 0.977 (Plus edges Gemini on Tau-2 alone but trails on Intelligence Index), GPT-5.5 at 0.939, DeepSeek V4 Pro at 0.962, and Claude Opus 4.7 at 0.886. The picture changes when you weight long-horizon planning (TerminalBench Hard) and cache economics — the criteria below split agent workloads into reliability-first, throughput-first, and cost-first profiles.

Get an API key→Browse all models

Top pick

Gemini 3.1 Pro Preview

Google

EDITOR'S TOP PICK

Recommended model

Rank #1

Gemini 3.1 Pro Preview

Google

Gemini 3.

Price posture$2 / $12 · 1M tokens

Use it on ElliotGate See all 30 selected models

SELECTION CRITERIA

How we ranked these models

An agent fails when a single tool call mis-fires; it succeeds when a planning step holds across 10+ turns. Tau-2 is therefore the dominant criterion. We then pull in long-horizon evaluations (TerminalBench Hard, IFBench instruction following), cache pricing (cache-read rate matters when the same plan state replays), and context window (an agent loop with chunked memory is degenerate).

Weight
35%
Tau-2 tool-use accuracy
AA's benchmark for multi-step function calling. A 5-point Tau-2 drop translates to several percentage points of full-workflow failure rate in production agents.
View source
Weight
25%
TerminalBench Hard pass rate
Long-horizon, multi-step shell sessions with verified fixes. Closest open benchmark to the real-world coding-agent loop.
View source
Weight
20%
AA Intelligence Index
Aggregate reasoning quality. Agents that have to plan and reflect benefit from a higher floor here, even at the cost of slightly lower Tau-2.
View source
Weight
10%
Cache-read price per 1M
Stable plan state and shared system prompts replay every turn. Cache reads typically dominate input billing on agentic workloads.
View source
Weight
10%
Context window
Without 200K+, the agent has to compress its own plan state into summaries, which compounds error. We require ≥ 200K for the top five.
View source

TOP 5 LEADERBOARD

The ranking

Ranked on a weighted composite: 35% Tau-2, 25% TerminalBench Hard, 20% AA Intelligence Index, 10% cache-read rate, 10% context. Every model on the list is bookable on ElliotGate through both OpenAI- and Anthropic-compatible payloads — switch agents across vendors without rewriting tool schemas.

#	Model	Provider	Tau-2 tool use	Throughput	Price (in / out)
1	Gemini 3.1 Pro Preview	Google	Tau-2 95.6%	143 tok/s	$2 / $12	Open in ElliotGate
2	GPT-5.5	OpenAI	Tau-2 93.9%	64 tok/s	$5 / $30	Open in ElliotGate
3	DeepSeek: DeepSeek V4 Pro	DeepSeek	Tau-2 96.2%	30 tok/s	$1.74 / $3.48	Open in ElliotGate
4	Qwen3.6 Plus	Qwen	Tau-2 97.7%	52 tok/s	$0.5 / $3	Open in ElliotGate
5	Claude Opus 4.7	Anthropic	Tau-2 88.6%	71 tok/s	$5 / $25	Open in ElliotGate

Pricing is per 1M tokens, USD, sourced from Artificial Analysis and matched against each provider's official rate. ElliotGate charges the same per-token rate as the upstream provider.

MODEL-BY-MODEL ANALYSIS

Why each model placed where it did

#1
Gemini 3.1 Pro Preview
Google
Open on ElliotGate
Gemini 3.1 Pro Preview takes the agents top slot on the weighted composite: Tau-2 0.956 (second only to Qwen3.6-Plus by a thin margin), AA Intelligence Index 57.2 (the floor for frontier-tier reasoning), and 142.7 tok/s throughput that prevents the agent loop from blocking. GPQA Diamond 0.941 is the highest in the ranking — graduate-level reasoning carries over to plan-quality on hard branches. Pricing of $2 input / $12 output undercuts GPT-5.5 by ~60% on input, which is a sustained cost win on agent workloads that re-send long system prompts. The catch is preview status — pin a stable fallback in production.
Strengths
- Tau-2 0.956 — top-tier multi-step tool use
- AA Intelligence Index 57.2 + GPQA 0.941
- 142.7 tok/s — fastest in the ranking
- $2 / $12 — sustained cost win on long-prompt agents
- Multimodal in: text+image+audio+video+file
Weaknesses
- Preview status — production needs a fallback
- Max output 65K — limits very long agent traces
- Cache write pricing not published — sticky-context model unclear
Verify on Artificial Analysis
#2
GPT-5.5
OpenAI
Open on ElliotGate
GPT-5.5 places second on the agent composite. Tau-2 0.939 is below the leaders but compensated by the field's strongest TerminalBench Hard score (60.6%) and an IFBench instruction-following score of 75.9% — agents that hinge on strict format adherence (structured tool args, JSON-shaped responses) lean toward 5.5. Pricing of $5 input / $30 output is the highest in the top five; cache reads at $0.5 per million help amortize sustained prompts. Best fit: agents where a single mid-loop hallucination is catastrophic (financial actions, deployments, irreversible writes) and the budget tolerates premium routing.
Strengths
- TerminalBench Hard 60.6% — long-horizon agent leader
- IFBench 75.9% — strict format adherence
- GPQA 0.935 reasoning carries into hard plan steps
- 1M context + file input
Weaknesses
- $5 / $30 — highest pricing on the list
- 63 tok/s — agent loops noticeably blocked
- Tau-2 0.939 trails Gemini 3.1 Pro and Qwen3.6-Plus
Verify on Artificial Analysis
#3
DeepSeek: DeepSeek V4 Pro
DeepSeek
Open on ElliotGate
DeepSeek V4 Pro is the agentic specialist: Tau-2 0.962 is among the highest published, AA Intelligence Index 51.5 puts it within striking distance of the leaders, and pricing of $1.74 input / $3.48 output is dramatically lower priced than GPT-5.5 or Opus 4.7. Cache reads at $0.145 per million keep recurring plan state affordable on long-running agents. The trade is throughput: 30 tok/s slows interactive agent loops, so V4 Pro is best for batch agentic pipelines (research scrapers, document processors, overnight workflows) rather than user-facing real-time agents. 1M context + 384K max output handles essentially any agent trace.
Strengths
- Tau-2 0.962 — top-tier with frontier-class price
- $1.74 / $3.48 per 1M — frontier reasoning at mid-tier price
- 1M context + 384K max output
- Cache read $0.145 — keeps sustained agents affordable
Weaknesses
- 30 tok/s — batch only, not real-time
- Text-only — no native multimodal
- TerminalBench Hard not yet at GPT-5.5 level
Verify on Artificial Analysis
#4
Qwen3.6 Plus
Qwen
Open on ElliotGate
Qwen3.6-Plus posts the single highest Tau-2 score we've recorded — 0.977. That makes it the model to reach for when the agent is fundamentally a tool-orchestration problem: workflow runners, ETL agents, document-action bots. AA Intelligence Index 50 is below the leaders, so it is not the right pick for agents whose hard branches require frontier reasoning. Pricing of $0.5 input / $3 output is the second-cheapest in the top five. Throughput at 52.4 tok/s sits in the middle; 1M context with multimodal text+image+video input keeps the agent's working memory generous. Use it for tool-heavy, reasoning-light agentic workloads.
Strengths
- Tau-2 0.977 — single highest on the list
- $0.5 / $3 per 1M — sustained cost-effective tool agent
- 1M context + multimodal text+image+video
- 52.4 tok/s — interactive viable
Weaknesses
- AA Intelligence Index 50 — below frontier reasoning
- Cache pricing not published
- No file input — PDFs must be pre-parsed
Verify on Artificial Analysis
#5
Claude Opus 4.7
Anthropic
Open on ElliotGate
Claude Opus 4.7 closes the agent top five on cache economics. Tau-2 0.886 is the lowest in the top five, but Opus's $6.25 cache-write rate combined with $0.5 cache-read produces the cleanest sustained-cost profile on agent loops where the same context state replays every turn. Throughput at 70.6 tok/s exceeds GPT-5.5 and DeepSeek V4 Pro, keeping interactive agents responsive. The Anthropic prompt-caching protocol is also the most mature on the market — many agent frameworks ship with native Anthropic cache integration that just works. Best fit: stateful customer-facing agents where the same character / policy / tool-set replays across long sessions.
Strengths
- $6.25 / $0.5 cache pricing — best sustained agent economics
- 70.6 tok/s — interactive agents stay responsive
- 1M context + image input
- Most mature prompt-caching protocol in the market
Weaknesses
- Tau-2 0.886 — lowest in the agent top five
- TerminalBench Hard 51.5% behind GPT-5.5
- No file input
Verify on Artificial Analysis

EXAMPLE PROMPTS

Three prompts you can run today

Paste these into the ElliotGate playground or your own SDK. Each prompt exercises a different part of the task and gives you a real signal on which model fits your workload.

10-step research agent on a custom domain

Prompt

You are a research agent. Your tools are: web_search(query: str), fetch_url(url: str), summarize(text: str, target_words: int). The user asks: 'What changed in the SEC's 2026 Q1 climate disclosure rule, and what does it mean for a 200-person SaaS?' Plan, execute, and produce a 400-word briefing with citations. Use tools efficiently — no more than 10 calls.

Expected behavior

A planning preamble, 4–8 tool calls, and a structured briefing. Gemini 3.1 Pro Preview and Qwen3.6-Plus both nail the planning step and rarely over-call; GPT-5.5 produces the most well-structured briefing.

Customer support agent with hand-off rules

Prompt

You are a Tier-1 support agent. Tools: search_kb(q), lookup_order(order_id), refund_order(order_id, reason), escalate_to_human(reason). Rules: never refund without first finding a KB entry that matches; always lookup the order before quoting a status; escalate any case involving fraud, legal threats, or VIP customers. Conversation: 'Hi, my order O-44310 still hasn't shipped after 3 weeks. I'm threatening to call my lawyer.'

Expected behavior

Agent must escalate immediately (legal-threat rule) while still acknowledging the order issue. Tau-2-strong models (Gemini Pro, Qwen3.6-Plus) reliably escalate; weaker tool-use models try to refund first and break the policy.

Long-running code-edit agent with verification

Prompt

Tools: list_dir(path), read_file(path), write_file(path, content), run_tests(pattern). Task: 'Add a rate limiter to the /api/login endpoint of this Go service. Use the existing redis client. Add a test that asserts 5 requests/second per IP. After every code edit, run tests.' The agent has at most 20 tool calls.

Expected behavior

The agent reads first, edits, runs tests, reads test output, iterates. GPT-5.5 wins on TerminalBench Hard-style multi-step verification; DeepSeek V4 Pro wins on cost when this runs nightly across many endpoints.

QUICK START

Switch models with one line

Every ranked model accepts the same OpenAI-compatible request body. Change the model slug, keep the rest of the code, and you are routing across vendors with one API key.

Node.js

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OMINIGATE_API_KEY, // sk-omg-...
  baseURL: "https://api.elliotgate.com/v1",
});

const response = await client.chat.completions.create({
  model: "google/gemini-3.1-pro-preview",            // swap to any prod slug
  messages: [{ role: "user", content: "..." }],
});

QUESTIONS WE GET

Frequently asked

Tau-2 tool-use accuracy. A 5-point Tau-2 drop translates to a measurable lift in full-workflow failure rate on production agent traffic, because a single mis-formatted tool call or wrong-argument call usually breaks the rest of the chain. AA Intelligence Index matters for plan quality on hard branches, but tool-use reliability is what determines whether the agent runs at all. The top five on this list all clear 0.88; the leaders clear 0.93.

Whichever your framework supports natively. OpenAI's shape is more widely adopted in third-party agent frameworks (LangGraph, LlamaIndex, AutoGen). Anthropic's shape is more expressive on streaming partial tool args. ElliotGate exposes both endpoints — /v1/chat/completions for OpenAI shape and /v1/messages for Anthropic shape — and routes the same prod slug to the right upstream API behind both. You can write the agent once and switch underlying model with one config change.

An agent's prompt has three parts that repeat across every turn: the system message, the tool definitions, and the running conversation history. Cache the first two as a stable prefix and you typically cut input billing by 70-85% on a 10-turn session. Claude Opus 4.7 is the most mature here — $6.25 cache-write / $0.5 cache-read makes 5-turn sessions cost less than a single non-cached call. DeepSeek V4 Pro and Gemini 3.1 Pro also publish cache-read rates but charge them automatically with less manual setup.

Kimi K2.6 has strong Tau-2 (0.959) and reasonable cost — it sits just outside the top five by AA Intelligence Index (53.9) and TerminalBench Hard score. For pure tool-orchestration workloads, K2.6 is a strong sixth-place option and a good fallback or A/B candidate. We held the top five at models with broader benchmark coverage; K2.6 is a great test addition on your evaluation harness.

Tau-2 and TerminalBench Hard correlate well with real failure rates on tool-heavy workloads, but neither captures domain knowledge — an agent that has to reason about a niche compliance regime or a domain-specific schema will surface gaps that benchmarks miss. The protocol that works: pick the top three on benchmarks, build a 20–50-prompt evaluation set from your actual workload, run nightly across all three, and let production results adjust the ranking. ElliotGate's unified key makes this comparison cost-neutral.

Yes. Keep tool definitions in vendor-neutral JSON, write a single translator that emits OpenAI-shape or Anthropic-shape tools depending on which endpoint you target, and route by changing the `model` field. We see most agent teams ship a 30-line abstraction over the OpenAI SDK that handles both shapes and lets them swap models with a CLI flag.

Stop A/B-ing with vendor sprawl. Run the top 5 from one key.

Every model on this Multi-step agentic workflows with reliable tool use and long-horizon planning ranking is one slug change away on ElliotGate. Same SDK, same balance, same dashboard.

Get an API key →See pricing

Multi-step agentic workflows with reliable tool use and long-horizon planning

Recommended model

Gemini 3.1 Pro Preview

How we ranked these models

Tau-2 tool-use accuracy

TerminalBench Hard pass rate

AA Intelligence Index

Cache-read price per 1M

Context window

The ranking

Why each model placed where it did

Gemini 3.1 Pro Preview

Strengths

Weaknesses

GPT-5.5

Strengths

Weaknesses

DeepSeek: DeepSeek V4 Pro

Strengths

Weaknesses

Qwen3.6 Plus

Strengths

Weaknesses

Claude Opus 4.7

Strengths

Weaknesses

Three prompts you can run today

10-step research agent on a custom domain

Customer support agent with hand-off rules

Long-running code-edit agent with verification

Switch models with one line

Frequently asked

Stop A/B-ing with vendor sprawl. Run the top 5 from one key.