Skip to content
Seedance 2.0 Face is here — generate video from real-person reference photos.Try it now
USE CASE RANKING

Software engineering, code completion, refactoring, and multi-file edits

GPT-5.5 holds the top slot on coding workloads by margin — Artificial Analysis Coding Index 59.1 versus Claude Opus 4.7's 52.5 and Gemini 3.1 Pro Preview's 55.5, paired with the field-leading TerminalBench Hard score of 60.6%. Opus 4.7 takes second on throughput-adjusted quality: 70.6 tokens per second and Tau-2 tool use at 0.886 make it the safest pick for live coding agents where latency caps token-per-second matter. Gemini 3.1 Pro Preview is the throughput champion at 142.7 tok/s with multimodal input. The bottom of the top five — GPT-5.4 and GPT-5.3-Codex — give you cheaper inputs against the leaders' premium rate without losing the OpenAI tool-use ecosystem.

Top pick
GPT-5.5
OpenAI

EDITOR'S TOP PICK

Recommended model

Rank #1

GPT-5.5

OpenAI

GPT-5.

Price posture$5 / $30 · 1M tokens

SELECTION CRITERIA

How we ranked these models

Code quality has to be measured by benchmarks that resist memorization, by tool-use reliability across real shells, and by how quickly the model produces tokens — a slow correct answer still kills the inner loop. The criteria below weight AA's Coding Index hardest because it aggregates several independent benchmarks, then add TerminalBench Hard (verified fix), Tau-2 (tool use), throughput, and context length.

  1. Weight
    40%

    AA Coding Index

    Aggregate coding score from Artificial Analysis spanning LiveCodeBench, SciCode, and SWE-bench-style evaluations. The most contamination-resistant headline number we publish.

    View source
  2. Weight
    25%

    TerminalBench Hard

    Pass rate on multi-step shell sessions with verified fixes. This is the workload that most closely mirrors what an AI coding agent does inside Cursor, Cline, or Claude Code on a real repository.

    View source
  3. Weight
    15%

    Tau-2 tool use

    Function-calling and tool-orchestration accuracy. Critical for any coding agent that has to call linters, test runners, or git in a loop.

    View source
  4. Weight
    10%

    Output throughput

    Tokens per second observed by Artificial Analysis. A coding agent that streams a 5K-line diff at 30 tok/s breaks the inner-loop feel; at 120+ tok/s it feels native.

    View source
  5. Weight
    10%

    Repo-level context window

    Minimum 200K to plausibly hold a medium service repo without aggressive chunking. We exclude any model with sub-200K context from the top five.

    View source

TOP 5 LEADERBOARD

The ranking

Ranked by AA Coding Index descending, with ties broken by TerminalBench Hard. Every model on the list is callable on ElliotGate with the same OpenAI-compatible payload — your editor or coding agent stays untouched between switches.

#ModelProviderAA Coding IndexThroughputPrice (in / out) 
1GPT-5.5OpenAICoding 59.164 tok/s$5 / $30Open in ElliotGate
2Gemini 3.1 Pro PreviewGoogleCoding 55.5143 tok/s$2 / $12Open in ElliotGate
3GPT-5.4OpenAICoding 57.394 tok/s$2.5 / $15Open in ElliotGate
4GPT-5.3-CodexOpenAICoding 53.195 tok/s$1.75 / $14Open in ElliotGate
5Claude Opus 4.7AnthropicCoding 52.571 tok/s$5 / $25Open in ElliotGate

Pricing is per 1M tokens, USD, sourced from Artificial Analysis and matched against each provider's official rate. ElliotGate charges the same per-token rate as the upstream provider.

MODEL-BY-MODEL ANALYSIS

Why each model placed where it did

  1. #1

    GPT-5.5

    OpenAI

    GPT-5.5 leads the AA Coding Index at 59.1 and is the only model in the top five clearing 60% on TerminalBench Hard — the benchmark that comes closest to what a real coding agent does on a real repo. It also tops Tau-2 tool use at 0.939 and GPQA Diamond at 0.935 (graduate reasoning carries over into hard refactor reasoning). Pricing is the trade: $5 input / $30 output per million tokens is the highest in the top five. Pair it with cache reads ($0.5 per million) to keep multi-turn agent loops affordable, or route only the hardest refactor steps to 5.5 and farm the easier turns to 5.4 or 5.3-Codex. Best fit: senior-level architecture changes, security-sensitive code, refactors spanning unfamiliar modules.

    Strengths

    • AA Coding Index 59.1 — top of the list
    • TerminalBench Hard 60.6% — only model above 60
    • Tau-2 0.939 tool use — agent-loop ready
    • 1M context + file input — full repo + docs fit together

    Weaknesses

    • $5 / $30 per 1M is the highest in the top five
    • 63 tok/s output — slower than Opus 4.7 and much slower than Gemini Pro
    • TTFT 71.9s in xhigh effort hurts interactive use
    Verify on Artificial Analysis
  2. #2

    Gemini 3.1 Pro Preview

    Google

    Gemini 3.1 Pro Preview scores 55.5 on the AA Coding Index, just behind GPT-5.5, and pairs it with the field's highest Tau-2 tool-use score among coding-grade models (0.956) and the field's fastest throughput at 142.7 tok/s. That throughput is the differentiator: for streaming-heavy editor workflows (large multi-file diffs, fast iteration on a single function), Gemini Pro feels native where GPT-5.5 sometimes feels paced. Pricing of $2 input / $12 output undercuts GPT-5.5 by ~60% on input. The catch is preview status — Google reserves the right to bump pricing and behavior, so production traffic should always have an in-tree fallback to a stable model.

    Strengths

    • 142.7 tok/s — fastest throughput in the coding top five
    • Tau-2 0.956 tool use — best on the list
    • GPQA Diamond 0.941 carries through to hard reasoning code
    • $2 / $12 — undercuts GPT-5.5 by ~60% on input
    • Full multimodal (text+image+audio+video+file)

    Weaknesses

    • Preview status — pricing and limits subject to change
    • Coding Index 55.5 trails GPT-5.5 by 3.6 points
    • Max output 65K — slightly tight for very long diffs
    Verify on Artificial Analysis
  3. #3

    GPT-5.4

    OpenAI

    GPT-5.4 is the cost-quality middle of the OpenAI line: Coding Index 57.3 lands between GPT-5.5 (59.1) and GPT-5.3-Codex (53.1), but pricing is dramatically lower at $2.5 input / $15 output — half of GPT-5.5's rate. AA-measured throughput of 94.4 tok/s is the third-best on the coding list. For many teams the default editor model should be GPT-5.4: it carries the OpenAI tool-use shape and ecosystem, fits a full repo into 1M context, and only steps aside when the task explicitly demands the marginal Coding Index points of 5.5 or the throughput of Gemini Pro.

    Strengths

    • $2.5 / $15 — half of GPT-5.5's price with most of its quality
    • 94.4 tok/s — fast streaming for editor workflows
    • 1M context + file input
    • Tau-2 0.871 supports agent loops

    Weaknesses

    • Coding Index 1.8 points below GPT-5.5
    • TerminalBench Hard not yet at the GPT-5.5 level
    • No native voice output
    Verify on Artificial Analysis
  4. #4

    GPT-5.3-Codex

    OpenAI

    GPT-5.3-Codex is the specialist sibling — fine-tuned heavily for software engineering, slightly behind GPT-5.4 on the AA Coding Index (53.1 vs 57.3) but distinguished by structural training on multi-file edits, terminal workflows, and code review patterns. It is also the highest-throughput model in this top five at 95.4 tok/s. Pricing of $1.75 input / $14 output sits below GPT-5.4 on input and matches it on output. Use it when the workload is pure software engineering — a Cursor-like agent, an automated refactor pipeline, a CI bot — and you want token efficiency on long code-heavy outputs without the price step up to 5.5.

    Strengths

    • 95.4 tok/s — top of the coding list for throughput
    • Codex-line training — multi-file edit specialist
    • $1.75 input — cheapest input in the top five for OpenAI
    • 400K context still holds most service repos

    Weaknesses

    • AA Coding Index 4.2 points behind GPT-5.4
    • 400K context — narrower than 5.4/5.5's 1M
    • Narrow knowledge profile vs general-purpose 5.4
    Verify on Artificial Analysis
  5. #5

    Claude Opus 4.7

    Anthropic

    Claude Opus 4.7 closes the top five with AA Coding Index 52.5 — meaningfully behind GPT-5.5 (59.1) — but stays on the list because of three things: 70.6 tok/s throughput, $5/$25 pricing that undercuts GPT-5.5 on output, and a cache-write rate of $6.25 that materially changes economics on agent loops where the same code context recurs. The Anthropic alignment profile also produces a distinct response style for risk-sensitive code review (security audits, compliance work) that some teams find more useful than the OpenAI default. Best fit: agent loops where prefix caching is real, and code-review workloads where the response style matters as much as the raw score.

    Strengths

    • 70.6 tok/s — faster than GPT-5.5 by ~12%
    • $6.25 cache-write — sticky-context agent economics
    • $25 output — lower than GPT-5.5
    • 1M context + image input

    Weaknesses

    • AA Coding Index 52.5 — 6.6 points behind GPT-5.5
    • TerminalBench Hard 51.5% — below GPT-5.5
    • No file input — PDFs must be pre-parsed
    Verify on Artificial Analysis

EXAMPLE PROMPTS

Three prompts you can run today

Paste these into the ElliotGate playground or your own SDK. Each prompt exercises a different part of the task and gives you a real signal on which model fits your workload.

Multi-file refactor with verification

Prompt
Below is a 6-file Go service in `internal/payment/`. Refactor it so the `Charge` interface lives in its own file under `internal/payment/iface/`, and so the test fakes move to `internal/payment/iftest/`. Update all call sites. Then write a shell command sequence (no prose) that lints, builds, and runs the test suite to verify your change.

[6 files inlined...]
Expected behavior

Model produces full diffs for every file plus a verification block. GPT-5.5 and Opus 4.7 both reliably keep import paths consistent across files; GPT-5.3-Codex tends to win on the verification block's terseness.

Bug from a real stack trace

Prompt
Below is a Python panic from production. The full traceback names three files. I will paste those files after. Locate the root cause, propose a minimal fix (≤ 20 lines of code), and write a pytest test case that would have caught it. Do not propose architectural changes.

[traceback]
[file 1]
[file 2]
[file 3]
Expected behavior

Clean root-cause statement, a short patch, and a single test case. GPT-5.5 and Gemini 3.1 Pro both excel at the diagnostic step; Opus 4.7's test cases are usually the most idiomatic.

Code review on a diff

Prompt
Below is a pull-request diff (172 lines, touches a SQL builder and a permission helper). Review it as a senior engineer would: list correctness risks, security risks, and one stylistic suggestion. Be terse. Do not rewrite the code.

[diff]
Expected behavior

Bullet list with severity tags. Opus 4.7 produces the cleanest pass on security-sensitive review; GPT-5.5 wins on diff-level reasoning depth.

QUICK START

Switch models with one line

Every ranked model accepts the same OpenAI-compatible request body. Change the model slug, keep the rest of the code, and you are routing across vendors with one API key.

Node.js
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OMINIGATE_API_KEY, // sk-omg-...
  baseURL: "https://api.elliotgate.com/v1",
});

const response = await client.chat.completions.create({
  model: "openai/gpt-5.5",            // swap to any prod slug
  messages: [{ role: "user", content: "..." }],
});

QUESTIONS WE GET

Frequently asked

Coding Index 46.4 (Sonnet 4.6) versus 52.5 (Opus 4.7) is the gap that keeps it out of the top five. Sonnet 4.6 is the right pick when budget caps push you to the Sonnet tier, but the AA-measured coding gap is meaningful: Tau-2 drops to 0.795 from 0.886, GPQA from 0.914 to 0.799. For tasks where coding-agent quality dominates spend, the cost difference between Sonnet 4.6 and Opus 4.7 is usually worth paying.
All five clear the 0.85 Tau-2 threshold we use for agent-grade tool use, and four of them clear 0.88. Gemini 3.1 Pro Preview leads at 0.956. In Cursor, OpenAI's tool-use shape is the most battle-tested; Claude Code uses Anthropic's shape natively. Both shapes are supported through ElliotGate's OpenAI- and Anthropic-compatible endpoints respectively, so you can route the same agent across vendors without changing tool definitions.
GPT-5.3-Codex at $1.75 input / $14 output is the cheapest among the top five, and was tuned specifically for editor / agent workflows. Cache reads at $0.175 per million further compress cost on multi-turn editor sessions where the same files are re-sent. For teams that want a slightly broader knowledge profile, GPT-5.4 at $2.5 / $15 is a small step up. Both are dramatically lower priced than GPT-5.5 or Claude Opus 4.7 while staying inside the top five on benchmarks.
Send the same set of representative tasks to each model behind one ElliotGate API key. Most teams build a small evaluation harness: 10–30 real bugs, refactors, or code reviews, runs nightly, logs success rate and tokens used per task. Because ElliotGate charges the same per-token rate as the upstream vendor with no markup, the per-task cost lines up exactly with what you'd see direct — your A/B numbers translate to production budgets directly.
Yes, especially for GPT-5.5. The AA snapshot used here is GPT-5.5 xhigh — the highest effort tier, which produces the strongest scores at the cost of latency. Lower effort tiers cut TTFT significantly but typically shave 2–4 Coding Index points and 5–10 TerminalBench Hard points. The right choice depends on whether your editor session can absorb the extra wait for the better answer. Many teams default to high effort and fall back to medium on time-critical paths.
Three signals push toward Opus 4.7: (1) agent loops where the same codebase context re-enters every turn — Opus 4.7's $6.25 cache-write rate compresses sustained cost; (2) latency-bound workflows like in-editor inline completions where 70.6 tok/s versus 63 tok/s is felt; (3) code-review workflows where Anthropic's response framing is preferred. For greenfield correctness on hard refactors, GPT-5.5's Coding Index lead still wins.

Stop A/B-ing with vendor sprawl. Run the top 5 from one key.

Every model on this Software engineering, code completion, refactoring, and multi-file edits ranking is one slug change away on ElliotGate. Same SDK, same balance, same dashboard.