Skip to content
Seedance 2.0 Face is here — generate video from real-person reference photos.Try it now
USE CASE RANKING

Combined text + image + audio + video input on a single API

Multimodal breadth is binary on most APIs — either the model accepts video natively or it doesn't, and the gap is steep. Gemini 3.1 Pro Preview is the only model in the catalog accepting text, image, audio, video, and file in one call, and it pairs that breadth with AA Intelligence Index 57.2. Gemini 3.1 Flash Lite covers the same modality matrix at 5x lower price and 2.2x the throughput. GPT-5.5 and GPT-5.4 anchor the text-+-image-+-file tier with frontier reasoning; Qwen3.6-Plus extends that with native video input at the cost-effective end. Real multimodal pipelines route by modality complexity, not by a single 'multimodal model' choice.

Top pick
Gemini 3.1 Pro Preview
Google

EDITOR'S TOP PICK

Recommended model

Rank #1

Gemini 3.1 Pro Preview

Google

Gemini 3.

Price posture$2 / $12 · 1M tokens

SELECTION CRITERIA

How we ranked these models

We separate multimodal models by which modalities they accept natively (no pre-processing), how the price scales when those modalities are sent, and whether the reasoning quality holds up after multimodal grounding. The criteria below weight modality breadth hardest, then reasoning floor, then cost on image-token pricing.

  1. Weight
    40%

    Native modality breadth

    Which of text / image / audio / video / file the model accepts in one API call without pre-processing. Each additional modality that doesn't require a pipeline change is worth significant points.

    View source
  2. Weight
    25%

    AA Intelligence Index after multimodal grounding

    A multimodal model that can see an image but can't reason about it isn't useful. We require AA Intelligence Index ≥ 30 and prefer models above 50.

    View source
  3. Weight
    20%

    Image-token pricing posture

    Vendors charge differently for visual tokens — some flat-rate per image, some per pixel-block. Cheap text rates can mask expensive image rates and vice versa.

    View source
  4. Weight
    10%

    Tool use with multimodal inputs

    Modern multimodal agents combine 'look at this image' with 'call this tool with these args'. Tau-2 on the multimodal model determines whether tool args stay correct after visual grounding.

    View source
  5. Weight
    5%

    Output throughput on multimodal payloads

    Multimodal payloads inflate input length significantly, which usually slows TTFT. AA-measured throughput on the unimodal text baseline is a reasonable proxy ranked descending.

    View source

TOP 5 LEADERBOARD

The ranking

Ranked by modality coverage breadth, with ties broken by AA Intelligence Index. Each of the top five accepts at least three input modalities. All are reachable through one ElliotGate API key — multimodal payloads use the same OpenAI-compatible parts array as text-only payloads.

#ModelProviderModality breadthThroughputPrice (in / out) 
1Gemini 3.1 Pro PreviewGoogle5 input modalities143 tok/s$2 / $12Open in ElliotGate
2Gemini 3.1 Flash Lite PreviewGoogle5 input modalities321 tok/s$0.25 / $1.5Open in ElliotGate
3GPT-5.5OpenAI3 input modalities64 tok/s$5 / $30Open in ElliotGate
4Qwen3.6 PlusQwen3 input modalities52 tok/s$0.5 / $3Open in ElliotGate
5GPT-5.4OpenAI3 input modalities94 tok/s$2.5 / $15Open in ElliotGate

Pricing is per 1M tokens, USD, sourced from Artificial Analysis and matched against each provider's official rate. ElliotGate charges the same per-token rate as the upstream provider.

MODEL-BY-MODEL ANALYSIS

Why each model placed where it did

  1. #1

    Gemini 3.1 Pro Preview

    Google

    Gemini 3.1 Pro Preview is the only model in the ElliotGate catalog accepting text, image, audio, video, and file in one API call. AA Intelligence Index 57.2 keeps it at frontier reasoning quality, GPQA Diamond 0.941 is the highest in the multimodal ranking, and throughput at 142.7 tok/s prevents multimodal input length from blocking the response. Pricing at $2 input / $12 output is mid-tier — image-token amortization is included rather than charged as a separate rate. Best fit: multimodal RAG over video archives, accessibility tools that combine speech transcription with vision, document-and-recording workflows where one model handles everything.

    Strengths

    • Only model with text+image+audio+video+file in one call
    • AA Intelligence Index 57.2 — frontier reasoning preserved
    • GPQA 0.941 — highest in multimodal ranking
    • 142.7 tok/s — fastest in the ranking

    Weaknesses

    • Preview status — vendor reserves the right to change pricing
    • Max output 65K — tight on very long generated transcripts
    • Cache write pricing not published
    Verify on Artificial Analysis
  2. #2

    Gemini 3.1 Flash Lite Preview

    Google

    Gemini 3.1 Flash Lite Preview matches Gemini Pro on modality breadth — text, image, audio, video, file — and is the throughput champion of the entire ElliotGate catalog at 321 tokens per second. AA Intelligence Index 33.5 limits its reasoning ceiling, but for triage-style multimodal tasks (which video clip contains an outage, which document references the contract clause) the ceiling rarely binds. Pricing of $0.25 input / $1.5 output is 8x lower than Gemini Pro on input. Best fit: high-volume multimodal triage, accessibility captions, real-time multimodal moderation.

    Strengths

    • Same modality breadth as Gemini Pro Preview
    • 321 tok/s — fastest model in ElliotGate catalog
    • $0.25 / $1.5 — 8x cheaper input than Gemini Pro
    • 1M context

    Weaknesses

    • AA Intelligence Index 33.5 — limits hard reasoning
    • Tau-2 0.313 — agent-grade tool use weak
    • Preview status
    Verify on Artificial Analysis
  3. #3

    GPT-5.5

    OpenAI

    GPT-5.5 accepts text, image, and file but does not natively accept audio or video — those have to be pre-processed (Whisper for audio, frame-extraction for video). For workloads where the visual modality is image-only (document understanding, OCR-driven extraction, screenshot analysis), 5.5's frontier reasoning makes it the best multimodal pick on raw quality: AA Intelligence Index 60.2, GPQA 0.935. Pricing of $5 input / $30 output is the multimodal ranking's highest, and image tokens compound that. Best fit: high-value-per-call document-understanding workflows where the answer accuracy is worth the multimodal premium.

    Strengths

    • AA Intelligence Index 60.2 — highest in multimodal ranking
    • GPQA 0.935 — document-reasoning leader
    • File input — PDFs handled natively
    • 1M context

    Weaknesses

    • No native audio or video — pre-process required
    • $5 / $30 — highest pricing in ranking
    • 63 tok/s — slower than Gemini twins
    Verify on Artificial Analysis
  4. #4

    Qwen3.6 Plus

    Qwen

    Qwen3.6-Plus accepts text, image, and video natively — file and audio are not in its modality matrix. AA Intelligence Index 50 sits mid-tier, but Tau-2 of 0.977 (the single highest on our ranking) means tool calls after visual grounding stay reliable. Pricing of $0.5 input / $3 output makes it the cost leader among multimodal-with-video options. Best fit: video-aware agent workloads at scale (content moderation, video-driven dashboards, video-to-action automation) where Gemini's preview status is a liability and a stable vendor relationship matters.

    Strengths

    • Native text+image+video input
    • Tau-2 0.977 — strongest tool use after visual grounding
    • $0.5 / $3 — cost leader for video-capable models
    • 1M context

    Weaknesses

    • No audio or file input
    • AA Intelligence Index 50 — below frontier reasoning
    • Cache pricing not published
    Verify on Artificial Analysis
  5. #5

    GPT-5.4

    OpenAI

    GPT-5.4 closes the multimodal top five with the same modality matrix as 5.5 (text + image + file) at half the price: $2.5 input / $15 output. AA Intelligence Index 56.8 is essentially indistinguishable from 5.5's at 60.2 for everyday vision tasks (OCR, screenshot QA, document layout). For document-understanding workloads that don't need 5.5's marginal benchmark lead, 5.4 is the obvious default. Throughput at 94.4 tok/s is also faster than 5.5 on the multimodal baseline. Best fit: most production multimodal applications that center on documents and screenshots rather than video.

    Strengths

    • $2.5 / $15 — half of GPT-5.5 with similar vision quality
    • 94.4 tok/s — faster than GPT-5.5
    • File input — PDFs handled natively
    • 1M context

    Weaknesses

    • No native audio or video
    • AA Intelligence Index 3.4 points below 5.5
    • No published cache write rate
    Verify on Artificial Analysis

EXAMPLE PROMPTS

Three prompts you can run today

Paste these into the ElliotGate playground or your own SDK. Each prompt exercises a different part of the task and gives you a real signal on which model fits your workload.

Document layout QA over a scanned contract

Prompt
Below is a 14-page scanned contract as a PDF attachment. Find every signature block and report: (a) party name, (b) page number, (c) whether the signature line is filled. Return JSON only. Then on a separate line, flag any signature block that appears unfilled.
Expected behavior

Clean JSON, no prose. GPT-5.5 wins on document-layout reasoning; Gemini 3.1 Pro is the fastest. GPT-5.4 is the cost-quality middle.

Video-clip event extraction

Prompt
Below is a 90-second video of a manufacturing line. Identify every (a) machine state change, (b) operator action, with timestamps in mm:ss. Return CSV (event_type,timestamp,description). Do not narrate; just emit rows.
Expected behavior

Only Gemini Pro / Flash Lite and Qwen3.6-Plus accept video natively; the OpenAI models in this ranking would fail without frame extraction upstream.

Multimodal incident triage

Prompt
I will send three artifacts: an alert screenshot, a 30-second audio voice memo from on-call, and the relevant log file. Classify the incident severity (P0/P1/P2/P3), name the most likely root cause, and propose the next three actions. Return exactly one JSON object.
Expected behavior

A canonical multimodal-triage workload — Gemini Pro is the only model that can read all three modalities in one call. GPT-5.5 would need Whisper pre-processing on the audio.

QUICK START

Switch models with one line

Every ranked model accepts the same OpenAI-compatible request body. Change the model slug, keep the rest of the code, and you are routing across vendors with one API key.

Node.js
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OMINIGATE_API_KEY, // sk-omg-...
  baseURL: "https://api.elliotgate.com/v1",
});

const response = await client.chat.completions.create({
  model: "google/gemini-3.1-pro-preview",            // swap to any prod slug
  messages: [{ role: "user", content: "..." }],
});

QUESTIONS WE GET

Frequently asked

No. Only Gemini 3.1 Pro Preview, Gemini 3.1 Flash Lite Preview, and Qwen3.6-Plus accept video natively. GPT-5.5 and GPT-5.4 accept text + image + file but require external frame extraction or transcription for video and audio. The ranking is built on modality breadth — your shortlist should narrow further to the modalities your pipeline actually needs.
Vendors differ. OpenAI and Anthropic include image input in the text-input billing as 'image tokens' (an image is converted to a fixed token count based on resolution). Google charges image input at the same per-token rate as text. Qwen charges image input separately at a per-image rate. For multimodal-heavy workloads, get a real cost measurement by running a representative sample through each model on ElliotGate — the dashboard usage page breaks down per-modality token cost exactly as the upstream provider would charge directly.
Yes if the underlying model accepts them — that's only Gemini 3.1 Pro Preview and Gemini 3.1 Flash Lite Preview today. ElliotGate's /v1/chat/completions endpoint passes the OpenAI-compatible parts array through to the upstream API. Send each modality as a part: text, image_url (with data URL or HTTP URL), audio_url, video_url, file. Models that don't accept a modality return a clear 400 — your client code can fall back to another model.
Opus 4.7 accepts text + image but not file, audio, or video. The two GPT-5 models on this ranking accept text + image + file, which beats Opus on breadth. If your workload is purely text + image and you don't need file or video, Opus 4.7 is a legitimate alternative — pair it with the agents ranking we publish if tool use matters.
Run audio through a transcription model first — OpenAI's Whisper or Google's audio-to-text on ElliotGate works. Feed the transcript into GPT-5.4 or GPT-5.5 as text. This adds one model hop and ~1–2 seconds of latency but keeps you on the OpenAI ecosystem. For truly real-time audio reasoning, use Gemini 3.1 Pro Preview or Flash Lite Preview directly.
Yes — modal input expands prompt length significantly. An image at 1024x1024 typically costs 800–1500 input tokens; a 10-second video costs much more. TTFT increases linearly with prompt length on most providers. Gemini Pro and Flash Lite are designed to keep TTFT low even with video; OpenAI and Anthropic models slow noticeably on image-heavy prompts. Plan for an extra 1–3 seconds of latency on multimodal payloads versus text-only.

Stop A/B-ing with vendor sprawl. Run the top 5 from one key.

Every model on this Combined text + image + audio + video input on a single API ranking is one slug change away on ElliotGate. Same SDK, same balance, same dashboard.