Artificial Intelligence 10 min read

Which Domestic Multimodal LLM Is the Most Efficient for Production?

The article benchmarks three Chinese multimodal large models—Step 3.7 Flash, MiniMax M3, and Qwen 3.6‑flash—across two real‑world tasks, measuring output quality, API latency, and token cost, and concludes that Step 3.7 Flash consistently offers the best speed‑cost trade‑off for production use.

Su San Talks Tech

Jul 1, 2026

Which Domestic Multimodal LLM Is the Most Efficient for Production?

Evaluation Methodology

To compare multimodal models fairly, the same task, prompt, configuration, and tools are used; the only variable is the model. For each task the API response time, token consumption (input, output, cache read, cache write) and token price (RMB) are recorded. Three dimensions are measured:

Quality – whether a single dialogue yields usable results without repeated clarification.

Speed – end‑to‑end latency suitable for high‑frequency agent calls.

Cost – model price per token and total token usage.

Scenario 1: Reconstruct Business Logic from a Flowchart

A 10‑step WeChat mini‑program login flowchart is supplied to each model with the prompt @wechat_login_flow.png extract workflow logic from the image. The flowchart image:

All three models output the ten steps correctly, but their performance differs.

Step 3.7 Flash – API time 15 s, token consumption Input 728, Output 1.1k, Cache Read 54.4k, Cache Write 0, cost ¥0.0246.

MiniMax M3 – API time 20 s, token consumption Input 27.9k, Output 1.2k, Cache Read 228, Cache Write 0, cost ¥0.0688.

Qwen 3.6‑flash – API time 19 s, token consumption Input 251, Output 1.9k, Cache Read 0, Cache Write 28.6k, cost ¥0.0483.

Output quality is comparable; Step 3.7 Flash is faster and cheaper.

Scenario 2: Multimodal Assistance for Invoice Data Entry

Task: extract structured fields from an electronic invoice image and return JSON. Prompt used:

Extract structured information from this invoice image and return it in the following JSON format:
{
  "invoice_type": "string",
  "invoice_number": "string",
  "invoice_date": "string",
  "invoice_amount": "string",
  "tax_rate": "string",
  "tax_amount": "string",
  "item_name": "string",
  "buyer_tax_id": "string",
  "buyer_bank": "string",
  "seller_name": "string",
  "seller_tax_id": "string",
  "seller_bank": "string"
}

Invoice image used in the test:

Results:

Step 3.7 Flash – correct extraction, API time 5.6 s, total tokens 1,409, cost ¥0.0060.

MiniMax M3 – correct extraction, API time 6.1 s, total tokens 2,216, cost ¥0.0086.

Qwen 3.6‑flash – correct extraction, API time 7.38 s, total tokens 2,008, cost ¥0.0075.

All models achieve perfect extraction; Step 3.7 Flash again leads in latency and token efficiency, reducing per‑invoice cost to well under one cent.

Overall Comparison

Across both scenarios the models show stable, accurate outputs. Relative performance:

Speed – Fast (Step 3.7 Flash) vs. Medium (MiniMax M3, Qwen 3.6‑flash).

Token consumption – Low (Step 3.7 Flash), Medium (Qwen 3.6‑flash), High (MiniMax M3).

Token price (RMB) – Cheap (Step 3.7 Flash), Medium (Qwen 3.6‑flash), Expensive (MiniMax M3).

Stability – Excellent for all three.

For production integration where latency and cost matter, Step 3.7 Flash provides the most favorable combination of speed, low token usage, and stable output.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark multimodal LLM token cost API latency MiniMax-M3 Step 3.7 Flash Qwen 3.6 flash

Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Evaluation Methodology

Scenario 1: Reconstruct Business Logic from a Flowchart

Scenario 2: Multimodal Assistance for Invoice Data Entry

Overall Comparison

Su San Talks Tech

How this landed with the community

Was this worth your time?

0 Comments

Scenario 1: Reconstruct Business Logic from a Flowchart

Scenario 2: Multimodal Assistance for Invoice Data Entry