Which Domestic Multimodal LLM Is the Most Efficient for Production?
The article benchmarks three Chinese multimodal large models—Step 3.7 Flash, MiniMax M3, and Qwen 3.6‑flash—across two real‑world tasks, measuring output quality, API latency, and token cost, and concludes that Step 3.7 Flash consistently offers the best speed‑cost trade‑off for production use.
Evaluation Methodology
To compare multimodal models fairly, the same task, prompt, configuration, and tools are used; the only variable is the model. For each task the API response time, token consumption (input, output, cache read, cache write) and token price (RMB) are recorded. Three dimensions are measured:
Quality – whether a single dialogue yields usable results without repeated clarification.
Speed – end‑to‑end latency suitable for high‑frequency agent calls.
Cost – model price per token and total token usage.
Scenario 1: Reconstruct Business Logic from a Flowchart
A 10‑step WeChat mini‑program login flowchart is supplied to each model with the prompt @wechat_login_flow.png extract workflow logic from the image. The flowchart image:
All three models output the ten steps correctly, but their performance differs.
Step 3.7 Flash – API time 15 s, token consumption Input 728, Output 1.1k, Cache Read 54.4k, Cache Write 0, cost ¥0.0246.
MiniMax M3 – API time 20 s, token consumption Input 27.9k, Output 1.2k, Cache Read 228, Cache Write 0, cost ¥0.0688.
Qwen 3.6‑flash – API time 19 s, token consumption Input 251, Output 1.9k, Cache Read 0, Cache Write 28.6k, cost ¥0.0483.
Output quality is comparable; Step 3.7 Flash is faster and cheaper.
Scenario 2: Multimodal Assistance for Invoice Data Entry
Task: extract structured fields from an electronic invoice image and return JSON. Prompt used:
Extract structured information from this invoice image and return it in the following JSON format:
{
"invoice_type": "string",
"invoice_number": "string",
"invoice_date": "string",
"invoice_amount": "string",
"tax_rate": "string",
"tax_amount": "string",
"item_name": "string",
"buyer_tax_id": "string",
"buyer_bank": "string",
"seller_name": "string",
"seller_tax_id": "string",
"seller_bank": "string"
}Invoice image used in the test:
Results:
Step 3.7 Flash – correct extraction, API time 5.6 s, total tokens 1,409, cost ¥0.0060.
MiniMax M3 – correct extraction, API time 6.1 s, total tokens 2,216, cost ¥0.0086.
Qwen 3.6‑flash – correct extraction, API time 7.38 s, total tokens 2,008, cost ¥0.0075.
All models achieve perfect extraction; Step 3.7 Flash again leads in latency and token efficiency, reducing per‑invoice cost to well under one cent.
Overall Comparison
Across both scenarios the models show stable, accurate outputs. Relative performance:
Speed – Fast (Step 3.7 Flash) vs. Medium (MiniMax M3, Qwen 3.6‑flash).
Token consumption – Low (Step 3.7 Flash), Medium (Qwen 3.6‑flash), High (MiniMax M3).
Token price (RMB) – Cheap (Step 3.7 Flash), Medium (Qwen 3.6‑flash), Expensive (MiniMax M3).
Stability – Excellent for all three.
For production integration where latency and cost matter, Step 3.7 Flash provides the most favorable combination of speed, low token usage, and stable output.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
