Unveiling the ‘Elephant’: Ant’s Ling‑2.6‑flash LLM Delivers 1M Tokens for $0.10
Ant’s newly released Ling‑2.6‑flash model, hidden as the anonymous “Elephant Alpha,” combines a 104B‑parameter MoE design with only 7.4B active weights per inference, achieving ten‑fold token savings, top‑tier benchmark scores and a $0.10 per‑million‑token price that dramatically cuts inference costs for developers and enterprises.
Last week an anonymous model dubbed “Elephant Alpha” appeared on the OpenRouter platform without any branding, quickly climbing to the Trending list with daily token calls exceeding 100 B and a weekly growth rate over 5 000 %. Today Ant Bailing reveals that the model is its newly launched Ling‑2.6‑flash.
Model architecture and parameter efficiency
Ling‑2.6‑flash is an Instruct‑type LLM with a total of 104 B parameters but only 7.4 B active (or “awakened”) parameters during inference. It inherits the hybrid‑linear MoE (Mixture‑of‑Experts) architecture from Ling‑2.5, meaning that although the model is large, each forward pass activates a sparse subset of experts, keeping the majority of parameters silent.
Token efficiency and hardware performance
On a 4‑GPU H20 setup, Ling‑2.6‑flash reaches a peak inference speed of 340 tokens/s and a pre‑fill throughput 2.2 × that of Nemotron‑3‑Super. In output‑speed tests it sustains 215 tokens/s, placing it in the top tier of models with comparable parameter counts.
In a token‑consumption benchmark the model completes the same evaluation using only 1/10 of the tokens required by peers, achieving an Intelligence Index of 26 with just 15 M output tokens, whereas Nemotron‑3‑Super needs over 110 M tokens for a similar score (Artificial Analysis, 2024).
Targeted agent enhancement
Beyond raw token savings, Ling‑2.6‑flash is specially tuned for agent‑oriented scenarios. It attains state‑of‑the‑art results on several agent benchmarks, including BFCL‑V4 (tool‑calling accuracy), TAU2‑bench (complex workflow execution), SWE‑bench Verified (real‑world GitHub issue resolution), Claw‑Eval and PinchBench, all recognized as hard‑core evaluation suites.
The model also maintains strong performance on general knowledge, mathematical reasoning, instruction following and long‑text comprehension, making it a well‑rounded “non‑specialized” LLM.
Pricing advantage
API pricing is $0.10 per million input tokens and $0.30 per million output tokens. By contrast, GPT‑5.4 mini charges $0.391 per million input tokens and $4.5 per million output tokens, while GLM‑4.5‑Air costs $0.073 in and $1.05 out. The lower token consumption directly translates into reduced inference cost, faster first‑token latency and smoother interaction experiences.
Availability and roadmap
The Ling‑2.6‑flash API is now publicly available with a one‑week free trial through OpenRouter or Ant’s tbox platform. A commercial version, LingDT, is slated for release via Ant FinTech to serve global developers and SMBs.
Conclusion
From anonymous debut to a clearly positioned “token‑efficient” LLM, Ling‑2.6‑flash demonstrates a distinct path away from the traditional parameter‑size race: delivering more work with far fewer computational resources, a compelling proposition for enterprises and developers burdened by inference costs.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
