Unveiling the ‘Elephant’: Ant’s Ling‑2.6‑flash LLM Delivers 1M Tokens for $0.10

Ant’s newly released Ling‑2.6‑flash model, hidden as the anonymous “Elephant Alpha,” combines a 104B‑parameter MoE design with only 7.4B active weights per inference, achieving ten‑fold token savings, top‑tier benchmark scores and a $0.10 per‑million‑token price that dramatically cuts inference costs for developers and enterprises.

ITPUB
ITPUB
ITPUB
Unveiling the ‘Elephant’: Ant’s Ling‑2.6‑flash LLM Delivers 1M Tokens for $0.10

Last week an anonymous model dubbed “Elephant Alpha” appeared on the OpenRouter platform without any branding, quickly climbing to the Trending list with daily token calls exceeding 100 B and a weekly growth rate over 5 000 %. Today Ant Bailing reveals that the model is its newly launched Ling‑2.6‑flash.

Model architecture and parameter efficiency

Ling‑2.6‑flash is an Instruct‑type LLM with a total of 104 B parameters but only 7.4 B active (or “awakened”) parameters during inference. It inherits the hybrid‑linear MoE (Mixture‑of‑Experts) architecture from Ling‑2.5, meaning that although the model is large, each forward pass activates a sparse subset of experts, keeping the majority of parameters silent.

Token efficiency and hardware performance

On a 4‑GPU H20 setup, Ling‑2.6‑flash reaches a peak inference speed of 340 tokens/s and a pre‑fill throughput 2.2 × that of Nemotron‑3‑Super. In output‑speed tests it sustains 215 tokens/s, placing it in the top tier of models with comparable parameter counts.

In a token‑consumption benchmark the model completes the same evaluation using only 1/10 of the tokens required by peers, achieving an Intelligence Index of 26 with just 15 M output tokens, whereas Nemotron‑3‑Super needs over 110 M tokens for a similar score (Artificial Analysis, 2024).

Targeted agent enhancement

Beyond raw token savings, Ling‑2.6‑flash is specially tuned for agent‑oriented scenarios. It attains state‑of‑the‑art results on several agent benchmarks, including BFCL‑V4 (tool‑calling accuracy), TAU2‑bench (complex workflow execution), SWE‑bench Verified (real‑world GitHub issue resolution), Claw‑Eval and PinchBench, all recognized as hard‑core evaluation suites.

The model also maintains strong performance on general knowledge, mathematical reasoning, instruction following and long‑text comprehension, making it a well‑rounded “non‑specialized” LLM.

Pricing advantage

API pricing is $0.10 per million input tokens and $0.30 per million output tokens. By contrast, GPT‑5.4 mini charges $0.391 per million input tokens and $4.5 per million output tokens, while GLM‑4.5‑Air costs $0.073 in and $1.05 out. The lower token consumption directly translates into reduced inference cost, faster first‑token latency and smoother interaction experiences.

Availability and roadmap

The Ling‑2.6‑flash API is now publicly available with a one‑week free trial through OpenRouter or Ant’s tbox platform. A commercial version, LingDT, is slated for release via Ant FinTech to serve global developers and SMBs.

Conclusion

From anonymous debut to a clearly positioned “token‑efficient” LLM, Ling‑2.6‑flash demonstrates a distinct path away from the traditional parameter‑size race: delivering more work with far fewer computational resources, a compelling proposition for enterprises and developers burdened by inference costs.

Large Language ModelbenchmarkAI inferencepricingToken Efficiency
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.