15 min read

Ling-2.6-flash: Faster Response, Stronger Execution, and Higher Token Efficiency for Agent Workloads

Ling-2.6-flash is a 104B‑parameter Instruct model that uses a mixed‑linear architecture and token‑efficiency optimizations to achieve up to 340 tokens/s inference speed, 4× higher throughput than comparable models, and ten‑fold lower token consumption on Agent benchmarks, while maintaining SOTA performance.

AntTech

Apr 23, 2026

Ling-2.6-flash: Faster Response, Stronger Execution, and Higher Token Efficiency for Agent Workloads

Background

Agent workloads dramatically increase input length (up to two orders of magnitude) and trigger frequent tool calls, causing token consumption and inference compute to explode. Reducing token usage while preserving capability is the core challenge.

Model Overview

Ling-2.6-flash is an Instruct model with 104 B total parameters and 7.4 B activation parameters. It builds on the Ling‑2.5 architecture and replaces the original GQA attention with a 1:7 mixed‑linear attention (MLA + Lightning Linear) to improve both training and inference efficiency.

Key Technical Advances

Mixed‑linear architecture : The hybrid attention yields a peak inference speed of 340 tokens/s on a four‑card H20 system and a prefill throughput 2.2× that of Nemotron‑3‑Super, delivering a higher cost‑performance ratio.

Token‑efficiency calibration : During pre‑training, token efficiency was explicitly optimized. On the Artificial Analysis (AA) benchmark, Ling-2.6-flash consumes only 15 M tokens , roughly 1/10 of the tokens used by Nemotron‑3‑Super and similar models, while achieving comparable intelligence scores (Intelligence Index = 26).

Agent‑focused enhancements : Fine‑tuning on high‑demand agent data improves tool‑calling, multi‑step planning, and long‑range execution. The model attains near‑SOTA results on BFCL‑V4, TAU2‑bench, SWE‑bench Verified, Claw‑Eval, and PinchBench, matching or surpassing larger activation‑parameter models.

Benchmark Results

In the AA Output Speed dimension, Ling-2.6-flash records 215 tokens/s , placing it in the top tier for 104 B models. Throughput scales with context length: as both context and generation length increase, the model’s advantage over baseline SOTA models grows, delivering faster first‑token responses and higher sustained decode throughput (up to ~4× improvement in both prefill and decode).

Inference Optimizations

Operator fusion is applied across precision paths:

BF16 path : Deep fusion of QK Norm + RoPE, Group RMSNorm + Sigmoid Gate, and MoE Router GEMM + LM Head GEMM using BF16 input / FP32 output. MLA RoPE and Top‑K kernels are also fused.

FP8 path : RMSNorm and SwiGLU are fused with quantization kernels; a Split‑K blockwise FP8 GEMM is introduced for small batch sizes, unlocking additional throughput.

These optimizations keep the inference graph aligned with the training graph, improving RL rollout consistency.

Practical Demonstrations

Web page generation : Generates high‑quality single‑page prototypes at >200 tokens/s, correctly handling front‑end component libraries.

INT4 quantized version on DGX Spark : Enables Hermes‑style inference on industry‑grade hardware.

Kilo Code styling : Produces visually appealing web layouts directly from textual prompts.

Prompt‑driven workflow execution : Executes multi‑step text tasks with natural‑language flow.

Agent tool‑calling : Extracts character and event graphs from classic literature (e.g., “Dream of the Red Chamber”).

Autonovel long‑form writing : Generates million‑word drafts at >200 tokens/s while maintaining plot consistency.

Task scheduling and requirement gathering : Delivers low‑hallucination, high‑usability answers for real‑world pipelines.

Limitations and Future Work

Extreme agent scenarios can still produce occasional tool hallucinations. Bilingual (Chinese‑English) switching and complex instruction adherence also need improvement. Future iterations will continue to balance token efficiency with output quality, aiming for deeper stability, broader multilingual support, and tighter integration of token‑efficiency and intelligence.