Artificial Intelligence 15 min read

How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance

This article details the engineering challenges and solutions for deploying a 3.5 billion‑parameter MoE LLM in Taobao's search relevance pipeline, covering large‑batch scheduling, dynamic load balancing, intra‑batch KV‑Cache reuse, and MoE kernel tuning to meet sub‑second latency requirements.

Alibaba Cloud Developer

Jan 26, 2026

How We Scaled a 3.5B MoE LLM for Real‑Time Search Relevance

Background

In Taobao's search scenario, matching user queries with candidate items is critical for user experience. To improve handling of colloquial queries, a 3.5 B parameter Mixture‑of‑Experts (MoE) LLM was introduced and the number of items scored per request was increased, creating three main challenges: massive point‑wise computation that grows linearly with candidate set size, extremely long prompts that cause O(N²) attention cost, and strict end‑to‑end latency limits (≈500 ms for the relevance model).

Large‑Batch Scheduling & Load Balancing

The relevance model runs on a 3.5 B MoE LLM. Sending all (Query, Item) pairs to a single node would take seconds, which is unacceptable. The team horizontally scaled by splitting batches across N inference nodes, sending M/n pairs per sub‑batch. Traditional random load balancers (e.g., VIPServer) caused uneven queuing, so a custom Proxy service was built.

Uniform split by estimated compute: Instead of fixed batch sizes, the Proxy estimates token length for each (Query, Item) pair and groups them so each sub‑batch has roughly equal compute cost, avoiding stragglers.

Dynamic scheduling strategy: The Proxy tracks active request counts and completion status per downstream node. New large requests are dispatched using weighted round‑robin or least‑connections based on real‑time load, achieving globally optimal latency.

Two deployment patterns were evaluated for a batch size of 16:

Plan A (4 × 1 TP): Split the batch into four sub‑batches of size 4, each sent to an independent single‑GPU machine.

Plan B (1 × 4 TP): Send the whole batch to a single 4‑GPU tensor‑parallel machine.

Testing showed Plan A reduced latency by 12 ms compared to Plan B because it avoided inter‑GPU communication overhead and leveraged data parallelism.

Intra‑Batch Prefix Reuse

In relevance scoring, the same query token prefix appears in every (Query, Item) pair within a batch, leading to redundant attention computation. Existing KV‑Cache reuse works only across batches, not within a batch.

A naive two‑stage approach (pre‑compute the query prefix then process items) adds routing complexity and network overhead, negating benefits.

The solution introduces speculative block allocation and registration:

Pre‑registration during scheduling: When the scheduler assigns a block to the first request, it immediately registers the mapping of the shared query prefix tokens to that block, even before KV values are filled.

In‑batch reuse: Subsequent requests in the same batch find the prefix mapping in the Cache Manager and reuse the same block without reallocation.

Lazy filling: During attention calculation, the engine first computes KV values for the current token, updates the block, then reads the KV data for the final attention step, ensuring correctness under strict GPU kernel ordering.

To prevent cache contamination, an epoch identifier isolates block visibility to the originating batch. Only after successful batch completion are blocks committed globally; failures trigger a rollback that clears stale mappings.

This intra‑batch reuse yields a 10 % latency gain, with further potential by extending reuse to repeated item attributes.

VLLM 1.0 later added similar intra‑batch KV‑Cache reuse, confirming the industry consensus on eliminating prefix redundancy.

MoE Kernel Dynamic Tuning

The MoE layer consumes >70 % of forward‑pass time. In a small‑batch, low‑throughput online setting, sparse expert activation leads to many experts processing fewer than 64 tokens, while the DeepGemm backend pads token counts to 128, causing wasted compute.

A dynamic kernel selection mechanism was introduced:

Default high‑efficiency kernel: For dense workloads, the block‑128 kernel is chosen.

Padding overhead estimation: Before execution, the system computes the ratio of padded tokens to original tokens (the “padding overhead”).

Dynamic decision & switch: If the overhead exceeds a threshold (empirically 1.5), the scheduler switches to a block‑64 kernel, reducing unnecessary computation.

Experiments show noticeable latency reductions, especially for smaller batches where sparsity is higher.

Future Optimization Opportunities

Although expert parallelism (EP) was not adopted due to latency constraints, solving communication bottlenecks could enable EP with expert‑level load balancing (EPLB) while keeping all experts resident on each card.

Another avenue is a dedicated attention operator that reuses the remaining prefix tokens beyond the KV‑Cache page size, potentially adding another 3‑5 % end‑to‑end improvement.

Conclusion

By combining fine‑grained Proxy load balancing, intra‑batch KV‑Cache reuse, and dynamic MoE kernel tuning, the team successfully deployed a 40 A3B LLM for online search relevance, achieving substantial latency and experience gains. Ongoing work will continue to push performance, efficiency, and cost boundaries as model capabilities evolve.

LLM inference optimization Load Balancing MoE Search Relevance KV cache

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.