Why Modern LLMs Skip Thinking: Token Routing and Zero‑Compute Experts Explained

The article examines how large language models now use routing mechanisms and token‑level expert selection to reduce computation and cost, illustrating the trade‑offs with real‑world examples from OpenAI, LongCat, and DeepSeek while highlighting both the benefits and the pitfalls of this approach.

DataFunTalk
DataFunTalk
DataFunTalk
Why Modern LLMs Skip Thinking: Token Routing and Zero‑Compute Experts Explained

Many users feel that today’s AI models are getting less capable. The author recounts testing a $200 OpenAI subscription by giving ChatGPT a simple arithmetic problem (5.9 = x + 5.11) and receiving a nonsensical answer.

Further experiments with more complex math showed that the model sometimes switches to a “fast‑thinking” mode, producing correct results but without deep reasoning.

These observations reflect a broader industry trend: large language models are increasingly equipped with routing modules that decide when to think deeply and when to take shortcuts, primarily to save tokens and computational cost.

Data from OpenAI indicates that GPT‑5’s output token count can be reduced by 50‑80% through such routing. DeepSeek reports a 20‑50% token reduction for its new model, while energy‑consumption analyses suggest that cutting token usage can save hundreds of thousands of kilowatt‑hours per day.

The underlying technique is often called a “perception router.” During pre‑training, a lightweight language model learns to predict which expert model (e.g., a large, high‑capacity model or a smaller, faster one) is best suited for a given prompt. It compares its prediction with a ground‑truth answer, computes the error, and fine‑tunes its parameters to minimize that error.

After millions of training examples, the router can instantly assess a new prompt and dispatch it to the appropriate expert, effectively deciding whether the problem warrants deep computation.

Another approach, illustrated by Meituan’s LongCat, is the “zero‑compute expert” mechanism. Tokens are first passed through a small “Top‑k Router” that classifies them as complex or simple. Simple tokens are sent to lightweight “lazy” experts, while complex tokens are handled by more powerful models.

This design yields significant cost savings for model providers and faster, cheaper responses for users. However, the routing logic can fail. For example, early GPT‑5 users reported that the router often refused to engage in deep reasoning, responding with generic acknowledgments and even miscounting simple token‑based questions.

When the router misbehaves, users can try to force deeper computation by adding phrases like “deep think” or “ultra think” to their prompts, though this is only a temporary fix.

In summary, routing and token‑level expert selection represent a promising direction for scaling LLMs efficiently, but the current implementations still deliver mixed user experiences, and further research is needed to balance cost savings with reliable reasoning.

AIdeep learningToken Efficiencymodel routing
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.