Artificial Intelligence 13 min read

How Claude Code Handles max_output_tokens and Model Downgrade to Keep Agents Running

The article explains Claude Code's multi‑level fault‑tolerance for max_output_tokens errors, detailing dynamic token allocation, automatic model downgrade, environment‑variable controls, StopFailure hooks, and their coordination with compaction to prevent agents from getting stuck during long‑running tasks.

James' Growth Diary

May 3, 2026

How Claude Code Handles max_output_tokens and Model Downgrade to Keep Agents Running

Why max_output_tokens becomes a problem

LLM APIs enforce a hard token limit per response that is defined in the model specification. For example, claude-opus-4-6 reports a default output limit of 16,000 tokens and an absolute upper limit of 64,000 tokens via QF(modelId), which returns { default: number, upperLimit: number }. The default value is the quota used for normal requests; the upperLimit is applied only when the environment variable CLAUDE_CODE_MAX_OUTPUT_TOKENS is set.

When an agent generates large modules, test suites, or long documents, the model may stop mid‑stream with stop_reason: "max_tokens", returning a truncated JSON or code snippet. Some providers also reject requests that exceed 32,000 tokens with a 400 error.

Source location of the recovery logic

The recovery code lives in src/query.ts around line 1730, inside the main loop’s error‑handling block.

async function* runQuery(params: QueryParams, deps: QueryDeps): AsyncGenerator<Message | ProgressEvent, void, unknown> {
  let maxOutputTokensRetryCount = 0; // recovery counter
  let currentModel = params.model; // may be downgraded

  while (true) {
    try {
      const response = await callAPI({
        model: currentModel,
        max_tokens: resolveMaxOutputTokens(currentModel, maxOutputTokensRetryCount),
        // ...
      });

      if (response.stop_reason === 'max_tokens') {
        const recovery = handleMaxTokensExceeded(currentModel, maxOutputTokensRetryCount);
        if (recovery.action === 'retry') {
          maxOutputTokensRetryCount++;
          continue;
        }
        if (recovery.action === 'downgrade') {
          currentModel = recovery.fallbackModel;
          maxOutputTokensRetryCount = 0;
          continue;
        }
        yield makeErrorMessage(response);
        break;
      }
      // normal path
      yield response;
    } catch (error) {
      if (isMaxOutputTokensError(error)) {
        maxOutputTokensRetryCount++;
        continue;
      }
      throw error;
    }
  }
}

Two failure paths converge on the same recovery logic: stop_reason = "max_tokens" – the response succeeds but is truncated. isMaxOutputTokensError(error) – the API rejects the request outright.

resolveMaxOutputTokens: three‑tier dynamic allocation

The function resolveMaxOutputTokens implements a progressive expansion strategy:

First request uses default (conservative, sufficient for most cases).

First retry uses a middle value (moderate increase to probe the boundary).

Subsequent retries use upperLimit (full allocation).

This balances token cost against availability, expanding only when necessary.

Model downgrade when token expansion fails

interface RecoveryDecision {
  action: 'retry' | 'downgrade' | 'fail';
  fallbackModel?: string;
}

function handleMaxTokensExceeded(currentModel: string, retryCount: number): RecoveryDecision {
  const isAbsoluteMax = getCurrentMaxTokens(currentModel, retryCount) >= getModelSpec(currentModel).upperLimit;
  if (!isAbsoluteMax) {
    return { action: 'retry' };
  }
  const fallback = getFallbackModel(currentModel);
  if (fallback && !hasAttemptedFallback(fallback)) {
    return { action: 'downgrade', fallbackModel: fallback };
  }
  return { action: 'fail' };
}

function getFallbackModel(modelId: string): string | null {
  const FALLBACK_CHAIN: Record<string, string> = {
    'claude-opus-4-5': 'claude-sonnet-4-5',
    'claude-sonnet-4-5': 'claude-haiku-3-5',
  };
  return FALLBACK_CHAIN[modelId] ?? null;
}

The downgrade chain is one‑way (Opus → Sonnet → Haiku) and is triggered only after all token‑expansion attempts are exhausted.

Environment variable control

The /doctor command validates both CLAUDE_CODE_MAX_OUTPUT_TOKENS and BASH_MAX_OUTPUT_LENGTH. Each variable has a default and an upperLimit derived from QF(modelId). Common settings are 16,000 (default), 32,000 (large refactor), and 64,000 (full scaffold). Values above upperLimit are automatically clamped.

When a third‑party proxy is used, the variable may not affect the proxy’s lower limit, so the request can still be rejected.

StopFailure hook: notification on unrecoverable errors

If all recovery attempts fail, Claude Code emits a StopFailure hook with a structured error object containing fields such as 'rate_limit', 'authentication_failed', 'billing_error', 'invalid_request', 'server_error', 'max_output_tokens', and 'unknown'. The last_assistant_message field carries the truncated message for diagnosis.

#!/bin/bash
# .claude/hooks/StopFailure.sh – alert on max_output_tokens failure
HOOK_INPUT=$(cat)
ERROR=$(echo "$HOOK_INPUT" | jq -r '.error')
LAST_MSG=$(echo "$HOOK_INPUT" | jq -r '.last_assistant_message // "N/A"')
if [ "$ERROR" = "max_output_tokens" ]; then
  echo "❌ Claude output exceeded token limit, task aborted: $LAST_MSG" >&2
fi

Collaboration with Compaction

When stop_reason = max_tokens occurs, Claude Code first checks whether the context is also oversized. If so, it runs autoCompact(messages) to shrink the context, then retries with a larger token budget. This design separates input‑size handling (Compaction) from output‑size handling (max_tokens recovery) while sharing the same error path.

Design insights

Progressive resource allocation (default → middle → upperLimit) yields a better cost‑availability trade‑off and can be applied to other resources such as DB connection pools.

State‑machine‑based error handling (using maxOutputTokensRetryCount) is more precise than simple try‑catch loops.

The downgrade chain is unidirectional and conditional, preventing unexpected quality drops.

Hook‑based observability elevates operational visibility to a first‑class concern.

Critical perspective

Model downgrade is silent; the user is not notified when the system falls back to a lower‑capability model.

The three‑tier token steps are coarse; intermediate values may still be insufficient while the upper limit is costly.

There is no per‑task token configuration; a single global max_tokens parameter is used for all task types.

The interaction between recovery and compaction is implicit, lacking an explicit coordination API.

Practical recommendations

For long‑code‑generation tasks, expose max_tokens as a configurable option and allow optional model fallback. Monitor token usage via hooks such as .claude/hooks/PostToolUse.sh. Suggested environment settings: export CLAUDE_CODE_MAX_OUTPUT_TOKENS=16000 for daily coding. export CLAUDE_CODE_MAX_OUTPUT_TOKENS=32000 for large refactors. export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000 for scaffold generation (use cautiously).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

compaction Observability AI Agent environment variables Claude Code max_output_tokens model downgrade token recovery

Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.