How to Cut Large‑Model Token Usage by Over 90%

The article analyses why AI Skills waste massive token counts, demonstrates a pure‑Skill implementation that costs $10 and 12 minutes, then shows a code‑plus‑model hybrid that reduces runtime to 17 seconds, API calls to one, and cost to $0.004, saving more than 99% of tokens.

IT Services Circle
IT Services Circle
IT Services Circle
How to Cut Large‑Model Token Usage by Over 90%

While browsing AI programming communities I noticed many people trying to save tokens by caching prompts, switching to cheaper models, or writing a save_token.skill. In reality the real token eater is the Skill itself, not the model’s response or the prompt.

What a Skill Does

A Skill is a predefined instruction that tells a large model how to perform a task step by step. For example, a Skill could ask the model to open a browser, go to Amazon, search for "jeans", and return the cheapest product’s name and link.

Open the browser, visit Amazon, search "jeans", find the cheapest item on the first page, and return its name and link.

Version 1: Pure Skill Implementation

The author wrote a Skill with the following steps:

## Steps
1. Navigate to https://www.amazon.com
2. Find the search box, type "jeans" and search
3. Wait for results to load
4. Capture a page snapshot and analyze all product names and prices on the first page
5. Compare all prices and pick the cheapest product
6. Return the product name, price, and link

Running this Skill in Claude Code (model Opus 4.6) required 12 minutes and dozens of API calls, each carrying roughly 100 k tokens. The total cost was about $10.

Where the Tokens Go

Each tool call forces the model to resend the entire context (system prompt, tool definitions, Skill content, conversation history). Examples of token‑heavy steps:

Navigate to Amazon’s homepage – ~100 k tokens.

Find the search box and type the keyword – ~200 k tokens.

Click the search button – ~100 k tokens.

Analyze the search results – another ~100 k tokens.

Only the final price‑comparison step truly needs the model’s intelligence; the rest are deterministic actions that could be done with a few lines of Python.

Version 2: Code + Model Hybrid

The solution is to move all deterministic steps into Python code (using Playwright) and keep only the price‑comparison step for the model.

from playwright.sync_api import sync_playwright
import openai

def find_cheapest_jeans():
    # ===== Part 1: Pure code, no model needed =====
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        # 1. Go to Amazon (deterministic)
        page.goto("https://www.amazon.com")
        # 2. Search for jeans (deterministic)
        page.fill('input[name="field-keywords"]', 'jeans')
        page.press('input[name="field-keywords"]', 'Enter')
        page.wait_for_load_state('networkidle')
        # 3. Extract results (deterministic)
        results = page.query_selector_all('div[data-component-type="s-search-result"]')
        products = []
        for result in results[:20]:  # only first 20
            title_el = result.query_selector('h2 span')
            price_el = result.query_selector('.a-price .a-offscreen')
            link_el = result.query_selector('h2 a')
            if title_el and price_el and link_el:
                products.append({
                    'title': title_el.inner_text(),
                    'price': price_el.inner_text(),
                    'url': 'https://www.amazon.com' + link_el.get_attribute('href')
                })
        browser.close()
    # ===== Part 2: Intelligence needed =====
    client = openai.OpenAI(
        base_url="https://openrouter.ai/api/v1",
        api_key="your-key"
    )
    product_text = "
".join(
        f"[{i+1}] {p['title']} | {p['price']}" for i, p in enumerate(products)
    )
    response = client.chat.completions.create(
        model="minimax/minimax-m1",
        messages=[{
            "role": "user",
            "content": f"Below are the Amazon search results for jeans. Return only the index of the cheapest item.

{product_text}"
        }]
    )
    idx = int(response.choices[0].message.content.strip()) - 1
    cheapest = products[idx]
    return cheapest

result = find_cheapest_jeans()
print(f"Cheapest jeans: {result['title']}")
print(f"Price: {result['price']}")
print(f"Link: {result['url']}")

This version makes a single API call, consumes about 2.1 k tokens, and costs $0.004 – a saving of more than 99%.

Effect Comparison

Running the hybrid code finishes in 17 seconds. OpenRouter shows:

API calls: 1 (vs. dozens)

Tokens: ~2.1 k (vs. ~1 M)

Cost: $0.004 (vs. ~$10)

Runtime dropped from 12 minutes to 17 seconds (≈40× faster) and cost dropped by >99%.

Why the Gap Is So Large

Every tool call forces the model to resend the whole previous context, so each additional round inflates the token count dramatically. Deterministic browser actions generate large HTML snapshots that dominate the token usage. By moving those actions to Playwright, the model only receives a concise list of product texts (~2 k tokens).

Using Claude Subscription with Agent SDK

If you have a Claude Pro or Team subscription, you can avoid extra API fees entirely by using the Claude Agent SDK, which runs Claude Code locally under your subscription quota.

import anyio
from playwright.async_api import async_playwright
from claude_agent_sdk import query, ClaudeAgentOptions, AssistantMessage, TextBlock

async def find_cheapest_jeans():
    # ===== Part 1: Pure code, no model needed =====
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto("https://www.amazon.com")
        await page.fill('input[name="field-keywords"]', 'jeans')
        await page.press('input[name="field-keywords"]', 'Enter')
        await page.wait_for_load_state('networkidle')
        results = await page.query_selector_all('div[data-component-type="s-search-result"]')
        products = []
        for result in results[:20]:
            title_el = await result.query_selector('h2 span')
            price_el = await result.query_selector('.a-price .a-offscreen')
            link_el = await result.query_selector('h2 a')
            if title_el and price_el and link_el:
                products.append({
                    'title': await title_el.inner_text(),
                    'price': await price_el.inner_text(),
                    'url': 'https://www.amazon.com' + await link_el.get_attribute('href')
                })
        await browser.close()
    # ===== Part 2: Call Claude via SDK =====
    product_text = "
".join(
        f"[{i+1}] {p['title']} | {p['price']}" for i, p in enumerate(products)
    )
    prompt = (
        f"Below are the Amazon search results for jeans. Return only the index of the cheapest item.

{product_text}"
    )
    result_text = ""
    async for message in query(
        prompt=prompt,
        options=ClaudeAgentOptions(
            system_prompt="You are a price‑comparison assistant. Return only a number.",
            max_turns=1
        )
    ):
        if isinstance(message, AssistantMessage):
            for block in message.content:
                if isinstance(block, TextBlock):
                    result_text += block.text
    idx = int(result_text.strip()) - 1
    cheapest = products[idx]
    return cheapest

result = anyio.run(find_cheapest_jeans)
print(f"Cheapest jeans: {result['title']}")
print(f"Price: {result['price']}")
print(f"Link: {result['url']}")

Install the SDK with: pip install claude-agent-sdk The SDK calls your local Claude Code, so the “intelligence” step costs zero extra tokens – only the subscription’s allocated hours are used.

Don’t Mix .py Files with Skills

A Skill is a description that the model reads; a .py file is an executable script. Running a .py file via Claude Code is a single tool call (execute command → return output), whereas embedding code inside a Skill still forces the model to process the whole Skill each round.

Practical Advice

Identify deterministic steps in any Skill and rewrite them as Python (or other) code. Keep only the reasoning step for the model. This dramatically reduces token consumption and cost while preserving the model’s strength in handling ambiguous or unstructured data.

In short, move the “do the work” part to code and let the model “think” only where necessary.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonClaudePlaywrightOpenRouterSkilltoken optimization
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.