How Thyme Enables Models to Think Beyond Images with Code‑Driven Multimodal Reasoning

The Kwai Keye team presents Thyme, a novel multimodal reasoning framework that lets large language models generate and safely execute Python code for image manipulation and complex calculations, achieving significant performance gains over existing vision‑language models across perception, reasoning, and hallucination‑reduction benchmarks.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
How Thyme Enables Models to Think Beyond Images with Code‑Driven Multimodal Reasoning

Overview

Kwai Keye team introduced Thyme (Think Beyond Images), a new multimodal reasoning paradigm that equips open‑source models with the ability to generate and execute Python code for image manipulation and complex calculations, surpassing existing methods such as OpenAI O3.

Main Contributions

New multimodal interaction paradigm : the model can actively “think beyond images” by generating code to invoke tools for tasks like cropping, rotating, contrast enhancement, and mathematical computation.

Two‑stage training (SFT + RL) : a supervised fine‑tuning stage with ~500 k high‑quality samples (≈200 GPU h) followed by reinforcement learning that uses a custom GRPO‑ATS algorithm with separate temperature settings for text and code.

Open‑source resources : the full dataset, sandbox environment, and codebase have been released to the community.

Workflow

The model receives a user query and decides whether code generation is needed. If so, it produces Python code, which is safely executed in a sandbox that handles formatting, variable bounds, and error correction. The execution result is fed back to the model for further reasoning until a final answer is produced.

Training Data

SFT data consist of three task types: (1) direct answer without code, (2) code‑based image operations and calculations, and (3) multi‑turn interaction for error correction. Over 400 k raw samples were filtered, and an additional 10 k manually annotated high‑resolution OCR and attribute data were added.

RL Strategy (GRPO‑ATS)

The reward function combines result accuracy, reasoning‑answer consistency, and strict output formatting. Text generation uses a high temperature (τ = 1) to encourage diversity, while code generation uses τ = 0 for deterministic execution. Adaptive temperature sampling and Rabin‑Karp duplicate detection reduce wasted computation.

Experimental Results

Thyme outperforms larger models such as Qwen‑2.5‑VL‑32B on perception tasks (e.g., OCR, chart reading) and shows large gains on difficult domains like surveillance and autonomous driving. In reasoning tasks, code‑based computation yields significant improvements, while hallucination rates are reduced across many benchmarks.

All resources, including the model checkpoint, demo homepage, and GitHub repository, are publicly available.

Thyme workflow
Thyme workflow
Training pipeline
Training pipeline
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Code Generationlarge language modelmultimodalreinforcement learningAI researchVision-Language
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.