How Thyme Enables Models to Think Beyond Images with Code‑Driven Multimodal Reasoning
The Kwai Keye team presents Thyme, a novel multimodal reasoning framework that lets large language models generate and safely execute Python code for image manipulation and complex calculations, achieving significant performance gains over existing vision‑language models across perception, reasoning, and hallucination‑reduction benchmarks.
Overview
Kwai Keye team introduced Thyme (Think Beyond Images), a new multimodal reasoning paradigm that equips open‑source models with the ability to generate and execute Python code for image manipulation and complex calculations, surpassing existing methods such as OpenAI O3.
Main Contributions
New multimodal interaction paradigm : the model can actively “think beyond images” by generating code to invoke tools for tasks like cropping, rotating, contrast enhancement, and mathematical computation.
Two‑stage training (SFT + RL) : a supervised fine‑tuning stage with ~500 k high‑quality samples (≈200 GPU h) followed by reinforcement learning that uses a custom GRPO‑ATS algorithm with separate temperature settings for text and code.
Open‑source resources : the full dataset, sandbox environment, and codebase have been released to the community.
Workflow
The model receives a user query and decides whether code generation is needed. If so, it produces Python code, which is safely executed in a sandbox that handles formatting, variable bounds, and error correction. The execution result is fed back to the model for further reasoning until a final answer is produced.
Training Data
SFT data consist of three task types: (1) direct answer without code, (2) code‑based image operations and calculations, and (3) multi‑turn interaction for error correction. Over 400 k raw samples were filtered, and an additional 10 k manually annotated high‑resolution OCR and attribute data were added.
RL Strategy (GRPO‑ATS)
The reward function combines result accuracy, reasoning‑answer consistency, and strict output formatting. Text generation uses a high temperature (τ = 1) to encourage diversity, while code generation uses τ = 0 for deterministic execution. Adaptive temperature sampling and Rabin‑Karp duplicate detection reduce wasted computation.
Experimental Results
Thyme outperforms larger models such as Qwen‑2.5‑VL‑32B on perception tasks (e.g., OCR, chart reading) and shows large gains on difficult domains like surveillance and autonomous driving. In reasoning tasks, code‑based computation yields significant improvements, while hallucination rates are reduced across many benchmarks.
All resources, including the model checkpoint, demo homepage, and GitHub repository, are publicly available.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
