How Tiny Inference Model Tina Cuts Training Costs by 99.6% with LoRA‑RL

Researchers from ShanghaiTech and USC introduced the compact inference model Tina, which leverages low‑rank adaptation and reinforcement learning to achieve comparable or superior performance to large SOTA models while reducing post‑training and evaluation costs to just $9, a 99.6% savings over traditional approaches.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
How Tiny Inference Model Tina Cuts Training Costs by 99.6% with LoRA‑RL

Background and Motivation

In response to the high expense of training large inference models, a team led by Wang Shangshang (ShanghaiTech alumnus and USC PhD student) developed a series of small inference models named Tina . Their goal is to democratize reinforcement‑learning‑driven inference by dramatically lowering hardware and budget requirements.

Model and Training Approach

Tina builds on a 1.5‑billion‑parameter base model, DeepSeek‑R1‑Distill‑Qwen‑1.5B . The researchers applied low‑rank adaptation (LoRA) together with reinforcement learning (RL) to fine‑tune only a tiny subset of parameters, preserving most of the pretrained knowledge while adapting the model to the inference task.

Figure
Figure

Cost Efficiency

Using this efficient training pipeline, the post‑training and evaluation cost for a single Tina checkpoint was reduced to only $9 , representing a 99.6% reduction compared with traditional methods that would cost about $526 to reproduce the entire experiment. The team set a conservative overall budget of $100 , yet the actual expenditure stayed far below this ceiling.

Hardware and Software Stack

The experiments ran on two NVIDIA L40S GPUs accessed via a commercial cloud platform, priced at roughly $1 per GPU‑hour (including 300 GB storage). Training leveraged the open‑source OpenR1 codebase (a full reproduction of DeepSeek‑R1), combined with Accelerate , TRl , and DeepSpeed ZeRO optimizations. To keep GPU memory low, the team limited vLLM’s memory usage and performed both RL training and inference on the same two GPUs.

Experimental Setup

All baseline evaluations used the lighteval framework integrated with the vLLM inference engine, ensuring fair comparison under identical hardware (two L40S GPUs) and standardized inference parameters. The primary metric was zero‑shot pass@1 accuracy across six benchmark tasks.

Results

On the AIME24 benchmark, the best Tina model achieved a Pass@1 accuracy of 43.33% and could improve inference performance by more than 20% relative to the baseline. Across all Tina variants, average scores ranged from 48.16% to 50.60% , with the Tina‑Open‑RS2 model reaching the highest 50.60% average. Notably, these gains were obtained while using only 19%–57% of a full training cycle.

Analysis of Low‑Rank Adaptation

The study highlighted that LoRA‑based fine‑tuning requires orders of magnitude fewer floating‑point operations (FLOPs) than full‑parameter training, yet delivers comparable or superior inference quality. Interestingly, increasing training compute for LoRA models sometimes degraded performance, underscoring the “less is more” principle in this context.

Implications and Future Work

The findings demonstrate that small‑scale language models can attain strong inference capabilities with minimal computational resources, making advanced AI accessible to labs with limited GPU budgets. The authors plan to further investigate why Tina’s inference ability is so effective and to extend the approach to broader AI applications.

AIlow-rank adaptationsmall language modelsCost‑Efficient Inference
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.