Artificial Intelligence 10 min read

How a Simple Colon Can Fool Top LLMs – The ‘Universal Key’ Vulnerability Exposed

Researchers discovered that trivial symbols such as a colon or the word “Solution” can trigger false‑positive rewards in LLM judge models, causing GPT‑4o, Claude‑4 and LLaMA‑3‑70B to fail, and proposed a robust “Master‑RM” model that eliminates these attacks.

Java Tech Enthusiast

Jul 17, 2025

How a Simple Colon Can Fool Top LLMs – The ‘Universal Key’ Vulnerability Exposed

A Universal Key That Can Deceive LLMs

Recent work shows that large language models (LLMs) used as reward judges in reinforcement learning with verifiable rewards (RLVR) can be fooled by extremely simple cues such as a colon (":") or a single space, as well as generic reasoning starters like “Thought process:” or the Chinese character “解”. These cues produce false‑positive rewards, allowing the LLM to pass verification without providing a meaningful answer.

The phenomenon was reported in a paper titled “A token can deceive LLM”. Experiments on popular LLM judges—including GPT‑4o, Claude‑4, and LLaMA‑3‑70B—demonstrated that all of them succumb to these “universal keys”.

To address the bug, researchers from Tencent AI Lab, Princeton University, and the University of Virginia built an enhanced dataset and trained a more reliable judge model called Master‑RM . By adding 20,000 adversarial samples (generated with GPT‑4o‑mini and stripped to the empty first line) to the original 160,000‑sample training set, they created an augmented dataset. Master‑RM was then fine‑tuned on Qwen2.5‑7B‑Instruct using supervised fine‑tuning (SFT).

Evaluation under the same conditions showed that Master‑RM achieved a false‑positive rate close to 0 % across all “universal key” attacks, while maintaining a high correlation (0.96) with GPT‑4o’s original judgments.

An Untrickable Judge Model

Master‑RM’s training pipeline involved sampling 20 k adversarial responses from the original 160 k data, labeling the empty first line as “incorrect”, and merging them with the clean data. The model was then fine‑tuned on Qwen2.5‑7B‑Instruct, minimizing cross‑entropy loss to learn to distinguish genuine answers from superficial tricks.

Experimental results reveal several key findings:

Non‑text symbols : spaces, periods, commas, colons, etc., can trigger false positives.

Reasoning prefixes : phrases such as “Thought process:”, “Solution”, “Let’s solve this problem step by step”, and their equivalents in Chinese and Japanese, also cause failures.

Across five reasoning benchmarks (including general and mathematical reasoning), every tested model—both specialized reward models (e.g., Multi‑sub RM, Omni‑Judge) and general LLMs (GPT‑4o, Claude‑4, LLaMA‑3‑70B, Qwen2.5‑72B)—produced false‑positive responses.

Specific false‑positive rates include:

GPT‑4o on the colon token: 35 % false‑positive rate.

LLaMA‑3‑70B on “Thought process:” up to 60‑90 % .

General‑Verifier on the MATH dataset for spaces: 66.8 % .

The vulnerability is language‑agnostic; Chinese, English, and Japanese prompts all achieve high false‑positive rates, indicating a cross‑language universality.

Model size does not correlate monotonically with robustness: smaller 0.5 B models rely on literal matching and show low false‑positive rates but poor consistency; mid‑size 1.5‑3 B models detect semantic similarity but suffer higher false positives; 7‑14 B models achieve the lowest false‑positive rates with good consistency; the largest 32‑72 B models tend to generate their own solutions, leading to increased false positives.

Furthermore, the “universal key” can proliferate automatically. By using an all‑MiniLM‑L6‑v2 encoder to perform similarity search over large corpora, new adversarial responses similar to known keys can be generated, reproducing the high false‑positive rates.

In summary, the core mechanism of generative reward models—filtering invalid or erroneous answers—can be manipulated by superficial, irrelevant content, leading to systematic false‑positive rewards. This poses a serious challenge to any RLVR pipeline that relies on LLM judges for feedback.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Reward Modeling adversarial attacks AI robustness LLM security

Written by

Java Tech Enthusiast

Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.