Why Every “Don’t” in Your Prompt Might Be Counterproductive – Insights from 25 Superpowers 6.0 Experiments

Analyzing 25 micro‑tests from Superpowers 6.0, the author shows that adding “don’t” clauses often backfires, explains a low‑cost $0.15 per‑sample evaluation loop, presents five empirical laws and two hard rules for prompt wording, and offers a reusable framework for validating your own AI agent prompts.

Shuge Unlimited
Shuge Unlimited
Shuge Unlimited
Why Every “Don’t” in Your Prompt Might Be Counterproductive – Insights from 25 Superpowers 6.0 Experiments

While reviewing Superpowers 6.0 experiment logs, the author discovered that the common intuition of adding “don’t X” constraints to prompts can actually degrade performance, a finding supported by a series of controlled micro‑tests.

1. Prohibition vs. Positive Recipes

In the dispatch‑prompt evaluation (SDD workflow on Opus), three wording variants were run five times each. The “don’t repeat brief” prohibition yielded an average of 4.4 violations, higher than the 3.6 violations of the no‑guidance control. A positive recipe (explicitly listing required steps) achieved the lowest average of 3.0 violations with zero variance, while adding a nuance clause to the winning recipe increased the average to 3.8 (noisy). The authors note that the result is bounded to the specific model, workflow, and dates (Opus/Sonnet/Haiku, 2026‑06‑09 to 06‑11) and therefore represents engineering evidence, not a universal law.

2. Micro‑Testing Methodology and Cost

The micro‑test is designed to cost $0.15‑$0.30 per sample and complete in seconds, enabling rapid iteration before committing to a full end‑to‑end evaluation (≈ $12 and 50 minutes). The five‑step process, extracted from writing-skills/SKILL.md, includes:

Invoke a fresh‑context sample with a full system prompt and a realistic failure‑triggering user message.

Always include a no‑guidance control to verify whether a guidance clause actually changes outcomes.

Run each variant at least five times; single‑sample results are unreliable.

Programmatically score hits (e.g., grep for markers) and then manually review each hit because automated counts can misclassify template echo or reference examples.

Treat variance as a signal: zero variance (consistent) indicates a robust wording, while noisy results suggest the wording has not constrained the model.

The cost breakdown is illustrated in an image comparing micro‑tests (≈ $0.15 per call) with full evals (≈ $12 per run).

3. Writing‑Plans Placeholder Study

Running 40 samples across four variants (control, positive recipe, token migration, and no‑list control) showed that the current Opus model does not generate placeholders even under deliberate pressure, allowing the “No Placeholders” chapter to remain unchanged. Detailed results include zero placeholder violations in all 40 plans and a single regex hit for a self‑review test.

4. Five Empirical Laws and Two Hard Rules

Tripwires work : token‑level “do not flag” clauses reliably trigger when present.

Recognition tables work : red‑flag tables are consulted during decision making, not during composition.

Discrete‑directive prohibitions work : when the model lacks a competitive motive for Y, a prohibition against X is effective.

Composition prohibitions backfire : in output‑shape problems, prohibitions provoke adversarial behavior; positive compositions succeed, and adding nuance degrades them.

Ties go to the shorter phrasing : shorter prompts save compute; Codex rereads the skill file ~ 500 times in a long session.

Two hard rules (from SKILL.md) are:

No nuance clauses : adding “unless it matters” reopens negotiation and turns a winning recipe noisy.

Exemption clauses don’t scope : they still suppress code blocks; restructure to keep exempted sections out of the rule’s reach.

5. Relation to Anthropic Official Guidance

Anthropic’s prompt tutorial (GitHub prompt‑eng‑interactive‑tutorial) only advises “be clear and direct” and provides a single golden rule. It does not differentiate between positive and negative instructions. The Superpowers micro‑tests fill this gap by showing when prohibitions backfire, when they succeed, and how nuance affects outcomes.

6. Applying the Framework to Your Own Prompts

A minimal harness (shown in a pre block) demonstrates the loop: iterate over variants, run five trials per variant, score automatically, then manually review each hit. The five operational checkpoints mirror the earlier methodology.

Key pitfalls to avoid:

Never omit the no‑guidance control.

Never replace manual review with pure automated counting.

Never treat a single experiment as a universal law.

Never discard all prohibitions just because some backfire; classify the failure type first.

Conclusion

The experiments demonstrate that prompt engineering can be turned from intuition‑driven tweaking into a rigorous, low‑cost experimental practice. By spending a few dollars on micro‑tests, practitioners can identify when “don’t” clauses are harmful, validate positive recipes with zero variance, and keep a documented record of both successful and rejected designs, preventing duplicated effort across model generations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI agentsPrompt EngineeringEvaluationAnthropicSuperpowersmicro testing
Shuge Unlimited
Written by

Shuge Unlimited

Formerly "Ops with Skill", now officially upgraded. Fully dedicated to AI, we share both the why (fundamental insights) and the how (practical implementation). From technical operations to breakthrough thinking, we help you understand AI's transformation and master the core abilities needed to shape the future. ShugeX: boundless exploration, skillful execution.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.