Why Every “Don’t” in Your Prompt Might Be Counterproductive – Insights from 25 Superpowers 6.0 Experiments
Analyzing 25 micro‑tests from Superpowers 6.0, the author shows that adding “don’t” clauses often backfires, explains a low‑cost $0.15 per‑sample evaluation loop, presents five empirical laws and two hard rules for prompt wording, and offers a reusable framework for validating your own AI agent prompts.
While reviewing Superpowers 6.0 experiment logs, the author discovered that the common intuition of adding “don’t X” constraints to prompts can actually degrade performance, a finding supported by a series of controlled micro‑tests.
1. Prohibition vs. Positive Recipes
In the dispatch‑prompt evaluation (SDD workflow on Opus), three wording variants were run five times each. The “don’t repeat brief” prohibition yielded an average of 4.4 violations, higher than the 3.6 violations of the no‑guidance control. A positive recipe (explicitly listing required steps) achieved the lowest average of 3.0 violations with zero variance, while adding a nuance clause to the winning recipe increased the average to 3.8 (noisy). The authors note that the result is bounded to the specific model, workflow, and dates (Opus/Sonnet/Haiku, 2026‑06‑09 to 06‑11) and therefore represents engineering evidence, not a universal law.
2. Micro‑Testing Methodology and Cost
The micro‑test is designed to cost $0.15‑$0.30 per sample and complete in seconds, enabling rapid iteration before committing to a full end‑to‑end evaluation (≈ $12 and 50 minutes). The five‑step process, extracted from writing-skills/SKILL.md, includes:
Invoke a fresh‑context sample with a full system prompt and a realistic failure‑triggering user message.
Always include a no‑guidance control to verify whether a guidance clause actually changes outcomes.
Run each variant at least five times; single‑sample results are unreliable.
Programmatically score hits (e.g., grep for markers) and then manually review each hit because automated counts can misclassify template echo or reference examples.
Treat variance as a signal: zero variance (consistent) indicates a robust wording, while noisy results suggest the wording has not constrained the model.
The cost breakdown is illustrated in an image comparing micro‑tests (≈ $0.15 per call) with full evals (≈ $12 per run).
3. Writing‑Plans Placeholder Study
Running 40 samples across four variants (control, positive recipe, token migration, and no‑list control) showed that the current Opus model does not generate placeholders even under deliberate pressure, allowing the “No Placeholders” chapter to remain unchanged. Detailed results include zero placeholder violations in all 40 plans and a single regex hit for a self‑review test.
4. Five Empirical Laws and Two Hard Rules
Tripwires work : token‑level “do not flag” clauses reliably trigger when present.
Recognition tables work : red‑flag tables are consulted during decision making, not during composition.
Discrete‑directive prohibitions work : when the model lacks a competitive motive for Y, a prohibition against X is effective.
Composition prohibitions backfire : in output‑shape problems, prohibitions provoke adversarial behavior; positive compositions succeed, and adding nuance degrades them.
Ties go to the shorter phrasing : shorter prompts save compute; Codex rereads the skill file ~ 500 times in a long session.
Two hard rules (from SKILL.md) are:
No nuance clauses : adding “unless it matters” reopens negotiation and turns a winning recipe noisy.
Exemption clauses don’t scope : they still suppress code blocks; restructure to keep exempted sections out of the rule’s reach.
5. Relation to Anthropic Official Guidance
Anthropic’s prompt tutorial (GitHub prompt‑eng‑interactive‑tutorial) only advises “be clear and direct” and provides a single golden rule. It does not differentiate between positive and negative instructions. The Superpowers micro‑tests fill this gap by showing when prohibitions backfire, when they succeed, and how nuance affects outcomes.
6. Applying the Framework to Your Own Prompts
A minimal harness (shown in a pre block) demonstrates the loop: iterate over variants, run five trials per variant, score automatically, then manually review each hit. The five operational checkpoints mirror the earlier methodology.
Key pitfalls to avoid:
Never omit the no‑guidance control.
Never replace manual review with pure automated counting.
Never treat a single experiment as a universal law.
Never discard all prohibitions just because some backfire; classify the failure type first.
Conclusion
The experiments demonstrate that prompt engineering can be turned from intuition‑driven tweaking into a rigorous, low‑cost experimental practice. By spending a few dollars on micro‑tests, practitioners can identify when “don’t” clauses are harmful, validate positive recipes with zero variance, and keep a documented record of both successful and rejected designs, preventing duplicated effort across model generations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Shuge Unlimited
Formerly "Ops with Skill", now officially upgraded. Fully dedicated to AI, we share both the why (fundamental insights) and the how (practical implementation). From technical operations to breakthrough thinking, we help you understand AI's transformation and master the core abilities needed to shape the future. ShugeX: boundless exploration, skillful execution.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
