I wired up an auto-research loop in Pi that iterated three personality forms against six behavioral scenarios and scored each variant against a no-personality baseline. The delta table caught what I couldn't see by reading my own prompt: I had stacked too many control systems on top of the model. The same form that helped a weaker model by 29 points cost a stronger one 15. Letta evals reproduced the pattern. Trimming the prompt is what closed the gap.
I had been adding to the same system prompt for weeks. Layer on layer. Investigation rules. Planning rules. Recovery rules. Completion gates. Each addition felt like it was helping. Reading my own prompt, I couldn’t tell what was working and what was fighting itself.
So I stopped reading it and built something that could.
The Auto-Research Pipeline
The harness is a small Pi-driven loop in forge-personality-system/evals. One file describes the scenarios. One file describes the personality forms. The runner does the rest:
- For each (model × form) pair, run all behavioral scenarios twice. Once with the personality, once with a clean baseline.
- A separate judge model scores each response against a per-scenario rubric.
- The pipeline returns a delta table: how much each personality moved the score against its own no-personality baseline.
- Iterate. Change a form, re-run, see whether the delta moved in the direction I expected.
The grader is always a different model family from the subject. If the same family scores and generates, the eval eats its own tail.
Three forms, intentionally simple:
| Form | Token budget | Job |
|---|---|---|
| Stealth | ~100 | Minimal behavioral pressure |
| Compressed | ~150 | Operational discipline, no scaffolding |
| Full | ~500 | Planning, recovery, completion gates, the works |
Six behavioral scenarios covered scope respect, evidence-first reasoning, low-drama communication, pushback, execution awareness, and answer-first ordering. Small surface on purpose. The loop ran on every change.
What The Loop Surfaced
The first full sweep produced a table I didn’t want to see:
| Model | Form | Delta vs no-personality baseline |
|---|---|---|
| M2.5 | full | +0.29 |
| M2.5 | compressed | +0.12 |
| M2.5 | stealth | +0.11 |
| copilot-gpt-5.4 | compressed | +0.05 |
| M2.7 | compressed | −0.07 |
| M2.7 | full | −0.15 |
Same personality forms, same scenarios. M2.5 gained 29 points under the heaviest form. M2.7 lost 15, dropping the stronger model below its own untouched baseline. The personality was net-negative on the model I most wanted to ship.
The narrow surface where this showed up was scope respect. Pushback stayed at 1.0. Professional tone stayed clean. Where the heavy form broke things was multi-task scoping, where the agent had to pick one task and explicitly defer the rest. Under the full form, M2.7 stopped doing that. It started doing broad investigation across all three asked tasks instead. The “investigate before concluding” rule outranked the “do one task” rule.
That’s not noise. That’s instruction hierarchy.
The Diagnosis: I Was Overloading The Prompt
The lazy version of this finding is “longer prompts are bad.” That isn’t what the data says.
Look at the stealth row. Stealth, the lightest form, improved with model quality, 0.73 → 0.77 → 0.82 across the three subjects. Stronger models did benefit from behavioral guidance. The problem appeared only when the form got heavy enough to become several competing control systems at once.
The full form was no longer one personality. It was a constitution, a planner, a recovery protocol, and a completion gate stacked together. On a model with headroom (M2.5), the extra structure helped. On a model that already followed instructions tightly (M2.7), the structure became a set of rules that fought each other, and the model picked the wrong winner.
The thing I should have caught by reading the prompt was that I had stopped adding constraints and started adding control systems. The auto-researcher caught it because deltas don’t lie about themselves the way prose does.
How I Verified
Pi’s SDK had event-capture issues with M2.7 and GLM-5 mid-pipeline. The runs that completed were clean, but I wanted the result to survive a different harness before I trusted it.
I ported the same three forms into Letta and re-ran the discriminative scenarios with a different grader (GLM-5 instead of OpenAI). The Letta numbers showed the same shape: stealth scaled with model quality, full regressed on M2.7. Two independent harnesses, same direction. Not an artifact.
The Letta dataset later expanded to 154 runs and got its own longer writeup. For this post, the relevant fact is just that the Pi finding reproduced.
What Changed
Three things, in this order:
1. Stopped writing prose blobs. The repo moved to a parameterized architecture: one canonical constitution.json, a render pipeline, composable layers in layers.json, and forms generated from parameter values. Each layer can be toggled. Each change produces a diff against the previous form, so the auto-researcher’s delta is attributable to a known control change instead of a prose rewrite.
2. Started gating on per-dimension deltas, not overall score. Overall scores hide the kind of regression that broke M2.7. Scope respect dropped while pushback held. The dimension-level deltas surfaced exactly which rule was fighting which.
3. Stopped trusting reading-the-prompt as evidence. Every form change now has to ship with a re-run. If the deltas don’t move, the change didn’t do what I thought.
What Made The Lesson Land
The specific personality finding is interesting on its own: stronger models follow instruction hierarchy harder, so competing rules cost them more. Worth knowing.
But the lesson I actually use is the one above it. The auto-research pipeline turned a kind of mistake I’d been making for weeks, adding instructions because each one sounded reasonable, into a delta I couldn’t argue with. Every prompt-engineering setup needs that loop. The specific scenarios matter less than the discipline of measuring deltas against a clean baseline and forcing every change to defend itself with a number.
I stopped rewriting prompts in the dark. That’s the part of this work that’s stayed valuable.
Key Takeaways
- An auto-research pipeline scoring delta-vs-baseline can catch prompt overload that direct reading misses
- The same personality form lifted M2.5 by +0.29 and dragged M2.7 by −0.15 against their own baselines
- Stronger models follow instruction hierarchy more consistently, so competing rules cost them more
- Reproducing the result in a second harness (Letta) is what turned the finding from artifact into signal
- The system that surfaced the lesson mattered more than the specific personality it tested