Instructions Are Rigid, Tool Descriptions Are Flexible

What we learned from trying to get AI agents to use Surchin tools throughout a session, not just at the start and end.

Our production checklist gets Opus to ~80% overall compliance with Surchin's MCP tools. Query compliance is 100%. But we had a persistent problem: agents treat Surchin as bookends. They query at the start of a session, deposit at the end, and go dark in the middle.

For a knowledge substrate that's supposed to capture what agents learn during iterative work — the failed approach before the fix, the pattern discovered mid-refactor, the dead end that cost 20 minutes — bookend behavior means we're losing the most valuable signals.

We wanted mid-session tool usage. So we ran an experiment across three surfaces: tool descriptions, instructions, and hooks.

The Investigation

We started by deploying five parallel sub-agents to analyze why mid-session usage was low. Each agent examined a different surface area: tool descriptions, CLAUDE.md instructions, real session transcripts, benchmark results, and MCP protocol mechanics.

The transcripts were revealing. Agents re-read tool descriptions at every decision point — when choosing which tool to call next, when deciding whether to call a tool at all. But they read the CLAUDE.md checklist once at session start and then rely on the completion pressure of unchecked boxes. This suggested the two surfaces serve different cognitive roles.

We designed three experiment dimensions:

D1: Tool description variants (trigger signals in query, deposit, rate, or all)
D2: Instruction variants (loop-back instructions, trigger-based instructions)
D3: Hook-based enforcement (post-tool-call reminders via MCP protocol)

Each dimension had a control group (C0) using the production defaults. We tested on Opus using our instruction compliance benchmark suite — 10 synthetic tasks covering JWT bugs, RLS policies, TypeScript generics, race conditions, and more.

Phase 1: Tool Descriptions

We tested four tool description variants against the control:

Variant	What changed	Overall compliance
C0 (control)	Production defaults	93.5%
D1a	query_insights trigger signals	93.5%
D1b	deposit_insight trigger signals	100%
D1d	All tools get trigger signals	95.5%

D1b was the winner. It replaced the generic deposit description:

"Use this after solving a problem, discovering a pattern, or hitting a pitfall."

With explicit trigger moments:

"Call this tool at each of these moments: (1) AFTER fixing a bug — deposit root cause and fix as SOLUTION or PITFALL (2) AFTER a failed approach — deposit the dead end as a PITFALL so future agents skip it (3) AFTER discovering a non-obvious pattern — deposit as PATTERN (4) AFTER completing a sub-task in a multi-step problem — deposit before moving to next step."

The pattern came from our set_preference tool, which already listed specific trigger moments ("when you detect a preference signal") and had high natural call rates.

D1b cracked task a-004 — a TypeScript generics refactoring task that had resisted deposits across every prior variant in five experiment phases. No instruction wording could make Opus deposit on that task. A tool description change did.

The D1a variant (query triggers) had zero effect. Query compliance was already 100% — there was nothing to improve. And D1d (all triggers combined) scored worse than D1b alone. Adding trigger signals to tools that didn't need them diluted the effect.

Phase 2: Instructions

We tested two instruction-level approaches to mid-session usage:

Variant	What changed	Overall compliance
C0 (control)	Production v2 checklist	100%
I2a	Loop instruction ("after each sub-task, loop back")	80%
I2b	Trigger instruction ("at these moments, call these tools")	71.5%

Both instruction variants regressed compliance from the control. The loop instruction (I2a) dropped to 80%. The trigger instruction (I2b) dropped to 71.5%.

This was the opposite of what we expected. The instruction variants were designed to encourage more tool usage, but they disrupted the rigid compliance pressure of the checklist. The checklist works because it's a linear sequence: do step 1, do step 2, do step 3. Introducing loops or conditional triggers turned the sequence into a decision tree — and every decision point is a chance for Opus to choose "no."

This confirmed a principle from our earlier research: any flexibility in instructions is an escape hatch.

Phase 3: Saturation Check

We tested new task types designed to require mid-session tool calls: iterative debugging (multi-step diagnosis), compound fixes (multiple related bugs in one session). Both C0 and D1b scored 100% on these tasks.

The binary compliance metrics had saturated. Our benchmark tasks were too clean — agents completed them in a single pass, so there was no "middle" of the session to measure. Real-world sessions with multiple sub-tasks will show the D1b effect more clearly, but we'd need session-level telemetry (not benchmark compliance scores) to measure it.

The Insight

Instructions and tool descriptions serve complementary cognitive roles:

Instructions provide compliance pressure. They work through completion anxiety — unchecked boxes that the agent feels compelled to check off. They must be rigid, linear, and absolute. "All boxes must be checked. Task is incomplete until every step is done." Any flexibility ("skip if," "when appropriate," "loop back") weakens the pressure.

Tool descriptions provide contextual triggers. Agents re-read them at every decision point. They work by pattern-matching against the agent's current situation: "I just fixed a bug — oh, the deposit tool says to call it after fixing bugs." They should be specific about when to call the tool, not just what the tool does. Flexibility here is a feature, not a bug.

Trying to make instructions do the job of tool descriptions (I2a, I2b) weakened compliance. Enriching tool descriptions with the right trigger signals (D1b) improved behavior without touching the checklist at all.

What This Means for MCP Tool Designers

If you build MCP tools, your tool description is not just API documentation. It's a behavior guide that agents re-evaluate at every decision point in a session.

Three guidelines from our data:

1. Put trigger signals in your tool description. Don't say "use this to do X." Say "call this at each of these moments: (1) after X, (2) when Y, (3) before Z." Model the moments when calling your tool is the right action.

2. Keep your instructions rigid. If you have a CLAUDE.md or system prompt that tells agents how to use your tools, make it a strict sequence with checkboxes. Don't add loops, conditionals, or "skip if" language. The instruction's job is compliance pressure, not contextual guidance.

3. Don't combine all improvements. D1b alone was 100%. D1d (all tools enriched) was 95.5%. More is not better. Identify which tool has the compliance gap and enrich only that one. Signal dilution is real.

The Numbers

Phase	What we tested	Key finding
Phase 1	Tool description variants (D1a, D1b, D1d)	D1b = 100%, D1a = no effect, D1d = worse than D1b alone
Phase 2	Instruction variants (loop, triggers)	Both regressed (80%, 71.5%) vs 100% control
Phase 3	New task types (iterative, compound)	Binary metrics saturated — need session telemetry
Insight		Rigid instructions + flexible tool descriptions = complementary roles

What We Shipped

We made the D1b deposit description the production default. When no experiment variant is active, every Surchin user's agent now sees the trigger-signal description. The variant mechanism stays in place for future experiments.

The checklist is unchanged. It's still the same rigid 5-checkbox sequence from Phase 1. We stopped trying to make it flexible.

This experiment ran on Claude Opus using the Surchin instruction compliance benchmark suite. Total cost: ~$15 in API calls across three phases.