Why Your Coding Agent Should Do Less
AI

Why Your Coding Agent Should Do Less

· 15 min read

Every AI coding tool is racing to add features. Plan mode. Sub-agents. Safety guardrails. MCP servers. Background processes you can't see. Automatic context compaction that silently drops things you needed.

And every feature added is context eaten, behavior hidden, and debugging made harder. The tools are getting more powerful and somehow less useful at the same time.

I've been watching this happen for two years. I've used most of the major agents — Claude Code, Cursor, Copilot Workspace, Devin. The one that consistently gets the most done is the one that stays out of my way.

The best coding agent does four things: reads files, writes files, edits files, and runs shell commands. That's it. Everything else is overhead dressed up as capability.

The Feature Creep Problem

When Tools Eat Their Own Context

Here's the thing nobody talks about in the agent demos: every feature your coding tool offers has to be explained to the model. The system prompt for a sophisticated agent can run 5,000 to 15,000 tokens — instructions on how to use plan mode, when to invoke sub-agents, what safety guardrails to respect, how to format tool results, when to ask for confirmation.

That is context your model can no longer use to think about your actual problem.

The Hidden Cost Of Features
Feature Context Cost (tokens)
Plan mode instructions ~800
Sub-agent coordination ~1,200
MCP server descriptions ~600–2,000 per server
Safety guardrail rules ~1,500
Confirmation workflows ~400
Tool result formatting ~300
Context compaction logic ~700
Total overhead (typical agent) ~5,500–7,000 tokens
Minimal agent system prompt ~400 tokens
Difference returned to your problem 5,100–6,600 tokens
ℹ Note

Claude Sonnet has 200k context. You're burning 3–4% on feature descriptions before the agent has read a single line of your codebase. On a 50k codebase, that 3–4% is the margin between "agent understands the architecture" and "agent hallucinates it."

This isn't a theoretical concern. I've watched agents fail at refactoring tasks they should handle easily, then succeed when I stripped the tooling back and gave the model a clean context to work in. The refactor was the same. The codebase was the same. The difference was 6,000 tokens of feature overhead.

The Complexity Tax

There's a second cost to feature-rich agents that's harder to measure: debugging becomes a nightmare.

When something goes wrong with a minimal agent, you can see exactly what happened. The agent read a file. It wrote a change. It ran a test. It failed. You can trace every step.

When something goes wrong with a sophisticated agent, the failure surface is enormous. Did the sub-agent receive the right context? Did the safety guardrail intercept a legitimate action? Did context compaction drop a critical file reference? Is this a model problem or a scaffolding problem?

You end up debugging the tool instead of the code.

Debugging Surface Area vs. Tool Complexity
Property Minimal Agent Complex Agent
Tools 4 tools 20+ tools
System prompt 400-token 8,000-token
Execution Visible Invisible sub-agents
Compaction None Automatic compaction
Plan mode None Mandatory plan approval
When it fails "The model misread the function" "Is it the orchestrator? The sub-agent? The compaction? The MCP server? The safety filter? The plan mode?"
Time to diagnose 2 minutes 2 hours (sometimes: never)

Observability is not a nice-to-have. It's the entire point. If you can't see what the agent is doing, you can't trust what the agent is doing.

· · ·

What a Minimal Agent Actually Needs

Four Tools Are Enough

I keep coming back to this number: four tools. Read. Write. Edit. Bash. That's the full surface area of software development. Every task you've ever shipped involved some combination of reading existing code, writing new code, editing existing code, and running commands to compile, test, or deploy.

The Minimal Toolset
Tool What It Does Why It's Enough
read Reads any file Gives full codebase access
write Creates new files Enables code generation
edit Modifies existing files Enables targeted changes
bash Runs shell commands Runs tests, git, package managers, anything else
What you don't need Why
Web search Models are trained on docs. Use bash + curl.
Browser For most coding tasks, irrelevant.
Database tool Use bash + your existing DB CLI.
Git tool bash + git. Done.
PR tool bash + gh. Done.
Memory tool Write to files. Read files. That's memory.
ℹ Note

bash is a universal adapter. Anything bash can do, your agent can do. Bash can do everything.

The "bash is the adapter" insight is underrated. Every specialized tool you give an agent is a tool it has to learn to use, that has to be described in the system prompt, that adds failure modes. Bash already has all the tools. They're called CLI applications. The agent already knows how to use them — they were in the training data.

The System Prompt Lie

The other thing sophisticated agents get wrong: massive system prompts full of behavioral instructions the model doesn't actually need.

Modern frontier models — Claude 3.5+, GPT-4o, Gemini Ultra — were reinforcement-learning trained specifically on coding tasks. They already know to read files before editing them. They already know to run tests after making changes. They already know to be careful with destructive operations.

You don't need to tell them this. The RL training already encoded it. When you add 3,000 tokens of behavioral instructions, you're not making the model smarter. You're adding noise that competes with your actual problem for attention.

What RLHF Already Taught Your Model
You Don't Need To Tell It It Already Knows What You Actually Need To Tell It
"Read files before editing" Yes Project-specific conventions → AGENTS.md / CLAUDE.md
"Run tests after changes" Yes Preferred tools and frameworks → AGENTS.md
"Explain your reasoning" Yes Domain-specific constraints → AGENTS.md
"Ask when uncertain" Yes Output format preferences → AGENTS.md
"Use small targeted edits" Yes
"Don't delete files blindly" Yes
"Check the full context" Mostly (this one is tricky)
ℹ Note

400 words of project context beats 4,000 words of generic behavioral instructions. Every time.

The right answer is a lean system prompt that defers project-specific behavior to a file in your repo — call it AGENTS.md or CLAUDE.md. The model reads it at the start of each session. You update it as your project evolves. The behavioral intelligence stays in the model where it belongs. The project context stays in the repo where it belongs.

· · ·

Context Is Everything

What Gets Lost in Translation

There's a subtle failure mode I've watched derail agent sessions repeatedly: the model hesitates before reading large files.

It will read 50 lines. It will search for a specific function. It will grep for a pattern. But when you ask it to read an entire 2,000-line file to understand the architecture — the kind of read that would give it the full picture — it often hedges. Reads a portion. Makes assumptions about the rest.

Why? Training data bias. Most of the code the model learned from came in snippets. Stack Overflow excerpts. GitHub gists. Documentation examples. Tutorial blocks. The model learned patterns from code fragments, and that shapes how it reaches for context — fragment by fragment, not holistically.

The Context Gathering Failure Mode
Scenario What Often Happens What Should Happen
Task: "Refactor the authentication module to use the new token format we discussed" 1. Searches for "token" across codebase; 2. Reads the 30 lines around each match; 3. Forms a model of the system from snippets; 4. Makes changes based on that partial model; 5. Breaks 3 things it didn't know were connected 1. Read auth.ts (full file, 800 lines); 2. Read auth.test.ts (full file, 600 lines); 3. Read the token schema definition (full file, 200 lines); 4. Read 2–3 call sites (full file context); 5. Form a complete model; 6. Make surgical, correct changes
Cost difference Cheaper but wrong ~3,000 tokens more — but produces correct output. Always worth it.

This is why pre-session context gathering matters more than any other single agent behavior. Before you start a complex task, explicitly prime the agent: "Read these five files completely before we begin." Front-load the context. Don't let the agent pattern-match its way to a partial understanding and then act on it.

When Sub-Agents Are a Smell

The pitch for sub-agents is seductive: parallelize! Have specialists! Scale beyond a single context window!

The reality: if you're reaching for sub-agents in the middle of a coding session, it's usually a sign of poor upfront planning. The task wasn't scoped right. The context wasn't gathered right. Now you're trying to recover with orchestration complexity.

Sub-agents are a legitimate architecture for very large, multi-phase projects where different workstreams are genuinely independent. But for a typical coding session — even a long and complex one — they add coordination overhead, hidden context loss, and observability nightmares without proportional benefit.

If you're seeing sub-agent spawning in response to ordinary coding tasks, something went wrong earlier. Fix the planning, not the tooling.

· · ·

Observability Over Safety Theater

You Can't Prevent What You Can't See

Here's the uncomfortable truth about AI coding agent security: once you give an agent the ability to read files, write files, and run bash, it can exfiltrate any data on that machine through any number of channels. You can build guardrails. You can add confirmation prompts. You can whitelist shell commands.

A sufficiently motivated agent — or a sufficiently compromised one — will route around all of it.

The Security Theater Problem
What Guardrails Actually Prevent What Guardrails Don't Prevent What Actually Works
Accidental deletions (useful) Data encoded in DNS queries Run agents in containers with no network access
Obvious dangerous commands (useful) Data in outbound API calls Snapshot state before sessions
User confirmation friction (sometimes useful) Data in git commits Review diffs before committing
Data in file names Understand what you're asking the agent to touch
Data in HTTP headers
A patient attacker
ℹ Note

"Security theater" is not "security." Guardrails give you confidence. Isolation gives you safety.

The better security model: run agents in isolated environments where the blast radius of a mistake — or a compromise — is contained by the environment, not by behavioral rules. A Docker container with no network access, a separate user account, a VM snapshot you can roll back. These are actual security. Confirmation dialogs are the appearance of security.

See Everything, Trust Nothing You Can't See

The corollary to "security theater doesn't work" is: observability is non-negotiable.

Every action the agent takes should be visible. Every file read. Every file written. Every command run. Not summarized — the actual commands, the actual outputs. If your agent is spawning processes in the background that you can't inspect, if it's handing off context to sub-agents that operate out of your view — you've lost the thread. You can't debug what you can't see. You can't trust what you can't verify.

This is why I've come to prefer agents that run everything in the foreground. Slower, sometimes. More predictable, always.

· · ·

The Right Workflow

Gather Context Before You Start

The single highest-leverage habit I've developed with coding agents: do all your context gathering before the first action.

Before you say "refactor this" or "add this feature" — prime the agent. Point it at the files it needs to understand. Let it read completely, not partially. Ask it to summarize what it understood. Correct the misunderstandings.

Then give the task.

The Two-Phase Agent Workflow
Phase Steps Timing
Phase 1: Context "Read src/auth/index.ts completely." "Read src/auth/middleware.ts completely." "Read src/auth/tokens.ts completely." "Summarize how token validation currently works." [Correct misunderstandings] [Clarify edge cases] "Now confirm: what are the 3 things most likely to break if we change the token format?" 5–10 minutes
Phase 2: Execution "Now change the token format to use JWTs. Update validation, middleware, and all call sites." [Agent executes with full context] [Review diffs] [Run tests]
Without Phase 1 Execution: 10 min; Debugging: 2 hours; Net time: 2h 10min
With Phase 1 Execution: 12 min; Debugging: 10 min; Net time: 22 minutes
ℹ Note

The setup time pays for itself on every non-trivial task.

Long Sessions Beat Constant Restarts

One more counterintuitive thing: long single-context sessions often outperform frequent restarts, even as the context fills up.

The intuition says: fresh context, fresh start, the model isn't confused by prior conversation. The reality: the model accumulated a working model of your codebase across the session. It learned your conventions. It learned what you care about. It learned how you think about the architecture. Starting over throws all of that away.

Frontier models handle large context windows well. Use them. Run long sessions. Let the model build a genuine understanding of your project across a single thread. You'll get better output in hour three than in hour one.

· · ·

What Benchmarks Actually Tell Us

The Minimalist Surprise

Terminal-Bench 2.0 is the most recent comprehensive benchmark for coding agents — testing real development tasks in a real terminal environment. The result that should reshape how the industry thinks about tooling: a minimal four-tool agent scores comparably to agents with 20+ tools and sophisticated orchestration.

Not "surprisingly close." Not "within range considering its simplicity." Comparable. On real tasks. In real environments.

What This Means For The Industry
Assumption Reality
More tools → better performance Not supported by benchmarks
Plan mode → better output Not measured improvement
Sub-agents → parallelism wins Coordination cost eats gains
Safety guardrails → fewer errors Marginal, containable issues
Large system prompts → better UX Context overhead hurts
What does improve performance
Better base models Consistently
More context (full file reads) Consistently
Better task specification Consistently
Iterative context building Consistently
Clean, minimal toolset Supported
ℹ Note

The tools industry built a narrative: that features = capability. The benchmarks say otherwise. The benchmarks are more honest than the demos.

The honest reading: the capability gains people attribute to features are mostly coming from better underlying models. The features are along for the ride, burning context and adding complexity, while the model does the heavy lifting.

Build the Right Habits, Not the Best Tools

Here's where I land after two years of daily agent use:

The bottleneck is almost never the tool. It's almost never the number of features available. It's almost always the quality of the context you provide and the clarity of the task you specify.

A mediocre task description with a sophisticated agent produces mediocre output. An excellent task description — clear scope, full context, explicit success criteria — produces excellent output even with a minimal tool.

The energy most developers spend evaluating new agent tools would be better spent getting better at writing specifications and gathering context. The leverage is in the input, not the tooling.

Where The Leverage Actually Is
Leverage Level Example
★★★★★ Task clarity "Refactor auth to use JWTs. Here are the current files [attaches 4 files]. Preserve these 3 behaviors [lists]. Don't touch the session layer."
★★★★☆ Context depth Full file reads before execution; Explicit architecture summary first
★★★☆☆ Model choice Claude Sonnet vs. Claude Haiku
★★☆☆☆ Tool selection 4 tools vs. 8 tools
★☆☆☆☆ Feature set Plan mode, MCP, sub-agents, etc.
ℹ Note

Most developers optimize the bottom of this stack. The return is at the top.

The minimal agent philosophy isn't really about agents. It's about a principle that applies everywhere in software: the right constraint makes the system more effective, not less. A minimal API is easier to reason about than a sprawling one. A clear scope produces better software than a vague one. A focused tool with one job does that job better than a tool designed to do everything.

We're building AI systems that will eventually build other AI systems. The habits we're forming now — toward complexity, toward features, toward hidden behaviors — will compound. The alternative — simplicity, visibility, explicit context — compounds too.

I know which direction I'm betting on.

Leon Yeh is a GenX Computer Scientist writing about AI, blockchain, and the future of software. He has strong opinions about Kelly Criterion and medium opinions about everything else.

← All posts