Why is Claude Code (and other agentic coding tools) architected this way?
April 21, 2026
Why does Claude Code feel magical in a sea of AI products? Is it the way it's architected — cloud reasoning paired with local orchestration?
What does that even mean? Why this architecture, at this particular moment? And how does it connect to today's compute landscape?
I got curious and dove deep (with the help of Claude Code, who else?), and here's what I learned.
The short version: when Claude Code (or Openclaw, or Hermes agents and basically any agent frameworks) is working, your computer is handling nearly everything. File reads, terminal commands, searching through your codebase, writing and saving files — all of that runs locally on your machine. The powerful language model in the cloud gets called only when actual reasoning is needed. And the connection between your machine and that cloud model is, technically speaking, tiny. A thin pipe.
Then, on March 31, 2026, Anthropic accidentally shipped a debugging file inside a routine npm package update — version 2.1.88 of the @anthropic-ai/claude-code package. The file was a JavaScript source map, intended only for internal use. It contained 512,000 lines of TypeScript source code: Claude Code's full client-side runtime, all 40+ tool definitions, the permission logic, and a 57,000-word system prompt. Everything Anthropic had built around the model, exposed.
The leak confirmed the architecture — and added detail I didn't have. Let me explain exactly how this works, using the source code as the evidence.
Note: I'm referring throughout to the leaked Claude Code source code (v2.1.88, March 31, 2026). Anthropic pulled the file quickly, but not before it was widely analyzed.
What Happens When You Type a Prompt
Let's make this concrete. You open Claude Code and type: "Find all the files in my project that import the database module, and tell me which ones might be causing a memory leak."
Here's what actually happens, step by step.
Step 1: Your computer reads the files.
Claude Code issues a command to your local filesystem. It finds files matching a pattern — this is a grep command, essentially. Your machine scans through potentially hundreds of files in milliseconds. The computation here is trivial by modern standards. Your laptop's CPU handles it without breaking a sweat.
Step 2: It assembles a context.
The results — which files, which lines, what the surrounding code looks like — get assembled into a block of text. This is the "context" that will be sent to the model. Your machine did all the work of finding and formatting this information. The AI hasn't been involved yet.
Step 3: The thin pipe fires.
That assembled context gets sent to Claude — the actual language model — running on Anthropic's servers. The data transmitted is maybe 20,000 tokens of text. Up it goes to the cloud.
Step 4: The cloud thinks.
Now the frontier model does what it's uniquely good at: reasoning. It reads the code, understands the patterns, identifies the likely culprit, and forms an analysis. This is the computation that requires enormous scale — we're talking GPU clusters with hundreds of billions of parameters, hardware that costs millions of dollars, running specialized chips that didn't exist five years ago.
Step 5: The thin pipe fires back.
The model's response comes back down. Another few KB of text. Maybe a paragraph of analysis and a suggestion for which file to check.
Step 6: Your computer acts.
Claude Code reads the response and issues more local commands. It opens the file the model identified, reads its contents, maybe runs a test, checks the output. Your machine is doing the work again.
And the cycle repeats.
The leaked source code gave this loop a name: Think-Act-Observe-Repeat. That's the actual term used internally. Think (model reasons), Act (local system executes), Observe (local system gathers results), Repeat. It's a tight loop where the model only participates in one of the four steps.
The Compute Asymmetry
Here's the thing that makes this architecture make sense: the two kinds of computation involved are separated by orders of magnitude in cost and complexity.
Your laptop handles file operations in nanoseconds to milliseconds. Reading a file, running a terminal command, searching through directories — these are tasks that $500 worth of hardware does effortlessly and continuously. Your machine doesn't break a sweat.
Running frontier language model inference is a completely different beast. A single forward pass through a model like Claude requires on the order of a thousand trillion mathematical operations (1,000 TOPS — tera-operations per second). That requires specialized GPU hardware. At scale, we're talking data centers, custom silicon, infrastructure investments that run into hundreds of millions of dollars. You cannot run this on your MacBook. Not today. Probably not for a long time.
So the architecture divides labor based on what's actually cheap and what's actually expensive:
  • Cheap (local): File I/O, state management, context assembly, running commands, reading outputs, writing files.
  • Expensive (cloud): Understanding complex reasoning, making judgment calls, synthesizing information across a large context, generating new code or prose.
Your laptop handles cheap. The cloud handles expensive. The thin pipe connects them.
This isn't a design decision someone made because it seemed clever. It's the only arrangement that makes economic sense given the current state of compute.
The Thin Pipe, Quantified
"Thin pipe" is a metaphor, but let me make it literal for a second.
When Claude Code sends a request to the cloud model, the data payload is a relatively small packet of information. That's the full context: everything about your project, the task at hand, the history of what's been done, and whatever files are relevant. On a modern internet connection, that transmits in milliseconds.
When Claude Code sends a request to the cloud model, the data payload is text — no binaries, no direct file access, just the assembled context: everything about your project, the task at hand, the history of what's been done, and whatever files are relevant. On a modern internet connection, that transmits in milliseconds.
The response comes back — A few hundred words, some code, an analysis. Also milliseconds.
Compare that to what's happening locally. During a complex coding session, your machine might read, process, and write several megabytes of data — file scans, test runs, build outputs, log files. The local compute is doing the heavy lifting by volume. The cloud is doing the heavy lifting by complexity.
What crosses the wire is, structurally, just text. A prompt goes up. A completion comes back. The model has no persistent connection to your machine. It doesn't see your files directly. It can't run commands itself. It receives a description of your world and produces a response that your local system then acts on.
But here's where I need to complicate the "thin pipe" framing, because the leaked code reveals something important: the pipe is thin, but what runs on your machine to assemble what goes through that pipe is not simple at all.
The leaked source shows a module called QueryEngine that's 46,000 lines long. Its job is context management — deciding what information to include in each prompt, how to compress conversation history, what to cache and what to drop. Context is expensive. Every token sent to the cloud model costs money and time. So Claude Code has an entire engine whose sole purpose is to be smart about what goes through the pipe.
The compression strategy has three layers, named in the source code:
  • MicroCompact — quick local trimming, no API call required
  • AutoCompact — triggered when you're approaching the model's context limit, generates structured summaries locally
  • Full Compact — a complete conversation compression, selectively re-injecting only the most relevant files and context
The source code also reveals something called sticky latches — a mechanism that prevents mode changes from invalidating the prompt cache. If Claude Code switches between tasks and the cache breaks, you pay the full cost of re-sending context. Sticky latches prevent that. It's described internally as treating cache optimization as an accounting problem.
None of this crosses the wire. It's all local computation, running on your machine, before a single byte goes to the cloud.
Here's a real example — this article. I wrote it using Claude Code over the course of one working session. What felt like a flowing conversation was, underneath, around 40–55 discrete round trips to the cloud. Here's roughly how those loops broke down:
  • Filing inbox captures and routing files: ~7 loops — reading files, inferring destinations, writing to the right folders, deleting processed items
  • Finding and reading source material: ~6 loops — searching across learning notes, drafts, ideas folders, pulling the right files into context
  • Two research subagents (one on the leaked source code, one on Cursor/Hermes/OpenClaw): each ran 8–12 loops of their own internally, searching the web, fetching pages, synthesizing findings
  • Writing and editing the draft: ~10 loops — generating the piece, then five separate rounds of editing as the structure evolved
  • Miscellaneous actions (moving files, creating tasks, updating logs): ~4 loops
Each loop: a packet of context goes up, a response comes back, local tools execute, results feed into the next packet. The session probably ran for close to an hour. The actual cloud inference time — the seconds the model was actively reasoning — was maybe 3–5 minutes of that. The rest was your machine working.
This is what people mean when they talk about "orchestration." The local environment — Claude Code, in this case — is the orchestrator. It decides what information to gather, how to package it, what to do with the model's response, and what actions to take next. The model is the reasoning engine. The orchestrator is everything else — and as the leaked code shows, "everything else" is doing a lot.
The Nervous System Metaphor
The leaked code confirmed a framing I'd been reaching for. Anthropic's engineers describe the local environment as a "nervous system" — it routes signals, stores memory, moves muscles. The cloud model is the brain. Every decision comes from the brain, but the brain doesn't directly see or touch the world. It only ever receives signals from the nervous system, and it sends signals back.
Think about how your own body works. Your brain doesn't directly sense the temperature of the air. Nerve endings in your skin measure the temperature and send signals up through your spinal cord. Your brain processes those signals, decides you're cold, and sends signals back down to tell your muscles to generate heat. The brain never touched the air. The nervous system was the intermediary the entire time.
An AI agent on your computer works the same way. The model never opens a file. It never runs a terminal command. It never reads your screen. It receives a text representation of all of those things, assembled by the local system, and it sends back text that the local system interprets as instructions.
When Claude Code reads a file and includes its contents in a prompt, that's the nervous system sending a signal to the brain. When Claude responds with "look at line 47 in module.js", that's the brain sending a signal back. When your machine opens that file and checks line 47, that's the nervous system acting on the instruction.
The magic isn't in the brain alone. It's in the tight integration between the brain and the nervous system.
Why This Isn't Obvious (And Why It Matters)
The reason this architecture is invisible to most people is that the product experience is designed to make it invisible. You type. Things happen. The interface is seamless. You have no reason to think about what's local versus what's remote.
But the architecture has real implications.
Speed. When Claude Code feels fast, it's partly because most of what it does is local. File operations on modern hardware are genuinely very fast. You're not waiting for the cloud for those. You're only waiting for cloud latency when actual model inference happens — and those round trips, while noticeable, are the minority of the total time.
Cost. Model API calls cost money based on the number of tokens processed. Everything your machine does locally is free. The economic incentive is to do as much locally as possible and only call the model when reasoning is truly needed. This is why well-designed AI coding assistants don't send every file in your project to the model on every keystroke — they gather context intelligently, selectively, right before it's needed.
Control. Because your machine is the orchestrator, you have real control over what the model sees and what it can do. Claude Code can only take actions that the local system permits — reading files in directories you've allowed, running commands you've approved. The model can suggest anything, but the local nervous system decides whether to act on it.
Privacy. Your files don't live in the cloud. They live on your machine. Only the specific excerpts assembled for a given prompt cross the wire. For most people, this isn't something they've thought carefully about — but for enterprises, it's one of the most important architectural properties of these systems.
The Harness Does the Work
There's a concept I've been thinking about a lot that crystallizes why this architecture matters beyond just Claude Code.
Andrej Karpathy — one of the most influential AI researchers alive, who ran AI at Tesla and co-founded OpenAI — built something called Autoresearch.
It's an autonomous system that iterates on machine learning experiments continuously, without human intervention. It proposes changes, runs experiments, measures results, and keeps or discards based on a fixed metric.
Here's the thing: most of what Autoresearch does has nothing to do with the AI model. It manages files. It runs experiments. It measures outputs. It logs results. It rolls back changes that don't work. The model is called to propose changes and evaluate results — genuine reasoning tasks. But the system running around that model is deterministic, controlled, local computation.
Karpathy's insight — and it's one worth sitting with — is that the model isn't what makes the system powerful. The harness does. The constraints, the structure, the local machinery that turns model outputs into actions and model inputs from real-world observations. The model just needs to be good enough. The harness needs to be excellent.
This is why Gary Tan, the president of Y Combinator, talks about "thin harness, fat skills." The harness — the local orchestration layer — should be minimal and tight. The skills — specific tools, workflows, structured knowledge — should be rich and specialized. The model threads through the middle.
The leaked Claude Code source code is essentially a 512,000-line proof of this principle. Strip out the model and you still have: 40+ tools with independent permission gates and input validation, a 46,000-line context engine, a three-layer compression strategy, Bash command security validators running 25 regex checks and AST parsing before anything executes, and a multi-agent coordination system using a "mailbox pattern" — where worker agents route high-risk operations to a coordinator for human approval rather than acting unilaterally.
The 57,000-word system prompt — longer than most business books — is where Anthropic encoded the orchestration logic itself: workflows, decision rules, safety constraints, and behavioral guidelines. None of that is the model. All of it runs as context fed to the model. The intelligence is distributed across the harness and the model together, not concentrated in one place.
When you think about AI agents this way, the question stops being "how capable is the model?" and starts being "how well-designed is the harness?" A mediocre model in an excellent harness will outperform an excellent model in a mediocre harness, because most of what agents actually do is local computation that the model never touches.
What Happens When This Changes
The architecture I've described — local orchestration, cloud inference, thin pipe — is the architecture of 2026. Every serious AI coding tool, every agentic framework, every AI assistant with real capabilities works this way right now.
But this might not always be true.
There's a bet being made in the industry about whether frontier-level reasoning can be compressed enough to run locally. Models that required data center hardware two years ago can run on a MacBook today — not the frontier models, but models trained to approximate them. The gap is closing, though it's not obvious how close it's getting on the tasks that actually matter.
If local models get good enough for most everyday reasoning tasks, the architecture shifts. The thin pipe doesn't need to reach all the way to Anthropic's servers. It might only reach to a model running on your own machine, or on a company's internal infrastructure. The local nervous system and the brain get closer together, or merge entirely.
This would change a lot. Latency drops to near-zero. Privacy becomes trivial — nothing ever leaves your device. The marginal cost of a reasoning step approaches zero once you own the hardware.
It would also raise a different problem. Frontier model development — the kind that produces genuinely new capabilities — costs hundreds of millions of dollars per training run. That investment gets funded by inference revenue: every time you call an API, part of what you pay funds the next generation. If inference moves to edge and runs on local distilled models, who funds the frontier? That's an open question with no comfortable answer yet.
But the more immediate point stands: the current architecture is not the final architecture. The thin pipe exists because of a compute asymmetry that is narrowing, not widening. Where it goes from here is one of the more consequential technical bets being made in the industry right now.
The Takeaway
If you remember one thing from this: your AI coding assistant is mostly running on your machine.
The model in the cloud is the piece that thinks. It receives a packet of context, reasons over it, and returns a response. That's what it does. It doesn't execute. It doesn't run commands. It doesn't access your files. It processes text and generates text.
Everything else — the file reads, the command execution, the context assembly, the action-taking — happens locally. Fast, cheap, under your control.
The architecture exists because of a real cost asymmetry: local compute is nearly free, frontier model inference is expensive and requires specialized hardware. The system routes each task to where it's cheapest to run it.
Understanding this changes how you think about AI tools. The model isn't magic. It's a reasoning engine accessed through a protocol. The local system — the harness — is what turns that reasoning into useful work.
And right now, for the tools that are actually shipping, the harness is doing most of it.