The Three Layers of Agentic Engineering Maturity

▶ View the slides that accompany this post

This is a follow-up on our journey developing agentic software engineering practices at RIVET. It builds on the story started in Advanced Agentic Coding & The Journey Towards 3x Product Development Velocity.

When engineers talk about "using AI," they're often conflating three distinct engineering disciplines — each with its own investment curve, its own ceiling, and its own compounding dynamics.

At RIVET, we've been bumping into this conflation repeatedly as we invest deeper into agentic development. The language "prompt engineering" gets used to describe everything from a single sentence in a chat window to a full multi-agent harness running in production. That conflation makes it hard to think clearly about where to invest next and why.

So here's the framing we've settled on internally: prompt engineering, context engineering, and harness engineering. Three nested disciplines within what we call agentic software engineering — the practice of building with agents as a core part of your development workflow. And agentic software engineering is itself a subset of something broader: AI engineering, the discipline of building agentic systems as products and infrastructure, not just using them as tools.

That distinction matters. AI engineering is the broader discipline — it includes building LLM-powered products, designing retrieval systems, fine-tuning models, building evaluation pipelines, and shipping agent infrastructure at scale. (Chip Huyen's AI Engineering is a fantastic & comprehensive reference on the field.) It's what MLOps became when foundation models arrived. Agentic software engineering is one slice of that: the practice of using agents as collaborators in your own development workflow. Most of this post is about that inner slice — prompt, context, and harness engineering as they apply to writing better software faster. But the outermost layer, harness engineering, starts to blur the line. When you're programming around the model, designing autonomous workflows, and shipping agent outputs to production, you're not just using AI to code. You're doing AI engineering.

The three layers are a Russian doll. Prompt engineering lives inside context engineering, which lives inside harness engineering. You can't do context engineering well without solid prompting fundamentals. You can't build a harness that works without the context layer feeding it the right knowledge. Each outer layer contains and depends on the inner ones.

This means investment compounds in one direction. Getting better at prompt engineering makes your context engineering more effective, which makes your harness engineering more powerful. But it also means skipping layers doesn't work. A team that jumps straight to building agent harnesses without investing in prompt craft and knowledge infrastructure is building on sand.

Prompt Engineering

Prompt engineering is the craft of telling an agent what to do — and doing it precisely enough that it actually does it.

This means writing clear, unambiguous instructions. It means defining rules that encode your team's expectations (what patterns to follow, what to avoid, what "done" looks like). It means building a CLAUDE.md that gives the agent reliable orientation at the start of every session. It means building skills that encode your process as reusable commands — /implement, /review-pr, /update-project-docs — so you stop re-explaining your workflow every time.

The feedback loop here is tight and human-scaled. You write a better rule, you see better output in the next session. It compounds, but it compounds within a session or across a few days.

RIVET is deep in this layer. We've invested heavily in our CLAUDE.md, our rules files, our skills. We're thoughtful about the balance between over-specification (bloated instructions that crowd out the actual task) and under-specification (vague guidance that the agent ignores). We're continuously iterating.

What we haven't extracted much value from yet: hooks. Hooks — automated triggers that fire on agent events like post-edit or post-tool-use — are technically a prompt engineering feature in most frameworks. But in practice, they start to feel like harness engineering: you're programming around the agent rather than to it.

In practice, at RIVET, prompt engineering is almost entirely markdown optimization — writing and refining the CLAUDE.md, rules files, and skill definitions that shape agent behavior. The medium is text; the feedback loop is "edit a file, run a session, see what changed."

The ceiling on prompt engineering is real. You can write perfect instructions and still lose the thread on a complex feature if the agent doesn't have the right knowledge to act on them. That's where context engineering picks up.

Context Engineering

Context engineering is the discipline of managing what knowledge the agent has access to — and when.

A better-prompted agent with bad context will still fail. It'll make decisions that contradict architectural choices made three sprints ago. It'll re-derive patterns your team has already solved. It'll treat every session as day one, perpetually new, never compounding. Context engineering is the infrastructure that prevents that. It's the discipline that makes the agent's knowledge accumulate rather than reset.

At RIVET, we do this through PILRs — Persistent Indexed Learning Repositories. Three types, each with a different lifecycle:

Type 1 (Ephemeral): Per-feature planning docs, test plans, decision notes. They live in temp/projects/, scoped to a single PR or sprint. They're working memory, not permanent record.
Type 2 (Evergreen): Architecture docs, system design docs ("Deep Maps"), API contracts. They describe how the system works and why it works that way. These are the map the agent navigates with.
Type 3 (Cumulative): Solved problems, incident patterns, cross-system context. Institutional memory. The layer that makes the agent behave less like a generic assistant and more like someone who's worked on your specific product for a year.

We're early-to-mid here. The pattern is right; the infrastructure isn't finished. Our Type 1 docs are solid. Our Type 2 Deep Maps are in progress. Our Type 3 knowledge base is nascent — it's growing, but it's not yet the compound-interest machine it could be. We're also starting to host shared repositories for these learnings to make them team artifacts rather than personal ones.

In practice, context engineering at RIVET is still heavily markdown-based — writing and organizing the PILR documents themselves — but the work extends beyond text. It includes workflow optimization (how and when context gets surfaced to the agent), and we're starting to invest in data infrastructure: databases, indexing systems, and shared hosting that make the knowledge layer a team resource rather than a collection of files on one developer's machine.

The ceiling on context engineering is higher than prompt engineering — a well-informed agent is dramatically more capable than a well-instructed one with gaps in its knowledge. But even a perfectly informed agent operating inside a single session, returning results to a human, has a ceiling. That ceiling is where harness engineering lives.

Harness Engineering

Harness engineering is what happens when you stop running the agent interactively and start building systems that run agents for you.

This is programming around the model. It's designing workflows where agents execute multi-step tasks autonomously, hand off outputs between phases, check their own work, and ship results — with humans reviewing outcomes rather than supervising every step. In practice, harness engineering includes everything from the inner two layers — the markdown, the knowledge infrastructure — plus writing code: building an agentic application using model provider SDKs, adding guardrails and deterministic steps where reasoning isn't needed (running a script is better than asking a model to re-derive the answer every time), and wiring the whole thing into your team's existing systems.

The returns here are categorically different from prompt or context engineering — and not just in magnitude. Prompt and context engineering make your work faster. Harness engineering expands what work gets done at all.

Every engineering team has a long tail of valuable work — bug fixes, small features, UI polish, minor tweaks — that never makes it to the top of the sprint because higher-priority things keep landing. That work isn't unimportant; it's just unscheduled. A harness can collapse a full ticket — research, implementation, PR — into a flow that runs while you're in meetings, turning that backlog into throughput. The ceiling isn't +1x or even +3x. It's much larger.

RIVET is early in this layer, but we have one real case study: Odradek, our customer-reported bug resolution agent.

Odradek is built on the Claude Code SDK — a Mac desktop app with a cloud-hosted database for multiplayer support, where multiple engineers can see the queue, claim tickets, and review outputs. When a customer-reported bug comes in, Odradek:

Investigates the issue — reading relevant source files, checking git history for related changes, consulting our PILR knowledge base for matching past patterns
Fixes the issue — scoped, surgical edits to the relevant files
Verifies its own work — regression checks, test runs
Puts up a PR — with a human-readable description, ready for review

One-shot resolution rate: 60%. The other 40% are harder issues — often involving config changes outside our codebase (think: fixing permissions on a GCP API token, or changing an environment variable in an external system). Those still require human judgment. But even at 60%, the impact on our bug backlog has been concrete:

P1s (ship-within-a-week bugs): We used to sacrifice an engineer every sprint on a round-robin rotation dedicated solely to these. Odradek bought us back roughly half an engineer — the on-rotation dev stays on top of P1s more efficiently and has time left over for sprint work.
P2s (fix-within-a-quarter bugs): These used to stack up for months before anyone could get to them. Now they get addressed as they come in.
P3s (nice-to-fix bugs): These were effectively permanent backlog residents. Some of them are actually getting fixed now — work that would never have happened at our team size.

And Odradek today is still a manually-triggered, engineer-operated tool. We're treating it as the seed of something much more autonomous. Here's where we're taking it:

Event-driven triggers — fire automatically when a bug is opened in HubSpot or GitHub Issues, not manually launched by an engineer
Cloud-hosted dashboard — non-engineers (CS, product) can log in and see fix statuses without pinging a dev
MS Teams integration — ask about a bug status, kick off an investigation, or request a fix directly from chat
Parallel issue processing — right now Odradek works on one issue at a time against a single local copy of the codebase. Git worktrees (or isolated clones) would let it spin up multiple working copies and process several bugs concurrently, collapsing a queue into parallel throughput
Ephemeral test environments — instead of just putting up a PR, Odradek spins up a temporary environment via k8s so CS and product can verify the fix themselves, without a developer deploying to a dev server
Model routing — not every task needs the most capable (and most expensive) model. Investigation and triage might run on a smaller model or an open-source option like Qwen Coder, while the actual fix uses a frontier model. Routing tasks to the right model tier is how you keep API costs sustainable as the harness scales

Each of those steps removes an engineer from the loop on work that doesn't require engineering judgment. The same harness pattern extends to small features, tweaks, and polish items — exactly the long tail we described above. Longer-term, we're starting to explore whether a harness can move beyond fixes and into new feature development — prototyping and building, not just repairing. That's a harder problem with a different workflow, and we're very early, but it's the natural next frontier once bug resolution is reliable.

This doesn't mean engineers have less to do. Complex features, architectural decisions, and novel problem-solving still require human engineers — and always will. What changes is the mix of engineering work. Some of the time that used to go to routine feature dev shifts toward building and improving the harness itself. You're still engineering; you're just engineering at a higher leverage point. The goal isn't a faster engineer. It's a larger team.

The Velocity Curves

Here's the intuition made visual. Prompt and context engineering follow logarithmic curves — fast early returns that taper as you approach the ceiling. Harness engineering is different: it follows an S-curve, slow at first but accelerating dramatically in the mid-range before tapering at the top. The ceilings are different, our position on each curve is different, and the shape of the harness curve is why it rewards sustained investment differently than the other two.

A few things worth pointing out in these charts:

The ceilings are the point. Prompt engineering is real and valuable — we've captured most of what it has to give us, and it's made us meaningfully more effective. But it tops out around +1x additional velocity. Context engineering takes more sustained investment but tops out around +3x. Harness engineering requires the most investment — and the longest runway — but tops out around +10x. These aren't firm numbers; they're directional. The message is that the disciplines aren't interchangeable, and each outer layer demands more work but delivers categorically larger returns.

Crucially, the harness ceiling is higher because it measures something different: not just how fast your team works, but how much of your backlog actually gets addressed.

We're far along the prompt curve, early-mid on context, and very early on harness. Even Odradek in its current form — a first-generation agent harness that one-shots 60% of customer bugs — represents early returns from a curve that hasn't yet hit its steepest section. The S-curve shape means the most asymmetric returns are just ahead of us. That's where we're investing.

Where to invest your tokens

If you're early in your agentic journey, prompt engineering is the right starting point. The feedback loop is short, the skills are transferable, and you need a foundation before context or harness work pays off.

If you're mid-stage — comfortable with prompting, starting to feel the limits — context engineering is where the next returns are. Build the knowledge layer. Start with PILRs in the places where your agents are most confused or most repetitive. Index them so the agent can navigate selectively rather than loading everything at once. We wrote a deep dive on how we build and use PILRs if you want a practical starting point.

If you're operating at a scale where prompt and context engineering are solid, and you're watching valuable work pile up in backlogs because your team is at capacity — that's the signal to start engineering a harness. Pick a narrow, repetitive workflow — bug triage, small fixes, polish items. Build around it. Measure the one-shot rate. The question isn't "did it make us faster?" It's "did work get done that wouldn't have happened otherwise?"

We're doing all three simultaneously, which means we're spreading investment across disciplines at different maturities. That's intentional — the layers are interdependent. A harness without good context engineering is just an autonomous agent that makes confident, uninformed decisions. Context without harness is a knowledge base that still requires a human to unlock every time.

The goal is a system where all three layers reinforce each other. And then — like any good compounding investment — you keep building.

◆ Detroit Software Developers

We're a community of professional developers in Detroit. We meet monthly to share knowledge, experiences & good vibes.

Upcoming events →

References

Chip Huyen, AI Engineering (O'Reilly, 2025) — comprehensive reference on the broader AI engineering discipline
Advanced Agentic Coding & The Journey Towards 3x Product Development Velocity — my first post on agentic development practices at RIVET
Context Engineering with PILRs — deep dive on how I've built and used Persistent Indexed Learning Repositories @ RIVET
Claude Code SDK — the SDK we use to power Odradek