
How Code Agents Disrupted the Software Market
/ 37 min read
An experiment, not a forecast. This essay is the output of a multi-agent simulation, not an article I wrote by hand. Using MiroFish, I seeded a single premise (code agents becoming the primary operators of software) and spun up a parallel digital world of thousands of persona-driven agents, each with its own personality, long-term memory, and behavioural logic, then let them interact and socially evolve across a simulated 2026 to 2036. Every market dynamic, failure mode, and quoted “executive” below is synthetic: produced by those agents and compiled into this report by MiroFish’s ReportAgent. The swarm runs on OASIS (CAMEL-AI), with Claude (Opus) driving the agents through MiroFish’s Claude Code backend and a Zep memory graph holding each agent’s history. Read it as a rehearsal of one possible future, not a prediction. Originally published as an article on X under the title “Voice of the Future Customer: How Code Agents Disrupted the Software Market, 2026-2036.”
When code agents become the primary operators of digital services, selection pressure shifts from human interface preference to measurable delegated execution success, causing managed execution environments with built-in observability, simulation, and cross-customer learning to dominate thin APIs and traditional SaaS within a decade.
The Great Routing: How Agents Redirected Demand Away from Human-Centric Products
The routing transformation did not arrive as a single disruption event. It emerged through the accumulation of millions of individual agent decisions, each one small and locally rational, each one invisible to the software vendors on the receiving end, until the aggregate pattern became undeniable. Between 2026 and 2029, the software market underwent what simulation data tracks as a routing bifurcation — a clean divergence in growth trajectories between products that could be effectively operated by code agents and products that could not.
The Anatomy of an Agent Routing Decision
When code agents began operating on behalf of enterprise users at scale in 2026, they did not initially abandon familiar products. They tested them. Agents attached to enterprise workflows probed each product in their task environment with a consistent implicit checklist: Can I simulate this action before committing? Can I interpret the post-execution state without a human reading the UI confirmation screen? If something fails, can I recover without escalating? The results of these probes populated internal routing tables, and products that failed them were quietly deprioritized.
Simulation data from agent harness engineering teams revealed the precise hierarchy of these criteria. A lead agent harness engineer described the weighting his team discovered after eight months of production telemetry:
The routing logic we shipped in v2.4 of our harness had a product scoring function with six dimensions: discoverability, execution reliability, state observability, rollback availability, simulation support, and learning surface. We weighted them equally at first. After 8 months of production telemetry, we re-weighted. Observability ended up at 2.3x the weight of discoverability. Rollback was 1.9x. Simulation was 1.7x. Discoverability mattered less because agents could figure out what a product did — what they couldn’t tolerate was flying blind after execution. Products with no post-execution state feedback basically became radioactive in our routing table.
The preference hierarchy was not arbitrary. It reflected the asymmetric cost structure of autonomous operation: an agent that cannot observe what happened after executing an action enters what simulation data calls “silent execution ambiguity.” In 34% of these cases, agents took compensatory duplicate actions that caused downstream workflow corruption. The failure mode was catastrophic enough that agent operators began blacklisting ambiguity-prone products preemptively, even before explicit failures occurred.
The Emergence of the Dead Zone
By 2027, a new vocabulary had entered enterprise software procurement. Products with high human satisfaction scores but low agent operability scores were classified internally by enterprise buyers as “Dead Zone” products — not defective in any traditional sense, but effectively inert from the perspective of delegated execution. An enterprise procurement director at a major financial services firm described the internal audit that crystallized this classification:
We ran an internal audit in early 2028 across our 140 SaaS subscriptions. For each one, we asked a simple question: can our attached code agent complete the top 10 workflows autonomously without a human confirming each step? For 94 out of 140 products, the answer was ‘no more than 3 out of 10.’ We started calling those the Dead Zone products. We didn’t cancel them immediately, but we froze new seat purchases and redirected agent compute budget to the 46 products that actually scored above 7 out of 10. Within 9 months, the Dead Zone vendors were calling us asking what had changed.
The Dead Zone dynamic exposed a paradox that traditional SaaS metrics had no vocabulary for: the Human Satisfaction Paradox. Products maintaining high NPS scores of 50 or above but low agent operability scores showed a counterintuitive pattern — human users stayed, but their agents’ repeated failures created productivity burdens that turned satisfied users into internal advocates for replacement. High human satisfaction became a lagging indicator of product risk rather than a signal of durability.
Why the Signal Was Misread
The tragedy for many incumbent vendors was not the routing shift itself but the 12 to 18 months lost to misdiagnosis. When agent-related metrics began declining in 2026 and 2027, the dominant interpretation at traditional SaaS vendors was technical: API instability, integration problems, compatibility issues with specific agent harness versions. Engineering teams were assigned to “API stability improvements.”
The real problem was not stability but semantic richness. Agents needed meaning-bearing feedback — structured signals about what had changed, what state the system was now in, what recovery options existed. They did not need faster or more reliable transmission of outputs they could not interpret. A co-founder of a project management SaaS startup described the moment of recognition:
We built our product from 2021 to 2025 entirely for human teams. Beautiful Kanban UI, great mobile experience, deep notification system. By Q2 2027, inbound from new enterprise prospects had dropped 40% year over year. When we started asking why, every single pipeline deal we lost told us the same thing: their agent stack couldn’t reliably operate our product. Not that our product was bad — it was considered best-in-class for human users. But best-in-class for humans had become irrelevant when the buyer was an AI agent operating 50 workflows simultaneously. We had to do an emergency 9-month rebuild to expose an execution layer underneath our UI or we were going to lose the market.
The API-First Partial Victory and Its Limit
One important nuance in the routing data concerns the fate of API-first products. Initially, the simulation showed these products capturing agent routing share from UI-only SaaS — a sensible outcome, since machine-readable access was better than none. But by 2028, API-first products were themselves being routed around by a new category: managed execution environments that embedded not just access but context.
Raw APIs provided machine-readable endpoints without machine-readable meaning. Agents interacting with raw APIs still had to construct their own state models, interpret error semantics, build retry logic, and maintain their own execution history. The overhead was non-trivial, and when managed environments emerged that absorbed this burden natively, agent routing scores for raw APIs fell 2.1x relative to their managed counterparts. The advantage of API-first over UI-only was real but temporary — a way station on the path to a more complete agent-friendly architecture rather than a destination.
The Velocity Asymmetry
What made the routing shift so damaging to incumbents was not its direction but its speed. Human users churned on 18-month average loyalty cycles, giving vendors time to adapt and retain. Agent harnesses showed switching decisions within 3 to 4 months of a better-observable alternative appearing in the agent’s capability horizon scan. And unlike human churn, which was gradual and often reversible through relationship management, agent routing shifts were immediate and complete — once a routing table updated to prefer a competitor, the volume shift happened in hours, not quarters.
An individual developer who adopted an attached coding agent in early 2026 captured this dynamic from the user side:
I’ve had a coding agent attached to my workflow since early 2026. At first I kept all my old SaaS subscriptions. Within 4 months the agent had essentially blacklisted two of them from its default tool selection — not because I told it to, but because it kept getting stuck on error states it couldn’t interpret or recover from without me stepping in. I only noticed because my own workload on those tools went up, not down. The agent had learned the hard way that those products weren’t worth delegating to.
By 2029, the financial divergence was stark. Products with human-centric UI optimization alone averaged 4% ARR growth. Products that had introduced agent execution layers averaged 23%. Agent-initiated software spend constituted 54% of new SaaS contract value in enterprise segments, up from 9% in 2025. The routing had become the market. What had begun as millions of small, invisible agent decisions had accumulated into a structural realignment of where software revenue would be won or lost across the following decade.
Product Architecture Divergence: Managed Execution Environments Pull Away from the Field
The divergence in product architectures that defined the software market between 2027 and 2030 did not emerge from a deliberate industry standard or coordinated platform decision. It emerged from a competitive sorting process in which agent task completion rates became the market’s primary ranking signal, and different architectural choices produced dramatically different completion rates. The result was not a gradual differentiation but a progressive separation into distinct tiers — tiers defined not by feature sets or pricing but by the depth of infrastructure built to support autonomous agent operation.
The Architectural Reclassification of 2028
By mid-2028, enterprise procurement teams had developed vocabulary precise enough to distinguish categories that the software industry had previously collapsed into a single “developer-friendly” bucket. Products offering MCP or CLI tool surfaces — previously celebrated as the vanguard of agent-compatible design — were reclassified by enterprise buyers from “agent-compatible” to “agent-accessible.” The distinction carried significant economic weight. Agent-accessible meant a code agent could invoke the product. Agent-operable meant a code agent could autonomously complete workflows on the product without human confirmation at every step.
The reclassification exposed a gap that API-first vendors had not anticipated. Programmatic access turned out to be necessary but insufficient. An agent that could call an endpoint but could not interpret the post-execution state, could not simulate before committing, and could not recover from failures without escalating to a human operator was not productively autonomous — it was a remote control that still required a human hand.
We made the mistake of thinking ‘agent-compatible’ meant ‘has an API.’ By the time we understood that managed execution was a different product category, we were already 18 months behind startups who had designed for it from day one.
This observation, from a VP of Product Strategy at a large incumbent SaaS vendor, crystallized the strategic error that defined the period. The incumbents had read “agent support” as a feature request. The emerging managed execution environment providers had read it as an architecture question.
The Four Primitives and Their Compounding Logic
The architectural standard that defined managed execution environments crystallized around four primitives by early 2029: invoke (with dry-run flag), observe (structured state diff), revert (atomic rollback), and teach (trace submission). Startups that internalized this framework from founding built products that were load-bearing for agent operations from their first release. Incumbents that attempted to retrofit these primitives onto human-optimized architectures encountered a fundamental problem: their underlying state models were built for human inference, not machine legibility.
A principal architect at a mid-market SaaS company that successfully completed the transition described the depth of the required change:
I can give you the full timeline. We started in Q2 2027 with what we thought was an easy migration: we’d add dry-run endpoints, expose state as structured JSON, and ship a rollback API. Six months of work. We shipped it, agents used it, results were mediocre. The problem was our state model was built for humans. A human looks at ‘task completed’ and infers 12 things from context. An agent needs all 12 things spelled out explicitly. So we went back and rebuilt the state model itself. That took another 9 months. After that, agent task completion rates on our platform went from 44% to 91%. The insight: you can’t bolt observability onto a product built for human inference. You have to make the state model machine-legible at its foundation.
The performance data from agent harness platforms confirmed the consequence of this distinction with stark precision. Products with native dry-run support showed 91% first-attempt success rates. Products with retrofitted dry-run showed 73%. Products with no dry-run showed 41%. The same tiered pattern held for rollback: native 88%, retrofitted 62%, absent 29%. For state observability: native 94%, retrofitted 68%, absent 37%. Across every dimension measured, native managed environment features outperformed retrofitted equivalents by 15 to 25 percentage points — a gap large enough to make the difference between an agent completing a workflow autonomously or requiring human intervention.
The Teach Primitive and the Learning Flywheel
Of the four primitives, the teach endpoint proved the most strategically consequential. Its presence or absence determined whether operational learning accumulated at the provider or remained dispersed across individual client environments. This was not obvious in 2027, when the teach primitive looked like a minor convenience feature for improving personalization. By 2029, it had become the architectural decision that separated products with durable competitive moats from products that could be displaced by feature-equivalent competitors.
Provider-side learning created a compounding dynamic that agent harness telemetry made visible with unusual precision:
The products with provider-side learning that fed back into the execution environment showed a 3.1% monthly improvement in task completion rates over the first 12 months. Products relying purely on client-side learning showed 1.2% monthly improvement. The compounding difference over 18 months was enormous — native managed environments didn’t just start ahead, they kept pulling away.
The 3.1% versus 1.2% monthly improvement rate created an environment moat distinct from traditional competitive moats. Feature moats could be copied in 6 to 18 months. Building equivalent provider-side learning required 18 to 36 months plus operational time to accumulate sufficient execution traces. Each month of delay added approximately 2.3% to the performance gap between a delayed incumbent and an early-mover native environment, because native environments continued compounding while incumbents were still constructing infrastructure.
The Separation in Market Outcomes
The architectural divergence translated directly into market allocation by 2029. Products that completed the transition to native managed execution environments held 67% of enterprise agent deployment budget despite representing only 23% of available products. The market had bifurcated: the top quartile of agent compatibility held 71% of enterprise agent compute budget, while the bottom three quartiles competed for the remaining 29%.
The investment calculus that drove this outcome was visible on both sides. Enterprise buyers who adopted managed execution environments early accepted what the simulation data calls a cold start penalty — 34% lower initial task performance compared to warm environments — in exchange for a projected 18-month payoff as provider-side learning accumulated for their specific workflow patterns. The acceptance of this front-loaded investment reflected a sophisticated understanding of the compounding dynamic: paying more at onboarding to access a learning environment that would improve over time was economically rational once buyers had observed the 3.1% monthly improvement curve from peers who had adopted earlier.
On the supply side, providers who made the architectural investment before 2028 saw customer acquisition costs fall by 40% in the following period. Agent routing preferences functioned as organic distribution: products that performed well in agent benchmarks received increased routing without marketing spend, because agent harnesses allocated workflows to products with the best task completion histories.
We built execution environment primitives as the core product before we built a single user-facing feature. Agents were our primary customer from day one. The incumbents thought we were building developer tools. We were building the new SaaS layer.
The infrastructure costs of this approach were front-loaded — providers spent 2.4x normal engineering investment in year one building managed environment infrastructure — but per-session costs fell 67% over 36 months as the learning environment matured. The cost curve rewarded early movers doubly: they reached profitability on a declining cost structure while competitors were still absorbing the initial investment.
What the Retrofitting Phase Revealed
The retrofitting phase of 2027 and 2028, in which 74% of incumbent SaaS vendors attempted to add agent-compatibility features to existing products, revealed the precise cost of architectural debt accumulated during the human-centric design era. The honest assessments that emerged from incumbent vendors in late 2028 were notable for their specificity:
When we finally did the honest analysis in late 2028, we found that 70% of our product surface was load-bearing for human workflows in ways that actively made agent operation harder. The human-optimized UX created irreversible state transitions everywhere. The notification system was designed for human attention, not machine polling. The permission model assumed a human could interpret ambiguous access prompts. Retrofitting all of that cost us 3x what building native would have. And we still ended up with a product that scored 60% of what native managed environments scored on agent compatibility benchmarks.
The 60% benchmark score for best-effort retrofits was the critical finding of the period. It represented a ceiling, not a floor. Incumbents who invested heavily in retrofitting reached 60% of native managed environment performance and could not meaningfully improve further without rebuilding the state model at foundation level. This created a structural situation in which the choice was between accepting permanent performance disadvantage or undertaking a second complete rebuilding — this time under competitive pressure and while defending existing revenue.
The products that navigated this successfully treated the retrofitting phase as a diagnostic that identified which architectural components could be upgraded and which needed replacement. The products that treated it as a destination — assuming 60% compatibility was sufficient — found themselves progressively excluded from enterprise agent deployment budgets as the performance gap continued widening and procurement teams updated their scoring criteria quarterly to reflect the improving capabilities of native environments.
The Learning Moat: Cross-Customer Operational Intelligence as the Deepest Defensibility in Software History
The learning moat that emerged in the managed execution environment market between 2028 and 2033 was structurally unlike any competitive advantage previously documented in software history. Earlier software moats, whether network effects, proprietary data, or switching costs embedded in workflow dependencies, operated at human timescales and were partially penetrable through sustained investment. The cross-customer operational learning moat that accumulated in managed execution environments operated at machine timescales, compounded with every agent task execution, and became self-widening under conditions of customer growth, creating a barrier that grew faster than any new entrant could replicate it.
The mechanism began with what simulation data called the trace compound cycle: every agent task execution on a managed execution environment generated a structured record containing the input context, the sequence of tool calls, any intermediate state transitions, failures, rollbacks, and a labeled outcome. Individually, each trace was a modest artifact. Aggregated across thousands of enterprise customers running similar workflow categories, and indexed by product version, agent capability signature, and outcome quality, they became something qualitatively different: a predictive map of execution space that individual customers could never build from their own activity alone.
Aria Chen, Chief Product Officer at a leading managed execution provider, described the compounding logic precisely:
The thing people miss is that the moat isn’t the data itself — it’s the operational context that makes the data interpretable. Every trace we collect has product version tags, environment context, agent capability signatures, and outcome labels. That context is what makes the learning transferable across customers. A raw database of API calls is worthless. A structured trace library with causal chains tagged to outcomes — that’s defensible. We’ve built 140 million tagged execution traces over 4 years. The tagging infrastructure is as valuable as the traces themselves, and it took us 18 months to build right. A new entrant doesn’t just need traces — they need the curation architecture.
This observation pointed to a subtlety that early competitive analyses had missed: the moat resided not in trace volume alone but in the semantic infrastructure that made traces usable for prediction. By 2031, a widely-circulated enterprise research report formalized this as the Corpus Gap Problem — new entrants faced a moving target, because established providers continued accumulating at a faster absolute rate than any entrant could match from zero, and the gap was self-widening under conditions of sustained customer growth.
The simulation tracked this accumulation across the full decade. The earliest providers committed to systematic trace collection beginning in 2025 and 2026 as a research initiative without a clear monetization path. By 2027, enterprise buyers had begun expressing informal preferences for providers with demonstrable trace libraries. By 2028, trace library depth had become a formal evaluation criterion at 31% of large enterprise software buyers. By 2031, the three horizontal providers controlling the generalist managed execution market each maintained cross-customer trace libraries exceeding 100 million tagged execution records — a volume that simulation modeling projected would take a new entrant 6 to 9 years to replicate under favorable conditions, with the target continuing to move throughout that period.
The most commercially potent expression of the moat was what providers formalized as execution intelligence inheritance by 2029: new enterprise customers beginning operations on a managed execution environment were immediately endowed with prediction models derived from all prior customers’ relevant traces. First-day task completion rates were equivalent to 18 to 24 months of independent learning. Priya Nair, a lead engineer at a major agent harness platform, described the operational impact from the agent side:
Before provider-side learning was mature, we were shipping execution heuristics client-side — retry logic, state interpretation parsers, error classification trees. We maintained them per-product, and they went stale constantly. After we integrated with providers offering managed execution with built-in learning, we could offload most of that to the provider environment. The provider’s prediction layer would tell our agent: ‘Based on 4 million executions of this workflow type, step 3 has a 23% failure rate if the input includes date fields with this format — pre-validate before execution.’ We didn’t have to discover that ourselves through failures. The compounding effect is real: our agents on platforms with mature learning environments complete first-attempt tasks at 88% success rates; on thin APIs without learning layers, first-attempt rates are around 51%.
One enterprise buyer described execution intelligence inheritance as “the most compelling vendor lock-in mechanism we’ve encountered that actually benefits the customer” — a formulation that captured a novel quality of this moat. Unlike traditional switching costs, which customers experienced as friction without compensating value, the learning dependency was experienced as a service. The relationship was genuinely mutual: the provider extracted durable competitive advantage, and the customer received measurably better agent performance in return.
This dynamic became more complex when confronted with enterprise governance requirements. The 2030 consent incident — in which a major managed execution provider was discovered using customer agent traces in shared model training without explicit opt-in — triggered what simulation data tracked as a pivotal governance restructuring of the market. Lena Hartmann, a privacy governance lead at a large enterprise buyer, described the aftermath:
The 2030 incident was a watershed for our industry. Before that, most enterprise buyers hadn’t thought carefully about trace governance. After that, every vendor we worked with received a data processing addendum specifically covering agent execution data. The key clauses we required: explicit opt-in for shared pool contribution, data residency guarantees for traces, the right to delete historical contributions, and an audit right to review what model training our traces participated in. Some vendors couldn’t satisfy all four. Those lost the procurement. The ones who had invested in consent architecture before the incident — they cleaned up commercially. It became a sorting event.
The consent incident did not destroy the learning moat — it restructured how it was delivered. Providers that had invested in consent architecture before the crisis saw enterprise contract win rates among regulated industry buyers improve by 41% in the immediate period following. They introduced sovereign learning tracks: private trace pools that contributed only to customer-specific models, with opt-in anonymized sharing for the shared pool. The performance comparison was instructive: customers on shared tracks reached 94% workflow automation rates at 48-month tenure; customers on sovereign tracks reached 89% — a gap that validated the shared pool advantage while also demonstrating that sovereign tracks were commercially viable.
Marcus Webb, an enterprise procurement director, described the negotiation dynamic that emerged around learning contribution after 2030:
Here’s the thing no vendor wants to admit publicly: we figured out by 2030 that our agent activity was making their product smarter for everyone else. That’s not inherently bad — we benefit from everyone else’s traces too. But it created a new negotiation dynamic. We started asking: how much of our trace data goes into the shared pool? What’s our opt-out right? What do we get for contributing? The sophisticated vendors had answers. They’d built tiering systems where contributing customers got priority access to the freshest prediction models. The vendors who couldn’t answer those questions — who treated trace contribution as a default-on invisible process — those are the ones who lost our trust.
The business model structures that emerged to monetize the learning moat followed three converging paths. Execution performance tiers linked access to richer prediction models to higher pricing tiers. Learning contribution discounts gave customers who contributed more trace data pricing benefits, creating self-reinforcing enrollment dynamics in which the largest enterprise customers — who generated the most valuable edge-case traces — were most incentivized to participate, enriching the shared pool disproportionately with high-value data. The most commercially significant model was outcome-based pricing: by 2032, 34% of managed execution environment enterprise contracts included pricing components tied to measurable agent task completion improvement. These contracts showed 2.7x average contract value growth over two years, driven by workflow efficiency gains that providers could now attribute specifically to cross-customer learning predictions.
The structural opening for new entrants was identified early by vertical specialists, whose strategic logic was articulated by Samuel Torres, a startup founder building a managed execution environment for legal workflows:
The obvious answer is: you can’t compete on corpus size. So you compete on corpus relevance. We picked a single vertical — legal document workflows — and went deep. Every execution trace we collected was hyper-relevant to legal document processing. The big horizontal providers had broader corpora but shallower vertical coverage. A law firm running 500 agent tasks per day on legal workflows got better execution performance on our platform than on a general-purpose provider with 10x our total trace volume. Vertical relevance beats horizontal volume for specialized workflows. That’s the strategic opening for startups: the learning moat of a generalist provider is wide but thin. A vertical specialist’s moat is narrow but deep — and for customers with dense workflow concentration in one domain, deep beats wide.
Simulation data confirmed this trajectory. Vertical specialists in legal, financial services operations, and healthcare administration that focused on single domains reached minimum defensible corpus depth between 14 and 22 months after founding. At this threshold, edge-case prediction accuracy in their target vertical exceeded generalist platform performance, and competitive replication time estimates reached three or more years. The moat was narrow by design, but for domain-concentrated customers, its depth was directly comparable to what horizontal providers offered at vastly greater scale.
The long-run consequence, visible by 2033, was a market consolidation pattern consistent with moat mechanics. The top three horizontal managed execution providers controlled 67% of enterprise managed execution contract value. Forty or more vertical specialists served the remainder of the market through concentrated domain depth. The middle tier — providers that had invested partially in managed execution capabilities without reaching defensible corpus depth — showed the highest attrition rates, caught between generalist scale and vertical relevance without the advantages of either.
Traditional SaaS vendors that had not successfully transitioned to managed execution found themselves in a structurally awkward position that their own metrics could not diagnose. Dominic Osei, VP of Product Strategy at a major incumbent, identified the root asset mismatch that caused the delay:
We had rich user behavior data — clickstreams, feature usage, session duration, A/B test results. We thought that was a data moat. It was not the right kind of data. Agent execution traces are a completely different asset class. Our user behavior data told us how humans navigated our product. Agent execution traces tell you the causal sequence of tool calls, the failure modes, the optimal retry strategies, the state transitions that succeed versus fail under specific conditions. We had none of that because agents weren’t operating our product at scale until 2027. By then, pure-play managed execution startups had been accumulating traces since 2025. They had a 2-year head start in the new data category that actually mattered. Our response was acquisitions — we bought two startups with trace accumulation capabilities in 2029. The integration cost us 14 months. During those 14 months, the moat widened.
Beyond the traditional dimensions of competitive advantage, what distinguished the learning moat was how it transformed the cost of staying. By 2032, enterprise customers who had operated within a managed execution environment for 36 or more months had developed trace dependency — their internal agent optimization strategies had been calibrated around provider-side prediction signals to such a degree that migration to a competing environment would require rebuilding 18 to 24 months of domain-specific agent behavior. This was distinct from contract lock-in, workflow migration costs, or data portability requirements. It was a cognitive recalibration cost: agents trained to rely on a specific provider’s prediction infrastructure had to relearn failure modes, edge cases, and optimal execution sequences from scratch on a new platform. The customer remained free to leave at any moment. What they could not take with them was the accumulated intelligence that made their agents effective.
The final lesson the simulation produced about the learning moat was one of temporal asymmetry. Lena Hartmann’s governance team, after testing sovereign learning tracks extensively, offered an observation that cut against the dominant narrative of provider-side advantage:
The shared pool is most valuable in early deployment when you have no trace history at all. Once you have 18-24 months of domain-specific accumulation, the returns from cross-customer learning diminish for your specific use cases. The moat is real — but it’s most powerful at the start of a customer relationship, not the end.
This temporal structure — maximum dependency at the start, compound value through retention, and soft lock-in deepening with tenure — produced a business model with unusual properties. The learning moat simultaneously lowered customer acquisition friction (immediate performance inheritance reduced the cost of switching from a worse provider) and raised customer exit friction (accumulated trace dependency raised the cost of leaving). In software history, acquisition ease and retention strength had typically been in tension. The managed execution learning environment resolved that tension structurally, making the learning moat not merely the deepest competitive advantage in the software market of its era, but one of the rare business constructions in which the mechanism of gaining customers and the mechanism of keeping them were the same.
Transition Risks, Failure Modes, and the Strategic Local Optima for Incumbents and Startups
The transition to managed execution environments generated as many wreckage sites as it did success stories. What made the period between 2027 and 2033 particularly instructive was that most failures did not result from a lack of awareness that the market was changing. The incumbents who lost ground understood the shift intellectually. The startups that collapsed had often built technically sound products. The distinguishing factor was whether companies understood what the transition actually required versus what it appeared to require — a distinction that proved far more consequential than product quality or market timing alone.
The Checkbox Fallacy and the Compatibility Layer Trap
The most statistically significant failure mode among incumbents was neither strategic paralysis nor competitive denial. It was the systematic underestimation of architectural depth required — a pattern that simulation data tracked across 43% of incumbent SaaS vendors who attempted the agent compatibility transition between 2027 and 2029. Nearly half chose a compatibility layer approach: a translation surface placed over existing architecture, designed to present programmatic entry points without disturbing the underlying product.
Marcus Chen, a former VP of Product at a major SaaS incumbent turned enterprise software strategist, named this pattern precisely:
The single biggest mistake incumbents made was what I call the ‘checkbox fallacy.’ When the board asked ‘are we agent-compatible?’ the product team would commission a 6-week sprint to add MCP endpoints and check the box. Yes, you now had programmatic access. No, you had not built anything that agents could actually rely on. The product was still fundamentally designed for a human reading confirmation dialogs and making contextual judgments. The dirty secret was that the retrofitting projects almost always underdelivered because the engineering teams were patching the symptom, not the disease. The disease was a state model designed for human inference.
The economic consequences of the compatibility layer choice were documented with unusual precision by simulation tracking. Vendors who adopted compatibility layers experienced market share erosion at two to three times the rate of vendors who committed to full architectural rebuilds. The mechanism was direct: agents testing these products encountered state feedback that was syntactically structured but semantically thin, rollback systems that existed in documentation but not in the actual transaction model, and simulation endpoints that returned responses without actually modeling execution consequences. Task completion rates on compatibility-layer products remained in the 38-47% range, against 85-93% for natively rebuilt environments.
Kenji Watanabe, CTO of a midmarket SaaS company that went through both paths in sequence, quantified the cost of the choice with exceptional specificity:
In 2027, we had two competing proposals on the table: one to build a proper managed execution layer from the ground up, estimated at 14 months and $8 million in engineering cost, and one to ship a compatibility adapter — basically a translation layer over our existing APIs — estimated at 3 months and $900K. The board chose the adapter. The adapter shipped on time. It performed poorly. Agents using it had a 41% task completion rate versus 87% on our two main competitors who had gone the full rebuild route. We lost two enterprise RFPs in Q3 2028 entirely on that metric. The board reversed the decision and funded the full rebuild in Q4 2028. We finished it in Q2 2030. By then we had lost 23% of our enterprise seat count. The 14 months and $8 million we avoided in 2027 ended up costing us 23 months and an estimated $31 million in lost revenue.
Watanabe’s company survived because its domain position was strong enough to absorb the delay cost and fund the recovery. For incumbents with weaker customer lock-in or thinner margins, the same decision pattern produced terminal outcomes by 2031.
The Feature Parity Mirage and Trace Library Gap
Among startup failure modes, the most common was structurally opposite to the incumbent checkbox fallacy — but equally lethal. Where incumbents underinvested in architectural depth, startups frequently overinvested in feature coverage while underinvesting in operational learning accumulation. The simulation tracked this as the feature parity mirage: founding teams who built products with excellent managed execution primitives but then spent critical early runway closing feature gaps with established incumbents, leaving trace library construction underresourced until too late.
Priya Nair, founder of an agent harness startup and a close observer of the cohort that launched between 2025 and 2027, described the pattern:
The startup failure mode I watched play out over and over was the feature parity mirage. A founding team would build a product with full managed execution primitives — great simulation, great rollback, structured state everywhere — and then spend 18 months building features to close the gap with the incumbent they were displacing. By month 18, they had feature parity and a better agent interface. But the incumbent had 3 years of customer data and operational traces. The startup’s agents were still learning from scratch what the incumbent’s customers had already solved. The teams that survived were the ones who realized early that feature parity was the wrong finish line. You needed a learning accumulation strategy from day one.
By 2029, enterprise buyers had formalized trace library depth as an evaluation criterion — a development that post-mortems from failed startups consistently named as the event that invalidated their competitive models. A post-mortem analysis released publicly by one collapsed managed execution startup cited three causes across the 2029-2031 failure cohort: underestimating trace library requirements as a structural barrier affected 41% of failures, mistiming market readiness by entering before enterprise governance frameworks existed affected 28%, and misreading agent harness vendor leverage as a partnership opportunity rather than a competitive threat affected 31%.
That third cause — the agent harness misread — was the subtlest and most counterintuitive failure mode documented in the simulation. Startups building vertical managed execution environments frequently pursued integration agreements with agent harness vendors under the assumption that harness platform routing would generate customer acquisition. What they underestimated was that harness vendors were simultaneously building proprietary routing intelligence that aggregated cross-vendor observability data — creating a “meta-moat” that sat above individual product environments. By 2031, the five largest agent harness platforms controlled routing logic for 67% of enterprise agent compute budget, creating what smaller providers described as “existential toll booth dynamics.” Startups that had designed their go-to-market around harness routing discovered that the harness had become a gatekeeper with pricing leverage over both providers and buyers simultaneously.
The Vertical Depth Ceiling
One of the more nuanced failure modes documented in the simulation was not a true failure but a strategic ceiling — a local optimum that felt like success until its limits became apparent. Vertical-focused startups that committed early to domain-specific trace accumulation achieved extraordinary agent task completion rates in their niches, often exceeding 90%, and built defensible positions that horizontal competitors found difficult to displace. The trap was that the same specificity that made their trace libraries powerful also made them non-transferable.
Sarah Kim, Managing Partner at Frontier AI Ventures, described the pattern she observed across her portfolio:
The local optima trap I saw most often was the ‘vertical depth vs. horizontal breadth’ dilemma. Startups building vertical-specific managed execution environments — say, specifically for pharmaceutical regulatory submissions — could achieve extraordinary agent task completion rates in their domain, 90-plus percent in some cases, because their trace libraries were deeply specialized. But they’d hit a ceiling at around $20-40 million ARR and struggle to expand horizontally. The operational knowledge that made them great at pharma submissions was almost entirely non-transferable to adjacent use cases. The startups that scaled past $100M ARR were the ones who picked a vertical initially for the trace accumulation advantage, but invested simultaneously in abstraction layers that would let them generalize.
This vertical-depth ceiling manifested as a barbell market structure by 2033. Dominant horizontal managed execution providers occupied one end, with trace libraries exceeding 100 million tagged execution records and cross-domain generalization capability. Highly defensible vertical specialists occupied the other end, commanding premium pricing within their domains but facing structural limits on addressable market. The hollowed-out middle — where undifferentiated API-first products had once clustered — had nearly emptied by 2031, with survivors having migrated to one pole or the other.
The Three Survival Paths for Incumbents
The simulation identified three viable exit trajectories for incumbents that did not complete the architectural transition before market consolidation hardened in 2030-2031. These were not equal outcomes, but each represented a local optimum accessible under specific conditions.
The first was acquisition. Managed execution leaders acquired incumbents between 2030 and 2032 with a counterintuitive motivation: they wanted customer relationships, domain expertise, and in some cases the engineering teams who held operational knowledge — not the product architectures. The “acqui-hire moat” phenomenon emerged as established providers began treating acquired teams as shortcut paths to building domain trace libraries that would otherwise require years of organic accumulation. Incumbents with strong customer loyalty in specific verticals but weak execution environments became attractive acquisition targets precisely because their customer relationships provided the initial trace consent pool that organic customer acquisition could not quickly replicate.
The second survival path was deep entrenchment in regulated niches. Healthcare, financial services, and government procurement maintained human oversight requirements that reduced the agent operability premium as a procurement criterion through at least 2033. In these environments, compliance audit trails, regulatory reporting, and mandated human review of agent-executed tasks created requirements that neither pure managed execution environments nor traditional SaaS products could satisfy without significant governance architecture. Incumbents who reoriented their development investment toward compliance-native features, rather than execution environment rebuilds, found sustainable positions in these regulated segments while the broader market consolidated around them.
From the enterprise buyer side, the failure mode we most feared was what our team called ‘agent lock-in theater.’ Some vendors would market managed execution capabilities but design them in ways that made trace data non-portable and switching economically prohibitive. We had one vendor where the rollback system worked beautifully — but all the operational context was stored in a proprietary format tied to their harness. When we wanted to evaluate alternatives, we found we couldn’t bring our execution history with us. Three years of learning, captured in a format only they could read.
This observation from David Okafor, Chief Risk Officer at a large European financial institution, points to the third survival path: regulated enterprise buyers themselves shaped the market by developing trace portability requirements by 2029, after experiencing operational lock-in with first-generation managed execution providers. Incumbents who positioned as compliance-native, governance-premium alternatives — building audit trail preservation, cross-agent verification, and open trace portability as core features — found that this positioning attracted the 30-40% price premium that enterprise buyers in regulated industries were demonstrating willingness to pay.
The Governance Premium as Startup Wedge
For startups entering the market after 2028, when horizontal managed execution incumbents had already accumulated multi-year trace library advantages, the governance premium became the primary viable differentiation axis. Enterprise buyers who had experienced first-generation managed execution environments with opaque learning systems, non-portable trace formats, and cross-customer learning terms written without customer consent were actively seeking alternatives that embedded governance into the architecture rather than treating it as a compliance checkbox.
Startups that understood this positioned their trace portability commitments, consent-layer architecture, and regulatory audit capabilities as product features rather than legal disclaimers. The governance premium they commanded — 30-40% above technically equivalent environments without governance features — was large enough to fund continued trace accumulation even against competitors with larger libraries, because it attracted precisely the enterprise customers whose complex workflows generated the highest-quality traces for continued model refinement.
Aisha Mohammed, Product Lead at a vertical AI compliance tool startup, articulated the timing dimension that made even this path treacherous:
The failure mode that killed several companies I watched closely was timing misalignment between the product readiness curve and the market adoption curve. You can build a perfect managed execution environment and then release it 18 months before enterprise buyers have governance frameworks to evaluate it. Your early customers don’t know how to run the evaluation, procurement doesn’t have the scoring rubrics, and you end up burning runway doing education instead of closing deals. The teams that survived were the ones who spent the first 12 months selling to the 5% of enterprise buyers who were already running agent-native pilots — the ones who had their own internal benchmarks and didn’t need educating. That cohort became reference customers who could speak the language of the remaining 95% once the mainstream market caught up.
The strongest incentive structure in the simulated decade pushed software providers relentlessly toward memory-bearing, learning-accumulating execution environments. But the path to that destination was neither uniform nor forgiving. The incumbents who survived transformed themselves, were acquired, or found regulated niches where the rules temporarily held. The startups that endured were those who understood that building a managed execution environment was only the beginning — that the durable advantage lay not in the product architecture but in the operational intelligence that architecture was designed to accumulate. Products became less important than the learned context they held. And that context, once accumulated at scale, proved to be among the most durable competitive assets in software market history.