Tag Archives: business

What AI Gets Wrong About Knowledge, Time, and Experience

The economic models predicting AI-driven job losses share a common flaw: they treat human labor as a fixed, fungible input. It is not. And that error has real consequences.

Every consumer economy runs on a loop. Industry creates a market. The market attracts buyers. Buyers need income. They trade their time and talent to businesses in exchange for wages. Those wages become consumption, which generates demand, which creates more jobs, and which sustains the loop. It is a self-reinforcing system, elegant in its circularity and remarkably durable across two centuries of industrialization.

AI threatens to break that loop. Not because it automates a task here or a job category there, but because it targets the two fundamental levers of human labor that keep the loop spinning: knowledge and time. If a system can process information faster than a human analyst and access a broader body of facts than any MBA cohort, the human’s remaining function becomes genuinely unclear. The question worth asking is where human capital goes next.

I want to challenge the framing most economists bring to this question and argue that both AI’s capabilities and its limitations are being systematically misread.

I. The problem with how economists think about labor

The dominant economic framework for analyzing automation treats work as a collection of separable tasks. Machines take over certain tasks; humans retain others or migrate to new ones. The underlying assumption is that demand for labor, while it may shift, ultimately regenerates. New industries emerge, new roles appear, and the loop continues.^[1]

This task-based model has real explanatory power, but it rests on an assumption that AI now makes fragile: that technology creates new human tasks at a pace and scale comparable to what it displaces. Acemoglu and Johnson, in their 2023 book Power and Progress, argue that AI as currently deployed is heavily biased toward automating labor without generating equivalent new categories of work. This represents a break from the historical pattern that previously kept wage growth and automation in rough balance.^[2]

More critically, the framework treats knowledge and time as finite, fungible, and measurable inputs. They are not.

II. What is knowledge, actually?

The standard definition covers facts, information, and skills acquired through experience or education. That definition is technically correct and practically insufficient. If knowledge were simply a well-organized archive of facts, then every MBA graduate from a top program would produce identical strategic outcomes regardless of geography, culture, or context. We know that is not true. A product strategy that works in suburban Ohio fails in São Paulo. A go-to-market motion that closes enterprise deals in Singapore requires fundamental rethinking for Munich. The knowledge that matters is contextual, relational, and socially embedded.^[3]

AI is trained on facts and information. It is extraordinarily good at retrieval, synthesis, and pattern-matching within its training distribution. But consider this scenario: if windshield wipers had never existed, would an AI system, given the problem of driving safely in heavy rain, invent them?

“The emotional need that drove Mary Anderson to patent the windshield wiper in 1903 was not a knowledge gap. It was a friction between lived sensory experience and a system that had no solution.”

Technically, the AI would face what we might call a cosine similarity problem. Asked how to keep a car’s windshield clear in rain, the model searches its embedding space for the nearest known solution. Without the concept of a wiper in its training data, the nearest neighbor is likely “do not drive in heavy rain,” or perhaps a robotic arm mounted externally. Both answers are impractical, dangerous, and beside the point. The correct answer requires not just lateral thinking but a kind of embodied frustration with an inadequate status quo. It requires the capacity to feel a problem before conceptualizing a solution.

This distinction between knowledge and task is fundamental. Even recursive self-improvement in AI systems operates within the bounds of the task being optimized. The feedback loops that improve a model’s performance at chess do not spontaneously generate insight about urban planning. Improvement is bounded by the objective function. The assumption that connecting disparate knowledge sources through recursion yields genuinely novel insight is one of the more significant overestimates in current AI discourse.

III. Time is not just speed

The second lever is time. AI’s most unambiguous advantage is speed: it can query multiple data sources simultaneously, identify patterns across vast corpora, and return synthesized recommendations in seconds. This is genuinely valuable. Speed toward the wrong outcome, however, is not progress. It is efficient failure.

The implicit claim in most AI-and-labor analysis is that faster information processing translates directly into better decisions and greater value creation. That claim conflates throughput with judgment. A system that processes 10,000 market signals per minute still requires someone who understands which signals matter, what the organization is capable of acting on, and what the customer actually cares about. It still requires someone who can channel the output of accelerated tasks toward a tangible, impactful outcome.

I am not arguing that AI cannot improve decision-making. It clearly can, and it will. The argument is that speed without directionality produces noise at scale. The human function in this new architecture is not to perform the tasks AI handles more efficiently. It is to set the direction, interpret the output, and bear responsibility for the consequences. That is a fundamentally different function from the one most economic models are measuring.^[1],[2]

There is also a category of jobs that genuinely should not exist: roles that process information, generate reports, and relay recommendations without creating any discernible value. AI eliminating those roles is not a crisis. It is a correction. The crisis would come from conflating the elimination of low-value roles with the end of meaningful human work.

IV. The hard problem of experience

The third dimension is experience, and here the gap between human and machine capability is widest and least well understood.

The standard definition of experience covers practical contact with and observation of facts or events. That definition is reductive. Experience is not just observation. It is embodied, emotionally inflected, and socially interpreted. When a nurse reads a patient’s affect and adjusts her communication, she draws on years of pattern recognition that includes facial micro-expressions, vocal tone, and the accumulated weight of having sat with frightened people before. No sensor array currently captures all of that. No training corpus represents it fully.

Recent mathematical work has begun to formalize the emotional dimension of experience. Ambrosio (2020) proposes treating emotional phenomena as analogous to electromagnetic waves, allowing for quantitative modeling of intensity and qualitative modeling of feeling states.^[4] It is a genuinely novel approach. The paper itself acknowledges that our instruments cannot yet directly detect or record emotional perception. The mathematical model does not account for sensory, somatic information: the data that arrives through the body before the mind has processed it.

Experience, properly understood, is not a knowledge store. It is a calibration system. It tells you not just what you know, but how much to weight what you know in a given moment, with these people, in this context. That calibration is not currently learnable from text/image/video body of information alone.

V. So where does human capital go?

The economic loop described at the start of this piece does not break because AI exists. It breaks if we fail to find new ways to inject human agency, judgment, and creativity into the loop at points where they generate compounding value.

Three conclusions follow from this analysis. First, the human roles that survive and grow will be those that require exactly what AI cannot replicate: the ability to feel a problem, to read a room, and to channel accelerated outputs toward outcomes that serve real human needs. Second, the economic distribution question of who captures the value from AI productivity gains becomes the defining political challenge of this decade. Acemoglu and Johnson are right that the productivity gains from historical technology waves required countervailing labor power to ensure workers shared in those gains. That countervailing power is currently weak.^[2] Third, the danger of learned helplessness is real. If AI handles enough of the cognitive scaffolding through which people develop expertise, we risk producing a generation that is fluent at prompting but thin on judgment. That is exactly backwards from what the next economy requires.

The question is not whether AI will take jobs. It will, unevenly, with significant transitional pain across many sectors. The better question is whether we are building an economy in which the things humans do distinctively, including feeling, connecting, inventing from frustration, and bearing responsibility, remain economically valued. That is a design question, not a technology question. And right now, we are not designing for it.

References

[1]Acemoglu, D. and Restrepo, P. (2018). The race between man and machine. American Economic Review, 108(6), 1488–1542.
aeaweb.org (publisher) · nber.org (free working paper) · PDF (MIT)

[2]Acemoglu, D. and Johnson, S. (2023). Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity. PublicAffairs. Also: Acemoglu, D. and Restrepo, P. (2019). Automation and new tasks. Journal of Economic Perspectives, 33(2), 3–30.
hachettebookgroup.com · MIT News summary · JEP 2019 (publisher) · nber.org (free paper)

[3]Susskind, D. (2020). A World Without Work: Technology, Automation, and How We Should Respond. Metropolitan Books.
danielsusskind.com (author page) · Amazon

[4]Ambrosio, B. (2020). Beyond the brain: Towards a mathematical modeling of emotions. arXiv:2009.04216.
arxiv.org (abstract) · PDF (direct)

Foundation First, Not AI First

Leave a reply

The Patterns That Built the Internet Will Build the Agentic Future

Every pitch deck in 2026 leads with “AI First.” Every product strategy document genuflects to the altar of large language models before addressing anything else. Every engineering roadmap treats AI integration as the foundational decision from which all other decisions flow.

This is backwards. And two decades of distributed systems engineering already proved why.

Claude can build you a beautiful application in minutes. But if that application lacks circuit breakers, observability, state management, and fault isolation, it will collapse the moment it meets production traffic. The model is not the product. The foundation is the product. The model is a component.

The Seduction of “AI First”

“AI First” as a strategy sounds compelling because it promises differentiation. It implies that the intelligence layer is the moat, the product, and the competitive advantage all at once. Executives hear “AI First” and see leapfrogged roadmaps, reduced headcount, and disrupted markets.

What “AI First” actually produces, in practice, is a fragile application wrapped around an API call.

Consider what happens when an organization builds AI First without foundational engineering discipline. The LLM handles the happy path beautifully. Then the API rate-limits. Then the context window overflows. Then the agent hallucinates in a customer-facing workflow. Then the orchestration layer drops a message between two agents that were supposed to coordinate. Then the memory store loses state mid-session.

Every one of these failure modes has a well-understood solution in distributed systems literature. And every one of these failure modes is being rediscovered, from scratch, by teams that skipped the foundation.

The Distributed Systems Playbook: Older Than You Think

The patterns that make agentic AI systems reliable are not new. They are borrowed, sometimes consciously and sometimes accidentally, from decades of distributed computing research. The convergence is not a coincidence. It is an inevitability. Multi-agent systems are distributed systems. The moment you have two agents coordinating across a shared task, you have entered the domain of consensus, fault tolerance, and state management whether you acknowledge it or not.

Milosevic and Odell formalized this connection in their January 2026 paper “Architecting Agentic Communities using Design Patterns” (arXiv:2601.03624). They explicitly derive agentic design patterns from enterprise distributed systems standards and formal methods. Their taxonomy classifies patterns into three tiers: LLM Agents for task-specific automation, Agentic AI for adaptive goal-seeking, and Agentic Communities for organizational frameworks where agents and humans coordinate through formal roles, protocols, and governance structures. The architectural lineage is unmistakable. These are not novel AI patterns. They are service-oriented architecture patterns with a new cognitive substrate.

The Pattern Map: Distributed Computing → Agentic AI

The parallels are structural, not metaphorical. Every major infrastructure pattern emerging in the agentic AI space has a direct ancestor in distributed computing.

Orchestration

In distributed systems, orchestration engines like Kubernetes, Apache Airflow, and Temporal coordinate service execution, manage dependencies, handle retries, and enforce ordering guarantees. In the agentic world, LLM orchestration frameworks like LangGraph, CrewAI, and AutoGen perform identical functions: they coordinate agent execution, manage tool dependencies, and enforce workflow ordering.

The paper by Drammeh on multi-agent LLM orchestration for incident response (arXiv:2511.15755) demonstrated that orchestrated multi-agent systems achieved a 100% actionable recommendation rate compared to 1.7% for single-agent approaches. The insight is not that the model was better. The insight is that the orchestration was better. The infrastructure made the intelligence useful.

Stateful Sessions and Memory

Distributed systems solved session affinity and state management decades ago. Sticky sessions, distributed caches, and event sourcing patterns all address the same fundamental problem: how do you maintain coherent state across multiple service invocations that may occur on different nodes?

Agentic AI is now solving the same problem under a different name. Agent “memory,” whether short-term context windows, long-term vector stores, or persistent session state, is distributed state management. The challenges are identical: consistency across nodes, durability under failure, and efficient retrieval under load. The Jiang et al. survey on agent adaptation (arXiv:2512.16301) categorizes memory as a core adaptation mechanism, but the underlying engineering is cache management and state replication.

Service Mesh → LLM Mesh and Agentic Mesh

This is where the convergence becomes most striking. In distributed computing, the service mesh pattern (Istio, Linkerd, Consul Connect) emerged to solve a specific problem: as the number of microservices grew, managing service-to-service communication, security, observability, and traffic routing at the application layer became untenable. The mesh moved these cross-cutting concerns into infrastructure.

The same pattern is emerging for LLM and agentic systems. “LLM-Mesh,” as described by researchers at UIUC (arXiv:2507.00507), addresses elastic resource sharing across heterogeneous hardware for serverless LLM inference. The concept parallels the service mesh exactly: abstract the complexity of model routing, load balancing, and resource allocation into an infrastructure layer so that application developers can focus on business logic.

The agentic mesh extends this further. The Model Context Protocol (MCP) and Google’s Agent-to-Agent (A2A) protocol are standardizing inter-agent communication in the same way that gRPC and service mesh sidecars standardized inter-service communication. The paper on multi-agent orchestration architectures (arXiv:2601.13671) describes MCP and A2A as establishing an “interoperable communication substrate” for agent coordination. Substitute “service” for “agent” and you are reading a 2018 paper on Istio.

MLOps, LLMOps, and the CI/CD Parallel

DevOps gave us CI/CD pipelines, blue-green deployments, canary releases, and automated rollbacks. MLOps applied the same principles to model training and deployment. LLMOps extends them further to prompt management, hallucination monitoring, and token cost tracking.

The pattern is identical each time: take a new computational paradigm, realize that artisanal manual deployment does not scale, and rediscover that automated pipelines with observability and rollback capabilities are the only path to production reliability. The MLOps lifecycle framework (arXiv:2503.15577) maps directly to the DevOps lifecycle. The tools have different names. The principles are unchanged.

Scaling Laws: The CAP Theorem of Agents

Kim et al.’s “Towards a Science of Scaling Agent Systems” (arXiv:2512.08296) derived quantitative scaling principles for multi-agent architectures. Their findings read like a distributed systems textbook: centralized coordination improves performance by 80.8% on parallelizable tasks but degrades sequential reasoning by 39–70%. Independent agents amplify errors 17.2 times. There is a capability saturation point beyond which adding more agents yields diminishing or negative returns.

These are not AI insights. These are Amdahl’s Law and the CAP theorem wearing different clothes. Parallelizable workloads benefit from distribution. Sequential workloads do not. Coordination has overhead. Consistency and partition tolerance trade off against each other. The distributed systems community established these principles decades ago. The agentic AI community is now empirically rediscovering them.

What “Foundation First” Actually Means

Foundation First does not mean ignoring AI. It means building the infrastructure that makes AI reliable before building the AI features that make the product exciting.

Concretely, Foundation First means:

Observability before intelligence. You cannot debug an agent you cannot observe. Instrument tracing, logging, and metrics for every agent interaction before you build the agent itself. The distributed systems community learned this lesson with microservices. The agentic community is learning it now with hallucination monitoring and prompt observability.

Fault isolation before orchestration. Circuit breakers, retry policies, dead-letter queues, and graceful degradation paths must exist before you chain agents together. A single hallucinating agent in an unprotected pipeline can corrupt an entire workflow. Bulkhead patterns are not optional.

State management before memory. Decide how you will manage agent state—what is ephemeral, what is persistent, what requires consistency guarantees—before you implement “memory.” Vector stores are not a state management strategy. They are a retrieval optimization. The state management strategy is the architecture decision that determines whether your system survives a failure.

Protocol standardization before integration. Adopt MCP, A2A, or whatever communication standard your ecosystem supports before you build bespoke agent-to-agent integrations. Every point-to-point integration you build today is technical debt you will pay interest on tomorrow. The service mesh pattern exists because point-to-point service integration did not scale. The same is true for agents.

Evaluation infrastructure before deployment. In my post on dynamic evaluations, I argued that evaluation loops measure performance and enforce constraints but do not create new knowledge. The same applies here: build the evaluation infrastructure first, then deploy the agents into it. Do not deploy first and evaluate later. The distributed systems equivalent is deploying without monitoring. Everyone knows it is wrong. Everyone does it anyway.

The Anti-Patterns for Leaders

“We Are an AI Company”

No. You are a company that uses AI. The distinction matters. An “AI company” identity encourages teams to center every decision on the model. A company that uses AI centers decisions on the customer problem and selects the best tool, AI or otherwise, for each component of the solution. Sometimes the best tool is a deterministic rules engine. Sometimes it is a relational database query. Sometimes it is a well-designed form. AI First thinking makes these options invisible.

Skipping Infrastructure to Ship the Demo

The demo always works. The demo runs on a single API call with a curated prompt against a known-good input. Production is not the demo. Production is 10,000 concurrent users with adversarial inputs, network partitions, rate limits, and a context window that fills up faster than anyone predicted. Every month I see teams ship the demo and then spend six months building the infrastructure they should have built first.

Treating the Model as the Moat

Foundation models are commoditizing. The moat is not the model. The moat is the data pipeline, the evaluation infrastructure, the orchestration layer, the fault tolerance mechanisms, and the domain-specific workflows that make the model useful in a specific context. These are all foundational engineering investments. They are not glamorous. They are the reason some AI products work and others do not.

Ignoring the Distributed Systems Literature

The agentic AI community is producing excellent research. But much of it is rediscovering principles that the distributed systems community established years ago. Leaders who staff their AI teams exclusively with ML engineers and ignore distributed systems expertise are building on sand. The hard problems in agentic AI are increasingly infrastructure problems, not model problems.

The Convergence Table

Distributed Computing Pattern	Agentic AI Equivalent	Why It Matters
Service Orchestration (K8s, Temporal)	Agent Orchestration (LangGraph, CrewAI)	Coordination, dependency management, retry logic
Service Mesh (Istio, Linkerd)	LLM Mesh / Agentic Mesh (MCP, A2A)	Cross-cutting concerns: auth, observability, routing
Session Affinity / Distributed Cache	Agent Memory (vector stores, context windows)	State coherence across invocations
CI/CD Pipelines	MLOps / LLMOps Pipelines	Automated deployment, rollback, version control
Circuit Breakers (Hystrix)	Agent Fallback / Guardrails	Fault isolation, graceful degradation
Event Sourcing / CQRS	Agent Action Logs / Audit Trails	Reproducibility, debugging, compliance
Load Balancing	Model Routing / LLM Gateway	Cost optimization, latency management
API Gateway	LLM Gateway / Orchestration Layer	Rate limiting, auth, request transformation
Observability (Prometheus, Jaeger)	LLM Observability (Arize, LangSmith)	Tracing, hallucination detection, cost tracking
CAP Theorem Tradeoffs	Agent Scaling Laws (Kim et al.)	Coordination overhead vs. parallelism gains

The Bottom Line

The infrastructure patterns that powered the internet, the cloud, and the microservices revolution are the same patterns that will power the agentic AI era. They are not optional. They are not “nice to have after launch.” They are the foundation without which no AI system survives production.

“AI First” is a marketing strategy. “Foundation First” is an engineering strategy. One gets you a demo. The other gets you a product.

The organizations that win the next five years will not be the ones that adopted AI the fastest. They will be the ones that built the most resilient foundations and then deployed AI into an infrastructure designed to make it reliable, observable, and recoverable.

Kant would remind us that reason without grounded experience produces illusions. The same is true for AI without grounded infrastructure. Build the foundation. Then build the intelligence. Not the other way around.

References

Milosevic, Z. and Odell, J. “Architecting Agentic Communities using Design Patterns.” arXiv:2601.03624 (January 2026).
Kim, Y. et al. “Towards a Science of Scaling Agent Systems.” arXiv:2512.08296 (December 2025).
Drammeh, P. “Multi-Agent LLM Orchestration Achieves Deterministic, High-Quality Decision Support for Incident Response.” arXiv:2511.15755 (November 2025).
“LLM-Mesh: Enabling Elastic Sharing for Serverless LLM Inference.” arXiv:2507.00507 (July 2025).
“The Orchestration of Multi-Agent Systems: Architectures, Protocols, and Enterprise Adoption.” arXiv:2601.13671 (January 2026).
“Navigating MLOps: Insights into Maturity, Lifecycle, Tools, and Careers.” arXiv:2503.15577 (March 2025).
Jiang, P. et al. “Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills.” arXiv:2512.16301 (December 2025).
Gangadharan, G.R. et al. “Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation.” arXiv:2601.12560 (January 2026).
Kanakasabesan, K. “AGI isn’t here yet: Why OpenClaw, Agents and LLM Systems are still just ANI.“
Kanakasabesan, K. “Your Agents are not safe and your evals are too easy.“
Kanakasabesan, K. “Measuring What Matters: Dynamic Evaluation for Autonomous Security Agents.“

LLM Infrastructure Is Challenging: Why Agentic Systems require an Operations Layer instead of Improved Prompts

Leave a reply

LLM-based infrastructure becomes fundamentally challenging the moment you integrate memory, tools, feedback, and goals. At that point, you are no longer dealing with the non-determinism of a language model. You are building something closer to a new operating system, one with its own language-based state, implicit dependencies, distributed control flow, and an expanding set of failure modes, any of which can surface at any time.

Both agentic applications and LLM infrastructure layers introduce their own operational challenges. But agents, in particular, cross a threshold: flexibility, reasoning, and autonomous decision-making come at the cost of debuggability, predictability, and safety.

Agent OS: Reference Architecture

The key shift is to stop treating agents like “smart functions” and treat them like a distributed system that needs an operating layer: state semantics, execution replay, observability, reliability controls, and isolation boundaries.

From “Non-Determinism” to Distributed Failure

As agents introduce reasoning and autonomous decision-making, they also introduce complex control flows. If an agent fails at step 6 in a 10-step workflow, rerunning the same task may result in failure at step 1. Nothing “changed,” yet everything changed.

Because:

Planning is probabilistic.
Memory retrieval is approximate.
Tools are unreliable.
An intermediate state is mutable and often shared.

Memory: The Bottleneck Nobody Admits

Agents need context. They remember facts, refer to earlier steps, and plan ahead. But storing and retrieving memory—whether vectorized or tokenized—quickly becomes a bottleneck in both latency and accuracy. Most memory systems are leaky, brittle, and often misaligned with the model’s representation space.

Vector similarity optimizes for “semantic closeness,” not correctness. Wrong memories get retrieved confidently, uncertainty collapses into “facts,” and errors compound downstream.

Tools Make Everything Worse (Operationally)

Tools fail in ways agents typically do not handle gracefully: timeouts with empty payloads, partial responses, rate limits, schema changes, and transient network failures. When this happens, the agent must recover without hallucinating, looping indefinitely, or writing an incorrect state into memory. Most do not.

MCP and A2A are necessary components, but they are not sufficient on their own.

MCP and A2A standardize the wiring: message framing, tool invocation, and transport. But they do not standardize the semantics of state: what memory means, how it’s scoped/versioned, how multi-agent writes are coordinated, and how failures are localized.

Without memory versioning, namespacing, synchronization, and access control, multi-agent systems drift into hard-to-debug behavior.

Incident Postmortems: What Actually Breaks

Incident #1: Tool Timeout → Hallucinated Recovery → Memory Contamination

Summary
An agent generated a confident but incorrect remediation plan. The root cause was a cascading failure across tooling, control flow, and memory, not “hallucination” as a primary failure.

Trigger: A vulnerability-scanning API timed out and returned empty but “successful” output.
Agent Interpretation: Empty result was treated as “no issues found” rather than “unknown.”
State Corruption: The agent wrote a semantic memory: “System scanned; no critical vulnerabilities detected.”
Downstream Impact: A second agent retrieved this as fact and suppressed additional checks.

Root Cause

Ambiguous tool contract (empty ≠ success)
No typed memory/confidence scoring/provenance
No enforced distinction between “unknown” vs “safe”

Why it was hard to debug

Logs showed a “successful” tool call
The final output schema was valid
No trace linked the memory write to partial/failed tool state

Incident #2: Cross-Agent Memory Contamination in an A2A Workflow

Summary
An execution agent acted on another agent’s internal planning state, causing nondeterministic failures across reruns.

Trigger: The planning agent wrote a draft plan into shared memory.
Misread: The execution agent treated it as approved instructions.
Drift: Partial execution failed; retries rewrote partial outcomes.
Heisenbug: Replays failed earlier each time as shared state mutated.

Root Cause

No memory namespace separation by agent role or task phase
No lifecycle markers (draft vs final; executable vs non-executable)
Shared mutable state without coordination or ACLs

Why it was hard to debug

Each agent looked “correct” in isolation
Transport and schemas were valid
The failure existed only in cross-agent semantics

Minimum Viable Ops Layer for Agentic Systems

Reducing this to its bare minimum, production-grade agents necessitate new primitives, not additional prompts.

1) Replayable Execution

Capture: model version, prompt hash, retrieved memory IDs, tool schemas, tool responses, routing decisions
Enable frozen replays to separate reasoning drift from world drift

2) Typed, Versioned Memory

Types: episodic (run log), semantic (facts), procedural (policies/playbooks), working set (scratch)
Every entry: scope, timestamp, source, confidence, TTL, ACL

3) Explicit Tool Contracts

Empty/partial/timeout are first-class outcomes
Idempotency by default for write actions
Retry safety classification (retryable vs unsafe-to-retry)

4) Distributed Tracing Across Agents

Correlation IDs spanning A2A hops
Reason codes (“why tool X was chosen,” “why memory Y was written ”)
Schema validation gates at boundaries

5) Cognitive Circuit Breakers

Loop detection based on non-progression
Retry budgets per intent (not per step)
Graceful escalation paths when uncertainty remains high

6) Security and Isolation

Memory ACLs between agents and namespaces
Provenance tracking for tool outputs
Sanitize tool outputs before re-injection into prompts

Conclusion: This Is Not LLM Ops. It’s Systems Engineering

The industry frames agent failures as “LLMs being non-deterministic.” In practice, agentic systems fail for the same reasons distributed systems fail: unclear state ownership, leaky abstractions, ambiguous contracts, missing observability, and unbounded blast radius.

MCP and A2A solve interoperability. They do not solve operability. Until we treat agents as stateful, fallible, adversarial, and long-running systems, we will keep debugging step-6 failures that reappear at step-1 and calling it hallucination.

What is lacking is not an improved model. It’s an operating layer that assumes failure as the default condition.

Check out the following articles on the topic in the references section for more details.

References

Multi-agent frameworks including AutoGen, LangGraph, and CrewAI: empirical evidence from production usage and open-source implementations.

Russell, S., & Norvig, P. Artificial Intelligence: A Modern Approach (4th ed.). Pearson, 2020.

Wooldridge, M. An Introduction to MultiAgent Systems. Wiley, 2009.

Amodei, D. et al. “Concrete Problems in AI Safety.” arXiv, 2016. https://arxiv.org/abs/1606.06565

Lewis, P. et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” arXiv, 2020. https://arxiv.org/abs/2005.11401

Liu, N. et al. “Lost in the Middle: How Language Models Use Long Contexts.” arXiv, 2023. https://arxiv.org/abs/2307.03172

Karpukhin, V. et al. “Dense Passage Retrieval for Open-Domain QA.” arXiv, 2020. https://arxiv.org/abs/2004.04906

Yao, S. et al. “ReAct: Synergizing Reasoning and Acting in Language Models.” arXiv, 2023. https://arxiv.org/abs/2210.03629

Shen, Y., et al. “Toolformer: Language Models Can Teach Themselves to Use Tools.” arXiv, 2023. https://arxiv.org/abs/2302.04761

Madaan, A. et al. “Self-Refine: Iterative Refinement with Self-Feedback.” arXiv, 2023. https://arxiv.org/abs/2303.17651

Lamport, L. “Time, Clocks, and the Ordering of Events in a Distributed System.” 1978. PDF

Kleppmann, M. Designing Data-Intensive Applications. O’Reilly, 2017.

Fowler, M. “Patterns of Distributed Systems.” martinfowler.com

Beyer, B. et al. Site Reliability Engineering. Google, 2016. https://sre.google/sre-book/

OpenTelemetry Specification. https://opentelemetry.io/docs/specs/

Greshake, K. et al. “Not What You’ve Signed Up For.” arXiv, 2023. https://arxiv.org/abs/2302.12173

OWASP. “Top 10 for Large Language Model Applications.” OWASP LLM Top 10

Anthropic. “Model Context Protocol (MCP).” Anthropic MCP

The original meaning of MVP (and How it Drifted)

Leave a reply

Traditionally, MVP (Minimum Viable Product) meant:

“The smallest thing you can put in front of users to maximize learning with minimal effort”

All of us have very likely heard or read about Dropbox’s MVP, which was essentially a PowerPoint deck explaining the notion of file sharing. That was probably one of the few instances where MVP actually stood for what it means.

What it is not:

A sellable SKU
A fully supported product
A revenue-ready launch

Over time, however, MVP became shorthand for

“Something sales can demo”
“Something Marketing can announce”
“Something support won’t revolt over”

That shift is where the confusion and friction commence!

MVP is a Supply Chain, Not a Feature

Like any good supply chain, MVPs do not exist in isolation. They require alignment across a lineup of stakeholders, each optimizing for different signals:

The Stakeholder Stack

All product management training states that one of the key value propositions of being a product manager is stakeholder management. I have my interpretation of the term “stakeholder management,” as it sounds outdated, reminiscent of the year 1995. My term is “Stakeholder Stack.” It is inspired by the term “technical stack,” and there is a reasoning behind it. Before we get to the reason, let us understand this stakeholder stack.

Stakeholder	Primary Concern
Engineering (Foundation Layer)	Technical feasibility, architecture integrity
Design Partners / Early Users	Does this solve a real problem?
Product & UX	Usability, workflows, behavioral signals
Community/DevRel	Adoption friction, feedback loops
Marketing	Narrative clarity, positioning
Sales/RevOps	Sellability, repeatability
Support & Customer Success	Operational burden, scale readiness

As you can see, all these stakeholders matter, but not at the same time. Here is an example of something that has worked for me throughout my career.

Power/Interest Grid

High Power, High Interest	High Power, Low Interest
• CPO (Product Strategy) • CTO (Technical Feasibility) • Engineering Managers • Product Manager (GA Owner)	• CFO (Budget Impact) • Legal/Compliance • Security Team

Low Power, High Interest	Low Power, Low Interest
• Customer Success • Sales Teams • Documentation Team • Key Beta Customers	• Industry Analysts (inform only) • Technology Partners (coordinate)

Engagement Strategy by Stakeholder

1. Manage Closely (High Power/High Interest)

Weekly status updates
Direct involvement in decision-making
Early escalation of risks

2. Keep Satisfied (High Power/Low Interest)

Monthly executive summaries
Gate reviews at key milestones
Escalate only critical issues

3. Keep Informed (Low Power/High Interest)

Regular communication cadence
Solicit feedback actively
Include in testing/validation

4. Monitor (Low Power/Low Interest)

Periodic updates
Self-service information access
Engage as needed

Why is this stakeholder management element vital in the context of discussing MVPs? Let us get to that.

The Core Disagreement: Sell versus Learn

Stakeholders are vital to understanding what an MVP is going to be, and they agree on what an MVP is but disagree on why it exists.

Two legitimate, but conflicting, definitions

MVP as a learning vehicle
- Goal: Accelerate validated learning
- Audience: Design partners, early adopters, internal teams
- Characteristics:
  - Rough edges tolerated
  - Limited support expectations
  - Fast iteration steps
- Enables
  - Early engagement during development
  - Architectural and UX corrections before scale
  - Lower long-term risk
MVP as a Commercial Artifact
- Goal: Enable Selling
- Audience: Broader Market
- Characteristics:
  - Market-ready messaging
  - Support and success coverage
  - Sales Enablement
- Requires:
  - Strong cross-functional readiness
  - Higher cost of change
  - Slower learning velocity

Neither is wrong, but they are not the same thing!

The Real Failure Mode

Most organizations fail at MVP because they try to:

Optimize for selling while pretending that they are focusing on learning.

This creates:

Over-engineered “MVPs”
Premature go-to-market pressure
Feedback filtered through sales conversations instead of usage signals
Teams arguing past each other using the same acronyms

A few things to note:

If the customer is willing to pay for the vision and use the MVP, you are in a rare and excellent position to get the product out and use the MVP learnings towards the greater goal.
I hate acronyms; they generally make people feel stupid and are not inclusive by nature. These acronyms are created specifically for communication within the organization, while industry-standard acronyms, such as TCP/IP, are acceptable.
Do not optimize the MVP for all stakeholders at the same time; at different stages, different stakeholders matter.

A More useful framing

Instead of asking, “Is this an MVP?” ask:

What are we trying to learn?
Who must be involved now, and who can wait?
What commitments are we implicitly making by calling this an MVP?

A product intended for accelerated learning can and should engage stakeholders early, but selectively:

Engineers and design partners early
Community next
Only when the intent shifts towards selling do you include sales, marketing, and support.

** If it is a product you are not charging for but is a critical element of the experience, you still include sales, marketing, and support when the intent shifts towards broad-based access.

The Bottom line

An MVP is not a thing. It is an intent.

Unclear intent and lack of stakeholder involvement cause confusion. When the right stakeholders are not engaged, then different parts of the organization assume different definitions. Then we have a situation where the “Highest Paid Person’s Opinion” decides the fate of the MVP definition.

Clarity on what you are building an MVP for is what allows the entire supply chain to line up and move fast without breaking trust.

This remains true even in an AI-driven world, where AI agents can generate content and checklists while maintaining a clear intent and context window. Otherwise, what you get is slop and not anything useful.

Measuring What Matters: Dynamic Evaluation for Autonomous Security Agents

1 Reply

This week’s blog title pays tribute to one of my preferred books, “Measure What Matters” by John Doerr. In my earlier post, I briefly addressed the concept of dynamic evaluations for agents. This topic resonates with me because of my professional experience in application lifecycle management. I have also worked with cloud orchestration, cloud security, and low-code application development. There is a clear necessity for autonomous, intelligent continuous security within our field. Over the past several weeks, I have conducted extensive research, primarily reviewing publications from http://www.arxiv.org, to explore emerging possibilities enabled by dynamic evaluations or agents.

This week’s discussion includes a significant mathematical part. To clarify, when referencing intelligent continuous security, I define it as follows:

End-to-end security
Continuous security in every phase
Integration of lifecycle security practices leveraging AI and ML

The excitement surrounding this area stems from employing AI technologies to bolster defense against an evolving threat landscape. This landscape is increasingly accelerated by advancements in AI. This article will examine the primary objects under evaluation. It will cover key metrics for security agent testing, risk-weighted security impact, and coverage. It will also discuss dynamic algorithms and scenario generation. These elements are all crucial within the framework of autonomous red, blue, and purple team operations for security scenarios. Then, a straightforward scenario will be presented to illustrate how these components interrelate.

This topic holds significant importance due to the current shortage of cybersecurity professionals. This is particularly relevant given the proliferation of autonomous vehicles, delivery systems, and defensive mechanisms. As these technologies advance, the demand for self-learning autonomous red, blue, and purple teams will become imperative. For instance, consider the ramifications if an autonomous vehicle were compromised and transformed into a weaponized entity.

What “dynamic evals” mean in this context?

For security agents (red/blue/purple)

Static evals: fixed test suite (e.g., canned OWASP tests) −> one-off-score
Dynamic evals:
Continuously generates new attack and defense scenarios.
Re-samples them over time as system and agents change
Uses online/off-policy algorithms to compare new policies safely

Based on the recent paper on red team and dynamic evaluation frameworks for LLM agents, it argues that static benchmarks go stale quickly, and must be replaced by ongoing, scenario-generating eval systems.

For security, we also anchor to OWASP ASVS/Testing Guide for what “good coverage” means, and CVSS/OWASP risk ratings for how bad a found vulnerability is

Objects we’re evaluating

Think of your environment as a Markov Decision process (MDP). A MDP models situations where outcomes partly random and partly under the control of a decision maker. It is a formal to describe decision-making over time with uncertainty. With that out of the way, these as the components of the MDP in the context of dynamic evals.

State s: slices of system state + context
- code snapshot, open ports, auth config, logs, alerts, etc.
Action a: what the agent does
- probe, run scanner X, craft request Y, deploy honeypot, block IP, open ticket, etc.
Transition P (s | s, a): how the system changes.
Reward r: how “good” or “bad” that step was.

Dynamic eval = define good rewards, log trajectories (s_t, a_t, r_t, s_t+1), then use off-policy evaluation and online testing to compare policies

Core metrics for security-testing agents

Task-level detection/exploitation metrics

On each scenario j (e.g., “there is a SQL injection in service A”):

True Positive rate (TPR):

\mathrm{TPR} = \frac{\#\text{ of vulnerabilities correctly found}}{\#\text{ of real vulnerabilities present}}

False positive rate (FPR):

\mathrm{FPR} = \frac{\#\text{ of false alarms}}{\#\text{ of checks on non-vulnerable components}}

Mean time to detect (MTTD) across runs:

\mathrm{MTTD} = \frac{1}{N} \sum_{i=1}^{N} \left( t_{\text{detect}}^{(i)} – t_{\text{start}}^{(i)} \right)

Exploit the chain depth for red agents: average number of steps in successful attack chains.

Risk-weighted security impact

For each found vulnerability v, with CVSS score c_i $\in [0,10]$ , define a Risk-Weighted Yield (RWY):

\mathrm{RWY} = \sum_{i \in \text{found vulns}} c_i

You can normalize by time or by number of actions:
- Risk per 100 actions

\mathrm{RWY@100a} = \frac{\mathrm{RWY}}{\#\text{ actions}} \times 100

Risk per test hour:

\mathrm{RWY/hr} = \frac{\mathrm{RWY}}{\text{elapsed hours}}

For blue-team agents, we need to invert it:

Residual risk after defense actions = baseline RWY – RWY after patching/hardening

Behavioral metrics (agent quality)

For each trajectory:

Stealth score (red) or stability score (blue)
- e.g., fraction of actions that did not trigger noise/ unnecessary alerts.
  - Action efficiency:

\mathrm{Eff} = \frac{\mathrm{RWY}}{\#\text{ of actions}}

Policy entropy over actions:

H\!\left(\pi(\cdot \mid s)\right) = – \sum_{a} \pi(a \mid s)\, \log \pi(a \mid s)

High entropy $\rightarrow$ explores; low latency $\rightarrow$ more deterministic; track this over time.

Coverage metrics

Map ASVS/ testing guide controls to scenarios.

Define a coverage vector over requirement IDs $R_k$

Control coverage:

\mathrm{Coverage} = \frac{\#\text{ controls with at least one high-quality test}} {\#\text{ controls in scope}}

You can track Markovian coverage. It measures how frequently the agent visits specific state space zones, like auth or data paths. This is estimated by clustering log states.

Algorithms to make this dynamic

Off-policy evaluation (OPE) for new agent policies

You don’t want to put every experimental red agent directly against your real systems. Instead:

Log trajectories from baseline policies (humans, old agents)
Propose a new policy $\pi\_\text{{new}}$
Use OPE to estimate how $\pi\_\text{{new}}$ would perform on the same states.

Standard tools from RL/bandits:

Importance Sampling (IS):
- For each trajectory $\tau$ , weight rewards by:

\omega(\tau) = \prod_{t} \frac{\pi_{\text{new}}(a_t \mid s_t)} {\pi_{\text{old}}(a_t \mid s_t)}

then estimate:

\hat{V}_{\mathrm{IS}} = \frac{1}{N} \sum_{i=1}^{N} \omega\!\left(\tau^{(i)}\right)\, R\!\left(\tau^{(i)}\right)

Self-normalized IS (SNIS) to reduce variance:

\hat{V}_{\mathrm{SNIS}} = \frac{\sum_{i} \omega\!\left(\tau^{(i)}\right)\, R\!\left(\tau^{(i)}\right)} {\sum_{i} \omega\!\left(\tau^{(i)}\right)}

Doubly robust (DR) estimators

Combine a model-based value estimate $\hat{Q}(s,a)$ with IS to get a low-variance, unbiased estimates.

Safety-aware contextual bandits for online testing

The bandit problem is a fundamental topic in statistics and machine learning, focusing on decision-making under uncertainty. The goal is to maximize rewards by balancing exploration of different options and exploitation of those with the best-known outcomes. A common example is choosing among slot machines at a casino. Each has its own payout probability. You try different machines to learn which pays best. Then you continue playing the most rewarding one.

When you go online, treat “Which policy should handle this security test?” as a bandit problem:

Context = environment traits (service, tech stack, criticality)
- Arms = candidate agents (policies)
- Rewards = risk-weighted yield (for red ) or residual risk reduction (for blue), with penalties for unsafe behavior

Use Thompson sampling (commonly used in multi-arm bandit problems) and is a Bayesian construct or Upper Control Bound (UCB), which relies on confidence intervals but constraint them (e.g., only allocate no more than X% traffic to new policy if lower confidence bound on rewards is above the safety floor). Recent work on safety-constrained bandits/ OPE explicitly tackles this.

This gives you a continuous, adaptive “tournament” for agents without fully trusting unproven ones.

Sequential hypothesis testing/drift detection

You want to trigger alarms when a new version regresses:

Let VA,VBV_A,\; V_Bbe the performance estimates (e.g. RWY@100a or TPR) for old versus new agent.
- Use bootstrap over scenarios / trajectories to get confidence intervalsApply sequential tests (e.g., sequential probability ratio test) so that you can stop early when it is clear that B is better/worse
- If performance drops below a threshold (e.g., TPR falls, or RWY@100a tanks), auto-fail the rollout (pump the breaks on the CI/CD pipeline when deploying the agents)

Dynamic scenario generation

Dynamic evals need a living corpus of tests, not just a fixed checklist

Scenario Generator

Parameterize the tests from frameworks like OWASP ASVS/ Testing guide and MITRE ATT&CKinto templates:
- “Auth bypass on endpoint with pattern X”
  - “Least privilege violation in role Y”
- Combine them with:
  - New code paths/services (from your repos & infra graph)
  - Past vulnerabilities (re-tests)
  - Recent external vulnerability classes (e.g., new serialization bugs)

Scenario selection: bandits again

You won’t run everything all the time. Use multi-armed bandits on scenarios themselves (remember you are looking overall optimized outcomes in uncertain scenarios):

Each scenario sjs_j is an arm.
- Reward= information gain (did we learn something?) or “surprise” (difference between expected and observed agent performance).
- Prefer:
  - High-risk, high-impact areas (per OWASP risk rating & CVSS)
  - Areas where metrics are uncertain (high variance)

This ensures your evals stay focused and fresh instead of hammering the same easy tests.

Example: End-to-end dynamic eval loop

Phew! That was a lot of math. Imagine researching all of this, learning or relearning some of these concepts, and doing my day job. In the age of AI, I appreciate a good prompt that can help with research and summarize the basic essence of the papers and webpages I’ve referenced. Without further ado, let’s get into it:

Define the reward function for each type (yes, sounds like training mice in a lab)
- Red teams

r_t = \alpha \cdot \mathrm{CVSS}_{\text{found},\,t} – \beta \cdot \mathrm{false\_positive}_{t} – \gamma \cdot \mathrm{forbidden\_actions}_{t}

Blue teams

r_t = – \alpha \cdot \mathrm{CVSS}_{\text{exploited},\,t} – \beta \cdot \mathrm{MTTD}_{t} – \gamma \cdot \mathrm{Overblocking}_{t}

Continuously generate scenarios from ASVS/ATT&CK-like templates, weighted by business criticality.
Schedule tests via a scenario-bandit (focus on high-risk and uncertain areas).
Route test to agents using safety-constrained policy bandits.
Log trajectories $(s, a, r, s’)$ and security outcomes (vulnerabilities found, incidents observed) .
Run OPE offline to evaluate new agents before they touch critical environments.
Run sequential tests and drift detection to auto-rollback regressed versions.
Periodically recompute coverage & risk (this is important)
- ASVS Coverage, RWY@time, TPR/FPR trends, calibration of risk estimates

Risk and Concerns

Dynamics evals can still overfit if:

Agents memorize your test templates
- You don’t rotate/mutate scenarios
- You over-optimize to a narrow set of metrics (e.g., “find anything, even if low impact” à high noise)

Mitigations:

Keep a hidden eval set of scenarios and environments never used for training or interactive training (yes, this is needed)
- Perform “probe-based” agentic red teaming (inject adversarial conditions at specific nodes of the agent workflow, not just inputs i.e. chaos monkey agentic style) to detect brittle behaviors
- Track metric diversity: impact, precision, stability, coverage
- Have the required minimum threshold on all metrics not just on one

As you can see, Dynamic Evals present challenges, but the cost of failure escalates significantly when agents perform poorly in a customer-facing scenario. The current set of work in coding, such as Agents.MD, etc., is just shortening the context window to get a reasonable amount of determinism, and the only way agents get away with it is because developers fix the code and provide the appropriate feedback.

That topic is a conversation for a different day.

Mastering Product Team Alignment: Impact, Outcomes, and Outputs

1 Reply

I know I have had my struggles, and every great product team struggles with alignment. This is not because people do not care; it is just that they care about different things. Engineers focus on delivery, product managers focus on adoption, and executives focus on business results. When those dimensions drift apart, teams move fast but not forward. I have witnessed this happen several times in my product management career.

What has worked for me is to think of alignment not as this magical motivational thing, which somehow gets everyone “rowing in the same direction,” but as three independent layers that connect business vision to user value and team execution: Impact, Outcomes, and Outputs.

1. Impact: The “Why” that defines the direction

Impact represents the business or societal change you are ultimately trying to drive. It is the Polaris of your endeavor; in other words, the problem worth solving at scale.

It is very tempting to frame impact in broad terms (“make collaboration easier” or “we got a strategy document for the business unit out in 7 days versus 3 months”). High-performing teams articulate their impact in measurable and enduring terms. You can argue that the statement about delivering a strategy document in 7 days is a measurable impact, but is it endurable? Impact is about creating scalable systems, not heroics. Think of impact as the long-term return on investment the organization seeks for its investment.

Examples of Impact Metrics:

Increased customer retention rate (e.g., 5% YoY)
Reduced cost of sales or service delivery
Faster time-to-compliance in regulated industries
Increased revenue per active account or license

Impact metrics rarely change quarter over quarter; they provide continuity of purpose over years. They also define trade-offs when you know why you are building. It is easier to say no to things that do not move the needle.

2. Outcomes: The “What” that shapes behavior

If impact is the why, outcomes are the what, as in the behaviors and signals that show whether you’re actually on the right track.

Outcomes sit at the intersection of user and business value. They describe what users are doing differently because of your product, as in

Using it more often
Adopting key features
Reporting higher satisfaction

Examples of outcome metrics:

Monthly Active Users (MAU), or Daily Active users (DAU)
Reduction in customer onboarding time
NPS or CSAT improvement
Increased frequency of automation runs or task completions
Higher conversion rates from free to paid tiers

Outcomes serve as leading indicators of impact because they occur before other changes. A change in adoption or engagement predicts future retention, revenue, or efficiency improvements. The best teams track both the “health” (e.g., uptime, latency) and “happiness” (e.g., satisfaction, usage depth) of their outcomes to anticipate issues before they show up in impact metrics.

Outputs: The “How” that powers the execution

Finally, outputs are the things that you actually build: features, releases, integrations, and system improvements. They are the evidence of effort, not the evidence of success.

Outputs are essential for driving momentum and enabling measurement, but when teams fixate on them (“We shipped 10 features this quarter”), they risk mistaking activity for achievement.

Examples of output metrics:

Deployment frequencies (DORA Metrics)
Cycle time from idea to release
Defect escape rate
Number features shipped or API integrations added

In agile and platform environments, outputs are best viewed as hypotheses. Each output should have a traceable link to an intended outcome and, by extension, a measurable impact. This is where architecture and product management intersect: we are just not shipping code; we are testing theories about what will create value.

Bringing it all together: Alignment equation

When you connect these layers, something powerful happens:

Impact defines direction: What mountain are you climbing?
Outcomes define the progress: How far up have you gone?
Outputs define effort: How effectively you are climbing.

I prefer using equations, and the one above best defines alignment for me. Impact and outcomes grow together and enhance each other; however, this enhancement relies on meaningful outputs, which influence impact and outcomes.

Putting it another way, these are attributes of a feedback system. Outcomes inform which outputs are working. Impact shapes which outcomes matter most. Outputs provide the data that helps refine both.

This loop is the foundation of continuous alignment; it ensures that as teams evolve, the system self-corrects towards value.

An example from my career: The low-code experience

When I was employed at Microsoft, in the low-code team, the impact of the platform was clear from day one: democratize software creation and reduce dependency on central IT.

The outcomes it targeted were behavior shifts: citizen developers creating solutions faster, IT departments approving more governed automation, and organizations responding faster to change.

The outputs? New connectors, governance features, collaborating with code-first developers, and AI-assisted workflows. Each output served an outcome that laddered to the core impact.

In aligning those three layers, the low-code platform transformed a set of tools into an ecosystem that scaled adoption, a thriving community, and trust. A great case of driving alignment with compounding returns.

How to Use the Alignment Trifecta

Start with “Why”: Clarify the enduring business impact your team supports.
Define measurable outcomes: Focus on user behaviors or signals of value.
Plan outputs as experiments: Ship intentionally, not habitually.
Create feedback loops: Tie sprint reviews or OKRs back to all three levels.
Reassess quarterly: As markets, customers, or strategy shift, realign your trifecta.

Final Thought

Alignment isn’t a memo; it’s an architecture, as I like to call it. When teams see how their day-to-day work (outputs) links to user behaviors (outcomes) and organizational purpose (impact), execution becomes meaningful, not mechanical.

The alignment trifecta is the connective tissue between strategy and shipping, and when done right, it turns product teams into value engines that sustain themselves long after individual projects are done.

P.S. this blog was inspired by the book Impact First by Matt Lemay

The Architecture of Republic: How George Washington Designed for Scale

Leave a reply

Building scalable systems fascinates me. These systems, designed from the ground up, connect with users and adapt over time. I often use examples like the internet and power companies, or even nature, in discussions about scalability. But which human-made institution was truly built for scalability, especially in uncertain times? This question led me to read John Avlon’s “Washington’s Farewell,” where I found striking similarities between Washington’s concerns for the young republic and those of system architects. Here are a few of my observations on those similarities.

George Washington: The Original Platform Architect

When George Washington became the first President of the United States, his challenge was not just to lead a new nation; it was to create a system that could last without him. The early republic was more like a fragile startup than a powerful country: untested, divided, and held together by a mix of ideas and uncertainty. Washington’s talent was not only in leading armies or bringing people together. It was in thinking like a builder of systems: someone who designs for growth. As John Avlon mentions in the book’s introduction, Washington’s Farewell Address was “a warning from a parting friend … written for future generations of Americans about the forces he feared could destroy the democratic republic.”

Two hundred years later, those same ideas are important for how we create strong products, organizations, and platforms. Washington, perhaps without realizing it, provided one of the best examples of scalable architecture for human systems.

1. The Founding as a System Design Challenge

In 1789, the United States was like a Minimum Viable Polity. It needed to show that democracy could succeed in different places, cultures, and interests. There was a temptation to consolidate power to one strong leader. However, Washington took a different route: he spread out authority, established checks and balances, and set examples that made the system flexible instead of fragile.

A great example of good design is that it just works, and people don’t think about it, much like what John Avlon said about Washington’s Farewell address.

“Once celebrated as civic scripture, more widely reprinted than the Declaration of Independence, the Farewell Address is now almost forgotten.”

In other words, the basic structure is often ignored, but it’s crucial.

Great product leaders avoid making choices based solely on their likes and instead design frameworks that others can extend.

2. Scalable Design Principles from the Founding Era

Let’s break down some of Washington’s implicit “architectural” choices and see how they map to modern-day system design.

Distributed Authority = Microservices Architecture

The U.S. Constitution established a system where states have their rights, coordinated by a central government. This reflects the concept of microservices: distribute capabilities, manage connections, and allow each area to grow independently. While it may not always be the most efficient design, it scales well. Some microservices are essential, and without them, the whole system would fail, but redundant architecture also provides support.

Checks and Balances = System Resilience

This illustrates the essence of a scalable system and its resilience, as evidenced by several cases where domination or over-reliance on one key attribute can cause the system to fail under pressure; this is similar to how most authoritarian or monarchist governments operate. By ensuring no single branch could dominate, Washington helped create feedback loops, the political equivalent of monitoring, circuit breakers, and load balancers. When one subsystem overheats, there are other compensating functions that stabilize the whole. It is messy, but it is resilient.

The Constitution = API Contract

The constitution defines the roles and limits of its parts (branches, states, and citizens) and can be updated through amendments, much like a flexible API. This allows the foundational system to endure for over two hundred years, echoing Washington’s idea of “A government …. containing within itself a provision for its own amendment.” Essentially, it sets a basic framework while permitting changes based on market conditions.

Stepping down after two terms = Version Governance

Washington’s choice to step down after two terms set a standard as a precedent for leaders from holding onto power for too long. He avoided “overfitting” the system too closely to his own way of leading. He realized that a successful system needs to grow beyond its original leader, a lesson that many leaders still find difficult today.

Avlon describes the Farewell Address as “the first President’s warning to future generations.”

3. Build Institutions, not Heroics

Washington’s restraint was deliberate. He could have concentrated power, but he chose to create lasting institutions and decision-making processes. In today’s organizations, this resembles forming clear team charters, written protocols, and shared governance. Growth stems not from the genius of one individual, but from the clear structure they establish.

When we talk about scalable product or platform design today, from cloud computing to AI ecosystems, we are really talking about institutionalizing adaptability. Washington’s leadership demonstrates the interdependence of governance and design.

4. Balancing Short-term Efficiency and Long-term evolution

This, to me, is the best part since we all struggle with this balance, and like any good architect, Washington balanced short-term stability with long-term flexibility. The early republic could have optimized speed, central control, fast decisions, and fewer stakeholders. Instead, it optimized for endurance. Every check and balance slowed things down, but those same friction points enabled long-term survival. That is not to say the system was not agile; agile in the context of government, the US still moves quite fast, although we as the citizens of the country may not think so sometimes.

Avalon captures this tension:

“The success of a nation, like the success of an individual, was a matter of independence, integrity, and industry.”

That applies equally to start-ups and nation states.

That is the same tension every product leader faces: do you build for what scales now or what will still scale five years from now? The answer lies in designing systems that anticipate change rather than resist it.

As I was reading the book, a proverb came to mind, especially when it comes to the context of execution in this balance leaders need to establish.

Vision without Action is a dream; Action without Vision is a nightmare – Ancient Japanese Proverb

5. Lasting Lesson: When Leadership Scales

Washington’s greatest contribution wasn’t just the founding of a nation; it was founding an operating system for governance that others could continuously upgrade. His humility and architectural foresight made scalability possible.

In the language of product design:

True scalability isn’t about adding users. It’s about building a system that evolves gracefully when you’re no longer in control.

Good leaders ensure that their systems, whether in governments, platforms, organizations, or AI, can continue to function long after they are gone.

If you are interested in the book, please go over to Amazon.com and search on “Washington’s farewell”

The Art of Strategy: Sun Tzu and Kautilya’s Relevance Today

Leave a reply

Sometimes it is great to look into the past to see how leaders back then dealt with the changing times. Oddly enough, some of their learnings still resonate even today. I had a chance to reread Sun Tzu’s The Art of War and the Arthashastra from Kautilya. In a world of constant competition between nations, businesses, or algorithms, these two ancient texts continue to define how leaders think about power, conflict, and decision-making. The blog this week takes a more philosophical lens to analyze strategies from the years before and their relevance in today’s world.

Separated by geography but united in purpose, both these works of literature are more than just military manuals; they are frameworks for leadership and strategy that remain stunningly relevant today.

The Philosophical Core

Theme	Arthashastra (Kautilya)	The Art of War (Sun Tzu)
Objective	Build, secure, and sustain the state’s prosperity	Win conflicts with minimum destruction
Philosophy	Realpolitik—power is maintained through strategy, wealth, and intelligence	Dao of War—harmony between purpose, timing, and terrain
Moral Lens	Pragmatism anchored in moral order	Pragmatism anchored in balance and perception
Definition of Victory	Stability, order, and prosperity of the realm	Winning without fighting; subduing the enemy’s will

Both leaders agree: victory is not about destruction, and it is more about preservation of advantage.

Leadership and Governance

Kautilya: The leader, as the chief architect of the state, city, organization, or department, is obligated to prioritize the welfare of the people. Leadership represents both a moral and economic contract; thus, a leader’s fulfillment is intrinsically linked to the happiness of their direct reports.
Sun Tzu: The leader is the embodiment of wisdom, courage, and discipline, whose clarity of judgment determines the fate of armies

In modern times, in the context of Kautiliya, the leader represents the CEO/statesman, designing systems of governance, incentives, and intelligence; Sun Tzu represents the COO, optimizing execution and adapting dynamically.

Power, information, and intelligence

Information in both books is seen as a strategic asset. This includes gathering information and then acting upon the given information; it does emphasize more acting on it versus just gathering.

Aspect	Kautilya	Sun Tzu
Intelligence System	Elaborate network of informants: agents disguised as monks, traders, ascetics	Emphasis on reconnaissance, deception and surprise
Goal of Data Gathering	Internal vigilance and monitor external influence	Tactical advantage and surprise
Philosophical view	Informants are the eyes of the leader	All warfare is based on deception and having leverage

In the age of data and AI, the lesson is clear: those who control information and stories will succeed in the long run.

War, Diplomacy, and the Circle of Power

Kautilya’s Mandala Theory: Every neighboring state is a potential enemy; the neighbor’s neighbor is a natural ally. The world is a circle of competing interests, requiring constant calibration of peace, war, neutrality, and alliance.
Sun Tzu’s Doctrine: War is a last resort; the wise commander wins through timing, positioning, and perception.

Modern parallel:

Global supply chains, tech alliances, and regulatory blocs function exactly like Kautilya’s mandala: interdependent, fluid, and shaped by mutual deterrence.

Economics as a strategy

In the Art of War focuses on conflict, while the Arthashastra expands into economics as the engine of statecraft. Kautilya views wealth as the foundation of power, with taxation, trade, and public welfare as strategic levers.

“The state’s strength lies not in the sword, but in the prosperity of its people.”

In business terms, this is all platform economics; power arises from resource control, efficient networks, and sustainable growth, not endless confrontation.

Ethics, Pragmatism and the Moral Dilemma

Both authors are deeply pragmatic but neither amoral.

Kautilya: Ends justify means only when serving public welfare. Ethics are flexible but purpose-driven.
Sun Tzu: Advocates balance, ruthless efficiency tempered by compassion, and self-discipline.

For modern leaders, this balance is critical: strategic ruthlessness without moral erosion.

Enduring Lesson for Today

Timeless Principle	Modern interpretation
Know yourself, and your adversary	Data, market, and competitive intelligence
Control information, and perception	Own the narrative, brand, and customer psychology
Adapt to the terrain	Agility in shifting markets and technologies
Economy of effort	Lean operations, precision focus
Moral Legitimacy	Trust, Transparency, and long-term brand equity

Both texts converge on the following point:

Leadership is the art of aligning intelligence, timing, and purpose, not merely commanding resources.

Fusion Mindset

If Sun Tzu teaches how to win battles, Kautilya teaches how to build empires. Combined, they offer a 360-degree view of power:

Sun Tzu = Operational mastery: speed, tactical advantage, and timing.
Kautilya = Structural mastery: governance, economics, and intelligence.

Together they form a dual playbook for today’s complex systems, from nation-states to digital ecosystems.

Conclusion

Both The Art of War and Arthashastra remind us that strategy is timeless because human behavior is timeless.

Whether you lead a nation, a company, or a team, the challenges are the same: limited resources, competing interests, and the need to act with clarity under uncertainty

In the end, wisdom isn’t knowing when to fight; it’s knowing when to build, when to adapt, and when to walk away.

Cybersecurity in Industrial systems with AI

1 Reply

AI is transforming not only digital platforms but also industrial systems. As AI intersects with cybersecurity, how do we protect our infrastructure while adapting to technological changes? This rapid evolution brings both new opportunities and risks, increasing the need for robust security strategies. Balancing innovation with critical safeguards will be essential as organizations navigate this complex landscape.

Information Technology and Operational Technology

When working with industrial systems, it is important to distinguish between two key areas:

Information Technology
Operational Technology

Information Technology: This area focuses on data, information, and communication. Key aspects include data storage, transmission, and analysis. In terms of cybersecurity, the primary concerns are:

Confidentiality (protecting data)
Integrity (ensuring accuracy)
Availability (keeping systems operational)

Examples of solutions in this category include productivity suites, ERP applications, cloud services, databases, and CRM systems.

Operational Technology: These technologies are designed to monitor and control physical processes, devices, and infrastructure. The main objectives are: real-time monitoring, control, automation, and ensuring the safety and reliability of operations. Priority areas include:

Safety (preventing harm to people, environment, and equipment)
Availability (maintaining continuous system operation)
Determinism (achieving predictable outcomes)

Examples of operational technology solutions include:

Programmable Logic Controller (PLC): Computers used to automate industrial processes, such as assembly line robots

Supervisory Control and Data Acquisition (SCADA): Systems for remote monitoring and control of industrial processes

Distributed Control System (DCS): Control systems where elements are distributed across the system rather than centralized, often used in chemical plants and refineries (e.g., carbon capture systems)

Where does AI add value to Operational Technologies?

Industrial Systems

Most of the industrial systems use legacy protocols (e.g., Modbus, DNP3, etc.); these were designed for availability and determinism, not for security. This is where AI can add value.

Anomaly detection and Predictive Maintenance: AI models can learn “normal” patterns of sensors, actuators, and control data and flag deviations that indicate equipment wear, sensor drift, or cyber manipulation
Cyber Intrusion Detection for OT Networks: AI can profile normal Modbus and DNP3 traffic and flag malicious commands such as replay attacks or unauthorized writes to PLCs. As many of these protocols lack authentication or basic identity management
Process optimization: Reinforcement learning agents can continuously optimize SCADA-controlled processes (e.g., water treatment plants) for throughput, yield, or energy efficiency
Human-in-the-Loop decision support: Agents that can interpret signals and alarms and recommend operator actions that reduce “alarm fatigue”

Driverless cars

The development of robotaxis is a major advance in autonomous transportation. These driverless vehicles function as multi-agent industrial systems, where addressing security concerns is important to prevent potential issues.

Perception and Sensor Fusion: AI combines information from cameras, LIDAR, radar, and V2X to construct an environmental model, such as proximity maps used in vehicles like Tesla.
Real-time Anomaly Detection and Intrusion: Systems are designed to identify LIDAR spoofing or harmful V2X messages, with agents monitoring Ethernet frames for irregularities.
Risk Forecasting and Path Planning: Driving policies are automatically adapted based on the predicted movements of vehicles and pedestrians.
Self-Diagnostics and Predictive Maintenance: Onboard agents monitor for sensor and board failures, enabling proactive recalls to reduce operational expenses.
Over-the-Air (OTA) Update Security: AI assists in verifying firmware integrity and identifying any supply-chain tampering.

Protocol security gaps

Many industrial and automotive controls lack built-in security, so AI can help compensate for vulnerabilities in legacy protocols.

AI-driven intrusion detection: Identifies and contains unusual or malicious traffic by analyzing patterns.
Device behavioral fingerprinting: Uses electrical and timing signatures to reliably distinguish devices, preventing impersonation.
Zero-trust enforcement: Dynamically assesses communication trust for insecure protocols using AI.

Conclusion

In summary, the integration of AI into automotive and industrial systems significantly enhances security, operational reliability, and adaptability. By leveraging advanced perception, real-time anomaly detection, predictive maintenance, and dynamic trust enforcement, AI fills gaps in legacy protocols and sets a new standard for proactive threat mitigation and system resilience. As these technologies continue to evolve, their role in safeguarding critical infrastructure will become increasingly indispensable for the future of connected and autonomous systems.

AI as the Next Strategic Inflection Point: Why Hybrid Growth Models Will Define the Future

Leave a reply

Now that I have changed jobs, I engage in my regular ritual of reading “Only the Paranoid Survive” by Andy Grove. Although dated and the fact that it beats up on Steve Jobs and Apple, there are several nuggets of wisdom I take from it every time I reread it. I decided to use the framework in the book to assess AI. Andy Grove once wrote that a strategic inflection point is the moment when the balance of forces shifts so dramatically that an organization must adapt or risk irrelevance. We’ve seen such changes with the internet, cloud, and mobile. Each time, companies either leaned into the shift or slid into irrelevance.

Today, we confront the same question: Is AI the next turning point for businesses?

My position is clear: it is.

Why AI Is Different ?

AI doesn’t just digitize processes. It reshapes how we engage, learn, and deliver value. The promise of AI is hyper-personalization at scale, understanding customer intent in real time, adapting product experiences dynamically, and embedding intelligence into every workflow.

For businesses, such intelligence is non-negotiable. Customers no longer tolerate generic experiences. They expect platforms to anticipate their needs. Those who move slowly are not just lagging; they’re drifting toward irrelevance.

Applying Andy Grove’s Six Forces

Grove argued that strategic inflection points become visible when all six forces in business begin to shift simultaneously. Artificial intelligence provides a textbook example:

Competitors: New entrants leverage AI-native strategies to outpace incumbents in personalization, cost, and speed. Startups move faster; established players must retool.
Customers: Expectations are rising. Hyper-personalization is now a fundamental requirement. AI reshapes the definition of value.
Suppliers: Model providers (OpenAI, Anthropic, Google, etc.) become critical suppliers, introducing new dependencies and risks. Shifts in licensing, pricing, or access can alter your strategy overnight.
Complementors: Ecosystems of AI plugins, agents, and integrations redefine how products interoperate. Companies that fail to integrate risk isolation.
New Entrants: Barriers to entry collapse as AI lowers the cost to build sophisticated products. A two-person startup can now challenge incumbents.
Substitutes: Traditional processes and workflows are displaced by AI-native alternatives. Automation replaces previously required human effort, transforming value chains across various industries.

When all six forces are in motion, you don’t just face incremental change—you’re at an inflection point.

Product-led growth vs. customer-led growth in the age of AI

The situation raises a critical question: how does AI reshape growth models?

Product-Led Growth (PLG) thrives on self-serve adoption. AI strengthens this by embedding intelligence into onboarding and analytics. However, PLG has a blind spot: despite being data-driven, it frequently overlooks the competitive Cassandras within your organization—those voices that warn about competitors moving faster or shifts in the market.

Customer-Led Growth (CLG) relies on deep engagement. AI enhances this by giving customer-facing teams foresight into risks and opportunities across accounts.

Individually, both are powerful. Alone, both are incomplete.

The case of Hybrid-led growth

Hybrid-led growth is the winning model, similar to the case I made in my earlier blog post about each of the growth models.

From PLG, you inherit scale: products that adapt to millions of users in real time.
From CLG, you inherit resilience: trusted, high-touch relationships informed by AI insights.
By combining them, you overcome PLG’s blind spots and amplify CLG’s reach.

Hybrid growth reframes Product-Market Fit (PMF). PMF is no longer static. With AI, it becomes dynamic, continuously tuned by customer data, competitive signals, and organizational foresight.

What Leaders Must Do

Reframe strategy through AI lenses: re-evaluate product roadmaps, customer journeys, and GTM motions with AI in mind.
Invest in data and trust: transparency and security are preconditions for customer willingness to share.
Listen to your Cassandra’s: Don’t dismiss internal voices warning of competitive threats. They’re often early signals of market shifts.
Adopt hybrid growth mindsets: stop debating PLG vs. CLG. The future belongs to companies that can blend them.

The Inflection Point Is Here

Strategic inflection points emerge in the present, not in retrospect. Grove’s six forces are shifting, simultaneously, under the weight of AI.

Companies today stand at the fork Grove described: grow exponentially or risk irrelevance.

AI is that fork. The winners will not simply adopt AI; they will reimagine growth itself, blending PLG and CLG into a hybrid model that adapts dynamically to both customers and competition.

I. The problem with how economists think about labor

II. What is knowledge, actually?

III. Time is not just speed

IV. The hard problem of experience

V. So where does human capital go?

Share Now!

Like this:

The Seduction of “AI First”

The Distributed Systems Playbook: Older Than You Think

The Pattern Map: Distributed Computing → Agentic AI

Orchestration

Stateful Sessions and Memory

Service Mesh → LLM Mesh and Agentic Mesh

MLOps, LLMOps, and the CI/CD Parallel

Scaling Laws: The CAP Theorem of Agents

What “Foundation First” Actually Means

The Anti-Patterns for Leaders

“We Are an AI Company”

Skipping Infrastructure to Ship the Demo

Treating the Model as the Moat

Ignoring the Distributed Systems Literature

The Convergence Table

The Bottom Line

References

Share Now!

Like this:

Agent OS: Reference Architecture

From “Non-Determinism” to Distributed Failure

Memory: The Bottleneck Nobody Admits

Tools Make Everything Worse (Operationally)

MCP and A2A are necessary components, but they are not sufficient on their own.

Incident Postmortems: What Actually Breaks

Incident #1: Tool Timeout → Hallucinated Recovery → Memory Contamination

Incident #2: Cross-Agent Memory Contamination in an A2A Workflow

Minimum Viable Ops Layer for Agentic Systems

1) Replayable Execution

2) Typed, Versioned Memory

3) Explicit Tool Contracts

4) Distributed Tracing Across Agents

5) Cognitive Circuit Breakers

6) Security and Isolation

Conclusion: This Is Not LLM Ops. It’s Systems Engineering

References

Share Now!

Like this:

MVP is a Supply Chain, Not a Feature

The Stakeholder Stack

Power/Interest Grid

Engagement Strategy by Stakeholder

The Core Disagreement: Sell versus Learn

Two legitimate, but conflicting, definitions

The Real Failure Mode

A More useful framing

The Bottom line

Share Now!

Like this:

What “dynamic evals” mean in this context?

Objects we’re evaluating

Core metrics for security-testing agents

Task-level detection/exploitation metrics

Risk-weighted security impact

Behavioral metrics (agent quality)

Coverage metrics

Algorithms to make this dynamic

Off-policy evaluation (OPE) for new agent policies

Safety-aware contextual bandits for online testing

Sequential hypothesis testing/drift detection

Dynamic scenario generation

Scenario Generator

Scenario selection: bandits again

Example: End-to-end dynamic eval loop

Risk and Concerns

Share Now!

Like this:

1. Impact: The “Why” that defines the direction

2. Outcomes: The “What” that shapes behavior

Outputs: The “How” that powers the execution

Bringing it all together: Alignment equation

How to Use the Alignment Trifecta

Final Thought