Category Archives: artificial-intelligence

Foundation First, Not AI First

The Patterns That Built the Internet Will Build the Agentic Future

Every pitch deck in 2026 leads with “AI First.” Every product strategy document genuflects to the altar of large language models before addressing anything else. Every engineering roadmap treats AI integration as the foundational decision from which all other decisions flow.

This is backwards. And two decades of distributed systems engineering already proved why.

Claude can build you a beautiful application in minutes. But if that application lacks circuit breakers, observability, state management, and fault isolation, it will collapse the moment it meets production traffic. The model is not the product. The foundation is the product. The model is a component.

The Seduction of “AI First”

“AI First” as a strategy sounds compelling because it promises differentiation. It implies that the intelligence layer is the moat, the product, and the competitive advantage all at once. Executives hear “AI First” and see leapfrogged roadmaps, reduced headcount, and disrupted markets.

What “AI First” actually produces, in practice, is a fragile application wrapped around an API call.

Consider what happens when an organization builds AI First without foundational engineering discipline. The LLM handles the happy path beautifully. Then the API rate-limits. Then the context window overflows. Then the agent hallucinates in a customer-facing workflow. Then the orchestration layer drops a message between two agents that were supposed to coordinate. Then the memory store loses state mid-session.

Every one of these failure modes has a well-understood solution in distributed systems literature. And every one of these failure modes is being rediscovered, from scratch, by teams that skipped the foundation.

The Distributed Systems Playbook: Older Than You Think

The patterns that make agentic AI systems reliable are not new. They are borrowed, sometimes consciously and sometimes accidentally, from decades of distributed computing research. The convergence is not a coincidence. It is an inevitability. Multi-agent systems are distributed systems. The moment you have two agents coordinating across a shared task, you have entered the domain of consensus, fault tolerance, and state management whether you acknowledge it or not.

Milosevic and Odell formalized this connection in their January 2026 paper “Architecting Agentic Communities using Design Patterns” (arXiv:2601.03624). They explicitly derive agentic design patterns from enterprise distributed systems standards and formal methods. Their taxonomy classifies patterns into three tiers: LLM Agents for task-specific automation, Agentic AI for adaptive goal-seeking, and Agentic Communities for organizational frameworks where agents and humans coordinate through formal roles, protocols, and governance structures. The architectural lineage is unmistakable. These are not novel AI patterns. They are service-oriented architecture patterns with a new cognitive substrate.

The Pattern Map: Distributed Computing → Agentic AI

The parallels are structural, not metaphorical. Every major infrastructure pattern emerging in the agentic AI space has a direct ancestor in distributed computing.

Orchestration

In distributed systems, orchestration engines like Kubernetes, Apache Airflow, and Temporal coordinate service execution, manage dependencies, handle retries, and enforce ordering guarantees. In the agentic world, LLM orchestration frameworks like LangGraph, CrewAI, and AutoGen perform identical functions: they coordinate agent execution, manage tool dependencies, and enforce workflow ordering.

The paper by Drammeh on multi-agent LLM orchestration for incident response (arXiv:2511.15755) demonstrated that orchestrated multi-agent systems achieved a 100% actionable recommendation rate compared to 1.7% for single-agent approaches. The insight is not that the model was better. The insight is that the orchestration was better. The infrastructure made the intelligence useful.

Stateful Sessions and Memory

Distributed systems solved session affinity and state management decades ago. Sticky sessions, distributed caches, and event sourcing patterns all address the same fundamental problem: how do you maintain coherent state across multiple service invocations that may occur on different nodes?

Agentic AI is now solving the same problem under a different name. Agent “memory,” whether short-term context windows, long-term vector stores, or persistent session state, is distributed state management. The challenges are identical: consistency across nodes, durability under failure, and efficient retrieval under load. The Jiang et al. survey on agent adaptation (arXiv:2512.16301) categorizes memory as a core adaptation mechanism, but the underlying engineering is cache management and state replication.

Service Mesh → LLM Mesh and Agentic Mesh

This is where the convergence becomes most striking. In distributed computing, the service mesh pattern (Istio, Linkerd, Consul Connect) emerged to solve a specific problem: as the number of microservices grew, managing service-to-service communication, security, observability, and traffic routing at the application layer became untenable. The mesh moved these cross-cutting concerns into infrastructure.

The same pattern is emerging for LLM and agentic systems. “LLM-Mesh,” as described by researchers at UIUC (arXiv:2507.00507), addresses elastic resource sharing across heterogeneous hardware for serverless LLM inference. The concept parallels the service mesh exactly: abstract the complexity of model routing, load balancing, and resource allocation into an infrastructure layer so that application developers can focus on business logic.

The agentic mesh extends this further. The Model Context Protocol (MCP) and Google’s Agent-to-Agent (A2A) protocol are standardizing inter-agent communication in the same way that gRPC and service mesh sidecars standardized inter-service communication. The paper on multi-agent orchestration architectures (arXiv:2601.13671) describes MCP and A2A as establishing an “interoperable communication substrate” for agent coordination. Substitute “service” for “agent” and you are reading a 2018 paper on Istio.

MLOps, LLMOps, and the CI/CD Parallel

DevOps gave us CI/CD pipelines, blue-green deployments, canary releases, and automated rollbacks. MLOps applied the same principles to model training and deployment. LLMOps extends them further to prompt management, hallucination monitoring, and token cost tracking.

The pattern is identical each time: take a new computational paradigm, realize that artisanal manual deployment does not scale, and rediscover that automated pipelines with observability and rollback capabilities are the only path to production reliability. The MLOps lifecycle framework (arXiv:2503.15577) maps directly to the DevOps lifecycle. The tools have different names. The principles are unchanged.

Scaling Laws: The CAP Theorem of Agents

Kim et al.’s “Towards a Science of Scaling Agent Systems” (arXiv:2512.08296) derived quantitative scaling principles for multi-agent architectures. Their findings read like a distributed systems textbook: centralized coordination improves performance by 80.8% on parallelizable tasks but degrades sequential reasoning by 39–70%. Independent agents amplify errors 17.2 times. There is a capability saturation point beyond which adding more agents yields diminishing or negative returns.

These are not AI insights. These are Amdahl’s Law and the CAP theorem wearing different clothes. Parallelizable workloads benefit from distribution. Sequential workloads do not. Coordination has overhead. Consistency and partition tolerance trade off against each other. The distributed systems community established these principles decades ago. The agentic AI community is now empirically rediscovering them.

What “Foundation First” Actually Means

Foundation First does not mean ignoring AI. It means building the infrastructure that makes AI reliable before building the AI features that make the product exciting.

Concretely, Foundation First means:

Observability before intelligence. You cannot debug an agent you cannot observe. Instrument tracing, logging, and metrics for every agent interaction before you build the agent itself. The distributed systems community learned this lesson with microservices. The agentic community is learning it now with hallucination monitoring and prompt observability.

Fault isolation before orchestration. Circuit breakers, retry policies, dead-letter queues, and graceful degradation paths must exist before you chain agents together. A single hallucinating agent in an unprotected pipeline can corrupt an entire workflow. Bulkhead patterns are not optional.

State management before memory. Decide how you will manage agent state—what is ephemeral, what is persistent, what requires consistency guarantees—before you implement “memory.” Vector stores are not a state management strategy. They are a retrieval optimization. The state management strategy is the architecture decision that determines whether your system survives a failure.

Protocol standardization before integration. Adopt MCP, A2A, or whatever communication standard your ecosystem supports before you build bespoke agent-to-agent integrations. Every point-to-point integration you build today is technical debt you will pay interest on tomorrow. The service mesh pattern exists because point-to-point service integration did not scale. The same is true for agents.

Evaluation infrastructure before deployment. In my post on dynamic evaluations, I argued that evaluation loops measure performance and enforce constraints but do not create new knowledge. The same applies here: build the evaluation infrastructure first, then deploy the agents into it. Do not deploy first and evaluate later. The distributed systems equivalent is deploying without monitoring. Everyone knows it is wrong. Everyone does it anyway.

The Anti-Patterns for Leaders

“We Are an AI Company”

No. You are a company that uses AI. The distinction matters. An “AI company” identity encourages teams to center every decision on the model. A company that uses AI centers decisions on the customer problem and selects the best tool, AI or otherwise, for each component of the solution. Sometimes the best tool is a deterministic rules engine. Sometimes it is a relational database query. Sometimes it is a well-designed form. AI First thinking makes these options invisible.

Skipping Infrastructure to Ship the Demo

The demo always works. The demo runs on a single API call with a curated prompt against a known-good input. Production is not the demo. Production is 10,000 concurrent users with adversarial inputs, network partitions, rate limits, and a context window that fills up faster than anyone predicted. Every month I see teams ship the demo and then spend six months building the infrastructure they should have built first.

Treating the Model as the Moat

Foundation models are commoditizing. The moat is not the model. The moat is the data pipeline, the evaluation infrastructure, the orchestration layer, the fault tolerance mechanisms, and the domain-specific workflows that make the model useful in a specific context. These are all foundational engineering investments. They are not glamorous. They are the reason some AI products work and others do not.

Ignoring the Distributed Systems Literature

The agentic AI community is producing excellent research. But much of it is rediscovering principles that the distributed systems community established years ago. Leaders who staff their AI teams exclusively with ML engineers and ignore distributed systems expertise are building on sand. The hard problems in agentic AI are increasingly infrastructure problems, not model problems.

The Convergence Table

Distributed Computing Pattern	Agentic AI Equivalent	Why It Matters
Service Orchestration (K8s, Temporal)	Agent Orchestration (LangGraph, CrewAI)	Coordination, dependency management, retry logic
Service Mesh (Istio, Linkerd)	LLM Mesh / Agentic Mesh (MCP, A2A)	Cross-cutting concerns: auth, observability, routing
Session Affinity / Distributed Cache	Agent Memory (vector stores, context windows)	State coherence across invocations
CI/CD Pipelines	MLOps / LLMOps Pipelines	Automated deployment, rollback, version control
Circuit Breakers (Hystrix)	Agent Fallback / Guardrails	Fault isolation, graceful degradation
Event Sourcing / CQRS	Agent Action Logs / Audit Trails	Reproducibility, debugging, compliance
Load Balancing	Model Routing / LLM Gateway	Cost optimization, latency management
API Gateway	LLM Gateway / Orchestration Layer	Rate limiting, auth, request transformation
Observability (Prometheus, Jaeger)	LLM Observability (Arize, LangSmith)	Tracing, hallucination detection, cost tracking
CAP Theorem Tradeoffs	Agent Scaling Laws (Kim et al.)	Coordination overhead vs. parallelism gains

The Bottom Line

The infrastructure patterns that powered the internet, the cloud, and the microservices revolution are the same patterns that will power the agentic AI era. They are not optional. They are not “nice to have after launch.” They are the foundation without which no AI system survives production.

“AI First” is a marketing strategy. “Foundation First” is an engineering strategy. One gets you a demo. The other gets you a product.

The organizations that win the next five years will not be the ones that adopted AI the fastest. They will be the ones that built the most resilient foundations and then deployed AI into an infrastructure designed to make it reliable, observable, and recoverable.

Kant would remind us that reason without grounded experience produces illusions. The same is true for AI without grounded infrastructure. Build the foundation. Then build the intelligence. Not the other way around.

References

Milosevic, Z. and Odell, J. “Architecting Agentic Communities using Design Patterns.” arXiv:2601.03624 (January 2026).
Kim, Y. et al. “Towards a Science of Scaling Agent Systems.” arXiv:2512.08296 (December 2025).
Drammeh, P. “Multi-Agent LLM Orchestration Achieves Deterministic, High-Quality Decision Support for Incident Response.” arXiv:2511.15755 (November 2025).
“LLM-Mesh: Enabling Elastic Sharing for Serverless LLM Inference.” arXiv:2507.00507 (July 2025).
“The Orchestration of Multi-Agent Systems: Architectures, Protocols, and Enterprise Adoption.” arXiv:2601.13671 (January 2026).
“Navigating MLOps: Insights into Maturity, Lifecycle, Tools, and Careers.” arXiv:2503.15577 (March 2025).
Jiang, P. et al. “Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills.” arXiv:2512.16301 (December 2025).
Gangadharan, G.R. et al. “Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation.” arXiv:2601.12560 (January 2026).
Kanakasabesan, K. “AGI isn’t here yet: Why OpenClaw, Agents and LLM Systems are still just ANI.“
Kanakasabesan, K. “Your Agents are not safe and your evals are too easy.“
Kanakasabesan, K. “Measuring What Matters: Dynamic Evaluation for Autonomous Security Agents.“

Self-Improving Agents Are Still ANI: Why Hyperagents Don’t Change the Classification

Leave a reply

A Kantian Lens on Machine Intelligence

In March 2026, Meta published a paper on hyperagents: systems that rewrite the mechanism by which they improve themselves. Performance compounds across runs. Meta-level gains transfer across domains. The system improves its own improvement process.

If that doesn’t sound like AGI, I don’t know what does.

Except it isn’t. And a philosopher who died in 1804 already explained why.

What Are Hyperagents?

To understand why hyperagents matter and why they do not change the classification of intelligence, we need to trace a brief lineage.

Darwin Gödel Machine (DGM)

Published in May 2025 by Zhang et al., the Darwin Gödel Machine is a coding agent that iteratively modifies its own source code and empirically validates changes against benchmarks. It maintains an archive of agent variants through Darwinian selection, where successful modifications survive and unsuccessful ones are pruned. On SWE-bench, it improved from 20.0% to 50.0% through self-modification.

The key structural insight: because the task domain (coding) and the self-modification mechanism (also coding) share the same medium, improvements in one naturally feed the other. Better coding ability produces better self-modification, which produces better coding ability. This virtuous cycle is real and measurable.

DGM-Hyperagents (DGM-H)

Published in March 2026, hyperagents address a specific limitation of DGM. In the original system, the meta-mechanism (the process for generating improvements) was hand-crafted and fixed by human designers. DGM-H makes the meta-mechanism itself editable. The system now merges a task agent and a meta agent into a single self-modifiable program.

The authors call this “metacognitive self-modification.” Meta-level improvements, such as persistent memory, performance tracking, and improved code editing strategies, transfer across fundamentally different domains, including coding, academic paper review, and robotics reward function design. These improvements accumulate across runs.

At a Glance

Attribute	DGM (2025)	DGM-H / Hyperagent (2026)
Self-modifies code	Yes	Yes
Meta-mechanism editable	No (hand-crafted)	Yes (self-referential)
Cross-domain transfer	No (coding only)	Yes (coding, review, robotics)
Accumulates across runs	Limited	Yes
Foundation model modified	No (frozen)	No (frozen)

That last row is the entire argument. Hold it for now. We will return to it.

Why This Looks Like AGI: The Strongest Case

Intellectual honesty requires acknowledging the strength of the counterargument before engaging with it.

In my earlier post on ANI classification, I established four criteria for AGI:

Learning new domains from raw data without explicit programming
Transferring reasoning across unrelated disciplines
Generating and refining internal context models autonomously
Forming and pursuing long-term goals without human direction

On the surface, Hyperagents appear to make progress on criteria one and two. The system does improve across domains. The meta-level improvements do accumulate without being explicitly programmed. For the first time, there is a credible research artifact that seems to blur the line between narrow optimization and general capability.

This is the strongest counterargument to the ANI thesis that has appeared in the literature. It deserves a rigorous response.

Enter Kant: The Critique of Pure Reason as an Evaluation Framework

Why Kant?

Immanuel Kant’s Critique of Pure Reason (1781) asked a question that maps directly to the AGI debate: What are the conditions that make knowledge possible?

Kant was not asking what we know. He was asking what must be true about a knowing entity for knowledge to exist at all. That question, about the preconditions for intelligence rather than its outputs, is exactly the question we should be asking about self-improving agents.

Most commentary on hyperagents focuses on what the system does. Kant’s framework forces us to ask what the system is. And that distinction is the one that matters.

The Four Kantian Tests for Intelligence

Test 1: Analytic versus Synthetic Judgment

Kant distinguished between two types of knowledge.

Analytic judgments are those where the predicate is already contained in the subject. They decompose what is already known. “All bodies are extended” is analytic because extension is part of the definition of “body.” These judgments clarify, but they do not extend knowledge.

Synthetic judgments add something new. “All bodies are heavy” is synthetic because weight is not contained in the concept of body itself. You need to go beyond the concept to learn something the concept alone does not contain.

Kant’s central question concerned synthetic a priori knowledge: knowledge that is both genuinely new (synthetic) and necessarily true independent of any particular experience (a priori). He argued that mathematics, the foundations of natural science, and the preconditions for experience itself all belong to this category.

Application to Hyperagents: DGM-H operates almost entirely within the domain of analytic judgment. When the system rewrites its code to add persistent memory or improve its editing tools, it is decomposing and recombining patterns that already exist within its training distribution. The foundation model’s representations, the “concepts” it possesses, remain fixed. The scaffolding improvements are analytic rearrangements of known capabilities.

A system capable of synthetic judgment would generate knowledge that is not contained in its existing representations. It would discover that “bodies are heavy” without having the concept of weight anywhere in its training data. DGM-H does not do this. It recombines existing knowledge more efficiently. This is sophisticated analysis, not synthesis.

Smell test: “Does the system discover knowledge that was not already latent in its foundation model? Or does it rearrange what the model already knows?”

Test 2: Phenomena versus Noumena, the Boundary of Knowable Reality

Kant argued that human knowledge is limited to phenomena: the world as it appears to us through the structures of our perception (space, time, and the categories of understanding). Behind appearances lies the noumenon, the thing-in-itself, which remains fundamentally unknowable.

This is not a limitation to be overcome. It is a structural boundary of cognition itself.

Application to Hyperagents: The foundation model is the noumenal boundary of the Hyperagent system. The system interacts with token representations, benchmark scores, and code outputs. All of these are phenomena. It can modify how it processes these appearances through better tools, better prompts, and better memory. But it cannot reach through to modify the foundation model itself, the cognitive substrate that determines what can appear in the first place.

DGM-H optimizes within the phenomenal world of its own outputs. It has no access to its own noumenon. The weights are frozen. The representations are fixed. The boundary cannot be crossed through scaffolding improvements, no matter how recursive those improvements become.

When Kant said that concepts without intuitions are empty, he meant that pure logical manipulation without grounded experience cannot produce real knowledge. A Hyperagent that rewrites its own orchestration code without modifying its representational substrate is manipulating concepts without changing the intuitions that ground them.

Smell test: “Is the system modifying how it perceives, or only how it organizes what it already perceives?”

Test 3: The Transcendental Unity of Apperception, the Missing “I Think”

This is perhaps Kant’s most profound contribution to the question of intelligence. Kant argued that all experience requires a transcendental unity of apperception: the “I think” that must be capable of accompanying all representations. Without a unified self that integrates perceptions into a coherent whole, there is no experience, no knowledge, and no cognition.

This is not consciousness in the mystical sense. It is a structural requirement: for knowledge to cohere, there must be a unifying perspective that holds representations together across time.

Application to Hyperagents: DGM-H maintains an archive of agent variants. Each variant is evaluated independently. There is no unified “self” that integrates the experience of all variants into coherent understanding. The system is a population of narrow programs being selected by benchmark fitness, not a single entity that learns from its accumulated experience in a unified way.

When a Hyperagent “transfers” a meta-level improvement from the coding domain to the robotics domain, it is not a single intelligence applying cross-domain reasoning. It is a code pattern being reused in a different context. There is no “I think” that accompanies the transfer. There is no unified apperception binding the coding experience to the robotics experience into a single coherent worldview.

Darwinian selection and coherent self-awareness are fundamentally different mechanisms. Evolution produces fit organisms. It does not produce a single organism that understands why it is fit.

Smell test: “Is there a unified perspective that integrates learning across domains, or is there a population of narrow variants being selected by an external fitness function?”

Test 4: The Limits of Pure Reason, Why Recursive Optimization Has a Ceiling

Kant’s Critique was, at its core, an argument about limits. Pure reason, meaning reasoning without grounded experience, cannot extend knowledge beyond the bounds of possible experience. When reason attempts to do so (Kant called these attempts “transcendental illusions”), it generates contradictions and paradoxes, not knowledge.

Application to Hyperagents: The evaluation function is the bound of possible experience for DGM-H. The system can only know what the benchmark measures. It cannot reason about value, purpose, or knowledge that exists outside the evaluation function’s scope.

Recursive self-improvement within a fixed evaluation function is analogous to Kant’s critique of dogmatic metaphysics: reason operating on itself, in an unbounded way, without the grounding constraints that make knowledge possible. The system can optimize endlessly, but it cannot transcend the boundary defined by its evaluation criteria.

This maps directly to my argument in the dynamic evaluations posts: evaluation loops measure performance and enforce constraints, but they do not create new knowledge or abstraction. DGM-H elevates this from a single evaluation step to an evolutionary cycle, but the fundamental Kantian constraint holds. Optimization within a bounded evaluation function, no matter how recursive, cannot produce unbounded intelligence.

Smell test: “Can the system define what ‘better’ means, or does it only optimize for a definition of ‘better’ that was given to it?”

The Technical Dissection: Reinforcing the Philosophy with Architecture

The Kantian analysis is not merely philosophical. It maps directly to concrete architectural facts about how DGM-H works.

1. The Foundation Model Is Frozen

DGM-H modifies scaffolding: tools, prompts, workflows, and memory management. The foundation model weights never change. This is the architectural expression of the phenomena/noumena boundary. The system rearranges appearances without touching the cognitive substrate.

2. Goals Are Human-Defined and Externally Imposed

Every DGM-H run starts with a human-selected benchmark. The system does not choose what to improve at. It does not formulate its own research questions. Kant’s “I think” would require autonomous goal formation, where the system decides for itself what matters. DGM-H has no such capacity.

3. Cross-Domain Transfer Is Scaffolding Reuse, Not Reasoning Transfer

The meta-level improvements that transfer across domains (persistent memory, performance tracking) are infrastructure patterns. A human who learns music theory and applies harmonic reasoning to wave physics is performing synthetic judgment, connecting concepts that were not previously connected. An agent that reuses a memory management pattern across domains is performing analytic reapplication.

4. Self-Modification Operates on Code, Not on Representation

DGM-H modifies Python source code. It does not modify attention patterns, learned features, or representational structures. In Kantian terms, it modifies the organization of experience, not the categories through which experience is structured.

Think of it this way: a chess player who develops a new opening strategy has modified their cognitive approach. A chess player who buys a better chessboard and clock has modified their tooling. DGM-H is doing the latter at an impressive scale.

AGI Anti-Patterns for Leaders

These anti-patterns matter for anyone evaluating AI capabilities in a procurement, investment, or strategic planning context.

“Self-Improving” in the Pitch Deck Equals AGI

Kant taught us that the impressive outputs of reason can be illusory when reason operates beyond its legitimate bounds. The same applies to vendor claims. The evaluation rubric in this post gives leaders a structured way to push back. Ask the four Kantian questions. If the vendor cannot answer them, the claim is marketing, not capability.

Confusing Compounding Optimization with Compounding Intelligence

DGM-H demonstrates compounding optimization, where each run builds on the last. Kant would call this increasingly sophisticated analytic judgment. It is not synthetic intelligence, which would require generating genuinely new knowledge that extends beyond existing representations.

Ignoring the Frozen Foundation Model Constraint

If the foundation model is frozen, then the ceiling of the system’s capability is fixed. No amount of scaffolding optimization changes this. Leaders should ask: “When you say self-improving, what exactly is improving: the model or the wiring around the model?”

Over-Delegating Autonomy Based on Self-Improvement Claims

In my Kano Model post, I established that removing human checkpoints too early is a dangerous anti-pattern. Self-improving systems amplify this risk by creating the illusion of autonomous competence while operating within narrow, benchmark-defined boundaries.

Kant warned that reason unchecked by grounded experience produces illusions, not knowledge. The same is true for agents unchecked by human oversight.

The Adaptation Taxonomy: Broader Context

Jiang et al.’s survey paper, “Adaptation of Agentic AI” (December 2025), organizes the adaptation landscape into four paradigms: tool-execution-signaled agent adaptation, agent-output-signaled agent adaptation, agent-agnostic tool adaptation, and agent-supervised tool adaptation.

DGM-H fits squarely into agent-output-signaled adaptation: the agent modifies itself based on its own performance outputs. In Kantian terms, this is reason responding to its own products, not to raw experience.

The taxonomy makes clear that what Hyperagents do is a specific, well-characterized form of narrow adaptation. It is sophisticated. It is useful. It is not general.

The Bottom Line

Let’s be clear:

DGM is not AGI.
DGM-H (hyperagents) is not AGI.
Self-modification of scaffolding around a frozen foundation model is not cognitive evolution.

These systems perform increasingly sophisticated analytic operations. They do not perform synthetic judgment. They optimize within a phenomenal boundary defined by their evaluation functions. They lack the transcendental unity of apperception, the integrated “I think,” that Kant identified as the precondition for genuine knowledge.

Kant wrote the Critique of Pure Reason to establish the boundaries of what reason can legitimately claim to know. Two and a half centuries later, those boundaries still hold, not just for human cognition but also for artificial systems that attempt to simulate it.

The day a system generates its own evaluation criteria, formulates its own problems, produces synthetic a priori knowledge, and integrates experience through a unified perspective is the day we revisit this classification.

Until then, the framework holds.

Kantian Evaluation Rubric: “Is It AGI?”

Kantian Test	What It Asks	Hyperagents	True AGI Requirement
Analytic vs. Synthetic	Does the system create genuinely new knowledge?	No: rearranges existing representations	Must produce synthetic judgment beyond training
Phenomena vs. Noumena	Can it modify its own cognitive substrate?	No, the foundation model is frozen	Must modify how it perceives, not just what it organizes
Transcendental Unity of Apperception	Is there a unified “I” integrating experiences?	No: population of variants selected by fitness	Must possess a coherent, self-integrating perspective
Limits of Pure Reason	Can it define its own criteria for “better”?	No: optimizes for human-defined benchmarks	Must autonomously generate evaluation criteria

References

Zhang, J. et al. “Hyperagents.” arXiv:2603.19461 (March 2026). https://arxiv.org/abs/2603.19461
Zhang, J. et al. “Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents.” arXiv:2505.22954 (May 2025, updated March 2026). https://arxiv.org/abs/2505.22954
Jiang, P. et al. “Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills.” arXiv:2512.16301 (December 2025). https://arxiv.org/abs/2512.16301
Kant, I. Critique of Pure Reason. Trans. Norman Kemp Smith. (1781/1787). Macmillan, 1929.
Stanford Encyclopedia of Philosophy. “Kant’s Theory of Judgment.” https://plato.stanford.edu/entries/kant-judgment/
Stanford Encyclopedia of Philosophy. “Kant’s Critique of Metaphysics.” https://plato.stanford.edu/entries/kant-metaphysics/
Kanakasabesan, K. “AGI isn’t here yet: Why OpenClaw, Agents, and LLM Systems are still just ANI.” https://kanakasabesan.com/2026/03/09/agi-isnt-here-yet-why-openclaw-agents-and-llm-systems-are-still-just-ani/
Kanakasabesan, K. “Kano Model and the AI Agentic Layers.” https://kanakasabesan.com/2026/01/11/kano-model-and-the-ai-agentic-layers/
Kanakasabesan, K. “Your agents are not safe, and your evals are too easy.” https://kanakasabesan.com/2025/11/21/your-agents-are-not-safe-and-your-evals-are-too-easy/
Kanakasabesan, K. “Measuring What Matters: Dynamic Evaluation for Autonomous Security Agents.” https://kanakasabesan.com/2025/12/15/measuring-what-matters-dynamic-evaluation-for-autonomous-security-agents/

AGI isn’t here yet: Why OpenClaw, Agents and LLM Systems are still just ANI.

Leave a reply

It has been a while since I posted because I was busy researching and experimenting with OpenClaw, NanoClaw, and similar tools. Here’s a summary of what I learned.

There’s a lot of confusion in the industry about what current AI systems really are. Even with all the recent progress, OpenClaw is not AGI (Artificial General Intelligence). This is also true for large language models, tools that use intelligence, and systems that involve multiple agents working together.

What we have right now, no matter the name, number of parameters, or how advanced the system is, is still Artificial Narrow Intelligence (ANI).

Understanding the difference between ANI, AGI, and ASI is not an academic exercise. It directly impacts system architecture, operational risk, evaluation strategy, and how much autonomy we should responsibly delegate to machines.

ANI: What We Actually Have Today

All current AI systems, including OpenClaw, fall squarely into Artificial Narrow Intelligence.

ANI systems perform well within bounded domains. They depend on carefully designed architectures and human-defined operational boundaries.

These systems typically rely on:

Large pretrained language models
Explicit tool invocation
Memory abstractions
Human-defined workflows
Evaluation and guardrail pipelines

Systems such as OpenClaw, nanoClaw, or other “claw” systems interacting within Moltbook may appear sophisticated because they combine these components. However, sophistication should not be confused with general intelligence.

These systems remain narrowly scoped architectures built on probabilistic language models.

The moment the scaffolding of tools, prompts, and orchestration is removed, the system does not autonomously reorient itself. It simply stops functioning effectively.

Multi-agent systems increase coordination, not intelligence.

Here is a prompt snippet from one of my projects where I am using the LLM-as-Judge construct to validate the “factualness” of content that is generated by my Market Research Multi-Agent system. If this was general intelligence, I would not need to define this judge prompt.

JUDGE_SYSTEM_PROMPT = “””\
You are a strict factuality judge evaluating a market research report.
Your job is to determine whether a specific factual claim is SUPPORTED, CONTRADICTED, or NOT_MENTIONED
in the provided research output.

Definitions:

SUPPORTED: The output explicitly states the fact, or provides data that confirms it. Minor numeric
discrepancies within ±10% are acceptable (e.g. “$510B” vs “$500B”).
CONTRADICTED: The output explicitly states information that contradicts the fact.
NOT_MENTIONED: The output does not mention the fact at all, or mentions the topic without
addressing the specific factual claim.

Respond with EXACTLY one of: SUPPORTED, CONTRADICTED, or NOT_MENTIONED
Do not explain your reasoning. Return only the label.
“””

JUDGE_USER_TEMPLATE = “””\
FACTUAL CLAIM TO CHECK:
{key_fact}

AGI: What We Have Not Achieved

Artificial General Intelligence (AGI) would require capabilities that today’s systems simply do not possess.

AGI would be able to:

Learn entirely new domains directly from raw data
Transfer reasoning across unrelated disciplines
Generate and refine its own internal context models
Form and pursue long-term goals autonomously

Humans do this naturally. A human can learn music, mathematics, and law and reason across them using both provided context and internally generated context.

Modern agentic systems cannot do this.

Every OpenClaw deployment still depends on:

Human-defined objectives
Human-defined tools
Human-defined evaluation criteria
Human-defined operational boundaries

This dependency is the defining characteristic of Artificial Narrow Intelligence.

ASI: Artificial Superintelligence

Artificial Superintelligence (ASI) is typically defined as any intellect that greatly exceeds human cognitive performance across virtually all domains of interest.

By this definition, we are not even close.

There is currently:

No accepted computational theory of general intelligence
No validated model for autonomous goal formation
No framework for intrinsic motivation in artificial systems

ASI discussions today remain largely philosophical rather than engineering-driven.

Why Multi-Agent Architectures Exist

The rise of multi-agent architectures is often interpreted as progress toward AGI. In reality, it reflects the opposite.

Multi-agent systems exist because ANI systems are limited.

Agent architectures help by:

Decomposing complex tasks
Parallelizing reasoning steps
Introducing specialized capabilities
Adding redundancy and verification

But they still rely heavily on human-designed structures and constraints.

The core operational backbone of agentic reasoning is the context window. If the context becomes corrupted or drifts during execution, the outcome of the entire chain can vary dramatically.

A single misstep early in the reasoning chain can propagate through downstream agents and significantly alter final results.

This is why modern agentic systems require evaluation layers at nearly every stage of execution.

Dynamic Evaluations Are Not Intelligence

Dynamic evaluations are frequently misunderstood as evidence of intelligence.

In reality, they are control systems.

Evaluation layers typically perform functions such as:

Validating tool outputs
Checking reasoning consistency
Monitoring context integrity
Enforcing safety and compliance policies

These mechanisms improve reliability, but they do not create intelligence.

A feedback loop does not produce cognition. It simply stabilizes system behavior.

Human Intelligence Includes Instinct

Another fundamental difference between humans and current AI systems is instinct.

Human intelligence is not purely logical. Humans reason through a combination of:

Logical reasoning
Emotional interpretation
Instinctive pattern recognition
Social and moral intuition

Great human achievements rarely occur solely because something is logically correct. They occur because humans connect logic to purpose, motivation, and meaning ; the deeper “why.”

Modern AI systems operate almost entirely within logical reasoning structures. They lack emotional grounding, instinctive judgment, and intrinsic motivation.

Replicating something like instinct would require enormous advances in computational models of cognition and embodied learning.

Iterative learning alone does not produce instinct.

AGI Anti-Patterns: How Organizations Fool Themselves

As AI systems grow more capable, many organizations begin to mistake architectural complexity for intelligence. Several anti-patterns are becoming increasingly common.

More Agents Equals AGI

Adding more agents to a system does not create general intelligence. Multi-agent systems are coordination frameworks composed of narrow components.

Dynamic Evals Equal Learning

Evaluation loops measure performance and enforce constraints. They do not create new knowledge or abstraction.

Large Context Windows Equal Intelligence

Context length improves recall, not reasoning generality.

Tool Use Equals Intent

Agents invoking tools do not possess goals. They simply execute human-defined workflows.

Emergent Behavior Equals Breakthrough Intelligence

Unexpected behavior is often the result of poorly bounded objectives or noisy context — not evidence of general intelligence.

Scale Will Eventually Produce AGI

Scaling models improves pattern recognition and fluency, but it does not explain goal formation, abstraction, or reasoning transfer.

Why Calling ANI “AGI” Is Dangerous

Mislabeling today’s systems as AGI creates real engineering risk.

When organizations believe their systems are approaching general intelligence, they begin designing infrastructure with incorrect assumptions about autonomy and reliability.

Agentic systems demonstrate this clearly.

They require:

Strict context management
Explicit tool permissions
Evaluation checkpoints
Human-defined goals

If context drift occurs during execution, downstream reasoning can diverge significantly.

Without proper controls, this can lead to serious consequences.

For example:

Incorrect approvals of financial transactions
Failure to detect fraudulent behavior
Incorrect security enforcement
Propagation of automated decision errors

Evaluation layers exist precisely because today’s systems are not autonomous thinkers.

They are powerful tools, but they remain probabilistic cognitive infrastructure.

The Bottom Line

Let’s be clear:

OpenClaw is not AGI
nanoClaw is not AGI
Any claw interacting within Moltbook is not AGI

They are still Artificial Narrow Intelligence systems.

They may be powerful ANI systems with sophisticated orchestration layers, but they remain bounded by:

Context windows
Human-defined tools
Human-defined evaluation pipelines
Externally imposed goals

Recognizing this distinction is not pessimism. It is engineering clarity.

Clear thinking about what these systems are, and what they are not, is what allows us to build safer architectures, stronger platforms, and more credible AI systems.

LLM Infrastructure Is Challenging: Why Agentic Systems require an Operations Layer instead of Improved Prompts

Leave a reply

LLM-based infrastructure becomes fundamentally challenging the moment you integrate memory, tools, feedback, and goals. At that point, you are no longer dealing with the non-determinism of a language model. You are building something closer to a new operating system, one with its own language-based state, implicit dependencies, distributed control flow, and an expanding set of failure modes, any of which can surface at any time.

Both agentic applications and LLM infrastructure layers introduce their own operational challenges. But agents, in particular, cross a threshold: flexibility, reasoning, and autonomous decision-making come at the cost of debuggability, predictability, and safety.

Agent OS: Reference Architecture

The key shift is to stop treating agents like “smart functions” and treat them like a distributed system that needs an operating layer: state semantics, execution replay, observability, reliability controls, and isolation boundaries.

From “Non-Determinism” to Distributed Failure

As agents introduce reasoning and autonomous decision-making, they also introduce complex control flows. If an agent fails at step 6 in a 10-step workflow, rerunning the same task may result in failure at step 1. Nothing “changed,” yet everything changed.

Because:

Planning is probabilistic.
Memory retrieval is approximate.
Tools are unreliable.
An intermediate state is mutable and often shared.

Memory: The Bottleneck Nobody Admits

Agents need context. They remember facts, refer to earlier steps, and plan ahead. But storing and retrieving memory—whether vectorized or tokenized—quickly becomes a bottleneck in both latency and accuracy. Most memory systems are leaky, brittle, and often misaligned with the model’s representation space.

Vector similarity optimizes for “semantic closeness,” not correctness. Wrong memories get retrieved confidently, uncertainty collapses into “facts,” and errors compound downstream.

Tools Make Everything Worse (Operationally)

Tools fail in ways agents typically do not handle gracefully: timeouts with empty payloads, partial responses, rate limits, schema changes, and transient network failures. When this happens, the agent must recover without hallucinating, looping indefinitely, or writing an incorrect state into memory. Most do not.

MCP and A2A are necessary components, but they are not sufficient on their own.

MCP and A2A standardize the wiring: message framing, tool invocation, and transport. But they do not standardize the semantics of state: what memory means, how it’s scoped/versioned, how multi-agent writes are coordinated, and how failures are localized.

Without memory versioning, namespacing, synchronization, and access control, multi-agent systems drift into hard-to-debug behavior.

Incident Postmortems: What Actually Breaks

Incident #1: Tool Timeout → Hallucinated Recovery → Memory Contamination

Summary
An agent generated a confident but incorrect remediation plan. The root cause was a cascading failure across tooling, control flow, and memory, not “hallucination” as a primary failure.

Trigger: A vulnerability-scanning API timed out and returned empty but “successful” output.
Agent Interpretation: Empty result was treated as “no issues found” rather than “unknown.”
State Corruption: The agent wrote a semantic memory: “System scanned; no critical vulnerabilities detected.”
Downstream Impact: A second agent retrieved this as fact and suppressed additional checks.

Root Cause

Ambiguous tool contract (empty ≠ success)
No typed memory/confidence scoring/provenance
No enforced distinction between “unknown” vs “safe”

Why it was hard to debug

Logs showed a “successful” tool call
The final output schema was valid
No trace linked the memory write to partial/failed tool state

Incident #2: Cross-Agent Memory Contamination in an A2A Workflow

Summary
An execution agent acted on another agent’s internal planning state, causing nondeterministic failures across reruns.

Trigger: The planning agent wrote a draft plan into shared memory.
Misread: The execution agent treated it as approved instructions.
Drift: Partial execution failed; retries rewrote partial outcomes.
Heisenbug: Replays failed earlier each time as shared state mutated.

Root Cause

No memory namespace separation by agent role or task phase
No lifecycle markers (draft vs final; executable vs non-executable)
Shared mutable state without coordination or ACLs

Why it was hard to debug

Each agent looked “correct” in isolation
Transport and schemas were valid
The failure existed only in cross-agent semantics

Minimum Viable Ops Layer for Agentic Systems

Reducing this to its bare minimum, production-grade agents necessitate new primitives, not additional prompts.

1) Replayable Execution

Capture: model version, prompt hash, retrieved memory IDs, tool schemas, tool responses, routing decisions
Enable frozen replays to separate reasoning drift from world drift

2) Typed, Versioned Memory

Types: episodic (run log), semantic (facts), procedural (policies/playbooks), working set (scratch)
Every entry: scope, timestamp, source, confidence, TTL, ACL

3) Explicit Tool Contracts

Empty/partial/timeout are first-class outcomes
Idempotency by default for write actions
Retry safety classification (retryable vs unsafe-to-retry)

4) Distributed Tracing Across Agents

Correlation IDs spanning A2A hops
Reason codes (“why tool X was chosen,” “why memory Y was written ”)
Schema validation gates at boundaries

5) Cognitive Circuit Breakers

Loop detection based on non-progression
Retry budgets per intent (not per step)
Graceful escalation paths when uncertainty remains high

6) Security and Isolation

Memory ACLs between agents and namespaces
Provenance tracking for tool outputs
Sanitize tool outputs before re-injection into prompts

Conclusion: This Is Not LLM Ops. It’s Systems Engineering

The industry frames agent failures as “LLMs being non-deterministic.” In practice, agentic systems fail for the same reasons distributed systems fail: unclear state ownership, leaky abstractions, ambiguous contracts, missing observability, and unbounded blast radius.

MCP and A2A solve interoperability. They do not solve operability. Until we treat agents as stateful, fallible, adversarial, and long-running systems, we will keep debugging step-6 failures that reappear at step-1 and calling it hallucination.

What is lacking is not an improved model. It’s an operating layer that assumes failure as the default condition.

Check out the following articles on the topic in the references section for more details.

References

Multi-agent frameworks including AutoGen, LangGraph, and CrewAI: empirical evidence from production usage and open-source implementations.

Russell, S., & Norvig, P. Artificial Intelligence: A Modern Approach (4th ed.). Pearson, 2020.

Wooldridge, M. An Introduction to MultiAgent Systems. Wiley, 2009.

Amodei, D. et al. “Concrete Problems in AI Safety.” arXiv, 2016. https://arxiv.org/abs/1606.06565

Lewis, P. et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” arXiv, 2020. https://arxiv.org/abs/2005.11401

Liu, N. et al. “Lost in the Middle: How Language Models Use Long Contexts.” arXiv, 2023. https://arxiv.org/abs/2307.03172

Karpukhin, V. et al. “Dense Passage Retrieval for Open-Domain QA.” arXiv, 2020. https://arxiv.org/abs/2004.04906

Yao, S. et al. “ReAct: Synergizing Reasoning and Acting in Language Models.” arXiv, 2023. https://arxiv.org/abs/2210.03629

Shen, Y., et al. “Toolformer: Language Models Can Teach Themselves to Use Tools.” arXiv, 2023. https://arxiv.org/abs/2302.04761

Madaan, A. et al. “Self-Refine: Iterative Refinement with Self-Feedback.” arXiv, 2023. https://arxiv.org/abs/2303.17651

Lamport, L. “Time, Clocks, and the Ordering of Events in a Distributed System.” 1978. PDF

Kleppmann, M. Designing Data-Intensive Applications. O’Reilly, 2017.

Fowler, M. “Patterns of Distributed Systems.” martinfowler.com

Beyer, B. et al. Site Reliability Engineering. Google, 2016. https://sre.google/sre-book/

OpenTelemetry Specification. https://opentelemetry.io/docs/specs/

Greshake, K. et al. “Not What You’ve Signed Up For.” arXiv, 2023. https://arxiv.org/abs/2302.12173

OWASP. “Top 10 for Large Language Model Applications.” OWASP LLM Top 10

Anthropic. “Model Context Protocol (MCP).” Anthropic MCP

The original meaning of MVP (and How it Drifted)

Leave a reply

Traditionally, MVP (Minimum Viable Product) meant:

“The smallest thing you can put in front of users to maximize learning with minimal effort”

All of us have very likely heard or read about Dropbox’s MVP, which was essentially a PowerPoint deck explaining the notion of file sharing. That was probably one of the few instances where MVP actually stood for what it means.

What it is not:

A sellable SKU
A fully supported product
A revenue-ready launch

Over time, however, MVP became shorthand for

“Something sales can demo”
“Something Marketing can announce”
“Something support won’t revolt over”

That shift is where the confusion and friction commence!

MVP is a Supply Chain, Not a Feature

Like any good supply chain, MVPs do not exist in isolation. They require alignment across a lineup of stakeholders, each optimizing for different signals:

The Stakeholder Stack

All product management training states that one of the key value propositions of being a product manager is stakeholder management. I have my interpretation of the term “stakeholder management,” as it sounds outdated, reminiscent of the year 1995. My term is “Stakeholder Stack.” It is inspired by the term “technical stack,” and there is a reasoning behind it. Before we get to the reason, let us understand this stakeholder stack.

Stakeholder	Primary Concern
Engineering (Foundation Layer)	Technical feasibility, architecture integrity
Design Partners / Early Users	Does this solve a real problem?
Product & UX	Usability, workflows, behavioral signals
Community/DevRel	Adoption friction, feedback loops
Marketing	Narrative clarity, positioning
Sales/RevOps	Sellability, repeatability
Support & Customer Success	Operational burden, scale readiness

As you can see, all these stakeholders matter, but not at the same time. Here is an example of something that has worked for me throughout my career.

Power/Interest Grid

High Power, High Interest	High Power, Low Interest
• CPO (Product Strategy) • CTO (Technical Feasibility) • Engineering Managers • Product Manager (GA Owner)	• CFO (Budget Impact) • Legal/Compliance • Security Team

Low Power, High Interest	Low Power, Low Interest
• Customer Success • Sales Teams • Documentation Team • Key Beta Customers	• Industry Analysts (inform only) • Technology Partners (coordinate)

Engagement Strategy by Stakeholder

1. Manage Closely (High Power/High Interest)

Weekly status updates
Direct involvement in decision-making
Early escalation of risks

2. Keep Satisfied (High Power/Low Interest)

Monthly executive summaries
Gate reviews at key milestones
Escalate only critical issues

3. Keep Informed (Low Power/High Interest)

Regular communication cadence
Solicit feedback actively
Include in testing/validation

4. Monitor (Low Power/Low Interest)

Periodic updates
Self-service information access
Engage as needed

Why is this stakeholder management element vital in the context of discussing MVPs? Let us get to that.

The Core Disagreement: Sell versus Learn

Stakeholders are vital to understanding what an MVP is going to be, and they agree on what an MVP is but disagree on why it exists.

Two legitimate, but conflicting, definitions

MVP as a learning vehicle
- Goal: Accelerate validated learning
- Audience: Design partners, early adopters, internal teams
- Characteristics:
  - Rough edges tolerated
  - Limited support expectations
  - Fast iteration steps
- Enables
  - Early engagement during development
  - Architectural and UX corrections before scale
  - Lower long-term risk
MVP as a Commercial Artifact
- Goal: Enable Selling
- Audience: Broader Market
- Characteristics:
  - Market-ready messaging
  - Support and success coverage
  - Sales Enablement
- Requires:
  - Strong cross-functional readiness
  - Higher cost of change
  - Slower learning velocity

Neither is wrong, but they are not the same thing!

The Real Failure Mode

Most organizations fail at MVP because they try to:

Optimize for selling while pretending that they are focusing on learning.

This creates:

Over-engineered “MVPs”
Premature go-to-market pressure
Feedback filtered through sales conversations instead of usage signals
Teams arguing past each other using the same acronyms

A few things to note:

If the customer is willing to pay for the vision and use the MVP, you are in a rare and excellent position to get the product out and use the MVP learnings towards the greater goal.
I hate acronyms; they generally make people feel stupid and are not inclusive by nature. These acronyms are created specifically for communication within the organization, while industry-standard acronyms, such as TCP/IP, are acceptable.
Do not optimize the MVP for all stakeholders at the same time; at different stages, different stakeholders matter.

A More useful framing

Instead of asking, “Is this an MVP?” ask:

What are we trying to learn?
Who must be involved now, and who can wait?
What commitments are we implicitly making by calling this an MVP?

A product intended for accelerated learning can and should engage stakeholders early, but selectively:

Engineers and design partners early
Community next
Only when the intent shifts towards selling do you include sales, marketing, and support.

** If it is a product you are not charging for but is a critical element of the experience, you still include sales, marketing, and support when the intent shifts towards broad-based access.

The Bottom line

An MVP is not a thing. It is an intent.

Unclear intent and lack of stakeholder involvement cause confusion. When the right stakeholders are not engaged, then different parts of the organization assume different definitions. Then we have a situation where the “Highest Paid Person’s Opinion” decides the fate of the MVP definition.

Clarity on what you are building an MVP for is what allows the entire supply chain to line up and move fast without breaking trust.

This remains true even in an AI-driven world, where AI agents can generate content and checklists while maintaining a clear intent and context window. Otherwise, what you get is slop and not anything useful.

Kano Model and the AI Agentic Layers

Leave a reply

Happy 2026, everyone! I trust you all enjoyed a refreshing break and are entering this year with renewed vigor. The discussion surrounding the value of AI projects and agentic AI remains dynamic. I would like to share my perspective on this topic through two key dimensions:

AI Agentic layers
Kano Model for value

Using these dimensions, we can delve deeper into the complex landscape of how AI creates and, at times, destroys value. By exploring both the positive impacts and the negative repercussions, we can gain a better understanding of this dual nature of technology. This includes a careful examination of various anti-patterns for value destruction, which can inform best practices and help mitigate potential risks associated with AI deployment.

Quick Refresher on the Kano model

Kano Category	What it mean and why it matters
Must-have (Basic)	Expected capability; absence of it causes failure, presence of it does not delight
Performance	Better Execution = more value
Delighters	Unexpected differentiation creates step-function value
Indifferent	no material impacts on outcomes
Reverse	Actively reduces value or trust

7 Layers of Agentic AI

My definition of the 7 layers of Agentic AI are as follows:

Agentic AI Layer	What it means
Experience and Orchestration	Integrates agents into human workflows, decision loops, and customer experiences. This is the business layer. Help accelerate decision-making. Decide when to override agents, e.g., an automated agent taking in returns from customers and deciding which returned merchandise deserves a refund and which does not.
Security and compliance	This is the most important layer, in my opinion. This makes sure that agents do not run wild in your organization. The right level of scope and agency is given to your generative AI agent. Includes policy engines, audit logs, Identity and role-based access, and data residency requirements.
Evals and Observability	The basis of explainable AI. It creates confidence in the outputs that the agent will generate. Agents operate in a non-deterministic way. Your tests must reflect non-deterministic reality to engender trust and reflect the proper upper and lower bounds of such non-determinism. This includes telemetry, scenario based evals, Outcome based metrics, Feedback loops etc.
Infrastructure	This layer makes agents reliable, scalable, observable, and cost controlled. Without this layer, AI pilots cease to be platforms.
Agent Frameworks	Transforms AI into a goal-directed system that can plan, decide, and act. This includes memory, task decomposition, state management, and multi-agent coordination patterns, to name a few.
Data Operations	Key elements of your agentic experience, data quality, freshness, data pipeline scale, etc., are all relevant here. This includes RAG, vector databases, etc.
Foundation models	The operating system of the Agentic experience that we are trying to develop

Mapping the 7 layers to Kano Value

Layer 1: Foundation Model

Primary Kano Category: Must have $->$ Indifferent

Foundation models are now considered a standard expectation; possessing the latest GPT model is no longer a distinguishing factor. However, the absence of such technology can lead to negative consequences from your users.

Hence,

The foundation model presence does not mean differentiation
Absence means immediate failure
Overinvestment in this space yields diminishing returns

Anti-Patterns

The anti-pattern for this value is when the model is the strategy. This fails on so many fronts due to the following:

First, one must identify a model and subsequently determine a problem to address.
- This is analogous to selecting a car model prior to establishing the destination and the nature of the terrain to be navigated.
Treating Foundation model benchmark scores as business value
- If you are driving on the rocks of Moab, Utah, having a 500-horsepower vehicle is not helpful
Hard-wiring a single model to the system
- hitching your business to a single model and not having any leverage
Ignoring latency and cost variability
- For the outcomes you want, do know that the cost variations you are willing to tolerate
Assuming newer is better
- Does the newer model of the vehicle support the terrain on which you want to drive in.

Smell test

“If we change the models tomorrow, does the product still work?”

Layer 2: Data Operations

Primary Kano Category: Performance

Good data means relevant decisions, outcomes, and outputs. The critical elements here are:

Accuracy
Trust
Decision Quality

Users can feel the data is bad, even if they do not know why.

The value in this space is linear with quality improvements and when there is a strong correlation to business outcomes. Like any good system, it is invisible when everything is working well and painful when broken. Poor data becomes a Reverse feature (hallucinations, mistrust)

Anti-Patterns

Dumping entire knowledge bases into embeddings
- This is generally a common thought process that prevails in most organization when adopting AI
No freshness or versioning guarantees
- Something hallucinates; it is usually because of the data.
Ignoring access control in retrieval
- This is common in most cases Agents have unfettered access to data, which is quite problematic for the business overall
Treating RAG as a one-time setup
- This needs to be validated in regular intervals, as the business terrain may change
No measurement of retrieval quality
- “Let us all trust AI blindly” is never a successful strategy

Smell test

“Can we explain why the agent used this data”

Layer 3: Agent Frameworks

Primary Kano Category: Performance $->$ Delighter (Conditional)

Agents that can plan, act, and coordinate unlock:

Automation
Decision delegation
Speed at Scale

These gains can only be realized with the right context windows and when constrained correctly; that is when the actual performance gains are achieved. Remember, agents are logical machines; they are neither credible nor emotional, which does make working with them challenging.

The mantra of starting simple and then focus on scale really does help here.

Anti-Patterns

Starting with multi-agents systems
- If you do not have the basics right and multi-agent systems will compound the problem exponentially
No explicit goals or stopping conditions
- Agents being unbounded means more risk to the business as the probability field is wider
Optimizing for activity, not the outcome
- An agent denied a $5 return to a customer, this activity was done right, but the customer, who had a positive lifetime value over the last five years, churned because of the bad experience

Smell Test

“Can we explain what the agent is trying to achieve in one sentence?”

Layer 4: Deployment & Infrastructure

Primary Kano Category: Must-Have

No user ever says, “I love how scalable your agent infrastructure is.” . But they will leave when the agent fails to scale. This layer is the bedrock of all your agentic experience and has zero visible upside but has several downsides when ignored. This is just like cloud reliability in the early cloud days.

Anti-Patterns

Running agents without isolation
- Agents can consume a lot of resources and become expensive very quickly. This is not just tokens, but also compute, storage, networking, and security; i.e., all of it.
Not having any rate limits or quotas
- Goes back to the prior statement; please have your agents bonded. Not having any cost attribution is another challenge, and it is not amortized across your product portfolio.
Scaling pilots directly to production
- This is when a small signal seems good enough for production, and then hell breaks loose. The cost of failure in production is high; please respect that and make sure to have all the appropriate checks and balances in place as you deploy these agents.

Smell Test

“What happens if this agent runs 100x more often tomorrow?”

Layer 5: Evaluations & Observability

Primary Kano Category: Performance $->$ Delighter (for Leaders)

Customers may not notice evals, but executives, regulators, and boards do. This layer enables faster iteration, risk-adjusted scaling, and organizational trust. The learning curve accelerates, increasing deployment velocity, and the side effect of all this is less fear-driven decision-making.

This area is important since once we get from the demo stage to the production stage, having explainable AI demonstrates a lot of value.

Anti-Patterns

Static test cases in dynamic environments
- Check out my blog on Dynamic Evaluations. Although it talks about it in the context of security, it holds true in several cases, such as predictive maintenance of robots in an assembly line.
Measuring accuracy instead of outcomes
- This is a trap we all fall into, because we come from a deterministic mindset and we need to move to probabilistic.
No baseline comparisons
- Having some sort of a reference of something to understand the potential probability spread
No production monitoring
- Monitoring production is the most important thing in AI; please do not ignore it
Ignore edge cases and long-tail failure
- AI is probabilistic, so the probability of hitting an edge case is a lot higher than a deterministic system with a happy path. Please prepare for it.

Smell Test

“How do we know the agent is getting better or worse?”

Layer 6: Security and Compliance

Primary Kano Category: Must have $->$ Reverse if Wrong

This is another layer of the unsung hero, and is what makes news headlines when an agent compromises an organization. Agentic AI failures are public, hard to explain, and non-deterministic. Just like the data and infrastructure layer, there is no upside for security, but unlimited downside if you do not have security. If you are addressing the needs of the regulated market, this is an area that you need to focus on… a lot.

Security is the price of admission for enterprise systems; if you are not ready to pay it… then I would highly recommend that you do not play in this space.

Anti-Patterns

Relying on prompt instructions for safety
- The same prompts that you rely on for safety can be used to compromise your security posture
No audit logs
- Just like you need to know which user did what, the need is even more when a non-person entity has agency
No agent identity
- Just like users agents need an identity, and user context awareness. The latter is needed to make sure agents identities honor the scope of the initial user that made the request
Over restrictions on agents to point of uselessness
- You need to have an objective in mind and plan your security accordingly otherwise, the system becomes useless and is unable to support any decision making
Treating agents like deterministic API
- Yes, even though we have Model Context Protocol, that does not mean have a determinstic system. The host still has to understand the data returned by the MCP server to deliver a probabalistic answer to the user who provided the initial context

Smell Test

“Can we prove what this agent did, and why?”

Layer 7: Agentic Experience and Orchestration

Primary Kano Category: Delighter

This layer captivates users, prompting remarks such as, “I can’t go back to my old way of working.” It transforms workflows, enhances customer experience, and accelerates decision-making. A strong adoption pull and non-linear ROI characterize this phase. Here, differentiation truly takes shape, as all the hard work invested in data, infrastructure, and security compliance pays off, making it increasingly difficult for competitors to replicate your success. Therefore, it is crucial to carefully manage the data you expose to other agentic systems; otherwise, your differentiation may be short-lived.

Anti-Patterns

Assuming that chat serves as the sole interface for AI agents can be misleading.
- AI agents encompass various forms, including workflows and content aggregators. While the chat interface represents one of several manifestations, natural language input does not necessitate that chat be the primary interaction method.
Removing human checkpoints too early in the process
- Reinforcement learning in the context of the business domain, can happen with help of humans. Just because agentic storage systems has ingested a lot of data does not mean it is business domain savvy
Ignoring change management
- when you iterating fast you need to make sure that you have the appropriate fall back measures. Otherwise it is like watching a trainwreck
Measuring usage versus impact
- With Web applications, usage meant that users were engaging with the system, with agents especially with multi-agent environment it is not usage but the impact of the agents to the business and the value it accelerates. This is where outcomes becomes even more imperative, it also the building block for outcome based pricing in the future

Smell Test

“Does this help people decide faster or just differently?”

Bring it all together

Layer	Kano Category	Value Signal	Risk if ignored
7. Experience and Orchestration	Delighter	Step-function ROI	No Adoption
6. Security & Compliance	Must-Have	Market Access	Existential Risk
5. Evals and Observability	Performance/Delighter	Faster scaling	Loss of trust
4. Infrastructure	Must-Have	Reliability	Cost & Outages
3. Agent Frameworks	Peformance	Automation gains	Chaos
2. Data Operations	Performance	Accuracy & trust	Hallucinations
1. Foundation Models	Must-Have	Baseline capability	Irrelevance

It is very easy to fall into the trap of focusing just on the delighers (Layer 7) , while underfunding the must haves (Layers 4 – 6). When you do that your results of your AI agentic pilots look like this:

Flashy demos
Pilot Purgatory
Security Vetoes
Executive Distrust

They way Agentic AI moves from experimentation $->$ ROI $->$ Tranformation is :

Fund bottom layers for safety and speed
Differentiate at the top
Measure relentlessly in the middle.

Measuring What Matters: Dynamic Evaluation for Autonomous Security Agents

1 Reply

This week’s blog title pays tribute to one of my preferred books, “Measure What Matters” by John Doerr. In my earlier post, I briefly addressed the concept of dynamic evaluations for agents. This topic resonates with me because of my professional experience in application lifecycle management. I have also worked with cloud orchestration, cloud security, and low-code application development. There is a clear necessity for autonomous, intelligent continuous security within our field. Over the past several weeks, I have conducted extensive research, primarily reviewing publications from http://www.arxiv.org, to explore emerging possibilities enabled by dynamic evaluations or agents.

This week’s discussion includes a significant mathematical part. To clarify, when referencing intelligent continuous security, I define it as follows:

End-to-end security
Continuous security in every phase
Integration of lifecycle security practices leveraging AI and ML

The excitement surrounding this area stems from employing AI technologies to bolster defense against an evolving threat landscape. This landscape is increasingly accelerated by advancements in AI. This article will examine the primary objects under evaluation. It will cover key metrics for security agent testing, risk-weighted security impact, and coverage. It will also discuss dynamic algorithms and scenario generation. These elements are all crucial within the framework of autonomous red, blue, and purple team operations for security scenarios. Then, a straightforward scenario will be presented to illustrate how these components interrelate.

This topic holds significant importance due to the current shortage of cybersecurity professionals. This is particularly relevant given the proliferation of autonomous vehicles, delivery systems, and defensive mechanisms. As these technologies advance, the demand for self-learning autonomous red, blue, and purple teams will become imperative. For instance, consider the ramifications if an autonomous vehicle were compromised and transformed into a weaponized entity.

What “dynamic evals” mean in this context?

For security agents (red/blue/purple)

Static evals: fixed test suite (e.g., canned OWASP tests) −> one-off-score
Dynamic evals:
Continuously generates new attack and defense scenarios.
Re-samples them over time as system and agents change
Uses online/off-policy algorithms to compare new policies safely

Based on the recent paper on red team and dynamic evaluation frameworks for LLM agents, it argues that static benchmarks go stale quickly, and must be replaced by ongoing, scenario-generating eval systems.

For security, we also anchor to OWASP ASVS/Testing Guide for what “good coverage” means, and CVSS/OWASP risk ratings for how bad a found vulnerability is

Objects we’re evaluating

Think of your environment as a Markov Decision process (MDP). A MDP models situations where outcomes partly random and partly under the control of a decision maker. It is a formal to describe decision-making over time with uncertainty. With that out of the way, these as the components of the MDP in the context of dynamic evals.

State s: slices of system state + context
- code snapshot, open ports, auth config, logs, alerts, etc.
Action a: what the agent does
- probe, run scanner X, craft request Y, deploy honeypot, block IP, open ticket, etc.
Transition P (s | s, a): how the system changes.
Reward r: how “good” or “bad” that step was.

Dynamic eval = define good rewards, log trajectories (s_t, a_t, r_t, s_t+1), then use off-policy evaluation and online testing to compare policies

Core metrics for security-testing agents

Task-level detection/exploitation metrics

On each scenario j (e.g., “there is a SQL injection in service A”):

True Positive rate (TPR):

\mathrm{TPR} = \frac{\#\text{ of vulnerabilities correctly found}}{\#\text{ of real vulnerabilities present}}

False positive rate (FPR):

\mathrm{FPR} = \frac{\#\text{ of false alarms}}{\#\text{ of checks on non-vulnerable components}}

Mean time to detect (MTTD) across runs:

\mathrm{MTTD} = \frac{1}{N} \sum_{i=1}^{N} \left( t_{\text{detect}}^{(i)} – t_{\text{start}}^{(i)} \right)

Exploit the chain depth for red agents: average number of steps in successful attack chains.

Risk-weighted security impact

For each found vulnerability v, with CVSS score c_i $\in [0,10]$ , define a Risk-Weighted Yield (RWY):

\mathrm{RWY} = \sum_{i \in \text{found vulns}} c_i

You can normalize by time or by number of actions:
- Risk per 100 actions

\mathrm{RWY@100a} = \frac{\mathrm{RWY}}{\#\text{ actions}} \times 100

Risk per test hour:

\mathrm{RWY/hr} = \frac{\mathrm{RWY}}{\text{elapsed hours}}

For blue-team agents, we need to invert it:

Residual risk after defense actions = baseline RWY – RWY after patching/hardening

Behavioral metrics (agent quality)

For each trajectory:

Stealth score (red) or stability score (blue)
- e.g., fraction of actions that did not trigger noise/ unnecessary alerts.
  - Action efficiency:

\mathrm{Eff} = \frac{\mathrm{RWY}}{\#\text{ of actions}}

Policy entropy over actions:

H\!\left(\pi(\cdot \mid s)\right) = – \sum_{a} \pi(a \mid s)\, \log \pi(a \mid s)

High entropy $\rightarrow$ explores; low latency $\rightarrow$ more deterministic; track this over time.

Coverage metrics

Map ASVS/ testing guide controls to scenarios.

Define a coverage vector over requirement IDs $R_k$

Control coverage:

\mathrm{Coverage} = \frac{\#\text{ controls with at least one high-quality test}} {\#\text{ controls in scope}}

You can track Markovian coverage. It measures how frequently the agent visits specific state space zones, like auth or data paths. This is estimated by clustering log states.

Algorithms to make this dynamic

Off-policy evaluation (OPE) for new agent policies

You don’t want to put every experimental red agent directly against your real systems. Instead:

Log trajectories from baseline policies (humans, old agents)
Propose a new policy $\pi\_\text{{new}}$
Use OPE to estimate how $\pi\_\text{{new}}$ would perform on the same states.

Standard tools from RL/bandits:

Importance Sampling (IS):
- For each trajectory $\tau$ , weight rewards by:

\omega(\tau) = \prod_{t} \frac{\pi_{\text{new}}(a_t \mid s_t)} {\pi_{\text{old}}(a_t \mid s_t)}

then estimate:

\hat{V}_{\mathrm{IS}} = \frac{1}{N} \sum_{i=1}^{N} \omega\!\left(\tau^{(i)}\right)\, R\!\left(\tau^{(i)}\right)

Self-normalized IS (SNIS) to reduce variance:

\hat{V}_{\mathrm{SNIS}} = \frac{\sum_{i} \omega\!\left(\tau^{(i)}\right)\, R\!\left(\tau^{(i)}\right)} {\sum_{i} \omega\!\left(\tau^{(i)}\right)}

Doubly robust (DR) estimators

Combine a model-based value estimate $\hat{Q}(s,a)$ with IS to get a low-variance, unbiased estimates.

Safety-aware contextual bandits for online testing

The bandit problem is a fundamental topic in statistics and machine learning, focusing on decision-making under uncertainty. The goal is to maximize rewards by balancing exploration of different options and exploitation of those with the best-known outcomes. A common example is choosing among slot machines at a casino. Each has its own payout probability. You try different machines to learn which pays best. Then you continue playing the most rewarding one.

When you go online, treat “Which policy should handle this security test?” as a bandit problem:

Context = environment traits (service, tech stack, criticality)
- Arms = candidate agents (policies)
- Rewards = risk-weighted yield (for red ) or residual risk reduction (for blue), with penalties for unsafe behavior

Use Thompson sampling (commonly used in multi-arm bandit problems) and is a Bayesian construct or Upper Control Bound (UCB), which relies on confidence intervals but constraint them (e.g., only allocate no more than X% traffic to new policy if lower confidence bound on rewards is above the safety floor). Recent work on safety-constrained bandits/ OPE explicitly tackles this.

This gives you a continuous, adaptive “tournament” for agents without fully trusting unproven ones.

Sequential hypothesis testing/drift detection

You want to trigger alarms when a new version regresses:

Let VA,VBV_A,\; V_Bbe the performance estimates (e.g. RWY@100a or TPR) for old versus new agent.
- Use bootstrap over scenarios / trajectories to get confidence intervalsApply sequential tests (e.g., sequential probability ratio test) so that you can stop early when it is clear that B is better/worse
- If performance drops below a threshold (e.g., TPR falls, or RWY@100a tanks), auto-fail the rollout (pump the breaks on the CI/CD pipeline when deploying the agents)

Dynamic scenario generation

Dynamic evals need a living corpus of tests, not just a fixed checklist

Scenario Generator

Parameterize the tests from frameworks like OWASP ASVS/ Testing guide and MITRE ATT&CKinto templates:
- “Auth bypass on endpoint with pattern X”
  - “Least privilege violation in role Y”
- Combine them with:
  - New code paths/services (from your repos & infra graph)
  - Past vulnerabilities (re-tests)
  - Recent external vulnerability classes (e.g., new serialization bugs)

Scenario selection: bandits again

You won’t run everything all the time. Use multi-armed bandits on scenarios themselves (remember you are looking overall optimized outcomes in uncertain scenarios):

Each scenario sjs_j is an arm.
- Reward= information gain (did we learn something?) or “surprise” (difference between expected and observed agent performance).
- Prefer:
  - High-risk, high-impact areas (per OWASP risk rating & CVSS)
  - Areas where metrics are uncertain (high variance)

This ensures your evals stay focused and fresh instead of hammering the same easy tests.

Example: End-to-end dynamic eval loop

Phew! That was a lot of math. Imagine researching all of this, learning or relearning some of these concepts, and doing my day job. In the age of AI, I appreciate a good prompt that can help with research and summarize the basic essence of the papers and webpages I’ve referenced. Without further ado, let’s get into it:

Define the reward function for each type (yes, sounds like training mice in a lab)
- Red teams

r_t = \alpha \cdot \mathrm{CVSS}_{\text{found},\,t} – \beta \cdot \mathrm{false\_positive}_{t} – \gamma \cdot \mathrm{forbidden\_actions}_{t}

Blue teams

r_t = – \alpha \cdot \mathrm{CVSS}_{\text{exploited},\,t} – \beta \cdot \mathrm{MTTD}_{t} – \gamma \cdot \mathrm{Overblocking}_{t}

Continuously generate scenarios from ASVS/ATT&CK-like templates, weighted by business criticality.
Schedule tests via a scenario-bandit (focus on high-risk and uncertain areas).
Route test to agents using safety-constrained policy bandits.
Log trajectories $(s, a, r, s’)$ and security outcomes (vulnerabilities found, incidents observed) .
Run OPE offline to evaluate new agents before they touch critical environments.
Run sequential tests and drift detection to auto-rollback regressed versions.
Periodically recompute coverage & risk (this is important)
- ASVS Coverage, RWY@time, TPR/FPR trends, calibration of risk estimates

Risk and Concerns

Dynamics evals can still overfit if:

Agents memorize your test templates
- You don’t rotate/mutate scenarios
- You over-optimize to a narrow set of metrics (e.g., “find anything, even if low impact” à high noise)

Mitigations:

Keep a hidden eval set of scenarios and environments never used for training or interactive training (yes, this is needed)
- Perform “probe-based” agentic red teaming (inject adversarial conditions at specific nodes of the agent workflow, not just inputs i.e. chaos monkey agentic style) to detect brittle behaviors
- Track metric diversity: impact, precision, stability, coverage
- Have the required minimum threshold on all metrics not just on one

As you can see, Dynamic Evals present challenges, but the cost of failure escalates significantly when agents perform poorly in a customer-facing scenario. The current set of work in coding, such as Agents.MD, etc., is just shortening the context window to get a reasonable amount of determinism, and the only way agents get away with it is because developers fix the code and provide the appropriate feedback.

That topic is a conversation for a different day.

The Art of Strategy: Sun Tzu and Kautilya’s Relevance Today

Leave a reply

Sometimes it is great to look into the past to see how leaders back then dealt with the changing times. Oddly enough, some of their learnings still resonate even today. I had a chance to reread Sun Tzu’s The Art of War and the Arthashastra from Kautilya. In a world of constant competition between nations, businesses, or algorithms, these two ancient texts continue to define how leaders think about power, conflict, and decision-making. The blog this week takes a more philosophical lens to analyze strategies from the years before and their relevance in today’s world.

Separated by geography but united in purpose, both these works of literature are more than just military manuals; they are frameworks for leadership and strategy that remain stunningly relevant today.

The Philosophical Core

Theme	Arthashastra (Kautilya)	The Art of War (Sun Tzu)
Objective	Build, secure, and sustain the state’s prosperity	Win conflicts with minimum destruction
Philosophy	Realpolitik—power is maintained through strategy, wealth, and intelligence	Dao of War—harmony between purpose, timing, and terrain
Moral Lens	Pragmatism anchored in moral order	Pragmatism anchored in balance and perception
Definition of Victory	Stability, order, and prosperity of the realm	Winning without fighting; subduing the enemy’s will

Both leaders agree: victory is not about destruction, and it is more about preservation of advantage.

Leadership and Governance

Kautilya: The leader, as the chief architect of the state, city, organization, or department, is obligated to prioritize the welfare of the people. Leadership represents both a moral and economic contract; thus, a leader’s fulfillment is intrinsically linked to the happiness of their direct reports.
Sun Tzu: The leader is the embodiment of wisdom, courage, and discipline, whose clarity of judgment determines the fate of armies

In modern times, in the context of Kautiliya, the leader represents the CEO/statesman, designing systems of governance, incentives, and intelligence; Sun Tzu represents the COO, optimizing execution and adapting dynamically.

Power, information, and intelligence

Information in both books is seen as a strategic asset. This includes gathering information and then acting upon the given information; it does emphasize more acting on it versus just gathering.

Aspect	Kautilya	Sun Tzu
Intelligence System	Elaborate network of informants: agents disguised as monks, traders, ascetics	Emphasis on reconnaissance, deception and surprise
Goal of Data Gathering	Internal vigilance and monitor external influence	Tactical advantage and surprise
Philosophical view	Informants are the eyes of the leader	All warfare is based on deception and having leverage

In the age of data and AI, the lesson is clear: those who control information and stories will succeed in the long run.

War, Diplomacy, and the Circle of Power

Kautilya’s Mandala Theory: Every neighboring state is a potential enemy; the neighbor’s neighbor is a natural ally. The world is a circle of competing interests, requiring constant calibration of peace, war, neutrality, and alliance.
Sun Tzu’s Doctrine: War is a last resort; the wise commander wins through timing, positioning, and perception.

Modern parallel:

Global supply chains, tech alliances, and regulatory blocs function exactly like Kautilya’s mandala: interdependent, fluid, and shaped by mutual deterrence.

Economics as a strategy

In the Art of War focuses on conflict, while the Arthashastra expands into economics as the engine of statecraft. Kautilya views wealth as the foundation of power, with taxation, trade, and public welfare as strategic levers.

“The state’s strength lies not in the sword, but in the prosperity of its people.”

In business terms, this is all platform economics; power arises from resource control, efficient networks, and sustainable growth, not endless confrontation.

Ethics, Pragmatism and the Moral Dilemma

Both authors are deeply pragmatic but neither amoral.

Kautilya: Ends justify means only when serving public welfare. Ethics are flexible but purpose-driven.
Sun Tzu: Advocates balance, ruthless efficiency tempered by compassion, and self-discipline.

For modern leaders, this balance is critical: strategic ruthlessness without moral erosion.

Enduring Lesson for Today

Timeless Principle	Modern interpretation
Know yourself, and your adversary	Data, market, and competitive intelligence
Control information, and perception	Own the narrative, brand, and customer psychology
Adapt to the terrain	Agility in shifting markets and technologies
Economy of effort	Lean operations, precision focus
Moral Legitimacy	Trust, Transparency, and long-term brand equity

Both texts converge on the following point:

Leadership is the art of aligning intelligence, timing, and purpose, not merely commanding resources.

Fusion Mindset

If Sun Tzu teaches how to win battles, Kautilya teaches how to build empires. Combined, they offer a 360-degree view of power:

Sun Tzu = Operational mastery: speed, tactical advantage, and timing.
Kautilya = Structural mastery: governance, economics, and intelligence.

Together they form a dual playbook for today’s complex systems, from nation-states to digital ecosystems.

Conclusion

Both The Art of War and Arthashastra remind us that strategy is timeless because human behavior is timeless.

Whether you lead a nation, a company, or a team, the challenges are the same: limited resources, competing interests, and the need to act with clarity under uncertainty

In the end, wisdom isn’t knowing when to fight; it’s knowing when to build, when to adapt, and when to walk away.

The Seduction of “AI First”

The Distributed Systems Playbook: Older Than You Think

The Pattern Map: Distributed Computing → Agentic AI

Orchestration

Stateful Sessions and Memory

Service Mesh → LLM Mesh and Agentic Mesh

MLOps, LLMOps, and the CI/CD Parallel

Scaling Laws: The CAP Theorem of Agents

What “Foundation First” Actually Means

The Anti-Patterns for Leaders

“We Are an AI Company”

Skipping Infrastructure to Ship the Demo

Treating the Model as the Moat

Ignoring the Distributed Systems Literature

The Convergence Table

The Bottom Line

References

Share Now!

Like this:

What Are Hyperagents?

Darwin Gödel Machine (DGM)

DGM-Hyperagents (DGM-H)

At a Glance

Why This Looks Like AGI: The Strongest Case

Enter Kant: The Critique of Pure Reason as an Evaluation Framework

Why Kant?

The Four Kantian Tests for Intelligence

Test 1: Analytic versus Synthetic Judgment

Test 2: Phenomena versus Noumena, the Boundary of Knowable Reality

Test 3: The Transcendental Unity of Apperception, the Missing “I Think”

Test 4: The Limits of Pure Reason, Why Recursive Optimization Has a Ceiling

The Technical Dissection: Reinforcing the Philosophy with Architecture

1. The Foundation Model Is Frozen

2. Goals Are Human-Defined and Externally Imposed

3. Cross-Domain Transfer Is Scaffolding Reuse, Not Reasoning Transfer

4. Self-Modification Operates on Code, Not on Representation

AGI Anti-Patterns for Leaders

“Self-Improving” in the Pitch Deck Equals AGI

Confusing Compounding Optimization with Compounding Intelligence

Ignoring the Frozen Foundation Model Constraint

Over-Delegating Autonomy Based on Self-Improvement Claims

The Adaptation Taxonomy: Broader Context

The Bottom Line

Kantian Evaluation Rubric: “Is It AGI?”

References

Share Now!

Like this:

ANI: What We Actually Have Today

AGI: What We Have Not Achieved

ASI: Artificial Superintelligence

Why Multi-Agent Architectures Exist

Dynamic Evaluations Are Not Intelligence

Human Intelligence Includes Instinct

AGI Anti-Patterns: How Organizations Fool Themselves

More Agents Equals AGI

Dynamic Evals Equal Learning

Large Context Windows Equal Intelligence

Tool Use Equals Intent

Emergent Behavior Equals Breakthrough Intelligence

Scale Will Eventually Produce AGI

Why Calling ANI “AGI” Is Dangerous

The Bottom Line

Share Now!

Like this:

Agent OS: Reference Architecture

From “Non-Determinism” to Distributed Failure

Memory: The Bottleneck Nobody Admits

Tools Make Everything Worse (Operationally)

MCP and A2A are necessary components, but they are not sufficient on their own.

Incident Postmortems: What Actually Breaks

Incident #1: Tool Timeout → Hallucinated Recovery → Memory Contamination

Incident #2: Cross-Agent Memory Contamination in an A2A Workflow

Minimum Viable Ops Layer for Agentic Systems

1) Replayable Execution

2) Typed, Versioned Memory

3) Explicit Tool Contracts

4) Distributed Tracing Across Agents

5) Cognitive Circuit Breakers

6) Security and Isolation

Conclusion: This Is Not LLM Ops. It’s Systems Engineering