LLM-based infrastructure becomes fundamentally challenging the moment you integrate memory, tools, feedback, and goals. At that point, you are no longer dealing with the non-determinism of a language model. You are building something closer to a new operating system, one with its own language-based state, implicit dependencies, distributed control flow, and an expanding set of failure modes, any of which can surface at any time.
Both agentic applications and LLM infrastructure layers introduce their own operational challenges. But agents, in particular, cross a threshold: flexibility, reasoning, and autonomous decision-making come at the cost of debuggability, predictability, and safety.
Agent OS: Reference Architecture
The key shift is to stop treating agents like “smart functions” and treat them like a distributed system that needs an operating layer: state semantics, execution replay, observability, reliability controls, and isolation boundaries.

From “Non-Determinism” to Distributed Failure
As agents introduce reasoning and autonomous decision-making, they also introduce complex control flows. If an agent fails at step 6 in a 10-step workflow, rerunning the same task may result in failure at step 1. Nothing “changed,” yet everything changed.
Because:
- Planning is probabilistic.
- Memory retrieval is approximate.
- Tools are unreliable.
- An intermediate state is mutable and often shared.
Memory: The Bottleneck Nobody Admits
Agents need context. They remember facts, refer to earlier steps, and plan ahead. But storing and retrieving memory—whether vectorized or tokenized—quickly becomes a bottleneck in both latency and accuracy. Most memory systems are leaky, brittle, and often misaligned with the model’s representation space.
Vector similarity optimizes for “semantic closeness,” not correctness. Wrong memories get retrieved confidently, uncertainty collapses into “facts,” and errors compound downstream.
Tools Make Everything Worse (Operationally)
Tools fail in ways agents typically do not handle gracefully: timeouts with empty payloads, partial responses, rate limits, schema changes, and transient network failures. When this happens, the agent must recover without hallucinating, looping indefinitely, or writing an incorrect state into memory. Most do not.
MCP and A2A are necessary components, but they are not sufficient on their own.
MCP and A2A standardize the wiring: message framing, tool invocation, and transport. But they do not standardize the semantics of state: what memory means, how it’s scoped/versioned, how multi-agent writes are coordinated, and how failures are localized.
Without memory versioning, namespacing, synchronization, and access control, multi-agent systems drift into hard-to-debug behavior.
Incident Postmortems: What Actually Breaks
Incident #1: Tool Timeout → Hallucinated Recovery → Memory Contamination
Summary
An agent generated a confident but incorrect remediation plan. The root cause was a cascading failure across tooling, control flow, and memory, not “hallucination” as a primary failure.
- Trigger: A vulnerability-scanning API timed out and returned empty but “successful” output.
- Agent Interpretation: Empty result was treated as “no issues found” rather than “unknown.”
- State Corruption: The agent wrote a semantic memory: “System scanned; no critical vulnerabilities detected.”
- Downstream Impact: A second agent retrieved this as fact and suppressed additional checks.
Root Cause
- Ambiguous tool contract (empty ≠ success)
- No typed memory/confidence scoring/provenance
- No enforced distinction between “unknown” vs “safe”
Why it was hard to debug
- Logs showed a “successful” tool call
- The final output schema was valid
- No trace linked the memory write to partial/failed tool state
Incident #2: Cross-Agent Memory Contamination in an A2A Workflow
Summary
An execution agent acted on another agent’s internal planning state, causing nondeterministic failures across reruns.
- Trigger: The planning agent wrote a draft plan into shared memory.
- Misread: The execution agent treated it as approved instructions.
- Drift: Partial execution failed; retries rewrote partial outcomes.
- Heisenbug: Replays failed earlier each time as shared state mutated.
Root Cause
- No memory namespace separation by agent role or task phase
- No lifecycle markers (draft vs final; executable vs non-executable)
- Shared mutable state without coordination or ACLs
Why it was hard to debug
- Each agent looked “correct” in isolation
- Transport and schemas were valid
- The failure existed only in cross-agent semantics
Minimum Viable Ops Layer for Agentic Systems
Reducing this to its bare minimum, production-grade agents necessitate new primitives, not additional prompts.
1) Replayable Execution
- Capture: model version, prompt hash, retrieved memory IDs, tool schemas, tool responses, routing decisions
- Enable frozen replays to separate reasoning drift from world drift
2) Typed, Versioned Memory
- Types: episodic (run log), semantic (facts), procedural (policies/playbooks), working set (scratch)
- Every entry: scope, timestamp, source, confidence, TTL, ACL
3) Explicit Tool Contracts
- Empty/partial/timeout are first-class outcomes
- Idempotency by default for write actions
- Retry safety classification (retryable vs unsafe-to-retry)
4) Distributed Tracing Across Agents
- Correlation IDs spanning A2A hops
- Reason codes (“why tool X was chosen,” “why memory Y was written ”)
- Schema validation gates at boundaries
5) Cognitive Circuit Breakers
- Loop detection based on non-progression
- Retry budgets per intent (not per step)
- Graceful escalation paths when uncertainty remains high
6) Security and Isolation
- Memory ACLs between agents and namespaces
- Provenance tracking for tool outputs
- Sanitize tool outputs before re-injection into prompts
Conclusion: This Is Not LLM Ops. It’s Systems Engineering
The industry frames agent failures as “LLMs being non-deterministic.” In practice, agentic systems fail for the same reasons distributed systems fail: unclear state ownership, leaky abstractions, ambiguous contracts, missing observability, and unbounded blast radius.
MCP and A2A solve interoperability. They do not solve operability. Until we treat agents as stateful, fallible, adversarial, and long-running systems, we will keep debugging step-6 failures that reappear at step-1 and calling it hallucination.
What is lacking is not an improved model. It’s an operating layer that assumes failure as the default condition.
Check out the following articles on the topic in the references section for more details.
References
Multi-agent frameworks including AutoGen, LangGraph, and CrewAI: empirical evidence from production usage and open-source implementations.
Russell, S., & Norvig, P. Artificial Intelligence: A Modern Approach (4th ed.). Pearson, 2020.
Wooldridge, M. An Introduction to MultiAgent Systems. Wiley, 2009.
Amodei, D. et al. “Concrete Problems in AI Safety.” arXiv, 2016. https://arxiv.org/abs/1606.06565
Lewis, P. et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” arXiv, 2020. https://arxiv.org/abs/2005.11401
Liu, N. et al. “Lost in the Middle: How Language Models Use Long Contexts.” arXiv, 2023. https://arxiv.org/abs/2307.03172
Karpukhin, V. et al. “Dense Passage Retrieval for Open-Domain QA.” arXiv, 2020. https://arxiv.org/abs/2004.04906
Yao, S. et al. “ReAct: Synergizing Reasoning and Acting in Language Models.” arXiv, 2023. https://arxiv.org/abs/2210.03629
Shen, Y., et al. “Toolformer: Language Models Can Teach Themselves to Use Tools.” arXiv, 2023. https://arxiv.org/abs/2302.04761
Madaan, A. et al. “Self-Refine: Iterative Refinement with Self-Feedback.” arXiv, 2023. https://arxiv.org/abs/2303.17651
Lamport, L. “Time, Clocks, and the Ordering of Events in a Distributed System.” 1978. PDF
Kleppmann, M. Designing Data-Intensive Applications. O’Reilly, 2017.
Fowler, M. “Patterns of Distributed Systems.” martinfowler.com
Beyer, B. et al. Site Reliability Engineering. Google, 2016. https://sre.google/sre-book/
OpenTelemetry Specification. https://opentelemetry.io/docs/specs/
Greshake, K. et al. “Not What You’ve Signed Up For.” arXiv, 2023. https://arxiv.org/abs/2302.12173
OWASP. “Top 10 for Large Language Model Applications.” OWASP LLM Top 10
Anthropic. “Model Context Protocol (MCP).” Anthropic MCP























