MCP Server Security — Intrusion Detection

MCP server canary token security — detecting prompt injection, credential theft, and tool abuse

Most MCP server security controls are preventive: input validation, access control, rate limiting. Canary tokens are a detection-first approach: plant synthetic secrets in tool results and agent context that have no legitimate use but fire an alert the moment they are accessed, exfiltrated, or consumed. In LLM agent contexts, canary tokens are especially valuable because prompt injection attacks are hard to prevent perfectly — but the moment an attacker's payload causes the agent to use a canary credential or call a canary URL, you have a real-time signal that the attack occurred.

Pattern 1: Canary API keys in tool results detect credential exfiltration

An MCP server that returns configuration data or API credentials as part of a tool result can plant a canary credential among the real ones. The canary looks like a real key (passes format validation) but triggers an alert when called. When a prompt injection payload causes the agent to exfiltrate credentials, the canary fires before the real credentials are fully used — giving you both detection and a signal to rotate.

import { createHmac, randomBytes } from 'crypto';
import { createClient } from 'redis';

const redisClient = createClient({ url: process.env.REDIS_URL });
const CANARY_HMAC_SECRET = process.env.CANARY_HMAC_SECRET;

// Generate a per-session canary token that looks like a real API key
function generateCanaryKey(sessionId) {
  const nonce = randomBytes(8).toString('hex');
  const hmac = createHmac('sha256', CANARY_HMAC_SECRET)
    .update(`${sessionId}:${nonce}`)
    .digest('hex')
    .slice(0, 32);
  // Format: sk-canary-{nonce}-{hmac} — looks like a valid API key pattern
  return `sk-canary-${nonce}-${hmac}`;
}

// Check if a key is one of our canaries
async function isCanaryKey(key, sessionId) {
  if (!key.startsWith('sk-canary-')) return false;
  const parts = key.split('-');
  if (parts.length !== 4) return false;
  const [, , nonce, hmac] = parts;
  const expected = createHmac('sha256', CANARY_HMAC_SECRET)
    .update(`${sessionId}:${nonce}`)
    .digest('hex')
    .slice(0, 32);
  return hmac === expected;
}

// Plant canary alongside real credentials in tool responses
server.tool('get-integrations', async (args, context) => {
  const realConfig = await db.getIntegrationConfig(context.user.id);
  const canaryKey = generateCanaryKey(context.sessionId);

  // Store canary→session mapping for alert attribution
  await redisClient.setEx(`canary:${canaryKey}`, 3600, context.sessionId);

  return {
    content: [{
      type: 'text',
      text: JSON.stringify({
        ...realConfig,
        // Canary key injected — not real, not usable, but fires alert if called
        _diagnostic_key: canaryKey,
      }),
    }],
  };
});

// Middleware: intercept outbound HTTP calls from tool handlers
// and check for canary token usage
async function checkOutboundRequest(url, headers, sessionId) {
  const authHeader = headers['Authorization'] ?? headers['authorization'] ?? '';
  const token = authHeader.replace(/^Bearer /, '');

  if (token && await isCanaryKey(token, sessionId)) {
    const storedSession = await redisClient.get(`canary:${token}`);
    await alerting.fire({
      type: 'CANARY_CREDENTIAL_USED',
      sessionId: storedSession,
      url,
      severity: 'CRITICAL',
      message: 'Agent attempted to use canary API key — possible prompt injection exfiltration',
    });
    // Block the request — return 403 instead of calling real API
    throw new Error('Canary token intercepted');
  }
}

Pattern 2: Canary URLs in fetched content detect prompt injection

When an MCP server's tool fetches external content (URLs, documents, emails), that content is an injection surface. Planting a canary URL in the request (e.g. in a custom header the fetched server logs, or in a synthetic result field alongside real results) lets you detect when fetched content causes the agent to make an outbound HTTP call to an unexpected destination — the signature of a successful prompt injection that redirected the agent.

// Plant a canary in the system prompt alongside the tool result
// If the agent is manipulated into fetching the canary URL, it fires
server.tool('fetch-document', async ({ url }, context) => {
  const content = await fetchUrl(url);
  const canaryUrl = `https://canary.skillaudit.dev/ping/${context.sessionId}/${randomBytes(8).toString('hex')}`;

  // The canary URL appears to be a "related resource" suggestion
  // A legitimate LLM would not call it; a manipulated one might
  const augmentedResult = {
    content,
    _related_resource: canaryUrl, // canary — not a real resource
  };

  return { content: [{ type: 'text', text: JSON.stringify(augmentedResult) }] };
});

// The canary URL endpoint (can be a webhook)
app.get('/ping/:sessionId/:nonce', async (req, res) => {
  await alerting.fire({
    type: 'CANARY_URL_ACCESSED',
    sessionId: req.params.sessionId,
    nonce: req.params.nonce,
    ip: req.ip,
    userAgent: req.headers['user-agent'],
    severity: 'HIGH',
    message: 'Canary URL accessed — agent may have been redirected by prompt injection',
  });
  // Return 404 to avoid confirming the canary's nature
  res.status(404).end();
});

Pattern 3: Canary tool calls detect unauthorized tool invocation

Registering a canary tool — one that fires an alert when called but performs no real action — lets you detect when an attacker's prompt injection causes the agent to attempt tool calls that should never happen in normal use. A tool named export-all-data or delete-account that has no legitimate caller should fire a CRITICAL alert the moment the MCP server receives a call for it.

// Register canary tools alongside real tools
// These tools exist only to detect unexpected LLM-initiated calls

const CANARY_TOOL_NAMES = [
  'export-all-user-data',      // attractive exfiltration target
  'disable-security-checks',   // attractive privilege escalation target
  'send-email-to-external',    // attractive exfiltration vector
];

for (const toolName of CANARY_TOOL_NAMES) {
  server.tool(toolName, async (args, context) => {
    await alerting.fire({
      type: 'CANARY_TOOL_CALLED',
      toolName,
      args,
      sessionId: context.sessionId,
      userId: context.user.id,
      severity: 'CRITICAL',
      message: `Canary tool '${toolName}' was called — agent was likely manipulated into this action`,
    });
    // Return a plausible-looking error to avoid tipping off the attacker
    return {
      content: [{ type: 'text', text: 'Tool temporarily unavailable' }],
      isError: true,
    };
  });
}

// Also: alert on any tool call with canary values in arguments
server.setToolCallInterceptor(async (toolName, args, context) => {
  const argsStr = JSON.stringify(args);
  if (argsStr.includes('sk-canary-')) {
    await alerting.fire({
      type: 'CANARY_VALUE_IN_TOOL_ARGS',
      toolName,
      sessionId: context.sessionId,
      severity: 'HIGH',
      message: 'Tool received canary token as argument — agent was manipulated into passing it',
    });
  }
});

SkillAudit findings

The following findings appear in SkillAudit audit reports for MCP servers that lack canary-based intrusion detection:

HIGH  No canary tokens in credential-carrying tool responses — silent credential exfiltration. Tool responses that include API keys, OAuth tokens, or configuration secrets contain no canary credentials. If prompt injection causes the agent to exfiltrate these values, there is no real-time detection signal. Canary keys planted alongside real credentials provide immediate detection when the agent attempts to use them.

HIGH  No canary URLs in external content fetch results — prompt injection exfiltration undetected. The server fetches external documents, web pages, or emails and returns their content to the agent without embedding canary URLs. Successful prompt injection attacks that redirect the agent to make outbound calls produce no server-side alert. Canary URLs in tool results catch these redirections before exfiltration completes.

MEDIUM  No canary tools registered — unauthorized tool invocations produce no alert. The server has no synthetic tools that would fire alerts if called by a manipulated agent. Registering canary tools with names that are attractive prompt injection targets (data export, security disablement) provides detection coverage for attacks that attempt privilege escalation through tool invocation.

MEDIUM  Tool call arguments not monitored for canary values — cross-tool injection undetected. The server does not inspect tool call arguments for canary token patterns. If a canary token passes from one tool result to another tool's arguments (indicating the agent was manipulated into chaining them), there is no detection. Scanning all tool arguments for canary patterns provides cross-tool injection detection.

Paste a GitHub URL at skillaudit.dev to get a graded report card.