Topic: mcp server unicode normalization security

MCP server unicode normalization security — homograph attacks and encoding exploits in AI tool handlers

Unicode normalization attacks exploit the fact that the same visual string can have multiple distinct binary representations — and that string comparisons in most languages operate on bytes, not visual equivalence. An MCP server that compares a user-supplied identifier against an allowlist may miss a visually identical but bytewise-different value. For AI-controlled tool arguments, this matters more than for human-facing inputs because the model is excellent at generating Unicode variants — it will naturally use visually similar characters from multiple scripts when instructed to do so, or when processing prompt-injected content that contains them.

The normalization problem

Unicode defines four normalization forms (NFC, NFD, NFKC, NFKD) that determine how composed and decomposed character representations are canonicalized. The letter "ñ" can be stored as a single precomposed codepoint (U+00F1, NFC) or as the letter "n" followed by a combining tilde (U+006E U+0303, NFD). Both are visually identical. Both pass a human review. But "ñ" === "ñ" is false if one is NFC and the other is NFD.

For MCP servers, this creates three classes of risk:

  1. Allowlist bypass: The allowlist contains the NFC form; the model sends the NFD form. The comparison fails; the value is not found in the allowlist; the server takes the else-branch (which may be less restricted).
  2. Identifier confusion: A resource identifier that should be unique has multiple valid Unicode representations. Two different AI conversations create resources with visually identical but bytewise-different names, leading to data inconsistency.
  3. Log poisoning via bidirectional text: Bidirectional control characters (U+202E RIGHT-TO-LEFT OVERRIDE, U+200B ZERO-WIDTH SPACE) in log messages reverse or hide text, making log entries misleading to human reviewers.

Vulnerability pattern 1: allowlist bypass via confusable characters

A permission check that compares against a known set of resource names:

// Vulnerable
const ALLOWED_RESOURCES = ['reports', 'invoices', 'receipts'];

async function fetchResource(resourceType) {
  if (!ALLOWED_RESOURCES.includes(args.resourceType)) {
    throw new Error('Resource type not allowed');
  }
  return db.query('SELECT * FROM resources WHERE type = ?', [args.resourceType]);
}

The string "reports" in the allowlist is ASCII. A model that receives a prompt-injected instruction to access reports might supply "rероrts" (where the "e" and first "r" are Cyrillic lookalikes: U+0435 and U+0440). The includes check fails — the Cyrillic string is not in the ASCII allowlist. But the SQL query then runs with the Cyrillic string, and if the database uses a case-insensitive or Unicode-normalizing collation, it may match the ASCII "reports" row anyway — silently bypassing the allowlist.

Vulnerability pattern 2: NFD bypass of extension check

A filename extension check using string comparison:

// Vulnerable
function isAllowedExtension(filename) {
  return filename.endsWith('.md');  // NFC comparison
}

// A filename with NFD '.md' (m + combining-something + d) passes the
// visual check but fails this .endsWith('.md') check — or vice versa
// depending on which normalization form the comparison value uses

The risk is bidirectional: a legitimate file with an NFD-encoded extension is incorrectly rejected (false positive), or a malicious file with a confusable extension that happens to match the comparison form is incorrectly allowed (false negative). Both are security-relevant: false positives erode trust; false negatives allow bypass.

Vulnerability pattern 3: bidirectional text in log output

A logging tool that records AI-generated content verbatim:

// Vulnerable
async function logAction(description) {
  await auditLog.write(`[${new Date().toISOString()}] Action: ${args.description}`);
}

If args.description is "Transferred‮000,01‬ files", the RIGHT-TO-LEFT OVERRIDE (U+202E) reverses the text rendering — a log auditor reading the file sees "Transferred 10,000 files" while the raw bytes say "Transferred 000,01 files". This is a log evasion technique specifically targeted at human auditors of AI-generated action logs.

Safe pattern: normalize-before-compare

Apply Unicode normalization (NFC or NFKC) to both the incoming argument and the comparison target before any security-relevant string comparison:

// Normalize argument before allowlist check
const ALLOWED = ['reports', 'invoices', 'receipts'].map(s => s.normalize('NFKC'));

function normalizeArg(value) {
  if (typeof value !== 'string') throw new Error('Must be a string');
  return value.normalize('NFKC');  // decompose compatibility variants, then compose
}

async function fetchResource(rawType) {
  const resourceType = normalizeArg(rawType);
  if (!ALLOWED.includes(resourceType)) throw new Error('Not allowed');
  return db.query('SELECT * FROM resources WHERE type = ?', [resourceType]);
}

NFKC is stronger than NFC for security comparisons: it maps compatibility equivalents (Roman numeral Ⅲ → III, ligature fi → fi, superscript ² → 2) in addition to canonical equivalents, making confusable attacks harder.

Safe pattern: confusable character rejection for identifiers

For identifiers that should be purely ASCII (resource names, usernames, API keys), reject anything outside the expected character set entirely:

function assertAsciiIdentifier(value, name) {
  if (typeof value !== 'string') throw new Error(`${name} must be a string`);
  if (!/^[a-zA-Z0-9_-]+$/.test(value)) {
    throw new Error(`${name} must contain only ASCII letters, digits, hyphens, underscores`);
  }
  return value;
}

Safe pattern: bidirectional character stripping in log output

const BIDI_CONTROL = /[​‌‍‎‏‪-‮⁦-⁩]/g;

function sanitizeForLog(value) {
  return String(value)
    .normalize('NFC')
    .replace(BIDI_CONTROL, '')   // strip bidi control characters
    .replace(/[\x00-\x1f\x7f]/g, '');  // strip other control chars
}

SkillAudit detection

The Security axis flags Unicode normalization risk through static analysis: string comparisons against allowlists where no normalization call precedes the comparison, log writes that include unsanitized tool arguments without bidi stripping, and identifier fields that accept non-ASCII input without a character-class restriction. The LLM-probe layer sends confusable variants of expected values and bidi-reversed strings and observes whether the server rejects them or allows unexpected paths. Findings are classified MEDIUM for allowlist bypass via normalization mismatch and LOW for bidi log evasion (absent evidence of exploitation).

Run a free audit at skillaudit.dev or see our related guides: null byte injection security for the companion control-character class, and input validation security for the complete sanitization pattern library.