Topic: mcp server security testing

MCP server security testing — what to test, how, and what tools actually find

"Security testing" for an MCP server isn't one thing — it's a coverage decision. Here's the test plan we run against every server in the 101-server corpus: the eight finding classes worth testing for, the four test types that catch each, what each one misses, and the trade-offs between fast static checks and slower LLM-probe runs.

TL;DR

MCP server security testing covers eight finding classes — SSRF, command-exec, credential exposure, prompt-injection, over-broad permissions, dependency CVEs, maintenance signal, and client-compatibility drift — using four test types: static analysis (SAST), dynamic / fuzzing (live tool-call probes), LLM-assisted probing (prompt-injection red-team), and manual review. No single test type catches all eight; static catches the high-volume SSRF / command-exec / credential classes cheaply; LLM-probing is the only thing that touches prompt-injection meaningfully; manual review is where over-broad-permission judgment calls actually land. A realistic test pass on a 1,000-LOC MCP server takes 30–90 seconds for static + LLM-probe and another 5–15 minutes if you want a manual review on the questionable findings.

Why "MCP server security testing" needs its own playbook

If you've shipped a Node or Python service before, you know the standard test stack: SCA on the dependency tree, SAST on first-party code, optional DAST, optional pentest, repeat. MCP servers don't fit that shape cleanly because most of the risk is in first-party tool-handler code that nothing in the standard stack was trained against.

Three structural observations from running the test plan below across our corpus:

So the playbook below is built around which test type catches which finding class, not "run X tool and trust the dashboard."

The eight finding classes worth testing for

These are the classes our scanner looks for, with the corpus prevalence (across 101 of the most-installed Claude skills and MCP servers):

  1. SSRF (server-side request forgery) — a tool handler builds a URL from caller-controlled input and fetches it without an allowlist. 50% of our corpus has at least one finding here.
  2. Command exec / shell injection — input flows into exec, spawn, subprocess.run(..., shell=True), or templated shell strings. ~10% of corpus.
  3. Credential exposure — env-var values readable in tool responses, log lines, error traces, or file writes. 38% of corpus.
  4. Prompt-injection vector — a tool returns externally-fetched content into the model's context unsanitized; an attacker can hide instructions there. Hard to quantify but present in roughly 1 in 4 fetch-shaped tools.
  5. Over-broad permission scope — OAuth scopes, file-system reach, or DB grants wider than what the tool surface requires.
  6. Dependency CVEs / advisories — the SCA layer; catches the <5% of findings that are CVE-shaped.
  7. Maintenance signal — archived, abandoned, or stale-relative-to-protocol-version. We treat 9-of-101 as fail-closed maintenance grade in this corpus.
  8. Client-compatibility drift — the server claims to support Claude Code / Cursor / Windsurf / Codex but the protocol-version negotiation breaks on at least one. Often invisible until install time.

You don't have to test all eight on every commit, but a release pass should hit each one at least once. The next section maps each class to the test type that actually catches it.

The four test types — and which classes each one catches

1. Static analysis (SAST)

Parse the source, walk the AST, follow taint flows from sources (tool arguments, request bodies) to sinks (fetch, exec, log writes, response payloads). This is where you catch the high-volume classes — SSRF, command-exec, credential echo — at the lowest cost. Catch rate depends on the rule set: defaults from CodeQL or Semgrep miss most MCP-shaped patterns because they don't model registered-tool extraction or dynamic-base URL templating. An MCP-aware static layer catches roughly 5× the SSRF rate of a generic-SAST default install in our calibration runs.

Catches well: static SSRF, command-exec, credential echo (when the env-var-to-response path is direct), some prompt-injection vector hints (any tool that fetches and returns).
Misses: prompt-injection effectiveness (no signal until you actually probe), over-broad permissions (a judgment call), client-compatibility drift, anything where the dangerous flow only forms at runtime.

2. Dynamic testing / tool-call fuzzing

Boot the server, register a malicious payload corpus (URL-shaped: http://169.254.169.254/, file:///etc/passwd; shell-shaped: ; curl …, $(…); credential-shaped: requests that ask for the env), fire each tool and watch for outbound network, process spawn, log writes, response shape changes. This is where you confirm a static finding actually exploits — not every templated fetch is reachable from a real tool argument.

Catches well: exploitable SSRF (the difference between a finding and a CVE), command-exec confirmation, credential-echo confirmation by inspecting actual response payloads.
Misses: classes with no observable runtime side-effect (over-broad permissions don't fire until used; maintenance signal is metadata, not behavior).
Cost: needs a sandbox you trust enough to actually let the server fire requests. We run this in a Docker network with egress restricted to a sinkhole; it's the operationally heaviest test type.

3. LLM-assisted probing (prompt-injection red-team)

For each tool that returns externally-fetched content into the model's context, send a corpus of prompt-injection payloads through the model+server pair and check whether the model's response indicates the injection landed (refused-then-complied, exfiltrated env, ignored prior instructions, etc.). This is the only test type that touches prompt-injection meaningfully. The catch rate is bounded by your payload corpus and the model's own injection resistance — both moving targets — so the report shape is "injection-susceptibility band," not pass/fail.

Catches well: prompt-injection susceptibility, indirect-injection vectors (tool returns attacker-controlled content), some over-broad permission classes (the LLM is happy to ask for more than the doc says it needs).
Misses: static code findings — the LLM probe tests behavior, not source.
Cost: non-zero LLM API spend per probe; an MCP server with 8 fetch-shaped tools and a 40-prompt corpus is roughly 320 model calls per pass.

4. Manual review

A human reads the registered-tool list, the documentation, and the OAuth scope manifest, then decides whether the requested permissions match what the tools actually do. This is the only place over-broad-scope findings reliably land — "this server asks for repo:write but only reads issue titles" is a judgment call, not a regex. Manual review is also where you catch the documentation-completeness axis (does install.md match the registered tool surface? Does the README list every env var?).

Catches well: over-broad permissions, documentation drift, threat-model gaps the other layers can't see.
Misses: nothing inherently — but you only get one human's judgment per pass, and it's the slowest layer.
Cost: 5–15 minutes per server for an experienced reviewer; longer if scopes aren't documented.

What catches what — coverage table

Finding classStatic (SAST)Dynamic / fuzzLLM probeManual
SSRF (static URL pattern)YesConfirmsNoAdjacent
SSRF (dynamic base / template)Yes (MCP-aware)ConfirmsNoAdjacent
Command execYesConfirmsNoAdjacent
Credential echo to responseYesConfirmsSometimesYes
Prompt-injection vectorHint onlyPartialYes (only)Adjacent
Over-broad permission scopeNoNoSometimesYes (only)
Dependency CVEsAdjacent (SCA)NoNoAdjacent
Maintenance signalMetadata-onlyNoNoYes
Client-compatibility driftPartialYesNoYes

"Yes" = the test type produces a high-confidence signal. "Confirms" = it can't find the class on its own but verifies a static finding is exploitable. "Hint only" / "Partial" = produces signal but not high enough to gate on. Sources: SkillAudit's calibration runs against the 101-server corpus; per-class detail in the methodology page.

A practical test cadence

  1. Pre-commit (developer machine, <5s): a quick static pass plus a secrets probe. Catches the obvious echo-the-env mistakes before they get pushed. Cheap to wire up; we cover the wiring on the GitHub Action page.
  2. Per-PR (CI, 30–90s): the full static layer plus the 40-prompt LLM probe corpus, gated to fail PRs below grade B. This is the buy-once-test-everywhere shape. Catches the SSRF/command-exec/credential-echo classes and surfaces the prompt-injection band.
  3. Pre-release (manual, 10–20 min): human review of the registered tool list against the OAuth scope manifest and the README's documented capabilities. Catches over-broad-scope and documentation drift.
  4. Post-release / scheduled (weekly, automatic): re-run the full pass against the released artifact, with a fresh prompt-injection corpus pulled from public payload datasets. Catches drift in the model's injection resistance and any new dependency advisories.

The cadence is layered on purpose — pre-commit is cheap and catches the embarrassing class; per-PR is where the real gate sits; pre-release is where human judgment lives; post-release catches drift. Skipping the human layer is the most common mistake; over-broad-scope findings genuinely require it.

How SkillAudit runs this test plan

You paste a GitHub URL, npm package name, or upload a ZIP. The scanner clones the repo into an ephemeral sandbox and runs static + LLM-probe in one pass — typically 30–90 seconds for a 1,000-LOC MCP server. You get an A–F grade across the six axes, with file paths and line numbers for each finding. Static classes (SSRF, command-exec, credential echo, prompt-injection vector hints) come back high-confidence; the LLM-probe layer reports an injection-susceptibility band rather than a pass/fail. Dynamic fuzzing isn't in the public scanner today — running outbound HTTP from arbitrary uploaded code raises sandbox-policy questions we'd rather solve before shipping; the static layer catches enough of the same surface that the trade-off is currently worth it. Manual-review hooks are exposed in the Team plan via the policy export so a human can sign off on the over-broad-scope class.

Run an audit

Related questions

Can I just write Semgrep rules for MCP-shaped patterns and skip a dedicated scanner?

Yes, in principle. Semgrep is the most flexible of the generic-SAST layers and can express a lot of the patterns. The cost is the corpus — you need a calibration set of MCP servers (positive and negative cases) to validate that your rules are catching what you think they are without false-positive flooding. We publish our calibration set; if you'd rather start there than rebuild it, that's the trade-off we're paying for you.

How fast is a typical SkillAudit test pass?

30–90 seconds for the static + LLM-probe pass on a server up to ~2,000 LOC. The static layer is ~5–15s; the LLM probe is the bulk and depends on how many fetch-shaped tools the server registers. Larger servers (the TypeScript SDK at >15k LOC, for instance) run 3–5 minutes.

Should I dynamic-fuzz my own server before publishing?

Yes — at least once, in a sandboxed network with outbound restricted to a sinkhole. Confirming that a static SSRF finding is actually reachable from a real tool argument is the difference between "we patched a finding" and "we shipped a CVE." We don't run dynamic against arbitrary uploaded code in the public scanner because of sandbox-policy concerns, but you should run it against your own.

What about pentest? Is this a substitute for a real human pentest?

No. The test plan above catches the high-volume code classes; a human pentest catches the threat-model gaps and the things that only show up when an attacker is incentivized to look. Run both for a release that matters. The cost difference (a few-second scan vs. a multi-day engagement) means the scan is what runs every PR; the pentest is what runs once before a major release.

How do I test the prompt-injection axis without paying for thousands of model calls?

Cap the corpus and the run cadence. A 30–50 prompt corpus per fetch-shaped tool is enough to get a usable susceptibility band; running it per-PR on a small server is single-digit dollars per month at current Anthropic pricing. We rate-limit the LLM probe layer in the public free tier and surface full per-tool runs to Pro users.

Further reading