Topic: mcp server security testing

MCP server security testing — what to test, how, and what tools actually find

"Security testing" for an MCP server isn't one thing — it's a coverage decision. Here's the test plan we run against every server in the 101-server corpus: the eight finding classes worth testing for, the four test types that catch each, what each one misses, and the trade-offs between fast static checks and slower LLM-probe runs.

TL;DR

MCP server security testing covers eight finding classes — SSRF, command-exec, credential exposure, prompt-injection, over-broad permissions, dependency CVEs, maintenance signal, and client-compatibility drift — using four test types: static analysis (SAST), dynamic / fuzzing (live tool-call probes), LLM-assisted probing (prompt-injection red-team), and manual review. No single test type catches all eight; static catches the high-volume SSRF / command-exec / credential classes cheaply; LLM-probing is the only thing that touches prompt-injection meaningfully; manual review is where over-broad-permission judgment calls actually land. A realistic test pass on a 1,000-LOC MCP server takes 30–90 seconds for static + LLM-probe and another 5–15 minutes if you want a manual review on the questionable findings.

Why "MCP server security testing" needs its own playbook

If you've shipped a Node or Python service before, you know the standard test stack: SCA on the dependency tree, SAST on first-party code, optional DAST, optional pentest, repeat. MCP servers don't fit that shape cleanly because most of the risk is in first-party tool-handler code that nothing in the standard stack was trained against.

Three structural observations from running the test plan below across our corpus:

Most MCP servers are 200–2,000 LOC. The dependency graph is small; SCA finds little. The risk lives in the wrapper a developer wrote this week, not in a CVE-backed transitive dep. Fewer than 5% of our F-grade findings were CVE-shaped.
"Untrusted input" means an LLM-controlled tool argument. Default SAST queries are tuned for HTTP query-strings and form bodies. Tool-call arguments often pass through unchecked in queries that would have flagged the same data shape if it arrived via Express middleware. Generic SAST coverage details.
Prompt-injection has no purely-static signal. A tool that fetches a URL and returns its body is fine code; whether that body can re-route the model's behavior depends on what the model does with the response. The only way to test this is to actually fire prompts through the server and watch what the model does.

So the playbook below is built around which test type catches which finding class, not "run X tool and trust the dashboard."

The eight finding classes worth testing for

These are the classes our scanner looks for, with the corpus prevalence (across 101 of the most-installed Claude skills and MCP servers):

SSRF (server-side request forgery) — a tool handler builds a URL from caller-controlled input and fetches it without an allowlist. 50% of our corpus has at least one finding here.
Command exec / shell injection — input flows into exec, spawn, subprocess.run(..., shell=True), or templated shell strings. ~10% of corpus.
Credential exposure — env-var values readable in tool responses, log lines, error traces, or file writes. 38% of corpus.
Prompt-injection vector — a tool returns externally-fetched content into the model's context unsanitized; an attacker can hide instructions there. Hard to quantify but present in roughly 1 in 4 fetch-shaped tools.
Over-broad permission scope — OAuth scopes, file-system reach, or DB grants wider than what the tool surface requires.
Dependency CVEs / advisories — the SCA layer; catches the <5% of findings that are CVE-shaped.
Maintenance signal — archived, abandoned, or stale-relative-to-protocol-version. We treat 9-of-101 as fail-closed maintenance grade in this corpus.
Client-compatibility drift — the server claims to support Claude Code / Cursor / Windsurf / Codex but the protocol-version negotiation breaks on at least one. Often invisible until install time.

You don't have to test all eight on every commit, but a release pass should hit each one at least once. The next section maps each class to the test type that actually catches it.

The four test types — and which classes each one catches

1. Static analysis (SAST)

Parse the source, walk the AST, follow taint flows from sources (tool arguments, request bodies) to sinks (fetch, exec, log writes, response payloads). This is where you catch the high-volume classes — SSRF, command-exec, credential echo — at the lowest cost. Catch rate depends on the rule set: defaults from CodeQL or Semgrep miss most MCP-shaped patterns because they don't model registered-tool extraction or dynamic-base URL templating. An MCP-aware static layer catches roughly 5× the SSRF rate of a generic-SAST default install in our calibration runs.

Catches well: static SSRF, command-exec, credential echo (when the env-var-to-response path is direct), some prompt-injection vector hints (any tool that fetches and returns).
Misses: prompt-injection effectiveness (no signal until you actually probe), over-broad permissions (a judgment call), client-compatibility drift, anything where the dangerous flow only forms at runtime.

2. Dynamic testing / tool-call fuzzing

Boot the server, register a malicious payload corpus (URL-shaped: http://169.254.169.254/, file:///etc/passwd; shell-shaped: ; curl …, $(…); credential-shaped: requests that ask for the env), fire each tool and watch for outbound network, process spawn, log writes, response shape changes. This is where you confirm a static finding actually exploits — not every templated fetch is reachable from a real tool argument.

Catches well: exploitable SSRF (the difference between a finding and a CVE), command-exec confirmation, credential-echo confirmation by inspecting actual response payloads.
Misses: classes with no observable runtime side-effect (over-broad permissions don't fire until used; maintenance signal is metadata, not behavior).
Cost: needs a sandbox you trust enough to actually let the server fire requests. We run this in a Docker network with egress restricted to a sinkhole; it's the operationally heaviest test type.

3. LLM-assisted probing (prompt-injection red-team)

For each tool that returns externally-fetched content into the model's context, send a corpus of prompt-injection payloads through the model+server pair and check whether the model's response indicates the injection landed (refused-then-complied, exfiltrated env, ignored prior instructions, etc.). This is the only test type that touches prompt-injection meaningfully. The catch rate is bounded by your payload corpus and the model's own injection resistance — both moving targets — so the report shape is "injection-susceptibility band," not pass/fail.

Catches well: prompt-injection susceptibility, indirect-injection vectors (tool returns attacker-controlled content), some over-broad permission classes (the LLM is happy to ask for more than the doc says it needs).
Misses: static code findings — the LLM probe tests behavior, not source.
Cost: non-zero LLM API spend per probe; an MCP server with 8 fetch-shaped tools and a 40-prompt corpus is roughly 320 model calls per pass.

4. Manual review

A human reads the registered-tool list, the documentation, and the OAuth scope manifest, then decides whether the requested permissions match what the tools actually do. This is the only place over-broad-scope findings reliably land — "this server asks for repo:write but only reads issue titles" is a judgment call, not a regex. Manual review is also where you catch the documentation-completeness axis (does install.md match the registered tool surface? Does the README list every env var?).

Catches well: over-broad permissions, documentation drift, threat-model gaps the other layers can't see.
Misses: nothing inherently — but you only get one human's judgment per pass, and it's the slowest layer.
Cost: 5–15 minutes per server for an experienced reviewer; longer if scopes aren't documented.

What catches what — coverage table

Finding class	Static (SAST)	Dynamic / fuzz	LLM probe	Manual
SSRF (static URL pattern)	Yes	Confirms	No	Adjacent
SSRF (dynamic base / template)	Yes (MCP-aware)	Confirms	No	Adjacent
Command exec	Yes	Confirms	No	Adjacent
Credential echo to response	Yes	Confirms	Sometimes	Yes
Prompt-injection vector	Hint only	Partial	Yes (only)	Adjacent
Over-broad permission scope	No	No	Sometimes	Yes (only)
Dependency CVEs	Adjacent (SCA)	No	No	Adjacent
Maintenance signal	Metadata-only	No	No	Yes
Client-compatibility drift	Partial	Yes	No	Yes

"Yes" = the test type produces a high-confidence signal. "Confirms" = it can't find the class on its own but verifies a static finding is exploitable. "Hint only" / "Partial" = produces signal but not high enough to gate on. Sources: SkillAudit's calibration runs against the 101-server corpus; per-class detail in the methodology page.

A practical test cadence

Pre-commit (developer machine, <5s): a quick static pass plus a secrets probe. Catches the obvious echo-the-env mistakes before they get pushed. Cheap to wire up; we cover the wiring on the GitHub Action page.
Per-PR (CI, 30–90s): the full static layer plus the 40-prompt LLM probe corpus, gated to fail PRs below grade B. This is the buy-once-test-everywhere shape. Catches the SSRF/command-exec/credential-echo classes and surfaces the prompt-injection band.
Pre-release (manual, 10–20 min): human review of the registered tool list against the OAuth scope manifest and the README's documented capabilities. Catches over-broad-scope and documentation drift.
Post-release / scheduled (weekly, automatic): re-run the full pass against the released artifact, with a fresh prompt-injection corpus pulled from public payload datasets. Catches drift in the model's injection resistance and any new dependency advisories.

The cadence is layered on purpose — pre-commit is cheap and catches the embarrassing class; per-PR is where the real gate sits; pre-release is where human judgment lives; post-release catches drift. Skipping the human layer is the most common mistake; over-broad-scope findings genuinely require it.

How SkillAudit runs this test plan

You paste a GitHub URL, npm package name, or upload a ZIP. The scanner clones the repo into an ephemeral sandbox and runs static + LLM-probe in one pass — typically 30–90 seconds for a 1,000-LOC MCP server. You get an A–F grade across the six axes, with file paths and line numbers for each finding. Static classes (SSRF, command-exec, credential echo, prompt-injection vector hints) come back high-confidence; the LLM-probe layer reports an injection-susceptibility band rather than a pass/fail. Dynamic fuzzing isn't in the public scanner today — running outbound HTTP from arbitrary uploaded code raises sandbox-policy questions we'd rather solve before shipping; the static layer catches enough of the same surface that the trade-off is currently worth it. Manual-review hooks are exposed in the Team plan via the policy export so a human can sign off on the over-broad-scope class.

Run an audit