Topic: mcp server security scanner

MCP server security scanner — what to look for in one

If you're searching for a Model Context Protocol security scanner, you're looking for something most existing tools weren't built to be. Here's the threat model an MCP-aware scanner has to cover, the gaps generic SAST and SCA leave, and what we found running ours against 101 of the most-installed servers in the wild.

TL;DR

An MCP server's risk surface is in tool-handler bodies — first-party code the LLM gets permission to call with arguments derived from untrusted input. A real MCP security scanner therefore has to detect tool-handler SSRF, command-exec sinks, credential echo back through tool responses, prompt-injection vectors smuggled through tool output, permission-scope inflation, and per-axis maintenance and documentation signal. Generic SAST and SCA scanners catch almost none of this because none of it leaves a CVE — the dangerous code was written this week. Of 101 of the most-installed MCP servers we scanned, 50% shipped SSRF, 38% shipped credential-handling findings, and only 19% earned an A grade. SkillAudit is the scanner; you can paste a GitHub URL and read a graded report card in 60 seconds.

Why a generic SAST or SCA scanner is the wrong tool here

The first instinct of most teams adopting MCP is to point an existing scanner at the repo and call it done. That instinct is reasonable for code shaped like a typical web service — but MCP is shaped differently. Three reasons the generic toolchain leaves the dangerous parts un-flagged:

The vulnerable code is yours, not your dependencies'. An MCP server's risk usually lives in a tool handler that someone wrote to wrap a bespoke API. server.tool('fetch_url', async ({url}) => await fetch(url)) is a textbook SSRF and it ships, this week, in code with a clean SBOM. SCA tools (Snyk Open Source, Dependabot, OSV-Scanner, npm audit) have nothing to flag — there's no CVE in the dependency tree. See our Snyk alternative and npm audit alternative writeups for side-by-side detail.
The exploit path is through tool I/O, not through HTTP. Conventional SAST learned what request flows look like — query strings, form data, headers. An MCP exploit traverses tool-call arguments and tool-result content. A prompt-injection payload smuggled inside a tool response that downstream models will read as instructions is invisible to OWASP-shaped pattern matchers. Generic SAST stays quiet.
Permission and credential models are MCP-specific. The "asks for too many env vars," "echoes process.env.GITHUB_TOKEN in a tool result," and "registers a tool whose scope is wider than its docstring claims" classes of finding are not in the typical SAST taxonomy. They have to be modeled explicitly against the MCP shape.

What an MCP server security scanner actually has to detect

Working backward from the kinds of findings we've consistently seen across the corpus, an MCP scanner that earns its name needs to cover at least these axes:

Server-Side Request Forgery (SSRF) in tool handlers. Detect fetch(url), axios.get(url), requests.get(url), and similar where url is derived from a tool argument without an allowlist. Catch dynamic-base patterns (fetch(`${baseUrl}/${path}`)) that linters miss. 50% of the corpus shipped this.
Command and code execution sinks. Flag execSync, child_process.exec, os.system, shell=True calls reachable from tool handlers, and pattern-string interpolation into shell commands. 10% of the corpus had a finding here.
Credential-echo paths. Trace process.env.X / os.environ['X'] reads to tool-response return paths and to logger calls inside handlers. The most embarrassing class of finding because it's almost always inadvertent — see our walkthrough of how credential leaks land in MCP code. 38% of the corpus.
Prompt-injection vectors smuggled through tool output. An LLM-assisted probe is the only way to catch this honestly today. Static checks help (untemplated pass-through of upstream-API responses with embedded instructions, no content sanitization), but the high-signal probe is to red-team the tool with payloads and grade the response. SkillAudit's engine ships an LLM probe behind the v0.3 surface-tiered methodology.
Permission and scope hygiene. Compare the scopes a server requests (OAuth scopes, env vars, file paths, network egress) against what its documented tools actually need. "Asks for read+write+admin to do a read-only operation" is a real and common finding — and a buyer-side disqualifier.
Maintenance signal. Last commit, open-issue ratio, advisory feed presence. Nine of the 101 servers we scanned were archived; if you install one of those, no future patch is coming. A scanner has to surface this as a first-class axis, not a footnote.
Client compatibility. Does the server actually run under the clients people use? Claude Code, Cursor, Windsurf, Codex, the JetBrains plugin — protocol-version drift quietly breaks installs. A scanner should at least record the targeted clients and flag the ones the README claims but the code doesn't support.
Documentation completeness. Runnable example, semver, changelog, README that matches the actual tools. Low-grade signal individually but high-correlation overall: poorly-documented servers in our corpus correlate with the F-grade cluster.

How SkillAudit's scanner works

SkillAudit is a six-axis static + LLM-assisted scanner built specifically for Claude skills and Model Context Protocol servers. Paste a GitHub URL, npm package, or upload a ZIP; the engine produces a report card with a single A–F grade, per-axis pass/warn/fail, and remediation hints with file paths and line numbers. Free public audits live at stable URLs (e.g. /audits/owner-repo/) and authors can embed the resulting badge on their README.

The static layer is tree-sitter-based and tuned to MCP idioms — template-string fetch, dynamic baseURL, registered-tool extraction, env-var read tracing. The LLM-assisted layer (Claude Haiku 4.5) runs prompt-injection probes against the extracted tool handlers as a separate axis; we describe its limits honestly in the v0.3 calibration writeup. The two layers report independently; we don't bury the static-only findings under model probabilities.

For team buyers, SkillAudit's CI Webhook (Pro) wires the scan into a GitHub Action that fails PRs introducing new MCP servers below a configurable minimum grade. The GitHub Action gate page covers the workflow.

Run an audit

What 101 servers told us about the scanner-readiness of this market

We scanned the 101 most-installed Claude skills and Model Context Protocol servers — vendor-official releases (Stripe, PayPal, MongoDB, Redis, Cloudflare, AWS, Azure, GCP, Heroku, Elastic, Notion, Snowflake), popular indie frameworks (FastMCP, mcp-use, mcp-agent), and the nine official Anthropic SDKs. The full live board is at /audits/. The grade distribution: 19 A · 30 C · 10 D · 42 F.

Three implications for choosing a scanner:

Most of the F-grade vendor repos pass a generic SAST cleanly. Their package.json trees are healthy; the findings are in tool-handler bodies. If you only ran Snyk or npm audit, you'd green-light most of them. We named the F-grade vendor releases in the vendor-official F-grades writeup.
SSRF is the dominant first-finding. 50% of the corpus. Any MCP scanner has to do better than generic URL-taint analysis on this — the corpus has dynamic baseURL and template-string idioms that conventional SAST silently passes.
Prompt-injection findings cluster around servers that pass tool output through unmodified. A scanner without an LLM-assist axis will under-report this. We're explicit about which servers in our corpus failed only on the LLM probe vs. on static checks; see the per-server pages.

How to pick one

Start with one repo you already trust. Run the scanner against it. The output should be in the file paths and line numbers you'd recognize. If it's a finding list with no MCP-shaped detail, the scanner doesn't model MCP — it's running a generic SAST and rebadging the output.
Run it against one repo you don't trust. A vendor F-grade from our corpus is a fair test. The scanner should surface SSRF and credential findings; if it stays quiet, the generic-SAST hypothesis just got stronger.
Check the LLM-probe story. "We scan for prompt injection" with no methodology page is a marketing claim, not a feature. Demand a methodology, a calibration set, and an honest list of what the probe doesn't catch.
Check the buyer surface. A team lead deciding whether to allow an indie skill into their fleet wants A–F, not a 60-finding JSON. The scanner needs a buyer-readable output, not just a developer-targeted one.