Topic: mcp server security scanner

MCP server security scanner — what to look for in one

If you're searching for a Model Context Protocol security scanner, you're looking for something most existing tools weren't built to be. Here's the threat model an MCP-aware scanner has to cover, the gaps generic SAST and SCA leave, and what we found running ours against 101 of the most-installed servers in the wild.

TL;DR

An MCP server's risk surface is in tool-handler bodies — first-party code the LLM gets permission to call with arguments derived from untrusted input. A real MCP security scanner therefore has to detect tool-handler SSRF, command-exec sinks, credential echo back through tool responses, prompt-injection vectors smuggled through tool output, permission-scope inflation, and per-axis maintenance and documentation signal. Generic SAST and SCA scanners catch almost none of this because none of it leaves a CVE — the dangerous code was written this week. Of 101 of the most-installed MCP servers we scanned, 50% shipped SSRF, 38% shipped credential-handling findings, and only 19% earned an A grade. SkillAudit is the scanner; you can paste a GitHub URL and read a graded report card in 60 seconds.

Why a generic SAST or SCA scanner is the wrong tool here

The first instinct of most teams adopting MCP is to point an existing scanner at the repo and call it done. That instinct is reasonable for code shaped like a typical web service — but MCP is shaped differently. Three reasons the generic toolchain leaves the dangerous parts un-flagged:

What an MCP server security scanner actually has to detect

Working backward from the kinds of findings we've consistently seen across the corpus, an MCP scanner that earns its name needs to cover at least these axes:

  1. Server-Side Request Forgery (SSRF) in tool handlers. Detect fetch(url), axios.get(url), requests.get(url), and similar where url is derived from a tool argument without an allowlist. Catch dynamic-base patterns (fetch(`${baseUrl}/${path}`)) that linters miss. 50% of the corpus shipped this.
  2. Command and code execution sinks. Flag execSync, child_process.exec, os.system, shell=True calls reachable from tool handlers, and pattern-string interpolation into shell commands. 10% of the corpus had a finding here.
  3. Credential-echo paths. Trace process.env.X / os.environ['X'] reads to tool-response return paths and to logger calls inside handlers. The most embarrassing class of finding because it's almost always inadvertent — see our walkthrough of how credential leaks land in MCP code. 38% of the corpus.
  4. Prompt-injection vectors smuggled through tool output. An LLM-assisted probe is the only way to catch this honestly today. Static checks help (untemplated pass-through of upstream-API responses with embedded instructions, no content sanitization), but the high-signal probe is to red-team the tool with payloads and grade the response. SkillAudit's engine ships an LLM probe behind the v0.3 surface-tiered methodology.
  5. Permission and scope hygiene. Compare the scopes a server requests (OAuth scopes, env vars, file paths, network egress) against what its documented tools actually need. "Asks for read+write+admin to do a read-only operation" is a real and common finding — and a buyer-side disqualifier.
  6. Maintenance signal. Last commit, open-issue ratio, advisory feed presence. Nine of the 101 servers we scanned were archived; if you install one of those, no future patch is coming. A scanner has to surface this as a first-class axis, not a footnote.
  7. Client compatibility. Does the server actually run under the clients people use? Claude Code, Cursor, Windsurf, Codex, the JetBrains plugin — protocol-version drift quietly breaks installs. A scanner should at least record the targeted clients and flag the ones the README claims but the code doesn't support.
  8. Documentation completeness. Runnable example, semver, changelog, README that matches the actual tools. Low-grade signal individually but high-correlation overall: poorly-documented servers in our corpus correlate with the F-grade cluster.

How SkillAudit's scanner works

SkillAudit is a six-axis static + LLM-assisted scanner built specifically for Claude skills and Model Context Protocol servers. Paste a GitHub URL, npm package, or upload a ZIP; the engine produces a report card with a single A–F grade, per-axis pass/warn/fail, and remediation hints with file paths and line numbers. Free public audits live at stable URLs (e.g. /audits/owner-repo/) and authors can embed the resulting badge on their README.

The static layer is tree-sitter-based and tuned to MCP idioms — template-string fetch, dynamic baseURL, registered-tool extraction, env-var read tracing. The LLM-assisted layer (Claude Haiku 4.5) runs prompt-injection probes against the extracted tool handlers as a separate axis; we describe its limits honestly in the v0.3 calibration writeup. The two layers report independently; we don't bury the static-only findings under model probabilities.

For team buyers, SkillAudit's CI Webhook (Pro) wires the scan into a GitHub Action that fails PRs introducing new MCP servers below a configurable minimum grade. The GitHub Action gate page covers the workflow.

Run an audit

What 101 servers told us about the scanner-readiness of this market

We scanned the 101 most-installed Claude skills and Model Context Protocol servers — vendor-official releases (Stripe, PayPal, MongoDB, Redis, Cloudflare, AWS, Azure, GCP, Heroku, Elastic, Notion, Snowflake), popular indie frameworks (FastMCP, mcp-use, mcp-agent), and the nine official Anthropic SDKs. The full live board is at /audits/. The grade distribution: 19 A · 30 C · 10 D · 42 F.

Three implications for choosing a scanner:

How to pick one

  1. Start with one repo you already trust. Run the scanner against it. The output should be in the file paths and line numbers you'd recognize. If it's a finding list with no MCP-shaped detail, the scanner doesn't model MCP — it's running a generic SAST and rebadging the output.
  2. Run it against one repo you don't trust. A vendor F-grade from our corpus is a fair test. The scanner should surface SSRF and credential findings; if it stays quiet, the generic-SAST hypothesis just got stronger.
  3. Check the LLM-probe story. "We scan for prompt injection" with no methodology page is a marketing claim, not a feature. Demand a methodology, a calibration set, and an honest list of what the probe doesn't catch.
  4. Check the buyer surface. A team lead deciding whether to allow an indie skill into their fleet wants A–F, not a 60-finding JSON. The scanner needs a buyer-readable output, not just a developer-targeted one.

Related questions

Is an MCP scanner a replacement for Snyk or Dependabot?

No. SCA tools find CVEs in dependencies — that's a real, separate problem. An MCP scanner finds first-party tool-handler bugs that don't have CVEs. Run both in CI; they don't overlap. Side-by-side detail.

Does it scan private repos?

SkillAudit's free tier scans public repos only. The Pro tier ($19/mo) adds private GitHub repo scanning via a single-repo OAuth scope. We don't store source after the scan completes; the privacy page covers data handling.

How long does a scan take?

Public-repo scans complete in roughly 60 seconds for a typical MCP server. The LLM-probe axis is the bottleneck for larger repos with many registered tools. The scan results stream into the report-card URL as each axis completes.

Can I scan an npm package or a ZIP, not a GitHub repo?

Yes — the scanner accepts an npm package name (resolves the latest published tarball) or an uploaded ZIP. The same six-axis report card is produced. Useful when the publisher's GitHub repo is out of date relative to the actual published artifact.

How is this different from MCP Inspector?

MCP Inspector is a developer dev-loop tool — it lets you call your own server's tools manually to verify they work. It is not a security scanner. SkillAudit grades a server's code; Inspector tests its behavior. Both are useful, neither replaces the other. Detail.

Further reading