Security Methodology

The SkillAudit scorecard explained: how we grade MCP server security across six axes

A complete breakdown of every check we run, how findings map to severity levels, how axis scores are combined into a letter grade, and what you need to fix to move up.

Published 2026-06-04 · 14 min read

Every SkillAudit report shows six axis scores and a letter grade. Authors frequently ask: what exactly triggers a finding on axis X? What is "HIGH" versus "MEDIUM"? Why did a single finding drop my overall from A to B? This post answers all of it — the full scoring methodology, transparent and reproducible.

We publish this not as a cheat sheet but because trust requires explainability. An audit tool that won't tell you how it works is useless for authors who need to know what to fix. Reviewers and team leads evaluating whether to rely on SkillAudit grades deserve to know the methodology before they put it in a gate.

Overview: the six axes

SkillAudit evaluates every MCP server across six independent axes. Each axis has its own score (0–100) and its own weight. The weighted average of axis scores maps to the letter grade.

Security

35%

Credentials

25%

Permissions

15%

Maintenance

12%

Compatibility

Documentation

The weighting reflects blast radius: a Security or Credential finding can cause an immediate breach; a Documentation gap causes friction but not compromise. The weights are not symmetric because the risks are not symmetric.

How severity levels work

Every check produces zero or more findings. Each finding has a severity: HIGH, MEDIUM, LOW, or INFO.

Severity	Definition	Point deduction (per finding)
HIGH	Directly exploitable with high confidence, or a pattern that enables a known attack class (e.g., SSRF via unvalidated URL argument, shell injection via exec() on tool args, credential logged at INFO level)	−20 from axis score
MEDIUM	Exploitable under specific conditions, or a pattern that significantly increases attack surface without being directly exploitable in isolation (e.g., over-declared permissions, unchecked numeric bounds, advisory acknowledged but not patched)	−8 from axis score
LOW	Increases risk but requires multiple conditions to exploit (e.g., missing Content-Security-Policy, no explicit error type definitions, documentation gap on a security-relevant parameter)	−3 from axis score
INFO	Informational observation. Not a finding, not a deduction. Surfaced in the report for author awareness only.	0

Axis scores floor at 0 — multiple HIGH findings do not produce a negative score. The axis score starts at 100 and deductions are subtracted until it bottoms out.

Finding caps. To prevent one-time anomalies from dominating, deductions within a single check type are capped. A tool with fifty unvalidated URL arguments gets the same deduction as one with three — the finding is that the pattern exists, not that it appears N times. The cap prevents trivially-generated code from scoring worse than hand-written code with the same structural problem.

The letter grade mapping

Letter grade	Weighted score	What it means
A	90–100	Zero HIGH findings across all axes. At most two MEDIUM findings, each with a documented mitigation. Maintenance signal is green. Safe to install without manual review.
B	75–89	No HIGH findings in Security or Credential axes. May have one MEDIUM finding in either, or a cluster of MEDIUM in Permissions/Maintenance. Review recommended before production use.
C	55–74	One HIGH finding OR multiple MEDIUM findings across key axes. Usable in sandboxed or low-privilege contexts with explicit owner sign-off. Not suitable for team-wide adoption without remediation.
D	35–54	Multiple HIGH findings or a systemic pattern (e.g., all tools log arguments, or no permission declarations exist). Should not be installed until findings are resolved.
F	0–34	Severe findings: credential exfiltration surface, active exploitable SSRF, shell injection, or known-unpatched critical advisory. Do not install.

The grade is computed at the end of the scan — not incrementally. Authors see the letter grade on the public report page; the full score breakdown (per-axis, per-finding) is in the private deep report (Pro and Team plans).

Axis 1 — Security (35%)

🔒

Security

Weight: 35% · Checks: 28

The Security axis covers the attack surface accessible to an adversary who controls tool arguments or can inject instructions into a document the LLM reads. It combines static analysis (AST-level pattern matching via tree-sitter) with LLM-assisted red-teaming that generates adversarial tool invocations against the server's declared schema.

HIGH SSRF via unvalidated URL argument — any tool accepting a URL or hostname parameter without allowlist validation. Includes http://, file://, and ftp:// schemes. Detected by AST pattern matching on fetch/axios/request calls where the URL argument derives from tool args without an allowlist check.
HIGH Shell injection — spawn(shell: true), exec(), or execSync() with tool arguments interpolated into the command string. This includes template literal interpolation (`cmd ${args.filename}`) and string concatenation. Does not flag spawn(shell: false) with args as an array.
HIGH Prompt injection via tool description — tool description contains dynamic content from external sources at load time, or includes instruction-following language that could be weaponized (e.g., "ignore previous instructions and…" patterns in any reachable string).
HIGH eval() / Function() on argument content — any dynamic code execution where input derives from tool arguments or external data.
HIGH Path traversal — unvalidated file path argument where path.join() or path.resolve() is not followed by a startsWith(allowedRoot) check. Flags both absolute path construction and relative traversal patterns (../../).
MED Unchecked integer overflow — numeric arguments used as array indices, buffer sizes, or loop counts without Number.isSafeInteger() or clamping.
MED ReDoS — regex patterns with catastrophic backtracking applied to user-controlled strings (detected via static analysis of the regex AST).
MED XML injection — any XML construction with concatenated user-controlled strings outside a proper escaping library.
MED CRLF injection — argument values interpolated into HTTP headers, log lines, or CSV output without stripping \r\n.
MED Unsafe deserialization — JSON.parse() on an argument value without schema validation (prototype pollution risk via __proto__ keys in some environments).
LOW Missing input schema — tool registered without a Zod/JSON Schema declaration on arguments, preventing the MCP host from validating inputs before the tool runs.
LOW LLM-assisted red-team result — the engine's red-team module generated an invocation that produced an anomalous response (credential hint in output, error stack trace, unexpected external network call). Flagged as LOW only when the evidence is ambiguous; upgraded to HIGH if a concrete exfiltration path is confirmed.

The LLM-assisted red-team runs against every tool in the server's schema. It generates prompt injection payloads via tool arguments and tool descriptions, looking for outputs that contain credential strings, internal path fragments, or unusual network calls. The red-team results supplement — they do not replace — static findings.

Axis 2 — Credentials (25%)

🔑

Credentials

Weight: 25% · Checks: 16

The Credentials axis is concerned with how the server handles secrets: where they come from, whether they appear in logs or outputs, and whether they're scoped to the minimum needed. MCP servers have ambient access to every environment variable in the process — this axis checks whether that access is appropriately constrained.

HIGH Credential in log output — any call to console.log(), logger.info(), or similar where the argument chain can include a value derived from a known credential environment variable (API_KEY, SECRET, TOKEN, PASSWORD, PRIVATE_KEY pattern matching). Includes logging the entire args object which may contain credential-adjacent fields.
HIGH Credential in tool output — a tool's return value includes a raw secret string. Detected by tracing data flow from environment variable reads to the return statement.
HIGH Credential in error message — new Error(\`Connection failed: ${connectionString}\`) where the interpolated value is a database URL or similar. The error propagates to the LLM context window where it can be exfiltrated.
HIGH Hardcoded credential — any string literal matching known credential patterns (sk-, sk-ant-, AKIA, ghp_, xoxb-, postgresql://, mongodb+srv://) in source code or committed config files. Checks full git history, not just current HEAD.
MED Implicit environment variable read — process.env spread or destructuring that pulls all env vars into scope, rather than reading only the named keys the server needs.
MED Missing credential validation at startup — no check that required environment variables are set (and non-empty) before the server registers its tools. Servers that start without credentials and fail at tool-call time expose the error to the LLM.
MED Cross-tool credential sharing — a single credential variable reused across tools with different permission requirements, where one tool needs only read access but the credential grants write or destructive capabilities.
LOW No credential rotation support — no TTL-bounded token pattern; the server reads a single long-lived credential at startup with no refresh mechanism.
LOW Credential in default argument value — a tool argument declares a default value that contains a credential-adjacent string or placeholder that hints at the expected secret format.

The git history scan deserves special mention. We scan full commit history for hardcoded credential patterns — not just the current HEAD. A secret that was committed and then deleted is still a valid finding because the secret is permanently in the git history and must be rotated, not just removed from the working tree. Authors are often surprised that removing a line doesn't close the finding.

Axis 3 — Permissions (15%)

📋

Permissions

Weight: 15% · Checks: 11

The Permissions axis checks whether the server accurately declares what it can do, and whether it asks for less than the maximum. This matters because MCP host policies and team adoption gates increasingly rely on machine-readable permission declarations to make installation decisions. An over-declared server forces human review of every team member's installation; an under-declared server is deceptive.

HIGH No permission manifest — the server ships with no mcp-manifest.json, no YAML permission declaration, and no inline annotations in tool definitions. Without a manifest, downstream policy enforcement is impossible.
MED Over-declared write scope — manifest declares write:filesystem or equivalent but no tool in the source actually performs a write operation. Over-declaration inflates the blast-radius estimate and triggers unnecessary human review.
MED Under-declared network scope — source calls external URLs but manifest does not declare network:outbound. Under-declaration means policy gates pass the server that shouldn't.
MED Tool destructive annotation missing — a tool performs a delete, truncate, or irreversible write operation but is not annotated destructive: true in the tool definition. MCP host clients use this annotation to insert confirmation gates.
MED Overly broad filesystem access — the server reads or writes anywhere on the filesystem without constraining to a declared working directory. Compare against servers that declare and enforce a WORKING_DIR root.
LOW Tool description omits permission implications — a tool description does not mention that it makes external network calls, even though it does. Authors reading the description before installation have no warning.
LOW No readOnly: true annotation on read-only tools — tools that only read data should be annotated to allow hosts to grant them without requesting write capabilities.

The over/under-declaration checks use a two-pass approach: first, the manifest is parsed and its declared permissions are extracted; second, the source is statically analyzed for actual capability usage. The gap between declared and actual is the finding. This makes the Permissions axis effectively a correctness check on the manifest, not just its presence.

Axis 4 — Maintenance (12%)

🔄

Maintenance

Weight: 12% · Checks: 9

The Maintenance axis is the only axis that degrades with time regardless of code changes. It measures signals that indicate the server will receive security fixes when they're needed — not whether it's already secure. A server that's A-grade today but abandoned tomorrow will drop through B to C in under six months on this axis alone.

HIGH Unacknowledged critical advisory — a dependency has a published CVE at severity CRITICAL or HIGH, and there is no Dependabot PR, no renovate config, and no comment in package.json acknowledging the advisory. "Unacknowledged" means no evidence of awareness, not that the CVE is necessarily exploitable in this specific context.
MED Last commit >90 days ago — the most recent commit to the default branch is older than 90 days. This is a signal of abandonment, not a defect. The deduction is applied even if the server has zero security findings, because a maintained server responds to advisories; an unmaintained one doesn't.
MED Open security issue >30 days old — a GitHub issue labeled "security", "vulnerability", or "CVE" has been open without a response for more than 30 days.
MED Unpinned dependencies — package.json uses ranges (^1.2.3, ~4.5.6) without a lockfile committed to the repository. Unpinned production dependencies mean the server's behavior drifts between installs.
MED No Dependabot or Renovate config — no automated dependency update tooling configured. This is not a finding on its own but combines with other signals for the overall axis score.
LOW No CHANGELOG or release history — no documented record of what changed between versions. Not a security issue, but a signal that the maintainer is not tracking changes systematically.
LOW No SECURITY.md — no documented process for reporting vulnerabilities. Responsible disclosures from researchers go unreported because there's no obvious channel.

The 90-day threshold for last commit was calibrated on our corpus: servers that last committed 90+ days ago had a 4.3× higher rate of unpatched advisories at the next monthly rescan compared to actively maintained servers. It is a leading indicator, not a direct finding. See grade drift analysis for the full rescan data.

Axis 5 — Compatibility (8%)

🔌

Compatibility

Weight: 8% · Checks: 7

The Compatibility axis verifies that the server works correctly with the four major MCP host clients: Claude Code, Cursor, Windsurf, and Codex. Compatibility is weighted below security because it affects functionality, not safety — but a server that silently malfunctions on a popular client is worse than one that fails loudly. Silent malfunction can produce unexpected tool invocations.

HIGH Transport incompatibility — server registers tools via a transport the declared client does not support (e.g., HTTP streaming on a stdio-only host, or SSE on a client that requires batch). Detection via declared transport type in manifest vs. known host capabilities matrix.
MED Tool schema uses unsupported JSON Schema features — anyOf, oneOf, $ref, or recursive schemas that some host clients serialize incorrectly. Detected against the known compatibility matrix for each client version.
MED Missing protocol_version declaration — server does not declare which MCP protocol version it implements, preventing hosts from performing capability negotiation correctly.
LOW Tool name contains non-ASCII characters or symbols — some host clients reject tool names with characters outside [a-z0-9_-].
LOW Tool description exceeds 512 characters — some clients truncate descriptions at 512 characters, which can silently strip security-relevant text from the LLM's context.
LOW Return value exceeds soft size limit — tool returns objects larger than 64 KB, which some host clients buffer entirely in memory, causing degraded performance or OOM in resource-constrained environments.

Axis 6 — Documentation (5%)

📝

Documentation

Weight: 5% · Checks: 6

Documentation is the lowest-weighted axis but not zero — documentation gaps make security findings harder to evaluate and mitigations harder to implement. A server where the tool descriptions don't explain what a parameter does forces users to infer, which often means inferring wrong. The axis does not reward verbosity; it rewards completeness on security-relevant attributes.

MED No runnable example — README or documentation contains no working invocation example. A user cannot verify the server works correctly before installing into a production agent.
MED Tool description omits required credential setup — a tool requires an API key or database connection string but the description (and README) don't document what env vars to set. Users guess; guesses are often wrong and produce error messages that leak context.
LOW No version string in package.json or equivalent — versioning enables reliable dependency pinning. Without it, npm install author/repo#HEAD installs whatever is current, not a specific known state.
LOW Parameter descriptions missing on one or more tool arguments — inputSchema declares a field without a description property. The LLM must infer semantics from the field name alone.
LOW No README — server ships without a top-level README.md or equivalent documentation file.
INFO Tool count — surfaced as an informational note. High tool counts (>20) may indicate a server that does too much; low tool counts (<2) may indicate incomplete implementation. Not a finding.

How the overall score is calculated: a worked example

Take a hypothetical MCP server — a filesystem bridge that reads and writes files. Here's a realistic scan result:

Axis	Findings	Raw axis score	Weight	Weighted contribution
Security	1 HIGH (path traversal) · 1 MED (unchecked int)	100 − 20 − 8 = 72	35%	25.2
Credentials	1 MED (no startup validation)	100 − 8 = 92	25%	23.0
Permissions	1 MED (over-declared write), 1 LOW (no readOnly annotation)	100 − 8 − 3 = 89	15%	13.35
Maintenance	1 LOW (no SECURITY.md)	100 − 3 = 97	12%	11.64
Compatibility	Clean	100	8%	8.0
Documentation	1 LOW (param descriptions missing)	100 − 3 = 97	5%	4.85
Weighted total				86.0 → grade B

A single path traversal finding (HIGH in Security) cost this server its A. The path from B to A in this case is straightforward: add the startsWith(allowedRoot) containment check after path.join() and add Number.isSafeInteger() on the integer argument. Those two fixes clear all HIGH and one MEDIUM, pushing the weighted score to approximately 93 — a solid A.

Why one HIGH finding can drop two grades. A HIGH on the Security axis costs 20 raw points out of 100. With a 35% weight, that translates to 7 points off the weighted total. Moving from 93 (A) to 86 (B) is exactly a 7-point drop — so a single HIGH finding in Security is enough to push an otherwise-clean server from A to B. Two HIGHs in Security (−14 weighted points) move a clean server from 93 to 79, which is still B — but add one more MEDIUM anywhere and it drops to C. This is why the most common path to an A is eliminating every HIGH first, then cleaning up MEDIUMs.

What the public badge shows

The public badge embedded in a repository README or marketplace listing shows:

The overall letter grade (A/B/C/D/F)
The scan timestamp
A link to the public report page

The public report page shows the per-axis letter grades (not numeric scores) and the count of findings per severity per axis. It does not show finding details — those require Pro access. This is deliberate: finding details are a roadmap for exploitation if published publicly before the author has had time to fix them. Authors who want to share the full report with reviewers can generate a time-limited sharing link from the Pro dashboard.

The Pro plan adds a live badge endpoint — the badge dynamically reflects the most recent scan rather than being frozen at the point of first publication. This matters because the badge in a README is often read weeks or months after the scan that generated it. See the grade drift post for why this matters.

Rescan policy

A scan result is valid for 30 days for badge purposes. After 30 days, the badge shows a "stale" indicator rather than the grade. Authors on Pro can trigger rescans manually or via CI webhook at any cadence. The CI webhook integration runs a scan on every pull request and gates the merge if the grade falls below the configured threshold — this is the recommended usage pattern for production MCP servers.

Rescans on the same repository compare the new finding list against the prior scan and generate a delta report: new findings (regressions), resolved findings (improvements), and unchanged findings. The delta report is the primary artifact for tracking remediation progress over time.

Appeals and disputed findings

Static analysis produces false positives. If a finding is incorrect — for example, the path traversal check flags a path that is actually constrained by a validation step the AST parser missed — the author can file a dispute on the Pro dashboard. Disputes are reviewed manually. If the dispute is upheld, the finding is suppressed for future scans on that repository, and the current score is recalculated without it. If the dispute is rejected, the finding remains with a note explaining why the disputed pattern still meets the finding criteria.

We uphold roughly 12% of filed disputes across all check types. The highest dispute rate is on the Permissions axis (over-declared scope), where the gap between declared and actual permissions is sometimes intentional — authors declare write scope as a forward-compatibility measure for features not yet implemented. That's a valid design choice; it's also a valid finding. Both can be true.

What "A-grade" actually requires

Translated from scoring arithmetic to practical requirements, an A grade demands:

Zero HIGH findings anywhere. No unvalidated URL args, no shell injection, no eval(), no hardcoded credentials, no credential in logs or outputs, no transport incompatibility.
At most two MEDIUM findings total, neither in Security or Credentials. The most common surviving MEDIUMs on A-grade servers are in Permissions (over-declared scope that is genuinely intentional) or Documentation (missing runnable example).
Active maintenance signals. A commit within the last 90 days, no unacknowledged critical advisories, Dependabot or Renovate configured.
A machine-readable permission manifest. Present, accurate (not over-declared or under-declared), with destructive annotations on destructive tools.

Two of the 139 servers in our public corpus have zero findings across all axes — both share the same construction pattern covered in how to write a zero-finding MCP server. Our ten-most-common-C-grade findings post covers the specific patterns that appear most frequently on the near-miss servers that almost made A.

Engine versioning

The SkillAudit engine is versioned. Every report notes the engine version that produced it. When we update the engine — adding new checks, tuning thresholds, updating the compatibility matrix for new host client versions — scores for previously-scanned servers may change. The delta report distinguishes between findings that changed because the server changed and findings that changed because the engine changed.

We publish an engine version changelog on every release. The v0.3 calibration delta post is an example: it covers the specific threshold changes in that release and their direction of effect on existing scores.

Run your own scan

Paste a GitHub URL on the SkillAudit homepage. The free tier runs the full Security, Credentials, and Permissions checks on public repositories with three scans per month. The per-axis score breakdown, the finding detail list, and the remediation hints are Pro features. The public badge is available to all plans.

If you're evaluating SkillAudit for team adoption gating, the Team plan adds the CI webhook, policy export, per-repo grade floor configuration, and the audit log that satisfies SOC 2 evidence requests for third-party software vetting.