Methodology · v0.3
How SkillAudit grades a Claude skill or MCP server.
Six axes. Surface-tiered deductions. Per-(axis, surface) caps. One LLM-assisted prompt-injection probe per scan. The complete rubric every public report card is scored against — published in full so authors can reproduce a grade locally and so reviewers can audit our judgements before trusting them.
What the engine does
Given a public GitHub URL (or, for paid users, a private repo with an OAuth token scoped to that one repository), the SkillAudit engine performs a deterministic four-phase scan:
- Clone — shallow-clone the repo at the provided ref (default: default branch HEAD) into an ephemeral sandbox. The sandbox is destroyed as soon as the report is rendered; we never train on customer code and we never hold source after the scan.
- Walk — enumerate every
.js / .ts / .tsx / .pysource plus the manifest files (package.json,pyproject.toml,.claude-plugin/plugin.json, MCP manifests). Each file is tagged with a surface (see below) before any check runs. - Six static checks — SSRF, command exec, credential exposure, permission drift, maintenance signals, documentation completeness. Each finding is a
{file, line, severity, kind, surface, reason, match}tuple. - LLM prompt-injection probe — every
server.tool(...)/@app.toolregistration is bundled with the first ~60 lines of its handler body and red-teamed by Claude Haiku 4.5 in a single bounded API call. Findings roll up into the security axis. IfANTHROPIC_API_KEYis unset, the probe gracefully skips and the static grade is still produced — the skipped status is shown in every report's header.
The static portion is deterministic given the same commit. The probe is deterministic for the same handler-body input within a Claude version (we pin the model on every scan and cite it in the report).
The six axes
Every report card scores six independent axes from 100 down to 0. The overall grade is the floor of the worst axis — security cannot be hidden behind a clean docs score.
| Axis | What it catches | Where it looks |
|---|---|---|
| Security | SSRF; command exec via interpolated shell; LLM-assisted prompt-injection from untrusted-content tool calls | fetch / axios / requests / urllib calls, exec / shell=True / os.system / subprocess sinks, server.tool handler bodies |
| Permissions | Tools named read_* or get_* whose handlers contain a write/exec sink; manifests requesting org-wide scopes when single-repo would suffice | Manifest permissions blocks, tool registrations vs handler body |
| Credentials | Log/error sinks of process.env; hardcoded tokens by known prefix (AKIA, ghp_, sk-, xoxb-, …); secrets returned in tool responses | All source files; we redact before reporting |
| Maintenance | Days since last push; archived state; release cadence; open-issue ratio | GitHub API for the repo |
| Compatibility | Engine declarations (Node engines, python_requires); client-specific quirks for Claude Code / Cursor / Windsurf / Codex CLI | Manifests + per-client compat matrix |
| Docs | README install + usage section; LICENSE; SECURITY.md; manifest repository field; runnable example | Repo root, README byte size + section parsing |
A finding can only contribute to one axis (its kind determines the axis via a fixed map in product-api/audit/report.js). Each axis starts at 100 and only deducts.
Surface tiers (the v0.3 calibration core)
v0.2 was binary: production source vs. test source. That was too coarse. examples/ directories full of sk-*** placeholders kept tanking otherwise-clean MCP server grades; chatty benchmarks/ directories dominated the Stripe agent-toolkit's score even though zero findings landed in the runtime tool surface. v0.3 fixes this by tagging every finding with a surface — the kind of code path it lives in — and weighting deductions per-tier.
src/, top-level index.{js,ts,py}, or any path that the MCP server / Claude skill loads at request time. The default if no other tier matches..claude-plugin/install*. Runs on user systems but is not the LLM-facing tool surface — a credential leak in an installer is real but lower-blast-radius than one in a runtime tool.examples / samples / demos / fixtures / cookbook / tutorials; filenames matching *.example.ts / *.sample.py; .env.{example,template,sample}. Demonstration code, not the production path.benchmarks / bench / perf. Performance harnesses don't ship to users.scripts / bin / .github. Build and CI tooling. Nested src/foo/scripts/ stays production — that ships with the runtime.tests/, __tests__/, spec/, e2e/, *.test.ts, *.spec.tsx, test_*.py, *_test.go.Order of classification matters. The classifier checks tests first, then installer, then examples (any path segment), then benchmarks, then scripts (top-level only), then falls through to production. A test file inside examples/__tests__/ is classified as test, not examples; an installer script in .claude-plugin/install/scripts/foo.sh is classified as installer, not scripts. The full classifier is in product-api/audit/report.js:classifySurface — open-source and unit-tested.
Deduction table
Each finding deducts from the axis it belongs to, weighted by surface. High-severity findings are the hard fails — confirmed SSRF on a runtime URL, an env-var token logged on an error path, an execSync with template-string interpolation. Warn-severity findings are softer — possible SSRF where the URL might be allowlisted at runtime, a credential prefix that looks like a token but might be a fixture.
| Surface | High severity | Warn severity | Why this weight |
|---|---|---|---|
| production | −30 | −10 | The runtime tool surface — what the LLM actually invokes. Full deduct. |
| installer | −15 | −5 | Runs on user systems but isn't LLM-callable. Half deduct keeps real risk visible without dominating the score. |
| examples | −5 | 0 | Demonstration code. Real if it ships, but not what the agent loads at request time. |
| benchmarks | −5 | 0 | Performance harnesses. Don't ship to users. |
| scripts | −5 | 0 | Top-level build / CI. Same logic — not the runtime path. |
| test | −5 | 0 | Test source. shell=True in a unit test isn't a vulnerability. |
Caps — per (axis, surface)
Without a cap, a single chatty file could exhaust an axis to zero. v0.3 caps each axis-surface pair at 3 high + 5 warn deducting findings; remaining findings are still surfaced in the report but stop deducting score.
Caps are per (axis, surface), not shared across surfaces — this is the v0.3 cap-fix. v0.2 had a single per-axis cap, which meant a chatty tests/ directory could saturate the cap before any production findings landed, silencing real prod issues. The v0.3 calibration update found and corrected five repos that were unfairly held above their honest grade for exactly this reason (named in the examples below).
| Cap dimension | Limit | What happens at the cap |
|---|---|---|
| (axis × surface) — high severity | 3 deducting | Finding 4+ shown in report, no further deduct on that (axis, surface) pair |
| (axis × surface) — warn severity | 5 deducting | Same — visible in report, deduct stops |
| Axis floor | 0 | Score never goes negative — once an axis hits 0 it stays at 0 |
Maintenance & docs floors
Two axes have heuristic floors instead of (or alongside) finding-based deductions:
- Maintenance — if
daysSincePush > 365, the axis is capped at 40. IfdaysSincePush > 180, capped at 70. A repo last touched 18 months ago can't earn an A on maintenance even if every other signal is clean. - Docs — README under 3 KB caps the axis at 70. A README that's just a one-liner can't earn an A even if LICENSE and SECURITY.md are present.
- Compatibility — if no engine is declared (no Node
engines, nopython_requires), the axis defaults to 70 — we can't verify cross-client behavior without a declared runtime.
Grade buckets
The overall score is the floor of the worst axis. The grade letter is bucketed from that floor:
MIN_GRADE=C.
The LLM prompt-injection probe
Static checks can find shaped patterns — an SSRF where the URL is interpolated, a command exec where the shell template is. They can't reason about untrusted-content flow — the case where a tool fetches a webpage, reads a Jira ticket, or scrapes a Slack message and returns the body verbatim into a tool response that becomes part of the next assistant turn. That's the prompt-injection vector that's eating the MCP ecosystem in 2026, and it's exactly what the LLM-assisted probe is designed to catch.
For each server.tool(...) / @app.tool / @server.tool registration we find, we extract the tool name, description, parameter schema, and the first ~60 lines of the handler body. We bundle every registration into one input and hand it to Claude Haiku 4.5 with a red-team system prompt that asks: which of these tools can return untrusted content into the assistant turn, and what would that allow a malicious caller to do? The model returns structured findings — JSON shaped like the static findings, with a kind: "prompt_injection" tag — that roll up under the security axis at the production tier (these are runtime-surface findings).
- One API call per scan. All registrations bundled together. ~$0.02 per repo at typical sizes.
- Bounded input. Hard cap at ~15 K input tokens; if a repo exceeds that we sample tools by registration order and note the truncation in the report.
- Graceful skip. If
ANTHROPIC_API_KEYis unset (development, sandboxed runs), the probe is silently skipped and a skipped notice is rendered in the report header. The static grade is still produced. - Pinned model. Every report's header records the exact Claude model ID used — so anyone reproducing the scan can check whether a different prompt-injection finding would emerge from a different model version.
Worked examples — what v0.3 changed in real reports
Concrete grade moves from the v0.3 calibration update on 2026-04-30. The current public corpus is 101 vendor-official MCP servers, indie frameworks, and SDKs — distribution after v0.3: 19 A · 1 B · 38 C · 6 D · 37 F. v0.2 → v0.3 moved 22 grades total. Three representative cases:
stripe/agent-toolkit · F (40) → C (70) · +30
Every flagged finding lived in benchmarks/ — performance test harnesses with hardcoded fixture values. Zero findings landed in the runtime MCP tool surface under src/. v0.2 lumped benchmarks/ into "production" at full deduct; v0.3 puts them in the benchmarks tier at low weight (high −5, warn 0). The runtime tools themselves were always clean — the grade now reflects that. View report →
modelcontextprotocol/typescript-sdk · F (40) → B (85) · +45
The TypeScript SDK is the largest open-source MCP server in the corpus — chatty tests/ and examples/ directories. v0.2's shared per-axis cap was being saturated by test findings, masking how clean the actual SDK runtime is. v0.3 cap-fix (per-axis, per-surface caps) plus the test and examples tiers being low-weight produced the honest grade — the strongest B in the corpus. View report →
punkpeye/fastmcp · D (60) → F (40) · −20 (calibration drop)
Five repos legitimately dropped a grade under v0.3 because v0.2's shared cap was hiding production-tier findings. punkpeye/fastmcp had real production-source SSRFs (templated URLs in runtime fetch calls) that v0.2 silenced because chatty test directories saturated the high-cap before production findings landed. v0.3 counts those production findings at full weight — the F is the honest grade. We chose to ship the cap-fix knowing five repos would drop, because hiding production-tier findings to spare a grade is exactly the kind of dishonesty we accuse vendor-official MCP servers of when their examples/ is full of sk-*** placeholders. The audit board has to be honest about its own rubric. View report →
A grades are stable across calibration updates. All 19 A grades from v0.2 are still A grades under v0.3 — by definition an A repo has zero high-severity production findings, so there's nothing to re-tier. If a future calibration ever moves an A, that's a detection change (a new check), not a scoring change.
Reproducibility
Every report is deterministic given the same commit. The engine source lives at product-api/audit/ in our repo (open-source under MIT — link below). To reproduce a grade locally:
git clone https://github.com/skillaudit/skillaudit
cd skillaudit/product-api/audit
node index.js https://github.com/owner/repo
# Optional: enable the prompt-injection probe
ANTHROPIC_API_KEY=sk-ant-... node index.js https://github.com/owner/repo
Score deductions, severity weights, surface-tier classification, grade buckets, and cap rules all live in product-api/audit/report.js — review and challenge them. If you find a finding you think we're miscategorizing, the hello@skillaudit.dev channel is open and we publish corrections.
Engine changelog
| Version | Released | What changed |
|---|---|---|
| v0.3 | 2026-04-30 | Surface tiering — every finding tagged with production / installer / examples / benchmarks / scripts / test. Per-(axis, surface) caps replace shared per-axis caps (cap-fix). 22 grades moved on the public corpus. |
| v0.2 | 2026-04-26 | LLM-assisted prompt-injection probe added as a 7th check, rolling up under the security axis. Binary production-vs-test surface split. |
| v0.1 | 2026-04-23 | First six axes: SSRF, command exec, credentials, permissions, maintenance, docs. Static-only. |
Run the rubric on your repo
Paste a GitHub URL. Get a graded report in 60 seconds. Free for public repos; first 100 audits go to waitlist signups in order.
Join the waitlist → Browse 101 public reports →