Methodology · v0.3

How SkillAudit grades a Claude skill or MCP server.

Six axes. Surface-tiered deductions. Per-(axis, surface) caps. One LLM-assisted prompt-injection probe per scan. The complete rubric every public report card is scored against — published in full so authors can reproduce a grade locally and so reviewers can audit our judgements before trusting them.

What the engine does

Given a public GitHub URL (or, for paid users, a private repo with an OAuth token scoped to that one repository), the SkillAudit engine performs a deterministic four-phase scan:

Clone — shallow-clone the repo at the provided ref (default: default branch HEAD) into an ephemeral sandbox. The sandbox is destroyed as soon as the report is rendered; we never train on customer code and we never hold source after the scan.
Walk — enumerate every .js / .ts / .tsx / .py source plus the manifest files (package.json, pyproject.toml, .claude-plugin/plugin.json, MCP manifests). Each file is tagged with a surface (see below) before any check runs.
Six static checks — SSRF, command exec, credential exposure, permission drift, maintenance signals, documentation completeness. Each finding is a {file, line, severity, kind, surface, reason, match} tuple.
LLM prompt-injection probe — every server.tool(...) / @app.tool registration is bundled with the first ~60 lines of its handler body and red-teamed by Claude Haiku 4.5 in a single bounded API call. Findings roll up into the security axis. If ANTHROPIC_API_KEY is unset, the probe gracefully skips and the static grade is still produced — the skipped status is shown in every report's header.

The static portion is deterministic given the same commit. The probe is deterministic for the same handler-body input within a Claude version (we pin the model on every scan and cite it in the report).

The six axes

Every report card scores six independent axes from 100 down to 0. The overall grade is the floor of the worst axis — security cannot be hidden behind a clean docs score.

Axis	What it catches	Where it looks
Security	SSRF; command exec via interpolated shell; LLM-assisted prompt-injection from untrusted-content tool calls	`fetch / axios / requests / urllib` calls, `exec / shell=True / os.system / subprocess` sinks, `server.tool` handler bodies
Permissions	Tools named `read_` or `get_` whose handlers contain a write/exec sink; manifests requesting org-wide scopes when single-repo would suffice	Manifest `permissions` blocks, tool registrations vs handler body
Credentials	Log/error sinks of `process.env`; hardcoded tokens by known prefix (`AKIA`, `ghp_`, `sk-`, `xoxb-`, …); secrets returned in tool responses	All source files; we redact before reporting
Maintenance	Days since last push; archived state; release cadence; open-issue ratio	GitHub API for the repo
Compatibility	Engine declarations (Node engines, `python_requires`); client-specific quirks for Claude Code / Cursor / Windsurf / Codex CLI	Manifests + per-client compat matrix
Docs	README install + usage section; LICENSE; SECURITY.md; manifest `repository` field; runnable example	Repo root, README byte size + section parsing

A finding can only contribute to one axis (its kind determines the axis via a fixed map in product-api/audit/report.js). Each axis starts at 100 and only deducts.

Surface tiers (the v0.3 calibration core)

v0.2 was binary: production source vs. test source. That was too coarse. examples/ directories full of sk-*** placeholders kept tanking otherwise-clean MCP server grades; chatty benchmarks/ directories dominated the Stripe agent-toolkit's score even though zero findings landed in the runtime tool surface. v0.3 fixes this by tagging every finding with a surface — the kind of code path it lives in — and weighting deductions per-tier.

production full weight

The runtime tool surface. Anything inside src/, top-level index.{js,ts,py}, or any path that the MCP server / Claude skill loads at request time. The default if no other tier matches.

installer half weight

Code under .claude-plugin/install*. Runs on user systems but is not the LLM-facing tool surface — a credential leak in an installer is real but lower-blast-radius than one in a runtime tool.

examples low weight

Any segment named examples / samples / demos / fixtures / cookbook / tutorials; filenames matching *.example.ts / *.sample.py; .env.{example,template,sample}. Demonstration code, not the production path.

benchmarks low weight

Any segment named benchmarks / bench / perf. Performance harnesses don't ship to users.

scripts low weight

Top-level scripts / bin / .github. Build and CI tooling. Nested src/foo/scripts/ stays production — that ships with the runtime.

test low weight

Test directories and matching filenames: tests/, __tests__/, spec/, e2e/, *.test.ts, *.spec.tsx, test_*.py, *_test.go.

Order of classification matters. The classifier checks tests first, then installer, then examples (any path segment), then benchmarks, then scripts (top-level only), then falls through to production. A test file inside examples/__tests__/ is classified as test, not examples; an installer script in .claude-plugin/install/scripts/foo.sh is classified as installer, not scripts. The full classifier is in product-api/audit/report.js:classifySurface — open-source and unit-tested.

Deduction table

Each finding deducts from the axis it belongs to, weighted by surface. High-severity findings are the hard fails — confirmed SSRF on a runtime URL, an env-var token logged on an error path, an execSync with template-string interpolation. Warn-severity findings are softer — possible SSRF where the URL might be allowlisted at runtime, a credential prefix that looks like a token but might be a fixture.

Surface	High severity	Warn severity	Why this weight
production	−30	−10	The runtime tool surface — what the LLM actually invokes. Full deduct.
installer	−15	−5	Runs on user systems but isn't LLM-callable. Half deduct keeps real risk visible without dominating the score.
examples	−5	0	Demonstration code. Real if it ships, but not what the agent loads at request time.
benchmarks	−5	0	Performance harnesses. Don't ship to users.
scripts	−5	0	Top-level build / CI. Same logic — not the runtime path.
test	−5	0	Test source. `shell=True` in a unit test isn't a vulnerability.

Why warn = 0 in low-weight tiers, not −1 or −2? A warn-severity finding in a benchmark is, by definition, not a runtime risk. We surface it in the report because authors should still know it's there, but giving it any deduct produces score-noise that punishes well-tested repos for having tests.

Caps — per (axis, surface)

Without a cap, a single chatty file could exhaust an axis to zero. v0.3 caps each axis-surface pair at 3 high + 5 warn deducting findings; remaining findings are still surfaced in the report but stop deducting score.

Caps are per (axis, surface), not shared across surfaces — this is the v0.3 cap-fix. v0.2 had a single per-axis cap, which meant a chatty tests/ directory could saturate the cap before any production findings landed, silencing real prod issues. The v0.3 calibration update found and corrected five repos that were unfairly held above their honest grade for exactly this reason (named in the examples below).

Cap dimension	Limit	What happens at the cap
(axis × surface) — high severity	3 deducting	Finding 4+ shown in report, no further deduct on that (axis, surface) pair
(axis × surface) — warn severity	5 deducting	Same — visible in report, deduct stops
Axis floor	0	Score never goes negative — once an axis hits 0 it stays at 0

Maintenance & docs floors

Two axes have heuristic floors instead of (or alongside) finding-based deductions:

Maintenance — if daysSincePush > 365, the axis is capped at 40. If daysSincePush > 180, capped at 70. A repo last touched 18 months ago can't earn an A on maintenance even if every other signal is clean.
Docs — README under 3 KB caps the axis at 70. A README that's just a one-liner can't earn an A even if LICENSE and SECURITY.md are present.
Compatibility — if no engine is declared (no Node engines, no python_requires), the axis defaults to 70 — we can't verify cross-client behavior without a declared runtime.

Grade buckets

The overall score is the floor of the worst axis. The grade letter is bucketed from that floor:

A ≥ 90. Zero high-severity production findings across any axis. Active maintenance, complete docs, declared engines. Safe to install with default trust.

B ≥ 80. One or two production warns at most, all other axes clean. Minor remediation closes the gap to A.

C ≥ 70. Real findings present but bounded — typically a small number of installer/example findings or a single production warn. Acceptable for individual install with awareness; team gates often set MIN_GRADE=C.

D ≥ 60. Multiple production warns or one high. Investigate before installing into a production agent.

F < 60. One or more production high-severity findings, or maintenance / docs floors fully consumed. Do not install without first reading the report and the underlying code.

The LLM prompt-injection probe

Static checks can find shaped patterns — an SSRF where the URL is interpolated, a command exec where the shell template is. They can't reason about untrusted-content flow — the case where a tool fetches a webpage, reads a Jira ticket, or scrapes a Slack message and returns the body verbatim into a tool response that becomes part of the next assistant turn. That's the prompt-injection vector that's eating the MCP ecosystem in 2026, and it's exactly what the LLM-assisted probe is designed to catch.

For each server.tool(...) / @app.tool / @server.tool registration we find, we extract the tool name, description, parameter schema, and the first ~60 lines of the handler body. We bundle every registration into one input and hand it to Claude Haiku 4.5 with a red-team system prompt that asks: which of these tools can return untrusted content into the assistant turn, and what would that allow a malicious caller to do? The model returns structured findings — JSON shaped like the static findings, with a kind: "prompt_injection" tag — that roll up under the security axis at the production tier (these are runtime-surface findings).

One API call per scan. All registrations bundled together. ~$0.02 per repo at typical sizes.
Bounded input. Hard cap at ~15 K input tokens; if a repo exceeds that we sample tools by registration order and note the truncation in the report.
Graceful skip. If ANTHROPIC_API_KEY is unset (development, sandboxed runs), the probe is silently skipped and a skipped notice is rendered in the report header. The static grade is still produced.
Pinned model. Every report's header records the exact Claude model ID used — so anyone reproducing the scan can check whether a different prompt-injection finding would emerge from a different model version.

Worked examples — what v0.3 changed in real reports

Concrete grade moves from the v0.3 calibration update on 2026-04-30. The current public corpus is 101 vendor-official MCP servers, indie frameworks, and SDKs — distribution after v0.3: 19 A · 1 B · 38 C · 6 D · 37 F. v0.2 → v0.3 moved 22 grades total. Three representative cases:

stripe/agent-toolkit · F (40) → C (70) · +30

Every flagged finding lived in benchmarks/ — performance test harnesses with hardcoded fixture values. Zero findings landed in the runtime MCP tool surface under src/. v0.2 lumped benchmarks/ into "production" at full deduct; v0.3 puts them in the benchmarks tier at low weight (high −5, warn 0). The runtime tools themselves were always clean — the grade now reflects that. View report →

modelcontextprotocol/typescript-sdk · F (40) → B (85) · +45

The TypeScript SDK is the largest open-source MCP server in the corpus — chatty tests/ and examples/ directories. v0.2's shared per-axis cap was being saturated by test findings, masking how clean the actual SDK runtime is. v0.3 cap-fix (per-axis, per-surface caps) plus the test and examples tiers being low-weight produced the honest grade — the strongest B in the corpus. View report →

punkpeye/fastmcp · D (60) → F (40) · −20 (calibration drop)

Five repos legitimately dropped a grade under v0.3 because v0.2's shared cap was hiding production-tier findings. punkpeye/fastmcp had real production-source SSRFs (templated URLs in runtime fetch calls) that v0.2 silenced because chatty test directories saturated the high-cap before production findings landed. v0.3 counts those production findings at full weight — the F is the honest grade. We chose to ship the cap-fix knowing five repos would drop, because hiding production-tier findings to spare a grade is exactly the kind of dishonesty we accuse vendor-official MCP servers of when their examples/ is full of sk-*** placeholders. The audit board has to be honest about its own rubric. View report →

A grades are stable across calibration updates. All 19 A grades from v0.2 are still A grades under v0.3 — by definition an A repo has zero high-severity production findings, so there's nothing to re-tier. If a future calibration ever moves an A, that's a detection change (a new check), not a scoring change.

Reproducibility

Every report is deterministic given the same commit. The engine source lives at product-api/audit/ in our repo (open-source under MIT — link below). To reproduce a grade locally:

git clone https://github.com/skillaudit/skillaudit
cd skillaudit/product-api/audit
node index.js https://github.com/owner/repo
# Optional: enable the prompt-injection probe
ANTHROPIC_API_KEY=sk-ant-... node index.js https://github.com/owner/repo

Score deductions, severity weights, surface-tier classification, grade buckets, and cap rules all live in product-api/audit/report.js — review and challenge them. If you find a finding you think we're miscategorizing, the hello@skillaudit.dev channel is open and we publish corrections.

Engine changelog

Version	Released	What changed
v0.3	2026-04-30	Surface tiering — every finding tagged with production / installer / examples / benchmarks / scripts / test. Per-(axis, surface) caps replace shared per-axis caps (cap-fix). 22 grades moved on the public corpus.
v0.2	2026-04-26	LLM-assisted prompt-injection probe added as a 7th check, rolling up under the security axis. Binary production-vs-test surface split.
v0.1	2026-04-23	First six axes: SSRF, command exec, credentials, permissions, maintenance, docs. Static-only.

Run the rubric on your repo

Paste a GitHub URL. Get a graded report in 60 seconds. Free for public repos; first 100 audits go to waitlist signups in order.

Join the waitlist → Browse 101 public reports →