Methodology · v0.3

How SkillAudit grades a Claude skill or MCP server.

Six axes. Surface-tiered deductions. Per-(axis, surface) caps. One LLM-assisted prompt-injection probe per scan. The complete rubric every public report card is scored against — published in full so authors can reproduce a grade locally and so reviewers can audit our judgements before trusting them.

What the engine does

Given a public GitHub URL (or, for paid users, a private repo with an OAuth token scoped to that one repository), the SkillAudit engine performs a deterministic four-phase scan:

  1. Clone — shallow-clone the repo at the provided ref (default: default branch HEAD) into an ephemeral sandbox. The sandbox is destroyed as soon as the report is rendered; we never train on customer code and we never hold source after the scan.
  2. Walk — enumerate every .js / .ts / .tsx / .py source plus the manifest files (package.json, pyproject.toml, .claude-plugin/plugin.json, MCP manifests). Each file is tagged with a surface (see below) before any check runs.
  3. Six static checks — SSRF, command exec, credential exposure, permission drift, maintenance signals, documentation completeness. Each finding is a {file, line, severity, kind, surface, reason, match} tuple.
  4. LLM prompt-injection probe — every server.tool(...) / @app.tool registration is bundled with the first ~60 lines of its handler body and red-teamed by Claude Haiku 4.5 in a single bounded API call. Findings roll up into the security axis. If ANTHROPIC_API_KEY is unset, the probe gracefully skips and the static grade is still produced — the skipped status is shown in every report's header.

The static portion is deterministic given the same commit. The probe is deterministic for the same handler-body input within a Claude version (we pin the model on every scan and cite it in the report).

The six axes

Every report card scores six independent axes from 100 down to 0. The overall grade is the floor of the worst axis — security cannot be hidden behind a clean docs score.

AxisWhat it catchesWhere it looks
SecuritySSRF; command exec via interpolated shell; LLM-assisted prompt-injection from untrusted-content tool callsfetch / axios / requests / urllib calls, exec / shell=True / os.system / subprocess sinks, server.tool handler bodies
PermissionsTools named read_* or get_* whose handlers contain a write/exec sink; manifests requesting org-wide scopes when single-repo would sufficeManifest permissions blocks, tool registrations vs handler body
CredentialsLog/error sinks of process.env; hardcoded tokens by known prefix (AKIA, ghp_, sk-, xoxb-, …); secrets returned in tool responsesAll source files; we redact before reporting
MaintenanceDays since last push; archived state; release cadence; open-issue ratioGitHub API for the repo
CompatibilityEngine declarations (Node engines, python_requires); client-specific quirks for Claude Code / Cursor / Windsurf / Codex CLIManifests + per-client compat matrix
DocsREADME install + usage section; LICENSE; SECURITY.md; manifest repository field; runnable exampleRepo root, README byte size + section parsing

A finding can only contribute to one axis (its kind determines the axis via a fixed map in product-api/audit/report.js). Each axis starts at 100 and only deducts.

Surface tiers (the v0.3 calibration core)

v0.2 was binary: production source vs. test source. That was too coarse. examples/ directories full of sk-*** placeholders kept tanking otherwise-clean MCP server grades; chatty benchmarks/ directories dominated the Stripe agent-toolkit's score even though zero findings landed in the runtime tool surface. v0.3 fixes this by tagging every finding with a surface — the kind of code path it lives in — and weighting deductions per-tier.

production full weight
The runtime tool surface. Anything inside src/, top-level index.{js,ts,py}, or any path that the MCP server / Claude skill loads at request time. The default if no other tier matches.
installer half weight
Code under .claude-plugin/install*. Runs on user systems but is not the LLM-facing tool surface — a credential leak in an installer is real but lower-blast-radius than one in a runtime tool.
examples low weight
Any segment named examples / samples / demos / fixtures / cookbook / tutorials; filenames matching *.example.ts / *.sample.py; .env.{example,template,sample}. Demonstration code, not the production path.
benchmarks low weight
Any segment named benchmarks / bench / perf. Performance harnesses don't ship to users.
scripts low weight
Top-level scripts / bin / .github. Build and CI tooling. Nested src/foo/scripts/ stays production — that ships with the runtime.
test low weight
Test directories and matching filenames: tests/, __tests__/, spec/, e2e/, *.test.ts, *.spec.tsx, test_*.py, *_test.go.

Order of classification matters. The classifier checks tests first, then installer, then examples (any path segment), then benchmarks, then scripts (top-level only), then falls through to production. A test file inside examples/__tests__/ is classified as test, not examples; an installer script in .claude-plugin/install/scripts/foo.sh is classified as installer, not scripts. The full classifier is in product-api/audit/report.js:classifySurface — open-source and unit-tested.

Deduction table

Each finding deducts from the axis it belongs to, weighted by surface. High-severity findings are the hard fails — confirmed SSRF on a runtime URL, an env-var token logged on an error path, an execSync with template-string interpolation. Warn-severity findings are softer — possible SSRF where the URL might be allowlisted at runtime, a credential prefix that looks like a token but might be a fixture.

SurfaceHigh severityWarn severityWhy this weight
production−30−10The runtime tool surface — what the LLM actually invokes. Full deduct.
installer−15−5Runs on user systems but isn't LLM-callable. Half deduct keeps real risk visible without dominating the score.
examples−50Demonstration code. Real if it ships, but not what the agent loads at request time.
benchmarks−50Performance harnesses. Don't ship to users.
scripts−50Top-level build / CI. Same logic — not the runtime path.
test−50Test source. shell=True in a unit test isn't a vulnerability.
Why warn = 0 in low-weight tiers, not −1 or −2? A warn-severity finding in a benchmark is, by definition, not a runtime risk. We surface it in the report because authors should still know it's there, but giving it any deduct produces score-noise that punishes well-tested repos for having tests.

Caps — per (axis, surface)

Without a cap, a single chatty file could exhaust an axis to zero. v0.3 caps each axis-surface pair at 3 high + 5 warn deducting findings; remaining findings are still surfaced in the report but stop deducting score.

Caps are per (axis, surface), not shared across surfaces — this is the v0.3 cap-fix. v0.2 had a single per-axis cap, which meant a chatty tests/ directory could saturate the cap before any production findings landed, silencing real prod issues. The v0.3 calibration update found and corrected five repos that were unfairly held above their honest grade for exactly this reason (named in the examples below).

Cap dimensionLimitWhat happens at the cap
(axis × surface) — high severity3 deductingFinding 4+ shown in report, no further deduct on that (axis, surface) pair
(axis × surface) — warn severity5 deductingSame — visible in report, deduct stops
Axis floor0Score never goes negative — once an axis hits 0 it stays at 0

Maintenance & docs floors

Two axes have heuristic floors instead of (or alongside) finding-based deductions:

Grade buckets

The overall score is the floor of the worst axis. The grade letter is bucketed from that floor:

A ≥ 90. Zero high-severity production findings across any axis. Active maintenance, complete docs, declared engines. Safe to install with default trust.
B ≥ 80. One or two production warns at most, all other axes clean. Minor remediation closes the gap to A.
C ≥ 70. Real findings present but bounded — typically a small number of installer/example findings or a single production warn. Acceptable for individual install with awareness; team gates often set MIN_GRADE=C.
D ≥ 60. Multiple production warns or one high. Investigate before installing into a production agent.
F < 60. One or more production high-severity findings, or maintenance / docs floors fully consumed. Do not install without first reading the report and the underlying code.

The LLM prompt-injection probe

Static checks can find shaped patterns — an SSRF where the URL is interpolated, a command exec where the shell template is. They can't reason about untrusted-content flow — the case where a tool fetches a webpage, reads a Jira ticket, or scrapes a Slack message and returns the body verbatim into a tool response that becomes part of the next assistant turn. That's the prompt-injection vector that's eating the MCP ecosystem in 2026, and it's exactly what the LLM-assisted probe is designed to catch.

For each server.tool(...) / @app.tool / @server.tool registration we find, we extract the tool name, description, parameter schema, and the first ~60 lines of the handler body. We bundle every registration into one input and hand it to Claude Haiku 4.5 with a red-team system prompt that asks: which of these tools can return untrusted content into the assistant turn, and what would that allow a malicious caller to do? The model returns structured findings — JSON shaped like the static findings, with a kind: "prompt_injection" tag — that roll up under the security axis at the production tier (these are runtime-surface findings).

Worked examples — what v0.3 changed in real reports

Concrete grade moves from the v0.3 calibration update on 2026-04-30. The current public corpus is 101 vendor-official MCP servers, indie frameworks, and SDKs — distribution after v0.3: 19 A · 1 B · 38 C · 6 D · 37 F. v0.2 → v0.3 moved 22 grades total. Three representative cases:

stripe/agent-toolkit · F (40) → C (70) · +30

Every flagged finding lived in benchmarks/ — performance test harnesses with hardcoded fixture values. Zero findings landed in the runtime MCP tool surface under src/. v0.2 lumped benchmarks/ into "production" at full deduct; v0.3 puts them in the benchmarks tier at low weight (high −5, warn 0). The runtime tools themselves were always clean — the grade now reflects that. View report →

modelcontextprotocol/typescript-sdk · F (40) → B (85) · +45

The TypeScript SDK is the largest open-source MCP server in the corpus — chatty tests/ and examples/ directories. v0.2's shared per-axis cap was being saturated by test findings, masking how clean the actual SDK runtime is. v0.3 cap-fix (per-axis, per-surface caps) plus the test and examples tiers being low-weight produced the honest grade — the strongest B in the corpus. View report →

punkpeye/fastmcp · D (60) → F (40) · −20 (calibration drop)

Five repos legitimately dropped a grade under v0.3 because v0.2's shared cap was hiding production-tier findings. punkpeye/fastmcp had real production-source SSRFs (templated URLs in runtime fetch calls) that v0.2 silenced because chatty test directories saturated the high-cap before production findings landed. v0.3 counts those production findings at full weight — the F is the honest grade. We chose to ship the cap-fix knowing five repos would drop, because hiding production-tier findings to spare a grade is exactly the kind of dishonesty we accuse vendor-official MCP servers of when their examples/ is full of sk-*** placeholders. The audit board has to be honest about its own rubric. View report →

A grades are stable across calibration updates. All 19 A grades from v0.2 are still A grades under v0.3 — by definition an A repo has zero high-severity production findings, so there's nothing to re-tier. If a future calibration ever moves an A, that's a detection change (a new check), not a scoring change.

Reproducibility

Every report is deterministic given the same commit. The engine source lives at product-api/audit/ in our repo (open-source under MIT — link below). To reproduce a grade locally:

git clone https://github.com/skillaudit/skillaudit
cd skillaudit/product-api/audit
node index.js https://github.com/owner/repo
# Optional: enable the prompt-injection probe
ANTHROPIC_API_KEY=sk-ant-... node index.js https://github.com/owner/repo

Score deductions, severity weights, surface-tier classification, grade buckets, and cap rules all live in product-api/audit/report.js — review and challenge them. If you find a finding you think we're miscategorizing, the hello@skillaudit.dev channel is open and we publish corrections.

Engine changelog

VersionReleasedWhat changed
v0.32026-04-30Surface tiering — every finding tagged with production / installer / examples / benchmarks / scripts / test. Per-(axis, surface) caps replace shared per-axis caps (cap-fix). 22 grades moved on the public corpus.
v0.22026-04-26LLM-assisted prompt-injection probe added as a 7th check, rolling up under the security axis. Binary production-vs-test surface split.
v0.12026-04-23First six axes: SSRF, command exec, credentials, permissions, maintenance, docs. Static-only.

Run the rubric on your repo

Paste a GitHub URL. Get a graded report in 60 seconds. Free for public repos; first 100 audits go to waitlist signups in order.

Join the waitlist →   Browse 101 public reports →