Engineering · 2026-04-30

Engine v0.3 calibration delta — 22 grades moved when surface tiering shipped

The SkillAudit engine update we'd been promising on the vendor-official F-grades post for two sessions is now live. Engine v0.3 introduces surface tiering: a finding in benchmarks/ or examples/ or top-level scripts/ no longer deducts at the same weight as a finding in the actual MCP runtime. Across the 101-repo audit corpus, exactly 22 grades moved when we re-ran v0.3 over the existing reports — 9 repos climbed a letter band, 8 lifted within their band, 5 dropped. The two D→F drops are honest cap-fix corrections; the v0.2 shared-cap bug had been silencing production-source SSRFs that v0.3 now counts at full weight. Below: every move, every reason, with the audit page linked for each.

Why surface tiering

Two posts back — 29 vendor-official MCP servers earned an F — we voluntarily flagged a calibration question that affected several of the entries on the F list. The engine treated a high-severity SSRF in src/server/handler.ts the same way it treated one in benchmarks/galtee-basic/environment/client/return.js. Both got the same −30 deduction, and the per-axis cap of "no more than 3 high deductions counted" applied across both contexts. For Stripe specifically, every flagged finding sat in benchmarks/ — code that an LLM tool handler will never reach, but that the engine had no way to discount. Stripe ended at F. Honest report, dishonest framing: the F was bottoming-out scoring on test fixtures, not the MCP runtime.

The vendor-official F-grades post named this directly and committed to a surface-tier engine update. The install-gate playbook referenced the calibration delta in its 30-day re-scan rule. The install shortlist reasoned about which corner of the rubric the calibration would touch and concluded the 19 A grades would not move (correctly — surface tiering is asymmetric, it can only down-weight findings). Three posts of public commitment to the same rubric change. v0.3 is that change shipped.

The mechanic is straightforward. Every finding from every check (SSRF, command-exec, credentials, permissions, env-files, meta-checks) now carries a surface tag derived from its file path. tests/, __tests__/, and *.test.{js,ts,py} stay test surface (low weight, 0 warn deduction). New tiers: installer for .claude-plugin/install* (half weight, the right deduction for code that runs on user systems but isn't the LLM-facing tool surface); examples for any examples/ / samples/ / demos/ / fixtures/ / cookbook/ / tutorials/ path segment plus .examples.ts / .samples.ts filename patterns plus .env.{example,template,sample}; benchmarks for benchmarks/ / bench/ / perf/; scripts for top-level only scripts/ or bin/ or .github/ (a src/foo/scripts/bar.py file stays production — it's a runtime script that ships with the tool, not a build-time helper); and the catch-all is production. All four low-weight tiers deduct identically (−5 high, 0 warn) — the differentiation in the report is for visibility, not arithmetic. The full deduction matrix and worked examples sit on the methodology page deductions section.

The companion change is the cap rule. v0.2 capped at 3 high deductions per axis, shared across all surfaces. If a repo had 5 high findings in tests and 1 in production, the production finding was the fourth by the time scoring reached it and the cap silenced it. v0.3 caps per-(axis, surface), so the production finding lands at full weight regardless of how many test findings preceded it. This is the cap-fix that produces the 5 drops below. None of the drops are surprising — the audit pages already showed those production-source findings; v0.2 just wasn't counting them.

The new distribution at a glance

A
19
unchanged
B
1
+1 (was 0)
C
38
+8 (was 30)
D
6
−4 (was 10)
F
37
−5 (was 42)

The corpus shifts toward the middle. The two largest moves — F shrinking by 5, C growing by 8 — are mostly the same repos: F→C promotions, with two D→C promotions filling the rest of C's gain and two D→F drops backfilling F. The first B grade in the corpus is Anthropic's own MCP TypeScript SDK at score 80 — a strong B, with v0.3 specifically separating the genuine warnings on the runtime fetch(url) call sites from the chatty examples directory the SDK ships for documentation.

Importantly: A grades are unmoved. We predicted this on the install shortlist post for a reason that holds in the engine itself — surface tiering down-weights findings only; it never up-weights. A repo earned an A because it had no high-severity findings in any source-tree context (production, installer, or otherwise). v0.3 cannot move it down without finding a new issue, and v0.3 is a scoring change, not a detection change. The shortlist's claim that the 19 A grades are the most stable corner of the rubric is the same claim the engine math makes deterministically.

The 9 letter-band promotions

These are the repos that moved up a full grade letter under v0.3. The pattern across all 9: the bulk of v0.2 deductions were taking on examples / benchmarks / scripts / installer code that surface tiering now counts at low or half weight, while the production source either had no findings or only warn-level findings.

Repov0.2v0.3ΔWhat v0.3 saw
stripe/agent-toolkitF (0)C (70)+70Every flagged finding in benchmarks/ (test fixtures for evaluation harnesses, never reached by the MCP tool layer). Production source clean.
lastmile-ai/mcp-agentF (0)C (70)+70Findings concentrated in examples/ documentation servers. Runtime agent code clean.
modelcontextprotocol/typescript-sdkF (15)B (80)+65The strongest B in the corpus. Two warn-level fetch(url) call sites in examples/server/src/serverGuide.examples.ts at low weight; two HIGH fetch(url) in scripts/fetch-spec-types.ts (build-time script for type generation, top-level scripts surface, low weight). Production SDK code clean of high-severity findings.
github/github-mcp-serverF (10)C (70)+60HIGH findings in cmd/ CLI tool entry points and scripts/ — top-level scripts surface. Production server code in pkg/github/ has only warn-level signals.
modelcontextprotocol/go-sdkF (10)C (70)+60Most flagged code lives in examples/server/auth-middleware/ and similar example servers. The runtime SDK is clean.
pydantic/pydantic-aiF (10)C (70)+60Heavy examples/ tree (the framework ships dozens of demo agents). Production framework code at low warning levels only.
grafana/mcp-grafanaF (40)C (70)+30The case the install-gate post specifically flagged: HIGH finding in .claude-plugin/install-binary.mjs (a runtime-derived URL fetch in the installer). Half-weight under v0.3 because it's installer-tier — the installer runs on user systems but is not the LLM-facing tool surface.
modelcontextprotocol/python-sdkD (60)C (70)+10One additional warn-level finding moved from production-source-counted to examples-counted. The runtime SDK has no high-severity findings.
modelcontextprotocol/quickstart-resourcesD (60)C (70)+10By construction this is an examples-and-quickstart repo — almost everything is in weather/, memory/, etc. example servers. v0.3 reflects that.

The Stripe and Anthropic-TypeScript-SDK movements are the visible outcomes. Two posts ago we said Stripe was an honest F because the engine couldn't tell benchmarks/ from src/. v0.3 can. Stripe's audit page now shows the findings under a Benchmarks (low-weight) heading with the deduction (−5/0) named in the section. The grade reflects what the engine actually has evidence for: a production-source-clean MCP that ships an evaluation harness with throwaway fixtures.

The MCP TypeScript SDK is the calibration headliner. The live audit page now shows three distinct sections: examples-tier findings (the fetch(url) patterns in examples/server/ documentation servers), build-script-tier findings (scripts/fetch-spec-types.ts, the type generator), and test-tier findings (warn-level patterns in packages/client/test/). The runtime SDK source has no high-severity findings on any axis. v0.3 was specifically calibrated against this kind of repo: an SDK whose production code is careful and whose documentation code is necessarily exploratory. The B reflects that the SDK is recommendable as an install while honestly flagging that the example servers it ships are not the place to copy-paste fetch idioms.

The 8 within-band lifts

These repos got a score lift but didn't cross a letter band. Most are within F (still failing the gate, but with a smaller production-source signal than v0.2 implied), one is within A (Vectara reaches the corpus's second perfect 100).

Repov0.2v0.3ΔWhy no letter change
mongodb-js/mongodb-mcp-serverF (5)F (45)+40Many low-tier findings stripped, but real production-source SSRFs in src/common/atlas/apiClient.ts still hold the security axis below 60. Honest F.
apify/actors-mcp-serverF (10)F (35)+25Production-source command-exec patterns persist; examples-tier findings down-weighted but not the runtime ones.
getsentry/sentry-mcpF (0)F (20)+20Production OAuth helpers in packages/mcp-cloudflare/src/server/oauth/helpers.ts still flagged at HIGH; some examples-tier credential findings now low-weight.
heroku/heroku-mcp-serverF (0)F (10)+10Several findings move to examples-tier; the production-source src/services/ SSRF cluster still drives an F-band score.
honeycombio/honeycomb-mcpF (30)F (40)+10Half the findings move to examples-tier; the runtime-tool surface still has enough HIGH findings to keep security below 60.
posthog/mcpF (0)F (10)+10typescript/src/api/client.ts production findings hold; small examples-tier credit shifts.
upstash/context7-mcpF (0)F (10)+10Same pattern — production findings unchanged in scoring impact, low-tier findings reweighted.
vectara/vectara-mcpA (90)A (100)+10The corpus's second perfect 100. Vectara had a single warn-level finding in a documentation example that v0.3 reweights to zero deduction.

The seven F→F repos all share the same shape: surface tiering shifts some deduction weight off the books, but the underlying production-source SSRFs / command-exec / credential echoes are real and remain at full weight. A team running the install gate at MIN_GRADE=C still blocks every one of these repos, just as v0.2 did. Vectara reaching 100 joins langchain-ai/langchain-mcp-adapters as the only two perfect scores in the corpus.

The 5 drops — honest cap-fix corrections

The interesting half of the calibration. v0.2 had an order-dependent bug: the per-axis high-finding cap (3 high deductions counted before further highs were silenced) was shared across surfaces. A repo with 5 chatty test-directory HIGH findings hit the cap on the test findings and the production finding that came next was capped to zero deduction. Five repos in the corpus had this pattern. v0.3 caps per-(axis, surface), so test-tier highs no longer eat the production-tier deduction budget. The result: production findings that the audit pages already showed are now actually counted toward the score. None of these repos got worse code; they got an honest score for the code they always had.

Repov0.2v0.3ΔWhat was being silenced
glips/figma-context-mcpD (60)F (20)−40Test-directory fetch(url) calls were saturating the security-axis cap before production-source SSRFs could land. v0.3 counts the production findings; the score reflects them.
punkpeye/fastmcpD (60)F (35)−25Real production-source fetch(url) in src/DiscoveryDocumentCache.ts was previously masked by HIGH template-string fetches in src/FastMCP.https.test.ts + src/FastMCP.oauth-proxy.test.ts hitting the cap first. v0.3 caps test findings separately from production.
awslabs/mcpF (10)F (0)−10Within-F. Same shared-cap pattern: a chatty samples directory was offsetting some production credit. Now zero out.
jlowin/fastmcpF (35)F (25)−10Within-F. Production-source findings were partially offset by examples credit in v0.2; v0.3 produces a cleaner accounting and the score reflects the production load alone.
sooperset/mcp-atlassianF (10)F (0)−10Within-F. Examples-tier offsets removed; the underlying production-source HIGH findings drive the score.

The two letter drops are the calibration's hardest credibility test. v0.3 made two repos look worse on the public board after we shipped a change voluntarily framed as a fix to over-deduction. Both drops are correct. The audit page for punkpeye/fastmcp shows the production-source fetch(url) in src/DiscoveryDocumentCache.ts:101 with a clear file path and line number — that finding existed in the v0.2 report too; v0.2 just wasn't counting it because three test-file HIGHs preceded it and ate the cap. v0.3 is the rubric we should have shipped from day one.

Shipping the drops alongside the promotions was a deliberate choice. The audit board has to be honest about its own scoring; hiding production-tier findings to spare a grade-drop on the live page is the same dishonest behavior we accuse vendor-official MCPs of when their examples/ directory is full of sk-*** placeholders. If we can't take a hit on our own engine fix, we have no standing to tell the corpus to take a hit on theirs.

How v0.3 was rolled out

The recalibration didn't re-clone any repos. v0.3 ships a recalibrate.js tool that reads the existing v0.2 Markdown reports under product-api/audit/reports/, parses each finding's file path / severity / kind, classifies the surface tier from the path, reverse-engineers the meta signals (declaredRuntime, readmeBytes, daysSincePush) from cap patterns and the report header, and re-runs scoreFindings() + renderMarkdown() with the new tiering applied. Output is a delta table plus the regenerated reports — structurally identical to a fresh clone, but without the bandwidth or the GitHub API rate limit.

This is the right scope because v0.3 is a scoring change, not a detection change. Every finding the corpus showed under v0.2 is still in the v0.3 report; only the per-finding deduction weight and the per-axis cap behavior changed. A future engine update that touches detection logic (e.g., tree-sitter AST parsing for fewer false positives, or npm package support for paste-a-package-name workflows) will need a fresh clone. v0.3 doesn't.

All 101 audit pages were re-rendered under v0.3 and the audit-index.json rebuilt against the new distribution. The CI gate shipped on the install-gate playbook reads from this file — every team running MIN_GRADE=C against the new distribution now blocks 43 of 101 repos (down from 52 of 101 under v0.2). The post explicitly anticipated this with the wording "v0.3 will move some grades; we'll publish a delta when it lands." This is that delta.

What this changes for the install gate

The team policy from the install-gate playbook still works under v0.3 unchanged — set MIN_GRADE=C in the GitHub Action, fail the PR on installs below threshold, re-scan every 30 days. The numerics shift by 9 repos (52 blocked under v0.2 → 43 blocked under v0.3, because 9 letter promotions crossed the threshold), but the policy template and the rollout calendar are identical. Teams who set their gate up earlier this week and shipped the policy paragraph internally do not need to update anything; the gate logic is grade-driven, the grades are now slightly more accurate, and that's strictly an improvement.

Two specific recalculations a team using the gate may want to make:

Honest caveats

Three caveats on what v0.3 does and doesn't change.

First, v0.3 is calibration, not new detection. The engine still doesn't run a tree-sitter AST parse, doesn't do dataflow analysis across function boundaries, and doesn't query a vulnerability database for known CVEs. It does what v0.1 and v0.2 did — pattern-match on six categories of risky idiom — and now it weighs the matches by surface. The roadmap items that touch detection (AST-based SSRF, npm package support, weekly re-scan cron) are different work and are not part of v0.3.

Second, the recalibrator handles existing reports; it does not re-pull a repo's commit. Every grade in this delta is computed against the same commit SHA the v0.2 audit was produced against. If a maintainer pushed a fix between then and now, that fix is not in the v0.3 numbers. The 30-day re-scan cadence the install-gate post recommends is the right answer; we'll re-clone every repo on a weekly cron once that ships.

Third, surface classification is heuristic. scripts/fetch-spec-types.ts is a top-level scripts file (low weight); src/internal/scripts/run.ts is production (full weight) — the rule keys on whether scripts/ is the first path segment. There will be edge cases where the heuristic is wrong; the methodology page documents every rule so a maintainer can read it and dispute a specific classification. The classifySurface() function in product-api/audit/report.js is the single source of truth and runs deterministically.

FAQ

Why ship the drops alongside the promotions?

Because hiding the drops would be exactly the dishonest behavior our F-grade posts call vendor-official MCPs out for. v0.3 made the cap honest; it had to be honest in both directions. Five repos took a score hit because production findings the v0.2 cap was silencing are now counted at full weight. Two of those crossed a letter band. We ship them named because the audit pages already named them — if we hid the drops we'd be hiding an internal-state inconsistency on our own board.

Could a repo I rely on lose its A in a future calibration update?

Not from a calibration update. Surface tiering is asymmetric — it can only down-weight findings, never up-weight, and an A grade requires zero high-severity findings on every axis in any source-tree context. A future calibration update that further down-weights some surface (say, splitting "fixtures" out of "examples" with a different weight) cannot move an A grade down. A repo can lose its A from a detection update — adding tree-sitter dataflow that catches a new-shape SSRF — or from the maintainer pushing a regression. Both are real risks. A pure calibration shift is not.

Will the corpus get more B grades over time?

Almost certainly yes. The MCP TypeScript SDK is currently the only B because most repos with high-severity production-source findings still have at least one finding that drops the security axis below 90 (= the A threshold) but not below 70 (= the C threshold). The intermediate band — 80, B — is where you land if you have a small number of warn-level findings or a single low-impact high finding partially offset by another axis. As maintainers fix the worst findings on F-grade reports and re-submit, several should cluster at B in the next round.

How do I read the new audit page format?

Every audit page now shows findings grouped under surface-tier headings: Production sources, Installer (half-weight), Examples / samples (low-weight), Benchmarks (low-weight), Build / CI scripts (low-weight), Test source (low-weight). The deduction per finding is named in the heading. The methodology paragraph at the bottom of every page references v0.3 and links to the canonical rubric. The grade is the floor of the worst axis after surface-tiered deductions.

Is v0.3 the final calibration?

Probably not. Two open questions remain. (a) Should installer code be its own tier, or should it be a sub-tier of production with a configurable weight per repo? Some installers run as root on user systems and arguably deserve full production weight; some are trivial wrappers around a network fetch. (b) Should "fixtures" (test data shipped in examples to make a demo runnable) get a separate weight from "demo code" in examples/? They look the same to a path-pattern classifier but have different threat models. We're collecting data from the next 50-100 audits before deciding. The methodology page will document any further calibration update with the same delta structure as this post.

Did any A grades almost lose their A?

No. The closest A-grade repo to the threshold is Microsoft Playwright MCP at 90 (no buffer below the A floor) — but it has zero high-severity findings on any axis in any surface, so v0.3 had nothing to recompute upward. Vectara had one warn-level docs-tier finding that dropped to zero impact under v0.3, lifting it to 100. Every other A holds at 90 because the security floor for an A is "no high-severity production findings" and that's a threshold v0.3 doesn't touch.

Related posts and pages

Audit your own MCP server with the v0.3 engine.

Submit a repo to audit → Read the methodology →