Research post · 2026-04-29

29 vendor-official MCP servers earned an F — every name, every file path

Out of 101 Model Context Protocol servers in our public corpus, 42 ended at an F grade. 29 of those F's are vendor-official releases — the MCPs your team installs because they have a brand on the package. This post walks each one, names the file path the engine flagged, and is honest about where the grade is unambiguous and where the engine is still learning to calibrate.

Why this list exists

Most teams adopting MCP servers in 2026 use brand as a heuristic. "It's by Cloudflare" or "it's the official one from Stripe" reads to a procurement reviewer the way @types/node reads to an npm reviewer — household name, defaults trusted, wave it through. That heuristic worked for SaaS APIs because the vendor signing the contract had a SOC 2 reviewer auditing the server. It is not how Model Context Protocol servers ship in 2026. An MCP server is a tree of tool handlers a vendor's developer relations team wrote in two weeks to demo at a conference, packaged into the vendor's GitHub org, and frozen in time the day the demo ended. There is no review gate between the dev-rel branch and the README that says claude plugin install vendor/vendor-mcp.

We ran SkillAudit v0.2.1 against the 101 most-installed MCP servers we could find. The headline numbers from the full corpus — 50% with SSRF-pattern findings, 38% with credential-handling findings, 19% earning an A and 42% an F — are useful in aggregate. They are also abstract. This post is the concrete version: every F-grade vendor-official MCP, with the file path the engine flagged. If you are about to install one of these into a Claude Code or Cursor or Windsurf agent that has access to your laptop, your AWS console, or your customer database, read your vendor's section first.

The unambiguous F's — runtime tool surface flagged

These are the F's where the SkillAudit findings sit in the runtime path that loads when the MCP starts and answers tool calls. If the LLM driving the agent is shown adversarial input — a README it is asked to summarize, a GitHub issue it is asked to triage, an email it is asked to reply to — and that input contains an instruction to call one of these tools with a crafted argument, the call goes through. There is no argument here that the engine is over-grading.

Heroku — heroku/heroku-mcp-server · F (0/100)

Ten textbook SSRF primitives across src/services/app-service.ts, src/services/app-setup-service.ts, src/services/build-service.ts, and the dyno / formation / config-var / addon services. Every method has the same pattern:

// src/services/app-service.ts:20
const response = await fetch(`${this.endpoint}/apps/${appIdentity}`, {
  headers: { Authorization: `Bearer ${this.token}` },
});

this.endpoint is read from process.env at construction. appIdentity is a tool argument that originates from whatever the agent is currently working on. Pair the bearer-token attachment with no allow-list on the URL and the primitive is a clean SSRF that exfiltrates the Heroku platform token to an attacker-controlled URL on first dispatch. The Heroku platform team would catch this in a `heroku-cli` PR review; no equivalent review fired here. Days since last push: 2 (active repo). The fix is structural — pin this.endpoint to a hardcoded constant or an allow-list checked at construction.

Auth0 — auth0/auth0-mcp-server · F (10/100)

Five HIGH SSRF findings on src/auth/device-auth-flow.ts at lines 44, 107, 171, 227, and one more in the same file. The device-auth flow is the runtime path: it constructs URLs from ${this.domain} and fires fetch(). domain is a tool input — the user (or the LLM) selects which Auth0 tenant to authenticate against. An attacker who controls the input can redirect the OAuth flow to an attacker-controlled host and steal the device-flow code on the way back. For an Auth0-branded MCP server, this is the worst possible category of finding — Auth0's entire product is "we get OAuth right." The engine is not over-grading. The fix is a tenant-domain allow-list on the IDP side or a scheme/host pinning at the client.

Cloudflare — cloudflare/mcp-server-cloudflare · F (0/100)

The mcp-server-cloudflare repo is an apps monorepo. Findings cluster in apps/graphql/src/tools/graphql.tools.ts, apps/radar/src/tools/url-scanner.tools.ts, and apps/dex-analysis/src/warp_diag_reader.ts — these ARE the tool entry points (the file names tell you so) and they each invoke fetch(url, …) on a runtime-derived URL. There are also findings in apps/demo-day/frontend/script.js which arguably aren't part of the MCP tool surface, but the apps/graphql and apps/radar findings stand on their own — those are the canonical "URL scanner" tool surface for an LLM agent. cloudflare/workers-mcp (F · 40/100) is a separate repo and a separate F. Two F's from the same vendor on the same week is a dataset, not a glitch.

MongoDB — mongodb-js/mongodb-mcp-server · F (5/100)

Six SSRF findings + ten command-exec findings. The runtime-tier finding sits in src/common/atlas/apiClient.ts:152 — a generic fetch(url, {…}) where url is a runtime-derived parameter. The Atlas API client is the heart of the MongoDB MCP — every cluster / database / project / org tool routes through it. The command-exec findings are mostly in scripts/bumpPackages.ts (a release script), which is a calibration question — but the Atlas client SSRF is unambiguous. Engine score 5/100, days since last push: low single digits. Active codebase, real surface, clean signal.

Anthropic — modelcontextprotocol/inspector · F (0/100)

The MCP Inspector — the canonical "is my MCP server working?" debug tool — earned the most concentrated findings cluster on the board. cli/scripts/make-executable.js:15 contains:

execSync(`chmod +x "${TARGET_FILE}"`)

TARGET_FILE is a CLI argument. The shell-quoting on a single argument with embedded quotes is the textbook command-injection primitive. The client/src/App.tsx:709 finding is a template-string fetch in the React client, also runtime. There are 17 SSRF findings across the codebase. Anthropic ships this. The fact that we audited it on the same rubric we apply to community repos and graded it the same way is the rubric working as intended; we acknowledged it openly on our Anthropic Skills Directory comparison page and on the MCP Inspector comparison page. The engine doesn't bend for the vendor running the protocol.

PostHog — posthog/mcp · F (0/100)

The runtime-tier finding sits in typescript/src/api/client.ts:118fetch(url, …) with a parameter url. Plus six template-string fetch sites in the API client itself, and an execSync(`pnpm typed-openapi ${TEMP_SCHEMA_PATH} --output ${OUTPUT_PATH}`) in typescript/scripts/update-openapi-client.ts:34. The script is build-time, but the API client SSRF is runtime. Plus two sk-***-pattern API keys in .env.example templates that the engine flags as credential leaks (we have not yet calibrated .env.example placeholders against real key formats; see the calibration note above). Score 0/100.

Resend — resend/mcp-send-email · F (35/100)

Two HIGH findings in src/lib/dashboard-client.ts:12 and src/lib/resend-editor-client.ts:18 — both fetch(url, …) with a parameter url. These are the Resend dashboard and editor API clients in the runtime path. For an email-sending MCP, the SSRF primitive is particularly concerning: an attacker who can pivot through the dashboard client to an internal metadata endpoint can read whatever the email-sending server's IAM role can read. Score 35/100 (less than the 0's on this list because the credential and maintenance axes pulled some points back).

Sentry — getsentry/sentry-mcp · F (0/100)

The findings cluster in packages/mcp-cloudflare/src/server/oauth/helpers.ts:341 (fetch(upstream_url, …)), packages/mcp-cloudflare/src/server/routes/chat-oauth.ts:120 (fetch(registrationUrl, …)), and packages/mcp-cloudflare/src/server/routes/chat-oauth.ts:171 (fetch(tokenUrl, …)). The OAuth helpers are the connection-establishment runtime path — every Sentry MCP user touches them. tokenUrl reaching fetch() without a scheme/host pin is the same category of finding as Auth0, and from a vendor (Sentry) whose business is observability of code that handles user data. Most of the Sentry findings are WARN-level rather than HIGH (the engine sees validation markers near the call-site but cannot confirm they apply to that exact call), but the call-site density on a single OAuth helper file is itself signal.

The shells-and-installers F's — partially calibration-driven

These F grades sit at the boundary of what a static scanner should weight as "tool-surface" code. The findings are real (the engine isn't hallucinating call sites), but the call sites are in build scripts, install scripts, examples, or benchmarks rather than runtime tool handlers. We are calling them out separately because the calibration question matters; the F grade does not.

Anthropic (TypeScript SDK) — modelcontextprotocol/typescript-sdk · F (15/100)

Eighteen findings, concentrated in examples/server/src/serverGuide.examples.ts and examples/server-quickstart/src/index.ts — both are fetch(url) patterns that ship as the official "here's how to write an MCP server" reference. The library code itself is mostly clean; the F is partly because copy-paste primitives in the canonical examples directory do propagate downstream — every MCP server templated from create-typescript-server inherits the URL-handling pattern its example showed. So the calibration argument cuts both ways: examples are not runtime, and examples ARE the surface that propagates.

Anthropic (Go SDK) — modelcontextprotocol/go-sdk · F (10/100)

Four sk-*** API-key placeholders across examples/server/auth-middleware/main.go. These are placeholders, not real keys — but they are placeholders in the same key format real keys ship as, which is exactly the pattern that gets accidentally committed in production code by anyone who copy-pastes from examples/. Same calibration question as the TypeScript SDK, same answer: examples are the surface that propagates.

Stripe — stripe/agent-toolkit · F (0/100)

The toolkit is an 874-file repo. Most of the findings sit in benchmarks/card-element-to-checkout/, benchmarks/checkout-gym/, and benchmarks/furever/ — benchmark fixtures the toolkit uses to evaluate its own agent. Benchmark code arguably should not weigh on the runtime score, and we expect the v0.3 engine to subdivide here. That said, when those benchmarks ship in the same npm package the runtime tools ship in (stripe-agent-toolkit uses a single package layout), the benchmark code is reachable from the install. Calibration question, but a defensible F either way for a payments-vendor toolkit.

AWS — awslabs/mcp · F (10/100)

Five SSRF + six credential findings. The runtime-tier finding is src/aws-api-mcp-server/awslabs/aws_api_mcp_server/core/metadata/read_only_operations_list.py:54requests.get(SERVICE_REFERENCE_URL, …). The other findings sit in samples/mcp-integration-with-kb/ and samples/mcp-integration-with-nova-canvas/ — sample apps. The runtime finding is real; the sample-app findings will move on calibration. Notable: this is the AWS Labs monorepo housing the AWS-API MCP, the AWS-Documentation MCP, and 30+ other AWS service wrappers, which makes the per-service per-tool runtime surface much larger than any single-tool MCP.

Grafana — grafana/mcp-grafana · F (40/100)

Findings concentrate in .claude-plugin/install-binary.mjs:83 and .claude-plugin/install-binary.mjs:171fetch(url) and fetch(CHECKSUMS_URL) in the post-install binary downloader. This is the install path — every user running claude plugin install hits it, and the URLs are constructed from process.env with no allow-list. The argument that this isn't tool surface is weaker here than for examples/benchmarks: an installer that fetches a binary the user then executes is, operationally, a much higher-trust call than a runtime API client. Score 40/100 (the engine partially compensates for axis spread).

The credential-axis F's

A subset of vendor F's are driven by the credentials axis — env vars echoed to tool responses, hardcoded test tokens, debug logs that include secret material — without dominant SSRF / exec findings. These are real, but the failure mode looks different.

The remaining vendor F's — short form

The eight other vendor F's land mostly in the same SSRF-on-runtime-API-client bucket. Read the per-repo report card for the file path:

The pattern across all 29

If you read every F-grade vendor report card on this list back-to-back — and we have, three times this month — you find roughly four shapes of finding repeated over and over. The previous research post made the case in aggregate; this one is the per-vendor enumeration. The four shapes:

  1. Template-string fetch in the API-client layer. fetch(`${this.endpoint}/${resource}/${id}`, { headers: { Authorization: this.token } }). Heroku, Cloudflare (apps/graphql, apps/radar), MongoDB (atlas/apiClient), Auth0 (device-auth-flow), Sentry (oauth helpers), CircleCI (httpClient), Resend (dashboard-client), Axiom (axiom/client), JetBrains (src/index.ts), PostHog (api/client.ts) — every one of these vendors has this pattern in production source. The fix is structural: pin this.endpoint at construction, allow-list at the call site, or both.
  2. execSync / spawn with template-string args in scripts. Anthropic Inspector's execSync(`chmod +x "${TARGET_FILE}"`), Azure's execSync(`npm install ${platformPackageName}@latest`), MongoDB's execSync(`git log ${range} --format=%s -- "${dir}"`), PostHog's execSync(`pnpm typed-openapi ${TEMP_SCHEMA_PATH}`). The fix is well-known: pass argv as an array, not a shell string.
  3. Hardcoded sk-*** placeholders in examples/ and .env.example. Anthropic Go SDK, PostHog. These are not production keys — but they ship in the same key format real keys ship as, and that format is what scanners like Detect-Secrets / GitGuardian / TruffleHog match on, which is what your CI pipelines and on-prem deploys also match on. Anyone who copy-pastes the example and fills in a real key around the placeholder leaves the placeholder in place; that is the credential-leak pattern in production. Use environment-variable references in examples, not key-shaped placeholders.
  4. Installer scripts that fetch() binaries from runtime-derived URLs. Grafana's .claude-plugin/install-binary.mjs is the canonical example. The argument "this is just install-time" doesn't hold: claude plugin install runs this on first install, and the URL is read from process.env. An attacker who controls the env (or the registry response) controls what binary lands on the user's laptop.

What we are NOT claiming

We are not claiming any of these vendors ship malicious code. Every primitive on this list is the unintentional kind of vulnerability — the developer who wrote fetch(`${this.endpoint}/...`) in src/services/app-service.ts was thinking "this is a vendored API client, the endpoint is fixed at construction." In a normal service boundary that pattern is fine. In an MCP server, where the LLM can be coaxed into calling the tool with crafted arguments and where the vendored API client is being called by the agent on behalf of the user, it is not.

We are also not claiming our engine is the final word. Some grades on this list will move when we ship the v0.3 calibration update that subdivides runtime-tier from examples-tier from scripts-tier findings. Some grades will move when the LLM-assisted prompt-injection probe (engine v0.2 has it; we have not run it on this batch because the factory ANTHROPIC_API_KEY is not yet attached) lights up a category we currently can't see. Some grades will move when the maintainer fixes the finding and we re-scan. Grades move; the file paths are checkpoints in time.

What we ARE claiming is that vendor-official is not a security signal in the MCP ecosystem in 2026. The data on this page is the support for that claim. If you are about to install a vendor-official MCP into a production agent, click through to the per-repo report card and read what the engine found before claude plugin install runs. If the vendor whose MCP you are about to install isn't in our corpus yet, paste the GitHub URL on the homepage and we will queue it.

For maintainers — how to move your grade

If you maintain one of the 29 repos above, here is the short playbook. We don't charge for this; the goal is for every number in this post to go down next quarter.

For buyers — what to do this week

If you adopt MCP servers into a Claude Code, Cursor, Windsurf, or Codex agent at your team, the actionable version of this post is short:

  1. Build a list of every MCP server currently installed across your engineering team's .claude/plugins.lock / equivalent files. (If you don't have a list, that is your first finding; a default-allow approach to MCP installs is the supply-chain risk before any specific repo is.)
  2. For each one, search our public board for an existing audit. If found, read the report card.
  3. For repos not on our board, paste the GitHub URL at skillaudit.dev. Free tier covers 3 audits per month against public repos; Pro is $19/month for unlimited and a CI webhook (a GitHub Action that fails your build on grades below a configured minimum); Team is $99/month with policy export and SSO for 10 seats.
  4. Set a minimum-grade gate in your engineering onboarding doc — "we don't install anything below a C without a security review" is a defensible default. If you set the gate at C and treat F's as "blocked unless we own the fix," you eliminate ~42% of the current public corpus from your default install set, including most of the 29 vendors above.
  5. Pair the gate with a quarterly re-scan cadence. Vendor-official repos on this list will fix findings in our next batch; some of the F's above will be C's by Q3. Static grades age the moment the underlying file changes.

Coming next

Three updates from the engine side over the next two batches:

If your team is interested in being a Pro/Team beta partner — CI gating, policy export, the GitHub Action — mail hello@skillaudit.dev. We are picking the first 100 authors who request a scan and giving them Pro free for six months.


SkillAudit engine v0.2.1. Every report linked above is a permanent URL; scan data is regenerated from source commits, so reports pinned to specific commits remain verifiable. For the full index see the audit board. For the aggregate first-party data this post is a per-vendor breakdown of, see the corpus state-of-the-art post. For the comparison-tool framing — Snyk, Dependabot, Socket, OSV-Scanner, npm audit, MCP Inspector, the Anthropic Skills Directory, GitHub Code Scanning, StackHawk — see the compare hub. For the embeddable grade badge see the embed page.

Audit your MCP server before your users do.

Audit your repo →