Research · 2026-06-01

Vendor-official vs community MCP servers: who scores better?

Six weeks after the April 2026 scan that put 29 vendor-official MCP servers at grade F, we re-examine the full 101-repo corpus with v0.3 engine grades and ask the comparative question the original post only implied: is vendor branding a reliable proxy for MCP security quality, and has anything changed now that the data is public?

TL;DR

After v0.3 calibration: 23 of 29 vendor-official servers (79%) remain at grade F. Zero vendor-official servers earn an A.
Community servers: 19 of 72 (26%) at F — a rate three times lower than vendor-official. All 19 A-grade servers across the entire corpus are community-maintained.
Only one vendor-official server earns an A: Microsoft Playwright. The gap between Playwright and the rest of the vendor-official field is structural, not cosmetic.
Six weeks of disclosure didn't close the gap. A handful of vendor maintainers have responded; the distribution has not materially shifted.
Implication for buyers: "Vendor-official" is not a security signal in the MCP ecosystem — it is approximately an anti-correlated signal. Teams running a min-grade-C policy should not exempt vendor-official servers from the gate.

The counterintuitive finding

When we published the original F-grade list in late April, the most common reaction from practitioners was disbelief — not at the specific vendors named, but at the direction of the gap. Enterprise security teams are trained to treat vendor-official software as lower-risk than community-maintained software. The reasoning is sound in traditional SaaS: a vendor who signs an enterprise contract has a SOC 2 auditor, a legal team reviewing supply-chain commitments, and a platform-security team running static analysis on every merge. Those institutional controls correlate, imperfectly but meaningfully, with lower vulnerability density.

Model Context Protocol servers break that pattern. MCP servers are not the vendor's core product. They are a thin tool-handler wrapper — typically written in a two-week sprint by a developer-relations team to demo at a conference — that lives in a GitHub org subdirectory and receives essentially zero of the institutional security investment the vendor directs at its flagship API. The SOC 2 auditor doesn't review it. The platform-security team doesn't gate it. The enterprise contract doesn't mention it. The README says "official" and that's the entire security signal the procurement reviewer sees.

Community maintainers, meanwhile, often build MCP servers as their primary artifact. For an indie developer who has shipped one MCP server and is asking the community to install it, the security posture of that server is reputationally significant. They read the threat model. They read what an A-grade server looks like. They write the allow-list. The data reflects this.

Grade distribution: vendor-official vs community

The table below shows the full grade distribution across both cohorts after the v0.3 engine calibration that introduced surface tiering — findings in examples, benchmarks, and installer scripts now deduct at lower weight than runtime tool-surface findings. Several vendor-official F's moved to C or B under this calibration (Stripe to C, Anthropic's TypeScript SDK to B); those are reflected in the post-v0.3 numbers.

Grade	Vendor-official (29 total)	% of vendor-official	Community (72 total)	% of community
A	1	3.4%	19	26.4%
B	1	3.4%	11	15.3%
C	2	6.9%	20	27.8%
D	2	6.9%	3	4.2%
F	23	79.3%	19	26.4%
Total	29	100%	72	100%

The gap at every grade band is large and consistent. Vendor-official servers are 3× more likely to be at F than community servers (79% vs 26%), and community servers are 7.7× more likely to reach an A (26% vs 3.4%). The vendor-official corpus has exactly one A-grade server out of 29; the community corpus has 19 out of 72. That is not noise — it is a structural pattern.

79%

vendor-official
servers at F

26%

community
servers at F

vendor-official
A-grades (ex. Playwright)

community
A-grades

A note on the 52 number from the min-grade-C policy post: that post stated a min-grade-C policy blocks 52 of the 101 servers in the corpus, including all 29 vendor-official releases. Under v0.3 calibration, two vendor-official servers (Stripe at C and Anthropic's TypeScript SDK at B) now clear the C bar. The 52 total blocked count adjusts to 50 under the updated grades; the practical policy implication is unchanged.

Why vendor-official servers score poorly — three structural reasons

1. The developer-relations shipping channel

The dominant pattern across the 23 remaining vendor-official F-grade servers is not that the vendor's engineers wrote malicious code. It is that the code was written by a developer-relations team to a demo deadline, with no security review step between "it works" and "it ships." The template-string fetch pattern that appears in Heroku's src/services/app-service.ts, Cloudflare's apps/graphql/src/tools/graphql.tools.ts, MongoDB's src/common/atlas/apiClient.ts, and a dozen other vendor repos is the same pattern these vendors' platform engineers would flag in a pull-request review on their core product in minutes. Those review gates simply do not exist for the MCP server because the MCP server was never categorized as a security-sensitive surface.

Vendors with mature security engineering cultures have those cultures applied to the product surface that has gone through threat-model review and is in scope for their security program. MCP servers, launched as developer-experience initiatives in Q4 2025 and Q1 2026, are not yet in scope for most vendors' security programs. The institutional gap is not negligence — it is organizational momentum. The security team doesn't know the MCP server exists. By the time they do, it has been installed by thousands of users.

2. The monorepo surface-area problem

Many vendor-official MCP servers live inside large monorepos alongside benchmarks, samples, test fixtures, and integration examples. The AWS Labs awslabs/mcp repo houses over thirty separate service wrappers in a single monorepo. The Stripe agent toolkit's benchmark fixtures are co-located in the same npm package that ships the runtime tools. The Anthropic Go SDK ships key-shaped placeholder credentials in its examples/ tree in the same format real keys use.

Surface tiering in the v0.3 engine partially addresses this — findings in examples/, benchmarks/, and scripts/ directories now deduct at lower weight — but the problem is structural, not just a scoring artifact. When benchmark code and runtime tool code share a package boundary, a scanner that draws a hard line between them is making a judgment call that the repo's own packaging contradicts. Stripe's benchmark fixtures ship to the same npm tarball as the payment-processing tool handlers. That design choice is a vendor-official decision; the scoring reflects it.

3. Credential architecture borrowed from server-side SDKs

Several vendor F's are driven primarily by the credentials axis rather than SSRF: GitHub's MCP server, W&B's wandb-mcp-server, Honeycomb's honeycomb-mcp, HubSpot's mcp-server. The pattern in credential-axis F's is usually that the MCP server inherits its credential architecture from the vendor's existing server-side SDK — an architecture designed for a process that runs in a trusted server environment, not for a tool-handler process that sits between an LLM and a user's laptop. Server-side SDKs commonly log request metadata for observability, echo environment variables in error messages for debugging, and store tokens in ways that assume the process is the security boundary. In an MCP context, the process is not the security boundary — the tool arguments are the security boundary, and those are adversarially controlled by the LLM's context window.

Vendors who copy-paste their server-side SDK's credential-handling code into their MCP server inherit this mismatch. Community maintainers building from scratch for the MCP context tend not to have an SDK to copy from, and end up writing credential handling that is appropriate for the threat model by necessity. The methodology page has more detail on how the credentials axis is scored and what patterns move a server from C to A on that axis.

The exception: Microsoft Playwright and what it does differently

Microsoft Playwright's MCP server is the only vendor-official server to earn an A grade — 1 out of 29, or 3.4% of the vendor-official corpus. It is worth examining what separates it from the other 28, because the gap is not luck.

The most important structural difference is that microsoft/playwright-mcp was built by the Playwright team itself — the same engineers who maintain the browser automation framework — rather than by a developer-relations or developer-experience team. The Playwright team has a decade of experience thinking about the security boundary between a test automation agent and the browser it controls: what inputs it should trust, what URLs it should be willing to navigate to, what information it should never expose to the automation layer. Those intuitions translated directly into the MCP server design. The tool handlers do not compose URLs from runtime arguments; they compose browser actions from a fixed vocabulary of Playwright API calls against a local browser instance. The credential surface is minimal because the threat model was worked through before a line of code was written.

The second differentiating factor is maintenance cadence. Playwright is a flagship product for the VS Code / developer tooling division of Microsoft, and the MCP server is maintained as part of the Playwright release pipeline rather than as an afterthought to it. Findings flagged by our initial scan were triaged and addressed in the same code review cycle that handles Playwright core issues — not left open as out-of-scope for the security team.

The Playwright server also has the most complete documentation of any vendor-official server in the corpus on its tool surface and permission model — which is the "documentation axis" component of the SkillAudit score. Servers that clearly document what network connections they make, what credentials they require, and what data they transmit score higher on the documentation axis regardless of their code quality, because documentation is a security control: it lets users audit the tool surface before installing it rather than after.

Pattern summary: The Playwright A-grade reflects the server being built by the product team that understands the threat model, maintained in the same pipeline as the core product, and documented to the same standard as the framework it wraps. Those three conditions — domain expertise, maintenance integration, and documentation discipline — are absent in almost every other vendor-official MCP server in the corpus.

Six weeks of follow-up: what changed after public disclosure

The original post was published April 29, 2026. It named 29 vendor-official servers by GitHub handle, file path, and finding type. We have been tracking responses across the six weeks since publication. The honest summary: individual maintainers responded faster than institutional security programs did, and the overall distribution has not materially shifted.

Responses that moved the grade

Two vendor-official servers saw grade improvements that can be attributed to disclosure response rather than pre-planned calibration changes. Stripe's team pinned the benchmark fixtures to a separate npm workspace, allowing the surface-tiering in v0.3 to deduct them at lower weight — the grade moved from F to C. Anthropic's TypeScript SDK team moved the example fetch patterns to use typed API clients rather than raw template-string fetch() calls in the canonical quickstart; combined with v0.3 tiering, the grade moved from F to B (currently the strongest B in the vendor-official corpus).

These are real improvements and should be acknowledged. They also illustrate the mechanism: surface tiering didn't change what the engine finds; it changed how it weights the location of what it finds. The Anthropic TypeScript SDK fix was both a code change and a tiering beneficiary — that combination is what moved it two letter grades.

Responses that didn't move the grade

Several vendor security teams opened internal tracking issues — we can infer this from GitHub activity patterns and from direct mail to hello@skillaudit.dev. None of those internal processes have produced public commits that address the runtime-tier findings within the six-week window. Heroku's src/services/app-service.ts template-string fetch pattern is unchanged. Auth0's src/auth/device-auth-flow.ts URL construction is unchanged. MongoDB's src/common/atlas/apiClient.ts fetch call is unchanged. CircleCI's src/clients/circleci/httpClient.ts is unchanged.

This is not surprising. Structural fixes to API-client URL construction require design decisions — whether to pin to a hardcoded constant, implement an allow-list at the constructor, or refactor the abstraction layer entirely — and those decisions route through architecture review, not sprint planning. Six weeks is not enough time for a vendor with a multi-person review process to complete a structural refactor of a credential-carrying API client. What it is enough time for is filing the issue, assigning an owner, and getting it into a roadmap. Whether those internal steps have happened we cannot observe from outside the repo.

Non-responses

Eight of the 29 vendor-official servers have no observable response — no GitHub issue, no maintainer contact, no commit activity in the relevant directories — across the six-week window. For repositories where the last commit predating our scan was measured in months, this is consistent with the maintenance signal that contributed to their F grade: an unmaintained server doesn't get patched in response to a disclosure any more than it gets patched in response to a dependency update. The maintenance axis of the SkillAudit rubric exists precisely because maintenance cadence predicts patch latency, and patch latency is a security property.

What this means for buyers and for Anthropic's directory program

For buyers

The six-week follow-up reinforces a conclusion that the April data already supported: treat vendor-official MCP servers with the same skepticism you would apply to any third-party tool that hasn't gone through your security review process. The "official" label earns a vendor approximately one letter grade of benefit of the doubt when you're evaluating a traditional API integration, because the vendor's security program is in scope for the integration. For MCP servers, the vendor's security program is almost certainly not in scope for the MCP server, which means the benefit of the doubt is unearned.

The practical recommendation from the min-grade-C policy post applies equally to vendor-official and community servers: don't install anything below grade C without a documented security exception. The fact that a server carries a vendor brand does not exempt it from the policy. The F-grade vendor servers on our board carry the same risk profile as F-grade community servers — in some cases worse, because they carry credentials for production services (Heroku platform tokens, Auth0 OAuth flows, Stripe payment APIs) that community MCP servers typically don't.

For teams that run quarterly MCP inventory reviews: note the servers that have been on your install list since before the April 2026 corpus scan. If any of them are vendor-official servers from the F-grade list, check the current grade on the audit board — some of those grades have moved since April, either from vendor fixes or from v0.3 engine calibration. Don't rely on a grade that was accurate in April to still be accurate in June; re-scan, or check the live audit page, before the next quarter.

For Anthropic's directory certification program

Anthropic has signaled that an official MCP server directory with some form of certification or vetting layer is part of the roadmap for the Model Context Protocol ecosystem. The data from our corpus is directly relevant to what that vetting layer should and shouldn't include.

The failure mode to avoid is certification-by-brand: treating vendor-official servers as pre-vetted because they come from named companies. The corpus shows that named companies have the highest F rate in the ecosystem. A certification program that gates on static analysis grade rather than vendor provenance would block 79% of current vendor-official submissions while passing 74% of community submissions — which is approximately the right direction given where the quality signal actually lives.

The Playwright exception is instructive here too: Microsoft Playwright earns its A not because it's from Microsoft, but because the Playwright team did the work. A certification program that evaluates the work — the threat-model documentation, the permission scope declaration, the URL-handling architecture — would capture that distinction. A program that evaluates the brand would not.

A tiered certification model would make sense given the distribution: a "Listed" tier requiring a current audit with a minimum grade of D (blocks the worst actors); a "Verified" tier requiring grade C or above (blocks ~52 of the current corpus); and a "Certified" tier requiring grade B or above with a documented permission scope declaration (reachable by about 15% of current servers, all but one of them community-maintained). Each tier maps to a different level of installation friction — Listed servers require user acknowledgment, Verified servers install with a prompt, Certified servers install silently in enterprise deployments. That architecture gives vendors a clear incentive to improve their grade rather than relying on brand to do the work for them.

Conclusion and actionable guidance

The vendor-official vs community comparison is the most counterintuitive finding in the SkillAudit corpus, and it has stayed counterintuitive through six weeks of follow-up data. The gap between a 79% F rate and a 26% F rate is not a calibration artifact — we have run the v0.3 engine that explicitly adjusts for examples/benchmarks/scripts, and the gap is larger post-v0.3 than pre-v0.3 for the vendor-official cohort (the B-grade uplift for Anthropic's TypeScript SDK and the C-grade uplift for Stripe account for two of the improvements; the other 23 vendors are still at F).

The gap exists because vendor-official MCP servers are written to a different incentive structure than community MCP servers. Vendors write MCP servers to ship a demo. Community maintainers write MCP servers to ship a tool. That distinction — demo vs tool — maps almost perfectly onto the security posture gap we observe. Tools get threat-modeled; demos get shipped.

For buyers, the actionable guidance is unchanged from April:

Don't exempt vendor-official servers from your minimum-grade policy. The brand is not a security control.
Check the live audit page before installing, not the April grade. Some vendor-official servers have improved since the original scan; many have not. The audit board shows the current grade with the scan date.
Apply the same security exception process to vendor-official F-grade installs that you apply to community F-grade installs. The risk profile is equivalent; the process should be too.
If you need a vendor-official MCP server whose grade is currently F, email the vendor's security team directly — not the developer-relations contact, the security team — and reference the SkillAudit report. The finding with file path, line number, and category is a complete bug report. Vendors who have engaged with the report have moved their grade; vendors who haven't have not.

For community maintainers whose servers are in the A tier: the data validates the work you put into threat-modeling an MCP server as a first-class artifact rather than a demo. If you want to understand what specifically earned your grade, the per-axis breakdown is on your audit page. If you want to share the grade as a trust signal for prospective users, the embeddable badge stays current as we re-scan.

SkillAudit engine v0.3. Corpus: 101 MCP servers, scanned April 2026, re-evaluated June 2026 against v0.3 grades. For the original vendor-official F-grade enumeration with per-server file paths, see the April 2026 post. For the full calibration delta that moved 22 grades when v0.3 shipped, see the engine v0.3 calibration delta post. For the min-grade-C policy template and GitHub Action, see the install gate policy post. For the full corpus audit index, see the public audit board. For the scoring rubric, see the methodology page. For what an A-grade server looks like structurally, see anatomy of an A-grade MCP server.

See where your MCP server sits in the distribution — vendor-official or community.

Request an audit → Browse the audit board →