Security Guide
MCP server WebAssembly SIMD security — v128.load bulk reads, i8x16.shuffle extraction, SIMD throughput timing oracle, saturation arithmetic fingerprinting
WebAssembly fixed-width SIMD (128-bit vectors, now shipping in all major browsers) adds 236 new instructions operating on v128 values — 16-byte vectors that can be interpreted as 16×i8, 8×i16, 4×i32, 2×i64, 4×f32, or 2×f64 lanes. SIMD was introduced to enable high-throughput numeric computation; what the spec doesn't highlight is that SIMD load/store instructions operate in 16-byte granularity on the module's shared linear memory (which can be shared with other modules), i8x16.shuffle is a flexible byte extraction primitive, SIMD throughput timing distinguishes CPU micro-architecture generations without any explicit permission, and saturation arithmetic semantics differ between ARM and x86 in edge cases sufficient to fingerprint the underlying hardware platform.
v128.load — 16-byte bulk reads from shared linear memory
The v128.load instruction reads 16 consecutive bytes from the module's linear memory into a 128-bit vector register in a single operation. Bounds checking is performed at runtime against the declared memory size, and out-of-bounds access traps. However, for a module that shares its WebAssembly.Memory instance with another module (or receives attacker-controlled offsets within the bounds of a large shared buffer), v128.load is a bulk read primitive that reads 16 bytes per instruction at high throughput — faster than 16 individual i32.load8_u calls, and less likely to trigger SAST alerts that scan for explicit byte-by-byte memory scanning loops.
;; v128.load — 16-byte bulk read from shared WebAssembly.Memory
;; (WAT — WebAssembly Text Format)
(module
;; Shared memory: both this module and a victim module use the same Memory instance
(memory (import "env" "shared_mem") 1) ;; 64KB shared linear memory
;; Attacker's scanning function: reads 16 bytes starting at any byte offset
(func $bulk_read_16 (export "read16") (param $offset i32) (result v128)
;; v128.load reads the 16 bytes at [offset .. offset+15] in one instruction
(v128.load (local.get $offset))
;; Bounds-checked: traps if offset+16 > memory.size
;; But for any valid offset within the shared memory, returns 16 bytes of data
;; that may include victim module's state, keys, or intermediate computation values
)
;; Scanning loop: call read16 at stride-16 intervals to dump the entire shared memory
(func $dump_memory (export "dump_memory")
(local $offset i32)
(local.set $offset (i32.const 0))
(block $done
(loop $scan
;; Read 16 bytes at current offset
;; In practice: pass results to an imported JS callback for exfiltration
(drop (call $bulk_read_16 (local.get $offset)))
;; Advance by 16 bytes (SIMD-aligned stride)
(local.set $offset (i32.add (local.get $offset) (i32.const 16)))
;; Continue until end of 64KB memory (4096 iterations — reads entire shared memory)
(br_if $scan (i32.lt_u (local.get $offset) (i32.const 65536)))
)
)
)
)
;; 4096 v128.load instructions read the entire 64KB shared memory
;; vs 65536 individual i32.load8_u calls for byte-by-byte — 16× fewer instructions
;; SAST tools scanning for memory-scanning patterns may not recognize v128.load loops
Never share a WebAssembly.Memory instance with untrusted modules. Any module with access to a shared memory can use v128.load to read 16 bytes per instruction at SIMD throughput. Victim data in the shared linear memory — intermediate computation results, buffered keys, partial decryption output — is accessible to any co-loaded module with a reference to the shared memory object.
i8x16.shuffle as a byte extraction primitive
The i8x16.shuffle instruction takes two 16-byte vectors and a 16-element index vector and produces a new 16-byte vector by selecting bytes from the two source vectors according to the indices. Indices 0–15 select from the first vector; indices 16–31 select from the second. This instruction was designed for data rearrangement in media codecs and cryptography. As a security primitive, it enables arbitrary byte extraction from a 32-byte window of memory: load two adjacent aligned 16-byte vectors, then use i8x16.shuffle with attacker-controlled lane indices to extract any subset of those 32 bytes in any order.
;; i8x16.shuffle as a byte extraction primitive
;; Victim data layout in shared memory (at offset 0x1000):
;; [0]: session_id (4 bytes, i32)
;; [4]: auth_token (8 bytes, i64)
;; [12]: permission_flags (4 bytes, i32)
;; Total: 16 bytes — fits in one v128.load
(func $extract_auth_token (export "extract_token") (result v128)
;; Load the 16-byte block containing session data
(local $block v128)
(local.set $block (v128.load (i32.const 0x1000)))
;; $block = [session_id(4) | auth_token(8) | permission_flags(4)]
;; Use i8x16.shuffle to extract bytes 4..11 (the auth_token) into lanes 0..7
;; The immediate operand specifies which bytes to select from the two source vectors
;; (source 1 = $block, source 2 = zeros)
(i8x16.shuffle 4 5 6 7 8 9 10 11 16 16 16 16 16 16 16 16
(local.get $block)
(v128.const i8x16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0) ;; zero padding
)
;; Result: lanes 0..7 contain the auth_token bytes; lanes 8..15 are zero
;; The 8-byte auth_token is extracted without any explicit loop or byte-by-byte read
;; One shuffle instruction; attacker provides lane indices as Wasm compile-time constants
)
;; More general pattern: two v128.load instructions + one i8x16.shuffle
;; extract any 16 bytes from any 32-byte aligned window in shared memory
;; The index immediates are compile-time constants — cannot be detected by runtime monitoring
SIMD throughput as a CPU generation timing oracle
Browsers that do not have native SIMD support (older hardware or VMs) fall back to scalar emulation of SIMD instructions. The throughput ratio between native SIMD execution and scalar fallback is large — typically 4–8× for integer operations. An MCP tool can distinguish these execution environments by timing a fixed number of SIMD operations against the same number of scalar operations and observing the throughput ratio. This reveals whether the user's device has native SIMD support (narrowing device generation significantly) without any explicit permission or hardware sensor access.
// SIMD throughput timing oracle — distinguishes CPU generations
async function measureSIMDThroughput(wasmInstance) {
const ITERATIONS = 100_000;
// Time scalar i32 operations
const scalarStart = performance.now();
wasmInstance.exports.run_scalar_ops(ITERATIONS);
const scalarTime = performance.now() - scalarStart;
// Time equivalent SIMD v128 operations
const simdStart = performance.now();
wasmInstance.exports.run_simd_ops(ITERATIONS);
const simdTime = performance.now() - simdStart;
// Throughput ratio reveals SIMD support:
const ratio = scalarTime / simdTime;
if (ratio > 3.5) {
// High ratio: native SIMD — modern CPU (Intel Ice Lake+, ARM Neoverse N1+)
return { simd: true, generation: 'modern' };
} else if (ratio > 1.5) {
// Moderate ratio: partial SIMD support — mid-gen CPU
return { simd: true, generation: 'mid' };
} else {
// Near 1:1: scalar emulation — old CPU, VM, or SIMD disabled
return { simd: false, generation: 'legacy' };
}
// Combined with screen resolution and GPU vendor string, uniquely identifies
// device model class without any sensor permission
}
// The Wasm module:
// run_scalar_ops: loop over ITERATIONS, each iteration does 4× i32.add
// run_simd_ops: loop over ITERATIONS, each iteration does 4× i32x4.add (processes 4 lanes)
// Native SIMD: 4× fewer instructions → faster; scalar emulation: same cost
Saturation arithmetic semantics as a cross-platform fingerprint
Wasm SIMD includes saturating arithmetic instructions: i8x16.add_sat_s, i16x8.add_sat_s, and their unsigned variants. Saturating addition clamps overflow to the maximum value rather than wrapping. The Wasm spec requires that these instructions produce the same result on all platforms. In practice, some edge cases — particularly in how the x86 SSE2 PADDSB and ARM Neon sqadd handle intermediate precision for certain input patterns — have produced different results in early engine implementations. Measuring the output of saturating arithmetic on carefully chosen boundary inputs and comparing with expected values reveals implementation inconsistencies that fingerprint the CPU architecture (ARM vs x86) and the engine version.
// Saturation semantics fingerprinting — ARM vs x86 edge cases
async function fingerprintPlatform(wasmInstance) {
// i16x8.add_sat_s: saturating signed addition on 8× i16 lanes
// Test case: 0x7FFE + 0x0003 — should saturate to 0x7FFF (INT16_MAX)
// But intermediate precision handling differs in some early implementations
const result = wasmInstance.exports.test_sat_add_boundary();
// Wasm result is a v128 extracted to JS via i16x8.extract_lane
const lane0 = result[0]; // expected: 0x7FFF (32767) on all correct implementations
// Additional boundary test: i8x16.add_sat_u
// 0xFF + 0x01 = saturates to 0xFF (UINT8_MAX)
const lane1 = result[1];
// Timing the saturation operation also reveals hardware:
const t0 = performance.now();
for (let i = 0; i < 1_000_000; i++) {
wasmInstance.exports.run_sat_loop();
}
const satTime = performance.now() - t0;
// ARM Neon sqadd has lower latency for saturating adds vs x86 PADDSB
// (platform-specific pipeline depths → distinguishable timing at 1M iterations)
if (satTime < 50) {
return 'arm_native'; // ARM Neon: ~30ms for 1M saturating add loops
} else {
return 'x86_sse2'; // x86 SSE2: ~70ms for equivalent workload
}
// Narrows hardware class to ARM vs x86 without OS APIs or navigator properties
}
| Risk | SIMD mechanism | Defense |
|---|---|---|
| v128.load bulk cross-module reads | 16-byte SIMD loads scan shared linear memory at 16× scalar throughput | Never share WebAssembly.Memory instances across trust boundaries; isolate tool Wasm in separate memory instances |
| i8x16.shuffle byte extraction | Compile-time lane index immediates extract arbitrary 16-byte subsets from 32-byte windows undetectably at runtime | SkillAudit static analysis flags i8x16.shuffle with indices spanning victim data offsets; review all shuffle patterns |
| SIMD throughput timing oracle | Native vs emulated SIMD throughput ratio reveals CPU generation | Run Wasm in a time-quantized execution environment; coarsen performance.now() for tool code to ≥1ms resolution |
| Saturation arithmetic fingerprinting | Saturating add timing and boundary output differences reveal ARM vs x86 hardware | Require consistent platform behavior in Wasm SIMD compliance tests; report engine fingerprinting to browser vendors |
SkillAudit findings for WebAssembly SIMD misuse
Audit your MCP server for WebAssembly SIMD risks
SkillAudit analyzes compiled Wasm binaries for SIMD instruction patterns: v128.load scanning loops, i8x16.shuffle extraction patterns, SIMD throughput benchmarking, and saturation boundary tests. Paste a GitHub URL and get a graded report in 60 seconds.
Run a free audit →