mcp-dyno
An open-source CLI that measures how an MCP server performs when an LLM drives it — across efficiency, cost, context-bloat, correctness, and reliability — and proves before/after changes with paired statistics. It grew out of benchmarking dtc-mcp, when measurement turned out to be harder than the optimization.

What it does
mcp-dyno points at any MCP server (stdio, SSE, or HTTP), drives it with Claude —
via an Anthropic API key or an existing Claude CLI subscription — and reports five
lenses in one run: efficiency (tokens, tool-call and discovery round-trips), cost in
real dollars, context-bloat (a four-channel breakdown of what fills the context
window), correctness (graded by an LLM judge against per-task criteria), and
reliability (pass^k consistency, hallucinated-tool rate, schema adherence). It
auto-generates a task suite from the server's own tools, or you can bring your own. A
local dashboard (dyno view) shows the report, per-task transcripts, and a
side-by-side comparison of any two runs.
dyno compare runs a before and after and applies a paired-difference test (with
minimum-detectable-effect and required-n), so it tells you whether a change is
statistically resolvable rather than just noise.
Why I built it
It grew out of benchmarking dtc-mcp: optimizing the server was straightforward, but knowing whether a change actually helped — through the noise of LLM runs — was the hard part. I wrote up the full story here.
Install with npx mcp-dyno · npm