projects

mcp-dyno

An open-source CLI that measures how an MCP server performs when an LLM drives it — across efficiency, cost, context-bloat, correctness, and reliability — and proves before/after changes with paired statistics. It grew out of benchmarking dtc-mcp, when measurement turned out to be harder than the optimization.

MCPAI ToolsEvaluationDeveloper ToolsCLI
mcp-dyno

What it does

mcp-dyno points at any MCP server (stdio, SSE, or HTTP), drives it with Claude — via an Anthropic API key or an existing Claude CLI subscription — and reports five lenses in one run: efficiency (tokens, tool-call and discovery round-trips), cost in real dollars, context-bloat (a four-channel breakdown of what fills the context window), correctness (graded by an LLM judge against per-task criteria), and reliability (pass^k consistency, hallucinated-tool rate, schema adherence). It auto-generates a task suite from the server's own tools, or you can bring your own. A local dashboard (dyno view) shows the report, per-task transcripts, and a side-by-side comparison of any two runs.

dyno compare runs a before and after and applies a paired-difference test (with minimum-detectable-effect and required-n), so it tells you whether a change is statistically resolvable rather than just noise.

Why I built it

It grew out of benchmarking dtc-mcp: optimizing the server was straightforward, but knowing whether a change actually helped — through the noise of LLM runs — was the hard part. I wrote up the full story here.

Install with npx mcp-dyno · npm

Stack

TypeScriptNode.jsMCP SDKAnthropic SDKCommanderVitest