dtc-mcp

A code-execution MCP server for Klaviyo + Shopify analytics. Three tools instead of 28, a stateful V8 sandbox so iterative analyses don't re-fetch, and a benchmark against the official Klaviyo MCP that drove the architecture.

MCPAI ToolsDTC AnalyticsCode Execution

Source

What it does

dtc-mcp is a Model Context Protocol server that an LLM (Claude Desktop, Cursor, or any MCP client) can use to do analytics work against Klaviyo and Shopify. It exposes three tools instead of the dozens that a typical vendor MCP surfaces: execute_code runs JavaScript against typed Klaviyo + Shopify SDKs in a stateful V8 sandbox, and search_docs / read_doc let the agent discover the SDK surface on demand. The interesting move is the sandbox — it keeps one context alive per MCP connection, so anything the agent assigns to globalThis in one call is still there on the next call. Multi-turn analyses (fetch a 5,000-row report in turn 1, drill into the top performers in turn 2, synthesize for a CEO update in turn 3) don't re-fetch anything between turns. The agent writes a few lines of JS that aggregate in-sandbox and returns just the answer, instead of streaming every row through the conversation.

Why I built it

Klaviyo's official MCP server is the conventional shape: ~28 hand-built tools, each a thin wrapper around one API endpoint. That design is great when the workflow fits the tools, but real analyst work doesn't. An analyst asks "what flow is making me the most money, and how does the audience overlap with my top segment?" — a question that requires fetching a report, sorting it, filtering, joining against another resource, and synthesizing. With tool-list MCPs the agent does the aggregation in-context, which means every row of every response is paying for itself in tokens across the next few turns. With a code-execution surface, the agent writes 20 lines of JS that does the same work in-sandbox and returns just the verdict. Anthropic's Nov 2025 "Code execution with MCP" post is now the canonical statement of this thesis (they report 98.7% token reductions on representative workflows; Cloudflare's "Code Mode" makes the same case at hyperscale). I wanted to validate it independently on real third-party APIs the agent couldn't have memorized — Klaviyo's reporting endpoints, Shopify's GraphQL catalog — and against a vendor's own official MCP for a fair head-to-head.

How it works

The sandbox runs in one of two modes, chosen automatically at startup. Preferred is a sidecar process that spawns the user's system Node binary (≥ 20) and loads isolated-vm there; it holds one long-lived V8 isolate per MCP connection, with the typed Klaviyo/Shopify SDKs and a small set of output-discipline helpers (pick, topN, summarize) injected once at startup. The fallback is node:vm in-process, used when no system Node is discoverable — same global surface, weaker isolation. The sidecar exists because Claude Desktop is an Electron app with macOS hardened runtime and Library Validation: native modules loaded into Claude's own process have to share Anthropic's Team ID, which we obviously can't. The child process has its own (unrestricted) hardened-runtime status, so isolated-vm loads cleanly there, and the two processes talk via newline-delimited JSON-RPC over stdio. All API access — auth, rate limits, caching — lives on the host side; the sandbox can only invoke typed methods that route through the rate-limiter and the cache. Klaviyo's reporting endpoints (1/s burst, 2/min sustained) get a 10-minute response cache because the agent re-asks the same report constantly during iterative analyses. The docs surface (search_docs BM25 + read_doc by path) reads from a bundled ~330-chunk index that's also refreshed at startup from a CDN, so new API endpoints land without a new MCP release.

What I learned

I built a benchmark to validate the architecture (9 analytics tasks × both MCPs × 2 trials, conversations from 1 to 10 turns, graded by Sonnet sub-agents against ~5 criteria per task). The bench data is in the repo; three lessons travelled.

Tool descriptions are LLM input, not human documentation. v1.0.5 shipped with a "STRONGLY RECOMMENDED for multi-turn investigations: stash any expensive fetch on globalThis…" paragraph in the execute_code tool description. The intent was to teach the stash-and-cite pattern. It backfired: dtc-mcp's tokens went up ~30% on multi-turn tasks, and one long-conversation cell timed out at 20+ minutes (vs 7.3 min on v1.0.4). I ran a Sonnet sub-agent ablation (5 candidate description styles × 3 trials, no API calls) and the data was unambiguous: stash behavior was 3/3 across every candidate, including a 35-token ultra-compressed one. The model already had the pattern; the prose was teaching what it already did, and the added context just inflated per-call reasoning. v1.0.6 reverted to a minimal schema description with one canonical example, plus a state field in the response envelope so the agent can see what's stashed structurally instead of being told to track it. That fixed the regression. The transferable lesson: prescriptive language doesn't get specially weighted by attention; it just adds tokens. Optimize tool descriptions for what the model parses structurally — schema, names, response shapes, concrete examples — not for what a human reader finds helpful.

One canonical example with the real API surface does ~99% of the teaching. Three rounds of ablation (~63 sub-agent probes, 9 candidate description formats) all converged on the same finding: once the description includes one concrete example using the real API methods with the right request shape, hallucinations drop 5× and the call pattern stabilizes. Format past that is marginal — schema-first, multi-example, anti-example, intent-routing decision trees, TypeScript .d.ts declarations, npm README markdown all produced essentially equivalent agent behavior. Counter-intuitively, more examples didn't help (single canonical example beat four covering different angles), and full type declarations were actively worse (they encourage more elaborate reasoning, slower responses). I'd been ready to spend time on cleverness; the bench told me to ship one good example and stop.

The architecture is right but bounded by docs ergonomics. On the tasks that didn't touch segments (Klaviyo's segment resource requires an additional-fields[segment]=profile_count workaround to bulk-fetch counts), dtc-mcp won tokens by 13–28% over Klaviyo's hand-built MCP — directly the workload shape where in-sandbox aggregation amortizes the code overhead. On segment-touching tasks, dtc lost wall-clock 2–3× because the agent had to rediscover the documented workaround each run; the description's canonical example only shows one API surface (reporting), so the agent extrapolates patterns to others and invents methods. The next release (v1.1.0) is recipe-by-intent discovery — searchable recipes covering distinct resource shapes, modeled on Anthropic's Skills filesystem pattern — because the bench is now unambiguous that the code-execution architecture wins where computation lives, if the docs surface makes the right composition obvious. Description tweaks have hit their ceiling; runtime ergonomics is the next leverage point.

Stack

TypeScriptNode.jsMCP SDKisolated-vmV8Klaviyo APIShopify APIMiniSearchVitest