dtc-mcp
A code-execution MCP server for Klaviyo + Shopify analytics. Three tools instead of 28, a stateful V8 sandbox so iterative analyses don't re-fetch, and a benchmark against the official Klaviyo MCP that drove the architecture.

What it does
dtc-mcp is a Model Context Protocol server that an LLM (Claude
Desktop, Cursor, or any MCP client) can use to do analytics work
against Klaviyo and Shopify. It exposes three tools instead of the
dozens that a typical vendor MCP surfaces: execute_code runs
JavaScript against typed Klaviyo + Shopify SDKs in a stateful V8
sandbox, and search_docs / read_doc let the agent discover the
SDK surface on demand. The interesting move is the sandbox — it
keeps one context alive per MCP connection, so anything the agent
assigns to globalThis in one call is still there on the next call.
Multi-turn analyses (fetch a 5,000-row report in turn 1, drill into
the top performers in turn 2, synthesize for a CEO update in turn 3)
don't re-fetch anything between turns. The agent writes a few lines
of JS that aggregate in-sandbox and returns just the answer, instead
of streaming every row through the conversation.
Why I built it
Klaviyo's official MCP server is the conventional shape: ~28 hand-built tools, each a thin wrapper around one API endpoint. That design is great when the workflow fits the tools, but real analyst work doesn't. An analyst asks "what flow is making me the most money, and how does the audience overlap with my top segment?" — a question that requires fetching a report, sorting it, filtering, joining against another resource, and synthesizing. With tool-list MCPs the agent does the aggregation in-context, which means every row of every response is paying for itself in tokens across the next few turns. With a code-execution surface, the agent writes 20 lines of JS that does the same work in-sandbox and returns just the verdict. Anthropic's Nov 2025 "Code execution with MCP" post is now the canonical statement of this thesis (they report 98.7% token reductions on representative workflows; Cloudflare's "Code Mode" makes the same case at hyperscale). I wanted to validate it independently on real third-party APIs the agent couldn't have memorized — Klaviyo's reporting endpoints, Shopify's GraphQL catalog — and against a vendor's own official MCP for a fair head-to-head.
How it works
The sandbox runs in one of two modes, chosen automatically at
startup. Preferred is a sidecar process that spawns the user's
system Node binary (≥ 20) and loads isolated-vm there; it holds
one long-lived V8 isolate per MCP connection, with the typed
Klaviyo/Shopify SDKs and a small set of output-discipline helpers
(pick, topN, summarize) injected once at startup. The fallback
is node:vm in-process, used when no system Node is discoverable —
same global surface, weaker isolation. The sidecar exists because
Claude Desktop is an Electron app with macOS hardened runtime and
Library Validation: native modules loaded into Claude's own process
have to share Anthropic's Team ID, which we obviously can't. The
child process has its own (unrestricted) hardened-runtime status, so
isolated-vm loads cleanly there, and the two processes talk via
newline-delimited JSON-RPC over stdio. All API access — auth, rate
limits, caching — lives on the host side; the sandbox can only
invoke typed methods that route through the rate-limiter and the
cache. Klaviyo's reporting endpoints (1/s burst, 2/min sustained)
get a 10-minute response cache because the agent re-asks the same
report constantly during iterative analyses. The docs surface
(search_docs BM25 + read_doc by path) reads from a bundled
~330-chunk index that's also refreshed at startup from a CDN, so
new API endpoints land without a new MCP release.
What I learned
I built a benchmark to validate the architecture (9 analytics tasks × both MCPs × 2 trials, conversations from 1 to 10 turns, graded by Sonnet sub-agents against ~5 criteria per task). The bench data is in the repo; three lessons travelled.
Tool descriptions are LLM input, not human documentation. v1.0.5
shipped with a "STRONGLY RECOMMENDED for multi-turn investigations:
stash any expensive fetch on globalThis…" paragraph in the
execute_code tool description. The intent was to teach the
stash-and-cite pattern. It backfired: dtc-mcp's tokens went up
~30% on multi-turn tasks, and one long-conversation cell timed out
at 20+ minutes (vs 7.3 min on v1.0.4). I ran a Sonnet sub-agent
ablation (5 candidate description styles × 3 trials, no API calls)
and the data was unambiguous: stash behavior was 3/3 across every
candidate, including a 35-token ultra-compressed one. The model
already had the pattern; the prose was teaching what it already did,
and the added context just inflated per-call reasoning. v1.0.6
reverted to a minimal schema description with one canonical example,
plus a state field in the response envelope so the agent can
see what's stashed structurally instead of being told to track it.
That fixed the regression. The transferable lesson: prescriptive
language doesn't get specially weighted by attention; it just adds
tokens. Optimize tool descriptions for what the model parses
structurally — schema, names, response shapes, concrete examples —
not for what a human reader finds helpful.
One canonical example with the real API surface does ~99% of the teaching. Three rounds of ablation (~63 sub-agent probes, 9 candidate description formats) all converged on the same finding: once the description includes one concrete example using the real API methods with the right request shape, hallucinations drop 5× and the call pattern stabilizes. Format past that is marginal — schema-first, multi-example, anti-example, intent-routing decision trees, TypeScript .d.ts declarations, npm README markdown all produced essentially equivalent agent behavior. Counter-intuitively, more examples didn't help (single canonical example beat four covering different angles), and full type declarations were actively worse (they encourage more elaborate reasoning, slower responses). I'd been ready to spend time on cleverness; the bench told me to ship one good example and stop.
The architecture is right but bounded by docs ergonomics. On the
tasks that didn't touch segments (Klaviyo's segment resource
requires an additional-fields[segment]=profile_count workaround
to bulk-fetch counts), dtc-mcp won tokens by 13–28% over Klaviyo's
hand-built MCP — directly the workload shape where in-sandbox
aggregation amortizes the code overhead. On segment-touching tasks,
dtc lost wall-clock 2–3× because the agent had to rediscover the
documented workaround each run; the description's canonical example
only shows one API surface (reporting), so the agent extrapolates
patterns to others and invents methods. The next release (v1.1.0)
is recipe-by-intent discovery — searchable recipes covering distinct
resource shapes, modeled on Anthropic's Skills filesystem pattern —
because the bench is now unambiguous that the code-execution
architecture wins where computation lives, if the docs surface
makes the right composition obvious. Description tweaks have hit
their ceiling; runtime ergonomics is the next leverage point.