- Verdict
- Edits
- Cost impact
- Confidence
Cost Specialist
A drop-in folder that turns any Claude project into a Claude API cost coach. Four locked response shapes, mandatory citations, and an eval suite that grades the agent against its own rules.
- Verdict
- Per-step routing
- Cache structure
- Observability
- Cost impact + Confidence
- Ask for the three numbers · no $ figure · no confidence line
- Name 'out of scope' · offer one redirect · no code
A drop-in folder that turns any Claude.ai project or Claude Code session into a
Claude API cost & caching specialist. Paste a client.messages.create()
call and it tells you what to cache, where the breakpoint goes, and what the bill
looks like before and after for every 1,000 calls. The interesting part isn’t the
advice — it’s that the agent’s personality is locked by an eval suite that scores
adherence to its own rules.
Why it exists
Every Claude API team I’ve worked with hits the same three questions:
am I caching this right?, is Opus actually paying for itself on this
step?, and why does my cost dashboard read stable when I know I just
shipped a regression? The answers are in the docs, but the docs are spread
across four pages and the math is spread across four usage fields.
The Cost Specialist is the artifact I wish I had two years ago: a single Claude project that knows the four token streams, the cache mechanics, the dated price table, and the patterns that quietly inflate bills. It works inside the user’s own Claude.ai or Claude Code session, on their actual code, with their actual numbers.
What made it hard
The lazy version is “a system prompt that says you are a cost expert.” That works for one or two questions before the agent starts hallucinating prices, blending response shapes, and recommending LangChain because someone asked nicely.
The bar I held myself to was graded discipline. Every cost claim cites a dated reference file. Every reply takes exactly one of four shapes. Every out-of-scope topic gets a one-sentence Boundary, not a helpful detour. And the whole personality is graded by a regression suite — Sonnet 4.6 as the SUT, Opus 4.7 as the judge, the same judge ≠ SUT pattern the specialist itself recommends to users.
Three trade-offs worth naming.
Every choice below was the cheap one. Each has a real cost. Listing both halves on purpose.
Distribution as a folder, not a hosted app
A drop-in folder for any Claude.ai project or Claude Code session: identity.md sets the role, rules.md locks the response grammar, examples.md ships seven worked answers, and reference/ holds nine dated cost docs. No backend, no UI, no auth, no SDK.
An agent-design artifact should be readable in one sitting and adoptable in one drag-and-drop. A hiring manager can audit the whole personality before running it, and any user can drop it into their own Claude project without trusting a SaaS. The spec IS the agent.
No usage telemetry. I can't see which questions trip it up, which prompts get used most, or whether the citation table is actually firing in the wild — the eval suite is the only feedback loop. And every adopter has to keep the dated pricing snapshot fresh themselves.
Locked four-shape response grammar
Every reply takes one of four shapes: Audit (Verdict / Edits / Cost impact / Confidence), Pipeline (5 sections), Clarifying (1-3 sentences), or Boundary (1-2 sentences). Literal section headers, length caps, no preamble, no closing flourish, never blended. The shape selector in rules.md routes every input deterministically.
Cost questions deserve a scannable answer with a calibrated confidence note attached, not a wall of conversational text. The shape constrains both the agent and the user's expectations: a verdict line first, edits as a code block, cost impact as a before/after table, confidence with one named moving fact. That's the personality, not a side detail.
The voice can read terse next to a default Claude reply — some users hear 'no preamble' as cold. The eval has to assert shape adherence on every case (literal header strings, section count, length caps), which is more brittle than letting the answer flow free. Edits to rules.md need an eval re-run before merge.
Eval suite grades the specialist against its own rules
evals/run.ts runs every doc edit against ~12 hand-authored cases. Sonnet 4.6 is the SUT, Opus 4.7 is the judge. Cases assert shape (literal section headers present), citations (right reference file named per topic), refusal handling (out-of-scope topics return Boundary shape with no SDK code), and never-invent-prices.
Same dual-model pattern as the Claim Analyzer eval lab — judge ≠ SUT to defeat self-grading bias, encoded both as a rule the agent tells users about and as the evaluation method for the agent itself. Without it, every doc edit drifts the personality silently. With it, drift gets scored before merge.
About $0.70 per full run on default models, plus an Anthropic key. Ground-truth cases must be hand-authored — there's no shortcut to scoring 'did the agent obey its own rules.' The judge prompt and the must_include / must_not_include arrays in cases.json have to stay in lockstep with rules.md every time the rules move.
What it’s built on.
- Claude project · drop-in
- Claude Code · drop-in
- @anthropic-ai/sdk
- Claude Sonnet 4.6 · SUT
- Claude Opus 4.7 · judge
- TypeScript 5
- Node 20
- JSON schema cases
- Markdown spec
- 9 dated reference docs
- GitHub
- Claude Code