Agent-Design Case Study
case-study · cost-specialist

Cost Specialist

A drop-in folder that turns any Claude project into a Claude API cost coach. Four locked response shapes, mandatory citations, and an eval suite that grades the agent against its own rules.

Role
Agent designer + author
Status
Open source
Started
2026-05
URL
github.com/…/cost-specialist ↗
Source
Public
// shape selector
rules.md routes every input to exactly one shape. Never blended.
USER INPUT code paste · pipeline description · cost question · out-of-scope ask
ROUTER shape selector rules.md
AUDIT ≤ 300 words
when: code pasted, or call shape with model + numbers
  • Verdict
  • Edits
  • Cost impact
  • Confidence
PIPELINE ≤ 500 words
when: ≥ 2 model calls described in sequence
  • Verdict
  • Per-step routing
  • Cache structure
  • Observability
  • Cost impact + Confidence
CLARIFYING 1–3 sentences
when: no model · no token shape · no volume
  • Ask for the three numbers · no $ figure · no confidence line
BOUNDARY 1–2 sentences
when: out-of-scope topic (LangChain, RAG, OpenAI, …)
  • Name 'out of scope' · offer one redirect · no code

A drop-in folder that turns any Claude.ai project or Claude Code session into a Claude API cost & caching specialist. Paste a client.messages.create() call and it tells you what to cache, where the breakpoint goes, and what the bill looks like before and after for every 1,000 calls. The interesting part isn’t the advice — it’s that the agent’s personality is locked by an eval suite that scores adherence to its own rules.

Why it exists

Every Claude API team I’ve worked with hits the same three questions: am I caching this right?, is Opus actually paying for itself on this step?, and why does my cost dashboard read stable when I know I just shipped a regression? The answers are in the docs, but the docs are spread across four pages and the math is spread across four usage fields.

The Cost Specialist is the artifact I wish I had two years ago: a single Claude project that knows the four token streams, the cache mechanics, the dated price table, and the patterns that quietly inflate bills. It works inside the user’s own Claude.ai or Claude Code session, on their actual code, with their actual numbers.

What made it hard

The lazy version is “a system prompt that says you are a cost expert.” That works for one or two questions before the agent starts hallucinating prices, blending response shapes, and recommending LangChain because someone asked nicely.

The bar I held myself to was graded discipline. Every cost claim cites a dated reference file. Every reply takes exactly one of four shapes. Every out-of-scope topic gets a one-sentence Boundary, not a helpful detour. And the whole personality is graded by a regression suite — Sonnet 4.6 as the SUT, Opus 4.7 as the judge, the same judge ≠ SUT pattern the specialist itself recommends to users.

// eval suite
cases.json · SUT: Claude Sonnet 4.6 · judge: Claude Opus 4.7 · ~$0.70 / run.
case · what it asserts shape judge check status
audit-001 happy-path caching audit audit verdict · edits · cost · confidence PASS
audit-002 refuse cache below model minimum audit names 4,096 minimum · no cache_control PASS
pipeline-001 multi-model routing pipeline judge ≠ SUT · per-step reasoning PASS
dashboard-001 input_tokens dashboard math audit names all 3 usage fields · paste-ready formula PASS
boundary-001 refuse LangChain integration boundary out of scope · no SDK code in reply PASS
passing · last run 2026-05-10 · re-runs on every rules.md edit 5 of 12 shown judge: opus 4.7 12 / 12 · 100%
// pragmatic decisions

Three trade-offs worth naming.

Every choice below was the cheap one. Each has a real cost. Listing both halves on purpose.

01

Distribution as a folder, not a hosted app

Chose

A drop-in folder for any Claude.ai project or Claude Code session: identity.md sets the role, rules.md locks the response grammar, examples.md ships seven worked answers, and reference/ holds nine dated cost docs. No backend, no UI, no auth, no SDK.

Why

An agent-design artifact should be readable in one sitting and adoptable in one drag-and-drop. A hiring manager can audit the whole personality before running it, and any user can drop it into their own Claude project without trusting a SaaS. The spec IS the agent.

Cost

No usage telemetry. I can't see which questions trip it up, which prompts get used most, or whether the citation table is actually firing in the wild — the eval suite is the only feedback loop. And every adopter has to keep the dated pricing snapshot fresh themselves.

02

Locked four-shape response grammar

Chose

Every reply takes one of four shapes: Audit (Verdict / Edits / Cost impact / Confidence), Pipeline (5 sections), Clarifying (1-3 sentences), or Boundary (1-2 sentences). Literal section headers, length caps, no preamble, no closing flourish, never blended. The shape selector in rules.md routes every input deterministically.

Why

Cost questions deserve a scannable answer with a calibrated confidence note attached, not a wall of conversational text. The shape constrains both the agent and the user's expectations: a verdict line first, edits as a code block, cost impact as a before/after table, confidence with one named moving fact. That's the personality, not a side detail.

Cost

The voice can read terse next to a default Claude reply — some users hear 'no preamble' as cold. The eval has to assert shape adherence on every case (literal header strings, section count, length caps), which is more brittle than letting the answer flow free. Edits to rules.md need an eval re-run before merge.

03

Eval suite grades the specialist against its own rules

Chose

evals/run.ts runs every doc edit against ~12 hand-authored cases. Sonnet 4.6 is the SUT, Opus 4.7 is the judge. Cases assert shape (literal section headers present), citations (right reference file named per topic), refusal handling (out-of-scope topics return Boundary shape with no SDK code), and never-invent-prices.

Why

Same dual-model pattern as the Claim Analyzer eval lab — judge ≠ SUT to defeat self-grading bias, encoded both as a rule the agent tells users about and as the evaluation method for the agent itself. Without it, every doc edit drifts the personality silently. With it, drift gets scored before merge.

Cost

About $0.70 per full run on default models, plus an Anthropic key. Ground-truth cases must be hand-authored — there's no shortcut to scoring 'did the agent obey its own rules.' The judge prompt and the must_include / must_not_include arrays in cases.json have to stay in lockstep with rules.md every time the rules move.

// the stack

What it’s built on.

Up next
Healthcare Support Specialist — ICM agent workspace for health-plan CSRs