Agent-Design Case Study

case-study · cost-specialist

Cost Specialist

A drop-in Claude project for API cost audits. Locked answer shapes, dated citations, and an eval suite.

Role: Agent designer + author
Status: Open source
Started: 2026-05
URL: github.com/…/cost-specialist ↗
Source: Public

// hiring signal

What this proves

I can design a useful agent as a scoped product, not a loose prompt with a confident voice.

Risk handled

Cost tools can hallucinate prices, blur answer shapes, miss cache breakpoints, or drift as provider pricing changes.

Evidence

Public repository, four locked output shapes, nine dated reference docs, TypeScript evals, Sonnet system-under-test, and Opus judge.

// shape selector

rules.md routes every input to exactly one shape. Never blended.

USER INPUT code paste · pipeline description · cost question · out-of-scope ask

ROUTER shape selector rules.md

AUDIT ≤ 300 words

when: code pasted, or call shape with model + numbers

Verdict
Edits
Cost impact
Confidence

PIPELINE ≤ 500 words

when: ≥ 2 model calls described in sequence

Verdict
Per-step routing
Cache structure
Observability
Cost impact + Confidence

CLARIFYING 1–3 sentences

when: no model · no token shape · no volume

Ask for the three numbers · no $ figure · no confidence line

BOUNDARY 1–2 sentences

when: out-of-scope topic (LangChain, RAG, OpenAI, …)

Name 'out of scope' · offer one redirect · no code

// repository screenshot

What the agent returns.

GitHub screenshot of a Claude API cost audit showing an input prompt, cache-control code edits, cost table, and confidence note. — Repository screenshot: a live audit with verdict, edits, cost impact, and confidence.

A drop-in folder for Claude API cost and caching audits. Paste a client.messages.create() call; get the cache breakpoint, the edit, and the before/after cost per 1,000 calls. The agent is graded against its own rules, so the personality does not drift quietly.

Why it exists

Every Claude API team I've worked with hits the same three questions: am I caching this right?, is Opus worth it here?, and why did spend move? The answers exist, but they are spread across docs, prices, cache rules, and usage fields.

Cost Specialist puts that into one Claude project. It works inside the user's Claude.ai or Claude Code session, on their code, with their numbers.

What made it hard

The lazy version is "a system prompt that says you are a cost expert." That works for one or two questions before the agent starts hallucinating prices, blending answer shapes, and wandering off-topic.

The discipline is the product. Every cost claim cites a dated reference file. Every reply uses one of four shapes. Out-of-scope topics get a Boundary, not a helpful detour. Sonnet 4.6 is the system under test; Opus 4.7 is the judge.

// eval suite

cases.json · SUT: Claude Sonnet 4.6 · judge: Claude Opus 4.7 · ~$0.70 / run.

case · what it asserts shape judge check status

audit-001 happy-path caching audit audit verdict · edits · cost · confidence PASS

audit-002 refuse cache below model minimum audit names 4,096 minimum · no cache_control PASS

pipeline-001 multi-model routing pipeline judge ≠ SUT · per-step reasoning PASS

dashboard-001 input_tokens dashboard math audit names all 3 usage fields · paste-ready formula PASS

boundary-001 refuse LangChain integration boundary out of scope · no SDK code in reply PASS

passing · last run 2026-05-10 · re-runs on every rules.md edit 5 of 12 shown judge: opus 4.7 12 / 12 · 100%

// pragmatic decisions

Three trade-offs worth naming.

The short version: choice, reason, cost.

Distribution as a folder, not a hosted app

Chose

The project ships as a folder: identity.md, rules.md, examples.md, and nine dated reference docs. No backend, UI, auth, or SDK.

Why

A reviewer can read the whole agent before running it. A user can drop it into Claude without trusting a hosted app.

Cost

No telemetry. The eval suite is the only feedback loop, and adopters have to keep pricing snapshots fresh.

Locked four-shape response grammar

Chose

Every reply is Audit, Pipeline, Clarifying, or Boundary. Literal headers, length caps, no preamble, no closing flourish, no blended shapes.

Why

Cost answers need a verdict, the edit, the before/after bill, and a confidence note. Anything else slows the user down.

Cost

The voice is terse. The eval also has to check headers, section counts, and length caps on every case.

Eval suite grades the specialist against its own rules

Chose

evals/run.ts checks each doc edit against hand-authored cases. Sonnet 4.6 is the system under test; Opus 4.7 judges shape, citations, refusals, and made-up prices.

Why

Without a judge, the agent drifts every time the docs move.

Cost

A full run costs about $0.70 and needs an Anthropic key. The cases and judge prompt have to move with rules.md.

// the stack

The stack.

Claude project · drop-in
Claude Code · drop-in
@anthropic-ai/sdk
Claude Sonnet 4.6 · SUT
Claude Opus 4.7 · judge
TypeScript 5
Node 20
JSON schema cases
Markdown spec
9 dated reference docs
GitHub
Claude Code

✦

Up next

Healthcare Support Specialist — ICM agent workspace for health-plan CSRs

Back

← All case studies