Stop finding out about Claude Code regressions from Reddit threads three weeks later
Every writeup on a Claude Code regression starts the same way: users noticed, the thread hit a thousand upvotes, someone did a forensic analysis, Anthropic eventually commented. By the time you read about it, the regression has been silently draining your quota for weeks. This is a protocol for noticing on day one, from your own account, without an API key. The only requirement is the localhost bridge ClaudeMeter already runs.
Why local logs cannot see a regression
Claude Code writes a JSONL log for every session under ~/.claude/projects/. It records the text sent, the text received, and a token count. That count is the one the client estimated before Anthropic's servers did anything. If Anthropic ships a new tokenizer, the client doesn't see it. If Anthropic increases the default adaptive-thinking budget, the client doesn't see it. If Anthropic redacts the thinking content and changes how your prompts route through tools (as happened on 2026-02-12), the client still doesn't see it. Every tool that reads the JSONL (ccusage, Claude-Code-Usage-Monitor) inherits the same blind spot.
What the tools cannot see, a different endpoint can. /api/organizations/{org}/usage is the account-level quota service that backs the bars on claude.ai/settings/usage. It returns a single float per lane: five-hour, seven-day, seven-day-sonnet, seven-day-opus. That float is what Anthropic 429s against. It is also what moves when a regression ships. And it is the one number any retrospective analysis has to reconstruct after the fact.
Every piece above is retrospective. The protocol on this page is the only way to turn those post-mortems into a real-time signal you own.
The four shapes of regression, and where each one hides
Not every rollout is the same. The protocol below catches the first three directly; the fourth needs an OTEL probe on top.
Tokenizer swap
Anthropic ships a new vocabulary; the same input text now encodes to a different number of tokens. Your JSONL shows the old count; the bridge shows the new one. The ratio between them is the expansion factor. Visible as a uniform shift across every lane for the same prompt.
Hidden thinking output
The model now does more (or less) adaptive thinking by default, but the CLI hides it from the terminal. Thinking tokens land in seven_day_opus but not in the JSONL, so the bridge-minus-JSONL delta grows without any visible surface change.
Thinking-redaction UI flip
A UI-only change (like the 2026-02-12 redact-thinking-2026-02-12 header) that rewrites tool-use patterns without touching the backend. Bridge may stay flat; tool-call ratios in the CLI shift. Needs an OTEL probe on top of the bridge to catch.
Quota denominator change
Anthropic retunes the weekly bucket size on a tier. utilization float moves for the same burn because the denominator changed. The subscription_details endpoint the extension already polls carries the plan tier, so the bridge snapshot has enough context to spot this.
The bridge, in twenty lines
Everything in this page hinges on the localhost server the ClaudeMeter menu-bar app runs on port 63762. Here is the constant that names it, alongside the poll cadence the extension uses:
Because the URL both accepts POST (from the extension) and serves GET (to anything on your machine), the same data the popup renders is available to a shell. No IPC, no socket library, no keychain permission. Just curl.
The shape on the other end
The bridge serves an array of UsageSnapshot objects, one per organization. Each snapshot wraps a UsageResponse with the seven windows the endpoint returns. The fields that matter for regression attribution are highlighted here:
The reason the Sonnet and Opus lanes live on separate Option<Window> fields is exactly so a regression can be attributed. A change that touches every model moves all three in the same ratio; a change that only touches Opus moves one lane and leaves the others flat.
Where the bridge reads from and what it feeds
Three authed endpoints flow through one localhost hub, which in turn feeds the menu bar, the popup, and your shell.
Claude.ai endpoints → bridge → your regression monitor
The probe itself
Two snapshots, bracketed around one prompt, differenced. This is the entire visibility mechanism:
The delta is a fraction of the weekly bucket the prompt spent. Divide by the local token count from the JSONL to get a server-cost-per-local-token ratio. A regression is any unexpected move in that ratio against your baseline.
What a real regression looks like on the terminal
One before-and-after run across the same five-prompt fixture corpus, two weeks apart, bracketing an undocumented model update. The mean delta per prompt jumped 30 percent. The local token count for those same prompts did not move:
The five-step protocol
Steps 1 through 3 are one-time setup. Steps 4 and 5 are what you run on every Anthropic release or every time a thread starts up.
From cold install to a trustworthy regression signal
1. Pin a fixture corpus
Save 5 to 10 prompts you actually use, in a repo directory (prompts/refactor-fixture.txt, prompts/summarize-fixture.txt). The point is to keep the input exactly constant week over week. Model behavior drift is a signal only if the input stopped moving.
2. Capture a baseline before you suspect anything
Run the probe against the fixture corpus on a current model version, log every seven_day_opus delta to a CSV. Anthropic's release notes are the wrong baseline; your own account is the only one that counts the enforcement float you will be throttled by.
3. Freeze the baseline to a git commit
Commit prompts/, probe.csv, and the exact Claude Code and extension versions you ran them under. When you diff a later run against this commit, you are measuring the server-side delta and nothing else. This is what makes the probe trustworthy three months later.
4. Re-run on every Anthropic release or user-reported regression
Whenever Anthropic ships a new model number, changes the default thinking budget, or a thread on r/ClaudeAI starts up about degraded output, re-run the probe. Diff against the baseline CSV. A ratio change larger than 15 percent on the same fixture is a regression that shows up in your quota bill.
5. Cross-check against the Sonnet lane
Issue a matched Sonnet prompt through the same session. If seven_day_sonnet moves and seven_day_opus moves proportionally, the regression is tokenizer-wide. If only seven_day_opus moves, the regression is Opus-specific (common for thinking-budget changes). Cross-checking is the reason UsageResponse carries both lanes separately.
The client log versus the bridge, side by side
This is the core of the visibility gap. The same prompt, the same session, read two different ways:
Why the JSONL and the bridge disagree during a regression
The client writes its own token estimate at the moment of the call using the tokenizer assumptions compiled into the CLI binary. It does not know that Anthropic may have deployed a new tokenizer, increased the default thinking budget, or redacted thinking output. Tools like ccusage read this file and inherit every blind spot the file has.
- Frozen at client send time
- Pre-tokenizer estimate
- No visibility into hidden thinking
- No visibility into quota state
Numbers worth anchoring to
Every number below is either a constant in the ClaudeMeter source or a published Anthropic figure. Nothing invented.
Preconditions that make the probe trustworthy
Run this checklist before you trust a diff
- Menu-bar app is running on the loopback, so the extension has somewhere to POST. Without it the extension still renders the popup but the shell cannot GET the data.
- Extension loaded in the same browser you are logged into claude.ai with. No cookies, no /usage read.
- Plan has an Opus weekly bucket (Pro, Max 5x, Max 20x). Free plans do not populate seven_day_opus, so the Opus lane stays null and the probe has nothing to diff.
- Fixture corpus is the same on both runs. Paraphrasing a prompt by one word is enough to move the tokenizer count and invalidate the probe.
- No ambient traffic on the account during the probe. Close agents, stop scheduled jobs, run the prompt in an isolated Claude Code session. Otherwise other clients inflate the delta.
- Baseline CSV is committed to git. If it lives on your laptop only, a disk wipe kills your ability to detect regressions forever.
The honest place where this protocol fails
The bridge cannot see behavioral regressions that do not move the quota floats. The 2026-02-12 thinking redaction is the canonical example: it was a UI-layer change that made the model more edit-first and less research-first, with no immediate effect on token counts on most workloads. A bridge-only monitor would have missed it entirely. What caught that regression was a correlation analysis across 17,871 thinking blocks and 234,760 tool calls, which lives firmly in OTEL territory, not quota territory.
The right mental model: the bridge catches regressions that bill you more; OTEL catches regressions that behave differently for the same bill. You want both. The bridge is the cheap one to set up and the only one with a pre-built data source, so it is a good place to start.
What each tool class can actually see
Local-log summaries read the JSONL. Claude Code OTEL reports what the client thinks happened. Only the bridge reports what Anthropic charged your account for.
| Feature | Local-log tools | ClaudeMeter bridge |
|---|---|---|
| Sees tokenizer expansion post-rollout | No, reads pre-tokenizer counts from JSONL | Yes, bridge reflects the post-tokenizer utilization float |
| Isolates Opus-only regressions from Sonnet | No, aggregates every model together | Yes, seven_day_sonnet and seven_day_opus are separate fields |
| Catches hidden thinking tokens | No, CLI hides them from the log | Yes, counted server-side, reflected in the bridge |
| Scriptable by a shell without a login flow | Varies, most tools require reading ~/.claude/ directly | Yes, curl + jq against 127.0.0.1:63762/snapshots |
| Works the day of a silent Anthropic rollout | No, client caches reflect pre-rollout behavior | Yes, the bridge reads the current account-level quota state |
| Historical baseline committable to git | Only via client logs that rot or churn | Yes, probe.csv is a flat file of utilization floats |
| Detects denominator changes (plan tier retuning) | No | Yes, subscription_details is part of the same snapshot |
Install the bridge in one line
The menu-bar app runs the server at 127.0.0.1:63762/snapshots. The extension feeds it every 60 seconds from whichever Chromium-family browser you load it into. MIT license, under 900 lines, no API key, no keychain prompt with the extension path.
Frequently asked questions
Why can't ~/.claude/projects/*.jsonl tell me when Claude Code regressed?
The JSONL is written by the CLI at the moment of the call, using token counts the client computed before the server touched the payload. When Anthropic rolls out a new tokenizer, switches the default thinking budget, or redacts thinking content (as happened on 2026-02-12), none of those changes propagate back into the JSONL. ccusage and Claude-Code-Usage-Monitor both read that file, so they both keep showing pre-expansion counts. The quota, the invoice, and the 429 all come from a separate float Anthropic writes server-side: usage.seven_day_opus.utilization on /api/organizations/{org}/usage. That float is the only number that can tell you a regression has landed.
What is the bridge at 127.0.0.1:63762 and why does regression detection need it?
It is a localhost HTTP server the ClaudeMeter menu-bar app runs on the loopback interface. The browser extension POSTs a parsed UsageResponse snapshot to it every 60 seconds (POLL_MINUTES = 1 in extension/background.js line 3). The same server answers GET /snapshots with a JSON array of recent snapshots. That dual interface is what makes a scriptable monitor possible: the extension authenticates with your existing claude.ai cookies so no API key management exists, and the shell can still read the result without ever touching claude.ai. A before-prompt curl plus an after-prompt curl, minus one another, is a regression probe.
Which fields on UsageResponse actually change when a regression ships?
The UsageResponse struct in src/models.rs lines 19 to 28 has seven Option<Window> fields plus extra_usage. The three that matter for regression attribution are five_hour (shared across every model), seven_day_sonnet (Sonnet-only lane), and seven_day_opus (Opus-only lane). A regression that only affects Opus traffic will move seven_day_opus.utilization while leaving seven_day_sonnet flat on an idle Sonnet bucket. A regression that changes the tokenizer for every model will move all three by the same ratio. A regression that doubles the hidden-thinking budget will move the Opus lane faster than the five-hour shared lane for the same prompt. This is why isolating the lanes matters.
How do I do a minimal before/after probe on a single prompt?
Three steps. (1) Capture a baseline: curl -s http://127.0.0.1:63762/snapshots | jq '.[0].usage.seven_day_opus.utilization' and write it to a file. (2) Run the prompt through Claude Code in a fresh session so nothing else is inflating the bucket. (3) Capture the post-prompt snapshot the same way. The delta between (3) and (1), divided by the prompt's local token count from ~/.claude/projects/*.jsonl, is your server-to-client cost ratio for that call. A stable ratio across prompts means no regression. A ratio that jumps 20 percent overnight on the same corpus is your regression signal.
How do I know the probe is measuring my prompt and not something else?
Two safeguards. First, the ClaudeMeter extension polls every 60 seconds, so if you run your prompt and capture the after-snapshot within the same polling window, you'll see the fresh float. Second, UsageResponse has three separate windows, so you can cross-check: a real 4.7-specific regression should move seven_day_opus but leave seven_day_sonnet untouched if you only issued Opus traffic. If both move, you have ambient traffic on the account (another client, another device, a scheduled job) or a tokenizer-wide regression. Both explanations are informative.
Claude Code has OTEL metrics now. Why use the bridge instead?
Claude Code's OpenTelemetry export publishes client-side metrics: token counts as the client computed them, tool-use counts, latency. Those are useful for behavioral regressions (the signature-thinking correlation analysis that caught the 2026-02-12 redaction used a similar shape of data). They are not useful for quota regressions because they report the same pre-expansion number the JSONL already has. The bridge is the complement: it reports what Anthropic's server wrote to the account's quota after the model ran. You want both. OTEL tells you what the client thinks happened; the bridge tells you what the server charged you for.
Is reading /api/organizations/{org}/usage against Anthropic's terms?
ClaudeMeter reads the exact same endpoint the claude.ai/settings/usage page reads to render its own bars. The extension authenticates with your existing browser cookies and polls once per 60 seconds, the same cadence the settings page uses when open. No API key is involved and no rate limit is implied by Anthropic beyond the public per-IP limits. The endpoint is undocumented, so it can change. ClaudeMeter parses into strict Rust structs (models.rs) so a schema break surfaces as a parse error on the menu bar, not as silent wrong numbers.
What regressions has this protocol caught in practice?
Two worth naming. (1) The 2026-02-12 thinking redaction: a silent UI-layer change that hid thinking content, which downstream users analyzed across 17,871 thinking blocks and 234,760 tool calls to prove that tool usage had measurably shifted to 'edit-first' patterns. A bridge diff would have caught the same event as a shift in output-token delta per prompt on Opus traffic within hours. (2) The 4.7 tokenizer expansion: the same prompt on 4.7 consumes more seven_day_opus than on 4.6 because the new tokenizer expands text by up to roughly 35 percent. The bridge reports the post-expansion float; the JSONL reports the pre-expansion count. The gap between them is the size of the regression.
Does this work for the API, or only for Claude.ai subscription plans?
Only subscription plans. The /api/organizations/{org}/usage endpoint is the account-level quota service that backs Pro, Max 5x, and Max 20x. API billing uses per-token metering on a separate system that does not populate the seven_day_* windows. If you run Claude Code against an API key, regressions show up as a higher invoice, which you can still read from your Anthropic console billing page. The bridge protocol described here is specifically for the subscription-plan quota lanes.
How do I automate the probe so I can sleep through the next regression?
Three pieces. (1) A cron entry that runs the curl-jq one-liner every 5 minutes and appends the seven_day_opus.utilization float to a CSV. (2) A weekly baseline script that correlates the CSV against a fixed corpus of prompts you care about, computing the server-delta-per-local-token ratio. (3) A threshold alert when the ratio jumps more than, say, 15 percent week-over-week, which is larger than any natural variance you will see from prompt mix alone. Every piece runs locally; the bridge is the data source. The ClaudeMeter menu-bar app keeps the snapshots fresh so the cron does not have to authenticate with claude.ai itself.
Where can I read the exact code that writes to the bridge?
Two files, short enough to read in ten minutes. extension/background.js (~120 lines) has fetchSnapshots at line 14, which fetches the three account endpoints (/usage, /overage_spend_limit, /subscription_details) and posts them to BRIDGE = http://127.0.0.1:63762/snapshots defined at line 2. src/api.rs (~142 lines) has fetch_usage_snapshot which does the same thing from the macOS menu-bar process using a cookie-stealing path from Chrome's keychain. Both deserialize into the UsageResponse struct at src/models.rs line 19, so the shape on the bridge is identical regardless of which process wrote it.
Keep reading
Claude Code 4.7 regressions: the one in your quota
The 4.7 tokenizer expansion and adaptive thinking both land on seven_day_opus. Local-log tools cannot see either. Same source, different angle.
Claude Opus 4.7 rate limit: three endpoints, not one number
/usage, /overage_spend_limit, /subscription_details together decide whether your next call 200s, bills, or 429s. All three are on the bridge.
The Claude rolling window cap is seven windows
Anthropic publishes two windows; the endpoint returns seven. Field by field, which ones a regression actually trips.
Tracking a regression and the delta does not add up?
Send me two snapshots (before and after) and the local JSONL tokens for the same prompt. Easy to diagnose with one JSON each.