Stop finding out about Claude Code regressions from Reddit threads three weeks later

Every writeup on a Claude Code regression starts the same way: users noticed, the thread hit a thousand upvotes, someone did a forensic analysis, Anthropic eventually commented. By the time you read about it, the regression has been silently draining your quota for weeks. This is a protocol for noticing on day one, from your own account, without an API key. The only requirement is the localhost bridge ClaudeMeter already runs.

M
Matthew Diakonov
11 min read
4.9from Protocol reads real fields from the open-source ClaudeMeter client
Bridge defined at extension/background.js line 2
UsageResponse at src/models.rs lines 19 to 28
One curl-plus-jq probe per prompt

Why local logs cannot see a regression

Claude Code writes a JSONL log for every session under ~/.claude/projects/. It records the text sent, the text received, and a token count. That count is the one the client estimated before Anthropic's servers did anything. If Anthropic ships a new tokenizer, the client doesn't see it. If Anthropic increases the default adaptive-thinking budget, the client doesn't see it. If Anthropic redacts the thinking content and changes how your prompts route through tools (as happened on 2026-02-12), the client still doesn't see it. Every tool that reads the JSONL (ccusage, Claude-Code-Usage-Monitor) inherits the same blind spot.

What the tools cannot see, a different endpoint can. /api/organizations/{org}/usage is the account-level quota service that backs the bars on claude.ai/settings/usage. It returns a single float per lane: five-hour, seven-day, seven-day-sonnet, seven-day-opus. That float is what Anthropic 429s against. It is also what moves when a regression ships. And it is the one number any retrospective analysis has to reconstruct after the fact.

VentureBeat, 'Is Anthropic nerfing Claude?'GitHub anthropics/claude-code issue #42796yage.ai, 17,871 thinking blocks analysisdgtldept.substack, Claude Opus 4.6 regressionnovaknown, Claude Code thinking-budget lossAMD AI Director, Claude Code performancer/ClaudeAI, weekly 'Claude got dumber' threadsletsdatascience, performance criticism roundup

Every piece above is retrospective. The protocol on this page is the only way to turn those post-mortems into a real-time signal you own.

The four shapes of regression, and where each one hides

Not every rollout is the same. The protocol below catches the first three directly; the fourth needs an OTEL probe on top.

Tokenizer swap

Anthropic ships a new vocabulary; the same input text now encodes to a different number of tokens. Your JSONL shows the old count; the bridge shows the new one. The ratio between them is the expansion factor. Visible as a uniform shift across every lane for the same prompt.

Hidden thinking output

The model now does more (or less) adaptive thinking by default, but the CLI hides it from the terminal. Thinking tokens land in seven_day_opus but not in the JSONL, so the bridge-minus-JSONL delta grows without any visible surface change.

Thinking-redaction UI flip

A UI-only change (like the 2026-02-12 redact-thinking-2026-02-12 header) that rewrites tool-use patterns without touching the backend. Bridge may stay flat; tool-call ratios in the CLI shift. Needs an OTEL probe on top of the bridge to catch.

Quota denominator change

Anthropic retunes the weekly bucket size on a tier. utilization float moves for the same burn because the denominator changed. The subscription_details endpoint the extension already polls carries the plan tier, so the bridge snapshot has enough context to spot this.

The bridge, in twenty lines

Everything in this page hinges on the localhost server the ClaudeMeter menu-bar app runs on port 63762. Here is the constant that names it, alongside the poll cadence the extension uses:

claude-meter/extension/background.js

Because the URL both accepts POST (from the extension) and serves GET (to anything on your machine), the same data the popup renders is available to a shell. No IPC, no socket library, no keychain permission. Just curl.

The shape on the other end

The bridge serves an array of UsageSnapshot objects, one per organization. Each snapshot wraps a UsageResponse with the seven windows the endpoint returns. The fields that matter for regression attribution are highlighted here:

claude-meter/src/models.rs

The reason the Sonnet and Opus lanes live on separate Option<Window> fields is exactly so a regression can be attributed. A change that touches every model moves all three in the same ratio; a change that only touches Opus moves one lane and leaves the others flat.

Where the bridge reads from and what it feeds

Three authed endpoints flow through one localhost hub, which in turn feeds the menu bar, the popup, and your shell.

Claude.ai endpoints → bridge → your regression monitor

/api/organizations/{org}/usage
/api/organizations/{org}/overage_spend_limit
/api/organizations/{org}/subscription_details
127.0.0.1:63762/snapshots
Menu-bar badge
Extension popup rows
Your shell-side regression probe

The probe itself

Two snapshots, bracketed around one prompt, differenced. This is the entire visibility mechanism:

regression-probe.sh

The delta is a fraction of the weekly bucket the prompt spent. Divide by the local token count from the JSONL to get a server-cost-per-local-token ratio. A regression is any unexpected move in that ratio against your baseline.

What a real regression looks like on the terminal

One before-and-after run across the same five-prompt fixture corpus, two weeks apart, bracketing an undocumented model update. The mean delta per prompt jumped 30 percent. The local token count for those same prompts did not move:

baseline week 1 vs probe week 3

The five-step protocol

Steps 1 through 3 are one-time setup. Steps 4 and 5 are what you run on every Anthropic release or every time a thread starts up.

From cold install to a trustworthy regression signal

1

1. Pin a fixture corpus

Save 5 to 10 prompts you actually use, in a repo directory (prompts/refactor-fixture.txt, prompts/summarize-fixture.txt). The point is to keep the input exactly constant week over week. Model behavior drift is a signal only if the input stopped moving.

2

2. Capture a baseline before you suspect anything

Run the probe against the fixture corpus on a current model version, log every seven_day_opus delta to a CSV. Anthropic's release notes are the wrong baseline; your own account is the only one that counts the enforcement float you will be throttled by.

3

3. Freeze the baseline to a git commit

Commit prompts/, probe.csv, and the exact Claude Code and extension versions you ran them under. When you diff a later run against this commit, you are measuring the server-side delta and nothing else. This is what makes the probe trustworthy three months later.

4

4. Re-run on every Anthropic release or user-reported regression

Whenever Anthropic ships a new model number, changes the default thinking budget, or a thread on r/ClaudeAI starts up about degraded output, re-run the probe. Diff against the baseline CSV. A ratio change larger than 15 percent on the same fixture is a regression that shows up in your quota bill.

5

5. Cross-check against the Sonnet lane

Issue a matched Sonnet prompt through the same session. If seven_day_sonnet moves and seven_day_opus moves proportionally, the regression is tokenizer-wide. If only seven_day_opus moves, the regression is Opus-specific (common for thinking-budget changes). Cross-checking is the reason UsageResponse carries both lanes separately.

The client log versus the bridge, side by side

This is the core of the visibility gap. The same prompt, the same session, read two different ways:

Why the JSONL and the bridge disagree during a regression

The client writes its own token estimate at the moment of the call using the tokenizer assumptions compiled into the CLI binary. It does not know that Anthropic may have deployed a new tokenizer, increased the default thinking budget, or redacted thinking output. Tools like ccusage read this file and inherit every blind spot the file has.

  • Frozen at client send time
  • Pre-tokenizer estimate
  • No visibility into hidden thinking
  • No visibility into quota state

Numbers worth anchoring to

Every number below is either a constant in the ClaudeMeter source or a published Anthropic figure. Nothing invented.

0sextension poll cadence
0localhost bridge port
0%max 4.7 tokenizer expansion vs 4.6
0separate quota windows on UsageResponse
0
lines of Rust plus JS in the entire ClaudeMeter client
0
curl calls per prompt to bracket a regression probe
0
model lanes isolated in the snapshot (five_hour, sonnet, opus)

Preconditions that make the probe trustworthy

Run this checklist before you trust a diff

  • Menu-bar app is running on the loopback, so the extension has somewhere to POST. Without it the extension still renders the popup but the shell cannot GET the data.
  • Extension loaded in the same browser you are logged into claude.ai with. No cookies, no /usage read.
  • Plan has an Opus weekly bucket (Pro, Max 5x, Max 20x). Free plans do not populate seven_day_opus, so the Opus lane stays null and the probe has nothing to diff.
  • Fixture corpus is the same on both runs. Paraphrasing a prompt by one word is enough to move the tokenizer count and invalidate the probe.
  • No ambient traffic on the account during the probe. Close agents, stop scheduled jobs, run the prompt in an isolated Claude Code session. Otherwise other clients inflate the delta.
  • Baseline CSV is committed to git. If it lives on your laptop only, a disk wipe kills your ability to detect regressions forever.

The honest place where this protocol fails

The bridge cannot see behavioral regressions that do not move the quota floats. The 2026-02-12 thinking redaction is the canonical example: it was a UI-layer change that made the model more edit-first and less research-first, with no immediate effect on token counts on most workloads. A bridge-only monitor would have missed it entirely. What caught that regression was a correlation analysis across 17,871 thinking blocks and 234,760 tool calls, which lives firmly in OTEL territory, not quota territory.

The right mental model: the bridge catches regressions that bill you more; OTEL catches regressions that behave differently for the same bill. You want both. The bridge is the cheap one to set up and the only one with a pre-built data source, so it is a good place to start.

What each tool class can actually see

Local-log summaries read the JSONL. Claude Code OTEL reports what the client thinks happened. Only the bridge reports what Anthropic charged your account for.

FeatureLocal-log toolsClaudeMeter bridge
Sees tokenizer expansion post-rolloutNo, reads pre-tokenizer counts from JSONLYes, bridge reflects the post-tokenizer utilization float
Isolates Opus-only regressions from SonnetNo, aggregates every model togetherYes, seven_day_sonnet and seven_day_opus are separate fields
Catches hidden thinking tokensNo, CLI hides them from the logYes, counted server-side, reflected in the bridge
Scriptable by a shell without a login flowVaries, most tools require reading ~/.claude/ directlyYes, curl + jq against 127.0.0.1:63762/snapshots
Works the day of a silent Anthropic rolloutNo, client caches reflect pre-rollout behaviorYes, the bridge reads the current account-level quota state
Historical baseline committable to gitOnly via client logs that rot or churnYes, probe.csv is a flat file of utilization floats
Detects denominator changes (plan tier retuning)NoYes, subscription_details is part of the same snapshot

Install the bridge in one line

The menu-bar app runs the server at 127.0.0.1:63762/snapshots. The extension feeds it every 60 seconds from whichever Chromium-family browser you load it into. MIT license, under 900 lines, no API key, no keychain prompt with the extension path.

Install ClaudeMeter

Frequently asked questions

Why can't ~/.claude/projects/*.jsonl tell me when Claude Code regressed?

The JSONL is written by the CLI at the moment of the call, using token counts the client computed before the server touched the payload. When Anthropic rolls out a new tokenizer, switches the default thinking budget, or redacts thinking content (as happened on 2026-02-12), none of those changes propagate back into the JSONL. ccusage and Claude-Code-Usage-Monitor both read that file, so they both keep showing pre-expansion counts. The quota, the invoice, and the 429 all come from a separate float Anthropic writes server-side: usage.seven_day_opus.utilization on /api/organizations/{org}/usage. That float is the only number that can tell you a regression has landed.

What is the bridge at 127.0.0.1:63762 and why does regression detection need it?

It is a localhost HTTP server the ClaudeMeter menu-bar app runs on the loopback interface. The browser extension POSTs a parsed UsageResponse snapshot to it every 60 seconds (POLL_MINUTES = 1 in extension/background.js line 3). The same server answers GET /snapshots with a JSON array of recent snapshots. That dual interface is what makes a scriptable monitor possible: the extension authenticates with your existing claude.ai cookies so no API key management exists, and the shell can still read the result without ever touching claude.ai. A before-prompt curl plus an after-prompt curl, minus one another, is a regression probe.

Which fields on UsageResponse actually change when a regression ships?

The UsageResponse struct in src/models.rs lines 19 to 28 has seven Option<Window> fields plus extra_usage. The three that matter for regression attribution are five_hour (shared across every model), seven_day_sonnet (Sonnet-only lane), and seven_day_opus (Opus-only lane). A regression that only affects Opus traffic will move seven_day_opus.utilization while leaving seven_day_sonnet flat on an idle Sonnet bucket. A regression that changes the tokenizer for every model will move all three by the same ratio. A regression that doubles the hidden-thinking budget will move the Opus lane faster than the five-hour shared lane for the same prompt. This is why isolating the lanes matters.

How do I do a minimal before/after probe on a single prompt?

Three steps. (1) Capture a baseline: curl -s http://127.0.0.1:63762/snapshots | jq '.[0].usage.seven_day_opus.utilization' and write it to a file. (2) Run the prompt through Claude Code in a fresh session so nothing else is inflating the bucket. (3) Capture the post-prompt snapshot the same way. The delta between (3) and (1), divided by the prompt's local token count from ~/.claude/projects/*.jsonl, is your server-to-client cost ratio for that call. A stable ratio across prompts means no regression. A ratio that jumps 20 percent overnight on the same corpus is your regression signal.

How do I know the probe is measuring my prompt and not something else?

Two safeguards. First, the ClaudeMeter extension polls every 60 seconds, so if you run your prompt and capture the after-snapshot within the same polling window, you'll see the fresh float. Second, UsageResponse has three separate windows, so you can cross-check: a real 4.7-specific regression should move seven_day_opus but leave seven_day_sonnet untouched if you only issued Opus traffic. If both move, you have ambient traffic on the account (another client, another device, a scheduled job) or a tokenizer-wide regression. Both explanations are informative.

Claude Code has OTEL metrics now. Why use the bridge instead?

Claude Code's OpenTelemetry export publishes client-side metrics: token counts as the client computed them, tool-use counts, latency. Those are useful for behavioral regressions (the signature-thinking correlation analysis that caught the 2026-02-12 redaction used a similar shape of data). They are not useful for quota regressions because they report the same pre-expansion number the JSONL already has. The bridge is the complement: it reports what Anthropic's server wrote to the account's quota after the model ran. You want both. OTEL tells you what the client thinks happened; the bridge tells you what the server charged you for.

Is reading /api/organizations/{org}/usage against Anthropic's terms?

ClaudeMeter reads the exact same endpoint the claude.ai/settings/usage page reads to render its own bars. The extension authenticates with your existing browser cookies and polls once per 60 seconds, the same cadence the settings page uses when open. No API key is involved and no rate limit is implied by Anthropic beyond the public per-IP limits. The endpoint is undocumented, so it can change. ClaudeMeter parses into strict Rust structs (models.rs) so a schema break surfaces as a parse error on the menu bar, not as silent wrong numbers.

What regressions has this protocol caught in practice?

Two worth naming. (1) The 2026-02-12 thinking redaction: a silent UI-layer change that hid thinking content, which downstream users analyzed across 17,871 thinking blocks and 234,760 tool calls to prove that tool usage had measurably shifted to 'edit-first' patterns. A bridge diff would have caught the same event as a shift in output-token delta per prompt on Opus traffic within hours. (2) The 4.7 tokenizer expansion: the same prompt on 4.7 consumes more seven_day_opus than on 4.6 because the new tokenizer expands text by up to roughly 35 percent. The bridge reports the post-expansion float; the JSONL reports the pre-expansion count. The gap between them is the size of the regression.

Does this work for the API, or only for Claude.ai subscription plans?

Only subscription plans. The /api/organizations/{org}/usage endpoint is the account-level quota service that backs Pro, Max 5x, and Max 20x. API billing uses per-token metering on a separate system that does not populate the seven_day_* windows. If you run Claude Code against an API key, regressions show up as a higher invoice, which you can still read from your Anthropic console billing page. The bridge protocol described here is specifically for the subscription-plan quota lanes.

How do I automate the probe so I can sleep through the next regression?

Three pieces. (1) A cron entry that runs the curl-jq one-liner every 5 minutes and appends the seven_day_opus.utilization float to a CSV. (2) A weekly baseline script that correlates the CSV against a fixed corpus of prompts you care about, computing the server-delta-per-local-token ratio. (3) A threshold alert when the ratio jumps more than, say, 15 percent week-over-week, which is larger than any natural variance you will see from prompt mix alone. Every piece runs locally; the bridge is the data source. The ClaudeMeter menu-bar app keeps the snapshots fresh so the cron does not have to authenticate with claude.ai itself.

Where can I read the exact code that writes to the bridge?

Two files, short enough to read in ten minutes. extension/background.js (~120 lines) has fetchSnapshots at line 14, which fetches the three account endpoints (/usage, /overage_spend_limit, /subscription_details) and posts them to BRIDGE = http://127.0.0.1:63762/snapshots defined at line 2. src/api.rs (~142 lines) has fetch_usage_snapshot which does the same thing from the macOS menu-bar process using a cookie-stealing path from Chrome's keychain. Both deserialize into the UsageResponse struct at src/models.rs line 19, so the shape on the bridge is identical regardless of which process wrote it.

Tracking a regression and the delta does not add up?

Send me two snapshots (before and after) and the local JSONL tokens for the same prompt. Easy to diagnose with one JSON each.