← Blog

One observability ship found three production bugs in five hours

A real case study from the InferAll gateway: shipping a single 'stuck Claude Code user' alert tracker surfaced three separate paid-user bugs that had been silently bouncing customers. Sequence: visibility, real-time signal, then three same-day fixes.

InferAll Team

6 min read
observabilityengineeringcase studyAI gatewayClaude Code
A short story about how cheap a real-time signal can be — and how much it can pay back if you ship it the moment you suspect there's a friction nobody's complaining about loudly. Context: the InferAll gateway sees a steady but small number of Claude Code (`claude-cli`) users every day. Most signup flows go through `/v1/messages` first because Claude Code is what brought them; that's the route they configure when they paste in our base URL. We had spent a few days on copy improvements to the 402 messages — shortening them, leading with the URL, fixing misleading "Add $5" framing — but we couldn't tell whether any of it was actually helping users finish checkout. ## Step 1 — snapshot block (the denominator) The daily founder-loop snapshot already counted unique active users, paid customers, savings vs paid providers. It didn't know about `claude-cli` as a cohort. We added one block: ``` Claude-CLI 24h: N unique users · M calls · X success · Y blocked versions: 2.1.191×N, 2.1.173×M ``` First reading: **2 users, 9 calls, 0 success, 9 blocked** over 24 hours. 100% block rate. The one new paying customer that week had been a `claude-cli` user — but their post-payment calls were going through raw HTTP, not `claude-cli`, so the `claude-cli` traffic itself stayed at 0% success even though the conversion was real. That was the denominator we'd been missing. Real users were trying, getting blocked, leaving — and we'd been treating the empty `claude-cli` success count as "no one is using it" instead of "everyone who tries is stuck." ## Step 2 — real-time alert (the per-user signal) The existing alert thresholds caught SDK auto-retry loops: `gate_storm` at 15 trial-block 402s in 60 seconds, `gate_persistent` at 30 in an hour. Real humans don't loop that hard. They try 3 or 4 times and give up — well below both thresholds. We added a third tracker keyed on the `claude-cli` UA: **3 gate-blocks in 5 minutes → fire an alert**. Same dedup pattern as the existing trackers, fires once per user per window: ``` [ALERT:claude_cli_blocked] user_id=X count=3 cli=2.1.179 auth_status=trial_blocked_no_card window=5min — human-rate retries from claude-cli; high bounce risk if not converted ``` Threshold rationale: 3 is the point where someone is clearly stuck (1–2 attempts could be casual). 5 minutes matches typical human attention span before giving up. ## Step 3 — first alert fires, finds bug #1 Within an hour of deploying the tracker, it fired on the new paying customer from earlier in the week. They had come back after 12 hours of being idle and were hitting `trial_blocked_paid` 402s — on a *paid model*, from a *paid user* with $5 of balance. The diagnosis took 15 minutes. They had 6 keys. Three had been created before checkout (got `tier=free` + `has_paid_successfully=true` from the Stripe webhook). Three had been created after checkout (got `tier=pending` because that's the default, and the webhook only updates the specific `userKeyId` from checkout metadata, not all the user's keys). Their CLI was rotating between keys. When it picked a `tier=pending` key: ```ts if (project.tier === "pending") { const verdict = await checkTrialAllowance(env, project, route.model); if (!verdict.allowed) return openaiError(...) // ← this fires } ``` `checkTrialAllowance` returns `{ allowed: false, reason: "paid" }` for any non-free model. Their `has_paid_successfully=true` was being ignored. **Fix:** at the top of `checkTrialAllowance`, bypass the gate entirely when `has_paid_successfully=true`. Their access is governed by balance + rate limits, not the 200-call trial cap. Shipped 15 minutes after the alert fired. ## Step 4 — root cause #1 (post-deploy) The bypass works around the symptom. The cause is that new keys created after a successful payment default to `tier=pending` with `has_paid_successfully=false` — even when the user clearly has other paid keys. `handleCreateKey` already inherits `tier` from the user's existing key set (line 126: "if any existing key has `has_paid_successfully`, bump new key to `tier=free`"). It just wasn't inheriting `has_paid_successfully` itself: ```diff + inheritedHasPaid = rows.some((r) => r.has_paid_successfully === true); ``` ```diff body: JSON.stringify({ user_id: user.id, api_key_hash: keyHash, name, tier, limits, stripe_customer_id: inheritedCustomerId, stripe_subscription_id: inheritedSubscriptionId, + has_paid_successfully: inheritedHasPaid, }), ``` Defense-in-depth: the bypass covers any pre-existing mis-tagged keys in the DB; the inheritance prevents new ones from being created in the bad state. ## Step 5 — same alert fires on a different user, finds bug #2 Two hours later, the tracker fired again. Different user, different country (`claude-cli` user from APAC), different version (`claude-cli/2.1.179` vs the earlier `2.1.191`). Same pattern: 3 blocks in 5 minutes via `claude-cli`. This time the user had `stripe_customer_id` set on their keys, but `has_paid_successfully=false` and they were getting `trial_blocked_no_card`. The check that gates that: ```ts if (project.has_card !== true) return { allowed: false, reason: "no_card" }; ``` `has_card` is `!!stripe_customer_id`. They had a stripe_customer_id. So why did `has_card` evaluate false? **Auth cache TTL lag.** We cache `ProjectConfig` per-key-hash for 30 seconds. They had just initiated checkout — Stripe created their customer, we wrote `stripe_customer_id` to the row — but their cached `ProjectConfig` still said `has_card=false`. Next call within the 30-second window hit the cache, got the old value, fired the 402. Most webhook handlers in `billing.ts` already called `invalidateCustomerKeyCache` after updating the row. The `checkout.session.completed` branch for credit-pack mode didn't. **Fix:** add the cache invalidation call. Shipped 8 minutes after the alert fired. ## Step 6 — bug #3 (verifying the first alert was unrelated to the second) The second user never came back to retry. We checked their key state to verify the fix would have helped: `stripe_customer_id` was set (good), `has_paid_successfully=false` (correct — they hadn't completed checkout). But the cache had still been stale. That meant the cache had also been wrong at checkout *start*, not just at checkout *complete*. The `handleCheckout` endpoint creates the Stripe customer and writes `stripe_customer_id` — but had no cache invalidation either. Found by following the trail from the same alert: ```diff customerId = customer.id as string; await updateUserKey(env, userKey.id as string, { stripe_customer_id: customerId, }); + await invalidateCustomerKeyCache(env, customerId); ``` Same shape as the webhook fix. Third deploy of the day. All 8 cache-write paths now invalidate. ## What we learned - **The denominator is half the work.** "0 successful `claude-cli` calls today" is a totally different signal from "no `claude-cli` traffic today." We had been reading the latter for weeks. - **Human-rate retries don't fit SDK thresholds.** The existing alerts (`gate_storm` 15/60s, `gate_persistent` 30/1h) were both right for SDK auto-retry detection. They missed real users entirely. - **One alert can find multiple bugs.** The tracker fired twice in 5 hours, surfaced 3 separate bugs (one symptom workaround + two root causes), each fixed same-day. The total observability ship was ~50 lines of new code. The bug fixes were ~30 lines combined. - **Ship the diagnostic first, fix the bug second.** Every step in the sequence above produced a same-day improvement. If we'd tried to fix all three bugs at once we would have shipped slower, and probably gotten the diagnosis wrong on at least one. If you're running an AI gateway, billing flow, or any system where users churn silently on the 402/403 boundary: instrument the boundary. The hot path is fast and cheap; the *quiet* failure mode is where the money leaks.