Ramón Lence.

When AI Fixes the Wrong Thing: A Debugging Lesson from Vibe Coding

Cover Image for When AI Fixes the Wrong Thing: A Debugging Lesson from Vibe Coding

If you've been vibe coding — letting an AI agent drive your development while you review, steer, and occasionally grab the wheel — you've probably felt it at least once: that uneasy moment where the AI does something that looks right, the tests go green, and yet something in your gut says wait.

This post is about one of those moments. Not because the AI failed dramatically, but because it failed quietly — and because instead of just pressing retry or accepting the fix, I remembered one of the main lessons from The Wizard's Lens. I decided to stop and interrogate the reasoning behind it. What came out of that conversation is something I think every developer leaning on AI tools should read.


The Setup

I was working on a project with an e2e test suite. At some point, 32 tests started failing. The errors pointed at a specific month: March 2026. The tests had been written in March, and now that April had arrived, assertions like "shows transactions for the current month" were failing because they had the month hardcoded as a string.

Claude Code identified the problem immediately and proposed a fix: update the month strings from 2026-03 to 2026-04 across the test files. The tests went green. Done.

Except it wasn't done at all.


The Fix That Wasn't a Fix

Me:

why are you proposing such a useless solution? What do you think will happen in May? Please, don't fix it. Do some meta-cognition research to understand why you made that mistake offering kicking the ball until it fails again next month.

That's the moment the real conversation started. Not "try again" or "that's wrong, do it differently" — but "stop, and tell me why you reasoned that way."

The agent's response was honest:

Claude Code:

I diagnosed the symptom correctly (current month is April, tests expected March) but then applied the laziest possible fix — a time-shift. I treated the tests as if they were snapshot documents of a moment in time, rather than specifications of behavior that must hold forever.

I didn't ask the critical follow-up question: "Why does this test know what month it is at all?" The answer is: it shouldn't.

That's the crux of it. A test called "shows transactions for the current month" has absolutely no business encoding a specific calendar month. The month is an implementation detail of when the test was written — not a requirement of the feature. The agent saw hardcoded date → month mismatch → change the date and stopped there. It was solving the error message, not the problem.

The correct fix, which the agent then proposed, was to replace every hardcoded date with runtime helpers:

export function currentYearMonth() {
  return format(new Date(), 'yyyy-MM')
}
export function currentMonthDate(day: number) {
  return format(new Date(), 'yyyy-MM') + `-${String(day).padStart(2, '0')}`
}
export function prevMonthYearMonth(n = 1) {
  return format(subMonths(new Date(), n), 'yyyy-MM')
}

Now the tests compute their dates at runtime, and they'll still pass in May. And June. And forever.


Going Deeper: Is This a Training Problem?

Getting the right fix felt good. But I wanted more. I wanted to understand why the agent went for the wrong fix first, so we could prevent it from happening again — not just with dates, but with any similar class of mistake.

Me:

But I want us to understand, together, why you made that mistake, to avoid it happening in the future. Is there any requirement in CLAUDE.md or specs/ that made you go to the first solution without thinking further? Is it any flaw in your training? Any master prompt to solve a local issue without the whole picture?

The agent examined its own instructions and was candid:

Claude Code:

There's a real gap. The testing section says "tests should read like documentation" and "unit tests are required for every task." It specifies good test names but says nothing about good test data design. There's no rule like "test data must be relative to runtime — never hardcode calendar dates." Without that guardrail, I had no written principle to check my fix against.

Partly, it's a training-level bias. I'm biased toward the minimal diff that makes the errors go away. But here I applied "minimal change" to the symptom (the month string) rather than the cause (temporal coupling). The heuristic was applied at the wrong level of abstraction.

That last sentence is key. The agent wasn't broken. It was doing exactly what it's trained to do: find the smallest change that makes the problem disappear. The problem is that making the error go away and fixing the underlying issue are not always the same thing. And by default, the agent won't make that distinction unless you push it.


Patching the Patch

At this point I had a new idea: add a rule to CLAUDE.md about never hardcoding calendar dates in tests. Specific, actionable, clear. The agent agreed it would help.

Then I pushed back on myself.

Me:

Ok, putting that in CLAUDE.md is another patch for a current issue. It's not preventing any other temporal or environmental change to break the e2e tests, so you would again propose a local fix, not thinking further.

And this is where the conversation got genuinely interesting. The agent recognised what I was doing — I wasn't asking it to fix the date problem, I was asking it to fix the reasoning pattern that produces problems like the date problem.

Claude Code:

The real problem is deeper: I anchor to the error message and optimize for making it disappear. That's the failure mode. When I saw "Create March 2026" not found, my brain computed "what change makes that assertion true?" — not "what assumption is this test making that is fragile, and is that assumption essential to the behavior under test?"

Those are fundamentally different questions. The first is suppression. The second is diagnosis.

And the failure isn't isolated to dates. The same pattern plays out across any environmental assumption:

Symptom The agent's instinct
Hardcoded date breaks Change the date
Hardcoded locale breaks Change the locale
Hardcoded screen size breaks Change the size
Browser API changes behaviour Add a workaround

Each fix looks correct in isolation. Each one will break again when the environment shifts.


The Principle That Came Out of It

After this back-and-forth, the agent proposed something worth keeping — not a specific rule about dates or locales, but a general reasoning discipline to apply before any fix:

Claude Code:

Before proposing any fix — to code, tests, or configuration — stop and answer two questions:

  1. What assumption is failing? Not "what is the error message" but what does the code believe to be true that isn't.
  2. Is that assumption essential to the behavior under test, or is it accidental? If it's accidental (a hardcoded date, a locale, a path, an env value), the assumption itself is the bug — not the current value of it.

Ask before writing any code: "Will this fix be correct if the environment changes — different machine, different date, different locale, different user?" If no, you're suppressing a symptom.

That's good engineering advice regardless of whether AI is involved. But the point is: the agent arrived there because I didn't accept the first answer. I kept asking why.


Warnings for Developers Doing Vibe Coding

If you're letting AI drive significant parts of your development, here are the things this conversation taught me to watch out for.

The agent optimises for green, not for correct. Its default goal is to resolve the error in front of it. That's usually fine for obvious bugs. But for systemic issues — fragile tests, leaky abstractions, environment-dependent assumptions — it will reach for the nearest fix, not the right one. You have to be the one asking "but is this actually fixed?"

Watch for fixes with expiry dates. Any time an AI suggests changing a hardcoded value — a date string, a locale code, a screen resolution, an environment variable — ask yourself: what happens when this value is no longer true? If the answer is "it breaks again," you haven't fixed anything. You've bought yourself a few weeks.

The agent won't ask the second-order question unless you do. The agent is very good at answering the question you ask. It's less good at noticing that you're asking the wrong question. "Why are my tests failing?" and "Why are my tests fragile?" look similar from the outside, but they lead to completely different solutions. You need to be the one who reframes.

Retrying is not debugging. When something goes wrong in a vibe coding session, the temptation is to just let the agent try again, or try a different prompt. Sometimes that works. But for anything structural, retrying without understanding is how you end up with a codebase full of workarounds, each one slightly obscuring the one before it. Slow down. Ask the agent to explain its reasoning. Push back on the explanation.

You are still the engineer. This is the big one. Vibe coding doesn't mean abdicating responsibility for the quality of the code. The agent is an extraordinarily capable collaborator, but it does not have a stake in the long-term health of your project. You do. Every time it proposes a fix, you are the one who decides whether it's a real solution or a deferred problem.


What This Looks Like in Practice

The change in habit I'd suggest is small but significant: before accepting any AI-generated fix for a non-trivial bug, ask one extra question.

"Is this fixing the assumption, or just the current value of it?"

If the fix changes a value — a string, a number, a path — without removing the dependency on that value, it's probably a patch. A real fix either makes the assumption unnecessary (compute it at runtime, inject it as a dependency) or makes it true by construction (a contract test, a type constraint, a validated config).

You don't need to ask this for every tiny thing. But for anything that smells like an environment issue — tests that only fail on CI, behaviour that differs between machines, things that worked last month — it's worth slowing down and having the conversation.


Closing Thought

What I found most valuable about this exchange wasn't the runtime date helpers. It was the experience of using the AI as a thinking partner rather than a fix machine.

The agent is very good at generating solutions. It's less good, by default, at questioning whether the solution addresses the real problem. That's a human job — and it's one of the most important ones left to us as developers in an AI-assisted workflow.

The biggest risk of vibe coding isn't that the AI writes bad code. It's that it writes convincing code that quietly defers the real problem. And the antidote to that isn't a better AI — it's a more engaged developer.

Keep asking why.