The Staging-Only Bug: An Agentic Debugging Story

Every engineering team has that one bug. The one that shows up in staging, disappears on your laptop, and slowly eats two, three, or four days of your best engineers' time. No clear cause. No reproducible path. Just a blank page where a form should be, and a growing sense that the environment itself is lying to you. This is the story of how we closed one of those in a single morning, and why the method we used mattered more than the tools.

By Samuel Granja

How two engineers running the same agentic brainstorm cracked a non-reproducible bug in one morning

Some bugs you can see. You open the page, the thing is broken, you fix it. Those are the cheap ones.

The expensive ones refuse to show up where you're looking. A health-tech client of ours hit one of those recently. A new patient finishes registration, lands on the screen where they enter payment details, and instead of the form, they are presented with a blank page. A reload fixes it. But for that first impression, the single most important screen in the onboarding flow rendered as nothing.

It surfaced in staging, where the client's team ran into it. It never appeared on our laptops, and it never appeared in our shared DEV environment, regardless of our attempts to force it. The code was byte-for-byte identical across all three. This is the story of how we closed that gap in a morning instead of a week, and the way of working that made the difference.

The Challenge: Identical Code, Different Behavior

The first thing we confirmed was the most disorienting. We diffed every file in the suspect path: the auth guard, the payment guard, the app shell, and the subscription logic. All identical between staging, where it broke, and the DEV and local environments, where it didn't. Same branch, same build.

When the code is the same, and the behavior is different, your instinct is to hunt for what's different about the broken environment. Different data. A different feature flag. A stale config. That instinct is exactly what burns teams for days, because it sends you looking in the one place the answer isn't.

This class of bug, environment-sensitive, timing-dependent, and non-deterministic, is one of the most costly in software development. Research from the Consortium for Information and Software Quality estimates that poor software quality costs US organizations over $2.4 trillion annually, with a significant portion tied to defects that surface only in production or staging environments. The cost isn't just time; it's the compounding uncertainty that makes engineers doubt their own understanding of the system.

A Different Way of Working

Over the last several months, we have been changing how we develop on greenfield projects and on features inside existing ones. It is an agentic engineering approach, architected by our co-founder, Juan, and its principle is his line: AI proposes, engineers approve. The approach is embodied in an agentic development toolkit we built in-house, a set of agent personas and skills that carry a feature from requirements all the way to production, with a human approving every checkpoint along the way.

One of those skills is a structured, hypothesis-first brainstorm. On paper, it is a pre-spec tool. You point it at a fuzzy idea, and it refuses to jump to a solution. Instead, it forces you to frame the problem, then it does something most of us skip under pressure: it lists the assumptions you have been making and pushes back on each one before it lets you move forward. You can even tell it to run in a more adversarial stance, where it treats every claim you make as something to disprove rather than confirm.

We did not reach for it because a manual told us to. We reached for it because we were stuck, and a method whose entire job is to surface and kill bad assumptions is exactly what a non-reproducible bug needs. The same toolkit that takes a feature from requirements to production turned out to be just as good at taking a mystery apart.

The Wrong Turns, and Why the Method Caught Them

My first real hypothesis was session collision. The client runs several apps, and I had convinced myself that a shared session cookie from another property was clobbering the patient session. It was a clean theory.

It was also wrong, and the structured brainstorm is what made me prove it instead of chasing it. Stated as a premise and tested, it collapsed in minutes: the cookies were scoped to different root domains and could not physically collide. Mechanically impossible.

The second premise was a data-state difference between environments. Plausible, until we forced the same data shape locally, and the form still rendered fine.

The third premise was the sharpest and the most useful one to kill. I suspected the payment logic was reading a property off the user object that could be undefined on first render, and simply throwing. The reason that theory mattered is that disproving it is what pointed us at the answer. The app already gates that first render behind a spinner until the user object is populated, so the first mount was safe. If the crash was real, it could not be happening there. It had to be happening later, around the data that loads after the user object arrives. That was the thread we finally pulled.

Three theories in, we had ruled things out and were no longer guessing. The difference from a normal bad morning was that we had not wasted the morning. Each dead end was a premise we had named on purpose and falsified on purpose. That is the entire value of the method: it converts flailing into a sequence with a clear next step at every stage.

Two Engineers, One Brainstorm, the Same Answer

Here is the part that turned a hypothesis into a conviction.

My teammate Sebas and I went after the bug separately, each running our own agentic brainstorm, without comparing notes. We were not trying to agree. We were trying to see whether two people running the same structured process on the same symptom would land in the same place.

We did. We independently arrived at the identical root cause, and we even arrived at the same test to prove it. Two engineers reasoning apart are far more likely to share the same right answer than the same wrong one, so that convergence was the strongest signal we had that we were finally looking at the real thing.

"In one morning, we found something that on my own would've taken about a week." Sebas, the engineer on the project.

This kind of structured independent verification has parallels in how high-reliability organizations approach critical decisions. The NASA Lessons Learned database repeatedly surfaces independent validation as one of the most effective safeguards against confirmation bias in complex system failures. What we did informally, by running two separate brainstorms, applies the same principle: convergence between independent reasoners reduces the probability of shared blind spots.

The Pivot: Latency as a Window, Not a Cause

The reframe that cracked it was a change of question. We stopped asking "what is different about staging?" and started asking "what is the same everywhere, but only visible under latency?"

After registration, the app does a hard reload, and the state is wiped. The user-identity request fires first and resolves. Then the secondary requests for subscription, credit, and price data fire asynchronously. The payment screen has guards that show a spinner while those secondary requests are in flight.

The bug lived in the gap before those requests started. In that sliver of time, the in-flight flags are still false, and the data is still empty, so the guards let the payment form mount. If the third-party payment library also hasn't finished loading in that same sliver, the form has nothing to render. Blank page.

The guard was watching the wrong window. Staging never caused the bug. The bug was always there, on every machine, in every environment. Staging simply had slower backend responses, and that slowness held the unguarded gap open long enough for the race to be lost. On DEV and on our laptops, the same gap existed, but closed so fast the form almost always won. "Almost always" is why it looked like it worked.

Sebas confirmed the mechanism in the most direct way possible. He blocked the payment library's requests entirely on a local machine, recreating the slow-load condition on demand, and the blank page appeared every time. Then we captured that condition as a deterministic test, one that exercises the exact failure window and fails loudly if the guard ever regresses. The bug that nobody could reproduce now has a test that reproduces it on every run.

This is also where our QA engineering practice earns its keep. Writing a test that reliably exercises a race condition isn't just debugging; it's risk elimination. It converts a probabilistic failure into a guaranteed catch. That's the difference between a team that fixes bugs and a team that eliminates categories of bugs.

Why It Worked: AI Proposes, Engineers Approve

It is tempting to credit the speed to AI. That is only half of it, and the less important half.

What the agentic brainstorm did was move fast through the mechanical part of the thinking: enumerate the hypotheses, structure them, surface the assumptions hiding inside each one. This is where tools like Claude and similar AI systems have genuine leverage, not replacing judgment, but accelerating the scaffolding that surrounds it.

What Sebas and I did was the part that actually mattered: judge which premises were worth testing, design the experiments, and verify the answer against reality from two independent directions.

This is the shift Juan designed the approach around: the engineer moves from producer to judge. The agents do not replace that judgment; they make room for it, and the checkpoints give you more control over the outcome, not less. The model underneath is fungible. The system of work is not.

For teams wondering how to integrate AI without losing engineering rigor, this case is instructive. The GitHub Copilot research shows 55% productivity gains in task completion, but those gains are concentrated in developers who already understand what they're building. AI amplifies good reasoning; it doesn't substitute for it. Our AI-augmented development approach is built on exactly that premise.

Takeaways for Technical Leaders

If your team has a class of bugs that only appear in staging or production and never locally, you are almost certainly looking at timing windows, not code differences. The reflex to different environments feels rigorous and is usually a dead end.

What changes the economics is having a repeatable method for the messy middle of debugging, the stretch between "it's broken" and "I know why":

Premises beforehand. Naming and falsifying your assumptions, rather than chasing them, is what kept three wrong theories from costing us three days.

Convergence as evidence. Two engineers running the same structured brainstorm independently, and landing on the same root cause and the same test, gave us confidence that no single investigation could. Independent validation is a safeguard, not a redundancy.

Speed from agents, safety from people. The AI accelerated the reasoning. The humans owned the verdict. This is the human-in-the-loop principle applied not to a product feature but to the engineering process itself.

The cheapest invisible bug is the one you make visible on purpose. The method that gets you there is the same one we now use to decide what to build in the first place.

How We Can Help Your Team

The class of bugs we described, timing windows, non-reproducible failures, and race conditions in async flows, doesn't get cheaper over time. It gets more expensive as your system grows and the number of asynchronous interactions multiplies. Catching them early is an architectural discipline, not a lucky find.

At Sancrisoft, this agentic way of working, architected by our co-founder Juan and applied across greenfield projects and legacy integrations alike, is how we build now: built by agents, verified by humans. It's the same methodology we brought to our healthcare clients, where HIPAA compliance requirements mean defects in onboarding flows carry real regulatory weight, not just UX friction.

Our nearshore team in Medellín operates in your time zone, which means that when a staging bug surfaces on a Tuesday morning, we're not catching up from an overnight difference; we're already in it with you.

Whether you're dealing with a specific class of non-reproducible bugs, looking to build agentic workflows into your own engineering culture, or evaluating whether AI-augmented development can reduce your team's debugging overhead, we're happy to have a direct technical conversation about what's actually possible. Schedule a consultation with our engineers. No pitch deck, just an honest discussion about your stack and your problems.

Stay in the know

Want to receive Sancrisoft news and updates? Sign up for our weekly newsletter.