Skip to main content

Building a Production Payroll System in 28 Working Days: A Real Case Study of Claude Code + Our Agentic Harness

Numbers are easy to throw around. "3x faster", "10x developer." We hear those claims constantly in AI tooling marketing, and most of them are backed by nothing more than vibes and demo code that never ships to production.

This is not that kind of article.


team member: Alejandro Marin Alejandro Marin

By Alejandro Marin

I’ve built a full production payroll management system from scratch, as a solo engineer, in 28 working days of actual engineering time, using Claude Code and our internal agentic harness. Every number below is derived from the Git history of the project repository. GitHub's research on AI-assisted development documents productivity gains of 35–55% on coding tasks. What I experienced was closer to 3x fewer days for a full production system. Here is exactly what that looked like.

This is how AI is transforming software development, actually, at the feature level, not in aggregate statistics, but in timestamps, commit counts, and decisions made under real constraints.

What Got Built

Before getting into the how, here is what the system actually does:

  • Multi-currency payroll management (USD, COP, EUR) with live exchange rates.
  • Role-based access control for administrators, HR managers, and employees.
  • Employee lifecycle management with bank accounts and contractor support.
  • Payroll periods, payment records, deductions, additions, and batch approvals.
  • Vacation management with automated balance tracking and date-range support.
  • Bank-grade data security: encrypted personal and financial data with a full audit trail following OWASP security standards.
  • Employee service portal with payment history.
  • Real-time collaborative notes with direct navigation to any record in the system.
  • PDF and Excel report exports.
  • Full internationalization in English and Spanish.
  • 65 automated test files covering unit, integration, and end-to-end scenarios with Playwright.

The Numbers From Git History

Building-a-Production-Payroll-System-in-28-Working-Days-sancrisoft.webp

Timeline: 28 working days of total engineering time, 427 commits, 447 files created, ~103,000 lines of code created and tested.

The Workflow: How Our Agentic Harness Changes the Equation

Sancrisoft's agentic development harness is built on top of Claude Code . The key difference from a plain AI coding assistant is that every feature goes through structured phases, each with a human checkpoint before proceeding. This is the same architecture described in our AI agents for software development article: specialized agents with defined mandates, separated by human approval gates.

The Six Phases:

  1. /brainstorm: Clarify requirements before committing to a solution. What problem are we actually solving? What constraints exist that aren't obvious from the ticket?
  2. /spec: Produce a formal requirements document and technical design.
  3. /build: Implement against the approved spec. The agent works within the boundaries defined in the spec document, not from an open-ended prompt.
  4. /review: Multi-agent code review that stress-tests the implementation before human eyes see it. Security, correctness, edge cases, and accessibility are all checked systematically before the pull request is opened.
  5. /test: Risk-driven QA coverage focused on what can actually break in production. Not coverage metrics, but actual risk registers that map tests to identified failure modes.
  6. /doc-garden: Detect and fix drift between documentation and actual code. The system that ships is the system that is documented.

The artifacts from this process are real engineering documents: 29 feature specs, 38 Architecture Decision Records, and 15 QA risk registers, all linked to the code that implements them.

What the Human Actually Does

This is the part that gets misunderstood most often. The human's role is not to type less. It is to evaluate, approve, redirect, or reject AI-generated artifacts at each gate.

The engineer becomes a judge, not just a producer. This is the resolution of the human-in-the-loop paradox. By delegating mechanical execution to AI, the human role elevates to the work that actually determines whether the system holds up: architecture decisions, security evaluations, edge case identification, and the judgment calls that no AI can make reliably without human context.

Our structured AI development workflow formalizes this at the process level. What the payroll project demonstrated is what it looks like in practice, phase by phase, feature by feature, 427 commits deep.

Feature Delivery: The Honest Numbers

These are actual delivery times compared to conservative estimates for the same feature set and quality bar:

FeatureComplexityActual DeliveryDev EstimateSpeedup
Security & Data Encryption, encrypted personal and financial data, staged migration, audit trailVery High6 days4–5 weeks~4x
Real-time Collaboration, live notes, deep navigation links, WCAG accessibility, internationalizationHigh3 days3–4 weeks~5x
Employee Self-Service Portal, payment history, and role-based accessMedium4 days1.5–2 weeks~3x
Vacation Date-Range Support, automated server-side balance calculationLow–Med1 day3–5 days~3x
Batch Payment Approvals, invoice redesign, and bulk processingMedium~1.5 weeks3–4 weeks~2x

Full project: A payroll system of this scope, multi-currency, encrypted data, real-time features, self-service portal, and comprehensive test coverage would take one senior developer roughly 16 weeks (~80 working days) working traditionally. The same system shipped in 28 working days. Roughly 3x fewer days, because the AI absorbed the mechanical work while the engineer focused on review, architecture, and quality decisions.

What Actually Surprised Me (The Real Pros)

Requirements First Eliminates the Most Expensive Mistakes

For the data encryption feature, the /spec phase generated a design document that identified critical infrastructure constraints before a single line of code was written. In a traditional workflow, those same constraints would have surfaced during testing or in production, costing days of rework. Here they appeared in a document reviewed in 20–30 minutes.

This is the core argument of our AI workflow methodology: specifications should precede implementation, not follow it. The payroll project ran this experiment at scale across 29 features, and the result was consistent. Front-loading the spec eliminated the most expensive category of bugs: building the wrong thing.

Documented decisions pay compounding returns

The repository has 38 Architecture Decision Records. Each one is linked from the feature spec that generated it. When a later feature needed to extend the encryption model, the build agent referenced earlier decisions automatically and produced an implementation consistent with prior choices, without me having to recall anything from memory.

Multi-agent code review catches what solo review misses

The /review phase on the real-time collaboration feature caught a subscription firing unnecessarily for non-admin users, a security edge case in the link-parsing logic, and three missing WCAG accessibility attributes, before the pull request was even opened.

These are precisely the issues that slip through when a single reviewer is tired and moving fast. Multi-agent review applies systematic coverage across security, correctness, and standards compliance every time, not just when the reviewer has energy for it.

Risk-driven testing focuses coverage on what matters

Every feature ships with a risk register that identifies the highest-probability failure modes. The QA phase runs against those risks, not just against coverage metrics. The result is a test suite that actually protects against regression rather than just padding a number.

Every feature ships with a risk register that identifies the highest-probability failure modes. Our QA practice runs against those risks, not just against coverage metrics. The result is a test suite that actually protects against regression rather than padding a percentage. All test files across unit, integration, and end-to-end scenarios, areall traceable back to identified risks rather than generated opportunistically.

The Honest Cons

No case study is credible without them.

The process has overhead that does not scale down to small changes. A minor bug fix does not need a full requirements document, architectural decisions, and a risk register. Using the complete pipeline for trivial tasks creates unnecessary friction. The solution is straightforward: for small changes, drop the pipeline entirely and work directly with Claude. The full workflow is for features; Claude alone is for fixes. Once that distinction becomes natural, the overhead disappears.

AI-generated work requires experienced human oversight. The workflow accelerates delivery significantly, but it depends on an engineer who can evaluate whether the AI's output is actually correct, not just whether it looks correct. The pipeline raises the quality floor; it does not replace judgment.

AI-generated work requires experienced human oversight. The workflow accelerates delivery significantly, but it depends on an engineer who can evaluate whether the AI's output is actually correct, not just whether it looks correct. The pipeline raises the quality floor. It does not replace judgment. This is the principle at the center of our AI development manifesto: AI proposes, engineers approve. That dynamic is not optional; it is what makes the quality model hold.

You must understand what you’re approving. The human checkpoint is only as good as the human at the checkpoint. If you treat reviews as a rubber stamp, the quality model breaks. The workflow gives you better material to evaluate, but it cannot substitute for a genuine understanding of what is being built and why.

The Real ROI

28 working days of engineering time against a comparable solo build that would require roughly 16 weeks is approximately ~3x fewer days invested. But raw efficiency is still not where the value lives. The value is in the kind of work those 28 days contained.

In those 28 working days with our agentic harness, I spent the majority of my time evaluating architecture, reviewing security decisions, and catching edge cases in generated specs, not writing boilerplate. The encryption feature would not just have been slower without this workflow; it would have been riskier, with implicit decisions scattered across commit messages rather than explicit, reviewable documents. The real-time collaboration feature would not have shipped in three days; it would have taken three weeks, and the accessibility and internationalization coverage would have been deferred to "next sprint."

The documentation in this project is not a side effect of using the harness. It is a first-class deliverable of every feature. That is what makes the pace sustainable: every new feature builds on a base that is actually documented, not just familiar.

For comparison, our full-stack social network built in five days used the same underlying harness principles; the payroll project scales that approach to a significantly larger and more complex system over a longer timeline.

Is It Worth It?

For a small team building a non-trivial production system: yes!.

The workflow rewards engineers who are strong reviewers and architectural thinkers. It compounds on teams that invest in their knowledge base, the caveat: It struggles when people treat the checkpoints as overhead to skip rather than the mechanism that keeps AI-generated work trustworthy.

28 working days to ship a system that would otherwise take ~80 days is defensible and conservative. It is also what makes the pace sustainable for a solo engineer without burning out. But that number is still not the real question.

The real question is not how many lines you ship per day. It is whether the system you shipped will hold up in six or twelve months, and whether the next engineer who touches it will understand why it was built the way it was.

With our agentic harness, both answers are yes. The technical debt is documented and intentional, not accumulated by default.

What This Means for Your Team

A 28-day solo build of a production payroll system is not a benchmark to copy. It is a data point about what becomes possible when AI handles structured execution and an experienced engineer handles judgment.

The question for your team is not whether to use AI tools. It is whether you have the workflow structure that makes AI assistance reliable rather than risky. The difference between a team that ships production-grade software with AI and a team that accumulates vibe-coded technical debt is not the tools; it is the checkpoints, the artifacts, and the discipline to treat AI output as a starting point for human judgment, not as the final answer.

At Sancrisoft, our web development and DevOps teams run this harness across client engagements, the same phased pipeline, the same ADR discipline, the same risk-driven QA approach that produced the numbers above. The system ships faster. The documentation ships with it. And the next engineer who touches the codebase six months later has a fighting chance of understanding why it was built the way it was.

If you want to see how AI-augmented workflows would apply to your team's specific technical context, the stack, the timeline, and the risk profile, schedule a conversation with us. We'll walk through what the harness looks like applied to your problem, and give you an honest assessment of where it creates leverage and where it doesn't.

Stay in the know

Want to receive Sancrisoft news and updates? Sign up for our weekly newsletter.