How I shipped 2 Months of features in 3 weeks with LLM agents

October 2, 2025

15

(and what it means for engineering leaders)

In three weeks, I pushed 15 brand-new features into production for ComPaRe EMA, the mobile app we’re building for AP‑HP.

Without LLM agents, that same backlog would have taken two months to develop.

image 31 d4e4c789c95e1585c44bd298d2135cfa 800 — Without AI the team estimated 2 months-dev needed to ship to production the 15 features left

This isn’t about typing speed. It’s not “Copilot writes code so I don’t have to.”

The real shift is a new way of designing, coding, and managing software projects — what I call Right-First-Time Agentic Coding. (a bit longer than “Vibe planning”). In this model, conception dominates completion: you engineer the whole flow with an AI agent before you write a single line of code.

As a CTO, I deliberately stepped back into the role of developer on this project. Because you can’t manage, train, or make strategic calls about AI-assisted development if you haven’t lived through the details yourself. Buying Copilot licenses for everyone and “waiting to see” is a recipe for mediocrity. There is no better way to understand the future of coding, than by experimenting yourself.

capture decran 2025 10 01 a 131106 92d012d2f8403621f2730b2f77ecc2ca 800 — ComPaRe EMA is capturing contextual insights on daily living with a chronic disease to foster research.

Let’s talk about productivity

“AI productivity” stats are often illusionary, they measure speed, not value.

Take for instance the headline that “AI boosts commits by 17 %”.

If your customer stills receives the same app on the same date with identical quality, that 17% is a 0% productivity gain.

In fact, the illusion of faster typing can backfire: studies show that teams who rely on AI for “speed” sometimes experience longer lead times on real Open Source repo issues.

According to the lean tech manifesto, what matter most is the Value for the Customer. So in lean terms productivity = value delivered ÷ cost.

For my customer (AP-HP), value means:

Lead time — do they get their app faster to start research earlier ?
Features – does it contain more feature ?
Research impact — does the app collect more diverse or better data (less bias, better UX, stronger performance), so researchers can advance their work faster?

For Theodo, cost translates to man-hours and resource allocation needed to achieve same quality.

So what should I measure ?

Conception to production-time : How fast we move features from idea to user hands.
Rework rate : How many times we avoid rework (defects, bugs, rejected features).
External bottleneck time : Waiting for non‑development gatekeepers (App Store review, QA validation, 3rd-party APIs support, …)

With that framing, let’s look at how LLM agents really changed my productivity.

1. Conception > Completion : the real productivity gains comes from better Design, not faster typing

I started experimenting this method end of august, right before Github announced its “Spec Driven Development” (Sept 2nd 2025). The two approaches line up pretty well, so I’ll focus on what worked for me

The biggest wins didn’t come from writing code faster, but from designing a feature right first time so it’s ready to ship.

The `/technical-strategy` claude command

technical-strategy.md

At the heart of my workflow is a single slash command (adapted from the initiative of François Hendriks ) that orchestrates the entire feature‑design cycle. It breaks the process in 3 phases :

Phase 1 : Codebase exploration
Phase 2 : Targeted questions & discussion
Phase 3 : Technical strategy redaction

Below I walk through each phase using the Export‑Study feature from ComPaRe as a running example.

image 34 fcb15b85e91281fccbe524b006866641 800 — technical-strategy claude command

Phase 1 : Codebase exploration

The performance of an agent all comes down to context engineering. In this first phase, Claude must build its own context for the following steps. My job is to constantly represent myself this context by asking these questions :

Do I have enough context space left for the remaining work ? (≈ 200k tokens)
Is the context poisoned, ie does it contains tokens that reduce the likelihood of success ?
Conversely, does it lack important information from the codebase, are any critical token missing?

image 35 c69a688509c5122fb279c70e3c1a0652 800

screenshot 2025 09 23 at 171633 8dcbd6de7f6f5d1cca711195a99cbbe6 800 — https://ccusage.com/guide/statusline is one of the option to have a constant glance on your context window

In this example, context window is healthy at the moment (24% used), Claude didn’t read parts of the codebase that contain technical debt or truly irrelevant files. However, a close read of its analysis shows that it never mentions the S3 bucket. If it was a junior dev, I would immediately realize he would struggle storing the exported data without knowing which storage technology we’re using.

image 37 8ce1243b37e9ff744cc1cca423f61a93 800

image 36 491a8df8c0ea5f676b5c69bcf700da31 800

image 38 30ba309f7e1c7004f3486483dc838f4f 800

Better

Let’s move on to the discussion …

Phase 2 : Discussion

Now is the best part : Claude asking targeted questions. This is when the magic happens. Many bad architecture decisions I would have made were caught at this stage, leading to lot of avoided rework.

Let’s see how it works with our Export feature. Claude asked me 10 questions following its exploration.

Technical discussion prompt

I guide Claude to ask questions on subject that are often cause of defects, like bad data structure or security issues.

screenshot 2025 09 23 at 173558 53873171ce5acc2cf18ddfcf47a06f0b 800

So far, so good, I was planning to set up background task and add a new dedicated bucket (Claude would have missed the bucket entirely if I hadn’t pointed it out earlier), all good.

screenshot 2025 09 23 at 173609 ae3c791dd6bf190c8dac277b945ca758 800

I don’t agree with question 6. The links to the documents are already embedded in document-type answers. Listing files from the buckets could lead to missing or pulling the wrong documents. If I hadn’t clarified that, it would have turned into unnecessary rework once the feature was implemented.

screenshot 2025 09 23 at 173620 69cd11aa05b0504c921780985f662e5b 800

Question 7 is excellent ! I hadn’t considered the sheer size of participant uploaded images, videos … so streaming from bucket to bucket is essential; otherwise a naïve in‑memory approach would have bled server memory and caused timeouts. The archiver pattern is the safest route; it lets us start writing the ZIP before all files are fetched, and we can still apply compression on the fly.

Had I ignored this nuance, a defect would have surfaced only after ingesting a few weeks’ worth of real data. By flagging it now, we avoid that rework and keep the export service horizontally scalable.

image 39 90941947e387bf9a0694e47b444e3d13 800

I did a few rounds of questions, answer to move on to the next phase : generating the technical strategy document.

Phase 3 : Technical Strategy Generation

In this phase we turn the high‑level discussion into a concrete, actionable plan. The goal is to produce a short but precise markdown document that an LLM can follow without ambiguity.

My Review Checklist

Verify that every line is correct and unambiguous. It can be tempting to read superficially but keep in mind every token has an influence on quality.
Ensure the document stays under ~200 lines; if it exceeds, split it into logical sub‑tasks.
Confirm that the LLM can parse each section without needing extra context as this is the only context Claude will start with to implement the feature

As this Export study features is quite big, I splitted it in 7 files : 6 individual task that are each testable. To start a feature

image 40 bd1a03ad14dfedc2c494cb3f55ae369a 800

STRATEGY_EXPORT_STUDY_DATA.md

STRATEY_EXPORT_TASK_5_API_ENDPOINTS.md

How is it different from using Claude Code Plan mode ?

Once again it’s all about context engineering. When using Claude Code built-in plan mode, you can’t control every token that reach the LLM.

By crafting a custom template, you gain two advantages:

Predictability : the model outputs a plan that adheres to your specific project, business constraints, and physical realities, leading to more reliable outcomes.
Learning : iteratively refining the template teaches you how to design effective system prompts, sharpening your intuition for prompt engineering.

For most tasks I recommend avoiding the built‑in Plan mode unless you’re tackling a completely new problem where a generic plan is all you need.

Now that we have a plan, let’s move on to the implementation.

2. Put strong guardrails to guide Claude code Right First Time

For this part I couldn’t think of a better metaphor than this :

votre texte de paragraphe 1 87048a887cd38ebe0e844bfe9b7f93f6 800

Your best chance to strike is the plan, but adding some guard rail will definitely avoid you the gutter.

Coding standards

You already have a set of implicit rules about what good code looks like. Claude can’t guess those rules, so you need to spell them out explicitly—unless your codebase is already so mature that every file Claude reads automatically follows the standards.

So the idea is to write down markdown files that explain in clear english what good code looks like according to you.

You can then refer to these markdown files in your technical strategy document, depending on the task ahead

Prefer positive phrasing over negative phrasing—Claude tends to produce better results with the former.

For example, say ”Tests files MUST define a factory helper method to create test data” Rather than “Don’t redefine a new test‑data object in each test.”

screenshot 2025 09 23 at 190631 50b718de31a0469836834b4f5fec806e 800

By experience we see that the real challenge is keeping those rules manageable as your codebase and team evolve, so keep them minimal. If you manage to keep it lean, coding standards won’t impact much the context window usage.

Linter & typing

A robust linter has always been a cornerstone of code quality; now it’s worth investing even more in one. As every diagnostic error is fed back into Claude, if the error message is clear enough he will happily correct it – at the expense of filling up part of the context window. This makes linting slightly more “expensive” than coding‑standard checks, but the payoff in quality is well worth it.

image 41 e2a37a04a214ce697f0a22c34a623fef 800 — Claude wrote EmptySearchState.tsx but didn’t follow our coding standard to use stylesheet instead of inline-style which triggered eslint react-native/no-inline-styles diagnostic error.

image 42 e8e4a1b871a6bffe252aa52cc1d49852 800 — This diagnostic error was fed to Claude Code, he understood he needed to remove the inline style.

image 43 e68efd5515314ff5952d71441e78e015 800 — On the second iteration with this additional context he introduced a stylesheet, which is what we wanted in the first place. It cost 2% of the context window

This mechanism also works with typing. So having robust types and strict typing configuration will enable Claude to write better code without you to correct it.

If Claude struggles to fix a linting error you can check the error message. As a general rule of thumb make sure the error message explain what the issue is and a way to fix it.

image 44 d321c96fb7c325f3f45cb0b7af585242 800 — A good error message explain the issue and also a way to fix it.

Formatting & Claude hooks

Now a common issue that eats a lot of context space is code formatting. Generally Claude won’t follow your prettier strict configuration. As a result it will generate a lot of diagnostics errors

image 45 2ac79c33da960725228ca71737d2126e 800 — Prettier is configured to have one prop per line

Using Claude hooks you can execute prettier autofix after he wrote the the file.

image 46 60acf6249711674c69917fd32be6af16 800

screenshot 2025 09 23 at 190052 657b3bbc5d29ef1fd034f37ac53da343 800

This pre‑processing step saves valuable context‑window space and reduces overall turnaround time.

Test-Driven-Development

Now the last part to make sure Claude will write code that works Right-First-Time is to employ a TDD workflow. I usually start by asking Claude to draft the tests files, review them myself, and then let it run the implementation by executing the tests autonomously.

This approach offers a triple advantage:

It guarantees that your code is tested.
The code Claude writes is inherently testable.
It gives you, as the architect, an opportunity to build a robust testing system.

For example, I introduced a test‑environment helper that starts a PostgreSQL container, boots a NestJS app, and seeds the database with minimal boilerplate. This reduced the amount of code Claude had to write for a backend test, reduce the context window usage, while keeping everything fully testable.

E2E Test Environment to minimize backend tests setup

Summary

The final outcome for the Export Data feature was that it was designed, developed, validated, and pushed to production in just 2 man days, without a single bug. By contrast, the team had originally estimated it would take five man‑days without LLM agents support.

image 47 ee8dc4a0d34c78a925200a5d33bac03f 800 — Summary of the Right First Time Agentic Coding method

3. Right First Time or … Jidoka !

Now what Lean Tech teaches us is that every problem is an opportunity to learn, and to improve our daily working conditions.

So when the agent falls short implementing a feature Right-First-Time, the tempting shortcut is to hand-craft the missing bits yourself. That shortcut steals a precious chance: you’ll never uncover why the LLM failed, you won’t sharpen your intuition about how agents actually work, and you’ll miss the opportunity to iterate on—and ultimately improve—the feature factory you’ve built.

Rule of thumb: When an agent can’t finish a task, investigate why. Don’t just patch the output.

image 48 96973e21af3873702288368b3ca4f091 800

The `/ui-component` command

To illustrate this, I built a Claude slash command called /ui‑component.

Like many other it connects Claude to this project’s Figma via one of the many available MCP servers. This command accepts a Figma screen link and a technical strategy document. It’s goal is two-fold :

to replicate the pixel-perfect screen
but also to follow our project standards : reusing existing component, applying theme colors sizes and typographies, using translations keys for english and french, …

Figma good practices

Every element should reuse components from our UI library, which are defined as named design tokens (colors, margins, padding, fonts, etc.). This guarantees visual consistency, and also guide Claude to reuse the proper element from a shared theme. For example, you won’t end up with an 18 px border radius while the rest of the design uses a 24 px radius.

In this example it reuses design tokens Text/P1/Paragraph, colors/text/higher, and font/size/md; all of these are defined in our theme.ts.

image 49 4006679532e72e000b50448e0a10aaee 800

To start implementation of the given screen, I should need to copy its link

screenshot 2025 09 25 at 104227 2e1036b307468192e12a3aa9176b4612 800

And give it to claude, which will use the MCP tool : get_figma_data_tool

that retrieves everything that define this screen (layout, design tokens, names, properties, …) as JSON.

image 50 8ea17967fafeea42b32a23fde9af1018 800

On this example, Claude first pass generated

a pixel perfect screen
but inline style
without reusing our UI design system components (Box, Typography, Link, …)
with hardcoded text

image 51 f37d823c71cb59e3bf1cdf843f31b949 800

My first attempt was to hard‑code a bunch of rules and example files, but Claude kept slipping up. Either it was missing a component that wasn’t in its context (e.g he uses react native component instead of from our design system), or the 200k context window blew up and Claude ended up looping until failing (probably due to the ≈ 60 k tokens from the Figma export)

I solved this problem with two ideas embedded in my ui-component slash command :

screenshot 2025 09 25 at 104149 25e1ef1c0d575d1cdbe0ae9d8b9c5820 800

Inject needed files instead of letting Claude discover them – I used @ directives to pull in the theme files directly injected in the prompt before sending them to Claude. It avoids Claude searching and opening multiple files to have context on the theme and available components in our UI system.

Extract informations from large files using bash – instead of loading entire i18n bundles (which would again overflow the context), I leveraged Claude’s !“ syntax that executes commands and stitch the stdout into the prompt before the LLM sees it.

For example:

This single line outputs every translation key in a two‑column TSV (~500 tokens), without dumping all the actual translations into Claude’s context (8000 tokens).

capture decran 2025 10 02 a 094652 044aad9d2ba8507c890e91352266109b 800

The result: a lean prompt that keeps the model focused and avoids token bloat, while still giving it everything it needs to generate a pixel‑perfect, standards‑compliant UI component.

The result is a fully-featured component coded in under 5 minutes, without even looking at my Figma file.

image 52 4be1b8c8867b4d4d8931b2df5c5584da 800

Full file

Ji‑doka is one of the two pillars of the Toyota Production System that inspired the Lean Tech Manifesto. It teaches us that when a defect appears on the production line, it’s better to stop the line, identify the root cause, and fix it before resuming.

Applied to agentic coding, this means pausing Claude, adjusting your meta‑prompt (slash commands, existing code, coding standards, linter rules, etc.), resetting everything, and restarting. By doing so you build step by step an AI‑powered factory that ships entire features Right First Time.

Work as a team : Invest in a AI-learning organization … or an organization learning AI

image 53 5307c9a1e4ec0bd2389bdd1f99488c2a 800

I want to emphasize that it’s perfectly normal and healthy, to fall back on writing the code yourself. There are times you’ll want to step back and manually polish the few remaining errors that prevent a feature from being finished. Plus, you don’t want to forget how to ride a bike right ?

Now in a team setting, I recommend having everyone log any defects that arise when Claude fails to deliver a right‑first‑time implementation. Then, set aside a short, dedicated slot each day to review those failures together and experiment with improvements.

This practice turns your team into a learning organization that knows how to react when an LLM doesn’t produce the exact result you need.

In a world where a new model, a new IDE, a new CLI, a New framework appears every week, as a Engineering Leader, the best investment you could do is invest in your people. By having the whole organization learn how to use this new tools in their daily work, you have a better chance of getting ahead then spending time negotiating contracts with the best LLM provider that will be dethroned in two months.

Conclusion

It’s hard to imagine that software development will not be entirely transformed in coming years, with agents becoming an integral part of our daily workflow. These tools can either eliminate the need to type on a keyboard or it can accelerate learning of deep technical concepts, enabling us to build better, more performant and secure products and open up new possibilities for rapid iteration with users, ultimately delivering them a better overall experience.

At Theodo we nurture that mindset by using agents first and foremost to build more ingenious technical conceptions. This encourages continuous deep learning while cultivating an intuition for how agents can help us build better products for our customers and solves our society problems.

What about the cost ? Pulling 15 complex features cost me roughly €500 on Claude usage. It is a notable expense yet very modest compared to the man-hours saved on the engineering side. Whether you see it as expensive or worthwhile depends on your budget, but in practice the ROI is hard to ignore.

If you’re eager to contribute to the french research effort on understanding chronic diseases, join the AP-HP voluntary cohorts of over 60.000 participants at ComPaRe. We are very happy to enable research using state of the art technologies. Now if you want to build better product to help the healthcare system globally you can also candidate to join us 😉

PS : this article was extensively reviewed and refined with the help of openai/gpt-oss 20B

Source link

How I shipped 2 Months of features in 3 weeks with LLM agents

Let’s talk about productivity