How I Used Two AIs to Build a Software Engineering System

How I Built It

I spent about three days building a fairly complex software system. I didn’t write a single line of code. Instead, I used two AI systems — ChatGPT and Claude Code — in a very deliberate way to design and build it end-to-end. Not as a demo. Not as a proof of concept. But as a working software engineering system. The goal was simple, but not easy:

Build an AI-powered, human-in-the-loop software engineering workflow — where AI can plan, implement, test, review, and deliver code… under control.

I call it the ‘AI Dev Orchestrator’. To understand it better, let’s first look at the goal it was trying to achieve.

The Goal: From Requirement to Delivery

The goal was to build an AI assisted fully functional software engineering flow with human-in-the-loop by design.

Basically you give a requirement in Jira and AI picks up from there to finally deliver the software addressing that requirement.

Integrations Involved

At a high level:

Jira captures requirements
AI breaks Epics → Stories (human-approved)
AI implements code
Tests run (with one fix attempt)
PR is created
Independent AI agents review
A release gate decides
GitHub enforces checks
Post-merge validation runs
Feedback is recorded for future runs

So it’s not a coding assistant, not a prompt trick, it’s a system — where AI participates in software engineering under structure, constraints, and human control. To make this work in practice, I needed to think about how I would actually use AI systems — not just what they would do.

AI Dev Orchestrator Workflow

Two AIs, Two Roles: Separation by Design

ChatGPT was used as a System Designer. It managed the architecture, guardrails, workflows. Whereas Claude Code worked as an Executor, being responsible for code, integration, debugging. The important part wasn’t the tools. It was context separation.

ChatGPT did not see low-level code/debug logs
Claude did not hold product-level vision

One AI thinks. One AI builds. Neither gets overloaded. This single decision kept the system stable as it grew. Once the roles were clear, the next question was: how do you actually build something like this without losing control?

The Build Method: Design → Execute → Refine

This wasn’t “one big prompt”. Each phase followed a strict loop:

Design (ChatGPT)

Execution Guide
Implementation (Claude)
Summary + Feedback
Gap Analysis
Next Phase

Repeat this for each phase (18 phases, so far).

Execution guides were explicit: data models, APIs, constraints, non-goals. Claude executed. Then reported what actually happened. Design evolved from reality, not assumptions.

AI Dev Orchestrator Feedback Loop

How It Was Actually Built (Not Just Prompted)

What surprised me the most was this:

This didn’t feel like prompting. It felt like engineering.

I didn’t sit down and write code. I didn’t try to generate the whole system in one go. Instead, I treated the two AI systems like members of a team. One of them designed the system. The other built it. And I sat in the loop — guiding both. The process looked simple on the surface:

Design → Execute → Observe → Refine

But what made it work was discipline. I would first ask ChatGPT to define what we are building:

the problem
the constraints
what we should not do

Then I would ask it to break that into a small, achievable phase. Not the whole system. Just the next step.

From there, ChatGPT would generate a detailed execution guide. Not ideas. Not suggestions. Actual instructions:

what to build
how to structure it
what success looks like

Then Claude took over. It would implement that phase — step by step — on a real codebase. And then came the most important part. It would tell me what actually happened. Not what it intended. Not what it assumed.

What worked
What broke
What didn’t make sense

I would take that back to ChatGPT and ask “Given this reality, what should change?” And that’s where the system evolved. This loop repeated again and again. Small phase, clear goal, real execution, honest feedback. Not everything at once. One decision at a time.

Over three days, this process ran through 18 phases. And somewhere along the way, it stopped feeling like “using AI” and started feeling like “working with a team that just happens to be AI-driven”.

The Turning Points That Changed the System

1) Closing the Loop (Phase 5)

Early versions of the system could generate code and create PRs — but they had no way to answer a critical question — Did the change actually work? So Phase 5 introduced a simple but powerful loop:

Implement → Test → Fix → Validate → Decide

After generating code:

tests were executed automatically
if they failed, the system made one focused fix attempt
tests were run again
and only then was a decision made

Two important constraints made this reliable:

bounded retries (no infinite loops)
no silent success (failures were explicit and visible)

This was a turning point. The system stopped just producing output and started taking responsibility for outcomes.

2) Think Before Acting (Phase 6)

Until this point, the system only knew how to execute a Story. But real work doesn’t start with clean, well-defined Stories — it starts with messy requirements. So Phase 6 introduced a planning layer: Epic → Story (human-approved). I made a deliberate simplification here. Instead of Epic → Feature → Story, I chose Epic → Story.

Because more layers introduce more ambiguity — and ambiguity is where AI starts making things up.

Around the same time, I noticed something interesting. Claude would often fix the most obvious code issue, even if it had nothing to do with the Story. For example, a duplicate import would get fixed repeatedly across runs. It took me a while to realize what was happening. The system was optimizing for what was visible, not what was intended.

That insight led to a few important changes:

tightening prompts to emphasize Story intent
improving context selection so relevant files were prioritized
refining sandbox behavior to avoid repeated “obvious fixes”

These changes made a noticeable difference. The system became better at aligning with intent — not just reacting to surface-level issues.

3) Learn, But Keep It Bounded (Phase 7)

At this point, the system could plan and execute — but it treated every run in isolation. It had no memory of:

what worked
what failed
what needed retries

So Phase 7 introduced a feedback layer. We started capturing simple, structured signals:

planning approvals and rejections
execution failures and retries
recurring patterns across runs

But instead of building a complex “AI memory,” I kept it deliberately constrained. Maximum 5 bullets. Small, explicit, and relevant. No long histories. No hidden state. No “mystical memory engine.”

4) Break the “Single AI” Model (Phase 8–10)

Up to this point, a single AI was effectively doing everything. That’s a problem. A system that generates code and evaluates its own output cannot be trusted. So we split responsibilities into independent agents:

Developer Agent → writes code
Reviewer Agent → checks story alignment
Test Quality Agent → evaluates whether tests are meaningful
Architecture Agent → checks system and design impact
Release Gate → makes the final decision

Each agent looks at the same change — but from a different perspective. One insight here was especially important:

Passing tests ≠ good tests.

Tests can pass and still miss critical behavior. So instead of relying only on test results, a separate agent evaluates:

what the tests actually cover
whether they validate the intended behavior

All of these signals are then combined by a Unified Release Gate. Instead of scattered checks, the system produces a single outcome:

approved
skipped
blocked

And most importantly — why. This is where the system started to resemble a real engineering team, not just a single AI tool.

5) Make It Safe (Phase 11)

By this point, the system could do something powerful — and risky. It could merge code. That’s where things change. Because once a system can modify real repositories, the problem is no longer capability — it’s control. So Phase 11 focused entirely on safety. We added:

admin authentication
webhook validation
Telegram command controls
GitHub write guards
pause/resume kill switch
audit logs for every critical action

These weren’t enhancements. They were necessary.

Power without control is just a faster failure mode.

6) Make It Behave Like a Team (Phase 12–13)

At this stage, the system could plan, build, and review. But it still had one major weakness — it would guess when things were unclear. So we fixed that. We introduced a clarification loop — instead of guessing, the system pauses and asks.

If something is ambiguous, the workflow moves into a waiting state until a human responds. Only then does it continue. At the same time, we made decisions visible and enforceable in GitHub. All internal checks — review, test quality, architecture, release — are published as native status checks on the PR. This ensures:

decisions are transparent
merges cannot bypass the system accidentally
the source of truth lives where engineers actually work

No hidden logic. No silent decisions.

7) Make It Practical (Phase 14–16)

By this point, the system was functionally complete. But it still needed to be usable. So Phase 14–16 focused on making it practical for real-world use. We added:

an admin dashboard to inspect runs, failures, and decisions
multi-stack support (Python, Maven, Gradle, React, with safe fallback for unknown repos)
post-merge validation to verify whether the deployed system actually works

One design choice here was intentional. Deployment failures are observed — not automatically rolled back. Because the goal was not to hide problems, but to surface them clearly. The system moved from being technically capable to operationally usable.

8) Make It Real (Phase 17)

Until this point, everything operated in controlled environments. Phase 17 changed that. The system was pointed at a real project. But before touching any code, it first built an understanding of the repository:

scanned the repo structure
detected the tech stack
generated an architecture summary
extracted coding conventions

This was stored as project knowledge and used to guide all future decisions. AI stopped operating blindly — and started working with context.

Why Iteration Worked Better Than Planning

If I had tried to design the entire system upfront, it would have looked very different. And it would have been wrong. Because the most important decisions didn’t come from design — they came from real execution gaps. For example:

couldn’t validate correctness → added a test loop
couldn’t plan effectively → added Epic → Story decomposition
system forgot everything → added feedback and memory
risk of self-approval → split into independent agents
system became powerful → added safety and control

None of these were obvious at the start. They only became clear after the system was actually used. Each phase wasn’t about adding features — it was about fixing what the previous phase got wrong. That’s why iteration worked better than planning. The system didn’t evolve from ideas — it evolved from reality.

Rethinking How We Use AI in Engineering

Most AI workflows look like this:

Prompt → Answer → Done

Fast. Convenient. But shallow. This system worked differently:

Design → Execute → Observe → Refine

Each step feeds into the next. Each outcome shapes the next decision. And more importantly, not one AI doing everything — but multiple AIs, each with a clear, bounded responsibility. That’s what makes the system stable.

What This Experience Actually Taught Me

Over three days, the system generated:

~14K lines of Python
a similar amount of documentation

In a traditional setup, this would likely involve:

multiple engineers
different roles (backend, testing, CI/CD)
several weeks, if not months

But the real takeaway is not speed. It's this:

A structured, iterative approach with AI works far better than trying to do everything in one go.

Small, verifiable steps made all the difference.

Why I Stopped at Phase 18

By this point, the system could:

plan
build
review
validate
and learn

The guardrails were in place. Observability was strong. It was no longer a prototype. At that stage, adding more features would not have made it better. It would have made it more complex. So I stopped. Because the next set of lessons won’t come from designing more. They will come from using the system in the real world.

Repository

I have made the project repository available here: GitHub Repository

This is not a polished product or framework yet — it’s a working artifact of this experiment, including the code, phase-wise evolution, and supporting documentation.

What Happens Next

Up to this point, the focus was on building the system. Now the focus shifts to using it. There are three directions I want to explore:

1) Improve a real project

The first step is to apply this workflow to an existing codebase. Not a sandbox. Not a controlled demo. A real project — with real constraints, real trade-offs, and real consequences. This will test:

how well the system understands unfamiliar code
whether planning holds up beyond simple cases
how often clarifications are needed
how reliable the review and release gates actually are

2) Bootstrap a new project

The second direction is to use the system from scratch. Start with a simple idea. Let the system:

break it into Stories
generate the initial codebase
evolve it incrementally

This will answer a different question — can this workflow act as a starting point — not just an improvement layer?

3) Improve the orchestrator itself

The third direction is the most interesting. Use the same system to improve the orchestrator. That means:

creating Epics for enhancements
decomposing them into Stories
letting the system implement its own improvements
reviewing them through its own agents
merging changes under the same guardrails

The system becomes both the tool… and the subject. Up to now, everything has been controlled. The real test begins when:

the system encounters messy requirements
context is incomplete
decisions are not obvious

That’s where you find out whether this is just a clever build… or a usable system.

Closing Thought

This wasn’t about proving that AI can write code. We already know that. It was about building a system where:

one AI designs
another executes
independent agents verify
humans stay in control

Because without that, you’re not building an engineering system. You’re just automating chaos.