What my evolving Claude and Codex skills can actually produce

In my earlier post on evolving my skills strategy for Claude and Codex I mostly wrote about structure: why I started building these skills, how I split heavier and lighter workflows, and how I am trying to make the whole setup portable across runtimes.

That is useful context, but it is still mostly about the system in the abstract.

This post is about a real session where that system did something genuinely impressive.

The task started small enough: a activties heatmap on one of my side projects was broken. The route endpoint was returning a 500 and then, after the first fix, returning no useful data. What followed was not just “the model wrote some code”. It investigated the bug, wrote a plan, handed work off through subagents, tracked progress on disk in markdown files, deployed changes, monitored those deploys, recognised when the deploy had not actually fixed the underlying problem, went into the database to inspect live data, found a second issue, fixed that, redeployed, watched the migrations run, and kept going until the issue was actually resolved.

It is not perfect. There was still some ceremony, and there were a couple of places where the orchestration needed a human nudge. But as an example of what these evolving skills can already produce, I thought it was worth writing up.

The session started with a normal bug report
Planning and implementation were treated as separate jobs
Progress files became the handoff mechanism
Deploy and monitor turned out to matter as much as the code fix
The interesting part was the second failure
Why this session mattered to me
Summary

The session started with a normal bug report

The opening problem was pretty ordinary. A map heatmap was broken and the route fetch looked like it was failing around one store call:

routes, err := h.store.GetActivityRoutes(ctx, params)
if err != nil {
    return errorResponse(http.StatusInternalServerError, "internal error"), nil
}

The top-level skill first investigated the code and schema rather than taking the line in the handler at face value. It found that the heatmap query was joining on the wrong foreign keys (doh!). That is a real bug, and fixing it stopped the 500.

That would already have been a decent assisted debugging session. But it was only the first layer of the problem.

Planning and implementation were treated as separate jobs

Instead of rolling straight from investigation into a loose stream of edits, the session went through a proper planning phase using the heavier planning skill.

That matters more than it sounds.

The planning skill produced a concrete implementation plan rather than a vague checklist. It wrote the plan to disk, broke the work into waves and tasks, and framed the likely fix as:

Correct the join in the route query
Add regression coverage
Improve observability around the route failure

After that the implementation skill picked up the plan and executed it rather than starting from scratch. The important thing here is not just that there were two skills involved. It is that the second skill inherited a shaped piece of work: a plan with explicit tasks and expectations, not a raw user request.

That separation is one of the things I was getting at in the earlier post. Skills are most useful when they encode workflow boundaries. This was a good example of that in practice.

Progress files became the handoff mechanism

One part of this setup I still find slightly ridiculous and surprisingly effective at the same time is the use of progress files.

The implementation flow wrote step progress out to files to disk. Those files tracked:

What step was currently being worked on
What had been learnt while doing it
What had been validated
Whether the step was blocked or needed review

That gave the session a durable coordination mechanism. The top-level agent could inspect what had happened, decide whether the task was complete, and then either continue or stop at a review boundary - this could all happen whilst the subagent was still working, the progress file is a communication channel. When the work expanded beyond the original fix, those files also acted as a memory aid for what had already been tried and what still needed doing.

I would not claim this is elegant, it is a little clunky. But it is exactly the sort of thing I mean when I say the skills are becoming a small software system rather than a pile of reusable prompts. The coordination is explicit. It is inspectable. And when something goes sideways, there is at least some state to recover from.

Deploy and monitor turned out to matter as much as the code fix

After the join bug was fixed, the session deployed and monitored the change rather than assuming the local result was enough.

That was the right instinct, because production behaviour changed but the feature was still broken.

The 500 was gone, but the UI still had no data to display.

At that point the deploy-monitor loop became more valuable than the code-writing part. The relevant skills pushed changes, watched GitHub Actions runs, inspected CloudFormation state, and checked live AWS resources rather than treating deployment as an afterthought.

This is the kind of thing I think current agent demos tend to skip over. Writing the patch is only part of the job, there is also:

Recognising when a deploy did not run the thing you thought it ran
Separating application deployment from datastore deployment
Reading failed workflow logs rather than hand waving
Checking the actual data in the database instead of inferring too much from code

This session did all of that.

The interesting part was the second failure

The second failure was the part that really sold me on the setup.

Once the 500 was fixed, the session started checking the live database and found that some required data was NULL for every activity. That explained why viewport-filtered heatmap requests were empty. There were plenty of stream points in the database, but no bounding boxes to use for the map filtering logic.

So the session moved from “fix a broken SQL join” into “design and ship a data backfill”. It added:

A migration to backfill the data
An ingestion-side update so future activities would populate bbox automatically
A frontend timeout increase for the heavier map route fetch

Then it deployed again.

And that still did not fully work.

The deploy monitor caught that the migration step had failed in GitHub Actions. The error was not especially glamorous either: the one-shot backfill query was timing out under the RDS Data API migration runner. That is a very normal kind of production issue.

What impressed me was what happened next. The session did not stop at “migration failed”. It went back into investigation mode, looked at the failed logs, inspected the migration runner, tested alternative SQL shapes against the live database, compared query plans, found that a per-activity indexed aggregation was materially better than a full table aggregation, rewrote the migration, redeployed, and monitored again until:

The migration completed
The schema migrations table showed the new migration as applied
The live database reported the expected state

Why this session mattered to me

The earlier evolving-skills post was mostly about intent, this is practical.

The parts I found most valuable were:

It investigated before editing
It separated planning from implementation
It preserved working state in progress files
It treated deployment and monitoring as first class work
It adapted when the first fix changed the symptoms but did not solve the problem
It validated against live AWS and Postgres data instead of trusting local reasoning alone

There are still obvious limitations.

The workflow can be a bit ceremonial, especially once plan files, progress files and skill routing all get involved. It also still needed me in the loop for decisions like whether to keep a mixed commit and when to trigger the right deployment path for migration changes. And while the subagent and progress-file model works, I would not yet call it graceful.

This is also a contrived exmaple in some senses. It is a side project that doesn’t have a proper local dev environment with a DB (but it could), it can be broken in production (but it could have a dev/staging environment) and it being used to learn as much as I can about these tools and push them.

With a more robust production grade project all of these steps and processes can still take place but likely in a different environment, the problem would still be solved.

I didn’t need to babysit this process, I was doing other things.

Summary

If the previous post was about how the skills are evolving, this session was about what that evolution can already buy me.

The best part was not any individual patch. It was the continuity of the work. The session kept enough context and enough discipline to move through distinct modes of engineering work: bug investigation, planning, implementation, deployment, failure analysis, data inspection, migration redesign and final verification.

This one did. Imperfectly, but convincingly.

Get in contact

If you have comments, questions or better ways to do anything that I have discussed in this post then please get in contact via LinkedIn or email.