Claude on Stuart Forrest

What my evolving Claude and Codex skills can actually produce

Sat, 04 Apr 2026 10:12:00 +0000

In my earlier post on evolving my skills strategy for Claude and Codex I mostly wrote about structure: why I started building these skills, how I split heavier and lighter workflows, and how I am trying to make the whole setup portable across runtimes.

That is useful context, but it is still mostly about the system in the abstract.

This post is about a real session where that system did something genuinely impressive.

The task started small enough: a activties heatmap on one of my side projects was broken. The route endpoint was returning a 500 and then, after the first fix, returning no useful data. What followed was not just “the model wrote some code”. It investigated the bug, wrote a plan, handed work off through subagents, tracked progress on disk in markdown files, deployed changes, monitored those deploys, recognised when the deploy had not actually fixed the underlying problem, went into the database to inspect live data, found a second issue, fixed that, redeployed, watched the migrations run, and kept going until the issue was actually resolved.

It is not perfect. There was still some ceremony, and there were a couple of places where the orchestration needed a human nudge. But as an example of what these evolving skills can already produce, I thought it was worth writing up.

The session started with a normal bug report
Planning and implementation were treated as separate jobs
Progress files became the handoff mechanism
Deploy and monitor turned out to matter as much as the code fix
The interesting part was the second failure
Why this session mattered to me
Summary

The session started with a normal bug report

The opening problem was pretty ordinary. A map heatmap was broken and the route fetch looked like it was failing around one store call:

routes, err := h.store.GetActivityRoutes(ctx, params)
if err != nil {
 return errorResponse(http.StatusInternalServerError, "internal error"), nil
}

The top-level skill first investigated the code and schema rather than taking the line in the handler at face value. It found that the heatmap query was joining on the wrong foreign keys (doh!). That is a real bug, and fixing it stopped the 500.

That would already have been a decent assisted debugging session. But it was only the first layer of the problem.

Planning and implementation were treated as separate jobs

Instead of rolling straight from investigation into a loose stream of edits, the session went through a proper planning phase using the heavier planning skill.

That matters more than it sounds.

The planning skill produced a concrete implementation plan rather than a vague checklist. It wrote the plan to disk, broke the work into waves and tasks, and framed the likely fix as:

Correct the join in the route query
Add regression coverage
Improve observability around the route failure

After that the implementation skill picked up the plan and executed it rather than starting from scratch. The important thing here is not just that there were two skills involved. It is that the second skill inherited a shaped piece of work: a plan with explicit tasks and expectations, not a raw user request.

That separation is one of the things I was getting at in the earlier post. Skills are most useful when they encode workflow boundaries. This was a good example of that in practice.

Progress files became the handoff mechanism

One part of this setup I still find slightly ridiculous and surprisingly effective at the same time is the use of progress files.

The implementation flow wrote step progress out to files to disk. Those files tracked:

What step was currently being worked on
What had been learnt while doing it
What had been validated
Whether the step was blocked or needed review

That gave the session a durable coordination mechanism. The top-level agent could inspect what had happened, decide whether the task was complete, and then either continue or stop at a review boundary - this could all happen whilst the subagent was still working, the progress file is a communication channel. When the work expanded beyond the original fix, those files also acted as a memory aid for what had already been tried and what still needed doing.

I would not claim this is elegant, it is a little clunky. But it is exactly the sort of thing I mean when I say the skills are becoming a small software system rather than a pile of reusable prompts. The coordination is explicit. It is inspectable. And when something goes sideways, there is at least some state to recover from.

Deploy and monitor turned out to matter as much as the code fix

After the join bug was fixed, the session deployed and monitored the change rather than assuming the local result was enough.

That was the right instinct, because production behaviour changed but the feature was still broken.

The 500 was gone, but the UI still had no data to display.

At that point the deploy-monitor loop became more valuable than the code-writing part. The relevant skills pushed changes, watched GitHub Actions runs, inspected CloudFormation state, and checked live AWS resources rather than treating deployment as an afterthought.

This is the kind of thing I think current agent demos tend to skip over. Writing the patch is only part of the job, there is also:

Recognising when a deploy did not run the thing you thought it ran
Separating application deployment from datastore deployment
Reading failed workflow logs rather than hand waving
Checking the actual data in the database instead of inferring too much from code

This session did all of that.

The interesting part was the second failure

The second failure was the part that really sold me on the setup.

Once the 500 was fixed, the session started checking the live database and found that some required data was NULL for every activity. That explained why viewport-filtered heatmap requests were empty. There were plenty of stream points in the database, but no bounding boxes to use for the map filtering logic.

So the session moved from “fix a broken SQL join” into “design and ship a data backfill”. It added:

A migration to backfill the data
An ingestion-side update so future activities would populate bbox automatically
A frontend timeout increase for the heavier map route fetch

Then it deployed again.

And that still did not fully work.

The deploy monitor caught that the migration step had failed in GitHub Actions. The error was not especially glamorous either: the one-shot backfill query was timing out under the RDS Data API migration runner. That is a very normal kind of production issue.

What impressed me was what happened next. The session did not stop at “migration failed”. It went back into investigation mode, looked at the failed logs, inspected the migration runner, tested alternative SQL shapes against the live database, compared query plans, found that a per-activity indexed aggregation was materially better than a full table aggregation, rewrote the migration, redeployed, and monitored again until:

The migration completed
The schema migrations table showed the new migration as applied
The live database reported the expected state

Why this session mattered to me

The earlier evolving-skills post was mostly about intent, this is practical.

The parts I found most valuable were:

It investigated before editing
It separated planning from implementation
It preserved working state in progress files
It treated deployment and monitoring as first class work
It adapted when the first fix changed the symptoms but did not solve the problem
It validated against live AWS and Postgres data instead of trusting local reasoning alone

There are still obvious limitations.

The workflow can be a bit ceremonial, especially once plan files, progress files and skill routing all get involved. It also still needed me in the loop for decisions like whether to keep a mixed commit and when to trigger the right deployment path for migration changes. And while the subagent and progress-file model works, I would not yet call it graceful.

This is also a contrived exmaple in some senses. It is a side project that doesn’t have a proper local dev environment with a DB (but it could), it can be broken in production (but it could have a dev/staging environment) and it being used to learn as much as I can about these tools and push them.

With a more robust production grade project all of these steps and processes can still take place but likely in a different environment, the problem would still be solved.

I didn’t need to babysit this process, I was doing other things.

Summary

If the previous post was about how the skills are evolving, this session was about what that evolution can already buy me.

The best part was not any individual patch. It was the continuity of the work. The session kept enough context and enough discipline to move through distinct modes of engineering work: bug investigation, planning, implementation, deployment, failure analysis, data inspection, migration redesign and final verification.

This one did. Imperfectly, but convincingly.

Get in contact

If you have comments, questions or better ways to do anything that I have discussed in this post then please get in contact via LinkedIn or email.

Evolving my skills strategy for Claude and Codex

Fri, 03 Apr 2026 12:00:00 +0000

Over the last couple of weeks I have been building out a set of personal skills for both Claude and Codex in a dotfile style repo. I do not really think of these as clever prompts. I think of them as a way to make parts of my engineering process explicit: how I want planning to work, when I want review to happen, what should be delegated, what should be kept local, and where I want the agent to show a bit more discipline than “just have a go”.

This post is a short walkthrough of how that setup has evolved and what I have learnt from it so far.

Why I started doing this
The first pass: one plugin and a lot of structure
Splitting heavyweight and lightweight work
Making the approach work across Claude and Codex
Different models for different sub agents
What is working well
What still needs work
Summary

Why I started doing this

The basic problem was pretty simple. Blank-slate prompting gets old quickly.

If I sat down with Claude or Codex and asked for help on a non-trivial task, I found myself repeatedly restating the same preferences:

Investigate first
Do not ask lazy questions
Be explicit about tradeoffs
Separate planning from implementation when the task is risky
Review changes properly before pushing

That is not especially interesting work for me to repeat, and it is not a very good interface either. If I already know the shape of the workflow I want, I would rather encode it once and reuse it.

The other driver was consistency. If I ask an agent to review infrastructure changes one day and to plan a multi-step feature the next, I do not want both requests to be interpreted as the same kind of job. Skills give me a way to make that intent clearer up front.

The first pass: one plugin and a lot of structure

The initial version was very opinionated very quickly. It introduced a personal Claude plugin and a cluster of larger workflow skills such as:

/lets-work for deep planning
/implement-plan for orchestrated execution
/review for a proper review and push gate
/monitor-deploy for a fix and redeploy loop
/nix-project-init for bootstrapping new projects
/update-step-progress for explicit progress tracking

Looking back, the interesting part is not that I made a lot of skills at once. It is that I was trying to encode a full delivery loop rather than one-off prompts. Planning was not just “give me a list”. It had self-validation, user questions in a batch, and an explicit wave structure so work could be parallelised later. Execution was not just “write the code”. It had progress files, wave boundaries and verification requirements.

That structure has been useful, but it also taught me the first real lesson: if you only build heavyweight workflows, everything starts to look like a heavyweight workflow.

Splitting heavyweight and lightweight work

A few days later I had added /quick-work and /security-review.

That was a useful correction.

/lets-work, now renamed /technical-plan, is good when the task is genuinely large, ambiguous or risky. It is overkill when the work is local, well-scoped and you mostly just need disciplined investigation followed by implementation. That is exactly the gap /quick-work fills. It keeps the same quality bar around investigation and self-checking, but drops the ceremony of writing a plan file, managing waves and tracking progress on disk.

/security-review was another good lesson. Security concerns were previously mixed into broader review behaviour, which sounds tidy until you actually want an explicit gate for IAM, exposure, encryption and audit risks. Pulling that into its own skill made the intent much sharper.

This was probably the point where the overall strategy started to feel more solid. Instead of one giant “be a great engineering assistant” instruction set, I had a smaller set of modes with clearer entry points:

Deep planning
Light but rigorous task execution
Implementation from an existing plan
Review
Security review
Deploy and monitor flows

That separation has made the system much easier to trust.

Making the approach work across Claude and Codex

The next step was less about adding more workflows and more about making the workflows portable.

After working with these for a few more tasks, a few changes landed in quick succession:

The planning skill was refactored to pull shared rules into reference files
A routing reference was added for subagent roles and model tiers
Separate model mapping references were added for Claude and Codex

This is the part of the evolution I find most useful.

The current plugin structure reflects that shift as well. I now have a shared plugin manifest which renders both Claude and Codex plugin files from one source. That is the right direction. If the workflow is the product, I do not want two drifting copies of it.

Being able to reuse the same workflows across models, even when they do not produce exactly the same results, is useful in itself. Some models are plainly better than others at particular kinds of work, and model limits have a habit of running out at different times. Having skills that can move across runtimes gives me a bit more resilience there.

Different models for different sub agents

Earlier versions were still heavily shaped by one runtime. Within a few weeks the newer setup had become much more explicit about the abstraction boundary. The skill decides things like:

What kind of subtask this is
What evidence depth it needs
What role should handle it
What abstract model tier makes sense

Only after that does it translate the decision into runtime-specific choices for Claude or Codex.

The intention here is pretty practical. I want the skill to pick the right model for the task often enough that I get a better balance of speed, cost and output quality. In theory that means preferring a cheaper lighter model when the work is narrow or the evidence is simple, then upgrading when the first pass comes back weak or uncertain. I am not convinced that balance is really solved yet, but it is at least explicit now rather than accidental.

My experience so far has shown some promise. There have been a few good sessions where Haiku has been used for bounded fact discovery in sub-agents and done exactly what I wanted. That is nice to see because it suggests the routing can sometimes keep the expensive reasoning for where it is actually needed instead of spending it everywhere. The harder bit is deciding when the first output is not good enough and should be escalated. I have tried to encourage that behaviour in the skill, but I do not think it is fully reliable yet.

I also added a guardrails layer on top of the routing. The guardrails file is per project, so depending on the repo or environment I can block particular models entirely and define fallbacks. That gives me another lever when a project has cost constraints, environment-specific limits or just a model that I do not want used there for some reason. I like this because it keeps the routing policy mostly stable while still allowing local constraints to win.

I like this overall direction because it keeps the reasoning about the work separate from the quirks of the model vendor. The routing policy is the thing I actually care about. Model mapping should be an implementation detail.

What is working well

A few things are working particularly well at the moment.

Repeatability

The biggest win is that I no longer need to restate the same expectations in every session. The agent starts much closer to how I actually want to work.

Clearer task framing

Choosing between /technical-plan, /quick-work, /review or /security-review forces me to be clearer about the job itself. That sounds small, but it has a real effect on output quality.

Better decomposition

The subagent routing work has improved how larger tasks get broken down. Even when I disagree with the output, the structure makes it much easier to see why the agent made a choice and where the decision should be adjusted.

Runtime portability

Having shared workflow definitions and separate runtime mappings feels much healthier than rewriting the same intent for Claude and Codex independently. It reduces prompt drift, gives me one place to refine the actual method, and makes it easier to move between models when one is a better fit for the task or simply unavailable.

What still needs work

There are still a few obvious problems.

It can still get too ceremonial

I like structure, but there is a point where structure becomes friction. I have improved this with /quick-work, but there are still cases where the system wants to act like a mini operating model when a sharp local change would do.

The skills themselves are becoming a system to maintain

This is now real software, even if it is written in markdown and shell scripts. There are dependencies between skills, shared references, hooks, manifests and runtime mappings. That is powerful, but it also means the maintenance burden is real. If I am not careful, I end up needing tooling to manage the tooling.

Claude and Codex are not actually identical

Shared manifests and mapping files help a lot, but runtime parity is not free. The two environments have different strengths, different tool surfaces and different rough edges. A skill can abstract some of that, but not all of it.

I still need better feedback loops

At the moment a lot of my judgement is qualitative. I can say a workflow feels better, or that it reduced prompt repetition, or that a review was more thorough. What I do not yet have is a very good lightweight way of measuring which skills genuinely improve outcomes and which ones mostly make me feel organised.

Summary

The main thing I have learnt is that skills are most useful when they encode judgement, constraints and workflow boundaries, not when they just wrap a fancy prompt around a generic request.

More recently I have also started adding narrower skills like project-knowledge, and that has reinforced another lesson: the most valuable skills are usually the ones that encode specific judgement in a repeatable setting, not the ones that try to be universally smart.

The best parts of this setup so far are the parts that make expectations explicit: investigate first, separate heavy and light work, make review a real gate, route subtasks deliberately, and keep the workflow portable across runtimes where possible.

The part to watch is complexity. Once you start building a proper system around skills, it is very easy to create a second job for yourself maintaining the meta-layer. So that is the balance I am trying to keep now: more reusable judgement, less unnecessary ceremony.

Get in contact

If you have comments, questions or better ways to do anything that I have discussed in this post then please get in contact via LinkedIn or email.