<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Claude on Stuart Forrest</title><link>https://www.uglydirtylittlestrawberry.co.uk/tags/claude/</link><description>Recent content in Claude on Stuart Forrest</description><generator>Hugo -- gohugo.io</generator><language>en-uk</language><lastBuildDate>Sat, 04 Apr 2026 10:12:00 +0000</lastBuildDate><atom:link href="https://www.uglydirtylittlestrawberry.co.uk/tags/claude/" rel="self" type="application/rss+xml"/><item><title>What my evolving Claude and Codex skills can actually produce</title><link>https://www.uglydirtylittlestrawberry.co.uk/posts/what-my-evolving-claude-and-codex-skills-can-actually-produce/</link><pubDate>Sat, 04 Apr 2026 10:12:00 +0000</pubDate><guid>https://www.uglydirtylittlestrawberry.co.uk/posts/what-my-evolving-claude-and-codex-skills-can-actually-produce/</guid><description>&lt;p>In my earlier post on &lt;a href="https://www.uglydirtylittlestrawberry.co.uk/posts/evolving-my-skills-strategy-for-claude-and-codex/">evolving my skills strategy for Claude and Codex&lt;/a> I mostly wrote about structure: why I started building these skills, how I split heavier and lighter workflows, and how I am trying to make the whole setup portable across runtimes.&lt;/p>
&lt;p>That is useful context, but it is still mostly about the system in the abstract.&lt;/p>
&lt;p>This post is about a real session where that system did something genuinely impressive.&lt;/p>
&lt;p>The task started small enough: a activties heatmap on one of my side projects was broken. The route endpoint was returning a &lt;code>500&lt;/code> and then, after the first fix, returning no useful data. What followed was not just &amp;ldquo;the model wrote some code&amp;rdquo;. It investigated the bug, wrote a plan, handed work off through subagents, tracked progress on disk in markdown files, deployed changes, monitored those deploys, recognised when the deploy had not actually fixed the underlying problem, went into the database to inspect live data, found a second issue, fixed that, redeployed, watched the migrations run, and kept going until the issue was actually resolved.&lt;/p>
&lt;p>It is not perfect. There was still some ceremony, and there were a couple of places where the orchestration needed a human nudge. But as an example of what these evolving skills can already produce, I thought it was worth writing up.&lt;/p>
&lt;ul>
&lt;li>&lt;a href="#the-session-started-with-a-normal-bug-report">The session started with a normal bug report&lt;/a>&lt;/li>
&lt;li>&lt;a href="#planning-and-implementation-were-treated-as-separate-jobs">Planning and implementation were treated as separate jobs&lt;/a>&lt;/li>
&lt;li>&lt;a href="#progress-files-became-the-handoff-mechanism">Progress files became the handoff mechanism&lt;/a>&lt;/li>
&lt;li>&lt;a href="#deploy-and-monitor-turned-out-to-matter-as-much-as-the-code-fix">Deploy and monitor turned out to matter as much as the code fix&lt;/a>&lt;/li>
&lt;li>&lt;a href="#the-interesting-part-was-the-second-failure">The interesting part was the second failure&lt;/a>&lt;/li>
&lt;li>&lt;a href="#why-this-session-mattered-to-me">Why this session mattered to me&lt;/a>&lt;/li>
&lt;li>&lt;a href="#summary">Summary&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="the-session-started-with-a-normal-bug-report">The session started with a normal bug report&lt;/h2>
&lt;p>The opening problem was pretty ordinary. A map heatmap was broken and the route fetch looked like it was failing around one store call:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a6e22e">routes&lt;/span>, &lt;span style="color:#a6e22e">err&lt;/span> &lt;span style="color:#f92672">:=&lt;/span> &lt;span style="color:#a6e22e">h&lt;/span>.&lt;span style="color:#a6e22e">store&lt;/span>.&lt;span style="color:#a6e22e">GetActivityRoutes&lt;/span>(&lt;span style="color:#a6e22e">ctx&lt;/span>, &lt;span style="color:#a6e22e">params&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#a6e22e">err&lt;/span> &lt;span style="color:#f92672">!=&lt;/span> &lt;span style="color:#66d9ef">nil&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#a6e22e">errorResponse&lt;/span>(&lt;span style="color:#a6e22e">http&lt;/span>.&lt;span style="color:#a6e22e">StatusInternalServerError&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;internal error&amp;#34;&lt;/span>), &lt;span style="color:#66d9ef">nil&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The top-level skill first investigated the code and schema rather than taking the line in the handler at face value. It found that the heatmap query was joining on the wrong foreign keys (doh!). That is a real bug, and fixing it stopped the &lt;code>500&lt;/code>.&lt;/p>
&lt;p>That would already have been a decent assisted debugging session. But it was only the first layer of the problem.&lt;/p>
&lt;h2 id="planning-and-implementation-were-treated-as-separate-jobs">Planning and implementation were treated as separate jobs&lt;/h2>
&lt;p>Instead of rolling straight from investigation into a loose stream of edits, the session went through a proper planning phase using the heavier planning skill.&lt;/p>
&lt;p>That matters more than it sounds.&lt;/p>
&lt;p>The planning skill produced a concrete implementation plan rather than a vague checklist. It wrote the plan to disk, broke the work into waves and tasks, and framed the likely fix as:&lt;/p>
&lt;ul>
&lt;li>Correct the join in the route query&lt;/li>
&lt;li>Add regression coverage&lt;/li>
&lt;li>Improve observability around the route failure&lt;/li>
&lt;/ul>
&lt;p>After that the implementation skill picked up the plan and executed it rather than starting from scratch. The important thing here is not just that there were two skills involved. It is that the second skill inherited a shaped piece of work: a plan with explicit tasks and expectations, not a raw user request.&lt;/p>
&lt;p>That separation is one of the things I was getting at in the earlier post. Skills are most useful when they encode workflow boundaries. This was a good example of that in practice.&lt;/p>
&lt;h2 id="progress-files-became-the-handoff-mechanism">Progress files became the handoff mechanism&lt;/h2>
&lt;p>One part of this setup I still find slightly ridiculous and surprisingly effective at the same time is the use of progress files.&lt;/p>
&lt;p>The implementation flow wrote step progress out to files to disk. Those files tracked:&lt;/p>
&lt;ul>
&lt;li>What step was currently being worked on&lt;/li>
&lt;li>What had been learnt while doing it&lt;/li>
&lt;li>What had been validated&lt;/li>
&lt;li>Whether the step was blocked or needed review&lt;/li>
&lt;/ul>
&lt;p>That gave the session a durable coordination mechanism. The top-level agent could inspect what had happened, decide whether the task was complete, and then either continue or stop at a review boundary - this could all happen whilst the subagent was still working, the progress file is a communication channel. When the work expanded beyond the original fix, those files also acted as a memory aid for what had already been tried and what still needed doing.&lt;/p>
&lt;p>I would not claim this is elegant, it is a little clunky. But it is exactly the sort of thing I mean when I say the skills are becoming a small software system rather than a pile of reusable prompts. The coordination is explicit. It is inspectable. And when something goes sideways, there is at least some state to recover from.&lt;/p>
&lt;h2 id="deploy-and-monitor-turned-out-to-matter-as-much-as-the-code-fix">Deploy and monitor turned out to matter as much as the code fix&lt;/h2>
&lt;p>After the &lt;code>join&lt;/code> bug was fixed, the session deployed and monitored the change rather than assuming the local result was enough.&lt;/p>
&lt;p>That was the right instinct, because production behaviour changed but the feature was still broken.&lt;/p>
&lt;p>The &lt;code>500&lt;/code> was gone, but the UI still had no data to display.&lt;/p>
&lt;p>At that point the deploy-monitor loop became more valuable than the code-writing part. The relevant skills pushed changes, watched GitHub Actions runs, inspected CloudFormation state, and checked live AWS resources rather than treating deployment as an afterthought.&lt;/p>
&lt;p>This is the kind of thing I think current agent demos tend to skip over. Writing the patch is only part of the job, there is also:&lt;/p>
&lt;ul>
&lt;li>Recognising when a deploy did not run the thing you thought it ran&lt;/li>
&lt;li>Separating application deployment from datastore deployment&lt;/li>
&lt;li>Reading failed workflow logs rather than hand waving&lt;/li>
&lt;li>Checking the actual data in the database instead of inferring too much from code&lt;/li>
&lt;/ul>
&lt;p>This session did all of that.&lt;/p>
&lt;h2 id="the-interesting-part-was-the-second-failure">The interesting part was the second failure&lt;/h2>
&lt;p>The second failure was the part that really sold me on the setup.&lt;/p>
&lt;p>Once the &lt;code>500&lt;/code> was fixed, the session started checking the live database and found that some required data was &lt;code>NULL&lt;/code> for every activity. That explained why viewport-filtered heatmap requests were empty. There were plenty of stream points in the database, but no bounding boxes to use for the map filtering logic.&lt;/p>
&lt;p>So the session moved from &amp;ldquo;fix a broken SQL join&amp;rdquo; into &amp;ldquo;design and ship a data backfill&amp;rdquo;. It added:&lt;/p>
&lt;ul>
&lt;li>A migration to backfill the data&lt;/li>
&lt;li>An ingestion-side update so future activities would populate bbox automatically&lt;/li>
&lt;li>A frontend timeout increase for the heavier map route fetch&lt;/li>
&lt;/ul>
&lt;p>Then it deployed again.&lt;/p>
&lt;p>And that still did not fully work.&lt;/p>
&lt;p>The deploy monitor caught that the migration step had failed in GitHub Actions. The error was not especially glamorous either: the one-shot backfill query was timing out under the RDS Data API migration runner. That is a very normal kind of production issue.&lt;/p>
&lt;p>What impressed me was what happened next. The session did not stop at &amp;ldquo;migration failed&amp;rdquo;. It went back into investigation mode, looked at the failed logs, inspected the migration runner, tested alternative SQL shapes against the live database, compared query plans, found that a per-activity indexed aggregation was materially better than a full table aggregation, rewrote the migration, redeployed, and monitored again until:&lt;/p>
&lt;ul>
&lt;li>The migration completed&lt;/li>
&lt;li>The schema migrations table showed the new migration as applied&lt;/li>
&lt;li>The live database reported the expected state&lt;/li>
&lt;/ul>
&lt;h2 id="why-this-session-mattered-to-me">Why this session mattered to me&lt;/h2>
&lt;p>The earlier evolving-skills post was mostly about intent, this is practical.&lt;/p>
&lt;p>The parts I found most valuable were:&lt;/p>
&lt;ul>
&lt;li>It investigated before editing&lt;/li>
&lt;li>It separated planning from implementation&lt;/li>
&lt;li>It preserved working state in progress files&lt;/li>
&lt;li>It treated deployment and monitoring as first class work&lt;/li>
&lt;li>It adapted when the first fix changed the symptoms but did not solve the problem&lt;/li>
&lt;li>It validated against live AWS and Postgres data instead of trusting local reasoning alone&lt;/li>
&lt;/ul>
&lt;p>There are still obvious limitations.&lt;/p>
&lt;p>The workflow can be a bit ceremonial, especially once plan files, progress files and skill routing all get involved. It also still needed me in the loop for decisions like whether to keep a mixed commit and when to trigger the right deployment path for migration changes. And while the subagent and progress-file model works, I would not yet call it graceful.&lt;/p>
&lt;p>This is also a contrived exmaple in some senses. It is a side project that doesn&amp;rsquo;t have a proper local dev environment with a DB (but it could), it can be broken in production (but it could have a dev/staging environment) and it being used to learn as much as I can about these tools and push them.&lt;/p>
&lt;p>With a more robust production grade project all of these steps and processes can still take place but likely in a different environment, the problem would still be solved.&lt;/p>
&lt;p>I didn&amp;rsquo;t need to babysit this process, I was doing other things.&lt;/p>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>If the previous post was about how the skills are evolving, this session was about what that evolution can already buy me.&lt;/p>
&lt;p>The best part was not any individual patch. It was the continuity of the work. The session kept enough context and enough discipline to move through distinct modes of engineering work: bug investigation, planning, implementation, deployment, failure analysis, data inspection, migration redesign and final verification.&lt;/p>
&lt;p>This one did. Imperfectly, but convincingly.&lt;/p>
&lt;h3 id="get-in-contact">Get in contact&lt;/h3>
&lt;p>If you have comments, questions or better ways to do anything that I have discussed in this post then please get in contact via &lt;a href="https://linkedin.com/in/stuart-f-41a43b180">LinkedIn&lt;/a> or &lt;a href="mailto:stuart@uglydirtylittlestrawberry.co.uk">email&lt;/a>.&lt;/p></description></item><item><title>Evolving my skills strategy for Claude and Codex</title><link>https://www.uglydirtylittlestrawberry.co.uk/posts/evolving-my-skills-strategy-for-claude-and-codex/</link><pubDate>Fri, 03 Apr 2026 12:00:00 +0000</pubDate><guid>https://www.uglydirtylittlestrawberry.co.uk/posts/evolving-my-skills-strategy-for-claude-and-codex/</guid><description>&lt;p>Over the last couple of weeks I have been building out a set of personal skills for both Claude and Codex in a &lt;code>dotfile&lt;/code> style repo. I do not really think of these as clever prompts. I think of them as a way to make parts of my engineering process explicit: how I want planning to work, when I want review to happen, what should be delegated, what should be kept local, and where I want the agent to show a bit more discipline than &amp;ldquo;just have a go&amp;rdquo;.&lt;/p>
&lt;p>This post is a short walkthrough of how that setup has evolved and what I have learnt from it so far.&lt;/p>
&lt;ul>
&lt;li>&lt;a href="#why-i-started-doing-this">Why I started doing this&lt;/a>&lt;/li>
&lt;li>&lt;a href="#the-first-pass-one-plugin-and-a-lot-of-structure">The first pass: one plugin and a lot of structure&lt;/a>&lt;/li>
&lt;li>&lt;a href="#splitting-heavyweight-and-lightweight-work">Splitting heavyweight and lightweight work&lt;/a>&lt;/li>
&lt;li>&lt;a href="#making-the-approach-work-across-claude-and-codex">Making the approach work across Claude and Codex&lt;/a>&lt;/li>
&lt;li>&lt;a href="#different-models-for-different-sub-agents">Different models for different sub agents&lt;/a>&lt;/li>
&lt;li>&lt;a href="#what-is-working-well">What is working well&lt;/a>&lt;/li>
&lt;li>&lt;a href="#what-still-needs-work">What still needs work&lt;/a>&lt;/li>
&lt;li>&lt;a href="#summary">Summary&lt;/a>&lt;/li>
&lt;/ul>
&lt;h2 id="why-i-started-doing-this">Why I started doing this&lt;/h2>
&lt;p>The basic problem was pretty simple. Blank-slate prompting gets old quickly.&lt;/p>
&lt;p>If I sat down with Claude or Codex and asked for help on a non-trivial task, I found myself repeatedly restating the same preferences:&lt;/p>
&lt;ul>
&lt;li>Investigate first&lt;/li>
&lt;li>Do not ask lazy questions&lt;/li>
&lt;li>Be explicit about tradeoffs&lt;/li>
&lt;li>Separate planning from implementation when the task is risky&lt;/li>
&lt;li>Review changes properly before pushing&lt;/li>
&lt;/ul>
&lt;p>That is not especially interesting work for me to repeat, and it is not a very good interface either. If I already know the shape of the workflow I want, I would rather encode it once and reuse it.&lt;/p>
&lt;p>The other driver was consistency. If I ask an agent to review infrastructure changes one day and to plan a multi-step feature the next, I do not want both requests to be interpreted as the same kind of job. Skills give me a way to make that intent clearer up front.&lt;/p>
&lt;h2 id="the-first-pass-one-plugin-and-a-lot-of-structure">The first pass: one plugin and a lot of structure&lt;/h2>
&lt;p>The initial version was very opinionated very quickly. It introduced a personal Claude plugin and a cluster of larger workflow skills such as:&lt;/p>
&lt;ul>
&lt;li>&lt;code>/lets-work&lt;/code> for deep planning&lt;/li>
&lt;li>&lt;code>/implement-plan&lt;/code> for orchestrated execution&lt;/li>
&lt;li>&lt;code>/review&lt;/code> for a proper review and push gate&lt;/li>
&lt;li>&lt;code>/monitor-deploy&lt;/code> for a fix and redeploy loop&lt;/li>
&lt;li>&lt;code>/nix-project-init&lt;/code> for bootstrapping new projects&lt;/li>
&lt;li>&lt;code>/update-step-progress&lt;/code> for explicit progress tracking&lt;/li>
&lt;/ul>
&lt;p>Looking back, the interesting part is not that I made a lot of skills at once. It is that I was trying to encode a full delivery loop rather than one-off prompts. Planning was not just &amp;ldquo;give me a list&amp;rdquo;. It had self-validation, user questions in a batch, and an explicit wave structure so work could be parallelised later. Execution was not just &amp;ldquo;write the code&amp;rdquo;. It had progress files, wave boundaries and verification requirements.&lt;/p>
&lt;p>That structure has been useful, but it also taught me the first real lesson: if you only build heavyweight workflows, everything starts to look like a heavyweight workflow.&lt;/p>
&lt;h2 id="splitting-heavyweight-and-lightweight-work">Splitting heavyweight and lightweight work&lt;/h2>
&lt;p>A few days later I had added &lt;code>/quick-work&lt;/code> and &lt;code>/security-review&lt;/code>.&lt;/p>
&lt;p>That was a useful correction.&lt;/p>
&lt;p>&lt;code>/lets-work&lt;/code>, now renamed &lt;code>/technical-plan&lt;/code>, is good when the task is genuinely large, ambiguous or risky. It is overkill when the work is local, well-scoped and you mostly just need disciplined investigation followed by implementation. That is exactly the gap &lt;code>/quick-work&lt;/code> fills. It keeps the same quality bar around investigation and self-checking, but drops the ceremony of writing a plan file, managing waves and tracking progress on disk.&lt;/p>
&lt;p>&lt;code>/security-review&lt;/code> was another good lesson. Security concerns were previously mixed into broader review behaviour, which sounds tidy until you actually want an explicit gate for IAM, exposure, encryption and audit risks. Pulling that into its own skill made the intent much sharper.&lt;/p>
&lt;p>This was probably the point where the overall strategy started to feel more solid. Instead of one giant &amp;ldquo;be a great engineering assistant&amp;rdquo; instruction set, I had a smaller set of modes with clearer entry points:&lt;/p>
&lt;ul>
&lt;li>Deep planning&lt;/li>
&lt;li>Light but rigorous task execution&lt;/li>
&lt;li>Implementation from an existing plan&lt;/li>
&lt;li>Review&lt;/li>
&lt;li>Security review&lt;/li>
&lt;li>Deploy and monitor flows&lt;/li>
&lt;/ul>
&lt;p>That separation has made the system much easier to trust.&lt;/p>
&lt;h2 id="making-the-approach-work-across-claude-and-codex">Making the approach work across Claude and Codex&lt;/h2>
&lt;p>The next step was less about adding more workflows and more about making the workflows portable.&lt;/p>
&lt;p>After working with these for a few more tasks, a few changes landed in quick succession:&lt;/p>
&lt;ul>
&lt;li>The planning skill was refactored to pull shared rules into reference files&lt;/li>
&lt;li>A routing reference was added for subagent roles and model tiers&lt;/li>
&lt;li>Separate model mapping references were added for Claude and Codex&lt;/li>
&lt;/ul>
&lt;p>This is the part of the evolution I find most useful.&lt;/p>
&lt;p>The current plugin structure reflects that shift as well. I now have a shared plugin manifest which renders both Claude and Codex plugin files from one source. That is the right direction. If the workflow is the product, I do not want two drifting copies of it.&lt;/p>
&lt;p>Being able to reuse the same workflows across models, even when they do not produce exactly the same results, is useful in itself. Some models are plainly better than others at particular kinds of work, and model limits have a habit of running out at different times. Having skills that can move across runtimes gives me a bit more resilience there.&lt;/p>
&lt;h2 id="different-models-for-different-sub-agents">Different models for different sub agents&lt;/h2>
&lt;p>Earlier versions were still heavily shaped by one runtime. Within a few weeks the newer setup had become much more explicit about the abstraction boundary. The skill decides things like:&lt;/p>
&lt;ul>
&lt;li>What kind of subtask this is&lt;/li>
&lt;li>What evidence depth it needs&lt;/li>
&lt;li>What role should handle it&lt;/li>
&lt;li>What abstract model tier makes sense&lt;/li>
&lt;/ul>
&lt;p>Only after that does it translate the decision into runtime-specific choices for Claude or Codex.&lt;/p>
&lt;p>The intention here is pretty practical. I want the skill to pick the right model for the task often enough that I get a better balance of speed, cost and output quality. In theory that means preferring a cheaper lighter model when the work is narrow or the evidence is simple, then upgrading when the first pass comes back weak or uncertain. I am not convinced that balance is really solved yet, but it is at least explicit now rather than accidental.&lt;/p>
&lt;p>My experience so far has shown some promise. There have been a few good sessions where Haiku has been used for bounded fact discovery in sub-agents and done exactly what I wanted. That is nice to see because it suggests the routing can sometimes keep the expensive reasoning for where it is actually needed instead of spending it everywhere. The harder bit is deciding when the first output is not good enough and should be escalated. I have tried to encourage that behaviour in the skill, but I do not think it is fully reliable yet.&lt;/p>
&lt;p>I also added a guardrails layer on top of the routing. The guardrails file is per project, so depending on the repo or environment I can block particular models entirely and define fallbacks. That gives me another lever when a project has cost constraints, environment-specific limits or just a model that I do not want used there for some reason. I like this because it keeps the routing policy mostly stable while still allowing local constraints to win.&lt;/p>
&lt;p>I like this overall direction because it keeps the reasoning about the work separate from the quirks of the model vendor. The routing policy is the thing I actually care about. Model mapping should be an implementation detail.&lt;/p>
&lt;h2 id="what-is-working-well">What is working well&lt;/h2>
&lt;p>A few things are working particularly well at the moment.&lt;/p>
&lt;h3 id="repeatability">Repeatability&lt;/h3>
&lt;p>The biggest win is that I no longer need to restate the same expectations in every session. The agent starts much closer to how I actually want to work.&lt;/p>
&lt;h3 id="clearer-task-framing">Clearer task framing&lt;/h3>
&lt;p>Choosing between &lt;code>/technical-plan&lt;/code>, &lt;code>/quick-work&lt;/code>, &lt;code>/review&lt;/code> or &lt;code>/security-review&lt;/code> forces me to be clearer about the job itself. That sounds small, but it has a real effect on output quality.&lt;/p>
&lt;h3 id="better-decomposition">Better decomposition&lt;/h3>
&lt;p>The subagent routing work has improved how larger tasks get broken down. Even when I disagree with the output, the structure makes it much easier to see why the agent made a choice and where the decision should be adjusted.&lt;/p>
&lt;h3 id="runtime-portability">Runtime portability&lt;/h3>
&lt;p>Having shared workflow definitions and separate runtime mappings feels much healthier than rewriting the same intent for Claude and Codex independently. It reduces prompt drift, gives me one place to refine the actual method, and makes it easier to move between models when one is a better fit for the task or simply unavailable.&lt;/p>
&lt;h2 id="what-still-needs-work">What still needs work&lt;/h2>
&lt;p>There are still a few obvious problems.&lt;/p>
&lt;h3 id="it-can-still-get-too-ceremonial">It can still get too ceremonial&lt;/h3>
&lt;p>I like structure, but there is a point where structure becomes friction. I have improved this with &lt;code>/quick-work&lt;/code>, but there are still cases where the system wants to act like a mini operating model when a sharp local change would do.&lt;/p>
&lt;h3 id="the-skills-themselves-are-becoming-a-system-to-maintain">The skills themselves are becoming a system to maintain&lt;/h3>
&lt;p>This is now real software, even if it is written in markdown and shell scripts. There are dependencies between skills, shared references, hooks, manifests and runtime mappings. That is powerful, but it also means the maintenance burden is real. If I am not careful, I end up needing tooling to manage the tooling.&lt;/p>
&lt;h3 id="claude-and-codex-are-not-actually-identical">Claude and Codex are not actually identical&lt;/h3>
&lt;p>Shared manifests and mapping files help a lot, but runtime parity is not free. The two environments have different strengths, different tool surfaces and different rough edges. A skill can abstract some of that, but not all of it.&lt;/p>
&lt;h3 id="i-still-need-better-feedback-loops">I still need better feedback loops&lt;/h3>
&lt;p>At the moment a lot of my judgement is qualitative. I can say a workflow feels better, or that it reduced prompt repetition, or that a review was more thorough. What I do not yet have is a very good lightweight way of measuring which skills genuinely improve outcomes and which ones mostly make me feel organised.&lt;/p>
&lt;h2 id="summary">Summary&lt;/h2>
&lt;p>The main thing I have learnt is that skills are most useful when they encode judgement, constraints and workflow boundaries, not when they just wrap a fancy prompt around a generic request.&lt;/p>
&lt;p>More recently I have also started adding narrower skills like &lt;code>project-knowledge&lt;/code>, and that has reinforced another lesson: the most valuable skills are usually the ones that encode specific judgement in a repeatable setting, not the ones that try to be universally smart.&lt;/p>
&lt;p>The best parts of this setup so far are the parts that make expectations explicit: investigate first, separate heavy and light work, make review a real gate, route subtasks deliberately, and keep the workflow portable across runtimes where possible.&lt;/p>
&lt;p>The part to watch is complexity. Once you start building a proper system around skills, it is very easy to create a second job for yourself maintaining the meta-layer. So that is the balance I am trying to keep now: more reusable judgement, less unnecessary ceremony.&lt;/p>
&lt;h3 id="get-in-contact">Get in contact&lt;/h3>
&lt;p>If you have comments, questions or better ways to do anything that I have discussed in this post then please get in contact via &lt;a href="https://linkedin.com/in/stuart-f-41a43b180">LinkedIn&lt;/a> or &lt;a href="mailto:stuart@uglydirtylittlestrawberry.co.uk">email&lt;/a>.&lt;/p></description></item></channel></rss>