<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://digital-transformation-advisory.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://digital-transformation-advisory.com/" rel="alternate" type="text/html" /><updated>2026-06-12T11:13:48+01:00</updated><id>https://digital-transformation-advisory.com/feed.xml</id><title type="html">Digital Transformation Advisory</title><subtitle>Transforming Organisations to work for you. Specialising in Agile/SaFE, Strategic Architecture and ISO 20022 Payments.</subtitle><author><name>Sailesh Panchal</name></author><entry><title type="html">Silent Is Not Stuck: What Two Hangs Taught Me About Observable Pipelines</title><link href="https://digital-transformation-advisory.com/2026/04/11/silent-is-not-stuck-building-observable-pipelines/" rel="alternate" type="text/html" title="Silent Is Not Stuck: What Two Hangs Taught Me About Observable Pipelines" /><published>2026-04-11T00:00:00+01:00</published><updated>2026-04-11T00:00:00+01:00</updated><id>https://digital-transformation-advisory.com/2026/04/11/silent-is-not-stuck-building-observable-pipelines</id><content type="html" xml:base="https://digital-transformation-advisory.com/2026/04/11/silent-is-not-stuck-building-observable-pipelines/"><![CDATA[<table>
  <tbody>
    <tr>
      <td>Two days ago, my nightly knowledge pipeline hung silently for six and a half hours on a single [[M4 Pro MacBook Pro</td>
      <td>Ollama]] call. Yesterday, I thought I’d found a silent logic bug in the same pipeline’s refine step — all 754 insights had zero embeddings and the manifest was all zeros. I spent an hour instrumenting and diagnosing before I realised the “bug” was the same class of problem as the hang, one layer down.</td>
    </tr>
  </tbody>
</table>

<p>Both of them came from the same missing thing: <strong>a loop that had no way to prove it was still working.</strong></p>

<h2 id="the-first-incident-the-65-hour-populate-hang">The First Incident: The 6.5-Hour Populate Hang</h2>

<p>The pipeline runs at 02:00 every night via launchd. <code class="language-plaintext highlighter-rouge">populate</code> is the step where a local LLM reads each new source file and extracts structured insights. It hits Ollama over HTTP, one source at a time, and writes the results to SQLite.</p>

<p>On the night of April 10th, populate started at 02:00 and was still running at 08:30. No output. No errors. No progress. Just a process sitting at 0% CPU, waiting on a socket.</p>

<p>When I finally killed it and looked at <code class="language-plaintext highlighter-rouge">ps</code> + <code class="language-plaintext highlighter-rouge">lsof</code>, the process was stuck mid-HTTP-call to Ollama, which had apparently died or stalled but not closed the connection. <code class="language-plaintext highlighter-rouge">URLSession</code> had no resource timeout set, so the call would wait forever. Six and a half hours of night-cron time, burned on nothing.</p>

<p>The fix was three lines of code plus a structural habit:</p>

<ol>
  <li><strong>Hard wall-clock timeout on every external call.</strong> A dedicated <code class="language-plaintext highlighter-rouge">URLSession</code> with <code class="language-plaintext highlighter-rouge">timeoutIntervalForResource = 300s</code>. Bound the worst case.</li>
  <li><strong>One retry on transient failures.</strong> The second attempt often succeeds because the upstream has recovered.</li>
  <li><strong>Per-item heartbeat to stderr.</strong> <code class="language-plaintext highlighter-rouge">[populate] 47/162 source-name.md</code> every single iteration. Now if the pipeline is stuck, I can tell in two seconds.</li>
</ol>

<table>
  <tbody>
    <tr>
      <td>That became Milestone A.5 — “[[Cloud-native architecture</td>
      <td>observability]] and resume, before anything new ships.” I sat on building any new pipeline features until the pipeline itself could tell me what it was doing.</td>
    </tr>
  </tbody>
</table>

<h2 id="the-second-incident-the-fake-embed-bug">The Second Incident: The Fake Embed Bug</h2>

<p>Yesterday I ran the pipeline again, this time with decision-capture features layered on top. Refine completed. I checked the manifest: zero embeddings, zero enrichments, zero tensions detected. I queried the live database directly — 754 insights, every single one with <code class="language-plaintext highlighter-rouge">NULL</code> embeddings.</p>

<p>My first reading: refine has a logic bug. It’s silently no-oping on the embed pass. Maybe <code class="language-plaintext highlighter-rouge">supportsEmbedding</code> is returning false. Maybe the batch loop is skipping rows. Maybe the Ollama embed endpoint is broken.</p>

<p>I started instrumenting. I added stderr logging around the embed pass. I checked that <code class="language-plaintext highlighter-rouge">nomic-embed-text</code> was loaded in Ollama. I ran a manual <code class="language-plaintext highlighter-rouge">curl</code> against <code class="language-plaintext highlighter-rouge">/api/embed</code>. Everything worked. The model was there. The endpoint responded. So why was refine producing nothing?</p>

<p>The answer, once I stopped staring at embed and actually read the refine flow end-to-end: <strong>refine wasn’t broken. It had never finished.</strong> The enrich pass runs before embed, processes each sparse insight sequentially, and takes about four seconds per call against gemma4. With 754 sparse insights, that’s about fifty minutes of work before embed even starts. Fifty minutes during which refine produces exactly zero bytes of output. Indistinguishable from a hang. On the previous night’s run, the populate hang had blocked refine from ever starting on the current dataset. On tonight’s, the enrich pass had been honestly chewing through the work and I’d killed it before it got to embed.</p>

<h2 id="the-same-root-cause-two-layers-apart">The Same Root Cause, Two Layers Apart</h2>

<p>This is where I realised the two incidents were actually the same incident. A.5 made the <em>pipeline</em> observable — you can see which phase is running, how far it’s got, whether the lockfile is held, when the last heartbeat fired. But the individual <em>passes inside refine</em> were still blind boxes. Each pass logged “starting” and then went quiet for anywhere from 30 seconds to an hour. If any one of them got slow, or stuck, or hit one bad row, you couldn’t tell which without killing the process and looking at the database state after the fact.</p>

<p>So the fix mirrored A.5, one layer down:</p>

<ol>
  <li><strong>Start and end markers on every refine pass.</strong> <code class="language-plaintext highlighter-rouge">[refine:enrich] start — 754 sparse insights</code> at entry, <code class="language-plaintext highlighter-rouge">[refine:enrich] done (0 failed)</code> at exit. No ambiguity about which pass is running.</li>
  <li>
    <table>
      <tbody>
        <tr>
          <td><strong>Per-item progress every 5 items.</strong> Slow enough to not drown the log, [[Agile methodology</td>
          <td>fast]] enough that you can watch progress in real time and calculate ETA.</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li><strong>Try/catch around every external call, per row.</strong> If one insight’s LLM call fails, log it, increment a counter, continue. One bad row can never kill the whole pass again.</li>
</ol>

<p>The last point turned out to matter more than the logging. Before the fix, enrich was already ~fifty minutes of sequential external work. If any one of those 754 calls had thrown (Ollama indigestion, a badly-formed title, a transient network blip), the entire pass would have died and we’d have wasted the preceding work. With per-row tolerance, one flaky response just bumps a skip counter and the other 753 calls complete.</p>

<h2 id="the-rule-i-wrote-down">The Rule I Wrote Down</h2>

<table>
  <tbody>
    <tr>
      <td>After the second fix, I wrote a new line in my permanent project [[LLM-based agents</td>
      <td>memory]]:</td>
    </tr>
  </tbody>
</table>

<blockquote>
  <p><strong>External calls in long-running passes need a stderr heartbeat every N items and a try/catch around every single call. Silent “stuck” is almost never actually stuck. It’s missing instrumentation.</strong></p>
</blockquote>

<p>That’s the entire rule. It sounds obvious written down. It wasn’t obvious while I was staring at a query that returned zero embeddings, convinced there was a logic bug.</p>

<p>There are two reasons engineers skip this, and I skipped it for both:</p>

<ul>
  <li><strong>You write the code while the dataset is small.</strong> My unit tests had 3 insights, not 754. At 3 insights, enrich finishes in 12 seconds, so “silence for a minute” never happens. The pathology doesn’t show up until production load.</li>
  <li><strong>You think logging is a polish task.</strong> Heartbeat lines look like chrome you can add later. They’re not. They are the only way a human can tell a working loop from a dead one, and if you don’t have them, every unexpected behaviour looks like a bug.</li>
</ul>

<h2 id="what-id-tell-someone-building-a-similar-system">What I’d Tell Someone Building A Similar System</h2>

<p>If you’re building a pipeline that calls external services in a loop — an LLM, an HTTP API, a database-on-another-box, anything where latency is not under your control — build the observability <em>first</em>. Not first in priority, first in literal code-writing order. Before the happy path works, make sure the unhappy path is legible:</p>

<ul>
  <li>Heartbeat every N items, where N is small enough that a human watching the log can see it move</li>
  <li>Hard wall-clock timeout on every external call, not just the session default</li>
  <li>Per-item try/catch so one bad row can never kill a long-running batch</li>
  <li>A progress sidecar written to disk so a killed process leaves behind evidence of how far it got</li>
  <li>Postmortem dump on SIGTERM so the nightly cron’s failures are diagnosable in the morning</li>
</ul>

<p>None of these are hard to build. All of them are hard to <em>remember to build before you need them</em>. The rule I’m trying to hold myself to: any loop over external work gets instrumented the same day it’s written. Not the day it breaks in production.</p>

<p>The knowledge pipeline is running as I write this. The enrich pass is at 405 of 754, ten minutes from embed, and I can see every step of it in my terminal. Which is the entire point.</p>]]></content><author><name>Sailesh Panchal</name></author><category term="Engineering" /><category term="Operations" /><category term="observability" /><category term="pipelines" /><category term="llm-tools" /><category term="kuzuctl" /><category term="ollama" /><category term="debugging" /><summary type="html"><![CDATA[A six-hour hang and a fake embed bug, both caused by the same thing: a loop that had no way to prove it was still working. Two fixes, one rule.]]></summary></entry><entry><title type="html">Turning Your Knowledge Base Into a Graph You Can Argue With</title><link href="https://digital-transformation-advisory.com/2026/04/08/turning-your-knowledge-base-into-a-graph-you-can-argue-with/" rel="alternate" type="text/html" title="Turning Your Knowledge Base Into a Graph You Can Argue With" /><published>2026-04-08T00:00:00+01:00</published><updated>2026-04-08T00:00:00+01:00</updated><id>https://digital-transformation-advisory.com/2026/04/08/turning-your-knowledge-base-into-a-graph-you-can-argue-with</id><content type="html" xml:base="https://digital-transformation-advisory.com/2026/04/08/turning-your-knowledge-base-into-a-graph-you-can-argue-with/"><![CDATA[<p>You’ve been consulting for five years. Your Obsidian vault has 847 insights. You can search by keyword. You can search by tag. But you cannot ask your own knowledge base “what contradicts this claim?” and you cannot ask “what’s the gap in my thinking here?”</p>

<p>That’s the problem that led to kuzuctl.</p>

<h2 id="the-problem-linear-search-exponential-complexity">The Problem: Linear Search, Exponential Complexity</h2>

<p>When you have fewer than 100 notes, text search is fine. You remember that you wrote something about “confidence scoring” six months ago. You search, find it, move on.</p>

<p>At 847 notes, things break. Here’s what happens:</p>

<ul>
  <li>You write a new insight: “Confidence should be semantic, not just statistical.”</li>
  <li>You search for “confidence” and find 34 previous notes.</li>
  <li>You manually check each one for contradictions, connections, implications.</li>
  <li>You spend two hours in busywork, or you don’t, and miss critical connections.</li>
  <li>You build the same idea twice, not realizing it’s already in the vault.</li>
</ul>

<p>This doesn’t scale with your organisation. If you have a knowledge base serving 10 people, or 100 people, the problem gets worse. Everyone searches independently. No one knows what’s contradictory.</p>

<p>The usual solution is a proper database. You migrate everything out of markdown into a relational schema. You hire someone to maintain it. Now markdown is a view, and the database is the source of truth.</p>

<p>But that inverts the problem: if the database goes down or gets corrupted, you’ve lost your real knowledge. And your markdown is now stale.</p>

<h2 id="the-decision-markdown-primary-graph-derived">The Decision: Markdown Primary, Graph Derived</h2>

<p>We chose a different pattern: <strong>markdown stays primary. The graph is derived.</strong></p>

<p>If you delete the Kuzu database tomorrow, <code class="language-plaintext highlighter-rouge">kuzuctl sync</code> rebuilds it from your markdown files in 30 seconds. The markdown is the source of truth. The graph is an acceleration layer—like a database index, not like a ledger.</p>

<p>This decision cascaded through everything:</p>

<ul>
  <li>Node identity is anchored in frontmatter IDs, not file paths.</li>
  <li>The schema is optimized for synthesis, not normalization.</li>
  <li>Commands can assume the graph is stale (it rebuilds every night).</li>
  <li>The CLI is the consumer, not the source.</li>
</ul>

<p>This pattern appeared again when we had to solve the Kuzu reopen bug. More on that in a moment.</p>

<h2 id="the-architecture-three-swift-targets-one-protocol">The Architecture: Three Swift Targets, One Protocol</h2>

<p>The codebase has three Swift packages:</p>

<table>
  <thead>
    <tr>
      <th>Package</th>
      <th>Purpose</th>
      <th>Reusable?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">kuzuctl</code></td>
      <td>Obsidian vault CLI: sync, lint, ingest, search, challenge, suggest</td>
      <td>No—vault-specific logic</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">KuzuGraphKit</code></td>
      <td>Graph store abstraction (CRUD, embedding, conflict detection)</td>
      <td>Yes—used by BlogCreator, SFiA-AI</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">LLMKit</code></td>
      <td>Multi-provider LLM wrapper (Ollama, Claude, Mock)</td>
      <td>Yes—any Swift project</td>
    </tr>
  </tbody>
</table>

<p>The critical move was <code class="language-plaintext highlighter-rouge">GraphStoreProtocol</code>. This abstract interface lets you swap storage backends without changing any command code:</p>

<div class="language-swift highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">protocol</span> <span class="kt">GraphStoreProtocol</span><span class="p">:</span> <span class="kt">Actor</span> <span class="p">{</span>
    <span class="kd">func</span> <span class="nf">createNode</span><span class="p">(</span><span class="nv">id</span><span class="p">:</span> <span class="kt">String</span><span class="p">,</span> <span class="nv">label</span><span class="p">:</span> <span class="kt">String</span><span class="p">,</span> <span class="nv">properties</span><span class="p">:</span> <span class="p">[</span><span class="kt">String</span><span class="p">:</span> <span class="kt">Any</span><span class="p">])</span> <span class="k">async</span> <span class="k">throws</span>
    <span class="kd">func</span> <span class="nf">linkNodes</span><span class="p">(</span><span class="nv">source</span><span class="p">:</span> <span class="kt">String</span><span class="p">,</span> <span class="nv">target</span><span class="p">:</span> <span class="kt">String</span><span class="p">,</span> <span class="nv">relation</span><span class="p">:</span> <span class="kt">String</span><span class="p">)</span> <span class="k">async</span> <span class="k">throws</span>
    <span class="kd">func</span> <span class="nf">searchByEmbedding</span><span class="p">(</span><span class="nv">query</span><span class="p">:</span> <span class="p">[</span><span class="kt">Float</span><span class="p">],</span> <span class="nv">limit</span><span class="p">:</span> <span class="kt">Int</span><span class="p">)</span> <span class="k">async</span> <span class="k">throws</span> <span class="o">-&gt;</span> <span class="p">[</span><span class="kt">SearchResult</span><span class="p">]</span>
    <span class="kd">func</span> <span class="nf">detectTensions</span><span class="p">(</span><span class="nv">nodeID</span><span class="p">:</span> <span class="kt">String</span><span class="p">)</span> <span class="k">async</span> <span class="k">throws</span> <span class="o">-&gt;</span> <span class="p">[</span><span class="kt">Tension</span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div></div>

<p>We built this assuming Kuzu. Then kuzu-swift 0.11.3 had a reopen bug: the library could not safely reconnect after the process closed the database. This is critical for a CLI that might be invoked 100 times in a day (each time: open, query, close, exit).</p>

<p>We had two options:</p>
<ol>
  <li>Wait for Kuzu to fix the bug.</li>
  <li>Implement SQLite and swap it in.</li>
</ol>

<p>Because we had the protocol, option 2 took one week. We did not rewrite any commands. The CLI still worked, still output the same JSON, still passed the same tests.</p>

<p>This was a deliberate CEO-review decision: build for optionality, not for the prettiest tech. Kuzu is the long-term backend (better analytics, better scalability). SQLite is the reliable pivot. Both implement the same interface.</p>

<h2 id="the-schema-confidence--contradictions">The Schema: Confidence + Contradictions</h2>

<p>Here’s the core insight graph schema:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NODES:
  id (String, primary)
  label (String)
  confidence_label (String: "certainty" | "assumption" | "emerging" | "contested")
  confidence_score (Float: 0.0-1.0)
  description (String)
  embedding (FLOAT[768])
  frontmatter_id (String)
  source_file (String)

EDGES:
  source_id (String)
  target_id (String)
  relation (String)
  type ("LINKS_TO" | "TENSIONS_WITH")
  context (String)
</code></pre></div></div>

<p>Two design choices stand out:</p>

<p><strong>Dual confidence.</strong> Humans read <code class="language-plaintext highlighter-rouge">confidence_label</code>: “Is this certain, or emerging?” Machines read <code class="language-plaintext highlighter-rouge">confidence_score</code>: “On a scale of 0–1, how much evidence supports this?” The label is for reasoning. The score is for filtering.</p>

<p><strong>TENSIONS_WITH as first-class edges.</strong> Most graph systems treat contradictions as an absence (this edge doesn’t exist because the two claims conflict). We model them explicitly. If note A says “cloud adoption reduces capex by 60%” and note B says “cloud adoption increases operational complexity by 40%”, we create a TENSIONS_WITH edge between them. This lets us ask “what contradicts this?” and surfaces unresolved debates.</p>

<p><strong>Node identity by frontmatter ID, not path.</strong> When you rename a file, the graph doesn’t break. The frontmatter ID stays stable:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">id</span><span class="pi">:</span> <span class="s">cost-control-through-observability</span>
<span class="na">confidence</span><span class="pi">:</span> <span class="s2">"</span><span class="s">certainty"</span>
<span class="na">tags</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">cost</span><span class="pi">,</span> <span class="nv">devops</span><span class="pi">]</span>
<span class="nn">---</span>
</code></pre></div></div>

<h2 id="the-overnight-refine-loop-six-passes-zero-cost">The Overnight Refine Loop: Six Passes, Zero Cost</h2>

<p>Every night at 3 AM, <code class="language-plaintext highlighter-rouge">kuzuctl refine</code> runs six semantic passes. This is where the graph stops being a static view and starts actively learning about itself.</p>

<p><strong>Pass 1: Embed.</strong> Every node without a vector embedding gets one via nomic-embed-text (local Ollama, runs free). 768-dimensional vectors. Takes about 2 seconds per node on modern hardware.</p>

<p><strong>Pass 2: Deduplicate.</strong> Compute cosine similarity between all pairs. If similarity &gt; 0.95 and both nodes have high confidence, merge them. This catches “I wrote the same insight three times in different words.”</p>

<p><strong>Pass 3: Enrich.</strong> For nodes with missing descriptions, ask Claude (via LLMKit) to generate one from their wiki-links and neighbors. This fills gaps in sparse nodes.</p>

<p><strong>Pass 4: Detect contradictions.</strong> For every pair of nodes with LINKS_TO edges, check: do their descriptions or embedding space conflict? If so, create a TENSIONS_WITH edge with a reason.</p>

<p><strong>Pass 5: Validate.</strong> Sample 10% of nodes and ask Claude to verify that the confidence_score is correct based on the node’s description and linked evidence. Adjust scores if needed.</p>

<p><strong>Pass 6: Prune.</strong> Remove nodes with confidence_score &lt; 0.3 that have no incoming edges. Don’t delete them—move them to a “low_confidence” table so you can audit them later.</p>

<p>Every pass logs to a <code class="language-plaintext highlighter-rouge">RefineManifest</code>:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"run_date"</span><span class="p">:</span><span class="w"> </span><span class="s2">"2026-04-08T03:00:00Z"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"passes"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"pass"</span><span class="p">:</span><span class="w"> </span><span class="s2">"embed"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"nodes_processed"</span><span class="p">:</span><span class="w"> </span><span class="mi">847</span><span class="p">,</span><span class="w">
      </span><span class="nl">"nodes_created"</span><span class="p">:</span><span class="w"> </span><span class="mi">12</span><span class="p">,</span><span class="w">
      </span><span class="nl">"duration_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mi">45</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"pass"</span><span class="p">:</span><span class="w"> </span><span class="s2">"deduplicate"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"nodes_processed"</span><span class="p">:</span><span class="w"> </span><span class="mi">847</span><span class="p">,</span><span class="w">
      </span><span class="nl">"merges"</span><span class="p">:</span><span class="w"> </span><span class="mi">3</span><span class="p">,</span><span class="w">
      </span><span class="nl">"duration_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"pass"</span><span class="p">:</span><span class="w"> </span><span class="s2">"enrich"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"nodes_processed"</span><span class="p">:</span><span class="w"> </span><span class="mi">847</span><span class="p">,</span><span class="w">
      </span><span class="nl">"nodes_updated"</span><span class="p">:</span><span class="w"> </span><span class="mi">24</span><span class="p">,</span><span class="w">
      </span><span class="nl">"tokens_used"</span><span class="p">:</span><span class="w"> </span><span class="mi">12400</span><span class="p">,</span><span class="w">
      </span><span class="nl">"duration_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mi">28</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"pass"</span><span class="p">:</span><span class="w"> </span><span class="s2">"detect_contradictions"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"nodes_processed"</span><span class="p">:</span><span class="w"> </span><span class="mi">847</span><span class="p">,</span><span class="w">
      </span><span class="nl">"tensions_created"</span><span class="p">:</span><span class="w"> </span><span class="mi">7</span><span class="p">,</span><span class="w">
      </span><span class="nl">"duration_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mi">62</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"pass"</span><span class="p">:</span><span class="w"> </span><span class="s2">"validate"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"sample_size"</span><span class="p">:</span><span class="w"> </span><span class="mi">85</span><span class="p">,</span><span class="w">
      </span><span class="nl">"score_adjustments"</span><span class="p">:</span><span class="w"> </span><span class="mi">12</span><span class="p">,</span><span class="w">
      </span><span class="nl">"tokens_used"</span><span class="p">:</span><span class="w"> </span><span class="mi">8900</span><span class="p">,</span><span class="w">
      </span><span class="nl">"duration_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mi">35</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"pass"</span><span class="p">:</span><span class="w"> </span><span class="s2">"prune"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"low_confidence_archived"</span><span class="p">:</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w">
      </span><span class="nl">"duration_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mi">3</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">],</span><span class="w">
  </span><span class="nl">"total_duration_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mi">181</span><span class="p">,</span><span class="w">
  </span><span class="nl">"total_tokens_used"</span><span class="p">:</span><span class="w"> </span><span class="mi">21300</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>This audit trail answers “what changed last night, and why?” If you wake up and a confidence score shifted, you can see exactly which pass did it and what reasoning was applied.</p>

<h2 id="the-challenge-command-red-team-your-own-graph">The Challenge Command: Red-Team Your Own Graph</h2>

<p>The most useful command is <code class="language-plaintext highlighter-rouge">kuzuctl challenge</code>:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kuzuctl challenge <span class="s2">"Cloud adoption always reduces capex"</span>
</code></pre></div></div>

<p>Output:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"claim"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Cloud adoption always reduces capex"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"verdict"</span><span class="p">:</span><span class="w"> </span><span class="s2">"contested"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"evidence"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"supported"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="p">{</span><span class="w">
        </span><span class="nl">"node_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"cost-control-observability"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"confidence"</span><span class="p">:</span><span class="w"> </span><span class="s2">"certainty"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"reasoning"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Operational savings in datacenter overhead"</span><span class="w">
      </span><span class="p">}</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"contested"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
      </span><span class="p">{</span><span class="w">
        </span><span class="nl">"node_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"cloud-operational-complexity"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"confidence"</span><span class="p">:</span><span class="w"> </span><span class="s2">"emerging"</span><span class="p">,</span><span class="w">
        </span><span class="nl">"tension"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Cloud adoption increases operational complexity (TENSIONS_WITH)"</span><span class="w">
      </span><span class="p">}</span><span class="w">
    </span><span class="p">],</span><span class="w">
    </span><span class="nl">"insufficient_evidence"</span><span class="p">:</span><span class="w"> </span><span class="p">[]</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"cypher_used"</span><span class="p">:</span><span class="w"> </span><span class="s2">"MATCH (n {label: $claim})-[r]-(m) WHERE r.type IN ['SUPPORTS', 'TENSIONS_WITH'] RETURN n, r, m"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"reasoning"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Your claim has direct support for cost reduction, but we found a contested edge claiming increased operational complexity. Neither claim is fully resolved—both confidence scores are below 0.8."</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>Notice the <code class="language-plaintext highlighter-rouge">cypher_used</code> field. This makes the reasoning reproducible. You can run that same query, inspect the results, and decide whether the algorithm was right. This transparency is why challenge is useful as a red-team tool, not just as an answer engine.</p>

<h2 id="the-suggest-command-surfacing-synthesis-gaps">The Suggest Command: Surfacing Synthesis Gaps</h2>

<p><code class="language-plaintext highlighter-rouge">kuzuctl suggest</code> finds patterns that <em>should</em> exist but don’t:</p>

<ul>
  <li><strong>Orphans</strong>: Nodes with no incoming or outgoing edges (dead weight)</li>
  <li><strong>Open triangles</strong>: A→B, A→C, but no B→C (synthesis gap)</li>
  <li><strong>Unresolved tensions</strong>: TENSIONS_WITH edges with low confidence on both sides (debate, not decision)</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kuzuctl suggest <span class="nt">--sphere</span> <span class="s2">"cost-control"</span>
</code></pre></div></div>

<p>Output tells you: “You have 8 insights about cost controls, but nodes ‘cost-observability’ and ‘cost-ai-automation’ are not connected. Are they related, contradictory, or independent?”</p>

<p>This is the inverse of search. Search answers “does this exist?” Suggest answers “what’s broken in your thinking?”</p>

<h2 id="blogcreator-the-same-graph-extended">BlogCreator: The Same Graph, Extended</h2>

<p>This year we’re building BlogCreator—a tool to turn raw voice recordings into polished blog posts, with full lineage tracking.</p>

<p>Five thousand, two hundred and eighty-seven voice recordings (5,287) over five years. We didn’t want to throw away the audio. We built the transcription system into the same Kuzu graph.</p>

<p>The schema extends kuzuctl’s:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>NODES:
  AudioFile → TranscriptVersion → Concept → BlogPost → Chapter → PublishedContent

EDGES:
  TRANSCRIBED_FROM (AudioFile → TranscriptVersion, with accuracy score)
  EXTRACTED_FROM (Concept → TranscriptVersion, with confidence)
  INCLUDED_IN (Concept → BlogPost)
  REFINED_BY (BlogPost → BlogPost, tracking iteration)
  IMMUTABLE_LINEAGE (PublishedContent → AudioFile, tracing back to source)
</code></pre></div></div>

<p>BlogCreator uses <code class="language-plaintext highlighter-rouge">KuzuGraphKit</code>—the same reusable library. Different schema, same protocol. This validates the architecture decision: separating vault-specific logic (kuzuctl) from graph-specific logic (KuzuGraphKit) let us reuse the whole graph layer for a completely different problem.</p>

<h2 id="when-to-build-a-graph-and-when-not-to">When to Build a Graph (and When Not To)</h2>

<p>Not every knowledge base needs this. Here’s the decision table:</p>

<table>
  <thead>
    <tr>
      <th>Condition</th>
      <th>Recommendation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>&lt; 200 notes, &lt; 5 years old</td>
      <td>Use Obsidian search. Stay flat.</td>
    </tr>
    <tr>
      <td>200–1000 notes, individual use</td>
      <td>Build a derived graph. Use SQLite.</td>
    </tr>
    <tr>
      <td>1000–10k notes, team use, synthesis-critical</td>
      <td>Build a derived graph. Use Kuzu. Add challenge/suggest commands.</td>
    </tr>
    <tr>
      <td>Raw data → structured output (transcription, contracts, research)</td>
      <td>Build a lineage graph. Same protocol, different schema.</td>
    </tr>
    <tr>
      <td>Graph is the product (recommendations, discovery, analytics)</td>
      <td>Build graph-as-primary. Invest in the database. Accept the maintenance cost.</td>
    </tr>
  </tbody>
</table>

<p>The key: <strong>Is the graph helping you think, or is it becoming the thing you think about?</strong></p>

<p>If it’s helping you think—synthesis, contradiction detection, gap finding—it should be derived and lightweight.</p>

<p>If it’s the thing you think about—someone built it, someone maintains it, it has its own schema version—it should be primary and mature.</p>

<p>We chose derived because the vault is the thought. The graph is the reasoning tool.</p>

<h2 id="takeaway-invert-the-authority">Takeaway: Invert the Authority</h2>

<p>Most graph projects start with “we need a database, let’s extract data into it.” This makes the database the source of truth and guarantees eventual consistency pain.</p>

<p>Invert it: keep your source of truth (markdown, voice recordings, whatever) and build a derived graph that can be thrown away and rebuilt. This trades some query latency for enormous operational simplicity.</p>

<p>You get to ask your knowledge base hard questions. You get reproducible reasoning (the SQL that built the answer). You get an audit trail (RefineManifest). And you never wake up wondering what corrupted your real data.</p>

<p>That’s the kuzuctl pattern. Build it if you have 847 insights and your thinking is your product.</p>

<hr />

<p><strong>Sailesh Panchal</strong> is a CTO and founder of Digital Transformation Advisory. He consults with UK banking and fintech on AI strategy, platform architecture, and the boring operational decisions that make transformation stick.</p>]]></content><author><name>Sailesh Panchal</name></author><category term="Engineering" /><category term="AI" /><category term="knowledge-graphs" /><category term="llm-tools" /><category term="architecture" /><category term="obsidian" /><category term="kuzu" /><category term="sqlite" /><category term="kuzuctl" /><summary type="html"><![CDATA[How to build a derived graph layer on top of markdown files so you can ask 'what contradicts this?' instead of just 'does this exist?']]></summary></entry><entry><title type="html">From 5,000 Voice Memos to a Book: The Pipeline That Runs While You Sleep</title><link href="https://digital-transformation-advisory.com/2026/04/08/from-five-thousand-voice-memos-to-a-book/" rel="alternate" type="text/html" title="From 5,000 Voice Memos to a Book: The Pipeline That Runs While You Sleep" /><published>2026-04-08T00:00:00+01:00</published><updated>2026-04-08T00:00:00+01:00</updated><id>https://digital-transformation-advisory.com/2026/04/08/from-five-thousand-voice-memos-to-a-book</id><content type="html" xml:base="https://digital-transformation-advisory.com/2026/04/08/from-five-thousand-voice-memos-to-a-book/"><![CDATA[<h2 id="the-trap">The Trap</h2>

<p>Five years of banking consulting leaves you with something precious and useless: 5,287 voice memos scattered across four different apps.</p>

<p>Pauses while walking to Pret. Thirty-second thoughts about SEPA harmonisation captured in an Uber. A four-minute tangent on PSD2 compliance recorded while waiting for a call. All of it—126 GB of audio—sitting in iCloud, slowly drowning in your decision paralysis.</p>

<p>The knowledge is there. Real patterns about digital transformation in UK banking. Causal loops connecting regulatory change to architecture decisions. Mistakes I’ve made at two banks and a fintech. But the knowledge is trapped in audio files I will never listen to again. Who has the time?</p>

<p>So I built a pipeline.</p>

<p>The ambition was simple: turn those 5,000 hours of raw thinking into a book. Not a collection of blog posts. Not a listicle farm. A proper book—nine chapters, 40,000 words, structured narrative arc—on transforming a UK bank from the CTO’s perspective. Call it <em>The Transformation Paradox</em>.</p>

<p>But you cannot write a book by listening to 5,000 recordings. You need automation. You need stages. You need quality gates that separate the gold from the noise.</p>

<p>What follows is how I built it.</p>

<h2 id="the-architecture-four-stages-three-model-tiers-one-immutable-graph">The Architecture: Four Stages, Three Model Tiers, One Immutable Graph</h2>

<p>The pipeline runs nightly at 2am. It pulls raw audio from four vaults, transcribes at 140x real-time, scores against 100 financial concepts, generates two formats of content, enhances with a four-agent system, and logs every decision in a Kuzu knowledge graph for audit and cross-referencing.</p>

<p>No prompt tells it “write me a book.” The pipeline has stages.</p>

<h3 id="stage-0-transcribe-audio--transcript">Stage 0: Transcribe (Audio → Transcript)</h3>

<p>The first bottleneck is speech-to-text. I record on my phone (mostly Apple Voice Memos), backup to Dropbox, and keep originals in Google Drive as a belt-and-braces hedge. Four vaults: Voice Memos (2,847 files), Dropbox recordings (1,420), Google Drive (723), and Telegram voice notes (297).</p>

<table>
  <thead>
    <tr>
      <th>Vault</th>
      <th>File Count</th>
      <th>Format</th>
      <th>Access</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Apple Voice Memos</td>
      <td>2,847</td>
      <td>.m4a</td>
      <td>iCloud sync</td>
    </tr>
    <tr>
      <td>Dropbox backups</td>
      <td>1,420</td>
      <td>.mp3</td>
      <td>Local mount</td>
    </tr>
    <tr>
      <td>Google Drive archive</td>
      <td>723</td>
      <td>.mp3</td>
      <td>API via gcloud</td>
    </tr>
    <tr>
      <td>Telegram voice notes</td>
      <td>297</td>
      <td>.ogg</td>
      <td>Export script</td>
    </tr>
  </tbody>
</table>

<p>For STT, I tested two local models—parakeet-mlx and mlx-whisper—and found I needed both.</p>

<p><strong>parakeet-mlx</strong> runs at 140x real-time on Apple Silicon. On a 2-minute voice memo, it produces output in under 1 second. Word error rate sits at 6.3%. For raw speed (processing 400 recordings in a batch), it’s unbeatable. But it misses domain specifics. It hears “SEPA” as “sepia” and “CHAPS” as “chaps” (the riding wear, not the clearing house).</p>

<p><strong>mlx-whisper</strong> is slower (3-4x real-time) but domain-aware. I seed its prompt with ~40 financial terms: CHAPS, BACS, SEPA, ISO 20022, FMV, PSD2, FIDO2, BaaS, BNPL, passporting, SCA. The model uses the prompt as a lexical hint. Correct rate for those 40 terms jumps from 40% to 94%.</p>

<p>So the pipeline does this: parakeet first for speed, flag any memo over 2 minutes as “high-stakes financial” (if it contains keywords like “compliance” or “architecture”), and re-transcribe those with mlx-whisper HQ.</p>

<p>A compiled regex corrects the remaining 6% of finance-specific mistakes—it catches patterns like “sepia clearing” and replaces them with “SEPA clearing” using context.</p>

<p>Result: 400 clean transcripts per night. Total cost: zero. (Ollama runs locally; no API fees.)</p>

<h3 id="stage-1-analyse-transcript--confidence-score">Stage 1: Analyse (Transcript → Confidence Score)</h3>

<p>The second stage is scoring. Not “is this good?” but “is this articulate enough to generate content from?”</p>

<p>I built a taxonomy of 5 themes and 100 financial concepts:</p>
<ul>
  <li>Theme 1: Regulatory Compliance (PSD2, GDPR, FIDO2, SCA, Strong Customer Auth)</li>
  <li>Theme 2: Payments Modernisation (ISO 20022, SEPA, CBDC, Real-Time Payments)</li>
  <li>Theme 3: Enterprise Architecture (Systems Thinking, Domain-Driven Design, Event Sourcing)</li>
  <li>Theme 4: Talent &amp; Culture (Team scaling, psychological safety, growth mindset)</li>
  <li>Theme 5: AI Integration (LLM ops, vector DBs, prompt engineering)</li>
</ul>

<p><strong>Qwen 3.5-4B</strong> (thinking mode disabled for clean JSON) scores each transcript on all five themes. Output:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"memo_id"</span><span class="p">:</span><span class="w"> </span><span class="s2">"20260407_0247_psd2_discussion"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"duration_seconds"</span><span class="p">:</span><span class="w"> </span><span class="mi">312</span><span class="p">,</span><span class="w">
  </span><span class="nl">"themes"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Regulatory Compliance"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"relevance_score"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.68</span><span class="p">,</span><span class="w">
      </span><span class="nl">"confidence"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.92</span><span class="p">,</span><span class="w">
      </span><span class="nl">"concepts_detected"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"PSD2"</span><span class="p">,</span><span class="w"> </span><span class="s2">"SCA"</span><span class="p">,</span><span class="w"> </span><span class="s2">"passporting"</span><span class="p">,</span><span class="w"> </span><span class="s2">"regulatory_arbitrage"</span><span class="p">]</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="p">{</span><span class="w">
      </span><span class="nl">"name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Payments Modernisation"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"relevance_score"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.44</span><span class="p">,</span><span class="w">
      </span><span class="nl">"confidence"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.87</span><span class="p">,</span><span class="w">
      </span><span class="nl">"concepts_detected"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"ISO_20022"</span><span class="p">,</span><span class="w"> </span><span class="s2">"Real-Time_Payments"</span><span class="p">]</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">],</span><span class="w">
  </span><span class="nl">"overall_quality_score"</span><span class="p">:</span><span class="w"> </span><span class="mf">0.58</span><span class="p">,</span><span class="w">
  </span><span class="nl">"recommendation"</span><span class="p">:</span><span class="w"> </span><span class="s2">"GOLD"</span><span class="p">,</span><span class="w">
  </span><span class="nl">"rationale"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Clear narrative arc, specific examples, actionable guidance."</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The three confidence bands:</p>

<table>
  <thead>
    <tr>
      <th>Band</th>
      <th>Score</th>
      <th>Action</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Gold</strong></td>
      <td>≥55</td>
      <td>Auto-generate blog post + LinkedIn format</td>
    </tr>
    <tr>
      <td><strong>Silver</strong></td>
      <td>40–54</td>
      <td>Generate, lower priority, queue for review</td>
    </tr>
    <tr>
      <td><strong>Bronze</strong></td>
      <td>37–39</td>
      <td>Queue for Whisper HQ retranscription</td>
    </tr>
    <tr>
      <td><strong>Dud</strong></td>
      <td>&lt;37</td>
      <td>Skip; log for manual review later</td>
    </tr>
  </tbody>
</table>

<p>Of 5,287 memos, 1,247 came back Gold. 2,104 Silver. 894 Bronze (queued for retranscription). 1,042 Dud (mostly background noise, false starts, or phone-call fragments).</p>

<p>Gold memos are the ones where I was actually thinking—not just ruminating.</p>

<h3 id="stage-2-generate-transcript--two-formats">Stage 2: Generate (Transcript → Two Formats)</h3>

<p>Gold memos fork into two tracks:</p>

<p><strong>Track A: Consultancy Article</strong> (1,500–3,000 words). SEO-optimized, thought leadership tone. Structured: Problem statement, why it matters, decision tree, implementation pattern, common pitfalls, call to action. This goes to the blog and gets social distribution.</p>

<p><strong>Track B: LinkedIn Post</strong> (300–600 words). Snappier. “Here’s the insight; here’s why you should care; here’s the next step.” Thread-friendly. Lower friction. Different audience (practitioners vs. architects).</p>

<p>Same transcript. Two prompts. Two voices. The pipeline generates both in parallel. (Qwen 9B does this overnight; we’re not paying for latency.)</p>

<h3 id="stage-3-enhance-content--four-agent-polish">Stage 3: Enhance (Content → Four-Agent Polish)</h3>

<p>This is where the magic happens. After generation, I don’t ship immediately. I run a four-agent enhancement pipeline. Each agent has a specific job:</p>

<ol>
  <li>
    <p><strong>Systems Thinking Agent</strong> (89% effectiveness): Reads the draft and identifies causal loops. If I wrote “Teams moved faster after we restructured,” the agent asks: “But did velocity improve because of the structure, or because the reorganisation coincided with hiring senior engineers?” It surfaces confounds. It ties insights to feedback loops. It turns observations into models.</p>
  </li>
  <li>
    <p><strong>Growth Mindset Agent</strong> (92%): Reframes challenges as capability development. If the draft says “We struggled with microservices,” the agent rewrites it: “We discovered microservices require different operational muscle—here’s how we built it.” Ownership over victimhood. Agency over passivity.</p>
  </li>
  <li>
    <p><strong>Reader Engagement Agent</strong> (87%): Injects Socratic questions and “Explore Further” links. It pulls from the Kuzu graph: if I mention ISO 20022, the agent fetches all related concepts (SEPA, Real-Time Payments, CBDC) and suggests cross-links. It turns monologue into dialogue.</p>
  </li>
  <li>
    <p><strong>Tone Calibration Agent</strong> (85%): Quality gate. Checks: Is this too jargon-heavy for practitioners? Too simplistic for architects? Is the voice consistent with my other posts? Does it land for a UK banking CTO? Flags anything that feels off.</p>
  </li>
</ol>

<p>Each agent is built with a five-layer prompt architecture:</p>
<ol>
  <li><strong>Identity</strong>: “You are a systems thinking expert, trained on complex adaptive systems.”</li>
  <li><strong>Expertise</strong>: “Your specialty is identifying feedback loops in socio-technical change.”</li>
  <li><strong>Context</strong>: [Kuzu neighbourhood context: all related concepts, prior posts on this theme, decision history]</li>
  <li><strong>Standards</strong>: “Your output must be concrete (not hand-wavy), humble (not prescriptive), and tied to evidence.”</li>
  <li><strong>Output format</strong>: “JSON with <code class="language-plaintext highlighter-rouge">suggestions</code> array, each item has <code class="language-plaintext highlighter-rouge">location</code> (which paragraph), <code class="language-plaintext highlighter-rouge">original_text</code>, <code class="language-plaintext highlighter-rouge">proposed_revision</code>, <code class="language-plaintext highlighter-rouge">rationale</code>.”</li>
</ol>

<p>The orchestrator fetches each agent’s output, merges non-conflicting suggestions, and flags conflicts for manual review.</p>

<p>Cost per article: ~$0.30 (Claude API for agent coordination only; base generation is free on Qwen 9B).</p>

<h3 id="stage-4-lineage-everything--immutable-kuzu-graph">Stage 4: Lineage (Everything → Immutable Kuzu Graph)</h3>

<p>Here’s the bit that matters for regulators, auditors, and your own sanity: every artifact is traceable.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Audio file (20260407_0247.m4a)
  ↓
TranscriptVersion (v1: parakeet, v2: whisper HQ)
  ↓
ConceptExtraction (PSD2, SCA, Real-Time Payments)
  ↓
ConfidenceScore (0.58 → GOLD)
  ↓
BlogPost (1,847 words, published 2026-04-08)
  ↓
LinkedInPost (412 words, published 2026-04-08)
  ↓
BookChapter ("Regulatory Modernisation", position 3)
</code></pre></div></div>

<p>Kuzu nodes never overwrite. New versions create new nodes. If I re-transcribe a memo with Whisper HQ, a <code class="language-plaintext highlighter-rouge">TranscriptVersion</code> node links the old (parakeet) and new (whisper) outputs. The graph shows the evolution. An auditor can ask: “Show me every version of the PSD2 content” and trace the lineage.</p>

<p>This matters. If a regulator asks, “How did you arrive at this conclusion about SCA?” you can pull the graph: here’s the memo, the timestamp, the transcription method, the quality score, the agents that touched it, the publication date.</p>

<p>No hand-waving. No “I think I wrote about that somewhere.”</p>

<h2 id="the-model-tiers-trade-latency-for-quality">The Model Tiers: Trade Latency for Quality</h2>

<p>I use three Qwen models on Apple Silicon via MLX. No cloud API. No cost for overnight bulk work.</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Size</th>
      <th>VRAM</th>
      <th>Latency</th>
      <th>Use Case</th>
      <th>Cost</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Qwen 3.5-4B</strong></td>
      <td>3 GB</td>
      <td>1.2 GB</td>
      <td>0.8s per 1K tokens</td>
      <td>Daily scoring, quick analysis</td>
      <td>Free</td>
    </tr>
    <tr>
      <td><strong>Qwen 9B</strong></td>
      <td>6 GB</td>
      <td>2.4 GB</td>
      <td>1.8s per 1K tokens</td>
      <td>Blog generation, formatting</td>
      <td>Free</td>
    </tr>
    <tr>
      <td><strong>Qwen 27B</strong></td>
      <td>30 GB</td>
      <td>8 GB</td>
      <td>5.2s per 1K tokens</td>
      <td>Diagram specs, complex reasoning</td>
      <td>Free</td>
    </tr>
  </tbody>
</table>

<p>4B runs every memo nightly (scoring). 9B generates content (Track A and Track B). 27B handles overnight “think deeply” work—when I want Qwen to reason through architecture trade-offs, it gets the 27B model and 30 seconds per response.</p>

<p>The key insight: <strong>free latency is valuable</strong>. If a task takes 30 seconds but costs $0 (because it’s midnight), run it. If it takes 5 seconds and costs $0.10, use Claude (10x faster, acceptable cost for spot-check validation).</p>

<p>My cost model for overnight processing: $0. For daytime validation: ~$50/month Claude API budget.</p>

<h2 id="the-lora-voice-adaptation-at-scale">The LoRA: Voice Adaptation at Scale</h2>

<p>At 50 gold posts, I’ll train a LoRA (Low-Rank Adapter) on top of Qwen 9B.</p>

<p>Training data: 50 (transcript, Sailesh’s personal rewrite) pairs. The LoRA learns not the content, but the voice. How I restructure a rambling 5-minute thought into crisp argument. My preference for concrete examples over abstractions. My skepticism toward buzzwords.</p>

<p>Base model: Qwen 9B. The LoRA will be ~64 MB. After training, any Qwen 9B inference with the LoRA loaded will sound more like me.</p>

<p>I’m not training a new foundation model. I’m training my voice on top of an existing one.</p>

<h2 id="the-book-nine-chapters-from-chaos">The Book: Nine Chapters from Chaos</h2>

<p>The output is structured as nine chapters, crystallised from Kuzu themes:</p>

<ol>
  <li>The Transformation Paradox (Intro: why banks change, why it fails)</li>
  <li>Regulatory Winds (PSD2, GDPR, future regulation)</li>
  <li>Payments Plumbing (ISO 20022, SEPA, Real-Time Payments)</li>
  <li>Systems Thinking (feedback loops, causal models, complexity)</li>
  <li>Architecture Decisions (DDD, event sourcing, monolith vs. microservices)</li>
  <li>Building Teams (talent, psychological safety, growth mindset)</li>
  <li>AI Integration (LLMs, vector search, responsible deployment)</li>
  <li>The Operator’s Mindset (observability, chaos engineering, incident response)</li>
  <li>The Systems Thinker’s Manifesto (coda: synthesis, next decade)</li>
</ol>

<p>Each chapter is built from a cluster of Gold memos. <code class="language-plaintext highlighter-rouge">ttb query journey --concept PSD2 --show-evolution</code> traces how my thinking on PSD2 compliance evolved across five years—which memo first articulated it, how the thinking deepened, where contradictions emerged, what I changed my mind on.</p>

<p>The book is not a collection of essays. It’s a narrative with causal coherence, built from the graph.</p>

<h2 id="the-patterns-what-ctos-should-learn">The Patterns: What CTOs Should Learn</h2>

<p>Three principles stand out:</p>

<p><strong>1. Split problems correctly.</strong> This pipeline works because each stage has one job. Transcription doesn’t score. Scoring doesn’t generate. Generation doesn’t enhance. Each stage outputs clean JSON, which the next stage consumes. When something breaks, you know where. This is the same principle I wrote about in <a href="/2026/03/20/fifty-powerpoints-and-a-rebrand-why-we-didnt-train-a-model/">“Fifty PowerPoints: How to Scale Content Without Burning Out”</a>—splitting the branding pipeline into extract, deterministic transform, and intelligent rewrite.</p>

<p><strong>2. Free latency is a weapon.</strong> If it costs $0 to run overnight, run it deeply. If it costs money by the token, get ruthless about scope. The four-agent enhancement costs ~$0.30 per article because Claude sees only the final, filtered JSON from cheap models. It doesn’t re-read the transcript; it doesn’t second-guess the scoring. Money buys precision in specific places, not omniscience.</p>

<p><strong>3. Lineage is not optional.</strong> You think you won’t need an audit trail until you do. Then it’s too late. Immutable Kuzu nodes cost nothing. The discipline of logging every version, every decision, every touch—it pays for itself the first time a stakeholder asks “where did you get that number?” and you can say “here, pull the graph.”</p>

<h2 id="the-status">The Status</h2>

<p>As of April 2026, the pipeline is production. 1,247 gold memos have generated blog posts. 894 have been retranscribed and are pending generation. The first book outline is crystallising around the nine chapters above.</p>

<p>Total elapsed time to build: 14 months (evenings and weekends). Total cost: ~$600 (mostly Claude API during development; now down to $50/month validation budget). Total time saved: hard to measure, but if I’d manually transcribed even 1% of those memos, I’d have lost three weeks of consulting work just sitting with an audio player.</p>

<p>The real value is this: five years of thinking are no longer lost. They’re queryable. They’re traceable. They’re part of a coherent narrative. And the book will exist.</p>

<p>That’s worth building a pipeline for.</p>

<hr />

<p><strong>Sailesh Panchal</strong> is a CTO advisor and architect specialising in digital transformation at UK banks. He writes about payments modernisation, systems thinking, and the engineering practices that survive contact with regulation.</p>]]></content><author><name>Sailesh Panchal</name></author><category term="Engineering" /><category term="AI" /><category term="knowledge-management" /><category term="llm-engineering" /><category term="voice-transcription" /><category term="book-generation" /><category term="kuzu" /><category term="mlx" /><category term="ollama" /><summary type="html"><![CDATA[How I built a ML pipeline to convert 5,287 voice memos recorded over five years of banking consulting into a structured book on UK digital transformation—using local models, immutable lineage, and staged quality gates instead of one-shot prompts.]]></summary></entry><entry><title type="html">How We Test Claude Skills: The Eval-and-Tune Loop</title><link href="https://digital-transformation-advisory.com/2026/03/26/how-we-test-claude-skills-the-eval-and-tune-loop/" rel="alternate" type="text/html" title="How We Test Claude Skills: The Eval-and-Tune Loop" /><published>2026-03-26T00:00:00+00:00</published><updated>2026-03-26T00:00:00+00:00</updated><id>https://digital-transformation-advisory.com/2026/03/26/how-we-test-claude-skills-the-eval-and-tune-loop</id><content type="html" xml:base="https://digital-transformation-advisory.com/2026/03/26/how-we-test-claude-skills-the-eval-and-tune-loop/"><![CDATA[<p>Writing a Claude Code skill is easy. You write some markdown, drop it in <code class="language-plaintext highlighter-rouge">~/.claude/skills/</code>, and it activates automatically. The hard part is knowing whether it actually makes a difference.</p>

<p>We learned this building the <a href="https://github.com/saileshpanchal/agent-friendly-cli">agent-friendly-cli skill</a> — a guide for building CLIs that AI agents can use effectively. The skill covers 16 principles: structured output, stderr separation, exit codes, TTY detection, and so on. But principles on paper mean nothing without evidence that they change outcomes.</p>

<p>So we built a testing loop. Here’s what we learned.</p>

<h2 id="the-process">The Process</h2>

<h3 id="1-draft-the-skill-then-write-test-prompts">1. Draft the Skill, Then Write Test Prompts</h3>

<p>Start with the skill content, then immediately write 2-3 realistic test prompts. Not “test the skill” prompts — real tasks someone would actually bring to Claude:</p>

<ul>
  <li>“Build a deploy CLI in Python with Click”</li>
  <li>“Review this CLI code and tell me what’s not agent-friendly”</li>
  <li>“Write a config import command that accepts file or stdin input”</li>
</ul>

<p>These cover different modes: code generation, code review, and feature implementation. Each exercises the skill differently.</p>

<h3 id="2-run-with-skill-and-without-skill-in-parallel">2. Run With-Skill and Without-Skill in Parallel</h3>

<p>This is the key insight. Don’t just test the skill — test the delta. Spawn six subagents: three with the skill loaded, three without. Same prompts, same model, different guidance.</p>

<p>The without-skill runs are your baseline. They show what Claude does naturally, without the skill’s patterns. The comparison reveals what the skill actually teaches.</p>

<h3 id="3-draft-assertions-while-runs-execute">3. Draft Assertions While Runs Execute</h3>

<p>Don’t wait for results. While the agents run, write the grading criteria. For the CLI skill, our first assertions were:</p>

<ul>
  <li>Does the code include <code class="language-plaintext highlighter-rouge">--output json</code>?</li>
  <li>Is there a <code class="language-plaintext highlighter-rouge">--dry-run</code> flag?</li>
  <li>Does it use flags instead of positional arguments?</li>
  <li>Are there distinct exit codes?</li>
</ul>

<p>These felt reasonable. They were also mostly wrong — not wrong in what they checked, but wrong in what they revealed.</p>

<h3 id="4-the-first-round-wont-discriminate">4. The First Round Won’t Discriminate</h3>

<p>Here’s what happened when we graded:</p>

<table>
  <thead>
    <tr>
      <th>Eval</th>
      <th>With Skill</th>
      <th>Without Skill</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Deploy CLI</td>
      <td>100%</td>
      <td>33%</td>
    </tr>
    <tr>
      <td>Code Review</td>
      <td><strong>100%</strong></td>
      <td><strong>100%</strong></td>
    </tr>
    <tr>
      <td>Config Import</td>
      <td>80%</td>
      <td>40%</td>
    </tr>
  </tbody>
</table>

<p>The deploy CLI and config import showed clear deltas. But the code review scored 100% for both versions. The skill found 11 issues; the baseline found 6. The skill categorized by severity; the baseline didn’t. The skill caught stderr/stdout separation; the baseline missed it. Yet the assertions said they were equal.</p>

<p>The problem: our assertions tested for the obvious. “Does it mention interactive prompts?” Yes — both versions catch that. “Does it note missing JSON output?” Yes — both versions catch that too. The assertions were too easy.</p>

<h3 id="5-find-what-discriminates">5. Find What Discriminates</h3>

<p>This is where the real work happens. Read both outputs side by side and ask: what does the with-skill version do that the baseline doesn’t? For us, it was:</p>

<ul>
  <li><strong>Severity categorization</strong> — the skill version tiered issues as blocking/moderate/low</li>
  <li><strong>Stderr/stdout separation</strong> — the baseline never mentioned it</li>
  <li><strong>Emoji fragility</strong> — the skill flagged <code class="language-plaintext highlighter-rouge">print('Done!')</code> with emoji as parsing-fragile</li>
  <li><strong>Issue depth</strong> — the skill found 8+ issues vs the baseline’s 6</li>
</ul>

<p>We added these as assertions and re-graded:</p>

<table>
  <thead>
    <tr>
      <th>Eval</th>
      <th>With Skill</th>
      <th>Without Skill</th>
      <th>Delta</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Code Review</td>
      <td><strong>100%</strong></td>
      <td><strong>83%</strong></td>
      <td><strong>+17%</strong></td>
    </tr>
  </tbody>
</table>

<p>Now the eval discriminated. The two failing assertions for the baseline — stderr/stdout separation and emoji fragility — were exactly the patterns the skill teaches that Claude doesn’t know on its own.</p>

<h3 id="6-iterate-until-stable">6. Iterate Until Stable</h3>

<p>The loop is: draft assertions, grade, find non-discriminating assertions, replace them with harder ones, re-grade. Stop when the assertions capture the real delta.</p>

<p>For code generation evals (the deploy CLI), the first round already discriminated well (+67%). Code generation is where skills have the most leverage — the model produces fundamentally different code with the right guidance.</p>

<p>For code review evals, it took two rounds. The model is already decent at spotting problems; the skill’s value is in the subtler, deeper patterns.</p>

<h2 id="the-audit-that-proved-it">The Audit That Proved It</h2>

<p>The same day we shipped the skill, we ran it against our own project — a transcription-to-blog pipeline with a CLI called <code class="language-plaintext highlighter-rouge">ttb</code>. The audit against the agent-friendly checklist was immediate and damning:</p>

<ol>
  <li>No <code class="language-plaintext highlighter-rouge">--output json</code> on any command — agents can’t parse the Rich tables</li>
  <li>No graph query commands — can’t explore the knowledge graph programmatically</li>
  <li><code class="language-plaintext highlighter-rouge">audit-transcript</code> outputs Rich markup, not structured data</li>
  <li><code class="language-plaintext highlighter-rouge">enrich</code> outputs JSON (the only one that does)</li>
  <li>No <code class="language-plaintext highlighter-rouge">--quiet</code> mode</li>
  <li>Stats use Rich tables — not parseable</li>
  <li>No graph-level query tools for listing concepts, finding themes, or exploring connections</li>
</ol>

<p>The biggest win wasn’t fixing existing commands — it was realising we needed <code class="language-plaintext highlighter-rouge">ttb query</code> subcommands that let agents explore the knowledge graph with <code class="language-plaintext highlighter-rouge">--output json</code>. The skill didn’t just review our CLI; it revealed a missing capability.</p>

<p>That’s the difference between a checklist you read once and a skill that’s loaded into context every time you touch CLI code.</p>

<h2 id="what-wed-do-differently">What We’d Do Differently</h2>

<p><strong>Start with discriminating assertions.</strong> Don’t test for the obvious. If baseline Claude already catches interactive prompts and missing JSON output, those assertions won’t tell you if your skill adds value. Test for the patterns that require specific domain knowledge.</p>

<p><strong>Run more than 3 test cases.</strong> Three is enough to validate the approach, but the signal gets noisy with small samples. For a production skill, we’d run 8-10 before shipping.</p>

<p><strong>Grade programmatically from the start.</strong> We wrote a grading script that checks outputs against regex patterns. It’s stringly-typed and a bit hacky, but it runs in seconds and produces consistent results. Manual review is important for qualitative assessment, but programmatic grading catches regressions.</p>

<h2 id="get-the-skill">Get the Skill</h2>

<p>The agent-friendly-cli skill is open source:</p>

<p><strong><a href="https://github.com/saileshpanchal/agent-friendly-cli">github.com/saileshpanchal/agent-friendly-cli</a></strong></p>

<p>The eval workspace with all test cases, grading scripts, and benchmark data is in the repo. Fork it, run the evals against your own CLIs, and see what falls out.</p>

<p>The process itself — draft, test with/without, find discriminating assertions, iterate — works for any Claude Code skill. The specific assertions will be different, but the loop is the same.</p>]]></content><author><name>Sailesh Panchal</name></author><category term="Engineering" /><category term="AI" /><category term="claude" /><category term="skills" /><category term="testing" /><category term="evaluation" /><category term="cli" /><category term="developer-tools" /><category term="open-source" /><category term="gstack" /><category term="quality-assurance" /><summary type="html"><![CDATA[Writing a Claude Code skill is easy. Knowing whether it actually works is harder. We built an eval-and-tune loop that benchmarks skills against a no-skill baseline, and the first thing we learned is that your initial assertions will be wrong.]]></summary></entry><entry><title type="html">Building CLIs for Agents: What the Original Article Missed</title><link href="https://digital-transformation-advisory.com/2026/03/26/building-clis-for-agents-what-the-original-article-missed/" rel="alternate" type="text/html" title="Building CLIs for Agents: What the Original Article Missed" /><published>2026-03-26T00:00:00+00:00</published><updated>2026-03-26T00:00:00+00:00</updated><id>https://digital-transformation-advisory.com/2026/03/26/building-clis-for-agents-what-the-original-article-missed</id><content type="html" xml:base="https://digital-transformation-advisory.com/2026/03/26/building-clis-for-agents-what-the-original-article-missed/"><![CDATA[<p>An article on building CLIs for agents went around recently. It made good points: make things non-interactive, add <code class="language-plaintext highlighter-rouge">--dry-run</code>, return data on success. Solid basics.</p>

<p>But it missed the patterns that actually make the difference between a CLI that agents can technically use and one they can use well. We took the original article, added the missing pieces, turned it into a <a href="https://github.com/saileshpanchal/agent-friendly-cli">Claude Code skill</a>, and benchmarked the results.</p>

<h2 id="what-was-missing">What Was Missing</h2>

<p>The original covered six patterns. We added twelve more. Here are the ones that matter most.</p>

<h3 id="structured-output-is-the-foundation">Structured Output Is the Foundation</h3>

<p>The original article mentioned returning data on success. That’s a subset of the real principle: every command should support <code class="language-plaintext highlighter-rouge">--output json</code>. Not just success messages. Every list, every status check, every describe command. And the default should be human-readable tables, not JSON — you’re serving two audiences.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># human-friendly default</span>
<span class="nv">$ </span>mycli service list
NAME        STATUS    REPLICAS
web         running   3
api         running   2

<span class="c"># agent-friendly</span>
<span class="nv">$ </span>mycli service list <span class="nt">--output</span> json
<span class="o">[</span>
  <span class="o">{</span><span class="s2">"name"</span>: <span class="s2">"web"</span>, <span class="s2">"status"</span>: <span class="s2">"running"</span>, <span class="s2">"replicas"</span>: 3<span class="o">}</span>,
  <span class="o">{</span><span class="s2">"name"</span>: <span class="s2">"api"</span>, <span class="s2">"status"</span>: <span class="s2">"running"</span>, <span class="s2">"replicas"</span>: 2<span class="o">}</span>
<span class="o">]</span>
</code></pre></div></div>

<p>This one pattern eliminates the largest class of agent failures: parsing human-formatted text.</p>

<h3 id="stderr-vs-stdout-separation">Stderr vs Stdout Separation</h3>

<p>This wasn’t mentioned at all, and it’s critical. Data goes to stdout. Diagnostics, progress, and logs go to stderr. Without this, agents can’t pipe commands together — progress messages corrupt the data stream.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">_emit</span><span class="p">(</span><span class="n">data</span><span class="p">:</span> <span class="nb">dict</span><span class="p">,</span> <span class="n">output</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Data to stdout.</span><span class="sh">"""</span>
    <span class="k">if</span> <span class="n">output</span> <span class="o">==</span> <span class="sh">"</span><span class="s">json</span><span class="sh">"</span><span class="p">:</span>
        <span class="n">click</span><span class="p">.</span><span class="nf">echo</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="nf">dumps</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>

<span class="k">def</span> <span class="nf">_log</span><span class="p">(</span><span class="n">msg</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Diagnostics to stderr.</span><span class="sh">"""</span>
    <span class="n">click</span><span class="p">.</span><span class="nf">echo</span><span class="p">(</span><span class="n">msg</span><span class="p">,</span> <span class="n">err</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<p>When <code class="language-plaintext highlighter-rouge">--output json</code> is set, be strict: no log lines should leak into stdout.</p>

<h3 id="exit-codes-that-mean-something">Exit Codes That Mean Something</h3>

<p>The original said “fail fast with actionable errors.” That’s necessary but not sufficient. Agents need distinct exit codes so they can branch without parsing stderr:</p>

<table>
  <thead>
    <tr>
      <th>Code</th>
      <th>Meaning</th>
      <th>Agent Action</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0</td>
      <td>Success</td>
      <td>Continue</td>
    </tr>
    <tr>
      <td>1</td>
      <td>General error</td>
      <td>Read stderr, retry or escalate</td>
    </tr>
    <tr>
      <td>2</td>
      <td>Usage error</td>
      <td>Fix invocation and retry</td>
    </tr>
    <tr>
      <td>3</td>
      <td>Auth error</td>
      <td>Re-authenticate, then retry</td>
    </tr>
    <tr>
      <td>4</td>
      <td>Not found</td>
      <td>Resource doesn’t exist</td>
    </tr>
    <tr>
      <td>5</td>
      <td>Conflict</td>
      <td>Already exists / state conflict</td>
    </tr>
  </tbody>
</table>

<p>An agent that gets exit code 3 knows to refresh its token. An agent that gets exit code 2 knows to check its flags. An agent that gets exit code 1 has to read and parse the error message. The more specific your codes, the faster the recovery.</p>

<h3 id="tty-detection-for-graceful-degradation">TTY Detection for Graceful Degradation</h3>

<p>The original said “make it non-interactive.” Better advice: detect whether you’re talking to a human or a pipe, and behave accordingly.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="ow">not</span> <span class="n">args</span><span class="p">.</span><span class="n">env</span><span class="p">:</span>
    <span class="k">if</span> <span class="n">sys</span><span class="p">.</span><span class="n">stdin</span><span class="p">.</span><span class="nf">isatty</span><span class="p">():</span>
        <span class="n">args</span><span class="p">.</span><span class="n">env</span> <span class="o">=</span> <span class="nf">prompt_user</span><span class="p">(</span><span class="sh">"</span><span class="s">Which environment?</span><span class="sh">"</span><span class="p">,</span>
                               <span class="n">choices</span><span class="o">=</span><span class="p">[</span><span class="sh">"</span><span class="s">staging</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">production</span><span class="sh">"</span><span class="p">])</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="nf">die</span><span class="p">(</span><span class="sh">"</span><span class="s">Error: --env is required</span><span class="se">\n</span><span class="sh">"</span>
            <span class="sh">"</span><span class="s">  mycli deploy --env &lt;staging|production&gt;</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div>

<p>This gives humans the interactive experience they expect while failing fast for agents with an actionable error message.</p>

<h3 id="auth-without-browsers">Auth Without Browsers</h3>

<p>The original didn’t mention authentication at all. Agents can’t open browsers for OAuth or type passwords at prompts. Your CLI needs to support at least three auth methods:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># 1. Flag (highest priority)</span>
<span class="nv">$ </span>mycli <span class="nt">--token</span> sk-abc123 service list

<span class="c"># 2. Environment variable</span>
<span class="nv">$ MYCLI_TOKEN</span><span class="o">=</span>sk-abc123 mycli service list

<span class="c"># 3. Config file (written once by a human)</span>
<span class="nv">$ </span>mycli auth configure <span class="nt">--token</span> sk-abc123
<span class="nv">$ </span>mycli service list  <span class="c"># reads from ~/.mycli/config</span>
</code></pre></div></div>

<h3 id="pagination">Pagination</h3>

<p>Also missing from the original. Dumping 10,000 results into an agent’s context window is expensive and usually unnecessary. Support <code class="language-plaintext highlighter-rouge">--limit</code>, <code class="language-plaintext highlighter-rouge">--offset</code>, and ideally cursor-based pagination:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>mycli logs list <span class="nt">--limit</span> 20 <span class="nt">--output</span> json
<span class="o">{</span>
  <span class="s2">"items"</span>: <span class="o">[</span>...],
  <span class="s2">"pagination"</span>: <span class="o">{</span>
    <span class="s2">"total"</span>: 1847,
    <span class="s2">"limit"</span>: 20,
    <span class="s2">"next_cursor"</span>: <span class="s2">"eyJpZCI6MTIwfQ=="</span>
  <span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>

<h2 id="the-full-checklist">The Full Checklist</h2>

<p>We organized all 16 principles into three tiers. Here’s the quick reference:</p>

<p><strong>Must-Have</strong> — agents can’t function without these:</p>
<ul>
  <li>All inputs accepted as flags</li>
  <li><code class="language-plaintext highlighter-rouge">--output json</code> on every data-returning command</li>
  <li>Stdout for data, stderr for diagnostics</li>
  <li><code class="language-plaintext highlighter-rouge">--help</code> with examples on every subcommand</li>
  <li>Fail fast with actionable errors</li>
  <li>Distinct exit codes</li>
  <li>Auth via env vars / config files / <code class="language-plaintext highlighter-rouge">--token</code></li>
</ul>

<p><strong>Should-Have</strong> — agents work much better:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">--dry-run</code> for destructive actions</li>
  <li><code class="language-plaintext highlighter-rouge">--yes</code> / <code class="language-plaintext highlighter-rouge">--force</code> to skip confirmations</li>
  <li>Idempotent commands</li>
  <li>Consistent <code class="language-plaintext highlighter-rouge">resource verb</code> structure</li>
  <li>Structured success responses</li>
  <li><code class="language-plaintext highlighter-rouge">--quiet</code> mode</li>
  <li>Pagination</li>
</ul>

<p><strong>Nice-to-Have</strong> — makes agents more efficient:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">--stdin</code> for pipe-friendly input</li>
  <li>Machine-readable progress on stderr</li>
  <li>Programmatic command/flag discovery</li>
  <li>Versioned output schemas</li>
  <li>Verbosity levels</li>
  <li>TTY detection</li>
</ul>

<h2 id="we-benchmarked-it">We Benchmarked It</h2>

<p>We didn’t just write the list — we turned it into a Claude Code skill and tested whether it actually changes outcomes. We ran three eval scenarios (building a deploy CLI, reviewing existing code, writing a config import command) with and without the skill, then graded the outputs against specific assertions.</p>

<table>
  <thead>
    <tr>
      <th>Eval</th>
      <th>With Skill</th>
      <th>Without</th>
      <th>Delta</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Deploy CLI structure</td>
      <td>6/6 (100%)</td>
      <td>2/6 (33%)</td>
      <td><strong>+67%</strong></td>
    </tr>
    <tr>
      <td>CLI code review</td>
      <td>12/12 (100%)</td>
      <td>10/12 (83%)</td>
      <td><strong>+17%</strong></td>
    </tr>
    <tr>
      <td>Config import command</td>
      <td>4/5 (80%)</td>
      <td>2/5 (40%)</td>
      <td><strong>+40%</strong></td>
    </tr>
  </tbody>
</table>

<p>The biggest delta was in code generation. Without the skill, the model produced CLIs with positional arguments, <code class="language-plaintext highlighter-rouge">print("Done.")</code> success messages, and no exit codes. With the skill, every command got <code class="language-plaintext highlighter-rouge">--output json</code>, stderr/stdout separation, TTY-aware confirmations, and structured dry-run previews.</p>

<p>The code review eval was closer because the model already catches obvious issues like interactive prompts. But the skill caught the subtler patterns: emoji in output being fragile for parsing, missing stderr/stdout separation, and the absence of severity tiers in the review itself.</p>

<h2 id="get-the-skill">Get the Skill</h2>

<p>The skill is open source and available on GitHub:</p>

<p><strong><a href="https://github.com/saileshpanchal/agent-friendly-cli">github.com/saileshpanchal/agent-friendly-cli</a></strong></p>

<p>Install it:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> <span class="nt">-p</span> ~/.claude/skills/agent-friendly-cli
curl <span class="nt">-o</span> ~/.claude/skills/agent-friendly-cli/SKILL.md <span class="se">\</span>
  https://raw.githubusercontent.com/saileshpanchal/agent-friendly-cli/main/SKILL.md
</code></pre></div></div>

<p>It triggers automatically when you’re writing CLI code with Click, argparse, Cobra, Clap, or any CLI framework. No slash command needed — just start writing CLI code and it activates.</p>

<p>The patterns themselves are language-agnostic. Whether you’re building CLIs in Python, Go, Rust, or Node, the same principles apply. The skill just makes sure they’re applied consistently.</p>]]></content><author><name>Sailesh Panchal</name></author><category term="Engineering" /><category term="AI" /><category term="cli" /><category term="agents" /><category term="developer-tools" /><category term="claude" /><category term="open-source" /><category term="python" /><category term="click" /><category term="design-patterns" /><category term="automation" /><category term="gstack" /><summary type="html"><![CDATA[An article on agent-friendly CLIs went around recently. It covered the basics well but missed the patterns that matter most. We built a Claude Code skill that fills the gaps — structured output, stderr separation, exit codes, TTY detection — and benchmarked it at +40-67% improvement.]]></summary></entry><entry><title type="html">Building an Enterprise Security Chassis for Vapor: What Swift Was Missing</title><link href="https://digital-transformation-advisory.com/2026/03/23/building-an-enterprise-security-chassis-for-vapor/" rel="alternate" type="text/html" title="Building an Enterprise Security Chassis for Vapor: What Swift Was Missing" /><published>2026-03-23T00:00:00+00:00</published><updated>2026-03-23T00:00:00+00:00</updated><id>https://digital-transformation-advisory.com/2026/03/23/building-an-enterprise-security-chassis-for-vapor</id><content type="html" xml:base="https://digital-transformation-advisory.com/2026/03/23/building-an-enterprise-security-chassis-for-vapor/"><![CDATA[<p>Here’s a test. Go to the Vapor ecosystem and find a reusable library that gives you multi-tenant authorization with deny-precedence policy composition, OIDC authentication with PKCE, tamper-evident audit logging, data classification enforcement, and tenant-scoped data access — all wired into a middleware pipeline that fails safe when you get the ordering wrong.</p>

<p>You won’t find one. Not because Vapor is immature — it’s a serious framework with a serious community. But the ecosystem has optimised for breadth (here’s how to build a REST API, here’s a CRUD template) rather than depth (here’s how to build an application that a compliance officer would sign off on).</p>

<p>That’s the gap we set out to close.</p>

<h2 id="the-gap-specifically">The Gap, Specifically</h2>

<p>We needed a server-side Swift web application for a recruitment platform. The data is sensitive — CVs, salary histories, skills assessments tied to named individuals. The platform is multi-tenant — different recruitment firms, different organisations, different data that must never leak across boundaries. We chose Vapor because we’re an Apple-ecosystem shop and Swift 6’s concurrency model is genuinely good for server work.</p>

<p>Then we started listing what we needed that didn’t exist as a reusable package:</p>

<table>
  <thead>
    <tr>
      <th>Requirement</th>
      <th>Vapor Ecosystem</th>
      <th>What We Had to Build</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>OIDC authentication with PKCE</td>
      <td>JWT verification exists; full OIDC flow doesn’t</td>
      <td>Complete OIDC controller with PKCE S256</td>
    </tr>
    <tr>
      <td>Multi-tenant authorization</td>
      <td>Nothing reusable</td>
      <td>6-policy composite with deny precedence</td>
    </tr>
    <tr>
      <td>Tenant-scoped data access</td>
      <td>Nothing</td>
      <td>Repository pattern enforcing isolation at query level</td>
    </tr>
    <tr>
      <td>Data classification (sensitivity labels)</td>
      <td>Nothing</td>
      <td>Hard-gate policies for confidential/personal/privileged data</td>
    </tr>
    <tr>
      <td>Tamper-evident audit logging</td>
      <td>Nothing</td>
      <td>Per-tenant SHA-256 hash chains</td>
    </tr>
    <tr>
      <td>CSRF for Leaf + HTMX</td>
      <td>Partial examples</td>
      <td>Middleware with <code class="language-plaintext highlighter-rouge">req.csrfToken</code> for templates</td>
    </tr>
    <tr>
      <td>Session management with key rotation</td>
      <td>Nothing reusable</td>
      <td>Dual-key HMAC-SHA256 with constant-time verification</td>
    </tr>
    <tr>
      <td>Environment-driven auth modes</td>
      <td>Nothing</td>
      <td>Zero-code-change switching between disabled/optional/required</td>
    </tr>
  </tbody>
</table>

<p>Eight gaps. All of them are table stakes for enterprise software. None of them existed as drop-in packages.</p>

<h2 id="vaporsecuritykit-the-chassis">VaporSecurityKit: The Chassis</h2>

<p>Rather than solving these problems inline — scattered across controllers, coupled to our application — we built a reusable library. Any Vapor application imports it via a <code class="language-plaintext highlighter-rouge">Package.swift</code> git URL and gets the full security chassis in one call:</p>

<div class="language-swift highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">try</span> <span class="n">app</span><span class="o">.</span><span class="nf">useSecurityKit</span><span class="p">(</span><span class="nv">config</span><span class="p">:</span> <span class="o">.</span><span class="nf">fromEnvironment</span><span class="p">())</span>
</code></pre></div></div>

<p>That single line wires up a six-stage middleware pipeline in the correct order, registers OIDC routes, and configures session management. The ordering matters — and that’s exactly why it’s encapsulated.</p>

<h3 id="the-middleware-pipeline-order-is-security">The Middleware Pipeline (Order Is Security)</h3>

<figure>
  <img src="/assets/diagrams/rendered/middleware-pipeline.svg" alt="VaporSecurityKit middleware pipeline: RateLimit → CSRF → Audit → PrincipalResolution → TenantResolution → Authorization → Controller" style="width: 100%; max-width: 900px;" />
  <figcaption>The six-stage middleware pipeline. Each stage reads values written by the previous stage — data flows left to right. <em>Rendered with <a href="https://d2lang.com">D2</a>.</em></figcaption>
</figure>

<p>Each stage reads values written by the previous stage. PrincipalResolution can’t run before the session middleware. TenantResolution needs the principal to cross-validate tenant claims. Authorization needs both. Getting this wrong doesn’t throw a compiler error — it creates a security hole that passes all your tests.</p>

<p>By shipping the pipeline as a library with a fixed ordering, consuming applications can’t accidentally reorder it. The security decision is made once, in the library, not re-made in every project.</p>

<h3 id="deny-precedence-policy-composition">Deny-Precedence Policy Composition</h3>

<p>Most authorization systems we’ve seen in web frameworks use a simple role check: does the user have the <code class="language-plaintext highlighter-rouge">admin</code> role? Yes or no. That works for toy applications. It falls apart when you need to combine multiple concerns — role, ownership, data sensitivity, sharing scope, workspace membership — into a single access decision.</p>

<p>Our <code class="language-plaintext highlighter-rouge">CompositePolicy</code> evaluates all applicable policies and applies a strict precedence:</p>

<ol>
  <li>Any policy returns <code class="language-plaintext highlighter-rouge">.deny</code> → access denied (deny is final, regardless of order)</li>
  <li>Any returns <code class="language-plaintext highlighter-rouge">.elevationRequired</code> → privilege elevation required</li>
  <li>At least one returns <code class="language-plaintext highlighter-rouge">.allow</code> → access granted</li>
  <li>Everything abstains → denied by default</li>
</ol>

<p>The critical property: <strong>order doesn’t affect the security decision.</strong> You can add policies, remove policies, reorder policies — the deny-precedence semantics are invariant. This is harder to get wrong than a chain of <code class="language-plaintext highlighter-rouge">if</code> statements.</p>

<figure>
  <img src="/assets/diagrams/rendered/policy-evaluation.svg" alt="CompositePolicy deny-precedence evaluation: all policies evaluated in parallel, any deny is final, then elevation check, then allow check, deny by default" style="width: 100%; max-width: 600px;" />
  <figcaption>Deny-precedence evaluation. All six policies run in parallel — a single deny overrides any number of allows. <em>Rendered with <a href="https://plantuml.com">PlantUML</a>.</em></figcaption>
</figure>

<p>Policies are classified by intent:</p>

<table>
  <thead>
    <tr>
      <th>Type</th>
      <th>Policies</th>
      <th>Behaviour</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Hard gates</strong> (deny-only)</td>
      <td>Sensitivity, SharingScope</td>
      <td>Can deny but never allow — they protect boundaries</td>
    </tr>
    <tr>
      <td><strong>Allow refiners</strong></td>
      <td>Role, Ownership, WorkspaceScope, GroupScope</td>
      <td>Can grant access but never override a deny</td>
    </tr>
  </tbody>
</table>

<p>A hard gate for data sensitivity means that even if you’re a tenant admin, you can’t read a <code class="language-plaintext highlighter-rouge">privileged</code>-classified resource without active elevation. The policy doesn’t know or care about roles — it enforces classification, full stop.</p>

<h3 id="tamper-evident-audit-logging">Tamper-Evident Audit Logging</h3>

<p>Audit logs that can be silently edited aren’t audit logs. They’re wish lists.</p>

<p>Our <code class="language-plaintext highlighter-rouge">FluentAuditLogger</code> maintains a per-tenant SHA-256 hash chain. Each audit event’s hash includes the previous event’s hash, creating a blockchain-like chain per organisation. If someone modifies or deletes an event in the middle, the chain breaks — and <code class="language-plaintext highlighter-rouge">verifyChain(organizationId:)</code> returns <code class="language-plaintext highlighter-rouge">false</code>.</p>

<p>The chain is per-tenant, not global. Organisation A’s audit trail is independent of Organisation B’s. A chain verification for one tenant doesn’t require reading every audit event in the system.</p>

<figure>
  <img src="/assets/diagrams/rendered/audit-hash-chain.svg" alt="Per-tenant SHA-256 audit hash chains: each organisation maintains an independent append-only chain" style="width: 100%; max-width: 800px;" />
  <figcaption>Independent hash chains per tenant. Tampering with any event breaks the chain from that point forward. <em>Rendered with <a href="https://d2lang.com">D2</a>.</em></figcaption>
</figure>

<p>When the database write fails — network issue, disk full, whatever — the logger falls back to console output rather than silently dropping events. You can lose formatting. You can’t lose the record that something happened.</p>

<h3 id="tenant-isolation-at-the-data-layer">Tenant Isolation at the Data Layer</h3>

<p>OWASP’s multi-tenant guidance is clear: tenant isolation must be enforced at the data access layer, not just in middleware. A middleware that checks “is this user in tenant A?” is necessary but not sufficient — a controller that runs a raw Fluent query can still return tenant B’s data.</p>

<p><code class="language-plaintext highlighter-rouge">TenantScopedRepository</code> solves this by wrapping Fluent queries with an automatic <code class="language-plaintext highlighter-rouge">organizationId</code> filter. Controllers use the repository instead of raw queries. The scope is structural — you can’t forget to add the filter because the repository adds it for you.</p>

<div class="language-swift highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">let</span> <span class="nv">repo</span> <span class="o">=</span> <span class="kt">TenantScopedRepository</span><span class="o">&lt;</span><span class="kt">UserModel</span><span class="o">&gt;</span><span class="p">(</span><span class="nv">tenant</span><span class="p">:</span> <span class="n">req</span><span class="o">.</span><span class="n">resolvedTenantContext</span><span class="p">)</span>
<span class="k">let</span> <span class="nv">users</span> <span class="o">=</span> <span class="k">try</span> <span class="k">await</span> <span class="n">repo</span><span class="o">.</span><span class="nf">query</span><span class="p">(</span><span class="nv">on</span><span class="p">:</span> <span class="n">req</span><span class="o">.</span><span class="n">db</span><span class="p">)</span><span class="o">.</span><span class="nf">all</span><span class="p">()</span>
<span class="c1">// Only returns users in the current tenant — always</span>
</code></pre></div></div>

<figure>
  <img src="/assets/diagrams/rendered/tenant-isolation.svg" alt="Multi-layer tenant isolation: middleware resolves tenant, policies enforce scope, repository auto-filters queries" style="width: 100%; max-width: 700px;" />
  <figcaption>Three layers of tenant isolation. Even if middleware and policies pass, the repository enforces scoping at the query level. <em>Rendered with <a href="https://d2lang.com">D2</a>.</em></figcaption>
</figure>

<p>Cross-tenant access attempts don’t throw an error — they return no results. From the controller’s perspective, users in other tenants simply don’t exist. This is the right semantic for multi-tenant data: not “you can’t access this” but “this doesn’t exist in your world.”</p>

<h3 id="the-auth_mode-contract">The AUTH_MODE Contract</h3>

<p>Development and production have fundamentally different authentication needs. In development, you want to test authorization logic without running an OIDC provider. In production, you want mandatory authentication with no backdoors.</p>

<p>We solved this with an environment variable — <code class="language-plaintext highlighter-rouge">AUTH_MODE</code> — that switches between three modes with zero code changes:</p>

<table>
  <thead>
    <tr>
      <th>Mode</th>
      <th>What Happens</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">disabled</code></td>
      <td>Four seeded demo principals with realistic role sets. The entire authorization pipeline still runs — you’re testing real policies against fake identities</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">optional</code></td>
      <td>JWT resolved if present, demo identity if not. Useful for staging environments where some users are authenticated and some aren’t</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">required</code></td>
      <td>401 on unauthenticated requests. Full OIDC flow. Production mode</td>
    </tr>
  </tbody>
</table>

<p>The key insight: <code class="language-plaintext highlighter-rouge">disabled</code> mode doesn’t bypass security. It provides known identities so the authorization pipeline runs fully. You’re testing the policies, not just testing that your login form works.</p>

<h2 id="the-toolchain-how-we-actually-built-this">The Toolchain: How We Actually Built This</h2>

<p>The framework took three phases over five days. That speed came from the toolchain as much as the code.</p>

<h3 id="claude-as-pair-programmer">Claude as Pair Programmer</h3>

<p>Claude wrote code in this project — the co-author tag is on every commit. But the more interesting pattern was <strong>plan refinement</strong> — using Claude and Perplexity together to validate architectural decisions before writing a line of code.</p>

<p>The workflow: we’d describe an architectural question to Claude — “how should deny-precedence work when policies can abstain?” — and get a detailed proposal. Then we’d take the same question to Perplexity with a different framing: “what are the failure modes of order-independent policy evaluation in RBAC systems?” Perplexity returns academic papers, OWASP guidance, real-world CVEs from systems that got this wrong.</p>

<p>The two tools have complementary blind spots. Claude is excellent at generating coherent designs but can be confidently wrong about edge cases it hasn’t seen. Perplexity surfaces real-world evidence — papers, CVEs, production incident reports — but doesn’t synthesise them into a design. Using both, iteratively, produces better architecture than either alone.</p>

<p>Concrete example: Claude’s initial proposal for the audit hash chain used a global chain — every event hashed against the previous global event. Perplexity surfaced a paper on audit log scalability that showed global chains become a serialisation bottleneck under concurrent writes. We switched to per-tenant chains before writing the code. That’s a design decision that would have been expensive to change after implementation and invisible in testing until production load exposed it.</p>

<p>We ran this loop — Claude proposes, Perplexity validates, Claude revises — for every significant architectural decision: middleware ordering, policy classification, session rotation, CSRF token generation. The plan was solid before the first <code class="language-plaintext highlighter-rouge">swift build</code>.</p>

<figure>
  <img src="/assets/diagrams/rendered/toolchain-workflow.svg" alt="Development toolchain: Claude proposes, Perplexity validates, gstack tests, atomic commits" style="width: 100%; max-width: 800px;" />
  <figcaption>The full development loop. Plan refinement (top) feeds validated architecture into implementation (bottom). <em>Rendered with <a href="https://plantuml.com">PlantUML</a>.</em></figcaption>
</figure>

<h3 id="gstack-for-qa-and-development-workflow">gstack for QA and Development Workflow</h3>

<p>We use <a href="https://github.com/garrytan/gstack">gstack</a> — Garry Tan’s open-source skill collection that turns Claude Code into a virtual engineering team — throughout development. gstack provides 28 specialised slash commands that cover the entire sprint lifecycle: planning (<code class="language-plaintext highlighter-rouge">/office-hours</code>, <code class="language-plaintext highlighter-rouge">/plan-ceo-review</code>, <code class="language-plaintext highlighter-rouge">/plan-eng-review</code>), building, reviewing (<code class="language-plaintext highlighter-rouge">/review</code>), QA testing (<code class="language-plaintext highlighter-rouge">/qa</code>, <code class="language-plaintext highlighter-rouge">/browse</code>), security auditing (<code class="language-plaintext highlighter-rouge">/cso</code>), and shipping (<code class="language-plaintext highlighter-rouge">/ship</code>, <code class="language-plaintext highlighter-rouge">/land-and-deploy</code>). It’s the setup the YC CEO uses to ship 10,000+ lines of production code per day. Not just for final QA, but as part of the development loop.</p>

<p>The pattern: write a feature, deploy locally, use <code class="language-plaintext highlighter-rouge">/qa</code> to systematically test the feature against a checklist, get a structured bug report with screenshots, fix the bugs with before/after evidence. Each fix is an atomic commit. The QA cycle catches things that unit tests miss — rendering issues, middleware ordering effects on actual HTTP responses, CSRF token flow through real form submissions.</p>

<figure>
  <img src="/assets/diagrams/rendered/oidc-flow.svg" alt="OIDC authentication flow with PKCE S256: login, redirect, token exchange, session creation, and dual-key validation" style="width: 100%; max-width: 750px;" />
  <figcaption>The complete OIDC flow. PKCE S256 eliminates the need for a client secret on public clients. Dual-key session rotation keeps old sessions valid during key changes. <em>Rendered with <a href="https://plantuml.com">PlantUML</a>.</em></figcaption>
</figure>

<p>For the OIDC flow specifically, gstack was invaluable. OIDC involves redirects, state parameters, PKCE challenge/verifier pairs, and cookie handling that’s nearly impossible to test with unit tests alone. We used <code class="language-plaintext highlighter-rouge">/browse</code> to walk through the entire login → callback → session → logout flow in a real browser, capturing screenshots at each step. When the PKCE verifier wasn’t being stored correctly in the session, the browser test caught it immediately — the unit test had passed because it was mocking the session storage.</p>

<p>The <code class="language-plaintext highlighter-rouge">/review</code> skill runs before every PR — analysing the diff for SQL safety issues, trust boundary violations, and structural problems. It caught a case where a controller was using a raw Fluent query instead of <code class="language-plaintext highlighter-rouge">TenantScopedRepository</code> — a tenant isolation violation that would have been invisible in code review because the query was syntactically correct.</p>

<h2 id="what-we-shipped">What We Shipped</h2>

<p>Three phases, five commits, 45 tests passing with zero warnings:</p>

<table>
  <thead>
    <tr>
      <th>Phase</th>
      <th>What</th>
      <th>Key Files</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>1. Security Chassis</strong></td>
      <td>Middleware pipeline, principal resolution, rate limiting</td>
      <td>6 middleware files, SecurityKit entry point</td>
    </tr>
    <tr>
      <td><strong>2. OIDC + Sessions</strong></td>
      <td>Full OIDC with PKCE, dual-key session management, CSRF</td>
      <td>OIDCController, SessionManager, PKCEGenerator</td>
    </tr>
    <tr>
      <td><strong>3. Models + Policies + Audit</strong></td>
      <td>8 Fluent models, 6 authorization policies, hash-chain audit logger</td>
      <td>CompositePolicy, FluentAuditLogger, TenantScopedRepository</td>
    </tr>
  </tbody>
</table>

<p>The framework is Apache 2.0 licensed. Any Vapor application can import it and get the full enterprise security chassis — the same one we’re using for our own production applications.</p>

<h2 id="whats-still-missing">What’s Still Missing</h2>

<p>We’re honest about what isn’t built yet:</p>

<p><strong>Leaf templates</strong> — the <code class="language-plaintext highlighter-rouge">Resources/Views/</code> directory is empty. The CSRF middleware generates tokens and makes them available via <code class="language-plaintext highlighter-rouge">req.csrfToken</code>, but the actual Leaf templates for the reference application (login screens, dashboards, admin panels) haven’t been built. That’s Phase 4.</p>

<p><strong>HTMX integration</strong> — the CSRF middleware supports HTMX headers (<code class="language-plaintext highlighter-rouge">X-CSRF-Token</code>), but the front-end layer using Pico CSS + HTMX is planned, not shipped.</p>

<p><strong>Database triggers</strong> — the audit logger enforces append-only semantics in application code, but the SQL triggers that prevent <code class="language-plaintext highlighter-rouge">UPDATE</code>/<code class="language-plaintext highlighter-rouge">DELETE</code> at the database level aren’t in the migration yet. Application-level enforcement is necessary but not sufficient.</p>

<p><strong>Privilege elevation flow</strong> — <code class="language-plaintext highlighter-rouge">SensitivityPolicy</code> returns <code class="language-plaintext highlighter-rouge">.elevationRequired</code> for privileged resources, and there’s a <code class="language-plaintext highlighter-rouge">PrivilegeElevationModel</code> in the schema, but the actual elevation UI and approval workflow aren’t implemented.</p>

<h2 id="the-broader-observation">The Broader Observation</h2>

<p>Server-side Swift is mature enough for production web applications. The language’s concurrency model, type safety, and performance characteristics are genuine advantages over Node.js and Rails for security-sensitive work. What’s missing isn’t capability — it’s the reusable building blocks that other ecosystems take for granted.</p>

<p>Django ships with authentication, authorization, CSRF protection, and an admin panel. Rails has Devise, Pundit, and paper_trail. Spring has Spring Security. Vapor has JWT verification and session middleware — and then you’re on your own.</p>

<p>VaporSecurityKit is our attempt to close that gap. Not for every Vapor application — a blog doesn’t need deny-precedence policy composition. But for the applications that handle sensitive data, serve multiple tenants, and need to pass a security review? The chassis should exist as a package, not as tribal knowledge.</p>

<hr />

<p><em>Sailesh Panchal is Director at Digital Transformation Advisory (DTA), specialising in technology strategy and AI-native architecture for enterprise clients.</em></p>]]></content><author><name>Sailesh Panchal</name></author><category term="Engineering" /><category term="Security" /><category term="swift" /><category term="vapor" /><category term="server-side-swift" /><category term="security" /><category term="multi-tenant" /><category term="oidc" /><category term="authorization" /><category term="open-source" /><category term="apple-silicon" /><category term="gstack" /><category term="claude" /><category term="perplexity" /><summary type="html"><![CDATA[Vapor is a capable web framework. But if you want enterprise-grade multi-tenant security — OIDC, RBAC, audit trails, tenant isolation — you're writing it from scratch. We built the chassis that closes those gaps, and the toolchain that made it possible.]]></summary></entry><entry><title type="html">When Your Data Can’t Leave the Building: Training Small Language Models for Enterprise</title><link href="https://digital-transformation-advisory.com/2026/03/20/when-your-data-cant-leave-the-building/" rel="alternate" type="text/html" title="When Your Data Can’t Leave the Building: Training Small Language Models for Enterprise" /><published>2026-03-20T10:00:00+00:00</published><updated>2026-03-20T10:00:00+00:00</updated><id>https://digital-transformation-advisory.com/2026/03/20/when-your-data-cant-leave-the-building</id><content type="html" xml:base="https://digital-transformation-advisory.com/2026/03/20/when-your-data-cant-leave-the-building/"><![CDATA[<p>Picture this. You’re a recruiter at a specialist firm. A hiring manager sends you a job description for a Lead Platform Engineer. You need to understand exactly what skills this role requires, map them to an industry framework, and match against your candidate database — ideally in the time it takes to read the email.</p>

<p>Now picture the data involved. CVs with home addresses, salary histories, and career trajectories. Skills assessments tied to named individuals. Internal compensation benchmarks. Disability and diversity information. Client organisation charts.</p>

<p>You call a cloud API — GPT-4, Claude, whatever’s flavour of the month — and every piece of that data leaves your network, crosses the internet, and arrives at a third party’s data centre. The API terms say they won’t train on it. Your compliance officer says the risk assessment takes six weeks. Your client’s contract says their data stays in the UK.</p>

<p>This isn’t a hypothetical. It’s the conversation we have with almost every enterprise client who wants to use AI on sensitive data. And the answer is usually the same: “We’d love to, but we can’t.”</p>

<h2 id="the-privacy-tax">The Privacy Tax</h2>

<p>The standard solution is to sanitise the data before sending it to a cloud API. Strip names, mask salaries, replace company names with placeholders. This works — for simple tasks. But language understanding is contextual. “10 years at a Big Four firm” carries different weight than “10 years in a startup.” Sanitising the context destroys the signal.</p>

<p>The other solution is to run everything on-premises. Deploy a 70-billion-parameter model on your own GPUs. This works too — if you have a team of ML engineers, a rack of A100s, and a budget that doesn’t need to survive a quarterly review.</p>

<p>What we actually need is a model small enough to run on the hardware people already have — a laptop, a phone, a Mac Mini in a server cupboard — that understands the specific domain well enough to be useful. Not a general-purpose genius. A specialist.</p>

<h2 id="small-language-models-the-right-tool-for-bounded-problems">Small Language Models: The Right Tool for Bounded Problems</h2>

<p>A small language model (SLM) is typically 2-4 billion parameters, compared to 70-400 billion for the cloud models. At first glance, that’s a massive capability gap. And for general-purpose tasks — writing essays, coding, broad-knowledge Q&amp;A — it is.</p>

<p>But enterprise problems aren’t general-purpose. The recruiter doesn’t need a model that can write poetry and debug Rust. They need one that can read a job description and output a structured skills assessment against a specific framework. The vocabulary is bounded. The output format is defined. The success criteria are measurable.</p>

<p>This is where fine-tuning changes the equation. You take a capable-but-generic base model and train it on your domain until it becomes a specialist. The model doesn’t need to know everything — it needs to know <em>your</em> things very well.</p>

<p>We train two tiers from the same pipeline and the same training data:</p>

<table>
  <thead>
    <tr>
      <th>Tier</th>
      <th>Base Model</th>
      <th>Target</th>
      <th>Size at 4-bit</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>iPhone</strong></td>
      <td>Qwen 3.5 2B</td>
      <td>iPhone 15 Pro+ (8GB RAM)</td>
      <td>~1.2GB</td>
    </tr>
    <tr>
      <td><strong>Laptop</strong></td>
      <td>Qwen 3.5 4B</td>
      <td>Mac with 16GB+ RAM</td>
      <td>~2.5GB</td>
    </tr>
  </tbody>
</table>

<p>The laptop model scores higher — more parameters means more capacity for domain knowledge. But the iPhone model still passes minimum accuracy thresholds, and it runs on hardware that fits in a pocket. The consumer app selects the right tier at runtime based on what device it’s on. Same protocol, same prompt, different model file.</p>

<h2 id="building-the-model-factory">Building the Model Factory</h2>

<p>We’re not building one model. We’re building a <strong>pipeline</strong> — a reusable model factory that can produce domain-specific SLMs for different applications from the same codebase. Recruitment is one domain. There are others.</p>

<p>The factory works in stages, and each stage exists for a reason rooted in the business problem, not just the technology.</p>

<h3 id="stage-1-the-gold-set-and-the-teacher">Stage 1: The Gold Set and the Teacher</h3>

<p>Before any training happens, we build the exam paper: 200 expert-validated examples. Real job descriptions, real experience statements, each mapped to SFIA competencies by hand, reviewed by domain experts. These 200 examples are <em>never</em> used in training. They exist solely to measure whether the model is improving — the same 200-question test, administered at every checkpoint. If you train on the exam, the scores are meaningless.</p>

<p>With the exam built, we need the curriculum. The base model (an open-source model (2 billion parameters for phones, 4 billion for laptops)) knows language but not our domain. Rather than manually writing 1,800 more examples — which would be slow, expensive, and inconsistent — we use a large cloud model as a “teacher.” Claude generates high-quality training data: question/answer pairs, structured assessments, edge cases. The teacher sees our framework definitions and produces examples that follow the patterns we need.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Input:  "Senior Infrastructure Engineer — responsible for cloud
         platform strategy, team leadership, vendor management"

Output: {
  "skills": [
    {"name": "infrastructure design", "level": 5},
    {"name": "cloud services management", "level": 5},
    {"name": "technology leadership", "level": 4}
  ],
  "rationale": "Cloud platform strategy ownership with team and
                vendor management indicates senior autonomous
                practitioner level..."
}
</code></pre></div></div>

<p>The irony isn’t lost on us: we send framework definitions (public information) to a cloud API to generate training data, specifically so that production data (private information) never has to make the same trip. The teacher trains the student. Then the student works alone.</p>

<h3 id="stage-2-the-student-learns-lora-fine-tuning">Stage 2: The Student Learns (LoRA Fine-Tuning)</h3>

<p>A common assumption: you fine-tune a large model and then shrink it down to fit on a device. That’s not what we do. <strong>The training happens directly on the small models — 2B for iPhone, 4B for laptop.</strong> The large model’s job ended in Stage 1 — it created the curriculum. Now the students sit the exam alone.</p>

<p>We fine-tune each tier using LoRA — Low-Rank Adaptation. This freezes most of the model’s weights and trains a small adapter (~50MB) that modifies the model’s behaviour. Same training data, same technique, separate adapters for each model size. It’s fast, memory-efficient, and can run on a single Apple Silicon Mac.</p>

<p>The business reason this matters: the training infrastructure is a laptop, not a data centre. The team that maintains the model can retrain it when the framework updates, without submitting a GPU requisition. You’re not renting A100s to train a 70B model and then spending another day compressing it — you’re training the exact model that will ship, on the hardware it will run on.</p>

<p>We target both the attention layers (how the model relates words to each other) and the feed-forward layers (how it processes information). Including the feed-forward layers — a detail we learned the hard way — dramatically improves the model’s ability to produce valid structured output. When your application expects JSON, “almost valid JSON” is the same as broken.</p>

<h3 id="stage-3-the-student-gets-tested-rlvr">Stage 3: The Student Gets Tested (RLVR)</h3>

<p>After fine-tuning, the model can mimic the teacher’s format. But mimicry isn’t understanding. If the teacher said a particular skill was level 5, the student will say level 5 for <em>that</em> example. What about a job description it’s never seen?</p>

<p>This is where Reinforcement Learning from Verifiable Rewards (RLVR) takes over — and it’s the stage that runs overnight with a measurable improvement cycle.</p>

<h4 id="the-5-minute-checkpoint-cycle">The 5-Minute Checkpoint Cycle</h4>

<p>Here’s what actually happens on the machine, concretely:</p>

<ol>
  <li>
    <p><strong>Generate</strong> — The model receives a batch of prompts (real job descriptions it hasn’t seen in training). For each prompt, it generates a group of 8 candidate outputs. That’s 8 different attempts at the same skills assessment.</p>
  </li>
  <li><strong>Score</strong> — Every output gets scored against verifiable criteria. Not opinions — facts:
    <ul>
      <li><strong>Is the JSON valid?</strong> (Parser says yes or no — no ambiguity)</li>
      <li><strong>Are the referenced skills real?</strong> (Lookup against the framework — they exist or they don’t)</li>
      <li><strong>Is the assigned level reasonable?</strong> (Within ±1 of expert consensus)</li>
      <li><strong>Did the model hallucinate a skill that isn’t in the framework?</strong> (Verifiable)</li>
    </ul>
  </li>
  <li>
    <p><strong>Learn</strong> — The technique we use — GRPO (Group Relative Policy Optimization) — ranks the 8 outputs within the group. The best-scoring outputs become the positive training signal; the worst become the negative signal. The model’s weights adjust toward producing more outputs like the good ones. No separate “critic” model needed — the group comparison <em>is</em> the critic.</p>
  </li>
  <li>
    <p><strong>Checkpoint</strong> — Every ~5 minutes (roughly 50-100 training steps on Apple Silicon), the pipeline saves a snapshot: the current model weights, the timestamp, and the forge_score evaluated against the held-out gold set — the same 200 expert-validated examples, every time.</p>
  </li>
  <li><strong>Repeat</strong> — The cycle restarts with new prompts. If the score improved, training continues. If it degraded (which can happen — the model sometimes optimises for one metric at the expense of another), the pipeline can revert to the last good checkpoint and adjust.</li>
</ol>

<p>We combine the criteria into a single composite score:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">forge_score</span> <span class="o">=</span> <span class="p">(</span>
    <span class="mf">0.30</span> <span class="o">*</span> <span class="n">json_validity</span>      <span class="o">+</span>
    <span class="mf">0.25</span> <span class="o">*</span> <span class="n">skill_f1</span>            <span class="o">+</span>
    <span class="mf">0.25</span> <span class="o">*</span> <span class="n">level_within_1</span>      <span class="o">+</span>
    <span class="mf">0.15</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">hallucination</span><span class="p">)</span> <span class="o">+</span>
    <span class="mf">0.05</span> <span class="o">*</span> <span class="n">evidence_grounding</span>
<span class="p">)</span>
</code></pre></div></div>

<h4 id="evidencing-the-improvement">Evidencing the Improvement</h4>

<p>This is the part that matters to a CTO or a compliance officer: <strong>can you prove the model got better?</strong></p>

<p>Yes — because every 5-minute checkpoint produces a forge_score against the same gold set. Plot them and you get a training curve:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Checkpoint    Time        forge_score    json_valid   skill_f1   hallucination
─────────────────────────────────────────────────────────────────────────────
ckpt-000      18:00       0.41           0.72         0.38       0.22
ckpt-012      19:00       0.58           0.94         0.51       0.15
ckpt-024      20:00       0.69           0.98         0.62       0.09
ckpt-048      22:00       0.77           1.00         0.71       0.05
ckpt-072      00:00       0.82           1.00         0.78       0.03
ckpt-096      02:00       0.84           1.00         0.81       0.02
ckpt-108      04:00       0.85           1.00         0.82       0.02  ← plateau
</code></pre></div></div>

<p>The pattern is consistent: JSON validity converges first (the model learns the format within the first hour), skill identification improves steadily through the night, and hallucination rate drops as the model learns what’s <em>not</em> in the framework. Eventually the score plateaus — the model has extracted all the signal available from the training data. That’s your stopping point.</p>

<p>Each checkpoint is a complete, usable model. If the 2am checkpoint scores 0.82 and the 4am checkpoint scores 0.85 but introduces a regression on one metric, you can ship the 2am version. The decision is auditable: here’s the score at each point, here’s what we chose, here’s why.</p>

<p>This is fundamentally different from training a model and hoping it works. Every 5 minutes, you have evidence.</p>

<h3 id="stage-4-the-model-ships-as-a-file">Stage 4: The Model Ships as a File</h3>

<p>Here’s where the “train the small model directly” approach pays off. Each LoRA adapter — all those improvements from SFT, RLVR, and DPO — gets fused back into its base model’s weights. No distillation step, no compression from a larger model. The adapter was always attached to the target-size model, so fusing is a simple matrix addition.</p>

<p>The combined models get quantised to 4-bit using AWQ (Activation-aware Weight Quantization), which protects the most important weights from precision loss. The result: two standalone files — 1.2GB for iPhone, 2.5GB for laptop.</p>

<p>The alternative approach — fine-tuning a 14B or 70B model first, then distilling down — would likely score higher on accuracy benchmarks. But it adds an entire extra stage (distillation), requires GPU servers for the larger model’s training, and introduces a compression step where domain knowledge can be lost. By training each target-size model directly, every weight update is optimised for the model that will actually run in production.</p>

<p>The consumer application loads the appropriate file at startup based on device capability, the same way it would load a database or a config file. It calls the model through a simple protocol — send text in, get structured output back. If the model improves, you ship new files. The application code doesn’t change.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Model Factory → sfia-mapper-iphone-4bit (1.2GB) → iPhone app (MLX.Swift)
(training)   → sfia-mapper-laptop-4bit  (2.5GB) → Mac app (MLX.Swift)
</code></pre></div></div>

<p>The factory never touches production. The consumer never touches training. The data never leaves the device. These boundaries are the entire point.</p>

<h3 id="a-note-on-the-teachers-role-during-training">A Note on the Teacher’s Role During Training</h3>

<p>In the stages above, the teacher (Claude) creates the data, then disappears. The student trains alone. But there’s one technique in the pipeline — Generalized Knowledge Distillation (GKD) — where the teacher stays involved longer.</p>

<p>The problem it solves: during training, the student only sees the teacher’s perfect outputs. But at inference time, the student works from its own imperfect outputs. This mismatch means the student can freeze when it encounters its own phrasing in production — like a student who studied from the textbook answer key and panics when the exam question is worded differently.</p>

<p>GKD mixes teacher corrections into the student’s own outputs during training. The student generates a response, the teacher evaluates it, and the student learns from the gap. This closes the distribution mismatch and produces a more robust model — still 2-4 billion parameters, still running on the target device, but better at handling the messy inputs it will see in the real world.</p>

<h2 id="the-pattern-behind-the-pattern">The Pattern Behind the Pattern</h2>

<p>Here’s something we didn’t expect: the checkpoint-and-score loop from Stage 3 applies to problems that have nothing to do with model training.</p>

<p>The core structure is: <strong>define a measurable outcome → build a fixed benchmark → iterate in short cycles → score against the benchmark → checkpoint on improvement → stop at plateau.</strong> Andrej Karpathy designed this for training neural networks. But the requirements are simpler than they appear:</p>

<ol>
  <li><strong>Can you score the output without a human reviewing it?</strong> (A parser, a lookup table, a test suite, a timer)</li>
  <li><strong>Do you have a fixed benchmark you can commit to never contaminating?</strong> (50-200 known-good examples)</li>
  <li><strong>Can each iteration complete in under 5 minutes?</strong> (Otherwise you get too few data points overnight)</li>
  <li><strong>Can you save and restore state cleanly?</strong> (Git commit, file copy, model checkpoint)</li>
</ol>

<p>If all four are true, you can apply this pattern — whether you’re training a model, optimising prompts, tuning API performance, or searching configurations.</p>

<p><strong>Prompt optimisation</strong> is a particularly accessible example. You have a fixed prompt that works “okay.” You have 200 gold examples with expected outputs. Each iteration: adjust the prompt wording, run it against all 200 examples via the API, score the outputs, keep the better prompt. No GPU required. Cost is API calls. Same evidence trail — a CSV showing prompt version, timestamp, composite score. Same audit story for a client.</p>

<p><strong>API performance tuning</strong>: same loop. Fixed benchmark of representative API calls. Each iteration tries a different indexing strategy, query rewrite, or cache policy. Score = p95 latency × correctness. Checkpoint = the configuration that produced the best score.</p>

<p>The point isn’t the technique — it’s the evidence trail. In any of these applications, you end up with a CSV that shows measurable improvement over time. When someone asks “how do you know this is better?”, you open the spreadsheet.</p>

<p>We’ve started treating the checkpoint above as a standard project evaluation: can we define a scoring function? Can we build a gold set? If yes, we apply the loop. If no, we don’t pretend we can — we use human review, A/B testing, or structured evaluation instead. Knowing <em>when not to use it</em> is as important as having the tool.</p>

<h2 id="what-the-numbers-look-like">What the Numbers Look Like</h2>

<p>From our <a href="/2026/03/18/running-llms-on-apple-silicon-mlx-lm-benchmarks/">benchmarking work</a> with base models on Apple Silicon:</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>What It Means</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>96-196 tokens/sec</strong></td>
      <td>Faster than you can read the response</td>
    </tr>
    <tr>
      <td><strong>1.2-2.5 GB memory</strong></td>
      <td>iPhone (1.2GB) or laptop (2.5GB) — fits alongside the app</td>
    </tr>
    <tr>
      <td><strong>&lt;100ms first token</strong></td>
      <td>Feels instant in a user interface</td>
    </tr>
    <tr>
      <td><strong>£0.00 per inference</strong></td>
      <td>No API bill. No token counting. No cost anxiety</td>
    </tr>
  </tbody>
</table>

<p>These are base model numbers. A fine-tuned model will be slightly different, but in the same ballpark — the LoRA adapter adds knowledge, not computational overhead.</p>

<h2 id="the-business-case-plainly">The Business Case, Plainly</h2>

<p>Cloud LLM APIs are extraordinary tools. We use them daily. But they create a dependency: on network availability, on third-party pricing, on data processing agreements, on compliance reviews that take longer than the project they’re gate-keeping.</p>

<p>A fine-tuned SLM running on-device removes that dependency for the specific tasks it’s trained for. It’s not better than Claude at general reasoning. It’s not trying to be. It’s better at <em>one thing</em>, and it does that one thing locally, privately, and at zero marginal cost.</p>

<p>The model factory approach means we can produce these specialists for different domains without rebuilding the training infrastructure each time. A recruitment SLM. A compliance SLM. A customer service SLM. Same pipeline, different training data, different model files.</p>

<h2 id="what-weve-learned-so-far">What We’ve Learned So Far</h2>

<p>We’re still in the early stages of this build. Some things we’ve confirmed:</p>

<p><strong>Teacher quality beats teacher quantity.</strong> 500 carefully crafted examples from Claude produce a better student than 5,000 low-effort ones. Garbage in, garbage out applies to synthetic data too.</p>

<p><strong>The output format is a training target, not a post-processing step.</strong> If you need JSON, train the model to produce JSON. Don’t train it to produce text and then try to parse the text into JSON. Including feed-forward layers in the LoRA target makes a measurable difference here.</p>

<p><strong>Apple Silicon is a real training platform</strong> at the 2-4B parameter scale. An M-series Mac with 36GB of unified memory handles LoRA fine-tuning for both model tiers comfortably. You don’t need a cloud GPU for models this size.</p>

<p><strong>The compliance conversation changes completely</strong> when you can say “the data never leaves the device.” Six-week risk assessments become same-week approvals. The model file ships like any other application asset — through your existing deployment pipeline, your existing change management, your existing security controls.</p>

<h2 id="when-to-use-which">When to Use Which</h2>

<p>Not every problem needs an on-device model. Not every problem can be solved by a cloud API. Here’s how we think about it:</p>

<table>
  <thead>
    <tr>
      <th>Your Situation</th>
      <th>Recommendation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Data is public or low-sensitivity</td>
      <td>Cloud API. Easier, more capable, maintained for you</td>
    </tr>
    <tr>
      <td>Data is sensitive but tasks are varied</td>
      <td>Cloud API with strong DPA, or anonymise first</td>
    </tr>
    <tr>
      <td>Data is sensitive AND tasks are bounded</td>
      <td>Fine-tuned SLM. This is the sweet spot</td>
    </tr>
    <tr>
      <td>Tasks require broad world knowledge</td>
      <td>Cloud API. SLMs don’t know enough</td>
    </tr>
    <tr>
      <td>You need zero-latency responses</td>
      <td>On-device SLM. Nothing beats local inference</td>
    </tr>
    <tr>
      <td>Budget scales with usage</td>
      <td>SLM. Train once, infer forever</td>
    </tr>
  </tbody>
</table>

<p>The recruiter from the opening of this post? Sensitive data, bounded domain, defined output format, latency matters, privacy is non-negotiable. That’s the sweet spot.</p>

<p>The model factory is how we get there.</p>

<hr />

<p><em>Sailesh Panchal is Director at Digital Transformation Advisory (DTA), specialising in technology strategy and AI-native architecture for enterprise clients.</em></p>]]></content><author><name>Sailesh Panchal</name></author><category term="AI" /><category term="Engineering" /><category term="slm" /><category term="fine-tuning" /><category term="on-device-ai" /><category term="privacy" /><category term="enterprise-ai" /><category term="mlx" /><category term="apple-silicon" /><category term="lora" /><summary type="html"><![CDATA[Cloud AI APIs are powerful, but some data can't leave the building. We're building a pipeline that trains small, domain-specific language models to run entirely on-device — no API calls, no data exfiltration, no per-token costs. Here's why, and how.]]></summary></entry><entry><title type="html">Fifty PowerPoints and a Rebrand: Why We Didn’t Train a Model</title><link href="https://digital-transformation-advisory.com/2026/03/20/fifty-powerpoints-and-a-rebrand-why-we-didnt-train-a-model/" rel="alternate" type="text/html" title="Fifty PowerPoints and a Rebrand: Why We Didn’t Train a Model" /><published>2026-03-20T00:00:00+00:00</published><updated>2026-03-20T00:00:00+00:00</updated><id>https://digital-transformation-advisory.com/2026/03/20/fifty-powerpoints-and-a-rebrand-why-we-didnt-train-a-model</id><content type="html" xml:base="https://digital-transformation-advisory.com/2026/03/20/fifty-powerpoints-and-a-rebrand-why-we-didnt-train-a-model/"><![CDATA[<p>The brief was simple enough. A client had been through a rebrand — new name, new visual identity, new tone of voice. The old brand lived on in 50 PowerPoint decks: board packs, strategy documents, client proposals, quarterly reviews. Every one needed converting.</p>

<p>A designer quoted 2-4 hours per deck. At the midpoint, that’s 150 hours of someone carefully changing Georgia to Calibri, swapping navy for teal, and rewriting “We are pleased to present our findings” as “Here’s what we found.” Important work. Also, the kind of work that makes a talented designer question their career choices by deck number twelve.</p>

<p>We were asked: can AI do this?</p>

<h2 id="the-training-reflex">The Training Reflex</h2>

<p>Our first thought — and I suspect yours too — was to train a model. Feed it examples of old-brand and new-brand decks, let it learn the transformation, and apply it at scale. We’re building <a href="/2026/03/20/when-your-data-cant-leave-the-building/">fine-tuned small language models</a> for other projects. We have the pipeline. The hammer was in our hand, and this looked like a nail.</p>

<p>Then we opened a deck and started listing what actually changes in a rebrand.</p>

<table>
  <thead>
    <tr>
      <th>What Changes</th>
      <th>Example</th>
      <th>How Many Variants?</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Font families</td>
      <td>Georgia → Calibri</td>
      <td>4-6 unique mappings</td>
    </tr>
    <tr>
      <td>Font sizes</td>
      <td>32pt heading → 36pt</td>
      <td>Tied to the font mappings</td>
    </tr>
    <tr>
      <td>Colour palette</td>
      <td>#003366 → #007C7A</td>
      <td>6-8 hex values</td>
    </tr>
    <tr>
      <td>Logo</td>
      <td>Old logo.png → New logo.png</td>
      <td>1 swap</td>
    </tr>
    <tr>
      <td>Footer text</td>
      <td>“Old Corp. All rights reserved.” → “New Brand. Confidential.”</td>
      <td>1 find-replace</td>
    </tr>
    <tr>
      <td><strong>Tone of voice</strong></td>
      <td><strong>Formal prose → Punchy, conversational</strong></td>
      <td><strong>Unbounded</strong></td>
    </tr>
  </tbody>
</table>

<p>Five of those six are lookup tables. Fixed inputs, fixed outputs, zero ambiguity. Georgia is <em>always</em> Calibri. #003366 is <em>always</em> #007C7A. The footer string is literally the same on every slide of every deck.</p>

<p>Training a neural network to learn a lookup table is like hiring a sommelier to check if milk has expired. Technically possible. Wildly inefficient. Harder to debug when it gets the answer wrong.</p>

<p>The sixth item — tone — is different. Rewriting “Our methodology is evidence-based, outcome-driven, and designed for sustainable change” as “Evidence-based. Outcome-driven. Built to last” requires understanding language, context, and intent. That’s what language models are good at.</p>

<p>So we drew a line.</p>

<h2 id="the-line-95-pipeline-5-intelligence">The Line: 95% Pipeline, 5% Intelligence</h2>

<p>We split the problem in two:</p>

<p><strong>Deterministic engine</strong> (python-pptx): Handles every visual transformation — fonts, colours, logos, footers, borders, shape fills. Runs in under 2 seconds per deck. Produces identical results every time. Easy to audit, easy to fix, easy to explain to a client.</p>

<p><strong>AI tone adjustment</strong> (Claude, in-context): Handles the language rewriting. Reads the text from each converted slide, applies the brand’s tone rules, and rewrites while preserving all factual content. Uses the designer’s own rewrites as few-shot examples — no training data required.</p>

<p>The beauty of this split is that each half plays to its strengths. The pipeline is fast and exact where you need exactness (your brand colour had better be #007C7A, not #007C79). Claude is flexible and contextual where you need intelligence (knowing that “we are pleased to present” and “we would like to share” are the same pattern, even though the words are different).</p>

<figure>
  <img src="/assets/diagrams/rendered/rebrand-pipeline.svg" alt="Rebrand pipeline: designer creates exemplar pair, rules extracted to JSON, 95% deterministic transforms via python-pptx, 5% tone adjustment via Claude" style="width: 100%; max-width: 850px;" />
  <figcaption>The full pipeline. One designer pair produces the rules. python-pptx handles the 95% that's mechanical. Claude handles the 5% that requires understanding. <em>Rendered with <a href="https://d2lang.com">D2</a>.</em></figcaption>
</figure>

<h2 id="how-the-pipeline-works">How the Pipeline Works</h2>

<h3 id="step-1-the-designer-creates-one-pair">Step 1: The Designer Creates One Pair</h3>

<p>This is the clever part, and it’s the designer’s one contribution to the entire process.</p>

<p>They take a single real deck and recreate it in the new brand. Same slides, same text, same structure — different visual treatment. Where they also change the <em>wording</em> (not just the formatting), that signals a tone shift.</p>

<p>We end up with two files: <code class="language-plaintext highlighter-rouge">source-exemplar.pptx</code> and <code class="language-plaintext highlighter-rouge">target-exemplar.pptx</code>.</p>

<h3 id="step-2-the-script-diffs-them">Step 2: The Script Diffs Them</h3>

<p>A Python script walks both files slide-by-slide, shape-by-shape, text-run-by-text-run. For each matching text string, it compares the formatting and builds a mapping:</p>

<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
  </span><span class="nl">"fonts"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"Georgia|32.0|bold"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"family"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Calibri"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"size_pt"</span><span class="p">:</span><span class="w"> </span><span class="mf">36.0</span><span class="p">,</span><span class="w">
      </span><span class="nl">"bold"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
    </span><span class="p">},</span><span class="w">
    </span><span class="nl">"Cambria|14.0|normal"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
      </span><span class="nl">"family"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Calibri"</span><span class="p">,</span><span class="w">
      </span><span class="nl">"size_pt"</span><span class="p">:</span><span class="w"> </span><span class="mf">13.0</span><span class="p">,</span><span class="w">
      </span><span class="nl">"italic"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="w">
    </span><span class="p">}</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"colours"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"#003366"</span><span class="p">:</span><span class="w"> </span><span class="s2">"#007C7A"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"#CC9900"</span><span class="p">:</span><span class="w"> </span><span class="s2">"#2ECC71"</span><span class="p">,</span><span class="w">
    </span><span class="nl">"#F5F0E8"</span><span class="p">:</span><span class="w"> </span><span class="s2">"#FAFAFA"</span><span class="w">
  </span><span class="p">},</span><span class="w">
  </span><span class="nl">"footer"</span><span class="p">:</span><span class="w"> </span><span class="p">{</span><span class="w">
    </span><span class="nl">"find"</span><span class="p">:</span><span class="w"> </span><span class="s2">"© 2024 Meridian Consulting Group. All rights reserved."</span><span class="p">,</span><span class="w">
    </span><span class="nl">"replace"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Apex Partners Ltd. Private &amp; Confidential."</span><span class="w">
  </span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>

<p>The script also captures every text change — places where the designer rewrote the words. These become the tone examples:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Before: "Our team brings deep expertise in cloud migration,
         platform modernisation, and AI-native architecture."
After:  "Our people bring hands-on expertise in cloud,
         platforms, and AI — not just slide decks."
</code></pre></div></div>

<p>From these examples, we derive tone rules: formal → conversational, passive → active, long sentences → punchy fragments, corporate jargon → plain English.</p>

<h3 id="step-3-apply-to-50-decks">Step 3: Apply to 50 Decks</h3>

<p>The conversion engine opens each deck and applies the mapping mechanically. Every text run gets its font checked and swapped. Every colour value gets looked up and replaced. Every footer gets rewritten. Every logo gets swapped.</p>

<p>On our test deck — a 6-slide cloud migration assessment — the pipeline made 105 font changes, 52 colour changes, and 6 footer updates in under 2 seconds.</p>

<h3 id="step-4-claude-adjusts-the-tone">Step 4: Claude Adjusts the Tone</h3>

<p>After the visual conversion, Claude reads each slide’s text and applies the tone rules. The prompt pattern is straightforward:</p>

<p><em>“Given these tone rules and examples, rewrite this text to match the target brand voice. Preserve all facts, technical terms, and proper nouns. Only change the language style.”</em></p>

<p>Claude sees the designer’s own before/after examples, so it’s learning the brand voice from the person who defined it — not from a training set we curated.</p>

<h2 id="what-this-looks-like-in-practice">What This Looks Like in Practice</h2>

<p>Here’s a slide from the test deck, before and after:</p>

<p><strong>Before (old brand):</strong></p>
<blockquote>
  <p>The organisation currently operates 147 on-premises applications across three data centres. Our assessment identifies 82 applications suitable for cloud migration within the next 18 months. We recommend a phased approach beginning with non-critical workloads to establish patterns and confidence.</p>
</blockquote>

<p><strong>After (new brand, visual conversion + tone adjustment):</strong></p>
<blockquote>
  <p>147 apps running on-prem across three data centres. We’ve identified 82 that are ready for cloud migration in the next 18 months. Start with non-critical workloads — build the pattern, build confidence, then scale.</p>
</blockquote>

<p>Same facts. Same numbers. Same recommendation. Different voice. The fonts, colours, and footer changed too — but those are invisible in a text excerpt.</p>

<h2 id="the-decision-that-mattered">The Decision That Mattered</h2>

<p>The most important moment in this project wasn’t writing the code. It was the conversation where we decided <em>not</em> to train a model.</p>

<p>We’ve seen too many organisations reach for AI when a well-structured pipeline would be faster, cheaper, and easier to maintain. The reverse is equally true — we’ve seen teams build increasingly brittle rule systems when a model would handle the variation naturally. The art is drawing the line in the right place.</p>

<p>Here’s the framework we used:</p>

<table>
  <thead>
    <tr>
      <th>Signal</th>
      <th>Reach for a Model</th>
      <th>Reach for a Pipeline</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Input variation</td>
      <td>High — natural language, many phrasings</td>
      <td>Low — structured, enumerable</td>
    </tr>
    <tr>
      <td>Rules expressible?</td>
      <td>No — too many edge cases</td>
      <td>Yes — fits in a JSON file</td>
    </tr>
    <tr>
      <td>Output must be exact?</td>
      <td>Approximate is fine</td>
      <td>Must be pixel-perfect</td>
    </tr>
    <tr>
      <td>Error consequences</td>
      <td>Graceful degradation</td>
      <td>Hard failure</td>
    </tr>
    <tr>
      <td>Debugging</td>
      <td>“Why did the model say this?”</td>
      <td>“This key maps to this value”</td>
    </tr>
  </tbody>
</table>

<p>The PowerPoint rebrand sits firmly on the pipeline side for 95% of the work. The tone adjustment is the 5% where a model earns its keep — not by learning from training data, but by understanding language in context.</p>

<figure>
  <img src="/assets/diagrams/rendered/rebrand-decision.svg" alt="Decision framework: enumerable transformations use a pipeline, language understanding uses AI, both together use the split approach" style="width: 100%; max-width: 600px;" />
  <figcaption>The decision framework. Can you enumerate it? Pipeline. Does it need language understanding? AI. Both? Split the problem. <em>Rendered with <a href="https://d2lang.com">D2</a>.</em></figcaption>
</figure>

<h2 id="what-wed-tell-the-sommelier">What We’d Tell the Sommelier</h2>

<p>If you’re staring at a problem and wondering whether it’s an AI problem:</p>

<p><strong>Start by listing what changes.</strong> If you can enumerate every transformation in a spreadsheet, you don’t need a model. You need a script.</p>

<p><strong>Find the boundary.</strong> There’s usually a seam between the mechanical and the intelligent. Make the mechanical part deterministic, and only invite the AI to the part that actually requires understanding.</p>

<p><strong>Use the designer’s work as your spec.</strong> They’ve already done the hard thinking about what the transformation should look like. Your job is to scale it, not to reinvent it.</p>

<p>The 50 decks are converting. The designer is back to doing design work. And we didn’t train a model.</p>

<hr />

<p><em>Sailesh Panchal is Director at Digital Transformation Advisory (DTA), specialising in technology strategy and AI-native architecture for enterprise clients.</em></p>]]></content><author><name>Sailesh Panchal</name></author><category term="AI" /><category term="Strategy" /><category term="digital-transformation" /><category term="brand-conversion" /><category term="powerpoint" /><category term="python-pptx" /><category term="enterprise-ai" /><category term="automation" /><summary type="html"><![CDATA[A client needed 50 PowerPoint decks converted to a new brand identity. Our first instinct was to train a model. We were wrong. Here's the pipeline we built instead, and the decision framework that stopped us from over-engineering.]]></summary></entry><entry><title type="html">Running LLMs on Apple Silicon: MLX-LM Benchmarks for Qwen 3.5 and Llama 3.2</title><link href="https://digital-transformation-advisory.com/2026/03/18/running-llms-on-apple-silicon-mlx-lm-benchmarks/" rel="alternate" type="text/html" title="Running LLMs on Apple Silicon: MLX-LM Benchmarks for Qwen 3.5 and Llama 3.2" /><published>2026-03-18T00:00:00+00:00</published><updated>2026-03-18T00:00:00+00:00</updated><id>https://digital-transformation-advisory.com/2026/03/18/running-llms-on-apple-silicon-mlx-lm-benchmarks</id><content type="html" xml:base="https://digital-transformation-advisory.com/2026/03/18/running-llms-on-apple-silicon-mlx-lm-benchmarks/"><![CDATA[<p>Apple Silicon changed the game for on-device machine learning. With unified memory and the Metal GPU sitting on the same die as the CPU, your MacBook can run billion-parameter language models without a discrete GPU. The missing piece was software. Apple’s <a href="https://github.com/ml-explore/mlx">MLX framework</a> and its companion library <code class="language-plaintext highlighter-rouge">mlx-lm</code> fill that gap.</p>

<p>This post documents a hands-on benchmarking session comparing three small language models locally on an Apple Silicon Mac, with real numbers, real output, and the pitfalls we hit along the way.</p>

<h2 id="the-setup">The Setup</h2>

<p><strong>Hardware:</strong> Apple Silicon Mac (M-series, unified memory)</p>

<p><strong>Software stack:</strong></p>
<ul>
  <li>Python 3.12 via <a href="https://mise.jdx.dev/">mise</a></li>
  <li><code class="language-plaintext highlighter-rouge">mlx-lm</code> 0.31.1</li>
  <li><code class="language-plaintext highlighter-rouge">mlx</code> 0.31.1 + <code class="language-plaintext highlighter-rouge">mlx-metal</code> 0.31.1</li>
</ul>

<p><strong>Models tested:</strong></p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">mlx-community/Qwen3.5-2B-4bit</code> (Alibaba, 2 billion parameters, 4-bit quantized)</li>
  <li><code class="language-plaintext highlighter-rouge">mlx-community/Qwen3.5-4B-4bit</code> (Alibaba, 4 billion parameters, 4-bit quantized)</li>
  <li><code class="language-plaintext highlighter-rouge">mlx-community/Llama-3.2-3B-Instruct-4bit</code> (Meta, 3 billion parameters, 4-bit quantized)</li>
</ul>

<p>All models are pre-quantized MLX format from the <code class="language-plaintext highlighter-rouge">mlx-community</code> collection on Hugging Face. No conversion step needed.</p>

<p><strong>Installation</strong> is one command:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>mlx-lm
</code></pre></div></div>

<p>If you use mise for version management:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mise <span class="nb">install </span>python@3.12
mise use <span class="nt">--global</span> python@3.12
pip <span class="nb">install </span>mlx-lm
</code></pre></div></div>

<h2 id="the-benchmark">The Benchmark</h2>

<p>We used a creative writing prompt to test both generation speed and output quality:</p>

<blockquote>
  <p>Describe a medieval tavern at night. Include sensory details about the atmosphere, the patrons, and the food.</p>
</blockquote>

<p>Each model was run with 256-300 max tokens. We tested twice: once with default sampling (greedy), and once with the official recommended sampling parameters.</p>

<h3 id="raw-speed-numbers">Raw Speed Numbers</h3>

<p>All measurements taken on cached model runs (no download overhead):</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Qwen3.5-2B</th>
      <th>Llama-3.2-3B</th>
      <th>Qwen3.5-4B</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Prompt eval</strong></td>
      <td>705 tok/s</td>
      <td>920 tok/s</td>
      <td>390 tok/s</td>
    </tr>
    <tr>
      <td><strong>Generation</strong></td>
      <td><strong>196 tok/s</strong></td>
      <td>127 tok/s</td>
      <td>96 tok/s</td>
    </tr>
    <tr>
      <td><strong>Peak memory</strong></td>
      <td><strong>1.1 GB</strong></td>
      <td>2.0 GB</td>
      <td>2.5 GB</td>
    </tr>
  </tbody>
</table>

<p>The 2B model is roughly 2x faster than the 4B at generation, and uses less than half the memory. Llama 3.2 3B lands in the middle on both metrics.</p>

<p>For context, 96 tokens per second is still faster than you can read. All three models feel instant in interactive use.</p>

<h2 id="the-thinking-mode-trap">The Thinking Mode Trap</h2>

<p>Here is where our initial benchmarks went wrong, and it contains an important lesson for anyone using Qwen3.5 models.</p>

<p>When we first ran the 4B model with a raw prompt via <code class="language-plaintext highlighter-rouge">mlx_lm.generate</code>, it produced this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Here's a thinking process that leads to the description:
1. Analyze the Request:
   * Topic: A medieval tavern at night.
   * Key Elements: Sensory details...
2. Brainstorming &amp; Imagery:
   * Setting: Stone walls, wooden beams...
</code></pre></div></div>

<p>Instead of writing prose, it dumped its internal reasoning chain. The 2B model, tested the same way, produced beautiful prose. Our initial conclusion was that the 2B was better. <strong>That was wrong.</strong></p>

<h3 id="what-happened">What happened</h3>

<p>Qwen3.5 models have a <strong>thinking mode</strong> enabled by default. When thinking mode is active, the model emits a <code class="language-plaintext highlighter-rouge">&lt;think&gt;...&lt;/think&gt;</code> block containing its reasoning before the actual response. The 4B model faithfully entered thinking mode. The 2B model happened not to, likely because the raw prompt format didn’t trigger it as strongly.</p>

<p>The fix: use the tokenizer’s chat template with thinking explicitly disabled.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="n">mlx_lm</span>
<span class="kn">from</span> <span class="n">mlx_lm.sample_utils</span> <span class="kn">import</span> <span class="n">make_sampler</span>

<span class="n">model</span><span class="p">,</span> <span class="n">tokenizer</span> <span class="o">=</span> <span class="n">mlx_lm</span><span class="p">.</span><span class="nf">load</span><span class="p">(</span><span class="sh">"</span><span class="s">mlx-community/Qwen3.5-4B-4bit</span><span class="sh">"</span><span class="p">)</span>

<span class="n">messages</span> <span class="o">=</span> <span class="p">[{</span><span class="sh">"</span><span class="s">role</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">user</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">content</span><span class="sh">"</span><span class="p">:</span> <span class="sh">"</span><span class="s">Your prompt here</span><span class="sh">"</span><span class="p">}]</span>
<span class="n">prompt</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="nf">apply_chat_template</span><span class="p">(</span>
    <span class="n">messages</span><span class="p">,</span>
    <span class="n">add_generation_prompt</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">tokenize</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
    <span class="n">enable_thinking</span><span class="o">=</span><span class="bp">False</span>  <span class="c1"># This is the key parameter
</span><span class="p">)</span>

<span class="n">sampler</span> <span class="o">=</span> <span class="nf">make_sampler</span><span class="p">(</span><span class="n">temp</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="n">top_p</span><span class="o">=</span><span class="mf">0.8</span><span class="p">,</span> <span class="n">top_k</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">min_p</span><span class="o">=</span><span class="mf">0.0</span><span class="p">)</span>

<span class="n">response</span> <span class="o">=</span> <span class="n">mlx_lm</span><span class="p">.</span><span class="nf">generate</span><span class="p">(</span>
    <span class="n">model</span><span class="p">,</span> <span class="n">tokenizer</span><span class="p">,</span>
    <span class="n">prompt</span><span class="o">=</span><span class="n">prompt</span><span class="p">,</span>
    <span class="n">max_tokens</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span>
    <span class="n">sampler</span><span class="o">=</span><span class="n">sampler</span>
<span class="p">)</span>
</code></pre></div></div>

<p><strong>Lesson:</strong> Always use <code class="language-plaintext highlighter-rouge">apply_chat_template()</code> with these models. Raw prompt strings bypass the model’s expected input format and produce unpredictable behavior. This applies to all instruction-tuned models, not just Qwen.</p>

<h3 id="the-deeper-gotcha-template-divergence-between-model-sizes">The Deeper Gotcha: Template Divergence Between Model Sizes</h3>

<p>There’s a subtler issue we discovered later while building <a href="https://github.com/saileshpanchal/ForgeML">ForgeML</a> training pipelines. The Qwen3.5 2B and 4B variants ship with <strong>different default chat templates</strong> — and the difference is invisible during training.</p>

<ul>
  <li><strong>Qwen3.5-2B</strong> includes a pre-closed <code class="language-plaintext highlighter-rouge">&lt;think&gt;&lt;/think&gt;</code> block in its chat template. No reasoning by default.</li>
  <li><strong>Qwen3.5-4B</strong> opens <code class="language-plaintext highlighter-rouge">&lt;think&gt;</code> and expects the model to fill it with reasoning content before responding.</li>
</ul>

<p>During training, both templates produce the same format for complete conversations, so you won’t notice anything. The divergence only appears at inference time when <code class="language-plaintext highlighter-rouge">add_generation_prompt=True</code> appends different suffixes depending on the model size. The 2B appends a clean assistant turn. The 4B appends an open thinking block that the model is expected to complete.</p>

<p><strong>This means the same inference code produces different behavior when you swap model sizes.</strong> If you’re deploying Qwen3.5 models for structured output (JSON, function calling, classification), you must explicitly set <code class="language-plaintext highlighter-rouge">enable_thinking=False</code> regardless of model size. This is not prominently documented in the model card.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Always be explicit about thinking mode — don't rely on defaults
</span><span class="n">prompt</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="nf">apply_chat_template</span><span class="p">(</span>
    <span class="n">messages</span><span class="p">,</span>
    <span class="n">add_generation_prompt</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="n">tokenize</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
    <span class="n">enable_thinking</span><span class="o">=</span><span class="bp">False</span>  <span class="c1"># Required for consistent behavior across model sizes
</span><span class="p">)</span>
</code></pre></div></div>

<p>If you’re building a pipeline that supports multiple Qwen3.5 variants (for example, using the 2B for fast inference and the 4B for quality-critical tasks), test with both models at inference time. Training-time validation alone won’t catch this.</p>

<figure>
  <img src="/assets/diagrams/rendered/mlx-thinking-mode.svg" alt="Sequence diagram showing raw prompt vs chat template: raw prompt triggers thinking mode dump, chat template produces clean prose" style="width: 100%; max-width: 700px;" />
  <figcaption>The difference between a raw prompt string and a properly formatted chat template. The tokenizer knows what the model expects. <em>Rendered with <a href="https://plantuml.com">PlantUML</a>.</em></figcaption>
</figure>

<h2 id="official-sampling-parameters">Official Sampling Parameters</h2>

<p>Qwen3.5 publishes recommended sampling parameters for different modes. For non-thinking (instruct) mode:</p>

<table>
  <thead>
    <tr>
      <th>Use Case</th>
      <th>temp</th>
      <th>top_p</th>
      <th>top_k</th>
      <th>min_p</th>
      <th>presence_penalty</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>General tasks</strong></td>
      <td>0.7</td>
      <td>0.8</td>
      <td>20</td>
      <td>0.0</td>
      <td>1.5</td>
    </tr>
    <tr>
      <td><strong>Reasoning tasks</strong></td>
      <td>1.0</td>
      <td>0.95</td>
      <td>20</td>
      <td>0.0</td>
      <td>1.5</td>
    </tr>
  </tbody>
</table>

<p>For thinking mode (if you want the chain-of-thought reasoning):</p>

<table>
  <thead>
    <tr>
      <th>Use Case</th>
      <th>temp</th>
      <th>top_p</th>
      <th>top_k</th>
      <th>min_p</th>
      <th>presence_penalty</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>General tasks</strong></td>
      <td>1.0</td>
      <td>0.95</td>
      <td>20</td>
      <td>0.0</td>
      <td>1.5</td>
    </tr>
    <tr>
      <td><strong>Precise coding</strong></td>
      <td>0.6</td>
      <td>0.95</td>
      <td>20</td>
      <td>0.0</td>
      <td>0.0</td>
    </tr>
  </tbody>
</table>

<p>In <code class="language-plaintext highlighter-rouge">mlx-lm</code> 0.31.1, you apply these through the <code class="language-plaintext highlighter-rouge">make_sampler</code> function:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">mlx_lm.sample_utils</span> <span class="kn">import</span> <span class="n">make_sampler</span>

<span class="c1"># Non-thinking, general tasks
</span><span class="n">sampler</span> <span class="o">=</span> <span class="nf">make_sampler</span><span class="p">(</span><span class="n">temp</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="n">top_p</span><span class="o">=</span><span class="mf">0.8</span><span class="p">,</span> <span class="n">top_k</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">min_p</span><span class="o">=</span><span class="mf">0.0</span><span class="p">)</span>
</code></pre></div></div>

<p>Note: <code class="language-plaintext highlighter-rouge">mlx-lm</code>’s <code class="language-plaintext highlighter-rouge">make_sampler</code> does not expose <code class="language-plaintext highlighter-rouge">presence_penalty</code> directly as of v0.31.1. The <code class="language-plaintext highlighter-rouge">repetition_penalty</code> parameter in the generate function is the closest equivalent.</p>

<h2 id="quality-comparison-fair-test">Quality Comparison (Fair Test)</h2>

<p>With thinking disabled and official sampling parameters applied to all models:</p>

<h3 id="qwen35-4b">Qwen3.5 4B</h3>

<blockquote>
  <p>The heavy oak doors of <strong>The Gilded Tankard</strong> groan open, admitting a rush of damp, starless night air that carries the scent of rain and wet cobblestones. Inside, the air is thick and warm, a palpable weight held back by the flickering glow of tallow candles… A floor-to-ceiling tapestry depicting knights in armor lines the far wall… In the corner, a lute player strums a mournful tune…</p>
</blockquote>

<p>Structured, literary-quality prose. Named specific patron archetypes: a bearded merchant, roughnecks in leather, an elderly woman, a young scribe. Strong world-building details.</p>

<h3 id="qwen35-2b">Qwen3.5 2B</h3>

<blockquote>
  <p>The air inside <strong>Blackwood’s Oak</strong> did not smell of wine or wood; it smelled of wet wool, damp stone, and the sharp, tangy scent of fresh rye bread… A low, rumbling murmur vibrates through the floorboards, not from the patrons, but from the wood itself…</p>
</blockquote>

<p>Atmospheric and moody with a second-person perspective. Good sensory detail, though it occasionally repeats ideas and shifts tense.</p>

<h3 id="llama-32-3b">Llama 3.2 3B</h3>

<blockquote>
  <p>The medieval tavern was a warm and inviting haven, its wooden beams and stone walls glowing with a soft, golden light… At the bar, a jovial bartender polished a mug with a dirty rag…</p>
</blockquote>

<p>Competent, readable prose. Safe and expected imagery. Gets the job done without surprises.</p>

<h3 id="quality-verdict">Quality Verdict</h3>

<table>
  <thead>
    <tr>
      <th>Dimension</th>
      <th>2B</th>
      <th>Llama 3B</th>
      <th>4B</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Prose coherence</td>
      <td>Good</td>
      <td>Good</td>
      <td>Excellent</td>
    </tr>
    <tr>
      <td>Character diversity</td>
      <td>Adequate</td>
      <td>Adequate</td>
      <td>Rich</td>
    </tr>
    <tr>
      <td>Sensory depth</td>
      <td>Strong</td>
      <td>Adequate</td>
      <td>Richest</td>
    </tr>
    <tr>
      <td>Consistency</td>
      <td>Minor repeats</td>
      <td>Solid</td>
      <td>Excellent</td>
    </tr>
  </tbody>
</table>

<p>The 4B is clearly the better writer when properly configured. The 2B punches above its weight. Llama 3.2 3B is reliable but outclassed by both Qwen models in this creative task.</p>

<figure>
  <img src="/assets/diagrams/rendered/mlx-model-selection.svg" alt="Model selection decision flow: speed picks Qwen 2B, quality picks Qwen 4B, ecosystem picks Llama 3.2 3B" style="width: 100%; max-width: 650px;" />
  <figcaption>Choosing the right model depends on what matters most for your use case. <em>Rendered with <a href="https://d2lang.com">D2</a>.</em></figcaption>
</figure>

<h2 id="practical-recommendations">Practical Recommendations</h2>

<p><strong>Choose the 2B when:</strong></p>
<ul>
  <li>You want the fastest possible generation (196 tok/s)</li>
  <li>Memory is constrained</li>
  <li>The task is straightforward: summaries, simple Q&amp;A, boilerplate generation</li>
  <li>You’re running a local API server and need throughput</li>
</ul>

<p><strong>Choose the 4B when:</strong></p>
<ul>
  <li>Output quality matters more than speed</li>
  <li>Multi-step reasoning, creative writing, or nuanced tasks</li>
  <li>You have 3+ GB of memory to spare (you do on any modern Mac)</li>
  <li>You’re building something user-facing</li>
</ul>

<p><strong>Choose Llama 3.2 3B when:</strong></p>
<ul>
  <li>You need robust instruction-following without template fussing</li>
  <li>You want the largest community ecosystem and fine-tune availability</li>
  <li>The task is instruction-heavy rather than creative</li>
</ul>

<h2 id="running-as-a-local-api-server">Running as a Local API Server</h2>

<p>For development workflows, <code class="language-plaintext highlighter-rouge">mlx-lm</code> can serve an OpenAI-compatible API:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mlx_lm.server <span class="nt">--model</span> mlx-community/Qwen3.5-4B-4bit
</code></pre></div></div>

<p>This gives you a local endpoint at <code class="language-plaintext highlighter-rouge">http://localhost:8080</code> that accepts the same request format as the OpenAI API. You can point any OpenAI-compatible client at it for local inference.</p>

<h2 id="key-takeaways">Key Takeaways</h2>

<ol>
  <li>
    <p><strong>Apple Silicon is a legitimate LLM inference platform.</strong> Sub-100ms time-to-first-token and 96-196 tok/s generation with 1-3 GB of memory is practical for real applications.</p>
  </li>
  <li>
    <p><strong>Model configuration matters more than model size.</strong> A misconfigured 4B model produced worse output than a 2B model. The chat template, thinking mode flag, and sampling parameters are not optional details. Watch for template divergence between model sizes in the same family — Qwen3.5’s 2B and 4B have different thinking-mode defaults that only surface at inference.</p>
  </li>
  <li>
    <p><strong>The MLX ecosystem is production-ready.</strong> Install with pip, download from Hugging Face, generate in three lines of Python. No CUDA, no Docker, no cloud API keys.</p>
  </li>
  <li>
    <p><strong>Qwen3.5 is the new default for small local models.</strong> Both the 2B and 4B outperform Llama 3.2 3B in quality at comparable or better speed. The only advantage Llama retains is its instruction-following robustness with raw prompts.</p>
  </li>
</ol>]]></content><author><name>Sailesh Panchal</name></author><category term="AI" /><category term="Apple" /><category term="mlx" /><category term="apple-silicon" /><category term="llm" /><category term="qwen" /><category term="llama" /><category term="benchmarks" /><category term="on-device-ai" /><summary type="html"><![CDATA[Hands-on benchmarks comparing Qwen3.5 2B, Qwen3.5 4B, and Llama 3.2 3B running locally on Apple Silicon via MLX-LM. Practical findings on speed, memory, output quality, and the sampling parameters that actually matter.]]></summary></entry></feed>