Digital Transformation Advisory

Silent Is Not Stuck: What Two Hangs Taught Me About Observable Pipelines

2026-04-11T00:00:00+01:00

Two days ago, my nightly knowledge pipeline hung silently for six and a half hours on a single [[M4 Pro MacBook Pro

Ollama]] call. Yesterday, I thought I’d found a silent logic bug in the same pipeline’s refine step — all 754 insights had zero embeddings and the manifest was all zeros. I spent an hour instrumenting and diagnosing before I realised the “bug” was the same class of problem as the hang, one layer down.

Both of them came from the same missing thing: a loop that had no way to prove it was still working.

The First Incident: The 6.5-Hour Populate Hang

The pipeline runs at 02:00 every night via launchd. populate is the step where a local LLM reads each new source file and extracts structured insights. It hits Ollama over HTTP, one source at a time, and writes the results to SQLite.

On the night of April 10th, populate started at 02:00 and was still running at 08:30. No output. No errors. No progress. Just a process sitting at 0% CPU, waiting on a socket.

When I finally killed it and looked at ps + lsof, the process was stuck mid-HTTP-call to Ollama, which had apparently died or stalled but not closed the connection. URLSession had no resource timeout set, so the call would wait forever. Six and a half hours of night-cron time, burned on nothing.

The fix was three lines of code plus a structural habit:

Hard wall-clock timeout on every external call. A dedicated URLSession with timeoutIntervalForResource = 300s. Bound the worst case.
One retry on transient failures. The second attempt often succeeds because the upstream has recovered.
Per-item heartbeat to stderr. [populate] 47/162 source-name.md every single iteration. Now if the pipeline is stuck, I can tell in two seconds.

That became Milestone A.5 — “[[Cloud-native architecture

observability]] and resume, before anything new ships.” I sat on building any new pipeline features until the pipeline itself could tell me what it was doing.

The Second Incident: The Fake Embed Bug

Yesterday I ran the pipeline again, this time with decision-capture features layered on top. Refine completed. I checked the manifest: zero embeddings, zero enrichments, zero tensions detected. I queried the live database directly — 754 insights, every single one with NULL embeddings.

My first reading: refine has a logic bug. It’s silently no-oping on the embed pass. Maybe supportsEmbedding is returning false. Maybe the batch loop is skipping rows. Maybe the Ollama embed endpoint is broken.

I started instrumenting. I added stderr logging around the embed pass. I checked that nomic-embed-text was loaded in Ollama. I ran a manual curl against /api/embed. Everything worked. The model was there. The endpoint responded. So why was refine producing nothing?

The answer, once I stopped staring at embed and actually read the refine flow end-to-end: refine wasn’t broken. It had never finished. The enrich pass runs before embed, processes each sparse insight sequentially, and takes about four seconds per call against gemma4. With 754 sparse insights, that’s about fifty minutes of work before embed even starts. Fifty minutes during which refine produces exactly zero bytes of output. Indistinguishable from a hang. On the previous night’s run, the populate hang had blocked refine from ever starting on the current dataset. On tonight’s, the enrich pass had been honestly chewing through the work and I’d killed it before it got to embed.

The Same Root Cause, Two Layers Apart

This is where I realised the two incidents were actually the same incident. A.5 made the pipeline observable — you can see which phase is running, how far it’s got, whether the lockfile is held, when the last heartbeat fired. But the individual passes inside refine were still blind boxes. Each pass logged “starting” and then went quiet for anywhere from 30 seconds to an hour. If any one of them got slow, or stuck, or hit one bad row, you couldn’t tell which without killing the process and looking at the database state after the fact.

So the fix mirrored A.5, one layer down:

Start and end markers on every refine pass. [refine:enrich] start — 754 sparse insights at entry, [refine:enrich] done (0 failed) at exit. No ambiguity about which pass is running.

Per-item progress every 5 items. Slow enough to not drown the log, [[Agile methodology

fast]] enough that you can watch progress in real time and calculate ETA.

Try/catch around every external call, per row. If one insight’s LLM call fails, log it, increment a counter, continue. One bad row can never kill the whole pass again.

The last point turned out to matter more than the logging. Before the fix, enrich was already ~fifty minutes of sequential external work. If any one of those 754 calls had thrown (Ollama indigestion, a badly-formed title, a transient network blip), the entire pass would have died and we’d have wasted the preceding work. With per-row tolerance, one flaky response just bumps a skip counter and the other 753 calls complete.

The Rule I Wrote Down

After the second fix, I wrote a new line in my permanent project [[LLM-based agents

memory]]:

External calls in long-running passes need a stderr heartbeat every N items and a try/catch around every single call. Silent “stuck” is almost never actually stuck. It’s missing instrumentation.

That’s the entire rule. It sounds obvious written down. It wasn’t obvious while I was staring at a query that returned zero embeddings, convinced there was a logic bug.

There are two reasons engineers skip this, and I skipped it for both:

You write the code while the dataset is small. My unit tests had 3 insights, not 754. At 3 insights, enrich finishes in 12 seconds, so “silence for a minute” never happens. The pathology doesn’t show up until production load.
You think logging is a polish task. Heartbeat lines look like chrome you can add later. They’re not. They are the only way a human can tell a working loop from a dead one, and if you don’t have them, every unexpected behaviour looks like a bug.

What I’d Tell Someone Building A Similar System

If you’re building a pipeline that calls external services in a loop — an LLM, an HTTP API, a database-on-another-box, anything where latency is not under your control — build the observability first. Not first in priority, first in literal code-writing order. Before the happy path works, make sure the unhappy path is legible:

Heartbeat every N items, where N is small enough that a human watching the log can see it move
Hard wall-clock timeout on every external call, not just the session default
Per-item try/catch so one bad row can never kill a long-running batch
A progress sidecar written to disk so a killed process leaves behind evidence of how far it got
Postmortem dump on SIGTERM so the nightly cron’s failures are diagnosable in the morning

None of these are hard to build. All of them are hard to remember to build before you need them. The rule I’m trying to hold myself to: any loop over external work gets instrumented the same day it’s written. Not the day it breaks in production.

The knowledge pipeline is running as I write this. The enrich pass is at 405 of 754, ten minutes from embed, and I can see every step of it in my terminal. Which is the entire point.

Turning Your Knowledge Base Into a Graph You Can Argue With

2026-04-08T00:00:00+01:00

You’ve been consulting for five years. Your Obsidian vault has 847 insights. You can search by keyword. You can search by tag. But you cannot ask your own knowledge base “what contradicts this claim?” and you cannot ask “what’s the gap in my thinking here?”

That’s the problem that led to kuzuctl.

The Problem: Linear Search, Exponential Complexity

When you have fewer than 100 notes, text search is fine. You remember that you wrote something about “confidence scoring” six months ago. You search, find it, move on.

At 847 notes, things break. Here’s what happens:

You write a new insight: “Confidence should be semantic, not just statistical.”
You search for “confidence” and find 34 previous notes.
You manually check each one for contradictions, connections, implications.
You spend two hours in busywork, or you don’t, and miss critical connections.
You build the same idea twice, not realizing it’s already in the vault.

This doesn’t scale with your organisation. If you have a knowledge base serving 10 people, or 100 people, the problem gets worse. Everyone searches independently. No one knows what’s contradictory.

The usual solution is a proper database. You migrate everything out of markdown into a relational schema. You hire someone to maintain it. Now markdown is a view, and the database is the source of truth.

But that inverts the problem: if the database goes down or gets corrupted, you’ve lost your real knowledge. And your markdown is now stale.

The Decision: Markdown Primary, Graph Derived

We chose a different pattern: markdown stays primary. The graph is derived.

If you delete the Kuzu database tomorrow, kuzuctl sync rebuilds it from your markdown files in 30 seconds. The markdown is the source of truth. The graph is an acceleration layer—like a database index, not like a ledger.

This decision cascaded through everything:

Node identity is anchored in frontmatter IDs, not file paths.
The schema is optimized for synthesis, not normalization.
Commands can assume the graph is stale (it rebuilds every night).
The CLI is the consumer, not the source.

This pattern appeared again when we had to solve the Kuzu reopen bug. More on that in a moment.

The Architecture: Three Swift Targets, One Protocol

The codebase has three Swift packages:

Package	Purpose	Reusable?
`kuzuctl`	Obsidian vault CLI: sync, lint, ingest, search, challenge, suggest	No—vault-specific logic
`KuzuGraphKit`	Graph store abstraction (CRUD, embedding, conflict detection)	Yes—used by BlogCreator, SFiA-AI
`LLMKit`	Multi-provider LLM wrapper (Ollama, Claude, Mock)	Yes—any Swift project

The critical move was GraphStoreProtocol. This abstract interface lets you swap storage backends without changing any command code:

protocol GraphStoreProtocol: Actor {
    func createNode(id: String, label: String, properties: [String: Any]) async throws
    func linkNodes(source: String, target: String, relation: String) async throws
    func searchByEmbedding(query: [Float], limit: Int) async throws -> [SearchResult]
    func detectTensions(nodeID: String) async throws -> [Tension]
}

We built this assuming Kuzu. Then kuzu-swift 0.11.3 had a reopen bug: the library could not safely reconnect after the process closed the database. This is critical for a CLI that might be invoked 100 times in a day (each time: open, query, close, exit).

We had two options:

Wait for Kuzu to fix the bug.
Implement SQLite and swap it in.

Because we had the protocol, option 2 took one week. We did not rewrite any commands. The CLI still worked, still output the same JSON, still passed the same tests.

This was a deliberate CEO-review decision: build for optionality, not for the prettiest tech. Kuzu is the long-term backend (better analytics, better scalability). SQLite is the reliable pivot. Both implement the same interface.

The Schema: Confidence + Contradictions

Here’s the core insight graph schema:

NODES:
  id (String, primary)
  label (String)
  confidence_label (String: "certainty" | "assumption" | "emerging" | "contested")
  confidence_score (Float: 0.0-1.0)
  description (String)
  embedding (FLOAT[768])
  frontmatter_id (String)
  source_file (String)

EDGES:
  source_id (String)
  target_id (String)
  relation (String)
  type ("LINKS_TO" | "TENSIONS_WITH")
  context (String)

Two design choices stand out:

Dual confidence. Humans read confidence_label: “Is this certain, or emerging?” Machines read confidence_score: “On a scale of 0–1, how much evidence supports this?” The label is for reasoning. The score is for filtering.

TENSIONS_WITH as first-class edges. Most graph systems treat contradictions as an absence (this edge doesn’t exist because the two claims conflict). We model them explicitly. If note A says “cloud adoption reduces capex by 60%” and note B says “cloud adoption increases operational complexity by 40%”, we create a TENSIONS_WITH edge between them. This lets us ask “what contradicts this?” and surfaces unresolved debates.

Node identity by frontmatter ID, not path. When you rename a file, the graph doesn’t break. The frontmatter ID stays stable:

---
id: cost-control-through-observability
confidence: "certainty"
tags: [cost, devops]
---

The Overnight Refine Loop: Six Passes, Zero Cost

Every night at 3 AM, kuzuctl refine runs six semantic passes. This is where the graph stops being a static view and starts actively learning about itself.

Pass 1: Embed. Every node without a vector embedding gets one via nomic-embed-text (local Ollama, runs free). 768-dimensional vectors. Takes about 2 seconds per node on modern hardware.

Pass 2: Deduplicate. Compute cosine similarity between all pairs. If similarity > 0.95 and both nodes have high confidence, merge them. This catches “I wrote the same insight three times in different words.”

Pass 3: Enrich. For nodes with missing descriptions, ask Claude (via LLMKit) to generate one from their wiki-links and neighbors. This fills gaps in sparse nodes.

Pass 4: Detect contradictions. For every pair of nodes with LINKS_TO edges, check: do their descriptions or embedding space conflict? If so, create a TENSIONS_WITH edge with a reason.

Pass 5: Validate. Sample 10% of nodes and ask Claude to verify that the confidence_score is correct based on the node’s description and linked evidence. Adjust scores if needed.

Pass 6: Prune. Remove nodes with confidence_score < 0.3 that have no incoming edges. Don’t delete them—move them to a “low_confidence” table so you can audit them later.

Every pass logs to a RefineManifest:

{
  "run_date": "2026-04-08T03:00:00Z",
  "passes": [
    {
      "pass": "embed",
      "nodes_processed": 847,
      "nodes_created": 12,
      "duration_seconds": 45
    },
    {
      "pass": "deduplicate",
      "nodes_processed": 847,
      "merges": 3,
      "duration_seconds": 8
    },
    {
      "pass": "enrich",
      "nodes_processed": 847,
      "nodes_updated": 24,
      "tokens_used": 12400,
      "duration_seconds": 28
    },
    {
      "pass": "detect_contradictions",
      "nodes_processed": 847,
      "tensions_created": 7,
      "duration_seconds": 62
    },
    {
      "pass": "validate",
      "sample_size": 85,
      "score_adjustments": 12,
      "tokens_used": 8900,
      "duration_seconds": 35
    },
    {
      "pass": "prune",
      "low_confidence_archived": 2,
      "duration_seconds": 3
    }
  ],
  "total_duration_seconds": 181,
  "total_tokens_used": 21300
}

This audit trail answers “what changed last night, and why?” If you wake up and a confidence score shifted, you can see exactly which pass did it and what reasoning was applied.

The Challenge Command: Red-Team Your Own Graph

The most useful command is kuzuctl challenge:

kuzuctl challenge "Cloud adoption always reduces capex"

Output:

{
  "claim": "Cloud adoption always reduces capex",
  "verdict": "contested",
  "evidence": {
    "supported": [
      {
        "node_id": "cost-control-observability",
        "confidence": "certainty",
        "reasoning": "Operational savings in datacenter overhead"
      }
    ],
    "contested": [
      {
        "node_id": "cloud-operational-complexity",
        "confidence": "emerging",
        "tension": "Cloud adoption increases operational complexity (TENSIONS_WITH)"
      }
    ],
    "insufficient_evidence": []
  },
  "cypher_used": "MATCH (n {label: $claim})-[r]-(m) WHERE r.type IN ['SUPPORTS', 'TENSIONS_WITH'] RETURN n, r, m",
  "reasoning": "Your claim has direct support for cost reduction, but we found a contested edge claiming increased operational complexity. Neither claim is fully resolved—both confidence scores are below 0.8."
}

Notice the cypher_used field. This makes the reasoning reproducible. You can run that same query, inspect the results, and decide whether the algorithm was right. This transparency is why challenge is useful as a red-team tool, not just as an answer engine.

The Suggest Command: Surfacing Synthesis Gaps

kuzuctl suggest finds patterns that should exist but don’t:

Orphans: Nodes with no incoming or outgoing edges (dead weight)
Open triangles: A→B, A→C, but no B→C (synthesis gap)
Unresolved tensions: TENSIONS_WITH edges with low confidence on both sides (debate, not decision)

kuzuctl suggest --sphere "cost-control"

Output tells you: “You have 8 insights about cost controls, but nodes ‘cost-observability’ and ‘cost-ai-automation’ are not connected. Are they related, contradictory, or independent?”

This is the inverse of search. Search answers “does this exist?” Suggest answers “what’s broken in your thinking?”

BlogCreator: The Same Graph, Extended

This year we’re building BlogCreator—a tool to turn raw voice recordings into polished blog posts, with full lineage tracking.

Five thousand, two hundred and eighty-seven voice recordings (5,287) over five years. We didn’t want to throw away the audio. We built the transcription system into the same Kuzu graph.

The schema extends kuzuctl’s:

NODES:
  AudioFile → TranscriptVersion → Concept → BlogPost → Chapter → PublishedContent

EDGES:
  TRANSCRIBED_FROM (AudioFile → TranscriptVersion, with accuracy score)
  EXTRACTED_FROM (Concept → TranscriptVersion, with confidence)
  INCLUDED_IN (Concept → BlogPost)
  REFINED_BY (BlogPost → BlogPost, tracking iteration)
  IMMUTABLE_LINEAGE (PublishedContent → AudioFile, tracing back to source)

BlogCreator uses KuzuGraphKit—the same reusable library. Different schema, same protocol. This validates the architecture decision: separating vault-specific logic (kuzuctl) from graph-specific logic (KuzuGraphKit) let us reuse the whole graph layer for a completely different problem.

When to Build a Graph (and When Not To)

Not every knowledge base needs this. Here’s the decision table:

Condition	Recommendation
< 200 notes, < 5 years old	Use Obsidian search. Stay flat.
200–1000 notes, individual use	Build a derived graph. Use SQLite.
1000–10k notes, team use, synthesis-critical	Build a derived graph. Use Kuzu. Add challenge/suggest commands.
Raw data → structured output (transcription, contracts, research)	Build a lineage graph. Same protocol, different schema.
Graph is the product (recommendations, discovery, analytics)	Build graph-as-primary. Invest in the database. Accept the maintenance cost.

The key: Is the graph helping you think, or is it becoming the thing you think about?

If it’s helping you think—synthesis, contradiction detection, gap finding—it should be derived and lightweight.

If it’s the thing you think about—someone built it, someone maintains it, it has its own schema version—it should be primary and mature.

We chose derived because the vault is the thought. The graph is the reasoning tool.

Takeaway: Invert the Authority

Most graph projects start with “we need a database, let’s extract data into it.” This makes the database the source of truth and guarantees eventual consistency pain.

Invert it: keep your source of truth (markdown, voice recordings, whatever) and build a derived graph that can be thrown away and rebuilt. This trades some query latency for enormous operational simplicity.

You get to ask your knowledge base hard questions. You get reproducible reasoning (the SQL that built the answer). You get an audit trail (RefineManifest). And you never wake up wondering what corrupted your real data.

That’s the kuzuctl pattern. Build it if you have 847 insights and your thinking is your product.

Sailesh Panchal is a CTO and founder of Digital Transformation Advisory. He consults with UK banking and fintech on AI strategy, platform architecture, and the boring operational decisions that make transformation stick.

From 5,000 Voice Memos to a Book: The Pipeline That Runs While You Sleep

2026-04-08T00:00:00+01:00

The Trap

Five years of banking consulting leaves you with something precious and useless: 5,287 voice memos scattered across four different apps.

Pauses while walking to Pret. Thirty-second thoughts about SEPA harmonisation captured in an Uber. A four-minute tangent on PSD2 compliance recorded while waiting for a call. All of it—126 GB of audio—sitting in iCloud, slowly drowning in your decision paralysis.

The knowledge is there. Real patterns about digital transformation in UK banking. Causal loops connecting regulatory change to architecture decisions. Mistakes I’ve made at two banks and a fintech. But the knowledge is trapped in audio files I will never listen to again. Who has the time?

So I built a pipeline.

The ambition was simple: turn those 5,000 hours of raw thinking into a book. Not a collection of blog posts. Not a listicle farm. A proper book—nine chapters, 40,000 words, structured narrative arc—on transforming a UK bank from the CTO’s perspective. Call it The Transformation Paradox.

But you cannot write a book by listening to 5,000 recordings. You need automation. You need stages. You need quality gates that separate the gold from the noise.

What follows is how I built it.

The Architecture: Four Stages, Three Model Tiers, One Immutable Graph

The pipeline runs nightly at 2am. It pulls raw audio from four vaults, transcribes at 140x real-time, scores against 100 financial concepts, generates two formats of content, enhances with a four-agent system, and logs every decision in a Kuzu knowledge graph for audit and cross-referencing.

No prompt tells it “write me a book.” The pipeline has stages.

Stage 0: Transcribe (Audio → Transcript)

The first bottleneck is speech-to-text. I record on my phone (mostly Apple Voice Memos), backup to Dropbox, and keep originals in Google Drive as a belt-and-braces hedge. Four vaults: Voice Memos (2,847 files), Dropbox recordings (1,420), Google Drive (723), and Telegram voice notes (297).

Vault	File Count	Format	Access
Apple Voice Memos	2,847	.m4a	iCloud sync
Dropbox backups	1,420	.mp3	Local mount
Google Drive archive	723	.mp3	API via gcloud
Telegram voice notes	297	.ogg	Export script

For STT, I tested two local models—parakeet-mlx and mlx-whisper—and found I needed both.

parakeet-mlx runs at 140x real-time on Apple Silicon. On a 2-minute voice memo, it produces output in under 1 second. Word error rate sits at 6.3%. For raw speed (processing 400 recordings in a batch), it’s unbeatable. But it misses domain specifics. It hears “SEPA” as “sepia” and “CHAPS” as “chaps” (the riding wear, not the clearing house).

mlx-whisper is slower (3-4x real-time) but domain-aware. I seed its prompt with ~40 financial terms: CHAPS, BACS, SEPA, ISO 20022, FMV, PSD2, FIDO2, BaaS, BNPL, passporting, SCA. The model uses the prompt as a lexical hint. Correct rate for those 40 terms jumps from 40% to 94%.

So the pipeline does this: parakeet first for speed, flag any memo over 2 minutes as “high-stakes financial” (if it contains keywords like “compliance” or “architecture”), and re-transcribe those with mlx-whisper HQ.

A compiled regex corrects the remaining 6% of finance-specific mistakes—it catches patterns like “sepia clearing” and replaces them with “SEPA clearing” using context.

Result: 400 clean transcripts per night. Total cost: zero. (Ollama runs locally; no API fees.)

Stage 1: Analyse (Transcript → Confidence Score)

The second stage is scoring. Not “is this good?” but “is this articulate enough to generate content from?”

I built a taxonomy of 5 themes and 100 financial concepts:

Theme 1: Regulatory Compliance (PSD2, GDPR, FIDO2, SCA, Strong Customer Auth)
Theme 2: Payments Modernisation (ISO 20022, SEPA, CBDC, Real-Time Payments)
Theme 3: Enterprise Architecture (Systems Thinking, Domain-Driven Design, Event Sourcing)
Theme 4: Talent & Culture (Team scaling, psychological safety, growth mindset)
Theme 5: AI Integration (LLM ops, vector DBs, prompt engineering)

Qwen 3.5-4B (thinking mode disabled for clean JSON) scores each transcript on all five themes. Output:

{
  "memo_id": "20260407_0247_psd2_discussion",
  "duration_seconds": 312,
  "themes": [
    {
      "name": "Regulatory Compliance",
      "relevance_score": 0.68,
      "confidence": 0.92,
      "concepts_detected": ["PSD2", "SCA", "passporting", "regulatory_arbitrage"]
    },
    {
      "name": "Payments Modernisation",
      "relevance_score": 0.44,
      "confidence": 0.87,
      "concepts_detected": ["ISO_20022", "Real-Time_Payments"]
    }
  ],
  "overall_quality_score": 0.58,
  "recommendation": "GOLD",
  "rationale": "Clear narrative arc, specific examples, actionable guidance."
}

The three confidence bands:

Band	Score	Action
Gold	≥55	Auto-generate blog post + LinkedIn format
Silver	40–54	Generate, lower priority, queue for review
Bronze	37–39	Queue for Whisper HQ retranscription
Dud	<37	Skip; log for manual review later

Of 5,287 memos, 1,247 came back Gold. 2,104 Silver. 894 Bronze (queued for retranscription). 1,042 Dud (mostly background noise, false starts, or phone-call fragments).

Gold memos are the ones where I was actually thinking—not just ruminating.

Stage 2: Generate (Transcript → Two Formats)

Gold memos fork into two tracks:

Track A: Consultancy Article (1,500–3,000 words). SEO-optimized, thought leadership tone. Structured: Problem statement, why it matters, decision tree, implementation pattern, common pitfalls, call to action. This goes to the blog and gets social distribution.

Track B: LinkedIn Post (300–600 words). Snappier. “Here’s the insight; here’s why you should care; here’s the next step.” Thread-friendly. Lower friction. Different audience (practitioners vs. architects).

Same transcript. Two prompts. Two voices. The pipeline generates both in parallel. (Qwen 9B does this overnight; we’re not paying for latency.)

Stage 3: Enhance (Content → Four-Agent Polish)

This is where the magic happens. After generation, I don’t ship immediately. I run a four-agent enhancement pipeline. Each agent has a specific job:

Systems Thinking Agent (89% effectiveness): Reads the draft and identifies causal loops. If I wrote “Teams moved faster after we restructured,” the agent asks: “But did velocity improve because of the structure, or because the reorganisation coincided with hiring senior engineers?” It surfaces confounds. It ties insights to feedback loops. It turns observations into models.
Growth Mindset Agent (92%): Reframes challenges as capability development. If the draft says “We struggled with microservices,” the agent rewrites it: “We discovered microservices require different operational muscle—here’s how we built it.” Ownership over victimhood. Agency over passivity.
Reader Engagement Agent (87%): Injects Socratic questions and “Explore Further” links. It pulls from the Kuzu graph: if I mention ISO 20022, the agent fetches all related concepts (SEPA, Real-Time Payments, CBDC) and suggests cross-links. It turns monologue into dialogue.
Tone Calibration Agent (85%): Quality gate. Checks: Is this too jargon-heavy for practitioners? Too simplistic for architects? Is the voice consistent with my other posts? Does it land for a UK banking CTO? Flags anything that feels off.

Each agent is built with a five-layer prompt architecture:

Identity: “You are a systems thinking expert, trained on complex adaptive systems.”
Expertise: “Your specialty is identifying feedback loops in socio-technical change.”
Context: [Kuzu neighbourhood context: all related concepts, prior posts on this theme, decision history]
Standards: “Your output must be concrete (not hand-wavy), humble (not prescriptive), and tied to evidence.”
Output format: “JSON with suggestions array, each item has location (which paragraph), original_text, proposed_revision, rationale.”

The orchestrator fetches each agent’s output, merges non-conflicting suggestions, and flags conflicts for manual review.

Cost per article: ~$0.30 (Claude API for agent coordination only; base generation is free on Qwen 9B).

Stage 4: Lineage (Everything → Immutable Kuzu Graph)

Here’s the bit that matters for regulators, auditors, and your own sanity: every artifact is traceable.

Audio file (20260407_0247.m4a)
  ↓
TranscriptVersion (v1: parakeet, v2: whisper HQ)
  ↓
ConceptExtraction (PSD2, SCA, Real-Time Payments)
  ↓
ConfidenceScore (0.58 → GOLD)
  ↓
BlogPost (1,847 words, published 2026-04-08)
  ↓
LinkedInPost (412 words, published 2026-04-08)
  ↓
BookChapter ("Regulatory Modernisation", position 3)

Kuzu nodes never overwrite. New versions create new nodes. If I re-transcribe a memo with Whisper HQ, a TranscriptVersion node links the old (parakeet) and new (whisper) outputs. The graph shows the evolution. An auditor can ask: “Show me every version of the PSD2 content” and trace the lineage.

This matters. If a regulator asks, “How did you arrive at this conclusion about SCA?” you can pull the graph: here’s the memo, the timestamp, the transcription method, the quality score, the agents that touched it, the publication date.

No hand-waving. No “I think I wrote about that somewhere.”

The Model Tiers: Trade Latency for Quality

I use three Qwen models on Apple Silicon via MLX. No cloud API. No cost for overnight bulk work.

Model	Size	VRAM	Latency	Use Case	Cost
Qwen 3.5-4B	3 GB	1.2 GB	0.8s per 1K tokens	Daily scoring, quick analysis	Free
Qwen 9B	6 GB	2.4 GB	1.8s per 1K tokens	Blog generation, formatting	Free
Qwen 27B	30 GB	8 GB	5.2s per 1K tokens	Diagram specs, complex reasoning	Free

4B runs every memo nightly (scoring). 9B generates content (Track A and Track B). 27B handles overnight “think deeply” work—when I want Qwen to reason through architecture trade-offs, it gets the 27B model and 30 seconds per response.

The key insight: free latency is valuable. If a task takes 30 seconds but costs $0 (because it’s midnight), run it. If it takes 5 seconds and costs $0.10, use Claude (10x faster, acceptable cost for spot-check validation).

My cost model for overnight processing: $0. For daytime validation: ~$50/month Claude API budget.

The LoRA: Voice Adaptation at Scale

At 50 gold posts, I’ll train a LoRA (Low-Rank Adapter) on top of Qwen 9B.

Training data: 50 (transcript, Sailesh’s personal rewrite) pairs. The LoRA learns not the content, but the voice. How I restructure a rambling 5-minute thought into crisp argument. My preference for concrete examples over abstractions. My skepticism toward buzzwords.

Base model: Qwen 9B. The LoRA will be ~64 MB. After training, any Qwen 9B inference with the LoRA loaded will sound more like me.

I’m not training a new foundation model. I’m training my voice on top of an existing one.

The Book: Nine Chapters from Chaos

The output is structured as nine chapters, crystallised from Kuzu themes:

The Transformation Paradox (Intro: why banks change, why it fails)
Regulatory Winds (PSD2, GDPR, future regulation)
Payments Plumbing (ISO 20022, SEPA, Real-Time Payments)
Systems Thinking (feedback loops, causal models, complexity)
Architecture Decisions (DDD, event sourcing, monolith vs. microservices)
Building Teams (talent, psychological safety, growth mindset)
AI Integration (LLMs, vector search, responsible deployment)
The Operator’s Mindset (observability, chaos engineering, incident response)
The Systems Thinker’s Manifesto (coda: synthesis, next decade)

Each chapter is built from a cluster of Gold memos. ttb query journey --concept PSD2 --show-evolution traces how my thinking on PSD2 compliance evolved across five years—which memo first articulated it, how the thinking deepened, where contradictions emerged, what I changed my mind on.

The book is not a collection of essays. It’s a narrative with causal coherence, built from the graph.

The Patterns: What CTOs Should Learn

Three principles stand out:

1. Split problems correctly. This pipeline works because each stage has one job. Transcription doesn’t score. Scoring doesn’t generate. Generation doesn’t enhance. Each stage outputs clean JSON, which the next stage consumes. When something breaks, you know where. This is the same principle I wrote about in “Fifty PowerPoints: How to Scale Content Without Burning Out”—splitting the branding pipeline into extract, deterministic transform, and intelligent rewrite.

2. Free latency is a weapon. If it costs $0 to run overnight, run it deeply. If it costs money by the token, get ruthless about scope. The four-agent enhancement costs ~$0.30 per article because Claude sees only the final, filtered JSON from cheap models. It doesn’t re-read the transcript; it doesn’t second-guess the scoring. Money buys precision in specific places, not omniscience.

3. Lineage is not optional. You think you won’t need an audit trail until you do. Then it’s too late. Immutable Kuzu nodes cost nothing. The discipline of logging every version, every decision, every touch—it pays for itself the first time a stakeholder asks “where did you get that number?” and you can say “here, pull the graph.”

The Status

As of April 2026, the pipeline is production. 1,247 gold memos have generated blog posts. 894 have been retranscribed and are pending generation. The first book outline is crystallising around the nine chapters above.

Total elapsed time to build: 14 months (evenings and weekends). Total cost: ~$600 (mostly Claude API during development; now down to $50/month validation budget). Total time saved: hard to measure, but if I’d manually transcribed even 1% of those memos, I’d have lost three weeks of consulting work just sitting with an audio player.

The real value is this: five years of thinking are no longer lost. They’re queryable. They’re traceable. They’re part of a coherent narrative. And the book will exist.

That’s worth building a pipeline for.

Sailesh Panchal is a CTO advisor and architect specialising in digital transformation at UK banks. He writes about payments modernisation, systems thinking, and the engineering practices that survive contact with regulation.

How We Test Claude Skills: The Eval-and-Tune Loop

2026-03-26T00:00:00+00:00

Writing a Claude Code skill is easy. You write some markdown, drop it in ~/.claude/skills/, and it activates automatically. The hard part is knowing whether it actually makes a difference.

We learned this building the agent-friendly-cli skill — a guide for building CLIs that AI agents can use effectively. The skill covers 16 principles: structured output, stderr separation, exit codes, TTY detection, and so on. But principles on paper mean nothing without evidence that they change outcomes.

So we built a testing loop. Here’s what we learned.

The Process

1. Draft the Skill, Then Write Test Prompts

Start with the skill content, then immediately write 2-3 realistic test prompts. Not “test the skill” prompts — real tasks someone would actually bring to Claude:

“Build a deploy CLI in Python with Click”
“Review this CLI code and tell me what’s not agent-friendly”
“Write a config import command that accepts file or stdin input”

These cover different modes: code generation, code review, and feature implementation. Each exercises the skill differently.

2. Run With-Skill and Without-Skill in Parallel

This is the key insight. Don’t just test the skill — test the delta. Spawn six subagents: three with the skill loaded, three without. Same prompts, same model, different guidance.

The without-skill runs are your baseline. They show what Claude does naturally, without the skill’s patterns. The comparison reveals what the skill actually teaches.

3. Draft Assertions While Runs Execute

Don’t wait for results. While the agents run, write the grading criteria. For the CLI skill, our first assertions were:

Does the code include --output json?
Is there a --dry-run flag?
Does it use flags instead of positional arguments?
Are there distinct exit codes?

These felt reasonable. They were also mostly wrong — not wrong in what they checked, but wrong in what they revealed.

4. The First Round Won’t Discriminate

Here’s what happened when we graded:

Eval	With Skill	Without Skill
Deploy CLI	100%	33%
Code Review	100%	100%
Config Import	80%	40%

The deploy CLI and config import showed clear deltas. But the code review scored 100% for both versions. The skill found 11 issues; the baseline found 6. The skill categorized by severity; the baseline didn’t. The skill caught stderr/stdout separation; the baseline missed it. Yet the assertions said they were equal.

The problem: our assertions tested for the obvious. “Does it mention interactive prompts?” Yes — both versions catch that. “Does it note missing JSON output?” Yes — both versions catch that too. The assertions were too easy.

5. Find What Discriminates

This is where the real work happens. Read both outputs side by side and ask: what does the with-skill version do that the baseline doesn’t? For us, it was:

Severity categorization — the skill version tiered issues as blocking/moderate/low
Stderr/stdout separation — the baseline never mentioned it
Emoji fragility — the skill flagged print('Done!') with emoji as parsing-fragile
Issue depth — the skill found 8+ issues vs the baseline’s 6

We added these as assertions and re-graded:

Eval	With Skill	Without Skill	Delta
Code Review	100%	83%	+17%

Now the eval discriminated. The two failing assertions for the baseline — stderr/stdout separation and emoji fragility — were exactly the patterns the skill teaches that Claude doesn’t know on its own.

6. Iterate Until Stable

The loop is: draft assertions, grade, find non-discriminating assertions, replace them with harder ones, re-grade. Stop when the assertions capture the real delta.

For code generation evals (the deploy CLI), the first round already discriminated well (+67%). Code generation is where skills have the most leverage — the model produces fundamentally different code with the right guidance.

For code review evals, it took two rounds. The model is already decent at spotting problems; the skill’s value is in the subtler, deeper patterns.

The Audit That Proved It

The same day we shipped the skill, we ran it against our own project — a transcription-to-blog pipeline with a CLI called ttb. The audit against the agent-friendly checklist was immediate and damning:

No --output json on any command — agents can’t parse the Rich tables
No graph query commands — can’t explore the knowledge graph programmatically
audit-transcript outputs Rich markup, not structured data
enrich outputs JSON (the only one that does)
No --quiet mode
Stats use Rich tables — not parseable
No graph-level query tools for listing concepts, finding themes, or exploring connections

The biggest win wasn’t fixing existing commands — it was realising we needed ttb query subcommands that let agents explore the knowledge graph with --output json. The skill didn’t just review our CLI; it revealed a missing capability.

That’s the difference between a checklist you read once and a skill that’s loaded into context every time you touch CLI code.

What We’d Do Differently

Start with discriminating assertions. Don’t test for the obvious. If baseline Claude already catches interactive prompts and missing JSON output, those assertions won’t tell you if your skill adds value. Test for the patterns that require specific domain knowledge.

Run more than 3 test cases. Three is enough to validate the approach, but the signal gets noisy with small samples. For a production skill, we’d run 8-10 before shipping.

Grade programmatically from the start. We wrote a grading script that checks outputs against regex patterns. It’s stringly-typed and a bit hacky, but it runs in seconds and produces consistent results. Manual review is important for qualitative assessment, but programmatic grading catches regressions.

Get the Skill

The agent-friendly-cli skill is open source:

github.com/saileshpanchal/agent-friendly-cli

The eval workspace with all test cases, grading scripts, and benchmark data is in the repo. Fork it, run the evals against your own CLIs, and see what falls out.

The process itself — draft, test with/without, find discriminating assertions, iterate — works for any Claude Code skill. The specific assertions will be different, but the loop is the same.

Building CLIs for Agents: What the Original Article Missed

2026-03-26T00:00:00+00:00

An article on building CLIs for agents went around recently. It made good points: make things non-interactive, add --dry-run, return data on success. Solid basics.

But it missed the patterns that actually make the difference between a CLI that agents can technically use and one they can use well. We took the original article, added the missing pieces, turned it into a Claude Code skill, and benchmarked the results.

What Was Missing

The original covered six patterns. We added twelve more. Here are the ones that matter most.

Structured Output Is the Foundation

The original article mentioned returning data on success. That’s a subset of the real principle: every command should support --output json. Not just success messages. Every list, every status check, every describe command. And the default should be human-readable tables, not JSON — you’re serving two audiences.

# human-friendly default
$ mycli service list
NAME        STATUS    REPLICAS
web         running   3
api         running   2

# agent-friendly
$ mycli service list --output json
[
  {"name": "web", "status": "running", "replicas": 3},
  {"name": "api", "status": "running", "replicas": 2}
]

This one pattern eliminates the largest class of agent failures: parsing human-formatted text.

Stderr vs Stdout Separation

This wasn’t mentioned at all, and it’s critical. Data goes to stdout. Diagnostics, progress, and logs go to stderr. Without this, agents can’t pipe commands together — progress messages corrupt the data stream.

def _emit(data: dict, output: str) -> None:
    """Data to stdout."""
    if output == "json":
        click.echo(json.dumps(data, indent=2))

def _log(msg: str) -> None:
    """Diagnostics to stderr."""
    click.echo(msg, err=True)

When --output json is set, be strict: no log lines should leak into stdout.

Exit Codes That Mean Something

The original said “fail fast with actionable errors.” That’s necessary but not sufficient. Agents need distinct exit codes so they can branch without parsing stderr:

Code	Meaning	Agent Action
0	Success	Continue
1	General error	Read stderr, retry or escalate
2	Usage error	Fix invocation and retry
3	Auth error	Re-authenticate, then retry
4	Not found	Resource doesn’t exist
5	Conflict	Already exists / state conflict

An agent that gets exit code 3 knows to refresh its token. An agent that gets exit code 2 knows to check its flags. An agent that gets exit code 1 has to read and parse the error message. The more specific your codes, the faster the recovery.

TTY Detection for Graceful Degradation

The original said “make it non-interactive.” Better advice: detect whether you’re talking to a human or a pipe, and behave accordingly.

if not args.env:
    if sys.stdin.isatty():
        args.env = prompt_user("Which environment?",
                               choices=["staging", "production"])
    else:
        die("Error: --env is required\n"
            "  mycli deploy --env ")

This gives humans the interactive experience they expect while failing fast for agents with an actionable error message.

Auth Without Browsers

The original didn’t mention authentication at all. Agents can’t open browsers for OAuth or type passwords at prompts. Your CLI needs to support at least three auth methods:

# 1. Flag (highest priority)
$ mycli --token sk-abc123 service list

# 2. Environment variable
$ MYCLI_TOKEN=sk-abc123 mycli service list

# 3. Config file (written once by a human)
$ mycli auth configure --token sk-abc123
$ mycli service list  # reads from ~/.mycli/config

Pagination

Also missing from the original. Dumping 10,000 results into an agent’s context window is expensive and usually unnecessary. Support --limit, --offset, and ideally cursor-based pagination:

$ mycli logs list --limit 20 --output json
{
  "items": [...],
  "pagination": {
    "total": 1847,
    "limit": 20,
    "next_cursor": "eyJpZCI6MTIwfQ=="
  }
}

The Full Checklist

We organized all 16 principles into three tiers. Here’s the quick reference:

Must-Have — agents can’t function without these:

All inputs accepted as flags
--output json on every data-returning command
Stdout for data, stderr for diagnostics
--help with examples on every subcommand
Fail fast with actionable errors
Distinct exit codes
Auth via env vars / config files / --token

Should-Have — agents work much better:

--dry-run for destructive actions
--yes / --force to skip confirmations
Idempotent commands
Consistent resource verb structure
Structured success responses
--quiet mode
Pagination

Nice-to-Have — makes agents more efficient:

--stdin for pipe-friendly input
Machine-readable progress on stderr
Programmatic command/flag discovery
Versioned output schemas
Verbosity levels
TTY detection

We Benchmarked It

We didn’t just write the list — we turned it into a Claude Code skill and tested whether it actually changes outcomes. We ran three eval scenarios (building a deploy CLI, reviewing existing code, writing a config import command) with and without the skill, then graded the outputs against specific assertions.

Eval	With Skill	Without	Delta
Deploy CLI structure	6/6 (100%)	2/6 (33%)	+67%
CLI code review	12/12 (100%)	10/12 (83%)	+17%
Config import command	4/5 (80%)	2/5 (40%)	+40%

The biggest delta was in code generation. Without the skill, the model produced CLIs with positional arguments, print("Done.") success messages, and no exit codes. With the skill, every command got --output json, stderr/stdout separation, TTY-aware confirmations, and structured dry-run previews.

The code review eval was closer because the model already catches obvious issues like interactive prompts. But the skill caught the subtler patterns: emoji in output being fragile for parsing, missing stderr/stdout separation, and the absence of severity tiers in the review itself.

Get the Skill

The skill is open source and available on GitHub:

github.com/saileshpanchal/agent-friendly-cli

Install it:

mkdir -p ~/.claude/skills/agent-friendly-cli
curl -o ~/.claude/skills/agent-friendly-cli/SKILL.md \
  https://raw.githubusercontent.com/saileshpanchal/agent-friendly-cli/main/SKILL.md

It triggers automatically when you’re writing CLI code with Click, argparse, Cobra, Clap, or any CLI framework. No slash command needed — just start writing CLI code and it activates.

The patterns themselves are language-agnostic. Whether you’re building CLIs in Python, Go, Rust, or Node, the same principles apply. The skill just makes sure they’re applied consistently.

Building an Enterprise Security Chassis for Vapor: What Swift Was Missing

2026-03-23T00:00:00+00:00

Here’s a test. Go to the Vapor ecosystem and find a reusable library that gives you multi-tenant authorization with deny-precedence policy composition, OIDC authentication with PKCE, tamper-evident audit logging, data classification enforcement, and tenant-scoped data access — all wired into a middleware pipeline that fails safe when you get the ordering wrong.

You won’t find one. Not because Vapor is immature — it’s a serious framework with a serious community. But the ecosystem has optimised for breadth (here’s how to build a REST API, here’s a CRUD template) rather than depth (here’s how to build an application that a compliance officer would sign off on).

That’s the gap we set out to close.

The Gap, Specifically

We needed a server-side Swift web application for a recruitment platform. The data is sensitive — CVs, salary histories, skills assessments tied to named individuals. The platform is multi-tenant — different recruitment firms, different organisations, different data that must never leak across boundaries. We chose Vapor because we’re an Apple-ecosystem shop and Swift 6’s concurrency model is genuinely good for server work.

Then we started listing what we needed that didn’t exist as a reusable package:

Requirement	Vapor Ecosystem	What We Had to Build
OIDC authentication with PKCE	JWT verification exists; full OIDC flow doesn’t	Complete OIDC controller with PKCE S256
Multi-tenant authorization	Nothing reusable	6-policy composite with deny precedence
Tenant-scoped data access	Nothing	Repository pattern enforcing isolation at query level
Data classification (sensitivity labels)	Nothing	Hard-gate policies for confidential/personal/privileged data
Tamper-evident audit logging	Nothing	Per-tenant SHA-256 hash chains
CSRF for Leaf + HTMX	Partial examples	Middleware with `req.csrfToken` for templates
Session management with key rotation	Nothing reusable	Dual-key HMAC-SHA256 with constant-time verification
Environment-driven auth modes	Nothing	Zero-code-change switching between disabled/optional/required

Eight gaps. All of them are table stakes for enterprise software. None of them existed as drop-in packages.

VaporSecurityKit: The Chassis

Rather than solving these problems inline — scattered across controllers, coupled to our application — we built a reusable library. Any Vapor application imports it via a Package.swift git URL and gets the full security chassis in one call:

try app.useSecurityKit(config: .fromEnvironment())

That single line wires up a six-stage middleware pipeline in the correct order, registers OIDC routes, and configures session management. The ordering matters — and that’s exactly why it’s encapsulated.

The Middleware Pipeline (Order Is Security)

The six-stage middleware pipeline. Each stage reads values written by the previous stage — data flows left to right. Rendered with D2.

Each stage reads values written by the previous stage. PrincipalResolution can’t run before the session middleware. TenantResolution needs the principal to cross-validate tenant claims. Authorization needs both. Getting this wrong doesn’t throw a compiler error — it creates a security hole that passes all your tests.

By shipping the pipeline as a library with a fixed ordering, consuming applications can’t accidentally reorder it. The security decision is made once, in the library, not re-made in every project.

Deny-Precedence Policy Composition

Most authorization systems we’ve seen in web frameworks use a simple role check: does the user have the admin role? Yes or no. That works for toy applications. It falls apart when you need to combine multiple concerns — role, ownership, data sensitivity, sharing scope, workspace membership — into a single access decision.

Our CompositePolicy evaluates all applicable policies and applies a strict precedence:

Any policy returns .deny → access denied (deny is final, regardless of order)
Any returns .elevationRequired → privilege elevation required
At least one returns .allow → access granted
Everything abstains → denied by default

The critical property: order doesn’t affect the security decision. You can add policies, remove policies, reorder policies — the deny-precedence semantics are invariant. This is harder to get wrong than a chain of if statements.

Deny-precedence evaluation. All six policies run in parallel — a single deny overrides any number of allows. Rendered with PlantUML.

Policies are classified by intent:

Type	Policies	Behaviour
Hard gates (deny-only)	Sensitivity, SharingScope	Can deny but never allow — they protect boundaries
Allow refiners	Role, Ownership, WorkspaceScope, GroupScope	Can grant access but never override a deny

A hard gate for data sensitivity means that even if you’re a tenant admin, you can’t read a privileged-classified resource without active elevation. The policy doesn’t know or care about roles — it enforces classification, full stop.

Tamper-Evident Audit Logging

Audit logs that can be silently edited aren’t audit logs. They’re wish lists.

Our FluentAuditLogger maintains a per-tenant SHA-256 hash chain. Each audit event’s hash includes the previous event’s hash, creating a blockchain-like chain per organisation. If someone modifies or deletes an event in the middle, the chain breaks — and verifyChain(organizationId:) returns false.

The chain is per-tenant, not global. Organisation A’s audit trail is independent of Organisation B’s. A chain verification for one tenant doesn’t require reading every audit event in the system.

Independent hash chains per tenant. Tampering with any event breaks the chain from that point forward. Rendered with D2.

When the database write fails — network issue, disk full, whatever — the logger falls back to console output rather than silently dropping events. You can lose formatting. You can’t lose the record that something happened.

Tenant Isolation at the Data Layer

OWASP’s multi-tenant guidance is clear: tenant isolation must be enforced at the data access layer, not just in middleware. A middleware that checks “is this user in tenant A?” is necessary but not sufficient — a controller that runs a raw Fluent query can still return tenant B’s data.

TenantScopedRepository solves this by wrapping Fluent queries with an automatic organizationId filter. Controllers use the repository instead of raw queries. The scope is structural — you can’t forget to add the filter because the repository adds it for you.

let repo = TenantScopedRepository<UserModel>(tenant: req.resolvedTenantContext)
let users = try await repo.query(on: req.db).all()
// Only returns users in the current tenant — always

Three layers of tenant isolation. Even if middleware and policies pass, the repository enforces scoping at the query level. Rendered with D2.

Cross-tenant access attempts don’t throw an error — they return no results. From the controller’s perspective, users in other tenants simply don’t exist. This is the right semantic for multi-tenant data: not “you can’t access this” but “this doesn’t exist in your world.”

The AUTH_MODE Contract

Development and production have fundamentally different authentication needs. In development, you want to test authorization logic without running an OIDC provider. In production, you want mandatory authentication with no backdoors.

We solved this with an environment variable — AUTH_MODE — that switches between three modes with zero code changes:

Mode	What Happens
`disabled`	Four seeded demo principals with realistic role sets. The entire authorization pipeline still runs — you’re testing real policies against fake identities
`optional`	JWT resolved if present, demo identity if not. Useful for staging environments where some users are authenticated and some aren’t
`required`	401 on unauthenticated requests. Full OIDC flow. Production mode

The key insight: disabled mode doesn’t bypass security. It provides known identities so the authorization pipeline runs fully. You’re testing the policies, not just testing that your login form works.

The Toolchain: How We Actually Built This

The framework took three phases over five days. That speed came from the toolchain as much as the code.

Claude as Pair Programmer

Claude wrote code in this project — the co-author tag is on every commit. But the more interesting pattern was plan refinement — using Claude and Perplexity together to validate architectural decisions before writing a line of code.

The workflow: we’d describe an architectural question to Claude — “how should deny-precedence work when policies can abstain?” — and get a detailed proposal. Then we’d take the same question to Perplexity with a different framing: “what are the failure modes of order-independent policy evaluation in RBAC systems?” Perplexity returns academic papers, OWASP guidance, real-world CVEs from systems that got this wrong.

The two tools have complementary blind spots. Claude is excellent at generating coherent designs but can be confidently wrong about edge cases it hasn’t seen. Perplexity surfaces real-world evidence — papers, CVEs, production incident reports — but doesn’t synthesise them into a design. Using both, iteratively, produces better architecture than either alone.

Concrete example: Claude’s initial proposal for the audit hash chain used a global chain — every event hashed against the previous global event. Perplexity surfaced a paper on audit log scalability that showed global chains become a serialisation bottleneck under concurrent writes. We switched to per-tenant chains before writing the code. That’s a design decision that would have been expensive to change after implementation and invisible in testing until production load exposed it.

We ran this loop — Claude proposes, Perplexity validates, Claude revises — for every significant architectural decision: middleware ordering, policy classification, session rotation, CSRF token generation. The plan was solid before the first swift build.

The full development loop. Plan refinement (top) feeds validated architecture into implementation (bottom). Rendered with PlantUML.

gstack for QA and Development Workflow

We use gstack — Garry Tan’s open-source skill collection that turns Claude Code into a virtual engineering team — throughout development. gstack provides 28 specialised slash commands that cover the entire sprint lifecycle: planning (/office-hours, /plan-ceo-review, /plan-eng-review), building, reviewing (/review), QA testing (/qa, /browse), security auditing (/cso), and shipping (/ship, /land-and-deploy). It’s the setup the YC CEO uses to ship 10,000+ lines of production code per day. Not just for final QA, but as part of the development loop.

The pattern: write a feature, deploy locally, use /qa to systematically test the feature against a checklist, get a structured bug report with screenshots, fix the bugs with before/after evidence. Each fix is an atomic commit. The QA cycle catches things that unit tests miss — rendering issues, middleware ordering effects on actual HTTP responses, CSRF token flow through real form submissions.

The complete OIDC flow. PKCE S256 eliminates the need for a client secret on public clients. Dual-key session rotation keeps old sessions valid during key changes. Rendered with PlantUML.

For the OIDC flow specifically, gstack was invaluable. OIDC involves redirects, state parameters, PKCE challenge/verifier pairs, and cookie handling that’s nearly impossible to test with unit tests alone. We used /browse to walk through the entire login → callback → session → logout flow in a real browser, capturing screenshots at each step. When the PKCE verifier wasn’t being stored correctly in the session, the browser test caught it immediately — the unit test had passed because it was mocking the session storage.

The /review skill runs before every PR — analysing the diff for SQL safety issues, trust boundary violations, and structural problems. It caught a case where a controller was using a raw Fluent query instead of TenantScopedRepository — a tenant isolation violation that would have been invisible in code review because the query was syntactically correct.

What We Shipped

Three phases, five commits, 45 tests passing with zero warnings:

Phase	What	Key Files
1. Security Chassis	Middleware pipeline, principal resolution, rate limiting	6 middleware files, SecurityKit entry point
2. OIDC + Sessions	Full OIDC with PKCE, dual-key session management, CSRF	OIDCController, SessionManager, PKCEGenerator
3. Models + Policies + Audit	8 Fluent models, 6 authorization policies, hash-chain audit logger	CompositePolicy, FluentAuditLogger, TenantScopedRepository

The framework is Apache 2.0 licensed. Any Vapor application can import it and get the full enterprise security chassis — the same one we’re using for our own production applications.

What’s Still Missing

We’re honest about what isn’t built yet:

Leaf templates — the Resources/Views/ directory is empty. The CSRF middleware generates tokens and makes them available via req.csrfToken, but the actual Leaf templates for the reference application (login screens, dashboards, admin panels) haven’t been built. That’s Phase 4.

HTMX integration — the CSRF middleware supports HTMX headers (X-CSRF-Token), but the front-end layer using Pico CSS + HTMX is planned, not shipped.

Database triggers — the audit logger enforces append-only semantics in application code, but the SQL triggers that prevent UPDATE/DELETE at the database level aren’t in the migration yet. Application-level enforcement is necessary but not sufficient.

Privilege elevation flow — SensitivityPolicy returns .elevationRequired for privileged resources, and there’s a PrivilegeElevationModel in the schema, but the actual elevation UI and approval workflow aren’t implemented.

The Broader Observation

Server-side Swift is mature enough for production web applications. The language’s concurrency model, type safety, and performance characteristics are genuine advantages over Node.js and Rails for security-sensitive work. What’s missing isn’t capability — it’s the reusable building blocks that other ecosystems take for granted.

Django ships with authentication, authorization, CSRF protection, and an admin panel. Rails has Devise, Pundit, and paper_trail. Spring has Spring Security. Vapor has JWT verification and session middleware — and then you’re on your own.

VaporSecurityKit is our attempt to close that gap. Not for every Vapor application — a blog doesn’t need deny-precedence policy composition. But for the applications that handle sensitive data, serve multiple tenants, and need to pass a security review? The chassis should exist as a package, not as tribal knowledge.

Sailesh Panchal is Director at Digital Transformation Advisory (DTA), specialising in technology strategy and AI-native architecture for enterprise clients.

When Your Data Can’t Leave the Building: Training Small Language Models for Enterprise

2026-03-20T10:00:00+00:00

Picture this. You’re a recruiter at a specialist firm. A hiring manager sends you a job description for a Lead Platform Engineer. You need to understand exactly what skills this role requires, map them to an industry framework, and match against your candidate database — ideally in the time it takes to read the email.

Now picture the data involved. CVs with home addresses, salary histories, and career trajectories. Skills assessments tied to named individuals. Internal compensation benchmarks. Disability and diversity information. Client organisation charts.

You call a cloud API — GPT-4, Claude, whatever’s flavour of the month — and every piece of that data leaves your network, crosses the internet, and arrives at a third party’s data centre. The API terms say they won’t train on it. Your compliance officer says the risk assessment takes six weeks. Your client’s contract says their data stays in the UK.

This isn’t a hypothetical. It’s the conversation we have with almost every enterprise client who wants to use AI on sensitive data. And the answer is usually the same: “We’d love to, but we can’t.”

The Privacy Tax

The standard solution is to sanitise the data before sending it to a cloud API. Strip names, mask salaries, replace company names with placeholders. This works — for simple tasks. But language understanding is contextual. “10 years at a Big Four firm” carries different weight than “10 years in a startup.” Sanitising the context destroys the signal.

The other solution is to run everything on-premises. Deploy a 70-billion-parameter model on your own GPUs. This works too — if you have a team of ML engineers, a rack of A100s, and a budget that doesn’t need to survive a quarterly review.

What we actually need is a model small enough to run on the hardware people already have — a laptop, a phone, a Mac Mini in a server cupboard — that understands the specific domain well enough to be useful. Not a general-purpose genius. A specialist.

Small Language Models: The Right Tool for Bounded Problems

A small language model (SLM) is typically 2-4 billion parameters, compared to 70-400 billion for the cloud models. At first glance, that’s a massive capability gap. And for general-purpose tasks — writing essays, coding, broad-knowledge Q&A — it is.

But enterprise problems aren’t general-purpose. The recruiter doesn’t need a model that can write poetry and debug Rust. They need one that can read a job description and output a structured skills assessment against a specific framework. The vocabulary is bounded. The output format is defined. The success criteria are measurable.

This is where fine-tuning changes the equation. You take a capable-but-generic base model and train it on your domain until it becomes a specialist. The model doesn’t need to know everything — it needs to know your things very well.

We train two tiers from the same pipeline and the same training data:

Tier	Base Model	Target	Size at 4-bit
iPhone	Qwen 3.5 2B	iPhone 15 Pro+ (8GB RAM)	~1.2GB
Laptop	Qwen 3.5 4B	Mac with 16GB+ RAM	~2.5GB

The laptop model scores higher — more parameters means more capacity for domain knowledge. But the iPhone model still passes minimum accuracy thresholds, and it runs on hardware that fits in a pocket. The consumer app selects the right tier at runtime based on what device it’s on. Same protocol, same prompt, different model file.

Building the Model Factory

We’re not building one model. We’re building a pipeline — a reusable model factory that can produce domain-specific SLMs for different applications from the same codebase. Recruitment is one domain. There are others.

The factory works in stages, and each stage exists for a reason rooted in the business problem, not just the technology.

Stage 1: The Gold Set and the Teacher

Before any training happens, we build the exam paper: 200 expert-validated examples. Real job descriptions, real experience statements, each mapped to SFIA competencies by hand, reviewed by domain experts. These 200 examples are never used in training. They exist solely to measure whether the model is improving — the same 200-question test, administered at every checkpoint. If you train on the exam, the scores are meaningless.

With the exam built, we need the curriculum. The base model (an open-source model (2 billion parameters for phones, 4 billion for laptops)) knows language but not our domain. Rather than manually writing 1,800 more examples — which would be slow, expensive, and inconsistent — we use a large cloud model as a “teacher.” Claude generates high-quality training data: question/answer pairs, structured assessments, edge cases. The teacher sees our framework definitions and produces examples that follow the patterns we need.

Input:  "Senior Infrastructure Engineer — responsible for cloud
         platform strategy, team leadership, vendor management"

Output: {
  "skills": [
    {"name": "infrastructure design", "level": 5},
    {"name": "cloud services management", "level": 5},
    {"name": "technology leadership", "level": 4}
  ],
  "rationale": "Cloud platform strategy ownership with team and
                vendor management indicates senior autonomous
                practitioner level..."
}

The irony isn’t lost on us: we send framework definitions (public information) to a cloud API to generate training data, specifically so that production data (private information) never has to make the same trip. The teacher trains the student. Then the student works alone.

Stage 2: The Student Learns (LoRA Fine-Tuning)

A common assumption: you fine-tune a large model and then shrink it down to fit on a device. That’s not what we do. The training happens directly on the small models — 2B for iPhone, 4B for laptop. The large model’s job ended in Stage 1 — it created the curriculum. Now the students sit the exam alone.

We fine-tune each tier using LoRA — Low-Rank Adaptation. This freezes most of the model’s weights and trains a small adapter (~50MB) that modifies the model’s behaviour. Same training data, same technique, separate adapters for each model size. It’s fast, memory-efficient, and can run on a single Apple Silicon Mac.

The business reason this matters: the training infrastructure is a laptop, not a data centre. The team that maintains the model can retrain it when the framework updates, without submitting a GPU requisition. You’re not renting A100s to train a 70B model and then spending another day compressing it — you’re training the exact model that will ship, on the hardware it will run on.

We target both the attention layers (how the model relates words to each other) and the feed-forward layers (how it processes information). Including the feed-forward layers — a detail we learned the hard way — dramatically improves the model’s ability to produce valid structured output. When your application expects JSON, “almost valid JSON” is the same as broken.

Stage 3: The Student Gets Tested (RLVR)

After fine-tuning, the model can mimic the teacher’s format. But mimicry isn’t understanding. If the teacher said a particular skill was level 5, the student will say level 5 for that example. What about a job description it’s never seen?

This is where Reinforcement Learning from Verifiable Rewards (RLVR) takes over — and it’s the stage that runs overnight with a measurable improvement cycle.

The 5-Minute Checkpoint Cycle

Here’s what actually happens on the machine, concretely:

Generate — The model receives a batch of prompts (real job descriptions it hasn’t seen in training). For each prompt, it generates a group of 8 candidate outputs. That’s 8 different attempts at the same skills assessment.
Score — Every output gets scored against verifiable criteria. Not opinions — facts:
- Is the JSON valid? (Parser says yes or no — no ambiguity)
- Are the referenced skills real? (Lookup against the framework — they exist or they don’t)
- Is the assigned level reasonable? (Within ±1 of expert consensus)
- Did the model hallucinate a skill that isn’t in the framework? (Verifiable)
Learn — The technique we use — GRPO (Group Relative Policy Optimization) — ranks the 8 outputs within the group. The best-scoring outputs become the positive training signal; the worst become the negative signal. The model’s weights adjust toward producing more outputs like the good ones. No separate “critic” model needed — the group comparison is the critic.
Checkpoint — Every ~5 minutes (roughly 50-100 training steps on Apple Silicon), the pipeline saves a snapshot: the current model weights, the timestamp, and the forge_score evaluated against the held-out gold set — the same 200 expert-validated examples, every time.
Repeat — The cycle restarts with new prompts. If the score improved, training continues. If it degraded (which can happen — the model sometimes optimises for one metric at the expense of another), the pipeline can revert to the last good checkpoint and adjust.

We combine the criteria into a single composite score:

forge_score = (
30 * json_validity      +
25 * skill_f1            +
25 * level_within_1      +
15 * (1 - hallucination) +
05 * evidence_grounding
)

Evidencing the Improvement

This is the part that matters to a CTO or a compliance officer: can you prove the model got better?

Yes — because every 5-minute checkpoint produces a forge_score against the same gold set. Plot them and you get a training curve:

Checkpoint    Time        forge_score    json_valid   skill_f1   hallucination
─────────────────────────────────────────────────────────────────────────────
ckpt-000      18:00       0.41           0.72         0.38       0.22
ckpt-012      19:00       0.58           0.94         0.51       0.15
ckpt-024      20:00       0.69           0.98         0.62       0.09
ckpt-048      22:00       0.77           1.00         0.71       0.05
ckpt-072      00:00       0.82           1.00         0.78       0.03
ckpt-096      02:00       0.84           1.00         0.81       0.02
ckpt-108      04:00       0.85           1.00         0.82       0.02  ← plateau

The pattern is consistent: JSON validity converges first (the model learns the format within the first hour), skill identification improves steadily through the night, and hallucination rate drops as the model learns what’s not in the framework. Eventually the score plateaus — the model has extracted all the signal available from the training data. That’s your stopping point.

Each checkpoint is a complete, usable model. If the 2am checkpoint scores 0.82 and the 4am checkpoint scores 0.85 but introduces a regression on one metric, you can ship the 2am version. The decision is auditable: here’s the score at each point, here’s what we chose, here’s why.

This is fundamentally different from training a model and hoping it works. Every 5 minutes, you have evidence.

Stage 4: The Model Ships as a File

Here’s where the “train the small model directly” approach pays off. Each LoRA adapter — all those improvements from SFT, RLVR, and DPO — gets fused back into its base model’s weights. No distillation step, no compression from a larger model. The adapter was always attached to the target-size model, so fusing is a simple matrix addition.

The combined models get quantised to 4-bit using AWQ (Activation-aware Weight Quantization), which protects the most important weights from precision loss. The result: two standalone files — 1.2GB for iPhone, 2.5GB for laptop.

The alternative approach — fine-tuning a 14B or 70B model first, then distilling down — would likely score higher on accuracy benchmarks. But it adds an entire extra stage (distillation), requires GPU servers for the larger model’s training, and introduces a compression step where domain knowledge can be lost. By training each target-size model directly, every weight update is optimised for the model that will actually run in production.

The consumer application loads the appropriate file at startup based on device capability, the same way it would load a database or a config file. It calls the model through a simple protocol — send text in, get structured output back. If the model improves, you ship new files. The application code doesn’t change.

Model Factory → sfia-mapper-iphone-4bit (1.2GB) → iPhone app (MLX.Swift)
(training)   → sfia-mapper-laptop-4bit  (2.5GB) → Mac app (MLX.Swift)

The factory never touches production. The consumer never touches training. The data never leaves the device. These boundaries are the entire point.

A Note on the Teacher’s Role During Training

In the stages above, the teacher (Claude) creates the data, then disappears. The student trains alone. But there’s one technique in the pipeline — Generalized Knowledge Distillation (GKD) — where the teacher stays involved longer.

The problem it solves: during training, the student only sees the teacher’s perfect outputs. But at inference time, the student works from its own imperfect outputs. This mismatch means the student can freeze when it encounters its own phrasing in production — like a student who studied from the textbook answer key and panics when the exam question is worded differently.

GKD mixes teacher corrections into the student’s own outputs during training. The student generates a response, the teacher evaluates it, and the student learns from the gap. This closes the distribution mismatch and produces a more robust model — still 2-4 billion parameters, still running on the target device, but better at handling the messy inputs it will see in the real world.

The Pattern Behind the Pattern

Here’s something we didn’t expect: the checkpoint-and-score loop from Stage 3 applies to problems that have nothing to do with model training.

The core structure is: define a measurable outcome → build a fixed benchmark → iterate in short cycles → score against the benchmark → checkpoint on improvement → stop at plateau. Andrej Karpathy designed this for training neural networks. But the requirements are simpler than they appear:

Can you score the output without a human reviewing it? (A parser, a lookup table, a test suite, a timer)
Do you have a fixed benchmark you can commit to never contaminating? (50-200 known-good examples)
Can each iteration complete in under 5 minutes? (Otherwise you get too few data points overnight)
Can you save and restore state cleanly? (Git commit, file copy, model checkpoint)

If all four are true, you can apply this pattern — whether you’re training a model, optimising prompts, tuning API performance, or searching configurations.

Prompt optimisation is a particularly accessible example. You have a fixed prompt that works “okay.” You have 200 gold examples with expected outputs. Each iteration: adjust the prompt wording, run it against all 200 examples via the API, score the outputs, keep the better prompt. No GPU required. Cost is API calls. Same evidence trail — a CSV showing prompt version, timestamp, composite score. Same audit story for a client.

API performance tuning: same loop. Fixed benchmark of representative API calls. Each iteration tries a different indexing strategy, query rewrite, or cache policy. Score = p95 latency × correctness. Checkpoint = the configuration that produced the best score.

The point isn’t the technique — it’s the evidence trail. In any of these applications, you end up with a CSV that shows measurable improvement over time. When someone asks “how do you know this is better?”, you open the spreadsheet.

We’ve started treating the checkpoint above as a standard project evaluation: can we define a scoring function? Can we build a gold set? If yes, we apply the loop. If no, we don’t pretend we can — we use human review, A/B testing, or structured evaluation instead. Knowing when not to use it is as important as having the tool.

What the Numbers Look Like

From our benchmarking work with base models on Apple Silicon:

Metric	What It Means
96-196 tokens/sec	Faster than you can read the response
1.2-2.5 GB memory	iPhone (1.2GB) or laptop (2.5GB) — fits alongside the app
<100ms first token	Feels instant in a user interface
£0.00 per inference	No API bill. No token counting. No cost anxiety

These are base model numbers. A fine-tuned model will be slightly different, but in the same ballpark — the LoRA adapter adds knowledge, not computational overhead.

The Business Case, Plainly

Cloud LLM APIs are extraordinary tools. We use them daily. But they create a dependency: on network availability, on third-party pricing, on data processing agreements, on compliance reviews that take longer than the project they’re gate-keeping.

A fine-tuned SLM running on-device removes that dependency for the specific tasks it’s trained for. It’s not better than Claude at general reasoning. It’s not trying to be. It’s better at one thing, and it does that one thing locally, privately, and at zero marginal cost.

The model factory approach means we can produce these specialists for different domains without rebuilding the training infrastructure each time. A recruitment SLM. A compliance SLM. A customer service SLM. Same pipeline, different training data, different model files.

What We’ve Learned So Far

We’re still in the early stages of this build. Some things we’ve confirmed:

Teacher quality beats teacher quantity. 500 carefully crafted examples from Claude produce a better student than 5,000 low-effort ones. Garbage in, garbage out applies to synthetic data too.

The output format is a training target, not a post-processing step. If you need JSON, train the model to produce JSON. Don’t train it to produce text and then try to parse the text into JSON. Including feed-forward layers in the LoRA target makes a measurable difference here.

Apple Silicon is a real training platform at the 2-4B parameter scale. An M-series Mac with 36GB of unified memory handles LoRA fine-tuning for both model tiers comfortably. You don’t need a cloud GPU for models this size.

The compliance conversation changes completely when you can say “the data never leaves the device.” Six-week risk assessments become same-week approvals. The model file ships like any other application asset — through your existing deployment pipeline, your existing change management, your existing security controls.

When to Use Which

Not every problem needs an on-device model. Not every problem can be solved by a cloud API. Here’s how we think about it:

Your Situation	Recommendation
Data is public or low-sensitivity	Cloud API. Easier, more capable, maintained for you
Data is sensitive but tasks are varied	Cloud API with strong DPA, or anonymise first
Data is sensitive AND tasks are bounded	Fine-tuned SLM. This is the sweet spot
Tasks require broad world knowledge	Cloud API. SLMs don’t know enough
You need zero-latency responses	On-device SLM. Nothing beats local inference
Budget scales with usage	SLM. Train once, infer forever

The recruiter from the opening of this post? Sensitive data, bounded domain, defined output format, latency matters, privacy is non-negotiable. That’s the sweet spot.

The model factory is how we get there.

Sailesh Panchal is Director at Digital Transformation Advisory (DTA), specialising in technology strategy and AI-native architecture for enterprise clients.

Fifty PowerPoints and a Rebrand: Why We Didn’t Train a Model

2026-03-20T00:00:00+00:00

The brief was simple enough. A client had been through a rebrand — new name, new visual identity, new tone of voice. The old brand lived on in 50 PowerPoint decks: board packs, strategy documents, client proposals, quarterly reviews. Every one needed converting.

A designer quoted 2-4 hours per deck. At the midpoint, that’s 150 hours of someone carefully changing Georgia to Calibri, swapping navy for teal, and rewriting “We are pleased to present our findings” as “Here’s what we found.” Important work. Also, the kind of work that makes a talented designer question their career choices by deck number twelve.

We were asked: can AI do this?

The Training Reflex

Our first thought — and I suspect yours too — was to train a model. Feed it examples of old-brand and new-brand decks, let it learn the transformation, and apply it at scale. We’re building fine-tuned small language models for other projects. We have the pipeline. The hammer was in our hand, and this looked like a nail.

Then we opened a deck and started listing what actually changes in a rebrand.

What Changes	Example	How Many Variants?
Font families	Georgia → Calibri	4-6 unique mappings
Font sizes	32pt heading → 36pt	Tied to the font mappings
Colour palette	#003366 → #007C7A	6-8 hex values
Logo	Old logo.png → New logo.png	1 swap
Footer text	“Old Corp. All rights reserved.” → “New Brand. Confidential.”	1 find-replace
Tone of voice	Formal prose → Punchy, conversational	Unbounded

Five of those six are lookup tables. Fixed inputs, fixed outputs, zero ambiguity. Georgia is always Calibri. #003366 is always #007C7A. The footer string is literally the same on every slide of every deck.

Training a neural network to learn a lookup table is like hiring a sommelier to check if milk has expired. Technically possible. Wildly inefficient. Harder to debug when it gets the answer wrong.

The sixth item — tone — is different. Rewriting “Our methodology is evidence-based, outcome-driven, and designed for sustainable change” as “Evidence-based. Outcome-driven. Built to last” requires understanding language, context, and intent. That’s what language models are good at.

So we drew a line.

The Line: 95% Pipeline, 5% Intelligence

We split the problem in two:

Deterministic engine (python-pptx): Handles every visual transformation — fonts, colours, logos, footers, borders, shape fills. Runs in under 2 seconds per deck. Produces identical results every time. Easy to audit, easy to fix, easy to explain to a client.

AI tone adjustment (Claude, in-context): Handles the language rewriting. Reads the text from each converted slide, applies the brand’s tone rules, and rewrites while preserving all factual content. Uses the designer’s own rewrites as few-shot examples — no training data required.

The beauty of this split is that each half plays to its strengths. The pipeline is fast and exact where you need exactness (your brand colour had better be #007C7A, not #007C79). Claude is flexible and contextual where you need intelligence (knowing that “we are pleased to present” and “we would like to share” are the same pattern, even though the words are different).

The full pipeline. One designer pair produces the rules. python-pptx handles the 95% that's mechanical. Claude handles the 5% that requires understanding. Rendered with D2.

How the Pipeline Works

Step 1: The Designer Creates One Pair

This is the clever part, and it’s the designer’s one contribution to the entire process.

They take a single real deck and recreate it in the new brand. Same slides, same text, same structure — different visual treatment. Where they also change the wording (not just the formatting), that signals a tone shift.

We end up with two files: source-exemplar.pptx and target-exemplar.pptx.

Step 2: The Script Diffs Them

A Python script walks both files slide-by-slide, shape-by-shape, text-run-by-text-run. For each matching text string, it compares the formatting and builds a mapping:

{
  "fonts": {
    "Georgia|32.0|bold": {
      "family": "Calibri",
      "size_pt": 36.0,
      "bold": true
    },
    "Cambria|14.0|normal": {
      "family": "Calibri",
      "size_pt": 13.0,
      "italic": true
    }
  },
  "colours": {
    "#003366": "#007C7A",
    "#CC9900": "#2ECC71",
    "#F5F0E8": "#FAFAFA"
  },
  "footer": {
    "find": "© 2024 Meridian Consulting Group. All rights reserved.",
    "replace": "Apex Partners Ltd. Private & Confidential."
  }
}

The script also captures every text change — places where the designer rewrote the words. These become the tone examples:

Before: "Our team brings deep expertise in cloud migration,
         platform modernisation, and AI-native architecture."
After:  "Our people bring hands-on expertise in cloud,
         platforms, and AI — not just slide decks."

From these examples, we derive tone rules: formal → conversational, passive → active, long sentences → punchy fragments, corporate jargon → plain English.

Step 3: Apply to 50 Decks

The conversion engine opens each deck and applies the mapping mechanically. Every text run gets its font checked and swapped. Every colour value gets looked up and replaced. Every footer gets rewritten. Every logo gets swapped.

On our test deck — a 6-slide cloud migration assessment — the pipeline made 105 font changes, 52 colour changes, and 6 footer updates in under 2 seconds.

Step 4: Claude Adjusts the Tone

After the visual conversion, Claude reads each slide’s text and applies the tone rules. The prompt pattern is straightforward:

“Given these tone rules and examples, rewrite this text to match the target brand voice. Preserve all facts, technical terms, and proper nouns. Only change the language style.”

Claude sees the designer’s own before/after examples, so it’s learning the brand voice from the person who defined it — not from a training set we curated.

What This Looks Like in Practice

Here’s a slide from the test deck, before and after:

Before (old brand):

The organisation currently operates 147 on-premises applications across three data centres. Our assessment identifies 82 applications suitable for cloud migration within the next 18 months. We recommend a phased approach beginning with non-critical workloads to establish patterns and confidence.

After (new brand, visual conversion + tone adjustment):

147 apps running on-prem across three data centres. We’ve identified 82 that are ready for cloud migration in the next 18 months. Start with non-critical workloads — build the pattern, build confidence, then scale.

Same facts. Same numbers. Same recommendation. Different voice. The fonts, colours, and footer changed too — but those are invisible in a text excerpt.

The Decision That Mattered

The most important moment in this project wasn’t writing the code. It was the conversation where we decided not to train a model.

We’ve seen too many organisations reach for AI when a well-structured pipeline would be faster, cheaper, and easier to maintain. The reverse is equally true — we’ve seen teams build increasingly brittle rule systems when a model would handle the variation naturally. The art is drawing the line in the right place.

Here’s the framework we used:

Signal	Reach for a Model	Reach for a Pipeline
Input variation	High — natural language, many phrasings	Low — structured, enumerable
Rules expressible?	No — too many edge cases	Yes — fits in a JSON file
Output must be exact?	Approximate is fine	Must be pixel-perfect
Error consequences	Graceful degradation	Hard failure
Debugging	“Why did the model say this?”	“This key maps to this value”

The PowerPoint rebrand sits firmly on the pipeline side for 95% of the work. The tone adjustment is the 5% where a model earns its keep — not by learning from training data, but by understanding language in context.

The decision framework. Can you enumerate it? Pipeline. Does it need language understanding? AI. Both? Split the problem. Rendered with D2.

What We’d Tell the Sommelier

If you’re staring at a problem and wondering whether it’s an AI problem:

Start by listing what changes. If you can enumerate every transformation in a spreadsheet, you don’t need a model. You need a script.

Find the boundary. There’s usually a seam between the mechanical and the intelligent. Make the mechanical part deterministic, and only invite the AI to the part that actually requires understanding.

Use the designer’s work as your spec. They’ve already done the hard thinking about what the transformation should look like. Your job is to scale it, not to reinvent it.

The 50 decks are converting. The designer is back to doing design work. And we didn’t train a model.

Sailesh Panchal is Director at Digital Transformation Advisory (DTA), specialising in technology strategy and AI-native architecture for enterprise clients.

Running LLMs on Apple Silicon: MLX-LM Benchmarks for Qwen 3.5 and Llama 3.2

2026-03-18T00:00:00+00:00

Apple Silicon changed the game for on-device machine learning. With unified memory and the Metal GPU sitting on the same die as the CPU, your MacBook can run billion-parameter language models without a discrete GPU. The missing piece was software. Apple’s MLX framework and its companion library mlx-lm fill that gap.

This post documents a hands-on benchmarking session comparing three small language models locally on an Apple Silicon Mac, with real numbers, real output, and the pitfalls we hit along the way.

The Setup

Hardware: Apple Silicon Mac (M-series, unified memory)

Software stack:

Python 3.12 via mise
mlx-lm 0.31.1
mlx 0.31.1 + mlx-metal 0.31.1

Models tested:

mlx-community/Qwen3.5-2B-4bit (Alibaba, 2 billion parameters, 4-bit quantized)
mlx-community/Qwen3.5-4B-4bit (Alibaba, 4 billion parameters, 4-bit quantized)
mlx-community/Llama-3.2-3B-Instruct-4bit (Meta, 3 billion parameters, 4-bit quantized)

All models are pre-quantized MLX format from the mlx-community collection on Hugging Face. No conversion step needed.

Installation is one command:

pip install mlx-lm

If you use mise for version management:

mise install python@3.12
mise use --global python@3.12
pip install mlx-lm

The Benchmark

We used a creative writing prompt to test both generation speed and output quality:

Describe a medieval tavern at night. Include sensory details about the atmosphere, the patrons, and the food.

Each model was run with 256-300 max tokens. We tested twice: once with default sampling (greedy), and once with the official recommended sampling parameters.

Raw Speed Numbers

All measurements taken on cached model runs (no download overhead):

Metric	Qwen3.5-2B	Llama-3.2-3B	Qwen3.5-4B
Prompt eval	705 tok/s	920 tok/s	390 tok/s
Generation	196 tok/s	127 tok/s	96 tok/s
Peak memory	1.1 GB	2.0 GB	2.5 GB

The 2B model is roughly 2x faster than the 4B at generation, and uses less than half the memory. Llama 3.2 3B lands in the middle on both metrics.

For context, 96 tokens per second is still faster than you can read. All three models feel instant in interactive use.

The Thinking Mode Trap

Here is where our initial benchmarks went wrong, and it contains an important lesson for anyone using Qwen3.5 models.

When we first ran the 4B model with a raw prompt via mlx_lm.generate, it produced this:

Here's a thinking process that leads to the description:
1. Analyze the Request:
   * Topic: A medieval tavern at night.
   * Key Elements: Sensory details...
2. Brainstorming & Imagery:
   * Setting: Stone walls, wooden beams...

Instead of writing prose, it dumped its internal reasoning chain. The 2B model, tested the same way, produced beautiful prose. Our initial conclusion was that the 2B was better. That was wrong.

What happened

Qwen3.5 models have a thinking mode enabled by default. When thinking mode is active, the model emits a ... block containing its reasoning before the actual response. The 4B model faithfully entered thinking mode. The 2B model happened not to, likely because the raw prompt format didn’t trigger it as strongly.

The fix: use the tokenizer’s chat template with thinking explicitly disabled.

import mlx_lm
from mlx_lm.sample_utils import make_sampler

model, tokenizer = mlx_lm.load("mlx-community/Qwen3.5-4B-4bit")

messages = [{"role": "user", "content": "Your prompt here"}]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False,
    enable_thinking=False  # This is the key parameter
)

sampler = make_sampler(temp=0.7, top_p=0.8, top_k=20, min_p=0.0)

response = mlx_lm.generate(
    model, tokenizer,
    prompt=prompt,
    max_tokens=300,
    sampler=sampler
)

Lesson: Always use apply_chat_template() with these models. Raw prompt strings bypass the model’s expected input format and produce unpredictable behavior. This applies to all instruction-tuned models, not just Qwen.

The Deeper Gotcha: Template Divergence Between Model Sizes

There’s a subtler issue we discovered later while building ForgeML training pipelines. The Qwen3.5 2B and 4B variants ship with different default chat templates — and the difference is invisible during training.

Qwen3.5-2B includes a pre-closed block in its chat template. No reasoning by default.
Qwen3.5-4B opens and expects the model to fill it with reasoning content before responding.

During training, both templates produce the same format for complete conversations, so you won’t notice anything. The divergence only appears at inference time when add_generation_prompt=True appends different suffixes depending on the model size. The 2B appends a clean assistant turn. The 4B appends an open thinking block that the model is expected to complete.

This means the same inference code produces different behavior when you swap model sizes. If you’re deploying Qwen3.5 models for structured output (JSON, function calling, classification), you must explicitly set enable_thinking=False regardless of model size. This is not prominently documented in the model card.

# Always be explicit about thinking mode — don't rely on defaults
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False,
    enable_thinking=False  # Required for consistent behavior across model sizes
)

If you’re building a pipeline that supports multiple Qwen3.5 variants (for example, using the 2B for fast inference and the 4B for quality-critical tasks), test with both models at inference time. Training-time validation alone won’t catch this.

The difference between a raw prompt string and a properly formatted chat template. The tokenizer knows what the model expects. Rendered with PlantUML.

Official Sampling Parameters

Qwen3.5 publishes recommended sampling parameters for different modes. For non-thinking (instruct) mode:

Use Case	temp	top_p	top_k	min_p	presence_penalty
General tasks	0.7	0.8	20	0.0	1.5
Reasoning tasks	1.0	0.95	20	0.0	1.5

For thinking mode (if you want the chain-of-thought reasoning):

Use Case	temp	top_p	top_k	min_p	presence_penalty
General tasks	1.0	0.95	20	0.0	1.5
Precise coding	0.6	0.95	20	0.0	0.0

In mlx-lm 0.31.1, you apply these through the make_sampler function:

from mlx_lm.sample_utils import make_sampler

# Non-thinking, general tasks
sampler = make_sampler(temp=0.7, top_p=0.8, top_k=20, min_p=0.0)

Note: mlx-lm’s make_sampler does not expose presence_penalty directly as of v0.31.1. The repetition_penalty parameter in the generate function is the closest equivalent.

Quality Comparison (Fair Test)

With thinking disabled and official sampling parameters applied to all models:

Qwen3.5 4B

The heavy oak doors of The Gilded Tankard groan open, admitting a rush of damp, starless night air that carries the scent of rain and wet cobblestones. Inside, the air is thick and warm, a palpable weight held back by the flickering glow of tallow candles… A floor-to-ceiling tapestry depicting knights in armor lines the far wall… In the corner, a lute player strums a mournful tune…

Structured, literary-quality prose. Named specific patron archetypes: a bearded merchant, roughnecks in leather, an elderly woman, a young scribe. Strong world-building details.

Qwen3.5 2B

The air inside Blackwood’s Oak did not smell of wine or wood; it smelled of wet wool, damp stone, and the sharp, tangy scent of fresh rye bread… A low, rumbling murmur vibrates through the floorboards, not from the patrons, but from the wood itself…

Atmospheric and moody with a second-person perspective. Good sensory detail, though it occasionally repeats ideas and shifts tense.

Llama 3.2 3B

The medieval tavern was a warm and inviting haven, its wooden beams and stone walls glowing with a soft, golden light… At the bar, a jovial bartender polished a mug with a dirty rag…

Competent, readable prose. Safe and expected imagery. Gets the job done without surprises.

Quality Verdict

Dimension	2B	Llama 3B	4B
Prose coherence	Good	Good	Excellent
Character diversity	Adequate	Adequate	Rich
Sensory depth	Strong	Adequate	Richest
Consistency	Minor repeats	Solid	Excellent

The 4B is clearly the better writer when properly configured. The 2B punches above its weight. Llama 3.2 3B is reliable but outclassed by both Qwen models in this creative task.

Choosing the right model depends on what matters most for your use case. Rendered with D2.

Practical Recommendations

Choose the 2B when:

You want the fastest possible generation (196 tok/s)
Memory is constrained
The task is straightforward: summaries, simple Q&A, boilerplate generation
You’re running a local API server and need throughput

Choose the 4B when:

Output quality matters more than speed
Multi-step reasoning, creative writing, or nuanced tasks
You have 3+ GB of memory to spare (you do on any modern Mac)
You’re building something user-facing

Choose Llama 3.2 3B when:

You need robust instruction-following without template fussing
You want the largest community ecosystem and fine-tune availability
The task is instruction-heavy rather than creative

Running as a Local API Server

For development workflows, mlx-lm can serve an OpenAI-compatible API:

mlx_lm.server --model mlx-community/Qwen3.5-4B-4bit

This gives you a local endpoint at http://localhost:8080 that accepts the same request format as the OpenAI API. You can point any OpenAI-compatible client at it for local inference.

Key Takeaways

Apple Silicon is a legitimate LLM inference platform. Sub-100ms time-to-first-token and 96-196 tok/s generation with 1-3 GB of memory is practical for real applications.
Model configuration matters more than model size. A misconfigured 4B model produced worse output than a 2B model. The chat template, thinking mode flag, and sampling parameters are not optional details. Watch for template divergence between model sizes in the same family — Qwen3.5’s 2B and 4B have different thinking-mode defaults that only surface at inference.
The MLX ecosystem is production-ready. Install with pip, download from Hugging Face, generate in three lines of Python. No CUDA, no Docker, no cloud API keys.
Qwen3.5 is the new default for small local models. Both the 2B and 4B outperform Llama 3.2 3B in quality at comparable or better speed. The only advantage Llama retains is its instruction-following robustness with raw prompts.