Ask a bank’s COO a simple question: when an important service goes down at 14:02, how long until you can say precisely which customers were harmed — not “roughly”, not “we’re investigating”, but these 1,200 customers missed a payment alert, these 142 payments were delayed? For most banks the honest answer is hours, sometimes days, assembled by people cross-referencing logs, dashboards and phone calls.
That gap is no longer a back-office inconvenience. It is a regulatory exposure, and the regulators have said so plainly.
Resilience stopped being a document
The UK’s operational-resilience regime (FCA PS21/3 and PRA SS1/21) reframed the whole problem. Banks must identify their important business services, set an impact tolerance for each (the maximum tolerable disruption), map the people, processes, technology and third parties each service depends on, and scenario-test that they can stay within tolerance. Since 31 March 2025, firms are expected to be able to remain within those tolerances. The EU’s Digital Operational Resilience Act (DORA), applicable since 17 January 2025, pushes the same direction with ICT risk management, incident reporting on a clock, and — importantly — scope that reaches critical third-party ICT providers.
The common thread is that resilience is now a demonstrable capability, not a policy you can write down. And “impact tolerance” has a hidden requirement buried inside it: to know whether you are within tolerance, you must know, quickly, which business service is impaired and what the customer harm is. You cannot manage a tolerance you cannot measure in the moment.
Most resilience programmes answer this with documents — service maps in a spreadsheet, runbooks in a wiki, a CMDB someone updates by hand. Those drift from reality the day they are written. The architecture answer is different, and it is better.
The move: a platform that watches itself
The same property that makes an agentic platform governable — events, authority and evidence captured as the work happens, on a governed event spine — is the property that lets it watch itself.
Picture the platform built so that:
- Components and agents self-register. Every service and agent declares its identity, version and dependencies to the bank’s configuration store as it deploys — discovery and service mapping, not a hand-maintained inventory. The map of what depends on what is generated by the estate, not typed into a wiki.
- Incidents assemble themselves. Health and event telemetry correlate into a single incident automatically rather than arriving as a pager storm of disconnected alerts. (This is the discipline the industry calls event management and AIOps; the principle is older and simpler — silent is not stuck, so make the system say what it is doing.)
- The platform diagnoses by walking its own graph. Because the dependency map is live, the platform can trace from a failing component up to the services that sit on it, and down to the work that was in flight.
- It reports the blast radius. The output is the answer the regulator asks for: which transactions failed, and which customers missed a notification, while the service was offline — mapped to the important business service and its impact tolerance.
--- title: "BLUEPRINT — illustrative figures, not live customer data" --- flowchart TB SVC["Notification service — offline 14:02–14:18"] SVC --> D1["Depends on: payment events"] SVC --> D2["Depends on: customer comms"] D1 --> TX["142 payments delayed"] D2 --> CU["1,200 customers — no payment alert"] TX --> IBS["Important business service: payments"] CU --> IBS IBS --> TOL["Impact tolerance — breach risk flagged"]
This is operational resilience as an architecture property, not a reporting workflow. The blast-radius report is not assembled after the fact; it is a projection of the same governed events the platform was already emitting to do its job. Lineage by construction — the discipline that answers BCBS 239’s expectation of risk data “aggregated on a largely automated basis” — turns out to answer the resilience question too. The evidence is the object.
An honesty note, because it is the point of how we work. This self-monitoring, self-diagnosing capability is architectural design — a blueprint — not something running in our demonstrator today. What the live engine does demonstrate is the thing it is built on: evidence emitted as a by-product of every step in a real proposition flow. We would rather show you the one and be precise about the other than imply a finished platform.
Two more pillars: access and control
A self-diagnosing platform answers “what happened and to whom”. Two related questions decide whether the platform is safe to run at all.
Access — zero-trust for agents. NIST SP 800-207 sets the pattern: no perimeter trust, least privilege, assume breach, and a policy decision/enforcement split. The twist for agentic banking is that an AI agent is a new privileged, non-human identity. The discipline that matters is that its authority is per-action and per-context — scoped to a single task invocation, not an ambient role. That is the line between an agentic gateway where agents propose and the gateway disposes, and an agent handed a broad token that can reach everything that token can touch. And authority on the event is only ever as trustworthy as the mandate registry behind it — which is why the mandate’s lifecycle (issuance, revocation, expiry, delegation) deserves to be a governed product in its own right.
Control — sovereignty, properly distinguished. Banks conflate three different things. Residency is where the data sits. Data sovereignty is whose law can compel access to it — and in-country residency does not stop an extraterritorial demand. Operational sovereignty is who can actually run, change and see the system. DORA’s reach into critical third-party providers makes this concrete: if your resilience depends on a provider whose control plane you do not hold, you have a dependency you cannot fully answer for. The architectural responses are bank-held keys and custody, tested exit and portability, and owning the data foundations while partnering for the underlying models — never the reverse. This is the Accountability–Sovereignty Principle in operational form: you cannot be accountable for what you do not control.
Where to start
You do not begin a resilience programme by buying an observability tool. You begin by choosing one important business service and asking the architecture question of it: if this fails, can the platform itself tell us the customer harm, fast enough to stay within tolerance — and can it prove, after, exactly what happened and who could act? If the answer depends on people reconstructing it from logs, the gap is architectural, and a tool will not close it.
That is what our six-week Proposition Blueprint is for: take one service, map it as the regime expects, and design the event spine, the self-declaring estate and the evidence trail that make resilience a property of the system rather than a quarterly assertion. You can see how the platform is shaped on the Platform and Resilience pages, and the live proof — evidence by construction — in Family Wealth.
Resilience regimes asked banks to prove they can stay within tolerance and name the harm when something breaks. You can answer that with a thicker binder, or you can answer it with an architecture that watches itself. Only one of them is still true at 14:02 on a bad day.
FCA PS21/3, PRA SS1/21, EU DORA and NIST SP 800-207 are referenced as industry frameworks, not as endorsements of this approach. The self-monitoring capability described here is design intent; figures in the diagram are illustrative.