ArchitectureRedaction vs split-knowledge

Redaction is a promise.
Architecture is a guarantee.

Four approaches to keeping PII out of AI inference: provider-side filters, client-side redaction libraries, AI-gateway proxies, or infrastructure-level split-knowledge. Three of them are software promises — a bug leaks PII. The fourth is an architectural property: the AI cannot see identity because the route does not exist.

TL;DR

Redaction is a software promise: code reads input, removes identifiers, then sends the rest to the model. A bug, a regex miss, a context-dependent identifier the matcher didn't catch — and PII reaches the model. Split-knowledge is an architectural property: identity data lives in Sandbox A, inference runs in Sandbox B, and there is no network path from B back to A. The difference matters when an auditor asks not "are you trying to keep PII out?" but "can you prove PII never crossed?"

See the comparison Platform overview
01Four approaches

How teams keep PII
out of the model today.

Three software approaches and one architectural one. Each protects against a different failure mode; each fails under a different one. Pick the lightest approach that survives your regulator's review.

Approach 01

Provider-side privacy mode

Vendor-provided privacy modes: the LLM provider's content filtering and data-handling settings. The vendor promises not to log or train on your data. Inference still sees the full input including PII. Compliance is policy-bound, not architectural.

Fine for non-regulated tooling. Fails any audit that requires the AI to provably not see identity.

Approach 02

Client-side redaction libraries

Presidio, spaCy NER, regex matchers, custom Python. Your application strips identifiers before the LLM call. Mappings (token → real value) typically held in app memory or a side database. Coverage is matcher-quality bound; context-dependent PII often missed.

Standard for early-stage products. Fails when an auditor asks for proof that redaction happened on every call.

Approach 03

AI-gateway proxy redaction

Cloudflare AI Gateway, Lakera, Robust Intelligence — middleware that sits between your app and the LLM, redacts on the way out, and re-hydrates on the way back. Centralised redaction policy; better than client-side. Still software, still bug-shaped.

Right when client-side is unmanageable. Fails when the gateway itself is the trust boundary the auditor pushes on.

Approach 04 · Lucairn

Infrastructure-level split-knowledge

Sandbox A holds identity (WHO). Sandbox B runs inference (WHAT). There is no network path from B to A. Even Lucairn operators with full Sandbox B access cannot re-identify a single response. Plus: every decision produces a signed receipt anchored in a public log.

Right when procurement requires architectural evidence, not vendor promises.

02Compare

Eight criteria,
four approaches.

The criteria below are what a DPO, CISO, or external auditor will actually push on. Lucairn's split-knowledge architecture wins on five, ties on two, loses on one (operational burden).

Criterion
Provider-side
Client redact
AI gateway
Lucairn split-knowledge
PII never reaches the inference model
Vendor sees raw input
PartialMatcher-quality bound
PartialMatcher-quality bound
StrongArchitectural — no path
Failure mode if redaction is incomplete
PII to vendor
PII to vendor
PII to vendor
ContainedBug stays in Sandbox A
Per-decision proof of redaction
PartialGateway logs
YesSigned sanitizer manifest
Re-identification map custody
Vendor or none
PartialApp memory / DB
PartialGateway memory
YesSandbox A only
Coverage of context-dependent PII
PartialPattern-bound
PartialPattern-bound
StrongThree-layer detection + zone isolation
Cryptographic audit trail
PartialBest-effort
YesEd25519 + Sigstore Rekor
Vendor / deployment portability
Vendor lock-in
YesLibrary swap
PartialGateway lock-in
YesOpen protocol
Operational burden
Lowest
ModerateApp-coded
ModerateService to operate
ModerateBridge + witness
03When to choose what

Each approach is
the right answer somewhere.

Honest framing: not every workload needs split-knowledge. Pick the lightest approach your audit will accept.

Provider-side privacy mode (Approach 01) when…
  • Internal tooling, non-customer-facing decisioning
  • No regulator with audit authority over the data path
  • Vendor's privacy contract is acceptable evidence
  • Engineering convenience outweighs compliance depth
Client-side or gateway redaction (02 / 03) when…
  • PII detection is a defence-in-depth layer, not the primary control
  • Your auditor accepts software-based redaction with logs
  • Internal use; PII categories are well-bounded by patterns
  • You need to ship before the architectural option is operationally feasible
Split-knowledge (Approach 04) when…
  • Customer-impacting AI decisions in regulated industries
  • An external auditor will challenge the redaction integrity
  • Context-dependent PII is in scope (clinical notes, contracts)
  • DORA Art 28 or EU AI Act Art 12 is in your future
  • Procurement requires architectural evidence, not vendor promises
04Frequently asked

Redaction vs split-knowledge — questions,
answered.

Isn't a good redaction library 'good enough'?

It depends on what you're protecting against. For pattern-bound PII (IBANs, phone numbers, names in structured fields), a good library catches 95%+. For context-dependent PII (medical condition tied to a free-text identifier, transactional details that quasi-identify a customer), pattern matching misses. The deeper issue isn't the matcher quality — it's the failure mode. When redaction misses, PII reaches the model. When split-knowledge "misses," the bug still lives in Sandbox A; the model still cannot see it because the network path doesn't exist.

What about provider privacy modes — aren't those enough?

Provider privacy modes guarantee the vendor will not log or train on your data. They do not change what the model sees during inference. The model still processes the raw input including identity. For regulated work where the regulator's question is "prove the AI didn't see personal data," a vendor's promise that they won't keep it isn't the same as proof it never reached the inference path. EU AI Act Art 13 transparency and GDPR Art 25 by-design are about the data path, not the vendor's logging policy.

Is Lucairn's architecture overkill for typical SaaS?

For non-regulated SaaS, yes — the operational overhead of running a gateway, bridge, and witness isn't worth it if your AI is internal tooling and no auditor will examine the data path. For regulated work, the calculus inverts: provider-side redaction or client-side libraries leave you defending a software promise in front of an auditor, which is not where you want to be. Lucairn is heavier than gateway redaction by maybe 20% in operational complexity, but it changes the conversation from "trust us" to "here's the receipt."

Can I combine redaction libraries with Lucairn?

Yes — this is the production-default. Lucairn's sanitiser uses Presidio plus a quasi-identifier risk engine inside Sandbox A. The redaction libraries are the matcher; the architectural property is what makes them load-bearing. Together they give you both pattern-bound and architectural coverage. A custom-trained PII shield model fitted to your domain corpus is available as an Enterprise-only option (priced per scope).

What if a context-dependent identifier slips through the sanitiser anyway?

Two things happen. First, the bug stays in Sandbox A — the model in Sandbox B still cannot see it, because the bridge only carries the de-identified payload (whatever the sanitiser produced). Second, the receipt records the sanitiser scheme used. If a class of identifiers turns out to be undermatched, you can identify the affected receipts retroactively by querying for the scheme version — the audit chain helps you scope the incident, rather than complicating it.

Does split-knowledge work for context-rich inputs (clinical notes, legal contracts)?

Yes, and that's where the architectural property matters most. Clinical notes and contracts are full of context-dependent PII that pattern matching misses. Lucairn's three-layer sanitiser (Presidio + quasi-identifier risk + an optional custom-trained PII shield on the Enterprise tier) handles ~90%+ of these cases. The 10% that slips through stays in Sandbox A — the model never sees it. That's the architectural payoff in practice.

05Get started

From assessment
to production.

Run the self-serve assessment against your AI workflow and see whether software-based redaction is enough or whether split-knowledge is the right call. 15 minutes. Output goes to your DPO.