The framing the industry repeats

If you read enterprise AI privacy posts from the past two years, one claim shows up in almost every one of them: you cannot do PII protection on streaming AI responses. The argument goes something like this. Streaming sends tokens to the client as the model generates them. PII detection needs a sentence-level window — names span words, IBANs span a fixed character count, addresses span clauses. Run a recogniser on each chunk and you miss everything that crosses a chunk boundary. Wait for the full response and you've lost the latency benefit that made streaming worth doing in the first place. Therefore: pick streaming or pick PII protection, not both.

The argument is correct as stated. It is also a non-sequitur as soon as you separate the two distinct problems people are conflating when they say "PII protection on streaming responses." Lucairn shipped streaming on all four of our LLM endpoints (/v1/messages, /v1/chat/completions, /api/v1/mcp/messages, /api/v1/proxy/messages) on 2026-05-07, gated by STREAMING_ENABLED on the gateway and ON by default on hosted gateway.lucairn.eu. We did not solve the unsolvable. We separated two problems and shipped the smaller one.

This post explains the separation, the architectural reason it works in our specific shape, and what we deliberately did not build.

The two problems people conflate

Here are the two problems lurking inside "PII protection on streaming AI responses."

Problem A — output-side raw-PII detection. The model emits text token by token. Some of that text might contain PII. Detect it before each chunk leaves the gateway, redact it, and only forward the safe version to the client. This is the unsolvable-at-low-latency problem. Sentence-window NLP recall on partial text is bad. Hash-based pattern matching needs the full pattern. Edge cases multiply: what if a name spans three chunks? What if an IBAN is split across the chunk boundary by tokenisation? What if the model emits a real email address one character at a time over four streamed deltas? Industry research has not produced a low-latency, high-recall sentence-window NLP solution for this, and the operational risk of shipping anything less than high-recall is large: a single missed pattern is a real-world PII leak that you advertised your system would prevent.

Problem B — placeholder relink in stream. The model emits text token by token. Some of that text might contain placeholders that the gateway issued before the prompt was sent to the model — [PERSON_1], [EMAIL_2], [IBAN_3]. Map those placeholders back to the original values for whichever tier of the customer agreed to relink output. This is solvable. The placeholders are a known regex (\[[A-Z_0-9]+_\d+\]), bounded in length (≤24 characters), and emitted by the model verbatim because they were in the prompt verbatim. The only complication is that tokenisation can split a placeholder across chunk boundaries — [PERSON_ arrives in one SSE delta, 1] arrives in the next.

The whole industry conflates these two problems because most architectures do not have placeholders. Vault-based vendors like Skyflow tokenise structured data at rest but do not proxy the LLM response stream — there is no relink-in-stream problem to solve because the data never went through their layer. LLM gateways like Cloudflare AI Gateway or Helicone proxy responses but do not do PII pseudonymisation upstream — there is no placeholder to relink because no placeholder was ever issued. The combination of (a) gateway issues placeholders before the model sees the prompt and (b) gateway is in the response stream is rare. When you have both, Problem B becomes a small, mechanical engineering task that has nothing to do with NLP.

Why Problem A is the wrong layer in our architecture

The split-knowledge architecture Lucairn ships is documented at length in split-knowledge-architecture. The relevant property here is that the sanitiser runs before the prompt leaves the gateway. By the time any byte reaches the upstream model, identifiers like names, emails, IBANs, dates of birth, and customer references have already been replaced with numbered placeholders. The model never sees the raw values.

This is the architectural fact that makes Problem A redundant rather than just hard. The standard motivation for output-side raw-PII detection is "the model might emit PII it saw in the prompt or training data." In Lucairn's pipeline the model has not seen any PII in the prompt — the sanitiser stripped it. So if the model emits something that looks like an email address, that email address is either (a) a hallucination unrelated to your data or (b) a placeholder we issued, sometimes split across chunk boundaries. Hallucinated PII is an unbounded category that no recogniser would catch reliably anyway; the right defence against hallucinated PII is the upstream model provider's content filter, not a per-chunk regex on our side.

The architectural rationale is in the gateway code at services/gateway/internal/api/proxy.go:616-628. The comment block there explicitly notes that output-side scanning would be checking the sanitiser's homework — papering over sanitiser bugs rather than solving a real problem. If sanitiser recall on a customer's data is genuinely poor, the right fix is to invest in the sanitiser (more recognisers, customer-supplied known-entity lists, the Enterprise-tier custom-trained level-3 PII shield), not to add an output-side regex pass that compensates for upstream gaps.

Honesty caveat: this is only a valid argument as long as sanitiser recall actually is good. We track that with a 1,000-payload eval dataset across English and German, plus property-based tests on the sanitiser's individual layers (L1 known-entity matching, L2 Presidio NER, L3 LLM PII Shield). The eval-dataset score on the current build is in the changelog. If we ever ship a sanitiser regression that drops recall below the bar, output-side scanning still does not become the right layer — fixing the sanitiser does. The architecture commits us to that order.

What Problem B looks like when you take it seriously

Problem B is what we actually have to solve in stream. Here is the shape of it concretely.

The placeholders we issue match the regex \[[A-Z_0-9]+_\d+\]. By construction they are bounded — the longest placeholder we issue today is around 22 characters ([GERMAN_MEDICAL_TERM_999] style), and we cap the regex at 24 bytes. Within a single SSE chunk we can match-and-replace with a stateless strings.ReplaceAll and ship the chunk. Across SSE chunks the situation is different.

Tokenisation operates on bytes, not on the model's intent to emit a placeholder atomically. A typical Anthropic SSE delta might contain the bytes "Reach out to [PERSON_", with the next delta containing "1] for that ticket.". Two stateless strings.ReplaceAll calls — one per chunk — leak both fragments unrelinked: the client receives Reach out to [PERSON_ followed by 1] for that ticket., and the relink never fires because neither chunk contains a complete placeholder match.

The fix is a stateful Relinker with a bounded tail buffer. The algorithm has two operations: Feed(chunk) and Close(). Pseudocode:

struct Relinker {
  pending: bytes     // tail buffer, ≤24 bytes
  mapping: map[string]string    // [PERSON_1] → "Marc"
}

fn Feed(chunk):
  combined = pending + chunk
  // Find safe split point: the last position where any
  // remaining suffix could not possibly start a placeholder.
  safe = scan_back_to_safe_boundary(combined)
  emit_part = relink(combined[0..safe], mapping)
  pending = combined[safe..]      // up to 24 bytes
  return emit_part

fn Close():
  // Emit any remaining buffered bytes after a final relink pass.
  return relink(pending, mapping)

The "safe split point" is the key invariant. We walk backwards from the end of the combined buffer and stop at the last byte that cannot possibly be the start of a placeholder match continuing past the current chunk. In practice this means: if the combined buffer ends with ...something [PERS, we hold those last six bytes back in pending. If the buffer ends with ...something complete., the safe boundary is at the end and we flush everything. The invariant is provable: any placeholder beginning before the safe boundary is fully contained in the bytes we already emitted, and any placeholder beginning at-or-after the safe boundary is fully contained in pending (because placeholders are ≤24 bytes and pending keeps the last ≤24 bytes).

Latency cost: ~30–80 ms of TTFT delay from the bounded tail. For chat applications, RAG agents, and document-summarisation tools this is invisible. For voice applications targeting sub-100 ms TTFT it is over budget — we say so explicitly later.

The correctness property is that for every byte-offset split position in a non-streaming response, the streamed reconstruction must produce identical output. We test this with a property-based test that takes a non-streamed Lucairn response, splits it at every possible byte offset, feeds the splits into the streaming Relinker, and asserts byte-for-byte equality with the non-streamed output. The test catches multi-chunk-split bugs that point-fixture tests miss because the fixture rarely covers the exact tokeniser boundary that the upstream model picked on a given run. The implementation lives in services/gateway/internal/api/streaming.go; the property test lives next to it.

What we deliberately did NOT build

There are three tempting features adjacent to streaming that we did not build, and the reasons matter.

Per-chunk raw-PII NLP scan on output. This is Problem A, which we discussed above. We did not build it because it is the wrong layer in our architecture and because shipping a low-recall version would let us claim something we cannot actually deliver. If a future regulator or customer requirement makes output-scanning unavoidable (for example, if Lucairn is ever asked to proxy responses from a model that did not go through our sanitiser on input), we will revisit — but the answer will most likely be a separate sanitiser pass at a different boundary, not a per-chunk regex inserted into the streamer.

Voice / realtime sub-100 ms streaming. The ~30–80 ms TTFT delay from the 24-byte tail buffer pushes us past most voice budgets. We could shrink the buffer but cannot eliminate it: any placeholder-relink-in-stream design needs a buffer at least as long as the longest placeholder, otherwise the cross-chunk-split bug returns. For voice applications targeting sub-100 ms TTFT the right answer is a different gateway path that emits non-relinked placeholders, with the customer's voice client doing the relink on its own side. We will ship that when a voice customer asks for it; today no production traffic needs it.

Tool-call argument sanitisation. Tool-calls and function-calling are tracked separately on the /integration capability matrix because the sanitiser does not currently forward tools or tool_choice fields through the gateway handlers. Placeholder relink-in-stream of tool-call argument JSON bodies works through the same regex (the placeholders are inside JSON string values), so when we do ship tool-call sanitisation the streaming side comes for free. The blocker is the input-side sanitiser pass, not the streamer. Sanitising tool-call definitions versus arguments has correctness traps that are easy to ship wrong; we'd rather not have a half-feature live.

The competitive narrative (without overclaiming)

What did we actually ship? A small, mechanical, correctness-focused piece of streaming infrastructure that handles the 90 % of LLM workloads — text in, text out, with redaction proof — that our customers actually run. We did not invent low-latency sentence-window NLP. We did not ship a research paper. We separated two problems that the industry conflates and shipped the one we can ship at the recall bar that production traffic deserves.

The architectural primitive that makes our version tractable is the numbered placeholder. The placeholder is what the LLM emits verbatim because it received it verbatim, and because it is ≤24 bytes long it can be rebuilt across SSE chunk boundaries with a small bounded tail buffer. Vault-based privacy vendors do not have placeholders in the response stream because they are not in the response path. LLM gateways without sanitisation upstream do not have placeholders because nothing was redacted to relink. Lucairn's specific shape — gateway sanitises input, then sees response stream — is what makes Problem B small enough to engineer cleanly. It is also why our streaming story does not generalise to, say, a prompt-only DLP product: the architecture has to reach both ends of the call.

A real correctness fix shipped here too. Before the cross-chunk Relinker, a stateless strings.ReplaceAll per chunk leaked unrelinked placeholder fragments any time tokenisation split a placeholder across an SSE delta boundary. The fragments were valid HTML / safe characters, so they did not break clients — but they did appear as raw [PERSON_1]-shaped strings in the user-visible response on the customer-relink-enabled tier. This was a real production bug, not just a marketing story.

For teams choosing an LLM privacy layer in 2026, the test we would suggest is structural rather than feature-based. Ask the vendor whether they (a) sanitise the input before the model sees it, (b) appear in the response stream, and (c) ship a property-based test that the streamed reconstruction equals the non-streamed output across every byte-offset split. If the answer to (a) is no, the privacy guarantee is policy not architecture. If (b) is no, the streaming response is opaque to the privacy layer entirely. If (c) is no, you should not assume the cross-chunk-split case is handled correctly even if the demo looks fine — it is the kind of bug that ships latently because it depends on tokenisation patterns the QA fixture does not cover.

Where Lucairn fits

Lucairn's gateway sits between your application and the upstream LLM provider. Every request goes through the same pipeline whether it is streamed or not: input is sanitised, identifiers are replaced with numbered placeholders, the prompt is sent to the upstream model, the response is captured. For streaming requests, the gateway additionally runs the bounded-buffer Relinker over the SSE response stream so the tier-appropriate output (placeholders for Developer, optional relink for Pro and Enterprise) reaches the client without the cross-chunk-split bug.

The same Lucairn Certificate is generated for streamed and non-streamed requests — the per-call signed evidence (ai-act-article-12-logging-in-practice) records which sanitiser layers fired, which placeholders were issued, and the upstream model identifier. Streaming does not weaken the evidence chain; the certificate is anchored to the same canonical request hash.

If you are building a streaming chat application or a streamed-response RAG agent and want the input PII protection without rewriting your client to do per-chunk relink yourself, the proxy is a drop-in: change base_url, set STREAMING_ENABLED=true on a self-hosted Lucairn (already ON for hosted), and the existing SDK streaming pattern works. The capability matrix at /integration lists every endpoint and the honest gaps that remain — tool-calls and multimodal still are them; streaming is not.

Streaming LLM responses with PII redaction — when you separate the two problems, it's tractable