What Article 12 actually requires
Article 12 of Regulation (EU) 2024/1689 — the EU AI Act — sits at the structural centre of the high-risk AI obligations. Its title, "Record-keeping," makes the clause sound administrative. Read the text and that impression collapses.
Article 12(1) is the operative obligation: high-risk AI systems must "technically allow for the automatic recording of events ('logs') over the lifetime of the system." Two phrases in that sentence carry the weight. "Technically allow" rules out a logging policy that depends on operators remembering to enable it. "Over the lifetime" extends the obligation past initial deployment, through every update, retraining round, and configuration change.
Article 12(2) requires that logging capabilities "ensure a level of traceability of the AI system's functioning that is appropriate to its intended purpose." Note the standard: not a level of verbosity, not a level of detail — a level of traceability. The benchmark is whether a third party can reconstruct what happened.
Article 12(3) then enumerates four mandatory log targets for high-risk systems falling under Annex III point 1(a) (remote biometric identification): the period of each use, the reference database checked, input data that yielded a match, and the natural persons involved in result verification. Other Annex III categories inherit the general traceability obligation from Article 12(1)–(2) without that specific enumeration, but the principle scales: whatever a competent authority would need to investigate a malfunction or a discriminatory pattern must be recoverable from the log.
The locked AI Act category pairing places Article 12 alongside Article 14 (human oversight). They are designed to function together: Article 12 produces the evidence; Article 14 is the human reviewer who reads it. These two obligations sit together in our category framework as Cat 2 — Article 12 (logging) and Article 14 (human oversight) form the evidence category, where the artefacts produced by the system are designed to be reviewable by a person who was not in the request loop. A logging implementation that doesn't yield reviewable artefacts fails both articles simultaneously.
What "automatic" means in implementation terms
Most teams reading Article 12 reach for their existing application logging. The application emits structured JSON to a log aggregator. That aggregator persists to object storage. There is a retention policy. Done.
The regulator did not have that picture in mind. "Automatic" in Article 12(1) is a load-bearing word. It distinguishes architectural logging — produced as a side effect of the system functioning — from policy-based logging — produced when an engineer remembered to add a log.info line.
A useful test: ask whether your logs would still exist if every member of your engineering team were replaced tomorrow with people who had never seen the application code. If the answer is "no, they'd need to understand which functions to instrument," your logging is policy-based. The Article 12 standard is closer to "the system produces records by virtue of running," not "the system produces records because we configured it to."
This is why a pure application-layer approach struggles. A bug in the logging path is a bug in your compliance posture. An engineer who forgets to instrument a new endpoint creates a hole. A retry loop that swallows an error before logging it produces an incomplete record. None of these failure modes are visible in a code review unless reviewers know specifically to look for logging coverage — and even then, the review is policy.
Architectural logging instead places the recording behaviour outside the application: at the gateway, the proxy, the audit service, the database. The application becomes a participant in a logging system rather than the producer of logs. This is more expensive to set up. It is also the only setup that fails closed.
The traceability gap most teams miss
Article 12(2)'s traceability obligation is where most LLM applications quietly fail. The reason is subtle: input/output capture is necessary but not sufficient. A regulator investigating an incident does not want a log line per API call. They want a causal chain.
Consider a high-risk LLM application that screens job applications. A candidate is rejected. The candidate's lawyer requests the basis for the decision. What does the system need to surface?
A naive log line might read: request_id=abc123 input="<prompt>" output="reject" latency_ms=412. That captures input and output, but it doesn't reconstruct the decision. The lawyer cannot tell from this line which model version was used, whether a content filter intervened, what the system prompt said, what the temperature was, whether retrieval-augmented generation pulled in any external context, what that context was, or whether a downstream rule transformed the model's text into the binary "reject."
Article 12(2)'s traceability standard requires the answer to all of those questions. The causal chain runs: input arrives → preprocessing (sanitization, prompt assembly, retrieval) → model invocation (with named version + parameters) → postprocessing (content filtering, output parsing, rule application) → final decision recorded against the original input.
A working causal-chain log captures all five stages with stable identifiers that link back to the request. Most LLM applications log only stage one and stage five, with everything in the middle either implicit or absent from the record. That gap is the Article 12 violation regulators will be looking for once Annex III enforcement begins on 2 August 2026.
What "appropriate" retention looks like
Article 12 does not specify a retention duration. Article 19 obliges providers of high-risk AI systems to keep logs generated by the system "for a period appropriate to the intended purpose of the high-risk AI system" but for at least six months "unless provided otherwise by applicable Union or national law, in particular Union law on the protection of personal data."
That six-month floor sits in tension with GDPR Article 5(1)(e), which requires personal data to be "kept in a form which permits identification of data subjects for no longer than is necessary." If your logs contain personal data — and high-risk AI logs often do, by design, because the regulator wants to be able to investigate decisions about specific people — then the retention period is bounded above by GDPR storage limitation and below by the AI Act's six-month floor.
The honest answer is that the right retention duration depends on the risk classification of the AI use case, the foreseeable investigation horizon, and the legal basis under which the log is being retained. For a credit-decision LLM the relevant horizon may be the candidate's right of redress under national consumer-credit law. For a medical-triage LLM it may extend across the full retention duty for medical records. For a hiring LLM it tracks employment-discrimination limitation periods.
What Article 12 does require is that whatever duration you choose be deliberate. A retention policy of "we keep logs until our object storage gets expensive" is not appropriate. A retention policy of "we keep logs for N] months because that is the maximum window in which a regulator could open an investigation under [specific law]" is appropriate. The reasoning has to exist in writing and survive a conformity assessment under [Article 43.
A concrete logging contract for an LLM application
The following is a logging contract a high-risk LLM system should be able to satisfy. It is not exhaustive — your specific use case under Annex III may demand more — but it is the minimum we have found through implementation work to satisfy Article 12(2)'s traceability standard.
For every request, the log must contain:
- A stable, globally unique request ID. Generated at the gateway, propagated through every downstream service, written into the final stored record. Without this, the causal chain cannot be reconstructed across service boundaries.
- An input hash. Not the input itself in every case (which would force a GDPR battle), but a hash that lets a regulator confirm the stored record corresponds to a specific submitted prompt when the user produces the original.
- A sanitization manifest. Which fields were detected as personal data. Which detection rules fired. What the redaction substitutions were. This is the bridge between the AI Act's traceability obligation and GDPR's data-minimisation obligation under Article 25.
- The model identity and version. Not just "GPT-4" — the specific upstream model identifier, the date the model was loaded into the deployment, and any fine-tuned adapter applied. Models change. Records must pin the version.
- The model parameters. Temperature, top-p, max tokens, system prompt, any tool definitions. A different system prompt is a different system. The record has to say which one was used.
- The decision output as the model produced it. Distinct from any downstream postprocessing. Postprocessing rules can fire, but the raw model output is the artefact a regulator will want to evaluate independently.
- The downstream actions. Did a content filter intervene? Did a routing rule change the response? Did an integrated workflow take an action (sending an email, updating a record, denying credit)? Each downstream action is part of the causal chain.
- The retention class and deletion timestamp. Not the policy in some other system — the literal expiry date attached to this specific record. Retention is a property of the log, not a property of the storage tier.
A log that satisfies these eight requirements survives the kind of investigation Article 12(2) anticipates. A log missing any of them leaves a gap a regulator can name.
Where Lucairn fits
Lucairn's gateway sits between your application and the upstream LLM provider. Every request that crosses the gateway emits a record that meets the eight-point contract above. The record is signed by an in-process witness service using a key the upstream model never sees, timestamped against a public RFC 3161 timestamp authority (FreeTSA), and entered into a public Sigstore Rekor transparency log so that the inclusion proof itself becomes evidence the record existed at the timestamped instant.
This matters for Article 12 compliance in two specific ways. First, the gateway captures the causal chain at the point where it actually exists — at the boundary between your code and the model — rather than relying on the application to remember to log each step. Second, the witness signature plus the Rekor inclusion proof give you tamper-evident logs that survive operator compromise. A regulator asking "how do we know your engineers didn't quietly rewrite a record after the fact?" gets a cryptographic answer rather than a procedural one. See the implementation breakdown at audit-trail-for-ai.
Split-knowledge architecture additionally means the model itself never sees the user's raw identity — pseudonymisation happens upstream of the model call. Article 12 requires you to be able to reconstruct what the AI did. It does not require you to give the AI more information to do it with. If anything, the regulator's preference is the opposite: the less personal data the model touches, the smaller the surface area for a downstream rights-of-redress claim.
For teams building toward 2 August 2026, what matters operationally is whether your logging architecture is positioned to survive the first investigation under Article 12. Enforcement itself is a given — the date is fixed, and Annex III categories will start drawing scrutiny on day one. The eight-point contract is a useful self-test. If your current logs miss any of the eight, you have time to fix that before regulators start asking.