← All news
Press · May 15, 2026 · 12 min read

Auditing an enterprise document corpus for AI — the K-AI 6-axis method

Auditing an enterprise document corpus for AI — the K-AI 6-axis method

Anomalies, conflicts, divergent duplicates, unmarked obsolescence, traceability, freshness: six measurable axes we instrument before any serious AI deployment.

This week, the market conversation shifted. Pinecone declared that the model is no longer the bottleneck; Glean reminded its readers that assistant relevance collapses the moment indexed content is stale, incomplete or poorly labeled (Glean, 2026); Atlan documented that standard RAG evaluations overstate production performance by 25-30 % because of ungoverned enterprise data (Atlan, 2026). The diagnosis is becoming consensus. The method is not. Iris.ai published a three-criteria framework on March 31, 2026 — extractability, scalability, factuality (Iris.ai, March 31, 2026). Cisco’s AI Readiness Index covers six organizational pillars — strategy, infrastructure, data, talent, governance, culture (Cisco AI Readiness Index). Knowlee documents seven pillars and a five-dimension data quality grid (Knowlee, 2026). None of these frameworks reaches the operational level a Head of Knowledge Management needs on a Monday morning to decide whether a SharePoint repository is fit to feed an agent. Here is the method we instrument at K-AI before any serious AI deployment: six measurable axes, each with a KPI, an alert threshold and a remediation procedure.

Why a corpus audit, and why now

Three recent figures frame the urgency. Cloudera’s Data Readiness Report 2026, published April 14, finds that only 18 % of enterprises describe their data as “fully governed,” while nearly 80 % say data access is the bottleneck for AI (Cloudera, April 14, 2026). McKinsey, in State of AI Trust 2026, observes that roughly 30 % of organizations reach maturity level 3 or above on strategy, governance and agentic controls (McKinsey, 2026). Gartner finds that organizations with successful AI initiatives invest up to four times more of their revenue in data and analytics foundations — and maintains its forecast that 60 % of AI projects will be abandoned by end-2026 due to a lack of AI-ready data (Gartner, April 16, 2026).

The regulatory agenda adds pressure that technical conversations often overlook. Under the EU AI Act, high-risk system obligations come into application on August 2, 2026, and Article 12 mandates the automatic retention of logs sufficient to reconstruct a system’s behavior (Artificial Intelligence Act — Article 12). That traceability does not stop at the prompt and the output. To demonstrate to a regulator, an internal auditor or a risk committee that a high-risk system behaved reproducibly, one must be able to show which source documents it consulted and what state those documents were in at the time of consultation. Without a corpus audit, that demonstration is not possible.

A useful distinction up front: the six axes presented here are not Cisco’s six pillars. Cisco measures your organizational maturity; we measure your corpus. The two efforts are complementary, not interchangeable.

Axis 1 — Internal anomalies: consistency breaks within a single document

A long document — an HR policy, a technical manual, a steering standard — regularly contains internal consistency breaks: a number in the body that does not match a summary table, a threshold cited differently in two places, a diagram dated from a prior version still embedded in the current one. These anomalies are invisible to a human reader because nobody re-reads fifty pages in one sitting. They are invisible to a vector retriever because each passage, taken in isolation, has excellent local coherence.

KPI: internal anomalies detected per 1,000 audited pages. Method: entity and numeric value extraction, intra-document cross-checking, anomaly flagging. Alert threshold: beyond 5 anomalies per 1,000 pages on a procedural corpus, the risk that an assistant cites a passage contradicted elsewhere in the same document becomes material.

Axis 2 — Inter-document conflicts

This is the most expensive axis to instrument and the one that most cleanly separates a serious audit from a surface cleanup. Two documents state two different things about the same object — a validation policy, an incident procedure, a commercial rule — without explicit hierarchy between them. A well-tuned retriever returns both. A re-ranker arbitrates in favor of the one that most resembles the question. The model faithfully answers what it receives. The answer is wrong for half the organization.

This is precisely what the Neural Semantic Graph was built for: modeling entities, relations and inter-document constraints, then surfacing contradictions the way one would surface a constraint violation in a relational database. On a first diagnostic at a K-AI client, we typically detect several hundred such inconsistencies on a single document repository — and that is one repository among dozens in a large organization.

KPI: number of non-hierarchized inter-document conflicts detected within the scope. Alert threshold: beyond a sector-specific volume, ingestion of the affected zone should be paused until the conflict is internally resolved, rather than letting an agent arbitrate on the organization’s behalf.

Axis 3 — Divergent duplicates

Three copies of an HR policy, two slide decks restating a technical standard with different parameters, an internal memo reproducing the content of an official manual with a regrettable shortcut: duplication is not the problem in itself — its divergence is. Across the scopes we audit, divergent duplication is the leading contributor to corpus bloat. The initial cleanup typically allows the organization to remove or merge a substantial share of the corpus, simply because no single team had ever held a clear mandate to do so.

KPI: divergent duplicate rate per 1,000 documents. Method: near-duplicate detection coupled with a semantic diff that qualifies the gap between versions. Alert threshold: a divergent duplicate identified as being cited by a production AI system is an incident, not technical debt.

Axis 4 — Unmarked obsolescence

A 2019 procedure nobody removed sits, at retrieval time, next to its 2026 version — often with a better semantic score because its prose is denser. Glen Rhodes coined a useful term, “document shelf life”: the period during which a document remains authoritative within its scope (Glen Rhodes, 2026). Without an explicit deprecated/replaced-by discipline carried in metadata, obsolescence cannot be seen at inference.

KPI: share of the corpus whose last business-side verification is older than N months, by document class. Alert threshold: class-specific — a technical manual may tolerate two years; a validation policy rarely tolerates more than six months.

Axis 5 — Traceability (author, date, validation, source of truth)

This is the axis that takes a corpus audit from document quality into compliance. To respect the spirit of AI Act Article 12, logging an assistant’s queries is not enough — one must be able to reconstruct, for a given response on a given date, which documents fed it and what state those documents were in. That implies, on each document in operational scope: an identified author, a last-revision date, a validation trace (who approved, when), and an explicit indication of source-of-truth when multiple versions coexist.

None of these four pieces of information is new. None is, on average, present on more than a third of the documents we audit. Traceability is the most poorly instrumented axis in practice — and it is the one becoming legally enforceable this year.

KPI: coverage rate of author + date + validation + source-of-truth across scope. Alert threshold: for an AI Act high-risk system, target 100 % on the documents the system effectively consumes — not on the entire corpus, which may remain in an unaudited zone as long as it is not exposed to inference.

Axis 6 — Freshness per segment

Freshness differs from obsolescence: obsolescence sanctions unmarked stale content; freshness measures the pace of updates. A corpus segment whose median last-update date has been frozen for eighteen months is not necessarily obsolete document by document — but it almost always signals that the function in charge has stopped treating it as a live subject. That is an early warning indicator: the segment is rotting without anything in the system triggering a response.

KPI: median last-update age per semantic cluster. Alert threshold: a slippage above 50 % over two consecutive quarters should trigger an ownership review, not a new prompt.

From diagnosis to monitoring — audit deliverable and re-audit cadence

A corpus audit produces two deliverables. The first is a documentary AI Readiness Score rated on each of the six axes on a 1-to-5 scale, with qualitative commentary and three precise indicators per axis — the measurable form of what Iris.ai poses as criteria, and what still missing from the market’s public grammar. The second is a prioritized action plan, axis by axis: what can be remediated automatically, what requires a business arbitration, and what needs to wait for a C-level decision.

An initial audit is useful once; continuous monitoring is useful every day. A policy gets rewritten, a manual lapses, two teams document the same procedure differently — without dedicated observability, document debt becomes invisible again within two quarters. Our practice: an initial audit on the scope exposed to AI, a full quarterly re-audit, and a continuous semantic monitoring layer that surfaces, in real time, new conflicts, new divergent duplicates and freshness slippage. That is the Stay Clean that extends Start Clean. It is also, incidentally, what produces the audit log Article 12 expects.

Frequently asked questions (FAQ)

How do you audit a document portfolio for AI?

A serious corpus audit goes through six measurable axes, sequentially: internal anomalies (intra-document inconsistencies), inter-document conflicts (non-hierarchized contradictions), divergent duplicates (concurrent versions that do not say the same thing), unmarked obsolescence (lapsed documents never removed), traceability (author, date, validation, source of truth) and freshness (update pace per segment). Each axis is measured with a KPI, an alert threshold and a remediation procedure. The deliverable is a documentary AI Readiness Score per axis, accompanied by a prioritized action plan. The typical duration of a first audit on an AI-exposed scope is 2 to 4 weeks, depending on volume and starting quality.

What metrics define an AI-ready corpus?

Five families of KPIs, observed the way one observes the quality of a structured data pipeline: rate of internal anomalies per 1,000 pages, volume of detected inter-document conflicts, rate of divergent duplicates per 1,000 documents, rate of unmarked obsolescence per document class, coverage of author + date + validation + source-of-truth across the scope exposed to inference. To these five families one adds freshness (median last-update age per semantic cluster), which serves as an early warning indicator. Together, these six metrics form the basis of a documentary AI Readiness Score defensible to a board and to a regulator alike.

What should an AI audit log contain to satisfy AI Act Article 12?

Article 12 mandates the automatic retention of logs sufficient to reconstruct a high-risk system’s behavior (Artificial Intelligence Act — Article 12). For a system that relies on a document corpus, that implies at minimum: the list of documents consulted for each response, their state at the time of consultation (version, last revision date, validation status), the record of their ingestion into the operational corpus (who indexed them, when, from what source), and the traceability of subsequent modifications. Without these four elements, one can prove the system ran — but not that it ran on a defensible corpus, which is the real intent of the requirement.

How does K-AI differ from existing AI Readiness frameworks?

Cisco measures your organizational maturity on six pillars (strategy, infrastructure, data, talent, governance, culture). Iris.ai poses three readiness criteria (extractability, scalability, factuality). Knowlee lays out seven pillars with a five-dimension data quality grid. All these frameworks are useful, and we recommend them at the organizational level. None descends to the documentary-operational level — where a team needs to know, segment by segment, whether a repository is fit to feed an assistant or an agent. The K-AI six-axis method is the documentary operationalization of those frameworks: it does not replace them, it makes them measurable on the ground.

How often should a document corpus be re-audited?

A full initial audit, then a quarterly re-audit on the entire AI-exposed scope, complemented by continuous semantic monitoring. Continuous monitoring surfaces, in real time, new conflicts, divergent duplicates as they appear, and freshness slippage. The quarterly re-audit re-scores the six axes and updates the action plan. Without continuous monitoring, a remediated corpus degrades materially within two to three quarters — that is the experience we systematically observe on scopes left unattended after an initial Start Clean.

Further reading

If you recognize the situation described — an AI project that depends on a corpus whose precise state nobody knows — the useful next step is neither a new embedding nor a new governance framework. It is a six-axis corpus audit on the exposed scope. We do this for large enterprises on piloted scopes. Write to us at contact@k-ai.ai.

Sources cited


K-AI already supports CMA CGM, Veolia, PwC, BNP Paribas, TotalEnergies and CEVA Logistics on the quality of their document portfolio in the AI era. Partners: AWS, Snowflake, Microsoft, Wavestone, Devoteam.

And in your organization, what does your document estate look like?

30 minutes with a founder. We audit a sample of your documents for free and show you exactly what K-AI detects.

Book a demo → Read other articles