RAGOps: The Data Half Nobody Operates — SLIs, SLOs, and a Control Plane for Corpus Health

RAGOps includes continuous corpus management. A year after the founding paper, nobody operationalizes it. Here are the SLIs of a production corpus.

On June 2, a large-scale study posted to arXiv — 5 models, 10 biomedical question-answering datasets, 4 retrieval methods, 4 corpora — concluded that retrieval delivers only “weak and inconsistent” gains over a no-retrieval baseline (arXiv:2606.04127). In plain terms: adding retrieval is not, by itself, a performance lever — the “plug in RAG and it gets better” reflex does not survive measurement. Meanwhile, your teams are instrumenting the pipeline — traces, spans, relevance scores, evaluation dashboards. Everything is observed except the one component that changes every day: the corpus itself. This research note proposes a fix, in the vocabulary operations teams already speak: SLIs, SLOs, and a control plane.

LLMOps ≠ RAGOps: the delta is the data

RAGOps has a precise academic definition. The founding paper — RAGOps: Operating and Managing Retrieval-Augmented Generation Pipelines (Xu, Weytjens, Zhang, Lu, Weber, Zhu — CSIRO Data61 / TU Munich, arXiv:2506.03401, June 2025) — estimates that 60% of compound LLM systems in the enterprise rely on RAG, and defines RAGOps as an extension of LLMOps with “a strong focus on data management,” precisely because a RAG system’s external sources change continuously.

Read that definition again: half of RAGOps is the data lifecycle. Not prompts, not models, not traces — documents. A year after publication, our review of the market — enterprise RAG platforms, LLM observability tools, content services vendors — found no vendor that has turned that data half into an operating doctrine. LLM observability tools trace application spans. Evaluation frameworks measure the pipeline. The corpus remains a blind spot.

What application tracing cannot see

The industry has not ignored operations. Unstructured.io published an evaluation framework in March that decomposes RAG into measurable stages — ingestion, chunking, retrieval, reranking, generation — with regression gates in CI. Useful work. But to the question “how do you evaluate RAG when the corpus changes daily?”, the proposed answer is a frozen snapshot. You evaluate the pipeline on a photograph; production runs on a film.

The same offset shows up across platforms. Hyland announced general availability of its Enterprise Context Engine on June 1, complete with an observability “Control Tower” — for agents. Pinecone wired Nexus into Microsoft OneLake on June 3, with deterministic conflict resolution — at retrieval time. Microsoft moved Foundry IQ to GA on June 2, unifying knowledge access behind a single SLA-backed endpoint — saying nothing about the quality of what gets indexed. Three major announcements in 72 hours, all on the access and orchestration layer. None exposes a single health metric for the content being consumed.

Application tracing can tell you a query returned eight chunks in 230 ms with a mean relevance score of 0.82. It cannot tell you that two of those chunks come from diverging versions of the same document, that a third has been obsolete since a regulatory change, and that the actually applicable procedure is simply not in the index. No span carries that information — it lives only in the corpus.

Clean benchmarks, dirty corpora: the 0.16% → 24% gap

Why has this blind spot survived so long? Because public benchmarks don’t show it. An empirical analysis published in May (arXiv:2605.09611) measured the effect of exact deduplication across RAG corpus types: on BeIR, the standard academic benchmark, removable redundancy is 0.16% — negligible. On “enterprise” corpus patterns (document revisions, coexisting versions), it reaches 24%. Nearly a quarter of the corpus.

The implication deserves to be spelled out: almost everything the industry knows about RAG performance comes from corpora that look nothing like yours. A pipeline validated on BeIR has never met the cases that define document life in a large organization — the 2019 procedure nobody retired, the internal memo that tacitly invalidates a chapter of a reference manual, the three versions of an HR policy where two disagree on a threshold. VentureBeat recently gave this accumulation a name: retrieval debt — messy corpora producing answers that are “technically correct but outdated.” At K-AI we measure that debt with our customers: in a first diagnostic on a single document repository, surfacing over a thousand anomalies is not unusual — an order of magnitude invisible in public benchmarks, and consistent with the 0.16% → 24% gap the research measured.

The five SLIs of a production corpus

If the corpus is a production component, it deserves what every production component gets: Service Level Indicators. We propose five, derived directly from the defect families our six-axis corpus audit method establishes at diagnostic time — the audit defines the defects; the SLIs watch them over time.

1. Divergent redundancy rate. Share of documents existing in multiple versions whose content diverges — not exact copies, but the near-copies that contradict each other. This is the defect byte-exact deduplication only partially catches, and that benchmarks underestimate by a factor of 150.

2. Freshness drift (staleness). Distribution of document age weighted by retrieval frequency. A stale document that is never retrieved is dormant debt; a stale document retrieved ten times a day is a live incident.

3. Active contradiction density. Number of formally incompatible claim pairs across documents in the same scope, normalized by corpus size. Undetectable by vector similarity — two versions of a policy are semantically near-identical; their divergence on a date or a threshold drowns in the cosine.

4. Mandatory topic coverage. Share of business-critical questions (regulatory, operational, contractual) the corpus can actually answer. The most precise retrieval in the world cannot compensate for a document that does not exist.

5. Lineage completeness. Share of documents with an identified owner, a validation date, and a designated source of truth. This SLI conditions the other four: without an owner, no remediation ever lands.

Each SLI calls for an SLO — a threshold agreed with the business owners of the documents, exactly like an availability target. A sample formulation: “divergent redundancy in the HSE scope stays under 2%; any active contradiction on a regulatory document is arbitrated within 10 business days.”

The corpus control plane: an architecture for document observability

Then comes execution. A corpus control plane has three loops, mirroring what SRE teams already run.

A continuous measurement loop. The corpus is re-analyzed in stream — on every document addition, modification, or removal, not on a fixed schedule. This is where a semantic graph approach earns its keep operationally: K-AI’s Neural Semantic Graph maintains a representation of claims and their relations (support, contradiction, redundancy, obsolescence), so a modified document triggers recomputation of only the affected nodes — not a full re-audit.

An alerting loop. SLO breach → notification to the owner of the affected document, with the arbitration context (which versions diverge, on which claims, with what retrieval impact). An alert without an identified business owner is noise — hence SLI #5.

A traced remediation loop. Every arbitration (version kept, document retired, merge) is journaled with its author and rationale. That journal is the corpus’s operational memory — and, as we’ll see, rather more than that.

This is the operational translation of what we call Stay Clean: not a repeated audit, but permanent instrumentation, at the same architectural rank as pipeline monitoring.

Beyond operations: what these metrics are worth for compliance

One last argument, for readers who need to justify the investment. On June 1, the European Commission appointed the AI Act’s Scientific Panel and Advisory Forum — 60 independent experts tasked, among other things, with evaluation methodologies, two months before the first enforcement deadlines. The documentation obligations that apply to systems already in production demand exactly what the control plane produces as a by-product: dated quality metrics, arbitration logs, lineage. What you instrument for operations becomes your regulatory evidence file — we covered that angle in our 60-day corpus plan for the EU AI Act.

The conclusion fits in one sentence: RAGOps as defined by the research has two halves, and the industry has tooled only one. The pipeline has its control plane. It’s time the corpus had its own.

Frequently asked questions

What is RAGOps and how is it different from LLMOps?

RAGOps is the discipline of operating RAG pipelines, formalized by researchers at CSIRO Data61 and TU Munich in June 2025 (arXiv:2506.03401). It extends LLMOps — lifecycle management for models, prompts, and deployments — with a specific addition: continuous management of the external data the pipeline consumes. That is the structural difference: an LLM is versioned and relatively stable; an enterprise document corpus changes daily. RAGOps therefore covers two coupled lifecycles, the model’s and the data’s. In practice, most organizations have tooled only the first — application observability, retrieval evaluation — leaving the second without instrumentation.

Why does RAG still hallucinate even with good documents?

Because the unit quality of documents says nothing about their collective coherence. A corpus can be made of individually well-written, validated, sourced documents and still contain diverging versions of the same procedure, unmarked obsolete information, and contradictions across scopes. The pipeline then retrieves a context that is locally correct but globally inconsistent — and the model generates an answer faithful to a document that should not have been authoritative. It is a corpus defect, not a model defect: no reranker or runtime verification loop can detect that a document contradicts another document absent from the context. Remediation happens upstream, in the corpus.

How does document quality affect RAG accuracy?

At every stage. At retrieval: divergent duplicates cannibalize each other in the ranking and surface competing versions. At ranking: a well-written obsolete document often scores higher than an up-to-date but poorly structured one. At generation: the model synthesizes what it is given — if the context mixes two versions of a regulatory threshold, the answer picks one, confidently. A large-scale biomedical study published June 2, 2026 (arXiv:2606.04127) shows that adding retrieval, by itself, yields weak and inconsistent gains versus a no-retrieval baseline: the lever is not adding more retrieval, but qualifying what it consumes — and measuring what the corpus actually contains.

How do you evaluate document corpus quality before deploying AI?

With a structured corpus audit before any deployment — then continuous monitoring after. The audit establishes the major defect families: internal anomalies, cross-document conflicts, divergent duplicates, unmarked obsolescence, traceability, freshness by segment. It produces a quantified baseline (how many active contradictions, what redundancy rate, what share of ownerless documents). The SLIs described in this article then take over: they turn the audit’s axes into permanently monitored metrics, with alert thresholds and arbitration workflows. An audit without monitoring goes stale within months; monitoring without an initial audit has no reference point.

Is there an AI to check for outdated documentation?

Yes — it is one of the most mature use cases of semantic corpus analysis. Obsolescence detection cannot rely on modification dates alone: an old document may still be valid, and a recent one may be invalidated by a later decision. Effective approaches cross several signals — age weighted by consultation and retrieval frequency, contradiction with more recent documents in the same scope, references to expired entities (product versions, repealed regulations, defunct organizations). This is one of the functions of K-AI’s Neural Semantic Graph: flagging documents whose claims are contradicted or superseded by newer documents, and routing the arbitration to the relevant owner rather than deleting automatically.

Going further

If your RAG pipeline is instrumented but your corpus is not, the first step is a baseline: an audit that establishes your five SLIs on a pilot repository. We run it in a few weeks, with quantified results. Write to us: contact@k-ai.ai.

Sources

When Retrieval Doesn’t Help: A Large-Scale Study of Biomedical RAG — arXiv:2606.04127, June 2, 2026 — https://arxiv.org/abs/2606.04127
RAGOps: Operating and Managing Retrieval-Augmented Generation Pipelines — Xu, Weytjens, Zhang, Lu, Weber, Zhu (CSIRO Data61 / TU Munich), arXiv:2506.03401, June 2025 — https://arxiv.org/abs/2506.03401
Byte-Exact Deduplication in RAG: A Three-Regime Empirical Analysis — arXiv:2605.09611, May 2026 — https://arxiv.org/abs/2605.09611
RAG Evaluation: A Data Pipeline Performance Framework — Unstructured.io, March 21, 2026 — https://unstructured.io/insights/rag-evaluation-a-data-pipeline-performance-framework
Hyland launches next wave of AI platform innovations — Hyland Newsroom, June 1, 2026 — https://www.hyland.com/en/company/newsroom/hyland-launches-next-wave-ai-platform-innovations
Pinecone Nexus and Microsoft OneLake — Pinecone Newsroom, June 3, 2026 — https://www.pinecone.io/newsroom/microsoft-onelake-nexus/
What’s new in Microsoft Foundry — Build 2026 — Microsoft Dev Blogs, June 2, 2026 — https://devblogs.microsoft.com/foundry/whats-new-in-microsoft-foundry-build-2026/
Why prompt debt, retrieval debt and evaluation debt are quietly reshaping enterprise AI risk — VentureBeat, May 2026 — https://venturebeat.com/technology/why-prompt-debt-retrieval-debt-and-evaluation-debt-are-quietly-reshaping-enterprise-ai-risk
AI Act enforcement gets independent expert support — European Commission, June 1, 2026 — https://digital-strategy.ec.europa.eu/en/news/ai-act-enforcement-gets-independent-expert-support

Auditing a document corpus for AI — the K-AI six-axis method (May 15, 2026) — operational method for the six axes.
Context engineering done right — why the post-RAG paradigm demands a clean corpus (May 29, 2026) — the upstream layer between pipeline and knowledge.
Knowledge graph vs. vector database for enterprise RAG (May 22, 2026) — semantic graph vs vector store.

K-AI already works with CMA CGM, Veolia, PwC, BNP Paribas, TotalEnergies, and CEVA Logistics. Partners: AWS, Snowflake, Microsoft, Wavestone, Devoteam.