Model Context Protocol and Enterprise Documents: What MCP Standardizes — and Leaves Ungoverned

MCP standardizes access to enterprise document sources for LLM agents — not their quality. What RAG architects must understand before connecting SharePoint.

In June 2026, KPMG pulled its report Redefining Excellence in the Age of Agentic AI after GPTZero documented that 40 of its 45 citations pointed to non-existent or fabricated sources (TechCrunch, June 13, 2026). The issue was not access. KPMG had a pipeline, sources, and a content retrieval system. The issue was upstream: the quality of the sources the system was drawing from.

This pattern — access solved, quality ignored — is playing out systematically in enterprise RAG deployments. In 2026, it is taking a new form with the widespread adoption of the Model Context Protocol.

What MCP Standardizes for Enterprise Document Sources

Published by Anthropic in November 2024 (official specification), the Model Context Protocol has become the de facto standard for connecting LLM agents to external data sources. The protocol rests on a straightforward architecture: an LLM host (Claude, GPT-4o, Gemini) connects to MCP servers that expose resources — documents, databases, APIs — through a unified JSON-RPC interface.

For enterprise document sources, MCP standardizes three specific technical capabilities:

1. Discoverability. An MCP server exposes the list of available resources via the resources/list primitive. An agent can dynamically inventory accessible documents in a SharePoint library, a Confluence space, or an enterprise content management system — without prior knowledge of the source structure.

2. Normalized access. The resources/read primitive returns the text content of a document identified by its URI (for example, sharepoint://tenant/site/library/document.docx), along with basic metadata: MIME type, title, last modified date if the server exposes it.

3. Interoperability. An agent can switch from a SharePoint MCP server to a Confluence MCP server without modifying its retrieval architecture. Atlassian released its official MCP Server in May 2026, opening access to Confluence spaces and Jira projects through this protocol (SiliconANGLE, May 6, 2026). Microsoft has done the same for SharePoint.

These three capabilities are real and valuable. MCP meaningfully reduces the integration friction between LLM agents and enterprise document sources. That is its primary benefit.

What MCP Does Not Address — and Why This Is a Structural Problem

The MCP specification defines no mechanism for the following:

Document validity: Is this document still current, or has it been superseded by a newer version that was never formally retired?
Conflict detection: Does this document contradict another document in the same corpus on the same topic?
Version canonicity: Among three versions of a procedure stored in the same SharePoint library, which one is authoritative?
Governance metadata: Who owns this document? What is its contractual expiry date? Does it have an active maintainer?
Semantic quality score: Is this document coherent with the other sources the RAG pipeline will query simultaneously?

These gaps are not design flaws in MCP — they are deliberately out of scope. MCP is a transport and access protocol, not a document governance framework. The confusion arises because solving access to a document and knowing whether that document is reliable requires different tools operating at different layers of the stack.

An MCP SharePoint server connected to a library of 40,000 documents will expose those 40,000 documents with excellent technical reliability and low latency. If 23% of those documents are outdated, or if 1,400 document pairs contain conflicting information on HR policies or regulatory procedures, the protocol will not detect it. Neither will the LLM agent.

The Concrete Scenario: Standardizing Access to Disorder

A survey of 132 enterprise AI leaders published by VentureBeat in June 2026 (link) documents that the primary production failure point for agents is not the model — it is the knowledge layer. Teams are investing in retries and orchestration when the underlying issue is source coherence. Investment in retrieval optimization jumped from 19% to 28.9% of AI infrastructure budgets in Q1 2026. Engineering teams are investing in rerankers, hybrid search, and GraphRAG — all improvements to the retrieval layer. None of these technologies detects that a service note from April 2023 stored in Confluence is contradicted by a January 2026 directive in the same space.

An academic study on factual accuracy of citations in commercially deployed AI research agents (arXiv:2605.06635, May 2026) finds that even the strongest frontier models achieve only 77% factual accuracy on their citations; open-source models fall below 39%. These measurements cover web-search agents — enterprise RAG systems operating on private corpora lack an equivalent standardized evaluation framework, potentially making the problem more pronounced on ungoverned corpora. The cause is not in the retrieval architecture. It is in the quality of the sources.

MCP makes those sources more accessible. It does not make them more reliable.

The DKP Layer as Upstream to an MCP Server

There are two architectural approaches to addressing this problem.

Reactive approach: add post-retrieval verification — fact-checking, confidence reranking, source filtering by date. This approach addresses symptoms at the pipeline output. It is compute-intensive, non-deterministic, and fails for inter-document contradictions (if two contradictory sources are retrieved, the pipeline has no means of knowing which is correct without knowing their version history).

Preventive approach: audit and govern the corpus before it is exposed through the MCP server. This is what a Document Knowledge Platform does. On a single enterprise document repository during an initial diagnostic, K-AI teams typically identify several hundred document anomalies — divergent duplicates, documents in conflict on regulatory thresholds, outdated versions never formally retired, orphaned documents with no active owner. These anomalies are detectable before indexing, not after.

In a DKP-MCP architecture, the sequence is as follows:

Raw sources (SharePoint, Confluence, ECM, S3...)
         ↓
    [DKP Layer]
    Audit → Cleaning → Scoring → Continuous monitoring
         ↓
  Governed corpus (valid documents, conflicts resolved)
         ↓
    [MCP Server]
    Normalized exposure to LLM agents
         ↓
   Agent / RAG pipeline

The MCP server receives a pre-qualified corpus as input. The resources it exposes carry documented validity, a canonical version, and a resolved conflict status. The LLM agent can build on this layer with a measurable probability of document coherence.

Recommendations for Teams Deploying MCP Agents

Before connecting an MCP server to an enterprise document source:

Audit the source before exposing it. Do not assume that a SharePoint library or Confluence space is clean because it is organized. Access structure (libraries, spaces, permissions) says nothing about the semantic coherence of the content.
Identify high conflict-risk sources. HR policy spaces, regulatory procedure documentation, and multi-version product documentation are the territories to audit first before MCP exposure. Static archived documents (reports, signed contracts) carry lower risk of active contradiction.
Instrument continuous monitoring, not just a one-time audit. A corpus audited before MCP deployment will drift afterward. New versions, partial updates, and documents imported without governance controls introduce post-audit conflicts. Continuous document governance — what we call “Stay Clean” — is the durability condition for corpus quality in an MCP deployment.
Embed governance metadata in MCP annotations. The MCP specification allows servers to expose custom annotations on resources. An MCP server connected to a DKP-governed corpus can surface: quality score, validity status, canonical version identifier, last audit date. These metadata allow the agent to dynamically weight its confidence in each source.
Consider Article 12 obligations for high-risk systems. If your MCP pipeline feeds a high-risk AI system under Annex III of the EU AI Act, source logging enters the scope of the traceability obligation. Upstream document quality is a prerequisite for a defensible audit trail.

Frequently Asked Questions

Can MCP replace an enterprise document governance system?

No. MCP is a transport and access protocol: it standardizes how an LLM agent discovers and reads documents, not how those documents are qualified, maintained, or governed. A document governance system audits the semantic coherence of the corpus (inter-document conflicts, divergent duplicates, unmarked obsolescence), produces quality scoring, and monitors drift continuously. These two layers are complementary, not substitutable. MCP without document governance efficiently exposes a potentially unreliable corpus.

How do I know if my SharePoint corpus is ready to be exposed via MCP for an LLM agent?

Three practical indicators: (1) The proportion of documents without an explicit validity date or active owner — a rate above 30% is a warning signal. (2) The presence of filename duplicates with multiple versions in the same library — each divergent duplicate is a non-deterministic retrieval risk. (3) The presence of documents in conflict on regulatory thresholds or operational procedures — inter-document conflict detection requires semantic analysis, not metadata comparison. A document corpus audit following the six-axis method provides these three indicators before MCP exposure.

What does an MCP server actually return for a SharePoint document?

The resources/read primitive of a SharePoint MCP server returns the text content of the document, its URI, its MIME type, and basic metadata if the server exposes it (title, last modified date, author). What the server does not return: document validity status, conflict indicator with other corpus documents, canonical version among multiple active versions, semantic quality score. These must be produced by an upstream document audit layer and injected as custom annotations in the MCP server.

What is the concrete risk for an LLM agent connected to an unaudited corpus via MCP?

The agent retrieves and synthesizes contradictory sources without detecting the contradiction. Example: an approval procedure was updated in March 2026, but the October 2024 version remains indexed in the same Confluence space. The MCP server exposes both. The reranker does not detect the semantic contradiction — it sees two relevant documents on the same topic. The agent generates a response based on one or the other depending on retrieval order — non-deterministic behavior that erodes operational trust. The KPMG scenario (40 of 45 unverifiable citations) is the extreme of this failure class.

How does K-AI integrate into an MCP architecture?

K-AI operates upstream of the MCP server: the Document Knowledge Platform audits the source corpus (SharePoint, Confluence, ECM), resolves conflicts and duplicates, marks outdated documents, and produces a quality score per document. The governed corpus is then exposed through the MCP server — with enriched annotations (validity, canonical version, quality score) available to the agent. K-AI offers native MCP integration that allows architecture teams to connect the DKP layer directly into their agentic infrastructure.

Sources

TechCrunch — KPMG pulls report on AI usage due to apparent hallucinations — June 13, 2026 — https://techcrunch.com/2026/06/13/kpmg-pulls-report-on-ai-usage-due-to-apparent-hallucinations/
Anthropic — Model Context Protocol Specification — https://modelcontextprotocol.io/specification
SiliconANGLE — Atlassian opens Teamwork Graph, pushes Rovo agentic execution — May 6, 2026 — https://siliconangle.com/2026/05/06/atlassian-opens-teamwork-graph-pushes-rovo-agentic-execution-team-26/
VentureBeat — The Agentic Reckoning: Enterprise AI Organizations Have a Runtime Problem, Not a Model Problem — June 2026 — https://venturebeat.com/resources/the-agentic-reckoning-enterprise-ai-organizations-have-a-runtime-problem-not-a-model-problem
arXiv — Cited but Not Verified: Parsing LLM Deep Research Agent Citations — May–June 2026 — https://arxiv.org/html/2605.06635v1
EU AI Act — Article 12 — Automatic event logging — https://artificialintelligenceact.eu/article/12/