← All news
Press · May 20, 2026 · 12 min read

If Copilot can't find your SharePoint documents, the bug isn't in Copilot — it's in your corpus

If Copilot can't find your SharePoint documents, the bug isn't in Copilot — it's in your corpus

Microsoft hit 20M paid Copilot seats. On Microsoft Q&A, the same complaint keeps surfacing: Copilot can't retrieve our SharePoint documents.

Microsoft announced in March 2026 that Microsoft 365 Copilot had passed 20 million paid seats — a tripling in six months (WindowsNews, March 2026). At the same time, more than six threads opened between April and May 2026 on the Microsoft Q&A portal say variations of the same thing: Copilot can’t find my SharePoint documents, it fabricates answers, search is inconsistent (Microsoft Q&A, thread opened April 24, 2026). On April 24, 2026, the NHSmail support team referenced a Microsoft 365 service degradation: users not receiving search results with SharePoint as a knowledge source (NHSmail Support, April 24, 2026).

If you are a CIO, CDO or Head of Knowledge Management in a Fortune 500 that deployed Copilot in 2025, you have almost certainly heard — or said — one of those sentences. You have probably also escalated to your Microsoft TAM. That is the most expensive misattribution of the decade. Here is why.

The Copilot promise versus the field reality

The typical scenario plays out like this. A large enterprise rolls out Copilot M365. Weeks later, three recurring complaints surface from end users. First: I know this document exists, Copilot can’t find it. Second: It quoted me an outdated version instead of the current one. Third, the most concerning: It made up a procedure that vaguely resembles what we do, but is not what we do.

The instinctive boardroom reaction is to lean on Microsoft. The instinctive CIO reaction is to relaunch an access-governance project: sensitivity labels, restricted search, DLP audit. That is what the third-party ecosystem — AvePoint, Syskit, Glean — is pushing, each with its own toolbox (AvePoint Confidence Platform, Feb. 2026).

Both reflexes miss the same target. Microsoft does not have a model problem. The access plumbing does not have a configuration problem. The problem is in the corpus Copilot is querying.

Microsoft documented the problem — nobody reads the documentation

The technical lock-in is public, and precise. Since late 2025, Microsoft Learn has documented several constraints that, taken together, account for a meaningful share of the Copilot failures enterprise leaders blame on Microsoft.

First, for a SharePoint declarative agent to actually scan the full relevant content, Microsoft recommends limiting yourself to twenty files or three hundred pages in total; for embedded files, seven hundred fifty to one thousand pages per file maximum (Microsoft Learn — Optimize Content Retrieval). Beyond that, Copilot stops parsing the full document — it samples. Second, for knowledge sources on a Copilot Studio agent, the direct upload is capped at 512 MB per file; and when an agent references a SharePoint file directly, without Microsoft 365 Copilot being licensed in the same tenant, the cap drops to 7 MB due to memory constraints (Microsoft Learn — Copilot Studio quotas). Third, Restricted SharePoint Search (RSS) is capped at one hundred sites in the allowed list and is documented by Microsoft as “a short-term solution, not scalable, not a security boundary” (Microsoft Learn — Restricted SharePoint Search). Fourth, indexing is daily for multi-user sites, not real-time. A document published at 9 a.m. is not guaranteed to be retrievable at 11 a.m.

Four lines of official documentation. I have yet to encounter a consulting firm — francophone or otherwise — that quotes them together in a Copilot readiness deck. I have yet to encounter a Copilot steering committee that has read them. And yet: if your SharePoint estate contains long PDFs from legal, 300 MB decks from finance, libraries of more than a thousand documents — that is, if your SharePoint looks like a Fortune 500’s SharePoint — then Copilot simply cannot return what you expect it to return.

The mechanism: Copilot does what it’s told, your corpus tells it nonsense

Beyond the technical constraints, there is the documentation quality itself. And this is where the diagnosis becomes uncomfortable.

A Copilot M365 — like a Glean agent, a Sana assistant, a Sinequa MCP — is a RAG system. It retrieves a subset of documents judged relevant, then asks an LLM to answer using those documents. Output quality is capped by retrieval quality. So what does a real enterprise SharePoint actually contain?

Divergent duplicates: three versions of the same procedure, in three different sites, with three different page numbers. Outdated content: an HR policy archived in 2022 that nobody flagged as such. Cryptic file naming: “final doc v3 truly final.docx”. Missing metadata: no owner, no review date, no sensitivity label. Inter-document conflicts: a security reference document that states “twelve-character minimum password” alongside another that states “fourteen”. Every audit of an enterprise document repository surfaces this pattern. In a first-time diagnostic by K-AI on one repository at a large non-tech group, we detected more than 1,300 anomalies — conflicts, divergent duplicates, undated obsolescence. That was one repository among dozens in the same organization.

Asking Copilot to retrieve the approval procedure for an investment above €5 million in that landscape is not asking for a retrieval. It is asking for a miracle. The LLM is not lying — it faithfully reports what retrieval handed it. And if retrieval handed it three competing versions, it picks one. Often the wrong one. SPS, an independent SharePoint agency, puts it plainly: “You cannot get reliable outputs from an unreliable corpus. Garbage in, confident misinformation out, at speed, at scale, to every employee who asks” (SharePointSupport, May 2026).

The same mechanism explains why, according to a multi-vendor panel published in 2026, more than 70% of enterprise RAG deployments fail before production — and why retrieval is the failure point in 73% of cases (dev.to, 2026; Rag About It, 2026). Microsoft 365 Copilot is a RAG. It is not exempt from that rule. It illustrates it at scale.

RSS, RCD, SAM: why access plumbing isn’t enough

Microsoft offers several levers to reduce friction. Restricted SharePoint Search limits Copilot to a list of 100 allowed sites. Restricted Content Discovery (RCD) — announced in preview in March 2026, GA expected later in 2026 — allows content to be excluded from organizational search even when the user has access to it (Microsoft Learn — Restricted Content Discovery). SharePoint Advanced Management (SAM) brings a site lifecycle and Data Access Governance (DAG) reports (Microsoft Learn — Get Ready for Copilot with SAM).

These levers address the access plumbing: who sees what. They do not address the quality of the content itself. An outdated document remains outdated after it has been excluded from RSS. Three competing versions of a procedure remain three competing versions after a sensitivity label is applied. SAM was not designed to manage corpus quality, any more than RSS’s 100-site cap was designed to scale to a Fortune 500 — Microsoft says this explicitly (Microsoft Learn — RSS limitations).

It is a distinction worth holding. Microsoft built an access governance tool, not a documentation quality tool. Glean built a cross-SaaS knowledge graph, not a documentation quality tool. AvePoint built engagement analytics and bulk cleanup, not a documentation quality tool. The missing layer — audit, deduplication, conflict detection, obsolescence flagging, traceability — sits somewhere else.

What a Document Knowledge Platform does before Copilot runs

That is exactly the definition of a Document Knowledge Platform (DKP), as I laid out last week in the May 18 pillar piece (K-AI — Knowledge AI vs Knowledge Management vs DKP). A DKP sits upstream of the Knowledge AI layer — upstream of Copilot, Glean, Sana, Sinequa. Its job is not to answer questions. Its job is to make the corpus answerable: deduplication, inter-document conflict detection, undeclared obsolescence flagging, metadata normalization, traceability of origin and last review. It is the unstructured-content counterpart of what a Data Catalog has been doing for structured data for the past decade.

On a K-AI–audited perimeter, the effect is measurable in weeks. On one repository, we typically see the document volume drop by around 30% within a week (duplicates and obsolete content removed), followed by a roughly 40% reduction in document-level conflicts after a month of continuous monitoring. These are one-repository numbers — not full-estate numbers. A large non-tech group runs dozens of such repositories.

Once the corpus is treated, Copilot works. Not magically — functionally. It retrieves what it is supposed to retrieve, because there are no longer three versions of the procedure but one. It no longer fabricates a non-existent procedure, because the existing procedure now has an owner, a review date, and a validation signature that the LLM can quote. A Copilot’s performance — a Knowledge AI’s performance more broadly — is bounded by the corpus it ingests, not by the sophistication of the model.

That is why, when a steering committee tells me “Copilot isn’t working”, I prefer to ask one question first: did you audit the corpus? In eight cases out of ten, the answer is no. In the remaining two, the audit was done by hand, in Excel, by a Knowledge Management team that could only see what the eye can see. That is precisely what a DKP scales.

Frequently asked questions

Why can’t Microsoft 365 Copilot find my SharePoint documents?

Three families of causes are at play. First, technical: files exceed the thresholds documented by Microsoft (twenty files and three hundred pages in total for a SharePoint declarative agent, 512 MB for a direct Copilot Studio upload, 7 MB for a SharePoint file referenced directly without Microsoft 365 Copilot licensed in the same tenant). Second, access governance: permissions too broad, Restricted SharePoint Search activated on the wrong number of sites, daily (not real-time) indexing. Third — and this is the one steering committees underestimate — corpus quality itself: divergent duplicates, unflagged obsolete content, missing metadata, cryptic naming. The first two families fall under Microsoft Learn and the SAM/AvePoint ecosystem. The third falls under a Document Knowledge Platform.

What are the file-size limits for Copilot in SharePoint?

Microsoft documents several thresholds. For a SharePoint declarative agent to scan all relevant content end-to-end, limit yourself to twenty files or three hundred pages in total; for embedded files, seven hundred fifty to one thousand pages per file maximum (Microsoft Learn — Optimize Content Retrieval). For knowledge sources on a Copilot Studio agent, the direct upload is capped at 512 MB per file; when the agent references a SharePoint file directly without Microsoft 365 Copilot licensed in the same tenant, the cap drops to 7 MB (Microsoft Learn — Copilot Studio quotas). Beyond these thresholds, content can sit in SharePoint without being effectively queryable by Copilot.

Should I enable Restricted SharePoint Search (RSS)?

Yes in the short term, as a controlled on-ramp for a phased rollout. No as a long-term strategy. Microsoft itself documents RSS as “a short-term solution, not scalable, not a security boundary,” and caps the allowed list at 100 sites. It is designed to let you tame the rollout, not to absorb the document estate of a large enterprise. Beyond 100 sites — or as soon as access governance needs to be central rather than peripheral — the right answer lies elsewhere: Restricted Content Discovery, sensitivity labels, and most importantly an upstream audit of the corpus. Source: Microsoft Learn — Restricted SharePoint Search.

Why does Copilot hallucinate on SharePoint content?

Most “hallucinations” Copilot exhibits in the field are not hallucinations in the strict sense — they are faithful renderings of a degraded corpus. The RAG engine retrieves two or three relevant documents — for instance, three competing versions of an HR procedure — and the LLM picks one to answer. From the user’s perspective, the answer appears fabricated because it does not match the official version. In reality, the LLM did its job perfectly on a corpus that had no single official version. The fix is not in the LLM — it is in resolving document conflicts, flagging obsolescence, and naming a single owner per procedure.

How do I assess my SharePoint corpus quality before a Copilot project?

The audit should cover six measurable axes, which K-AI published as a method on May 15, 2026: internal anomalies (consistency breaks within a single document), inter-document conflicts (two references that contradict each other), divergent duplicates (multiple versions of the same document), unflagged obsolescence, traceability (author, date, validation, source of truth), and freshness by segment (K-AI — Auditing a document corpus for AI in six axes). Each axis can be quantified, benchmarked against an alert threshold, and logged in an audit journal compliant with Article 12 of the EU AI Act. For a Copilot rollout in a large enterprise, the SharePoint corpus audit should precede seat activation — not the other way around.

Going further

If you are launching a Microsoft 365 Copilot rollout on a sizeable SharePoint estate and want to objectively assess the corpus before committing the seat budget, reach out: contact@k-ai.ai. A focused diagnostic on a pilot repository yields, in a matter of days, an anomaly map and a remediation plan prioritized by expected Copilot impact.

Sources cited


K-AI already works with CMA CGM, Veolia, PwC, BNP Paribas, TotalEnergies and CEVA Logistics. Partners: AWS, Snowflake, Microsoft, Wavestone, Devoteam.

And in your organization, what does your document estate look like?

30 minutes with a founder. We audit a sample of your documents for free and show you exactly what K-AI detects.

Book a demo → Read other articles