Evaluating Microsoft Copilot Studio-based RAG Agents with the Copilot Studio Evaluator

Evaluating Microsoft Copilot Studio-based RAG Agents with the Copilot Studio Evaluator
Microsoft Copilot Studio logo

Building robust Retrieval-Augmented Generation (RAG) solutions is essential when leveraging Microsoft Copilot Studio for enterprise-grade AI scenarios. Whether you are creating a knowledge worker assistant, an internal FAQ bot, or a content synthesizer, ensuring groundedness and document retrieval performance is critical. The Copilot Studio Evaluator repository offers you a straightforward way to measure and improve these aspects of your RAG agents.

The repository builds upon the Microsoft 365 Agents C#/.Net SDK by adding an evaluation client.

Please refer to the M365 Agents SDK overview and C#/.Net SDK documentation for more details on writing custom code that can access Microsoft Copilot Studio based agents.

In the end, I hope this repository will make it into the M365 Agents SDK samples so that this repository here can be dropped.

Overview of the Copilot Studio Evaluator

The Copilot Studio Evaluator is a toolkit designed to help you systematically evaluate Copilot Studio-based RAG agents.

Traditional QA and conversational AI evaluation often focuses on whether an answer is “correct” or “relevant.” But with RAG agents, there’s a second dimension: ensuring that the model’s answers are backed by accurate, up-to-date, and trustworthy sources. In other words, we need to ensure:

  • Groundedness: How well does the model’s answer align with the supporting evidence or reference documents?
  • Document Retrieval Accuracy: Did the model retrieve the most relevant documents for the question or query?

For more details, see the repo's README.md file

Why This Matters

  • Regulatory and Compliance: Many industries have strict requirements around data usage and accuracy.
  • User Trust: Users need to trust that the bot isn’t “making things up” (hallucinating), especially in enterprise contexts.
  • Continuous Improvement: By identifying weaknesses in retrieval or reasoning, you can iteratively improve the agent’s performance.

Key Metrics: Groundedness and Document Retrieval

Groundedness

Groundedness refers to how faithfully the model’s response sticks to the source material. A response is considered “grounded” if it can be traced back to documents retrieved during the conversation. This mitigates issues like hallucinations or unsourced speculation.

Example

  • User Query: “What is the return policy for product X?”
  • Documents Retrieved: Company policy document stating product X has a 30-day return window.
  • Model Answer: “You can return product X within 30 days from the date of purchase.”

If the model’s answer matches or is well-supported by the official policy document, we consider it grounded.

Document Retrieval Accuracy

Document Retrieval Accuracy measures how effectively your RAG agent retrieved the right documents. Even if your model is good at summarizing or answering based on some text, it’s critical that the correct references were fetched in the first place—especially if your knowledge base is large.

Example

  • Available Documents:
    1. Product return policy (relevant)
    2. HR policy (irrelevant for the user’s question)
  • System Retrieval: If the system successfully retrieves the product return policy (and not the HR policy), that’s a positive retrieval outcome.