How To Evaluate Agents

This guide is for domain experts who are evaluating the performance of their Agent and covers how to review feedback, review the automated evaluation metrics, and how to diagnose any issues.

1. Accessing the Agent

To access your Agent, head to your URL for the Great Wave AI Platform and login via your username and password or via Microsoft single sign-on. The URL may be different for each organisation and may be versions of:

  • app.greatwave.ai

  • [domain].greatwave.ai

  • greatwave.[domain].com

Once you've logged in:

  1. On the left-hand side of the page, set the "Agents by User" filter to "All".

  2. Either Search for the Agent you want to investigate or filter via the tag under the "Filter Agents".

  3. Find the Agent you want to investigate and click "Select".

  4. Navigate to the Evaluation page using the panel on the left hand-side of the screen.

2. Filtering for Issues

2.1 Filtering for Live vs Test Usage

You can filter on Live vs Tests usage.

  • Live usage is from anyone using the Agent via the API endpoints or via the live chat front-end.

  • Tests usage is from anyone using the Agent via the Great Wave AI Platform (e.g., via the Instruct screen or the Design screen).

2.2 Filtering for Red Flags

If people have given any negative feedback (see How To Give Agent Feedback), these will show up within the Evaluation page. You can filter for all Red Flags.

Red flags also show up highlighted red on the Evaluation screen.

2.3 Filtering on Automated Evaluation

You can also filter across the automated evaluation metrics (see Evaluate for more details on the metrics) by clicking on that metric. This will order the scores high-to-low or low-to-high.

Recommendation on which automated evaluation metrics to check:

Relevance: Things that score low on relevance (did it answer the question) will give you an idea of the things that people have asked that may not be in your source information. Here people are looking for information that they are struggling to find. This is a powerful flag to help you create additional content in the future.

Data Quality: Things that score low on data quality might indicate discrepancies in the data set which are worth investigating (i.e., contradictory information). This is a powerful flag to help you understand whether data needs reviewing and fixing.

3. Understanding the Issues

Once you've filtered and understood which Question & Response pairing you want to investigate, first click into the Question & Response pairing by clicking on it.

Here you can see:

  • Question: the Question asked by the user

  • Response: the Response given by the Agent

  • Agent Context (Source data): this is the information the Agent used to give it's Response. This can be:

    • Knowledge that you've attached to the Agent (see Knowledge Domain)

    • Responses from other Agents

    • Responses from an API call (see API (Agent))

  • Related Agent Queries: these are other Agents in the chain that were called.

  • Guardrules: gives detail on whether any Guardrules were invoked (see Guardrules)

  • Scoring: gives further detail on the automated evaluation metrics.

  • User Feedback: shows any user feedback given (see How To Give Agent Feedback)

  • Review: allows you, as an evaluator, to tag and keep track of any issues once you've understood them

3.1 Diagnosing Knowledge

Within the "Agent Context (Source data)" section, you can delve into the chunks of knowledge that the Agent used to generate its Response.

If , within the Agent Context (Source data) section, you can see a blue box called "Agent-Response.ai", it's likely that no Knowledge was retrieved and you'll be looking at the responses from other Agents. If you're expecting Knowledge, it's likely other Agents in the chain are the ones searching through Knowledge. See below on how to open up the other Agents.

You can:

  • Open up the chunks of knowledge that were retrieved.

  • Search through the chunks of knowledge that were retrieved.

This allows you to understand whether the Agent was wrong because (a) the Knowledge/Documents themselves are wrong, or (b) the Knowledge/Documents were right and something else is going wrong.

Example on Diagnosing Knowledge

In this example, imagine we've built an Agent to respond to HR queries and we've given it Knowledge of our HR Policies.

  • The question from the user was: "What's the annual leave allowance?"

  • The answer from the Agent was: "Full-time employees are entitled to 25 days of paid annual leave per year, in addition to the 8 UK bank holidays."

  • The feedback was: "Sam H || it should be 28 days + bank holidays"

We can search for "25 days" to see where it got that 25 day number from.

Here we can see that the HR policy named "DemoCo - HR Policy.docx" states it's 25 days. So we have an issue in the underlying data that we need to change.

3.2 Diagnosing Other Agents

You can understand which Agents responded to a query and what their response was by clicking into the Agent under the "Related Agent Queries".

This will open up that Agent and you can diagnose issues in the same way.

Last updated