Evaluate

The Evaluate module in the Great Wave AI Studio is a crucial tool for assessing the performance and accuracy of your AI agents. Here's a detailed guide on how to navigate and utilize these features for optimal evaluation of your AI agents.

Observe Screen

The Observe screen allows you to monitor and evaluate specific performance metrics for each interaction with the AI agent.

1. Factuality

Purpose: To assess whether all claims made in the response are directly inferable from the retrieved chunks. Scoring Guide:

1: There are no true claims or more than 3 claims are false amongst multiple other factual claims.
2: 3 claims are false amongst multiple other factual claims.
3: 2 claims are false amongst multiple other factual claims.
4: 1 claim is false amongst multiple other factual claims.
5: All claims are factual.

Data used in Evaluation:

Response
Retrieved Chunks

2. Relevance

Purpose: To determine if the whether the response has completed the task in line with the input.

Scoring Guide:

1: It failed to answer in line with the input at all.
2: It answers less than half of the input .
3: It answers about half of the input .
4: It answers most of the input with some omissions.
5: It answers the input.

Data used in Evaluation:

Input
Response

3. Adherence

Purpose: To assess whether the response follows the instructions provided. Scoring Guide:

1: All Instructions conflict.
2: The majority of Instructions conflict.
3: There is a level of confliction.
4: The majority of Instructions have been followed.
5: It followed the instructions.

Data used in Evaluation:

Instructions
Response

4. Data Quality

Purpose: To evaluate whether the response contains all fundamental data required to address the input or context.

Scoring Guide:

1: Chunks retrieved had more than 3 contradictions of any kind.
2: Chunks retrieved had more than 1 contradiction or more than 2 minor.
3: Chunks retrieve had no more than 2 minor contradictions.
4: Chunks retrieved had 1 minor contradiction.
5: Chunks retrieved had no contradictions.

Data used in Evaluation:

Context

5. Input Clarity

Purpose: To evaluate the clarity and quality of the input in relation to the instructions provided. Scoring Guide:

1: Input is incomprehensible.
2: Input makes little sense.
3: Input is somewhat understandable.
4: Input is clear.
5: Input is very clear.

Data used in Evaluation:

Input
Instructions

Features for Data Interaction:

Flag, Filter, and Sort: Users can flag specific interactions for review, filter the data based on specific criteria like scores or dates, and sort through the interactions to prioritize reviewing certain entries.
Detailed Inspection: By clicking into a row, users can view detailed information about that specific interaction. This includes the source data chunks used by the agent to formulate its response and explanations on how the evaluation scores were computed.

Using the Evaluation Section Effectively

Regular Monitoring: Regularly use the Observe screen to keep track of your agent's performance metrics and identify any trends or areas for improvement.
Adjust Based on Feedback: Utilize the feedback from both the Observe screens and user driven feedback to refine your agent's responses, instructions, and grounding data.
Iterative Improvement: The Evaluation section is designed to support an iterative approach to improving your AI agent, enabling continuous enhancements based on detailed, actionable insights.

By effectively using the Evaluation section of the Great Wave AI Studio, you can ensure that your AI agents are performing optimally, aligning with your objectives, and consistently delivering high-quality, reliable outputs.

PreviousGuardrules NextEvaluation Pagination and Filter

Last updated 6 months ago