Skip to content

Evaluation Quickstart: Evaluate Your Agent! ๐ŸŽฏ

This tutorial shows you how to use the Amazon Bedrock AgentCore starter toolkit CLI to evaluate your deployed agent's performance. You'll learn how to run on-demand evaluations and set up continuous monitoring with online evaluation.

The evaluation CLI provides commands to assess agent quality using built-in evaluators (like helpfulness and goal success) or create custom evaluators for your specific needs.

๐Ÿ“š For comprehensive details, see the AgentCore Evaluation Documentation

Prerequisites

Before you start, make sure you have:

Step 1: Install the Toolkit

Install the AgentCore starter toolkit:

pip install bedrock-agentcore-starter-toolkit

Verify installation:

agentcore eval --help

Success: You should see the evaluation command options.

Step 2: List Available Evaluators

View all available built-in and custom evaluators:

agentcore eval evaluator list

Success: You should see a table of evaluators:

Built-in Evaluators (13)

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ ID                            โ”ƒ Name           โ”ƒ Level      โ”ƒ Description    โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ Builtin.GoalSuccessRate       โ”‚ Builtin.GoalSโ€ฆ โ”‚ SESSION    โ”‚ Task           โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ Completion     โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ Metric.        โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ Evaluates      โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ whether the    โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ conversation   โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ successfully   โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ meets the      โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ user's goals   โ”‚
โ”‚ Builtin.Helpfulness           โ”‚ Builtin.Helpfโ€ฆ โ”‚ TRACE      โ”‚ Response       โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ Quality        โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ Metric.        โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ Evaluates from โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ user's         โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ perspective    โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ how useful and โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ valuable the   โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ agent's        โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ response is    โ”‚
โ”‚ Builtin.Correctness           โ”‚ Builtin.Correโ€ฆ โ”‚ TRACE      โ”‚ Response       โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ Quality        โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ Metric.        โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ Evaluates      โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ whether the    โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ information in โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ the agent's    โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ response is    โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ factually      โ”‚
โ”‚                               โ”‚                โ”‚            โ”‚ accurate       โ”‚
...

Total: 13 builtin evaluators

Understanding Evaluator Levels: - SESSION: Evaluates entire conversation (e.g., goal completion) - TRACE: Evaluates individual responses (e.g., helpfulness, correctness) - TOOL_CALL: Evaluates tool selection and parameters

Step 3: Run Your First Evaluation

Run an on-demand evaluation on your agent:

agentcore eval run --evaluator "Builtin.Helpfulness"

This automatically uses the agent ID and session ID from your .bedrock_agentcore.yaml configuration file.

Note: You'll see "Using session from config: " confirming that the session ID was loaded from your configuration file.

Success: You should see evaluation results:

Using session from config: 383c4a9d-5682-4186-a125-e226f9f6c141

Evaluating session: 383c4a9d-5682-4186-a125-e226f9f6c141
Mode: All traces (most recent 1000 spans)
Evaluators: Builtin.Helpfulness

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ Evaluation Results                                                           โ”‚
โ”‚ Session: 383c4a9d-5682-4186-a125-e226f9f6c141                                โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

โœ“ Successful Evaluations

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚                                                                              โ”‚
โ”‚  Evaluator: Builtin.Helpfulness                                              โ”‚
โ”‚                                                                              โ”‚
โ”‚  Score: 0.83                                                                 โ”‚
โ”‚  Label: Very Helpful                                                         โ”‚
โ”‚                                                                              โ”‚
โ”‚  Explanation:                                                                โ”‚
โ”‚  The assistant's response effectively addresses the user's request by        โ”‚
โ”‚  providing comprehensive analysis...                                         โ”‚
โ”‚                                                                              โ”‚
โ”‚  Token Usage:                                                                โ”‚
โ”‚    - Input: 927                                                              โ”‚
โ”‚    - Output: 233                                                             โ”‚
โ”‚    - Total: 1,160                                                            โ”‚
โ”‚                                                                              โ”‚
โ”‚  Evaluated:                                                                  โ”‚
โ”‚    - Session: 383c4a9d-5682-4186-a125-e226f9f6c141                           โ”‚
โ”‚    - Trace: 6929ecf956ccc60c19c9a548698ae116                                 โ”‚
โ”‚                                                                              โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Multiple Evaluators

Evaluate with multiple evaluators simultaneously:

agentcore eval run \
  --evaluator "Builtin.Helpfulness" \
  --evaluator "Builtin.GoalSuccessRate" \
  --evaluator "Builtin.Correctness"

Save Results

Export evaluation results to JSON:

agentcore eval run \
  --evaluator "Builtin.Helpfulness" \
  --output results.json

This creates two files: - results.json - Evaluation scores and explanations - results_input.json - Input data used for evaluation

Step 4: Set Up Continuous Monitoring

Enable automatic evaluation of live agent traffic with online evaluation:

agentcore eval online create \
  --name production_eval_config \
  --sampling-rate 1.0 \
  --evaluator "Builtin.GoalSuccessRate" \
  --evaluator "Builtin.Helpfulness" \
  --description "Production evaluation for my agent"

Note: The agent ID is automatically detected from your .bedrock_agentcore.yaml configuration file. To explicitly specify an agent, add --agent-id <your-agent-id>.

Parameters: - --sampling-rate: Percentage of interactions to evaluate (0.01-100). Start with 1-5% for production. - --evaluator: Evaluator IDs (specify multiple times)

Success: You should see:

Creating online evaluation config: production_eval_config
Agent ID: agent_lg-EVQuBO6Q0n
Region: us-east-1
Sampling Rate: 1.0%
Evaluators: ['Builtin.GoalSuccessRate', 'Builtin.Helpfulness']
Endpoint: DEFAULT

โœ“ Online evaluation config created successfully!

Config ID: production_eval_config-2HeyEjChSQ
Config Name: production_eval_config
Status: CREATING
Execution Role: arn:aws:iam::730335462089:role/AgentCoreEvalsSDK-us-east-1-4b7eba641e
Output Log Group: /aws/bedrock-agentcore/evaluations/results/production_eval_config-2HeyEjChSQ

Notes: - If an IAM execution role doesn't exist, it will be auto-created - The config starts in CREATING status and transitions to ACTIVE within a few seconds - Save the Config ID - you'll need it to manage this configuration

Step 5: Monitor Evaluation Results

View Your Configurations

List all online evaluation configurations:

agentcore eval online list

You should see a table showing your configurations:

Found 2 online evaluation config(s)

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
โ”ƒ Config Name      โ”ƒ Config ID        โ”ƒ Status โ”ƒ Execution โ”ƒ Created           โ”ƒ
โ”กโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
โ”‚ production_evalโ€ฆ โ”‚ production_evalโ€ฆ โ”‚ ACTIVE โ”‚ ENABLED   โ”‚ 2025-11-28        โ”‚
โ”‚                  โ”‚                  โ”‚        โ”‚           โ”‚ 10:47:56.055000-โ€ฆ โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Get Configuration Details

View details about a specific configuration:

agentcore eval online get --config-id production_eval_config-2HeyEjChSQ

You should see detailed configuration information:

Config Name: production_eval_config
Config ID: production_eval_config-2HeyEjChSQ
Status: ACTIVE
Execution Status: ENABLED
Sampling Rate: 1.0%
Evaluators: Builtin.GoalSuccessRate, Builtin.Helpfulness
Execution Role: arn:aws:iam::730335462089:role/AgentCoreEvalsSDK-us-east-1-4b7eba641e

Output Log Group: /aws/bedrock-agentcore/evaluations/results/production_eval_config-2HeyEjChSQ

Description: Production evaluation for my agent

Replace production_eval_config-2HeyEjChSQ with your configuration ID from Step 4.

View Results in CloudWatch

  1. Open the CloudWatch Console
  2. Navigate to GenAI Observability โ†’ Bedrock AgentCore
  3. Select your agent and endpoint
  4. View the Evaluations tab for detailed results

Alternative: Without Configuration File

If you don't have a .bedrock_agentcore.yaml configuration file (or want to evaluate a different agent/session), you can explicitly specify the agent ID and session ID:

Run Evaluation

agentcore eval run \
  --agent-id agent_myagent-ABC123xyz \
  --session-id 550e8400-e29b-41d4-a716-446655440000 \
  --evaluator "Builtin.Helpfulness"

Replace agent_myagent-ABC123xyz with your agent ID and 550e8400-e29b-41d4-a716-446655440000 with your session ID.

Create Online Evaluation

agentcore eval online create \
  --name production_eval_config \
  --agent-id agent_myagent-ABC123xyz \
  --sampling-rate 1.0 \
  --evaluator "Builtin.GoalSuccessRate" \
  --evaluator "Builtin.Helpfulness"

This approach is useful when: - You deployed your agent outside of AgentCore Runtime - You want to evaluate a specific session (not the latest) - You're evaluating multiple agents and need to switch between them

Next Steps

Create Custom Evaluators

Create domain-specific evaluators for your use case. First, create a configuration file evaluator-config.json:

{
  "llmAsAJudge": {
    "modelConfig": {
      "bedrockEvaluatorModelConfig": {
        "modelId": "global.anthropic.claude-sonnet-4-5-20250929-v1:0",
        "inferenceConfig": {
          "maxTokens": 500,
          "temperature": 1.0
        }
      }
    },
    "ratingScale": {
      "numerical": [
        {
          "value": 0.0,
          "label": "Poor",
          "definition": "Response is unhelpful or incorrect"
        },
        {
          "value": 0.5,
          "label": "Adequate",
          "definition": "Response is partially helpful"
        },
        {
          "value": 1.0,
          "label": "Excellent",
          "definition": "Response is highly helpful and accurate"
        }
      ]
    },
    "instructions": "Evaluate the assistant's response for helpfulness and accuracy. Context: {context}. Target to evaluate: {assistant_turn}"
  }
}

Then create the evaluator:

agentcore eval evaluator create \
  --name "my_custom_evaluator" \
  --config evaluator-config.json \
  --level TRACE \
  --description "Custom evaluator for my use case"

Update Online Evaluation Configuration

Modify existing online evaluation configurations to adjust sampling rates, evaluators, or status:

# Change sampling rate
agentcore eval online update \
  --config-id production_eval_config-2HeyEjChSQ \
  --sampling-rate 5.0

# Disable temporarily
agentcore eval online update \
  --config-id production_eval_config-2HeyEjChSQ \
  --status DISABLED

# Update evaluators
agentcore eval online update \
  --config-id production_eval_config-2HeyEjChSQ \
  --evaluator "Builtin.Correctness" \
  --evaluator "Builtin.Faithfulness"

Replace production_eval_config-2HeyEjChSQ with your configuration ID from Step 4.

Troubleshooting

"No agent specified" or Agent ID not found

Problem: Agent ID cannot be loaded from configuration file.

Solution: You can specify the agent ID explicitly:

# Find your agent ID from deployment
agentcore status

# Or specify it directly
agentcore eval run \
  --agent-id agent_myagent-ABC123xyz \
  --evaluator "Builtin.Helpfulness"

For online evaluation:

agentcore eval online create \
  --name my_eval_config \
  --agent-id agent_myagent-ABC123xyz \
  --evaluator "Builtin.Helpfulness"

"No session ID provided"

Problem: Session ID cannot be loaded from configuration file.

Solution: Find and specify a session ID explicitly:

# List recent sessions using observability
agentcore obs list

# This will show output like:
# Session ID: 550e8400-e29b-41d4-a716-446655440000
# Trace Count: 5
# Start Time: 2024-11-28 10:30:00

# Use a session ID from the list
agentcore eval run \
  --session-id 550e8400-e29b-41d4-a716-446655440000 \
  --evaluator "Builtin.Helpfulness"

"No spans found for session"

Problem: The session ID exists in config but no observability data is available.

Common Causes: - Session is older than 7 days (default lookback period) - Session hasn't completed yet - Observability was not enabled when the session ran - CloudWatch logs haven't populated yet (2-5 minute delay after agent invocation)

Note: By default, the CLI looks back 7 days for session data. If your session is older, use --days to extend the lookback period (observability data is retained for up to 30 days).

Solution: Run a new agent interaction to generate fresh session data:

# Step 1: Invoke your agent to create a new session
agentcore invoke --input "Tell me about AWS"

# Step 2: Wait 2-5 minutes for CloudWatch logs to populate
# CloudWatch ingestion has a delay before logs become available

# Step 3: Run evaluation after waiting
agentcore eval run --evaluator "Builtin.Helpfulness"

Important: There is typically a 2-5 minute delay between invoking your agent and when the observability data becomes available in CloudWatch for evaluation. If you get "No spans found", wait a few minutes and try again.

For older sessions (8-30 days old), extend the lookback period:

# Evaluate a session from 14 days ago
agentcore eval run \
  --evaluator "Builtin.Helpfulness" \
  --days 14

# Or with explicit session ID
agentcore eval run \
  --session-id <your-old-session-id> \
  --evaluator "Builtin.Helpfulness" \
  --days 30

Verify an older session exists before evaluating:

agentcore obs list --session-id <your-session-id> --days 30

"ValidationException: config name must match pattern"

Solution: Use underscores instead of hyphens in configuration names (e.g., my_config not my-config).