Skip to main content
Agent evaluation tests whether an agent completes tasks correctly, avoids unsafe output, calls tools accurately, and stays grounded in available knowledge. Running evaluation before deployment reduces the risk of issues in production.

What to evaluate

Lyzr Agent Eval assesses agents across two categories of metrics: Agent metrics: Task Completion, Hallucination rate, Bias, Toxicity, Faithfulness, Reflection, and LLM-as-Judge. Tool and Knowledge Base metrics: Evaluate interactions with connected tools and databases: whether the agent calls the right tool, passes correct arguments, and retrieves relevant content.

Evaluation workflow

Create environment
  → define scenarios and personas
  → generate test cases (automated or manual)
  → select metrics
  → run tests
  → review scored results
  → update instructions, model, tools, or guardrails
  → re-run until the agent passes

Environments

Each agent can have multiple evaluation environments, one per stage or use case. When you create an environment, Lyzr automatically generates scenarios and personas based on the agent’s role and goal. You can add your own scenarios and personas on top of the generated set, or import test cases by downloading and filling the CSV template.

Test case generation

Test cases are generated from scenario and persona combinations. For each test case, Lyzr creates both a user input and an expected output. If memory is enabled on the agent, Lyzr generates conversational multi-turn simulations. If memory is disabled, it generates single-turn test cases.

Running tests and scoring

Test cases execute automatically. The agent’s response is compared against the expected output, and a score is generated per metric, showing which test cases pass and which fail. Scores highlight specific areas where the agent needs improvement.

Acting on failures

Every failed test case is a QA signal. You can address failures in two ways: Manual update: Inspect the failing test case, identify the root cause, and update the agent’s instructions, model, tools, or features directly. Agent Hardening: Select a subset of failed test cases and let Lyzr analyze the failures and recommend optimal agent configurations. Agent Hardening is faster when failures share a common pattern.

Production readiness

Agent Eval covers correctness, security, response tone, faithfulness to retrieved knowledge, and other quality dimensions. An agent that passes its evaluation suite has been tested against realistic inputs and is less likely to produce unexpected behavior in production.

Next steps