Skip to main content
The Improvement Engine monitors your registered agents continuously, detects quality and performance issues from live traces, and generates hardening suggestions to improve agent behavior over time. It checks for a range of issues across your agent’s behavior - including hallucinations, low task completion, Knowledge Base retrieval failures, tool call errors, and response relevancy. Each issue is scored, categorized by severity, and tied back to the trace that surfaced it. When patterns emerge across multiple traces, the engine generates an Agent Hardening suggestion that proposes specific changes to the agent’s goal and instructions to address the root cause. Access it via Safety and Evaluations > Improvement Engine from the sidebar. Agent Improvement Engine dashboard showing registered agents, issue counts by severity, and a Recent Issues panel

Registering an agent

Only registered agents are monitored. To register an agent:
  1. Select + Register Agent in the top right.
  2. Choose an agent from the Agent dropdown.
  3. Toggle Automatic analysis on to run evaluation on a schedule.
  4. If automatic analysis is on, set the frequency: Every hour, Every 3 hours, Every 6 hours, Every 12 hours, Daily, Weekly, or Custom.
  5. Select Register Agent.
Register Agent modal showing agent selector, automatic analysis toggle, and frequency dropdown Once registered, the agent appears in the agents list. Analysis begins at the next scheduled run. You can also trigger analysis manually at any time from the agent’s detail view without waiting for the schedule. Registering an agent gives you:
  • Real-time issue detection from traces
  • Performance bottleneck alerts
  • AI-powered prompt improvement suggestions
To pause monitoring for an agent without removing it, select the three-dot menu on its row and choose Pause. Paused agents retain their issue history but stop receiving new analysis runs.

Dashboard overview

The main dashboard gives a cross-agent view of all registered agents. Summary stats across the top show total issues from all registered agents, split into Unresolved and Resolved counts, with a severity breakdown of Critical, Medium, and Low. The agents table lists each registered agent with:
ColumnDescription
StatusLive (actively monitored) or Paused
IssuesTotal issues detected and how many of the agent’s traces were analyzed
Issue SeverityC / M / L count breakdown of current issues
UnresolvedIssues still open and needing attention
Recent Issues on the right shows the latest issues across all registered agents, ordered by recency. Each entry shows the issue title, severity tag, the agent it belongs to, and when it was detected. This gives you a single view of what needs attention right now without having to open each agent individually.

Viewing agent issues

Select any agent from the table to open its detail view. The header shows:
  • Agent name and live status
  • Analysis interval (for example, Every 3 hours)
  • Model and provider
  • Total traces analyzed and when analysis last ran
Summary cards below the header show Total Issues, Unresolved, Resolved, and Issue Severity for that specific agent.

Issues tab

The Issues tab lists all detected issues. Use the Severity filter to focus on Critical, Medium, or Low issues, and the Category dropdown to filter by issue type. Select Run Analysis to trigger a fresh analysis run on demand. Each row in the issues table shows:
ColumnDescription
IssueTitle and a short description of what was detected
SeverityCritical, Medium, or Low - based on how significantly the issue affects agent quality
CategoryType of issue - for example, Low Task Completion, Hallucination, Knowledge Base
ScoreNumeric score for the affected metric, with the passing threshold shown below it. A score of 0.30 with a threshold of 0.70 means the agent is well below the acceptable range
TraceThe trace ID where the issue was detected - select it to open Trace Details
DetectedWhen the issue was first seen
Some issues carry an Observation score rather than a numeric score. These are qualitative findings where the engine detected a problem but the issue does not map to a numeric metric - for example, when a trace does not contain enough data to evaluate task completion. To mark an issue as resolved, select the checkmark icon on its row. To remove an issue from the list entirely, select the delete icon. Issues tab showing a filtered list of detected issues with severity, category, score, and trace columns

Trace Details

Selecting a trace ID opens the Trace Details panel. It shows:
  • The detected issue, its category tag, and a full description of why it was flagged - including evidence and what the evaluator expected to find
  • Duration, total tokens, tool calls, and cost for that trace
  • A trace timeline showing the span breakdown
  • Agent information: name, model, and provider
Use Full Trace View to open the complete trace in the Tracing view, or View Conversation to see the full conversation that triggered the issue.

Agent Hardening

The Agent Hardening tab shows AI-generated suggestions for improving the agent’s configuration. The engine analyzes patterns across detected issues - not just individual failures - and produces a consolidated suggestion that addresses the underlying root causes. Each suggestion in the list shows:
ColumnDescription
SuggestionName of the hardened configuration
StatusPending (not yet applied) or Applied
FieldsWhich parts of the agent config are proposed to change, such as Goal or Instructions
Size deltaHow many characters the suggested change adds or removes relative to the current config
GeneratedWhen the suggestion was created
An agent can have multiple Pending suggestions if separate analysis runs identified different issue patterns. Review each independently before deciding which to apply. Select a suggestion to expand it. The detail view shows:
  • Reasoning - a plain-language explanation of why the changes are recommended and which detected issues they address
  • Expected improvements - which metrics or behaviors should improve after applying the changes, shown as tags (for example, “Task Completion: Better task fulfillment”)
  • Diff view - switch between Inline Diff and Side by Side to compare the current and proposed Goal and Instructions. Additions are shown in green, removals in red.
To apply a suggestion, select Push to Production. The agent configuration is updated, a new version is recorded in Version Control, and the suggestion status changes to Applied. Select Delete to discard a suggestion without applying it. Select Run Analysis Now to generate a fresh hardening suggestion based on the latest traces. Agent Hardening tab showing a pending suggestion with inline diff of Goal and Instructions changes

Settings

The Settings tab controls how the Improvement Engine monitors an agent. Settings tab showing analysis schedule, runaway limits with per-trace ceilings and spend budgets, tracked metrics, and email alert configuration

Analysis schedule

Set whether analysis runs automatically and at what frequency. The panel shows when the next scheduled run is. You can change the interval at any time - the change takes effect before the next scheduled run. Manual Run Analysis always works on demand regardless of the automatic schedule setting.

Runaway limits

Agent evaluation consumes tokens and incurs cost. Runaway limits let you set guardrails so a single expensive trace or a sustained period of high usage does not run up an unexpected bill. Enable Runaway detection to activate limits. Leaving a field blank inherits the workspace default. Per-trace ceilings flag and stop evaluation for a single trace that exceeds a threshold:
CeilingPreset options
Cost$0.25, $0.50, $1.00, or a custom value
Latency10s, 20s, 60s, or a custom value
Tokens25,000, 50,000, 100,000, or a custom value
Spend budgets cap total evaluation spend over a rolling time window:
  • Cost: Daily and Monthly limits
  • Tokens: Daily and Monthly limits
Per-trace ceilings catch a single expensive evaluation call. Spend budgets catch sustained high usage over time. Anything that trips either type fires a Resource runaway alert - configure where those alerts are sent in the Alerts section.

Tracked metrics

Metrics are auto-selected based on the agent’s configuration. For most agents, Task Completion and Hallucinations are active by default. Tool and Knowledge Base metrics activate automatically when the agent has tools or a Knowledge Base connected.
MetricModuleWhat it checks
Task CompletionBaseHow fully the agent accomplishes the user’s request
HallucinationsBaseDetects fabricated, unverifiable, or invented claims
Tool CorrectnessToolsWhether the right tool was chosen at the right time
Argument CorrectnessToolsPrecision of tool arguments - types, values, and formats
Contextual RelevancyKnowledge BaseRelevance and sufficiency of retrieved context
Answer RelevancyKnowledge BaseWhether the response directly addresses the user’s question
Knowledge RetentionKnowledge BaseConsistency and coherence across multi-step reasoning

Alerts

Configure where the engine sends notifications when analysis events occur. You can add multiple email channels with different recipient lists and event subscriptions. Supported events:
  • Issues found - new issues were detected in an analysis run
  • Suggestion ready - a new hardening suggestion has been generated
  • Analysis failed - an analysis run could not complete
  • Resource runaway - a runaway limit was tripped
Select the events each channel should receive. Slack alerts are coming soon.

Upcoming features

  • Real-time monitoring - analyze each trace via webhook as it completes, without waiting for a scheduled run
  • Custom judges - define your own evaluation criteria beyond the built-in tracked metrics

Next steps