Skip to main content
The Agent Simulation Engine (A-Sim) lets you test and improve AI agents before they reach production. It generates synthetic conversations from persona and scenario combinations, evaluates agent responses across accuracy, helpfulness, and safety metrics, and automatically rewrites agent instructions based on the failures it finds. Use A-Sim when you need confidence that your agent handles real-world edge cases, adversarial users, and domain-specific compliance requirements before deployment.

The world model

A-Sim structures testing around a world model made up of two dimensions: personas and scenarios. A persona is a user archetype that defines who is interacting with the agent. Examples include a first-time user unfamiliar with the product, an experienced power user with technical knowledge, or an adversarial user trying to bypass the agent’s guardrails. A scenario is a task type that defines what the user is trying to accomplish. Examples include a basic policy inquiry, a complex compliance issue, or a request the agent is supposed to refuse. A-Sim combines every persona with every scenario to produce a set of simulations, which are synthetic test conversations. This cross-product approach ensures the agent is tested across the full range of situations it will encounter in production.

How it works

  1. You create an environment, which is an isolated clone of your agent used for safe evaluation without affecting the production version.
  2. A-Sim generates personas and scenarios automatically using the agent’s role and goal, or you define them manually.
  3. A-Sim combines personas and scenarios into simulations and runs each one against the agent.
  4. An evaluation scores each simulation response across the metrics you select.
  5. Simulations that fail are passed to agent hardening, which analyzes the failure patterns and produces an improved set of agent instructions.
  6. You start a new evaluation round with the improved instructions and repeat the cycle until all simulations pass.

Evaluation metrics

Each evaluation run scores responses against one or more of the following metrics. Each simulation receives a final judgment of PASS or FAIL.
MetricWhat it measures
task_completionWhether the agent accomplished what the user asked
hallucinationsWhether the agent fabricated facts not present in its knowledge
answer_relevancyWhether the response is on-topic and directly addresses the query

Agent hardening

When simulations fail, A-Sim analyzes the failure patterns across the evaluation round and produces two agent configurations: the original and an improved version with rewritten instructions targeting the specific failures. You can review the changes before applying them, or let A-Sim apply and re-evaluate automatically. The hardening loop continues round by round until all simulations pass or you reach the maximum number of rounds you configure.