> ## Documentation Index
> Fetch the complete documentation index at: https://docs.lyzr.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# How It Works

> Control Plane pipeline architecture, core data model, and execution flow.

## Pipeline architecture

A Control Plane pipeline has four stages that run on every push:

```
Build → Eval → Compare → Deploy
```

**Build**: Install dependencies, validate `langship.yaml`, resolve dataset versions.

**Eval**: Run every evaluator defined in your config against the target dataset. Each evaluator produces a pass/fail verdict and a numeric score.

**Compare**: Diff scores against the previous passing run on the target branch. Surface regressions in the CI summary.

**Deploy**: If all evals pass (or no blocking evals failed), promote the agent version to the target environment.

## Core data model

```
Project
├── Runs (executions of your agent)
│   ├── Traces (one trace per run)
│   │   ├── Spans (one span per step: LLM call, tool call, agent loop)
│   │   └── Events (logs, errors, custom annotations)
│   └── Eval results
├── Datasets (versioned collections of test cases)
└── Deployments (environment → agent version mapping)
```

### Runs

Every time your agent executes (locally or in CI), Control Plane records a Run. A Run has:

* A unique ID
* Start/end time and total duration
* The project and environment it belongs to
* A status: `success`, `failure`, `error`
* The full trace and any eval results attached

### Traces and spans

A trace is a tree of spans. Spans map directly to agent steps:

| Span type   | Created by                                             |
| ----------- | ------------------------------------------------------ |
| `llm`       | Every LLM call (model, prompt, response, tokens, cost) |
| `tool`      | Every tool invocation (name, input, output, duration)  |
| `agent`     | The agent's decision loop                              |
| `retrieval` | Vector store / document retrieval                      |
| `memory`    | Memory read or write                                   |
| `custom`    | Manual spans via `langship.span()`                     |

Traces are OpenTelemetry-compatible; you can export them to any OTel-compatible backend (Jaeger, Grafana Tempo, etc.) in addition to Control Plane.

### Datasets

A Dataset is a versioned collection of test cases. Each test case has:

* `input`: the query or message to send to your agent
* `expected` (optional): the expected output for exact-match evals
* `metadata` (optional): tags, difficulty labels, source references

Datasets are stored in Control Plane Server and referenced by SHA-pinned version in `langship.yaml`. Your evals always run against the exact same dataset version, making results reproducible across branches and time.

### Evaluators

| Evaluator type        | How it scores                                            |
| --------------------- | -------------------------------------------------------- |
| `exact-match`         | String equality between agent output and expected        |
| `contains`            | Whether output contains a substring                      |
| `semantic-similarity` | Embedding cosine similarity against expected             |
| `llm-judge`           | LLM grades the response on a 0–1 scale using your prompt |
| `python`              | Custom Python function returning a float in `[0, 1]`     |
| `regex`               | Output matches a pattern                                 |

Evaluators can be configured as `blocking: true` (fail the pipeline on threshold miss) or `blocking: false` (report only).

## Execution flow

### Local run

```
langship eval run
  ↓
Read langship.yaml
  ↓
Fetch dataset (latest version or pinned SHA)
  ↓
For each test case:
  → Run agent with test input
  → Collect trace
  → Score with each evaluator
  ↓
Aggregate results
  ↓
Print report, exit 0 (pass) or 1 (fail)
```

### CI run (GitHub Actions)

```
git push / PR opened
  ↓
actions/checkout + langship/setup-action
  ↓
langship eval run --ci
  ↓
Post results as PR check + comment
  ↓
On pass: langship deploy --env staging
```

## Configuration reference

`langship.yaml` structure:

```yaml theme={null}
project: my-agent          # project name in Control Plane Server
endpoint: ${LANGSHIP_URL}  # server URL (env var)

datasets:
  golden-set:
    path: ./evals/golden-set.jsonl
    version: sha256:abc123  # pin to a specific version (optional)

evals:
  - name: factual-accuracy
    type: llm-judge
    model: gpt-4o-mini       # model to use as judge
    prompt: |
      Rate the factual accuracy of this response on a scale of 0 to 1.
      Response: {{output}}
      Question: {{input}}
    dataset: golden-set
    pass_threshold: 0.8
    blocking: true

  - name: no-refusals
    type: contains
    negate: true             # pass if output does NOT contain the string
    value: "I cannot"
    dataset: golden-set
    blocking: false          # report only, don't fail pipeline

deployments:
  staging:
    target: lyzr             # deploy to Lyzr Studio
    agent_id: ${AGENT_ID}
    on_pass: true
```

## Observability pipeline

Control Plane uses an OpenTelemetry Collector under the hood:

```
Your agent (SDK)
  → OTLP exporter
  → Control Plane Collector
  → Control Plane Server (storage + UI)
  → Optional: forward to Jaeger / Grafana / Datadog
```

Forwarding to external backends is configured in `collector-config.yaml` (included in the Docker Compose setup).
