How It Works

Pipeline architecture

A Control Plane pipeline has four stages that run on every push:

Build → Eval → Compare → Deploy

Build: Install dependencies, validate langship.yaml, resolve dataset versions. Eval: Run every evaluator defined in your config against the target dataset. Each evaluator produces a pass/fail verdict and a numeric score. Compare: Diff scores against the previous passing run on the target branch. Surface regressions in the CI summary. Deploy: If all evals pass (or no blocking evals failed), promote the agent version to the target environment.

Core data model

Project
├── Runs (executions of your agent)
│   ├── Traces (one trace per run)
│   │   ├── Spans (one span per step: LLM call, tool call, agent loop)
│   │   └── Events (logs, errors, custom annotations)
│   └── Eval results
├── Datasets (versioned collections of test cases)
└── Deployments (environment → agent version mapping)

Runs

Every time your agent executes (locally or in CI), Control Plane records a Run. A Run has:

A unique ID
Start/end time and total duration
The project and environment it belongs to
A status: success, failure, error
The full trace and any eval results attached

Traces and spans

A trace is a tree of spans. Spans map directly to agent steps:

Span type	Created by
`llm`	Every LLM call (model, prompt, response, tokens, cost)
`tool`	Every tool invocation (name, input, output, duration)
`agent`	The agent’s decision loop
`retrieval`	Vector store / document retrieval
`memory`	Memory read or write
`custom`	Manual spans via `langship.span()`

Traces are OpenTelemetry-compatible; you can export them to any OTel-compatible backend (Jaeger, Grafana Tempo, etc.) in addition to Control Plane.

Datasets

A Dataset is a versioned collection of test cases. Each test case has:

input: the query or message to send to your agent
expected (optional): the expected output for exact-match evals
metadata (optional): tags, difficulty labels, source references

Datasets are stored in Control Plane Server and referenced by SHA-pinned version in langship.yaml. Your evals always run against the exact same dataset version, making results reproducible across branches and time.

Evaluators

Evaluator type	How it scores
`exact-match`	String equality between agent output and expected
`contains`	Whether output contains a substring
`semantic-similarity`	Embedding cosine similarity against expected
`llm-judge`	LLM grades the response on a 0–1 scale using your prompt
`python`	Custom Python function returning a float in `[0, 1]`
`regex`	Output matches a pattern

Evaluators can be configured as blocking: true (fail the pipeline on threshold miss) or blocking: false (report only).

Execution flow

Local run

langship eval run
  ↓
Read langship.yaml
  ↓
Fetch dataset (latest version or pinned SHA)
  ↓
For each test case:
  → Run agent with test input
  → Collect trace
  → Score with each evaluator
  ↓
Aggregate results
  ↓
Print report, exit 0 (pass) or 1 (fail)

CI run (GitHub Actions)

git push / PR opened
  ↓
actions/checkout + langship/setup-action
  ↓
langship eval run --ci
  ↓
Post results as PR check + comment
  ↓
On pass: langship deploy --env staging

Configuration reference

langship.yaml structure:

project: my-agent          # project name in Control Plane Server
endpoint: ${LANGSHIP_URL}  # server URL (env var)

datasets:
  golden-set:
    path: ./evals/golden-set.jsonl
    version: sha256:abc123  # pin to a specific version (optional)

evals:
  - name: factual-accuracy
    type: llm-judge
    model: gpt-4o-mini       # model to use as judge
    prompt: |
      Rate the factual accuracy of this response on a scale of 0 to 1.
      Response: {{output}}
      Question: {{input}}
    dataset: golden-set
    pass_threshold: 0.8
    blocking: true

  - name: no-refusals
    type: contains
    negate: true             # pass if output does NOT contain the string
    value: "I cannot"
    dataset: golden-set
    blocking: false          # report only, don't fail pipeline

deployments:
  staging:
    target: lyzr             # deploy to Lyzr Studio
    agent_id: ${AGENT_ID}
    on_pass: true

Observability pipeline

Control Plane uses an OpenTelemetry Collector under the hood:

Your agent (SDK)
  → OTLP exporter
  → Control Plane Collector
  → Control Plane Server (storage + UI)
  → Optional: forward to Jaeger / Grafana / Datadog

Forwarding to external backends is configured in collector-config.yaml (included in the Docker Compose setup).

Getting Started

Guides

How It Works

Pipeline architecture

Core data model

Runs

Traces and spans

Datasets

Evaluators

Execution flow

Local run

CI run (GitHub Actions)

Configuration reference

Observability pipeline

​Pipeline architecture

​Core data model

​Runs

​Traces and spans

​Datasets

​Evaluators

​Execution flow

​Local run

​CI run (GitHub Actions)

​Configuration reference

​Observability pipeline

Pipeline architecture

Core data model

Runs

Traces and spans

Datasets

Evaluators

Execution flow

Local run

CI run (GitHub Actions)

Configuration reference

Observability pipeline