Skip to main content

Pipeline architecture

A Control Plane pipeline has four stages that run on every push:
Build → Eval → Compare → Deploy
Build: Install dependencies, validate langship.yaml, resolve dataset versions. Eval: Run every evaluator defined in your config against the target dataset. Each evaluator produces a pass/fail verdict and a numeric score. Compare: Diff scores against the previous passing run on the target branch. Surface regressions in the CI summary. Deploy: If all evals pass (or no blocking evals failed), promote the agent version to the target environment.

Core data model

Project
├── Runs (executions of your agent)
│   ├── Traces (one trace per run)
│   │   ├── Spans (one span per step: LLM call, tool call, agent loop)
│   │   └── Events (logs, errors, custom annotations)
│   └── Eval results
├── Datasets (versioned collections of test cases)
└── Deployments (environment → agent version mapping)

Runs

Every time your agent executes (locally or in CI), Control Plane records a Run. A Run has:
  • A unique ID
  • Start/end time and total duration
  • The project and environment it belongs to
  • A status: success, failure, error
  • The full trace and any eval results attached

Traces and spans

A trace is a tree of spans. Spans map directly to agent steps:
Span typeCreated by
llmEvery LLM call (model, prompt, response, tokens, cost)
toolEvery tool invocation (name, input, output, duration)
agentThe agent’s decision loop
retrievalVector store / document retrieval
memoryMemory read or write
customManual spans via langship.span()
Traces are OpenTelemetry-compatible; you can export them to any OTel-compatible backend (Jaeger, Grafana Tempo, etc.) in addition to Control Plane.

Datasets

A Dataset is a versioned collection of test cases. Each test case has:
  • input: the query or message to send to your agent
  • expected (optional): the expected output for exact-match evals
  • metadata (optional): tags, difficulty labels, source references
Datasets are stored in Control Plane Server and referenced by SHA-pinned version in langship.yaml. Your evals always run against the exact same dataset version, making results reproducible across branches and time.

Evaluators

Evaluator typeHow it scores
exact-matchString equality between agent output and expected
containsWhether output contains a substring
semantic-similarityEmbedding cosine similarity against expected
llm-judgeLLM grades the response on a 0–1 scale using your prompt
pythonCustom Python function returning a float in [0, 1]
regexOutput matches a pattern
Evaluators can be configured as blocking: true (fail the pipeline on threshold miss) or blocking: false (report only).

Execution flow

Local run

langship eval run

Read langship.yaml

Fetch dataset (latest version or pinned SHA)

For each test case:
  → Run agent with test input
  → Collect trace
  → Score with each evaluator

Aggregate results

Print report, exit 0 (pass) or 1 (fail)

CI run (GitHub Actions)

git push / PR opened

actions/checkout + langship/setup-action

langship eval run --ci

Post results as PR check + comment

On pass: langship deploy --env staging

Configuration reference

langship.yaml structure:
project: my-agent          # project name in Control Plane Server
endpoint: ${LANGSHIP_URL}  # server URL (env var)

datasets:
  golden-set:
    path: ./evals/golden-set.jsonl
    version: sha256:abc123  # pin to a specific version (optional)

evals:
  - name: factual-accuracy
    type: llm-judge
    model: gpt-4o-mini       # model to use as judge
    prompt: |
      Rate the factual accuracy of this response on a scale of 0 to 1.
      Response: {{output}}
      Question: {{input}}
    dataset: golden-set
    pass_threshold: 0.8
    blocking: true

  - name: no-refusals
    type: contains
    negate: true             # pass if output does NOT contain the string
    value: "I cannot"
    dataset: golden-set
    blocking: false          # report only, don't fail pipeline

deployments:
  staging:
    target: lyzr             # deploy to Lyzr Studio
    agent_id: ${AGENT_ID}
    on_pass: true

Observability pipeline

Control Plane uses an OpenTelemetry Collector under the hood:
Your agent (SDK)
  → OTLP exporter
  → Control Plane Collector
  → Control Plane Server (storage + UI)
  → Optional: forward to Jaeger / Grafana / Datadog
Forwarding to external backends is configured in collector-config.yaml (included in the Docker Compose setup).