Pipeline architecture
A Control Plane pipeline has four stages that run on every push:langship.yaml, resolve dataset versions.
Eval: Run every evaluator defined in your config against the target dataset. Each evaluator produces a pass/fail verdict and a numeric score.
Compare: Diff scores against the previous passing run on the target branch. Surface regressions in the CI summary.
Deploy: If all evals pass (or no blocking evals failed), promote the agent version to the target environment.
Core data model
Runs
Every time your agent executes (locally or in CI), Control Plane records a Run. A Run has:- A unique ID
- Start/end time and total duration
- The project and environment it belongs to
- A status:
success,failure,error - The full trace and any eval results attached
Traces and spans
A trace is a tree of spans. Spans map directly to agent steps:| Span type | Created by |
|---|---|
llm | Every LLM call (model, prompt, response, tokens, cost) |
tool | Every tool invocation (name, input, output, duration) |
agent | The agent’s decision loop |
retrieval | Vector store / document retrieval |
memory | Memory read or write |
custom | Manual spans via langship.span() |
Datasets
A Dataset is a versioned collection of test cases. Each test case has:input: the query or message to send to your agentexpected(optional): the expected output for exact-match evalsmetadata(optional): tags, difficulty labels, source references
langship.yaml. Your evals always run against the exact same dataset version, making results reproducible across branches and time.
Evaluators
| Evaluator type | How it scores |
|---|---|
exact-match | String equality between agent output and expected |
contains | Whether output contains a substring |
semantic-similarity | Embedding cosine similarity against expected |
llm-judge | LLM grades the response on a 0–1 scale using your prompt |
python | Custom Python function returning a float in [0, 1] |
regex | Output matches a pattern |
blocking: true (fail the pipeline on threshold miss) or blocking: false (report only).
Execution flow
Local run
CI run (GitHub Actions)
Configuration reference
langship.yaml structure:
Observability pipeline
Control Plane uses an OpenTelemetry Collector under the hood:collector-config.yaml (included in the Docker Compose setup).