Skip to main content
Control Plane integrates with GitHub Actions through the official langship/eval-action. Add it to your workflow to automatically run eval suites on every PR and post results as a check.

Setup

1. Add your Control Plane API key as a secret

In your GitHub repository: Settings → Secrets and variables → Actions → New repository secret
  • Name: LANGSHIP_API_KEY
  • Value: your Control Plane API key (from the Control Plane dashboard)
Also add LANGSHIP_URL if you’re self-hosting:
  • Name: LANGSHIP_URL
  • Value: https://langship.yourcompany.com

2. Add the workflow

Create .github/workflows/eval.yml:
name: Agent Eval

on:
  pull_request:
    branches: [main, staging]
  push:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run Control Plane evals
        uses: langship/eval-action@v1
        with:
          api-key: ${{ secrets.LANGSHIP_API_KEY }}
          url: ${{ secrets.LANGSHIP_URL }}
          config: langship.yaml
          fail-on-regression: true

3. Configure evals in langship.yaml

project: my-agent

evals:
  - name: factual-accuracy
    type: llm-judge
    dataset: golden-set
    pass_threshold: 0.85
    blocking: true

  - name: response-length
    type: python
    function: evals.check_length
    dataset: golden-set
    blocking: false      # report-only, won't block merge

deployments:
  staging:
    target: lyzr
    agent_id: ${{ env.AGENT_ID }}
    on_pass: true
    branches: [main]

What the action does

On every PR:
  1. Runs each eval defined in langship.yaml against the configured dataset
  2. Posts results as a GitHub Check on the PR; you will see pass/fail in the PR status bar
  3. Posts a PR comment with a results table (one row per evaluator)
  4. If fail-on-regression: true and any blocking eval drops below its threshold, the action exits with code 1, blocking merge
Example PR comment:
## Control Plane Eval Results

| Eval | Score | Threshold | Status |
|---|---|---|---|
| factual-accuracy | 0.91 | 0.85 | ✅ Pass |
| no-refusals | 1.00 | 0.95 | ✅ Pass |
| response-length | 0.78 | 0.80 | ❌ Fail |

**Overall: 2/3 passing**: response-length is non-blocking; merge is allowed.

Blocking vs non-blocking evals

evals:
  - name: safety-check
    blocking: true     # PR cannot merge if this fails
    pass_threshold: 1.0

  - name: verbosity-score
    blocking: false    # Results shown, but merge not blocked
    pass_threshold: 0.7

Auto-deploy on merge

When a PR merges to main and all blocking evals pass, Control Plane can automatically deploy the new agent version:
deployments:
  production:
    target: lyzr
    agent_id: ${{ env.PROD_AGENT_ID }}
    on_pass: true
    branches: [main]
    require_approval: true    # opens a GitHub environment approval gate
With require_approval: true, the deploy step waits for a reviewer to approve in the GitHub Actions UI before proceeding.

Matrix evals across environments

Test your agent against multiple models or configurations in parallel:
# .github/workflows/eval.yml
jobs:
  eval:
    strategy:
      matrix:
        model: [gpt-4o, gpt-4o-mini, claude-3-5-sonnet]
    steps:
      - uses: langship/eval-action@v1
        with:
          api-key: ${{ secrets.LANGSHIP_API_KEY }}
          env-vars: |
            AGENT_MODEL=${{ matrix.model }}
Results for each matrix leg appear as separate checks on the PR.

Caching

Speed up eval runs by caching your Python dependencies:
- uses: actions/cache@v4
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}

- run: pip install -r requirements.txt

Action inputs

InputRequiredDefaultDescription
api-keyYesNoneControl Plane API key
urlNohttp://localhost:3000Control Plane server URL
configNolangship.yamlPath to config file
fail-on-regressionNotrueExit 1 if blocking eval fails
post-commentNotruePost results as PR comment
dataset-versionNolatestPin dataset to a specific version