This cookbook walks you through building Nexus, a Level 1 IT support voice agent, end to end: pick the models, save the agent, run a live session in a browser, and fetch transcripts.
What you’ll build
Nexus greets the caller, asks whether they’re on Mac or Windows, and walks them through Level 1 troubleshooting steps. By the end of this guide you’ll have:
- a saved agent on the Lyzr Voice API (reusable across sessions),
- a Python script that starts and ends sessions,
- a React component that connects a browser microphone to a live session,
- a script that pulls the transcript and latency metrics for any past call.
Prerequisites
- A Lyzr API key. Set it as
LYZR_API_KEY for the Python scripts.
- Python with
requests (pip install requests).
- For the browser section: a React app that can install npm/pnpm packages.
export LYZR_API_KEY="your-lyzr-api-key"
export VOICE_API_BASE_URL="https://voice-livekit.studio.lyzr.ai/v1"
1. Pick your models and a voice
Pipeline mode runs three separate models — STT, LLM, TTS — plus a TTS voice. Before creating the agent, ask the API which options are available so you can pick concrete values for each.
import os
from urllib.parse import urlencode
import requests
API_KEY = os.environ["LYZR_API_KEY"]
BASE_URL = os.getenv("VOICE_API_BASE_URL", "https://voice-livekit.studio.lyzr.ai/v1")
HEADERS = {"x-api-key": API_KEY, "accept": "application/json"}
def get_json(path: str, params=None) -> dict:
query = f"?{urlencode(params)}" if params else ""
response = requests.get(f"{BASE_URL}{path}{query}", headers=HEADERS, timeout=30)
response.raise_for_status()
return response.json()
def main() -> None:
# Pick the first TTS provider that's configured for your account.
providers = get_json("/config/tts-voice-providers")["providers"]
active_provider = next(p["providerId"] for p in providers if p.get("configured"))
# Pick a model for each pipeline stage.
options = get_json("/config/pipeline-options")
stt = options["stt"][0]["models"][0]["id"]
llm = options["llm"][0]["models"][0]["id"]
tts_provider = next(p for p in options["tts"] if p["providerId"] == active_provider)
tts_model = tts_provider["models"][0]["id"]
# Pick a voice for the chosen provider.
voices = get_json(
"/config/tts-voices",
{"providerId": active_provider, "limit": 5},
)["voices"]
voice = voices[0]
print(f"STT: {stt}")
print(f"LLM: {llm}")
print(f"TTS: {tts_model} ({active_provider})")
print(f"Voice: {voice['name']} ({voice['id']})")
if __name__ == "__main__":
main()
You should see something like:
STT: assemblyai/universal-streaming:en
LLM: openai/gpt-4o-mini
TTS: cartesia/sonic-3 (cartesia)
Voice: Friendly Reading Man (9626c31c-bec5-4cca-baa8-f8ba9e84c8bc)
Hold onto the voice ID — you’ll pass it as engine.voice_id in the next step.
This script takes the happy path: the first configured provider, the first model in each list, the first voice. In production code, handle empty results explicitly and let users pick a voice rather than auto-selecting one.
2. Create the Nexus agent
POST /agents saves an agent definition so you don’t have to send the whole config on every session. The response includes agent.id — store it; the rest of this cookbook references it.
import json
import os
import requests
API_KEY = os.environ["LYZR_API_KEY"]
BASE_URL = os.getenv("VOICE_API_BASE_URL", "https://voice-livekit.studio.lyzr.ai/v1")
HEADERS = {
"x-api-key": API_KEY,
"accept": "application/json",
"Content-Type": "application/json",
}
def create_nexus_agent() -> str:
payload = {
"config": {
# Identity and persona.
"agent_name": "Nexus - IT Support",
"agent_description": "Level 1 technical support voice agent",
"prompt": (
"You are Nexus, a Level 1 IT support assistant. Keep answers concise "
"and easy to understand over an audio call. Always ask whether the "
"user is on Windows or Mac before giving troubleshooting steps."
),
"conversation_start": {
"who": "ai",
"greeting": (
"Say, \"Hi, I'm Nexus, your IT support assistant. "
"Are you calling about a Mac or Windows computer today?\""
),
},
# Models — fill in the IDs you picked in step 1.
"engine": {
"kind": "pipeline",
"stt": "assemblyai/universal-streaming:en",
"tts": "cartesia/sonic-3",
"llm": "openai/gpt-4o-mini",
"voice_id": "9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
"language": "en",
},
# Optional polish — turn detection, VAD, recording, ambience.
"turn_detection": "english",
"vad_enabled": True,
"noise_cancellation": {"enabled": True, "type": "auto"},
"audio_recording_enabled": True,
"background_audio": {
"enabled": True,
"ambient": {
"enabled": True,
"source": "OFFICE_AMBIENCE",
"volume": 0.4,
},
"tool_call": {
"enabled": True,
"sources": [
{"source": "KEYBOARD_TYPING_TRUNC", "volume": 0.6, "probability": 1}
],
},
},
}
}
response = requests.post(
f"{BASE_URL}/agents", headers=HEADERS, json=payload, timeout=30
)
if response.status_code != 201:
print(json.dumps(response.json(), indent=2))
response.raise_for_status()
agent_id = response.json()["agent"]["id"]
print(f"Created agent: {agent_id}")
return agent_id
if __name__ == "__main__":
create_nexus_agent()
Export the printed ID so the next scripts can pick it up:
export LYZR_VOICE_AGENT_ID="agent_..."
Only agent_name, prompt, and engine are strictly required. Everything below the second comment in the payload is polish — drop it for a stripped-down agent and add it back as you need it.
3. Start and end a session
Starting a session dispatches the Python worker, creates a LiveKit room, and returns the credentials your client needs to join.
Request to POST /sessions/start:
{
"userIdentity": "user_123",
"agentId": "your-agent-id"
}
Response:
{
"userToken": "livekit-client-token",
"roomName": "room-...",
"sessionId": "uuid",
"livekitUrl": "wss://...",
"agentDispatched": true,
"agentConfig": {
"engine": {
"kind": "pipeline",
"stt": "assemblyai/universal-streaming:en",
"tts": "cartesia/sonic-3",
"llm": "openai/gpt-4o-mini",
"voice_id": "9626c31c-bec5-4cca-baa8-f8ba9e84c8bc"
},
"tools": []
}
}
You’ll feed livekitUrl and userToken into the LiveKit client in the next step. When the call is done, call POST /sessions/end with either roomName or sessionId.
The Python script below handles only the lifecycle — it doesn’t publish a microphone or play audio. On its own it just opens an empty room and waits. You’ll connect a real microphone in step 4.
import os
import time
import requests
API_KEY = os.environ["LYZR_API_KEY"]
AGENT_ID = os.environ["LYZR_VOICE_AGENT_ID"]
BASE_URL = os.getenv("VOICE_API_BASE_URL", "https://voice-livekit.studio.lyzr.ai/v1")
HEADERS = {
"x-api-key": API_KEY,
"accept": "application/json",
"Content-Type": "application/json",
}
def start_session() -> dict:
payload = {
"userIdentity": f"user_{int(time.time())}",
"agentId": AGENT_ID,
}
response = requests.post(
f"{BASE_URL}/sessions/start", headers=HEADERS, json=payload, timeout=30
)
response.raise_for_status()
data = response.json()
print(f"Session ID: {data['sessionId']}")
print(f"Room name: {data['roomName']}")
print(f"LiveKit URL: {data['livekitUrl']}")
return data
def end_session(room_name: str) -> None:
response = requests.post(
f"{BASE_URL}/sessions/end",
headers=HEADERS,
json={"roomName": room_name},
timeout=30,
)
response.raise_for_status()
print("Session ended.")
if __name__ == "__main__":
session = start_session()
print("Connect a LiveKit client now. Press Enter when you're done.")
input()
end_session(session["roomName"])
POST /sessions/end returns immediately and marks the session as ended on the API side. The transcript becomes available once the worker flushes its buffers and observability data arrives — typically a few seconds after the LiveKit room closes.
4. Connect from the browser
The REST API has done its job; from here, all the audio plumbing belongs to the LiveKit SDK. We’ll build two files:
voice-api.ts — a thin wrapper around /v1/sessions/start and /v1/sessions/end.
NexusVoiceWidget.tsx — a React component that joins the room and renders the agent’s audio.
Install the client packages:
pnpm add livekit-client @livekit/components-react @livekit/components-styles
voice-api.ts
const API_BASE_URL =
import.meta.env.VITE_LIVEKIT_BACKEND_URL ?? "https://voice-livekit.studio.lyzr.ai";
export interface SessionResponse {
userToken: string;
roomName: string;
sessionId: string;
livekitUrl: string;
agentDispatched: boolean;
}
export async function startVoiceSession(input: {
apiKey: string;
userIdentity: string;
agentId?: string;
agentConfig?: Record<string, unknown>;
}): Promise<SessionResponse> {
const response = await fetch(`${API_BASE_URL}/v1/sessions/start`, {
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": input.apiKey,
},
body: JSON.stringify({
userIdentity: input.userIdentity,
agentId: input.agentId,
agentConfig: input.agentConfig,
}),
});
if (!response.ok) {
throw new Error(`Failed to start voice session: ${response.status}`);
}
return response.json();
}
export async function endVoiceSession(input: {
apiKey: string;
roomName: string;
}): Promise<void> {
const response = await fetch(`${API_BASE_URL}/v1/sessions/end`, {
method: "POST",
headers: {
"Content-Type": "application/json",
"x-api-key": input.apiKey,
},
body: JSON.stringify({ roomName: input.roomName }),
});
if (!response.ok && response.status !== 404) {
throw new Error(`Failed to end voice session: ${response.status}`);
}
}
The widget renders two audio sinks inside <LiveKitRoom>:
RoomAudioRenderer — plays the agent’s voice. This is the one you actually hear Nexus through.
BackgroundAudioRenderer — a small custom renderer that attaches the optional background_audio track (office ambience, keyboard sounds during tool calls). It’s separate because the default renderer doesn’t surface non-voice tracks.
import { useEffect, useRef, useState } from "react";
import {
LiveKitRoom,
RoomAudioRenderer,
useTracks,
useVoiceAssistant,
} from "@livekit/components-react";
import { Track } from "livekit-client";
import "@livekit/components-styles";
import { endVoiceSession, startVoiceSession, type SessionResponse } from "./voice-api";
function BackgroundAudioRenderer() {
const audioRef = useRef<HTMLAudioElement | null>(null);
const tracks = useTracks([Track.Source.Unknown], { onlySubscribed: true });
const backgroundTrack = tracks.find(
(track) => track.publication?.trackName === "background_audio",
);
useEffect(() => {
const mediaTrack = backgroundTrack?.publication?.track;
if (!mediaTrack) return;
const audioElement = audioRef.current ?? document.createElement("audio");
audioElement.autoplay = true;
audioElement.setAttribute("playsinline", "true");
audioRef.current = audioElement;
mediaTrack.attach(audioElement);
return () => {
mediaTrack.detach(audioElement);
};
}, [backgroundTrack?.publication?.track]);
return null;
}
function AgentStatus() {
const { state } = useVoiceAssistant();
return <p>Agent state: {state}</p>;
}
export function NexusVoiceWidget(props: { apiKey: string; agentId: string }) {
const [session, setSession] = useState<SessionResponse | null>(null);
async function startCall() {
const data = await startVoiceSession({
apiKey: props.apiKey,
userIdentity: `user_${Date.now()}`,
agentId: props.agentId,
});
setSession(data);
}
async function endCall() {
if (session) {
await endVoiceSession({ apiKey: props.apiKey, roomName: session.roomName });
}
setSession(null);
}
if (!session) {
return <button onClick={startCall}>Start call</button>;
}
return (
<LiveKitRoom
serverUrl={session.livekitUrl}
token={session.userToken}
connect
audio
video={false}
onDisconnected={() => void endCall()}
>
<AgentStatus />
<button onClick={endCall}>End call</button>
<RoomAudioRenderer />
<BackgroundAudioRenderer />
</LiveKitRoom>
);
}
Render <NexusVoiceWidget apiKey={...} agentId={...} /> somewhere in your app, click Start call, grant microphone permission, and you should hear Nexus’s greeting within a second or two.
In a real app, never pass apiKey straight to the browser. Proxy /v1/sessions/start through your own backend so the key stays on the server.
5. Read transcripts and metrics
Once a call has ended and the worker has flushed, you can pull aggregate stats for the agent and a per-session transcript with latency metrics on each turn.
import os
import requests
API_KEY = os.environ["LYZR_API_KEY"]
AGENT_ID = os.environ["LYZR_VOICE_AGENT_ID"]
BASE_URL = os.getenv("VOICE_API_BASE_URL", "https://voice-livekit.studio.lyzr.ai/v1")
HEADERS = {"x-api-key": API_KEY, "accept": "application/json"}
def get_json(path: str) -> dict:
response = requests.get(f"{BASE_URL}{path}", headers=HEADERS, timeout=30)
response.raise_for_status()
return response.json()
def text_from_content(content: object) -> str:
# Messages can be a plain string or a list of parts (multimodal turns);
# flatten both into a single line for printing.
if isinstance(content, list):
return " ".join(str(part) for part in content)
return str(content or "")
def fetch_analytics() -> None:
stats = get_json(f"/transcripts/agent/{AGENT_ID}/stats")
print("Aggregate stats")
print(f"Total calls: {stats.get('totalCalls')}")
print(f"Average messages: {stats.get('avgMessages')}")
recent = get_json(f"/transcripts/agent/{AGENT_ID}?sort=desc&limit=5")
items = recent.get("items", [])
if not items:
print("No transcripts found. Complete a voice session first.")
return
session_id = items[0]["sessionId"]
transcript = get_json(f"/transcripts/{session_id}")["transcript"]
print("\nSession overview")
print(f"Session ID: {transcript.get('sessionId')}")
print(f"Room: {transcript.get('roomName')}")
print(f"Duration (s): {(transcript.get('durationMs') or 0) / 1000:.2f}")
print(f"Message count: {transcript.get('messageCount')}")
print("\nConversation")
for item in transcript.get("chatHistory", []):
if item.get("type") != "message":
continue
role = item.get("role", "system").upper()
print(f"[{role}] {text_from_content(item.get('content'))}")
print("\nLatest assistant latency")
for item in reversed(transcript.get("chatHistory", [])):
if item.get("role") != "assistant" or "metrics" not in item:
continue
metrics = item["metrics"]
print(f"LLM TTFT: {metrics.get('llm_node_ttft')}")
print(f"TTS TTFB: {metrics.get('tts_node_ttfb')}")
break
if __name__ == "__main__":
fetch_analytics()
llm_node_ttft is the time from the user finishing their turn to the first LLM token; tts_node_ttfb is the time from that first token to the first byte of synthesized audio. Together they’re the dominant contributors to perceived latency.
Where to go next
- Swap
engine.kind from "pipeline" to "realtime" to use a single multimodal model instead of three.
- Add tools to the agent payload so Nexus can look up tickets, reset passwords, or hand off to a human.
- Stream transcripts in realtime by subscribing to LiveKit data tracks instead of polling
/transcripts/{sessionId} afterward.