API Reference¶
HTTP JSON WebSocket PCM Server-Sent Events WAV clips Live-token integration routes
This document describes the HTTP, WebSocket, and Server-Sent Events API exposed by doctorita-transcribe.
The service is intentionally stateful: sessions are created in memory, audio is streamed into the owning process, and session routes return 404 after a server restart unless the caller creates a new session.
Stateful sessions
Session IDs are live-process handles. The canonical audio spool is written under DATA_DIR, but session route lookup is in memory, so clients should create a new session after a server restart.
Conventions¶
Base URL¶
When running locally with the default configuration:
All public API paths below are relative to that base URL.
Content types¶
| Direction | Content type |
|---|---|
| JSON requests | application/json |
| JSON responses | application/json |
| Audio uploads | multipart/form-data |
| Downloaded recordings and clips | audio/wav |
| Live transcript, inspect, and log streams | text/event-stream |
| Live audio ingest | WebSocket text control frame followed by binary PCM frames |
Time and audio units¶
| Unit | Meaning |
|---|---|
*_ms | Milliseconds from the beginning of the session recording |
*_sample | 16 kHz mono PCM sample index, half-open interval style: [start_sample, end_sample) |
updated_at, created_at, time | JSON timestamp emitted by Go's time.Time encoder, normally RFC 3339 with nanoseconds when needed |
The audio ingest contract is always:
Error envelope¶
Errors are JSON objects:
Common statuses:
| Status | Meaning |
|---|---|
400 | Invalid request body, query, form data, ASR config, or audio slice |
401 | Invalid or missing integration live token |
404 | Session or route not found |
405 | Method not allowed |
409 | Operation requires a finalized transcript or existing chunk transcript |
502 | Upstream transcription provider failed for direct batch transcription |
503 | Optional service is unavailable, such as ElevenLabs or log streaming |
Transcript consistency¶
Transcript snapshots can be requested at different consistency levels:
| Value | Included words |
|---|---|
PARTIAL | Partial, stable, and final words |
STABLE | Stable and final words |
FINAL | Final words only |
The consistency query parameter is case-insensitive. Missing or unknown values default to PARTIAL.
Authentication¶
The public /v1/* routes do not currently enforce an application auth layer.
The integration-prefixed live routes under /api/v1/transcribe-live/* require a live token. The token can be supplied either as:
or:
The token format is:
Payload schema:
{
"session_id": "18b9617a48fef8072bfc2c38",
"transcript_id": "optional-upstream-transcript-id",
"user_id": "optional-upstream-user-id",
"exp": 1777392000
}
Validation rules:
| Claim | Rule |
|---|---|
session_id | Must match the route session ID |
transcript_id | If the session metadata has transcript_id, the token claim must match |
user_id | If the session metadata has user_id, the token claim must match |
exp | Optional Unix timestamp. If present and in the past, the token is rejected |
Endpoint index¶
| Method | Path | Purpose |
|---|---|---|
GET | /healthz | Service health and active transcription provider |
GET | /v1/logs/events | Service log SSE stream |
POST | /v1/sessions | Create a live session |
GET | /v1/sessions/{session_id}/audio/ws | Stream PCM audio over WebSocket |
POST | /v1/sessions/{session_id}/stop | Stop audio ingest and seal the transcript |
GET | /v1/sessions/{session_id}/transcript | Fetch a transcript snapshot |
GET | /v1/sessions/{session_id}/events | Transcript SSE stream |
GET | /v1/sessions/{session_id}/recording | Download session recording WAV |
GET | /v1/sessions/{session_id}/inspect/events | Inspect/debug SSE stream |
GET | /v1/sessions/{session_id}/inspect/audio | Download a WAV clip by sample window |
GET | /v1/sessions/{session_id}/asr-config | Read ASR window config |
PATCH | /v1/sessions/{session_id}/asr-config | Patch ASR window config |
POST | /v1/sessions/{session_id}/speaker-identification | Rerun speaker identification |
POST | /v1/sessions/{session_id}/chunk-pass/elevenlabs | Rerun live chunk pass through ElevenLabs |
POST | /v1/sessions/{session_id}/full-pass/elevenlabs | Run full-recording ElevenLabs pass |
GET | /v1/speakers | List saved speaker profiles |
POST | /v1/speakers | Enroll a speaker from a finalized session clip |
POST | /v1/transcriptions/elevenlabs | Direct full-audio ElevenLabs transcription comparison |
GET | /api/v1/transcribe-live/sessions/{session_id}/audio/ws | Protected integration live audio WebSocket |
GET | /api/v1/transcribe-live/sessions/{session_id}/events | Protected integration transcript SSE |
POST | /api/v1/transcribe-live/sessions/{session_id}/stop | Protected integration stop route |
Health¶
GET /healthz¶
Returns basic service health and the configured transcription provider.
Response 200¶
| Field | Type | Notes |
|---|---|---|
status | string | Currently ok when the handler is reachable |
provider | string | Active provider name, such as elevenlabs, openai, or an empty string if the manager is absent |
Example¶
Sessions¶
POST /v1/sessions¶
Creates a new in-memory live session and a session directory under DATA_DIR.
Request body¶
Body is JSON. The body may be omitted, but clients should send {} if they do not have metadata.
{
"language_hint": "en",
"glossary": ["Doctorita", "ECAPA"],
"transcript_id": "upstream-transcript-123",
"user_id": "user-123",
"asr_window_config": {
"pre_roll_ms": 700,
"post_roll_ms": 700,
"min_commit_ms": 4000,
"target_commit_ms": 10000,
"max_commit_ms": 15000,
"merge_gap_ms": 1800,
"min_speech_ms": 2500,
"min_isolated_ms": 400,
"commit_tolerance_ms": 200
}
}
| Field | Type | Required | Notes |
|---|---|---|---|
language_hint | string | No | Passed to the transcription provider when supported |
glossary | string array | No | Terms for client/backend metadata and prompt context |
transcript_id | string | No | Upstream transcript ID used by backend callbacks and live-token validation |
user_id | string | No | Upstream user ID used by backend callbacks and live-token validation |
asr_window_config | object | No | Overrides the default ASR chunk window config for this session |
Response 201¶
Errors¶
| Status | Cause |
|---|---|
400 | Invalid JSON or invalid ASR window config |
405 | Any method other than POST |
500 | Session directory, VAD detector, or session actor creation failed |
Example¶
curl -X POST http://localhost:8080/v1/sessions \
-H 'Content-Type: application/json' \
-d '{"language_hint":"en","glossary":["Doctorita"]}'
GET /v1/sessions/{session_id}/audio/ws¶
Opens the live PCM ingest WebSocket for a session.
The first frame must be a JSON text frame:
After the start message, every frame must be binary PCM:
WebSocket validation¶
| Rule | Failure close reason |
|---|---|
| Missing first frame | missing start message |
| First frame is not text | first message must be JSON text |
| First frame is invalid JSON | invalid start message |
type is not start | first audio websocket message must be type=start |
sample_rate is not 16000 | sample_rate must be 16000 |
channels is not 1 | channels must be 1 |
format is not pcm_s16le | format must be pcm_s16le |
| Session already has an audio stream | session already has an audio stream |
| Later frame is not binary | audio frames must be binary PCM16 |
| Binary frame has invalid PCM16 length | Decoder error, for example an odd byte count |
Allowed browser origins are controlled by DOCTORITA_TRANSCRIBE_ALLOWED_ORIGINS, which defaults to:
State side effects¶
| Event | Effect |
|---|---|
| WebSocket accepted and valid start frame received | The session begins accepting PCM |
| Binary PCM frames received | Frames are appended to the in-memory ring and canonical spool |
| WebSocket closes | The session stops accepting PCM, but the session is not finalized until POST /stop |
POST /v1/sessions/{session_id}/stop¶
Stops ingest and finalizes the session from the completed live chunks.
Response 200¶
Errors¶
| Status | Cause |
|---|---|
404 | Session not found |
500 | Stop/finalization failed |
Example¶
GET /v1/sessions/{session_id}/transcript¶
Returns the current transcript snapshot.
Query parameters¶
| Name | Type | Required | Default | Notes |
|---|---|---|---|---|
consistency | enum | No | PARTIAL | PARTIAL, STABLE, or FINAL, case-insensitive |
Response 200¶
Returns a TranscriptSnapshot.
{
"session_id": "18b9617a48fef8072bfc2c38",
"revision": 3,
"text": "hello patient",
"words": [
{
"start_ms": 0,
"end_ms": 500,
"text": "hello",
"speaker_id": "speaker_0",
"speaker_name": "Doctor",
"speaker_confidence": 0.91
}
],
"segments": [
{
"segment_id": "18b9617a48fef8072bfc2c38-3-1",
"session_id": "18b9617a48fef8072bfc2c38",
"window_id": "asr-window-1",
"revision": 3,
"provider": "elevenlabs",
"audio_start_ms": 0,
"audio_end_ms": 500,
"text": "hello",
"state": "FINAL",
"speaker_id": "speaker_0",
"speaker_name": "Doctor",
"speaker_confidence": 0.91
}
],
"finalized": true,
"consistency": "FINAL",
"updated_at": "2026-04-28T10:00:00Z"
}
Errors¶
| Status | Cause |
|---|---|
404 | Session not found |
500 | Snapshot query failed |
GET /v1/sessions/{session_id}/events¶
Subscribes to transcript updates through Server-Sent Events.
Events¶
| Event | Data |
|---|---|
transcript | TranscriptSnapshot |
ping | {} every 15 seconds while idle |
Example stream¶
event: transcript
data: {"session_id":"18b9617a48fef8072bfc2c38","revision":1,"text":"hello","words":[],"segments":[],"finalized":false,"consistency":"PARTIAL","updated_at":"2026-04-28T10:00:00Z"}
event: ping
data: {}
Browser example¶
const events = new EventSource('/v1/sessions/18b9617a48fef8072bfc2c38/events');
events.addEventListener('transcript', (event) => {
const snapshot = JSON.parse(event.data);
console.log(snapshot.text);
});
Errors¶
| Status | Cause |
|---|---|
404 | Session not found |
500 | Streaming unsupported or subscription failed |
GET /v1/sessions/{session_id}/recording¶
Downloads the full canonical session recording as WAV.
Response 200¶
| Header | Value |
|---|---|
Content-Type | audio/wav |
Content-Disposition | inline; filename="recording.wav" |
If the session has no audio yet, the endpoint returns a valid empty WAV.
Errors¶
| Status | Cause |
|---|---|
404 | Session not found |
500 | Recording could not be read |
GET /v1/sessions/{session_id}/inspect/events¶
Subscribes to internal inspect/debug events for a session.
The server first sends retained inspect history, then live inspect events.
Events¶
| Event | Data |
|---|---|
inspect | InspectEvent |
ping | {} every 15 seconds while idle |
Example¶
Errors¶
| Status | Cause |
|---|---|
404 | Session not found |
500 | Streaming unsupported or inspect stream unavailable |
GET /v1/sessions/{session_id}/inspect/audio¶
Downloads a WAV slice from the canonical session spool.
Query parameters¶
| Name | Type | Required | Notes |
|---|---|---|---|
start_sample | int64 | Yes | Inclusive 16 kHz sample index |
end_sample | int64 | Yes | Exclusive 16 kHz sample index, must be greater than start_sample |
Response 200¶
| Header | Value |
|---|---|
Content-Type | audio/wav |
Cache-Control | no-store |
Errors¶
| Status | Cause |
|---|---|
400 | Missing, non-integer, invalid, or out-of-range sample window |
404 | Session not found |
Example¶
curl -o clip.wav \
'http://localhost:8080/v1/sessions/18b9617a48fef8072bfc2c38/inspect/audio?start_sample=0&end_sample=16000'
GET /v1/sessions/{session_id}/asr-config¶
Returns the session ASR window config.
Response 200¶
Returns an ASRWindowConfig.
{
"pre_roll_ms": 700,
"post_roll_ms": 700,
"min_commit_ms": 4000,
"target_commit_ms": 10000,
"max_commit_ms": 15000,
"merge_gap_ms": 1800,
"min_speech_ms": 2500,
"min_isolated_ms": 400,
"commit_tolerance_ms": 200
}
PATCH /v1/sessions/{session_id}/asr-config¶
Partially updates the session ASR window config.
Unknown JSON fields are rejected.
Request body¶
Any subset of ASRWindowConfig:
The patch is applied to the current config, then the full updated config is validated.
Response 200¶
Returns the full updated ASRWindowConfig.
Validation¶
| Field | Minimum | Maximum |
|---|---|---|
pre_roll_ms | 0 | 5000 |
post_roll_ms | 0 | 5000 |
min_commit_ms | 400 | 30000 |
target_commit_ms | 400 | 60000 |
max_commit_ms | 1000 | 120000 |
merge_gap_ms | 0 | 10000 |
min_speech_ms | 0 | 10000 |
min_isolated_ms | 0 | 5000 |
commit_tolerance_ms | 0 | 1000 |
Additional invariants:
| Rule |
|---|
target_commit_ms >= min_commit_ms |
max_commit_ms >= target_commit_ms |
Errors¶
| Status | Cause |
|---|---|
400 | Invalid JSON, unknown field, invalid range, or invalid invariant |
404 | Session not found |
405 | Method other than GET or PATCH |
Provider comparison routes¶
POST /v1/sessions/{session_id}/chunk-pass/elevenlabs¶
Reruns the session's persisted chunk windows through the configured ElevenLabs client and rebuilds the chunk transcript.
This is intended for diagnostics and comparison after a session has already finalized.
Response 200¶
Returns a final TranscriptSnapshot. The comparison field compares the rebuilt chunk transcript against the current final transcript.
Errors¶
| Status | Cause |
|---|---|
400 | No canonical audio timeline |
404 | Session not found |
409 | No finalized chunk transcript or no persisted chunk segments |
500 | Transcription or persistence failed |
503 | ElevenLabs client is not configured |
POST /v1/sessions/{session_id}/full-pass/elevenlabs¶
Runs the full canonical recording through ElevenLabs and stores a final snapshot from that result.
This route requires an existing finalized chunk transcript so the service can build a comparison.
Response 200¶
Returns a final TranscriptSnapshot. The comparison field compares the live chunk transcript with the full-pass transcript.
Errors¶
| Status | Cause |
|---|---|
400 | No canonical audio timeline |
404 | Session not found |
409 | No finalized chunk transcript |
500 | Transcription failed |
503 | ElevenLabs client is not configured |
Speakers¶
GET /v1/speakers¶
Lists saved speaker profiles.
Response 200¶
Returns an array of SpeakerProfile.
[
{
"id": "speaker-123",
"name": "Doctor",
"created_at": "2026-04-28T10:00:00Z",
"updated_at": "2026-04-28T10:00:00Z",
"samples": [
{
"session_id": "18b9617a48fef8072bfc2c38",
"start_ms": 1000,
"end_ms": 5000,
"created_at": "2026-04-28T10:00:00Z",
"embedding": [0.01, -0.02]
}
],
"centroid": [0.01, -0.02]
}
]
Errors¶
| Status | Cause |
|---|---|
405 | Method other than GET or POST |
500 | Speaker service is unavailable or profile store read failed |
POST /v1/speakers¶
Enrolls or updates a speaker profile using a clip from a finalized session.
Request body¶
| Field | Type | Required | Notes |
|---|---|---|---|
name | string | Yes | Profile display name |
session_id | string | Yes | Source session |
start_ms | int64 | Yes | Start of enrollment clip |
end_ms | int64 | Yes | End of enrollment clip |
The source session must have a finalized transcript. The clip must be at least SPEAKER_MIN_ENROLLMENT_MS, which defaults to 3000.
Response 201¶
Returns the saved SpeakerProfile.
Errors¶
| Status | Cause |
|---|---|
400 | Invalid JSON, invalid clip window, or clip too short |
404 | Source session not found |
409 | Source session does not have a finalized transcript |
500 | Speaker service, embedding sidecar, or store failed |
POST /v1/sessions/{session_id}/speaker-identification¶
Reruns speaker analysis for a finalized session and updates the transcript snapshot with speaker labels.
Response 200¶
Returns a TranscriptSnapshot with speaker fields populated where matches were found.
Errors¶
| Status | Cause |
|---|---|
404 | Session not found |
409 | Session does not have a finalized transcript |
500 | Speaker service or embedding sidecar failed |
Logs¶
GET /v1/logs/events¶
Subscribes to service logs through Server-Sent Events.
The server first sends retained log history, then live log entries.
Query parameters¶
| Name | Type | Required | Notes |
|---|---|---|---|
session_id | string | No | If present, only log entries whose attrs.session_id equals this value are sent |
Events¶
| Event | Data |
|---|---|
log | LogEntry |
ping | {} every 15 seconds while idle |
Example¶
Errors¶
| Status | Cause |
|---|---|
500 | Streaming unsupported |
503 | Log stream is unavailable |
Direct ElevenLabs transcription¶
POST /v1/transcriptions/elevenlabs¶
Uploads a complete audio file directly to ElevenLabs with diarization enabled. This bypasses the live PCM session flow and returns a final transcript snapshot.
This endpoint is useful for comparing the in-house live pipeline against a whole-file ElevenLabs transcription.
Request¶
multipart/form-data
| Field | Type | Required | Notes |
|---|---|---|---|
file | file | Yes | Audio file to transcribe |
language_hint | string | No | Passed through to the provider |
num_speakers | integer | No | Optional expected speaker count, accepted range is effectively 0 to 32; when set above 0, it cannot be combined with diarization_threshold |
diarization_threshold | float | No | Optional diarization threshold; cannot be combined with num_speakers > 0 |
The service parses multipart bodies with a 32 MiB in-memory threshold. Larger multipart file parts may spill to temporary files per Go's standard library behavior.
Response 200¶
{
"provider": "elevenlabs",
"language_code": "en",
"language_probability": 0.99,
"input": {
"filename": "sample.wav",
"mime_type": "audio/wav",
"duration_ms": 3000
},
"snapshot": {
"session_id": "elevenlabs-batch",
"revision": 1,
"text": "doctor hello patient yes",
"words": [
{
"start_ms": 0,
"end_ms": 500,
"text": "doctor",
"speaker_id": "speaker_0"
}
],
"segments": [],
"finalized": true,
"consistency": "FINAL",
"updated_at": "2026-04-28T10:00:00Z"
}
}
| Field | Type | Notes |
|---|---|---|
provider | string | Provider name, currently elevenlabs |
language_code | string | Provider-detected language code, omitted when absent |
language_probability | number | Provider confidence, omitted when zero |
input | object | Uploaded file metadata and duration inferred from word end times |
snapshot | object | Final transcript snapshot with words and segments |
Errors¶
| Status | Cause |
|---|---|
400 | Invalid content type, invalid multipart body, missing file, invalid number, invalid threshold, invalid speaker/threshold combination |
405 | Method other than POST |
502 | ElevenLabs upstream request failed |
503 | ElevenLabs transcription is unavailable |
Example¶
curl -X POST http://localhost:8080/v1/transcriptions/elevenlabs \
-F file=@sample.wav \
-F language_hint=en \
-F num_speakers=2
Integration live routes¶
The /api/v1/transcribe-live/* routes expose only the subset needed by an upstream application embedding live transcription.
Every route requires a valid live token.
GET /api/v1/transcribe-live/sessions/{session_id}/audio/ws¶
Protected version of GET /v1/sessions/{session_id}/audio/ws.
Token may be passed as a query parameter because browsers cannot set arbitrary headers on native WebSocket constructors:
GET /api/v1/transcribe-live/sessions/{session_id}/events¶
Protected version of GET /v1/sessions/{session_id}/events.
POST /api/v1/transcribe-live/sessions/{session_id}/stop¶
Protected version of POST /v1/sessions/{session_id}/stop.
Integration route errors¶
| Status | Cause |
|---|---|
401 | Missing token, invalid token, expired token, or claim mismatch |
404 | Session or route not found |
Schemas¶
ASRWindowConfig¶
{
"pre_roll_ms": 700,
"post_roll_ms": 700,
"min_commit_ms": 4000,
"target_commit_ms": 10000,
"max_commit_ms": 15000,
"merge_gap_ms": 1800,
"min_speech_ms": 2500,
"min_isolated_ms": 400,
"commit_tolerance_ms": 200
}
| Field | Type | Default | Description |
|---|---|---|---|
pre_roll_ms | int64 | 700 | Audio included before the committed speech span |
post_roll_ms | int64 | 700 | Audio included after the committed speech span |
min_commit_ms | int64 | 4000 | Minimum span before a gap can trigger an ASR window |
target_commit_ms | int64 | 10000 | Target span that triggers a window |
max_commit_ms | int64 | 15000 | Hard maximum span that forces a window |
merge_gap_ms | int64 | 1800 | Maximum gap between detected speech segments before the planner separates windows |
min_speech_ms | int64 | 2500 | Minimum total speech required before gap-based commit |
min_isolated_ms | int64 | 400 | Minimum isolated speech span to emit any ASR window |
commit_tolerance_ms | int64 | 200 | Boundary tolerance for committed spans |
TranscriptSnapshot¶
{
"session_id": "18b9617a48fef8072bfc2c38",
"revision": 4,
"text": "hello patient",
"words": [],
"segments": [],
"finalized": false,
"consistency": "PARTIAL",
"comparison": {
"chunk_text": "hello patient",
"final_pass_text": "hello patient",
"chunk_word_count": 2,
"final_pass_word_count": 2,
"similarity": 1,
"target": 0.99,
"meets_target": true
},
"updated_at": "2026-04-28T10:00:00Z"
}
| Field | Type | Description |
|---|---|---|
session_id | string | Session ID |
revision | int64 | Monotonic snapshot revision within a session |
text | string | Visible transcript text for the requested consistency |
words | array of Word | Visible words |
segments | array of TranscriptSegment | Visible segments grouped by state, speaker, and timing |
finalized | boolean | True after stop/final pass has sealed the transcript |
consistency | enum | PARTIAL, STABLE, or FINAL |
comparison | Comparison | Optional comparison block after final/chunk pass diagnostics |
updated_at | timestamp | Snapshot creation/update time |
Word¶
{
"start_ms": 0,
"end_ms": 500,
"text": "hello",
"speaker_id": "speaker_0",
"speaker_name": "Doctor",
"speaker_confidence": 0.91
}
| Field | Type | Required | Description |
|---|---|---|---|
start_ms | int64 | Yes | Word start offset |
end_ms | int64 | Yes | Word end offset |
text | string | Yes | Word/token text |
speaker_id | string | No | Provider or local speaker ID |
speaker_name | string | No | Matched saved speaker name or anonymous label |
speaker_confidence | number | No | Speaker match confidence |
TranscriptSegment¶
{
"segment_id": "18b9617a48fef8072bfc2c38-4-1",
"session_id": "18b9617a48fef8072bfc2c38",
"window_id": "asr-window-1",
"revision": 4,
"provider": "elevenlabs",
"audio_start_ms": 0,
"audio_end_ms": 1200,
"text": "hello patient",
"state": "STABLE",
"speaker_id": "speaker_0",
"speaker_name": "Doctor",
"speaker_confidence": 0.91
}
| Field | Type | Required | Description |
|---|---|---|---|
segment_id | string | Yes | Generated ID from session, revision, and segment ordinal |
session_id | string | Yes | Session ID |
window_id | string | Yes | ASR window or synthetic final-pass ID |
revision | int64 | Yes | Snapshot revision |
provider | string | Yes | Provider that produced the words |
audio_start_ms | int64 | Yes | Segment start |
audio_end_ms | int64 | Yes | Segment end |
text | string | Yes | Segment text |
state | enum | Yes | PARTIAL, STABLE, or FINAL |
speaker_id | string | No | Provider or local speaker ID |
speaker_name | string | No | Speaker display name |
speaker_confidence | number | No | Speaker match confidence |
Comparison¶
{
"chunk_text": "hello patient",
"final_pass_text": "hello patient",
"chunk_word_count": 2,
"final_pass_word_count": 2,
"similarity": 1,
"target": 0.99,
"meets_target": true
}
| Field | Type | Description |
|---|---|---|
chunk_text | string | Text from live chunk pipeline |
final_pass_text | string | Text from final/full pass |
chunk_word_count | integer | Normalized word count for chunk text |
final_pass_word_count | integer | Normalized word count for final text |
similarity | number | Word-level Levenshtein similarity from 0 to 1 |
target | number | Target similarity, currently 0.99 by default |
meets_target | boolean | True when similarity >= target |
InspectEvent¶
{
"seq": 12,
"time": "2026-04-28T10:00:00Z",
"session_id": "18b9617a48fef8072bfc2c38",
"type": "chunk_pass_window_completed",
"lane": "live",
"window_id": "asr-window-1",
"state": "completed",
"reason": "asr_target",
"start_sample": 0,
"end_sample": 160000,
"start_ms": 0,
"end_ms": 10000,
"bytes": 320000,
"text_len": 42,
"word_count": 8,
"preview": "hello patient",
"note": "windows=1 similarity=1.00"
}
| Field | Type | Description |
|---|---|---|
seq | int64 | Monotonic inspect event sequence |
time | timestamp | Event timestamp |
session_id | string | Session ID |
type | string | Event type, such as session_started, pcm_ingest_started, full_pass_completed |
lane | string | Optional lane, commonly live or final |
window_id | string | Optional ASR/final window ID |
state | string | Optional lifecycle state |
reason | string | Optional planner or boundary reason |
start_sample | int64 | Optional start sample |
end_sample | int64 | Optional end sample |
start_ms | int64 | Optional start time |
end_ms | int64 | Optional end time |
bytes | int64 | Optional byte count |
text_len | integer | Optional transcript text length |
word_count | integer | Optional word count |
preview | string | Optional text preview |
note | string | Optional human-readable note |
LogEntry¶
{
"time": "2026-04-28T10:00:00Z",
"level": "INFO",
"message": "session initialized",
"attrs": {
"session_id": "18b9617a48fef8072bfc2c38"
}
}
| Field | Type | Description |
|---|---|---|
time | timestamp | Log record timestamp |
level | string | Slog level |
message | string | Log message |
attrs | object | Optional structured attributes |
SpeakerProfile¶
{
"id": "speaker-123",
"name": "Doctor",
"created_at": "2026-04-28T10:00:00Z",
"updated_at": "2026-04-28T10:00:00Z",
"samples": [],
"centroid": [0.01, -0.02]
}
| Field | Type | Description |
|---|---|---|
id | string | Profile ID |
name | string | Display name |
created_at | timestamp | Profile creation time |
updated_at | timestamp | Last update time |
samples | array of SpeakerSample | Enrollment samples |
centroid | number array | Current embedding centroid |
SpeakerSample¶
{
"session_id": "18b9617a48fef8072bfc2c38",
"start_ms": 1000,
"end_ms": 5000,
"created_at": "2026-04-28T10:00:00Z",
"embedding": [0.01, -0.02]
}
| Field | Type | Description |
|---|---|---|
session_id | string | Source session ID |
start_ms | int64 | Enrollment clip start |
end_ms | int64 | Enrollment clip end |
created_at | timestamp | Sample creation time |
embedding | number array | Raw speaker embedding |
End-to-end live flow¶
-
Create a session.
-
Open
/v1/sessions/{session_id}/eventsto receive live transcript snapshots. -
Open
/v1/sessions/{session_id}/audio/ws. -
Send the required JSON start frame.
-
Stream binary PCM16 little-endian frames.
-
Close the WebSocket when capture ends.
-
Call
POST /v1/sessions/{session_id}/stop. -
Fetch
GET /v1/sessions/{session_id}/transcript?consistency=FINAL. -
Optionally call
POST /v1/sessions/{session_id}/speaker-identification. -
Optionally download
GET /v1/sessions/{session_id}/recording.