Skip to content

API Reference

HTTP JSON WebSocket PCM Server-Sent Events WAV clips Live-token integration routes

This document describes the HTTP, WebSocket, and Server-Sent Events API exposed by doctorita-transcribe.

The service is intentionally stateful: sessions are created in memory, audio is streamed into the owning process, and session routes return 404 after a server restart unless the caller creates a new session.

Stateful sessions

Session IDs are live-process handles. The canonical audio spool is written under DATA_DIR, but session route lookup is in memory, so clients should create a new session after a server restart.

Conventions

Base URL

When running locally with the default configuration:

http://localhost:8080

All public API paths below are relative to that base URL.

Content types

Direction Content type
JSON requests application/json
JSON responses application/json
Audio uploads multipart/form-data
Downloaded recordings and clips audio/wav
Live transcript, inspect, and log streams text/event-stream
Live audio ingest WebSocket text control frame followed by binary PCM frames

Time and audio units

Unit Meaning
*_ms Milliseconds from the beginning of the session recording
*_sample 16 kHz mono PCM sample index, half-open interval style: [start_sample, end_sample)
updated_at, created_at, time JSON timestamp emitted by Go's time.Time encoder, normally RFC 3339 with nanoseconds when needed

The audio ingest contract is always:

16 kHz, mono, signed 16-bit little-endian PCM

Error envelope

Errors are JSON objects:

{
  "error": "message"
}

Common statuses:

Status Meaning
400 Invalid request body, query, form data, ASR config, or audio slice
401 Invalid or missing integration live token
404 Session or route not found
405 Method not allowed
409 Operation requires a finalized transcript or existing chunk transcript
502 Upstream transcription provider failed for direct batch transcription
503 Optional service is unavailable, such as ElevenLabs or log streaming

Transcript consistency

Transcript snapshots can be requested at different consistency levels:

Value Included words
PARTIAL Partial, stable, and final words
STABLE Stable and final words
FINAL Final words only

The consistency query parameter is case-insensitive. Missing or unknown values default to PARTIAL.

Authentication

The public /v1/* routes do not currently enforce an application auth layer.

The integration-prefixed live routes under /api/v1/transcribe-live/* require a live token. The token can be supplied either as:

Authorization: Bearer <token>

or:

?token=<token>

The token format is:

base64url(payload).base64url(hmac_sha256(payload, DOCTORITA_TRANSCRIBE_LIVE_TOKEN_SECRET))

Payload schema:

{
  "session_id": "18b9617a48fef8072bfc2c38",
  "transcript_id": "optional-upstream-transcript-id",
  "user_id": "optional-upstream-user-id",
  "exp": 1777392000
}

Validation rules:

Claim Rule
session_id Must match the route session ID
transcript_id If the session metadata has transcript_id, the token claim must match
user_id If the session metadata has user_id, the token claim must match
exp Optional Unix timestamp. If present and in the past, the token is rejected

Endpoint index

Method Path Purpose
GET /healthz Service health and active transcription provider
GET /v1/logs/events Service log SSE stream
POST /v1/sessions Create a live session
GET /v1/sessions/{session_id}/audio/ws Stream PCM audio over WebSocket
POST /v1/sessions/{session_id}/stop Stop audio ingest and seal the transcript
GET /v1/sessions/{session_id}/transcript Fetch a transcript snapshot
GET /v1/sessions/{session_id}/events Transcript SSE stream
GET /v1/sessions/{session_id}/recording Download session recording WAV
GET /v1/sessions/{session_id}/inspect/events Inspect/debug SSE stream
GET /v1/sessions/{session_id}/inspect/audio Download a WAV clip by sample window
GET /v1/sessions/{session_id}/asr-config Read ASR window config
PATCH /v1/sessions/{session_id}/asr-config Patch ASR window config
POST /v1/sessions/{session_id}/speaker-identification Rerun speaker identification
POST /v1/sessions/{session_id}/chunk-pass/elevenlabs Rerun live chunk pass through ElevenLabs
POST /v1/sessions/{session_id}/full-pass/elevenlabs Run full-recording ElevenLabs pass
GET /v1/speakers List saved speaker profiles
POST /v1/speakers Enroll a speaker from a finalized session clip
POST /v1/transcriptions/elevenlabs Direct full-audio ElevenLabs transcription comparison
GET /api/v1/transcribe-live/sessions/{session_id}/audio/ws Protected integration live audio WebSocket
GET /api/v1/transcribe-live/sessions/{session_id}/events Protected integration transcript SSE
POST /api/v1/transcribe-live/sessions/{session_id}/stop Protected integration stop route

Health

GET /healthz

Returns basic service health and the configured transcription provider.

Response 200

{
  "status": "ok",
  "provider": "elevenlabs"
}
Field Type Notes
status string Currently ok when the handler is reachable
provider string Active provider name, such as elevenlabs, openai, or an empty string if the manager is absent

Example

curl http://localhost:8080/healthz

Sessions

POST /v1/sessions

Creates a new in-memory live session and a session directory under DATA_DIR.

Request body

Body is JSON. The body may be omitted, but clients should send {} if they do not have metadata.

{
  "language_hint": "en",
  "glossary": ["Doctorita", "ECAPA"],
  "transcript_id": "upstream-transcript-123",
  "user_id": "user-123",
  "asr_window_config": {
    "pre_roll_ms": 700,
    "post_roll_ms": 700,
    "min_commit_ms": 4000,
    "target_commit_ms": 10000,
    "max_commit_ms": 15000,
    "merge_gap_ms": 1800,
    "min_speech_ms": 2500,
    "min_isolated_ms": 400,
    "commit_tolerance_ms": 200
  }
}
Field Type Required Notes
language_hint string No Passed to the transcription provider when supported
glossary string array No Terms for client/backend metadata and prompt context
transcript_id string No Upstream transcript ID used by backend callbacks and live-token validation
user_id string No Upstream user ID used by backend callbacks and live-token validation
asr_window_config object No Overrides the default ASR chunk window config for this session

Response 201

{
  "session_id": "18b9617a48fef8072bfc2c38"
}

Errors

Status Cause
400 Invalid JSON or invalid ASR window config
405 Any method other than POST
500 Session directory, VAD detector, or session actor creation failed

Example

curl -X POST http://localhost:8080/v1/sessions \
  -H 'Content-Type: application/json' \
  -d '{"language_hint":"en","glossary":["Doctorita"]}'

GET /v1/sessions/{session_id}/audio/ws

Opens the live PCM ingest WebSocket for a session.

The first frame must be a JSON text frame:

{
  "type": "start",
  "sample_rate": 16000,
  "channels": 1,
  "format": "pcm_s16le"
}

After the start message, every frame must be binary PCM:

signed int16 little-endian samples, 16 kHz, mono

WebSocket validation

Rule Failure close reason
Missing first frame missing start message
First frame is not text first message must be JSON text
First frame is invalid JSON invalid start message
type is not start first audio websocket message must be type=start
sample_rate is not 16000 sample_rate must be 16000
channels is not 1 channels must be 1
format is not pcm_s16le format must be pcm_s16le
Session already has an audio stream session already has an audio stream
Later frame is not binary audio frames must be binary PCM16
Binary frame has invalid PCM16 length Decoder error, for example an odd byte count

Allowed browser origins are controlled by DOCTORITA_TRANSCRIBE_ALLOWED_ORIGINS, which defaults to:

localhost:* 127.0.0.1:*

State side effects

Event Effect
WebSocket accepted and valid start frame received The session begins accepting PCM
Binary PCM frames received Frames are appended to the in-memory ring and canonical spool
WebSocket closes The session stops accepting PCM, but the session is not finalized until POST /stop

POST /v1/sessions/{session_id}/stop

Stops ingest and finalizes the session from the completed live chunks.

Response 200

{
  "status": "stopped"
}

Errors

Status Cause
404 Session not found
500 Stop/finalization failed

Example

curl -X POST http://localhost:8080/v1/sessions/18b9617a48fef8072bfc2c38/stop

GET /v1/sessions/{session_id}/transcript

Returns the current transcript snapshot.

Query parameters

Name Type Required Default Notes
consistency enum No PARTIAL PARTIAL, STABLE, or FINAL, case-insensitive

Response 200

Returns a TranscriptSnapshot.

{
  "session_id": "18b9617a48fef8072bfc2c38",
  "revision": 3,
  "text": "hello patient",
  "words": [
    {
      "start_ms": 0,
      "end_ms": 500,
      "text": "hello",
      "speaker_id": "speaker_0",
      "speaker_name": "Doctor",
      "speaker_confidence": 0.91
    }
  ],
  "segments": [
    {
      "segment_id": "18b9617a48fef8072bfc2c38-3-1",
      "session_id": "18b9617a48fef8072bfc2c38",
      "window_id": "asr-window-1",
      "revision": 3,
      "provider": "elevenlabs",
      "audio_start_ms": 0,
      "audio_end_ms": 500,
      "text": "hello",
      "state": "FINAL",
      "speaker_id": "speaker_0",
      "speaker_name": "Doctor",
      "speaker_confidence": 0.91
    }
  ],
  "finalized": true,
  "consistency": "FINAL",
  "updated_at": "2026-04-28T10:00:00Z"
}

Errors

Status Cause
404 Session not found
500 Snapshot query failed

GET /v1/sessions/{session_id}/events

Subscribes to transcript updates through Server-Sent Events.

Events

Event Data
transcript TranscriptSnapshot
ping {} every 15 seconds while idle

Example stream

event: transcript
data: {"session_id":"18b9617a48fef8072bfc2c38","revision":1,"text":"hello","words":[],"segments":[],"finalized":false,"consistency":"PARTIAL","updated_at":"2026-04-28T10:00:00Z"}

event: ping
data: {}

Browser example

const events = new EventSource('/v1/sessions/18b9617a48fef8072bfc2c38/events');

events.addEventListener('transcript', (event) => {
  const snapshot = JSON.parse(event.data);
  console.log(snapshot.text);
});

Errors

Status Cause
404 Session not found
500 Streaming unsupported or subscription failed

GET /v1/sessions/{session_id}/recording

Downloads the full canonical session recording as WAV.

Response 200

Header Value
Content-Type audio/wav
Content-Disposition inline; filename="recording.wav"

If the session has no audio yet, the endpoint returns a valid empty WAV.

Errors

Status Cause
404 Session not found
500 Recording could not be read

GET /v1/sessions/{session_id}/inspect/events

Subscribes to internal inspect/debug events for a session.

The server first sends retained inspect history, then live inspect events.

Events

Event Data
inspect InspectEvent
ping {} every 15 seconds while idle

Example

curl -N http://localhost:8080/v1/sessions/18b9617a48fef8072bfc2c38/inspect/events

Errors

Status Cause
404 Session not found
500 Streaming unsupported or inspect stream unavailable

GET /v1/sessions/{session_id}/inspect/audio

Downloads a WAV slice from the canonical session spool.

Query parameters

Name Type Required Notes
start_sample int64 Yes Inclusive 16 kHz sample index
end_sample int64 Yes Exclusive 16 kHz sample index, must be greater than start_sample

Response 200

Header Value
Content-Type audio/wav
Cache-Control no-store

Errors

Status Cause
400 Missing, non-integer, invalid, or out-of-range sample window
404 Session not found

Example

curl -o clip.wav \
  'http://localhost:8080/v1/sessions/18b9617a48fef8072bfc2c38/inspect/audio?start_sample=0&end_sample=16000'

GET /v1/sessions/{session_id}/asr-config

Returns the session ASR window config.

Response 200

Returns an ASRWindowConfig.

{
  "pre_roll_ms": 700,
  "post_roll_ms": 700,
  "min_commit_ms": 4000,
  "target_commit_ms": 10000,
  "max_commit_ms": 15000,
  "merge_gap_ms": 1800,
  "min_speech_ms": 2500,
  "min_isolated_ms": 400,
  "commit_tolerance_ms": 200
}

PATCH /v1/sessions/{session_id}/asr-config

Partially updates the session ASR window config.

Unknown JSON fields are rejected.

Request body

Any subset of ASRWindowConfig:

{
  "target_commit_ms": 12000,
  "max_commit_ms": 18000
}

The patch is applied to the current config, then the full updated config is validated.

Response 200

Returns the full updated ASRWindowConfig.

Validation

Field Minimum Maximum
pre_roll_ms 0 5000
post_roll_ms 0 5000
min_commit_ms 400 30000
target_commit_ms 400 60000
max_commit_ms 1000 120000
merge_gap_ms 0 10000
min_speech_ms 0 10000
min_isolated_ms 0 5000
commit_tolerance_ms 0 1000

Additional invariants:

Rule
target_commit_ms >= min_commit_ms
max_commit_ms >= target_commit_ms

Errors

Status Cause
400 Invalid JSON, unknown field, invalid range, or invalid invariant
404 Session not found
405 Method other than GET or PATCH

Provider comparison routes

POST /v1/sessions/{session_id}/chunk-pass/elevenlabs

Reruns the session's persisted chunk windows through the configured ElevenLabs client and rebuilds the chunk transcript.

This is intended for diagnostics and comparison after a session has already finalized.

Response 200

Returns a final TranscriptSnapshot. The comparison field compares the rebuilt chunk transcript against the current final transcript.

Errors

Status Cause
400 No canonical audio timeline
404 Session not found
409 No finalized chunk transcript or no persisted chunk segments
500 Transcription or persistence failed
503 ElevenLabs client is not configured

POST /v1/sessions/{session_id}/full-pass/elevenlabs

Runs the full canonical recording through ElevenLabs and stores a final snapshot from that result.

This route requires an existing finalized chunk transcript so the service can build a comparison.

Response 200

Returns a final TranscriptSnapshot. The comparison field compares the live chunk transcript with the full-pass transcript.

Errors

Status Cause
400 No canonical audio timeline
404 Session not found
409 No finalized chunk transcript
500 Transcription failed
503 ElevenLabs client is not configured

Speakers

GET /v1/speakers

Lists saved speaker profiles.

Response 200

Returns an array of SpeakerProfile.

[
  {
    "id": "speaker-123",
    "name": "Doctor",
    "created_at": "2026-04-28T10:00:00Z",
    "updated_at": "2026-04-28T10:00:00Z",
    "samples": [
      {
        "session_id": "18b9617a48fef8072bfc2c38",
        "start_ms": 1000,
        "end_ms": 5000,
        "created_at": "2026-04-28T10:00:00Z",
        "embedding": [0.01, -0.02]
      }
    ],
    "centroid": [0.01, -0.02]
  }
]

Errors

Status Cause
405 Method other than GET or POST
500 Speaker service is unavailable or profile store read failed

POST /v1/speakers

Enrolls or updates a speaker profile using a clip from a finalized session.

Request body

{
  "name": "Doctor",
  "session_id": "18b9617a48fef8072bfc2c38",
  "start_ms": 1000,
  "end_ms": 5000
}
Field Type Required Notes
name string Yes Profile display name
session_id string Yes Source session
start_ms int64 Yes Start of enrollment clip
end_ms int64 Yes End of enrollment clip

The source session must have a finalized transcript. The clip must be at least SPEAKER_MIN_ENROLLMENT_MS, which defaults to 3000.

Response 201

Returns the saved SpeakerProfile.

Errors

Status Cause
400 Invalid JSON, invalid clip window, or clip too short
404 Source session not found
409 Source session does not have a finalized transcript
500 Speaker service, embedding sidecar, or store failed

POST /v1/sessions/{session_id}/speaker-identification

Reruns speaker analysis for a finalized session and updates the transcript snapshot with speaker labels.

Response 200

Returns a TranscriptSnapshot with speaker fields populated where matches were found.

Errors

Status Cause
404 Session not found
409 Session does not have a finalized transcript
500 Speaker service or embedding sidecar failed

Logs

GET /v1/logs/events

Subscribes to service logs through Server-Sent Events.

The server first sends retained log history, then live log entries.

Query parameters

Name Type Required Notes
session_id string No If present, only log entries whose attrs.session_id equals this value are sent

Events

Event Data
log LogEntry
ping {} every 15 seconds while idle

Example

curl -N 'http://localhost:8080/v1/logs/events?session_id=18b9617a48fef8072bfc2c38'

Errors

Status Cause
500 Streaming unsupported
503 Log stream is unavailable

Direct ElevenLabs transcription

POST /v1/transcriptions/elevenlabs

Uploads a complete audio file directly to ElevenLabs with diarization enabled. This bypasses the live PCM session flow and returns a final transcript snapshot.

This endpoint is useful for comparing the in-house live pipeline against a whole-file ElevenLabs transcription.

Request

multipart/form-data

Field Type Required Notes
file file Yes Audio file to transcribe
language_hint string No Passed through to the provider
num_speakers integer No Optional expected speaker count, accepted range is effectively 0 to 32; when set above 0, it cannot be combined with diarization_threshold
diarization_threshold float No Optional diarization threshold; cannot be combined with num_speakers > 0

The service parses multipart bodies with a 32 MiB in-memory threshold. Larger multipart file parts may spill to temporary files per Go's standard library behavior.

Response 200

{
  "provider": "elevenlabs",
  "language_code": "en",
  "language_probability": 0.99,
  "input": {
    "filename": "sample.wav",
    "mime_type": "audio/wav",
    "duration_ms": 3000
  },
  "snapshot": {
    "session_id": "elevenlabs-batch",
    "revision": 1,
    "text": "doctor hello patient yes",
    "words": [
      {
        "start_ms": 0,
        "end_ms": 500,
        "text": "doctor",
        "speaker_id": "speaker_0"
      }
    ],
    "segments": [],
    "finalized": true,
    "consistency": "FINAL",
    "updated_at": "2026-04-28T10:00:00Z"
  }
}
Field Type Notes
provider string Provider name, currently elevenlabs
language_code string Provider-detected language code, omitted when absent
language_probability number Provider confidence, omitted when zero
input object Uploaded file metadata and duration inferred from word end times
snapshot object Final transcript snapshot with words and segments

Errors

Status Cause
400 Invalid content type, invalid multipart body, missing file, invalid number, invalid threshold, invalid speaker/threshold combination
405 Method other than POST
502 ElevenLabs upstream request failed
503 ElevenLabs transcription is unavailable

Example

curl -X POST http://localhost:8080/v1/transcriptions/elevenlabs \
  -F file=@sample.wav \
  -F language_hint=en \
  -F num_speakers=2

Integration live routes

The /api/v1/transcribe-live/* routes expose only the subset needed by an upstream application embedding live transcription.

Every route requires a valid live token.

GET /api/v1/transcribe-live/sessions/{session_id}/audio/ws

Protected version of GET /v1/sessions/{session_id}/audio/ws.

Token may be passed as a query parameter because browsers cannot set arbitrary headers on native WebSocket constructors:

ws://localhost:8080/api/v1/transcribe-live/sessions/18b9617a48fef8072bfc2c38/audio/ws?token=<token>

GET /api/v1/transcribe-live/sessions/{session_id}/events

Protected version of GET /v1/sessions/{session_id}/events.

POST /api/v1/transcribe-live/sessions/{session_id}/stop

Protected version of POST /v1/sessions/{session_id}/stop.

Integration route errors

Status Cause
401 Missing token, invalid token, expired token, or claim mismatch
404 Session or route not found

Schemas

ASRWindowConfig

{
  "pre_roll_ms": 700,
  "post_roll_ms": 700,
  "min_commit_ms": 4000,
  "target_commit_ms": 10000,
  "max_commit_ms": 15000,
  "merge_gap_ms": 1800,
  "min_speech_ms": 2500,
  "min_isolated_ms": 400,
  "commit_tolerance_ms": 200
}
Field Type Default Description
pre_roll_ms int64 700 Audio included before the committed speech span
post_roll_ms int64 700 Audio included after the committed speech span
min_commit_ms int64 4000 Minimum span before a gap can trigger an ASR window
target_commit_ms int64 10000 Target span that triggers a window
max_commit_ms int64 15000 Hard maximum span that forces a window
merge_gap_ms int64 1800 Maximum gap between detected speech segments before the planner separates windows
min_speech_ms int64 2500 Minimum total speech required before gap-based commit
min_isolated_ms int64 400 Minimum isolated speech span to emit any ASR window
commit_tolerance_ms int64 200 Boundary tolerance for committed spans

TranscriptSnapshot

{
  "session_id": "18b9617a48fef8072bfc2c38",
  "revision": 4,
  "text": "hello patient",
  "words": [],
  "segments": [],
  "finalized": false,
  "consistency": "PARTIAL",
  "comparison": {
    "chunk_text": "hello patient",
    "final_pass_text": "hello patient",
    "chunk_word_count": 2,
    "final_pass_word_count": 2,
    "similarity": 1,
    "target": 0.99,
    "meets_target": true
  },
  "updated_at": "2026-04-28T10:00:00Z"
}
Field Type Description
session_id string Session ID
revision int64 Monotonic snapshot revision within a session
text string Visible transcript text for the requested consistency
words array of Word Visible words
segments array of TranscriptSegment Visible segments grouped by state, speaker, and timing
finalized boolean True after stop/final pass has sealed the transcript
consistency enum PARTIAL, STABLE, or FINAL
comparison Comparison Optional comparison block after final/chunk pass diagnostics
updated_at timestamp Snapshot creation/update time

Word

{
  "start_ms": 0,
  "end_ms": 500,
  "text": "hello",
  "speaker_id": "speaker_0",
  "speaker_name": "Doctor",
  "speaker_confidence": 0.91
}
Field Type Required Description
start_ms int64 Yes Word start offset
end_ms int64 Yes Word end offset
text string Yes Word/token text
speaker_id string No Provider or local speaker ID
speaker_name string No Matched saved speaker name or anonymous label
speaker_confidence number No Speaker match confidence

TranscriptSegment

{
  "segment_id": "18b9617a48fef8072bfc2c38-4-1",
  "session_id": "18b9617a48fef8072bfc2c38",
  "window_id": "asr-window-1",
  "revision": 4,
  "provider": "elevenlabs",
  "audio_start_ms": 0,
  "audio_end_ms": 1200,
  "text": "hello patient",
  "state": "STABLE",
  "speaker_id": "speaker_0",
  "speaker_name": "Doctor",
  "speaker_confidence": 0.91
}
Field Type Required Description
segment_id string Yes Generated ID from session, revision, and segment ordinal
session_id string Yes Session ID
window_id string Yes ASR window or synthetic final-pass ID
revision int64 Yes Snapshot revision
provider string Yes Provider that produced the words
audio_start_ms int64 Yes Segment start
audio_end_ms int64 Yes Segment end
text string Yes Segment text
state enum Yes PARTIAL, STABLE, or FINAL
speaker_id string No Provider or local speaker ID
speaker_name string No Speaker display name
speaker_confidence number No Speaker match confidence

Comparison

{
  "chunk_text": "hello patient",
  "final_pass_text": "hello patient",
  "chunk_word_count": 2,
  "final_pass_word_count": 2,
  "similarity": 1,
  "target": 0.99,
  "meets_target": true
}
Field Type Description
chunk_text string Text from live chunk pipeline
final_pass_text string Text from final/full pass
chunk_word_count integer Normalized word count for chunk text
final_pass_word_count integer Normalized word count for final text
similarity number Word-level Levenshtein similarity from 0 to 1
target number Target similarity, currently 0.99 by default
meets_target boolean True when similarity >= target

InspectEvent

{
  "seq": 12,
  "time": "2026-04-28T10:00:00Z",
  "session_id": "18b9617a48fef8072bfc2c38",
  "type": "chunk_pass_window_completed",
  "lane": "live",
  "window_id": "asr-window-1",
  "state": "completed",
  "reason": "asr_target",
  "start_sample": 0,
  "end_sample": 160000,
  "start_ms": 0,
  "end_ms": 10000,
  "bytes": 320000,
  "text_len": 42,
  "word_count": 8,
  "preview": "hello patient",
  "note": "windows=1 similarity=1.00"
}
Field Type Description
seq int64 Monotonic inspect event sequence
time timestamp Event timestamp
session_id string Session ID
type string Event type, such as session_started, pcm_ingest_started, full_pass_completed
lane string Optional lane, commonly live or final
window_id string Optional ASR/final window ID
state string Optional lifecycle state
reason string Optional planner or boundary reason
start_sample int64 Optional start sample
end_sample int64 Optional end sample
start_ms int64 Optional start time
end_ms int64 Optional end time
bytes int64 Optional byte count
text_len integer Optional transcript text length
word_count integer Optional word count
preview string Optional text preview
note string Optional human-readable note

LogEntry

{
  "time": "2026-04-28T10:00:00Z",
  "level": "INFO",
  "message": "session initialized",
  "attrs": {
    "session_id": "18b9617a48fef8072bfc2c38"
  }
}
Field Type Description
time timestamp Log record timestamp
level string Slog level
message string Log message
attrs object Optional structured attributes

SpeakerProfile

{
  "id": "speaker-123",
  "name": "Doctor",
  "created_at": "2026-04-28T10:00:00Z",
  "updated_at": "2026-04-28T10:00:00Z",
  "samples": [],
  "centroid": [0.01, -0.02]
}
Field Type Description
id string Profile ID
name string Display name
created_at timestamp Profile creation time
updated_at timestamp Last update time
samples array of SpeakerSample Enrollment samples
centroid number array Current embedding centroid

SpeakerSample

{
  "session_id": "18b9617a48fef8072bfc2c38",
  "start_ms": 1000,
  "end_ms": 5000,
  "created_at": "2026-04-28T10:00:00Z",
  "embedding": [0.01, -0.02]
}
Field Type Description
session_id string Source session ID
start_ms int64 Enrollment clip start
end_ms int64 Enrollment clip end
created_at timestamp Sample creation time
embedding number array Raw speaker embedding

End-to-end live flow

  1. Create a session.

    SESSION_ID=$(
      curl -s -X POST http://localhost:8080/v1/sessions \
        -H 'Content-Type: application/json' \
        -d '{"language_hint":"en"}' \
      | jq -r .session_id
    )
    
  2. Open /v1/sessions/{session_id}/events to receive live transcript snapshots.

  3. Open /v1/sessions/{session_id}/audio/ws.

  4. Send the required JSON start frame.

  5. Stream binary PCM16 little-endian frames.

  6. Close the WebSocket when capture ends.

  7. Call POST /v1/sessions/{session_id}/stop.

  8. Fetch GET /v1/sessions/{session_id}/transcript?consistency=FINAL.

  9. Optionally call POST /v1/sessions/{session_id}/speaker-identification.

  10. Optionally download GET /v1/sessions/{session_id}/recording.