API Reference¶

HTTP JSON WebSocket PCM Server-Sent Events WAV clips Live-token integration routes

This document describes the HTTP, WebSocket, and Server-Sent Events API exposed by doctorita-transcribe.

The service is intentionally stateful: sessions are created in memory, audio is streamed into the owning process, and session routes return 404 after a server restart unless the caller creates a new session.

Stateful sessions

Session IDs are live-process handles. The canonical audio spool is written under DATA_DIR, but session route lookup is in memory, so clients should create a new session after a server restart.

Conventions¶

Base URL¶

When running locally with the default configuration:

http://localhost:8080

All public API paths below are relative to that base URL.

Content types¶

Direction	Content type
JSON requests	`application/json`
JSON responses	`application/json`
Audio uploads	`multipart/form-data`
Downloaded recordings and clips	`audio/wav`
Live transcript, inspect, and log streams	`text/event-stream`
Live audio ingest	WebSocket text control frame followed by binary PCM frames

Time and audio units¶

Unit	Meaning
`*_ms`	Milliseconds from the beginning of the session recording
`*_sample`	16 kHz mono PCM sample index, half-open interval style: `[start_sample, end_sample)`
`updated_at`, `created_at`, `time`	JSON timestamp emitted by Go's `time.Time` encoder, normally RFC 3339 with nanoseconds when needed

The audio ingest contract is always:

16 kHz, mono, signed 16-bit little-endian PCM

Error envelope¶

Errors are JSON objects:

{
  "error": "message"
}

Common statuses:

Status	Meaning
`400`	Invalid request body, query, form data, ASR config, or audio slice
`401`	Invalid or missing integration live token
`404`	Session or route not found
`405`	Method not allowed
`409`	Operation requires a finalized transcript or existing chunk transcript
`502`	Upstream transcription provider failed for direct batch transcription
`503`	Optional service is unavailable, such as ElevenLabs or log streaming

Transcript consistency¶

Transcript snapshots can be requested at different consistency levels:

Value	Included words
`PARTIAL`	Partial, stable, and final words
`STABLE`	Stable and final words
`FINAL`	Final words only

The consistency query parameter is case-insensitive. Missing or unknown values default to PARTIAL.

Authentication¶

The public /v1/* routes do not currently enforce an application auth layer.

The integration-prefixed live routes under /api/v1/transcribe-live/* require a live token. The token can be supplied either as:

Authorization: Bearer <token>

or:

?token=<token>

The token format is:

base64url(payload).base64url(hmac_sha256(payload, DOCTORITA_TRANSCRIBE_LIVE_TOKEN_SECRET))

Payload schema:

{
  "session_id": "18b9617a48fef8072bfc2c38",
  "transcript_id": "optional-upstream-transcript-id",
  "user_id": "optional-upstream-user-id",
  "exp": 1777392000
}

Validation rules:

Claim	Rule
`session_id`	Must match the route session ID
`transcript_id`	If the session metadata has `transcript_id`, the token claim must match
`user_id`	If the session metadata has `user_id`, the token claim must match
`exp`	Optional Unix timestamp. If present and in the past, the token is rejected

Endpoint index¶

Method	Path	Purpose
`GET`	`/healthz`	Service health and active transcription provider
`GET`	`/v1/logs/events`	Service log SSE stream
`POST`	`/v1/sessions`	Create a live session
`GET`	`/v1/sessions/{session_id}/audio/ws`	Stream PCM audio over WebSocket
`POST`	`/v1/sessions/{session_id}/stop`	Stop audio ingest and seal the transcript
`GET`	`/v1/sessions/{session_id}/transcript`	Fetch a transcript snapshot
`GET`	`/v1/sessions/{session_id}/events`	Transcript SSE stream
`GET`	`/v1/sessions/{session_id}/recording`	Download session recording WAV
`GET`	`/v1/sessions/{session_id}/inspect/events`	Inspect/debug SSE stream
`GET`	`/v1/sessions/{session_id}/inspect/audio`	Download a WAV clip by sample window
`GET`	`/v1/sessions/{session_id}/asr-config`	Read ASR window config
`PATCH`	`/v1/sessions/{session_id}/asr-config`	Patch ASR window config
`POST`	`/v1/sessions/{session_id}/speaker-identification`	Rerun speaker identification
`POST`	`/v1/sessions/{session_id}/chunk-pass/elevenlabs`	Rerun live chunk pass through ElevenLabs
`POST`	`/v1/sessions/{session_id}/full-pass/elevenlabs`	Run full-recording ElevenLabs pass
`GET`	`/v1/speakers`	List saved speaker profiles
`POST`	`/v1/speakers`	Enroll a speaker from a finalized session clip
`POST`	`/v1/transcriptions/elevenlabs`	Direct full-audio ElevenLabs transcription comparison
`GET`	`/api/v1/transcribe-live/sessions/{session_id}/audio/ws`	Protected integration live audio WebSocket
`GET`	`/api/v1/transcribe-live/sessions/{session_id}/events`	Protected integration transcript SSE
`POST`	`/api/v1/transcribe-live/sessions/{session_id}/stop`	Protected integration stop route

Health¶

`GET /healthz`¶

Returns basic service health and the configured transcription provider.

Response `200`¶

{
  "status": "ok",
  "provider": "elevenlabs"
}

Field	Type	Notes
`status`	string	Currently `ok` when the handler is reachable
`provider`	string	Active provider name, such as `elevenlabs`, `openai`, or an empty string if the manager is absent

Example¶

curl http://localhost:8080/healthz

Sessions¶

`POST /v1/sessions`¶

Creates a new in-memory live session and a session directory under DATA_DIR.

Request body¶

Body is JSON. The body may be omitted, but clients should send {} if they do not have metadata.

{
  "language_hint": "en",
  "glossary": ["Doctorita", "ECAPA"],
  "transcript_id": "upstream-transcript-123",
  "user_id": "user-123",
  "asr_window_config": {
    "pre_roll_ms": 700,
    "post_roll_ms": 700,
    "min_commit_ms": 4000,
    "target_commit_ms": 10000,
    "max_commit_ms": 15000,
    "merge_gap_ms": 1800,
    "min_speech_ms": 2500,
    "min_isolated_ms": 400,
    "commit_tolerance_ms": 200
  }
}

Field	Type	Required	Notes
`language_hint`	string	No	Passed to the transcription provider when supported
`glossary`	string array	No	Terms for client/backend metadata and prompt context
`transcript_id`	string	No	Upstream transcript ID used by backend callbacks and live-token validation
`user_id`	string	No	Upstream user ID used by backend callbacks and live-token validation
`asr_window_config`	object	No	Overrides the default ASR chunk window config for this session

Response `201`¶

{
  "session_id": "18b9617a48fef8072bfc2c38"
}

Errors¶

Status	Cause
`400`	Invalid JSON or invalid ASR window config
`405`	Any method other than `POST`
`500`	Session directory, VAD detector, or session actor creation failed

Example¶

curl -X POST http://localhost:8080/v1/sessions \
  -H 'Content-Type: application/json' \
  -d '{"language_hint":"en","glossary":["Doctorita"]}'

`GET /v1/sessions/{session_id}/audio/ws`¶

Opens the live PCM ingest WebSocket for a session.

The first frame must be a JSON text frame:

{
  "type": "start",
  "sample_rate": 16000,
  "channels": 1,
  "format": "pcm_s16le"
}

After the start message, every frame must be binary PCM:

signed int16 little-endian samples, 16 kHz, mono

WebSocket validation¶

Rule	Failure close reason
Missing first frame	`missing start message`
First frame is not text	`first message must be JSON text`
First frame is invalid JSON	`invalid start message`
`type` is not `start`	`first audio websocket message must be type=start`
`sample_rate` is not `16000`	`sample_rate must be 16000`
`channels` is not `1`	`channels must be 1`
`format` is not `pcm_s16le`	`format must be pcm_s16le`
Session already has an audio stream	`session already has an audio stream`
Later frame is not binary	`audio frames must be binary PCM16`
Binary frame has invalid PCM16 length	Decoder error, for example an odd byte count

Allowed browser origins are controlled by DOCTORITA_TRANSCRIBE_ALLOWED_ORIGINS, which defaults to:

localhost:* 127.0.0.1:*

State side effects¶

Event	Effect
WebSocket accepted and valid start frame received	The session begins accepting PCM
Binary PCM frames received	Frames are appended to the in-memory ring and canonical spool
WebSocket closes	The session stops accepting PCM, but the session is not finalized until `POST /stop`

`POST /v1/sessions/{session_id}/stop`¶

Stops ingest and finalizes the session from the completed live chunks.

Response `200`¶

{
  "status": "stopped"
}

Errors¶

Status	Cause
`404`	Session not found
`500`	Stop/finalization failed

Example¶

curl -X POST http://localhost:8080/v1/sessions/18b9617a48fef8072bfc2c38/stop

`GET /v1/sessions/{session_id}/transcript`¶

Returns the current transcript snapshot.

Query parameters¶

Name	Type	Required	Default	Notes
`consistency`	enum	No	`PARTIAL`	`PARTIAL`, `STABLE`, or `FINAL`, case-insensitive

Response `200`¶

Returns a TranscriptSnapshot.

{
  "session_id": "18b9617a48fef8072bfc2c38",
  "revision": 3,
  "text": "hello patient",
  "words": [
    {
      "start_ms": 0,
      "end_ms": 500,
      "text": "hello",
      "speaker_id": "speaker_0",
      "speaker_name": "Doctor",
      "speaker_confidence": 0.91
    }
  ],
  "segments": [
    {
      "segment_id": "18b9617a48fef8072bfc2c38-3-1",
      "session_id": "18b9617a48fef8072bfc2c38",
      "window_id": "asr-window-1",
      "revision": 3,
      "provider": "elevenlabs",
      "audio_start_ms": 0,
      "audio_end_ms": 500,
      "text": "hello",
      "state": "FINAL",
      "speaker_id": "speaker_0",
      "speaker_name": "Doctor",
      "speaker_confidence": 0.91
    }
  ],
  "finalized": true,
  "consistency": "FINAL",
  "updated_at": "2026-04-28T10:00:00Z"
}

Errors¶

Status	Cause
`404`	Session not found
`500`	Snapshot query failed

`GET /v1/sessions/{session_id}/events`¶

Subscribes to transcript updates through Server-Sent Events.

Events¶

Event	Data
`transcript`	`TranscriptSnapshot`
`ping`	`{}` every 15 seconds while idle

Example stream¶

event: transcript
data: {"session_id":"18b9617a48fef8072bfc2c38","revision":1,"text":"hello","words":[],"segments":[],"finalized":false,"consistency":"PARTIAL","updated_at":"2026-04-28T10:00:00Z"}

event: ping
data: {}

Browser example¶

const events = new EventSource('/v1/sessions/18b9617a48fef8072bfc2c38/events');

events.addEventListener('transcript', (event) => {
  const snapshot = JSON.parse(event.data);
  console.log(snapshot.text);
});

Errors¶

Status	Cause
`404`	Session not found
`500`	Streaming unsupported or subscription failed

`GET /v1/sessions/{session_id}/recording`¶

Downloads the full canonical session recording as WAV.

Response `200`¶

Header	Value
`Content-Type`	`audio/wav`
`Content-Disposition`	`inline; filename="recording.wav"`

If the session has no audio yet, the endpoint returns a valid empty WAV.

Errors¶

Status	Cause
`404`	Session not found
`500`	Recording could not be read

`GET /v1/sessions/{session_id}/inspect/events`¶

Subscribes to internal inspect/debug events for a session.

The server first sends retained inspect history, then live inspect events.

Events¶

Event	Data
`inspect`	`InspectEvent`
`ping`	`{}` every 15 seconds while idle

Example¶

curl -N http://localhost:8080/v1/sessions/18b9617a48fef8072bfc2c38/inspect/events

Errors¶

Status	Cause
`404`	Session not found
`500`	Streaming unsupported or inspect stream unavailable

`GET /v1/sessions/{session_id}/inspect/audio`¶

Downloads a WAV slice from the canonical session spool.

Query parameters¶

Name	Type	Required	Notes
`start_sample`	int64	Yes	Inclusive 16 kHz sample index
`end_sample`	int64	Yes	Exclusive 16 kHz sample index, must be greater than `start_sample`

Response `200`¶

Header	Value
`Content-Type`	`audio/wav`
`Cache-Control`	`no-store`

Errors¶

Status	Cause
`400`	Missing, non-integer, invalid, or out-of-range sample window
`404`	Session not found

Example¶

curl -o clip.wav \
  'http://localhost:8080/v1/sessions/18b9617a48fef8072bfc2c38/inspect/audio?start_sample=0&end_sample=16000'

`GET /v1/sessions/{session_id}/asr-config`¶

Returns the session ASR window config.

Response `200`¶

Returns an ASRWindowConfig.

{
  "pre_roll_ms": 700,
  "post_roll_ms": 700,
  "min_commit_ms": 4000,
  "target_commit_ms": 10000,
  "max_commit_ms": 15000,
  "merge_gap_ms": 1800,
  "min_speech_ms": 2500,
  "min_isolated_ms": 400,
  "commit_tolerance_ms": 200
}

`PATCH /v1/sessions/{session_id}/asr-config`¶

Partially updates the session ASR window config.

Unknown JSON fields are rejected.

Request body¶

Any subset of ASRWindowConfig:

{
  "target_commit_ms": 12000,
  "max_commit_ms": 18000
}

The patch is applied to the current config, then the full updated config is validated.

Response `200`¶

Returns the full updated ASRWindowConfig.

Validation¶

Field	Minimum	Maximum
`pre_roll_ms`	`0`	`5000`
`post_roll_ms`	`0`	`5000`
`min_commit_ms`	`400`	`30000`
`target_commit_ms`	`400`	`60000`
`max_commit_ms`	`1000`	`120000`
`merge_gap_ms`	`0`	`10000`
`min_speech_ms`	`0`	`10000`
`min_isolated_ms`	`0`	`5000`
`commit_tolerance_ms`	`0`	`1000`

Additional invariants:

Rule
`target_commit_ms >= min_commit_ms`
`max_commit_ms >= target_commit_ms`

Errors¶

Status	Cause
`400`	Invalid JSON, unknown field, invalid range, or invalid invariant
`404`	Session not found
`405`	Method other than `GET` or `PATCH`

Provider comparison routes¶

`POST /v1/sessions/{session_id}/chunk-pass/elevenlabs`¶

Reruns the session's persisted chunk windows through the configured ElevenLabs client and rebuilds the chunk transcript.

This is intended for diagnostics and comparison after a session has already finalized.

Response `200`¶

Returns a final TranscriptSnapshot. The comparison field compares the rebuilt chunk transcript against the current final transcript.

Errors¶

Status	Cause
`400`	No canonical audio timeline
`404`	Session not found
`409`	No finalized chunk transcript or no persisted chunk segments
`500`	Transcription or persistence failed
`503`	ElevenLabs client is not configured

`POST /v1/sessions/{session_id}/full-pass/elevenlabs`¶

Runs the full canonical recording through ElevenLabs and stores a final snapshot from that result.

This route requires an existing finalized chunk transcript so the service can build a comparison.

Response `200`¶

Returns a final TranscriptSnapshot. The comparison field compares the live chunk transcript with the full-pass transcript.

Errors¶

Status	Cause
`400`	No canonical audio timeline
`404`	Session not found
`409`	No finalized chunk transcript
`500`	Transcription failed
`503`	ElevenLabs client is not configured

Speakers¶

`GET /v1/speakers`¶

Lists saved speaker profiles.

Response `200`¶

Returns an array of SpeakerProfile.

[
  {
    "id": "speaker-123",
    "name": "Doctor",
    "created_at": "2026-04-28T10:00:00Z",
    "updated_at": "2026-04-28T10:00:00Z",
    "samples": [
      {
        "session_id": "18b9617a48fef8072bfc2c38",
        "start_ms": 1000,
        "end_ms": 5000,
        "created_at": "2026-04-28T10:00:00Z",
        "embedding": [0.01, -0.02]
      }
    ],
    "centroid": [0.01, -0.02]
  }
]

Errors¶

Status	Cause
`405`	Method other than `GET` or `POST`
`500`	Speaker service is unavailable or profile store read failed

`POST /v1/speakers`¶

Enrolls or updates a speaker profile using a clip from a finalized session.

Request body¶

{
  "name": "Doctor",
  "session_id": "18b9617a48fef8072bfc2c38",
  "start_ms": 1000,
  "end_ms": 5000
}

Field	Type	Required	Notes
`name`	string	Yes	Profile display name
`session_id`	string	Yes	Source session
`start_ms`	int64	Yes	Start of enrollment clip
`end_ms`	int64	Yes	End of enrollment clip

The source session must have a finalized transcript. The clip must be at least SPEAKER_MIN_ENROLLMENT_MS, which defaults to 3000.

Response `201`¶

Returns the saved SpeakerProfile.

Errors¶

Status	Cause
`400`	Invalid JSON, invalid clip window, or clip too short
`404`	Source session not found
`409`	Source session does not have a finalized transcript
`500`	Speaker service, embedding sidecar, or store failed

`POST /v1/sessions/{session_id}/speaker-identification`¶

Reruns speaker analysis for a finalized session and updates the transcript snapshot with speaker labels.

Response `200`¶

Returns a TranscriptSnapshot with speaker fields populated where matches were found.

Errors¶

Status	Cause
`404`	Session not found
`409`	Session does not have a finalized transcript
`500`	Speaker service or embedding sidecar failed

Logs¶

`GET /v1/logs/events`¶

Subscribes to service logs through Server-Sent Events.

The server first sends retained log history, then live log entries.

Query parameters¶

Name	Type	Required	Notes
`session_id`	string	No	If present, only log entries whose `attrs.session_id` equals this value are sent

Events¶

Event	Data
`log`	`LogEntry`
`ping`	`{}` every 15 seconds while idle

Example¶

curl -N 'http://localhost:8080/v1/logs/events?session_id=18b9617a48fef8072bfc2c38'

Errors¶

Status	Cause
`500`	Streaming unsupported
`503`	Log stream is unavailable

Direct ElevenLabs transcription¶

`POST /v1/transcriptions/elevenlabs`¶

Uploads a complete audio file directly to ElevenLabs with diarization enabled. This bypasses the live PCM session flow and returns a final transcript snapshot.

This endpoint is useful for comparing the in-house live pipeline against a whole-file ElevenLabs transcription.

Request¶

multipart/form-data

Field	Type	Required	Notes
`file`	file	Yes	Audio file to transcribe
`language_hint`	string	No	Passed through to the provider
`num_speakers`	integer	No	Optional expected speaker count, accepted range is effectively `0` to `32`; when set above `0`, it cannot be combined with `diarization_threshold`
`diarization_threshold`	float	No	Optional diarization threshold; cannot be combined with `num_speakers > 0`

The service parses multipart bodies with a 32 MiB in-memory threshold. Larger multipart file parts may spill to temporary files per Go's standard library behavior.

Response `200`¶

{
  "provider": "elevenlabs",
  "language_code": "en",
  "language_probability": 0.99,
  "input": {
    "filename": "sample.wav",
    "mime_type": "audio/wav",
    "duration_ms": 3000
  },
  "snapshot": {
    "session_id": "elevenlabs-batch",
    "revision": 1,
    "text": "doctor hello patient yes",
    "words": [
      {
        "start_ms": 0,
        "end_ms": 500,
        "text": "doctor",
        "speaker_id": "speaker_0"
      }
    ],
    "segments": [],
    "finalized": true,
    "consistency": "FINAL",
    "updated_at": "2026-04-28T10:00:00Z"
  }
}

Field	Type	Notes
`provider`	string	Provider name, currently `elevenlabs`
`language_code`	string	Provider-detected language code, omitted when absent
`language_probability`	number	Provider confidence, omitted when zero
`input`	object	Uploaded file metadata and duration inferred from word end times
`snapshot`	object	Final transcript snapshot with words and segments

Errors¶

Status	Cause
`400`	Invalid content type, invalid multipart body, missing file, invalid number, invalid threshold, invalid speaker/threshold combination
`405`	Method other than `POST`
`502`	ElevenLabs upstream request failed
`503`	ElevenLabs transcription is unavailable

Example¶

curl -X POST http://localhost:8080/v1/transcriptions/elevenlabs \
  -F file=@sample.wav \
  -F language_hint=en \
  -F num_speakers=2

Integration live routes¶

The /api/v1/transcribe-live/* routes expose only the subset needed by an upstream application embedding live transcription.

Every route requires a valid live token.

`GET /api/v1/transcribe-live/sessions/{session_id}/audio/ws`¶

Protected version of GET /v1/sessions/{session_id}/audio/ws.

Token may be passed as a query parameter because browsers cannot set arbitrary headers on native WebSocket constructors:

ws://localhost:8080/api/v1/transcribe-live/sessions/18b9617a48fef8072bfc2c38/audio/ws?token=<token>

`GET /api/v1/transcribe-live/sessions/{session_id}/events`¶

Protected version of GET /v1/sessions/{session_id}/events.

`POST /api/v1/transcribe-live/sessions/{session_id}/stop`¶

Protected version of POST /v1/sessions/{session_id}/stop.

Integration route errors¶

Status	Cause
`401`	Missing token, invalid token, expired token, or claim mismatch
`404`	Session or route not found

Schemas¶

ASRWindowConfig¶

{
  "pre_roll_ms": 700,
  "post_roll_ms": 700,
  "min_commit_ms": 4000,
  "target_commit_ms": 10000,
  "max_commit_ms": 15000,
  "merge_gap_ms": 1800,
  "min_speech_ms": 2500,
  "min_isolated_ms": 400,
  "commit_tolerance_ms": 200
}

Field	Type	Default	Description
`pre_roll_ms`	int64	`700`	Audio included before the committed speech span
`post_roll_ms`	int64	`700`	Audio included after the committed speech span
`min_commit_ms`	int64	`4000`	Minimum span before a gap can trigger an ASR window
`target_commit_ms`	int64	`10000`	Target span that triggers a window
`max_commit_ms`	int64	`15000`	Hard maximum span that forces a window
`merge_gap_ms`	int64	`1800`	Maximum gap between detected speech segments before the planner separates windows
`min_speech_ms`	int64	`2500`	Minimum total speech required before gap-based commit
`min_isolated_ms`	int64	`400`	Minimum isolated speech span to emit any ASR window
`commit_tolerance_ms`	int64	`200`	Boundary tolerance for committed spans

TranscriptSnapshot¶

{
  "session_id": "18b9617a48fef8072bfc2c38",
  "revision": 4,
  "text": "hello patient",
  "words": [],
  "segments": [],
  "finalized": false,
  "consistency": "PARTIAL",
  "comparison": {
    "chunk_text": "hello patient",
    "final_pass_text": "hello patient",
    "chunk_word_count": 2,
    "final_pass_word_count": 2,
    "similarity": 1,
    "target": 0.99,
    "meets_target": true
  },
  "updated_at": "2026-04-28T10:00:00Z"
}

Field	Type	Description
`session_id`	string	Session ID
`revision`	int64	Monotonic snapshot revision within a session
`text`	string	Visible transcript text for the requested consistency
`words`	array of `Word`	Visible words
`segments`	array of `TranscriptSegment`	Visible segments grouped by state, speaker, and timing
`finalized`	boolean	True after stop/final pass has sealed the transcript
`consistency`	enum	`PARTIAL`, `STABLE`, or `FINAL`
`comparison`	`Comparison`	Optional comparison block after final/chunk pass diagnostics
`updated_at`	timestamp	Snapshot creation/update time

Word¶

{
  "start_ms": 0,
  "end_ms": 500,
  "text": "hello",
  "speaker_id": "speaker_0",
  "speaker_name": "Doctor",
  "speaker_confidence": 0.91
}

Field	Type	Required	Description
`start_ms`	int64	Yes	Word start offset
`end_ms`	int64	Yes	Word end offset
`text`	string	Yes	Word/token text
`speaker_id`	string	No	Provider or local speaker ID
`speaker_name`	string	No	Matched saved speaker name or anonymous label
`speaker_confidence`	number	No	Speaker match confidence

TranscriptSegment¶

{
  "segment_id": "18b9617a48fef8072bfc2c38-4-1",
  "session_id": "18b9617a48fef8072bfc2c38",
  "window_id": "asr-window-1",
  "revision": 4,
  "provider": "elevenlabs",
  "audio_start_ms": 0,
  "audio_end_ms": 1200,
  "text": "hello patient",
  "state": "STABLE",
  "speaker_id": "speaker_0",
  "speaker_name": "Doctor",
  "speaker_confidence": 0.91
}

Field	Type	Required	Description
`segment_id`	string	Yes	Generated ID from session, revision, and segment ordinal
`session_id`	string	Yes	Session ID
`window_id`	string	Yes	ASR window or synthetic final-pass ID
`revision`	int64	Yes	Snapshot revision
`provider`	string	Yes	Provider that produced the words
`audio_start_ms`	int64	Yes	Segment start
`audio_end_ms`	int64	Yes	Segment end
`text`	string	Yes	Segment text
`state`	enum	Yes	`PARTIAL`, `STABLE`, or `FINAL`
`speaker_id`	string	No	Provider or local speaker ID
`speaker_name`	string	No	Speaker display name
`speaker_confidence`	number	No	Speaker match confidence

Comparison¶

{
  "chunk_text": "hello patient",
  "final_pass_text": "hello patient",
  "chunk_word_count": 2,
  "final_pass_word_count": 2,
  "similarity": 1,
  "target": 0.99,
  "meets_target": true
}

Field	Type	Description
`chunk_text`	string	Text from live chunk pipeline
`final_pass_text`	string	Text from final/full pass
`chunk_word_count`	integer	Normalized word count for chunk text
`final_pass_word_count`	integer	Normalized word count for final text
`similarity`	number	Word-level Levenshtein similarity from `0` to `1`
`target`	number	Target similarity, currently `0.99` by default
`meets_target`	boolean	True when `similarity >= target`

InspectEvent¶

{
  "seq": 12,
  "time": "2026-04-28T10:00:00Z",
  "session_id": "18b9617a48fef8072bfc2c38",
  "type": "chunk_pass_window_completed",
  "lane": "live",
  "window_id": "asr-window-1",
  "state": "completed",
  "reason": "asr_target",
  "start_sample": 0,
  "end_sample": 160000,
  "start_ms": 0,
  "end_ms": 10000,
  "bytes": 320000,
  "text_len": 42,
  "word_count": 8,
  "preview": "hello patient",
  "note": "windows=1 similarity=1.00"
}

Field	Type	Description
`seq`	int64	Monotonic inspect event sequence
`time`	timestamp	Event timestamp
`session_id`	string	Session ID
`type`	string	Event type, such as `session_started`, `pcm_ingest_started`, `full_pass_completed`
`lane`	string	Optional lane, commonly `live` or `final`
`window_id`	string	Optional ASR/final window ID
`state`	string	Optional lifecycle state
`reason`	string	Optional planner or boundary reason
`start_sample`	int64	Optional start sample
`end_sample`	int64	Optional end sample
`start_ms`	int64	Optional start time
`end_ms`	int64	Optional end time
`bytes`	int64	Optional byte count
`text_len`	integer	Optional transcript text length
`word_count`	integer	Optional word count
`preview`	string	Optional text preview
`note`	string	Optional human-readable note

LogEntry¶

{
  "time": "2026-04-28T10:00:00Z",
  "level": "INFO",
  "message": "session initialized",
  "attrs": {
    "session_id": "18b9617a48fef8072bfc2c38"
  }
}

Field	Type	Description
`time`	timestamp	Log record timestamp
`level`	string	Slog level
`message`	string	Log message
`attrs`	object	Optional structured attributes

SpeakerProfile¶

{
  "id": "speaker-123",
  "name": "Doctor",
  "created_at": "2026-04-28T10:00:00Z",
  "updated_at": "2026-04-28T10:00:00Z",
  "samples": [],
  "centroid": [0.01, -0.02]
}

Field	Type	Description
`id`	string	Profile ID
`name`	string	Display name
`created_at`	timestamp	Profile creation time
`updated_at`	timestamp	Last update time
`samples`	array of `SpeakerSample`	Enrollment samples
`centroid`	number array	Current embedding centroid

SpeakerSample¶

{
  "session_id": "18b9617a48fef8072bfc2c38",
  "start_ms": 1000,
  "end_ms": 5000,
  "created_at": "2026-04-28T10:00:00Z",
  "embedding": [0.01, -0.02]
}

Field	Type	Description
`session_id`	string	Source session ID
`start_ms`	int64	Enrollment clip start
`end_ms`	int64	Enrollment clip end
`created_at`	timestamp	Sample creation time
`embedding`	number array	Raw speaker embedding

End-to-end live flow¶

Create a session.

SESSION_ID=$(
  curl -s -X POST http://localhost:8080/v1/sessions \
    -H 'Content-Type: application/json' \
    -d '{"language_hint":"en"}' \
  | jq -r .session_id
)

Open /v1/sessions/{session_id}/events to receive live transcript snapshots.
Open /v1/sessions/{session_id}/audio/ws.
Send the required JSON start frame.
Stream binary PCM16 little-endian frames.
Close the WebSocket when capture ends.
Call POST /v1/sessions/{session_id}/stop.
Fetch GET /v1/sessions/{session_id}/transcript?consistency=FINAL.
Optionally call POST /v1/sessions/{session_id}/speaker-identification.
Optionally download GET /v1/sessions/{session_id}/recording.

API Reference¶

Conventions¶

Base URL¶

Content types¶

Time and audio units¶

Error envelope¶

Transcript consistency¶

Authentication¶

Endpoint index¶

Health¶

GET /healthz¶

Response 200¶

Example¶

Sessions¶

POST /v1/sessions¶

Request body¶

Response 201¶

Errors¶

Example¶

GET /v1/sessions/{session_id}/audio/ws¶

WebSocket validation¶

State side effects¶

POST /v1/sessions/{session_id}/stop¶

Response 200¶

Errors¶

Example¶

GET /v1/sessions/{session_id}/transcript¶

Query parameters¶

Response 200¶

Errors¶

GET /v1/sessions/{session_id}/events¶

Events¶

Example stream¶

Browser example¶

Errors¶

GET /v1/sessions/{session_id}/recording¶

Response 200¶

Errors¶

GET /v1/sessions/{session_id}/inspect/events¶

Events¶

Example¶

Errors¶

GET /v1/sessions/{session_id}/inspect/audio¶

Query parameters¶

Response 200¶

Errors¶

Example¶

GET /v1/sessions/{session_id}/asr-config¶

Response 200¶

PATCH /v1/sessions/{session_id}/asr-config¶

Request body¶

Response 200¶

Validation¶

Errors¶

Provider comparison routes¶

POST /v1/sessions/{session_id}/chunk-pass/elevenlabs¶

Response 200¶

Errors¶

POST /v1/sessions/{session_id}/full-pass/elevenlabs¶

Response 200¶

Errors¶

Speakers¶

GET /v1/speakers¶

Response 200¶

Errors¶

POST /v1/speakers¶

Request body¶

Response 201¶

Errors¶

POST /v1/sessions/{session_id}/speaker-identification¶

Response 200¶

Errors¶

Logs¶

GET /v1/logs/events¶

Query parameters¶

Events¶

Example¶

Errors¶

Direct ElevenLabs transcription¶

POST /v1/transcriptions/elevenlabs¶

`GET /healthz`¶

Response `200`¶

`POST /v1/sessions`¶

Response `201`¶

`GET /v1/sessions/{session_id}/audio/ws`¶

`POST /v1/sessions/{session_id}/stop`¶

Response `200`¶

`GET /v1/sessions/{session_id}/transcript`¶

Response `200`¶

`GET /v1/sessions/{session_id}/events`¶

`GET /v1/sessions/{session_id}/recording`¶

Response `200`¶

`GET /v1/sessions/{session_id}/inspect/events`¶

`GET /v1/sessions/{session_id}/inspect/audio`¶

Response `200`¶

`GET /v1/sessions/{session_id}/asr-config`¶

Response `200`¶

`PATCH /v1/sessions/{session_id}/asr-config`¶

Response `200`¶

`POST /v1/sessions/{session_id}/chunk-pass/elevenlabs`¶

Response `200`¶

`POST /v1/sessions/{session_id}/full-pass/elevenlabs`¶

Response `200`¶

`GET /v1/speakers`¶

Response `200`¶

`POST /v1/speakers`¶

Response `201`¶

`POST /v1/sessions/{session_id}/speaker-identification`¶

Response `200`¶

`GET /v1/logs/events`¶

`POST /v1/transcriptions/elevenlabs`¶

Response `200`¶

`GET /api/v1/transcribe-live/sessions/{session_id}/audio/ws`¶

`GET /api/v1/transcribe-live/sessions/{session_id}/events`¶

`POST /api/v1/transcribe-live/sessions/{session_id}/stop`¶