All articles
Voice·October 11, 2025·9 min read

We Put Gemini Live on a Phone Line. Here's Everything That Broke.

A first-person war story of bridging Exotel telephony to the consumer Gemini Live API — the wire-protocol traps, token bugs, and Cloud Run quirks, in the order they bit.

By Matrix Team

The pitch was simple: pick up a phone, talk to an AI agent, and have it talk back in real time — full-duplex, barge-in and all. The implementation was a two-week archaeology dig through a single, lying error string.

This is the war story of gemini live telephony in Matrix: bridging Exotel's voicebot media stream to the consumer Gemini Live API, one CallSession per call. Every bug below cost real time. They're listed in roughly the order they bit, so you can pay for them once instead of four times.

The shape of the thing

One CallSession bridges Exotel ↔ Gemini Live for the lifetime of a call. Exotel opens a WebSocket to us and streams the caller's audio in; we hold a second WebSocket out to Gemini Live; audio flows both directions and tool calls get dispatched in the middle. The moment Exotel's start frame arrives, the session adopts its Session row by callSid and derives direction, campaign, and per-call objective from it.

Conceptually clean. The trouble is that both ends of that bridge have opinions about wire format, auth, and transport that are not in any doc you'll find by searching.

We run gemini-3.1-flash-live-preview on the consumer endpoint (generativelanguage.googleapis.com), not Vertex Live — it's newer than anything Vertex exposes today. That choice is the root of half the gotchas, because the consumer Live API's auth model is built for browsers and SDKs, not for a long-lived server bridge.

Bug 1: every error said the same wrong thing

The first thing to internalize: Gemini Live's catch-all close reason —

Method doesn't allow unregistered callers (callers without established identity). Please use API Key or other form of API credentials

— is reported for at least four distinct bugs, none of which are actually about credentials. Wrong model name, wrong endpoint variant, wrong field casing, missing auth header — all the same string. When an API error is this generic, stop trusting it and instrument the wire.

The single most useful debugging move we made was reproducing the exact flow in a 30-line Python script using the official google-genai SDK, monkey-patching websockets.asyncio.client.ClientConnection.send/recv to print every frame, and diffing the SDK's wire output against ours. Casing and endpoint mismatches that were invisible in the logs showed up in one line. Two corollaries that paid off repeatedly: if the SDK works and your client doesn't, the gap is in your request shape; and read the SDK source, not just the docs — the auth_tokens flow, the constrained-vs-plain endpoint switch, and the header-vs-query auth choice are all spelled out in the google-genai Python source and nowhere in the public docs as of this writing.

Bug 2: the wrong WebSocket variant

Gemini Live exposes two RPC methods over WebSocket: BidiGenerateContent and BidiGenerateContentConstrained. We started on the obvious-looking one. It failed — the gateway wanted an Authorization: Token <name> HTTP header on the upgrade.

Here's the catch: a browser can't set custom headers on a WebSocket upgrade. new WebSocket(url, [protocols]) is the entire surface — only the URL and Sec-WebSocket-Protocol are yours. And while a server client can set headers, we wanted one code path that works for both the telephony bridge and the browser-direct /voice page.

The breakthrough: BidiGenerateContentConstrained accepts ?access_token=auth_tokens/… as a query param and requires no header. It's the browser-friendly variant, and it's the one to use.

wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1alpha.GenerativeService.BidiGenerateContentConstrained?access_token=auth_tokens/<id>

One trap inside the trap: don't encodeURIComponent the token value. Google's gateway sometimes treats an encoded slash as a path component before decoding. The / in auth_tokens/<id> must pass through literally.

Bug 3: the tempting token field that breaks everything

The auth_tokens minting API accepts a bidiGenerateContentSetup field that pre-binds the model, voice, and tools to the token. Fewer round trips, the client doesn't send a setup — tempting. We tried it.

It quietly changes the RPC method the server expects and pushes you back onto the header-based auth path that browsers can't satisfy. So: never bake bidiGenerateContentSetup into the ephemeral token. Mint a plain token, and send the setup message yourself at ws.onopen:

{ "setup": { "model": "models/gemini-3.1-flash-live-preview",
             "generationConfig": { "responseModalities": ["AUDIO"] } } }

A nice side effect: with setup sent from the client, you can A/B different voices and prompts without re-minting a token. setupComplete comes back, and you're connected.

Bug 4: the cruelest one — snake_case

setupComplete came back. We sent audio. The socket closed with 1008 and that same lying error string. We chased the auth angle for a long time before the SDK wire-tap made it obvious.

The setup envelope is camelCase. So are the responses (setupComplete, serverContent, toolCall). But the audio input message — realtime_input — is snake_case in v1alpha:

{ "realtime_input": { "audio": { "mime_type": "audio/pcm;rate=16000",
                                 "data": "<base64>" } } }

mime_type, not mimeType. realtime_input, not realtimeInput. Send camelCase and Gemini closes the WS with a misleading 1008 on the first audio frame — not at setup, where you'd think to look. The older realtime_input.mediaChunks: [...] array form is also rejected outright with the same close; the current shape is singular audio, not a chunk array.

Bug 5: "Token has been used too many times"

Browser-direct voice mints a token with uses: 1 per session and that's the end of it — one token, one session, done.

The telephony bridge is different. It's a long-lived server process, and the naive uses: 1 token started throwing 1011 Token has been used too many times mid-call when reused server-side. The fix was a shared token minted with uses: 10000, refreshed hourly by a @Scheduled task (GeminiTokenService.refreshSharedToken). The bridge draws from that shared token instead of minting per-session.

Worth a separate note: the token's newSessionExpireTime defaults to 60 seconds. On the browser path, a caller who opens the agent picker and browses for a minute before clicking Start finds the session-init window already closed — the WS opens, instantly closes, "unregistered callers" again. We widened that window to 600 seconds. Same lying error, a fifth distinct cause.

Bug 6: Cloud Run silently kills the WebSocket upgrade

Locally, everything worked. In production on Cloud Run, the outbound WebSocket to Gemini refused to establish.

Cloud Run's frontend (GFE) serves WebSockets over HTTP/1.1 only. But GFE negotiates HTTP/2 via ALPN with any client that offers it during the TLS handshake — then turns around and rejects the WS upgrade, because WebSockets aren't a thing in HTTP/2 the way the client expects. The Java WS client offered h2, GFE took it, and the upgrade died.

The fix is to force http/1.1 in the WS client's ALPN list so HTTP/2 never gets negotiated in the first place. One line, a day to find.

Bug 7: the JWT in the URL that Exotel refused

Now the other end of the bridge. The original design carried a JWT in the WebSocket URL path so the inbound Exotel stream could authenticate itself. Exotel silently refused to connect — no error, just no stream. The URL was too long.

The fix moved the token out of the path entirely. The voicebot WS URL is now ~130 characters: the token lives in a warm-session registry keyed on the call, and the handshake looks it up rather than parsing it from the URL. Short URL, happy Exotel.

There's a sharper lesson buried here, which we learned the hard way on a real outbound campaign call that opened with the inbound greeting and never reported its disposition back to the campaign. Never trust the bootstrap claims for per-call identity. The voicebot URL Exotel calls is static — configured once in the dashboard — so it carries no per-call context, and Exotel strips query strings on the upgrade. The handshake therefore can't tell inbound from outbound on its own; falling back to a default direction meant outbound calls greeted callers as if they'd dialed in, and a duplicate inbound Session row swallowed the transcript while the campaign's objective never injected.

The fix: the Session row, keyed by callSid, is the single source of truth. onExotelStart adopts that row and reads direction, agent, campaign, and per-call objective from it — never from the handshake claims. A fresh inbound row is created only when no row exists (a genuine inbound call with no preceding outbound dial). The per-call objective then lands in a dedicated "Your objective on THIS call" block in the Gemini setup.

The audio, briefly

The wire is only half the battle — the audio pipeline is its own saga (48 kHz mic capture downsampled to 16 kHz in, 24 kHz playback out; the byte math on Exotel's outbound chunks — ≥3.2 kB, exact multiples of 320 bytes — bites if you divide by samples instead of bytes). And barge-in — stopping the model the instant the caller talks — has its own set of real-time wire-protocol gotchas worth their own post. We'll point you there rather than re-fight those battles here.

Verify the model before you assume it

One more that wasted a morning early on: an unavailable model name produces no error frame at all. The server accepts the upgrade, accepts the setup, then closes silently — 1006 in the browser, a recv timeout in Python. Before assuming any model name works on your key, enumerate them:

GET https://generativelanguage.googleapis.com/v1beta/models?key=<API_KEY>

gemini-3.1-flash-live-preview is our current default and confirmed live. A name from your training data may not be on your key.

Lessons

If you're putting Gemini Live behind a phone line, internalize these before you write a frame:

  • Use BidiGenerateContentConstrained with ?access_token=auth_tokens/…. The plain BidiGenerateContent wants a header neither browsers nor your sanity can spare.
  • Mint plain tokens. Never bake in bidiGenerateContentSetup. It changes the RPC method and breaks query auth. Send setup from the client.
  • realtime_input is snake_case in v1alpha. mime_type, singular audio. CamelCase dies on the first audio frame with a misleading 1008.
  • Server-side, share one high-uses token and refresh it. Per-session uses: 1 tokens trigger 1011 ... used too many times when reused.
  • On Cloud Run, force http/1.1 in the WS client ALPN list, or GFE negotiates h2 and rejects the upgrade.
  • Keep the WS URL short and stateless. Exotel refuses long URLs; put the token in a registry and the per-call truth in the Session row, not the bootstrap claims.
  • The catch-all auth error lies. When the message is too generic to be useful, wire-tap the official SDK and diff frames.

Every one of these is catalogued, with the full debugging journey and the why behind each, in docs/LEARNINGS.md — read it before you touch the Live integration, because most "obvious" fixes have already been tried and rejected.

Ship voice without re-fighting this

Matrix runs this bridge in production today — inbound and outbound calls, barge-in, recording, the lot — so you don't have to spend two weeks decoding a lying error string. Create a workspace, point an agent at a phone number, and talk to it. The wire protocol is already solved; you get to work on what your agent actually says.

#gemini live telephony#voice ai#exotel#websockets

Build your first agent on Matrix

Spin up a workspace, wire up tools and knowledge, give your agent a voice, and talk to it in real time — no agent code required.

Keep reading