Gemini CLI is Google’s official agentic CLI for Gemini models. It speaks the native Gemini API (Documentation Index
Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
/v1beta/models/{model}:generateContent and :streamGenerateContent) and uses Google’s own request and response shapes — distinct from the OpenAI Chat Completions or Responses APIs.
The LangWatch AI Gateway exposes a Gemini-native passthrough at /v1beta/* that forwards requests verbatim to generativelanguage.googleapis.com, with the standard auth + tracing + budget + cache + guardrail middleware applied. So Gemini CLI works against the gateway with one env-var change.
Setup
GOOGLE_GEMINI_BASE_URL is the documented endpoint override for Gemini CLI’s gemini-api-key auth mode. The gateway accepts the request at POST /v1beta/models/{model}:generateContent, resolves the VK, dispatches through Bifrost’s Gemini adapter (which injects the real x-goog-api-key from the bound credential), and streams the native Gemini response shape back unchanged.
Gemini CLI sends the VK secret as
x-goog-api-key (its native auth header). The gateway’s auth middleware accepts that header, an Authorization: Bearer … header, OR an x-api-key header — whichever your client sends. All three paths resolve to the same VK.Self-hosted gateway
Replace the hostname:/v1beta/* route is the same on the LangWatch cloud and self-hosted.
What the gateway forwards
Every/v1beta/* path hits a single passthrough handler. Bifrost.Passthrough strips /v1beta from the incoming request, prepends Gemini’s API base URL (https://generativelanguage.googleapis.com/v1beta), and forwards body + query + selected headers verbatim. Specifically:
POST /v1beta/models/{model}:generateContent— synchronous generationPOST /v1beta/models/{model}:streamGenerateContent(with?alt=sse) — streaming generation. SSE chunks pass through byte-for-byte (Google already emits properevent:/data:framing, the gateway does not re-wrap)POST /v1beta/cachedContents— explicit Gemini context cache creationPOST /v1beta/models/{model}:countTokens/:embedContent/:batchEmbedContents— token counting and embeddings
Verifying the traffic flows
After onegemini session:
- Open LangWatch → Messages for the project the VK belongs to.
- You should see one trace per generation request (Gemini CLI typically makes one call per turn unless you’ve enabled multi-step planning).
- Span attributes include
langwatch.virtual_key_id,gen_ai.usage.*, the full request body, and the cost computed from Gemini’s pricing catalog.
GEMINI_API_KEY to isolate Gemini-CLI-specific issues:
candidates[].content.parts[].text plus usageMetadata.{promptTokenCount, candidatesTokenCount, cachedContentTokenCount}.
Prompt caching with cachedContents
Gemini’s cache primitive is opt-in: create a cached content blob via POST /v1beta/cachedContents, then reference it by name on subsequent generation calls via the cachedContent field. Both calls go through the gateway:
cachedContents flow above works on every tier and is Google’s recommended pattern for >1024-token prefixes.
Limits and caveats
- Bring a Bifrost-recognised model name.
gemini-2.5-flash,gemini-2.5-pro, andgemini-2.0-flashare recognised today. Aliases likegemini-flash(no version suffix) return400because the Gemini adapter can’t resolve the upstream model. Use Gemini CLI’s--modelflag or your VK’smodel_aliasesto pin to a concrete name. - Trace-level usage extraction is partial on streaming. The gateway successfully forwards the Gemini SSE stream, but the trace pipeline’s token-count parser for Gemini-native streams is still being completed — expect
prompt_tokensto show on traces butcompletion_tokensmay land null until the parser ships. Cost is still captured accurately. - Vertex SA auth is per-environment. The gateway’s Vertex VK uses a Google service-account JSON key on the gateway side. Engineers running Gemini CLI with a
GEMINI_API_KEY(which is a LangWatch VK) don’t need their own SA — the gateway’s bound Vertex credential handles upstream auth. ListModelsis not yet routed natively. Gemini CLI’s model-discovery pings to/v1beta/models(no model suffix) currently fall through to the OpenAI-style/v1/modelshandler. List your available models in the LangWatch UI’s VK detail page until this is fixed.
Governance recipes
Per-engineer monthly budget
- Scope:
principal, target: the engineer’s user id. - Window:
month, limit:$50. on_breach:block.
Restrict to specific Gemini models
VKmodels_allowed: ["gemini-2.5-flash", "gemini-2.5-pro"]. A gemini --model gemini-1.0-pro call returns 403 model_not_allowed before it ever leaves the gateway.
Default to flash, allow opt-in to pro
VKmodel_aliases:
gemini "task" (default) lands on flash for cost; gemini --model gemini-2.5-pro "task" opts into pro.