Vertex AI is Google Cloud’s enterprise LLM surface — Anthropic’s Claude family lives there alongside Gemini for customers who need GCP-native security and data-residency. The gateway uses bifrost/core which handles Vertex’s OAuth / ADC flow, regional endpoints, and Anthropic-on-Vertex model-id quirks.Documentation Index
Fetch the complete documentation index at: https://langwatch.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Configure the provider credential
Under Settings → Model Providers:- Add provider → Google Vertex AI.
- Choose authentication:
- Service account JSON — paste the JSON key (least-privilege role:
roles/aiplatform.userorroles/aiplatform.serviceAgent). - Application Default Credentials — leave JSON empty; gateway picks up workload identity on GKE / Cloud Run.
- Service account JSON — paste the JSON key (least-privilege role:
- Set project ID and region (
us-central1,us-east5,europe-west1— Anthropic-on-Vertex has a limited region list). - Save.
Model id format
Vertex model ids differ by model family:publishers/anthropic/models/claude-haiku-4-5-20251001@20251001— Anthropic Claude on Vertex.publishers/google/models/gemini-2.5-flash— Gemini via Vertex.
model_aliases:
Supported endpoints
POST /v1/messages— Anthropic-shape dispatched to Vertex’srawPredict(orstreamRawPredict) endpoint for Anthropic models.POST /v1/chat/completions— translated shape for both Anthropic-on-Vertex and Gemini.
cache_control passthrough
Anthropic-on-Vertex supports prompt caching via the same cache_control blocks. Bifrost/core translates the shape for Vertex’s raw-predict endpoint. Cache-read/write token counts flow back as with direct Anthropic.
Gemini’s implicit context caches are exposed via a separate API (caches.create); the gateway does not orchestrate these in v1 but forwards any cache-reference params the client includes.
Known quirks
- Region availability. Anthropic-on-Vertex is not in every GCP region.
us-central1,us-east5,europe-west1,asia-southeast1cover most cases. A 404 with “publisher not found” means wrong region. - Token refresh. Service-account tokens expire after ~1 hour. Bifrost/core caches and auto-refreshes — no configuration needed from the VK owner.
- Quota headers. Vertex returns
x-vertex-ai-quota-remainingand throttles with 429 when exhausted. Bifrost/core surfaces the 429 to the fallback engine. - Provisioned throughput vs on-demand. Provisioned throughput (“Committed Use Discounts”) has separate regional ARNs. If you’re using provisioned, pass the PT-specific URL on the ModelProvider.
- Model IDs include
@20251001version suffix — omitting it may route to a different model version silently. Always pin explicit versions in aliases. - Streaming. Vertex’s streaming format is JSON-lines (
\n-delimited) not SSE. Bifrost/core normalises to SSE for the client.
Using workload identity on GKE
Recommended for self-hosted gateway in GKE:- Create a Google service account with
roles/aiplatform.user. - Bind it to the gateway pod’s KSA via Workload Identity annotation.
- Leave the ModelProvider’s JSON empty — ADC flows automatically.