UsersInference Providers

Inference Providers

StudioBrain can route AI work to several different kinds of inference backends. The important distinction is not just which provider you pick, but what kind of provider it is because that changes setup, hardware, and troubleshooting.

This page is the user-facing reference for the inference provider taxonomy used by the gateway and setup UI.

The Six Categories

CategoryWhat it meansExamplesTypical setup
local_inprocessInference runs inside the gateway process itself. Nothing separate is launched.mistralrs, candle, llamacppEnable the feature in the build or use a build that already includes it
local_spawnedStudioBrain downloads and runs a local binary for you.llama-server, mistralrs-server, exllamav2, whisper-cpp-serverInstall the engine, then let the gateway supervise it
lan_endpointYou already run a compatible server on your machine or LAN. StudioBrain connects to it.Ollama, LM Studio, Jan, vLLM HTTP, generic OpenAI-compatible endpointsEnter the host/port or let StudioBrain probe the endpoint
distributedOne logical model is spread across multiple machines.llama.cpp RPC, distributed vLLMConfigure peers and cluster RPC before use
cloud_providerDirect hosted API from a model vendor.OpenAI, Anthropic, Google, xAI, DashScope, Z.AI, Mistral, Cohere, DeepSeek, Groq, Perplexity, TogetherAdd an API key and verify the provider
llm_routerA gateway in front of one or more providers.OpenRouter, LiteLLM Proxy, Azure OpenAI, Vertex AI, Bedrock, Portkey, Cloudflare AIAdd base URL + credentials, then verify

Install Matrix

CategoryManaged by StudioBrainNeeds local GPUNeeds networkNeeds API keyCan be shared across LAN/cluster
local_inprocessYesUsuallyNoNoNo
local_spawnedYesUsuallyOnly for downloadsNoSingle host only
lan_endpointNoDepends on the external serverYesUsually noYes
distributedPartlyYesYesNoYes
cloud_providerNo local runtime to installNoYesYesNot a LAN service
llm_routerNo local runtime to installNoYesUsually yesNot a LAN service

How To Think About Setup

Built-in local inference

Use local_inprocess when you want the fewest moving parts on a single machine and your StudioBrain build already includes the runtime you need.

  • Best for: offline-first local work
  • Tradeoff: tied to the features compiled into your gateway build
  • Default models: On first launch, a small model (qwen2.5-0.5b, ~400 MB) is automatically downloaded and loaded via the mistralrs engine. Each model in the autoconfig profiles is tagged with a specific backend (mistralrs or llamacpp) so the gateway routes to the correct engine without manual config.

Downloaded local engines

Use local_spawned when you want StudioBrain to manage a standalone engine binary for you.

  • Best for: local GPU workflows where you want managed install/update/supervision
  • Tradeoff: extra binary downloads and host-specific compatibility

Existing local or LAN servers

Use lan_endpoint when you already have another inference server running and just want StudioBrain to consume it.

  • Best for: Ollama, LM Studio, Jan, or a custom OpenAI-compatible server you already trust
  • Tradeoff: StudioBrain does not own that process lifecycle

Distributed inference

Use distributed only when you intentionally want multi-host inference.

  • Best for: splitting large models across multiple machines
  • Tradeoff: more operational complexity than single-node inference

Direct cloud APIs

Use cloud_provider for vendor-native APIs.

  • Best for: frontier closed-source models and BYOK setups
  • Tradeoff: internet dependency and provider billing

Router-style gateways

Use llm_router when you want a single endpoint that can fan out to multiple vendors or routing policies.

  • Best for: teams standardizing on one API surface
  • Tradeoff: adds another control plane between StudioBrain and the model vendor

Provider Examples By Category

CategoryCommon choices
local_inprocessmistralrs, candle, llamacpp
local_spawnedllama-server, mistralrs-server, exllamav2
lan_endpointOllama, LM Studio, Jan, text-generation-webui
distributedllama.cpp RPC, distributed vLLM
cloud_providerOpenAI, Anthropic, Google, DashScope, Z.AI
llm_routerOpenRouter, LiteLLM Proxy, Azure OpenAI, Bedrock

Which One Should I Choose?

GoalBest fit
Run fully offline on one machinelocal_inprocess or local_spawned
Reuse an inference server you already runlan_endpoint
Split a large model across hostsdistributed
Use vendor APIs directlycloud_provider
Standardize multiple vendors behind one endpointllm_router