Inference Providers

StudioBrain can route AI work to several different kinds of inference backends. The important distinction is not just which provider you pick, but what kind of provider it is because that changes setup, hardware, and troubleshooting.

This page is the user-facing reference for the inference provider taxonomy used by the gateway and setup UI.

The Six Categories

Category	What it means	Examples	Typical setup
`local_inprocess`	Inference runs inside the gateway process itself. Nothing separate is launched.	mistralrs, candle, llamacpp	Enable the feature in the build or use a build that already includes it
`local_spawned`	StudioBrain downloads and runs a local binary for you.	llama-server, mistralrs-server, exllamav2, whisper-cpp-server	Install the engine, then let the gateway supervise it
`lan_endpoint`	You already run a compatible server on your machine or LAN. StudioBrain connects to it.	Ollama, LM Studio, Jan, vLLM HTTP, generic OpenAI-compatible endpoints	Enter the host/port or let StudioBrain probe the endpoint
`distributed`	One logical model is spread across multiple machines.	llama.cpp RPC, distributed vLLM	Configure peers and cluster RPC before use
`cloud_provider`	Direct hosted API from a model vendor.	OpenAI, Anthropic, Google, xAI, DashScope, Z.AI, Mistral, Cohere, DeepSeek, Groq, Perplexity, Together	Add an API key and verify the provider
`llm_router`	A gateway in front of one or more providers.	OpenRouter, LiteLLM Proxy, Azure OpenAI, Vertex AI, Bedrock, Portkey, Cloudflare AI	Add base URL + credentials, then verify

Install Matrix

Category	Managed by StudioBrain	Needs local GPU	Needs network	Needs API key	Can be shared across LAN/cluster
`local_inprocess`	Yes	Usually	No	No	No
`local_spawned`	Yes	Usually	Only for downloads	No	Single host only
`lan_endpoint`	No	Depends on the external server	Yes	Usually no	Yes
`distributed`	Partly	Yes	Yes	No	Yes
`cloud_provider`	No local runtime to install	No	Yes	Yes	Not a LAN service
`llm_router`	No local runtime to install	No	Yes	Usually yes	Not a LAN service

How To Think About Setup

Built-in local inference

Use local_inprocess when you want the fewest moving parts on a single machine and your StudioBrain build already includes the runtime you need.

Best for: offline-first local work
Tradeoff: tied to the features compiled into your gateway build
Default models: On first launch, a small model (qwen2.5-0.5b, ~400 MB) is automatically downloaded and loaded via the mistralrs engine. Each model in the autoconfig profiles is tagged with a specific backend (mistralrs or llamacpp) so the gateway routes to the correct engine without manual config.

Downloaded local engines

Use local_spawned when you want StudioBrain to manage a standalone engine binary for you.

Best for: local GPU workflows where you want managed install/update/supervision
Tradeoff: extra binary downloads and host-specific compatibility

Existing local or LAN servers

Use lan_endpoint when you already have another inference server running and just want StudioBrain to consume it.

Best for: Ollama, LM Studio, Jan, or a custom OpenAI-compatible server you already trust
Tradeoff: StudioBrain does not own that process lifecycle

Distributed inference

Use distributed only when you intentionally want multi-host inference.

Best for: splitting large models across multiple machines
Tradeoff: more operational complexity than single-node inference

Direct cloud APIs

Use cloud_provider for vendor-native APIs.

Best for: frontier closed-source models and BYOK setups
Tradeoff: internet dependency and provider billing

Router-style gateways

Use llm_router when you want a single endpoint that can fan out to multiple vendors or routing policies.

Best for: teams standardizing on one API surface
Tradeoff: adds another control plane between StudioBrain and the model vendor

Provider Examples By Category

Category	Common choices
`local_inprocess`	mistralrs, candle, llamacpp
`local_spawned`	llama-server, mistralrs-server, exllamav2
`lan_endpoint`	Ollama, LM Studio, Jan, text-generation-webui
`distributed`	llama.cpp RPC, distributed vLLM
`cloud_provider`	OpenAI, Anthropic, Google, DashScope, Z.AI
`llm_router`	OpenRouter, LiteLLM Proxy, Azure OpenAI, Bedrock

Which One Should I Choose?

Goal	Best fit
Run fully offline on one machine	`local_inprocess` or `local_spawned`
Reuse an inference server you already run	`lan_endpoint`
Split a large model across hosts	`distributed`
Use vendor APIs directly	`cloud_provider`
Standardize multiple vendors behind one endpoint	`llm_router`

For initial application setup, see Installation
For server and deployment concerns, see Infrastructure Overview
For self-hosted operation, see Self-Hosting

Settings & Configuration Sync & Collaboration