Inference Providers
StudioBrain can route AI work to several different kinds of inference backends. The important distinction is not just which provider you pick, but what kind of provider it is because that changes setup, hardware, and troubleshooting.
This page is the user-facing reference for the inference provider taxonomy used by the gateway and setup UI.
The Six Categories
| Category | What it means | Examples | Typical setup |
|---|---|---|---|
local_inprocess | Inference runs inside the gateway process itself. Nothing separate is launched. | mistralrs, candle, llamacpp | Enable the feature in the build or use a build that already includes it |
local_spawned | StudioBrain downloads and runs a local binary for you. | llama-server, mistralrs-server, exllamav2, whisper-cpp-server | Install the engine, then let the gateway supervise it |
lan_endpoint | You already run a compatible server on your machine or LAN. StudioBrain connects to it. | Ollama, LM Studio, Jan, vLLM HTTP, generic OpenAI-compatible endpoints | Enter the host/port or let StudioBrain probe the endpoint |
distributed | One logical model is spread across multiple machines. | llama.cpp RPC, distributed vLLM | Configure peers and cluster RPC before use |
cloud_provider | Direct hosted API from a model vendor. | OpenAI, Anthropic, Google, xAI, DashScope, Z.AI, Mistral, Cohere, DeepSeek, Groq, Perplexity, Together | Add an API key and verify the provider |
llm_router | A gateway in front of one or more providers. | OpenRouter, LiteLLM Proxy, Azure OpenAI, Vertex AI, Bedrock, Portkey, Cloudflare AI | Add base URL + credentials, then verify |
Install Matrix
| Category | Managed by StudioBrain | Needs local GPU | Needs network | Needs API key | Can be shared across LAN/cluster |
|---|---|---|---|---|---|
local_inprocess | Yes | Usually | No | No | No |
local_spawned | Yes | Usually | Only for downloads | No | Single host only |
lan_endpoint | No | Depends on the external server | Yes | Usually no | Yes |
distributed | Partly | Yes | Yes | No | Yes |
cloud_provider | No local runtime to install | No | Yes | Yes | Not a LAN service |
llm_router | No local runtime to install | No | Yes | Usually yes | Not a LAN service |
How To Think About Setup
Built-in local inference
Use local_inprocess when you want the fewest moving parts on a single machine and your StudioBrain build already includes the runtime you need.
- Best for: offline-first local work
- Tradeoff: tied to the features compiled into your gateway build
- Default models: On first launch, a small model (
qwen2.5-0.5b, ~400 MB) is automatically downloaded and loaded via themistralrsengine. Each model in the autoconfig profiles is tagged with a specific backend (mistralrsorllamacpp) so the gateway routes to the correct engine without manual config.
Downloaded local engines
Use local_spawned when you want StudioBrain to manage a standalone engine binary for you.
- Best for: local GPU workflows where you want managed install/update/supervision
- Tradeoff: extra binary downloads and host-specific compatibility
Existing local or LAN servers
Use lan_endpoint when you already have another inference server running and just want StudioBrain to consume it.
- Best for: Ollama, LM Studio, Jan, or a custom OpenAI-compatible server you already trust
- Tradeoff: StudioBrain does not own that process lifecycle
Distributed inference
Use distributed only when you intentionally want multi-host inference.
- Best for: splitting large models across multiple machines
- Tradeoff: more operational complexity than single-node inference
Direct cloud APIs
Use cloud_provider for vendor-native APIs.
- Best for: frontier closed-source models and BYOK setups
- Tradeoff: internet dependency and provider billing
Router-style gateways
Use llm_router when you want a single endpoint that can fan out to multiple vendors or routing policies.
- Best for: teams standardizing on one API surface
- Tradeoff: adds another control plane between StudioBrain and the model vendor
Provider Examples By Category
| Category | Common choices |
|---|---|
local_inprocess | mistralrs, candle, llamacpp |
local_spawned | llama-server, mistralrs-server, exllamav2 |
lan_endpoint | Ollama, LM Studio, Jan, text-generation-webui |
distributed | llama.cpp RPC, distributed vLLM |
cloud_provider | OpenAI, Anthropic, Google, DashScope, Z.AI |
llm_router | OpenRouter, LiteLLM Proxy, Azure OpenAI, Bedrock |
Which One Should I Choose?
| Goal | Best fit |
|---|---|
| Run fully offline on one machine | local_inprocess or local_spawned |
| Reuse an inference server you already run | lan_endpoint |
| Split a large model across hosts | distributed |
| Use vendor APIs directly | cloud_provider |
| Standardize multiple vendors behind one endpoint | llm_router |
Related Setup Guides
- For initial application setup, see Installation
- For server and deployment concerns, see Infrastructure Overview
- For self-hosted operation, see Self-Hosting