Vision Embedding & Tagging API (sb-vision)
The sb-vision crate provides local vision embedding and image tagging for StudioBrain. It wraps BLIP-2, CLIP, and ViT models via the candle ML framework (preferred) with a tract-onnx fallback for mobile CPU paths.
Inputs are pre-thumbnailed to 512×512 by sb-thumbnails; this crate re-samples internally to the model-specific size.
Public API
Three functions form the public API, consumed by desktop, mobile (Tauri), cloud AI services, ComfyUI integration, and the manifest indexer:
use image::DynamicImage;
/// Compute a vision embedding for an image.
pub fn embed(image: &DynamicImage) -> anyhow::Result<Vec<f32>>;
/// Produce string tags for an image (descending confidence order).
pub fn tag(image: &DynamicImage) -> anyhow::Result<Vec<String>>;
/// Compute embedding and tags in a single call (more efficient).
pub fn embed_and_tag(image: &DynamicImage) -> anyhow::Result<(Vec<f32>, Vec<String>)>;Return Types
| Function | Returns | Dimension |
|---|---|---|
embed | Vec<f32> | 768-d (BLIP) or 512-d (CLIP via ONNX) |
tag | Vec<String> | Variable, descending confidence |
embed_and_tag | (Vec<f32>, Vec<String>) | Matches embed + tag |
Dimensional opaqueness: Callers should treat embedding dimensionality as opaque. Store whatever the active backend returns — do not hardcode dimensions. The backfill helpers (is_zero_embedding, needs_backfill) let you validate stale rows.
Models
| Backend | Model | Embedding dim | Task |
|---|---|---|---|
candle-cpu/cuda/metal | Salesforce/blip-image-captioning-large | 768 | CLS embedding + caption-based tags |
candle-cpu/cuda/metal | google/vit-base-patch16-224 | — | ImageNet-1K classification tags |
onnx-clip | CLIP ViT-B/32 (ONNX Runtime) | 512 | L2-normalized embeddings (embedding-only, no tags) |
onnx | Generic ViT ONNX (stub) | 512 | Pure-Rust mobile CPU fallback |
florence2-tract | Microsoft/Florence-2-base (ONNX) | 512 | Region-aware tags (stub — follow-up required) |
Florence-2 Status
Florence-2 support is stubbed in this release. candle-transformers 0.9 does not yet ship Florence-2’s custom attention implementation. Two paths forward:
- candle path — unblocked when
candle-transformersships Florence-2. - ONNX path —
florence2-tractfeature, requires exporting the model to ONNX and pushing to GHCR artifact registry.
Feature Flags
The crate ships with zero native dependencies by default. Enable a backend feature for real inference.
| Feature | Description | Native deps |
|---|---|---|
candle-cpu | Pure-Rust candle inference (CPU) | None |
candle-cuda | Candle + NVIDIA CUDA (implies candle-cpu) | CUDA toolkit |
candle-metal | Candle + Apple Metal (implies candle-cpu) | Metal framework |
onnx-clip | CLIP ViT-B/32 via ONNX Runtime (ort crate) | ONNX Runtime |
onnx | tract-onnx CPU fallback (mobile) | None |
florence2-tract | Florence-2 via ONNX (implies onnx) | ONNX artifact required |
pyo3 | Python bindings for cloud AI service | Python headers |
full | Convenience — candle-cpu for CI | None |
Default build has no inference. A default cargo build -p sb-vision compiles with zero native dependencies and returns VisionError::FeatureDisabled for all API calls. You must enable at least candle-cpu for real inference.
Quick Start
use image::DynamicImage;
// Open an image (sb-thumbnails handles the 512x512 pre-thumbnail).
let img = image::open("asset.jpg").unwrap();
// Get a 768-dimensional BLIP embedding.
let embedding = sb_vision::embed(&img).unwrap();
// Get ImageNet-1K tags (e.g., ["mountain", "landscape", "outdoor"]).
let tags = sb_vision::tag(&img).unwrap();
// Or get both in one call (shares preprocessing).
let (embedding, tags) = sb_vision::embed_and_tag(&img).unwrap();Model Cache
Models are downloaded from HuggingFace Hub and cached on disk:
| OS | Cache path |
|---|---|
| Linux | $XDG_CACHE_HOME/sb-vision or ~/.cache/sb-vision |
| macOS | ~/Library/Caches/sb-vision |
| Windows | %LOCALAPPDATA%\sb-vision |
Cache layout:
~/.cache/sb-vision/
Salesforce_blip-image-captioning-large/
model.safetensors
tokenizer.json
config.jsonEnvironment Variables
| Variable | Purpose | Default |
|---|---|---|
HF_TOKEN | HuggingFace Hub authentication (for gated/private models) | None |
SB_VISION_NO_DOWNLOAD | Disable automatic model downloads (set to 1 for CI/air-gapped) | 0 |
CLIP_MODEL_PATH | Path to CLIP ViT-B/32 ONNX file (required for onnx-clip backend) | None |
CLIP_MODEL_PATH is a hard requirement for onnx-clip. The backend returns VisionError::ModelNotConfigured if this is not set — there is no silent zero-vector fallback.
Python Bindings (PyO3)
Enable the pyo3 feature for Python consumption by the cloud AI service:
from sb_vision import embed_from_path, tag_from_path, embed_and_tag_from_path
embedding = embed_from_path("asset.jpg")
tags = tag_from_path("asset.jpg")
embedding, tags = embed_and_tag_from_path("asset.jpg")Backfill Helpers
For assets ingested before real vision embeddings were available (SBAI-2199), the crate ships helpers to identify stale rows:
use sb_vision::embed_tag::{is_zero_embedding, needs_backfill, CLIP_EMBED_DIM, BLIP_EMBED_DIM};
// Detect all-zero placeholder vectors.
assert!(is_zero_embedding(&[0.0; 512]));
// Check if an embedding needs regeneration.
assert!(needs_backfill(None, CLIP_EMBED_DIM)); // missing
assert!(needs_backfill(Some(&zeros), CLIP_EMBED_DIM)); // all-zero
assert!(needs_backfill(Some(&[0.0; 1]), CLIP_EMBED_DIM)); // wrong dim| Constant | Value | Backend |
|---|---|---|
CLIP_EMBED_DIM | 512 | onnx-clip |
BLIP_EMBED_DIM | 768 | candle-* |
Error Handling
All API functions return anyhow::Result. The structured VisionError enum is available for callers that need to branch on failure mode:
| Error Variant | Meaning |
|---|---|
FeatureDisabled | Required backend feature not compiled in |
ModelNotConfigured | CLIP_MODEL_PATH not set (onnx-clip only) |
ModelDownload | HuggingFace download failed |
Preprocess | Image decode/resize/tensor conversion failed |
Inference | Forward-pass failed — re-queue with backoff |
Tokenizer | Tokenizer load/encode failed |
Io | Filesystem I/O error |
Never treat inference failures as successful zero-embeddings. Pre-SBAI-2199 behavior silently returned all-zero vectors, causing all assets to be treated as identical in Qdrant. The current implementation returns hard errors — callers must propagate them.
Backend Selection Order
When multiple backend features are enabled, the active backend is determined at compile time:
candle-cpu/candle-cuda/candle-metal— BLIP/ViT via candle (preferred)onnx-clip— CLIP ViT-B/32 via ONNX Runtime (embedding-only)onnx— tract-onnx CPU fallback (mobile stub)- No feature —
VisionError::FeatureDisabled
Consumers
| Consumer | Repo | Functions used |
|---|---|---|
| ComfyUI integration | studiobrain-ai | sb_vision::tag() |
| Manifest indexer | studiobrain-ai | sb_vision::embed(), sb_vision::tag() |
| Cloud AI service | studiobrain-ai | PyO3 bindings |
| Desktop app | studiobrain-app | Cargo git dependency |
| Mobile app | studiobrain-app | Cargo git dependency (candle-metal) |