DeveloperVision Embedding & Tagging (sb-vision)

Vision Embedding & Tagging API (sb-vision)

The sb-vision crate provides local vision embedding and image tagging for StudioBrain. It wraps BLIP-2, CLIP, and ViT models via the candle ML framework (preferred) with a tract-onnx fallback for mobile CPU paths.

Inputs are pre-thumbnailed to 512×512 by sb-thumbnails; this crate re-samples internally to the model-specific size.

Public API

Three functions form the public API, consumed by desktop, mobile (Tauri), cloud AI services, ComfyUI integration, and the manifest indexer:

use image::DynamicImage;
 
/// Compute a vision embedding for an image.
pub fn embed(image: &DynamicImage) -> anyhow::Result<Vec<f32>>;
 
/// Produce string tags for an image (descending confidence order).
pub fn tag(image: &DynamicImage) -> anyhow::Result<Vec<String>>;
 
/// Compute embedding and tags in a single call (more efficient).
pub fn embed_and_tag(image: &DynamicImage) -> anyhow::Result<(Vec<f32>, Vec<String>)>;

Return Types

FunctionReturnsDimension
embedVec<f32>768-d (BLIP) or 512-d (CLIP via ONNX)
tagVec<String>Variable, descending confidence
embed_and_tag(Vec<f32>, Vec<String>)Matches embed + tag

Dimensional opaqueness: Callers should treat embedding dimensionality as opaque. Store whatever the active backend returns — do not hardcode dimensions. The backfill helpers (is_zero_embedding, needs_backfill) let you validate stale rows.

Models

BackendModelEmbedding dimTask
candle-cpu/cuda/metalSalesforce/blip-image-captioning-large768CLS embedding + caption-based tags
candle-cpu/cuda/metalgoogle/vit-base-patch16-224ImageNet-1K classification tags
onnx-clipCLIP ViT-B/32 (ONNX Runtime)512L2-normalized embeddings (embedding-only, no tags)
onnxGeneric ViT ONNX (stub)512Pure-Rust mobile CPU fallback
florence2-tractMicrosoft/Florence-2-base (ONNX)512Region-aware tags (stub — follow-up required)

Florence-2 Status

Florence-2 support is stubbed in this release. candle-transformers 0.9 does not yet ship Florence-2’s custom attention implementation. Two paths forward:

  1. candle path — unblocked when candle-transformers ships Florence-2.
  2. ONNX pathflorence2-tract feature, requires exporting the model to ONNX and pushing to GHCR artifact registry.

Feature Flags

The crate ships with zero native dependencies by default. Enable a backend feature for real inference.

FeatureDescriptionNative deps
candle-cpuPure-Rust candle inference (CPU)None
candle-cudaCandle + NVIDIA CUDA (implies candle-cpu)CUDA toolkit
candle-metalCandle + Apple Metal (implies candle-cpu)Metal framework
onnx-clipCLIP ViT-B/32 via ONNX Runtime (ort crate)ONNX Runtime
onnxtract-onnx CPU fallback (mobile)None
florence2-tractFlorence-2 via ONNX (implies onnx)ONNX artifact required
pyo3Python bindings for cloud AI servicePython headers
fullConvenience — candle-cpu for CINone
⚠️

Default build has no inference. A default cargo build -p sb-vision compiles with zero native dependencies and returns VisionError::FeatureDisabled for all API calls. You must enable at least candle-cpu for real inference.

Quick Start

use image::DynamicImage;
 
// Open an image (sb-thumbnails handles the 512x512 pre-thumbnail).
let img = image::open("asset.jpg").unwrap();
 
// Get a 768-dimensional BLIP embedding.
let embedding = sb_vision::embed(&img).unwrap();
 
// Get ImageNet-1K tags (e.g., ["mountain", "landscape", "outdoor"]).
let tags = sb_vision::tag(&img).unwrap();
 
// Or get both in one call (shares preprocessing).
let (embedding, tags) = sb_vision::embed_and_tag(&img).unwrap();

Model Cache

Models are downloaded from HuggingFace Hub and cached on disk:

OSCache path
Linux$XDG_CACHE_HOME/sb-vision or ~/.cache/sb-vision
macOS~/Library/Caches/sb-vision
Windows%LOCALAPPDATA%\sb-vision

Cache layout:

~/.cache/sb-vision/
  Salesforce_blip-image-captioning-large/
    model.safetensors
    tokenizer.json
    config.json

Environment Variables

VariablePurposeDefault
HF_TOKENHuggingFace Hub authentication (for gated/private models)None
SB_VISION_NO_DOWNLOADDisable automatic model downloads (set to 1 for CI/air-gapped)0
CLIP_MODEL_PATHPath to CLIP ViT-B/32 ONNX file (required for onnx-clip backend)None
⚠️

CLIP_MODEL_PATH is a hard requirement for onnx-clip. The backend returns VisionError::ModelNotConfigured if this is not set — there is no silent zero-vector fallback.

Python Bindings (PyO3)

Enable the pyo3 feature for Python consumption by the cloud AI service:

from sb_vision import embed_from_path, tag_from_path, embed_and_tag_from_path
 
embedding = embed_from_path("asset.jpg")
tags = tag_from_path("asset.jpg")
embedding, tags = embed_and_tag_from_path("asset.jpg")

Backfill Helpers

For assets ingested before real vision embeddings were available (SBAI-2199), the crate ships helpers to identify stale rows:

use sb_vision::embed_tag::{is_zero_embedding, needs_backfill, CLIP_EMBED_DIM, BLIP_EMBED_DIM};
 
// Detect all-zero placeholder vectors.
assert!(is_zero_embedding(&[0.0; 512]));
 
// Check if an embedding needs regeneration.
assert!(needs_backfill(None, CLIP_EMBED_DIM));         // missing
assert!(needs_backfill(Some(&zeros), CLIP_EMBED_DIM)); // all-zero
assert!(needs_backfill(Some(&[0.0; 1]), CLIP_EMBED_DIM)); // wrong dim
ConstantValueBackend
CLIP_EMBED_DIM512onnx-clip
BLIP_EMBED_DIM768candle-*

Error Handling

All API functions return anyhow::Result. The structured VisionError enum is available for callers that need to branch on failure mode:

Error VariantMeaning
FeatureDisabledRequired backend feature not compiled in
ModelNotConfiguredCLIP_MODEL_PATH not set (onnx-clip only)
ModelDownloadHuggingFace download failed
PreprocessImage decode/resize/tensor conversion failed
InferenceForward-pass failed — re-queue with backoff
TokenizerTokenizer load/encode failed
IoFilesystem I/O error
🚫

Never treat inference failures as successful zero-embeddings. Pre-SBAI-2199 behavior silently returned all-zero vectors, causing all assets to be treated as identical in Qdrant. The current implementation returns hard errors — callers must propagate them.

Backend Selection Order

When multiple backend features are enabled, the active backend is determined at compile time:

  1. candle-cpu / candle-cuda / candle-metal — BLIP/ViT via candle (preferred)
  2. onnx-clip — CLIP ViT-B/32 via ONNX Runtime (embedding-only)
  3. onnx — tract-onnx CPU fallback (mobile stub)
  4. No feature — VisionError::FeatureDisabled

Consumers

ConsumerRepoFunctions used
ComfyUI integrationstudiobrain-aisb_vision::tag()
Manifest indexerstudiobrain-aisb_vision::embed(), sb_vision::tag()
Cloud AI servicestudiobrain-aiPyO3 bindings
Desktop appstudiobrain-appCargo git dependency
Mobile appstudiobrain-appCargo git dependency (candle-metal)