Vision Embedding & Tagging API (sb-vision)

The sb-vision crate provides local vision embedding and image tagging for StudioBrain. It wraps BLIP-2, CLIP, and ViT models via the candle ML framework (preferred) with a tract-onnx fallback for mobile CPU paths.

Inputs are pre-thumbnailed to 512×512 by sb-thumbnails; this crate re-samples internally to the model-specific size.

Public API

Three functions form the public API, consumed by desktop, mobile (Tauri), cloud AI services, ComfyUI integration, and the manifest indexer:

use image::DynamicImage;
 
/// Compute a vision embedding for an image.
pub fn embed(image: &DynamicImage) -> anyhow::Result<Vec<f32>>;
 
/// Produce string tags for an image (descending confidence order).
pub fn tag(image: &DynamicImage) -> anyhow::Result<Vec<String>>;
 
/// Compute embedding and tags in a single call (more efficient).
pub fn embed_and_tag(image: &DynamicImage) -> anyhow::Result<(Vec<f32>, Vec<String>)>;

Return Types

Function	Returns	Dimension
`embed`	`Vec<f32>`	768-d (BLIP) or 512-d (CLIP via ONNX)
`tag`	`Vec<String>`	Variable, descending confidence
`embed_and_tag`	`(Vec<f32>, Vec<String>)`	Matches `embed` + `tag`

Dimensional opaqueness: Callers should treat embedding dimensionality as opaque. Store whatever the active backend returns — do not hardcode dimensions. The backfill helpers (is_zero_embedding, needs_backfill) let you validate stale rows.

Models

Backend	Model	Embedding dim	Task
`candle-cpu/cuda/metal`	`Salesforce/blip-image-captioning-large`	768	CLS embedding + caption-based tags
`candle-cpu/cuda/metal`	`google/vit-base-patch16-224`	—	ImageNet-1K classification tags
`onnx-clip`	CLIP ViT-B/32 (ONNX Runtime)	512	L2-normalized embeddings (embedding-only, no tags)
`onnx`	Generic ViT ONNX (stub)	512	Pure-Rust mobile CPU fallback
`florence2-tract`	`Microsoft/Florence-2-base` (ONNX)	512	Region-aware tags (stub — follow-up required)

Florence-2 Status

Florence-2 support is stubbed in this release. candle-transformers 0.9 does not yet ship Florence-2’s custom attention implementation. Two paths forward:

candle path — unblocked when candle-transformers ships Florence-2.
ONNX path — florence2-tract feature, requires exporting the model to ONNX and pushing to GHCR artifact registry.

Feature Flags

The crate ships with zero native dependencies by default. Enable a backend feature for real inference.

Feature	Description	Native deps
`candle-cpu`	Pure-Rust candle inference (CPU)	None
`candle-cuda`	Candle + NVIDIA CUDA (implies `candle-cpu`)	CUDA toolkit
`candle-metal`	Candle + Apple Metal (implies `candle-cpu`)	Metal framework
`onnx-clip`	CLIP ViT-B/32 via ONNX Runtime (ort crate)	ONNX Runtime
`onnx`	tract-onnx CPU fallback (mobile)	None
`florence2-tract`	Florence-2 via ONNX (implies `onnx`)	ONNX artifact required
`pyo3`	Python bindings for cloud AI service	Python headers
`full`	Convenience — `candle-cpu` for CI	None

⚠️

Default build has no inference. A default cargo build -p sb-vision compiles with zero native dependencies and returns VisionError::FeatureDisabled for all API calls. You must enable at least candle-cpu for real inference.

Quick Start

use image::DynamicImage;
 
// Open an image (sb-thumbnails handles the 512x512 pre-thumbnail).
let img = image::open("asset.jpg").unwrap();
 
// Get a 768-dimensional BLIP embedding.
let embedding = sb_vision::embed(&img).unwrap();
 
// Get ImageNet-1K tags (e.g., ["mountain", "landscape", "outdoor"]).
let tags = sb_vision::tag(&img).unwrap();
 
// Or get both in one call (shares preprocessing).
let (embedding, tags) = sb_vision::embed_and_tag(&img).unwrap();

Model Cache

Models are downloaded from HuggingFace Hub and cached on disk:

OS	Cache path
Linux	`$XDG_CACHE_HOME/sb-vision` or `~/.cache/sb-vision`
macOS	`~/Library/Caches/sb-vision`
Windows	`%LOCALAPPDATA%\sb-vision`

Cache layout:

~/.cache/sb-vision/
  Salesforce_blip-image-captioning-large/
    model.safetensors
    tokenizer.json
    config.json

Environment Variables

Variable	Purpose	Default
`HF_TOKEN`	HuggingFace Hub authentication (for gated/private models)	None
`SB_VISION_NO_DOWNLOAD`	Disable automatic model downloads (set to `1` for CI/air-gapped)	`0`
`CLIP_MODEL_PATH`	Path to CLIP ViT-B/32 ONNX file (required for `onnx-clip` backend)	None

⚠️

CLIP_MODEL_PATH is a hard requirement for onnx-clip. The backend returns VisionError::ModelNotConfigured if this is not set — there is no silent zero-vector fallback.

Python Bindings (PyO3)

Enable the pyo3 feature for Python consumption by the cloud AI service:

from sb_vision import embed_from_path, tag_from_path, embed_and_tag_from_path
 
embedding = embed_from_path("asset.jpg")
tags = tag_from_path("asset.jpg")
embedding, tags = embed_and_tag_from_path("asset.jpg")

Backfill Helpers

For assets ingested before real vision embeddings were available (SBAI-2199), the crate ships helpers to identify stale rows:

use sb_vision::embed_tag::{is_zero_embedding, needs_backfill, CLIP_EMBED_DIM, BLIP_EMBED_DIM};
 
// Detect all-zero placeholder vectors.
assert!(is_zero_embedding(&[0.0; 512]));
 
// Check if an embedding needs regeneration.
assert!(needs_backfill(None, CLIP_EMBED_DIM));         // missing
assert!(needs_backfill(Some(&zeros), CLIP_EMBED_DIM)); // all-zero
assert!(needs_backfill(Some(&[0.0; 1]), CLIP_EMBED_DIM)); // wrong dim

Constant	Value	Backend
`CLIP_EMBED_DIM`	512	`onnx-clip`
`BLIP_EMBED_DIM`	768	`candle-*`

Error Handling

All API functions return anyhow::Result. The structured VisionError enum is available for callers that need to branch on failure mode:

Error Variant	Meaning
`FeatureDisabled`	Required backend feature not compiled in
`ModelNotConfigured`	`CLIP_MODEL_PATH` not set (onnx-clip only)
`ModelDownload`	HuggingFace download failed
`Preprocess`	Image decode/resize/tensor conversion failed
`Inference`	Forward-pass failed — re-queue with backoff
`Tokenizer`	Tokenizer load/encode failed
`Io`	Filesystem I/O error

🚫

Never treat inference failures as successful zero-embeddings. Pre-SBAI-2199 behavior silently returned all-zero vectors, causing all assets to be treated as identical in Qdrant. The current implementation returns hard errors — callers must propagate them.

Backend Selection Order

When multiple backend features are enabled, the active backend is determined at compile time:

candle-cpu / candle-cuda / candle-metal — BLIP/ViT via candle (preferred)
onnx-clip — CLIP ViT-B/32 via ONNX Runtime (embedding-only)
onnx — tract-onnx CPU fallback (mobile stub)
No feature — VisionError::FeatureDisabled

Consumers

Consumer	Repo	Functions used
ComfyUI integration	`studiobrain-ai`	`sb_vision::tag()`
Manifest indexer	`studiobrain-ai`	`sb_vision::embed()`, `sb_vision::tag()`
Cloud AI service	`studiobrain-ai`	PyO3 bindings
Desktop app	`studiobrain-app`	Cargo git dependency
Mobile app	`studiobrain-app`	Cargo git dependency (`candle-metal`)

Creating Agent Skills Layout System Architecture