Troubleshooting

This page covers common issues, known bugs, debugging commands, and diagnostic procedures for StudioBrain deployments.

Common Issues

Backend Crash Loop on Cold Boot

Symptom: The backend container repeatedly restarts on first startup after a fresh deployment.

Cause: A transient race condition in model imports during the initial database setup. SQLAlchemy model registration can conflict when multiple import paths load simultaneously.

Resolution: Restart the backend container. The issue does not recur after the first successful boot.

docker restart studiobrain-backend

Prevention: This is a known startup timing issue. In most cases, Docker’s restart: unless-stopped policy handles it automatically — the container restarts and succeeds on the second attempt.

JWT Tokens Invalidate After Restart

Symptom: All users are logged out after a backend restart. Previously valid JWT tokens are rejected.

Cause: The JWT_SECRET environment variable is not set or is being regenerated on each startup. If the secret changes, all existing tokens become invalid.

Resolution: Set JWT_SECRET to a stable value in your docker/.env file:

# Generate a stable secret (run once)
openssl rand -hex 32
 
# Add to docker/.env
JWT_SECRET=your-stable-64-character-hex-string

The JWT secret must persist across container restarts. Never use a randomly generated value that changes on each boot.

Entity Type Case Sensitivity

Symptom: API requests for entity types return 404 or empty results when using uppercase type names (e.g., Character instead of character).

Cause: The backend API expects lowercase entity type names in URL paths. The database stores the type as-is from the template, but API routes normalize to lowercase.

Resolution: Always use lowercase entity type names in API calls:

# Correct
curl http://localhost:8201/api/entity/character
 
# Incorrect (may return 404)
curl http://localhost:8201/api/entity/Character

The frontend handles this normalization automatically. This issue only affects direct API calls.

GPU Validation Warning in AI Service

Symptom: The AI service logs contain a warning about GPU validation: “Expected RTX 4090, found RTX PRO 6000” or similar.

Cause: The AI service includes a GPU validation check that was originally written for a specific GPU model. When running on different NVIDIA hardware, it logs a warning but continues to function normally.

Resolution: This is cosmetic only. The AI service works correctly with any NVIDIA GPU that supports CUDA. No action is required. The warning can be suppressed by removing or updating the GPU validation check in the AI service configuration.

Model-Manager `/api/vram` Returns Empty Devices on Newer GPUs

Symptom: The model-manager gateway’s /api/vram endpoint returns {"devices": [], "budget_gb": 0.0} despite CUDA being installed and nvidia-smi reporting GPUs correctly.

Cause: The NVML library binding (nvml-wrapper) may not support newer GPU architectures (e.g., RTX PRO 6000 Blackwell, compute capability 12.0) when the library version predates the hardware. Before SBAI-2491, no fallback existed for this scenario on Linux.

Resolution: As of SBAI-2491, the model-manager includes an nvidia-smi CLI fallback that activates automatically when NVML initialization fails. If you encounter this on an older deployment, upgrade the model-manager to include the nvidia-smi fallback.

Workaround (pre-fix deployments): Set VRAM_BUDGET_GB explicitly in the gateway configuration to bypass device enumeration:

# In gateway config or .env
VRAM_BUDGET_GB=80.0

This sets a manual budget without requiring device detection. Note that per-process VRAM tracking will be unavailable.

llama.cpp Engine Installation Fails on Windows

Symptom: Installing the llama-server engine on Windows (via Settings → AI Services → Engine Installer or POST /api/engines/llama-server/install) completes but leaves the engine in a failed state. The health endpoint cannot locate the binary.

Cause: The Windows llama.cpp release archives (e.g., llama-b5200-bin-win-cuda-cu12.4-x64.zip) extract flat — llama-server.exe and ggml-*.dll files sit at the root of the archive. The engine catalog previously expected a build/bin/llama-server.exe subdirectory layout, which does not exist in upstream Windows releases.

Resolution: As of SBAI-2493, the engine installer includes post-extract layout normalization that automatically detects and corrects flat archive layouts. The catalog entries for Windows (both CUDA and CPU variants) now declare binary_path: "llama-server.exe" matching the actual upstream release structure.

Additional note for Windows CUDA builds: The llama.cpp Windows CUDA 12 release zip does not include cudart64_12.dll, cublas64_12.dll, or cublasLt64_12.dll. These are LGPL-compatible redistributables from the NVIDIA CUDA Toolkit. If these DLLs are missing, the engine will install but fail to start. Install the NVIDIA CUDA Toolkit or copy the required DLLs from an existing CUDA installation into the engine directory.

Content Not Appearing After Deployment

Symptom: The frontend loads but shows no entities. The entity list pages are empty.

Cause: The content directory is not mounted correctly, or the initial sync has not completed.

Resolution:

Check that the content volume is mounted:

docker exec studiobrain-backend ls /data/content/

You should see directories like Characters/, Locations/, _Templates/, etc.

Check that CONTENT_BASE_PATH is set correctly:

docker exec studiobrain-backend env | grep CONTENT_BASE_PATH

Check backend logs for sync errors:

docker logs studiobrain-backend 2>&1 | grep -i "sync\|error\|content"

If SKIP_STARTUP_SYNC=true is set, the backend skips the initial content scan. Set it to false and restart:

docker compose restart backend

Frontend Proxy Errors (502 Bad Gateway)

Symptom: The frontend loads but API calls fail with 502 errors. The browser console shows failed requests to /api/*.

Cause: The frontend’s Next.js proxy cannot reach the backend container. This usually indicates the backend is not running or the Docker network is misconfigured.

Resolution:

Verify the backend is running:

docker compose ps
docker logs studiobrain-backend --tail 20

Test backend connectivity from within the Docker network:

docker exec studiobrain-frontend wget -qO- http://backend:8201/health

Verify the BACKEND_URL build argument was set correctly. The frontend uses http://backend:8201 (the Docker service name) for internal routing:

docker exec studiobrain-frontend env | grep BACKEND

If the URL is wrong, rebuild the frontend:

docker compose build frontend
docker compose up -d frontend

Known Bugs

These are tracked bugs with JIRA references. Check the linked tickets for the latest status and any available fixes.

SBAI-214: ComfyUI Workflow Scan Failure

Error: 'str' object has no attribute 'get'

Context: The ComfyUI workflow scanner encounters a string where it expects a dictionary when parsing workflow JSON files. This affects the workflow browser in the AI Workshop.

Impact: ComfyUI workflow listing fails. Direct image generation through standard providers (OpenAI, Anthropic) is unaffected.

Workaround: Ensure workflow JSON files in _Plugins/comfyui/workflows/ are valid JSON objects, not raw strings.

SBAI-215: ComfyUI Job Completion Failure

Error: missing generation_time argument

Context: The ComfyUI job completion handler expects a generation_time parameter that is not being passed by the workflow execution pipeline.

Impact: ComfyUI jobs may complete on the ComfyUI server but fail to register as complete in StudioBrain.

SBAI-241: AI Service Missing python-frontmatter

Error: ModuleNotFoundError: No module named 'frontmatter'

Context: The python-frontmatter package is required for parsing entity markdown files in the AI service’s context builder.

Resolution: Add python-frontmatter to the AI service requirements:

# On the GPU host
echo "python-frontmatter>=1.0.0" >> /opt/studiobrain-ai/requirements-docker.txt
docker restart studiobrain-ai

SBAI-242: RAG Indexer NoneType Error

Error: 'NoneType' object is not subscriptable

Context: The RAG indexer encounters None values when processing entity content that has empty markdown bodies or missing frontmatter fields.

Impact: Some entities may not be indexed for RAG retrieval. The AI service continues to function but may miss context from affected entities.

Workaround: Ensure all entity markdown files have non-empty content bodies. The indexer should handle None gracefully; this is a defensive coding fix.

SBAI-243: pynvml Not Installed

Error: ModuleNotFoundError: No module named 'pynvml'

Context: The AI service attempts to use pynvml for GPU VRAM monitoring. Without it, GPU memory statistics are unavailable in the health endpoint.

Resolution: Add pynvml (also known as nvidia-ml-py3) to the AI service requirements:

echo "pynvml>=11.5.0" >> /opt/studiobrain-ai/requirements-docker.txt
docker restart studiobrain-ai

Impact without fix: GPU monitoring returns null values. AI generation works normally.

SBAI-244: Grok Model Catalog HTTP 403

Error: HTTP 403 when fetching the Grok model catalog from the xAI API.

Context: The xAI API may reject model catalog requests if the API key lacks the required permissions or if the endpoint has changed.

Impact: Grok models are unavailable in the provider selection. Other providers (OpenAI, Anthropic, Google) work normally.

Workaround: Verify the GROK_API_KEY is valid and has the correct permissions. If the issue persists, disable Grok in the provider configuration.

Docker Debugging Commands

Container Status and Logs

# Show all container status
docker compose ps
 
# Follow logs for all services
docker compose logs -f
 
# Follow logs for a specific service
docker logs -f studiobrain-backend
docker logs -f studiobrain-frontend
docker logs -f studiobrain-caddy
docker logs -f studiobrain-ai
 
# Last N lines
docker logs --tail 100 studiobrain-backend
 
# Logs since a specific time
docker logs --since "2026-02-24T10:00:00" studiobrain-backend
 
# Search logs for errors
docker logs studiobrain-backend 2>&1 | grep -i error

Container Inspection

# Enter a running container
docker exec -it studiobrain-backend bash
docker exec -it studiobrain-frontend sh
 
# Check environment variables
docker exec studiobrain-backend env | sort
 
# Check file system
docker exec studiobrain-backend ls -la /data/content/
docker exec studiobrain-backend ls -la /data/db/
 
# Check process list
docker exec studiobrain-backend ps aux
 
# Check network connectivity from inside a container
docker exec studiobrain-backend curl -s http://localhost:8201/health
docker exec studiobrain-frontend wget -qO- http://backend:8201/health

Service Management

# Restart a single service
docker compose restart backend
 
# Rebuild and restart
docker compose build backend && docker compose up -d backend
 
# Full stack restart
docker compose down && docker compose up -d
 
# Force rebuild without cache
docker compose build --no-cache
docker compose up -d
 
# View resource usage
docker stats studiobrain-backend studiobrain-frontend studiobrain-caddy

Health Check Endpoints

Backend Health

curl http://localhost:8201/health

Healthy response:

{
  "status": "healthy",
  "entity_count": 236,
  "entity_types": ["character", "location", "brand", "district", "faction", "item", "job"]
}

Unhealthy indicators:

entity_count: 0 with SKIP_STARTUP_SYNC=false means the content scan failed
HTTP 500 means the backend did not start correctly (check logs)
Connection refused means the container is not running

Detailed Service Health

curl http://localhost:8201/api/services/health

Check each service status. Any value other than "healthy" or "connected" indicates an issue with that service.

AI Service Health

curl http://your-gpu-host:8202/health

Healthy response:

{
  "status": "healthy",
  "cloud_mode": false,
  "gpu_available": true
}

Common issues:

"gpu_available": false means CUDA is not accessible. Check NVIDIA drivers and the Container Toolkit.
Connection refused means the AI container is still starting (first boot installs pip dependencies, which takes several minutes).

Log Search Patterns

Common patterns to search for in logs when diagnosing issues:

# Startup errors
docker logs studiobrain-backend 2>&1 | grep -i "error\|traceback\|failed"
 
# Database connection issues
docker logs studiobrain-backend 2>&1 | grep -i "database\|connection\|sqlite\|postgres"
 
# Sync problems
docker logs studiobrain-backend 2>&1 | grep -i "sync\|markdown\|parse"
 
# Authentication issues
docker logs studiobrain-backend 2>&1 | grep -i "jwt\|auth\|token\|unauthorized"
 
# AI service errors
docker logs studiobrain-ai 2>&1 | grep -i "error\|cuda\|gpu\|model\|provider"
 
# Memory issues
docker logs studiobrain-ai 2>&1 | grep -i "oom\|memory\|vram"

Network Debugging

Testing Connectivity Between Services

From the app host, verify that all database services are reachable:

# PostgreSQL Auth DB
docker exec studiobrain-backend python -c "
import psycopg2
conn = psycopg2.connect('postgresql://studiobrain_auth:password@auth-host:5432/studiobrain_auth')
print('Auth DB: Connected')
conn.close()
"
 
# PostgreSQL Content DB
docker exec studiobrain-backend python -c "
import psycopg2
conn = psycopg2.connect('postgresql://studiobrain_app:password@content-host:5432/studiobrain_content')
print('Content DB: Connected')
conn.close()
"
 
# Qdrant
curl http://qdrant-host:6333/healthz
 
# Redis
redis-cli -h redis-host -a password ping

Checking NFS Mounts

# Verify NFS mount is active
mount | grep nfs
 
# Test read access
ls -la /mnt/studiobrain/content/
 
# Test write access (backend needs write)
touch /mnt/studiobrain/content/.write_test && rm /mnt/studiobrain/content/.write_test
 
# Check NFS performance
dd if=/dev/zero of=/mnt/studiobrain/content/.perf_test bs=1M count=10 2>&1 | tail -1
rm /mnt/studiobrain/content/.perf_test

Port Connectivity

# Check if services are listening
ss -tlnp | grep -E '(8201|8202|3100|5432|6333|6379)'
 
# Test port reachability from another host
nc -zv app-host 8201
nc -zv gpu-host 8202
nc -zv auth-db-host 5432
nc -zv content-db-host 5432
nc -zv qdrant-host 6333
nc -zv redis-host 6379

Database Troubleshooting

SQLite: Database is Locked

Symptom: sqlite3.OperationalError: database is locked

Cause: Multiple processes or connections are trying to write to SQLite simultaneously. SQLite supports only one writer at a time.

Resolution:

Ensure only one backend instance is running:

docker compose ps | grep backend

If using NFS-mounted SQLite (not recommended), switch to local storage or PostgreSQL. SQLite file locking does not work reliably over network filesystems.
As a last resort, restart the backend to release the lock:

docker compose restart backend

PostgreSQL: Connection Refused

Symptom: psycopg2.OperationalError: could not connect to server: Connection refused

Cause: PostgreSQL is not running, the host is unreachable, or the firewall is blocking the connection.

Resolution:

Verify PostgreSQL is running on the database host:

ssh db-host "systemctl status postgresql"

Check that the port is open:

nc -zv db-host 5432

Check pg_hba.conf allows connections from the app server:

# pg_hba.conf on the database host
host  studiobrain_auth  studiobrain_auth  10.0.0.0/24  scram-sha-256
host  studiobrain_content  studiobrain_app  10.0.0.0/24  scram-sha-256

Verify the firewall allows the connection:

ssh db-host "ufw status" # or iptables -L

PostgreSQL: Too Many Connections

Symptom: FATAL: too many connections for role "studiobrain_app"

Resolution: Increase max_connections in postgresql.conf or add PgBouncer as a connection pooler:

# postgresql.conf
max_connections = 200

For persistent issues, deploy PgBouncer. See the Database Guide for configuration details.

Qdrant: Timeout or Connection Error

Symptom: Qdrant operations timeout or return connection errors.

Resolution:

Check Qdrant is running:

curl http://qdrant-host:6333/healthz

Check disk space (Qdrant needs fast storage for HNSW indexes):

ssh qdrant-host "df -h"

Check memory usage (Qdrant keeps hot indexes in RAM):

ssh qdrant-host "free -h"

If Qdrant is unresponsive, restart it:

ssh qdrant-host "systemctl restart qdrant"

Redis: Connection Refused or AUTH Error

Symptom: redis.exceptions.ConnectionError or NOAUTH Authentication required

Resolution:

Verify Redis is running:

redis-cli -h redis-host ping

Check authentication:

redis-cli -h redis-host -a your_password ping

If this returns PONG, the connection works. Verify the password in your REDIS_URL matches the Redis configuration.

Check that requirepass is set in redis.conf:

requirepass your_password

Diagnostic Checklist

When something is not working, work through this checklist:

Are all containers running?
```
docker compose ps
```
Are there errors in the logs?
```
docker compose logs --tail 50
```
Can the backend reach its database?
```
curl http://localhost:8201/health
```

Can the frontend reach the backend?

docker exec studiobrain-frontend wget -qO- http://backend:8201/health

Is the content directory mounted and readable?

docker exec studiobrain-backend ls /data/content/_Templates/

Is the AI service reachable? (if deployed)
```
curl http://gpu-host:8202/health
```

Are database services reachable? (if using PostgreSQL)

nc -zv auth-db-host 5432
nc -zv content-db-host 5432
nc -zv qdrant-host 6333
nc -zv redis-host 6379

Is DNS resolution working?

docker exec studiobrain-backend nslookup your-domain.com

Are there disk space issues?
```
df -h
docker system df
```
Is the JWT secret stable?
```
docker exec studiobrain-backend env | grep JWT_SECRET
```
Compare with the value in docker/.env. They must match.

Overview Self-Hosting

Troubleshooting

Common Issues

Backend Crash Loop on Cold Boot

JWT Tokens Invalidate After Restart

Entity Type Case Sensitivity

GPU Validation Warning in AI Service

Model-Manager /api/vram Returns Empty Devices on Newer GPUs

llama.cpp Engine Installation Fails on Windows

Content Not Appearing After Deployment

Frontend Proxy Errors (502 Bad Gateway)

Known Bugs

SBAI-214: ComfyUI Workflow Scan Failure

SBAI-215: ComfyUI Job Completion Failure

SBAI-241: AI Service Missing python-frontmatter

SBAI-242: RAG Indexer NoneType Error

SBAI-243: pynvml Not Installed

SBAI-244: Grok Model Catalog HTTP 403

Docker Debugging Commands

Container Status and Logs

Container Inspection

Service Management

Health Check Endpoints

Backend Health

Detailed Service Health

AI Service Health

Log Search Patterns

Network Debugging

Testing Connectivity Between Services

Checking NFS Mounts

Port Connectivity

Database Troubleshooting

SQLite: Database is Locked

PostgreSQL: Connection Refused

PostgreSQL: Too Many Connections

Qdrant: Timeout or Connection Error

Redis: Connection Refused or AUTH Error

Diagnostic Checklist

Model-Manager `/api/vram` Returns Empty Devices on Newer GPUs