Home / Docs / Operations / runbooks

Failure scenarios and recovery

Troubleshooting, teardown, and rebuild procedures.

Troubleshooting

CUDA not detected

Out-of-Memory (OOM)

Slow OCR

Hallucinations / low quality

flowchart TD A[CUDA not detected] --> A1[Check passthrough] A1 --> A2[nvidia-smi] A2 --> A3[CUDA/PyTorch versions] B[OOM] --> B1[Reduce context / LoRA rank] C[Slow OCR] --> C1[More workers] D[Low quality] --> D1[Dataset + system prompts]

Safe teardown

  1. Stop Docker services
  2. Snapshot VMs
  3. Back up:
    • Datasets
    • LoRA adapters
    • Docker volumes
  4. Remove containers
  5. Power down VMs

Rebuild

  1. Restore VM from template or ISO
  2. Reapply GPU passthrough (see infra/proxmox.md)
  3. Restore datasets and adapters
  4. Deploy Docker stack (infra/docker-compose.yml)
  5. Validate inference and training