Home / Docs / Operations / runbooks

Failure scenarios and recovery

Troubleshooting, teardown, and rebuild procedures.

Troubleshooting

CUDA not detected

Verify GPU passthrough (see infra/proxmox.md)
Check nvidia-smi
Confirm CUDA/PyTorch version alignment

Out-of-Memory (OOM)

Reduce context length
Increase gradient accumulation
Reduce LoRA rank

Slow OCR

Increase parallel workers
Disable PDF image fallback if unnecessary

Hallucinations / low quality

Improve dataset quality
Add system prompts
Increase eval dataset size

flowchart TD
    A[CUDA not detected] --> A1[Check passthrough]
    A1 --> A2[nvidia-smi]
    A2 --> A3[CUDA/PyTorch versions]
    B[OOM] --> B1[Reduce context / LoRA rank]
    C[Slow OCR] --> C1[More workers]
    D[Low quality] --> D1[Dataset + system prompts]

Safe teardown

Stop Docker services
Snapshot VMs
Back up:
- Datasets
- LoRA adapters
- Docker volumes
Remove containers
Power down VMs

Rebuild

Restore VM from template or ISO
Reapply GPU passthrough (see infra/proxmox.md)
Restore datasets and adapters
Deploy Docker stack (infra/docker-compose.yml)
Validate inference and training