Failure scenarios and recovery
Troubleshooting, teardown, and rebuild procedures.
Troubleshooting
CUDA not detected
- Verify GPU passthrough (see infra/proxmox.md)
- Check
nvidia-smi - Confirm CUDA/PyTorch version alignment
Out-of-Memory (OOM)
- Reduce context length
- Increase gradient accumulation
- Reduce LoRA rank
Slow OCR
- Increase parallel workers
- Disable PDF image fallback if unnecessary
Hallucinations / low quality
- Improve dataset quality
- Add system prompts
- Increase eval dataset size
flowchart TD
A[CUDA not detected] --> A1[Check passthrough]
A1 --> A2[nvidia-smi]
A2 --> A3[CUDA/PyTorch versions]
B[OOM] --> B1[Reduce context / LoRA rank]
C[Slow OCR] --> C1[More workers]
D[Low quality] --> D1[Dataset + system prompts]
Safe teardown
- Stop Docker services
- Snapshot VMs
- Back up:
- Datasets
- LoRA adapters
- Docker volumes
- Remove containers
- Power down VMs
Rebuild
- Restore VM from template or ISO
- Reapply GPU passthrough (see infra/proxmox.md)
- Restore datasets and adapters
- Deploy Docker stack (
infra/docker-compose.yml) - Validate inference and training