Home / Docs / ML / training

QLoRA fine-tuning workflow

Environment setup, dataset layout, and training for the ai-training VM.

Environment setup (ai-training VM)

Use a dedicated virtual environment for all ML tooling.

apt update
apt install -y python3 python3-venv python3-pip git build-essential
python3 -m venv /opt/llm-venv
source /opt/llm-venv/bin/activate
pip install --upgrade pip

Core Python requirements

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install \
  transformers \
  accelerate \
  peft \
  bitsandbytes \
  datasets \
  sentencepiece \
  safetensors \
  trl

Verify GPU availability:

python - <<EOF
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
EOF

Model selection

Inference is handled via Ollama, exposing an HTTP API consumed by Open WebUI and mobile clients.

Dataset layout

Recommended structure (shared NFS volume):

/datasets
├── raw
│   ├── ocr
│   └── curated
├── processed
│   ├── train.jsonl
│   └── eval.jsonl
└── adapters
flowchart LR raw[raw / ocr, curated] --> processed[processed / train.jsonl, eval.jsonl] processed --> adapters[adapters]

Why QLoRA

Training script

See ml/scripts/train_qlora.py for a runnable example. Core pattern:

Loading LoRA for inference

Adapters can be merged or dynamically loaded at inference time depending on the backend.

For Ollama, adapters are typically merged offline into a new model variant.


Parallel and overnight training (5pm–9am)

We run training (and heavy OCR) in a 5pm–9am window on remote machines (RTX 4070, RTX 3090) in addition to the primary host. Use these practices so future scripts are safe to parallelize and schedule.

Scheduling

Script design

PracticeWhy
No interactive promptsScripts run under cron/systemd; read config from env vars or a config file.
Unique run IDsUse a run ID (e.g. RUN_ID=${RUN_ID:-$(date +%Y%m%d-%H%M%S)}) for logs and output paths so parallel runs don’t overwrite each other.
Paths from enve.g. DATASET_DIR="${DATASET_DIR:-/datasets}", OUTPUT_DIR="${OUTPUT_DIR:-/datasets/adapters}". Same script works on every host; override per machine.
Checkpoint and resumeUse --save_strategy steps (or equivalent) and --save_steps; support --resume_from_checkpoint. Jobs that get killed can resume.
Log to filesRedirect stdout/stderr to timestamped or run-ID log files so you can inspect runs after the fact.
Exit codesExit 0 on success, non-zero on failure so schedulers or wrappers can alert or retry.
Idempotent where possibleIf a run is restarted, skip already-done work (e.g. existing checkpoints, processed shards) instead of redoing everything.

Shared state (NFS)

Example env-driven invocation

Future training scripts should accept --output_dir, --logging_dir, --resume_from_checkpoint, and read base paths from env (e.g. DATASET_DIR). Example pattern:

# Example: one training job per machine, 5pm–9am
export DATASET_DIR=/datasets
export RUN_ID=$(date +%Y%m%d-%H%M%S)-$(hostname -s)
export CUDA_VISIBLE_DEVICES=0

python ml/scripts/train_qlora.py \
  --output_dir "$DATASET_DIR/adapters/$RUN_ID" \
  --logging_dir "$DATASET_DIR/logs/$RUN_ID" \
  --resume_from_checkpoint auto \
  >> "$DATASET_DIR/logs/$RUN_ID.out" 2>> "$DATASET_DIR/logs/$RUN_ID.err"

The current train_qlora.py uses fixed paths; extend it (or new scripts) with these options for parallel/overnight use.

Multi-host coordination (optional)