Home / Docs / ML / training

QLoRA fine-tuning workflow

Environment setup, dataset layout, and training for the ai-training VM.

Environment setup (ai-training VM)

Use a dedicated virtual environment for all ML tooling.

apt update
apt install -y python3 python3-venv python3-pip git build-essential
python3 -m venv /opt/llm-venv
source /opt/llm-venv/bin/activate
pip install --upgrade pip

Core Python requirements

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install \
  transformers \
  accelerate \
  peft \
  bitsandbytes \
  datasets \
  sentencepiece \
  safetensors \
  trl

Verify GPU availability:

python - <<EOF
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
EOF

Model selection

Qwen3 8B
Loaded via Ollama for inference
Loaded directly via Hugging Face Transformers for fine-tuning

Inference is handled via Ollama, exposing an HTTP API consumed by Open WebUI and mobile clients.

Dataset layout

Recommended structure (shared NFS volume):

/datasets
├── raw
│   ├── ocr
│   └── curated
├── processed
│   ├── train.jsonl
│   └── eval.jsonl
└── adapters

flowchart LR
    raw[raw / ocr, curated] --> processed[processed / train.jsonl, eval.jsonl]
    processed --> adapters[adapters]

Why QLoRA

Enables fine-tuning large models on limited VRAM
Keeps base weights frozen
Stores only small adapter checkpoints
Suitable for RTX 3070 (8 GB VRAM)

Training script

See ml/scripts/train_qlora.py for a runnable example. Core pattern:

Load Qwen3 8B with load_in_4bit=True, device_map="auto"
Apply LoRA to q_proj, v_proj (e.g. r=8, lora_alpha=16)
Use SFTTrainer with dataset from /datasets/processed/train.jsonl
Save adapter to /datasets/adapters/qwen3-lora

Loading LoRA for inference

Adapters can be merged or dynamically loaded at inference time depending on the backend.

Merged adapters: lower latency, static behavior
Dynamic adapters: flexible, slightly higher latency

For Ollama, adapters are typically merged offline into a new model variant.

Parallel and overnight training (5pm–9am)

We run training (and heavy OCR) in a 5pm–9am window on remote machines (RTX 4070, RTX 3090) in addition to the primary host. Use these practices so future scripts are safe to parallelize and schedule.

Scheduling

Window: 5pm–9am local (or same TZ on all machines). Use cron or systemd timers; avoid overlapping with peak inference.
One run per GPU per host unless you explicitly use multi-GPU (e.g. accelerate). Pin GPU with CUDA_VISIBLE_DEVICES so each process sees a single device.

Script design

Practice	Why
No interactive prompts	Scripts run under cron/systemd; read config from env vars or a config file.
Unique run IDs	Use a run ID (e.g. `RUN_ID=${RUN_ID:-$(date +%Y%m%d-%H%M%S)}`) for logs and output paths so parallel runs don’t overwrite each other.
Paths from env	e.g. `DATASET_DIR="${DATASET_DIR:-/datasets}"`, `OUTPUT_DIR="${OUTPUT_DIR:-/datasets/adapters}"`. Same script works on every host; override per machine.
Checkpoint and resume	Use `--save_strategy steps` (or equivalent) and `--save_steps`; support `--resume_from_checkpoint`. Jobs that get killed can resume.
Log to files	Redirect stdout/stderr to timestamped or run-ID log files so you can inspect runs after the fact.
Exit codes	Exit 0 on success, non-zero on failure so schedulers or wrappers can alert or retry.
Idempotent where possible	If a run is restarted, skip already-done work (e.g. existing checkpoints, processed shards) instead of redoing everything.

Shared state (NFS)

Datasets: Read from a shared path (e.g. /datasets/processed/train.jsonl) so all machines see the same data.
Outputs: Write adapters/logs to per-run or per-host directories (e.g. /datasets/adapters/run-{RUN_ID}-{HOSTNAME}) to avoid collisions when multiple jobs run in parallel.
Locks (optional): For a single “next job” consumer, use a lockfile (e.g. flock) or a small coordinator script that assigns shards or configs to workers.

Example env-driven invocation

Future training scripts should accept --output_dir, --logging_dir, --resume_from_checkpoint, and read base paths from env (e.g. DATASET_DIR). Example pattern:

# Example: one training job per machine, 5pm–9am
export DATASET_DIR=/datasets
export RUN_ID=$(date +%Y%m%d-%H%M%S)-$(hostname -s)
export CUDA_VISIBLE_DEVICES=0

python ml/scripts/train_qlora.py \
  --output_dir "$DATASET_DIR/adapters/$RUN_ID" \
  --logging_dir "$DATASET_DIR/logs/$RUN_ID" \
  --resume_from_checkpoint auto \
  >> "$DATASET_DIR/logs/$RUN_ID.out" 2>> "$DATASET_DIR/logs/$RUN_ID.err"

The current train_qlora.py uses fixed paths; extend it (or new scripts) with these options for parallel/overnight use.

Multi-host coordination (optional)

Option A: Same cron on each host; each runs one job with a distinct RUN_ID (and optionally different data shards or hyperparameters).
Option B: A single coordinator (e.g. on the primary host) that SSHs to each remote and launches one training process per GPU, passing RUN_ID, CUDA_VISIBLE_DEVICES, and paths via env.
Option C: A simple queue (e.g. a file-based or Redis list of “run configs”); each host pulls a config, runs training, then marks it done and pulls the next until the window ends.