QLoRA fine-tuning workflow
Environment setup, dataset layout, and training for the ai-training VM.
Environment setup (ai-training VM)
Use a dedicated virtual environment for all ML tooling.
apt update
apt install -y python3 python3-venv python3-pip git build-essential
python3 -m venv /opt/llm-venv
source /opt/llm-venv/bin/activate
pip install --upgrade pip
Core Python requirements
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install \
transformers \
accelerate \
peft \
bitsandbytes \
datasets \
sentencepiece \
safetensors \
trl
Verify GPU availability:
python - <<EOF
import torch
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))
EOF
Model selection
- Qwen3 8B
- Loaded via Ollama for inference
- Loaded directly via Hugging Face Transformers for fine-tuning
Inference is handled via Ollama, exposing an HTTP API consumed by Open WebUI and mobile clients.
Dataset layout
Recommended structure (shared NFS volume):
/datasets
├── raw
│ ├── ocr
│ └── curated
├── processed
│ ├── train.jsonl
│ └── eval.jsonl
└── adapters
flowchart LR
raw[raw / ocr, curated] --> processed[processed / train.jsonl, eval.jsonl]
processed --> adapters[adapters]
Why QLoRA
- Enables fine-tuning large models on limited VRAM
- Keeps base weights frozen
- Stores only small adapter checkpoints
- Suitable for RTX 3070 (8 GB VRAM)
Training script
See ml/scripts/train_qlora.py for a runnable example. Core pattern:
- Load Qwen3 8B with
load_in_4bit=True,device_map="auto" - Apply LoRA to
q_proj,v_proj(e.g.r=8,lora_alpha=16) - Use
SFTTrainerwith dataset from/datasets/processed/train.jsonl - Save adapter to
/datasets/adapters/qwen3-lora
Loading LoRA for inference
Adapters can be merged or dynamically loaded at inference time depending on the backend.
- Merged adapters: lower latency, static behavior
- Dynamic adapters: flexible, slightly higher latency
For Ollama, adapters are typically merged offline into a new model variant.
Parallel and overnight training (5pm–9am)
We run training (and heavy OCR) in a 5pm–9am window on remote machines (RTX 4070, RTX 3090) in addition to the primary host. Use these practices so future scripts are safe to parallelize and schedule.
Scheduling
- Window: 5pm–9am local (or same TZ on all machines). Use cron or systemd timers; avoid overlapping with peak inference.
- One run per GPU per host unless you explicitly use multi-GPU (e.g.
accelerate). Pin GPU withCUDA_VISIBLE_DEVICESso each process sees a single device.
Script design
| Practice | Why |
|---|---|
| No interactive prompts | Scripts run under cron/systemd; read config from env vars or a config file. |
| Unique run IDs | Use a run ID (e.g. RUN_ID=${RUN_ID:-$(date +%Y%m%d-%H%M%S)}) for logs and output paths so parallel runs don’t overwrite each other. |
| Paths from env | e.g. DATASET_DIR="${DATASET_DIR:-/datasets}", OUTPUT_DIR="${OUTPUT_DIR:-/datasets/adapters}". Same script works on every host; override per machine. |
| Checkpoint and resume | Use --save_strategy steps (or equivalent) and --save_steps; support --resume_from_checkpoint. Jobs that get killed can resume. |
| Log to files | Redirect stdout/stderr to timestamped or run-ID log files so you can inspect runs after the fact. |
| Exit codes | Exit 0 on success, non-zero on failure so schedulers or wrappers can alert or retry. |
| Idempotent where possible | If a run is restarted, skip already-done work (e.g. existing checkpoints, processed shards) instead of redoing everything. |
Shared state (NFS)
- Datasets: Read from a shared path (e.g.
/datasets/processed/train.jsonl) so all machines see the same data. - Outputs: Write adapters/logs to per-run or per-host directories (e.g.
/datasets/adapters/run-{RUN_ID}-{HOSTNAME}) to avoid collisions when multiple jobs run in parallel. - Locks (optional): For a single “next job” consumer, use a lockfile (e.g.
flock) or a small coordinator script that assigns shards or configs to workers.
Example env-driven invocation
Future training scripts should accept --output_dir, --logging_dir, --resume_from_checkpoint, and read base paths from env (e.g. DATASET_DIR). Example pattern:
# Example: one training job per machine, 5pm–9am
export DATASET_DIR=/datasets
export RUN_ID=$(date +%Y%m%d-%H%M%S)-$(hostname -s)
export CUDA_VISIBLE_DEVICES=0
python ml/scripts/train_qlora.py \
--output_dir "$DATASET_DIR/adapters/$RUN_ID" \
--logging_dir "$DATASET_DIR/logs/$RUN_ID" \
--resume_from_checkpoint auto \
>> "$DATASET_DIR/logs/$RUN_ID.out" 2>> "$DATASET_DIR/logs/$RUN_ID.err"
The current train_qlora.py uses fixed paths; extend it (or new scripts) with these options for parallel/overnight use.
Multi-host coordination (optional)
- Option A: Same cron on each host; each runs one job with a distinct
RUN_ID(and optionally different data shards or hyperparameters). - Option B: A single coordinator (e.g. on the primary host) that SSHs to each remote and launches one training process per GPU, passing
RUN_ID,CUDA_VISIBLE_DEVICES, and paths via env. - Option C: A simple queue (e.g. a file-based or Redis list of “run configs”); each host pulls a config, runs training, then marks it done and pulls the next until the window ends.