Home / Docs / Architecture / capacity planning

Capacity planning

Users, tokens/sec, and GPU limits for the self-hosted Qwen3 LLM stack.

Hardware platform

Primary (inference + training)

ComponentSpecification
CPUThreadripper Pro
MotherboardWRX80D8-2T
Memory128 GB ECC RDIMM (32 GB × 4 channels)
GPUNVIDIA RTX 3070 (8 GB VRAM)
StorageNVMe SSD (host + VM disks)

Remote ML/OCR machines

MachineGPUVRAMIntended use
Remote 1NVIDIA RTX 407012 GBOCR, QLoRA training (5pm–9am)
Remote 2NVIDIA RTX 309024 GBOCR, QLoRA training (5pm–9am)

Training and heavy OCR are parallelized across these remotes in the 5pm–9am window (overnight / off-peak). See ml/training.md for script best practices.

Notes

Expected throughput

ComponentExpected throughput
OCR (Marker)~300–600 pages/hour (32 threads)
Inference (Qwen3 8B)~20–35 tokens/sec
Concurrent users3–6 interactive users
QLoRA training~1–2 hrs / 10k samples

These numbers vary based on document complexity and prompt length.

Capacity flow (conceptual)

flowchart LR Users[3–6 users] --> WebUI[Open WebUI] WebUI --> Ollama[Ollama] Ollama --> GPU[RTX 3070]

Limitations