Capacity planning
Users, tokens/sec, and GPU limits for the self-hosted Qwen3 LLM stack.
Hardware platform
Primary (inference + training)
| Component | Specification |
|---|---|
| CPU | Threadripper Pro |
| Motherboard | WRX80D8-2T |
| Memory | 128 GB ECC RDIMM (32 GB × 4 channels) |
| GPU | NVIDIA RTX 3070 (8 GB VRAM) |
| Storage | NVMe SSD (host + VM disks) |
Remote ML/OCR machines
| Machine | GPU | VRAM | Intended use |
|---|---|---|---|
| Remote 1 | NVIDIA RTX 4070 | 12 GB | OCR, QLoRA training (5pm–9am) |
| Remote 2 | NVIDIA RTX 3090 | 24 GB | OCR, QLoRA training (5pm–9am) |
Training and heavy OCR are parallelized across these remotes in the 5pm–9am window (overnight / off-peak). See ml/training.md for script best practices.
Notes
- ECC memory is strongly recommended for long-running training jobs on the primary host.
- RTX 3070 VRAM limits drive the choice of Qwen3 8B + QLoRA on the main inference box.
- 3090’s 24 GB allows larger batch sizes or full fine-tuning experiments; 4070 fits QLoRA comfortably.
Expected throughput
| Component | Expected throughput |
|---|---|
| OCR (Marker) | ~300–600 pages/hour (32 threads) |
| Inference (Qwen3 8B) | ~20–35 tokens/sec |
| Concurrent users | 3–6 interactive users |
| QLoRA training | ~1–2 hrs / 10k samples |
These numbers vary based on document complexity and prompt length.
Capacity flow (conceptual)
flowchart LR
Users[3–6 users] --> WebUI[Open WebUI]
WebUI --> Ollama[Ollama]
Ollama --> GPU[RTX 3070]
Limitations
- RTX 3070 VRAM limits concurrent users on the primary inference host.
- Fine-tuning is time-intensive; parallel runs on remotes (5pm–9am) ease the load.
- OCR throughput is bounded by CPU on each machine; remotes add capacity.