Self-Hosted Qwen3 LLM Stack

How do I get this on my iPhone?

Use LLMConnect on your iPhone over Tailscale. There is no separate “company app” — you install Tailscale (our secure network), then the LLMConnect app, and point it at our private endpoint.

Install Tailscale on your iPhone (App Store). Open it and sign in with your work Apple ID or Google account.
Accept the Tailnet invitation your admin sent you (email or link). You must be on the Tailnet before the chat app can reach our servers.
Install LLMConnect (App Store).
Add our endpoint in LLMConnect:
- In LLMConnect, add a custom / Ollama endpoint.
- URL: your admin will give you this (e.g. http://ai-training.your-tailnet:11434 — they’ll send the exact hostname).
- Model: qwen3:8b (or the model name your admin specifies).
- Save. You can now chat through LLMConnect while you’re on Tailscale (iPhone will use the VPN automatically when the app needs it).

Who sends the Tailnet invite and the URL? Your IT or the person who runs this platform. Full quick start: ops/onboarding.md.

Prefer a browser? You can also use Safari and open the Open WebUI URL your admin gave you (over Tailscale), and add it to your Home Screen.

Executive summary

This document describes the architecture and operation of a private, self-hosted Large Language Model (LLM) platform built on Qwen3 8B. The system is designed for internal Management and staff use, prioritizing privacy, reproducibility, and operational clarity.

Key characteristics:

No public internet exposure
VPN-only access (Tailscale)
GPU-accelerated inference and training
Separate concerns for inference, training, OCR, and storage

The documentation is split into two major areas:

Architecture — how the system is built → arch/
Operations — how the system is run and maintained → ops/

Goals & design principles

Fully private (no public internet exposure)
Easy access for non-technical users
ChatGPT-style UX (Open WebUI)
Mobile-friendly (LLMConnect, browser)
Reproducible infrastructure
Clear separation of concerns (training, inference, OCR)

Documentation map

Architecture: arch/architecture (high-level diagram) · arch/diagrams (Mermaid diagrams) · arch/capacity-planning (hardware, throughput, limitations)
Infrastructure: infra/proxmox (GPU passthrough & VM layout) · infra/docker-compose.yml (Ollama + Open WebUI) · infra/backups (backups)
ML: ml/training (QLoRA) · ml/ocr (Marker OCR) · ml/requirements.txt (dependencies)
Operations: ops/onboarding (access) · ops/runbooks (troubleshooting & recovery) · ops/security (VPN, governance) · ops/changelog (history)

Repository structure

llm-platform/
├── README.md                # High-level overview (this document)
├── app/                     # Next.js docs app (Vercel deployment)
├── package.json             # Node deps for docs site
├── vercel.json              # Vercel config
├── arch/
│   ├── architecture.mmd     # High-level system diagram
│   ├── diagrams.md          # Mermaid diagrams
│   └── capacity-planning.md # Users, tokens/sec, GPU limits
│
├── infra/
│   ├── proxmox.md           # GPU passthrough & VM layout
│   ├── docker-compose.yml   # Ollama + Open WebUI
│   └── backups.md           # Backup targets and procedures
│
├── ml/
│   ├── training.md          # QLoRA fine-tuning workflow
│   ├── ocr.md               # Marker OCR parallelization
│   ├── requirements.txt     # Python dependencies
│   └── scripts/
│       ├── run_ocr.sh       # Parallel OCR runner
│       └── train_qlora.py   # QLoRA training script
│
├── ops/
│   ├── onboarding.md        # Management & staff access
│   ├── runbooks.md          # Failure scenarios and recovery
│   ├── security.md          # VPN, ACLs, isolation guarantees
│   └── changelog.md         # Operational change history
│
└── .gitignore

Deploy docs to Vercel

The repo includes a Next.js docs site (static export) so you can host the documentation on Vercel.

Install and build locally (optional):

npm install
npm run build

Static output is in out/.

Deploy to Vercel:
- Push the repo to GitHub/GitLab/Bitbucket, or use the Vercel CLI.
- In Vercel, import the project. Vercel will detect Next.js and use the correct build.
- Deploy. The docs will be served at your project URL (e.g. https://your-project.vercel.app).
Local dev: npm run dev — then open http://localhost:3000.

Note: Only the documentation is deployed to Vercel. The LLM stack (Ollama, Open WebUI, training, OCR) runs on your own infrastructure (see infra/ and ops/).

Future improvements

Higher VRAM GPU (16–24 GB)
Multi-GPU inference
SSO / identity provider
Usage analytics
Model routing or ensemble support

Tone & audience

This repository is written to be:

Readable by Management at a high level
Actionable by Operators without external context
Auditable by Engineers for correctness and reproducibility

Technical depth increases progressively by directory.