Self-Hosted Qwen3 LLM Stack
How do I get this on my iPhone?
Use LLMConnect on your iPhone over Tailscale. There is no separate “company app” — you install Tailscale (our secure network), then the LLMConnect app, and point it at our private endpoint.
- Install Tailscale on your iPhone (App Store). Open it and sign in with your work Apple ID or Google account.
- Accept the Tailnet invitation your admin sent you (email or link). You must be on the Tailnet before the chat app can reach our servers.
- Install LLMConnect (App Store).
- Add our endpoint in LLMConnect:
- In LLMConnect, add a custom / Ollama endpoint.
- URL: your admin will give you this (e.g.
http://ai-training.your-tailnet:11434— they’ll send the exact hostname). - Model:
qwen3:8b(or the model name your admin specifies). - Save. You can now chat through LLMConnect while you’re on Tailscale (iPhone will use the VPN automatically when the app needs it).
Who sends the Tailnet invite and the URL? Your IT or the person who runs this platform. Full quick start: ops/onboarding.md.
Prefer a browser? You can also use Safari and open the Open WebUI URL your admin gave you (over Tailscale), and add it to your Home Screen.
Executive summary
This document describes the architecture and operation of a private, self-hosted Large Language Model (LLM) platform built on Qwen3 8B. The system is designed for internal Management and staff use, prioritizing privacy, reproducibility, and operational clarity.
Key characteristics:
- No public internet exposure
- VPN-only access (Tailscale)
- GPU-accelerated inference and training
- Separate concerns for inference, training, OCR, and storage
The documentation is split into two major areas:
- Architecture — how the system is built → arch/
- Operations — how the system is run and maintained → ops/
Goals & design principles
- Fully private (no public internet exposure)
- Easy access for non-technical users
- ChatGPT-style UX (Open WebUI)
- Mobile-friendly (LLMConnect, browser)
- Reproducible infrastructure
- Clear separation of concerns (training, inference, OCR)
Documentation map
- Architecture: arch/architecture (high-level diagram) · arch/diagrams (Mermaid diagrams) · arch/capacity-planning (hardware, throughput, limitations)
- Infrastructure: infra/proxmox (GPU passthrough & VM layout) · infra/docker-compose.yml (Ollama + Open WebUI) · infra/backups (backups)
- ML: ml/training (QLoRA) · ml/ocr (Marker OCR) · ml/requirements.txt (dependencies)
- Operations: ops/onboarding (access) · ops/runbooks (troubleshooting & recovery) · ops/security (VPN, governance) · ops/changelog (history)
Repository structure
llm-platform/
├── README.md # High-level overview (this document)
├── app/ # Next.js docs app (Vercel deployment)
├── package.json # Node deps for docs site
├── vercel.json # Vercel config
├── arch/
│ ├── architecture.mmd # High-level system diagram
│ ├── diagrams.md # Mermaid diagrams
│ └── capacity-planning.md # Users, tokens/sec, GPU limits
│
├── infra/
│ ├── proxmox.md # GPU passthrough & VM layout
│ ├── docker-compose.yml # Ollama + Open WebUI
│ └── backups.md # Backup targets and procedures
│
├── ml/
│ ├── training.md # QLoRA fine-tuning workflow
│ ├── ocr.md # Marker OCR parallelization
│ ├── requirements.txt # Python dependencies
│ └── scripts/
│ ├── run_ocr.sh # Parallel OCR runner
│ └── train_qlora.py # QLoRA training script
│
├── ops/
│ ├── onboarding.md # Management & staff access
│ ├── runbooks.md # Failure scenarios and recovery
│ ├── security.md # VPN, ACLs, isolation guarantees
│ └── changelog.md # Operational change history
│
└── .gitignore
Deploy docs to Vercel
The repo includes a Next.js docs site (static export) so you can host the documentation on Vercel.
-
Install and build locally (optional):
npm install npm run buildStatic output is in
out/. -
Deploy to Vercel:
- Push the repo to GitHub/GitLab/Bitbucket, or use the Vercel CLI.
- In Vercel, import the project. Vercel will detect Next.js and use the correct build.
- Deploy. The docs will be served at your project URL (e.g.
https://your-project.vercel.app).
-
Local dev:
npm run dev— then open http://localhost:3000.
Note: Only the documentation is deployed to Vercel. The LLM stack (Ollama, Open WebUI, training, OCR) runs on your own infrastructure (see infra/ and ops/).
Future improvements
- Higher VRAM GPU (16–24 GB)
- Multi-GPU inference
- SSO / identity provider
- Usage analytics
- Model routing or ensemble support
Tone & audience
This repository is written to be:
- Readable by Management at a high level
- Actionable by Operators without external context
- Auditable by Engineers for correctness and reproducibility
Technical depth increases progressively by directory.
All docs
- arch / architecture — architecture
- arch / capacity-planning — capacity planning
- arch / diagrams — diagrams
- infra / backups — backups
- infra / proxmox — proxmox
- ml / ocr — ocr
- ml / training — training
- ops / changelog — changelog
- ops / onboarding — onboarding
- ops / runbooks — runbooks
- ops / security — security
- ops / vercel-setup — vercel setup