Marker OCR parallelization
OCR tooling, environment, parallel execution, and output normalization on the ocr-processing VM.
OCR tooling (ocr-processing VM)
- Marker for document-to-text
- CPU-only workload
- Optimized for parallel document processing
Python environment (OCR VM)
apt update
apt install -y python3 python3-venv python3-pip poppler-utils tesseract-ocr
python3 -m venv /opt/ocr-venv
source /opt/ocr-venv/bin/activate
pip install --upgrade pip
pip install marker-pdf tqdm joblib
Parallel execution
Use ml/scripts/run_ocr.sh or the patterns below.
GNU Parallel
apt install -y parallel
find /datasets/raw/ocr -name "*.pdf" | \
parallel -j 8 "marker {} --output /datasets/processed/ocr"
Python multiprocessing
from multiprocessing import Pool
from pathlib import Path
import subprocess
files = list(Path('/datasets/raw/ocr').glob('*.pdf'))
def run(file):
subprocess.run([
'marker', str(file), '--output', '/datasets/processed/ocr'
])
with Pool(8) as p:
p.map(run, files)
flowchart LR
PDFs[PDFs in raw/ocr] --> Marker[Marker OCR]
Marker --> Normalized[Normalized text]
Normalized --> Training[Training dataset]
OCR output normalization
After OCR, text must be normalized before training.
Steps
- Remove headers and footers
- Normalize Unicode
- Collapse repeated whitespace
- Split into chunks that fit the target context window
Example normalization + chunking (concept):
- Read
.txtfrom/datasets/processed/ocr - Tokenize with Qwen3 tokenizer, chunk at
MAX_TOKENS = 3500 - Write JSONL to
/datasets/processed/train.jsonlwith{"text": "..."}per line
See ml/scripts/ for runnable OCR and training helpers.