Home / Docs / ML / ocr

Marker OCR parallelization

OCR tooling, environment, parallel execution, and output normalization on the ocr-processing VM.

OCR tooling (ocr-processing VM)

Marker for document-to-text
CPU-only workload
Optimized for parallel document processing

Python environment (OCR VM)

apt update
apt install -y python3 python3-venv python3-pip poppler-utils tesseract-ocr
python3 -m venv /opt/ocr-venv
source /opt/ocr-venv/bin/activate
pip install --upgrade pip
pip install marker-pdf tqdm joblib

Parallel execution

Use ml/scripts/run_ocr.sh or the patterns below.

GNU Parallel

apt install -y parallel
find /datasets/raw/ocr -name "*.pdf" | \
  parallel -j 8 "marker {} --output /datasets/processed/ocr"

Python multiprocessing

from multiprocessing import Pool
from pathlib import Path
import subprocess

files = list(Path('/datasets/raw/ocr').glob('*.pdf'))

def run(file):
    subprocess.run([
        'marker', str(file), '--output', '/datasets/processed/ocr'
    ])

with Pool(8) as p:
    p.map(run, files)

flowchart LR
    PDFs[PDFs in raw/ocr] --> Marker[Marker OCR]
    Marker --> Normalized[Normalized text]
    Normalized --> Training[Training dataset]

OCR output normalization

After OCR, text must be normalized before training.

Steps

Remove headers and footers
Normalize Unicode
Collapse repeated whitespace
Split into chunks that fit the target context window

Example normalization + chunking (concept):

Read .txt from /datasets/processed/ocr
Tokenize with Qwen3 tokenizer, chunk at MAX_TOKENS = 3500
Write JSONL to /datasets/processed/train.jsonl with {"text": "..."} per line

See ml/scripts/ for runnable OCR and training helpers.