Home / Docs / ML / ocr

Marker OCR parallelization

OCR tooling, environment, parallel execution, and output normalization on the ocr-processing VM.

OCR tooling (ocr-processing VM)

Python environment (OCR VM)

apt update
apt install -y python3 python3-venv python3-pip poppler-utils tesseract-ocr
python3 -m venv /opt/ocr-venv
source /opt/ocr-venv/bin/activate
pip install --upgrade pip
pip install marker-pdf tqdm joblib

Parallel execution

Use ml/scripts/run_ocr.sh or the patterns below.

GNU Parallel

apt install -y parallel
find /datasets/raw/ocr -name "*.pdf" | \
  parallel -j 8 "marker {} --output /datasets/processed/ocr"

Python multiprocessing

from multiprocessing import Pool
from pathlib import Path
import subprocess

files = list(Path('/datasets/raw/ocr').glob('*.pdf'))

def run(file):
    subprocess.run([
        'marker', str(file), '--output', '/datasets/processed/ocr'
    ])

with Pool(8) as p:
    p.map(run, files)
flowchart LR PDFs[PDFs in raw/ocr] --> Marker[Marker OCR] Marker --> Normalized[Normalized text] Normalized --> Training[Training dataset]

OCR output normalization

After OCR, text must be normalized before training.

Steps

Example normalization + chunking (concept):

See ml/scripts/ for runnable OCR and training helpers.