OCR

Build image

Clone this repository and navigate into it:

git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git && cd ocr

Build image:

docker build -t sfb1288inf/ocr:latest .

Alternatively build from the GitLab repository without cloning:

Build image:

docker build -t sfb1288inf/ocr:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git

Download prebuilt image

The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers.

Download image:

docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest

Run

Create input and output directories for the OCR software:

mkdir -p /<mydatalocation>/files_for_ocr /<mydatalocation>/files_from_ocr

Place your files inside the /<mydatalocation>/files_for_ocr directory. Files can either be PDF (.pdf) or multipage TIFF (.tiff, .tif) files. Files should all contain text of the same language.
Start the OCR process.

docker run \
    --rm \
    -it \
    -u $(id -u $USER):$(id -g $USER) \
    -v /<mydatalocation>/files_for_ocr:/input \
    -v /<mydatalocation>/files_from_ocr:/output \
    sfb1288inf/ocr:latest \
        -i /input \
        -l <languagecode> \
        -o /output

The arguments below sfb1288inf/ocr:latest are described in the OCR arguments part.

If you want to use the prebuilt image, replace sfb1288inf/ocr:latest with gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest.

Check your results in the /<mydatalocation>/files_from_ocr directory.

OCR arguments

-l languagecode

Tells tesseract which language will be used.
options = deu (German), eng (English), enm (Middle englisch), fra (French), frk (German Fraktur), frm (Middle french), ita (Italian), por (Portuguese), spa (Spanish)
required = True

--keep-intermediates

If set, all intermediate files created during the OCR process will be kept.
default = False
required = False

--nCores corenumber

Sets the number of CPU cores being used during the OCR process.
default = min(4, multiprocessing.cpu_count())
required = False

--skip-binarisation

Used to skip binarization with ocropus. If skipped, only the tesseract binarization is used.
default = False

Example with all arguments used:

docker run \
    --rm \
    -it \
    -u $(id -u $USER):$(id -g $USER) \
    -v "$HOME"/ocr/files_for_ocr:/input \
    -v "$HOME"/ocr/files_from_ocr:/output \
    sfb1288inf/ocr:latest \
        -i /input \
        -l eng \
        -o /output \
        --keep_intermediates \
        --nCores 8 \
        --skip-binarisation