ocr/README.md

2.6 KiB

OCR

Build image

  1. Clone this repository and navigate into it:
git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git && cd ocr
  1. Build image:
docker build -t sfb1288inf/ocr:latest .

Alternatively build from the GitLab repository without cloning:

  1. Build image:
docker build -t sfb1288inf/ocr:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git

Download prebuilt image

The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers.

  1. Download image:
docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest

Run

  1. Create input and output directories for the OCR software:
mkdir -p /<mydatalocation>/files_for_ocr /<mydatalocation>/files_from_ocr
  1. Place your files inside the /<mydatalocation>/files_for_ocr directory. Files can either be PDF (.pdf) or multipage TIFF (.tiff, .tif) files. Files should all contain text of the same language.

  2. Start the OCR process.

docker run \
    --rm \
    -it \
    -u $(id -u $USER):$(id -g $USER) \
    -v /<mydatalocation>/files_for_ocr:/input \
    -v /<mydatalocation>/files_from_ocr:/output \
    sfb1288inf/ocr:latest \
        -i /input \
        -l <languagecode> \
        -o /output

The arguments below sfb1288inf/ocr:latest are described in the OCR arguments part.

If you want to use the prebuilt image, replace sfb1288inf/ocr:latest with gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest.

  1. Check your results in the /<mydatalocation>/files_from_ocr directory.

OCR arguments

-l languagecode

  • Tells tesseract which language will be used.
  • options = deu (German), eng (English), enm (Middle englisch), fra (French), frk (German Fraktur), frm (Middle french), ita (Italian), por (Portuguese), spa (Spanish)
  • required = True

--keep-intermediates

  • If set, all intermediate files created during the OCR process will be kept.
  • default = False
  • required = False

--nCores corenumber

  • Sets the number of CPU cores being used during the OCR process.
  • default = min(4, multiprocessing.cpu_count())
  • required = False

--skip-binarisation

  • Used to skip binarization with ocropus. If skipped, only the tesseract binarization is used.
  • default = False

Example with all arguments used:

docker run \
    --rm \
    -it \
    -u $(id -u $USER):$(id -g $USER) \
    -v "$HOME"/ocr/files_for_ocr:/input \
    -v "$HOME"/ocr/files_from_ocr:/output \
    sfb1288inf/ocr:latest \
        -i /input \
        -l eng \
        -o /output \
        --keep_intermediates \
        --nCores 8 \
        --skip-binarisation