ocr/README.md
2019-04-03 15:23:10 +02:00

5.8 KiB

Installation

Install additional packages

  1. Install screen. We will use this to execute commands in their own terminal session.

Build your own image

  1. Clone this repository and navigate into it.
  2. Build the image from the dockerfile. docker build -t <image_name>:<tag> . For example: docker build -t ocr_container:latest .

Alternatively build directly from git.

  1. Use the following command to build directly from gitLab. docker build -t <image_name>:<tag> https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git.

Folder setup

  1. Create input and output folders for the OCR files.
  2. mkdir -p /some/path/<container-name>/ocr/files_for_ocr /some/path/<image_name>/ocr/files_from_ocr

Run the container

  1. Run container from an image. <contianer_name> and /some/path are the same as mentioned in the step folder setup. We are creating two volumes based on the folder paths provided in the section Folder setup.
docker run \
  --name <container-name> \
  -dit \
  -v /some/path/<container-name>/files_for_ocr:/root/files_for_ocr \
  -v /some/path/<container-name>/files_from_ocr:/root/files_from_ocr \
  <image_name>

Start an OCR job

  1. Place some files inside the folder files_for_ocr. Files can either be multipage tiffs or PDF files. One folder per file is needed. Files should all be of the same language.
  2. Start a screen session with screen -dmS <container_name>
  3. Enter the screen session with screen -r <container-name>. (Try this if the error "Cannot open your terminal '/dev/pts/0' - please check." appears: script -q -c "screen -r <container-name>" /dev/null).
  4. Start the OCR process for all files placed in files_for_ocr with docker exec -it <container-name> ocr -i files_for_ocr -o files_from_ocr -l <sprachcode>.

Valid language codes are:

  • deu (German)
  • deu_frak (German Fraktur)
  • eng (English)
  • enm (Middle englisch)
  • fra (French)
  • frm (Middle french)
  • por (Portuguese)
  • spa (Spanish)

Additional OCR arguments

Below we will describe all available pipeline arguments that can be used.

  • -i some/path

    • Sets the input directory using the specified path.
    • required = True
  • -o some/path

    • Sets the output directory using the specified path.
    • required = True
  • -l valid_language_code

    • Tells tesseract which language will be used.
    • required = True
  • --keep_intermediates

    • Optional argument. If set all intermediate filese created during the OCR process will be kept.
    • default = False
    • required = False
  • --nCores

    • Sets the number of CPU cores being used during the OCR process.
    • default = min(4, multiprocessing.cpu_count())
    • required = False

Example with all arguments used:

docker exec -it <container-name> ocr -i files_for_ocr -o files_from_ocr -l deu --keep_intermediates --nCores 8

Exit and re-enter the current running OCR process

  1. You can leave the currently running OCR process by pressing ctrl + a + d and thus leaving the screen session.
  2. Re-enter the screen session to check the status of the running OCR job with screen -r <container-name>. (Try this if the error "Cannot open your terminal '/dev/pts/0' - please check." appears: script -q -c "screen -r <container-name>" /dev/null).

Use prebuilt image

Download via regestry function with login or deploy token.

Add additional traineddata for OCR of additional languages.

Additional traineddata can be easily added to the Dockerfile. Just append the needed data file URL after line 56 in the Dockerfile following the same syntax.

The standard traineddata for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata.

The more accurate but slower traineddata can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata.

Traineddata for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata.

The Dockerfile section for the traineddata with added language support for Afrikaans would look like this:

RUN echo "deb https://notesalexp.org/tesseract-ocr/stretch/ stretch main" >> /etc/apt/sources.list && \
    wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && \
    apt-get update && \
    apt-get install -y --no-install-recommends tesseract-ocr && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata/raw/master/deu_frak.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/enm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/fra.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/frm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata