OCR

Build image

Clone this repository and navigate into it:

git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git && cd ocr

Build image:

docker build -t sfb1288inf/ocr:latest .

Alternatively build from the GitLab repository without cloning:

Build image:

docker build -t sfb1288inf/ocr:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git

Download prebuilt image

The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers.

Download image:

docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest

Run

Create input and output directories for the OCR software:

mkdir -p /<mydatalocation>/files_for_ocr /<mydatalocation>/files_from_ocr

Place your files inside the /<mydatalocation>/files_for_ocr directory. Files can either be multipage TIFF (.tiff, .tif) or PDF (.pdf) files. Files should all contain text of the same language.
Start the OCR process.

docker run \
    --rm \
    -it \
    -v /<mydatalocation>/files_for_ocr:/files_for_ocr \
    -v /<mydatalocation>/files_from_ocr:/files_from_ocr \
    sfb1288inf/ocr:latest \
        -i /files_for_ocr \
        -o /files_from_ocr \
        -l <languagecode>

The specified below sfb1288inf/ocr:latest are described in the OCR arguments part.

Check your results in the /<mydatalocation>/files_from_ocr directory.

OCR arguments

-i path

Sets the input directory using the specified path.
required = True

-o path

Sets the output directory using the specified path.
required = True

-l languagecode

Tells tesseract which language will be used.
options = deu (German), deu_frak (German Fraktur), eng (English), enm (Middle englisch), fra (French), frm (Middle french), por (Portuguese), spa (Spanish)
required = True

--keep-intermediates

If set, all intermediate files created during the OCR process will be kept.
default = False
required = False

--nCores corenumber

Sets the number of CPU cores being used during the OCR process.
default = min(4, multiprocessing.cpu_count())
required = False

--skip-binarisation

Used to skip binarization with ocropus. If skipped, only the tesseract binarization is used.
default = False

Example with all arguments used:

docker run \
    --rm \
    -it \
    -v "$HOME"/ocr/files_for_ocr:/files_for_ocr \
    -v "$HOME"/ocr/files_from_ocr:/files_from_ocr \
    sfb1288inf/ocr:latest \
        -i /files_for_ocr \
        -o /files_from_ocr \
        -l eng \
        --keep_intermediates \
        --nCores 8 \
        --skip-binarisation

Additional language models for OCR

Additional language models can be easily installed. Just add them analogical to the existing models to the Dockerfile.

The standard language models for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata.

The more accurate but slower language models can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata.

Language models for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata.

The Dockerfile section for the language models with added language support for Afrikaans would look like this:

RUN echo "deb https://notesalexp.org/tesseract-ocr/stretch/ stretch main" >> /etc/apt/sources.list && \
    wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && \
    apt-get update && \
    apt-get install -y --no-install-recommends tesseract-ocr && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata/raw/master/deu_frak.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/enm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/fra.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/frm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/ita.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata

5.2 KiB Raw Blame History

OCR

Build image

Download prebuilt image

Run

OCR arguments

Additional language models for OCR

5.2 KiB

Raw Blame History