ocr/README.md
2019-05-16 20:46:35 +02:00

5.2 KiB

OCR

Build image

  1. Clone this repository and navigate into it:
git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git && cd ocr
  1. Build image:
docker build -t sfb1288inf/ocr:latest .

Alternatively build from the GitLab repository without cloning:

  1. Build image:
docker build -t sfb1288inf/ocr:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git

Download prebuilt image

The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers.

  1. Download image:
docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest

Run

  1. Create input and output directories for the OCR software:
mkdir -p /<mydatalocation>/files_for_ocr /<mydatalocation>/files_from_ocr
  1. Place your files inside the /<mydatalocation>/files_for_ocr directory. Files can either be multipage TIFF (.tiff, .tif) or PDF (.pdf) files. Files should all contain text of the same language.

  2. Start the OCR process.

docker run \
    --rm \
    -it \
    -v /<mydatalocation>/files_for_ocr:/files_for_ocr \
    -v /<mydatalocation>/files_from_ocr:/files_from_ocr \
    sfb1288inf/ocr:latest \
        -i /files_for_ocr \
        -o /files_from_ocr \
        -l <languagecode>

The specified below sfb1288inf/ocr:latest are described in the OCR arguments part.

  1. Check your results in the /<mydatalocation>/files_from_ocr directory.

OCR arguments

-i path

  • Sets the input directory using the specified path.
  • required = True

-o path

  • Sets the output directory using the specified path.
  • required = True

-l languagecode

  • Tells tesseract which language will be used.
  • options = deu (German), deu_frak (German Fraktur), eng (English), enm (Middle englisch), fra (French), frm (Middle french), por (Portuguese), spa (Spanish)
  • required = True

--keep-intermediates

  • If set, all intermediate files created during the OCR process will be kept.
  • default = False
  • required = False

--nCores corenumber

  • Sets the number of CPU cores being used during the OCR process.
  • default = min(4, multiprocessing.cpu_count())
  • required = False

--skip-binarisation

  • Used to skip binarization with ocropus. If skipped, only the tesseract binarization is used.
  • default = False

Example with all arguments used:

docker run \
    --rm \
    -it \
    -v $HOME/ocr/files_for_ocr:/files_for_ocr \
    -v $HOME/ocr/files_from_ocr:/files_from_ocr \
    sfb1288inf/ocr:latest \
        -i /files_for_ocr \
        -o /files_from_ocr \
        -l eng \
        --keep_intermediates \
        --nCores 8 \
        --skip-binarisation

Additional language models for OCR

Additional language models can be easily installed. Just add them analogical to the existing models to the Dockerfile.

The standard language models for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata.

The more accurate but slower language models can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata.

Language models for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata.

The Dockerfile section for the language models with added language support for Afrikaans would look like this:

RUN echo "deb https://notesalexp.org/tesseract-ocr/stretch/ stretch main" >> /etc/apt/sources.list && \
    wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && \
    apt-get update && \
    apt-get install -y --no-install-recommends tesseract-ocr && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata/raw/master/deu_frak.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/enm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/fra.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/frm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/ita.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata