Go to file
2019-06-03 14:18:16 +02:00
wrapper Make arguments optional 2019-06-03 14:18:16 +02:00
.gitlab-ci.yml Update .gitlab-ci.yml 2019-03-13 18:33:30 +01:00
Dockerfile Use more specific versions. 2019-06-02 21:45:11 +02:00
hocrtotei Codestyle 2019-05-20 11:10:40 +02:00
ocr Codestyle 2019-05-20 11:10:40 +02:00
README.md Update for unprivileged usage 2 2019-06-03 13:32:42 +02:00

OCR

Build image

  1. Clone this repository and navigate into it:
git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git && cd ocr
  1. Build image:
docker build -t sfb1288inf/ocr:latest .

Alternatively build from the GitLab repository without cloning:

  1. Build image:
docker build -t sfb1288inf/ocr:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git

Download prebuilt image

The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers.

  1. Download image:
docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest

Run

  1. Create input and output directories for the OCR software:
mkdir -p /<mydatalocation>/files_for_ocr /<mydatalocation>/files_from_ocr
  1. Place your files inside the /<mydatalocation>/files_for_ocr directory. Files can either be PDF (.pdf) or multipage TIFF (.tiff, .tif) files. Files should all contain text of the same language.

  2. Start the OCR process.

docker run \
    --rm \
    -it \
    -u $(id -u $USER):$(id -g $USER) \
    -v /<mydatalocation>/files_for_ocr:/input \
    -v /<mydatalocation>/files_from_ocr:/output \
    sfb1288inf/ocr:latest \
        -i /input \
        -l <languagecode> \
        -o /output

The arguments below sfb1288inf/ocr:latest are described in the OCR arguments part.

If you want to use the prebuilt image, replace sfb1288inf/ocr:latest with gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest.

  1. Check your results in the /<mydatalocation>/files_from_ocr directory.

OCR arguments

-l languagecode

  • Tells tesseract which language will be used.
  • options = deu (German), deu_frak (German Fraktur), eng (English), enm (Middle englisch), fra (French), frm (Middle french), ita (Italian), por (Portuguese), spa (Spanish)
  • required = True

--keep-intermediates

  • If set, all intermediate files created during the OCR process will be kept.
  • default = False
  • required = False

--nCores corenumber

  • Sets the number of CPU cores being used during the OCR process.
  • default = min(4, multiprocessing.cpu_count())
  • required = False

--skip-binarisation

  • Used to skip binarization with ocropus. If skipped, only the tesseract binarization is used.
  • default = False

Example with all arguments used:

docker run \
    --rm \
    -it \
    -u $(id -u $USER):$(id -g $USER) \
    -v "$HOME"/ocr/files_for_ocr:/input \
    -v "$HOME"/ocr/files_from_ocr:/output \
    sfb1288inf/ocr:latest \
        -i /input \
        -l eng \
        -o /output \
        --keep_intermediates \
        --nCores 8 \
        --skip-binarisation

Additional language models for OCR

Additional language models can be easily installed. Just add them analogical to the existing models to the Dockerfile.

The standard language models for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata.

The more accurate but slower language models can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata.

Language models for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata.

The Dockerfile section for the language models with added language support for Afrikaans would look like this:

RUN echo "deb https://notesalexp.org/tesseract-ocr/stretch/ stretch main" >> /etc/apt/sources.list && \
    wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && \
    apt-get update && \
    apt-get install -y --no-install-recommends tesseract-ocr && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata/raw/master/deu_frak.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/enm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/fra.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/frm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/ita.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata