sporada/ocr

Fork 0

mirror of https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git synced 2026-01-09 12:30:54 +00:00

Go to file

Patrick Jentsch f280b16b1b Make arguments optional

2019-06-03 14:18:16 +02:00

wrapper

Make arguments optional

2019-06-03 14:18:16 +02:00

.gitlab-ci.yml

Update .gitlab-ci.yml

2019-03-13 18:33:30 +01:00

Dockerfile

Use more specific versions.

2019-06-02 21:45:11 +02:00

hocrtotei

Codestyle

2019-05-20 11:10:40 +02:00

ocr

Codestyle

2019-05-20 11:10:40 +02:00

README.md

Update for unprivileged usage 2

2019-06-03 13:32:42 +02:00

README.md

OCR

Build image

Clone this repository and navigate into it:

git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git && cd ocr

Build image:

docker build -t sfb1288inf/ocr:latest .

Alternatively build from the GitLab repository without cloning:

Build image:

docker build -t sfb1288inf/ocr:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git

Download prebuilt image

The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers.

Download image:

docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest

Run

Create input and output directories for the OCR software:

mkdir -p /<mydatalocation>/files_for_ocr /<mydatalocation>/files_from_ocr

Place your files inside the /<mydatalocation>/files_for_ocr directory. Files can either be PDF (.pdf) or multipage TIFF (.tiff, .tif) files. Files should all contain text of the same language.
Start the OCR process.

docker run \
    --rm \
    -it \
    -u $(id -u $USER):$(id -g $USER) \
    -v /<mydatalocation>/files_for_ocr:/input \
    -v /<mydatalocation>/files_from_ocr:/output \
    sfb1288inf/ocr:latest \
        -i /input \
        -l <languagecode> \
        -o /output

The arguments below sfb1288inf/ocr:latest are described in the OCR arguments part.

If you want to use the prebuilt image, replace sfb1288inf/ocr:latest with gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest.

Check your results in the /<mydatalocation>/files_from_ocr directory.

OCR arguments

-l languagecode

Tells tesseract which language will be used.
options = deu (German), deu_frak (German Fraktur), eng (English), enm (Middle englisch), fra (French), frm (Middle french), ita (Italian), por (Portuguese), spa (Spanish)
required = True

--keep-intermediates

If set, all intermediate files created during the OCR process will be kept.
default = False
required = False

--nCores corenumber

Sets the number of CPU cores being used during the OCR process.
default = min(4, multiprocessing.cpu_count())
required = False

--skip-binarisation

Used to skip binarization with ocropus. If skipped, only the tesseract binarization is used.
default = False

Example with all arguments used:

docker run \
    --rm \
    -it \
    -u $(id -u $USER):$(id -g $USER) \
    -v "$HOME"/ocr/files_for_ocr:/input \
    -v "$HOME"/ocr/files_from_ocr:/output \
    sfb1288inf/ocr:latest \
        -i /input \
        -l eng \
        -o /output \
        --keep_intermediates \
        --nCores 8 \
        --skip-binarisation

Additional language models for OCR

Additional language models can be easily installed. Just add them analogical to the existing models to the Dockerfile.

The standard language models for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata.

The more accurate but slower language models can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata.

Language models for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata.

The Dockerfile section for the language models with added language support for Afrikaans would look like this:

RUN echo "deb https://notesalexp.org/tesseract-ocr/stretch/ stretch main" >> /etc/apt/sources.list && \
    wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && \
    apt-get update && \
    apt-get install -y --no-install-recommends tesseract-ocr && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata/raw/master/deu_frak.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/enm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/fra.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/frm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/ita.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata