sporada/ocr

Fork 0

mirror of https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git synced 2026-07-26 14:33:57 +00:00

T

Patrick Jentsch 9536116cc2 Update README

2019-05-16 13:17:15 +02:00

.gitlab-ci.yml

Update .gitlab-ci.yml

2019-03-13 18:33:30 +01:00

Dockerfile

Correct order for output files.

2019-05-13 15:03:43 +02:00

hocrtotei

Correct order for output files.

2019-05-13 15:03:43 +02:00

ocr

Update

2019-05-16 00:09:19 +02:00

README.md

Update README

2019-05-16 13:17:15 +02:00

README.md

OCR

Build image

Clone this repository and navigate into it:

git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git && cd ocr

Build image:

docker build -t sfb1288inf/ocr:latest .

Alternatively build from git without cloning:

Build image:

docker build -t sfb1288inf/ocr:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git

Download prebuilt image

The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers.

Download image:

docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest

Run

Create input and output directories for the OCR software:

mkdir -p /<mydatalocation>/files_for_ocr /<mydatalocation>/files_from_ocr

Place your files inside the /<mydatalocation>/files_for_ocr directory. Files can either be multipage TIFF (.tiff, .tif) or PDF (.pdf) files. Files should all contain text of the same language.
Start the OCR process.

docker run \
    --rm \
    -it \
    -v /<mydatalocation>/files_for_ocr:/files_for_ocr \
    -v /<mydatalocation>/files_from_ocr:/files_from_ocr \
    sfb1288inf/ocr:latest \
        -i /files_for_ocr \
        -o /files_from_ocr \
        -l <languagecode>

The specified below sfb1288inf/ocr:latest are described in the OCR arguments part. 4. Check your results in the /<mydatalocation>/files_from_ocr directory.

OCR arguments

-i path

Sets the input directory using the specified path.
required = True

-o path

Sets the output directory using the specified path.
required = True

-l languagecode

Tells tesseract which language will be used.
options = deu (German), deu_frak (German Fraktur), eng (English), enm (Middle englisch), fra (French), frm (Middle french), por (Portuguese), spa (Spanish)
required = True

--keep-intermediates

If set, all intermediate files created during the OCR process will be kept.
default = False
required = False

--nCores corenumber

Sets the number of CPU cores being used during the OCR process.
default = min(4, multiprocessing.cpu_count())
required = False

--skip-binarization

Used to skip binarization with ocropus. If skipped, only the tesseract binarization is used.
default = False

Example with all arguments used:

docker run \
    --rm \
    -it \
    -v "$HOME"/ocr/files_for_ocr:/files_for_ocr \
    -v "$HOME"/ocr/files_from_ocr:/files_from_ocr \
    sfb1288inf/ocr:latest \
        -i /files_for_ocr \
        -o /files_from_ocr \
        -l eng \
        --keep_intermediates \
        --nCores 8 \
        --skip-binarization

Additional language models for OCR

Additional language models can be easily installed. Just add them analogical to the existing models to the Dockerfile.

The standard language models for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata.

The more accurate but slower language models can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata.

Language models for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata.

The Dockerfile section for the language models with added language support for Afrikaans would look like this:

RUN echo "deb https://notesalexp.org/tesseract-ocr/stretch/ stretch main" >> /etc/apt/sources.list && \
    wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && \
    apt-get update && \
    apt-get install -y --no-install-recommends tesseract-ocr && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata/raw/master/deu_frak.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/enm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/fra.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/frm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/ita.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
    wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata