diff --git a/README.md b/README.md index 979302a..8aae34f 100644 --- a/README.md +++ b/README.md @@ -1,116 +1,128 @@ -# Installation +# OCR -## Install additional packages -1. Install `screen`. We will use this to execute commands in their own terminal session. +## Build image -## Build your own image +1. Clone this repository and navigate into it: +``` +git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git && cd ocr +``` +2. Build image: +``` +docker build -t sfb1288inf/ocr:latest . +``` -1. Clone this repository and navigate into it. -2. Build the image from the dockerfile. `docker build -t : .` For example: `docker build -t ocr_container:latest .` +Alternatively build from git without cloning: +1. Build image: +``` +docker build -t sfb1288inf/ocr:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git +``` -Alternatively build directly from git. -1. Use the following command to build directly from gitLab. `docker build -t : https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git`. +## Download prebuilt image -## Folder setup +The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers. -1. Create input and output folders for the OCR files. -2. `mkdir -p /some/path//ocr/files_for_ocr /some/path//ocr/files_from_ocr` +1. Download image: +``` +docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest +``` -## Run the container +## Run -1. Run container from an image. and /some/path are the same as mentioned in the step folder setup. We are creating two volumes based on the folder paths provided in the section Folder setup. +1. Create input and output directories for the OCR software: +``` +mkdir -p //files_for_ocr //files_from_ocr +``` +2. Place your files inside the `//files_for_ocr` directory. Files can either be +multipage TIFF (.tiff, .tif) or PDF (.pdf) files. Files should all contain text +of the same language. +3. Start the OCR process. ``` docker run \ - --name \ - -dit \ - -v /some/path//files_for_ocr:/root/files_for_ocr \ - -v /some/path//files_from_ocr:/root/files_from_ocr \ - + --rm \ + -it \ + -v //files_for_ocr:/files_for_ocr \ + -v //files_from_ocr:/files_from_ocr \ + sfb1288inf/ocr:latest \ + -i /files_for_ocr \ + -o /files_from_ocr \ + -l ``` +The specified below `sfb1288inf/ocr:latest` are described in the OCR arguments part. +4. Check your results in the `//files_from_ocr` directory. -# Usage +### OCR arguments -## Start an OCR job -1. Place some files inside the folder _files\_for\_ocr_. Files can either be multipage tiffs or PDF files. One folder per file is needed. Files should all be of the same language. -2. Start a screen session with `screen -dmS ` -3. Enter the screen session with `screen -r `. (Try this if the error "Cannot open your terminal '/dev/pts/0' - please check." appears: `script -q -c "screen -r " /dev/null`). -4. Start the OCR process for all files placed in _files\_for\_ocr_ with `docker exec -it ocr -i files_for_ocr -o files_from_ocr -l `. +`-i path` +* Sets the input directory using the specified path. +* required = True -Valid language codes are: -- deu (German) -- deu_frak (German Fraktur) -- eng (English) -- enm (Middle englisch) -- fra (French) -- frm (Middle french) -- por (Portuguese) -- spa (Spanish) +`-o path` +* Sets the output directory using the specified path. +* required = True -### Additional OCR arguments -Below we will describe all available pipeline arguments that can be used. +`-l languagecode` +* Tells tesseract which language will be used. +* options = deu (German), deu_frak (German Fraktur), eng (English), +enm (Middle englisch), fra (French), frm (Middle french), por (Portuguese), +spa (Spanish) +* required = True +`--keep-intermediates` +* If set, all intermediate files created during the OCR process will be +kept. +* default = False +* required = False -- **_-i some/path_** - - Sets the input directory using the specified path. - - required = True +`--nCores corenumber` +* Sets the number of CPU cores being used during the OCR process. +* default = min(4, multiprocessing.cpu_count()) +* required = False -- **_-o some/path_** - - Sets the output directory using the specified path. - - required = True - -- **_-l valid_language_code_** - - Tells tesseract which language will be used. - - required = True - -- **_--keep-intermediates** - - Optional argument. If set all intermediate filese created during the OCR process will be kept. - - default = False - - required = False - -- **_--nCores_** - - Sets the number of CPU cores being used during the OCR process. - - default = min(4, multiprocessing.cpu_count()) - - required = False - -- **_--skip-binarization_** - - Used to skip binarization with ocropus. - - If skiped, only the tesseract binarization is used. +`--skip-binarization` +* Used to skip binarization with ocropus. If skipped, only the tesseract +binarization is used. +* default = False Example with all arguments used: +``` +docker run \ + --rm \ + -it \ + -v "$HOME"/ocr/files_for_ocr:/files_for_ocr \ + -v "$HOME"/ocr/files_from_ocr:/files_from_ocr \ + sfb1288inf/ocr:latest \ + -i /files_for_ocr \ + -o /files_from_ocr \ + -l eng \ + --keep_intermediates \ + --nCores 8 \ + --skip-binarization +``` -`docker exec -it ocr -i files_for_ocr -o files_from_ocr -l deu --keep_intermediates --nCores 8` +# Additional language models for OCR +Additional language models can be easily installed. Just add them analogical to the existing models to the `Dockerfile`. -## Exit and re-enter the current running OCR process -1. You can leave the currently running OCR process by pressing `ctrl + a + d` and thus leaving the screen session. -2. Re-enter the screen session to check the status of the running OCR job with `screen -r `. (Try this if the error "Cannot open your terminal '/dev/pts/0' - please check." appears: `script -q -c "screen -r " /dev/null`). +The standard language models for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata. -# Use prebuilt image -Download via regestry function with login or deploy token. +The more accurate but slower language models can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata. -# Add additional traineddata for OCR of additional languages. -Additional traineddata can be easily added to the Dockerfile. -Just append the needed data file URL after line 56 in the Dockerfile following the same syntax. +Language models for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata. -The standard traineddata for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata. - -The more accurate but slower traineddata can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata. - -Traineddata for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata. - -The Dockerfile section for the traineddata with added language support for Afrikaans would look like this: +The `Dockerfile` section for the language models with added language support for Afrikaans would look like this: ``` RUN echo "deb https://notesalexp.org/tesseract-ocr/stretch/ stretch main" >> /etc/apt/sources.list && \ wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && \ apt-get update && \ apt-get install -y --no-install-recommends tesseract-ocr && \ + wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata/raw/master/deu_frak.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/enm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/fra.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/frm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ + wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/ita.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ - wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ - wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata -``` \ No newline at end of file + wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata +```