# OCR ## Build image 1. Clone this repository and navigate into it: ``` git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git && cd ocr ``` 2. Build image: ``` docker build -t sfb1288inf/ocr:latest . ``` Alternatively build from the GitLab repository without cloning: 1. Build image: ``` docker build -t sfb1288inf/ocr:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git ``` ## Download prebuilt image The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers. 1. Download image: ``` docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest ``` ## Run 1. Create input and output directories for the OCR software: ``` mkdir -p //files_for_ocr //files_from_ocr ``` 2. Place your files inside the `//files_for_ocr` directory. Files can either be PDF (.pdf) or multipage TIFF (.tiff, .tif) files. Files should all contain text of the same language. 3. Start the OCR process. ``` docker run \ --rm \ -it \ -v //files_for_ocr:/files_for_ocr \ -v //files_from_ocr:/files_from_ocr \ sfb1288inf/ocr:latest \ -i /files_for_ocr \ -o /files_from_ocr \ -l ``` The arguments below `sfb1288inf/ocr:latest` are described in the [OCR arguments](#ocr-arguments) part. If you want to use the prebuilt image, replace `sfb1288inf/ocr:latest` with `gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest`. 4. Check your results in the `//files_from_ocr` directory. ### OCR arguments `-i path` * Sets the input directory using the specified path. * required = True `-o path` * Sets the output directory using the specified path. * required = True `-l languagecode` * Tells tesseract which language will be used. * options = deu (German), deu_frak (German Fraktur), eng (English), enm (Middle englisch), fra (French), frm (Middle french), ita (Italian), por (Portuguese), spa (Spanish) * required = True `--keep-intermediates` * If set, all intermediate files created during the OCR process will be kept. * default = False * required = False `--nCores corenumber` * Sets the number of CPU cores being used during the OCR process. * default = min(4, multiprocessing.cpu_count()) * required = False `--skip-binarisation` * Used to skip binarization with ocropus. If skipped, only the tesseract binarization is used. * default = False Example with all arguments used: ``` docker run \ --rm \ -it \ -v $HOME/ocr/files_for_ocr:/files_for_ocr \ -v $HOME/ocr/files_from_ocr:/files_from_ocr \ sfb1288inf/ocr:latest \ -i /files_for_ocr \ -o /files_from_ocr \ -l eng \ --keep_intermediates \ --nCores 8 \ --skip-binarisation ``` # Additional language models for OCR Additional language models can be easily installed. Just add them analogical to the existing models to the `Dockerfile`. The standard language models for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata. The more accurate but slower language models can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata. Language models for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata. The `Dockerfile` section for the language models with added language support for Afrikaans would look like this: ``` RUN echo "deb https://notesalexp.org/tesseract-ocr/stretch/ stretch main" >> /etc/apt/sources.list && \ wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && \ apt-get update && \ apt-get install -y --no-install-recommends tesseract-ocr && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata/raw/master/deu_frak.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/enm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/fra.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/frm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/ita.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata ```