# OCR ## Build image 1. Clone this repository and navigate into it: ``` git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git && cd ocr ``` 2. Build image: ``` docker build -t sfb1288inf/ocr:latest . ``` Alternatively build from the GitLab repository without cloning: 1. Build image: ``` docker build -t sfb1288inf/ocr:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git ``` ## Download prebuilt image The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers. 1. Download image: ``` docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest ``` ## Run 1. Create input and output directories for the OCR software: ``` mkdir -p //files_for_ocr //files_from_ocr ``` 2. Place your files inside the `//files_for_ocr` directory. Files can either be PDF (.pdf) or multipage TIFF (.tiff, .tif) files. Files should all contain text of the same language. 3. Start the OCR process. ``` docker run \ --rm \ -it \ -u $(id -u $USER):$(id -g $USER) \ -v //files_for_ocr:/input \ -v //files_from_ocr:/output \ sfb1288inf/ocr:latest \ -i /input \ -l \ -o /output ``` The arguments below `sfb1288inf/ocr:latest` are described in the [OCR arguments](#ocr-arguments) part. If you want to use the prebuilt image, replace `sfb1288inf/ocr:latest` with `gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest`. 4. Check your results in the `//files_from_ocr` directory. ### OCR arguments `-l languagecode` * Tells tesseract which language will be used. * options = deu (German), eng (English), enm (Middle englisch), fra (French), frk (German Fraktur), frm (Middle french), ita (Italian), por (Portuguese), spa (Spanish) * required = True `--keep-intermediates` * If set, all intermediate files created during the OCR process will be kept. * default = False * required = False `--nCores corenumber` * Sets the number of CPU cores being used during the OCR process. * default = min(4, multiprocessing.cpu_count()) * required = False `--skip-binarisation` * Used to skip binarization with ocropus. If skipped, only the tesseract binarization is used. * default = False Example with all arguments used: ``` docker run \ --rm \ -it \ -u $(id -u $USER):$(id -g $USER) \ -v "$HOME"/ocr/files_for_ocr:/input \ -v "$HOME"/ocr/files_from_ocr:/output \ sfb1288inf/ocr:latest \ -i /input \ -l eng \ -o /output \ --keep_intermediates \ --nCores 8 \ --skip-binarisation ```