ocr/README.md

56 lines
2.8 KiB
Markdown
Raw Normal View History

2019-04-02 13:43:41 +00:00
# Installation
## Install additional packages
1. Install `screen`. We will use this to execute commands in their own terminal session.
## Build your own image
1. Clone this repository and navigate into it.
2. Build the image from the dockerfile. `docker build -t <image_name>:<tag> .` For example: `docker build -t ocr_container:latest .`
Alternatively build directly from git.
1. Use the following command to build directly from gitLab. `docker build -t <image_name>:<tag> https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git`.
## Folder setup
1. Create input and output folders for the OCR files.
2. `mkdir -p /some/path/<container-name>/ocr/files_for_ocr /some/path/<image_name>/ocr/files_from_ocr`
## Run the container
1. Run container from an image. <contianer_name> and /some/path are the same as mentioned in the step folder setup. We are creating two volumes based on the folder paths provided in the section Folder setup.
```
docker run \
--name <container-name> \
-dit \
-v /some/path/<container-name>/files_for_ocr:/root/files_for_ocr \
-v /some/path/<container-name>/files_from_ocr:/root/files_from_ocr \
<image_name>
```
## Start an OCR job
1. Place some files inside the folder _files\_for\_ocr_. Files can either be multipage tiffs or PDF files. One folder per file is needed. Files should all be of the same language.
2. Start a screen session with `screen -dmS <container_name>`
3. Enter the screen session with `screen -r <container-name>`. (Try this if there is an error. `script -q -c "screen -r <container-name>" /dev/null`).
4. Start the OCR process for all files placed in _files\_for\_ocr_ with `docker exec -it <container-name> ocr -i files_for_ocr -o files_from_ocr -l <sprachcode>`.
Valid language codes are:
- deu (German)
- deu_frak (German Fraktur)
- eng (English)
- enm (Middle englisch)
- fra (French)
- frm (Middle french)
- por (Portuguese)
- spa (Spanish)
## Exit an re-enter the current running OCR process
1. You can leave the currently running OCR process by pressing `ctrl + a + d` and thus leaving the screen session.
2. Re-enter the screen session to check the status of the running OCR job with `screen -r <container-name>`. (Try this if there is an error. `script -q -c "screen -r <container-name>" /dev/null`).
## Use prebuilt image
2019-04-03 08:21:43 +00:00
Download via regestry function with login or deploy token.
2019-04-02 13:43:41 +00:00
2019-04-03 08:21:43 +00:00
## Add additional traineddata for OCR of additional languages.
Additional traineddata can be easily added to the docker file. Just append the needed data file URL after line 56 followin the same syntax. The standard traineddata for various languages can be found under https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400-november-29-2016. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata.