Update README

This commit is contained in:
Patrick Jentsch 2019-05-16 13:17:15 +02:00
parent 4c0ba270db
commit 9536116cc2

166
README.md
View File

@ -1,116 +1,128 @@
# Installation # OCR
## Install additional packages ## Build image
1. Install `screen`. We will use this to execute commands in their own terminal session.
## Build your own image 1. Clone this repository and navigate into it:
```
git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git && cd ocr
```
2. Build image:
```
docker build -t sfb1288inf/ocr:latest .
```
1. Clone this repository and navigate into it. Alternatively build from git without cloning:
2. Build the image from the dockerfile. `docker build -t <image_name>:<tag> .` For example: `docker build -t ocr_container:latest .` 1. Build image:
```
docker build -t sfb1288inf/ocr:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git
```
Alternatively build directly from git. ## Download prebuilt image
1. Use the following command to build directly from gitLab. `docker build -t <image_name>:<tag> https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git`.
## Folder setup The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers.
1. Create input and output folders for the OCR files. 1. Download image:
2. `mkdir -p /some/path/<container-name>/ocr/files_for_ocr /some/path/<image_name>/ocr/files_from_ocr` ```
docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest
```
## Run the container ## Run
1. Run container from an image. <contianer_name> and /some/path are the same as mentioned in the step folder setup. We are creating two volumes based on the folder paths provided in the section Folder setup. 1. Create input and output directories for the OCR software:
```
mkdir -p /<mydatalocation>/files_for_ocr /<mydatalocation>/files_from_ocr
```
2. Place your files inside the `/<mydatalocation>/files_for_ocr` directory. Files can either be
multipage TIFF (.tiff, .tif) or PDF (.pdf) files. Files should all contain text
of the same language.
3. Start the OCR process.
``` ```
docker run \ docker run \
--name <container-name> \ --rm \
-dit \ -it \
-v /some/path/<container-name>/files_for_ocr:/root/files_for_ocr \ -v /<mydatalocation>/files_for_ocr:/files_for_ocr \
-v /some/path/<container-name>/files_from_ocr:/root/files_from_ocr \ -v /<mydatalocation>/files_from_ocr:/files_from_ocr \
<image_name> sfb1288inf/ocr:latest \
-i /files_for_ocr \
-o /files_from_ocr \
-l <languagecode>
``` ```
The specified below `sfb1288inf/ocr:latest` are described in the OCR arguments part.
4. Check your results in the `/<mydatalocation>/files_from_ocr` directory.
# Usage ### OCR arguments
## Start an OCR job `-i path`
1. Place some files inside the folder _files\_for\_ocr_. Files can either be multipage tiffs or PDF files. One folder per file is needed. Files should all be of the same language. * Sets the input directory using the specified path.
2. Start a screen session with `screen -dmS <container_name>` * required = True
3. Enter the screen session with `screen -r <container-name>`. (Try this if the error "Cannot open your terminal '/dev/pts/0' - please check." appears: `script -q -c "screen -r <container-name>" /dev/null`).
4. Start the OCR process for all files placed in _files\_for\_ocr_ with `docker exec -it <container-name> ocr -i files_for_ocr -o files_from_ocr -l <sprachcode>`.
Valid language codes are: `-o path`
- deu (German) * Sets the output directory using the specified path.
- deu_frak (German Fraktur) * required = True
- eng (English)
- enm (Middle englisch)
- fra (French)
- frm (Middle french)
- por (Portuguese)
- spa (Spanish)
### Additional OCR arguments `-l languagecode`
Below we will describe all available pipeline arguments that can be used. * Tells tesseract which language will be used.
* options = deu (German), deu_frak (German Fraktur), eng (English),
enm (Middle englisch), fra (French), frm (Middle french), por (Portuguese),
spa (Spanish)
* required = True
`--keep-intermediates`
* If set, all intermediate files created during the OCR process will be
kept.
* default = False
* required = False
- **_-i some/path_** `--nCores corenumber`
- Sets the input directory using the specified path. * Sets the number of CPU cores being used during the OCR process.
- required = True * default = min(4, multiprocessing.cpu_count())
* required = False
- **_-o some/path_** `--skip-binarization`
- Sets the output directory using the specified path. * Used to skip binarization with ocropus. If skipped, only the tesseract
- required = True binarization is used.
* default = False
- **_-l valid_language_code_**
- Tells tesseract which language will be used.
- required = True
- **_--keep-intermediates**
- Optional argument. If set all intermediate filese created during the OCR process will be kept.
- default = False
- required = False
- **_--nCores_**
- Sets the number of CPU cores being used during the OCR process.
- default = min(4, multiprocessing.cpu_count())
- required = False
- **_--skip-binarization_**
- Used to skip binarization with ocropus.
- If skiped, only the tesseract binarization is used.
Example with all arguments used: Example with all arguments used:
```
docker run \
--rm \
-it \
-v "$HOME"/ocr/files_for_ocr:/files_for_ocr \
-v "$HOME"/ocr/files_from_ocr:/files_from_ocr \
sfb1288inf/ocr:latest \
-i /files_for_ocr \
-o /files_from_ocr \
-l eng \
--keep_intermediates \
--nCores 8 \
--skip-binarization
```
`docker exec -it <container-name> ocr -i files_for_ocr -o files_from_ocr -l deu --keep_intermediates --nCores 8` # Additional language models for OCR
Additional language models can be easily installed. Just add them analogical to the existing models to the `Dockerfile`.
## Exit and re-enter the current running OCR process The standard language models for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata.
1. You can leave the currently running OCR process by pressing `ctrl + a + d` and thus leaving the screen session.
2. Re-enter the screen session to check the status of the running OCR job with `screen -r <container-name>`. (Try this if the error "Cannot open your terminal '/dev/pts/0' - please check." appears: `script -q -c "screen -r <container-name>" /dev/null`).
# Use prebuilt image The more accurate but slower language models can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata.
Download via regestry function with login or deploy token.
# Add additional traineddata for OCR of additional languages. Language models for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata.
Additional traineddata can be easily added to the Dockerfile.
Just append the needed data file URL after line 56 in the Dockerfile following the same syntax.
The standard traineddata for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata. The `Dockerfile` section for the language models with added language support for Afrikaans would look like this:
The more accurate but slower traineddata can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata.
Traineddata for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata.
The Dockerfile section for the traineddata with added language support for Afrikaans would look like this:
``` ```
RUN echo "deb https://notesalexp.org/tesseract-ocr/stretch/ stretch main" >> /etc/apt/sources.list && \ RUN echo "deb https://notesalexp.org/tesseract-ocr/stretch/ stretch main" >> /etc/apt/sources.list && \
wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && \ wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && \
apt-get update && \ apt-get update && \
apt-get install -y --no-install-recommends tesseract-ocr && \ apt-get install -y --no-install-recommends tesseract-ocr && \
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata \
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
wget -nv https://github.com/tesseract-ocr/tessdata/raw/master/deu_frak.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata/raw/master/deu_frak.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/enm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/enm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/fra.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/fra.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/frm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/frm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/ita.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata
``` ```