Update README

This commit is contained in:
Patrick Jentsch 2019-05-17 01:07:39 +02:00
parent 18b659684a
commit ca4f218d2a

View File

@ -35,9 +35,7 @@ docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest
mkdir -p /<mydatalocation>/files_for_ocr /<mydatalocation>/files_from_ocr
```
2. Place your files inside the `/<mydatalocation>/files_for_ocr` directory. Files can either be
multipage TIFF (.tiff, .tif) or PDF (.pdf) files. Files should all contain text
of the same language.
2. Place your files inside the `/<mydatalocation>/files_for_ocr` directory. Files can either be PDF (.pdf) or multipage TIFF (.tiff, .tif) files. Files should all contain text of the same language.
3. Start the OCR process.
```
@ -67,9 +65,7 @@ The specified below `sfb1288inf/ocr:latest` are described in the [OCR arguments]
`-l languagecode`
* Tells tesseract which language will be used.
* options = deu (German), deu_frak (German Fraktur), eng (English),
enm (Middle englisch), fra (French), frm (Middle french), por (Portuguese),
spa (Spanish)
* options = deu (German), deu_frak (German Fraktur), eng (English), enm (Middle englisch), fra (French), frm (Middle french), ita (Italian), por (Portuguese), spa (Spanish)
* required = True
`--keep-intermediates`
@ -84,8 +80,7 @@ kept.
* required = False
`--skip-binarisation`
* Used to skip binarization with ocropus. If skipped, only the tesseract
binarization is used.
* Used to skip binarization with ocropus. If skipped, only the tesseract binarization is used.
* default = False
Example with all arguments used: