mirror of
https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git
synced 2024-12-27 08:24:19 +00:00
Update README.md for additional traineddata.
This commit is contained in:
parent
d3854bfdd0
commit
35f8444c5e
27
README.md
27
README.md
@ -52,4 +52,29 @@ Valid language codes are:
|
|||||||
Download via regestry function with login or deploy token.
|
Download via regestry function with login or deploy token.
|
||||||
|
|
||||||
## Add additional traineddata for OCR of additional languages.
|
## Add additional traineddata for OCR of additional languages.
|
||||||
Additional traineddata can be easily added to the docker file. Just append the needed data file URL after line 56 followin the same syntax. The standard traineddata for various languages can be found under https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400-november-29-2016. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata.
|
Additional traineddata can be easily added to the docker file.
|
||||||
|
Just append the needed data file URL after line 56 in the Dockerfile following the same syntax.
|
||||||
|
|
||||||
|
The standard traineddata for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata.
|
||||||
|
|
||||||
|
The more accurate but slower traineddata can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata.
|
||||||
|
|
||||||
|
Traineddata for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata.
|
||||||
|
|
||||||
|
The Dockerfile section for the traineddata with added language support for Afrikaans would look like this:
|
||||||
|
|
||||||
|
```
|
||||||
|
RUN echo "deb https://notesalexp.org/tesseract-ocr/stretch/ stretch main" >> /etc/apt/sources.list && \
|
||||||
|
wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && \
|
||||||
|
apt-get update && \
|
||||||
|
apt-get install -y --no-install-recommends tesseract-ocr && \
|
||||||
|
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
|
||||||
|
wget -nv https://github.com/tesseract-ocr/tessdata/raw/master/deu_frak.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
|
||||||
|
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
|
||||||
|
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/enm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
|
||||||
|
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/fra.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
|
||||||
|
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/frm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
|
||||||
|
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
|
||||||
|
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \
|
||||||
|
wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata
|
||||||
|
```
|
Loading…
Reference in New Issue
Block a user