From 35f8444c5e70f31cd52ea9cb5ad1081d05b261e2 Mon Sep 17 00:00:00 2001 From: Stephan Porada Date: Wed, 3 Apr 2019 10:33:00 +0200 Subject: [PATCH] Update README.md for additional traineddata. --- README.md | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 2a0e4e5..b6e9ccb 100644 --- a/README.md +++ b/README.md @@ -52,4 +52,29 @@ Valid language codes are: Download via regestry function with login or deploy token. ## Add additional traineddata for OCR of additional languages. -Additional traineddata can be easily added to the docker file. Just append the needed data file URL after line 56 followin the same syntax. The standard traineddata for various languages can be found under https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400-november-29-2016. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata. +Additional traineddata can be easily added to the docker file. +Just append the needed data file URL after line 56 in the Dockerfile following the same syntax. + +The standard traineddata for various languages can be found under https://github.com/tesseract-ocr/tessdata. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata/raw/4.00/afr.traineddata. + +The more accurate but slower traineddata can be found under https://github.com/tesseract-ocr/tessdata_best. Click on one of the languages and copy the link from the download button. The URL for Afrikaans (afr) would be for example https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata. + +Traineddata for fraktur fonts can also be found in the standard tessdata repository https://github.com/tesseract-ocr/tessdata. + +The Dockerfile section for the traineddata with added language support for Afrikaans would look like this: + +``` +RUN echo "deb https://notesalexp.org/tesseract-ocr/stretch/ stretch main" >> /etc/apt/sources.list && \ + wget -O - https://notesalexp.org/debian/alexp_key.asc | apt-key add - && \ + apt-get update && \ + apt-get install -y --no-install-recommends tesseract-ocr && \ + wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/deu.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ + wget -nv https://github.com/tesseract-ocr/tessdata/raw/master/deu_frak.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ + wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ + wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/enm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ + wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/fra.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ + wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/frm.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ + wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/por.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ + wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata && \ + wget -nv https://github.com/tesseract-ocr/tessdata_best/raw/master/afr.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata +``` \ No newline at end of file