mirror of https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git synced 2024-12-26 05:24:18 +00:00

Go to file

Patrick Jentsch ca1803ab8a Mark required arguments in scripts as required		2022-02-03 10:40:50 +01:00
wrapper	Codestyle enhacements	2022-01-27 13:40:23 +01:00
.gitlab-ci.yml	Change intermediate image name in order to fix issues with building multiple branches/tags at the same time	2021-03-15 14:11:23 +01:00
Dockerfile	Update to Tesseract 5.0.0, Set version 0.1.0	2022-01-04 11:42:55 +01:00
hocr2tei	Mark required arguments in scripts as required	2022-02-03 10:40:50 +01:00
hocr-combine	Mark required arguments in scripts as required	2022-02-03 10:40:50 +01:00
LICENSE	Update to Tesseract 5.0.0, Set version 0.1.0	2022-01-04 11:42:55 +01:00
ocr	Codestyle enhacements	2022-01-27 13:40:23 +01:00
README.md	Codestyle enhacements	2022-01-27 13:40:23 +01:00

README.md

OCR - Optical Character Recognition

This software implements a heavily parallelized pipeline to recognize text in PDF files. It is used for nopaque's OCR service but you can also use it standalone, for that purpose a convenient wrapper script is provided. The pipeline is designed to run on Linux operating systems, but with some tweaks it should also run on Windows with WSL installed.

Software used in this pipeline implementation

Official Debian Docker image (buster-slim): https://hub.docker.com/_/debian
- Software from Debian Buster's free repositories
ocropy (1.3.3): https://github.com/ocropus/ocropy/releases/tag/v1.3.3
pyFlow (1.1.20): https://github.com/Illumina/pyflow/releases/tag/v1.1.20
Tesseract OCR (5.0.0): https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0

Installation

Install Docker and Python 3.
Clone this repository: git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git
Build the Docker image: docker build -t gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:v0.1.0 ocr
Add the wrapper script (wrapper/ocr relative to this README file) to your ${PATH}.
Create working directories for the pipeline: mkdir -p /<my_data_location>/{input,models,output}.
Place your Tesseract OCR model(s) inside /<my_data_location>/models.

Use the Pipeline

Place your PDF files inside /<my_data_location>/input. Files should all contain text of the same language.
Clear your /<my_data_location>/output directory.
Start the pipeline process. Check the pipeline help (ocr --help) for more details.

cd /<my_data_location>
# <model_code> is the model filename without the ".traineddata" suffix
ocr \
  --input-dir input \
  --output-dir output \
  --model-file models/<model>
  -m <model_code> <optional_pipeline_arguments>
# More then one model
ocr \
  --input-dir input \
  --output-dir output \
  --model-file models/<model1>
  --model-file models/<model2>
  -m <model1_code>+<model2_code> <optional_pipeline_arguments>
# Instead of multiple --model-file statements, you can also use
ocr \
  --input-dir input \
  --output-dir output \
  --model-file models/*
  -m <model1_code>+<model2_code> <optional_pipeline_arguments>

Check your results in the /<my_data_location>/output directory.