Go to file
Patrick Jentsch 8a3816121c fix image tag
2022-01-04 12:10:26 +01:00
wrapper fix image tag 2022-01-04 12:10:26 +01:00
.gitlab-ci.yml Change intermediate image name in order to fix issues with building multiple branches/tags at the same time 2021-03-15 14:11:23 +01:00
Dockerfile Update to Tesseract 5.0.0, Set version 0.1.0 2022-01-04 11:42:55 +01:00
hocr2tei Update to Tesseract 5.0.0, Set version 0.1.0 2022-01-04 11:42:55 +01:00
hocr-combine Update to Tesseract 5.0.0, Set version 0.1.0 2022-01-04 11:42:55 +01:00
LICENSE Update to Tesseract 5.0.0, Set version 0.1.0 2022-01-04 11:42:55 +01:00
ocr Update to Tesseract 5.0.0, Set version 0.1.0 2022-01-04 11:42:55 +01:00
README.md fix image tag 2022-01-04 12:10:26 +01:00

OCR - Optical Character Recognition

This software implements a heavily parallelized pipeline to recognize text in PDF files. It is used for nopaque's OCR service but you can also use it standalone, for that purpose a convenient wrapper script is provided. The pipeline is designed to run on Linux operating systems, but with some tweaks it should also run on Windows with WSL installed.

Software used in this pipeline implementation

Installation

  1. Install Docker and Python 3.
  2. Clone this repository: git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git
  3. Build the Docker image: docker build -t gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:v0.1.0 ocr
  4. Add the wrapper script (wrapper/ocr relative to this README file) to your ${PATH}.
  5. Create working directories for the pipeline: mkdir -p /<my_data_location>/{input,models,output}.
  6. Place your Tesseract OCR model(s) inside /<my_data_location>/models.

Use the Pipeline

  1. Place your PDF files inside /<my_data_location>/input. Files should all contain text of the same language.
  2. Clear your /<my_data_location>/output directory.
  3. Start the pipeline process. Check the pipeline help (ocr --help) for more details.
cd /<my_data_location>
ocr -i input -o output -m models/<model_name> -l <language_code> <optional_pipeline_arguments>
# or
ocr -i input -o output -m models/* -l <language_code> <optional_pipeline_arguments>
  1. Check your results in the /<my_data_location>/output directory.