mirror of
				https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git
				synced 2025-10-31 20:03:14 +00:00 
			
		
		
		
	
			
				
					
						
					
					eb5ccf4e2159c29bbbb43dfde0dbd661bebd2a6e
				
			
			
		
	OCR
Build image
- Clone this repository and navigate into it:
git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git && cd ocr
- Build image:
docker build -t sfb1288inf/ocr:latest .
Alternatively build from the GitLab repository without cloning:
- Build image:
docker build -t sfb1288inf/ocr:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git
Download prebuilt image
The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers.
- Download image:
docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest
Run
- Create input and output directories for the OCR software:
mkdir -p /<mydatalocation>/files_for_ocr /<mydatalocation>/files_from_ocr
- 
Place your files inside the /<mydatalocation>/files_for_ocrdirectory. Files can either be PDF (.pdf) or multipage TIFF (.tiff, .tif) files. Files should all contain text of the same language.
- 
Start the OCR process. 
docker run \
    --rm \
    -it \
    -u $(id -u $USER):$(id -g $USER) \
    -v /<mydatalocation>/files_for_ocr:/input \
    -v /<mydatalocation>/files_from_ocr:/output \
    sfb1288inf/ocr:latest \
        -i /input \
        -l <languagecode> \
        -o /output
The arguments below sfb1288inf/ocr:latest are described in the OCR arguments part.
If you want to use the prebuilt image, replace sfb1288inf/ocr:latest with gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:latest.
- Check your results in the /<mydatalocation>/files_from_ocrdirectory.
OCR arguments
-l languagecode
- Tells tesseract which language will be used.
- options = deu (German), eng (English), enm (Middle englisch), fra (French), frk (German Fraktur), frm (Middle french), ita (Italian), por (Portuguese), spa (Spanish)
- required = True
--keep-intermediates
- If set, all intermediate files created during the OCR process will be kept.
- default = False
- required = False
--nCores corenumber
- Sets the number of CPU cores being used during the OCR process.
- default = min(4, multiprocessing.cpu_count())
- required = False
--skip-binarisation
- Used to skip binarization with ocropus. If skipped, only the tesseract binarization is used.
- default = False
Example with all arguments used:
docker run \
    --rm \
    -it \
    -u $(id -u $USER):$(id -g $USER) \
    -v "$HOME"/ocr/files_for_ocr:/input \
    -v "$HOME"/ocr/files_from_ocr:/output \
    sfb1288inf/ocr:latest \
        -i /input \
        -l eng \
        -o /output \
        --keep_intermediates \
        --nCores 8 \
        --skip-binarisation
					Languages
				
				
								
								
									Python
								
								92.2%
							
						
							
								
								
									Dockerfile
								
								7.8%