mirror of
				https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git
				synced 2025-10-31 20:03:14 +00:00 
			
		
		
		
	
			
				
					
						
					
					e1b78b6ba44f5910f9e9bd8f504084757c14a592
				
			
			
		
	OCR - Optical Character Recognition
This software implements a heavily parallelized pipeline to recognize text in PDF files. It is used for nopaque's OCR service but you can also use it standalone, for that purpose a convenient wrapper script is provided. The pipeline is designed to run on Linux operating systems, but with some tweaks it should also run on Windows with WSL installed.
Software used in this pipeline implementation
- Official Debian Docker image (buster-slim): https://hub.docker.com/_/debian
- Software from Debian Buster's free repositories
 
- ocropy (1.3.3): https://github.com/ocropus/ocropy/releases/tag/v1.3.3
- pyFlow (1.1.20): https://github.com/Illumina/pyflow/releases/tag/v1.1.20
- Tesseract OCR (5.0.0): https://github.com/tesseract-ocr/tesseract/releases/tag/5.0.0
Installation
- Install Docker and Python 3.
- Clone this repository: git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/ocr.git
- Build the Docker image: docker build -t gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/ocr:0.1.0 ocr
- Add the wrapper script (wrapper/ocrrelative to this README file) to your${PATH}.
- Create working directories for the pipeline: mkdir -p /<my_data_location>/{input,models,output}.
- Place your Tesseract OCR model(s) inside /<my_data_location>/models.
Use the Pipeline
- Place your PDF files inside /<my_data_location>/input. Files should all contain text of the same language.
- Clear your /<my_data_location>/outputdirectory.
- Start the pipeline process. Check the pipeline help (ocr --help) for more details.
cd /<my_data_location>
ocr -i input -o output -m models/<model_name> -l <language_code> <optional_pipeline_arguments>
# or
ocr -i input -o output -m models/* -l <language_code> <optional_pipeline_arguments>
- Check your results in the /<my_data_location>/outputdirectory.
					Languages
				
				
								
								
									Python
								
								92.2%
							
						
							
								
								
									Dockerfile
								
								7.8%