mirror of
				https://gitlab.ub.uni-bielefeld.de/sfb1288inf/nlp.git
				synced 2025-10-31 13:02:44 +00:00 
			
		
		
		
	Update
This commit is contained in:
		| @@ -31,6 +31,7 @@ RUN pip3 install wheel && pip3 install -U spacy && \ | ||||
|     python3 -m spacy download en && \ | ||||
|     python3 -m spacy download es && \ | ||||
|     python3 -m spacy download fr && \ | ||||
|     python3 -m spacy download it && \ | ||||
|     python3 -m spacy download pt | ||||
|  | ||||
| COPY nlp /usr/local/bin | ||||
|   | ||||
							
								
								
									
										86
									
								
								README.md
									
									
									
									
									
								
							
							
						
						
									
										86
									
								
								README.md
									
									
									
									
									
								
							| @@ -1,37 +1,73 @@ | ||||
| # Natural language processing | ||||
|  | ||||
| This repository provides all code that is needed to build a container image for natural language processing utilising [spaCy](https://spacy.io). | ||||
| In case you don't want to build the image by yourself, there is also a prebuild image that can be used in the [registry](https://gitlab.ub.uni-bielefeld.de/sfb1288inf/nlp/container_registry). | ||||
| This repository provides all code that is needed to build a container image for natural language processing utilizing [spaCy](https://spacy.io). | ||||
|  | ||||
| ## Build the image | ||||
| ## Build image | ||||
|  | ||||
| ```console | ||||
| user@machine:~$ cd <path-to-this-repository> | ||||
| user@machine:~$ docker build -t gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/nlp . | ||||
| 1. Clone this repository and navigate into it: | ||||
| ``` | ||||
| git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/nlp.git && cd nlp | ||||
| ``` | ||||
|  | ||||
| ## Starting a container | ||||
|  | ||||
| ```console | ||||
| user@machine:~$ docker run \ | ||||
|   --name nlp-container \ | ||||
|   -dit \ | ||||
|   -v <your-input-directory>:/root/files_for_nlp \ | ||||
|   -v <your-output-directory>:/root/files_from_nlp \ | ||||
|   gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/nlp | ||||
| 2. Build image: | ||||
| ``` | ||||
| docker build -t sfb1288inf/nlp:latest . | ||||
| ``` | ||||
|  | ||||
| ## Start a natural language processing run | ||||
| Alternatively build from the GitLab repository without cloning: | ||||
|  | ||||
| ```console | ||||
| user@machine:~$ docker exec -it nlp-container \ | ||||
|   nlp -i files_for_nlp -o files_from_nlp -l <language-code> | ||||
| 1. Build image: | ||||
| ``` | ||||
| docker build -t sfb1288inf/nlp:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/nlp.git | ||||
| ``` | ||||
|  | ||||
| Where <language-code> needs to be one of the following: | ||||
| ## Download prebuilt image | ||||
|  | ||||
| * de (German) | ||||
| * en (English) | ||||
| * es (Spanish) | ||||
| * fr (French) | ||||
| * pt (Portuguese) | ||||
| The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers. | ||||
|  | ||||
| 1. Download image: | ||||
| ``` | ||||
| docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/nlp:latest | ||||
| ``` | ||||
|  | ||||
| ## Run | ||||
|  | ||||
| 1. Create input and output directories for the NLP software: | ||||
| ``` | ||||
| mkdir -p /<mydatalocation>/files_for_nlp /<mydatalocation>/files_from_nlp | ||||
| ``` | ||||
|  | ||||
| 2. Place your text files inside the `/<mydatalocation>/files_for_nlp` directory. Files should all contain text of the same language. | ||||
|  | ||||
| 3. Start the NLP process. | ||||
| ``` | ||||
| docker run \ | ||||
|     --rm \ | ||||
|     -it \ | ||||
|     -v /<mydatalocation>/files_for_nlp:/files_for_nlp \ | ||||
|     -v /<mydatalocation>/files_from_nlp:/files_from_nlp \ | ||||
|     sfb1288inf/nlp:latest \ | ||||
|         -i /files_for_nlp \ | ||||
|         -o /files_from_nlp \ | ||||
|         -l <languagecode> | ||||
| ``` | ||||
| The arguments below `sfb1288inf/nlp:latest` are described in the [NLP arguments](#nlp-arguments) part. | ||||
|  | ||||
| If you want to use the prebuilt image, replace `sfb1288inf/nlp:latest` with `gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/nlp:latest`. | ||||
|  | ||||
| 4. Check your results in the `/<mydatalocation>/files_from_nlp` directory. | ||||
|  | ||||
| ### NLP arguments | ||||
|  | ||||
| `-i path` | ||||
| * Sets the input directory using the specified path. | ||||
| * required = True | ||||
|  | ||||
| `-o path` | ||||
| * Sets the output directory using the specified path. | ||||
| * required = True | ||||
|  | ||||
| `-l languagecode` | ||||
| * Tells spaCy which language will be used. | ||||
| * options = de (German), el (Greek), en (English), es (Spanish), fr (French), it (Italian), nl (Dutch), pt (Portuguese) | ||||
| * required = True | ||||
|   | ||||
							
								
								
									
										2
									
								
								nlp
									
									
									
									
									
								
							
							
						
						
									
										2
									
								
								nlp
									
									
									
									
									
								
							| @@ -28,7 +28,7 @@ def parse_arguments(): | ||||
|     ) | ||||
|     parser.add_argument( | ||||
|         '-l', | ||||
|         choices=['de', 'en', 'es', 'fr', 'pt'], | ||||
|         choices=['de', 'el', 'en', 'es', 'fr', 'it', 'nl', 'pt'], | ||||
|         dest='lang', | ||||
|         required=True | ||||
|     ) | ||||
|   | ||||
| @@ -15,7 +15,7 @@ parser.add_argument( | ||||
| ) | ||||
| parser.add_argument( | ||||
|     '-l', | ||||
|     choices=['de', 'en', 'es', 'fr', 'pt'], | ||||
|     choices=['de', 'el', 'en', 'es', 'fr', 'it', 'nl', 'pt'], | ||||
|     dest='lang', | ||||
|     required=True | ||||
| ) | ||||
| @@ -26,8 +26,9 @@ parser.add_argument( | ||||
| args = parser.parse_args() | ||||
|  | ||||
| SPACY_MODELS = { | ||||
|     'de': 'de_core_news_sm', 'en': 'en_core_web_sm', 'es': 'es_core_news_sm', | ||||
|     'fr': 'fr_core_news_sm', 'pt': 'pt_core_news_sm' | ||||
|     'de': 'de_core_news_sm', 'el': 'el_core_news_sm', 'en': 'en_core_web_sm', | ||||
|     'es': 'es_core_news_sm', 'fr': 'fr_core_news_sm', 'it': 'it_core_news_sm', | ||||
|     'nl': 'nl_core_news_sm', 'pt': 'pt_core_news_sm' | ||||
| } | ||||
|  | ||||
| # Set the language model for spacy | ||||
|   | ||||
		Reference in New Issue
	
	Block a user