This commit is contained in:
Patrick Jentsch 2019-05-20 12:08:13 +02:00
parent 5b7bc2a840
commit 6c8b32fad4
4 changed files with 67 additions and 29 deletions

View File

@ -31,6 +31,7 @@ RUN pip3 install wheel && pip3 install -U spacy && \
python3 -m spacy download en && \ python3 -m spacy download en && \
python3 -m spacy download es && \ python3 -m spacy download es && \
python3 -m spacy download fr && \ python3 -m spacy download fr && \
python3 -m spacy download it && \
python3 -m spacy download pt python3 -m spacy download pt
COPY nlp /usr/local/bin COPY nlp /usr/local/bin

View File

@ -1,37 +1,73 @@
# Natural language processing # Natural language processing
This repository provides all code that is needed to build a container image for natural language processing utilising [spaCy](https://spacy.io). This repository provides all code that is needed to build a container image for natural language processing utilizing [spaCy](https://spacy.io).
In case you don't want to build the image by yourself, there is also a prebuild image that can be used in the [registry](https://gitlab.ub.uni-bielefeld.de/sfb1288inf/nlp/container_registry).
## Build the image ## Build image
```console 1. Clone this repository and navigate into it:
user@machine:~$ cd <path-to-this-repository> ```
user@machine:~$ docker build -t gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/nlp . git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/nlp.git && cd nlp
``` ```
## Starting a container 2. Build image:
```
```console docker build -t sfb1288inf/nlp:latest .
user@machine:~$ docker run \
--name nlp-container \
-dit \
-v <your-input-directory>:/root/files_for_nlp \
-v <your-output-directory>:/root/files_from_nlp \
gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/nlp
``` ```
## Start a natural language processing run Alternatively build from the GitLab repository without cloning:
```console 1. Build image:
user@machine:~$ docker exec -it nlp-container \ ```
nlp -i files_for_nlp -o files_from_nlp -l <language-code> docker build -t sfb1288inf/nlp:latest https://gitlab.ub.uni-bielefeld.de/sfb1288inf/nlp.git
``` ```
Where <language-code> needs to be one of the following: ## Download prebuilt image
* de (German) The GitLab registry provides a prebuilt image. It is automatically created, utilizing the conquaire build servers.
* en (English)
* es (Spanish) 1. Download image:
* fr (French) ```
* pt (Portuguese) docker pull gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/nlp:latest
```
## Run
1. Create input and output directories for the NLP software:
```
mkdir -p /<mydatalocation>/files_for_nlp /<mydatalocation>/files_from_nlp
```
2. Place your text files inside the `/<mydatalocation>/files_for_nlp` directory. Files should all contain text of the same language.
3. Start the NLP process.
```
docker run \
--rm \
-it \
-v /<mydatalocation>/files_for_nlp:/files_for_nlp \
-v /<mydatalocation>/files_from_nlp:/files_from_nlp \
sfb1288inf/nlp:latest \
-i /files_for_nlp \
-o /files_from_nlp \
-l <languagecode>
```
The arguments below `sfb1288inf/nlp:latest` are described in the [NLP arguments](#nlp-arguments) part.
If you want to use the prebuilt image, replace `sfb1288inf/nlp:latest` with `gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/nlp:latest`.
4. Check your results in the `/<mydatalocation>/files_from_nlp` directory.
### NLP arguments
`-i path`
* Sets the input directory using the specified path.
* required = True
`-o path`
* Sets the output directory using the specified path.
* required = True
`-l languagecode`
* Tells spaCy which language will be used.
* options = de (German), el (Greek), en (English), es (Spanish), fr (French), it (Italian), nl (Dutch), pt (Portuguese)
* required = True

2
nlp
View File

@ -28,7 +28,7 @@ def parse_arguments():
) )
parser.add_argument( parser.add_argument(
'-l', '-l',
choices=['de', 'en', 'es', 'fr', 'pt'], choices=['de', 'el', 'en', 'es', 'fr', 'it', 'nl', 'pt'],
dest='lang', dest='lang',
required=True required=True
) )

View File

@ -15,7 +15,7 @@ parser.add_argument(
) )
parser.add_argument( parser.add_argument(
'-l', '-l',
choices=['de', 'en', 'es', 'fr', 'pt'], choices=['de', 'el', 'en', 'es', 'fr', 'it', 'nl', 'pt'],
dest='lang', dest='lang',
required=True required=True
) )
@ -26,8 +26,9 @@ parser.add_argument(
args = parser.parse_args() args = parser.parse_args()
SPACY_MODELS = { SPACY_MODELS = {
'de': 'de_core_news_sm', 'en': 'en_core_web_sm', 'es': 'es_core_news_sm', 'de': 'de_core_news_sm', 'el': 'el_core_news_sm', 'en': 'en_core_web_sm',
'fr': 'fr_core_news_sm', 'pt': 'pt_core_news_sm' 'es': 'es_core_news_sm', 'fr': 'fr_core_news_sm', 'it': 'it_core_news_sm',
'nl': 'nl_core_news_sm', 'pt': 'pt_core_news_sm'
} }
# Set the language model for spacy # Set the language model for spacy