# NLP - Natural Language Processing This software implements a heavily parallelized pipeline for Natural Language Processing of text files. It is used for nopaque's NLP service but you can also use it standalone, for that purpose a convenient wrapper script is provided. The pipeline is designed to run on Linux operating systems, but with some tweaks it should also run on Windows with WSL installed. ## Software used in this pipeline implementation - Official Debian Docker image (buster-slim): https://hub.docker.com/_/debian - Software from Debian Buster's free repositories - pyFlow (1.1.20): https://github.com/Illumina/pyflow/releases/tag/v1.1.20 - spaCy (3.2.1): https://github.com/explosion/spaCy/releases/tag/v3.2.1 - spaCy medium sized models (3.2.0): - https://github.com/explosion/spacy-models/releases/tag/de_core_news_md-3.2.0 - https://github.com/explosion/spacy-models/releases/tag/en_core_web_md-3.2.0 - https://github.com/explosion/spacy-models/releases/tag/it_core_news_md-3.2.0 - https://github.com/explosion/spacy-models/releases/tag/nl_core_news_md-3.2.0 - https://github.com/explosion/spacy-models/releases/tag/pl_core_news_md-3.2.0 - https://github.com/explosion/spacy-models/releases/tag/zh_core_web_md-3.2.0 ## Installation 1. Install Docker and Python 3. 2. Clone this repository: `git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/nlp.git` 3. Build the Docker image: `docker build -t gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/nlp:v0.1.0 nlp` 4. Add the wrapper script (`wrapper/nlp` relative to this README file) to your `${PATH}`. 5. Create working directories for the pipeline: `mkdir -p //{input,output}`. ## Use the Pipeline 1. Place your plain text files inside `//input`. Files should all contain text of the same language. 2. Clear your `//output` directory. 3. Start the pipeline process. Check the pipeline help (`nlp --help`) for more details. ```bash cd / nlp \ --input-dir input \ --output-dir output \ -m ``` 4. Check your results in the `//output` directory.