mirror of
https://gitlab.ub.uni-bielefeld.de/sfb1288inf/nlp.git
synced 2025-01-27 09:10:33 +00:00
NLP - Natural Language Processing
This software implements a heavily parallelized pipeline for Natural Language Processing of text files. It is used for nopaque's NLP service but you can also use it standalone, for that purpose a convenient wrapper script is provided. The pipeline is designed to run on Linux operating systems, but with some tweaks it should also run on Windows with WSL installed.
Software used in this pipeline implementation
- Official Debian Docker image (buster-slim): https://hub.docker.com/_/debian
- Software from Debian Buster's free repositories
- pyFlow (1.1.20): https://github.com/Illumina/pyflow/releases/tag/v1.1.20
- spaCy (3.2.1): https://github.com/explosion/spaCy/releases/tag/v3.2.1
- spaCy medium sized models (3.2.0):
- https://github.com/explosion/spacy-models/releases/tag/de_core_news_md-3.2.0
- https://github.com/explosion/spacy-models/releases/tag/en_core_web_md-3.2.0
- https://github.com/explosion/spacy-models/releases/tag/it_core_news_md-3.2.0
- https://github.com/explosion/spacy-models/releases/tag/nl_core_news_md-3.2.0
- https://github.com/explosion/spacy-models/releases/tag/pl_core_news_md-3.2.0
- https://github.com/explosion/spacy-models/releases/tag/zh_core_web_md-3.2.0
Installation
- Install Docker and Python 3.
- Clone this repository:
git clone https://gitlab.ub.uni-bielefeld.de/sfb1288inf/nlp.git
- Build the Docker image:
docker build -t gitlab.ub.uni-bielefeld.de:4567/sfb1288inf/nlp:v0.1.0 nlp
- Add the wrapper script (
wrapper/nlp
relative to this README file) to your${PATH}
. - Create working directories for the pipeline:
mkdir -p /<my_data_location>/{input,output}
.
Use the Pipeline
- Place your plain text files inside
/<my_data_location>/input
. Files should all contain text of the same language. - Clear your
/<my_data_location>/output
directory. - Start the pipeline process. Check the pipeline help (
nlp --help
) for more details.
cd /<my_data_location>
nlp \
--input-dir input \
--output-dir output \
-m <model_code> <optional_pipeline_arguments>
- Check your results in the
/<my_data_location>/output
directory.
Description
Languages
Python
95.2%
Dockerfile
4.8%