0. Install the needed requirements mentioned above.
- Install _js-beautify_ following one of the steps mentioned here: https://github.com/beautify-web/js-beautify#installation. Installing and using _js-beautify_ is optional.
- How to skip the steps that use _js-beautify_ is mentioned in the section below.
1. Downlaod some protocols to use them as an input for the markup process.
- You can either download some files from https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_data including the _development\_data\_xml_ data set found in _inputs_.
- Or download the protocols directly from https://www.bundestag.de/services/opendata.
- Only protocols from the 1st to 18th period can be used as an input.
2. Place the protocols you want to mark in one directory. The directory can contain one level of sub directories in example for protocols of different periods. This tutorial will continue using the folder _development\_data\_xml_.
3. Now you can start the markup process by executing following command `./bundesdata_markup.py -sp /path/to/development_data_xml /path/to/some/folder/for/the/output`.
4. After completion the marked protocols can be found in the folder _beautiful\_xml_ inside the specified output folder.
- To skip the step that uses _js-beautify_ execute the command `./bundesdata_markup.py -sp /path/to/development_data_xml /path/to/some/folder/for/the/output -kt -sb`
- The non beautified protocols can be found in _clear\_speech\_markup_. Notice that all other tmp folders containing the intermediate protocols are also kept in the output folder. This is due to using the `-kt` parameter.
2. To lemmatize the protocols execute `./bundesdata_nlp.py -lm -ns -sp /path/to/output/beautiful_xml /path/to/some/folder/for/the/output`.
- Or execute `./bundesdata_nlp.py -lm -ns -sp /path/to/clear_speech_markup /path/to/some/folder/for/the/output` if you want to use non beautified files.
- Notice that the parameter `-ns` removes stop words from the lemmatized text. To include stopwords remove the parameter.
1. To tokenize the protocols execute `./bundesdata_nlp.py -tn -ns -sp /path/to/output/beautiful_xml /path/to/some/folder/for/the/output`.
- Or execute `./bundesdata_nlp.py -tn -ns -sp /path/to/clear_speech_markup /path/to/some/folder/for/the/output` if you want to use non beautified files.
- Notice that the parameter `-ns` removes stop words from the tokenized text. To include stopwords remove the parameter.
2. To calculate the n-grams for the lemmatized protocols without stop words per year use the command `./bundesdata_nlp.py -cn year lm_ns_year -sp /path/to/nlp_output/nlp_beuatiful_xml/ /path/to/some/folder/for/the/output/`
3. After that move a copy of _bundesdata\_markup\_nlp/utility/move\_ngrams.py_ into the folder _nlp\_output/n-grams_ and execute it with `python move_ngrams.py`.
4. The n-grams are now ready to be imported into the database of the django web app.
- (The source code for the app and a tutorial for importing the ngrams can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_web_app)
5. If you want to calculate n-grams from tokenized protocols without stopwords per year use this command: `./bundesdata_nlp.py -cn year tk_ns_year -sp /path/to/nlp_output/nlp_beuatiful_xml/ /path/to/some/folder/for/the/output/`.
6. If you want to calculate n-grams from tokenized protocols with stopwords per speaker use this command: `./bundesdata_nlp.py -cn speaker tk_ws_speaker -sp /path/to/nlp_output/nlp_beuatiful_xml/ /path/to/some/folder/for/the/output/`.
7. The parameter `-cn` is always followed by two arguments (Example: `-cn year lm_ns_year`).
- The first is used to specifie how the n-grams are counted. It can be set to "year", "mont_year", "speaker" or "speech".
- N-grams will then be count by year, speaker and so on.
- The second argument is a user specified string to identify from what kind of protocols the n-grams have been calculated.
- The string "lm_ns_year" for example describes that the input protocols have been lemmatized ("lm") and contain no stop words ("ns"). The last part ("year") specifies that the n-grams have been calculated by year.