bundesdata_markup_nlp_software/README.md

# What is this?
This software is used to automatically mark the official protocols of the Bundestag.
The Bundestag published protocols of every session since 1949 till 2017 in XML.
Unforutnatley the markup of those is very rudimentary. It is not possible to see
which member of parliament hold what speech etc.

This software can mark every protocol from 1949 till 2017 automatically. The
software identifies speakers, their speeches, metadata etc. For detailed information
why this software was made and how it works, read the corresponding master thises
uploaded [here](https://gitea.sporada.eu/sporada/bundesdata_web_app/src/branch/master/2019-02-04_Stephan_Porada_Masterthesis_semi.pdf) (It is written in german though).

Besides the markup the software can also calculate ngrams for all automatically
marked protocols either from lemmatized or just tokenized text with or without
stopwords.


## Web app based on the protocols and ngrams

The protocols and ngrams are used for different functions of a django web application.
The web application displays the protocols, speeches and corresponding speakers
for research purposes.

The web app also provides an Ngram Viewer based on the produced ngram data that
displays ngram frequencies for all protocols from 1949 till 2017. The Ngram Viewer
is similar to the [Google Ngram Viewer](https://books.google.com/ngrams).

The source code of the web application can be found here: https://gitea.sporada.eu/sporada/bundesdata_web_app.
A live version of the web application can be visited via the link: https://bundesdata.sporada.eu/.

## Input and Output data
The input and output data of this software can be found here: https://gitea.sporada.eu/sporada/bundesdata_markup_nlp_data.
You can find all automatically marked protocols and ngrams there. Also the
official protocols used as input data are included.

# Installation and usage

## requirements
- Python 3.7.1+
- Python python3.7-dev
- js-beautify (optional if corresponding step is skipped)
- virtualenv
- unix-like os

## Installation

0. Install the needed requirements mentioned above.
    - Install _js-beautify_ following one of the steps mentioned here: https://github.com/beautify-web/js-beautify#installation. Installing and using _js-beautify_ is optional.
    - How to skip the steps that use _js-beautify_ is mentioned in the section below.
1. Clone this repository with `git clone https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software.git`.
3. Create a virtual environment for the software with `virtualenv --python=python3.7 path/to/folder/of/your/choice`.
4. Activate the virtual environment with `source path/to/folder/bin/activate`
2. Navigate into the cloned repository with `cd path/to/reopsitory`.
3. Install all requirements mentioned in _requirements.txt_ with `pip install -r requirements.txt`.
4. Move down into _bundesdata\_markup\_nlp_ with `cd bundesdata_markup_nlp`.
5. Execute `./bundesdata_markup.py -h` or `python bundesdata_markup.py -h` to verify the successful installation.
6. If the help shows up you are ready to go.

## Usage

### Markup process

1. Downlaod some protocols to use them as an input for the markup process.
    - You can either download some files from https://gitea.sporada.eu/sporada/bundesdata_markup_nlp_data including the _development\_data\_xml_ data set found in _inputs_.
    - Or download the protocols directly from https://www.bundestag.de/services/opendata.
    - Only protocols from the 1st to 18th period can be used as an input.
2. Place the protocols you want to mark in one directory. The directory can contain one level of sub directories in example for protocols of different periods. This tutorial will continue using the folder _development\_data\_xml_.
3. Now you can start the markup process by executing following command `./bundesdata_markup.py -sp /path/to/development_data_xml /path/to/some/folder/for/the/output`.
4. After completion the marked protocols can be found in the folder _beautiful\_xml_ inside the specified output folder.
    - To skip the step that uses _js-beautify_ execute the command `./bundesdata_markup.py -sp /path/to/development_data_xml /path/to/some/folder/for/the/output -kt -sb`
    - The non beautified protocols can be found in _clear\_speech\_markup_. Notice that all other tmp folders containing the intermediate protocols are also kept in the output folder. This is due to using the `-kt` parameter.
7. The marked protocols can now be used as an input to calculate different n-grams.

### N-grams
1. Before calculating the n-grams the protocols have either to be lemmatized or tokenized.

#### Lemmatize

2. To lemmatize the protocols execute `./bundesdata_nlp.py -lm -ns -sp /path/to/output/beautiful_xml /path/to/some/folder/for/the/output`.
    - Or execute `./bundesdata_nlp.py -lm -ns -sp /path/to/clear_speech_markup /path/to/some/folder/for/the/output` if you want to use non beautified files.
    - Notice that the parameter `-ns` removes stop words from the lemmatized text. To include stopwords remove the parameter.
3. The lemmatized protocols can be found in _nlp\_output/nlp_beuatiful_xml_. These protocols are also beautified using _js-beautify_.
4. If you want to skip the beautification add the parameter `-sb`. Non beautified protocols are found in _nlp\_output/lemmatized_.

#### Tokenize

1. To tokenize the protocols execute `./bundesdata_nlp.py -tn -ns -sp /path/to/output/beautiful_xml /path/to/some/folder/for/the/output`.
    - Or execute `./bundesdata_nlp.py -tn -ns -sp /path/to/clear_speech_markup /path/to/some/folder/for/the/output` if you want to use non beautified files.
    - Notice that the parameter `-ns` removes stop words from the tokenized text. To include stopwords remove the parameter.
3. The tokenized protocols can be found in _nlp\_output/nlp_beuatiful_xml_. These protocols are also beautified using _js-beautify_.
4. If you want to skip the beautification add the parameter `-sb`. Non beautified protocols are found in _nlp\_output/lemmatized_.

#### Calculating the n-grams
1. Now the lemmatized or tokenized (either with our without stop words) protocols can be used as an input for the n-gram calculation.
    - The following steps will be explained using the beautified protocols from _nlp\_beuatiful\_xml_.
2. To calculate the n-grams for the lemmatized protocols without stop words per year use the command `./bundesdata_nlp.py -cn year lm_ns_year -sp /path/to/nlp_output/nlp_beuatiful_xml/ /path/to/some/folder/for/the/output/`
3. After that move a copy of _bundesdata\_markup\_nlp/utility/move\_ngrams.py_ into the folder _nlp\_output/n-grams_ and execute it with `python move_ngrams.py`.
4. The n-grams are now ready to be imported into the database of the django web app.
    - (The source code for the app and a tutorial for importing the ngrams can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_web_app)
5. If you want to calculate n-grams from tokenized protocols without stopwords per year use this command: `./bundesdata_nlp.py -cn year tk_ns_year -sp /path/to/nlp_output/nlp_beuatiful_xml/ /path/to/some/folder/for/the/output/`.
6. If you want to calculate n-grams from tokenized protocols with stopwords per speaker use this command: `./bundesdata_nlp.py -cn speaker tk_ws_speaker -sp /path/to/nlp_output/nlp_beuatiful_xml/ /path/to/some/folder/for/the/output/`.
7. The parameter `-cn` is always followed by two arguments (Example: `-cn year lm_ns_year`).
    - The first is used to specifie how the n-grams are counted. It can be set to "year", "mont_year", "speaker" or "speech".
    - N-grams will then be count by year, speaker and so on.
    - The second argument is a user specified string to identify from what kind of protocols the n-grams have been calculated.
    - The string "lm_ns_year" for example describes that the input protocols have been lemmatized ("lm") and contain no stop words ("ns"). The last part ("year") specifies that the n-grams have been calculated by year.
    - The string "tk_ws_speaker" means that the ngrams are calculated using tokenized("tk") protocols with stop words ("ws").
    - N-grams are counted per speaker ("speaker").

# Used packages and software
- js-beautify: https://github.com/beautify-web/js-beautify
    - Lielmanis, E.; Newman, L.; Stockman, D. & Sanfilippo, S.
- lxml: https://github.com/lxml/lxml
    - Behnel, S.; Faassen, M.; Bicking, I.; Joukl, H.; Sapin, S.; Parent, M.-A.; Grisel, O.; Buchcik, K.; Wagner, F.; Kroymann, E.; Everitt, P.; Ng, V.; Kern, R.; Pakulat, A.; Sankel, D.; Kasperski, M.; da Silva, S. & Oberndörfer, P.
- Babel: https://github.com/python-babel/babel
    - Ronacher, A.
- tqdm: https://github.com/tqdm/tqdm
    - Yorav-Raphael, N.
- spaCy: https://github.com/explosion/spaCy
   - Explosion AI
- scikit-learn: https://github.com/scikit-learn/scikit-learn
   - Mueller, A.
Added some documentation. 2019-03-03 18:41:12 +01:00			`# What is this?`
			`This software is used to automatically mark the official protocols of the Bundestag.`
			`The Bundestag published protocols of every session since 1949 till 2017 in XML.`
			`Unforutnatley the markup of those is very rudimentary. It is not possible to see`
			`which member of parliament hold what speech etc.`
Initial commit 2019-02-21 19:29:44 +01:00
Added some documentation. 2019-03-03 18:41:12 +01:00			`This software can mark every protocol from 1949 till 2017 automatically. The`
			`software identifies speakers, their speeches, metadata etc. For detailed information`
			`why this software was made and how it works, read the corresponding master thises`
„README.md“ ändern 2021-01-19 15:31:25 +01:00			`uploaded [here](https://gitea.sporada.eu/sporada/bundesdata_web_app/src/branch/master/2019-02-04_Stephan_Porada_Masterthesis_semi.pdf) (It is written in german though).`
Initial commit 2019-02-21 19:29:44 +01:00
Added some documentation. 2019-03-03 18:41:12 +01:00			`Besides the markup the software can also calculate ngrams for all automatically`
			`marked protocols either from lemmatized or just tokenized text with or without`
			`stopwords.`
Initial commit 2019-02-21 19:29:44 +01:00

Update README.md 2019-03-03 18:43:51 +01:00			`## Web app based on the protocols and ngrams`
Initial commit 2019-02-21 19:29:44 +01:00
Update README.md 2020-07-28 10:48:33 +02:00			`The protocols and ngrams are used for different functions of a django web application.`
Added some documentation. 2019-03-03 18:41:12 +01:00			`The web application displays the protocols, speeches and corresponding speakers`
			`for research purposes.`
Initial commit 2019-02-21 19:29:44 +01:00
Added some documentation. 2019-03-03 18:41:12 +01:00			`The web app also provides an Ngram Viewer based on the produced ngram data that`
			`displays ngram frequencies for all protocols from 1949 till 2017. The Ngram Viewer`
			`is similar to the [Google Ngram Viewer](https://books.google.com/ngrams).`
Initial commit 2019-02-21 19:29:44 +01:00
„README.md“ ändern 2021-01-19 15:31:25 +01:00			`The source code of the web application can be found here: https://gitea.sporada.eu/sporada/bundesdata_web_app.`
			`A live version of the web application can be visited via the link: https://bundesdata.sporada.eu/.`
Initial commit 2019-02-21 19:29:44 +01:00
Added some documentation. 2019-03-03 18:41:12 +01:00			`## Input and Output data`
„README.md“ ändern 2021-01-19 15:31:25 +01:00			`The input and output data of this software can be found here: https://gitea.sporada.eu/sporada/bundesdata_markup_nlp_data.`
Added some documentation. 2019-03-03 18:41:12 +01:00			`You can find all automatically marked protocols and ngrams there. Also the`
			`official protocols used as input data are included.`
Initial commit 2019-02-21 19:29:44 +01:00
Added some documentation. 2019-03-03 18:41:12 +01:00			`# Installation and usage`
Initial commit 2019-02-21 19:29:44 +01:00
Added some documentation. 2019-03-03 18:41:12 +01:00			`## requirements`
			`- Python 3.7.1+`
			`- Python python3.7-dev`
			`- js-beautify (optional if corresponding step is skipped)`
			`- virtualenv`
			`- unix-like os`
Initial commit 2019-02-21 19:29:44 +01:00
Added some documentation. 2019-03-03 18:41:12 +01:00			`## Installation`
Initial commit 2019-02-21 19:29:44 +01:00
Updated README.md 2019-03-03 21:42:34 +01:00			`0. Install the needed requirements mentioned above.`
			`- Install _js-beautify_ following one of the steps mentioned here: https://github.com/beautify-web/js-beautify#installation. Installing and using _js-beautify_ is optional.`
			`- How to skip the steps that use _js-beautify_ is mentioned in the section below.`
Added some documentation. 2019-03-03 18:41:12 +01:00			1. Clone this repository with `git clone https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software.git`.
Updated README.md 2019-03-03 21:42:34 +01:00			3. Create a virtual environment for the software with `virtualenv --python=python3.7 path/to/folder/of/your/choice`.
			4. Activate the virtual environment with `source path/to/folder/bin/activate`
Added some documentation. 2019-03-03 18:41:12 +01:00			2. Navigate into the cloned repository with `cd path/to/reopsitory`.
			3. Install all requirements mentioned in _requirements.txt_ with `pip install -r requirements.txt`.
			4. Move down into _bundesdata\_markup\_nlp_ with `cd bundesdata_markup_nlp`.
			5. Execute `./bundesdata_markup.py -h` or `python bundesdata_markup.py -h` to verify the successful installation.
Updated README.md 2019-03-03 21:42:34 +01:00			`6. If the help shows up you are ready to go.`
Added some documentation. 2019-03-03 18:41:12 +01:00
			`## Usage`

			`### Markup process`

Updated README.md 2019-03-03 21:42:34 +01:00			`1. Downlaod some protocols to use them as an input for the markup process.`
„README.md“ ändern 2021-01-19 15:31:25 +01:00			`- You can either download some files from https://gitea.sporada.eu/sporada/bundesdata_markup_nlp_data including the _development\_data\_xml_ data set found in _inputs_.`
Updated README.md 2019-03-03 21:42:34 +01:00			`- Or download the protocols directly from https://www.bundestag.de/services/opendata.`
			`- Only protocols from the 1st to 18th period can be used as an input.`
Added some documentation. 2019-03-03 18:41:12 +01:00			`2. Place the protocols you want to mark in one directory. The directory can contain one level of sub directories in example for protocols of different periods. This tutorial will continue using the folder _development\_data\_xml_.`
			3. Now you can start the markup process by executing following command `./bundesdata_markup.py -sp /path/to/development_data_xml /path/to/some/folder/for/the/output`.
			`4. After completion the marked protocols can be found in the folder _beautiful\_xml_ inside the specified output folder.`
Updated README.md 2019-03-03 21:42:34 +01:00			- To skip the step that uses _js-beautify_ execute the command `./bundesdata_markup.py -sp /path/to/development_data_xml /path/to/some/folder/for/the/output -kt -sb`
			- The non beautified protocols can be found in _clear\_speech\_markup_. Notice that all other tmp folders containing the intermediate protocols are also kept in the output folder. This is due to using the `-kt` parameter.
Added some documentation. 2019-03-03 18:41:12 +01:00			`7. The marked protocols can now be used as an input to calculate different n-grams.`

			`### N-grams`
			`1. Before calculating the n-grams the protocols have either to be lemmatized or tokenized.`

			`#### Lemmatize`

Updated README.md 2019-03-03 21:42:34 +01:00			2. To lemmatize the protocols execute `./bundesdata_nlp.py -lm -ns -sp /path/to/output/beautiful_xml /path/to/some/folder/for/the/output`.
			- Or execute `./bundesdata_nlp.py -lm -ns -sp /path/to/clear_speech_markup /path/to/some/folder/for/the/output` if you want to use non beautified files.
			- Notice that the parameter `-ns` removes stop words from the lemmatized text. To include stopwords remove the parameter.
Added some documentation. 2019-03-03 18:41:12 +01:00			`3. The lemmatized protocols can be found in _nlp\_output/nlp_beuatiful_xml_. These protocols are also beautified using _js-beautify_.`
			4. If you want to skip the beautification add the parameter `-sb`. Non beautified protocols are found in _nlp\_output/lemmatized_.

			`#### Tokenize`

Updated README.md 2019-03-03 21:42:34 +01:00			1. To tokenize the protocols execute `./bundesdata_nlp.py -tn -ns -sp /path/to/output/beautiful_xml /path/to/some/folder/for/the/output`.
			- Or execute `./bundesdata_nlp.py -tn -ns -sp /path/to/clear_speech_markup /path/to/some/folder/for/the/output` if you want to use non beautified files.
			- Notice that the parameter `-ns` removes stop words from the tokenized text. To include stopwords remove the parameter.
Added some documentation. 2019-03-03 18:41:12 +01:00			`3. The tokenized protocols can be found in _nlp\_output/nlp_beuatiful_xml_. These protocols are also beautified using _js-beautify_.`
			4. If you want to skip the beautification add the parameter `-sb`. Non beautified protocols are found in _nlp\_output/lemmatized_.

			`#### Calculating the n-grams`
Updated README.md 2019-03-03 21:42:34 +01:00			`1. Now the lemmatized or tokenized (either with our without stop words) protocols can be used as an input for the n-gram calculation.`
			`- The following steps will be explained using the beautified protocols from _nlp\_beuatiful\_xml_.`
Added some documentation. 2019-03-03 18:41:12 +01:00			2. To calculate the n-grams for the lemmatized protocols without stop words per year use the command `./bundesdata_nlp.py -cn year lm_ns_year -sp /path/to/nlp_output/nlp_beuatiful_xml/ /path/to/some/folder/for/the/output/`
			3. After that move a copy of _bundesdata\_markup\_nlp/utility/move\_ngrams.py_ into the folder _nlp\_output/n-grams_ and execute it with `python move_ngrams.py`.
Updated README.md 2019-03-03 21:42:34 +01:00			`4. The n-grams are now ready to be imported into the database of the django web app.`
			`- (The source code for the app and a tutorial for importing the ngrams can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_web_app)`
Added some documentation. 2019-03-03 18:41:12 +01:00			5. If you want to calculate n-grams from tokenized protocols without stopwords per year use this command: `./bundesdata_nlp.py -cn year tk_ns_year -sp /path/to/nlp_output/nlp_beuatiful_xml/ /path/to/some/folder/for/the/output/`.
			6. If you want to calculate n-grams from tokenized protocols with stopwords per speaker use this command: `./bundesdata_nlp.py -cn speaker tk_ws_speaker -sp /path/to/nlp_output/nlp_beuatiful_xml/ /path/to/some/folder/for/the/output/`.
Updated README.md 2019-03-03 21:42:34 +01:00			7. The parameter `-cn` is always followed by two arguments (Example: `-cn year lm_ns_year`).
			`- The first is used to specifie how the n-grams are counted. It can be set to "year", "mont_year", "speaker" or "speech".`
			`- N-grams will then be count by year, speaker and so on.`
			`- The second argument is a user specified string to identify from what kind of protocols the n-grams have been calculated.`
			`- The string "lm_ns_year" for example describes that the input protocols have been lemmatized ("lm") and contain no stop words ("ns"). The last part ("year") specifies that the n-grams have been calculated by year.`
Update README.md 2019-03-03 21:44:42 +01:00			`- The string "tk_ws_speaker" means that the ngrams are calculated using tokenized("tk") protocols with stop words ("ws").`
			`- N-grams are counted per speaker ("speaker").`
Update README.md 2019-03-03 18:50:37 +01:00
			`# Used packages and software`
Update README.md 2019-03-03 18:54:37 +01:00			`- js-beautify: https://github.com/beautify-web/js-beautify`
Update README.md 2019-03-03 18:50:37 +01:00			`- Lielmanis, E.; Newman, L.; Stockman, D. & Sanfilippo, S.`
Update README.md 2019-03-03 18:54:37 +01:00			`- lxml: https://github.com/lxml/lxml`
Update README.md 2019-03-03 18:50:37 +01:00			`- Behnel, S.; Faassen, M.; Bicking, I.; Joukl, H.; Sapin, S.; Parent, M.-A.; Grisel, O.; Buchcik, K.; Wagner, F.; Kroymann, E.; Everitt, P.; Ng, V.; Kern, R.; Pakulat, A.; Sankel, D.; Kasperski, M.; da Silva, S. & Oberndörfer, P.`
Update README.md 2019-03-03 18:54:37 +01:00			`- Babel: https://github.com/python-babel/babel`
Update README.md 2019-03-03 18:50:37 +01:00			`- Ronacher, A.`
Update README.md 2019-03-03 18:54:37 +01:00			`- tqdm: https://github.com/tqdm/tqdm`
			`- Yorav-Raphael, N.`
			`- spaCy: https://github.com/explosion/spaCy`
			`- Explosion AI`
			`- scikit-learn: https://github.com/scikit-learn/scikit-learn`
Updated README.md 2019-03-03 21:42:34 +01:00			`- Mueller, A.`