Added some documentation.

This commit is contained in:
Stephan Porada 2019-03-01 21:00:46 +01:00
parent 27aa61d91a
commit 8acc913c45

View File

@ -1,8 +1,12 @@
# What is this? # What is this?
This django web app DESCRIPTION. This django web app is part of a masterthesis.
The needed data was created by using this software: https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software The app displays the session protocols of the german Bundestag from 1949 till 2017.
Besides that the app provides an Ngram Viewer that displays word frequencies over time for all those protocols. Th Ngram Viewer and its functionality is similar to the [Google Ngram Viewer](https://books.google.com/ngrams).
The n-gram data and the protocols have been created using this software (also part of the same masterthesis): https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software
The actual data can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_data The actual data can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_data
@ -10,8 +14,8 @@ The actual data can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bu
## Systemrequirements ## Systemrequirements
* docker * docker 18.09.1-ce
* docker-compose * docker-compose 1.23.2
* unix-like OS * unix-like OS
## Install requirements ## Install requirements
@ -40,11 +44,11 @@ The actual data can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bu
13. First we have to import the speaker data. This will be done by executing following command `docker-compose run web python manage.py import_speakers /usr/src/app/input_data/MdB_data/MdB_Stammdaten.xml` in the second terminal. 13. First we have to import the speaker data. This will be done by executing following command `docker-compose run web python manage.py import_speakers /usr/src/app/input_data/MdB_data/MdB_Stammdaten.xml` in the second terminal.
14. After that we can import all the protocols and thus all speeches for every person. The command to do that is `docker-compose run web python manage.py import_protocols /usr/src/app/input_data/outputs/markup/full_periods` (Importing all protocols takes up to 2 days. For testing purposes *dev\_data/beautiful\_xml* or *test\_data/beautiful\_xml* can be used.) 14. After that we can import all the protocols and thus all speeches for every person. The command to do that is `docker-compose run web python manage.py import_protocols /usr/src/app/input_data/outputs/markup/full_periods` (Importing all protocols takes up to 2 days. For testing purposes *dev\_data/beautiful\_xml* or *test\_data/beautiful\_xml* can be used.)
15. Now the n-grams can be imported by using `docker-compose run web python manage.py import_ngrams_bulk 1 /usr/src/app/input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/1_grams lm_ns_year`. This command imports the alphabetically splitted n-grams into their according tables. First parameter of this command is *1*. This tells the function to import the n-grams from the input path as 1-grams. Therefore the second parameter is the inputpath */usr/src/app/input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/1_grams* where the 1-grams are located. The last part of the input path clearly identifies the n-grams as 1-grams. Finally the third parameter identifies what kind of n-grams are being imported. In this case the parameter is set to *lm_ns_year* which means the ngrams are based on lemmatized text without stopwords counted by year. An example to import 2-grams would look like this `docker-compose run web python manage.py import_ngrams_bulk 2 /usr/src/app/input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/2_grams lm_ns_year`. To import 3-grams from a different corpus the command for example should look like this: `docker-compose run web python manage.py import_ngrams_bulk 3 /usr/src/app/input_data/outputs/nlp/full_periods/n-grams/tk_ws_speaker_\(1-3\)/3_grams tk_ws_speaker`. Be careful when importing the n-grams. If the parameters are set wrong, the n-grams will be imorted into the wrong tables and thus leading to incorect findings using the Ngram Viewer. It is possible to import different n-gram sets at the same time using multiple commands in multiple terminals. Just keep an eye out on the CPU and RAM usage. There is also an optional fourth parameter to set the batch size of one insert. The default is set to read 1 million rows from the csv and insert them at once into the database. The parameter `-bs 10000000` would set it to 10 million. Increasing that value also increases the RAM usage so be careful with that. 15. Now the n-grams can be imported by using `docker-compose run web python manage.py import_ngrams_bulk 1 /usr/src/app/input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/1_grams lm_ns_year`. This command imports the alphabetically splitted n-grams into their according tables. First parameter of this command is *1*. This tells the function to import the n-grams from the input path as 1-grams. Therefore the second parameter is the inputpath */usr/src/app/input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/1_grams* where the 1-grams are located. The last part of the input path clearly identifies the n-grams as 1-grams. Finally the third parameter identifies what kind of n-grams are being imported. In this case the parameter is set to *lm_ns_year* which means the ngrams are based on lemmatized text without stopwords counted by year. An example to import 2-grams would look like this `docker-compose run web python manage.py import_ngrams_bulk 2 /usr/src/app/input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/2_grams lm_ns_year`. To import 3-grams from a different corpus the command for example should look like this: `docker-compose run web python manage.py import_ngrams_bulk 3 /usr/src/app/input_data/outputs/nlp/full_periods/n-grams/tk_ws_speaker_\(1-3\)/3_grams tk_ws_speaker`. Be careful when importing the n-grams. If the parameters are set wrong, the n-grams will be imorted into the wrong tables and thus leading to incorect findings using the Ngram Viewer. It is possible to import different n-gram sets at the same time using multiple commands in multiple terminals. Just keep an eye out on the CPU and RAM usage. There is also an optional fourth parameter to set the batch size of one insert. The default is set to read 1 million rows from the csv and insert them at once into the database. The parameter `-bs 10000000` would set it to 10 million. Increasing that value also increases the RAM usage so be careful with that.
16. Repeate the step above for every kind of n-gram data you want to import. Importing 1-grams will only take some minutes while importing 5-grams will take several hours. (For testing purposes the n-grams from the test or development data can be used.) 16. Repeate the step above for every kind of n-gram data you want to import. Importing 1-grams will only take some minutes while importing 5-grams will take several hours. (For testing purposes the n-grams from *dev\_data* can be used.)
17. After importing the n-grams the web app is all set up. 17. After importing the n-grams the web app is all set up.
18. The app can be shut down with `docker-compose down`. All imported data is saved persistently in the database volume *postgres_data*. 18. The app can be shut down with `docker-compose down`. All imported data is saved persistently in the database volume *postgres_data*.
19. To restart the app use `docker-compose up` or `docker-compose -d` to start it detatched. 19. To restart the app use `docker-compose up` or `docker-compose -d` to start it detatched.
# Live version # Live version
A live Version of the app is running under http://129.70.12.88:8000/ in the University Bielefeld network. You have to access the universitie network via VPN to be able to use the live version. (https://www.ub.uni-bielefeld.de/search/vpn/) A live version of the app is running under http://129.70.12.88:8000/ inside the University Bielefeld network. You have to access the university network via VPN to be able to use the live version. (https://www.ub.uni-bielefeld.de/search/vpn/)