7.8 KiB
Executable File
What is this?
This django web app is part of a masterthesis. Said thesis can be read here.
The app displays the session protocols of the german Bundestag from 1949 till 2017. Besides that the app provides an Ngram Viewer that displays word frequencies over time for all those protocols. Th Ngram Viewer and its functionality is similar to the Google Ngram Viewer.
The n-gram data and the protocols have been created using this software (also part of the same masterthesis): https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software
The actual data can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_data
Installation
Systemrequirements
- docker 18.09.1-ce+
- docker-compose 1.23.2+
- unix-like OS
Install requirements
- First install
docker
for your OS according to this guide: https://docs.docker.com/install/ - After that install
docker-compose
for you system according to this guide: https://docs.docker.com/compose/install/ - Clone this reposiory with
git clone https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_web_app.git
to a location of your choice.
Build the app/start the app with docker
- Move into the app folder with
cd path/to/bundesdata_web_app
. - Copy the .env.tpl file and rename the copy to .env.
- Set a secret key for django and set the database parameters like shown in the .env.tpl file.
- Start the web app with
docker-compose up
. Doing this the first time will take some time because all the needed images (postgres, python and nginx) will have to be downloaded. Also all the packages defined in the requirements.txt will be installed in the according image. This terminal will stay open for now to show the log messages of the three running containers. - Visit 127.0.0.1:8000 to verfiy that the the web application is up and running..
Import the data into the database
- Now the data for the ngrams, speeches, and speakers has to be imported into the database of the app.
- Change the owner rights of all files in the repository. (This step should only be necessary for linux systems.)
- This has to be done because every process inside a docker container is always executed with root privilage. Thus the created volumes are not accessable anymore.
- Change the rights with
sudo chown -R $USER:$USER .
.
- Download the folders MdB_data and outputs from the link mentioned in this repository.
- Copy those into the folder input_volume which is located inside the web app repository on the root level.
- If the downloaded folders are inside an archive extract the folders first.
- The folder input_volume is a volume which is mounted into the web app container. The contianer is able to read every data inside that volume. Note that the volume is accessed with the path /usr/src/app/input_data not /usr/src/app/input_volume.
- First we have to import the speaker data.
- Interactivly access the running web application container with
docker exec -it bundesdata_web_app_web_1 bash
. - Import the speaker data now in the container bash prompt with
python manage.py import_speakers input_data/MdB_data/MdB_Stammdaten.xml
- Interactivly access the running web application container with
- After that we can import all the protocols and thus all speeches for every person.
- Import the protocols now in the container bash prompt with
python manage.py import_protocols input_data/outputs/markup/full_periods
(Importing all protocols takes up to 2 days. For testing purposes dev_data/beautiful_xml or test_data/beautiful_xml can be used.)
- Import the protocols now in the container bash prompt with
- Now the n-grams can be imported.
- Import the n-grams now in the container bash prompt with
python manage.py import_ngrams_bulk 1 input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/1_grams lm_ns_year
. - This command imports the alphabetically splitted n-grams into their according tables.
- First parameter of this command is 1. This tells the function to import the n-grams from the input path as 1-grams.
- Therefore the second parameter is the inputpath input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/1_grams where the 1-grams are located. The last part of the input path clearly identifies the n-grams as 1-grams.
- Finally the third parameter identifies what kind of n-grams are being imported. In this case the parameter is set to lm_ns_year which means the ngrams are based on lemmatized text without stopwords counted by year.
- An example to import 2-grams would look like this
python manage.py import_ngrams_bulk 2 input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/2_grams lm_ns_year
. - To import 3-grams from a different corpus the command for example should look like this:
python manage.py import_ngrams_bulk 3 input_data/outputs/nlp/full_periods/n-grams/tk_ws_speaker_\(1-3\)/3_grams tk_ws_speaker
. - Be careful when importing the n-grams. If the parameters are set wrong, the n-grams will be imported into the wrong tables and thus leading to incorrect findings using the Ngram Viewer.
- If you did something wrong you can reset the database with
docker-compose run web python manage.py flush
and start the data import again. - It is possible to import different n-gram sets at the same time using multiple commands in multiple terminals. Just keep an eye out on the CPU and RAM usage.
- There is also an optional fourth parameter to set the batch size of one insert. The default is set to read 1 million rows from the csv and insert them at once into the database. The parameter
-bs 10000000
would set it to 10 million. Increasing that value also increases the RAM usage so be careful with that.
- Import the n-grams now in the container bash prompt with
- Repeate the step above for every kind of n-gram data you want to import. Importing 1-grams will only take some minutes while importing 5-grams will take several hours. (For testing purposes the n-grams from dev_data can be used.)
- After importing the n-grams the web app is all set up.
- The app can be shut down with
docker-compose down
. All imported data is saved persistently in the database volume postgres_data. - To restart the app use
docker-compose up
ordocker-compose up -d
to start it detatched.
Security settings for hosting your own public version
Before hosting your own version of this web application pulblicly make sure the PostgreSQL username, password etc. in your .env file have been set to new and secret values. Keep in mind that the current version of this web application is not HTTPS ready on its own. To host this webapplication with HTTPS checkout traefik.
Used packages and software
- django: https://github.com/django/django
- Django Software Foundation
- psycopg2: https://github.com/psycopg/psycopg2
- Di Gregorio, F.
- gunicorn: https://github.com/benoitc/gunicorn
- Chesneau, B.
- lxml: https://github.com/lxml/lxml
- Behnel, S.; Faassen, M.; Bicking, I.; Joukl, H.; Sapin, S.; Parent, M.-A.; Grisel, O.; Buchcik, K.; Wagner, F.; Kroymann, E.; Everitt, P.; Ng, V.; Kern, R.; Pakulat, A.; Sankel, D.; Kasperski, M.; da Silva, S. & Oberndörfer, P.
- tqdm: https://github.com/tqdm/tqdm
- Yorav-Raphael, N.
- django-watson: https://github.com/etianen/django-watson
- Hall, D.
- django-tables2: https://github.com/jieter/django-tables2
- Ayers, B.
- django-jchart: https://github.com/matthisk/django-jchart
- Heimensen, M.
- nginx: https://hub.docker.com/_/nginx
- NGINX, Inc.
- postgreSQL: https://hub.docker.com/_/postgres/
- The PostgreSQL Global Development Group