Update README.md

This commit is contained in:
Stephan Porada 2020-07-29 09:47:19 +02:00
parent d8c2fdc15f
commit 180616da46

View File

@ -34,11 +34,7 @@ The actual data can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bu
## Import the data into the database
1. Befor importing the data we have to setup the tables in the PostgreSQL database.
- Do this with `docker-compose run web python manage.py makemigrations`
- followed by `docker-compose run web python manage.py migrate`.
11. Now the data for the ngrams, speeches, and speakers has to be imported into the database of the app.
12. Shutdown the app with the command `docker-compose down`.
13. Change the owner rights of all files in the repository. (This step should only be necessary for linux systems.)
- This has to be done because every process inside a docker container is always executed with root privilage. Thus the created volumes are not accessable anymore.
- Change the rights with `sudo chown -R $USER:$USER .`.
@ -46,18 +42,19 @@ The actual data can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bu
- Copy those into the folder *input_volume* which is located inside the web app repository on the root level.
- If the downloaded folders are inside an archive extract the folders first.
- The folder *input_volume* is a volume which is mounted into the web app container. The contianer is able to read every data inside that volume. Note that the volume is accessed with the path */usr/src/app/input_data* not */usr/src/app/input_volume*.
13. Restart the app with `docker-compose up`
13. First we have to import the speaker data.
- This will be done by executing following command `docker-compose run web python manage.py import_speakers /usr/src/app/input_data/MdB_data/MdB_Stammdaten.xml` in the second terminal.
- Interactivly access the running web application container with `docker exec -it bundesdata_web_app_web_1 bash`.
- Import the speaker data now in the container bash prompt with `python manage.py import_speakers input_data/MdB_data/MdB_Stammdaten.xml`
14. After that we can import all the protocols and thus all speeches for every person.
- The command to do that is `docker-compose run web python manage.py import_protocols /usr/src/app/input_data/outputs/markup/full_periods` (Importing all protocols takes up to 2 days. For testing purposes *dev\_data/beautiful\_xml* or *test\_data/beautiful\_xml* can be used.)
15. Now the n-grams can be imported by using `docker-compose run web python manage.py import_ngrams_bulk 1 /usr/src/app/input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/1_grams lm_ns_year`.
- Import the protocols now in the container bash prompt with `python manage.py import_protocols input_data/outputs/markup/full_periods` (Importing all protocols takes up to 2 days. For testing purposes *dev\_data/beautiful\_xml* or *test\_data/beautiful\_xml* can be used.)
15. Now the n-grams can be imported.
- Import the n-grams now in the container bash prompt with `python manage.py import_ngrams_bulk 1 input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/1_grams lm_ns_year`.
- This command imports the alphabetically splitted n-grams into their according tables.
- First parameter of this command is *1*. This tells the function to import the n-grams from the input path as 1-grams.
- Therefore the second parameter is the inputpath */usr/src/app/input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/1_grams* where the 1-grams are located. The last part of the input path clearly identifies the n-grams as 1-grams.
- Therefore the second parameter is the inputpath *input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/1_grams* where the 1-grams are located. The last part of the input path clearly identifies the n-grams as 1-grams.
- Finally the third parameter identifies what kind of n-grams are being imported. In this case the parameter is set to *lm_ns_year* which means the ngrams are based on lemmatized text without stopwords counted by year.
- An example to import 2-grams would look like this `docker-compose run web python manage.py import_ngrams_bulk 2 /usr/src/app/input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/2_grams lm_ns_year`.
- To import 3-grams from a different corpus the command for example should look like this: `docker-compose run web python manage.py import_ngrams_bulk 3 /usr/src/app/input_data/outputs/nlp/full_periods/n-grams/tk_ws_speaker_\(1-3\)/3_grams tk_ws_speaker`.
- An example to import 2-grams would look like this `python manage.py import_ngrams_bulk 2 input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/2_grams lm_ns_year`.
- To import 3-grams from a different corpus the command for example should look like this: `python manage.py import_ngrams_bulk 3 input_data/outputs/nlp/full_periods/n-grams/tk_ws_speaker_\(1-3\)/3_grams tk_ws_speaker`.
- Be careful when importing the n-grams. **If the parameters are set wrong, the n-grams will be imported into the wrong tables and thus leading to incorrect findings using the Ngram Viewer.**
- If you did something wrong you can reset the database with `docker-compose run web python manage.py flush` and start the data import again.
- It is possible to import different n-gram sets at the same time using multiple commands in multiple terminals. Just keep an eye out on the CPU and RAM usage.
@ -65,10 +62,10 @@ The actual data can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bu
16. Repeate the step above for every kind of n-gram data you want to import. Importing 1-grams will only take some minutes while importing 5-grams will take several hours. (For testing purposes the n-grams from *dev\_data* can be used.)
17. After importing the n-grams the web app is all set up.
18. The app can be shut down with `docker-compose down`. All imported data is saved persistently in the database volume *postgres_data*.
19. To restart the app use `docker-compose up` or `docker-compose -d` to start it detatched.
19. To restart the app use `docker-compose up` or `docker-compose up -d` to start it detatched.
# Security settings for hosting your own public version
Before hosting you own version of this website pulblicly do not forget to change the PostgreSQL username, password etc. in *docker-compose.yml* and *app/bundesdata_app/settings.py*. Also change the secret key mentioned in *app/bundesdata_app/settings.py* to a new django key that you will keep secret! Also keep in mind that the current version is not HTTPS ready.
Before hosting your own version of this web application pulblicly make sure the PostgreSQL username, password etc. in your *.env* file have been set to new and secret values. Keep in mind that the current version of this web application is not HTTPS ready on its own. To host this webapplication with HTTPS checkout [traefik](https://docs.traefik.io/).
# Used packages and software
- django: https://github.com/django/django