From e51f3b8dc70894a60c72a3a7814e0d18a4f5552b Mon Sep 17 00:00:00 2001 From: Stephan Porada Date: Wed, 29 Jul 2020 09:48:46 +0200 Subject: [PATCH] Update README.md --- README.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index e426e67..de57bdd 100755 --- a/README.md +++ b/README.md @@ -34,20 +34,20 @@ The actual data can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bu ## Import the data into the database -11. Now the data for the ngrams, speeches, and speakers has to be imported into the database of the app. -13. Change the owner rights of all files in the repository. (This step should only be necessary for linux systems.) +1. Now the data for the ngrams, speeches, and speakers has to be imported into the database of the app. +2. Change the owner rights of all files in the repository. (This step should only be necessary for linux systems.) - This has to be done because every process inside a docker container is always executed with root privilage. Thus the created volumes are not accessable anymore. - Change the rights with `sudo chown -R $USER:$USER .`. -12. Download the folders *MdB\_data* and *outputs* from the link mentioned in [this repository](https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_data). +3. Download the folders *MdB\_data* and *outputs* from the link mentioned in [this repository](https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_data). - Copy those into the folder *input_volume* which is located inside the web app repository on the root level. - If the downloaded folders are inside an archive extract the folders first. - The folder *input_volume* is a volume which is mounted into the web app container. The contianer is able to read every data inside that volume. Note that the volume is accessed with the path */usr/src/app/input_data* not */usr/src/app/input_volume*. -13. First we have to import the speaker data. +4. First we have to import the speaker data. - Interactivly access the running web application container with `docker exec -it bundesdata_web_app_web_1 bash`. - Import the speaker data now in the container bash prompt with `python manage.py import_speakers input_data/MdB_data/MdB_Stammdaten.xml` -14. After that we can import all the protocols and thus all speeches for every person. +5. After that we can import all the protocols and thus all speeches for every person. - Import the protocols now in the container bash prompt with `python manage.py import_protocols input_data/outputs/markup/full_periods` (Importing all protocols takes up to 2 days. For testing purposes *dev\_data/beautiful\_xml* or *test\_data/beautiful\_xml* can be used.) -15. Now the n-grams can be imported. +6. Now the n-grams can be imported. - Import the n-grams now in the container bash prompt with `python manage.py import_ngrams_bulk 1 input_data/outputs/nlp/full_periods/n-grams/lm_ns_year/1_grams lm_ns_year`. - This command imports the alphabetically splitted n-grams into their according tables. - First parameter of this command is *1*. This tells the function to import the n-grams from the input path as 1-grams. @@ -59,10 +59,10 @@ The actual data can be found here: https://gitlab.ub.uni-bielefeld.de/sporada/bu - If you did something wrong you can reset the database with `docker-compose run web python manage.py flush` and start the data import again. - It is possible to import different n-gram sets at the same time using multiple commands in multiple terminals. Just keep an eye out on the CPU and RAM usage. - There is also an optional fourth parameter to set the batch size of one insert. The default is set to read 1 million rows from the csv and insert them at once into the database. The parameter `-bs 10000000` would set it to 10 million. Increasing that value also increases the RAM usage so be careful with that. -16. Repeate the step above for every kind of n-gram data you want to import. Importing 1-grams will only take some minutes while importing 5-grams will take several hours. (For testing purposes the n-grams from *dev\_data* can be used.) -17. After importing the n-grams the web app is all set up. -18. The app can be shut down with `docker-compose down`. All imported data is saved persistently in the database volume *postgres_data*. -19. To restart the app use `docker-compose up` or `docker-compose up -d` to start it detatched. +7. Repeate the step above for every kind of n-gram data you want to import. Importing 1-grams will only take some minutes while importing 5-grams will take several hours. (For testing purposes the n-grams from *dev\_data* can be used.) +8. After importing the n-grams the web app is all set up. +9. The app can be shut down with `docker-compose down`. All imported data is saved persistently in the database volume *postgres_data*. +10. To restart the app use `docker-compose up` or `docker-compose up -d` to start it detatched. # Security settings for hosting your own public version Before hosting your own version of this web application pulblicly make sure the PostgreSQL username, password etc. in your *.env* file have been set to new and secret values. Keep in mind that the current version of this web application is not HTTPS ready on its own. To host this webapplication with HTTPS checkout [traefik](https://docs.traefik.io/).