change README

This commit is contained in:
sporada 2021-01-19 15:38:50 +01:00
parent 0803ecbeda
commit b68af8aabd

View File

@ -1,10 +1,10 @@
# bundesdata_markup_nlp_data
This is just a repository providing the link to the data used and created by the software from this repository: https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software
This is just a repository providing the link to the data used and created by the software from this repository: https://gitea.sporada.eu/sporada/bundesdata_markup_nlp_software
Pelase read the description of that project to understand what kind of data this is. The project is part of a master thesis which can be read [here](https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_web_app/raw/24641c2959796659d428514c9cdd3782d4248da0/2019-02-04_Stephan_Porada_Masterthesis_semi.pdf?inline=false).
Pelase read the description of that project to understand what kind of data this is. The project is part of a master thesis which can be read [here](https://gitea.sporada.eu/sporada/bundesdata_web_app/src/branch/master/2019-02-04_Stephan_Porada_Masterthesis_semi.pdf).
The data can be downloaded here: https://uni-bielefeld.sciebo.de/s/I93l9QNZKLUTv3S
The actual data can be downloaded here: https://uni-bielefeld.sciebo.de/s/I93l9QNZKLUTv3S
**Size**: around 70GB
@ -12,13 +12,13 @@ The data can be downloaded here: https://uni-bielefeld.sciebo.de/s/I93l9QNZKLUTv
Note that there are currently **two** versions of all the data available.
Version _1.0\_data_ contains the data described and used for the original master thesis. The master thesis can be viewed [here](https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_web_app/raw/24641c2959796659d428514c9cdd3782d4248da0/2019-02-04_Stephan_Porada_Masterthesis_semi.pdf?inline=false). Protocols for the periods 15, 16 and 17 are erroneous. Therfore the ngrams for those periods are erroneous as well.
Version _1.0\_data_ contains the data described and used for the original master thesis. The master thesis can be viewed [here](https://gitea.sporada.eu/sporada/bundesdata_web_app/src/branch/master/2019-02-04_Stephan_Porada_Masterthesis_semi.pdf). Protocols for the periods 15, 16 and 17 are erroneous. Therfore the ngrams for those periods are erroneous as well.
Version _1.1\_data_ contains the new officlally released and corrected xml protocols. The protocols have been corrected by the Bundesregierung. Ngrams will be calculated again on the basis of the new protocols in the near future. Also some fixes regarding the markup will be introduced.
```
.
├── inputs ### Data used as input for the software from https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software
├── inputs ### Data used as input for the software from https://gitea.sporada.eu/sporada/bundesdata_markup_nlp_software
│   ├── backup_raw_xml ### Zip files of all original protocols.
│   ├── current_official_protocols_xml ### Example file of the new official markup.
│   ├── development_data_xml ### Set of original xml protocols used for development.
@ -44,7 +44,7 @@ Version _1.1\_data_ contains the new officlally released and corrected xml proto
│   │   └── 18_Wahlperiode_2013-2017
│   └── test_data_xml ### Set of original protocols fpr testin purposes.
├── MdB_data ### The official Stammdaten of every MdB can be found here.
├── outputs ### These are the files an data produced using the software from https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software
├── outputs ### These are the files an data produced using the software from https://gitea.sporada.eu/sporada/bundesdata_markup_nlp_software
│   ├── markup ### Contains all automatically marked protocols.
│   │   ├── dev_data ### Automatically marked dev_data protocols.
│   │   │   ├── beautiful_xml ### Final output: humanreadable automatically marked protocols.