bundesdata_markup_nlp_data/README.md

# bundesdata_markup_nlp_data

This is just a repository providing the link to the data used and created by the software from this repository: https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software

Pelase read the description of that project to understand what kind of data this is. The project is part of a master thesis which can be read [here](https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_web_app/raw/24641c2959796659d428514c9cdd3782d4248da0/2019-02-04_Stephan_Porada_Masterthesis_semi.pdf?inline=false).

The data can be downloaded here: https://uni-bielefeld.sciebo.de/s/I93l9QNZKLUTv3S

**Size**: around 70GB

**Structure of the data files**:

Note that there are currently **two** versions of all the data available.

Version _1.0\_data_ contains the data described and used for the original master thesis. The master thesis can be viewed [here](https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_web_app/raw/24641c2959796659d428514c9cdd3782d4248da0/2019-02-04_Stephan_Porada_Masterthesis_semi.pdf?inline=false). Protocols for the periods 15, 16 and 17 are erroneous. Therfore the ngrams for those periods are erroneous as well.

Version _1.1\_data_ contains the new officlally released and corrected xml protocols. The protocols have been corrected by the Bundesregierung. Ngrams will be calculated again on the basis of the new protocols in the near future. Also some fixes regarding the markup will be introduced.

```
.
├── inputs   ### Data used as input for the software from https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software
│   ├── backup_raw_xml   ### Zip files of all original protocols.
│   ├── current_official_protocols_xml   ### Example file of the new official markup.
│   ├── development_data_xml   ### Set of original xml protocols used for development.
│   ├── faulty_raw_xml   ### All original protocols with errors. The Bundesregierung should have fixed those by now. The Software mentioned above used these faulty ones though because the new ones were not available back then.
│   │   ├── 15_Wahlperiode_2002-2005
│   │   ├── 16_Wahlperiode_2005-2009
│   │   └── 17_Wahlperiode_2009-2013
│   ├── protocols_raw_xml   ### Unziped original protocols.
│   │   ├── 01_Wahlperiode_1949-1953
│   │   ├── 02_Wahlperiode_1953-1957
│   │   ├── 03_Wahlperiode_1957-1961
│   │   ├── 04_Wahlperiode_1961-1965
│   │   ├── 05_Wahlperiode_1965-1969
│   │   ├── 06_Wahlperiode_1969-1972
│   │   ├── 07_Wahlperiode_1972-1976
│   │   ├── 08_Wahlperiode_1976-1980
│   │   ├── 09_Wahlperiode_1980-1983
│   │   ├── 10_Wahlperiode_1983-1987
│   │   ├── 11_Wahlperiode_1987-1990
│   │   ├── 12_Wahlperiode_1990-1994
│   │   ├── 13_Wahlperiode_1994-1998
│   │   ├── 14_Wahlperiode_1998-2002
│   │   └── 18_Wahlperiode_2013-2017
│   └── test_data_xml   ### Set of original protocols fpr testin purposes.
├── MdB_data   ### The official Stammdaten of every MdB can be found here.
├── outputs   ### These are the files an data produced using the software from https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software
│   ├── markup   ### Contains all automatically marked protocols.
│   │   ├── dev_data   ### Automatically marked dev_data protocols.
│   │   │   ├── beautiful_xml   ### Final output: humanreadable automatically marked protocols.
│   │   │   ├── clear_speech_markup   ### Tmp data 
│   │   │   ├── complex_markup   ### Tmp data
│   │   │   ├── new_metadata   ### Tmp data
│   │   │   └── simple_xml   ### Tmp data
│   │   ├── full_periods     ### Automatically marked protocols form all periods.
│   │   │   ├── 01_Wahlperiode_1949-1953
│   │   │   ├── 02_Wahlperiode_1953-1957
│   │   │   ├── 03_Wahlperiode_1957-1961
│   │   │   ├── 04_Wahlperiode_1961-1965
│   │   │   ├── 05_Wahlperiode_1965-1969
│   │   │   ├── 06_Wahlperiode_1969-1972
│   │   │   ├── 07_Wahlperiode_1972-1976
│   │   │   ├── 08_Wahlperiode_1976-1980
│   │   │   ├── 09_Wahlperiode_1980-1983
│   │   │   ├── 10_Wahlperiode_1983-1987
│   │   │   ├── 11_Wahlperiode_1987-1990
│   │   │   ├── 12_Wahlperiode_1990-1994
│   │   │   ├── 13_Wahlperiode_1994-1998
│   │   │   ├── 14_Wahlperiode_1998-2002
│   │   │   ├── 15_Wahlperiode_2002-2005_faulty
│   │   │   ├── 16_Wahlperiode_2005-2009_faulty
│   │   │   ├── 17_Wahlperiode_2009-2013_faulty
│   │   │   └── 18_Wahlperiode_2013-2017
│   │   └── test_data     ### Automatically marked test_data protocols.
│   │       ├── beautiful_xml  ### Final output: humanreadable automatically marked protocols.
│   │       ├── clear_speech_markup   ### Tmp data
│   │       ├── complex_markup   ### Tmp data
│   │       ├── new_metadata   ### Tmp data
│   │       └── simple_xml   ### Tmp data
│   └── nlp   ### All data created from the automatically marked protocols.
│       └── full_periods   ### Contains data created from all protocols.
│           ├── n-grams   ### N-Gramm data based on protocols (sibling of this folder).
│           │   ├── lm_ns_speaker   ### N-grams from lemmatized protocols without stop words counted by speaker.
│           │   │   ├── 1_grams
│           │   │   ├── 2_grams
│           │   │   ├── 3_grams
│           │   │   ├── 4_grams
│           │   │   └── 5_grams
│           │   ├── lm_ns_year   ### N-grams from lemmatized protocols without stop words counted by year.
│           │   │   ├── 1_grams
│           │   │   ├── 2_grams
│           │   │   ├── 3_grams
│           │   │   ├── 4_grams
│           │   │   └── 5_grams
│           │   ├── tk_ws_speaker_(1-3)   ### N-grams from tokenized protocols with stop words counted by speaker.
│           │   │   ├── 1_grams
│           │   │   ├── 2_grams
│           │   │   └── 3_grams
│           │   └── tk_ws_year_(1-4)   ### N-grams from tokenized protocols with stop words counted by year.
│           │       ├── 1_grams
│           │       ├── 2_grams
│           │       ├── 3_grams
│           │       └── 4_grams
│           └── protocols   ### Lemmatized and tokenized protocols used for n-gramm caalculation.
│               ├── protocols_lemmatized_without_stopwords
│               └── protocols_tokenized_with_stopwords
└── protocol_DTD

```
Initial commit 2019-02-18 10:07:07 +00:00			`# bundesdata_markup_nlp_data`

Update README.md 2019-02-26 18:32:41 +00:00			`This is just a repository providing the link to the data used and created by the software from this repository: https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software`
Update README.md 2019-02-26 18:39:32 +00:00
Update README.md 2019-06-25 12:59:27 +00:00			`Pelase read the description of that project to understand what kind of data this is. The project is part of a master thesis which can be read [here](https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_web_app/raw/24641c2959796659d428514c9cdd3782d4248da0/2019-02-04_Stephan_Porada_Masterthesis_semi.pdf?inline=false).`
Update README.md 2019-02-26 18:39:32 +00:00
Update README.md 2020-07-28 08:28:22 +00:00			`The data can be downloaded here: https://uni-bielefeld.sciebo.de/s/I93l9QNZKLUTv3S`
Update README.md 2019-02-26 18:33:03 +00:00
Update README.md 2019-06-18 09:28:54 +00:00			`Size: around 70GB`

			`Structure of the data files:`
Update README.md 2019-02-18 10:09:11 +00:00
Update README.md 2019-06-18 09:24:42 +00:00			`Note that there are currently two versions of all the data available.`
Update README.md 2019-06-18 09:24:57 +00:00
Update README.md 2020-07-28 08:29:16 +00:00			`Version _1.0\_data_ contains the data described and used for the original master thesis. The master thesis can be viewed [here](https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_web_app/raw/24641c2959796659d428514c9cdd3782d4248da0/2019-02-04_Stephan_Porada_Masterthesis_semi.pdf?inline=false). Protocols for the periods 15, 16 and 17 are erroneous. Therfore the ngrams for those periods are erroneous as well.`
Update README.md 2019-06-18 09:24:57 +00:00
Update README.md 2020-07-28 08:28:22 +00:00			`Version _1.1\_data_ contains the new officlally released and corrected xml protocols. The protocols have been corrected by the Bundesregierung. Ngrams will be calculated again on the basis of the new protocols in the near future. Also some fixes regarding the markup will be introduced.`
Update README.md 2019-02-26 18:35:10 +00:00
			```
			`.`
Update README.md 2019-02-26 18:56:28 +00:00			`├── inputs ### Data used as input for the software from https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software`
Update README.md 2019-02-26 18:53:13 +00:00			`│ ├── backup_raw_xml ### Zip files of all original protocols.`
			`│ ├── current_official_protocols_xml ### Example file of the new official markup.`
			`│ ├── development_data_xml ### Set of original xml protocols used for development.`
Update README.md 2019-02-26 18:54:42 +00:00			`│ ├── faulty_raw_xml ### All original protocols with errors. The Bundesregierung should have fixed those by now. The Software mentioned above used these faulty ones though because the new ones were not available back then.`
Update README.md 2019-02-26 18:35:10 +00:00			`│ │ ├── 15_Wahlperiode_2002-2005`
			`│ │ ├── 16_Wahlperiode_2005-2009`
			`│ │ └── 17_Wahlperiode_2009-2013`
Update README.md 2019-02-26 18:53:13 +00:00			`│ ├── protocols_raw_xml ### Unziped original protocols.`
Update README.md 2019-02-26 18:35:10 +00:00			`│ │ ├── 01_Wahlperiode_1949-1953`
			`│ │ ├── 02_Wahlperiode_1953-1957`
			`│ │ ├── 03_Wahlperiode_1957-1961`
			`│ │ ├── 04_Wahlperiode_1961-1965`
			`│ │ ├── 05_Wahlperiode_1965-1969`
			`│ │ ├── 06_Wahlperiode_1969-1972`
			`│ │ ├── 07_Wahlperiode_1972-1976`
			`│ │ ├── 08_Wahlperiode_1976-1980`
			`│ │ ├── 09_Wahlperiode_1980-1983`
			`│ │ ├── 10_Wahlperiode_1983-1987`
			`│ │ ├── 11_Wahlperiode_1987-1990`
			`│ │ ├── 12_Wahlperiode_1990-1994`
			`│ │ ├── 13_Wahlperiode_1994-1998`
			`│ │ ├── 14_Wahlperiode_1998-2002`
Update README.md 2019-02-26 18:36:53 +00:00			`│ │ └── 18_Wahlperiode_2013-2017`
Update README.md 2019-02-26 18:53:13 +00:00			`│ └── test_data_xml ### Set of original protocols fpr testin purposes.`
			`├── MdB_data ### The official Stammdaten of every MdB can be found here.`
			`├── outputs ### These are the files an data produced using the software from https://gitlab.ub.uni-bielefeld.de/sporada/bundesdata_markup_nlp_software`
			`│ ├── markup ### Contains all automatically marked protocols.`
			`│ │ ├── dev_data ### Automatically marked dev_data protocols.`
Update README.md 2019-02-26 18:54:42 +00:00			`│ │ │ ├── beautiful_xml ### Final output: humanreadable automatically marked protocols.`
Update README.md 2019-02-26 18:53:13 +00:00			`│ │ │ ├── clear_speech_markup ### Tmp data`
			`│ │ │ ├── complex_markup ### Tmp data`
			`│ │ │ ├── new_metadata ### Tmp data`
			`│ │ │ └── simple_xml ### Tmp data`
			`│ │ ├── full_periods ### Automatically marked protocols form all periods.`
Update README.md 2019-02-26 18:35:10 +00:00			`│ │ │ ├── 01_Wahlperiode_1949-1953`
			`│ │ │ ├── 02_Wahlperiode_1953-1957`
			`│ │ │ ├── 03_Wahlperiode_1957-1961`
			`│ │ │ ├── 04_Wahlperiode_1961-1965`
			`│ │ │ ├── 05_Wahlperiode_1965-1969`
			`│ │ │ ├── 06_Wahlperiode_1969-1972`
			`│ │ │ ├── 07_Wahlperiode_1972-1976`
			`│ │ │ ├── 08_Wahlperiode_1976-1980`
			`│ │ │ ├── 09_Wahlperiode_1980-1983`
			`│ │ │ ├── 10_Wahlperiode_1983-1987`
			`│ │ │ ├── 11_Wahlperiode_1987-1990`
			`│ │ │ ├── 12_Wahlperiode_1990-1994`
			`│ │ │ ├── 13_Wahlperiode_1994-1998`
			`│ │ │ ├── 14_Wahlperiode_1998-2002`
			`│ │ │ ├── 15_Wahlperiode_2002-2005_faulty`
			`│ │ │ ├── 16_Wahlperiode_2005-2009_faulty`
			`│ │ │ ├── 17_Wahlperiode_2009-2013_faulty`
			`│ │ │ └── 18_Wahlperiode_2013-2017`
Update README.md 2019-02-26 18:53:13 +00:00			`│ │ └── test_data ### Automatically marked test_data protocols.`
Update README.md 2019-02-26 18:54:42 +00:00			`│ │ ├── beautiful_xml ### Final output: humanreadable automatically marked protocols.`
Update README.md 2019-02-26 18:53:13 +00:00			`│ │ ├── clear_speech_markup ### Tmp data`
			`│ │ ├── complex_markup ### Tmp data`
			`│ │ ├── new_metadata ### Tmp data`
			`│ │ └── simple_xml ### Tmp data`
Update README.md 2019-02-26 18:54:42 +00:00			`│ └── nlp ### All data created from the automatically marked protocols.`
			`│ └── full_periods ### Contains data created from all protocols.`
			`│ ├── n-grams ### N-Gramm data based on protocols (sibling of this folder).`
Update README.md 2019-02-26 18:53:13 +00:00			`│ │ ├── lm_ns_speaker ### N-grams from lemmatized protocols without stop words counted by speaker.`
Update README.md 2019-02-26 18:35:10 +00:00			`│ │ │ ├── 1_grams`
			`│ │ │ ├── 2_grams`
			`│ │ │ ├── 3_grams`
			`│ │ │ ├── 4_grams`
			`│ │ │ └── 5_grams`
Update README.md 2019-02-26 18:53:13 +00:00			`│ │ ├── lm_ns_year ### N-grams from lemmatized protocols without stop words counted by year.`
Update README.md 2019-02-26 18:35:10 +00:00			`│ │ │ ├── 1_grams`
			`│ │ │ ├── 2_grams`
			`│ │ │ ├── 3_grams`
			`│ │ │ ├── 4_grams`
			`│ │ │ └── 5_grams`
Update README.md 2019-02-26 18:53:13 +00:00			`│ │ ├── tk_ws_speaker_(1-3) ### N-grams from tokenized protocols with stop words counted by speaker.`
Update README.md 2019-02-26 18:35:10 +00:00			`│ │ │ ├── 1_grams`
			`│ │ │ ├── 2_grams`
			`│ │ │ └── 3_grams`
Update README.md 2019-02-26 18:53:13 +00:00			`│ │ └── tk_ws_year_(1-4) ### N-grams from tokenized protocols with stop words counted by year.`
Update README.md 2019-02-26 18:35:10 +00:00			`│ │ ├── 1_grams`
			`│ │ ├── 2_grams`
			`│ │ ├── 3_grams`
			`│ │ └── 4_grams`
Update README.md 2019-02-26 18:53:13 +00:00			`│ └── protocols ### Lemmatized and tokenized protocols used for n-gramm caalculation.`
Update README.md 2019-02-26 18:35:10 +00:00			`│ ├── protocols_lemmatized_without_stopwords`
Update README.md 2019-02-26 18:36:53 +00:00			`│ └── protocols_tokenized_with_stopwords`
Update README.md 2019-02-26 18:35:10 +00:00			`└── protocol_DTD`

			```