mirror of
				https://gitlab.ub.uni-bielefeld.de/sfb1288inf/nopaque.git
				synced 2025-11-04 04:12:45 +00:00 
			
		
		
		
	manual sections 01, 02, 06
This commit is contained in:
		@@ -11,7 +11,7 @@
 | 
			
		||||
  <li><b>Image-to-text conversion tools:</b></li>
 | 
			
		||||
    <ol style="list-style-type:circle; margin-left:1em; padding-bottom:0;"><li><b>Optical Character Recognition</b> converts photos and 
 | 
			
		||||
    scans into text data, making them machine-readable.</li>
 | 
			
		||||
    <li><b>Transkribus HTR (Handwritten Text Recognition) Pipeline</b> 
 | 
			
		||||
    <li><b>Transkribus HTR (Handwritten Text Recognition) Pipeline (currently deactivated)* </b> 
 | 
			
		||||
    also converts images into text data, making them machine-readable.</li>
 | 
			
		||||
    </ol>
 | 
			
		||||
  <li><b>Natural Language Processing</b> extracts information from your text via 
 | 
			
		||||
@@ -23,5 +23,12 @@
 | 
			
		||||
 | 
			
		||||
Nopaque also features a <b>Social Area</b>, where researchers can create a personal profile, connect with other users and share corpora if desired.
 | 
			
		||||
These services can be accessed from the sidebar in nopaque.
 | 
			
		||||
All processes are implemented in a specially provided cloud environment with established open-source software. This always ensures that no personal data of the users is disclosed.
 | 
			
		||||
All processes are implemented in a specially provided cloud environment with established open-source software. 
 | 
			
		||||
This always ensures that no personal data of the users is disclosed.
 | 
			
		||||
<p>
 | 
			
		||||
*Note: the Transkribus HTR Pipeline is currently 
 | 
			
		||||
deactivated; we are working on an alternative solution. You can try using Tesseract OCR, 
 | 
			
		||||
though the results will likely be poor.
 | 
			
		||||
</p>
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
 
 | 
			
		||||
@@ -35,6 +35,7 @@ name in ascending order. It is thus recommended to name them accordingly, for ex
 | 
			
		||||
page-01.png, page-02.jpg, page-03.tiff.
 | 
			
		||||
</p>
 | 
			
		||||
<p>
 | 
			
		||||
Add a title and description to your job and select the File Setup version* you want to use.
 | 
			
		||||
After uploading the images and completing the File Setup job, the list of files added 
 | 
			
		||||
can be seen under “Inputs.” Further below, under “Results,” you can find and download 
 | 
			
		||||
the PDF output.</p>
 | 
			
		||||
@@ -42,14 +43,17 @@ the PDF output.</p>
 | 
			
		||||
<p>Select an image-to-text conversion tool depending on whether your PDF is primarily 
 | 
			
		||||
composed of handwritten text or printed text. For printed text, select the <b>Tesseract OCR 
 | 
			
		||||
Pipeline</b>. For handwritten text, select the <b>Transkribus HTR Pipeline</b>. Select the desired 
 | 
			
		||||
language model or upload your own. Select the version of Tesseract OCR you want to use 
 | 
			
		||||
language model or upload your own. Select the version* of Tesseract OCR you want to use 
 | 
			
		||||
and click on submit to start the conversion. When the job is finished, various output 
 | 
			
		||||
files can be seen and downloaded further below, under “Results.” You may want to review 
 | 
			
		||||
the text output for errors and coherence.</p>
 | 
			
		||||
the text output for errors and coherence. (Note: the Transkribus HTR Pipeline is currently 
 | 
			
		||||
deactivated; we are working on an alternative solution. You can try using Tesseract OCR, 
 | 
			
		||||
though the results will likely be poor.)
 | 
			
		||||
</p>
 | 
			
		||||
<h5 id="extracting-linguistic-data">Extracting linguistic data from text</h5>
 | 
			
		||||
<p>The <b>SpaCy NLP Pipeline</b> service extracts linguistic information from plain text files 
 | 
			
		||||
(in .txt format). Select the corresponding .txt file, the language model, and the 
 | 
			
		||||
version you want to use. When the job is finished, find and download the files in 
 | 
			
		||||
version* you want to use. When the job is finished, find and download the files in 
 | 
			
		||||
<b>.json</b> and <b>.vrt</b> format under “Results.”</p>
 | 
			
		||||
<h5 id="creating-a-corpus">Creating a corpus</h5>
 | 
			
		||||
<p>Now, using the files in .vrt format, you can create a corpus. This can be done 
 | 
			
		||||
@@ -74,3 +78,9 @@ visually as plain text with the option of highlighted entities or as chips.</p>
 | 
			
		||||
Here, you can filter out text parameters and structural attributes in different 
 | 
			
		||||
combinations. This is explained in more detail in the Query Builder section of the 
 | 
			
		||||
manual.</p>
 | 
			
		||||
 | 
			
		||||
<br>
 | 
			
		||||
<br>
 | 
			
		||||
*For all services, it is recommended to use the latest version unless you need a model 
 | 
			
		||||
only available in an earlier version or are looking to reproduce data that was originally generated 
 | 
			
		||||
using an older version.
 | 
			
		||||
 
 | 
			
		||||
@@ -7,40 +7,58 @@
 | 
			
		||||
  </div>
 | 
			
		||||
  <div class="col s12 m8">
 | 
			
		||||
    <p>
 | 
			
		||||
      Nopaque was designed to be modular. Its workflow consists of a sequence 
 | 
			
		||||
      of services that can be applied at different starting and ending points. 
 | 
			
		||||
      This allows you to proceed with your work flexibly.
 | 
			
		||||
      Each of these modules are implemented in a self-contained service, each of
 | 
			
		||||
      which represents a step in the workflow. The services are coordinated in
 | 
			
		||||
      such a way that they can be used consecutively. The order can either be
 | 
			
		||||
      taken from the listing of the services in the left sidebar or from the
 | 
			
		||||
      roadmap (accessible via the pink compass in the upper right corner). All
 | 
			
		||||
      services are versioned, so the data generated with nopaque is always
 | 
			
		||||
      Nopaque was designed to be modular. Its modules are implemented in 
 | 
			
		||||
      self-contained <b>services</b>, each of which represents a step in the 
 | 
			
		||||
      workflow. The typical workflow involves using services one after another, 
 | 
			
		||||
      consecutively.
 | 
			
		||||
      The typical workflow order can be taken from the listing of the 
 | 
			
		||||
      services in the left sidebar or from the nopaque manual (accessible via the pink 
 | 
			
		||||
      button in the upper right corner). 
 | 
			
		||||
      The services can also be applied at different starting and ending points, 
 | 
			
		||||
      which allows you to conduct your work flexibly.
 | 
			
		||||
      All services are versioned, so the data generated with nopaque is always
 | 
			
		||||
      reproducible.
 | 
			
		||||
      
 | 
			
		||||
      <p>For all services, it is recommended to use the latest version (selected 
 | 
			
		||||
      in the drop-down menu on the service page) unless you need a model 
 | 
			
		||||
      only available in an earlier version or are looking to reproduce data that was originally generated 
 | 
			
		||||
      using an older version.</p>
 | 
			
		||||
    </p>
 | 
			
		||||
  </div>
 | 
			
		||||
</div>
 | 
			
		||||
 | 
			
		||||
<h4 class="manual-chapter-title">File Setup</h4>
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
<h4>File Setup</h4>
 | 
			
		||||
<p>
 | 
			
		||||
  The <a href="{{ url_for('services.file_setup_pipeline') }}">File Setup Service</a> bundles image data, such as scans and photos,
 | 
			
		||||
  together in a handy PDF file. To use this service, use the job form to
 | 
			
		||||
  select the images to be bundled, choose the desired service version, and
 | 
			
		||||
  specify a title and description. Please note that the service sorts the
 | 
			
		||||
  images into the resulting PDF file based on the file names. So naming the
 | 
			
		||||
  images correctly is of great importance. It has proven to be a good practice
 | 
			
		||||
  to name the files according to the following scheme:
 | 
			
		||||
  page-01.png, page-02.jpg, page-03.tiff, etc. In general, you can assume
 | 
			
		||||
  specify a title and description.
 | 
			
		||||
  Note that the File Setup service will sort the images based on their file name in 
 | 
			
		||||
  ascending order. It is thus important and highly recommended to name 
 | 
			
		||||
  them accordingly, for example: 
 | 
			
		||||
  page-01.png, page-02.jpg, page-03.tiff. Generally, you can assume
 | 
			
		||||
  that the images will be sorted in the order in which the file explorer of
 | 
			
		||||
  your operating system lists them when you view the files in a folder
 | 
			
		||||
  sorted in ascending order by file name.
 | 
			
		||||
</p>
 | 
			
		||||
 | 
			
		||||
<h4>Optical Character Recognition (OCR)</h4>
 | 
			
		||||
<p>Coming soon...</p>
 | 
			
		||||
<p>
 | 
			
		||||
  The <a href="{{ url_for('services.tesseract_ocr_pipeline') }}">Tesseract OCR Pipeline</a> 
 | 
			
		||||
  converts image data - like photos and scans - into text data, making them machine-readable. 
 | 
			
		||||
  This step enables you to proceed with the computational analysis of your documents. 
 | 
			
		||||
  To use this service, use the job form to select the file you want to convert, choose 
 | 
			
		||||
  the desired language model and service version, enter the title and description, and 
 | 
			
		||||
  submit your job. The results can be found and downloaded below, under "Inputs."
 | 
			
		||||
 | 
			
		||||
</p>
 | 
			
		||||
 | 
			
		||||
<h4>Handwritten Text Recognition (HTR)</h4>
 | 
			
		||||
<p>Coming soon...</p>
 | 
			
		||||
<p>The Transkribus HTR Pipeline is currently 
 | 
			
		||||
deactivated. We are working on an alternative solution. In the meantime, you can 
 | 
			
		||||
try using Tesseract OCR, though the results will likely be poor.</p>
 | 
			
		||||
 | 
			
		||||
<h4>Natural Language Processing (NLP)</h4>
 | 
			
		||||
<p>Coming soon...</p>
 | 
			
		||||
@@ -48,7 +66,7 @@
 | 
			
		||||
<h4>Corpus Analysis</h4>
 | 
			
		||||
<p>
 | 
			
		||||
  With the corpus analysis service, it is possible to create a text corpus
 | 
			
		||||
  and then explore it in an analysis session. The analysis session is realized
 | 
			
		||||
  and then explore through it with analytical tools. The analysis session is realized
 | 
			
		||||
  on the server side by the Open Corpus Workbench software, which enables
 | 
			
		||||
  efficient and complex searches with the help of the CQP Query Language.
 | 
			
		||||
</p>
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user