Compare commits

..

No commits in common. "4425d50140b1211c358e7b40eb9938d1d995844c" and "a53f1d216b43c7056571c7adb190b9f3cb15c2ae" have entirely different histories.

4 changed files with 31 additions and 106 deletions

View File

@ -11,7 +11,7 @@
<li><b>Image-to-text conversion tools:</b></li> <li><b>Image-to-text conversion tools:</b></li>
<ol style="list-style-type:circle; margin-left:1em; padding-bottom:0;"><li><b>Optical Character Recognition</b> converts photos and <ol style="list-style-type:circle; margin-left:1em; padding-bottom:0;"><li><b>Optical Character Recognition</b> converts photos and
scans into text data, making them machine-readable.</li> scans into text data, making them machine-readable.</li>
<li><b>Transkribus HTR (Handwritten Text Recognition) Pipeline</b> (currently deactivated)* <li><b>Transkribus HTR (Handwritten Text Recognition) Pipeline</b>
also converts images into text data, making them machine-readable.</li> also converts images into text data, making them machine-readable.</li>
</ol> </ol>
<li><b>Natural Language Processing</b> extracts information from your text via <li><b>Natural Language Processing</b> extracts information from your text via
@ -23,12 +23,5 @@
Nopaque also features a <b>Social Area</b>, where researchers can create a personal profile, connect with other users and share corpora if desired. Nopaque also features a <b>Social Area</b>, where researchers can create a personal profile, connect with other users and share corpora if desired.
These services can be accessed from the sidebar in nopaque. These services can be accessed from the sidebar in nopaque.
All processes are implemented in a specially provided cloud environment with established open-source software. All processes are implemented in a specially provided cloud environment with established open-source software. This always ensures that no personal data of the users is disclosed.
This always ensures that no personal data of the users is disclosed.
<p>
*Note: the Transkribus HTR Pipeline is currently
deactivated; we are working on an alternative solution. You can try using Tesseract OCR,
though the results will likely be poor.
</p>

View File

@ -35,7 +35,6 @@ name in ascending order. It is thus recommended to name them accordingly, for ex
page-01.png, page-02.jpg, page-03.tiff. page-01.png, page-02.jpg, page-03.tiff.
</p> </p>
<p> <p>
Add a title and description to your job and select the File Setup version* you want to use.
After uploading the images and completing the File Setup job, the list of files added After uploading the images and completing the File Setup job, the list of files added
can be seen under “Inputs.” Further below, under “Results,” you can find and download can be seen under “Inputs.” Further below, under “Results,” you can find and download
the PDF output.</p> the PDF output.</p>
@ -43,35 +42,21 @@ the PDF output.</p>
<p>Select an image-to-text conversion tool depending on whether your PDF is primarily <p>Select an image-to-text conversion tool depending on whether your PDF is primarily
composed of handwritten text or printed text. For printed text, select the <b>Tesseract OCR composed of handwritten text or printed text. For printed text, select the <b>Tesseract OCR
Pipeline</b>. For handwritten text, select the <b>Transkribus HTR Pipeline</b>. Select the desired Pipeline</b>. For handwritten text, select the <b>Transkribus HTR Pipeline</b>. Select the desired
language model or upload your own. Select the version* of Tesseract OCR you want to use language model or upload your own. Select the version of Tesseract OCR you want to use
and click on submit to start the conversion. When the job is finished, various output and click on submit to start the conversion. When the job is finished, various output
files can be seen and downloaded further below, under “Results.” You may want to review files can be seen and downloaded further below, under “Results.” You may want to review
the text output for errors and coherence. (Note: the Transkribus HTR Pipeline is currently the text output for errors and coherence.</p>
deactivated; we are working on an alternative solution. You can try using Tesseract OCR,
though the results will likely be poor.)
</p>
<h5 id="extracting-linguistic-data">Extracting linguistic data from text</h5> <h5 id="extracting-linguistic-data">Extracting linguistic data from text</h5>
<p>The <b>SpaCy NLP Pipeline</b> service extracts linguistic information from plain text files <p>The <b>SpaCy NLP Pipeline</b> service extracts linguistic information from plain text files
(in .txt format). Select the corresponding .txt file, the language model, and the (in .txt format). Select the corresponding .txt file, the language model, and the
version* you want to use. When the job is finished, find and download the files in version you want to use. When the job is finished, find and download the files in
<b>.json</b> and <b>.vrt</b> format under “Results.”</p> <b>.json</b> and <b>.vrt</b> format under “Results.”</p>
<h5 id="creating-a-corpus">Creating a corpus</h5> <h5 id="creating-a-corpus">Creating a corpus</h5>
<p>Now, using the files in .vrt format, you can create a corpus. This can be done <p>Now, using the files in .vrt format, you can create a corpus. This can be done
in the <a href="{{ url_for('main.dashboard') }}">Dashboard</a> or in the Dashboard or Corpus Analysis under “My Corpora.” Click on “Create corpus”
<a href="{{ url_for('services.corpus_analysis') }}">Corpus Analysis</a> sections under “My Corpora.” Click on “Create corpus” and add a title and description for your corpus. After submitting, navigate down to
and add a title and description for your corpus. After submitting, you will automatically the “Corpus files” section. Once you have added the desired .vrt files, select “Build”
be taken to the corpus overview page (which can be called up again via the corpus lists) on the corpus page under “Actions.” Now, your corpus is ready for analysis.</p>
of your new, still empty corpus. </p>
<p>
Further down in the “Corpus files” section, you can add texts in .vrt format
(results of the NLP service) to your new corpus. To do this, use the "Add Corpus File"
button and fill in the form that appears. Here, you can add
metadata to each text. After adding all texts to the corpus, it must
be prepared for analysis. This process can be initiated by clicking on the
"Build" button under "Actions".
On the corpus overview page, you can see information about the current status of
the corpus in the upper right corner. After the build process, the status "built" should be shown here.
Now, your corpus is ready for analysis.</p>
<h5 id="analyzing-a-corpus">Analyzing a corpus</h5> <h5 id="analyzing-a-corpus">Analyzing a corpus</h5>
<p>Navigate to the corpus you would like to analyze and click on the Analyze button. <p>Navigate to the corpus you would like to analyze and click on the Analyze button.
This will take you to an analysis overview page for your corpus. Here, you can find a This will take you to an analysis overview page for your corpus. Here, you can find a
@ -89,9 +74,3 @@ visually as plain text with the option of highlighted entities or as chips.</p>
Here, you can filter out text parameters and structural attributes in different Here, you can filter out text parameters and structural attributes in different
combinations. This is explained in more detail in the Query Builder section of the combinations. This is explained in more detail in the Query Builder section of the
manual.</p> manual.</p>
<br>
<br>
*For all services, it is recommended to use the latest version unless you need a model
only available in an earlier version or are looking to reproduce data that was originally generated
using an older version.

View File

@ -16,7 +16,7 @@
A <b>job</b> is an initiated file processing procedure. A <b>job</b> is an initiated file processing procedure.
A <b>model</b> is a mathematical system for pattern recognition based on data examples that have been processed by AI. One can search for jobs as A <b>model</b> is a mathematical system for pattern recognition based on data examples that have been processed by AI. One can search for jobs as
well as corpus listings using the search field displayed above them on the dashboard. well as corpus listings using the search field displayed above them on the dashboard.
Uploaded models can be found and edited by clicking on the corresponding service under <b>My Contributions</b>. Models can be found and edited by clicking on the corresponding service under <b>My Contributions</b>.
</p> </p>
</div> </div>
<div class="col s12">&nbsp;</div> <div class="col s12">&nbsp;</div>

View File

@ -7,95 +7,48 @@
</div> </div>
<div class="col s12 m8"> <div class="col s12 m8">
<p> <p>
Nopaque was designed to be modular. Its modules are implemented in Nopaque was designed to be modular. Its workflow consists of a sequence
self-contained <b>services</b>, each of which represents a step in the of services that can be applied at different starting and ending points.
workflow. The typical workflow involves using services one after another, This allows you to proceed with your work flexibly.
consecutively. Each of these modules are implemented in a self-contained service, each of
The typical workflow order can be taken from the listing of the which represents a step in the workflow. The services are coordinated in
services in the left sidebar or from the nopaque manual (accessible via the pink such a way that they can be used consecutively. The order can either be
button in the upper right corner). taken from the listing of the services in the left sidebar or from the
The services can also be applied at different starting and ending points, roadmap (accessible via the pink compass in the upper right corner). All
which allows you to conduct your work flexibly. services are versioned, so the data generated with nopaque is always
All services are versioned, so the data generated with nopaque is always
reproducible. reproducible.
<p>For all services, it is recommended to use the latest version (selected
in the drop-down menu on the service page) unless you need a model
only available in an earlier version or are looking to reproduce data that was originally generated
using an older version.</p>
</p> </p>
</div> </div>
</div> </div>
<h4 class="manual-chapter-title">File Setup</h4>
<h4>File Setup</h4>
<p> <p>
The <a href="{{ url_for('services.file_setup_pipeline') }}">File Setup Service</a> bundles image data, such as scans and photos, The <a href="{{ url_for('services.file_setup_pipeline') }}">File Setup Service</a> bundles image data, such as scans and photos,
together in a handy PDF file. To use this service, use the job form to together in a handy PDF file. To use this service, use the job form to
select the images to be bundled, choose the desired service version, and select the images to be bundled, choose the desired service version, and
specify a title and description. specify a title and description. Please note that the service sorts the
Note that the File Setup service will sort the images based on their file name in images into the resulting PDF file based on the file names. So naming the
ascending order. It is thus important and highly recommended to name images correctly is of great importance. It has proven to be a good practice
them accordingly, for example: to name the files according to the following scheme:
page-01.png, page-02.jpg, page-03.tiff. Generally, you can assume page-01.png, page-02.jpg, page-03.tiff, etc. In general, you can assume
that the images will be sorted in the order in which the file explorer of that the images will be sorted in the order in which the file explorer of
your operating system lists them when you view the files in a folder your operating system lists them when you view the files in a folder
sorted in ascending order by file name. sorted in ascending order by file name.
</p> </p>
<h4>Optical Character Recognition (OCR)</h4> <h4>Optical Character Recognition (OCR)</h4>
<p> <p>Coming soon...</p>
The <a href="{{ url_for('services.tesseract_ocr_pipeline') }}">Tesseract OCR Pipeline</a>
converts image data - like photos and scans - into text data, making them machine-readable.
This step enables you to proceed with the computational analysis of your documents.
To use this service, use the job form to select the file you want to convert, choose
the desired language model and service version, enter the title and description, and
submit your job. The results can be found and downloaded below, under "Inputs."
</p>
<h4>Handwritten Text Recognition (HTR)</h4> <h4>Handwritten Text Recognition (HTR)</h4>
<p>The Transkribus HTR Pipeline is currently <p>Coming soon...</p>
deactivated. We are working on an alternative solution. In the meantime, you can
try using Tesseract OCR, though the results will likely be poor.</p>
<h4>Natural Language Processing (NLP)</h4> <h4>Natural Language Processing (NLP)</h4>
<p>The <a href="{{ url_for('services.spacy_nlp_pipeline') }}">SpaCy NLP Pipeline</a> extracts <p>Coming soon...</p>
information from plain text files (.txt format) via computational linguistic data processing
(tokenization, lemmatization, part-of-speech tagging and named-entity recognition).
To use this service, select the corresponding .txt file, the language model, and the
version you want to use. When the job is finished, find and download the files in
<b>.json</b> and <b>.vrt</b> format under “Results.”</p>
<h4>Corpus Analysis</h4> <h4>Corpus Analysis</h4>
<p> <p>
With the <a href="{{ url_for('services.corpus_analysis') }}">Corpus Analysis</a> With the corpus analysis service, it is possible to create a text corpus
service, it is possible to create a text corpus and then explore it in an analysis session. The analysis session is realized
and then explore through it with analytical tools. The analysis session is realized
on the server side by the Open Corpus Workbench software, which enables on the server side by the Open Corpus Workbench software, which enables
efficient and complex searches with the help of the CQP Query Language.</p> efficient and complex searches with the help of the CQP Query Language.
<p>
To use this service, navigate to the corpus you would like to analyze and click on the Analyze button.
This will take you to an analysis overview page for your corpus. Here, you can find
a visualization of general linguistic information of your corpus, including tokens,
sentences, unique words, unique lemmas, unique parts of speech and unique simple
parts of speech. You will also find a pie chart of the proportional textual makeup
of your corpus and can view the linguistic information for each individual text file.
A more detailed visualization of token frequencies with a search option is also on
this page.
</p>
<p>
From the corpus analysis overview page, you can navigate to other analysis modules:
the Query Builder (under Concordance) and the Reader. With the Reader, you can read
your corpus texts tokenized with the associated linguistic information. The tokens
can be shown as lemmas, parts of speech, words, and can be displayed in different
ways: visually as plain text with the option of highlighted entities or as chips.
</p>
<p>
The Concordance module allows for more specific, query-oriented text analyses.
Here, you can filter out text parameters and structural attributes in different
combinations. This is explained in more detail in the Query Builder section of the
manual.
</p>
</p> </p>