mirror of
https://gitlab.ub.uni-bielefeld.de/sfb1288inf/nopaque.git
synced 2024-12-26 19:34:19 +00:00
Compare commits
2 Commits
a53f1d216b
...
4425d50140
Author | SHA1 | Date | |
---|---|---|---|
|
4425d50140 | ||
|
39113a6f17 |
@ -11,7 +11,7 @@
|
|||||||
<li><b>Image-to-text conversion tools:</b></li>
|
<li><b>Image-to-text conversion tools:</b></li>
|
||||||
<ol style="list-style-type:circle; margin-left:1em; padding-bottom:0;"><li><b>Optical Character Recognition</b> converts photos and
|
<ol style="list-style-type:circle; margin-left:1em; padding-bottom:0;"><li><b>Optical Character Recognition</b> converts photos and
|
||||||
scans into text data, making them machine-readable.</li>
|
scans into text data, making them machine-readable.</li>
|
||||||
<li><b>Transkribus HTR (Handwritten Text Recognition) Pipeline</b>
|
<li><b>Transkribus HTR (Handwritten Text Recognition) Pipeline</b> (currently deactivated)*
|
||||||
also converts images into text data, making them machine-readable.</li>
|
also converts images into text data, making them machine-readable.</li>
|
||||||
</ol>
|
</ol>
|
||||||
<li><b>Natural Language Processing</b> extracts information from your text via
|
<li><b>Natural Language Processing</b> extracts information from your text via
|
||||||
@ -23,5 +23,12 @@
|
|||||||
|
|
||||||
Nopaque also features a <b>Social Area</b>, where researchers can create a personal profile, connect with other users and share corpora if desired.
|
Nopaque also features a <b>Social Area</b>, where researchers can create a personal profile, connect with other users and share corpora if desired.
|
||||||
These services can be accessed from the sidebar in nopaque.
|
These services can be accessed from the sidebar in nopaque.
|
||||||
All processes are implemented in a specially provided cloud environment with established open-source software. This always ensures that no personal data of the users is disclosed.
|
All processes are implemented in a specially provided cloud environment with established open-source software.
|
||||||
|
This always ensures that no personal data of the users is disclosed.
|
||||||
|
<p>
|
||||||
|
*Note: the Transkribus HTR Pipeline is currently
|
||||||
|
deactivated; we are working on an alternative solution. You can try using Tesseract OCR,
|
||||||
|
though the results will likely be poor.
|
||||||
|
</p>
|
||||||
|
|
||||||
|
|
||||||
|
@ -35,6 +35,7 @@ name in ascending order. It is thus recommended to name them accordingly, for ex
|
|||||||
page-01.png, page-02.jpg, page-03.tiff.
|
page-01.png, page-02.jpg, page-03.tiff.
|
||||||
</p>
|
</p>
|
||||||
<p>
|
<p>
|
||||||
|
Add a title and description to your job and select the File Setup version* you want to use.
|
||||||
After uploading the images and completing the File Setup job, the list of files added
|
After uploading the images and completing the File Setup job, the list of files added
|
||||||
can be seen under “Inputs.” Further below, under “Results,” you can find and download
|
can be seen under “Inputs.” Further below, under “Results,” you can find and download
|
||||||
the PDF output.</p>
|
the PDF output.</p>
|
||||||
@ -42,21 +43,35 @@ the PDF output.</p>
|
|||||||
<p>Select an image-to-text conversion tool depending on whether your PDF is primarily
|
<p>Select an image-to-text conversion tool depending on whether your PDF is primarily
|
||||||
composed of handwritten text or printed text. For printed text, select the <b>Tesseract OCR
|
composed of handwritten text or printed text. For printed text, select the <b>Tesseract OCR
|
||||||
Pipeline</b>. For handwritten text, select the <b>Transkribus HTR Pipeline</b>. Select the desired
|
Pipeline</b>. For handwritten text, select the <b>Transkribus HTR Pipeline</b>. Select the desired
|
||||||
language model or upload your own. Select the version of Tesseract OCR you want to use
|
language model or upload your own. Select the version* of Tesseract OCR you want to use
|
||||||
and click on submit to start the conversion. When the job is finished, various output
|
and click on submit to start the conversion. When the job is finished, various output
|
||||||
files can be seen and downloaded further below, under “Results.” You may want to review
|
files can be seen and downloaded further below, under “Results.” You may want to review
|
||||||
the text output for errors and coherence.</p>
|
the text output for errors and coherence. (Note: the Transkribus HTR Pipeline is currently
|
||||||
|
deactivated; we are working on an alternative solution. You can try using Tesseract OCR,
|
||||||
|
though the results will likely be poor.)
|
||||||
|
</p>
|
||||||
<h5 id="extracting-linguistic-data">Extracting linguistic data from text</h5>
|
<h5 id="extracting-linguistic-data">Extracting linguistic data from text</h5>
|
||||||
<p>The <b>SpaCy NLP Pipeline</b> service extracts linguistic information from plain text files
|
<p>The <b>SpaCy NLP Pipeline</b> service extracts linguistic information from plain text files
|
||||||
(in .txt format). Select the corresponding .txt file, the language model, and the
|
(in .txt format). Select the corresponding .txt file, the language model, and the
|
||||||
version you want to use. When the job is finished, find and download the files in
|
version* you want to use. When the job is finished, find and download the files in
|
||||||
<b>.json</b> and <b>.vrt</b> format under “Results.”</p>
|
<b>.json</b> and <b>.vrt</b> format under “Results.”</p>
|
||||||
<h5 id="creating-a-corpus">Creating a corpus</h5>
|
<h5 id="creating-a-corpus">Creating a corpus</h5>
|
||||||
<p>Now, using the files in .vrt format, you can create a corpus. This can be done
|
<p>Now, using the files in .vrt format, you can create a corpus. This can be done
|
||||||
in the Dashboard or Corpus Analysis under “My Corpora.” Click on “Create corpus”
|
in the <a href="{{ url_for('main.dashboard') }}">Dashboard</a> or
|
||||||
and add a title and description for your corpus. After submitting, navigate down to
|
<a href="{{ url_for('services.corpus_analysis') }}">Corpus Analysis</a> sections under “My Corpora.” Click on “Create corpus”
|
||||||
the “Corpus files” section. Once you have added the desired .vrt files, select “Build”
|
and add a title and description for your corpus. After submitting, you will automatically
|
||||||
on the corpus page under “Actions.” Now, your corpus is ready for analysis.</p>
|
be taken to the corpus overview page (which can be called up again via the corpus lists)
|
||||||
|
of your new, still empty corpus. </p>
|
||||||
|
<p>
|
||||||
|
Further down in the “Corpus files” section, you can add texts in .vrt format
|
||||||
|
(results of the NLP service) to your new corpus. To do this, use the "Add Corpus File"
|
||||||
|
button and fill in the form that appears. Here, you can add
|
||||||
|
metadata to each text. After adding all texts to the corpus, it must
|
||||||
|
be prepared for analysis. This process can be initiated by clicking on the
|
||||||
|
"Build" button under "Actions".
|
||||||
|
On the corpus overview page, you can see information about the current status of
|
||||||
|
the corpus in the upper right corner. After the build process, the status "built" should be shown here.
|
||||||
|
Now, your corpus is ready for analysis.</p>
|
||||||
<h5 id="analyzing-a-corpus">Analyzing a corpus</h5>
|
<h5 id="analyzing-a-corpus">Analyzing a corpus</h5>
|
||||||
<p>Navigate to the corpus you would like to analyze and click on the Analyze button.
|
<p>Navigate to the corpus you would like to analyze and click on the Analyze button.
|
||||||
This will take you to an analysis overview page for your corpus. Here, you can find a
|
This will take you to an analysis overview page for your corpus. Here, you can find a
|
||||||
@ -74,3 +89,9 @@ visually as plain text with the option of highlighted entities or as chips.</p>
|
|||||||
Here, you can filter out text parameters and structural attributes in different
|
Here, you can filter out text parameters and structural attributes in different
|
||||||
combinations. This is explained in more detail in the Query Builder section of the
|
combinations. This is explained in more detail in the Query Builder section of the
|
||||||
manual.</p>
|
manual.</p>
|
||||||
|
|
||||||
|
<br>
|
||||||
|
<br>
|
||||||
|
*For all services, it is recommended to use the latest version unless you need a model
|
||||||
|
only available in an earlier version or are looking to reproduce data that was originally generated
|
||||||
|
using an older version.
|
||||||
|
@ -16,7 +16,7 @@
|
|||||||
A <b>job</b> is an initiated file processing procedure.
|
A <b>job</b> is an initiated file processing procedure.
|
||||||
A <b>model</b> is a mathematical system for pattern recognition based on data examples that have been processed by AI. One can search for jobs as
|
A <b>model</b> is a mathematical system for pattern recognition based on data examples that have been processed by AI. One can search for jobs as
|
||||||
well as corpus listings using the search field displayed above them on the dashboard.
|
well as corpus listings using the search field displayed above them on the dashboard.
|
||||||
Models can be found and edited by clicking on the corresponding service under <b>My Contributions</b>.
|
Uploaded models can be found and edited by clicking on the corresponding service under <b>My Contributions</b>.
|
||||||
</p>
|
</p>
|
||||||
</div>
|
</div>
|
||||||
<div class="col s12"> </div>
|
<div class="col s12"> </div>
|
||||||
|
@ -7,48 +7,95 @@
|
|||||||
</div>
|
</div>
|
||||||
<div class="col s12 m8">
|
<div class="col s12 m8">
|
||||||
<p>
|
<p>
|
||||||
Nopaque was designed to be modular. Its workflow consists of a sequence
|
Nopaque was designed to be modular. Its modules are implemented in
|
||||||
of services that can be applied at different starting and ending points.
|
self-contained <b>services</b>, each of which represents a step in the
|
||||||
This allows you to proceed with your work flexibly.
|
workflow. The typical workflow involves using services one after another,
|
||||||
Each of these modules are implemented in a self-contained service, each of
|
consecutively.
|
||||||
which represents a step in the workflow. The services are coordinated in
|
The typical workflow order can be taken from the listing of the
|
||||||
such a way that they can be used consecutively. The order can either be
|
services in the left sidebar or from the nopaque manual (accessible via the pink
|
||||||
taken from the listing of the services in the left sidebar or from the
|
button in the upper right corner).
|
||||||
roadmap (accessible via the pink compass in the upper right corner). All
|
The services can also be applied at different starting and ending points,
|
||||||
services are versioned, so the data generated with nopaque is always
|
which allows you to conduct your work flexibly.
|
||||||
|
All services are versioned, so the data generated with nopaque is always
|
||||||
reproducible.
|
reproducible.
|
||||||
|
|
||||||
|
<p>For all services, it is recommended to use the latest version (selected
|
||||||
|
in the drop-down menu on the service page) unless you need a model
|
||||||
|
only available in an earlier version or are looking to reproduce data that was originally generated
|
||||||
|
using an older version.</p>
|
||||||
</p>
|
</p>
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<h4 class="manual-chapter-title">File Setup</h4>
|
|
||||||
|
|
||||||
|
<h4>File Setup</h4>
|
||||||
<p>
|
<p>
|
||||||
The <a href="{{ url_for('services.file_setup_pipeline') }}">File Setup Service</a> bundles image data, such as scans and photos,
|
The <a href="{{ url_for('services.file_setup_pipeline') }}">File Setup Service</a> bundles image data, such as scans and photos,
|
||||||
together in a handy PDF file. To use this service, use the job form to
|
together in a handy PDF file. To use this service, use the job form to
|
||||||
select the images to be bundled, choose the desired service version, and
|
select the images to be bundled, choose the desired service version, and
|
||||||
specify a title and description. Please note that the service sorts the
|
specify a title and description.
|
||||||
images into the resulting PDF file based on the file names. So naming the
|
Note that the File Setup service will sort the images based on their file name in
|
||||||
images correctly is of great importance. It has proven to be a good practice
|
ascending order. It is thus important and highly recommended to name
|
||||||
to name the files according to the following scheme:
|
them accordingly, for example:
|
||||||
page-01.png, page-02.jpg, page-03.tiff, etc. In general, you can assume
|
page-01.png, page-02.jpg, page-03.tiff. Generally, you can assume
|
||||||
that the images will be sorted in the order in which the file explorer of
|
that the images will be sorted in the order in which the file explorer of
|
||||||
your operating system lists them when you view the files in a folder
|
your operating system lists them when you view the files in a folder
|
||||||
sorted in ascending order by file name.
|
sorted in ascending order by file name.
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
<h4>Optical Character Recognition (OCR)</h4>
|
<h4>Optical Character Recognition (OCR)</h4>
|
||||||
<p>Coming soon...</p>
|
<p>
|
||||||
|
The <a href="{{ url_for('services.tesseract_ocr_pipeline') }}">Tesseract OCR Pipeline</a>
|
||||||
|
converts image data - like photos and scans - into text data, making them machine-readable.
|
||||||
|
This step enables you to proceed with the computational analysis of your documents.
|
||||||
|
To use this service, use the job form to select the file you want to convert, choose
|
||||||
|
the desired language model and service version, enter the title and description, and
|
||||||
|
submit your job. The results can be found and downloaded below, under "Inputs."
|
||||||
|
|
||||||
|
</p>
|
||||||
|
|
||||||
<h4>Handwritten Text Recognition (HTR)</h4>
|
<h4>Handwritten Text Recognition (HTR)</h4>
|
||||||
<p>Coming soon...</p>
|
<p>The Transkribus HTR Pipeline is currently
|
||||||
|
deactivated. We are working on an alternative solution. In the meantime, you can
|
||||||
|
try using Tesseract OCR, though the results will likely be poor.</p>
|
||||||
|
|
||||||
<h4>Natural Language Processing (NLP)</h4>
|
<h4>Natural Language Processing (NLP)</h4>
|
||||||
<p>Coming soon...</p>
|
<p>The <a href="{{ url_for('services.spacy_nlp_pipeline') }}">SpaCy NLP Pipeline</a> extracts
|
||||||
|
information from plain text files (.txt format) via computational linguistic data processing
|
||||||
|
(tokenization, lemmatization, part-of-speech tagging and named-entity recognition).
|
||||||
|
To use this service, select the corresponding .txt file, the language model, and the
|
||||||
|
version you want to use. When the job is finished, find and download the files in
|
||||||
|
<b>.json</b> and <b>.vrt</b> format under “Results.”</p>
|
||||||
|
|
||||||
<h4>Corpus Analysis</h4>
|
<h4>Corpus Analysis</h4>
|
||||||
<p>
|
<p>
|
||||||
With the corpus analysis service, it is possible to create a text corpus
|
With the <a href="{{ url_for('services.corpus_analysis') }}">Corpus Analysis</a>
|
||||||
and then explore it in an analysis session. The analysis session is realized
|
service, it is possible to create a text corpus
|
||||||
|
and then explore through it with analytical tools. The analysis session is realized
|
||||||
on the server side by the Open Corpus Workbench software, which enables
|
on the server side by the Open Corpus Workbench software, which enables
|
||||||
efficient and complex searches with the help of the CQP Query Language.
|
efficient and complex searches with the help of the CQP Query Language.</p>
|
||||||
|
<p>
|
||||||
|
To use this service, navigate to the corpus you would like to analyze and click on the Analyze button.
|
||||||
|
This will take you to an analysis overview page for your corpus. Here, you can find
|
||||||
|
a visualization of general linguistic information of your corpus, including tokens,
|
||||||
|
sentences, unique words, unique lemmas, unique parts of speech and unique simple
|
||||||
|
parts of speech. You will also find a pie chart of the proportional textual makeup
|
||||||
|
of your corpus and can view the linguistic information for each individual text file.
|
||||||
|
A more detailed visualization of token frequencies with a search option is also on
|
||||||
|
this page.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
From the corpus analysis overview page, you can navigate to other analysis modules:
|
||||||
|
the Query Builder (under Concordance) and the Reader. With the Reader, you can read
|
||||||
|
your corpus texts tokenized with the associated linguistic information. The tokens
|
||||||
|
can be shown as lemmas, parts of speech, words, and can be displayed in different
|
||||||
|
ways: visually as plain text with the option of highlighted entities or as chips.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
The Concordance module allows for more specific, query-oriented text analyses.
|
||||||
|
Here, you can filter out text parameters and structural attributes in different
|
||||||
|
combinations. This is explained in more detail in the Query Builder section of the
|
||||||
|
manual.
|
||||||
|
</p>
|
||||||
</p>
|
</p>
|
||||||
|
Loading…
Reference in New Issue
Block a user