https://github.com/proycon/foliadocserve
FoLiA Document Server - HTTP webservice backend for serving and annotating FoLiA documents using the FoLiA Query Language (FQL). Used by FLAT.
https://github.com/proycon/foliadocserve
document-server folia nlp python
Last synced: 6 months ago
JSON representation
FoLiA Document Server - HTTP webservice backend for serving and annotating FoLiA documents using the FoLiA Query Language (FQL). Used by FLAT.
- Host: GitHub
- URL: https://github.com/proycon/foliadocserve
- Owner: proycon
- License: gpl-3.0
- Created: 2015-02-12T10:07:05.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2025-01-15T10:20:39.000Z (9 months ago)
- Last Synced: 2025-04-22T10:21:11.126Z (6 months ago)
- Topics: document-server, folia, nlp, python
- Language: Python
- Homepage:
- Size: 406 KB
- Stars: 6
- Watchers: 4
- Forks: 4
- Open Issues: 2
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
.. image:: http://applejack.science.ru.nl/lamabadge.php/foliadocserve
:target: http://applejack.science.ru.nl/languagemachines/*****************************************
FoLiA Document Server
*****************************************The FoLiA Document Server is a backend HTTP service to interact with documents
in the FoLiA format, a rich XML-based format for linguistic annotation
(http://proycon.github.io/folia). It provides an interface to efficiently edit
FoLiA documents through the FoLiA Query Language (FQL). However, it is not
designed as a multi-document search tool.The FoLiA Document server is used by FLAT (https://github.com/proycon/flat)
The FoLiA Document Server is written in Python 3, using the FoLiA library in
pynlpl and cherrypy.============================================
Architecture
============================================The FoLiA Document Server consists of a document store that groups documents
into namespaces, a namespace can correspond for instance to a user ID or a
project.Documents are automatically loaded and unloaded as they are requested and
expire. Loaded documents are kept in memory fully to facilitate rapid access
and are serialised back to XML files on disk when unloaded.The document server is a webservice that receives requests over HTTP. Requests
interacting with a FoLiA document consist of statements in FoLiA Query Language
(FQL). For some uses the Corpus Query Language (CQL) is also supported.
Responses are FoLiA XML or parsed into JSON (may contain HTML excerpts too), as
requested in the FQL queries themselves.Features:
* webservice
* queries using FQL, or alternatively CQL (limited)
* multiple return formats (FoLiA XML, JSON, FLAT)
* versioning control support using git
* full support for corrections, alternatives!
* support for concurrencyNote that this webservice is *NOT* intended to be publicly exposed, but rather
to be used as a back-end by another system. The document server does support
constraining namespaces to certain session ids, constraining FQL queries to not
violate their namespace, and constraining uploads by session id or namespace.
This is secure for public exposure only when explicitly enabled and used over
HTTPS.If you are looking for a command line tool that interprets FQL/CQL and queries
FoLiA documents, use the ``foliaquery`` tool from the FoLiA-tools package
rather than this document server, see https://github.com/proycon/folia=======================
Installation & Usage
=======================You can directly fetch the document server from the Python Package Index::
$ pip install foliadocserve
Alternatively, install manually from the git repository or downloaded tarball::
$ python setup.py install
You may need to use ``sudo`` for global installation.
Create a writable directory to hold documents, this is the document root path. Then
start the document server as follows::$ foliadocserve -d /path/to/document/root
See ``-h`` for further options.
When started, a simple web-interface will be available on the specified host and port.
=========================================
Webservice Specification
=========================================Common variables in request URLs:
* **namespace** - A group identifier
* **docid** - The FoLiA document ID
* **sessionid** - A session ID, can be set to ``NOSID`` if no sessioning is
desired. Usage of session IDs enable functionality such as caching and
concurrency.---------------------------
Querying & Annotating
---------------------------* ``/query/`` (POST) - Content body consists of FQL queries, one per line (text/plain). The request header may contain ``X-sessionid`` and must contain ``Content-Length``.
* ``/query/?query=`` (GET) -- HTTP GET alias for the above, limited to a single queryThese URLs will return HTTP 200 OK, with data in the format as requested in the FQL
query if the query is succesful. If the query contains an error, an HTTP 404 response
will be returned.-------------
Versioning
-------------* ``/getdochistory//`` (GET) - Obtain the git history for the specified document. Returns a JSON response: ``{'history':[ {'commit': commithash, 'msg': commitmessage, 'date': commitdata } ] }``
* ``/revert///`` (GET) - Revert the document's state to the specified commit hash---------------------------
Document Management
---------------------------* ``/namespaces/`` (GET) -- List of all the namespaces
* ``/documents//`` (GET) -- Document Index for the given namespace (JSON list)
* ``/upload//`` (POST) -- Uploads a FoLiA XML document to a namespace, request body contains FoLiA XML.
* ``/create//`` (POST) -- Create a new namespace========================================
FoLiA Query Language (FQL)
========================================FQL statements are separated by newlines and encoded in UTF-8. The expressions
are case sensitive, all keywords are in upper case, all element names and
attributes in lower case.FQL is also strict about parentheses, they are generally either required or forbidden
for an expression. Parentheses usually indicate a sub-expression, and it is also used in
boolean logic.As a general rule, it is more efficient to do a single big query than multiple
standalone queries.Note that for readability, queries may have been split on multiple lines
in the presentation here, whereas in reality they should be on one.-------------------
Global variables
-------------------* ``SET =`` - Sets global variables that apply to all statements that follow. String values need to be in double quotes. Available variables are:
* **annotator** - The name of the annotator
* **annotatortype** - The type of the annotator, can be *auto* or *manual*Usually your queries on a particular annotation type are limited to one
specific set. To prevent having to enter the set explicitly in your queries,
you can set defaults. The annotation type corresponds to a FoLiA element::DEFAULTSET entity https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/namedentitycorrection.foliaset.xml
If the FoLiA document only has one set of that type anyway, then this is not even
necessary and the default will be automatically set.-------------------
Document Selection
-------------------FQL statements for the document server start with a document selector, represented by the
keyword **USE**::USE /
This selects what document to apply the query to, the document will be
automatically loaded and unloaded by the server as it sees fit. It can be
prepended to any action query or used standalone, in which case it will apply o
all subsequent queries.Alternatively, the **LOAD** statement loads an arbitrary file from disk, but its use
is restricted to the command line ``foliaquery`` tool rather than this document server::LOAD
If you're interested in retrieving the full document rather than doing specific querying, use
``GET`` statement immediately after a ``USE`` or ``LOAD`` expression.-----------------
Declarations
-----------------All annotation types in FoLiA need to be declared. FQL does this for you
automatically. If you make an edit of a previously undeclared set, it will be
declared for you. These default declarations will never assign default
annotators or annotator types.Explicit declarations are possible using the ``DECLARE`` keyword followed by
the annotation type you want to declare, this represented the tag of the
respective FoLiA annotation element::DECLARE entity OF "https://github.com/proycon/folia/blob/master/setdefinitions/namedentities.foliaset.xml"
WITH annotator = "me" annotatortype = "manual"Note that the statement must be on one single line, it is split here only for ease of
presentation.The **WITH** clause is optional, the set following the **OF** keyword is mandatory.
Declarations may be chained, i.e. multiple **DECLARE** statements may be issued
on one line, as well as prepended to action statements (see next section).---------
Actions
---------The core part of an FQL statement consists of an action verb, the following are
available* ``SELECT []`` - Selects an annotation
* ``DELETE []`` - Deletes an annotation
* ``EDIT [] []`` - Edits an existing annotation
* ``ADD `` - Adds an annotation (to the target expression)
* ``APPEND `` - Inserts an annotation after the target expression
* ``PREPEND `` - Inserts an annotation before the target expressionFollowing the action verb is the focus expression, this starts with an
annotation type, which is equal to the FoLiA XML element tag. The set is
specified using ``OF `` and/or the ID with ``ID ``. An example:pos OF "http://some.domain/some.folia.set.xml"
If an annotation type is already declared and there is only one in document, or
if the **DEFAULTSET** statement was used earlier, then the **OF** statement can
be omitted and will be implied and detected automatically. If it is ambiguous,
an error will be raised (rather than applying the query regardless of set).To further filter a the focus, the expression may consist of a **WHERE** clause
that filters on one or more FoLiA attributes:* **class**
* **annotator**
* **annotatortype**
* **n**
* **confidence**The following attribute is also available on when the elements contains text:
* **text**
The **WHERE** statement requires an operator (=,!=,>,<,<=,>=,CONTAINS,MATCHES), the **AND**,
**OR** and **NOT** operators are available (along with parentheses) for
grouping and boolean logic. The operators must never be glued to the attribute
name or the value, but have spaces left and right.We can now show some examples of full queries with some operators:
* ``SELECT pos OF "http://some.domain/some.folia.set.xml"``
* ``SELECT pos WHERE class = "n" AND annotator = "johndoe"``
* ``DELETE pos WHERE class = "n" AND annotator != "johndoe"``
* ``DELETE pos WHERE class = "n" AND annotator CONTAINS "john"``
* ``DELETE pos WHERE class = "n" AND annotator MATCHES "^john$"``The **ADD** and **EDIT** change actual attributes, this is done in the
*assignment expression* that starts with the **WITH** keyword. It applies to
all the common FoLiA attributes like the **WHERE** keyword, but has no operator or
boolean logic, as it is a pure assignment function.SELECT and DELETE only support WHERE, EDIT supports both WHERE and WITH, if
both are use they than WHERE is always before WITH. the ADD action supports only WITH. If
an EDIT is done on an annotation that can not be found, and there is no WHERE
clause, then it will fall back to ADD.Here is an **EDIT** query that changes all nouns in the document to verbs::
EDIT pos WHERE class = "n" WITH class "v" AND annotator = "johndoe"
The query is fairly crude as it still lacks a *target expression*: A *target
expression* determines what elements the focus is applied to, rather than to
the document as a whole, it starts with the keyword **FOR** and is followed by
either an annotation type (i.e. a FoLiA XML element tag) *or* the ID of an
element. The target expression also determines what elements will be returned.
More on this in a later section.The following FQL query shows how to get the part of speech tag for a
word::SELECT pos FOR ID mydocument.word.3
Or for all words::
SELECT pos FOR w
The **ADD** action almost always requires a target expression::
ADD pos WITH class "n" FOR ID mydocument.word.3
Multiple targets may be specified, comma delimited::
ADD pos WITH class "n" FOR ID mydocument.word.3 , ID myword.document.word.25
The target expression can again contain a **WHERE** filter::
SELECT pos FOR w WHERE class != "PUNCT"
Target expressions, starting with the **FOR** keyword, can be nested::
SELECT pos FOR w WHERE class != "PUNCT" FOR event WHERE class = "tweet"
You may also use the SELECT keyword without focus expression, but only with a target expression. This is particularly useful when you want to return multiple distinct elements, for instance by ID::
SELECT FOR ID mydocument.word.3 , ID myword.document.word.25
The **SELECT** keyword can also be used with the special **ALL** selector that selects all elemens in the scope, the following two statement are identical and will return all elements in the document::
SELECT ALL
SELECT FOR ALLIt can be used at deeper levels too, the following will return everything under all words::
SELECT ALL FOR w
Target expressions are vital for span annotation, the keyword **SPAN** indicates
that the target is a span (to do multiple spans at once, repeat the SPAN
keyword again), the operator ``&`` is used for consecutive spans, whereas ``,``
is used for disjoint spans::ADD entity WITH class "person" FOR SPAN ID mydocument.word.3 & ID myword.document.word.25
This works with filters too, the ``&`` operator enforced a single consecutive span::
ADD entity WITH class "person" FOR SPAN w WHERE text = "John" & w WHERE text = "Doe"
Remember we can do multiple at once::
ADD entity WITH class "person" FOR SPAN w WHERE text = "John" & w WHERE text = "Doe"
SPAN w WHERE text = "Jane" & w WHERE text = "Doe"The **HAS** keyword enables you to descend down in the document tree to
siblings. Consider the following example that changes the part of speech tag
to "verb", for all occurrences of words that have lemma "fly". The parentheses
are mandatory for a **HAS** statement::EDIT pos OF "someposset" WITH class = "v" FOR w WHERE (lemma OF "somelemmaset" HAS class "fly")
Target expressions can be former with either **FOR** or with **IN**, the
difference is that **IN** is much stricter, the element has to be a direct
child of the element in the **IN** statement, whereas **FOR** may skip
intermediate elements. In analogy with XPath, **FOR** corresponds to ``//`` and
**IN** corresponds to ``/``. **FOR** and **IN** may be nested and mixed at
will. The following query would most likely not yield any results because there are
likely to be paragraphs and/or sentences between the wod and event structures::SELECT pos FOR w WHERE class != "PUNCT" IN event WHERE class = "tweet"
Multiple actions can be combined, all share the same target expressions::
ADD pos WITH class "n" ADD lemma WITH class "house" FOR w WHERE text = "house" OR text = "houses"
It is also possible to nest actions, use parentheses for this, the nesting
occurs after any WHERE and WITH statements::ADD w ID mydoc.sentence.1.word.1 (ADD t WITH text "house" ADD pos WITH class "n") FOR ID mydoc.sentence.1
Though explicitly specified here, IDs will be automatically generated when necessary and not specified.
The **ADD** action has two cousins: **APPEND** and **PREPEND**.
Instead of adding something in the scope of the target expression, they either append
or prepend an element, so the inserted element will be a sibling::APPEND w (ADD t WITH text "house") FOR w WHERE text = "the"
This above query appends/inserts the word "house" after every definite article.
---------
Text
---------Our previous examples mostly focussed on part of speech annotation. In this
section we look at text content, which in FoLiA is an annotation element too
(t).Here we change the text of a word::
EDIT t WITH text = "house" FOR ID mydoc.word.45
Here we edit or add (recall that EDIT falls back to ADD when not found and
there is no further selector) a lemma and check on text content::EDIT lemma WITH class "house" FOR w WHERE text = "house" OR text = "houses"
You can use WHERE text on all elements, it will cover both explicit text
content as well as implicit text content, i.e. inferred from child elements. If
you want to be really explicit you can do::EDIT lemma WITH class "house" FOR w WHERE (t HAS text = "house")
**Advanced**:
Such syntax is required when covering texts with custom classes, such as
OCRed or otherwise pre-normalised text. Consider the following OCR correction::ADD t WITH text = "spell" FOR w WHERE (t HAS text = "5pe11" AND class = "OCR" )
---------------
Query Response
---------------We have shown how to do queries but not yet said anything on how the response is
returned. This is regulated using the **RETURN** keyword:* **RETURN focus** (default)
* **RETURN parent** - Returns the parent of the focus
* **RETURN target** or **RETURN inner-target**
* **RETURN outer-target**
* **RETURN ancestor-target**The default focus mode just returns the focus. Sometimes, however, you may want
more context and may want to return the target expression instead. In the
following example returning only the pos-tag would not be so interesting, you
are most likely interested in the word to which it applies::SELECT pos WHERE class = "n" FOR w RETURN target
When there are nested FOR/IN loops, you can specify whether you want to return
the inner one (highest granularity, default) or the outer one (widest scope).
You can also decide to return the first common structural ancestor of the
(outer) targets, which may be specially useful in combination with the **SPAN**
keyword.The return type can be set using the **FORMAT** statement:
* **FORMAT xml** - Returns FoLiA XML, the response is contained in a simple
```` structure.
* **FORMAT single-xml** - Like above, but returns pure unwrapped FoLiA XML and
therefore only works if the response only contains one element. An error
will be raised otherwise.
* **FORMAT json** - Returns JSON list
* **FORMAT single-json** - Like above, but returns a single element rather than
a list. An error will be raised if the response contains multiple.
* **FORMAT python** - Returns a Python object, can only be used when
directly querying the FQL library without the document server
* **FORMAT flat** - Returns a parsed format optimised for FLAT. This is a JSON reply
containing an HTML skeleton of structure elements (key html), parsed annotations
(key annotations). If the query returns a full FoLiA document, then the JSON object will include parsed set definitions, (key
setdefinitions), and declarations.The **RETURN** statement may be used standalone or appended to a query, in
which case it applies to all subsequent queries. The same applies to the
**FORMAT** statement, though an error will be raised if distinct formats are
requested in the same HTTP request.When context is returned in *target* mode, this can get quite big, you may
constrain the type of elements returned by using the **REQUEST** keyword, it
takes the names of FoLiA XML elements. It can be used standalone so it applies
to all subsequent queries::REQUEST w,t,pos,lemma
..or after a query::
SELECT pos FOR w WHERE class!="PUNCT" FOR event WHERE class="tweet" REQUEST w,pos,lemma
Two special uses of request are ``REQUEST ALL`` (default) and ``REQUEST
NOTHING``, the latter may be useful in combination with **ADD**, **EDIT** and
**DELETE**, by default it will return the updated state of the document.Note that if you set REQUEST wrong you may quickly end up with empty results.
---------------------
Span Annotation
---------------------Selecting span annotations is identical to token annotation. You may be aware
that in FoLiA span annotation elements are technically stored in a separate
stand-off layers, but you can forget this fact when composing FQL queries and can
access them right from the elements they apply to.The following query selects all named entities (of an actual rather than a
fictitious set for a change) of people that have the name John::SELECT entity OF "https://github.com/proycon/folia/blob/master/setdefinitions/namedentities.foliaset.xml"
WHERE class = "person" FOR w WHERE text = "John"Or consider the selection of noun-phrase syntactic units (su) that contain the
word house::SELECT su WHERE class = "np" FOR w WHERE text CONTAINS "house"
Note that if the **SPAN** keyword were used here, the selection would be
exclusively constrained to single words "John"::SELECT entity WHERE class = "person" FOR SPAN w WHERE text = "John"
We can use that construct to select all people named John Doe for instance::
SELECT entity WHERE class = "person" FOR SPAN w WHERE text = "John" & w WHERE text = "Doe"
Span annotations like syntactic units are typically nested trees, a tree query
such as "//pp/np/adj" can be represented as follows. Recall that the **IN**
statement starts a target expression like **FOR**, but is stricter on the
hierarchy, which is what we would want here::SELECT su WHERE class = "adj" IN su WHERE class = "np" IN su WHERE class = "pp"
In such instances we may be most interested in obtaining the full PP::
SELECT su WHERE class = "adj" IN su WHERE class = "np" IN su WHERE class = "pp" RETURN outer-target
The **EDIT** action is not limited to editing attributes, sometimes you
want to alter the element of a span. A separate **RESPAN** keyword (without
FOR/IN/WITH) accomplishes this. It takes the keyword **RESPAN** which behaves the
same as a **FOR SPAN** target expression and represents the new scope of the
span, the normal target expression represents the old scope::EDIT entity WHERE class= "person" RESPAN ID word.1 & ID word.2 FOR SPAN ID word.1 & ID word.2 & ID word.3
**WITH** statements can be used still too, they always preceed **RESPAN**::
EDIT entity WHERE class= "person" WITH class="location" RESPAN ID word.1 & ID word.2 FOR SPAN ID word.1 & ID word.2 & ID word.3
------------------------------
Corrections and Alternatives
------------------------------Both FoLiA and FQL have explicit support for corrections and alternatives on
annotations. A correction is not a blunt substitute of an annotation of any
type, but the original is preserved as well. Similarly, an alternative
annotation is one that exists alongside the actual annotation of the same type
and set, and is not authoritative.The following example is a correction but not in the FoLiA sense, it bluntly changes part-of-speech
annotation of all occurrences of the word "fly" from "n" to "v", for example to
correct erroneous tagger output::EDIT pos WITH class "v" WHERE class = "n" FOR w WHERE text = "fly"
Now we do the same but as an explicit correction::
EDIT pos WITH class "v" WHERE class = "n" (AS CORRECTION OF "some/correctionset" WITH class "wrongpos")
FOR w WHERE text = "fly"Another example in a spelling correction context, we correct the misspelling
*concous* to *conscious**::EDIT t WITH text "conscious" (AS CORRECTION OF "some/correctionset" WITH class "spellingerror")
FOR w WHERE text = "concous"The **AS CORRECTION** keyword (always in a separate block within parentheses) is used to
initiate a correction. The correction is itself part of a set with a class that
indicates the type of correction.Alternatives are simpler, but follow the same principle::
EDIT pos WITH class "v" WHERE class = "n" (AS ALTERNATIVE) FOR w WHERE text = "fly"
Confidence scores are often associationed with alternatives::
EDIT pos WITH class "v" WHERE class = "n" (AS ALTERNATIVE WITH confidence 0.6)
FOR w WHERE text = "fly"The **AS** clause is also used to select alternatives rather than the
authoritative form, this will get all alternative pos tags for words with the
text "fly"::SELECT pos (AS ALTERNATIVE) FOR w WHERE text = "fly"
If you want the authoritative tag as well, you can chain the actions. The
same target expression (FOR..) always applies to all chained actions, but the AS clause
applies only to the action in the scope of which it appears::SELECT pos SELECT pos (AS ALTERNATIVE) FOR w WHERE text = "fly"
Filters on the alternative themselves may be applied as expected using the WHERE clause::
SELECT pos (AS ALTERNATIVE WHERE confidence > 0.6) FOR w WHERE text = "fly"
Note that filtering on the attributes of the annotation itself is outside of the scope of
the AS clause::SELECT pos WHERE class = "n" (AS ALTERNATIVE WHERE confidence > 0.6) FOR w WHERE text = "fly"
Corrections by definition are authoritative, so no special syntax is needed to
obtain them. Assuming the part of speech tag is corrected, this will
correctly obtain it, no AS clause is necessary::SELECT pos FOR w WHERE text = "fly"
Adding **AS CORRECTION** will only enforce to return those that were actually
corrected::SELECT pos (AS CORRECTION) FOR w WHERE text = "fly"
However, if you want to obtain the original prior to correction, you can do so
using **AS CORRECTION ORIGINAL**::SELECT pos (AS CORRECTION ORIGINAL) FOR w WHERE text = "fly"
FoLiA does not just distinguish corrections, but also supports suggestions for
correction. Envision a spelling checker suggesting output for misspelled
words, but leaving it up to the user which of the suggestions to accept.
Suggestions are not authoritative and can be obtained in a similar fashion
by using the **SUGGESTION** keyword::SELECT pos (AS CORRECTION SUGGESTION) FOR w WHERE text = "fly"
Note that **AS CORRECTION** may take the **OF** keyword to
specify the correction set, they may also take a **WHERE** clause to filter::SELECT t (AS CORRECTION OF "some/correctionset" WHERE class = "confusible") FOR w
The **SUGGESTION** keyword can take a WHERE filter too::
SELECT t (AS CORRECTION OF "some/correctionset" WHERE class = "confusible" SUGGESTION WHERE confidence > 0.5) FOR w
To add a suggestion for correction rather than an actual authoritative
correction, you can do::EDIT pos (AS CORRECTION OF "some/correctionset" WITH class "poscorrection" SUGGESTION class "n") FOR w ID some.word.1
The absence of a WITH statement in the action clause indicates that this is purely a suggestion. The actual suggestion follows the **SUGGESTION** keyword.
Any attributes associated with the suggestion can be set with a **WITH** statement after the suggestion::
EDIT pos (AS CORRECTION OF "some/correctionset" WITH class "poscorrection" SUGGESTION class "n" WITH confidence 0.8) FOR w ID some.word.1
Even if a **WITH** statement is present for the action, making it an actual
correction, you can still add suggestions::EDIT pos WITH class "v" (AS CORRECTION OF "some/correctionset" WITH class "poscorrection" SUGGESTION class "n" WITH confidence 0.8) FOR w ID some.word.1
The **SUGGESTION** keyword can be chaineed to add multiple suggestions at once::
EDIT pos (AS CORRECTION OF "some/correctionset" WITH class "poscorrection"
SUGGESTION class "n" WITH confidence 0.8
SUGGESTION class "v" wITH confidence 0.2) FOR w ID some.word.1Another example in a spelling correction context::
EDIT t (AS CORRECTION OF "some/correctionset" WITH class "spellingerror"
SUGGESTION text "conscious" WITH confidence 0.8 SUGGESTION text "couscous" WITH confidence 0.2)
FOR w WHERE text = "concous"A similar construction is available for alternatives as well. First we
establish that the following two statements are identical::EDIT pos WHERE class = "n" WITH class "v" (AS ALTERNATIVE WITH confidence 0.6) FOR w WHERE text = "fly"
EDIT pos WHERE class = "n" (AS ALTERNATIVE class "v" WITH confidence 0.6) FOR w WHERE text = "fly"Specifying multiple alternatives is then done by simply adding enother
**ALTERNATIVE** clause::EDIT pos (AS ALTERNATIVE class "v" WITH confidence 0.6 ALTERNATIVE class "n" WITH confidence 0.4 ) FOR w WHERE text = "fly"
When a correction is made on an element, all annotations below it (recursively) are left
intact, i.e. they are copied from the original element to the new correct element. The
same applies to suggestions. Moreover, all references to the original element,
from for instance span annotation elements, will be made into references to the
new corrected elements.This is not always what you want, if you want the correction not to have any
annotations inherited from the original, simply use **AS BARE CORRECTION** instead of **AS
CORRECTION**.You can also use **AS CORRECTION** with **ADD** and **DELETE**.
The most complex kind of corrections are splits and merges. A split separates a
structure element such as a word into multiple, a merge unifies multiple
structure elements into one.In FQL, this is achieved through substitution, using the action **SUBSTITUTE**::
SUBSTITUTE w WITH text "together" FOR SPAN w WHERE text="to" & w WHERE text="gether"
Subactions are common with SUBSTITUTE, the following is equivalent to the above::
SUBSTITUTE w (ADD t WITH text "together") FOR SPAN w WHERE text="to" & w WHERE text="gether"
To perform a split into multiple substitutes, simply chain the SUBSTITUTE
clause::SUBSTITUTE w WITH text "each" SUBSTITUTE w WITH TEXT "other" FOR w WHERE text="eachother"
Like **ADD**, both **SUBSTITUTE** may take assignments (**WITH**), but no filters (**WHERE**).
You may have noticed that the merge and split examples were not corrections in
the FoLiA-sense; the originals are removed and not preserved. Let's make it
into proper corrections::SUBSTITUTE w WITH text "together"
(AS CORRECTION OF "some/correctionset" WITH class "spliterror")
FOR SPAN w WHERE text="to" & w WHERE text="gether"And a split::
SUBSTITUTE w WITH text "each" SUBSTITUTE w WITH text "other"
(AS CORRECTION OF "some/correctionset" WITH class "runonerror")
FOR w WHERE text="eachother"To make this into a suggestion for correction instead, use the **SUGGESTION**
keyword followed by **SUBSTITUTE**, inside the **AS** clause, where the chain
of substitute statements has to be enclosed in parentheses::SUBSTITUTE (AS CORRECTION OF "some/correctionset" WITH class "runonerror" SUGGESTION (SUBTITUTE w WITH text "each" SUBSTITUTE w WITH text "other") )
FOR w WHERE text="eachother"(Alternatively, you can use **ADD** instead of **SUBSTITUTE** after the **SUGGESTION** clause, which behaves identically)
In FoLiA, suggestions for deletion are simply empty suggestions, and they are made using the **DELETION** keyword::
SUBSTITUTE (AS CORRECTION OF "some/correctionset" WITH class "redundantword" SUGGESTION DELETION )
FOR w WHERE text="something"Suggestions may indicate they modify the parent structure when applied. For
instance, a suggestion for removal of a redundant period is often also a
suggestion that the sentence should be merged. This is explicitly indicated in
FoLiA with a ``merge`` attribute on the suggestion, and in FQL with the
**MERGE** keyword immediately following **SUGGESTION**. An example::SUBSTITUTE (AS CORRECTION OF "some/correctionset" WITH class "redundantpunctuation" SUGGESTION MERGE DELETION )
FOR w WHERE text="."The reverse situation would be insertion of a missing period, which is
generally also a suggestion to split the parent sentence. For this we use the
**SPLIT** keyword. Insertions are typically done using the **APPEND** or
**PREPEND** actions, as there is nothing to substitute::APPEND (AS CORRECTION OF "some/correctionset" WITH class "missingpunctuation" SUGGESTION SPLIT (ADD w WITH text ".") )
FOR w WHERE text="end"last, but not least, when deleting corrections explicitly, you may use the **RESTORE** keyword to restore the original.
Example::DELETE correction ID "some.correction" RESTORE ORIGINAL
-------------------------------
I can haz context plz?
-------------------------------We've seen that with the **FOR** keyword we can move to bigger elements in the FoLiA
document, and with the **HAS** keyword we can move to siblings. There are
several *context keywords* that give us all the tools we need to peek at the
context. Like **HAS** expressions, these need always be enclosed in
parentheses.For instance, consider part-of-speech tagging scenario. If we have a word where
the left neighbour is a determiner, and the right neighbour a noun, we can be
pretty sure the word under our consideration (our target expression) is an
adjective. Let's add the pos tag::EDIT pos WITH class = "adj" FOR w WHERE (PREVIOUS w WHERE (pos HAS class == "det")) AND (NEXT w WHERE (pos HAS class == "n"))
You may append a number directly to the **PREVIOUS**/**NEXT** modifier if
you're interested in further context, or you may use
**LEFTCONTEXT**/**RIGHTCONTEXT**/**CONTEXT** if you don't care at what position
something occurs::EDIT pos WITH class = "adj" FOR w WHERE (PREVIOUS2 w WHERE (pos HAS class == "det")) AND (PREVIOUS w WHERE (pos HAS class == "adj")) AND (RIGHTCONTEXT w WHERE (pos HAS class == "n"))
Instead of the **NEXT** and **PREVIOUS** keywords, a target expression can be used with the **SPAN** keyword and the **&** operator::
SELECT FOR SPAN w WHERE text = "the" & w WHERE (pos HAS class == "adj") & w WHERE text = "house"
Within a **SPAN** keyword, an **expansion expression** can be used to select
any number, or a certain number, of elements. You can do this by appending
curly braces after the element name (but not attached to it) and specifying the
minimum and maximum number of elements. The following expression selects from
zero up to three adjectives between the words "the" and "house"::SELECT FOR SPAN w WHERE text = "the" & w {0,3} WHERE (pos HAS class == "adj") & w WHERE text = "house"
If you specify only a single number in the curly braces, it will require that
exact number of elements. To match at least one word up to an unlimited number,
use an expansion expression such as ``{1,}``.If you are now perhaps tempted to use the FoLiA document server and FQL for searching through
large corpora in real-time, then be advised that this is not a good idea. It will be prohibitively
slow on large datasets as this requires smart indexing, which this document
server does not provide. You can therefore not do this real-time, but perhaps
only as a first step to build an actual search index.Other modifiers are PARENT and and ANCESTOR. PARENT will at most go one element
up, whereas ANCESTOR will go on to the largest element::SELECT lemma FOR w WHERE (PARENT s WHERE text CONTAINS "wine")
Instead of **PARENT**, the use of a nested **FOR** is preferred and more efficient::
SELECT lemma FOR w FOR s WHERE text CONTAINS "wine"
Let's revisit syntax trees for a bit now we know how to obtain context. Imagine
we want an NP to the left of a PP::SELECT su WHERE class = "np" AND (NEXT su WHERE class = "pp")
... and where the whole thing is part of a VP::
SELECT su WHERE class = "np" AND (NEXT su WHERE class = "pp") IN su WHERE class = "vp"
... and return that whole tree rather than just the NP we were looking for::
SELECT su WHERE class = "np" AND (NEXT su WHERE class = "pp") IN su WHERE class = "vp" RETURN target
-------------------------------
Slicing
-------------------------------FQL target expressions may be sliced using the **START** and **END** or
**ENDBEFORE** keywords (the former is inclusive, the latter is not). They take
a selection expression. You can for instance slice between two specific IDs::SELECT FOR w START ID "first.element.id" END ID "last.element.id"
Or to select all words from the first occurrence of *the* to the next::
SELECT FOR w START w WHERE text = "the" ENDBEFORE w WHERE text = "the"
The query will usually end after the **END**/**ENDBEFORE** statement. You may however
want to continue until the start expression is encountered again, in that case,
add the keyword **REPEAT**::SELECT FOR w START w WHERE text = "the" ENDBEFORE w WHERE text = "the" REPEAT
Note that slicing only works on target expressions, therefore the **FOR** is
mandatory. If multiple target expressions are chained, then each may set their
own slice.-------------------------------
Shortcuts
-------------------------------Classes are prevalent all throughout FoLiA, it is very common to want to select
on classes. To select words with pos tag "n" for example you can do::SELECT w WHERE (pos HAS class = "n")
Because this is so common, there is a shortcut. Specify the annotation type
directly preceeded by a colon, and a HAS statement that matches on class will
automatically be constructed::SELECT w WHERE :pos = "n"
The two statements are completely equivalent.
Another third alternative to obtain the same result set is to use a target
expression::SELECT pos WHERE class = "n" FOR w RETURN target
This illustrates that there are often multiple ways of obtaining the same
result set. Due to lazy evaluation in the FQL library, there is not much
difference performance-wise.Another kind of shortcut exists for setting text on structural elements. You
can add a word with text like this::ADD w (ADD t WITH text "hello") IN ID some.sentence
Or using the shortcut::
ADD w WITH text "hello" IN ID some.sentence
========================================
Corpus Query Language (CQL)
========================================The FoLiA Document Server also supports a basic subset of CQL. CQL focusses on
querying only, and has no data manipulation functions like FQL. CQL, however,
is considerably more concise than FQL, already well-spread, and its syntax is
easier.To use CQL instead of FQL, just start your query as usual with an FQL **USE**
or , then use the **CQL** keyword and everything thereafter will be interpreted
as CQL. Example::USE mynamespace/proycon CQL "the" [ tag="JJ.*" ]? [ lemma="house" & tag="N" ]
The ``tag`` attribute maps to the FoLiA ``pos`` type. ``word`` maps to
FoLiA/FQL ``text``, any other attributes are unmapped so you can simply use the
FoLiA names from CQL, including any span annotation.If multiple sets are available for a type, make sure to use the ``DEFAULTSET``
FQL keyword to set a default, otherwise the query will fail as CQL does not know
the FoLiA set paradigm.The CQL language is documented here:
http://www.sketchengine.co.uk/documentation/wiki/SkE/CorpusQuerying , the
advanced operators mentioned there are not supported yet.