An open API service indexing awesome lists of open source software.

https://github.com/czcorpus/kontext

An advanced, extensible web front-end for the Manatee-open corpus search engine
https://github.com/czcorpus/kontext

corpora corpus-linguistics corpus-tools user-interface

Last synced: 4 months ago
JSON representation

An advanced, extensible web front-end for the Manatee-open corpus search engine

Awesome Lists containing this project

README

          

![KonText screenshot](https://github.com/czcorpus/kontext/blob/master/doc/images/kontext-screenshot1.jpg)

## Contents

* [Introduction](#introduction)
* [Features](#features)
* [Installation](#installation)
* [Customization and contribution](#customization-and-contribution)
* [Notable users](#notable-users)
* [How to cite](#how-to-cite-kontext)

## Introduction

KonText is an **advanced corpus query interface** and corpus data **integration platform** built around corpus search engine [Manatee-open](http://nlp.fi.muni.cz/trac/noske). It is written in Python 3 and TypeScript and it runs on any major Linux distribution. The development is maintained by the [Department of Linguistics, Faculty of Arts, Charles University](http://ucnk.ff.cuni.cz/).

## Features

* fully **editable query chain**
* any operation from a user defined sequence (e.g. query -> filter -> sample -> sorting) can be changed
and the whole sequence is then re-executed.
* multiple search modes:
* concordance,
* paradigmatic query,
* word list
* keyword analysis
* simple and advanced query types
* **advanced CQL editor** with **syntax highlighting** and **attribute recognition**
* **interactive PoS tag composing tool** for positional and key-value tagsets
* customizable query suggestions and simple type query refinement (e.g. for homonym disambiguation)
* support for **spoken corpora**
* defined text segments can be played back as audio
* KWIC detail with **easily distinguishable speeches**
* rich **concordance view options and tools**
* any positional attribute can be set as primary
* multiple ways how to display other attributes
* **user-defined line groups** - filtering, reviewing groups ratios
* tokens and KWICs can be connected to external data services (e.g. dictionaries, encyclopedias)
* individual tokens can be linked to each other using an external service (e.g. for word translation equivalents)
* **rich subcorpus-related functionality**
* any subcorpus is accessible by other users (in case they obtain a URL, otherwise the subcorpus is not discoverable by default)
* once a public description is set, the subcorpus can be discovered on the "public subcorpora" page
* text types metadata can be gradually refined to a specific subcorpus ("which publishers are there in case only *fiction* is selected?")
* a **custom text types ratio** can be defined ("give me 20% fiction and 80% journalism")
* unused subcorpora can be archived (URLs with the subcorpus are still valid) or completely removed (URLs will become invalid)
* searching within a subcorpous can be further refined with ad-hoc text type selection
* a subcorpus can be created with respect to corpora aligned ("give me fiction in Czech but only if there is an English translation for it")
* **frequency distribution**
* univariate
* positional attributes (including tuples of multiple attributes per token)
* structural attributes
* **multivariate distribution** (2 dimensions) for both positional and structural attributes
* collocation analysis
* **persistent URLs** - any result page can be easily shared even if the original query is megabytes long
* access to **previous queries**, named queries
* convenient corpus access
* finding corpus by a keyword (tag), size, description
* adding corpus to **favorites** (incl. subcorpora, aligned corpora)
* saving result to Excel, CSV, XML, JSONL, TXT
* [HTTP API](https://github.com/czcorpus/kontext/wiki/HTTP-API) access

## Internal features

* modern client-side application (written in TypeScript, event stream architecture, React components, extensible)
* server-side written using the [Sanic](https://sanic.dev/en/) framework with fully **decoupled background concordance/frequency/collocation calculation** (using an integrated [Rq](https://python-rq.org/) worker server)
* modular code design with dynamically loadable plug-ins providing custom functionality implementation (e.g. custom database
adapters, authentication method, corpus listing widgets, HTTP session management)
* integrability with existing information systems

## Installation

### Docker

Running KonText as a set of Docker containers is the most convenient and flexible way. Docker Compose v2 is required. To run a basic
configuration instance (i.e. no MySQL/MariaDB server) use:

```shell
docker compose up
```

To run a production grade instance:

```shell
docker compose -f docker-compose.yml -f docker-compose.mysql.yml --env-file .env.mysql up
```

(the `.env.mysql` allows configuring custom MySQL/MariaDB credentials and KonText configuration file)

### Manual installation

#### Key requirements

* Python *3.6* (or newer)
* [Manatee](http://nlp.fi.muni.cz/trac/noske) corpus search engine - version *2.167.8* and onwards (for KonText *v0.17+*, Manatee *v2.2xx* is recommended)
* a key-value storage
* [Redis](http://redis.io/) (recommended), [SQLite](https://sqlite.org/) (supported), custom implementations possible
* a task queue - [Rq](https://python-rq.org/)
* HTTP proxy server
+ [Nginx](http://nginx.org/) (recommended), [Apache](http://httpd.apache.org/),...

For Ubuntu OS users, it is recommended to use the [install script](scripts/install/install.py) which should
perform most of the actions necessary to install and run KonText. For other Linux distributions we recommend
running KonText within a container or a virtual machine. Please refer to the [doc/INSTALL.md](doc/INSTALL.md)
file for details.

## Customization and contribution

Please refer to our [Wiki](https://github.com/czcorpus/kontext/wiki/Development-and-customization).

## Notable users

* [Department of Linguistics, Faculty of Arts, Charles University](https://kontext.korpus.cz/)
* [LINDAT/CLARIAH-CZ](https://ufal.mff.cuni.cz/lindat-kontext)
* [CLARIN-PL](https://kontext.clarin-pl.eu/)
* [CLARIN-SI](https://www.clarin.si/kontext/)
* [Serbski Institut](https://www.serbski-institut.de) (API version of KonText)

## How to cite KonText

Tomáš Machálek (2020) - KonText: Advanced and Flexible Corpus Query Interface

```bibtex
@inproceedings{machalek-2020-kontext,
title = "{K}on{T}ext: Advanced and Flexible Corpus Query Interface",
author = "Mach{\'a}lek, Tom{\'a}{\v{s}}",
booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
month = may,
year = "2020",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://www.aclweb.org/anthology/2020.lrec-1.865",
pages = "7003--7008",
language = "English",
ISBN = "979-10-95546-34-4",
}
```