https://github.com/czcorpus/kontext

An advanced, extensible web front-end for the Manatee-open corpus search engine
https://github.com/czcorpus/kontext

corpora corpus-linguistics corpus-tools user-interface

Last synced: 4 months ago
JSON representation

An advanced, extensible web front-end for the Manatee-open corpus search engine

Host: GitHub
URL: https://github.com/czcorpus/kontext
Owner: czcorpus
License: gpl-2.0
Created: 2015-04-14T14:19:58.000Z (about 11 years ago)
Default Branch: master
Last Pushed: 2026-02-02T14:27:10.000Z (4 months ago)
Last Synced: 2026-02-03T03:20:53.319Z (4 months ago)
Topics: corpora, corpus-linguistics, corpus-tools, user-interface
Language: TypeScript
Homepage:
Size: 39.9 MB
Stars: 78
Watchers: 7
Forks: 24
Open Issues: 68
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: COPYING

Awesome Lists containing this project

README

          ![KonText screenshot](https://github.com/czcorpus/kontext/blob/master/doc/images/kontext-screenshot1.jpg)

## Contents

* [Introduction](#introduction)

* [Features](#features)

* [Installation](#installation)

* [Customization and contribution](#customization-and-contribution)

* [Notable users](#notable-users)

* [How to cite](#how-to-cite-kontext)

## Introduction

KonText is an **advanced corpus query interface** and corpus data **integration platform** built around corpus search engine [Manatee-open](http://nlp.fi.muni.cz/trac/noske). It is written in Python 3 and TypeScript and it runs on any major Linux distribution. The development is maintained by the [Department of Linguistics, Faculty of Arts, Charles University](http://ucnk.ff.cuni.cz/).

## Features

* fully **editable query chain**

    * any operation from a user defined sequence (e.g. query -> filter -> sample -> sorting) can be changed

    and the whole sequence is then re-executed.

* multiple search modes:

    * concordance,

    * paradigmatic query,

    * word list

    * keyword analysis

* simple and advanced query types

    * **advanced CQL editor** with **syntax highlighting** and **attribute recognition**

    * **interactive PoS tag composing tool** for positional and key-value tagsets

    * customizable query suggestions and simple type query refinement (e.g. for homonym disambiguation)

* support for **spoken corpora**

    * defined text segments can be played back as audio

    * KWIC detail with **easily distinguishable speeches**

* rich **concordance view options and tools**

    * any positional attribute can be set as primary

    * multiple ways how to display other attributes

    * **user-defined line groups** - filtering, reviewing groups ratios

    * tokens and KWICs can be connected to external data services (e.g. dictionaries, encyclopedias)

    * individual tokens can be linked to each other using an external service (e.g. for word translation equivalents)

* **rich subcorpus-related functionality**

    * any subcorpus is accessible by other users (in case they obtain a URL, otherwise the subcorpus is not discoverable by default)

      * once a public description is set, the subcorpus can be discovered on the "public subcorpora" page

    * text types metadata can be gradually refined to a specific subcorpus ("which publishers are there in case only *fiction* is selected?")

    * a **custom text types ratio** can be defined ("give me 20% fiction and 80% journalism")

    * unused subcorpora can be archived (URLs with the subcorpus are still valid) or completely removed (URLs will become invalid)

    * searching within a subcorpous can be further refined with ad-hoc text type selection

    * a subcorpus can be created with respect to corpora aligned ("give me fiction in Czech but only if there is an English translation for it")

* **frequency distribution**

    * univariate

        * positional attributes (including tuples of multiple attributes per token)

        * structural attributes

    * **multivariate distribution** (2 dimensions) for both positional and structural attributes

* collocation analysis

* **persistent URLs** - any result page can be easily shared even if the original query is megabytes long

* access to **previous queries**, named queries

* convenient corpus access

    * finding corpus by a keyword (tag), size, description

    * adding corpus to **favorites** (incl. subcorpora, aligned corpora)

* saving result to Excel, CSV, XML, JSONL, TXT

* [HTTP API](https://github.com/czcorpus/kontext/wiki/HTTP-API) access

## Internal features

* modern client-side application (written in TypeScript, event stream architecture, React components, extensible)

* server-side written using the [Sanic](https://sanic.dev/en/) framework with fully **decoupled background concordance/frequency/collocation calculation** (using an integrated [Rq](https://python-rq.org/) worker server)

* modular code design with dynamically loadable plug-ins providing custom functionality implementation (e.g. custom database

adapters, authentication method, corpus listing widgets, HTTP session management)

   * integrability with existing information systems

## Installation

### Docker

Running KonText as a set of Docker containers is the most convenient and flexible way. Docker Compose v2 is required. To run a basic

configuration instance (i.e. no MySQL/MariaDB server) use:

```shell

docker compose up

```

To run a production grade instance:

```shell

docker compose -f docker-compose.yml -f docker-compose.mysql.yml --env-file .env.mysql up

```

(the `.env.mysql` allows configuring custom MySQL/MariaDB credentials and KonText configuration file)

### Manual installation

#### Key requirements

* Python *3.6* (or newer)

* [Manatee](http://nlp.fi.muni.cz/trac/noske) corpus search engine - version *2.167.8* and onwards (for KonText *v0.17+*, Manatee *v2.2xx* is recommended)

* a key-value storage

    * [Redis](http://redis.io/) (recommended), [SQLite](https://sqlite.org/) (supported), custom implementations possible

* a task queue - [Rq](https://python-rq.org/)

* HTTP proxy server

  + [Nginx](http://nginx.org/) (recommended), [Apache](http://httpd.apache.org/),...

For Ubuntu OS users, it is recommended to use the [install script](scripts/install/install.py) which should

perform most of the actions necessary to install and run KonText. For other Linux distributions we recommend

running KonText within a container or a virtual machine. Please refer to the [doc/INSTALL.md](doc/INSTALL.md)

file for details.

## Customization and contribution

Please refer to our [Wiki](https://github.com/czcorpus/kontext/wiki/Development-and-customization).

## Notable users

* [Department of Linguistics, Faculty of Arts, Charles University](https://kontext.korpus.cz/)

* [LINDAT/CLARIAH-CZ](https://ufal.mff.cuni.cz/lindat-kontext)

* [CLARIN-PL](https://kontext.clarin-pl.eu/)

* [CLARIN-SI](https://www.clarin.si/kontext/)

* [Serbski Institut](https://www.serbski-institut.de) (API version of KonText)

## How to cite KonText

Tomáš Machálek (2020) - KonText: Advanced and Flexible Corpus Query Interface

```bibtex

@inproceedings{machalek-2020-kontext,

    title = "{K}on{T}ext: Advanced and Flexible Corpus Query Interface",

    author = "Mach{\'a}lek, Tom{\'a}{\v{s}}",

    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",

    month = may,

    year = "2020",

    address = "Marseille, France",

    publisher = "European Language Resources Association",

    url = "https://www.aclweb.org/anthology/2020.lrec-1.865",

    pages = "7003--7008",

    language = "English",

    ISBN = "979-10-95546-34-4",

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/czcorpus/kontext

Awesome Lists containing this project

README