https://github.com/cscfi/kielipankki-utilities
Scripts for data conversion
https://github.com/cscfi/kielipankki-utilities
corpus-processing corpus-tools korp vrt
Last synced: 6 months ago
JSON representation
Scripts for data conversion
- Host: GitHub
- URL: https://github.com/cscfi/kielipankki-utilities
- Owner: CSCfi
- Created: 2020-01-30T13:15:48.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-09-18T11:05:07.000Z (about 1 year ago)
- Last Synced: 2024-09-18T15:11:29.346Z (about 1 year ago)
- Topics: corpus-processing, corpus-tools, korp, vrt
- Language: Python
- Size: 3.37 MB
- Stars: 6
- Watchers: 8
- Forks: 2
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Kielipankki-utilities
This repository contains software (scripts) and associated data for
importing, converting and other processing of (corpus) data in
[Kielipankki – The Language Bank of
Finland](https://www.kielipankki.fi/language-bank/).In particular, the repository contains scripts converting data to the
VRT (VeRticalized Text) format used as an input format for the [IMS
Open Corpus Workbench (CWB)](http://cwb.sourceforge.net/) and
[Korp](https://www.kielipankki.fi/support/korp/). Although many
scripts are specific to the Language Bank of Finland, some of them may
be more generally useful, or at least they may be adapted to other
environments.**Note:** Corpus data itself should *not* be included in this public
repository. Neither should any secret or private information, such as
passwords.## Directory structure
In general, the top-level directory structure is as follows:
* [`corp/`](corp/): corpus-specific scripts (and associated data),
containing subdirectories by corpus, group of corpora, corpus origin
(owner) or corpus type (such as speech)
* [`docs/`](docs/): general documentation on corpus processing
* [`fulltext/`](fulltext/): scripts to be included on corpus full-text
HTML pages
* [`scripts/`](scripts/): general-purpose scripts
* [`vrt-tools/`](vrt-tools/): FIN-CLARIN VRT Tools: tools for
processing VeRticalized Text dataFor a major subsystems of scripts, such as harvesting or parsing, you
may add a top-level directory of its own, or alternatively, a
subdirectory under `scripts/`.If you are unsure about where you should put your scripts in
development, you can develop them in a private branch first and merge
it to the master only when they are relatively stable.## Repository background
This repository is the public successor of the previous
[Kielipankki-konversio](https://github.com/CSCfi/Kielipankki-konversio/)
repository that also contained some private data. The repository was
made public on 2020-02-04. Any further development should be done in
this repository.