https://github.com/fostroll/corpuscula
Toolkit that simplifies corpus processing
https://github.com/fostroll/corpuscula
conllu corpora natural-language-processing nlp universal-dependencies
Last synced: about 1 month ago
JSON representation
Toolkit that simplifies corpus processing
- Host: GitHub
- URL: https://github.com/fostroll/corpuscula
- Owner: fostroll
- License: bsd-3-clause
- Created: 2020-04-05T12:30:36.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2021-12-09T10:56:50.000Z (over 3 years ago)
- Last Synced: 2025-03-24T17:15:48.803Z (2 months ago)
- Topics: conllu, corpora, natural-language-processing, nlp, universal-dependencies
- Language: Python
- Homepage:
- Size: 36.1 MB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
RuMor: Russian Morphology projectCorpuscula: a python NLP library for corpus processing
[](https://pypi.org/project/corpuscula/)
[](https://www.python.org/)
[](https://opensource.org/licenses/BSD-3-Clause)A part of ***RuMor*** project. It contains tools to simplify corpus
processing. Highlights are:* full [*CoNLL-U*](https://universaldependencies.org/format.html) support
(includes [*CoNLL-U Plus*](https://universaldependencies.org/ext-format.html))
* wrappers for known corpora of Russian language
* parser and wrapper for Russian part of *Wikipedia*
* *Corpus Dictionary* that can be used for further morphology processing
* simple database to keep named entities## Installation
### pip
***Corpuscula*** supports *Python 3.5* or later. To install it via *pip*, run:
```sh
$ pip install corpuscula
```If you currently have a previous version of ***Corpuscula*** installed, use:
```sh
$ pip install corpuscula -U
```### From Source
Alternatively, you can also install ***Corpuscula*** from source of this *git
repository*:
```sh
$ git clone https://github.com/fostroll/corpuscula.git
$ cd corpuscula
$ pip install -e .
```
This gives you access to examples and data that are not included to the
*PyPI* package.## Setup
After installation, you need to specify a directory where you prefer to store
downloaded corpora:
```python
>>> import corpuscula.corpus_utils as cu
>>> cu.set_root_dir() # We will keep corpora here
```
**NB:** it will create/update config file `.rumor` in your home directory.If you won't set the root directory, ***Corpuscula*** will keep corpora
in the directory where it's installed.## Usage
[*CoNLL-U* Support](https://github.com/fostroll/corpuscula/blob/master/doc/README_CONLLU.md)
[Management of Corpora](https://github.com/fostroll/corpuscula/blob/master/doc/README_CORPORA.md)
[Wrapper for *Wikipedia*](https://github.com/fostroll/corpuscula/blob/master/doc/README_WIKIPEDIA.md)
[*Corpus Dictionary*](https://github.com/fostroll/corpuscula/blob/master/doc/README_CDICT.md)
[Utilities](https://github.com/fostroll/corpuscula/blob/master/doc/README_UTILS.md)
[*Items* database](https://github.com/fostroll/corpuscula/blob/master/doc/README_ITEMS.md)
## Examples
You can find examples in the directory `examples` of our ***Corpuscula*** github
repository.## License
***Corpuscula*** is released under the BSD License. See the
[LICENSE](https://github.com/fostroll/corpuscula/blob/master/LICENSE) file for
more details.