https://github.com/andrefs/node-text-corpus
Some classes to represent elements in a text corpus.
https://github.com/andrefs/node-text-corpus
Last synced: about 1 month ago
JSON representation
Some classes to represent elements in a text corpus.
- Host: GitHub
- URL: https://github.com/andrefs/node-text-corpus
- Owner: andrefs
- Created: 2020-06-07T10:48:05.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2020-06-07T10:57:58.000Z (about 6 years ago)
- Last Synced: 2025-08-20T06:55:48.225Z (10 months ago)
- Language: JavaScript
- Size: 3.91 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# text-corpus
Some classes to represent elements in a text corpus. Currently, this
is mainly something to be used in [cetem-publico](https://www.npmjs.com/package/cetem-publico), [tnt-tagger]() and other modules, but hopefully it will be generic enough to be useful in other contexts as well.
## Installation
```bash
$ npm install text-corpus
```
## Classes
### Token
Used to represent the tokens (words) in the corpus.
#### `new Token(word, info)`
* `word` is the word in the original corpus text
* `info` (all these are optional)
* `tokenId`: an ID for this token
* `lemma`: the lemmatized version of `word`
* `pos`: the part-of-speech (POS) tag for `word`
* `other*: more information about the token
### MultiWordExpression
This class provides a way to group some tokens into multi-word
expressions.
MWEs can have attributes indicating the lemma and the POS tag
for the whole expression.
#### `new MultiWordExpression({lemma, pos}, tokens)`
* `lemma`: the lemma for the multi-word expression
* `pos`: the POS tag for the multi-word expression
* `tokens`: an array of Token objects which make this MWE
### Sentence
Sentences contain a list of tokens (the words in that sentence).
Because some words can form multi-word expressions, inside a
`Sentence` we can find both `Token`s and `MultiWordExpression`s
(which, in turn, have `Token` objects inside).
#### `new Sentence(id, tokens)`
* `id`: an id for the sentence
* `tokens`: an array of tokens and MWEs which form this sentence
### Paragraph
Paragraphs are composed of a sequence of sentences.
#### `new Paragraph(id, sentences)`
* `id`: an id for the paragraph
* `sentences`: an array of sentences which form this paragraph
## Bugs and stuff
Open a [GitHub issue](https://github.com/andrefs/node-text-corpus/issues) or, preferably, send me a pull request.
## License
MIT