https://github.com/gesistsa/tokenvars
🔬 Add token-level metadata to `quanteda` (An Experiment)
https://github.com/gesistsa/tokenvars
Last synced: 11 months ago
JSON representation
🔬 Add token-level metadata to `quanteda` (An Experiment)
- Host: GitHub
- URL: https://github.com/gesistsa/tokenvars
- Owner: gesistsa
- License: gpl-3.0
- Created: 2023-11-24T17:38:08.000Z (over 2 years ago)
- Default Branch: v0.0
- Last Pushed: 2024-02-13T16:19:27.000Z (over 2 years ago)
- Last Synced: 2025-05-05T03:32:47.129Z (about 1 year ago)
- Language: R
- Homepage:
- Size: 95.7 KB
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md
Awesome Lists containing this project
README
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# tokenvars
At the moment, this package is super experimental and cannot be considered easy to use. Even when it is the case, this is mostly an infrastructural R package for a very niche category of developers wanting to develop R packages for [quanteda](https://github.com/quanteda/quanteda).
`quanteda` has good support for metadata. However, one can only put corpus- and document-level metadata (`meta()`, `docvars()`, respectively). This package aims at going down one level and provides support for token-level metadata. Token-level metadata is useful for tagging individual token (e.g. Parts of Speech, relationships among tokens); it is also useful to store upper-level information of tokens (e.g. the subword tokenized sequence of tokens "\_L", "'", "app", "ar", "tement"; you might want to know "\_L" is from the French word "*L'appartement*").
## Installation
You can install the development version of tokenvars like so:
``` r
# Well, if you don't know how to do this, you probably shouldn't try this.
```
## A demo of using token-level metadata
```{r example}
library(quanteda)
library(tokenvars)
corp <- corpus(c(d1 = "spaCy is great at fast natural language processing.",
d2 = "Mr. Smith spent two years in North Carolina."))
tok <- tokens(corp) %>% tokens_add_tokenvars()
tok
```
```{r example2}
tokenvars(tok) ## nothing to see here
```
```{r example3}
tokenvars(tok, "tag") <- list(c("NNP", "VBZ", "JJ", "IN", "JJ", "JJ", "NN", "NN", "."),
c("NNP", ".", "NNP", "VBD", "CD", "NNS", "IN", "NNP", "NNP", "."))
tokenvars(tok, "lemma") <- list(c("spaCy", "be", "great", "at", "fast", "natural", "language", "processing", "."),
c("Mr", ".", "Smith", "spend", "two", "year", "in", "North", "Carolina", "."))
```
```{r example4}
tok
```
```{r example5}
tokenvars(tok)
```
```{r example6}
tokenvars(tok, field = "tag")
```
```{r example7}
tokenvars(tok, field = "lemma", docnames = "d2")
```
## tokens_proximity
`tokens_proxmity` is a showcase of `tokenvars` for calculating and manipulating a token-level metadata. "proximity" is a token-level metadata of the distance between a target pattern and all other tokens.
```{r prox1}
txt1 <-
c("Turkish President Tayyip Erdogan, in his strongest comments yet on the Gaza conflict, said on Wednesday the Palestinian militant group Hamas was not a terrorist organisation but a liberation group fighting to protect Palestinian lands.",
"EU policymakers proposed the new agency in 2021 to stop financial firms from aiding criminals and terrorists. Brussels has so far relied on national regulators with no EU authority to stop money laundering and terrorist financing running into billions of euros.")
tok1 <- txt1 %>% tokens() %>%
tokens_proximity(pattern = "turkish")
tok1
```
```{r prox2}
tokenvars(tok1, "proximity")
```
The `tokens` object with proximity vectors can be converted to a (weighted) `dfm` (Document-Feature Matrix). The default weight is assigned by inverting the proximity.
```{r dfm}
dfm(tok1)
```
You have the freedom to change to another weight function. For example, not inverting.
```{r dfm2}
dfm(tok1, weight_function = identity)
```
Or any custom function
```{r dfm3}
dfm(tok1, weight_function = function(x) { 1 / x^2 })
```