https://github.com/gesistsa/grafzahl

🧛 fine-tuning Transformers for text data from within R
https://github.com/gesistsa/grafzahl

Last synced: about 1 year ago
JSON representation

🧛 fine-tuning Transformers for text data from within R

Host: GitHub
URL: https://github.com/gesistsa/grafzahl
Owner: gesistsa
License: gpl-3.0
Created: 2022-06-20T13:56:26.000Z (almost 4 years ago)
Default Branch: v0.1
Last Pushed: 2025-02-19T14:00:03.000Z (over 1 year ago)
Last Synced: 2025-04-26T09:39:44.119Z (about 1 year ago)
Language: R
Homepage: https://gesistsa.github.io/grafzahl/
Size: 4.97 MB
Stars: 41
Watchers: 4
Forks: 2
Open Issues: 4
Metadata Files:
- Readme: README.Rmd
- License: LICENSE.md

Awesome Lists containing this project

README

          ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# grafzahl 

[![CRAN status](https://www.r-pkg.org/badges/version/grafzahl)](https://CRAN.R-project.org/package=grafzahl)

The goal of grafzahl (**G**racious **R** **A**nalytical **F**ramework for **Z**appy **A**nalysis of **H**uman **L**anguages [^1]) is to duct tape the [quanteda](https://github.com/quanteda/quanteda) ecosystem to modern [Transformer-based text classification models](https://simpletransformers.ai/), e.g. BERT, RoBERTa, etc. The model object looks and feels like the textmodel S3 object from the package [quanteda.textmodels](https://github.com/quanteda/quanteda.textmodels).

If you don't know what I am talking about, don't worry, this package is gracious. You don't need to know a lot about Transformers to use this package. See the examples below.

Please cite this software as:

Chan, C., (2023). [grafzahl: fine-tuning Transformers for text data from within R](paper/grafzahl_sp.pdf). *Computational Communication Research* 5(1): 76-84. [https://doi.org/10.5117/CCR2023.1.003.CHAN](https://doi.org/10.5117/CCR2023.1.003.CHAN)

## Installation: Local environment

Install the CRAN version

```r

install.packages("grafzahl")

```

After that, you need to setup your conda environment

```r

require(grafzahl)

setup_grafzahl(cuda = TRUE) ## if you have GPU(s)

```

## On remote environments, e.g. Google Colab

On Google Colab, you need to enable non-Conda mode

```r

install.packages("grafzahl")

require(grafzahl)

use_nonconda()

```

Please refer the vignette.

## Usage

Suppose you have a bunch of tweets in the quanteda corpus format. And the corpus has exactly one docvar that denotes the labels you want to predict. The data is from [this repository](https://github.com/pablobarbera/incivility-sage-open) (Theocharis et al., 2020).

```{r, echo = FALSE, message = FALSE}

devtools::load_all()

```

```{r}

unciviltweets

```

In order to train a Transfomer model, please select the `model_name` from [Hugging Face's list](https://huggingface.co/models). The table below lists some common choices. In most of the time, providing `model_name` is sufficient, there is no need to provide `model_type`.

Suppose you want to train a Transformer model using "bertweet" (Nguyen et al., 2020) because it matches your domain of usage. By default, it will save the model in the `output` directory of the current directory. You can change it to elsewhere using the `output_dir` parameter. 

```r

model <- grafzahl(unciviltweets, model_type = "bertweet", model_name = "vinai/bertweet-base")

### If you are hardcore quanteda user:

## model <- textmodel_transformer(unciviltweets,

##                                model_type = "bertweet", model_name = "vinai/bertweet-base")

```

Make prediction

```r

predict(model)

```

That is it.

## Extended examples

Several extended examples are also available.

| Examples                                        | file                                           |

|-------------------------------------------------|------------------------------------------------|

| van Atteveldt et al. (2021)                     | [paper/vanatteveldt.md](paper/vanatteveldt.md) |

| Dobbrick et al. (2021)                          | [paper/dobbrick.md](paper/dobbrick.md)         |

| Theocharis et al. (2020)                        | [paper/theocharis.md](paper/theocharis.md)     |

| OffensEval-TR (2020)                            | [paper/coltekin.md](paper/coltekin.md)         |

| Amharic News Text classification Dataset (2021) | [paper/azime.md](paper/azime.md)               |

## Some common choices of `model_name`

| Your data         | model_type | model_name                         |

|-------------------|------------|------------------------------------|

| English tweets    | bertweet   | vinai/bertweet-base                |

| Lightweight       | mobilebert | google/mobilebert-uncased          |

|                   | distilbert | distilbert-base-uncased            |

| Long Text         | longformer | allenai/longformer-base-4096       |

|                   | bigbird    | google/bigbird-roberta-base        |

| English (General) | bert       | bert-base-uncased                  |

|                   | bert       | bert-base-cased                    |

|                   | electra    | google/electra-small-discriminator |

|                   | roberta    | roberta-base                       |

| Multilingual      | xlm        | xlm-mlm-17-1280                    |

|                   | xml        | xlm-mlm-100-1280                   |

|                   | bert       | bert-base-multilingual-cased       |

|                   | xlmroberta | xlm-roberta-base                   |

|                   | xlmroberta | xlm-roberta-large                  |

# References

1. Theocharis, Y., Barberá, P., Fazekas, Z., & Popa, S. A. (2020). The dynamics of political incivility on Twitter. Sage Open, 10(2), 2158244020919447.

2. Nguyen, D. Q., Vu, T., & Nguyen, A. T. (2020). BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200.

---

[^1]: Yes, I totally made up the meaningless long name. Actually, it is the German name of the *Sesame Street* character [Count von Count](https://de.wikipedia.org/wiki/Sesamstra%C3%9Fe#Graf_Zahl), meaning "Count (the noble title) Number". And it seems to be so that it is compulsory to name absolutely everything related to Transformers after Seasame Street characters.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gesistsa/grafzahl

Awesome Lists containing this project

README