https://github.com/ubisoft/ubisoft-laforge-toxbuster

Last synced: 12 months ago
JSON representation

Host: GitHub
URL: https://github.com/ubisoft/ubisoft-laforge-toxbuster
Owner: ubisoft
License: other
Created: 2023-06-27T18:58:52.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2023-06-27T19:00:55.000Z (about 3 years ago)
Last Synced: 2024-06-11T00:23:13.747Z (about 2 years ago)
Language: Python
Size: 938 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: License.md

Awesome Lists containing this project

README

# ToxBuster

We aim to use language models to identify and classify toxicity inside in-game chat.

## Project Standards

For better collaboration and understanding of the project code and what has been done, the following sections outline what standards / loose rules are followed.

### Project File Layout
Follow something similar to under src [packaging-projects](https://packaging.python.org/en/latest/tutorials/packaging-projects/).

```
> src
> module_1
__init__.py
> module_2
__init__.py
> tests
main.py
poetry.lock
```

### Code Simple Guide
1. Keep it simple stupid.
2. Don't add features / over-engineer until you need it.

### Code Standards

### Poetry

1. Test if poetry package manager is up to date:
```Powershell
poetry run python .\main.py Train --config ".\train\train_on_CONDA_no_context.json" --max_epochs_to_train 1
```
Note: Poetry installs torch with CPU support and no CUDA support.

[Issue 4231](https://github.com/python-poetry/poetry/issues/4231)
-> User may have to separately install PyTorch with CUDA.

2. Use `poetry add` to add missing packages to pyproject.toml & poetry.lock

3. Use `poetry export -f requirements.txt > requirements.txt` to update `requirements.txt`.

## Understanding our model
We want our model to be able to classify span of words as non-toxic / specific categories of toxicity. For this use case, the model is currently a token classification.

### Basic information:
* Current model is `bert-base-uncased`;
* Tokenizer configs can be found [here](https://huggingface.co/distilbert-base-uncased/raw/main/tokenizer.json).
* HuggingFace [Token Classification](https://huggingface.co/docs/transformers/tasks/token_classification)

### Collate Function
* HuggingFace Tokenization Documentation: https://huggingface.co/docs/tokenizers/pipeline
* Useful Stackoverflow: https://stackoverflow.com/questions/65246703/how-does-max-length-padding-and-truncation-arguments-work-in-huggingface-bertt

## Trainer Logic / Terminology
1. Epoch: One run of the training dataset.
2. Batch Size: Number of samples to train on limited by memory size of CPU / GPU.
* `per_gpu_batch_size`: number of samples to run on each gpu if more than one. Batch size will be `num_gpu` * `per_gpu_batch_size`
3. Global Step: Number of batches before the model will calculate gradient & perform back propagation.
* To [prevent vanishing & exploding gradients](https://neptune.ai/blog/understanding-gradient-clipping-and-how-it-can-fix-exploding-gradients-problem), we use [`clip_grad_norm_`](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html) & accumulate batches.
* [Gradient Accumulation](https://towardsdatascience.com/gradient-accumulation-overcoming-memory-constraints-in-deep-learning-36d411252d01) is performed at every global step
4. Validation Loop:
* We run validation at every `X` epochs. If we follow the paper, it was run 10 times per epoch.
* Push metrics to TensorBoard
* In normal ML models, we run validation every epoch or even every `X` epochs.
5. Save model at the end of every `X` epoch:
* changed from global step since this is dependent on two config variables and can be inconsistent.
* can be changed back if we save all the configs.

## Other Useful Links / Info
1. Trainer Logic Code samples:
* https://github.com/uds-lsv/bert-stable-fine-tuning/blob/master/examples/bert_stable_fine_tuning/run_finetuning.py
* https://towardsdatascience.com/how-to-make-the-most-out-of-bert-finetuning-d7c9f2ca806c
* https://www.pluralsight.com/guides/data-visualization-deep-learning-model-using-matplotlib

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ubisoft/ubisoft-laforge-toxbuster

Awesome Lists containing this project

README