{"id":29795464,"url":"https://github.com/ubisoft/ubisoft-laforge-toxbuster","last_synced_at":"2025-07-28T04:11:06.504Z","repository":{"id":200845924,"uuid":"659403357","full_name":"ubisoft/ubisoft-laforge-toxbuster","owner":"ubisoft","description":null,"archived":false,"fork":false,"pushed_at":"2023-06-27T19:00:55.000Z","size":961,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-06-11T00:23:13.747Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ubisoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"License.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-06-27T18:58:52.000Z","updated_at":"2024-01-15T09:13:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"1e4fde58-3e27-459f-a3b7-bb11b218195a","html_url":"https://github.com/ubisoft/ubisoft-laforge-toxbuster","commit_stats":null,"previous_names":["ubisoft/ubisoft-laforge-toxbuster"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ubisoft/ubisoft-laforge-toxbuster","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ubisoft%2Fubisoft-laforge-toxbuster","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ubisoft%2Fubisoft-laforge-toxbuster/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ubisoft%2Fubisoft-laforge-toxbuster/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ubisoft%2Fubisoft-laforge-toxbuster/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ubisoft","download_url":"https://codeload.github.com/ubisoft/ubisoft-laforge-toxbuster/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ubisoft%2Fubisoft-laforge-toxbuster/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267459931,"owners_count":24090783,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-28T02:00:09.689Z","response_time":68,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-28T04:11:05.602Z","updated_at":"2025-07-28T04:11:06.496Z","avatar_url":"https://github.com/ubisoft.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"© [2023] Ubisoft Entertainment. All Rights Reserved\n\n# ToxBuster\n\nWe aim to use language models to identify and classify toxicity inside in-game chat.\n\n## Project Standards\n\nFor better collaboration and understanding of the project code and what has been done, the following sections outline what standards / loose rules are followed.\n\n\n### Project File Layout\nFollow something similar to under src [packaging-projects](https://packaging.python.org/en/latest/tutorials/packaging-projects/).\n\n```\n\u003e src\n   \u003e module_1\n      __init__.py\n   \u003e module_2\n   __init__.py\n\u003e tests\nmain.py\npoetry.lock\n```\n\n### Code Simple Guide\n1. Keep it simple stupid.\n2. Don't add features / over-engineer until you need it.\n\n\n### Code Standards\n\n| Name | Value | Links | Notes |\n| :--- | :--- |  :-- | :-- |\n| Language | Python 3.x | | |\n| Package Manager | Poetry | [Docs](https://python-poetry.org/docs/) ; [Useful TLDR](https://hackersandslackers.com/python-poetry-package-manager/) | On Windows, you may have to restart comp after installing to work with VSCode. |\n| Python Env | conda |  |\n| Code Linter | Pep8 | [Enable for VSCode](https://code.visualstudio.com/docs/python/linting) |\n| Docstring  | Pep8  | Follow [Numpy's Style](https://numpydoc.readthedocs.io/en/latest/format.html) |\n| Unit Tests | Python 3.x | [Sample](https://stackoverflow.com/questions/61151/where-do-the-python-unit-tests-go) |\n\n### Poetry\n\n1. Test if poetry package manager is up to date:\n   ```Powershell\n   poetry run python .\\main.py Train --config \".\\train\\train_on_CONDA_no_context.json\" --max_epochs_to_train 1\n   ```\n   Note: Poetry installs torch with CPU support and no CUDA support.\n\n   [Issue 4231](https://github.com/python-poetry/poetry/issues/4231)\n   -\u003e User may have to separately install PyTorch with CUDA.\n\n2. Use `poetry add` to add missing packages to pyproject.toml \u0026 poetry.lock\n\n3. Use `poetry export -f requirements.txt \u003e requirements.txt` to update `requirements.txt`.\n\n## Understanding our model\nWe want our model to be able to classify span of words as non-toxic / specific categories of toxicity.   For this use case, the model is currently a token classification.  \n\n### Basic information:\n* Current model is `bert-base-uncased`; \n* Tokenizer configs can be found [here](https://huggingface.co/distilbert-base-uncased/raw/main/tokenizer.json).\n* HuggingFace [Token Classification](https://huggingface.co/docs/transformers/tasks/token_classification)\n\n\n### Collate Function\n* HuggingFace Tokenization Documentation:  https://huggingface.co/docs/tokenizers/pipeline\n* Useful Stackoverflow: https://stackoverflow.com/questions/65246703/how-does-max-length-padding-and-truncation-arguments-work-in-huggingface-bertt\n\n\n## Trainer Logic / Terminology\n1. Epoch: One run of the training dataset.\n2. Batch Size: Number of samples to train on limited by memory size of CPU / GPU.\n   * `per_gpu_batch_size`: number of samples to run on each gpu if more than one. Batch size will be `num_gpu` * `per_gpu_batch_size`\n3. Global Step: Number of batches before the model will calculate gradient \u0026 perform back propagation.\n   * To [prevent vanishing \u0026 exploding gradients](https://neptune.ai/blog/understanding-gradient-clipping-and-how-it-can-fix-exploding-gradients-problem), we use [`clip_grad_norm_`](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html) \u0026 accumulate batches.\n   * [Gradient Accumulation](https://towardsdatascience.com/gradient-accumulation-overcoming-memory-constraints-in-deep-learning-36d411252d01) is performed at every global step\n4. Validation Loop:\n   * We run validation at every `X` epochs. If we follow the paper, it was run 10 times per epoch.\n   * Push metrics to TensorBoard\n   * In normal ML models, we run validation every epoch or even every `X` epochs.\n5. Save model at the end of every `X` epoch:\n   * changed from global step since this is dependent on two config variables and can be inconsistent.\n   * can be changed back if we save all the configs.\n\n\n## Other Useful Links / Info\n1. Trainer Logic Code samples:\n   * https://github.com/uds-lsv/bert-stable-fine-tuning/blob/master/examples/bert_stable_fine_tuning/run_finetuning.py\n   * https://towardsdatascience.com/how-to-make-the-most-out-of-bert-finetuning-d7c9f2ca806c\n   * https://www.pluralsight.com/guides/data-visualization-deep-learning-model-using-matplotlib\n\n© [2023] Ubisoft Entertainment. All Rights Reserved","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fubisoft%2Fubisoft-laforge-toxbuster","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fubisoft%2Fubisoft-laforge-toxbuster","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fubisoft%2Fubisoft-laforge-toxbuster/lists"}