{"id":22694604,"url":"https://github.com/kaspercools/tiktok-offensive-language-classifier","last_synced_at":"2025-03-29T17:26:38.133Z","repository":{"id":157668499,"uuid":"608879152","full_name":"kaspercools/tiktok-offensive-language-classifier","owner":"kaspercools","description":"ML fine-tuning/eval code (pytorch) with hyperparameter arguments","archived":false,"fork":false,"pushed_at":"2023-09-09T11:37:37.000Z","size":1140,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-04T18:42:06.194Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kaspercools.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-02T23:18:16.000Z","updated_at":"2024-10-08T15:29:23.000Z","dependencies_parsed_at":"2025-02-04T18:45:07.664Z","dependency_job_id":null,"html_url":"https://github.com/kaspercools/tiktok-offensive-language-classifier","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaspercools%2Ftiktok-offensive-language-classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaspercools%2Ftiktok-offensive-language-classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaspercools%2Ftiktok-offensive-language-classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaspercools%2Ftiktok-offensive-language-classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kaspercools","download_url":"https://codeload.github.com/kaspercools/tiktok-offensive-language-classifier/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246217987,"owners_count":20742295,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-10T03:08:50.972Z","updated_at":"2025-03-29T17:26:38.120Z","avatar_url":"https://github.com/kaspercools.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Detecting offensive content on TikTok\n\nby\nKasper Cools, in partial fulfillment of the requirements for the degree of Master of Science in Computer Science\n\n\u003e This repository and the associated Python code and Jupyter notebooks are hereby published as part of my Master's\n\u003e thesis \"Tick Tock, The clock is ticking. On the fine-tuning of Machine Learning models for offensive content\n\u003e classification on TikTok\"\n\nSupervisors:\n\n- [Gideon Maillette de Buy Wenniger](https://scholar.google.nl/citations?user=7X7QIrgAAAAJ\u0026hl=en)\n- [Clara Maathuis](https://scholar.google.com/citations?user=WqR3BVwAAAAJ\u0026hl=en)\n\n## Abstract\n\nThe prevalence of social media and technology has transformed the way young individuals, particularly Generation Z, consume information and interact. However, this also provides extremist groups with a platform to spread their propaganda and radicalization content and recruit new members, which has created a challenge for governments, organizations, and social media in general. Events that capture the urgency of this issue are the Christchurch mosque attacks. In 2019, Brenton Tarrant, a self-radicalized individual, carried out a deadly terrorist attack at two mosques in New Zealand, live streaming his shooting spree on Facebook and promoting it in his manifesto entitled \"The Great Replacement\".\n\nDespite TikTok's efforts to remove violent extremist content, white supremacist propaganda and hatespeech remain widespread on the platform. This research addresses the challenges social media platforms like TikTok face in detecting radicalization content by developing a set of computational machine learning models that help identify offensive language. A TikTok-specific dataset was compiled manually to develop and fine-tune a large language classification (BERT), and machine learning models, such as Naive Bayes and logistic regression, to detect offensive content. The results demonstrate that fine-tuned large language classification models outperform fine-tuned machine learning models. Finally, the model's generative capabilities were evaluated by measuring its performance on an unseen dataset derived from previous research by Waseem and Hovy [2016a] and Davidson et al. [2017b].\n\n## Presentations\n\n- April 19, 2023: [Young Talents International Conference 2023](http://ytic.eu)\n- September 22, 2023: [CLIN33](https://clin33.uantwerpen.be)\n\n## Software implementation\n\n\u003e This repository contains all python code used to train, test and evaluate our BERT model as well as our baseline\n\u003e Machine Learning models.\n\u003e The code, as-is, uses the HuggingFace [bert-base-uncased](https://huggingface.co/bert-base-uncased) model and training\n\u003e is configured to use 150 tokens max to match the needs for our particular use-case.\n\nFor the purpose of this research we wanted to measure the possible impact of adding custom tokens for specific Gen-Z\nslang and emojis used on TikTok, therefore the training method receives 2 bool parameters to indcate if you want to add\nemoji tokenization or slang. The emoji tokenization can be used as-is, but for slang you will need to either use\nour [slang dataset](https://github.com/kaspercools/genz-dataset) or provide your own dictionary.\n\nThe F_SCORE_THRESHOLD const is used to limit the number of snapshots that are stored to disk during longer training\nsessions. If the model F Score is lower than the given threshold, only the training results will be stored to disk. In\ncase the result exceeds the threshold, the results as well as a snapshot is saved to disk.\n\n## Getting the code\n\nYou can download a copy of all the files in this repository by cloning the\n[git](https://git-scm.com/) repository:\n\n    git clone https://github.com/kaspercools/tiktok-offensive-language-classifier\n\n## Datasets\n\nEven though a vast amount of research has been performed in regards to the detection\nof offensive language on social media, we did not find any datasets that suit our specific\nuse case. Most of the available datasets have focussed on data collected from platforms of\nwhich the majority of users are not part of Gen-Z.\nMore specifically, the ideal dataset would contain data that is representative for the vast majority of\nyounger users on TikTok, especially those under the age of 27 (the Zoomer generation).\n\nFor the purpose of this research we collected a total of 3,138 TikTok video posts which subsequently were used to\ncollect a total of 120,423 comments over the course of 4 months (April 2022 to July 2022). Subsequently, these comments\nwere manually labelled resulting in a total of 78,181 which either contained English sentences, solely consisted of\nemojis or more universally used expressions such as onomatopoeia. Of these 78,181 comments **2,034** were labelled\noffensive.\n\nGiven the nature of the data, and taken into account\nthe [TOS of TikTok](https://www.tiktok.com/legal/page/eea/terms-of-service/en) we are not able to make our dataset\npublicly available.\nThe dataset that can be found in the data folder is harvested\nfrom [another github page/research](https://github.com/dhavalpotdar/detecting-offensive-language-in-tweets) and used to\nevaluate our model's performance on unseen data.\n\n## Docker setup\n\nThe easiest way to get started is by setting up your environment using docker.\nTo do so, you first need to build an image locally. Open a terminal window and navigate to the root of this repository\nto execute the following command:\n\u003e docker build . -t ou_ml_tiktok\n\nOnce the docker image is built you can execute the python using the following command to start training:\n\n``` dockerfile\ndocker run --rm -it --init \\\n  --user=\"$(id -u):$(id -g)\" \\\n  --volume=\"$PWD:/app\" \\\n  ou_ml_tiktok python3 src/main.py -i \"data/comments_anonymous.csv\"\n````\n\n### Running on a GPU\n\n``` dockerfile\ndocker run --rm -it --init \\\n  --gpus=all\n  --user=\"$(id -u):$(id -g)\" \\\n  --volume=\"$PWD:/app\" \\\n  ou_ml_tiktok python3 src/main.py -i \"data/comments_anonymous.csv\"\n````\n\n## Environment setup\n### Dependencies\n\nYou'll need a working Python environment to run the code.\nThe recommended way to set up your environment is through the\n[Anaconda Python distribution](https://www.anaconda.com/download/) which\nprovides the `conda` package manager.\nAnaconda can be installed in your user directory and does not interfere with\nthe system Python installation.\n\nWe use `conda` virtual environments to manage the project dependencies in\nisolation.\nThus, you can install our dependencies without causing conflicts with your\nsetup (even with different Python versions).\n\nRun the following command in the main folder to create a separate environment and install all required\ndependencies in it:\n\n    conda env create --name ENVIRONMENT_NAME\n    conda activate ENVIRONMENT_NAME\n    pip install -r requirements.txt\n\nIt is advised to perform any training of your own datasets on a GPU.\nNote that ENVIRONMENT_NAME is an arbitrary name for your own reference so you can use any name you want.\n\n## Running the code\n\nIf you wish, you can pass your own hyperparameters for fine-tuning the training process [Devlin et al., 2018]:\n\n```    \n    main.py \n        -i \u003cinputfile\u003e \n        -o \u003coutputdir\u003e \n        -l \u003clearning rate\u003e \n        -a \u003cadam_epsilon\u003e\n        -v \u003cvalidation ratio\u003e \n        -e \u003cepochs\u003e \n        -b \u003cbatch size\u003e\n        -t \u003cmax token length\u003e \n        -n \u003cnumber of iterations\u003e\n        -m (includes emoji tokenization)\n        -c \u003ccustom vocabulary file\u003e\n````\n\nThe only actual required parameter is the input csv dataset, the other values default to:\n\n- batch_size = 32\n- learning_rate = 5e-5\n- adam_epsilon = 1e-08\n- val_ratio = 0.2\n- epochs = 2\n- output_dir='models'\n- data_folder='data'\n- iterations=100\n- max_token_len = 150\n\nTraining sequence will look for cuda support, if cuda is not available, then cpu is used for training.\n\nAnother way of exploring and use the code is through jupyter notebook.\nTo do this, you must first start the notebook server by going into the\nrepository top level and running:\njupyter notebook\n\nThis will start the server and open your default web browser to the Jupyter\ninterface.\n\nThe notebook is divided into cells (some have text while other have code).\nEach cell can be executed using `Shift + Enter`.\nExecuting text cells does nothing and executing code cells runs the code\nand produces it's output.\nTo execute the whole notebook, run all cells in order.\n\n## Related submodules\nThe submodules that are added to this repo are some of the key scripts used for collecting and processing the data in a (semi)automatic way. These were used to quickly and continuously scan, retrieve and collect data. Therefore, following submodules have been linked to this project:\n- [genz-dataset](https://github.com/kaspercools/genz-dataset/tree/ffbb4f0594a3792e95de16f0243deef1b43c512c)\n- [weaponized-word-collector](https://github.com/kaspercools/weaponized-word-collector)\n- [bright-data-collector](https://github.com/kaspercools/bright-data-collector/tree/287979cf8cacb691fa39325aeb002d13c4ca9f15)\n- [tiktok-selenium-crawler](https://github.com/kaspercools/tiktok-selenium-crawler/tree/e2f19e81ea44fdcb4054f04918e1c4447f4f6bdf)\n\n## License\n\nAll source code is made available under a BSD 3-clause license. You can freely\nuse and modify the code, without warranty, so long as you provide attribution\nto the authors. See `LICENSE` for the full license text.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkaspercools%2Ftiktok-offensive-language-classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkaspercools%2Ftiktok-offensive-language-classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkaspercools%2Ftiktok-offensive-language-classifier/lists"}