https://github.com/teddyoweh/cheat-model

NLP Text Binary Probabilistic Classification Model for predicting cheat statements
https://github.com/teddyoweh/cheat-model

data-science machine-learning nlp tokenizer

Last synced: about 1 month ago
JSON representation

NLP Text Binary Probabilistic Classification Model for predicting cheat statements

Host: GitHub
URL: https://github.com/teddyoweh/cheat-model
Owner: teddyoweh
Created: 2022-12-18T02:52:02.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2022-12-18T14:43:11.000Z (almost 3 years ago)
Last Synced: 2025-05-29T15:35:14.451Z (4 months ago)
Topics: data-science, machine-learning, nlp, tokenizer
Language: Jupyter Notebook
Homepage:
Size: 27.3 KB
Stars: 8
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Cheat Detection Model
NLP RNN Model for predicting cheat statements

This natural language processing text classification model uses various algorithms to predict if a sentence is closely related to a student trying to ask for answers in a group chat.

Developed this model for a Social Network used to connect with students with same majors [Afterhours](https://github.com/Tarleton-Computer-Society/afterhours) developed by [Tarleton Computer Society](https://github.com/Tarleton-Computer-Society)

Libraries used for this include pandas and numpy for data manipulation and seaborn, matplotlib, and pickle for data visualization. The model also utilizes various nltk (Natural Language Toolkit) libraries for text preprocessing, including word_tokenize for tokenizing text into individual words and stopwords for removing common stopwords. The model also uses the SnowballStemmer from nltk for stemming words and the WordNetLemmatizer for lemmatizing words.

For model building and evaluation, the model utilizes several sklearn (Scikit-learn) libraries, including train_test_split for dividing the data into training and testing sets using a specified test size and random state. The model also uses various classifiers, including Logistic Regression, SGDClassifier, and MultinomialNB, which are all based on different mathematical algorithms. The model also utilizes various evaluation metrics, including the classification report, F1 score, accuracy score, confusion matrix, ROC curve, AUC, and ROC AUC score, to assess the performance of the model.

In addition to these traditional machine learning algorithms, the model also utilizes feature extraction techniques such as bag of words representation and word embedding. For bag of words representation, the model uses TfidfVectorizer and CountVectorizer from sklearn's feature_extraction library to convert the text data into numerical feature vectors. For word embedding, the model uses the Word2Vec algorithm from gensim to learn dense vector representations of words based on their contexts in the text data.

Once the data is fully preprocessed and the features are extracted, the model trains and evaluates the different classifiers using the extracted features and the corresponding labels. The classifier with the highest evaluation metric scores is chosen as the final model, which is then saved using pickle for future use.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/teddyoweh/cheat-model

Awesome Lists containing this project

README