https://github.com/teddyoweh/cheat-model
NLP Text Binary Probabilistic Classification Model for predicting cheat statements
https://github.com/teddyoweh/cheat-model
data-science machine-learning nlp tokenizer
Last synced: about 1 month ago
JSON representation
NLP Text Binary Probabilistic Classification Model for predicting cheat statements
- Host: GitHub
- URL: https://github.com/teddyoweh/cheat-model
- Owner: teddyoweh
- Created: 2022-12-18T02:52:02.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2022-12-18T14:43:11.000Z (almost 3 years ago)
- Last Synced: 2025-05-29T15:35:14.451Z (4 months ago)
- Topics: data-science, machine-learning, nlp, tokenizer
- Language: Jupyter Notebook
- Homepage:
- Size: 27.3 KB
- Stars: 8
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Cheat Detection Model
NLP RNN Model for predicting cheat statementsThis natural language processing text classification model uses various algorithms to predict if a sentence is closely related to a student trying to ask for answers in a group chat.
Developed this model for a Social Network used to connect with students with same majors [Afterhours](https://github.com/Tarleton-Computer-Society/afterhours) developed by [Tarleton Computer Society](https://github.com/Tarleton-Computer-Society)
Libraries used for this include pandas and numpy for data manipulation and seaborn, matplotlib, and pickle for data visualization. The model also utilizes various nltk (Natural Language Toolkit) libraries for text preprocessing, including word_tokenize for tokenizing text into individual words and stopwords for removing common stopwords. The model also uses the SnowballStemmer from nltk for stemming words and the WordNetLemmatizer for lemmatizing words.
For model building and evaluation, the model utilizes several sklearn (Scikit-learn) libraries, including train_test_split for dividing the data into training and testing sets using a specified test size and random state. The model also uses various classifiers, including Logistic Regression, SGDClassifier, and MultinomialNB, which are all based on different mathematical algorithms. The model also utilizes various evaluation metrics, including the classification report, F1 score, accuracy score, confusion matrix, ROC curve, AUC, and ROC AUC score, to assess the performance of the model.
In addition to these traditional machine learning algorithms, the model also utilizes feature extraction techniques such as bag of words representation and word embedding. For bag of words representation, the model uses TfidfVectorizer and CountVectorizer from sklearn's feature_extraction library to convert the text data into numerical feature vectors. For word embedding, the model uses the Word2Vec algorithm from gensim to learn dense vector representations of words based on their contexts in the text data.
Once the data is fully preprocessed and the features are extracted, the model trains and evaluates the different classifiers using the extracted features and the corresponding labels. The classifier with the highest evaluation metric scores is chosen as the final model, which is then saved using pickle for future use.