Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/adesoji1/coding-challenge-interview---prompt-classifier-model---may-6-2024
Create a classifier that will accurately classify a list of reddit comments into the proper labels.
https://github.com/adesoji1/coding-challenge-interview---prompt-classifier-model---may-6-2024
Last synced: 17 days ago
JSON representation
Create a classifier that will accurately classify a list of reddit comments into the proper labels.
- Host: GitHub
- URL: https://github.com/adesoji1/coding-challenge-interview---prompt-classifier-model---may-6-2024
- Owner: Adesoji1
- Created: 2024-05-22T16:35:20.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-05-23T09:24:33.000Z (6 months ago)
- Last Synced: 2024-05-23T10:30:19.114Z (6 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 21.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Coding-Challenge-Interview---Prompt-Classifier-Model---May-6-2024
Create a classifier that will accurately classify a list of reddit comments into the proper labels.
# Multilabel Text Classification Using CNN and Bi-LSTM ๐๐๐ค
![Text Classification](https://img.icons8.com/ios/452/language.png)This project demonstrates how to build a multilabel text classification model using a combination of Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory Networks (Bi-LSTM). The model is trained using the TensorFlow and Keras libraries. We also use the GloVe word embeddings to enhance the model's performance and class weights to handle class imbalance.
## Table of Contents
- [Overview](#overview-)
- [Database](#database)
- [Dataset](#dataset)
- [Installation](#installation-)
- [Data Preprocessing](#data-preprocessing-)
- [Model Architecture](#model-architecture-๏ธ)
- [Training](#training-๏ธ๏ธ)
- [Evaluation](#evaluation-)
- [Usage](#usage-)
- [Streamlit Web App](#streamlit-web-app-)
- [Why I Use GPT 2?](#why-i-use-gpt-2--)
- [Text Generation](#text-generation)
- [Requirements](#requirements-)
- [Screenshots](#screenshots-)
- [Accuracy](#accuracy-)
- [References](#references-)## Overview ๐ก
Multilabel text classification involves assigning multiple labels to a given piece of text. This project utilizes a combination of CNN and Bi-LSTM networks to capture both local and sequential patterns in text data.
### Database
Activites of logging into the database and making queries , updating the reddit_username_comments with label update ![label_update](images/label_update.png) and also leaving the other values in the labels column unlabelled as instructed [here](images/label_blank.png) . The python code for exporting the data from the database to csv could be found [here](multilabel-text-classification/exportdatatocsv.py) , the connection to database python code is also found [here](multilabel-text-classification/database.py) and updating labels in the reddit_user_comments database code is found [here](multilabel-text-classification/updatelabelindb.py) . The requirements.txt for these codes to work is also located [here](multilabel-text-classification/requirementsfordatabaseconnectionusingpython.txt), most of my rough work for connecting to the database is found [here](multilabel-text-classification/test.ipynb) because i originally did the work there, therefore you could see the outputs.
### Dataset
The dataset was obatined from the postgresql database and saved as csv file, a python code like this could do the [job](multilabel-text-classification/exportdatatocsv.py). the sample of the dataset is availabel [here](Dataset)
### Why GloVe Embeddings? ๐ง
GloVe (Global Vectors for Word Representation) embeddings are used because they capture semantic relationships between words, providing richer contextual information compared to simple one-hot encoding or TF-IDF vectors. i used the 300D specification which is located [here](https://drive.google.com/file/d/1UVGifJPfMaYcW6NXNB3PiZAir6fof0tf/view?usp=sharing) in google drive
### Why Class Weights? โ๏ธ
Class weights are employed to handle class imbalance, ensuring that the model does not become biased towards the majority class. This is critical in real-world scenarios where certain classes may be underrepresented.
## Installation ๐ป
To set up the project, clone the repository and install the required dependencies:
```sh
git clone https://github.com/Adesoji1/Coding-Challenge-Interview---Prompt-Classifier-Model---May-6-2024.git
cd multilabel-text-classification
pip install -r requirements.txt
```## Data Preprocessing ๐งช
1. **Normalization**: Text normalization is performed to clean and standardize the input text. This includes converting to lowercase, removing punctuation, and stop words.
2. **Encoding**: Labels are encoded using `LabelEncoder` and `OneHotEncoder` to facilitate training.
3. **Tokenization**: Text data is tokenized and padded to ensure uniform input length for the neural network.
4. **Label Imbalance**: In order to balance the imbalance labels as seen here, ![Label Distribution](images/label_distribution.png)i had to use class weights from scikit-learn to improve the labels with the minority class.
### Why is this important? ๐ค
- **Normalization** ensures that the text data is clean and consistent.
- **Encoding** is crucial for converting categorical labels into a numerical format that the model can understand.
- **Tokenization** transforms the text into sequences of numbers, making it suitable for input to the neural network.```python
# Normalize the comments
X = X.apply(normalize_text)# Encode labels to integers for computing class weights and later for one-hot encoding
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)# Compute class weights
class_weights = class_weight.compute_class_weight(class_weight='balanced', classes=np.unique(y_encoded), y=y_encoded)
class_weights_dict = dict(enumerate(class_weights))# One-hot encode the labels for use in model training
onehot_encoder = OneHotEncoder(sparse_output=False)
y_onehot = onehot_encoder.fit_transform(y_encoded.reshape(-1, 1))
```### Model Architecture ๐๏ธ
The model architecture includes:
- **Embedding Layer**: Uses pre-trained GloVe embeddings to convert words into dense vectors.
- **SpatialDropout1D Layer**: Prevents overfitting by randomly setting entire feature maps to zero.
- **Conv1D Layer**: Extracts local patterns in text.
- **Bidirectional LSTM Layer**: Captures long-term dependencies and context in both forward and backward directions.
- **GlobalAveragePooling1D Layer**: Reduces the output dimensionality.
- **Dense Layers**: Fully connected layers for classification.### Why this architecture? ๐ญ
- **Embedding Layer** with GloVe vectors helps in understanding the semantic meaning of words.
- **SpatialDropout1D** helps in regularization, making the model more robust.
- **Conv1D** captures local dependencies and patterns in the text.
- **Bi-LSTM** handles long-term dependencies and context in both directions.
- **GlobalAveragePooling1D** reduces the dimensionality without losing important features.
- **Dense Layers** are used for final classification. Kindly view the graphical architecture!
[here](images/architecture.png) while the code representation is here below```python
# Define the model
sequence_input = Input(shape=(max_len,))
x = Embedding(min(max_features, vocab_size), embed_size, weights=[embedding_matrix], trainable=False)(sequence_input)
x = SpatialDropout1D(0.2)(x)
x = Conv1D(64, kernel_size=3, padding="valid", kernel_initializer="glorot_uniform")(x)
x = Bidirectional(LSTM(128, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
avg_pool = GlobalAveragePooling1D()(x)
x = Dense(128, activation='relu')(avg_pool)
x = Dropout(0.1)(x)
preds = Dense(y_train.shape[1], activation="softmax")(x)
```## Training ๐๏ธโโ๏ธ
First, i trained the model without preprocessing and i got a test accuracy of 94% as seen ![below](images/trained_withoutpreprocessing.png) with the model accuracy plot ![here](images/modelwithoutpreprocessingmodelaccuracy.png)
and the model loss as seen ![here](images/modelwithoutpreprocessingmodelloss.png) This took about 166minutes to complete training for 10 epochs using a 6GB Nvidia graphics memory card. The training script could be found [here](multilabel-text-classification/firstmodel.py). some of the prediction results are [in](multilabel-text-classification/test3.ipynb) the roughwork, also as seen below![firstmodel](images/firstmodelpred.png)
![firstmodel](images/firstmodelpredi.png)
![firstmodel](images/firstmodelpredic.png)
The 2nd model , still using the same architecture as seen previously in the first model, and the better model which trained for 97minutes,29.6 seconds had a test accuracy of 97% when the data was preprocessed before training meaning that additional noise was removed in the dataset. the screenshot of the training,model performance and inference are seen below.
![secondmodel](images/secondmodelepochs.png)![secondmodel](images/secondmodelaccuracy.png)
![secondmodel](images/secondmodelloss.png)
![secondmodel](images/secondmodelpred.png)
The script for training the second model is located [here](multilabel-text-classification/secondmodel.py) while the roughwork where the script was originally creates is located [here](multilabel-text-classification/test3.ipynb) .In addition, i ensured it was trained using class weights to handle class imbalance and `ModelCheckpoint` to save the best model based on validation accuracy. The model artifacts which include the keras model,the tokenizer and label file is located [here](checkpoints/best_modellatest2.keras) . For making inference using a csv as an input file for prediction, the script is located [here](multilabel-text-classification/secondmodelinferencefromcsv.py) for your reference while for using a single comment, it's found [here](multilabel-text-classification/secondmodelinferencefromcsv.py)### Why use class weights? โ๏ธ
Class weights ensure that the model gives proper attention to minority classes, preventing it from being biased towards the majority class.
### Why use ModelCheckpoint? ๐พ
`ModelCheckpoint` saves the best model during training, ensuring that the best performing model is preserved.
```python
history = model.fit(
X_train_padded,
y_train,
batch_size=8,
epochs=10,
validation_split=0.1,
class_weight=class_weights_dict,
verbose=1,
callbacks=[checkpoint_callback, tensorboard_callback]
)
```## Evaluation ๐
After training, the model is evaluated on the test set, and training/validation accuracy and loss are plotted.
```python
# Evaluation
loss, accuracy = model.evaluate(X_test_padded, y_test, verbose=1)
print("Test Accuracy:", accuracy)# Plot training & validation accuracy values
plt.figure(figsize=(8, 5))
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.tight_layout()
plt.show()# Plot training & validation loss values
plt.figure(figsize=(8, 5))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.tight_layout()
plt.show()
```## Usage ๐
To make predictions on new data, load the trained model and preprocess the input text:
```python
# Load the saved model and tokenizer
best_model = load_model(model_path)# Load the LabelEncoder and Tokenizer from disk
with open(label_encoder_path, 'rb') as le_file:
label_encoder = pickle.load(le_file)with open(tokenizer_path, 'rb') as tokenizer_file:
tokenizer = pickle.load(tokenizer_file)# Predict function
def predict_label(comment):
comment_normalized = normalize_text(comment)
comment_seq = tokenizer.texts_to_sequences([comment_normalized])
comment_padded = pad_sequences(comment_seq, maxlen=1020)
prediction = best_model.predict(comment_padded)
predicted_class_index = np.argmax(prediction, axis=1)
predicted_label = label_encoder.inverse_transform(predicted_class_index)
return predicted_label[0]
```## Streamlit Web App ๐
Deploy a Streamlit web app to interact with the model:
```python
import streamlit as stdef main():
st.title("Text Classification App ๐๐ค")menu = ["Home ๐ ", "Predict ๐ฎ"]
choice = st.sidebar.selectbox("Menu", menu)if choice == "Home ๐ ":
st.subheader("Home ๐ ")
st.write("Welcome to the Text Classification App! ๐ฅณ")
st.write("Use this app to classify comments into different labels. ๐")elif choice == "Predict ๐ฎ":
st.subheader("Predict ๐ฎ")
comment = st.text_area("Enter Comment โ๏ธ", "")
if st.button("Predict ๐"):
if comment:
label = predict_label(comment)
st.success(f"Predicted Label: {label} ๐")
else:
st.warning("Please enter a comment to classify. โ ๏ธ")st.sidebar.subheader("About ๐ก")
st.sidebar.text("Text Classification App ๐๐ค")
st.sidebar.text("Built with Streamlit โค๏ธ")if __name__ == '__main__':
main()
```Run the Streamlit app using the command, the streamlit web application folder is located [here](Webapp_Streamlit) while it's predictions are in the screenshot segment below --- >[Screenshots](#screenshots-):
```sh
streamlit run app.py
```## Why I Use GPT-2 ๐ค๐ธ
I chose to use GPT-2 for text generation in order to save on . I used the `set_seed` function in the transformers pipeline to create more comments, usernames, and labels. This approach allowed us to generate the necessary data efficiently and cost-effectively more details is in the textgenerator.py code.
## Text Generation
I discovered that we could add more training data using this code located at [textgenerator.py](multilabel-text-classification/textgenerator.py) , depending on your requirements, i created the python code to generate more comments,labels and username
## Requirements ๐
- Python 3.11 virtual environment
- Ubuntu 23.04
- NVIDIA GPU with 6GB memory ,NVIDIA GEFORCE RTX 2060Training the model took
approximately 166 minutes on a dataset of about 3,000 rows, achieving a test accuracy of 97%.
### `requirements.txt`
```console markdown
absl-py==2.1.0
asttokens==2.4.1
astunparse==1.6.3
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
comm==0.2.2
contourpy==1.2.1
cycler==0.12.1
debugpy==1.8.1
decorator==5.1.1
executing==2.0.1
filelock==3.14.0
flatbuffers==24.3.25
fonttools==4.51.0
fsspec==2024.5.0
gast==0.5.4
google-pasta==0.2.0
grpcio==1.63.0
h5py==3.11.0
huggingface-hub==0.23.0
idna==3.7
ipykernel==6.29.4
ipython==8.24.0
jedi==0.19.1
joblib==1.4.2
jupyter_client==8.6.1
jupyter_core==5.7.2
keras==3.3.3
kiwisolver==1.4.5
libclang==18.1.1
Markdown==3.6
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.9.0
matplotlib-inline==0.1.7
mdurl==0.1.2
ml-dtypes==0.3.2
namex==0.0.8
nest-asyncio==1.6.0
nltk==3.8.1
numpy==1.26.4
nvidia-cublas-cu12==12.3.4.1
nvidia-cuda-cupti-cu12==12.3.101
nvidia-cuda-nvcc-cu12==12.3.107
nvidia-cuda-nvrtc-cu12==12.3.107
nvidia-cuda-runtime-cu12==12.3.101
nvidia-cudnn-cu12==8.9.7.29
nvidia-cufft-cu12==11.0.12.1
nvidia-curand-cu12==10.3.4.107
nvidia-cusolver-cu12==11.5.4.101
nvidia-cusparse-cu12==12.2.0.103
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.3.101
opt-einsum==3.3.0
optree==0.11.0
packaging==24.0
pandas==2.2.2
parso==0.8.4
pexpect==4.9.0
pillow==10.3.0
platformdirs==4.2.1
prompt-toolkit==3.0.43
protobuf==4.25.3
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
Pygments==2.18.0
pyparsing==3.1.2
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.1
pyzmq==26.0.3
regex==2024.5.15
requests==2.31.0
rich==13.7.1
safetensors==0.4.3
scikit-learn==1.4.2
scipy==1.13.0
seaborn==0.13.2
six==1.16.0
stack-data==0.6.3
streamlit==1.34.0
tensorboard==2.16.2
tensorboard-data-server==0.7.2
tensorflow==2.16.1
tensorflow-io-gcs-filesystem==0.37.0
termcolor==2.4.0
tf_keras==2.16.0
threadpoolctl==3.5.0
tokenizers==0.19.1
tornado==6.4
tqdm==4.66.4
traitlets==5.14.3
transformers==4.41.0
typing_extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
wcwidth==0.2.13
Werkzeug==3.0.3
wrapt==1.16.0```
## Screenshots ๐ธ
kindly view the model accuracy![this](images/screenshot.png)
and the predictions from the web app of streamlit below showing the client-side and server-side interaction.
![a](images/a.png)
![b](images/b.png)
![c](images/c.png)
![d](images/d.png)
![e](images/e.png)
![f](images/f.png)## Accuracy ๐ฏ
The second model, which is the model of choice has 97% accuracy as seen above in the screenshot
## References ๐
- [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)
- [TensorFlow and Keras Documentation](https://www.tensorflow.org/api_docs)
- [Streamlit Documentation](https://docs.streamlit.io/en/stable/)---
## Caution โ ๏ธFor database connections, it 's advisable you use environment variables to protect sensitive information from being exposed to the public!
![GitHub](https://img.icons8.com/ios/452/github.png)
Feel free to contact me on this project on GitHub at Adesoji1 or email me at ! ๐ฅ๏ธ---
This README provides a comprehensive guide to understand, setup, train, evaluate, and deploy the multilabel text classification model using CNN and Bi-LSTM networks. ๐๐
Happy Coding! ๐ป๐