{"id":28296759,"url":"https://github.com/jashdubal/stackoverflow-classifier","last_synced_at":"2026-05-05T04:02:23.963Z","repository":{"id":152114158,"uuid":"618242358","full_name":"jashdubal/stackoverflow-classifier","owner":"jashdubal","description":"Recurrent Neural Networks (RNNs) to classify Stack Overflow posts using PyTorch","archived":false,"fork":false,"pushed_at":"2023-04-22T17:11:02.000Z","size":6999,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-14T11:41:18.358Z","etag":null,"topics":["deep-learning","machine-learning","neural-network","nlp","nlp-machine-learning","nltk","pytorch","rnn","rnn-gru","rnn-lstm","stackoverflow"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jashdubal.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-03-24T03:27:43.000Z","updated_at":"2024-03-17T22:01:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"3c32cb69-4398-406b-815c-96a0edf724d4","html_url":"https://github.com/jashdubal/stackoverflow-classifier","commit_stats":null,"previous_names":["jashdubal/stackoverflow-classifier","jashdubal/stackoverflow-classification"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jashdubal/stackoverflow-classifier","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jashdubal%2Fstackoverflow-classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jashdubal%2Fstackoverflow-classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jashdubal%2Fstackoverflow-classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jashdubal%2Fstackoverflow-classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jashdubal","download_url":"https://codeload.github.com/jashdubal/stackoverflow-classifier/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jashdubal%2Fstackoverflow-classifier/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32634732,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-04T10:08:07.713Z","status":"online","status_checked_at":"2026-05-05T02:00:06.033Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","machine-learning","neural-network","nlp","nlp-machine-learning","nltk","pytorch","rnn","rnn-gru","rnn-lstm","stackoverflow"],"created_at":"2025-05-22T21:20:06.024Z","updated_at":"2026-05-05T04:02:23.953Z","avatar_url":"https://github.com/jashdubal.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Stack Overflow Topic Classifier\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Open in Jupyter Notebook](https://img.shields.io/badge/Open%20in-Jupyter%20Notebook-orange)](https://github.com/jashdubal/stackoverflow-classification/blob/main/SO_notebook.ipynb)\n\nThis project demonstrates the classification of Stack Overflow posts into three categories: \"spark\", \"ml\", and \"security\". The performance of two different recurrent neural network (RNN) architectures, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), is compared.\n\n\u003cimg src=assets/rnn-pipeline.drawio.png/\u003e\n\n## Table of Contents\n\n- [Background](#background) \n- [Dataset](#dataset)\n- [How to Run](#how-to-run)\n- [Model Design](#model-design)\n- [Training](#training)\n- [Hyperparameter Tuning](#hyperparameter-tuning)\n- [Results](#results)\n- [License](#license)\n\n## Background\n\nThis repository contains the following code files:\n- [`SO_notebook.ipynb`](SO_notebook.ipynb): Jupyter Notebook that contains the code for training and evaluating a machine learning model on the Stack Overflow dataset.\n- [`dataset/SO.csv`](dataset/SO.csv): Stack Overflow dataset used to train and evaluate the machine learning model in SO-notebook.ipynb.\n\n## Dataset\n\nThe dataset used in this project is located in the [`dataset/SO.csv`](dataset/SO.csv) file. It contains Stack Overflow post titles and their corresponding labels (\"spark\", \"ml\", or \"security\").\n\nThe dataset consists of 150,000 entries with no missing values, and includes two columns: 'Title' and 'Label'. The data types for both columns are objects (strings).\n\nThe target distribution of the dataset is balanced, with each label having 50,000 samples:\n- spark: 50,000\n- ml: 50,000\n- security: 50,000\n\n## How to Run\n\nThe entire project is implemented in a Jupyter Notebook. To run the project, follow these steps:\n\n1. Clone the repository.\n2. Install the required dependencies using pip. You can do this by running the following command:\n\n```shell\n\npip install torch numpy pandas scikit-learn seaborn matplotlib nltk\n```\n\n3. Open the Jupyter Notebook `SO-notebook.ipynb` in Jupyter Notebook or JupyterLab.\n4. Follow the instructions provided in the notebook to train and evaluate the LSTM and GRU models on the Stack Overflow dataset.\n\nNote: Running the entire notebook may take up to 3 hours, depending on your machine's hardware specifications.\n\n## Model Design\n\nTwo RNN architectures are implemented and compared:\n\n1. **LSTM Classifier**: An LSTM-based RNN model to classify Stack Overflow post titles.\n2. **GRU Classifier**: A GRU-based RNN model to classify Stack Overflow post titles.\n\nBoth models are defined using the PyTorch framework, with custom classes `LSTMClassifier` and `GRUClassifier`.\n\n## Training\n\nThe training process is implemented using a custom `train_and_evaluate()` function. The training loop consists of the following steps:\n\n1. Set the model to training mode.\n2. Iterate over the training data in mini-batches.\n3. Perform forward pass.\n4. Calculate the loss using CrossEntropyLoss.\n5. Perform backpropagation to compute gradients.\n6. Update model parameters using Adam optimizer.\n\n## Hyperparameter Tuning\n\nThe hyperparameters of interest in this project are the hidden dimension and dropout rate. By experimenting with different values for these hyperparameters, we can improve model performance.\n\n## Results\n\nIn selecting RNN models, LSTM and GRU were considered beacuse they are both popular types of RNNs that excel at text classification tasks. I decided to compare the performance between the two models through a series comparison of ROC curves, confusion matrices, and classification reports.\n\nThe slightly higher average AUC of **0.9359** in the LSTM ROC curve tells us that this model slightly outperforms GRU model when it comes to comparison between all three classes.\n\nConfusion matrices and classification report also slightly favour LSTM over GRU.\n\n### Receiver Operating Characteristic (ROC) curves\n\n| LSTM Model | GRU Model | Tuned LSTM Model (ndim=256, dr=0.3) |\n|------------|-----------|-----------------------------------------------|\n| ![LSTM ROC](assets/lstm_roc.png) | ![GRU ROC](assets/gru_roc.png) | ![Tuned LSTM ROC](assets/tuned_lstm_roc.png) |\n\n### Confusion matrices\n\n| LSTM Model | GRU Model | Tuned LSTM Model (ndim=256, dr=0.3) |\n|------------|-----------|-----------------------------------------------|\n| ![LSTM CM](assets/lstm_cm.png) | ![GRU CM](assets/gru_cm.png) | ![Tuned LSTM CM](assets/tuned_lstm_cm.png) |\n\n\n### Classification report\n\n```\nLSTM Model Performance:\n              precision    recall  f1-score   support\n\n       spark     0.9111    0.8986    0.9048     10000\n          ml     0.9085    0.9051    0.9068     10000\n    security     0.9239    0.9400    0.9319     10000\n\n    accuracy                         0.9146     30000\n   macro avg     0.9145    0.9146    0.9145     30000\nweighted avg     0.9145    0.9146    0.9145     30000\n```\n\n```\nGRU Model Performance:\n              precision    recall  f1-score   support\n\n       spark     0.8998    0.9075    0.9036     10000\n          ml     0.9018    0.9014    0.9016     10000\n    security     0.9392    0.9315    0.9353     10000\n\n    accuracy                         0.9135     30000\n   macro avg     0.9136    0.9135    0.9135     30000\nweighted avg     0.9136    0.9135    0.9135     30000\n\n```\n\n```\nLSTM Tuned Model Performance:\n precision    recall  f1-score   support\n\n       spark     0.8868    0.9228    0.9044     10000\n          ml     0.9127    0.9021    0.9074     10000\n    security     0.9530    0.9254    0.9390     10000\n\n    accuracy                         0.9168     30000\n   macro avg     0.9175    0.9168    0.9169     30000\nweighted avg     0.9175    0.9168    0.9169     30000\n```\n\n## License\n\nThis project is licensed under the [MIT License](LICENSE).\n\n---\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjashdubal%2Fstackoverflow-classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjashdubal%2Fstackoverflow-classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjashdubal%2Fstackoverflow-classifier/lists"}