https://github.com/rishishanthan/lstm-sentiment-analysis
End-to-end sentiment analysis with a stacked LSTM in PyTorch โ custom tokenization, embeddings, padding, class imbalance handling, and thorough evaluation.
https://github.com/rishishanthan/lstm-sentiment-analysis
deep-learning lstm nlp pytorch rnn sentiment-analysis sequence-modeling text-classification tokenization torchtext
Last synced: 15 days ago
JSON representation
End-to-end sentiment analysis with a stacked LSTM in PyTorch โ custom tokenization, embeddings, padding, class imbalance handling, and thorough evaluation.
- Host: GitHub
- URL: https://github.com/rishishanthan/lstm-sentiment-analysis
- Owner: rishishanthan
- Created: 2025-10-08T01:35:23.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-10-08T02:22:47.000Z (8 months ago)
- Last Synced: 2025-10-08T03:27:05.428Z (8 months ago)
- Topics: deep-learning, lstm, nlp, pytorch, rnn, sentiment-analysis, sequence-modeling, text-classification, tokenization, torchtext
- Language: Jupyter Notebook
- Homepage:
- Size: 7.47 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# LSTM Sentiment Analysis (PyTorch)
A complete, production-friendly **sentiment analysis** pipeline built around a stacked **LSTM**.
This project includes robust tokenization, vocabulary building, padding/masking, class-imbalance handling, and a clean training loop with early stopping and LR scheduling.
---
## ๐ Highlights
- **Custom tokenization & vocab** (torchtext/nltk) with OOV handling
- **Embedding layer** (random or pretrained vectors if provided)
- **Stacked LSTM** (optionally bidirectional) + dropout regularization
- **Packed sequences** for efficient variable-length batching
- **Class weights / focal loss** option for imbalance
- **Thorough evaluation**: Accuracy, Precision/Recall/F1, ROC/PR curves, confusion matrix
---
## ๐๏ธ Model
- Embedding(d_model=EMB_DIM)
- LSTM: 2โ3 layers, hidden size = H, dropout = 0.3โ0.5
- Bidirectional (optional)
- Classifier head: Linear โ Softmax
- ### Loss: CrossEntropy (or focal)
- ### Optimizer: Adam (lr=1e-3 default)
- ### Scheduler: ReduceLROnPlateau
## ๐งพ requirements
```text
torch==2.4.1
torchtext==0.19.1
numpy==2.1.3
pandas==2.2.3
matplotlib==3.9.3
seaborn==0.13.2
scikit-learn==1.5.2
tqdm==4.66.5
nltk==3.9.1
```
## ๐ Insights
- Bidirectional LSTM improves recall on minority classes
- Packed sequences + masking stabilize training
- Moderate dropout (0.3โ0.5) and LR scheduling prevent overfitting
## ๐ฆ Weights
If you prefer to reuse my trained model, weights are attached to the repo.
## ๐ Dataset
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.
The dataset can be downloaded through the Link: https://ai.stanford.edu/~amaas/data/sentiment/
### Publications Using the Dataset
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).
## ๐ Results
All the results from my run including train, test, validation results are in Notebook file.