An open API service indexing awesome lists of open source software.

https://github.com/ashly1991/rnn-text-classification-keras-tf2

IMDB sentiment analysis with Keras RNNs (LSTM/GRU). Within-batch padding, bucketing, embeddings, and masking for efficient, accurate training.
https://github.com/ashly1991/rnn-text-classification-keras-tf2

embeddings gru imdb jupyter-notebook keras lstm masking rnn sentiment-analysis tensorflow text-classification tf-data

Last synced: about 1 month ago
JSON representation

IMDB sentiment analysis with Keras RNNs (LSTM/GRU). Within-batch padding, bucketing, embeddings, and masking for efficient, accurate training.

Awesome Lists containing this project

README

          

# RNN Text Classification with Keras — Embeddings, Masking, and Efficient Batching

This project continues the IMDB **sentiment analysis** task and addresses key efficiency and modeling issues by:
- using **within‑batch padding** (shorter sequences padded only to the longest in the **batch**, not the dataset),
- introducing **word embeddings** to replace one‑hot vectors,
- **skipping computation on padded steps** via Keras **masking**,
- leveraging **Keras RNNs** (LSTMs/GRUs, stacked or bidirectional) to simplify code and improve performance.

> **Previous part (low‑level RNN, Part 1):** https://github.com/Ashly1991/rnn-text-classification-tf2

## What’s new in this repo (beyond Part 1)
- **Efficient batching:** `from_generator` + **`padded_batch`** (pad to each batch’s max length).
- **Optional bucketing:** group sequences by similar length to reduce padding waste (helps more with larger truncation limits such as 500).
- **Embeddings:** compact, learnable representations replace one‑hot vectors; faster + fewer parameters for suitable `emb_dim`.
- **Keras RNNs:** use optimized **`LSTM`/`GRU`**, easily **stack** layers and add **Bidirectional** context.
- **Masking:** `Embedding(mask_zero=True)` propagates masks so RNNs skip padded steps → better, faster learning in terms of steps.
- **Cleaner training loop:** `model.fit` with built‑in metrics/callbacks (still compatible with custom loops if needed).

## Quick model sketch
```python
import tensorflow as tf
from tensorflow.keras import layers, models

vocab_size = 20000
emb_dim = 128

model = models.Sequential([
layers.Embedding(vocab_size, emb_dim, mask_zero=True),
layers.Bidirectional(layers.LSTM(128, return_sequences=False)),
layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
```

## Training efficiency
- **Within‑batch padding:** Build a `tf.data` pipeline from a Python generator and apply `padded_batch` so each batch pads only to its own max length.
- **Bucketing:** Optional length‑based grouping to avoid “one long sequence slows the whole batch”.
- **RaggedTensors:** Supported by many ops and Keras layers but not by `padded_batch`; ragged pipelines can be slower in practice.

## How to Run
```bash
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
jupyter lab rnn-text-classification-keras.ipynb
```

## Notes
- Consider truncation (e.g., 200 or 500) and vocabulary limits for speed/quality trade‑offs.
- Try LSTM vs GRU, stacked vs single layer, and bidirectional variants.
- Monitor per‑batch time to see gains from bucketing.