https://github.com/sanjurajveer/moview_review_analysis_nlp
Analysing movie reviews using NLP and categorising int good and bad
https://github.com/sanjurajveer/moview_review_analysis_nlp
nlp-machine-learning nltk-python perplexity tfidf-vectorizer tsne-algorithm
Last synced: 4 months ago
JSON representation
Analysing movie reviews using NLP and categorising int good and bad
- Host: GitHub
- URL: https://github.com/sanjurajveer/moview_review_analysis_nlp
- Owner: sanjurajveer
- Created: 2025-04-17T21:29:07.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-04-17T21:38:56.000Z (6 months ago)
- Last Synced: 2025-06-05T22:47:08.488Z (4 months ago)
- Topics: nlp-machine-learning, nltk-python, perplexity, tfidf-vectorizer, tsne-algorithm
- Language: Jupyter Notebook
- Homepage:
- Size: 202 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# IMDB Movie Reviews
## Overview
This Jupyter notebook demonstrates the application of t-SNE (t-Distributed Stochastic Neighbor Embedding) for visualizing high-dimensional text data from the IMDB movie review dataset. The dataset contains 50,000 movie reviews labeled as "positive" or "negative." The notebook walks through the entire process, from data loading and preprocessing to dimensionality reduction using t-SNE and visualization.## Dataset
The dataset used is the [IMDB Dataset of 50K Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews), which includes:
- **Review Text**: The textual content of movie reviews.
- **Sentiment Labels**: Binary labels indicating whether the review is "positive" or "negative."## Dependencies
To run this notebook, ensure you have the following Python libraries installed:
- `numpy`
- `pandas`
- `matplotlib`
- `seaborn`
- `re`
- `nltk`
- `sklearn.feature_extraction.text.TfidfVectorizer`
- `sklearn.manifold.TSNE`You can install these dependencies using `pip`:
```bash
pip install numpy pandas matplotlib seaborn nltk scikit-learn
```## Notebook Structure
1. **Import Libraries**: Load necessary Python libraries for data manipulation, text processing, and visualization.
2. **Load the Dataset**: Read the IMDB dataset from a CSV file and perform initial exploration.
3. **Basic Exploration**: Check for missing values and analyze the distribution of sentiments.
4. **Subset the Data**: Sample a balanced subset of the data (1000 positive and 1000 negative reviews) for computational efficiency.
5. **Clean & Preprocess Text**:
- Convert text to lowercase.
- Remove non-alphabetic characters and stopwords.
- Tokenize the text.
6. **Convert Text to Numeric Features (TF-IDF)**: Use TF-IDF vectorization to transform text into a numerical format suitable for t-SNE.
7. **Apply t-SNE**: Reduce the high-dimensional TF-IDF vectors to 2D for visualization.
8. **Visualization**: Plot the t-SNE results to explore the separation between positive and negative reviews.## Key Steps
- **Text Cleaning**: The notebook includes a function to clean and preprocess the review text, ensuring consistency and removing noise.
- **Dimensionality Reduction**: t-SNE is applied to the TF-IDF vectors to project the data into a 2D space, making it easier to visualize patterns.
- **Perplexity Analysis**: The notebook includes a section to analyze the effect of perplexity on t-SNE's performance, helping to choose an optimal value.## Usage
1. **Run the Notebook**: Execute each cell sequentially to follow the workflow.
2. **Modify Parameters**: Adjust the perplexity value in the t-SNE step or the sample size to experiment with different settings.
3. **Visualize Results**: The final visualization helps identify clusters of similar reviews and potential separations between sentiments.## Results
The t-SNE visualization provides insights into the structure of the dataset:
- Clusters of reviews with similar sentiment.
- Overlaps or separations between positive and negative reviews.
- The impact of perplexity on the visualization quality.## Applications
This notebook can be adapted for:
- Sentiment analysis tasks.
- Exploratory data analysis (EDA) for text data.
- Understanding the effectiveness of t-SNE for text visualization.## License
This project is open-source and available under the MIT License. Feel free to use, modify, and distribute the code as needed.## Acknowledgments
- Dataset sourced from Kaggle.
- Libraries and tools used: `scikit-learn`, `nltk`, `pandas`, `matplotlib`.For questions or contributions, please open an issue or submit a pull request. Happy analyzing!