https://github.com/ktk-jadoo/steamlit
Extracted appropiate data from Steam API, data processing, data pipelining, training the model , and deploying on Streamlit. Hence, Steamlit.
https://github.com/ktk-jadoo/steamlit
data-science nlp-deep-learning word2vec
Last synced: 11 months ago
JSON representation
Extracted appropiate data from Steam API, data processing, data pipelining, training the model , and deploying on Streamlit. Hence, Steamlit.
- Host: GitHub
- URL: https://github.com/ktk-jadoo/steamlit
- Owner: KTK-Jadoo
- License: mit
- Created: 2024-08-02T04:57:53.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-08-13T12:05:33.000Z (over 1 year ago)
- Last Synced: 2025-01-08T05:16:43.472Z (about 1 year ago)
- Topics: data-science, nlp-deep-learning, word2vec
- Language: Jupyter Notebook
- Homepage:
- Size: 234 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Steam-lit: Word Embeddings and LDA-Based NLP Recommendation System
This project demonstrates the creation of a Steam game recommendation system using a combination of BERT word embeddings and Latent Dirichlet Allocation (LDA) for topic modeling. The system recommends games based on user input descriptions.
## Table of Contents
- [Overview](#overview)
- [Technologies Used](#technologies-used)
- [Data Extraction](#data-extraction)
- [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)
- [Latent Dirichlet Allocation (LDA)](#latent-dirichlet-allocation-lda)
- [BERT Embeddings](#bert-embeddings)
- [Cosine Similarity](#cosine-similarity)
- [Implementation](#implementation)
- [Conclusion](#conclusion)
## Overview
The Steam-lit project involves building a recommendation system that combines the power of topic modeling through LDA and contextual word embeddings via BERT to provide game recommendations based on user-provided text descriptions.
## Technologies Used
- **Python**
- **Streamlit** - For the web interface
- **Pandas** - Data manipulation and analysis
- **NumPy** - Numerical operations
- **scikit-learn** - Machine learning algorithms and utilities
- **Hugging Face Transformers** - For BERT embeddings
- **Latent Dirichlet Allocation (LDA)** - For topic modeling
- **SQLite** - For data storage
## Data Extraction
The data for this project was extracted using the Steam Web API, allowing us to retrieve details about various games, including descriptions, reviews, and metadata.
## Exploratory Data Analysis (EDA)
Before building the recommendation system, the data was thoroughly cleaned and analyzed. This involved:
- Filtering out non-game items like soundtracks, DLCs, and demos.
- Analyzing the distribution of game tags, genres, and user reviews.
- Visualizing the relationships between different features to understand the dataset better.
## Latent Dirichlet Allocation (LDA)
LDA is used in this project to identify topics within the user reviews.
**Mathematics**:
- LDA assumes that each document (in our case, a game review) is a mixture of topics and that each word in the document is attributable to one of the document's topics.
- The LDA model produces a matrix of topic distributions for each review, which is then aggregated to form a topic distribution for each game.
## BERT Embeddings
BERT (Bidirectional Encoder Representations from Transformers) is used to create embeddings for game descriptions.
**Process**:
- Each game description is tokenized using BERT's tokenizer, then passed through the BERT model to obtain a dense vector representation (embedding).
- These embeddings capture the semantic meaning of the game descriptions.
## Cosine Similarity
Cosine similarity is used to compare the user input with the game embeddings to find the most similar games.
**Mathematics**:
- The cosine similarity between two vectors `A` and `B` is calculated as:
```
cosine_similarity(A, B) = (A · B) / (||A|| ||B||)
```
- In this project, cosine similarity is used to compute the similarity between the user's input embedding and the combined game description and topic embeddings.
## Implementation
The project was implemented in several key steps:
1. **Data Preparation**: Data was extracted from the Steam API, cleaned, and stored in an SQLite database.
2. **Topic Modeling with LDA**: LDA was applied to game reviews to extract topics, which were then associated with each game.
3. **Word Embeddings with BERT**: BERT was used to convert game descriptions into embeddings.
4. **Combining Features**: The LDA topic matrix and BERT embeddings were combined to form a comprehensive feature matrix for each game.
5. **Recommendation System**: The cosine similarity between user input and the game feature matrix was calculated to provide the top game recommendations.
## Conclusion
The Steam-lit project showcases the power of combining modern NLP techniques like BERT embeddings with classical topic modeling approaches such as LDA. The result is a recommendation system that can provide meaningful game suggestions based on textual input from users.