https://github.com/maettuu/24hs-essentials-in-text-and-speech-processing
Repository for the course Essentials in Text and Speech Processing Fall 2024
https://github.com/maettuu/24hs-essentials-in-text-and-speech-processing
amazon-books-reviews approximate-nearest-neighbors backend content-based-recommendation python recommendation-system sbert tf-idf-vectorization
Last synced: 11 months ago
JSON representation
Repository for the course Essentials in Text and Speech Processing Fall 2024
- Host: GitHub
- URL: https://github.com/maettuu/24hs-essentials-in-text-and-speech-processing
- Owner: maettuu
- Created: 2024-12-12T18:10:22.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-08T20:17:21.000Z (over 1 year ago)
- Last Synced: 2025-06-19T19:52:50.630Z (12 months ago)
- Topics: amazon-books-reviews, approximate-nearest-neighbors, backend, content-based-recommendation, python, recommendation-system, sbert, tf-idf-vectorization
- Language: Python
- Homepage:
- Size: 16.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 24HS Essentials in Text and Speech Processing (ETSP)
This repository includes the semester project of the course ETSP ([source repository](https://github.com/solodezaldivar/readAlike/tree/0d5913d29a7ab64459b55311e4176ca8ca8f992e)). The code is written in `Python v3.12`.
Main packages: `annoy`, `sentene_transformers`, `torch`, `numpy`, `sklearn` : {`TfidfVectorizer`}, `scipy`, `pandas`, `matplotlib`, `tqdm`
## readAlike
`readAlike` is a book recommendation system that provides similar books to a given input book. The system leverages multiple techniques, including TF-IDF vectorization, Sentence-BERT embeddings, and Approximate Nearest Neighbors (ANN) for generating content-based book recommendations based on title, description, author, and category data ([dataset](https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews)).
## How to Run
To run the recommendation engine:
1. Open the command line and execute ``` pip install -r requirements.txt ```
2. Execute `main.py`. Given an example book, the program will print the top five recommended books based on three methods: TF-IDF, SBERT, and ANN.
## Project Structure
### 1. Classes
- **preprocessing/**: Manages data preprocessing.
- **`Preprocessor`**: Handles data cleaning and formatting from a CSV file of book data.
- **core/**: Contains the core modules for recommendation.
- **`Library`** and **`Book`**: Models the library of books and individual book data.
- **`Vectorizer`**: Converts book text data into numerical vectors using TF-IDF and Sentence-BERT.
- **`DimensionalityReducer`**: Reduces the dimensionality of TF-IDF vectors using Truncated SVD.
- **`Ann`**: Creates an Approximate Nearest Neighbors model for efficient similarity search.
- **`Recommender`**: Main recommendation engine that integrates the above components to provide recommendations.
### 2. Key Components
- **`config.py`**: Configuration file with column names for title, description, authors, and categories.
- **`main.py`**: Main entry point for running the recommendation pipeline.
## Program Flow
1. **Preprocessing**: The `Preprocessor` class reads the dataset and performs data cleaning.
2. **Library Initialization**: `Library` is initialized with the cleaned dataset, storing each book as a `Book` object.
3. **Vectorization**: `Vectorizer` creates TF-IDF and Sentence-BERT embeddings for each book.
4. **Dimensionality Reduction**: `DimensionalityReducer` reduces TF-IDF embeddings for optimized ANN performance.
5. **ANN Construction**: `Ann` constructs an ANN model based on the reduced vectors.
6. **Recommendation**: `Recommender` classifies recommendations into TF-IDF, SBERT, and ANN-based results, outputting top similar books.
## Classes and Methods
### `Preprocessor`
- **Attributes**:
- `df`: DataFrame containing cleaned book data.
- **Methods**:
- `preprocess_data()`: Cleans and formats the data.
- `drop_items_with_short_entries()`, `drop_duplicates()`, `convert_strings_into_lists()`: Helper functions to clean the dataset.
### `Library`
- **Attributes**:
- `books`: List of `Book` objects.
- **Methods**:
- `get_combined_data()`: Concatenates title, description, authors, and categories into a single string per book.
- `get_book_idx()`: Retrieves the index of a book within the library.
### `Book`
- **Attributes**:
- `title`, `description`, `authors`, `categories`: Fields describing the book.
- **Methods**:
- `get_combined_data()`: Combines title, description, authors, and categories into a single string.
### `Vectorizer`
- **Attributes**:
- `tfidf_matrix`: Sparse matrix of TF-IDF vectors.
- `sbert_embeddings`: Sentence-BERT embeddings for each book.
- **Methods**:
- `tfidf_vectorize()`: Vectorizes a book using TF-IDF.
- `sbert_vectorize()`: Vectorizes a book or library using SBERT.
### `DimensionalityReducer`
- **Attributes**:
- `reduced_matrix`: Dimensionality-reduced version of the TF-IDF matrix.
- **Methods**:
- `reduce()`: Reduces a TF-IDF vector to the lower dimension.
### `Ann`
- **Attributes**:
- `ann_indices`: ANN model for similarity search.
- **Methods**:
- `get_nearest_neighbors_by_index()`, `get_nearest_neighbors_by_vector()`: Retrieves nearest neighbors by item index or vector.
### `Recommender`
- **Attributes**:
- `lib`: Library of books.
- `vectorizer`: Vectorizer instance.
- `reducer`: Dimensionality reducer instance.
- `ann`: ANN instance.
- **Methods**:
- `recommend()`: Provides top recommendations based on TF-IDF, SBERT, and ANN.