https://github.com/shayanshabani/mir-2024-project

IMDb Information Retrieval System
https://github.com/shayanshabani/mir-2024-project

bert classification-algorithms clustering-algorithms information-retrieval llms machine-learning recommender-systems

Last synced: 3 months ago
JSON representation

IMDb Information Retrieval System

Host: GitHub
URL: https://github.com/shayanshabani/mir-2024-project
Owner: shayanshabani
Created: 2024-03-08T17:48:22.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-11-13T22:49:37.000Z (7 months ago)
Last Synced: 2025-01-30T05:24:20.776Z (5 months ago)
Topics: bert, classification-algorithms, clustering-algorithms, information-retrieval, llms, machine-learning, recommender-systems
Language: Jupyter Notebook
Homepage:
Size: 56.9 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

The IMDb-IR-System is a comprehensive project designed to index movie data and return relevant movies based on a user query, with a focus on information retrieval and enhancing user experience. First of all, the system crawls the necessary data from the IMDb website to achieve relevant and reliable movie data. Completing this step ensures that we have accurate data for further improvement.

After successfully collecting the relevant information, a preprocessing step is used to improve the effectiveness of the model for the retrieval task. The main part of this section is duplicate detection, which involves document shingle generation to produce a characteristic matrix. Finally, we apply locality-sensitive hashing to effectively identify and remove duplicate movies.

The main part of the model includes indexing, search, and scoring modules. Given the data from the previous part, we produce inverted indices for different parts of the movies like summaries and genres. In the search module, by taking a user’s query and refining it, we retrieve similar movies based on various scoring metrics like Okapi BM25. A snippet is also generated for each of the retrieved movies to provide the user with a concise summary highlighting query terms.

To further evaluate retrieved movies, various metrics such as precision, recall, and mean reciprocal rank are implemented. Beyond basic methods of information retrieval, we use HITS algorithm to take into account the popularity of performers.

Finally, advanced modules like various classification and clustering algorithms are implemented to classify or cluster movies into groups, which will later be used in the search module. The project also includes BERT fine-tuning for enhanced query-document processing and Retrieval-Augmented Generation to improve results based on the given context. Lastly, a recommender system collects user preferences and recommends relevant movies based on user-item interaction history. This multi-faceted approach demonstrates the system’s robust capacity to intelligently process and enhance movie retrieval.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shayanshabani/mir-2024-project

Awesome Lists containing this project

README