Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/deepmancer/advanced-recommender-system

Advance information retrieval system that combines advanced indexing, machine learning, and personalized search to enhance academic research and document discovery.
https://github.com/deepmancer/advanced-recommender-system

bigram-model collaborative-filtering crawling-python fine-tuning information-retrieval language-model natural-language-processing nlp positional-indexing pytorch recommender-system selenium spelling-correction tokenization transformers vectorization

Last synced: 6 days ago
JSON representation

Advance information retrieval system that combines advanced indexing, machine learning, and personalized search to enhance academic research and document discovery.

Awesome Lists containing this project

README

        

# ๐Ÿ“š Advanced Recommender System


PyTorch
Hugging Face Transformers
Python
scikit-learn
Jupyter Notebook


Welcome to the **Advanced Recommender Systems**!

---

## ๐Ÿš€ Project Overview

Our goal is to create a comprehensive platform that excels in retrieving, classifying, ranking, and recommending documents tailored to user preferences. The pipeline of the project follows these key stages:

1. **Data Collection & Preprocessing**:
- ๐Ÿ“ฅ **Data Collection**: Gather academic paper data.
- ๐Ÿ”ง **Preprocessing**: Prepare data for indexing.

2. **Indexing & Retrieval Infrastructure**:
- ๐Ÿ—‚๏ธ **Indexing**: Develop an indexing system.
- โœ๏ธ **Spell Correction**: Integrate spell correction mechanisms.
- ๐Ÿ“Š **Vector Space Models**: Apply models for accurate search ranking.

3. **Machine Learning & Clustering**:
- ๐Ÿค– **Machine Learning**: Implement classification algorithms.
- ๐Ÿงฉ **Clustering**: Organize documents into clusters.

4. **Web Crawling & Personalized Search**:
- ๐ŸŒ **Web Crawling**: Collect additional data from the web.
- ๐Ÿ” **Personalized Search**: Develop advanced search and recommendation features.

5. **Evaluation & Optimization**:
- ๐Ÿ“ˆ **Evaluation**: Assess system performance using metrics.
- ๐Ÿ”ง **Optimization**: Refine and improve system effectiveness.

---

## ๐Ÿ—๏ธ Phase 1: Data Acquisition and Indexing Infrastructure

Phase 1 focuses on laying the foundation for a robust information retrieval system by establishing an efficient data processing and indexing infrastructure.

### Datasets

- **Dataset**: Scientific articles from [Semantic Scholar](https://www.semanticscholar.org/).
- **Dataset Category**: Artificial Intelligence & Bioinformatics

### Key Components

- **๐Ÿ“‚ Data Preprocessing & Preparation**: Structure academic papers for efficient retrieval.
- **๐Ÿ“š Positional Index Construction**: Create a positional index for precise document searches.
- **๐Ÿ”  Spell Correction Integration**: Integrate a bigram-based spell correction system.
- **๐Ÿงฎ Vector Space Modeling**: Implement vector space models for effective document ranking:
- **`ltn-lnn`**: Term frequency normalization model.
- **`ltc-lnc`**: Term and document frequency adjustment model.
- **`Okapi BM25`**: Probabilistic relevance ranking model.
- **๐Ÿ“ˆ Evaluation Metrics**: Assess system performance with metrics like MRR, Precision, Recall, F1 Score, MAP, and NDCG.

---

## ๐Ÿงฌ Phase 2: Machine Learning and Clustering for Document Retrieval

In Phase 2, we enhance retrieval capabilities through machine learning techniques, improving classification and clustering to refine the search system.

### Key Components

- **๐Ÿ“‚ Dataset**: Access the scientific articles dataset from [Kaggle](https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts?resource=download).
- **๐Ÿ“Š Naive Bayes Classification**: Implement a Naive Bayes classifier for document categorization.
- **๐Ÿค– Neural Network Classifier**: Develop a neural network classifier for improved accuracy.
- **๐Ÿง  Large Language Models**: Fine-tune a pre-trained model for advanced classification.
- **๐Ÿงฎ Hierarchical Clustering**: Apply hierarchical clustering for document organization.

---

## ๐Ÿ› ๏ธ Phase 3: Web Crawling, Link Analysis, and Personalized Search

Phase 3 centers on expanding the systemโ€™s capabilities with web crawling, link analysis, and advanced personalization features.

### Key Components

- **๐Ÿ•ท๏ธ Web Crawling**: Deploy a web crawler to gather academic articles and related data.
- **๐Ÿ”— Link Analysis**: Utilize PageRank and HITS algorithms to determine article importance.
- **๐Ÿ“š Content-Based Recommendation**: Develop recommendations based on article content similarity.
- **๐Ÿค Collaborative Filtering**: Recommend articles based on the preferences of similar users.
- **๐Ÿงช Evaluation of Recommender Systems**: Measure recommendation system performance using metrics like nDCG.

### Final Product

Upon completion of Phase 3, the Advanced Recommender Systems will be a comprehensive tool that excels in retrieving, organizing, ranking, and recommending academic papers tailored to users' research needs.

---

## ๐Ÿ“ License

This project is licensed under the MIT License. For detailed information, please refer to the [LICENSE](LICENSE) file.

---

Thank you for your interest in the **Advanced Recommender Systems**! We hope this project serves as a valuable and engaging tool for your research and information retrieval needs. Happy exploring! ๐Ÿš€