Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/phongct1105/synthesearch

An intelligent research tool that finds and synthesizes the most relevant papers to answer your query, streamlining the search process and enhancing understanding
https://github.com/phongct1105/synthesearch

gpt-4 lancedb llm rag vector-database

Last synced: 6 days ago
JSON representation

An intelligent research tool that finds and synthesizes the most relevant papers to answer your query, streamlining the search process and enhancing understanding

Awesome Lists containing this project

README

        

# πŸ•΅οΈβ€β™‚οΈ SyntheSearch

### πŸ–‹οΈ Authors

- **Lead Project:** Phong Cao
- **AI Developer:** Phong Cao
- **Backend:** Phong Cao, Hien Hoang
- **Frontend:** Doanh Phung, Minh Bui

---

## 🌟 Project Overview
**SyntheSearch**: A smart research tool that finds and synthesizes the most relevant papers, saving researchers time and enhancing insight.

---

### πŸš€ **Getting Started**

#### **πŸ› οΈ Start Server (Backend)**

1. Navigate to the backend directory:
```bash
cd backend
```

2. Set up a virtual environment:
```bash
python3 -m venv venv
```

3. Activate the virtual environment:
```bash
source venv/bin/activate
```

4. Install dependencies:
```bash
pip install -r requirement.txt
```
> **⚠️ Note:** On iOS, ensure `pywin32` is removed from `requirement.txt`.

5. Run the server:
```bash
uvicorn main:app --reload
```

---

#### **πŸ’» Start Client (Frontend)**

1. Navigate to the frontend directory:
```bash
cd frontend
```

2. Install dependencies:
```bash
npm i
```

3. Start the development server:
```bash
npm run dev
```

---

## πŸ“š **Introduction**
SyntheSearch is a web application designed to streamline the research process for students and researchers by efficiently locating relevant research papers. Researchers often spend hours sifting through papers, hoping to find the studies that best match their interests. SyntheSearch aims to reduce this time by intelligently suggesting the most relevant papers and generating a synthesis to reveal how the studies interrelate, offering users an insightful overview that saves time and enhances understanding.

---

## πŸ’‘ **Inspiration**
The inspiration for SyntheSearch came from our own experiences as students. Before HackUMass XII, one team member struggled to find research papers on machine-learning applications in cancer detection. The process of locating credible sources was exhausting and time-consuming, even with optimized library search tools. This frustration inspired us to develop a more efficient search engine that leverages Large Language Models (LLM) and vector databases to quickly surface relevant research and summar...

---

## πŸ› οΈ **How We Built the Project**
We chose Python for the back end because of its extensive frameworks for AI development. Databricks was used to streamline our machine-learning pipeline. Here’s how we approached building SyntheSearch:

1. **πŸ“Š Data Collection**: We started by scraping data from the CORE collection of open-access research papers.
2. **πŸ” Embedding**: Using LangChain, we implemented OpenAI's text-embedding-3-large model to convert paper texts into vector embeddings.
3. **πŸ“‚ Storage**: We utilized LanceDB as our vector database, storing the embedded vectors for fast and efficient retrieval.
4. **πŸ“ Summarization and Synthesis**: We employed OpenAI’s GPT-4o-mini model to generate summaries, suggestions, and synthesized insights.
5. **🌐 Front-End**: We built the user interface using React.JS with a TypeScript template, providing a clean and responsive experience for users.

---

## 🚧 **Challenges Faced**
- **πŸ”„ GitHub Workflow Issues**: Frequent pull request conflicts slowed our progress due to merge conflicts.
- **πŸ—£οΈ Communication Gaps**: Miscommunication led to duplicated work and inefficiencies.

---

## πŸŽ“ **Lessons Learned**
This project was an invaluable learning experience. As it was our first LLM project, we gained hands-on experience with GenAI technologies, particularly the power of vector databases. We learned the importance of clear team communication, and we now have a deeper understanding of LLMs and their capabilities in revolutionizing information retrieval.

---

## πŸ› οΈ **Built With**
- **🐍 Python** (Backend Development)
- **πŸ”— LangChain** (Embedding)
- **πŸ“‚ LanceDB** (Vector Database)
- **πŸ€– OpenAI GPT Models** (Summarization and Synthesis)
- **βš›οΈ React** with **TypeScript** (Front-End Development)
- **🎨 TailwindCSS** (Styling)
- **⚑ Vite** (Tooling)
- **πŸ§ͺ Databricks** (Machine Learning Pipeline)

---

## 🌱 **Empowering Researchers**
Through SyntheSearch, we’re excited to contribute to the efficiency of the research process, empowering researchers to focus on insights rather than information overload.