https://github.com/potat-dev/semantic-search-coursework
Semantic search engine written in Java as a university project
https://github.com/potat-dev/semantic-search-coursework
corenlp embeddings java milvus mongodb rabbitmq search-engine semantic-search vector-database
Last synced: 9 months ago
JSON representation
Semantic search engine written in Java as a university project
- Host: GitHub
- URL: https://github.com/potat-dev/semantic-search-coursework
- Owner: potat-dev
- Archived: true
- Created: 2023-10-12T03:10:21.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-01-15T17:28:56.000Z (over 2 years ago)
- Last Synced: 2024-10-02T09:05:07.008Z (over 1 year ago)
- Topics: corenlp, embeddings, java, milvus, mongodb, rabbitmq, search-engine, semantic-search, vector-database
- Language: Java
- Homepage:
- Size: 155 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Semantic search engine
Semantic search engine written in Java as a university project
> [!WARNING]
> This project is **absolutely not production ready**. It was developed as a university project, as a proof of concept. If you see this message, it means that I am already developing a new version of this project, which is based on a microservice architecture, and is much more optimized.
## How to run
1. You need to define some env vars:
```bash
export MILVUS_HOST=localhost MILVUS_PORT=19530 MONGODB_URI=mongodb://localhost:27017/ RABBITMQ_HOST=localhost RABBITMQ_USERNAME=user RABBITMQ_PASSWORD=pass MODEL_PATH=models/model.onnx
```
2. You need an embedding model in ONNX format. I used this model: [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru). To convert, I used the utility from this article: [Export to ONNX](https://huggingface.co/docs/transformers/serialization). Also, you can use any embedding model in ONNX format with a vector dimension of 768. Put the model in a models folder, like `models/model.onnx`
3. Build and run the project with:
```bash
./gradlew run
```
4. The API will be available on port 4567
> [!NOTE]
> The project was written with an emphasis on the fact that it will be possible to run as many indexing workers as you want. But, due to the tight deadlines, there was not enough time for optimization, and each indexing worker loads a model into its memory. Run with caution!
## ~~Shitty~~ project architecture
```mermaid
flowchart TD
U(User)
A(User API)
S(Search Service)
I(Indexing Service)
E(Embedding Model)
DM[(Mongo)]
DV[(Milvus)]
R[(RabbitMQ)]
U -->|API Request| A
A -->|Send indexing task to queue| R
R -->|Receive task| I
I -->|Store keywords| DM
I -->|Generate more indexing tasks| R
I -->|Extract from text| E
E -->|Store embedding| DV
A -->|Search request| S
S -->|Extract from query| E
S -->|Query by keywords| DM
S -->|Query by embedding| DV
```