An open API service indexing awesome lists of open source software.

https://github.com/breadrock1/doc-searcher

There is documents searcher project based on Rust and Opensearch technologies.
https://github.com/breadrock1/doc-searcher

elasticsearch fulltext-search opensearch rag rest-api rust semantic-search

Last synced: 5 days ago
JSON representation

There is documents searcher project based on Rust and Opensearch technologies.

Awesome Lists containing this project

README

          

[![Pull Request Actions](https://github.com/breadrock1/doc-searcher/actions/workflows/pull-request.yml/badge.svg)](https://github.com/breadrock1/doc-searcher/actions/workflows/pull-request.yml)

[![Target - Linux](https://img.shields.io/badge/OS-Linux-blue?logo=linux&logoColor=white)](https://www.linux.org/ "Go to Linux homepage")
[![Target - MacOS](https://img.shields.io/badge/OS-MacOS-blue?logo=linux&logoColor=white)](https://www.apple.com/ "Go to Apple homepage")
[![Target - Windows](https://img.shields.io/badge/OS-Windows-blue?logo=linux&logoColor=white)](https://www.microsoft.com/ "Go to Apple homepage")

# Doc-Search Metaverse project

Doc-Search is the simple and flexible searching documents application, leveraging the capabilities of Rust and Opensearch
to provide efficient and effective full-text search in documents. This project aims to offer a straightforward solution for
indexing and searching through a large corpus of documents with the speed and accuracy provided by Opensearch.

The main goal is implement simple and powerful system of storing and indexing documents with searching functionality
(full-text, semantic and hybrid). I decided to use opensearch as default searching engine, but you may use own solutions
by implementing several async traits for Tantivy, QDrant or own solution:

The principle schema:
![architecture.png](docs/architecture.png)

Doc-Search includes following sub-services:
- Cache Service - API of caching service like Redis;
- Metrics Service - API of metrics to Prometheus monitoring;
- Storage Service - API (CRUD) of indexed folders and documents;
- Searcher Service - API of searcher functionalities (fulltext, semantic, hybrid);
- Embeddings Service (removed) - API of embeddings service if you would like to use own model.

#### Changelog:

**OpenSearch instead Elasticsearch**
Searcher and Storage services at this moment has common implementation with opensearch

**Removed custom embeddings functionality**
After switching on OpenSearch instead Elasticsearch the neccessary of custon embeddings model integration has gone,
because the newer versions of OpenSearch provides ML plugin with neccessary functionality (chunking and emebdding).
So Embeddings module was been removed from code base. When i add Qdrant supporting his functionality will be added into
infrastructure with Qdrant client implementation.

## Features
Service based:
- **Rust Performance**: Benefit from the speed and safety of Rust;
- **REST API**: Easy to use REST API for searching documents and control management of indexing;
- **Swagger**: Using swagger documentation service for all available endpoints;
- **Remote logging**: Send error or warning messages or other metrics to remote server;
- **Docker Support**: Easy deployment with Docker and docker-compose;
- **Caching Queries**: Store data to cache service like Redis or own solutions;

Searching:
- **Full-Text Search**: Quickly find documents based on content based on choose searching engine;
- **Semantic Search**: Fast semantic searching by external embeddings service;
- **Hybrid Search**: Fast hybrid searching by external embeddings service;

## Domain

There are following domains:

```
domain
|----> Document storage (core)
| |----> Index
| | |----> Context: index management into vector storage
| | |----> Services: IIndexStorage
| |----> Document
| |----> Context: splits document on parts and stores into vector storage
| |----> Services: IDocumentPartStorage
|
|----> Document searching (core)
| |----> Founded document
| | |----> Context: multiple searching kind results
| | |----> Services: ISearcher
| |----> Pagination
| |----> Context: paginating of founded results
| |----> Services: IPAginator
```

And there are usecases:

```
usecase
|----> Storage Use Case
| |----> CRUD of index and document
| |----> split large document on parts to store
| |----> upload file to storage and create new task processing event
|
|----> Searching Use Case
| |----> searching document parts by multiple algorithms
| |----> paginate founded document parts results
```

There is context map:

```
+----------------+ +-----------------+
| StorageUseCase | <────── | SearcherUseCase |
+----------------+ +-----------------+
| |
▼ ▼
+----------------+ +-----------------+
| Storage Domain | | Searcher Domain |
+----------------+ +-----------------+
```

Context data flow:

```
HTTP Request


HTTP Handler (ServerState)


ServerAppState
├── StorageUseCase (application)
│ │
│ ▼
│ Storage (domain)

└── SearcherUseCase (application)


Task (domain)
```

## Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

### Prerequisites

- Rust
- Docker & docker-compose
- Cache (Redis)
- Opensearch

### Quick Start

0. Check `docs/opensearch` scripts how load ml cluster into single node and setup infrastructure as ingest and searching pipelines and deploying model.
1. Clone the repository
2. Run `cargo install --path .` to build project
3. Setting up `.env` file with services creds
4. Run `cargo run --bin init-infrastructure` to init elasticsearch schemas
4. Run `cargo run --bin launch` to launch service

### Features of project

Features to parse and store documents localy from current service (Not stable):
- enable-unique-doc-id - enable generating unique document id based on index and document ids.

[![Bread White - doc-search](https://img.shields.io/static/v1?label=Bread%20White&message=author&color=blue&logo=github)](https://github.com/breadrock1/doc-searcher)

[![stars - doc-search](https://img.shields.io/github/stars/breadrock1/doc-searcher?style=social)](https://github.com/breadrock1/doc-searcher)
[![forks - doc-search](https://img.shields.io/github/forks/breadrock1/doc-searcher?style=social)](https://github.com/breadrock1/doc-searcher)