Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ev2900/opensearch_knn_vector_search

Tokenize and convert sample text data into vectors using BERT. Load the vector representation of the text to OpenSearch and use kNN for semantic search
https://github.com/ev2900/opensearch_knn_vector_search

aws bert huggingface-transformers knn opensearch python search vector-search

Last synced: 2 months ago
JSON representation

Tokenize and convert sample text data into vectors using BERT. Load the vector representation of the text to OpenSearch and use kNN for semantic search

Awesome Lists containing this project

README

        

# OpenSearch kNN Vector Search

map-user map-user map-user

This example uses the publicly avaiable [Amazon Product Question Answer](https://registry.opendata.aws/amazon-pqa/) (PQA) data set. In this example, the questions in the PQA data set are tokenized and represented as vectors. BERT via. Hugging Face is used to generate the embeddings. The vector representation of the questions (embeddings) are loading to an OpenSearch index as a *knn_vector* data type

Searches are executed against OpenSearch by transforming search text into embeddings and determining similarity using kNN. The most similar result answers are returned as search results

# Deployment on AWS

To deploy this example on AWS you can click on the button below to launch a CloudFormation stack

[![Launch CloudFormation Stack](https://sharkech-public.s3.amazonaws.com/misc-public/cloudformation-launch-stack.png)](https://console.aws.amazon.com/cloudformation/home#/stacks/new?stackName=open-search-kNN&templateURL=https://sharkech-public.s3.amazonaws.com/misc-public/OpenSearch_kNN_Vector_Search.yaml)

The stack will deploy an Amazon OpenSearch domain and a Cloud9 environment with this GitHub repository downloaded. Before using the Cloud9 enviorment run the [resize_EBS.sh](https://github.com/ev2900/OpenSearch_kNN_Vector_Search/blob/main/resize_EBS.sh) from the Cloud9 termial.

Execute the following in terminal from the *OpenSearch_kNN_Vector_Search* directory

```bash resize_EBS.sh```

The bash script resizes the EBS volume attached to the Cloud9 instance from 10 GB to 100 GB.

Once the resize is complete, uou can update and run the [kNN.py](https://github.com/ev2900/OpenSearch_kNN_Vector_Search/blob/main/kNN.py) python script in the Cloud9 environment. The only parts of the Python script that need to updated before running it is the section below

```
# Configure re-usable variables for Opensearch domain URL, user name and password
opensearch_url = 'https://
Output: ```inputs_tokens```

tokenizer()
padding - Ensure that all sequences in a batch have the same length. If the padding argument is set to True, the function will pad sequences up to the length of the longest sequence in the batch
return_tensors - Return output as a PyTorch torch.Tensor object

### Convert tokenized questions into vectors using BERT

Input: ```inputs_tokens```

Output: ```outputs```

```outputs``` is 3 dimensional tensor object. Working with 1000 rows of data the dimension of outputs could be [1000, 64, 768]

### Use mean pooling to condense the

Input: ```outputs```

Ouput: ```question_text_embeddings```

```question_text_embeddings``` is a 2 dimensional tensor object. Working with 1000 rows of data the dimension of output could be [1000, 768]

## 3. Create an OpenSearch index

Make an API call to the OpenSearch domain to create an OpenSearch index named ```nlp_pqa``` with 3 fields. These fields include

1. question_vector
2. question
3. answer

The data type of the ```question_vector``` field is ```knn_vector```

## 4. Load data into the index

Make API calls to the OpenSearch domain to load the data (plain text and vector representation) into the OpenSearch index that was just created

## 5. Convert user input/search into a vector

Tokenize and convert the user input / search of *does this work with xbox?* into a vector. The vector representation of this search will be used in the next step

Input: ```query_raw_sentences = ['does this work with xbox?']```
Ouput: ```search_vector```

## 6. Search OpenSearch using the vector representation of the user input/search

Make an API call to the OpenSearch domain to run the run the search *does this work with xbox* by passing the vector-ized version of the search to OpenSearch. Print the top results to the console