Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ev2900/opensearch_knn_vector_search
Tokenize and convert sample text data into vectors using BERT. Load the vector representation of the text to OpenSearch and use kNN for semantic search
https://github.com/ev2900/opensearch_knn_vector_search
aws bert huggingface-transformers knn opensearch python search vector-search
Last synced: 2 months ago
JSON representation
Tokenize and convert sample text data into vectors using BERT. Load the vector representation of the text to OpenSearch and use kNN for semantic search
- Host: GitHub
- URL: https://github.com/ev2900/opensearch_knn_vector_search
- Owner: ev2900
- Created: 2023-08-31T13:52:07.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-17T01:31:38.000Z (3 months ago)
- Last Synced: 2024-10-19T03:07:40.910Z (3 months ago)
- Topics: aws, bert, huggingface-transformers, knn, opensearch, python, search, vector-search
- Language: Python
- Homepage:
- Size: 797 KB
- Stars: 2
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# OpenSearch kNN Vector Search
This example uses the publicly avaiable [Amazon Product Question Answer](https://registry.opendata.aws/amazon-pqa/) (PQA) data set. In this example, the questions in the PQA data set are tokenized and represented as vectors. BERT via. Hugging Face is used to generate the embeddings. The vector representation of the questions (embeddings) are loading to an OpenSearch index as a *knn_vector* data type
Searches are executed against OpenSearch by transforming search text into embeddings and determining similarity using kNN. The most similar result answers are returned as search results
# Deployment on AWS
To deploy this example on AWS you can click on the button below to launch a CloudFormation stack
[![Launch CloudFormation Stack](https://sharkech-public.s3.amazonaws.com/misc-public/cloudformation-launch-stack.png)](https://console.aws.amazon.com/cloudformation/home#/stacks/new?stackName=open-search-kNN&templateURL=https://sharkech-public.s3.amazonaws.com/misc-public/OpenSearch_kNN_Vector_Search.yaml)
The stack will deploy an Amazon OpenSearch domain and a Cloud9 environment with this GitHub repository downloaded. Before using the Cloud9 enviorment run the [resize_EBS.sh](https://github.com/ev2900/OpenSearch_kNN_Vector_Search/blob/main/resize_EBS.sh) from the Cloud9 termial.
Execute the following in terminal from the *OpenSearch_kNN_Vector_Search* directory
```bash resize_EBS.sh```
The bash script resizes the EBS volume attached to the Cloud9 instance from 10 GB to 100 GB.
Once the resize is complete, uou can update and run the [kNN.py](https://github.com/ev2900/OpenSearch_kNN_Vector_Search/blob/main/kNN.py) python script in the Cloud9 environment. The only parts of the Python script that need to updated before running it is the section below
```
# Configure re-usable variables for Opensearch domain URL, user name and password
opensearch_url = 'https://
Output: ```inputs_tokens```tokenizer()
padding - Ensure that all sequences in a batch have the same length. If the padding argument is set to True, the function will pad sequences up to the length of the longest sequence in the batch
return_tensors - Return output as a PyTorch torch.Tensor object### Convert tokenized questions into vectors using BERT
Input: ```inputs_tokens```
Output: ```outputs``````outputs``` is 3 dimensional tensor object. Working with 1000 rows of data the dimension of outputs could be [1000, 64, 768]
### Use mean pooling to condense the
Input: ```outputs```
Ouput: ```question_text_embeddings``````question_text_embeddings``` is a 2 dimensional tensor object. Working with 1000 rows of data the dimension of output could be [1000, 768]
## 3. Create an OpenSearch index
Make an API call to the OpenSearch domain to create an OpenSearch index named ```nlp_pqa``` with 3 fields. These fields include
1. question_vector
2. question
3. answerThe data type of the ```question_vector``` field is ```knn_vector```
## 4. Load data into the index
Make API calls to the OpenSearch domain to load the data (plain text and vector representation) into the OpenSearch index that was just created
## 5. Convert user input/search into a vector
Tokenize and convert the user input / search of *does this work with xbox?* into a vector. The vector representation of this search will be used in the next step
Input: ```query_raw_sentences = ['does this work with xbox?']```
Ouput: ```search_vector```## 6. Search OpenSearch using the vector representation of the user input/search
Make an API call to the OpenSearch domain to run the run the search *does this work with xbox* by passing the vector-ized version of the search to OpenSearch. Print the top results to the console