An open API service indexing awesome lists of open source software.

https://github.com/pinecone-io/hybrid-search-demo

Demo with Python Flask UI based on a web-crawl of Pinecone.io as source data
https://github.com/pinecone-io/hybrid-search-demo

Last synced: 4 months ago
JSON representation

Demo with Python Flask UI based on a web-crawl of Pinecone.io as source data

Awesome Lists containing this project

README

        

# Pinecone Hybrid Search Demo
**Description**: Pinecone challenge Web UI searching Pinecone.io using hybrid search
**Author**: Kevin M. Butler
**Date**: Nov-2022
**Version**: 1.0
**Purpose**: This repository contains code to deploy a Python Flask application that is able to query Pinecone using vector search and more consicely, hybrid search to include dense and sparse vectors.

## Table of contents
[Introduction](#introduction)
[Getting started](#getting-started)
[Resources](#resources)

### Introduction
The Pinecone [vector database](https://www.pinecone.io/learn/vector-database/) makes it easy to build high-performance vector search applications. Developer-friendly, fully managed, and easily scalable without infrastructure hassles.

Vector databases have become very popular in semantic use-cases and now, Pinecone has introduced the ability to combine dense and sparse vectors together to perform hybrid search.

Hybrid search is the ability to use tokenized keywords (sparse vectors) and semantic representations (dense vectors) together to perform searches based on both lexical and semantic meanings. Traditionally, this would require two technologies, multiple queries, and post-processing to accomplish this task. With Pinecone hybrid search, you can now merge these technologies together in one fast, scalable, fully managed system.

This repository aims to demonstrate a web-crawl of [pinecone.io](https://www.pinecone.io), stored in Pinecone, and queried using a Python Flask application as the web UI. The web UI processes the query into sparse and dense vectors, then sends the query to Pinecone. The results are in json format and can be easily parsed by the web application.

The distinction of lexical versus semantic is controlled by the 'alpha' parameter which is passed in as a value between 0 and 1. A lower value closer to 0 is considered more lexical or keyword oriented. A higher value closer to 1 is more semantic and gives more weight to the meaning of the words rather than the more exact matching of the terms that lexical is suited for.

### Getting started
#### Getting the data in
The ./resources directory has the files necessary to process the data and upload it to your Pinecone index.

The Jupyter notebook 'Pinecone_io_generate_embeddings_example.ipynb' is where you will run the code. Feel free to make a copy of this file and rename it to Pinecone_io_generate_embeddings.ipynb' It uses 'Pinecone_io_Webcrawl.json' as its data-source. To run this notebook, you will need to resolve any dependencies in your Python environment. You will also need your API key, environment and index name. If you do not have an API key, you can [sign up for free here](https://app.pinecone.io/).

Before you beging working in the notebook, you will have to [create a project](https://www.pinecone.io/docs/manage-projects/#creating-a-new-project) in order to retrieve your API key. Once the project is created and you have your API, you can continue here.

If you need to create your index for the first time, you can use CURL
_(modify YOUR_API_KEY, YOUR_ENVIRONMENT, and YOUR_INDEX_NAME below)_


curl --location --request POST 'https://controller.YOUR_ENVIRONMENT.pinecone.io/databases' \
--header 'Api-Key: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data-raw '{
"name": "YOUR_INDEX_NAME",
"dimension": 384,
"metric": "dotproduct",
"pod_type": "s1h",
"index_config": {
"hybrid_search": {
"avg_doc_len": 100
}
}
}'

**Note**: as of the publishing of this repo, the [pinecone-client](https://github.com/pinecone-io/pinecone-python-client) does not include hybrid API references since it is still in preview status. We have included a temporary class 'hybrid_pinecone_client.py' in this repository to take care of what the [pinecone-client](https://github.com/pinecone-io/pinecone-python-client) would normally do. Once the python-client is updated, we will refactor this repository.

You will next modify the following section of the 'Pinecone_io_generate_embeddings.ipynb' file by adding YOUR_API_KEY, YOUR_EVIRONMENT, and YOUR_INDEX_NAME below.

_**Pinecone_io_generate_embeddings.ipynb**_

# initialize an instance of HybridPinecone class

pinecone = HybridPinecone(
api_key = "YOUR_API_KEY", # app.pinecone.io
environment = "YOUR_ENVIRONMENT"
)

# choose a name for your index
index_name = "YOUR_INDEX_NAME"

#### Time to run the Jupyter Notebook
You can now run all of the cells in the Jupyter notebook. If all is successful, you should see query results at the bottom of the notebook. If you have any issues, please see the [troubleshooting](#troubleshooting) section.

#### Runing Python Flask (Web UI)
Writing and deploying Flask applications is outside the scope of this demo however, here is a [quick tutorial](https://flask.palletsprojects.com/en/2.2.x/quickstart/) to help you get started if your are not familiar with the Flask framework.

The Flask application included in this repository is controlled by the 'app.py' file which configures the web services and prepares the server to perform the search functionality. Templates included in the ./templates directory provide the views of the pages and render results.

Before we run Flask, we must configure the 'app.py' file. Create a copy of app_example.py and rename it to app.py. This file name (app.py) is required by Flask.

Replace YOUR_API_KEY, YOUR_ENVIRONMENT, YOUR_INDEX_NAME, and YOUR_SECRET_KEY with the values of your application. YOUR_SECRET_KEY can be any text value you like.

_**app.py**_


# Configure and connect to Pinecone index
api_key = "YOUR_API_KEY"
pinecone_env = "YOUR_ENVIRONMENT"
index_name = "YOUR_INDEX_NAME"
pinecone = hybrid_pinecone_client.HybridPinecone(api_key, pinecone_env)
pinecone.connect_index(index_name)

# Flask settings
app = Flask(__name__)
app.config['SECRET_KEY'] = 'YOUR_SECRET_KEY'


In order to run this application locally, you must start flask from a terminal in the root directory of this project. Here is one example of activating the current python environment using conda, setting debug properties, and starting Flask:

conda activate pinecone&&export FLASK_ENV=development&&export FLASK_DEBUG=1&&flask run

If all is working correctly, you should be able to visit [http://127.0.0.1:5000](http://127.0.0.1:5000) and see the search UI. Try some searches while adjusting the alpha value to any number between 0 and 1.

_A lower value closer to 0 is considered more lexical or keyword oriented. A higher value closer to 1 is more semantic and gives more weight to the meaning of the words rather than the more exact matching of the terms that lexical is suited for._

### Resources

#### Pinecone
[Sign up for a free Pinecone Account](https://app.pinecone.io/)
[Hybrid search early access documentation](https://docs.google.com/document/d/1Tx3tHC8PA9r5NfsTONpGMLurwZUj5phYcDa5JI6qJAU/edit#heading=h.8fup2t4burfu)
[Introducing the hybrid index to enable keyword-aware semantic search](https://www.pinecone.io/learn/hybrid-search/?utm_medium=email&_hsmi=231739825&_hsenc=p2ANqtz-_KdCTL8VpX0tqZ_e3Z9MSJSM6toQaESiTgWZCBVIbYMByQiG3rxb7GBh4WGY2mF9J44eYp_lrs9kEL-oQ_y8ivEdFjcQ&utm_content=231739825&utm_source=hs_email)
[Getting Started with Hybrid Search](https://www.pinecone.io/learn/hybrid-search-intro/)
[Pinecone's New *Hybrid* Search - the future of search?](https://youtu.be/0cKtkaR883c)

#### Misc Formulas
overall_score = alpha * dense_score + (1-alpha)*sparse_score
sparse_score = (overall_score - alpha * dense_score) / (1-alpha)

#### Python
[Python Flask](https://flask.palletsprojects.com/en/2.2.x/)
[Jupyter Notebooks](https://jupyter.org/)

#### Digital Ocean
[Digital Ocean](https://www.digitalocean.com/)

### Troubleshooting

#### Errors in the Jupyter notebook
Some of the most common errors:
1.) Unmet dependencies in your local python environment. These can typically be resolved using PIP and installing the missing module/package.

2.) API key, environment, index name not correct. Compare your notebook to the settings in the [Pinecone Console](https://app.pinecone.io/)

#### Errors in the Python Flask application
Although Flask is outside the scope of this repository, you may find help [here](https://www.digitalocean.com/community/tutorials/how-to-handle-errors-in-a-flask-application)