https://github.com/phact/neighborhoodwatch
gpu powered brute force knn ground truth dataset generator
https://github.com/phact/neighborhoodwatch
gpu knn vector-search
Last synced: 2 months ago
JSON representation
gpu powered brute force knn ground truth dataset generator
- Host: GitHub
- URL: https://github.com/phact/neighborhoodwatch
- Owner: phact
- Created: 2023-09-06T18:33:46.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-20T14:32:55.000Z (about 1 year ago)
- Last Synced: 2024-11-29T17:47:43.170Z (10 months ago)
- Topics: gpu, knn, vector-search
- Language: Python
- Homepage:
- Size: 952 KB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## NeighborhoodWatch
NeighborhoodWatch (`nw`) is a GPU powered brute force knn ground truth dataset generator
### Set Up the Environment
At high level, in order to run this program, the following prerqusites need to be satsified:
* One computing instance with Nividia GPU (e.g. AWS `p3.8xlarge` instance type)
* Nivdia CUDA toolkit and driver 12 installed ([link](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html))
* Nividia cuDNN library installed ([link](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html))
* Nividia NCCL library installed ([link](https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html))
* Python version 3.10
* A python virtual environemtn (e.g. MiniConda) is highly recommended
* Poetry Python dependency managementAn example of setting up a bare-metal environment on an AWS `p3.8xlarge` instance with `Ubuntun 22.04` OS is provided in the following script:
* [install_baremetal_env.sh](bash/install_baremetal_env.sh)For convenience purposes, a [Dockerfile](./Dockerfile) is also provided which allows you to build a docker image allows you to run the `nw` program within a docker container with all the required driver and library dependencies. For more detailed information, please refer to the [nw_docker](./nw_docker.md) document
---
### Run the Program
First check and install Python dependencies by running the following commands in the home directory of this program:
```
poetry lock && poetry install
```Then run the program with `poetry run nw ` command. The available input parameter list is as below:
```
$ poetry run nw -h
usage: nw [-h] [-m MODEL_NAME] [-rd REDUCED_DIMENSION_SIZE] [-k K] [--data_dir DATA_DIR] [--use-dataset-api | --no-use-dataset-api] [--gen-hdf5 | --no-gen-hdf5]
[--post-validation | --no-post-validation] [--enable-memory-tuning] [--disable-memory-tuning]
query_count base_countnw (neighborhood watch) uses GPU acceleration to generate ground truth KNN datasets
positional arguments:
query_count number of query vectors to generate
base_count number of base vectors to generateoptions:
-h, --help show this help message and exit
-m MODEL_NAME, --model_name MODEL_NAME
model name to use for generating embeddings, i.e. text-embedding-ada-002, textembedding-gecko, or intfloat/e5-large-v2
-rd REDUCED_DIMENSION_SIZE, --reduced_dimension_size REDUCED_DIMENSION_SIZE
Reduced (output) dimension size. Only supported in models (e.g. OpenAI text-embedding-3-xxx) that have this feature. Ignored otherwise!
-k K, --k K number of neighbors to compute per query vector
--data_dir DATA_DIR Directory to store the generated data (default: knn_dataset)
--use-dataset-api, --no-use-dataset-api
Use 'pyarrow.dataset' API to read the dataset (default: True). Recommended for large datasets. (default: False)
--gen-hdf5, --no-gen-hdf5
Generate hdf5 files (default: True) (default: True)
--post-validation, --no-post-validation
Validate the generated files (default: False) (default: False)
--enable-memory-tuning
Enable memory tuning
--disable-memory-tuning
Disable memory tuning (useful for very small datasets)Some example commands:
nw 1000 10000 -k 100 -m 'textembedding-gecko' --disable-memory-tuning
nw 1000 10000 -k 100 -m 'intfloat/e5-large-v2' --disable-memory-tuning
nw 1000 10000 -k 100 -m 'intfloat/e5-small-v2' --disable-memory-tuning
nw 1000 10000 -k 100 -m 'intfloat/e5-base-v2' --disable-memory-tuning
```### Generated Datasets
After the program is successfully run, it will generate a set of data sets under a spcified folder which is default to `knn_dataset` subfolder.
You can override the output directory using the `--data_dir ` option.In particular, the following datasets include the KNN ground truth results:
| file format | dataset name | dataset file |
| ----------- | ------------ | ------------ |
| `fvec` | `train` dataset (base) | `__base_vectors` |
| `fvec` | `test` dataset (query)| `__query_vectors_` |
| `fvec` | `distances` dataset (distances) | `__distances_` |
| `ivec` | `neighors` dataset (indices) | `__indices_query_` |
| `hdf5` | consolidated hdf5 dataset of the above 4 datasets | `_base__query_` |### Run the Tests
```
poetry run pytest
```#### cli:

#### nvtop:
