{"id":44981038,"url":"https://github.com/pentoai/vectory","last_synced_at":"2026-02-18T18:09:24.240Z","repository":{"id":61000489,"uuid":"543709723","full_name":"pentoai/vectory","owner":"pentoai","description":"Vectory provides a collection of tools to track and compare embedding versions.","archived":false,"fork":false,"pushed_at":"2022-11-25T15:23:24.000Z","size":2013,"stargazers_count":71,"open_issues_count":3,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-11-28T12:28:03.256Z","etag":null,"topics":["deep-learning","deep-neural-networks","embedding-python","embedding-vectors","embeddings-similarity","evaluation-framework"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pentoai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-09-30T17:18:16.000Z","updated_at":"2025-04-05T22:27:15.000Z","dependencies_parsed_at":"2023-01-22T07:15:49.589Z","dependency_job_id":null,"html_url":"https://github.com/pentoai/vectory","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/pentoai/vectory","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pentoai%2Fvectory","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pentoai%2Fvectory/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pentoai%2Fvectory/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pentoai%2Fvectory/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pentoai","download_url":"https://codeload.github.com/pentoai/vectory/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pentoai%2Fvectory/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29588787,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-18T16:55:40.614Z","status":"ssl_error","status_checked_at":"2026-02-18T16:55:37.558Z","response_time":162,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","deep-neural-networks","embedding-python","embedding-vectors","embeddings-similarity","evaluation-framework"],"created_at":"2026-02-18T18:09:23.392Z","updated_at":"2026-02-18T18:09:24.227Z","avatar_url":"https://github.com/pentoai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://pento.ai/images/vectory-banner.png\" alt=\"Vectory\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cb\u003e An embedding evaluation toolkit \u003c/b\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://pypi.org/project/vectory\" target=\"_blank\"\u003e\n        \u003cimg src=\"https://img.shields.io/pypi/v/vectory?color=%2334D058\u0026label=pypi%20package\" alt=\"Package version\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://pypi.org/project/vectory\" target=\"_blank\"\u003e\n        \u003cimg src=\"https://img.shields.io/pypi/pyversions/vectory.svg?color=%2334D058\" alt=\"Supported Python versions\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/overview.gif\" alt=\"animated\" /\u003e\n\u003c/p\u003e\n\n\nVectory provides a collection of tools to **track and compare embedding versions**.\n\nVisualizing and registering each experiment is a crucial part of developing successful models. Vectory is a tool designed by and for machine learning engineers to handle embedding experiments with little overhead.\n\n### Key features:\n\n- **Embedding linage**. Keep track of the data and models used to generate embeddings.\n- **Compare performance**. Compare metrics between different vector spaces.\n- **Ease of use**. Easy usage through the CLI, Python, and GUI interfaces.\n- **Extensibility**. Built with extensibility in mind.\n- **Persistence**. Simple local state persistence using SQLite.\n\n# Table of Contents\n\n1. [Installation](#installation)\n2. [Demo](#demo)\n3. [Usage](#usage)\n4. [Troubleshooting](TROUBLESHOOTING.md)\n5. [License](#license)\n\n# Installation\n\nAll you need for Vectory to run is to install the package and Elasticsearch. You can install the package using pip:\n\n```console\npip install vectory\n```\n\n## Demo\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/intro.gif\" alt=\"animated\" /\u003e\n\u003c/p\u003e\n\nAfter installing Vectory, you can play with the demo cases to get a feel of the toolkit.\n\n- Tiny-ImageNet: A computer vision dataset set of embeddings made from pretrained models ResNet50 and ConvNext-tiny.\n- IMDB: A NLP dataset set of embeddings made from pretrained models BERT and RoBERTa.\n\nTo set up the demo, run the following command:\n\n```console\nvectory demo\n```\n\nYou can specify the demo dataset by adding the name as the next argument. See `vectory demo --help` for more information.\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/zoom.gif\" alt=\"animated\" /\u003e\n\u003c/p\u003e\n\n\n## Set up Elasticsearch\n\nWhat is Elasticsearch? It's a free high-performance search engine used for many types of data.\n\nVectory uses Elasticsearch to load embeddings and then search for them.\n\nTo start the engine, you must install Docker and start its daemon.\nAfter that, just run:\n\n```console\nvectory elastic up --detach\n```\n\nAnd you can turn it off with:\n\n```console\nvectory elastic down\n```\n\n# Usage\n\nThe key concepts needed to use Vectory are **datasets**, **experiments** and **embedding spaces**.\n\nA **dataset** is just a collection of data. You could have evaluation or training datasets. Evaluation datasets are required for Vectory to run, whereas training datasets are optional and desired for tracking purposes.\n\nYou will need a CSV file to define a Datasets. The CSV file must have a header row, followed by a row for each data point in the dataset. The only requirement we ask of the CSV is to have at least an identifier column. The following columns could be labels, features, or any other information.\n\nAn **experiment** is a machine learning model trained with a particular dataset. You can create different experiments by varying the model and the dataset. You can optionally specify a training dataset for tracking purposes.\n\nA Dataset and an Experiment form an **embedding space**, which is just a 2-dimensional array with all the generated vectors (namely, features or embeddings) for a particular dataset given an experiment. You will need to provide the embeddings in a file that can be either `.npz` or `.npy`.\n\nThe important thing about these embedding files is that they must follow the same indexing as the evaluation dataset CSV file. To summarize, for every line in the dataset, there's an embedding in the `.npz` file.\n\n\u003cdetails markdown=\"1\"\u003e\n\u003csummary\u003e \u003cb\u003e Example \u003c/b\u003e \u003c/summary\u003e\n\nYou can have an experiment, such as a ResNet model trained with the dataset Data1. Let’s call the generated embedding space ES1. But either you split your data or you get new data once in a while (or both), so this experiment will not only be used in a static dataset. You might want to use this experiment on Data2 then, generating a particular embedding space called ES2.\n\nVectory helps you to organize and analyze the obtained embeddings for each dataset and experiment.\n\n\u003c/details\u003e\n\n---\n\n# Command Line Interface\n\n## Create\n\nCreate datasets, experiments, and embedding spaces:\n\n```console\nvectory add --dataset [path_to_csv] --embeddings [path_to_npz]\n```\n\nYou can see all the options with the `--help` flag.\n\n## Load\n\nEmbedding Spaces are links to ElasticSearch **indices**. To load the embeddings to ElasticSearch when creating the Embedding Space, add `--load ` after setting the dataset, the Embedding Space, and the parameters. This option for the `add` command only works for the default loading options. You can use the load command to load the embeddings with different options.\n\nTo separately load an Embedding Space to ElasticSearch:\n\n```console\nvectory embeddings load [index_name] [embedding_space_name]\n```\n\nYou can specify the model name, the similarity function, the number of threads, the chunk size, and the hyperparameters for the kNN search. You can see all the options with the `--help` flag.\n\n## Search\n\n**List all** your datasets, experiments, embedding spaces, and indices:\n\n```console\nvectory ls\n```\n\n**List the indexes:**\n\n```console\nvectory embeddings list-indices\n```\n\n## Delete\n\nDelete datasets:\n\n```console\nvectory dataset delete [dataset_name]\n```\n\n**Experiments:**\n\n```console\nvectory experiment delete [experiment_name]\n```\n\n**Embedding Spaces:**\n\n```console\nvectory embeddings delete [embedding_space_name]\n```\n\nYou can delete elements associated with these objects and their respective indices by adding the `--recursive` flag.\n\n**Indices:**\n\n```console\nvectory embeddings delete-index [index_name]\n```\n\n**All indices:**\n\n```console\nvectory embeddings delete-all-indices\n```\n\n### Comparing embedding spaces\n\nWith Vectory you can measure how similar two embedding spaces are. The similarity between two embedding spaces is the mean of the local neighborhood similarity of every point, which is the IoU of the ten nearest neighbors.\n\nTo compare two embedding spaces, Vectory computes the ten nearest neighbors for every data point for both embedding spaces, getting the IoU for each group of ten nearest neighbors obtained. Then, it shows the distribution of the IoU values. Also, we compute the mean of the IoU values to provide a single value to compare the two embedding spaces.\n\nTo learn more about comparing embedding spaces, check out [this embedding-comparator](http://vis.csail.mit.edu/pubs/embedding-comparator/) article.\n\nTo compare two embedding spaces, use:\n\n```console\nvectory compare [embedding_space_1_name] [embedding_space_2_name] --precompute\n```\n\nYou can specify the metric for the kNN search in each embedding space. You can also calculate the similarity histogram.\n\n# Python API\n\n## Create\n\nCreate datasets, experiments, and an embedding space.\n\n```python\nfrom vectory.datasets import Dataset\nfrom vectory.experiments import Experiment\nfrom vectory.spaces import EmbeddingSpace\n\ndataset = Dataset.get_or_create(csv_path=CSV_PATH, name=DATASET_NAME)\n\ntrain_dataset = Dataset.get_or_create(csv_path=TRAIN_CSV_PATH, name=TRAIN_DATASET_NAME)\n\nexperiment = Experiment.get_or_create(\n    train_dataset=TRAIN_DATASET_NAME,\n    model=MODEL_NAME,\n    name=EXPERIMENT_NAME,\n)\n\nembedding_space = EmbeddingSpace.get_or_create(\n    npz_path=NPZ_PATH,\n    dims=EMBEDDINGS_DIMENSIONS,\n    experiment=EXPERIMENT_NAME,\n    dataset=DATASET_NAME,\n    name=EMBEDDING_SPACE_NAME,\n)\n```\n\nThe `train_dataset` parameter is optional, but we recommend to track the training process.\n\nLoad an index on ElasticSearch for an embedding space:\n\n```python\nfrom vectory.indices import load_index\n\nload_index(\n    index_name=INDEX_NAME,\n    embedding_space_name=EMBEDDING_SPACE_NAME,\n)\n```\n\nYou can get the names of `dataset`, `experiment`, and `embedding_space` objects using `model.name`.\n\nAdditionally, you can specify the desired mapping to load the index. You can choose the mapping to use `cosine` or `euclidean` similarity for the kNN search. Searching will be slower but more accurate when using an `exact` model instead of the `lsh`. The `lsh` model and the `cosine` similarity are the default options. To see all the available mappings, check the possible options from `vectory.es.api.Mapping`.\n\n## Search\n\nGet all your datasets, experiments, embedding spaces, and indices:\n\n```python\nfrom vectory.db.models import (\n    DatasetModel,\n    ElasticSearchIndexModel,\n    EmbeddingSpaceModel,\n    ExperimentModel,\n    Query,\n)\n\ndatasets = Query(DatasetModel).get()\nexperiments = Query(ExperimentModel).get()\nspaces = Query(EmbeddingSpaceModel).get()\nindices = Query(ElasticSearchIndexModel).get()\n```\n\nYou can also get a specific dataset, experiment, space, or index by specifying an attribute:\n\n```python\ndataset = Query(DatasetModel).get(name=DATASET_NAME)[0]\n```\n\n## Delete\n\nDelete old datasets and their indices:\n\n```python\nfrom vectory.db.models import  DatasetModel, Query\n\ndataset = Query(DatasetModel).get(name=DATASET_NAME)[0]\ndataset.delete_instance(recursive=True)\n```\n\nSetting the `recursive` option to `True` deletes the experiments, spaces, and indices associated with the dataset.\n\nThe same can be done for experiments, embedding spaces and indices by using the `delete_instance` method on the correct object.\n\n## Compare\n\nWith Vectory you can measure how similar two embedding spaces are. The similarity between two embedding spaces is the mean of the local neighborhood similarity of every point, which is the IoU of the ten nearest neighbors. \n\nCompare two embedding spaces:\n\n```python\nfrom vectory.spaces import compare_embedding_spaces\n\nsimilarity, _, fig, _ = compare_embedding_spaces(\n    embedding_space_a=EMBEDDING_SPACE_NAME_1,\n    embedding_space_b=EMBEDDING_SPACE_NAME_2,\n    metric_a=METRIC_A,\n    metric_b=METRIC_B,\n    allow_precompute_knn=True,\n)\n```\n\nThe `metric_a` and `metric_b` parameters are either `euclidean` or `cosine`. The `allow_precompute_knn` parameter is set to `True` to allow precomputing the bulk operations for the similarity computation.\n\nThe `spaces_similarity` variable contains the similarity between the two embedding spaces. The `id_similarity_dict` variable has the similarity scores for every point in the embedding spaces.\n\nSetting the `histogram` parameter to `True` in the `compare_embedding_spaces` function will show a histogram of the similarity scores. The `fig` and `ax` variables are the figure and axis of the histogram.\n\n## Reduce dimensionality\n\nReduce the dimensionality to 2D of an embedding space:\n\n```python\nfrom vectory.visualization.utils import calculate_points, get_index\n\n# Get the embedding space data\nembeddings, rows, index = get_index(\n    EMBEDDING_SPACE_NAME, model=MODEL, similarity=SIMILARITY_METHOD\n)\n\n# Reduce the dimensionality\ndf = calculate_points(DIMENSIONAL_REDUCTION_MODEL, embeddings, rows)\n```\n\nThe `calculate_points` function reduces the dimensionality of the embeddings using the `DIMENSIONAL_REDUCTION_MODEL` model. It can be either `UMAP`, `PCA`, or `PCA +` UMAP`. It returns a DataFrame with the reduced dimensionality points and the data contained in the dataset's CSV file.\n\n## Get similar indices\n\nGet the most similar indices for a given embedding:\n\n```python\nfrom vectory.indices import match_query\n\n# Get the most similar indices for a sample embedding\nsimilarity_results, _ = match_query(indices_name=[INDEX_NAME], query_id=EMBEDDING_INDEX)\n```\n\nThe `match_query` function returns the most similar indices for a given embedding and the index of the embedding. The `indices_name` parameter is a list of indices names, and the `query_id` parameter is the id of the embedding to search. You can get the most similar indices and their scores from these results. The `similarity_results` variable contains a dictionary with the indices' names as keys and a list of tuples with the most similar indices and their scores as values.\n\n# Visualization\n\nOnce you have loaded your datasets, experiments, and embedding spaces, you can analyze the results by visualizing them on our Streamlit app or by following the Python API documentation and getting the indices.\n\n## Streamlit\n\nVisualize your embedding spaces on a local Streamlit app:\n\n```console\nvectory run\n```\n\nThe GUI dependencies are required to view the Streamlit app.\n\n# License\n\nThis project is licensed under the terms of the MIT license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpentoai%2Fvectory","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpentoai%2Fvectory","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpentoai%2Fvectory/lists"}