{"id":25314918,"url":"https://github.com/lanterndata/lantern_extras","last_synced_at":"2025-10-07T13:04:39.207Z","repository":{"id":181399100,"uuid":"666697994","full_name":"lanterndata/lantern_extras","owner":"lanterndata","description":"Routines for generating, manipulating, parsing, importing vector embeddings into Postgres tables","archived":false,"fork":false,"pushed_at":"2024-10-18T13:44:54.000Z","size":694,"stargazers_count":19,"open_issues_count":4,"forks_count":4,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-01-02T18:41:34.550Z","etag":null,"topics":["ai","database","image-processing","knn","machine-learning","open-source","postgres","postgresql","rust","vector","ycombinator"],"latest_commit_sha":null,"homepage":"https://lantern.dev","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lanterndata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-15T09:21:37.000Z","updated_at":"2024-11-27T14:09:18.000Z","dependencies_parsed_at":"2023-10-10T20:21:26.207Z","dependency_job_id":"b61c329f-a4f1-40b4-8f7e-8011a8c5b81a","html_url":"https://github.com/lanterndata/lantern_extras","commit_stats":null,"previous_names":["lanterndata/lanterndb_extras","lanterndata/lantern_extras"],"tags_count":24,"template":false,"template_full_name":null,"purl":"pkg:github/lanterndata/lantern_extras","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanterndata%2Flantern_extras","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanterndata%2Flantern_extras/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanterndata%2Flantern_extras/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanterndata%2Flantern_extras/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lanterndata","download_url":"https://codeload.github.com/lanterndata/lantern_extras/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lanterndata%2Flantern_extras/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261986813,"owners_count":23240716,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","database","image-processing","knn","machine-learning","open-source","postgres","postgresql","rust","vector","ycombinator"],"created_at":"2025-02-13T17:39:05.942Z","updated_at":"2025-10-07T13:04:34.158Z","avatar_url":"https://github.com/lanterndata.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Lantern Extras\n\n[![build](https://github.com/lanterndata/lantern_extras/actions/workflows/build.yaml/badge.svg?branch=main)](https://github.com/lanterndata/lantern_extras/actions/workflows/build.yaml)\n[![test](https://github.com/lanterndata/lantern_extras/actions/workflows/test.yaml/badge.svg?branch=main)](https://github.com/lanterndata/lantern_extras/actions/workflows/test.yaml)\n[![codecov](https://codecov.io/github/lanterndata/lantern_extras/branch/main/graph/badge.svg)](https://codecov.io/github/lanterndata/lantern_extras)\n\nThis extension makes it easy to experiment with embeddings from inside a Postgres database. We use this extension along with [Lantern](https://github.com/lanterndata/lantern) to make vector operations performant. But all the helpers here are standalone and may be used without the main database.\n\n**NOTE**: Functions defined in this extension use Postgres in ways Postgres is usually not used.\nSome calls may result in large file downloads, or CPU-intensive model inference operations. Keep this in mind when using this extension a shared Postgres environment.\n\n## Features\n\n- Streaming download of vector embeddings in archived and uncompressed formats\n- Streaming download of various standard vector benchmark datasets\n  - SIFT\n  - GIST\n- Generation of various various embeddings for data stored in Postgres tables without leaving the database\n\n## Examples\n\n```sql\n-- parse the first 41 vectors from the uncompressed .fvecs vector dataset on server machine\nSELECT parse_fvecs('/tmp/rustftp/siftsmall/siftsmall_base.fvecs', 41);\n\n-- load the first 10k vectors from the uncompressed vector dataset into a table named sift\nSELECT * INTO sift FROM parse_fvecs('/tmp/rustftp/siftsmall/siftsmall_base.fvecs', 10000);\n\n-- load SIFT dataset ground truth vectors into a table from an online ftp archive\nSELECT query,\n       true_nearest INTO sift_ground\nFROM get_sift_groundtruth('ftp://host/path/to/siftsmall.tar.gz');\n\n-- generate CLIP embeddings for columns of a postgres table\nSELECT abstract,\n       introduction,\n       figure1,\n       clip_text(abstract) AS abstract_ai,\n       clip_text(introduction) AS introduction_ai,\n       clip_image(figure1) AS figure1_ai\nINTO papers_augmented\nFROM papers;\n\n```\n\n-- generate embeddings from other models which can be extended\n\n```sql\n-- generate text embedding\nSELECT text_embedding('BAAI/bge-base-en', 'My text input');\n-- generate image embedding with image url\nSELECT image_embedding('clip/ViT-B-32-visual', 'https://link-to-your-image');\n-- generate image embedding with image path (this path should be accessible from postgres server)\nSELECT image_embedding('clip/ViT-B-32-visual', '/path/to/image/in-postgres-server');\n-- get available list of models\nSELECT get_available_models();\n```\n\n## Getting started\n\n### Installing from precompiled binaries\n\nYou can download precompiled binaries for Mac and linux from Github releases page.\nMake sure postgres is installed in your environment and `pg_config` is accessible form `$PATH`. Unzip the release archive from `lantern_extras` the directory run:\n\n```bash\nmake install\n```\n\n### Building from source\n\n\u003cdetails\u003e\n\u003csummary\u003e Click to expand\u003c/summary\u003e\n\nYou should have onnxruntime in your system in order to run the extension.\nYou can download the `onnxruntime` binary realease from GitHub https://github.com/microsoft/onnxruntime/releases/tag/v1.16.1 and place it somewhere in your system (e.g. /usr/lib/onnxruntime)\n\nThen you should export these 2 environment variables\n\n```bash\nexport ORT_STRATEGY=system\nexport ORT_DYLIB_PATH=/usr/local/lib/onnxruntime/lib/libonnxruntime.so\n```\n\nIn some systems you will need to specify `dlopen` search path, so the extension could load `ort` inside postgres.\n\nTo do that create a file `/etc/ld.so.conf.d/onnx.conf` with content `/usr/local/lib/onnxruntime/lib` and run `ldconfig`\n\nThis extension is written in Rust so requires Rust toolchain. Make sure Rust toolchain is installed before continuing\nThe extension also uses `pgrx`. If pgrx is not already installed, use the following commands to install it:\n\n```\n#install pgrx prerequisites\nsudo apt install pkg-config libssl-dev zlib1g-dev libreadline-dev\nsudo apt-get install clang\n\n#install pgrx itself\ncargo install --locked cargo-pgrx --version 0.11.3\ncargo pgrx init\n```\n\nThen, you can run the extension under development with the following\n\n```bash\ncargo pgrx run --package lantern_extras # runs in a testing environment\n```\n\nTo package the extension run\n\n```bash\ncargo pgrx package --package lantern_extras\n```\n\n\u003c/details\u003e\n\n### Initializing with psql\n\nOnce the extension is installed, in a psql shell or in your favorite SQL environment run:\n\n```sql\nCREATE EXTENSION lantern_extras;\n```\n\n### Adding new models\n\nTo add new textual or visual models for generating vector embeddings you can follow this steps:\n\n1. Find the model onnx file or convert it using [optimum-cli](https://huggingface.co/docs/transformers/serialization). Example `optimum-cli export onnx --model BAAI/bge-base-en onnx/`\n2. Host the onnx model\n3. Add model information in `MODEL_INFO_MAP` under `lantern_extras/src/encoder.rs`\n4. Add new image/text processor based on model inputs (you can check existing processors they might match the model) and then add the `match` arm in `process_text` or `process_image` function in `EncoderService` so it will run corresponding processor for model.\n\nAfter this your model should be callable from SQL like\n\n```sql\nSELECT text_embedding('your/model_name', 'Your text');\n```\n\n## Lantern Index Builder\n\n## Description\n\nThis is a CLI application that creates an index for Lantern outside of Postgres which can later be imported into Postgres. This allows for faster index creation through parallelization.\n\n## How to use\n\n### Installation\n\nRun `cargo install --path lantern_cli` to install the binary\n\n### Usage\n\nRun `lantern-cli create-index --help` to show the cli options.\n\n```bash\nUsage: lantern-cli create-index --uri \u003cURI\u003e --table \u003cTABLE\u003e --column \u003cCOLUMN\u003e -m \u003cM\u003e --efc \u003cEFC\u003e --ef \u003cEF\u003e -d \u003cDIMS\u003e --metric-kind \u003cMETRIC_KIND\u003e --out \u003cOUT\u003e --import\n```\n\n### Example\n\n```bash\nlantern-cli create-index -u \"postgresql://localhost/test\" -t \"small_world\" -c \"vec\" -m 16 --ef 64 --efc 128 -d 3 --metric-kind cos --out /tmp/index.usearch --import\n```\n\n### Notes\n\nThe index should be created from the same database on which it will be loaded, so row tids will match later.\n\n## Lantern Embeddings\n\n## Description\n\nThis is a CLI application that generates vector embeddings from your postgres data.\n\n## How to use\n\n### Installation\n\nRun `cargo install --path lantern_cli` to install the binary if you have clonned the source code or `cargo install --git https://github.com/lanterndata/lantern_extras.git` to install from git.\n\nor build and use the docker image\n\n```bash\n# Run with CPU version\ndocker run -v models-volume:/models --rm --network host lanterndata/lantern-cli create-embeddings --model 'BAAI/bge-large-en' --uri 'postgresql://postgres@host.docker.internal:5432/postgres' --table \"wiki\" --column \"content\" --out-column \"content_embedding\" --batch-size 40 --data-path /models\n\n# Run with GPU verion\nnvidia-docker run -v models-volume:/models --rm --network host lanterndata/lantern-cli:gpu create-embeddings  --model 'BAAI/bge-large-en' --uri 'postgresql://postgres@host.docker.internal:5432/postgres' --table \"wiki\" --column \"content\" --out-column \"content_embedding\" --batch-size 40 --data-path /models\n```\n\n\u003e [nvidia-container-runtime](https://developer.nvidia.com/nvidia-container-runtime) is required for GPU version to work. You can check the GPU load using `nvtop` command (`apt install nvtop`)\n\n### Usage\n\nRun `lantern-cli create-embeddings --help` to show the cli options.\nRun `lantern-cli show-models` to show available models.\n\n### Text Embedding Example\n\n1. Create table with text data\n\n```sql\nCREATE TABLE articles (id SERIAL, description TEXT, embedding REAL[]);\nINSERT INTO articles SELECT generate_series(0,999), 'My description column!';\n```\n\n\u003e Currently it is requried for table to have id column, so it could map the embedding with row when exporting output.\n\n2. Run embedding generation\n\n```bash\nlantern-cli create-embeddings  --model 'clip/ViT-B-32-textual'  --uri 'postgresql://postgres:postgres@localhost:5432/test' --table \"articles\" --column \"description\" --out-column \"embedding\" --schema \"public\"\n```\n\n\u003e The output database, table and column names can be specified via `--out-table`, `--out-uri`, `--out-column` arguments. Check `help` for more info.\n\nor you can export to csv file\n\n```bash\nlantern-cli create-embeddings  --model 'clip/ViT-B-32-textual'  --uri 'postgresql://postgres:postgres@localhost:5432/test' --table \"articles\" --column \"description\" --out-column embedding --out-csv \"embeddings.csv\" --schema \"public\"\n```\n\n### Image Embedding Example\n\n1. Create table with image uris data\n\n```sql\nCREATE TABLE images (id SERIAL, url TEXT, embedding REAL[]);\nINSERT INTO images (url) VALUES ('https://cdn.pixabay.com/photo/2014/11/30/14/11/cat-551554_1280.jpg'), ('https://cdn.pixabay.com/photo/2016/12/13/05/15/puppy-1903313_1280.jpg');\n```\n\n2. Run embedding generation\n\n```bash\nlantern-cli create-embeddings  --model 'clip/ViT-B-32-visual'  --uri 'postgresql://postgres:postgres@localhost:5432/test' --table \"images\" --column \"url\" --out-column \"embedding\" --schema \"public\" --visual\n```\n\n### OpenAI and Cohere Embeddings\n\nLantern CLI also supports generating OpenAI and Cohere embeddings via API. For that you should specify `--runtime` and `--runtime-params` arguments\n\n```bash\n# OpenAI\nlantern-cli create-embeddings  --model 'openai/text-embedding-ada-002' --uri 'postgresql://postgres:postgres@localhost:5432/test' --table \"images\" --column \"url\" --out-column \"embedding\" --schema \"public\" --runtime openai --runtime-params '{ \"api_token\": \"sk-xxx-xxxx\" }'\n\n# Cohere\nlantern-cli create-embeddings  --model 'openai/text-embedding-ada-002' --uri 'postgresql://postgres:postgres@localhost:5432/test' --table \"images\" --column \"url\" --out-column \"embedding\" --schema \"public\" --runtime cohere --runtime-params '{ \"api_token\": \"xxx-xxxx\" }'\n```\n\n|\u003e To get available runtimes use `bash lantern-cli show-runtimes`\n\n### Index Autotune\n\nLantern CLI supports autotuning HNSW index parameters. To use the functionality run\n\n```bash\nlantern-cli autotune-index -u 'postgresql://postgres:postgres@localhost:5432/test' -t \"sift1m\" -c \"v\" --metric-kind l2sq --test-data-size 10000 --k 20\n```\n\nTo get full list of arguments use `bash lantern-cli autotune-index -h`\n\n### Daemon Mode\n\nLantern CLI can be used in daemon mode to continousely listen to postgres table and generate embeddings, external indexes or autotune jobs.\n\n```bash\n lantern-cli start-daemon --uri 'postgres://postgres@localhost:5432/postgres' --embedding-table embedding_jobs --autotune-table index_autotune_jobs --autotune-results-table index_parameter_experiment_results --external-index-table external_index_jobs --schema public --log-level debug\n```\n\nThis will set up trigger on specified table (`lantern_jobs`) and when new row will be inserted it will start embedding generation based on row data.\nAfter that the triggers will be set up in target table, so it will generate embeddings continousely for that table.\nThe jobs table should have the following structure\n\n```sql\n-- Embedding Jobs Table should have the following structure:\nCREATE TABLE \"public\".\"embedding_jobs\" (\n    \"id\" SERIAL PRIMARY KEY,\n    \"database_id\" text NOT NULL,\n    \"db_connection\" text NOT NULL,\n    \"schema\" text NOT NULL,\n    \"table\" text NOT NULL,\n    \"runtime\" text NOT NULL,\n    \"runtime_params\" jsonb,\n    \"src_column\" text NOT NULL,\n    \"dst_column\" text NOT NULL,\n    \"embedding_model\" text NOT NULL,\n    \"created_at\" timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,\n    \"updated_at\" timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,\n    \"canceled_at\" timestamp,\n    \"init_started_at\" timestamp,\n    \"init_finished_at\" timestamp,\n    \"init_failed_at\" timestamp,\n    \"init_failure_reason\" text,\n    \"init_progress\" int2 DEFAULT 0\n);\n-- External Index Jobs Table should have the following structure:\nCREATE TABLE \"public\".\"external_index_jobs\" (\n    \"id\" SERIAL PRIMARY KEY,\n    \"database_id\" text NOT NULL,\n    \"db_connection\" text NOT NULL,\n    \"schema\" text NOT NULL,\n    \"table\" text NOT NULL,\n    \"column\" text NOT NULL,\n    \"index\" text,\n    \"operator\" text NOT NULL,\n    \"efc\" INT NOT NULL,\n    \"ef\" INT NOT NULL,\n    \"m\" INT NOT NULL,\n    \"created_at\" timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,\n    \"updated_at\" timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,\n    \"canceled_at\" timestamp,\n    \"started_at\" timestamp,\n    \"finished_at\" timestamp,\n    \"failed_at\" timestamp,\n    \"failure_reason\" text,\n    \"progress\" INT2 DEFAULT 0\n);\n-- Autotune Jobs Table should have the following structure:\nCREATE TABLE \"public\".\"index_autotune_jobs\" (\n    \"id\" SERIAL PRIMARY KEY,\n    \"database_id\" text NOT NULL,\n    \"db_connection\" text NOT NULL,\n    \"schema\" text NOT NULL,\n    \"table\" text NOT NULL,\n    \"column\" text NOT NULL,\n    \"operator\" text NOT NULL,\n    \"target_recall\" DOUBLE PRECISION NOT NULL,\n    \"embedding_model\" text NULL,\n    \"k\" int NOT NULL,\n    \"n\" int NOT NULL,\n    \"create_index\" bool NOT NULL,\n    \"created_at\" timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,\n    \"updated_at\" timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,\n    \"canceled_at\" timestamp,\n    \"started_at\" timestamp,\n    \"progress\" INT2 DEFAULT 0,\n    \"finished_at\" timestamp,\n    \"failed_at\" timestamp,\n    \"failure_reason\" text\n);\n\n-- Autotune results table should have the following structure:\nCREATE TABLE \"public\".\"index_parameter_experiment_results\" (\n     id SERIAL PRIMARY KEY,\n     experiment_id INT NOT NULL, -- reference to job.id\n     ef INT NOT NULL,\n     efc INT  NOT NULL,\n     m INT  NOT NULL,\n     recall DOUBLE PRECISION NOT NULL,\n     latency DOUBLE PRECISION NOT NULL,\n     build_time DOUBLE PRECISION NULL\n);\n```\n\n## Lantern PQ\n\n## Description\n\nUse external product quantization to compress table vectors using kmeans clustering.\n\n### Usage\n\nRun `lantern-cli pq-table --help` to show the cli options.\n\nJob can be run both on local instance and also using GCP batch jobs to parallelize the workload over handreds of VMs to speed up clustering.\n\nTo run locally use:\n\n```bash\nlantern-cli pq-table --uri 'postgres://postgres@127.0.0.1:5432/postgres' --table sift10k --column v --clusters 256 --splits 32\n```\n\nThe job will be run on current machine utilizing all available cores.\n\nFor big datasets over 1M it is convinient to run the job using GCP batch jobs.  \nMake sure to have GCP credentials set-up before running this command:\n\n```bash\nlantern-cli pq-table --uri 'postgres://postgres@127.0.0.1:5432/postgres' --table sift10k --column v --clusters 256 --splits 32 --run-on-gcp\n```\n\nIf you prefer to orchestrate task on your own on premise servers you need to do the following 3 steps:\n\n1. Run setup job. This will create necessary tables and add `pqvec` column on target table\n\n```bash\nlantern-cli pq-table --uri 'postgres://postgres@127.0.0.1:5432/postgres' --table sift10k --column v --clusters 256 --splits 32 --skip-codebook-creation --skip-vector-compression\n```\n\n2. Run clustering job. This will create codebook for the table and export to postgres table\n\n```bash\nlantern-cli pq-table --uri 'postgres://postgres@127.0.0.1:5432/postgres' --table sift10k --column v --clusters 256 --splits 32 --skip-table-setup --skip-vector-compression --parallel-task-count 10 --subvector-id 0\n```\n\nIn this case this command should be run 32 times for each subvector in range [0-31] and `--parallel-task-count` means at most we will run 10 tasks in parallel. This is used to not exceed max connection limit on postgres.\n\n3. Run compression job. This will compress vectors using the generated codebook and export results under `pqvec` column\n\n```bash\nlantern-cli pq-table --uri 'postgres://postgres@127.0.0.1:5432/postgres' --table sift10k --column v --clusters 256 --splits 32 --skip-table-setup --skip-codebook-creation --parallel-task-count 10 --total-task-count 10 --compression-task-id 0\n```\n\nIn this case this command should be run 10 times for each part of codebook in range [0-9] and `--parallel-task-count` means at most we will run 10 tasks in parallel. This is used to not exceed max connection limit on postgres.\n\nTable should have primary key, in order for this job to work. If primary key is different than `id` provide it using `--pk` argument\n\n## Lantern Daemon in SQL\nTo enable the daemon add `lantern_extra.so` to `shared_preload_libraries` in `postgresql.conf` file and set the `lantern_extras.enable_daemon` GUC to true. This can be done by executing the following command:\n\n```sql\nALTER SYSTEM SET lantern_extras.enable_daemon = true;\nSELECT pg_reload_conf();\n```\nThe daemon will start, targeting the current connected database or databases specified in the `lantern_extras.daemon_databases` GUC.\n\n**Important Notes**  \nThis is an experimental functionality to enable lantern daemon from SQL\n\n### SQL Functions for Embedding Jobs\nThis functions can be used both with externally managed Lantern Daemon or with a daemon run from the SQL.\n\n**Adding an Embedding Job**  \nTo add a new embedding job, use the `add_embedding_job` function:\n\n```sql\nSELECT add_embedding_job(\n    'table_name',        -- Name of the table\n    'src_column',        -- Source column for embeddings\n    'dst_column',        -- Destination column for embeddings\n    'embedding_model',   -- Embedding model to use\n    'runtime',           -- Runtime environment (default: 'ort')\n    'runtime_params',    -- Runtime parameters (default: '{}')\n    'pk',                -- Primary key column (default: 'id')\n    'schema'             -- Schema name (default: 'public')\n);\n```\n\n**Getting Embedding Job Status**  \nTo get the status of an embedding job, use the `get_embedding_job_status` function:\n\n```sql\nSELECT * FROM get_embedding_job_status(job_id);\n```\nThis will return a table with the following columns:\n\n- `status`: The current status of the job.\n- `progress`: The progress of the job as a percentage.\n- `error`: Any error message if the job failed.\n\n**Getting All Embedding Jobs**  \nTo get the status of all embedding jobs, use the `get_embedding_jobs` function:\n\n```sql\nSELECT * FROM get_embedding_jobs();\n\n```\nThis will return a table with the following columns:\n\n- `id`: Id of the job\n- `status`: The current status of the job.\n- `progress`: The progress of the job as a percentage.\n- `error`: Any error message if the job failed.\n\n**Canceling an Embedding Job**  \nTo cancel an embedding job, use the `cancel_embedding_job` function:\n\n```sql\nSELECT cancel_embedding_job(job_id);\n```\n\n**Resuming an Embedding Job**  \nTo resume a paused embedding job, use the `resume_embedding_job` function:\n\n```sql\nSELECT resume_embedding_job(job_id);\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flanterndata%2Flantern_extras","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flanterndata%2Flantern_extras","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flanterndata%2Flantern_extras/lists"}