{"id":16758441,"url":"https://github.com/kessler/node-embedding","last_synced_at":"2025-04-10T17:13:37.637Z","repository":{"id":186422405,"uuid":"675140950","full_name":"kessler/node-embedding","owner":"kessler","description":null,"archived":false,"fork":false,"pushed_at":"2023-10-25T11:37:10.000Z","size":189,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-04T03:31:52.263Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kessler.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-05T23:23:49.000Z","updated_at":"2023-11-26T03:28:24.000Z","dependencies_parsed_at":null,"dependency_job_id":"c0497da0-c025-4248-a971-1f3ace6281a6","html_url":"https://github.com/kessler/node-embedding","commit_stats":null,"previous_names":["kessler/node-embedding"],"tags_count":8,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kessler%2Fnode-embedding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kessler%2Fnode-embedding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kessler%2Fnode-embedding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kessler%2Fnode-embedding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kessler","download_url":"https://codeload.github.com/kessler/node-embedding/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247785943,"owners_count":20995644,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-13T04:05:19.391Z","updated_at":"2025-04-10T17:13:37.597Z","avatar_url":"https://github.com/kessler.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# @kessler/embedding (WIP)\n\nThis module is built to allow progressive advancement from simple json based embedding database to more advanced solutions like chroma or redis.\n\n## quick start\n\n```js\nimport { loadProviders, Collection } from '@kessler/embedding'\n\nasync function main() {\n  const { storage, embedders } = await loadProviders({ \n    fs: { directory: '/some/directory' },\n    openai: { apiKey: 'openai key here' } \n  })\n\n  const { fs } = storage\n  const { openai } = embedders\n\n  await fs.init()\n  await openai.init()\n\n  const collection = new Collection('test', openai, fs)\n  \n  await collection.add('hello world', { created: Date.now() })\n  console.log(await collection.query('hello'))\n\n  await fs.shutdown()\n  await openai.shutdown()\n}\n\nmain()\n```\n\n## Collections\n\nCollections are the highest abstraction layer. They group together documents, their embedding data and some optional metadata.\n\n```js\nclass Collection {\n  constructor(name, embeddingService, storage) {}\n  async query(text, { maxResults = Infinity, threshold = 0.8 }) {}\n  async add(text, metadata) {}\n  async delete(id) {}\n  async get(id) {}\n}\n```\n\n## Providers\n\nThere are two categories for providers: `embedding` and `storage`. Embedding providers expose embedding services through a unified interface and storage providers do the same, just for storing and querying documents.\n\nProviders can be loaded and created manually by importing their classes and instantiating them or they can be loaded through ```loadProviders``` (see below)\n\nOnce a provider is loaded you should call it's ```init``` method, regardless of wether you loaded it manually or through load providers. (_TODO: i might want to change this behavior_)\n\n### embedding provider\n\n```js\nclass Embedder {\n  constructor(underlyingProvider, config) {}\n  async exec(text, metadata) {}\n  async init() {}\n  async shutdown() {}\n}\n```\n\n_TODO: once a document is embedded with one service and stored, the embedding provider cannot be changed, if the embedding scheme is different in the new provider. This must be addressed some how in the design._\n\n### storage provider\n\n```js\nclass MyStorage {\n  constructor(underlyingProvider, config) {}\n  async query(collectionName, embedding, { maxResults, threshold }) {}\n  async add(collectionName, content, embedding, metadata) {}\n  async delete(collectionName, id) {}\n  async get(collectionName, id) {}\n  async init() {}\n  async shutdown() {}\n  async collections() {}\n}\n```\n\n### loading automatically\n\nthe intent of ```loadProviders``` is to load and instatiate any provider that can be loaded, meaning that their peer dependencies exist.\n\n```js\nimport { loadProviders } from '@kessler/embedding'\n\nasync function main() {\n  const { storage, embedders } = await loadProviders({ /* ...providers config */ })\n  const { pg } = storage\n  const { openai } = embedders\n\n  await pg.init()\n  await openai.init()\n}\n\nmain()\n```\n\n### loading manually\n\nTBD\n\n### embedding providers\n\n#### openai embedder\n\nCurrently the only supported embedding service.\n\nrun `npm install openai`\n\n```js\nimport { loadProviders } from '@kessler/embedding'\n\nasync function main() {\n  const { embedders, storage } = await loadProviders({ \n    openai: { apiKey: 'your-api-key' } \n  })\n\n  const { openai } = embedders\n  await openai.init()\n\n  // do stuff\n  await openai.shutdown()\n}\n\nmain()\n```\n\n### storage providers\n\n#### File System storage\nThe simplest non optimized solution, collections are saved on the file system in json files.\n\nEmbedding is matched by going through all the existing documents, so not very scalable.\n\n_I have plans to implement a better algorithm in the future._\n\n\n```js\nimport { loadProviders } from '@kessler/embedding'\n\nasync function main() {\n  const { embedders, storage } = await loadProviders({ \n    fs: { directory: '/some/path/to/embedding-db' },\n  })\n\n  const { fs } = storage\n  await fs.init()\n\n  // do stuff\n  await fs.shutdown()\n}\n\nmain()\n```\n\n#### Postgresql storage\n\nUses postgresql database with [pgvector](https://github.com/pgvector/pgvector) extension installed.\n\nrun ```npm install pg pgvector``` _(mind the peer dependency versions)_\n\n```js\nimport { loadProviders } from './index.mjs'\n\nasync function main() {\n\n  const { embedders, storage } = await loadProviders({ \n    // there are defaults though, database \"embedding\", localhost, root and no password\n    pg: {\n      databaseConfig: {\n        database: 'embedding',\n        user: 'root',\n        password: 'shhhhhhhhhhh'\n      }\n    }\n  })\n  \n  const { pg } = storage\n  await pg.init()\n  \n  // do stuff\n  await pg.shutdown()\n}\n\nmain()\n```\n\n#### Redis storage\n\nTBD\n\n#### Chroma storage\n\nTBD\n\n## resources\n- https://supabase.com/blog/openai-embeddings-postgres-vector\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkessler%2Fnode-embedding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkessler%2Fnode-embedding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkessler%2Fnode-embedding/lists"}