{"id":13998506,"url":"https://github.com/beerose/semantic-search","last_synced_at":"2025-04-12T05:43:33.653Z","repository":{"id":65908810,"uuid":"602089211","full_name":"beerose/semantic-search","owner":"beerose","description":"🕵️‍♀️ An OpenAI-powered CLI to build a semantic search index from your MDX files.","archived":false,"fork":false,"pushed_at":"2023-02-17T20:05:28.000Z","size":261,"stargazers_count":92,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-12T05:43:18.098Z","etag":null,"topics":["blog","cli","openai","search","semantic","typescript"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/beerose.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-02-15T13:27:38.000Z","updated_at":"2025-02-23T20:07:02.000Z","dependencies_parsed_at":"2023-02-25T21:30:50.412Z","dependency_job_id":null,"html_url":"https://github.com/beerose/semantic-search","commit_stats":{"total_commits":30,"total_committers":1,"mean_commits":30.0,"dds":0.0,"last_synced_commit":"2d1a7bf14cd5b49e2fedbb2323c4f3ff76df5944"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/beerose%2Fsemantic-search","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/beerose%2Fsemantic-search/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/beerose%2Fsemantic-search/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/beerose%2Fsemantic-search/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/beerose","download_url":"https://codeload.github.com/beerose/semantic-search/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248525156,"owners_count":21118616,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["blog","cli","openai","search","semantic","typescript"],"created_at":"2024-08-09T19:01:43.769Z","updated_at":"2025-04-12T05:43:33.611Z","avatar_url":"https://github.com/beerose.png","language":"TypeScript","funding_links":[],"categories":["TypeScript"],"sub_categories":[],"readme":"# @beerose/semantic-search\n\nAn OpenAI-powered CLI to build a semantic search index from your MDX files. It\nallows you to perform complex searches across your content and integrate it with\nyour platform.\n\n## 🧳 Prerequisites\n\nThis project uses [OpenAI](https://openai.com/api) to generate vector embeddings\nand [Pinecone](https://pinecone.io/) to host the embeddings, which means you\nneed to have accounts in OpenAI and Pinecone to use it.\n\n\u003cdetails\u003e\n\u003csummary\u003eSetting up a Pinecone project\u003c/summary\u003e\n\nAfter creating an account in Pinecone, go to the dashboard and click on the\n`Create Index` button:\n\n![CleanShot 2023-02-17 at 16 10 32@2x](https://user-images.githubusercontent.com/9019397/219693945-6d656f53-6dc2-4010-8ee8-f9d3e69913a1.png)\n\nFill the form with your new index name (e.g. your blog name) and set the number\nof dimensions to 1536:\n\n![CleanShot 2023-02-17 at 16 11 54@2x](https://user-images.githubusercontent.com/9019397/219693863-ccaa2105-db44-4838-b94b-40689945c8f2.png)\n\n\u003c/details\u003e\n\n## 🚀 CLI Usage\n\n\u003cdetails\u003e\n\u003csummary\u003eHow to get your env keys from Pinecone and OpenAI?\u003c/summary\u003e\n\n**Pinecone**\n\n![CleanShot 2023-02-17 at 16 15 32@2x](https://user-images.githubusercontent.com/9019397/219693780-bee0e02b-3961-4a92-b505-8076ef67295e.png)\n![CleanShot 2023-02-17 at 16 13 22@2x](https://user-images.githubusercontent.com/9019397/219693831-794c88ce-a763-4415-84f6-08b00c0aab0e.png)\n\n**OpenAI**\n\n![CleanShot 2023-02-17 at 16 18 00@2x](https://user-images.githubusercontent.com/9019397/219693739-3c5e0b31-425b-4cef-8aa9-066dd24d9ab2.png)\n\n\u003c/details\u003e\n\nThe CLI requires four env keys:\n\n```sh\nOPENAI_API_KEY=\n\nPINECONE_API_KEY=\nPINECONE_BASE_URL=\nPINECONE_NAMESPACE=\n```\n\nMake sure to add them before using it!\n\n### 🛠 Commands:\n\n`index \u003cdir\u003e` — processes files with your content and upload them to Pinecone.\n\nExample:\n\n```sh\n$ @beerose/semantic-search index ./posts\n```\n\n`search \u003cquery\u003e` — performs a semantic search by a given query.\n\nExample:\n\n```sh\n$ @beerose/semantic-search search \"hello world\"\n```\n\nFor more info, run any command with the `--help` flag:\n\n```sh\n$ @beerose/semantic-search index --help\n$ @beerose/semantic-search search --help\n$ @beerose/semantic-search --help\n```\n\n## ➕ Project integration\n\nYou can use the `semanticQuery` function exported from this library and\nintegrate it with your website or application.\n\nInstall deps:\n\n```sh\n$ pnpm add pinecone-client openai @beerose/semantic-search\n\n# or `yarn add` or `npm i`\n```\n\nAn example usage:\n\n```ts\nimport { PineconeMetadata, semanticQuery } from \"@beerose/semantic-search\";\nimport { Configuration, OpenAIApi } from \"openai\";\nimport { PineconeClient } from \"pinecone-client\";\n\nconst openai = new OpenAIApi(\n  new Configuration({\n    apiKey: process.env.OPENAI_API_KEY,\n  })\n);\n\nconst pinecone = new PineconeClient\u003cPineconeMetadata\u003e({\n  apiKey: process.env.PINECONE_API_KEY,\n  baseUrl: process.env.PINECONE_BASE_URL,\n  namespace: process.env.PINECONE_NAMESPACE,\n});\n\nconst result = await semanticQuery(\"hello world\", openai, pinecone);\n```\n\nHere's an example API route from [aleksandra.codes](https://aleksandra.codes):\nhttps://github.com/beerose/aleksandra.codes/blob/main/api/search.ts\n\n## ✨ How does it work?\n\nSemantic search can understand the meaning of words in documents and return\nresults that are more relevant to the user's intent.\n\nThis tool uses [OpenAI](https://openai.com/) to generate vector embeddings with\na `text-embedding-ada-002` model.\n\n\u003e Embeddings are numerical representations of concepts converted to number\n\u003e sequences, which make it easy for computers to understand the relationships\n\u003e between those concepts.\n\u003e https://openai.com/blog/new-and-improved-embedding-model/\n\nIt also uses [Pinecone](https://pinecone.io/) — a hosted database for vector\nsearch. It lets us perform k-NN searches across the generated embeddings.\n\n### Processing MDX content\n\nThe `@beerose/sematic-search index` CLI command performs the following steps for\neach file in a given directory:\n\n1.  Converts the MDX files to raw text.\n2.  Extracts the title.\n3.  Splits the file into chunks of a maximum of 100 tokens.\n4.  Generates OpenAI embeddings for each chunk.\n5.  Upserts the embeddings to Pinecone.\n\nDepending on your content, the whole process requires a bunch of calls to OpenAI\nand Pinecone, which can take some time. For example, it takes around thirty\nminutes for a directory with ~25 blog posts and an average of 6 minutes of\nreading time.\n\n### Performing semantic searches\n\nTo test the semantic search, you can use `@beerose/sematic-search search` CLI\ncommand, which:\n\n1. Creates an embedding for a provided query.\n2. Sends a request to Pinecone with the embedding.\n\n## 🍿 Demo\n\n![](https://user-images.githubusercontent.com/9019397/219777236-d9c4cbb6-b408-40ca-be22-cd01eefa4e53.gif)\n\n## 📦 What's inside?\n\n```sh\n.\n├── bin\n│   └── cli.js\n├── src\n│   ├── bin\n│   │   └── cli.ts\n│   ├── commands\n│   │   ├── indexFiles.ts\n│   │   └── search.ts\n│   ├── getEmbeddings.ts\n│   ├── isRateLimitExceeded.ts\n│   ├── mdxToPlainText.test.ts\n│   ├── mdxToPlainText.ts\n│   ├── semanticQuery.ts\n│   ├── splitIntoChunks.test.ts\n│   ├── splitIntoChunks.ts\n│   ├── titleCase.ts\n│   └── types.ts\n├── tsconfig.build.json\n├── tsconfig.json\n├── package.json\n└── pnpm-lock.yaml\n```\n\n- `bin/cli.js` — The CLI entrypoint.\n- `src`:\n  - `bin/cli.ts` — Files where you can find CLI commands and settings. This\n    project uses [CAC](https://github.com/cacjs/cac) for building CLIs.\n  - `commands/indexFiles.ts` — A CLI command that handles processing md/mdx\n    content, generating embeddings and uploading vectors to Pinecone.\n  - `command/search.ts` — A semantic search command. It generates an embedding\n    for a given search query and then calls Pinecone for the results.\n  - `getEmbeddings.ts` — Generating embeddings logic. It handles a call to Open\n    AI.\n  - `isRateLimitExceeded.ts` — Error handling helper.\n  - `mdxToPlainText.ts` — Converts MDX files to raw text. Uses remark and a\n    custom `remarkMdxToPlainText` plugin (also defined in that file).\n  - `semanticQuery.ts` — Core logic for performing semantic searches. It's being\n    used in `search` command, and also it's exported from this library so that\n    you can integrate it with your projects.\n  - `splitIntoChunks.ts` — Splits the text into chunks with a maximum of 100\n    tokens.\n  - `titleCase.ts` — Extracts a title from a file path.\n  - `types.ts` — Types and utilities used in this project.\n- `tsconfig.json` - TypeScript compiler configuration.\n- `tsconfig.build.json` - TypeScript compiler configuration used for\n  `pnpm build`.\n\nTests:\n\n- `src/mdxToPlainText.test.ts`\n- `src/splitIntoChunks.test.ts`\n\n## 👩‍💻 Local development\n\nInstall deps and build the project:\n\n```sh\npnpm i\n\npnpm build\n```\n\nRun the CLI locally:\n\n```sh\nnode bin/cli.js\n```\n\n## 🧪 Running tests\n\n```sh\npnpm test\n```\n\n## 🤝 Contributing\n\nContributions, issues and feature requests are welcome.\u003cbr /\u003e Feel free to check\n[issues page](https://github.com/beerose/semantic-search/issues) if you want to\ncontribute.\u003cbr /\u003e\n\n## 📝 License\n\nCopyright © 2023 [Aleksandra Sikora](https://github.com/beerose).\u003cbr /\u003e This\nproject is [MIT](https://github.com/beerose/semantic-search/blob/master/LICENSE)\nlicensed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbeerose%2Fsemantic-search","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbeerose%2Fsemantic-search","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbeerose%2Fsemantic-search/lists"}