{"id":28493409,"url":"https://github.com/qdrant/demo-code-search","last_synced_at":"2025-10-22T10:55:26.121Z","repository":{"id":104790956,"uuid":"600895194","full_name":"qdrant/demo-code-search","owner":"qdrant","description":null,"archived":false,"fork":false,"pushed_at":"2025-03-08T23:40:00.000Z","size":12862,"stargazers_count":47,"open_issues_count":2,"forks_count":17,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-07-26T17:19:54.622Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/qdrant.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-12T22:41:10.000Z","updated_at":"2025-06-06T08:10:31.000Z","dependencies_parsed_at":"2023-05-29T22:00:15.567Z","dependency_job_id":"b7cf8912-5af1-4406-a905-1a6fff216dfb","html_url":"https://github.com/qdrant/demo-code-search","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/qdrant/demo-code-search","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fdemo-code-search","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fdemo-code-search/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fdemo-code-search/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fdemo-code-search/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/qdrant","download_url":"https://codeload.github.com/qdrant/demo-code-search/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2Fdemo-code-search/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280424213,"owners_count":26328462,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-22T02:00:06.515Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-08T09:08:42.719Z","updated_at":"2025-10-22T10:55:26.116Z","avatar_url":"https://github.com/qdrant.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Code search with Qdrant\n\nDevelopers need a code search tool that helps them find the right piece of code. In this README, we describe how\nyou can set up a tool that provides code results, in context.\n\n## Online version\n\nSee our code search tool \"in action.\" Navigate to \n**[https://code-search.qdrant.tech/](https://code-search.qdrant.tech/)**. We've prepopulated the demo with Qdrant \ncodebase. You can see the results, in context, even with relatively vague search terms.\n\n## Prerequisites\n\nTo run this demo on your own system, install and/or set up the following components:\n\n- [Docker](https://www.docker.com/)\n- [Docker Compose](https://docs.docker.com/compose/)\n- [Rust](https://www.rust-lang.org/learn/get-started)\n- [rust-analyzer](https://rust-analyzer.github.io/)\n\nDocker and Docker Compose setup depends on your operating system. Please refer to the official documentation for\ninstructions on how to install them. Both Rust and rust-analyzer can be installed with the following commands:\n\n```shell\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\nrustup component add rust-analyzer\n```\n\n## Description\n\nYou can set up [Qdrant](https://qdrant.tech) to help developers find the code they need, with context. Using semantic\nsearch, developers can find the code samples that can help them do their day-to-day work, even with:\n\n- Imprecise keywords\n- Inexact names for functions, classes or variables\n- Some other code snippets\n\nThe demo uses [Qdrant source code](https://github.com/qdrant/qdrant) to build an end-to-end code search application that\nhelps you find the right piece of code, even if you have never contributed to the project. We implemented an end-to-end\nprocess, including data chunking, indexing, and search. Code search is a very specific task in which the programming \nlanguage syntax matters as much as the function, class, variable names, and the docstring, describing what and why. \nWhile the latter is more of a traditional natural language processing task, the former requires a specific approach. \nThus, we use the following neural encoders for our use cases:\n\n- `all-MiniLM-L6-v2` - one of the gold standard models for natural language processing\n- `microsoft/unixcoder-base` - a model trained specifically on a code dataset\n\n### Chunking and indexing process\n\nSemantic search works best with _structured_ source code repositories, with good syntax, as well as best practices\nas defined by the authoring team. If your code base needs help, start by dividing the code into chunks. Each\nchunk should correspond to a specific function, struct, enum, or any other code structure that might be considered as a whole.\n\nThere is a separate model-specific logic that extracts the most important parts of the code and converts them\ninto a format that the neural network can understand. Only then, the encoded representation is indexed in the Qdrant \ncollection, along with a JSON structure describing that snippet as a payload.\n\nTo that end, we work with the following models. The combination is the \"best of both worlds.\"\n\n#### all-MiniLM-L6-v2\n\nBefore the encoding, code is divided into chunks, but contrary to the traditional NLP challenges, it contains not only\nthe definition of the function or class but also the context in which appears. While doing code search it's important \nto know where the function is defined, in which module, and in which file. This information is crucial to present the\nresults to the user in a meaningful way.\n\nFor example, the `upsert` function from one of Qdrant's modules would be represented as the following structure:\n\n```json\n{\n    \"name\": \"upsert\",\n    \"signature\": \"fn upsert (\u0026 mut self , id : PointOffsetType , vector : SparseVector)\",\n    \"code_type\": \"Function\",\n    \"docstring\": \"= \\\" Upsert a vector into the inverted index.\\\"\",\n    \"line\": 105,\n    \"line_from\": 104,\n    \"line_to\": 125,\n    \"context\": {\n        \"module\": \"inverted_index\",\n        \"file_path\": \"lib/sparse/src/index/inverted_index/inverted_index_ram.rs\",\n        \"file_name\": \"inverted_index_ram.rs\",\n        \"struct_name\": \"InvertedIndexRam\",\n        \"snippet\": \"    /// Upsert a vector into the inverted index.\\n    pub fn upsert(\u0026mut self, id: PointOffsetType, vector: SparseVector) {\\n        for (dim_id, weight) in vector.indices.into_iter().zip(vector.values.into_iter()) {\\n            let dim_id = dim_id as usize;\\n            match self.postings.get_mut(dim_id) {\\n                Some(posting) =\u003e {\\n                    // update existing posting list\\n                    let posting_element = PostingElement::new(id, weight);\\n                    posting.upsert(posting_element);\\n                }\\n                None =\u003e {\\n                    // resize postings vector (fill gaps with empty posting lists)\\n                    self.postings.resize_with(dim_id + 1, PostingList::default);\\n                    // initialize new posting for dimension\\n                    self.postings[dim_id] = PostingList::new_one(id, weight);\\n                }\\n            }\\n        }\\n        // given that there are no holes in the internal ids and that we are not deleting from the index\\n        // we can just use the id as a proxy the count\\n        self.vector_count = max(self.vector_count, id as usize);\\n    }\\n\"\n    }\n}\n```\n\n\u003e Please note that this project aims to create a search mechanism specifically for Qdrant source code written in Rust.\nThus, we built a small separate [rust-parser project](https://github.com/qdrant/rust-parser) that converts it into the \nbefore-mentioned JSON objects. It uses [Syn](https://docs.rs/syn/latest/syn/index.html) to read the syntax tree of the \ncodebase. If you want to replicate the project for a different programming language, you will need to build a similar \nparser for that language. For example, Python has a similar library called [ast](https://docs.python.org/3/library/ast.html), \nbut there might be some differences in the way the code is parsed, thus some adjustments might be required.\n\nSince the `all-MiniLM-L6-v2` model is trained for more natural language tasks, it won't be able to understand the\ncode directly. For that reason, **we build a fake text-like representation of the structure, that should be \nunderstandable for the model**, or its tokenizer to be more specific. Such representation won't contain the actual code, \nbut rather the important parts of it, like the function name, its signature, and the docstring, but also many more. All \nthe special, language-specific characters are removed, to keep the names and signatures as clean as possible. Only that\nrepresentation is then passed to the model.\n\nFor example, the `upsert` function from the example above would be represented as:\n\n```python\n'Function upsert that does: = \" Upsert a vector into the inverted index.\" defined as fn upsert mut self id Point Offset Type vector Sparse Vector  in struct InvertedIndexRam  in module inverted_index  in file inverted_index_ram.rs'\n```\n\nIn the properly structured codebase, both module and file names should carry some additional information about the \nsemantics of that piece of code. For example, the `upsert` function is defined in the `InvertedIndexRam` struct, which \nis a part of the `inverted_index`, which indicates that it is a part of the inverted index implementation stored in \nmemory. It is unclear from the function name itself.\n\n\u003e If you want to see how the conversion is implemented in general, please check the `textify` function in the \n`code_search.index.textifier` module.\n\n#### microsoft/unixcoder-base\n\nIn that case, the model focuses specifically on the code snippets. We take the definitions along with the corresponding \ndocstrings and pass them to the model. Extracting all the definitions is not a trivial task, but there are various \nLanguage Server Protocol (**LSP**) implementations that can help with that, and you should be able to [find one for\nyour programming language](https://microsoft.github.io/language-server-protocol/implementors/servers/). For Rust, we \nused the [rust-analyzer](https://rust-analyzer.github.io/) that is capable of converting the codebase into the [LSIF \nformat](https://microsoft.github.io/language-server-protocol/specifications/lsif/0.4.0/specification/), which is a \nuniversal, JSON-based format for code, regardless of the programming language.\n\nThe same `upsert` function from the example above would be represented in LSIF as multiple entries and won't contain\nthe definition itself but just the location, so we have to extract it from the source file on our own. \n\nEven though the `microsoft/unixcoder-base` model does not officially support Rust, we found it to be working quite well \nfor the task. Obtaining the embeddings for the code snippets is quite straightforward, as we just send the code snippet \ndirectly to the model:\n\n```rust\n/// Upsert a vector into the inverted index.\npub fn upsert(\u0026mut self, id: PointOffsetType, vector: SparseVector) {\n    for (dim_id, weight) in vector.indices.into_iter().zip(vector.values.into_iter()) {\n        let dim_id = dim_id as usize;\n        match self.postings.get_mut(dim_id) {\n            Some(posting) =\u003e {\n                // update existing posting list\n                let posting_element = PostingElement::new(id, weight);\n                posting.upsert(posting_element);\n            }\n            None =\u003e {\n                // resize postings vector (fill gaps with empty posting lists)\n                self.postings.resize_with(dim_id + 1, PostingList::default);\n                // initialize new posting for dimension\n                self.postings[dim_id] = PostingList::new_one(id, weight);\n            }\n        }\n    }\n    // given that there are no holes in the internal ids and that we are not deleting from the index\n    // we can just use the id as a proxy the count\n    self.vector_count = max(self.vector_count, id as usize);\n}\n```\n\nHaving both encoders should help us build a more robust search mechanism, that can handle both the natural language and \ncode-specific queries.\n\n### Search process\n\nThe search process is quite straightforward. The user input is passed to both encoders, and the resulting vectors are\nused to query both Qdrant collections at the same time. The results are then merged with duplicates removed and returned \nback to the user. \n\n## Architecture\n\nThe demo uses [FastAPI](https://fastapi.tiangolo.com/) framework for the backend and [React](https://reactjs.org/) for \nthe frontend layer. \n\n![Architecture of the code search demo](images/architecture-diagram.png)\n\nThe demo consists of the following components:\n- [React frontend](/frontend) - a web application that allows the user to search over Qdrant codebase\n- [FastAPI backend](/code_search/service.py) - a backend that communicates with Qdrant and exposes a REST API\n- [Qdrant](https://qdrant.tech/) - a vector search engine that stores the data and performs the search\n- Two neural encoders - one trained on the natural language and one for the code-specific tasks\n\nThere is also an additional indexing component that has to be run periodically to keep the index up to date. It is also\npart of the demo, but it is not directly exposed to the user. All the required scripts are documented below, and you can \nfind them in the [`tools`](/tools) directory.\n\nThe demo is, as always, open source, so feel free to check the code in this repository to see how it is implemented.\n\n## Usage\n\nAs every other semantic search system, the demo requires a few steps to be set up. First of all, the data has to be\ningested, so we can then use the created index for our queries.\n\n### Data indexing\n\nQdrant is used as a search engine, so you will need to have it running somewhere. You can either use the local container\nor the Cloud version. If you want to use the local version, you can start it with the following command:\n\n```shell\ndocker run -p 6333:6333 -p 6334:6334 \\\n    -v $(pwd)/qdrant_storage:/qdrant/storage:z \\\n    qdrant/qdrant\n```\n\nHowever, the easiest way to start using Qdrant is to use our Cloud version. You can sign up for a free tier 1GB cluster \nat [https://cloud.qdrant.io/](https://cloud.qdrant.io/).\n\nOnce the environment is set up, you can configure the Qdrant instance and build the index by running the following \ncommands:\n\n```shell\nexport QDRANT_URL=\"http://localhost:6333\"\n\n# For the Cloud service you need to specify the api key as well\n# export QDRANT_API_KEY=\"your-api-key\"\n\nbash tools/download_and_index.sh\n```\n\nThe indexing process might take a while, as it needs to encode all the code snippets and send them to the Qdrant.\n\n### Search service\n\nOnce the index is built, you can start the search service by running the following commands:\n\n```shell\ndocker-compose up\n```\n\nThe UI will be available at [http://localhost:8000/](http://localhost:8000/). This is how it should look like:\n\n![Code search with Qdrant](images/code-search-ui.png)\n\nYou can type in the search query and see the related code structures. Queries might come both from natural language\nbut also from the code itself. \n\n## Further steps\n\nIf you would like to take the demo further, you can try to:\n\n1. Disable one of the neural encoders and see how the search results change.\n2. Try out some other encoder models and see the impact on the search quality.\n3. Fork the project and support programming languages other than Rust.\n4. Build a ground truth dataset and evaluate the search quality.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqdrant%2Fdemo-code-search","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqdrant%2Fdemo-code-search","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqdrant%2Fdemo-code-search/lists"}