{"id":22220962,"url":"https://github.com/nokia/codesearch","last_synced_at":"2025-07-27T16:30:54.471Z","repository":{"id":49972212,"uuid":"290777031","full_name":"nokia/codesearch","owner":"nokia","description":"Models and datasets for annotated code search.","archived":false,"fork":false,"pushed_at":"2023-05-22T07:22:52.000Z","size":107,"stargazers_count":33,"open_issues_count":0,"forks_count":6,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-04-14T13:07:11.793Z","etag":null,"topics":["code-reuse","code-search","deep-learning","machine-learning","natural-language-processing","transformer"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2008.12193","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nokia.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-08-27T13:04:02.000Z","updated_at":"2023-12-11T13:38:34.000Z","dependencies_parsed_at":"2022-09-09T20:31:03.623Z","dependency_job_id":null,"html_url":"https://github.com/nokia/codesearch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nokia%2Fcodesearch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nokia%2Fcodesearch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nokia%2Fcodesearch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nokia%2Fcodesearch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nokia","download_url":"https://codeload.github.com/nokia/codesearch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227817164,"owners_count":17824199,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-reuse","code-search","deep-learning","machine-learning","natural-language-processing","transformer"],"created_at":"2024-12-02T23:11:11.978Z","updated_at":"2024-12-02T23:11:12.698Z","avatar_url":"https://github.com/nokia.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Code search\n\nThis project contains the code to reproduce the experiments in the paper [Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent](https://arxiv.org/abs/2008.12193). It implements retrieval systems for annotated code snippets: pairs of a code snippet and a short natural language description. Our pretrained models and datasets are hosted on Zenodo (https://zenodo.org/record/4001602). The models and datasets will be downloaded automatically when calling `load_model`, `load_snippet_collection`, etc. (see the code examples below).\n\nIn addition, the project also implements some *code-only* retrieval models (BM25, NCS, UNIF) for snippet collections that do not come with descriptions.\n\nThe experiments in the paper are done on Python snippets, but the code preprocessing currently also supports java, javascript, and bash.\n\nThe project is developed by a research team in the [Application Platforms and Software Systems Lab](https://www.bell-labs.com/our-research/areas/applications-and-platforms/) of [Nokia Bell Labs](https://www.bell-labs.com/). \n\n## Installation\n\n1. Install the codesearch library: `pip install .`\n2. Install the tree-sitter parsers (for preprocessing the code snippets): e.g., `codesearch install_parsers python java` or simply `codesearch install_parsers` to install parsers for all supported languages. By default, parsers are installed under the `codesearch/parsers` directory this can be customized by setting the `TREE_SITTER_DIR` variable.\n3. Install spacy (for preprocessing descriptions/code comments): `python -m spacy download en_core_web_md`\n\n\n## Code structure\n\n```\ncodesearch\n├── codesearch          // Contains the library modules: model code, utilities to download and evaluate models, etc.\n├── nbs                 // Contains examples notebooks and notebooks to reproduce the experiments\n├── tests               // Contains some unit tests, mostly for verifying the code preprocessing\n```\n\n## Models\n\nWe provide some pretrained embedding models to create a retrieval system. The pretrained models also expose a consistent interface to embed snippets and queries:\n\n#### Example: Query a snippet collection with a pretrained embedding model\n\n```python\nfrom codesearch.utils import load_model\nfrom codesearch.embedding_retrieval import EmbeddingRetrievalModel\n\nquery = \"plot a bar chart\"\nsnippets = [{                           # a dummy snippet collection with 1 snippet\n    \"id\": \"1\",\n    \"description\": \"Hello world\",\n    \"code\": \"print('hello world')\",\n    \"language\": \"python\"\n    }]\n\nembedding_model = load_model(\"use-embedder-pacs\")\nretrieval_model = EmbeddingRetrievalModel(embedding_model)\nretrieval_model.add_snippets(snippets)\nretrieval_model.query(query)\n```\n\n#### Example: Embed snippets or queries with a pre-trained embedding model\n\n```python\nfrom codesearch.utils import load_model\n\nmodel_name = \"use-embedder-pacs\"\nqueries = [\"plot a bar chart\"]\nsnippets = [{\n    \"description\": \"Hello world\",\n    \"code\": \"print('hello world')\",\n    \"language\": \"python\"\n    }]\n\nembedding_model = load_model(model_name)\nquery_embs = embedding_model.embed_queries(queries)\nsnippet_embs = embedding_model.embed_snippets(snippets)\n```\n\n### Available models\n\nBelow you find a table with the pretrained models. For each model, we mention based on what information it computes a snippet embedding: the description and/or the code. \n\n| name                       | inputs             | training data                                          | notebook                    |\n|----------------------------|--------------------|--------------------------------------------------------|-----------------------------|\n| ncs-embedder-so-ds-feb20      | code               | so-ds-feb20                                            | nbs/ncs/ncs.ipynb           |\n| ncs-embedder-staqc-py      | code               | staqc-py-cleaned                              | nbs/ncs/ncs.ipynb           |\n| tnbow-embedder-so-ds-feb20 | description        | so-python-question-titles-feb20                        | nbs/tnbow/tnbow.ipynb       |\n| use-embedder-pacs          | description        | so-duplicates-pacsv1-train                             | nbs/tuse/tuse_tuned.ipynb   |\n| ensemble-embedder-pacs     | description + code | staqc-py-cleaned + so-duplicates-pacs-train | nbs/ensemble/ensemble.ipynb |\n\n## Datasets\n\nThis project provides a consistent interface to download and load datasets related to code search.\n\n### Snippet collections\n\n####  Example: Load a snippet collection\n\n```python\nfrom codesearch.data import load_snippet_collection\ncollection_name = \"so-ds-feb20\"\nsnippets = load_snippet_collection(collection_name)\n```\n\n#### Available snippet collections\nIn the table below you find which snippet collections can be loaded. The staqc-py-cleaned, conala-curated, and codesearchnet collections are derived from existing datasets. For staqc-py and conala-curated we did some additional processing, for the codesearchnet collections we merely load the original dataset in a format that is consistent with our code. \n\nIf you were to use any of these datasets in your research, please make sure to cite the respective works.\n\n| name                                          | description                                                                                                                  |\n|-----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|\n| so-ds-feb20                                   | Mined from Python Stack Overflow posts related to data science. Stack Overflow dumps can be found here: https://archive.org/details/stackexchange, [LICENSE](https://creativecommons.org/licenses/by-sa/4.0/)                                                             |\n| staqc-py-cleaned                     | Derived from the Python StaQC snippets (additional cleaning was done as decribed in the paper). See https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset, [LICENSE](https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset/blob/master/LICENSE.txt)                               |\n| conala-curated                                | Derived from the curated snippets of the CoNaLa benchmark. See https://conala-corpus.github.io/ , [LICENSE](https://creativecommons.org/licenses/by-sa/4.0/)                                                                                         |\n| codesearchnet-{language}-{train\\|valid\\|test} | The CodeSearchNet snippet collections used for training/MRR validation/MRR testing. See https://github.com/github/CodeSearchNet. Licenses of the individial snippets can be found in pkl files.                                           |\n| codesearchnet-{language}                      | The CodeSearchNet snippet collections used for the weights and biases benchmark. See https://github.com/github/CodeSearchNet. Licenses of the individial snippets can be found in pkl files. **Note**: not all of these snippets have descriptions |\n\n### Evaluation data\nEvaluation datasets link queries to relevant snippets in one of the above snippet collections.\n\n\n#### Example: load an evaluation dataset\n```python\nfrom codesearch.data import load_eval_dataset\nqueries, query2ids = load_eval_dataset(\"so-ds-feb20-valid\")\n```\n\n#### Available evaluation datasets\n| name                           | description                                                                     |\n|--------------------------------|---------------------------------------------------------------------------------|\n| so-ds-feb20-{valid\\|test}      | Queries paired to relevant snippets in the so-ds-feb20 snippet collection.      |\n| staqc-py-cleaned-{valid\\|test} | Queries paired to relevant snippets in the staqc-py-cleaned snippet collection. |\n| conala-curated-0.5-test        | Queries paired to relevant snippets in the CoNaLa benchmark                     |\n\n\nIt is also possible to load a snippet collection as evaluation data. The descriptions will be used as queries. Note that this only makes sense to evaluate code-only models (i.e., models that do not use the description field).\n\n#### Example: load a snippet collection as evaluation data\n```python\nqueries, query2ids = load_eval_dataset(\"codesearchnet-python-valid\")\n```\n\n\n### Training data\n\nThe different models we implement use different kinds of training data. Code-only models are trained on pairs of code snippets and descriptions. For these models, the snippet collections are used as training data (of course you should never train on a snippet collection when you intent to use that load that collection as evaluation data as well). The USE model is fine-tuned on titles of duplicate Stack Overflow posts. You can take a look our notebooks (e.g., nbs/ncs/ncs.ipynb, nbs/tuse/tuse_tuned) to find out how the training is done/how the training data is loaded.\n\nTo download and load the title pairs from Stack Overflow duplicate posts run:\n\n```python\nfrom codesearch.data import load_train_dataset\nduplicate_records = load_train_dataset(\"so-duplicates-pacs-train\")\n```\n\nThese duplicate records have been filtered to ensure that there is no overlap with the `so-ds-feb20` and `staqc-py` evaluation datasets.\n\nTo download a text file with Stack Overflow post titles tagged with Python (used for the TNBOW baseline) run: \n\n```python\nfrom codesearch.data import load_train_dataset\nfilename = load_train_dataset(\"so-python-question-titles-feb20\")\n```\n\n## Demo notebook\n\n You can run the demo notebook `nbs/demo/demo.ipynb` to quickly try out any of the pretrained models on one of the snippet collections.\n\n## Benchmark on PACS\n\nTo replicate the results of our paper or evaluate your own model on the PACS benchmark, have a look at `nbs/evaluate.ipynb` and `codesearch/benchmark.ipynb`. A custom embedding model class should implement the `embed_snippets` and `embed_queries` functions (similar to `codesearch/tuse/tuse_embedder.py`, `codesearch/tnbow/tnbow_embedder.py`, `codesearch/ncs/ncs_embedder.py` etc.).\n\n#### Example: Benchmark a model on PACS\n\n```python\nfrom codesearch.benchmark import benchmark_on_pacs\n\nbenchmark_on_pacs(\n    model_path=model_path, # one of the pretrained model names or a path to a model that can be loaded with `codesearch.utils.load_model`\n    output_dir=output_dir\n)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnokia%2Fcodesearch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnokia%2Fcodesearch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnokia%2Fcodesearch/lists"}