{"id":14155602,"url":"https://github.com/novoselrok/codesnippetsearch","last_synced_at":"2026-02-02T22:01:30.989Z","repository":{"id":37224557,"uuid":"263015225","full_name":"novoselrok/codesnippetsearch","owner":"novoselrok","description":"Neural bag of words code search implementation using PyTorch and data from the CodeSearchNet project. ","archived":false,"fork":false,"pushed_at":"2023-01-06T13:26:29.000Z","size":2819,"stargazers_count":71,"open_issues_count":62,"forks_count":5,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-08-06T01:32:43.457Z","etag":null,"topics":["code-search","codesearchnet","embeddings","machine-learning"],"latest_commit_sha":null,"homepage":"https://codesnippetsearch.net/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/novoselrok.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-05-11T10:50:22.000Z","updated_at":"2025-01-31T18:28:59.000Z","dependencies_parsed_at":"2023-02-06T04:16:45.058Z","dependency_job_id":null,"html_url":"https://github.com/novoselrok/codesnippetsearch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/novoselrok/codesnippetsearch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/novoselrok%2Fcodesnippetsearch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/novoselrok%2Fcodesnippetsearch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/novoselrok%2Fcodesnippetsearch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/novoselrok%2Fcodesnippetsearch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/novoselrok","download_url":"https://codeload.github.com/novoselrok/codesnippetsearch/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/novoselrok%2Fcodesnippetsearch/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29021031,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-02T18:51:31.335Z","status":"ssl_error","status_checked_at":"2026-02-02T18:49:20.777Z","response_time":58,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["code-search","codesearchnet","embeddings","machine-learning"],"created_at":"2024-08-17T08:04:18.927Z","updated_at":"2026-02-02T22:01:30.955Z","avatar_url":"https://github.com/novoselrok.png","language":"Python","funding_links":[],"categories":["machine-learning"],"sub_categories":[],"readme":"# CodeSnippetSearch\n\nCodeSnippetSearch is a web application and a web extension that allows you to search GitHub repositories using natural language\nqueries and code itself. \n\nIt is based on a neural bag of words code search implementation using PyTorch and data from the [CodeSearchNet](https://github.com/github/CodeSearchNet) project.\nThe model training code was heavily inspired by the baseline (Tensorflow) implementation in the CodeSearchNet repository. \nCurrently, Python, Java, Go, Php, Javascript, and Ruby programming languages are supported.\n\nHelpful papers:\n* [CodeSearchNet Challenge: Evaluating the State of Semantic Code Search](https://arxiv.org/pdf/1909.09436.pdf)\n* [When Deep Learning Met Code Search](https://arxiv.org/pdf/1905.03813.pdf)\n\n## Model description\n\n\n## Model structure\n\n![Model structure](assets/model.png)\n\n## Project structure\n- `code_search`: A Python package with scripts to prepare the data, train the language models and save the embeddings\n- `code_search_web`: CodeSnippetSearch website Django project\n- `serialized_data`: Store for intermediate objects during training (docs, vocabularies, models, embeddings etc.)\n- `codesearchnet_data`: Data from the CodeSearchNet project\n\n## Data\n\nWe are using the data from the CodeSearchNet project. Run the following commands to download the required data:\n\n- `$ ./scripts/download_codesearchnet_data.sh`\n\nThis will download around 20GB of data. Overview of the data structure is listed [here](https://github.com/github/CodeSearchNet/tree/master/resources).\n\n## Training the models\n\nIf you can, you should be performing these steps inside a virtual environment.\nTo install the required dependencies run: `$ ./scripts/install_pip_packages.sh`.\nTo install the `code_search` as a package run: `$ ./scripts/install_code_search_package.sh`\n\n### Preparing the data\n\nData preparation step is separate from the training step because it is time and memory consuming. We will prepare all the\nnecessary data needed for training. This includes preprocessing code docs, building vocabularies, and encoding sequences.\n\nThe first step is to parse the CodeSearchNet data. We need to parse `*_dedupe_definitions_v2.pkl` files from a `pickle` format to `jsonl` format. We will be using the jsonl\nformat throughout the project, since we can read the file line by line and keep the memory footprint minimal. Reading the\nevaluation docs requires **more** than 16GB of memory, because the entire file has to be read in memory (largest is `javascript_dedupe_definitions_v2.pkl` at 6.6GB).\nIf you do not have this kind of horsepower, I suggest renting a cloud server with \u003e16GB of memory and running this step on there. After you are done,\njust download the jsonl files to your local machine. Subsequent preparation and training steps should not take more than 16GB of memory.\n\nTo parse the CodeSearchNet data run: `$ python parse_codesearchnet_data.py`\n\nTo prepare the data for training run: `$ python prepare_data.py --prepare-all`. It uses the Python multiprocessing\nmodule to take advantage of multiple cores. If you encounter memory errors or slow performance you can tweak the number of\nprocesses by changing the parameter passed to `multiprocessing.Pool`.\n\n### Training and evaluation\n\nYou start the training by running: `$ python train.py`. This will train separate models for each language, build code embeddings\nand evaluate them according to MRR (Mean Reciprocal Rank) and output `model_predictions.csv`. These will be evaluated by Github \u0026 WANDB \nusing NDCG (Normalized Discounted cumulative gain) metric to rank the submissions.\n\n### Query the trained models\n\nRun `$ python search.py \"read file lines\"` and it will output 3 best ranked results for each language.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnovoselrok%2Fcodesnippetsearch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnovoselrok%2Fcodesnippetsearch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnovoselrok%2Fcodesnippetsearch/lists"}