{"id":17153371,"url":"https://github.com/qdrant/miniCOIL","last_synced_at":"2025-07-26T04:31:35.336Z","repository":{"id":250503067,"uuid":"834643864","full_name":"qdrant/miniCOIL","owner":"qdrant","description":"Contextualized per-token embeddings","archived":false,"fork":false,"pushed_at":"2025-05-11T21:07:29.000Z","size":1299,"stargazers_count":26,"open_issues_count":2,"forks_count":5,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-07-19T23:30:36.040Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/qdrant.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-07-27T22:54:05.000Z","updated_at":"2025-06-26T11:51:46.000Z","dependencies_parsed_at":"2024-10-27T12:29:37.637Z","dependency_job_id":"f41fdb0c-8914-42ef-b1d2-e9e956279e16","html_url":"https://github.com/qdrant/miniCOIL","commit_stats":null,"previous_names":["generall/minicoil","qdrant/minicoil"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/qdrant/miniCOIL","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2FminiCOIL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2FminiCOIL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2FminiCOIL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2FminiCOIL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/qdrant","download_url":"https://codeload.github.com/qdrant/miniCOIL/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qdrant%2FminiCOIL/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267117250,"owners_count":24038640,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-26T02:00:08.937Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-14T21:46:02.184Z","updated_at":"2025-07-26T04:31:35.327Z","avatar_url":"https://github.com/qdrant.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\n# miniCOIL\n\nMiniCOIL is a contextualized per-word embedding model.\nMiniCOIL generates small-size embeddings for each word in a sentence, but the embeddings can only be compared within \nembeddings of the same words in different sentences(context). \nThis restriction allows to generate an extremely small embeddings (8d or even 4d) while still preserving the context of the word.\n\n## Usage\n\nMiniCOIL embeddings might be useful in information retrieval tasks, where we need to resolve meaning of the word in the context of the sentence.\nFor example, many words have different meanings depending on the context, such as \"bank\" (river bank or financial institution).\n\nMiniCOIL allows to encode precise meaning of the word, but unlike traditional word embeddings it won't dilute exact match with other words in the vocabulary.\n\nMiniCOIL is not trained in end-to-end fashion, which means that it can't assign relative importance to the words in the sentence.\nHowever, it can be combined with BM25-like scoring formula and used in search engines.\n\n## Architecture\n\nMiniCOIL is designed to be compatible with foundational transformer models, such as SentenceTransformers.\nThere are two main reasons for this:\n\n- We don't want to spend enormous resources on training MiniCOIL.\n- We want to be able to combine MiniCOIL embeddings inference with dense embedding inference in a single step.\n\nTechnically, MiniCOIL is a simple array of linear layers (one for each word in vocabulary) that are trained \nto compress the word embeddings into a small size. That makes MiniCOIL a paper-thin layer on top of the transformer model.\n\n### Training process\n\nMiniCOIL is trained by the principle skip-gram models, adapted to the transformer model: we want to predict the context by the word.\nIn case of the transformer models, we predict sentence embeddings by the word embeddings.\n\nNaturally, this process can be separated into two steps: encoding and deciding (similar to autoencoders), where in the middle we have a small-size embeddings.\n\nSince we want to make MiniCOIL compatible with many transformer models, we can replace the decoder step with compressed embeddings of some larger model,\nso for each input model we can train the encoder independently.\n\nSo the process of training is as follows:\n\n1. Download dataset (we use openwebtext)\n1. Convert dataset into readable format with `mini_coil.data_pipeline.convert_openwebtext`\n1. Split data into sentences with `mini_coil.data_pipeline.split_sentences`\n1. Encode sentences with transformer model, save embeddings to disk (about 350M embeddings for openwebtext) with `mini_coil.data_pipeline.encode_targets`\n1. Upload encoded sentences to Qdrant, so we can sample sentences with specified words with `mini_coil.data_pipeline.upload_to_qdrant`\n1. For triplet-based training - follow [train-triplets.sh](./train-triplets.sh). It will for each word:\n   1. Denerate Distance Matrix based on large embeddings\n   2. Augment sentences\n   3. Encode sentences with small model\n   4. Train per-word encoder\n1. Merge encoders for each word into a single model `mini_coil.data_pipeline.combine_models`\n1. Make visualizations\n1. Benchmark\n1. Quantize (optional)\n1. Convert to ONNX\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqdrant%2FminiCOIL","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqdrant%2FminiCOIL","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqdrant%2FminiCOIL/lists"}