{"id":30430481,"url":"https://github.com/shobrook/weightgain","last_synced_at":"2025-08-22T18:22:25.264Z","repository":{"id":280304562,"uuid":"939035618","full_name":"shobrook/weightgain","owner":"shobrook","description":"Train an adapter for any embedding model in under a minute","archived":false,"fork":false,"pushed_at":"2025-04-09T18:32:11.000Z","size":557,"stargazers_count":110,"open_issues_count":0,"forks_count":5,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-08-17T04:33:30.198Z","etag":null,"topics":["adapter","embedding-models","embeddings","fine-tuning","lora","openai"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shobrook.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-25T22:17:43.000Z","updated_at":"2025-08-05T16:01:16.000Z","dependencies_parsed_at":"2025-03-02T16:43:55.573Z","dependency_job_id":null,"html_url":"https://github.com/shobrook/weightgain","commit_stats":null,"previous_names":["shobrook/weightgain"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/shobrook/weightgain","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shobrook%2Fweightgain","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shobrook%2Fweightgain/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shobrook%2Fweightgain/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shobrook%2Fweightgain/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shobrook","download_url":"https://codeload.github.com/shobrook/weightgain/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shobrook%2Fweightgain/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271680831,"owners_count":24802077,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-22T02:00:08.480Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adapter","embedding-models","embeddings","fine-tuning","lora","openai"],"created_at":"2025-08-22T18:22:24.053Z","updated_at":"2025-08-22T18:22:25.238Z","avatar_url":"https://github.com/shobrook.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# weightgain\n\n**Fine-tune _any_ embedding model in under a minute. Even closed-source models from OpenAI, Cohere, Voyage, etc.**\n\nWeightgain works by training an [adapter](https://research.trychroma.com/embedding-adapters) that sits on top of the model, transforming the embeddings _after_ they're generated. This produces task-specific embeddings optimized for your specific RAG/retrieval use case. \n\nWith weightgain, you can train an adapter in just a couple lines of code –– even if you don't have a dataset.\n\n## Installation\n\n```bash\n\u003e pip install weightgain\n```\n\n## Quickstart\n\n```python\nfrom weightgain import Dataset, Adapter\n\n# Generate a dataset (or supply your own)\ndataset = Dataset.from_synthetic_chunks(\n    prompt=\"Chunks of code from an arbitrary Python codebase.\",\n    llm=\"openai/gpt-4o-mini\",\n)\n\n# Train the adapter\nadapter = Adapter(\"openai/text-embedding-3-large\")\nadapter.fit(dataset)\n\n# Apply the adapter\nnew_embeddings = adapter.transform(old_embeddings)\n```\n\n## Usage\n\n### Choosing an Embedding Model\n\nWeightgain wraps LiteLLM. You can fine-tune any embedding model supported by LiteLLM, e.g. models from OpenAI, Cohere, Voyage, etc. [Here's](https://docs.litellm.ai/docs/embedding/supported_embedding) the full list of supported models.\n\n\u003c!--TODO: You can also define your own--\u003e\n\n### Building the Dataset\n\nYou need a dataset of `[query, chunk]` pairs to get started. A chunk is a retrieval result, e.g. a code snippet or excerpt from a document. And the query is a string that's _similar_ to the chunk and should match in a vector search. You can either generate a synthetic dataset or supply your own.\n\n**If you already have chunks:**\n\n```python\nfrom weightgain import Dataset\n\nchunks = [...] # list of strings\ndataset = Dataset.from_chunks(\n    chunks,\n    llm=\"openai/gpt-4o-mini\",\n    n_queries_per_chunk=1\n)\n```\n\nThis will use OpenAI's `gpt-4o-mini` (or whatever LiteLLM model you want) to generate `1` query per chunk.\n\n**If you don't have chunks:**\n\n```python\ndataset = Dataset.from_synthetic_chunks(\n    prompt=\"Chunks of code from an arbitrary Python codebase.\",\n    llm=\"openai/gpt-4o-mini\",\n    n_chunks=25,\n    n_queries_per_chunk=1\n)\n```\n\nThis will generate chunks using the prompt, and then generate `1` query per chunk.\n\n**If you have queries and chunks:**\n\n```python\nqa_pairs = [...] # list of (str, str) tuples\ndataset = Dataset.from_pairs(qa_pairs, model)\n```\n\n### Training the Adapter\n\n```python\nfrom weightgain import Adapter\n\nadapter = Adapter.fit(\n    dataset,\n    batch_size=25,\n    max_epochs=50,\n    learning_rate=100.0,\n    dropout=0.0\n)\n```\n\nAfter training, you can generate a report with various plots (training loss, cosine similarity distributions before/after training, etc.):\n\n```python\nadapter.show_report()\n```\n\n![Example report](./report.png)\n\n### Applying the Adapter\n\n```python\nold_embeddings = [...] # list of vectors\nnew_embeddings = adapter.transform(old_embeddings)\n```\n\nBehind the scenes, an adapter is just a matrix of weights that you can multiply your embeddings with. You can access this matrix like so:\n\n```python\nadapter.matrix # returns numpy.ndarray\n```\n\n## Roadmap\n\n1. Add option to train an MLP instead of a linear layer\n2. Add a method for easy hyperparameter search\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshobrook%2Fweightgain","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshobrook%2Fweightgain","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshobrook%2Fweightgain/lists"}