{"id":24818664,"url":"https://github.com/microsoft/kblam","last_synced_at":"2025-10-13T20:30:37.703Z","repository":{"id":273783857,"uuid":"836676452","full_name":"microsoft/KBLaM","owner":"microsoft","description":"Official Implementation of \"KBLaM: Knowledge Base augmented Language Model\"","archived":false,"fork":false,"pushed_at":"2025-01-30T15:05:33.000Z","size":334,"stargazers_count":7,"open_issues_count":1,"forks_count":1,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-01-30T16:22:52.452Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2410.10450","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-01T10:26:10.000Z","updated_at":"2025-01-30T15:05:38.000Z","dependencies_parsed_at":"2025-01-23T00:05:39.811Z","dependency_job_id":"e7e353a9-dc5d-4153-8c3a-b5a622c6aaa6","html_url":"https://github.com/microsoft/KBLaM","commit_stats":null,"previous_names":["microsoft/kblam"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FKBLaM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FKBLaM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FKBLaM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FKBLaM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/KBLaM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":236394706,"owners_count":19142139,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-30T17:37:17.677Z","updated_at":"2025-10-13T20:30:37.691Z","avatar_url":"https://github.com/microsoft.png","language":"Jupyter Notebook","funding_links":[],"categories":["知识图谱问答KBQA_多跳推理"],"sub_categories":["大语言对话模型及数据"],"readme":"# KBLaM - Knowledge Base Augmented Language Models [ICLR 2025]\n\nThis repo contains the official implementation of [KBLaM: Knowledge Base Augmented Language Models](https://arxiv.org/abs/2410.10450).\n\nAuthors: Xi Wang, Liana Mikaelyan, Taketomo Isazawa, Mathew Salvaris, James Hensman.\n\nKBLaM is a new method for augmentating LLMs with external knowledge.\nUnlike Retrieval-Augmented Generation, KBLaM eliminates external\nretrieval modules, and unlike in-context learning, its computational overhead scales linearly with KB size rather than quadratically.\n\n## Supported Models\nThe following models from Hugging Face hub are currently supported:\n\n- [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)\n- [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)\n- [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)\n\nTo add support for new model types, you will need to update the model processing scripts to incorporate an adapter similar to `llama_model.py` in `src/kblam/models`.\n\n## Setting up\n\nInstall the kblam package with\n\n```\npip install -e .\n```\n\nTo use Llama models, you will need to generate a token from Hugging Face and use it to log in:\n\n```\npip install huggingface_hub\nhuggingface-cli login\n```\n\nThe experiments in the paper can be replicated by running the scripts in `./experiments`.\n\n## Dataset Construction\n\nTo run the synthetic dataset construction, you will need a valid Azure OpenAI endpoint.\n\nTo construct a synthetic KB and question-answer pairs use `dataset_generation/gen_synthetic_data.py`\n\nThe question-answer pairs are constructed in the form:\n\n```\nWhat is the description of {entity_name}?\nThe description of {entity_name} is {description}.\n```\n\nTo generate KB embeddings, use `dataset_generation/generate_kb_embeddings.py`.\nThe embeddings we current support are [text-embedding-ada-002](https://openai.com/index/new-and-improved-embedding-model/) and [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2).\n\n## Training\n\nAs an example of model training, see the following:\n\n```\npython train.py --dataset_dir \u003cYour dataset directory\u003e --train_dataset synthetic --N 120000 --B 20 --total_steps 601  --encoder_spec OAI --use_oai_embd --key_embd_src key --use_data_aug --use_cached_embed\n```\n\nNote in particular the `--use_cached_embed` argument. This should be set to prevent recomputation of embeddings, which can take significant time especially when using APIs such as OpenAI's text embeddings.\nThere are a number of optional arguments in `train.py` that you may want to consult.\n\n## Contributing\n\nThis project welcomes contributions and suggestions. Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft\ntrademarks or logos is subject to and must follow\n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n\n## FAQ\n\n### What is KBLaM?\n\nKBLaM is a method to enhance a transformer-based LLM to augment it with knowledge. It consists of a base LLM, and some adapters that we train to transform the knowledge base to special knowledge tokens that the LLM ingests. In particular, because we only train adapters over the knowledge part, the base LLM is completely unmodified with regards to text input. If given no knowledge base, the model outputs the exact same thing as the base model for any given input.\n\n### What can KBLaM do?\n\nKBLaM can, in addition to the base LLM’s capabilities, also attend over the knowledge base to answer questions in a grounded manner.\n\n### What is/are KBLaM’s intended use(s)?\n\nThe model is intended to be used for research.\n\n### How was KBLaM evaluated? What metrics are used to measure performance?\n\nKBLaM was evaluated on accuracy of retrieval from the knowledge base, its refusal rate (how often it correctly said that it didn’t have the requisite information to answer the question), and precision and recall on how well the answers aligned with the correct answers given the knowledge base.\n\n### What are the limitations of KBLaM? How can users minimize the impact of KBLaM’s limitations when using the system?\n\nWhen used with knowledge bases that are very different from the knowledge base it was trained on, KBLaM will give incomplete answers, and the answers can be reworded from the original value in the knowledge base or at times entirely incorrect. As a result, KBLaM is not currently intended for use as a complete system in a production setting, but is a research project that we are sharing.\n\n### What operational factors and settings allow for effective and responsible use of KBLaM?\n\nKBLaM with no knowledge base will perform the exact same as the base model. With a knowledge base, for effective use, one should make sure that the training dataset and the usecase have sufficiently similar knowledge bases\n\n### How do I provide feedback on KBLaM?\n\nPlease add issues to this repository to provide feedback on KBLaM.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fkblam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fkblam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fkblam/lists"}