{"id":15691066,"url":"https://github.com/csinva/interpretable-embeddings","last_synced_at":"2025-05-07T23:23:09.460Z","repository":{"id":238753819,"uuid":"797452828","full_name":"csinva/interpretable-embeddings","owner":"csinva","description":"Interpretable text embeddings by asking LLMs yes/no questions (NeurIPS 2024)","archived":false,"fork":false,"pushed_at":"2024-11-15T06:39:59.000Z","size":151583,"stargazers_count":37,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-20T04:34:06.251Z","etag":null,"topics":["ai","artificial-intelligence","embeddings","encoding-models","explainability","fmri","huggingface","language-model","llm","neural-network","neuroscience","rag","retrieval-augmented-generation","transformer","xai"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2405.16714","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/csinva.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-05-07T21:33:22.000Z","updated_at":"2025-04-19T21:05:39.000Z","dependencies_parsed_at":"2024-05-28T06:06:40.612Z","dependency_job_id":null,"html_url":"https://github.com/csinva/interpretable-embeddings","commit_stats":null,"previous_names":["csinva/interpretable-embeddings"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/csinva%2Finterpretable-embeddings","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/csinva%2Finterpretable-embeddings/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/csinva%2Finterpretable-embeddings/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/csinva%2Finterpretable-embeddings/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/csinva","download_url":"https://codeload.github.com/csinva/interpretable-embeddings/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252969056,"owners_count":21833403,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","artificial-intelligence","embeddings","encoding-models","explainability","fmri","huggingface","language-model","llm","neural-network","neuroscience","rag","retrieval-augmented-generation","transformer","xai"],"created_at":"2024-10-03T18:19:48.084Z","updated_at":"2025-05-07T23:23:09.437Z","avatar_url":"https://github.com/csinva.png","language":"Python","readme":"\u003ch1 align=\"center\"\u003e ❓ Question-Answering Embeddings ❓ \u003c/h1\u003e\n\u003cp align=\"center\"\u003e Crafting Interpretable Embeddings by Asking LLMs Questions, code for the \u003ca href=\"https://arxiv.org/abs/2405.16714\"\u003eQA-Emb paper\u003c/a\u003e. \n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/license-mit-blue.svg\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/python-3.6+-blue\"\u003e\n\u003c/p\u003e  \n\n\u003cp align=\"center\"\u003e QA-Embs builds an interpretable embeddings by asking a series of yes-no questions to a pre-trained autoregressive LLM.\u003cbr/\u003e\n\u003cimg align=\"center\" width=100% src=\"docs/overview.png\"\u003e \u003c/img\u003e\t \u003cbr/\u003e\n\u003c/p\u003e\n\n# Quickstart\nIf you just want to use QA-Emb in your own application, the easiest way is through the [imodelsX package](https://github.com/csinva/imodelsX). To install, just run `pip install imodelsx`.\n\nThen, you can generate your own interpretable embeddings by coming up with questions for your domain:\n```python\nfrom imodelsx import QAEmb\nimport pandas as pd\n\nquestions = [\n    'Is the input related to food preparation?',\n    'Does the input mention laughter?',\n    'Is there an expression of surprise?',\n    'Is there a depiction of a routine or habit?',\n    'Does the sentence contain stuttering?',\n    'Does the input contain a first-person pronoun?',\n]\nexamples = [\n    'i sliced some cucumbers and then moved on to what was next',\n    'the kids were giggling about the silly things they did',\n    'and i was like whoa that was unexpected',\n    'walked down the path like i always did',\n    'um no um then it was all clear',\n    'i was walking to school and then i saw a cat',\n]\n\ncheckpoint = 'meta-llama/Meta-Llama-3-8B-Instruct'\n\nembedder = QAEmb(\n    questions=questions, checkpoint=checkpoint, use_cache=False)\nembeddings = embedder(examples)\n\ndf = pd.DataFrame(embeddings.astype(int), columns=[\n    q.split()[-1] for q in questions])\ndf.index = examples\ndf.columns.name = 'Question (abbreviated)'\ndisplay(df.style.background_gradient(axis=None))\n--------DISPLAYS ANSWER FOR EACH QUESTION IN EMBEDDING--------\n```\n\n\n# Dataset set up\n\nDirections for installing the datasets required for reproducing the fMRI experiments in the paper.\n\n- download data with `python experiments/00_load_dataset.py`\n    - create a `data` dir under wherever you run it and will use [datalad](https://github.com/datalad/datalad) to download the preprocessed data as well as feature spaces needed for fitting [semantic encoding models](https://www.nature.com/articles/nature17637)\n- set `neuro1.config.root_dir` to where you want to store the data\n- to make flatmaps, need to set [pycortex filestore] to `{root_dir}/ds003020/derivative/pycortex-db/`\n- to run eng1000, need to grab `em_data` directory from [here](https://github.com/HuthLab/deep-fMRI-dataset) and move its contents to `{root_dir}/em_data`\n- loading responses\n  - `neuro1.data.response_utils` function `load_response`\n  - loads responses from at `{root_dir}/ds003020/derivative/preprocessed_data/{subject}`, hwere they are stored in an h5 file for each story, e.g. `wheretheressmoke.h5`\n- loading stimulus\n  - `neuro1.features.stim_utils` function `load_story_wordseqs`\n  - loads textgrids from `{root_dir}/ds003020/derivative/TextGrids\", where each story has a TextGrid file, e.g. `wheretheressmoke.TextGrid`\n  - uses `{root_dir}/ds003020/derivative/respdict.json` to get the length of each story\n\n# Code install\n\nDirections for installing the code here as a package for full development.\n\n- from the repo directory, start with `pip install -e .` to locally install the `neuro1` package\n- `python 01_fit_encoding.py --subject UTS03 --feature eng1000`\n    - The other optional parameters that encoding.py takes such as sessions, ndelays, single_alpha allow the user to change the amount of data and regularization aspects of the linear regression used. \n    - This function will then save model performance metrics and model weights as numpy arrays.\n\n # Citation\n```r\n@misc{benara2024crafting,\n      title={Crafting Interpretable Embeddings by Asking LLMs Questions}, \n      author={Vinamra Benara and Chandan Singh and John X. Morris and Richard Antonello and Ion Stoica and Alexander G. Huth and Jianfeng Gao},\n      year={2024},\n      eprint={2405.16714},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcsinva%2Finterpretable-embeddings","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcsinva%2Finterpretable-embeddings","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcsinva%2Finterpretable-embeddings/lists"}