https://github.com/csinva/interpretable-embeddings
Interpretable text embeddings by asking LLMs yes/no questions (NeurIPS 2024)
https://github.com/csinva/interpretable-embeddings
ai artificial-intelligence embeddings encoding-models explainability fmri huggingface language-model llm neural-network neuroscience rag retrieval-augmented-generation transformer xai
Last synced: 5 months ago
JSON representation
Interpretable text embeddings by asking LLMs yes/no questions (NeurIPS 2024)
- Host: GitHub
- URL: https://github.com/csinva/interpretable-embeddings
- Owner: csinva
- Created: 2024-05-07T21:33:22.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2024-11-15T06:39:59.000Z (11 months ago)
- Last Synced: 2025-04-20T04:34:06.251Z (6 months ago)
- Topics: ai, artificial-intelligence, embeddings, encoding-models, explainability, fmri, huggingface, language-model, llm, neural-network, neuroscience, rag, retrieval-augmented-generation, transformer, xai
- Language: Python
- Homepage: https://arxiv.org/abs/2405.16714
- Size: 145 MB
- Stars: 37
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
❓ Question-Answering Embeddings ❓
Crafting Interpretable Embeddings by Asking LLMs Questions, code for the QA-Emb paper.
![]()
![]()
QA-Embs builds an interpretable embeddings by asking a series of yes-no questions to a pre-trained autoregressive LLM.
![]()
# Quickstart
If you just want to use QA-Emb in your own application, the easiest way is through the [imodelsX package](https://github.com/csinva/imodelsX). To install, just run `pip install imodelsx`.Then, you can generate your own interpretable embeddings by coming up with questions for your domain:
```python
from imodelsx import QAEmb
import pandas as pdquestions = [
'Is the input related to food preparation?',
'Does the input mention laughter?',
'Is there an expression of surprise?',
'Is there a depiction of a routine or habit?',
'Does the sentence contain stuttering?',
'Does the input contain a first-person pronoun?',
]
examples = [
'i sliced some cucumbers and then moved on to what was next',
'the kids were giggling about the silly things they did',
'and i was like whoa that was unexpected',
'walked down the path like i always did',
'um no um then it was all clear',
'i was walking to school and then i saw a cat',
]checkpoint = 'meta-llama/Meta-Llama-3-8B-Instruct'
embedder = QAEmb(
questions=questions, checkpoint=checkpoint, use_cache=False)
embeddings = embedder(examples)df = pd.DataFrame(embeddings.astype(int), columns=[
q.split()[-1] for q in questions])
df.index = examples
df.columns.name = 'Question (abbreviated)'
display(df.style.background_gradient(axis=None))
--------DISPLAYS ANSWER FOR EACH QUESTION IN EMBEDDING--------
```# Dataset set up
Directions for installing the datasets required for reproducing the fMRI experiments in the paper.
- download data with `python experiments/00_load_dataset.py`
- create a `data` dir under wherever you run it and will use [datalad](https://github.com/datalad/datalad) to download the preprocessed data as well as feature spaces needed for fitting [semantic encoding models](https://www.nature.com/articles/nature17637)
- set `neuro1.config.root_dir` to where you want to store the data
- to make flatmaps, need to set [pycortex filestore] to `{root_dir}/ds003020/derivative/pycortex-db/`
- to run eng1000, need to grab `em_data` directory from [here](https://github.com/HuthLab/deep-fMRI-dataset) and move its contents to `{root_dir}/em_data`
- loading responses
- `neuro1.data.response_utils` function `load_response`
- loads responses from at `{root_dir}/ds003020/derivative/preprocessed_data/{subject}`, hwere they are stored in an h5 file for each story, e.g. `wheretheressmoke.h5`
- loading stimulus
- `neuro1.features.stim_utils` function `load_story_wordseqs`
- loads textgrids from `{root_dir}/ds003020/derivative/TextGrids", where each story has a TextGrid file, e.g. `wheretheressmoke.TextGrid`
- uses `{root_dir}/ds003020/derivative/respdict.json` to get the length of each story# Code install
Directions for installing the code here as a package for full development.
- from the repo directory, start with `pip install -e .` to locally install the `neuro1` package
- `python 01_fit_encoding.py --subject UTS03 --feature eng1000`
- The other optional parameters that encoding.py takes such as sessions, ndelays, single_alpha allow the user to change the amount of data and regularization aspects of the linear regression used.
- This function will then save model performance metrics and model weights as numpy arrays.# Citation
```r
@misc{benara2024crafting,
title={Crafting Interpretable Embeddings by Asking LLMs Questions},
author={Vinamra Benara and Chandan Singh and John X. Morris and Richard Antonello and Ion Stoica and Alexander G. Huth and Jianfeng Gao},
year={2024},
eprint={2405.16714},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```