https://github.com/waltonfuture/Diff-eRank

Code for https://arxiv.org/abs/2401.17139 (NeurIPS 2024)
https://github.com/waltonfuture/Diff-eRank

evaluation-metrics llm llm-inference machine-learning mllm neurips-2024

Last synced: 7 months ago
JSON representation

Code for https://arxiv.org/abs/2401.17139 (NeurIPS 2024)

Host: GitHub
URL: https://github.com/waltonfuture/Diff-eRank
Owner: waltonfuture
License: apache-2.0
Created: 2024-01-29T13:16:52.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-11-15T11:46:37.000Z (8 months ago)
Last Synced: 2024-11-15T12:31:54.292Z (8 months ago)
Topics: evaluation-metrics, llm, llm-inference, machine-learning, mllm, neurips-2024
Language: Python
Homepage: https://arxiv.org/abs/2401.17139
Size: 37.1 KB
Stars: 23
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - waltonfuture/Diff-eRank - eRank是一种基于排序的指标，用于评估大型语言模型（LLM），它基于信息论和几何原理，通过分析模型的隐藏表示来量化模型在训练后丢弃冗余信息的能力。该指标适用于单模态（语言）和多模态场景。研究发现，Diff-eRank在模型规模扩大时会增加，并且与传统的指标（如损失和准确率）保持一致的关系。该项目提供了代码和示例，可以计算单个句子或数据集的 Diff-eRank，并提供了相应的论文和项目链接。 (A01_文本生成_文本对话 / 大语言对话模型及数据)

README

        # Diff-eRank: A Novel Rank-Based Metric for Evaluating Large Language Models (NeurIPS 2024)

[Lai Wei](https://waltonfuture.github.io/) *, Zhiquan Tan *, Chenghai Li, [Jindong Wang](https://jd92.wang/), [Weiran Huang](https://www.weiranhuang.com/) (*Equal Contribution).

**Shanghai Jiao Tong University & Tsinghua University & William and Mary**

 

## Introduction

We introduce a rank-based metric called Diff-eRank, which is rooted in information theory and geometry principles. Diff-eRank evaluates LLMs by examining their hidden representations to quantify how LLMs discard redundant information after training.

Specifically, we demonstrate its applicability in both single-modal (language) and multi-modal settings. For language models, our findings reveal that the Diff-eRank increases when the model scales up, which also demonstrates a consistent relationship with traditional metrics like loss and accuracy.

For multi-modal models, we also propose an evaluation method based on rank for assessing alignment quality and we find that modern multi-modal large language models exhibit good alignment performance. 



  



## Calculation of Diff-eRank

### Setup

```bash

pip install transformers torch datasets

```

### Calculation

```bash

from transformers import AutoTokenizer, AutoModel, AutoConfig

import torch

import math

# R input N*d

def normalize(R):

    with torch.no_grad():

        mean = R.mean(dim=0)

        R = R - mean

        norms = torch.norm(R, p=2, dim=1, keepdim=True)

        R = R/norms

    return R

def cal_cov(R):

    with torch.no_grad():

        Z = torch.nn.functional.normalize(R, dim=1)

        A = torch.matmul(Z.T, Z)/Z.shape[0]

    return A

def cal_erank(A):

    with torch.no_grad():

        eig_val = torch.svd(A / torch.trace(A))[1] 

        entropy = - (eig_val * torch.log(eig_val)).nansum().item()

        erank = math.exp(entropy)

    return erank

def compute(R):

    return cal_erank(cal_cov(normalize(R)))

model_path = "facebook/opt-1.3b" # for example

tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModel.from_pretrained(model_path).cuda()

config = AutoConfig.from_pretrained(model_path)

untrained_model = AutoModel.from_config(config).to('cuda')

text = "We introduce a rank-based metric called Diff-eRank, which is rooted in information theory and geometry principles. Diff-eRank evaluates LLMs by examining their hidden representations to quantify how LLMs discard redundant information after training." # for example

inputs = tokenizer(text, return_tensors="pt").to('cuda')

with torch.no_grad():

    R1 = model(inputs.input_ids)[0][0, :, :]

    R2 = untrained_model(inputs.input_ids)[0][0, :, :]

    erank1 = compute(R1)

    erank2 = compute(R2)

    RD = erank2 - erank1

print(RD)

```

### Diff-eRank of Single Sentence

```

cd utils

python diff_erank_single_sentence.py

```

### Diff-eRank of Dataset

Please download the datasets of [wiki-en](https://huggingface.co/datasets/wikipedia), [dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k), [openwebtext2](https://huggingface.co/datasets/suolyer/pile_openwebtext2), [hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) in huggingface and edit the data path in your scripts.

```

cd utils

python diff_erank_dataset.py

```

## Citation

If you're using Diff-eRank in your research or applications, please cite using this BibTeX:

```bibtex

@inproceedings{weidiff,

  title={Diff-eRank: A Novel Rank-Based Metric for Evaluating Large Language Models},

  author={Wei, Lai and Tan, Zhiquan and Li, Chenghai and Wang, Jindong and Huang, Weiran},

  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},

  year={2024}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/waltonfuture/Diff-eRank

Awesome Lists containing this project

README