https://github.com/AkariAsai/OpenScholar_ExpertEval

This repository contains expert evaluation interface and data evaluation script for the OpenScholar project.
https://github.com/AkariAsai/OpenScholar_ExpertEval

Last synced: 4 months ago
JSON representation

This repository contains expert evaluation interface and data evaluation script for the OpenScholar project.

Host: GitHub
URL: https://github.com/AkariAsai/OpenScholar_ExpertEval
Owner: AkariAsai
Created: 2024-11-16T21:28:54.000Z (6 months ago)
Default Branch: main
Last Pushed: 2024-11-19T11:11:43.000Z (6 months ago)
Last Synced: 2024-11-19T12:29:07.707Z (6 months ago)
Language: HTML
Homepage: https://allenai.org/blog/openscholar
Size: 879 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

StarryDivineSky - AkariAsai/OpenScholar_ExpertEval - instruct，支持RAG评估和细粒度评估。安装方法为创建conda环境并安装依赖项。运行界面时需要准备包含提示和两个完成体的数据文件。评估结果保存在数据库中，并可导出为Excel文件进行分析。项目特色在于提供了一个在线评估界面，支持在本地或云服务上运行，并能计算评估指标和一致性。 (A01_文本生成_文本对话 / 大语言对话模型及数据)

README

# Human Evaluation Annotation Interface for OpenScholar

This folder contains the code for the human eval annotation interface used in the paper [OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs](https://allenai.org/blog/openscholar).

**Acknowledgement:** The code is based on [allenai/open-instruct](https://github.com/allenai/open-instruct/tree/main/human_eval), with modifications to support (1) RAG evaluations and (2) add fine-grained evaluations. Thanks the Tulu authors!

## Installation

```bash
conda create -n human_eval python=3.10
conda activate human_eval
pip install -r requirements.txt
```

## Running the Interface locally

### Preparing data
Before running the app, you need to put evaluation instance in the `data` folder. Each instance should have a prompt and two completions with citations from two different models. We provide an example in `data/human_eval_sample.jsonl`.

Each line of this file should be in the following format:

```json
{
"prompt": "prompt text",
"completions": [
{
"model": "model 1 name",
"completion": "completion text",
"refs_list" : [{"title": "title", "id": "ref_id", "url": "url_to_ref", "text": "ref_text"}]
},
{
"model": "model 2 name",
"completion": "completion text",
"refs_list" : [{"title": "title", "id": "ref_id", "url": "url_to_ref", "text": "ref_text"}]
}
]
}
```

We provide a script to format the answer file in an expected way. Note that to get the corresponding paper URL information, you have to set your semantic scholar API key (`SS_API_KEY`) from [Semantic Scholar API](https://www.semanticscholar.org/product/api).

```
python convert_output_files.py \
--file_a FILE_A \
--file_b FILE_B \
--model_a MODEL_A_NAME \
--model_b MODEL_B_NAME \
--output_fn OUTPUT_FILE_NAME \
--prefix TASK_PREFIX
```

Now you can run the app with:

```bash
python app.py --comparison_data_path OUTPUT_FILE_NAME
```

You can open the app in your browser at http://localhost:5001. When doing the annotation, you can track the progress at the following url: http://localhost:5001/summary.

Here is a screenshot of the annotation interface:

Screenshot of the human evaluation interface.

## Share your annotation interface
To share the annotation interface with others, you can host the app on virtual machine (VM) using cloud services like Google Cloud. If you use Google Cloud as I did,

1. [Create a VM (Compute Engine Instance)](https://cloud.google.com/compute/docs/instances/create-start-instance)
2. [Configure firewall rules](https://cloud.google.com/filestore/docs/configuring-firewall) to open `5001` (e.g., [a related discussion](https://stackoverflow.com/questions/21065922/how-to-open-a-specific-port-such-as-9090-in-google-compute-engine))
3. You can access to the app via `http://YOUR_VM_EXTERNAL_IP:5001`

## Post-processing and Analysis

The annotation results are saved in a database file `data/evaluation.db` by default. You can use the following command to export the results to an excel file:

```bash
python export_db.py
```

Then, you can use the following command to compute the evaluation metrics and agreements:

```bash
python compute_metrics.py
```

## Citation

If you used this code, please cite our paper as well as the original Tulu paper that this app interface code is based on:

```bibtex
@article{openscholar,
title={{OpenScholar}: Synthesizing Scientific Literature with Retrieval-Augmented Language Models},
author={Asai, Akari and He*, Jacqueline and Shao*, Rulin and Shi, Weijia and Singh, Amanpreet and Chang, Joseph Chee and Lo, Kyle and Soldaini, Luca and Feldman, Tian, Sergey and Mike, D’arcy and Wadden, David and Latzke, Matt and Minyang and Ji, Pan and Liu, Shengyan and Tong, Hao and Wu, Bohao and Xiong, Yanyu and Zettlemoyer, Luke and Weld, Dan and Neubig, Graham and Downey, Doug and Yih, Wen-tau and Koh, Pang Wei and Hajishirzi, Hannaneh},
journal={Arxiv},
year={2024},
}

@misc{wang2023far,
title={How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources},
author={Yizhong Wang and Hamish Ivison and Pradeep Dasigi and Jack Hessel and Tushar Khot and Khyathi Raghavi Chandu and David Wadden and Kelsey MacMillan and Noah A. Smith and Iz Beltagy and Hannaneh Hajishirzi},
year={2023},
eprint={2306.04751},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/AkariAsai/OpenScholar_ExpertEval

Awesome Lists containing this project

README