https://github.com/evalplus/repoqa

RepoQA: Evaluating Long-Context Code Understanding
https://github.com/evalplus/repoqa

Last synced: 4 months ago
JSON representation

RepoQA: Evaluating Long-Context Code Understanding

Host: GitHub
URL: https://github.com/evalplus/repoqa
Owner: evalplus
License: apache-2.0
Created: 2024-02-18T07:04:19.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-11-01T17:59:43.000Z (over 1 year ago)
Last Synced: 2025-10-26T04:28:12.136Z (7 months ago)
Language: Python
Homepage: https://evalplus.github.io/repoqa.html
Size: 5.38 MB
Stars: 119
Watchers: 5
Forks: 7
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # RepoQA: Evaluating Long-Context Code Understanding

[![](https://img.shields.io/badge/arXiv-2406.06025-b31b1b.svg?style=for-the-badge)](https://arxiv.org/abs/2406.06025)

[![](https://img.shields.io/pypi/v/repoqa?style=for-the-badge&labelColor=black)](https://pypi.org/project/repoqa/)

🏠 Homepage: https://evalplus.github.io/repoqa.html

## 🚀 Installation

```bash

# without vLLM (can run openai, anthropic, and huggingface backends)

pip install --upgrade repoqa

# To enable vLLM

pip install --upgrade "repoqa[vllm]"

```

⏬ Install nightly version :: click to expand ::



```bash

pip install --upgrade "git+https://github.com/evalplus/repoqa.git"                 # without vLLM

pip install --upgrade "repoqa[vllm] @ git+https://github.com/evalplus/repoqa@main" # with vLLM

```



⏬ Using RepoQA as a local repo? :: click to expand ::



```bash

git clone https://github.com/evalplus/repoqa.git

cd repoqa

export PYTHONPATH=$PYTHONPATH:$(pwd)

pip install -r requirements.txt

```



## 🏁 Search Needle Function (SNF)

Search Needle Function is the first and base RepoQA task which aims to practice LLMs' ability of **long-context code understanding and retrieval**.

Its corresponding real-life scenario is to perform precise code search from function description.

🔎 More dataset details :: click to expand ::



> [!Note]

>

> SNF includes 500 tests (5 programming languages x 10 repos x 10 needle functions) where an LLM is given:

>

> 1. A large code context sorted in file dependency

> 2. A NL description of the needle function without revealing keywords like function names

> 3. An instruction to retrieve the described function

>

> The evaluator passes a test if the searched function is syntactically closest to the ground-truth compared against

> other functions (systematically parsed by `treesitter`) and the similarity is greater than a user defined threshold (by default 0.8).



You can run the SNF evaluation using various backends:

### OpenAI Compatible Servers

```bash

repoqa.search_needle_function --model "gpt4-turbo" --backend openai

# 💡 If you use openai API compatible server such as vLLM servers:

# repoqa.search_needle_function --base-url "http://localhost:[PORT]/v1" \

#                               --model "Qwen/CodeQwen1.5-7B-Chat" --backend openai

```

### Anthropic Compatible Servers

```bash

repoqa.search_needle_function --model "claude-3-haiku-20240307" --backend anthropic

```

### vLLM

```bash

repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend vllm

```

🔎 Context extension for small-ctx models :: click to expand ::



> There are two ways to unlock a model's context at inference time:

>

> 1. **Direct Extension**: Edit `max_positional_embedding` of the model's `config.json` (e.g., `hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/[hash]/config.json`) to something like `22528`.

> 2. **[Dynamic RoPE Scaling](https://blog.eleuther.ai/yarn/#dynamic-scaling)**:

>    To extend `Meta-Llama-3-8B-Instruct` from 8k to 32k (4x), edit the `config.json`:

>

> `"rope_scaling": {"type": "dynamic", "factor": 4.0}`

>

> Note: This works for vLLM `<0.4.3` and HuggingFace transformers. RepoQA will automatically configure dynamic RoPE for vLLM `>= 0.4.3`



> [!Note]

>

> Reference evaluation time:

>

> - Llama3-8B-Instruct: 45 minutes on 2xA6000 (PCIe NVLink)

> - Llama3-70B-Instruct: 100 minutes on 4xA100 (PCIe NVLink)

### HuggingFace transformers

```bash

repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend hf --trust-remote-code

```

> [!Tip]

>

> Installing [flash-attn](https://github.com/Dao-AILab/flash-attention) and

> additionally set `--attn-implementation "flash_attention_2"` can largely

> lower the memory requirement.

🔨 Having trouble installing `flash-attn`? :: click to expand ::



> If you have trouble with `pip install flash-attn --no-build-isolation`,

> you can try to directly use [pre-built wheels](https://github.com/Dao-AILab/flash-attention/releases):

>

> ```shell

> export FLASH_ATTN_VER=2.5.8 # check latest version at https://github.com/Dao-AILab/flash-attention/releases

> export CUDA_VER="cu122"     # check available ones at https://github.com/Dao-AILab/flash-attention/releases

> export TORCH_VER=$(python -c "import torch; print('.'.join(torch.__version__.split('.')[:2]))")

> export PY_VER=$(python -c "import platform; print(''.join(platform.python_version().split('.')[:2]))")

> export OS_ARCH=$(python -c "import platform; print(f'{platform.system().lower()}_{platform.machine()}')")

>

> export WHEEL=flash_attn-${FLASH_ATTN_VER}+${CUDA_VER}torch${TORCH_VER}cxx11abiFALSE-cp${PY_VER}-cp${PY_VER}-${OS_ARCH}.whl

> wget https://github.com/Dao-AILab/flash-attention/releases/download/v${FLASH_ATTN_VER}/${WHEEL}

> pip install ${WHEEL}

> ```



### Google Generative AI API (Gemini)

```bash

repoqa.search_needle_function --model "gemini-1.5-pro-latest" --backend google

```

### CLI Usage

- **Input**:

  - `--model`: Hugging-Face model ID, such as `ise-uiuc/Magicoder-S-DS-6.7B`

  - `--backend`: `vllm` (default) or `openai`

  - `--base-url`: OpenAI API base URL

  - `--code-context-size` (default: 16384): #tokens (by DeepSeekCoder tokenizer) of repository context

  - `--caching` (default: True): accelerate subsequent runs by caching preprocessing; `--nocaching` to disable

  - `--max-new-tokens` (default: 1024): Maximum #new tokens to generate

  - `--system-message` (default: None): system message (note it's not supported by some models)

  - `--tensor-parallel-size`: #GPUS for doing tensor parallelism (only for vLLM)

  - `--languages` (default: None): List of languages to evaluate (None means all)

  - `--result-dir` (default: "results"): Directory to save the model outputs and evaluation results

  - `--clean-ctx-comments` (default: "none"): Clean context comments with padding ("positional_padding") or no padding ("no_padding")

  - `--eval-ignore-comments` (default: False): During evaluation, ignore groundtruth and model comments

  - `--trust-remote-code` (default: False): allow remote code (for HuggingFace transformers and vLLM)

  - `--attn-implementation` (default: None): Use "flash_attention_2" if your HF hits OOM

- **Output**:

  - `results/ntoken_{code-context-size}/{model}.jsonl`: Model generated outputs

  - `results/ntoken_{code-context-size}/{model}-SCORE.json`: Evaluation results

### Compute Scores

By default, the `repoqa.search_needle_function` command will evaluate model outputs and compute scores after text generation.

However, you can also separately compute scores using the following command:

```shell

repoqa.compute_score --model-output-path={model-output}.jsonl

```

> [!Tip]

>

> - **Input**: Path to the model generated outputs.

> - **Output**: The evaluation scores would be stored in `{model-output}-SCORES.json`

## 📚 Read More

- [RepoQA Homepage](https://evalplus.github.io/repoqa.html)

- [RepoQA Dataset Curation](docs/curate_dataset.md)

- [RepoQA Development Notes](docs/dev_note.md)

## Citation

```bibtex

@article{repoqa,

  title = {RepoQA: Evaluating Long Context Code Understanding},

  author = {Liu, Jiawei and Tian, Jia Le and Daita, Vijay and Wei, Yuxiang and Ding, Yifeng and Wang, Yuhan Katherine and Yang, Jun and Zhang, Lingming},

  year = {2024},

  journal = {arXiv preprint arXiv:2406.06025},

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/evalplus/repoqa

Awesome Lists containing this project

README