https://github.com/evalplus/repoqa
RepoQA: Evaluating Long-Context Code Understanding
https://github.com/evalplus/repoqa
Last synced: 3 months ago
JSON representation
RepoQA: Evaluating Long-Context Code Understanding
- Host: GitHub
- URL: https://github.com/evalplus/repoqa
- Owner: evalplus
- License: apache-2.0
- Created: 2024-02-18T07:04:19.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-11-01T17:59:43.000Z (over 1 year ago)
- Last Synced: 2025-10-26T04:28:12.136Z (5 months ago)
- Language: Python
- Homepage: https://evalplus.github.io/repoqa.html
- Size: 5.38 MB
- Stars: 119
- Watchers: 5
- Forks: 7
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# RepoQA: Evaluating Long-Context Code Understanding
[](https://arxiv.org/abs/2406.06025)
[](https://pypi.org/project/repoqa/)
🏠 Homepage: https://evalplus.github.io/repoqa.html
## 🚀 Installation
```bash
# without vLLM (can run openai, anthropic, and huggingface backends)
pip install --upgrade repoqa
# To enable vLLM
pip install --upgrade "repoqa[vllm]"
```
⏬ Install nightly version :: click to expand ::
```bash
pip install --upgrade "git+https://github.com/evalplus/repoqa.git" # without vLLM
pip install --upgrade "repoqa[vllm] @ git+https://github.com/evalplus/repoqa@main" # with vLLM
```
⏬ Using RepoQA as a local repo? :: click to expand ::
```bash
git clone https://github.com/evalplus/repoqa.git
cd repoqa
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt
```
## 🏁 Search Needle Function (SNF)
Search Needle Function is the first and base RepoQA task which aims to practice LLMs' ability of **long-context code understanding and retrieval**.
Its corresponding real-life scenario is to perform precise code search from function description.
🔎 More dataset details :: click to expand ::
> [!Note]
>
> SNF includes 500 tests (5 programming languages x 10 repos x 10 needle functions) where an LLM is given:
>
> 1. A large code context sorted in file dependency
> 2. A NL description of the needle function without revealing keywords like function names
> 3. An instruction to retrieve the described function
>
> The evaluator passes a test if the searched function is syntactically closest to the ground-truth compared against
> other functions (systematically parsed by `treesitter`) and the similarity is greater than a user defined threshold (by default 0.8).
You can run the SNF evaluation using various backends:
### OpenAI Compatible Servers
```bash
repoqa.search_needle_function --model "gpt4-turbo" --backend openai
# 💡 If you use openai API compatible server such as vLLM servers:
# repoqa.search_needle_function --base-url "http://localhost:[PORT]/v1" \
# --model "Qwen/CodeQwen1.5-7B-Chat" --backend openai
```
### Anthropic Compatible Servers
```bash
repoqa.search_needle_function --model "claude-3-haiku-20240307" --backend anthropic
```
### vLLM
```bash
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend vllm
```
🔎 Context extension for small-ctx models :: click to expand ::
> There are two ways to unlock a model's context at inference time:
>
> 1. **Direct Extension**: Edit `max_positional_embedding` of the model's `config.json` (e.g., `hub/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/[hash]/config.json`) to something like `22528`.
> 2. **[Dynamic RoPE Scaling](https://blog.eleuther.ai/yarn/#dynamic-scaling)**:
> To extend `Meta-Llama-3-8B-Instruct` from 8k to 32k (4x), edit the `config.json`:
>
> `"rope_scaling": {"type": "dynamic", "factor": 4.0}`
>
> Note: This works for vLLM `<0.4.3` and HuggingFace transformers. RepoQA will automatically configure dynamic RoPE for vLLM `>= 0.4.3`
> [!Note]
>
> Reference evaluation time:
>
> - Llama3-8B-Instruct: 45 minutes on 2xA6000 (PCIe NVLink)
> - Llama3-70B-Instruct: 100 minutes on 4xA100 (PCIe NVLink)
### HuggingFace transformers
```bash
repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" --backend hf --trust-remote-code
```
> [!Tip]
>
> Installing [flash-attn](https://github.com/Dao-AILab/flash-attention) and
> additionally set `--attn-implementation "flash_attention_2"` can largely
> lower the memory requirement.
🔨 Having trouble installing `flash-attn`? :: click to expand ::
> If you have trouble with `pip install flash-attn --no-build-isolation`,
> you can try to directly use [pre-built wheels](https://github.com/Dao-AILab/flash-attention/releases):
>
> ```shell
> export FLASH_ATTN_VER=2.5.8 # check latest version at https://github.com/Dao-AILab/flash-attention/releases
> export CUDA_VER="cu122" # check available ones at https://github.com/Dao-AILab/flash-attention/releases
> export TORCH_VER=$(python -c "import torch; print('.'.join(torch.__version__.split('.')[:2]))")
> export PY_VER=$(python -c "import platform; print(''.join(platform.python_version().split('.')[:2]))")
> export OS_ARCH=$(python -c "import platform; print(f'{platform.system().lower()}_{platform.machine()}')")
>
> export WHEEL=flash_attn-${FLASH_ATTN_VER}+${CUDA_VER}torch${TORCH_VER}cxx11abiFALSE-cp${PY_VER}-cp${PY_VER}-${OS_ARCH}.whl
> wget https://github.com/Dao-AILab/flash-attention/releases/download/v${FLASH_ATTN_VER}/${WHEEL}
> pip install ${WHEEL}
> ```
### Google Generative AI API (Gemini)
```bash
repoqa.search_needle_function --model "gemini-1.5-pro-latest" --backend google
```
### CLI Usage
- **Input**:
- `--model`: Hugging-Face model ID, such as `ise-uiuc/Magicoder-S-DS-6.7B`
- `--backend`: `vllm` (default) or `openai`
- `--base-url`: OpenAI API base URL
- `--code-context-size` (default: 16384): #tokens (by DeepSeekCoder tokenizer) of repository context
- `--caching` (default: True): accelerate subsequent runs by caching preprocessing; `--nocaching` to disable
- `--max-new-tokens` (default: 1024): Maximum #new tokens to generate
- `--system-message` (default: None): system message (note it's not supported by some models)
- `--tensor-parallel-size`: #GPUS for doing tensor parallelism (only for vLLM)
- `--languages` (default: None): List of languages to evaluate (None means all)
- `--result-dir` (default: "results"): Directory to save the model outputs and evaluation results
- `--clean-ctx-comments` (default: "none"): Clean context comments with padding ("positional_padding") or no padding ("no_padding")
- `--eval-ignore-comments` (default: False): During evaluation, ignore groundtruth and model comments
- `--trust-remote-code` (default: False): allow remote code (for HuggingFace transformers and vLLM)
- `--attn-implementation` (default: None): Use "flash_attention_2" if your HF hits OOM
- **Output**:
- `results/ntoken_{code-context-size}/{model}.jsonl`: Model generated outputs
- `results/ntoken_{code-context-size}/{model}-SCORE.json`: Evaluation results
### Compute Scores
By default, the `repoqa.search_needle_function` command will evaluate model outputs and compute scores after text generation.
However, you can also separately compute scores using the following command:
```shell
repoqa.compute_score --model-output-path={model-output}.jsonl
```
> [!Tip]
>
> - **Input**: Path to the model generated outputs.
> - **Output**: The evaluation scores would be stored in `{model-output}-SCORES.json`
## 📚 Read More
- [RepoQA Homepage](https://evalplus.github.io/repoqa.html)
- [RepoQA Dataset Curation](docs/curate_dataset.md)
- [RepoQA Development Notes](docs/dev_note.md)
## Citation
```bibtex
@article{repoqa,
title = {RepoQA: Evaluating Long Context Code Understanding},
author = {Liu, Jiawei and Tian, Jia Le and Daita, Vijay and Wei, Yuxiang and Ding, Yifeng and Wang, Yuhan Katherine and Yang, Jun and Zhang, Lingming},
year = {2024},
journal = {arXiv preprint arXiv:2406.06025},
}
```