Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Dai-shen/LAiW
LAiW: A Chinese Legal Large Language Models Benchmark
https://github.com/Dai-shen/LAiW
Last synced: about 1 month ago
JSON representation
LAiW: A Chinese Legal Large Language Models Benchmark
- Host: GitHub
- URL: https://github.com/Dai-shen/LAiW
- Owner: Dai-shen
- License: mit
- Created: 2023-09-05T07:22:37.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-03T01:56:29.000Z (5 months ago)
- Last Synced: 2024-08-14T15:51:51.192Z (4 months ago)
- Language: Python
- Homepage:
- Size: 7.21 MB
- Stars: 61
- Watchers: 2
- Forks: 4
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-llm-eval - LAiW - 10-25)| (Datasets-or-Benchmark / 垂直领域)
- StarryDivineSky - Dai-shen/LAiW - LLaMA、Baichuan2、HanFei、ChatLaw、LaWGPT 等主流大模型进行了评估,并发布了评估结果和评分方法。用户可以通过 LAiW 的排行榜查看不同模型的评估结果,并根据自身需求选择合适的法律大模型。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
- Awesome-Domain-LLM - LAiW
README
# ⚖️LAiW: A Chinese Legal Large Language Models Benchmark
| [English](https://github.com/Dai-shen/LAiW/blob/main/README.md) | [Chinese](https://github.com/Dai-shen/LAiW/blob/main/README_zh.md)
**LAiW:A Comprehensive Benchmark for Chinese Legal Large Language Models (LLMs)**
🔥 [LAiW Leaderboard](https://huggingface.co/spaces/daishen/SCULAiW)
🔥 [Technical Report and Official Paper](https://arxiv.org/abs/2310.05620)
## News
🔄 **Recent Updates**
- [2024-04-19] The official [paper](https://arxiv.org/abs/2310.05620) has been updated.
📅 **Earlier News**
- [2024/1/22] Added evaluation results for the general LLMs [Baichuan-7B](https://huggingface.co/baichuan-inc/Baichuan-7B).
- [2024/1/14] Provided more detailed information on the evaluation dataset [here](https://github.com/Dai-shen/LAiW/blob/main/data/README.md), along with the calculation method for the model evaluation metric [SCULAiW](https://huggingface.co/spaces/daishen/SCULAiW).
- [2024/1/12] Further confirmed and improved relevant evaluation results, optimized the layout of the evaluation leaderboard [SCULAiW](https://huggingface.co/spaces/daishen/SCULAiW), and supplemented more detailed information on evaluated models.
- [2024/1/10] Added evaluations for commercial LLMs GPT-4 and general LLMs Llama-7B, Llama13B, [Chinese-LLaMA-13B](https://github.com/ymcui/Chinese-LLaMA-Alpaca).
- [2024/1/2] Announced the scoring mechanism for the legal capabilities of LLMs in [here](#评分机制) and published the evaluation scores for LLMs in [here](#模型得分).
- [2024/1/2] Released test datasets for 14 foundational tasks [here](https://huggingface.co/daishen).
- [2024/1/1] Updated the legal capability evaluation results for [SCULAiW](https://huggingface.co/spaces/daishen/SCULAiW).
- [2024/12/31] Completed legal capability evaluations for mainstream LLMs. During the evaluation process, in addition to the models mentioned earlier, general LLMs [ChatGLM](https://huggingface.co/THUDM/chatglm-6b) and legal LLMs [Lawyer-LLaMA](https://github.com/AndrewZhe/lawyer-llama/tree/main?tab=readme-ov-file), [Fuzi-Mingcha](https://huggingface.co/SDUIRLab/fuzi-mingcha-v1_0), [Wisdom-Interrogatory](https://github.com/zhihaiLLM/wisdomInterrogatory), [LexiLaw](https://github.com/CSHaitao/LexiLaw) were added.
- [2023/10/12] Published the initial version of the [LAiW Technical Report](https://arxiv.org/abs/2310.05620).
- [2023/10/08] Released the first phase evaluation system for LAiW capabilities [here](https://github.com/Dai-shen/LAiW).
- [2023/10/08] Completed the first phase evaluation of the Basic Information Retrieval capabilities of LLMs, including commercial LLMs: ChatGPT; general LLMs: [Llama2](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [Ziya-LLaMA](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1), [Chinese-LLaMA](https://github.com/ymcui/Chinese-LLaMA-Alpaca), [Baichuan2](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat); and legal LLMs: [HanFei](https://github.com/siat-nlp/HanFei), [ChatLaw](https://huggingface.co/JessyTsu1/ChatLaw-13B), [LaWGPT](https://github.com/pengxiao-song/LaWGPT).
- [2023/10/08] Released evaluation scores and calculation methods for legal capabilities and foundational tasks.## Contents
- [⚖️LAiW: A Chinese Legal Large Language Models Benchmark](#️laiw-a-chinese-legal-large-language-models-benchmark)
- [News](#news)
- [Contents](#contents)
- [Evaluation structure diagram](#evaluation-structure-diagram)
- [Scores for LLMs](#scores-for-llms)
- [Tasks](#tasks)
- [Datasets](#datasets)
- [Scoring Mechanism](#scoring-mechanism)
- [Run](#run)
- [1.Preparation](#1preparation)
- [2.Output of LLM](#2output-of-llm)
- [Contributors](#contributors)
- [Disclaimer](#disclaimer)
- [Acknowledgements](#acknowledgements)
- [Cite](#cite)### Evaluation structure diagram
### Scores for LLMs
According to the calculation method of the large models' [scoring mechanism](#评分机制), we have evaluated 7 mainstream legal large models and 6 general large models at this stage. The model scores are as follows:
| Model | Size | Model Domain | Total Score | BIR | LFI | CLA | Base Model |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| GPT-4 | - | General | 69.63 | 80.92 | 69.27 | 58.69 | - |
| ChatGPT | - | General | 64.09 | 75.99 | 58.32 | 57.96 | - |
| [Baichuan2-Chat](https://huggingface.co/baichuan-inc/Baichuan2-13B-Chat) | 13B | General | 48.04 | 53.67 | 32.03 | 58.40 | - |
| [ChatGLM](https://huggingface.co/THUDM/chatglm-6b) | 6B | General | 47.01 | 51.51 | 37.08 | 52.44 | - |
| [Ziya-LLaMA](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1) | 13B | General | 45.79 | 61.47 | 29.44 | 46.45 | Llama-13B |
| [Fuzi-Mingcha](https://huggingface.co/SDUIRLab/fuzi-mingcha-v1_0) | 6B | Legal | 40.62 | 39.68 | 27.46 | 54.71 | [ChatGLM-6B](https://huggingface.co/THUDM/chatglm-6b) |
| [HanFei](https://github.com/siat-nlp/HanFei) | 7B | Legal | 35.69 | 37.42 | 16.33 | 53.31 | - |
| [LexiLaw](https://github.com/CSHaitao/LexiLaw) | 6B | Legal | 31.31 | 41.32 | 8.88 | 43.73 | [ChatGLM-6B](https://huggingface.co/THUDM/chatglm-6b) |
| [ChatLaw](https://huggingface.co/JessyTsu1/ChatLaw-13B) | 13B | Legal | 25.77 | 58.02 | 12.54 | 6.74 | [Ziya-LLaMA-13B](https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-13B-v1) |
| [Llama2-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) | 7B | General | 27.76 | 31.86 | 12.77 | 38.64 | - |
| [Lawyer-LLaMA](https://github.com/AndrewZhe/lawyer-llama/tree/main?tab=readme-ov-file) | 13B | Legal | 29.25 | 30.85 | 6.39 | 50.50 | [Chinese-LLaMA-13B](https://github.com/ymcui/Chinese-LLaMA-Alpaca) |
| [Chinese-LLaMA](https://github.com/ymcui/Chinese-LLaMA-Alpaca) | 13B | General | 24.99 | 21.02 | 19.16 | 34.80 | Llama-13B |
| [Chinese-LLaMA](https://github.com/ymcui/Chinese-LLaMA-Alpaca) | 7B | General | 24.91 | 22.32 | 18.25 | 34.16 | Llama-7B |
| [Baichuan](https://github.com/baichuan-inc/Baichuan-7B) | 7B | General | 22.51 | 21.20 | 15.46 | 30.86 | - |
| [LaWGPT](https://github.com/pengxiao-song/LaWGPT) | 7B | Legal | 22.69 | 15.47 | 14.27 | 38.32 | [Chinese-LLaMA-7B](https://github.com/ymcui/Chinese-LLaMA-Alpaca) |
| Llama | 13B | General | 21.00 | 18.51 | 15.08 | 29.40 | - |
| [Wisdom-Interrogatory](https://github.com/zhihaiLLM/wisdomInterrogatory) | 7B | Legal | 18.83 | 12.66 | 10.45 | 33.37 | [Baichuan-7B](https://huggingface.co/baichuan-inc/Baichuan-7B) |
| Llama | 7B | General | 16.35 | 11.12 | 15.40 | 22.54 | - |The overall scores and scores for each level of legal capability of LLMs are ranked as follows:
![Overall Histogram](https://github.com/Dai-shen/LAiW/blob/main/resources/Overall-histogram.png)
![BIR Histogram](https://github.com/Dai-shen/LAiW/blob/main/resources/BIR-histogram.png)
![LFI Histogram](https://github.com/Dai-shen/LAiW/blob/main/resources/LFI-histogram.png)
![CLA Histogram](https://github.com/Dai-shen/LAiW/blob/main/resources/CLA-histogram.png)### Tasks
With the joint efforts of **legal experts** and **artificial intelligence experts**, we categorize the Legal Capabilities of LLMs into three levels, ranging from easy to difficult: Basic Information Retrieval (BIR), Legal Foundation Inference (**LFI**), and Cplex Legal Application (**CLA**), totaling 14 foundational tasks. The diagram above shows the structure of these three capability levels.
- Basic Information Retrieval. The capability of LLMs aims to address some fundamental tasks in the field of law that can be directly transferred from NLP, as well as some simple yet crucial pre-tasks in the legal domain. It includes 5 foundational tasks: Legal Article Recommendation (AR), Element Recognition (ER), Named Entity Recognition (NER), Judicial Summarization (JS), and Case Recognition (CR).
- Legal Foundation Inference. This capability aims to test some basic legal applications for LLMs. It includes 6 foundational tasks: Controversial Focus Mining (CFM), Similar Case Matching (SCM), Charge Prediction (CP), prison Term Prediction (PTP), Civil Trial Prediction (CTP), and Legal Question Answering (LQA).
- Legal Foundation Inference. We consider the challenging tasks that LLMs may face, such as complex reasoning in the legal field and aligning with real legal logic. Here, we focus on three tasks: Judicial Reasoning Generation (JRG), Case Understanding (CU), and Legal Consultation (LC).
Below is a brief description to each evaluation task.
Capability
Task
Description
BIR
Legal Article Recommendation
It aims to provide relevant articles based on the description of the case.
Element Recognition
It analyzes and assesses each sentence to identify the pivotal elements of the case.
Named Entity Recognition
It aims to extract nouns and phrases with legal characteristics from various legal documents.
Judicial Summarization
It aims to condense, summarize, and synthesize the content of legal documents.
Case Recognition
It aims to determine, based on the relevant description of the case, whether it pertains to a criminal or civil matter.
LFI
Controversial Focus Mining
It aims to extract the logical and interactive arguments between the defense and prosecution in legal documents, which will be analyzed as a key component for the tasks that relate to the case result.
Similar Case Matching
It aims to find cases that bear the closest resemblance, which is a core aspect of various legal systems worldwide, as they require consistent judgments for similar cases to ensure the fairness of the law.
Criminal Judgment Prediction
It involves predicting the guilt or innocence of the defendant, along with the potential sentencing, based on the results of basic legal NLP, including the facts of the case, the evidence presented, and the applicable law articles. Therefore, it is divided into two types of tasks: Charge Prediction and prison Term Prediction.
Civil Trial Prediction
It involves using factual descriptions to predict the judgment of the defendant in response to the plaintiff’s claim, which we should consider the Controversial Focus.
Legal Question Answering
It utilizes the model’s legal knowledge to address the national judicial examination, which encompasses various specific legal types.
CLA
Judicial Reasoning Generation
It aims to generate relevant legal reasoning texts based on the factual description of the case. It is a complex reasoning task, because the court requires further elaboration on the reasoning behind the judgment based on the determination of the facts of the case. This task also involves aligning with the logical structure of syllogism in law
Case Understanding
It is expected to provide reasonable and compliant answers based on the questions posed regarding the case-related descriptions in the judicial documents, which is also a complex reasoning task.
Legal Consultation
It covers a wide range of legal areas and aims to provide accurate, clear, and reliable answers based on the legal questions provided by the different users. Therefore, it usually requires the sum of the aforementioned capabilities to provide professional and reliable analysis.
### Datasets
We have reorganized and constructed the evaluation datasets for the aforementioned tasks based on existing publicly available Chinese legal datasets. These datasets are collectively referred to as the **Legal Evaluation Dataset (LED)**. We present the evaluation datasets for each foundational task. For more detailed information about the datasets, please refer to [here](https://github.com/Dai-shen/LAiW/blob/main/data/README.md).
Level
Task
Main Dataset
Evaluation Dataset
Data Size
Category
BIR
Legal Article Recommendation
CAIL-2018
legal_ar
1,000
Classification
Element Recognition
CAIL-2019
legal_er
1,000
Classification
Named Entity Recognition
CAIL-2021
legal_ner
1040
named entity recognition
Judicial Summarization
CAIL-2020
legal_js
364
Text Generation
Case Recognition
CJRC
legal_cr
2,000
Classification
LFI
Controversial Focus Mining
LAIC-2021
legal_cfm
306
Classification
Similar Case Matching
CAIL-2019
legal_scm
260
Classification
Charge Prediction
Criminal-S
legal_cp
827
Classification
prison Term Prediction
MLMN
legal_ptp
349
Classification
Civil Trial Prediction
MSJudeg
legal_ctp
800
Classification
Legal Question Answering
JEC-QA
legal_lqa
855
Classification
CLA
Judicial Reasoning Generation
AC-NLG
legal_jrg
834
Text Generation
Case Understanding
CJRC
legal_cu
1,054
Text Generation
Legal Consultation
CrimeKgAssitant
legal_lc
916
Text Generation
### Scoring Mechanism
⭐️ socres for each task
$$
S_{(Task)} = \begin{cases}
F1 * 100, & \text{If }\quad Task\quad\in\quad Classification \\
\frac{1}{3}*(R1 + R2 + RL) * 100, & \text{If }\quad Task \quad\in\quad Text\quad Generation \\
Acc * 100, & \text{If }\quad Task\quad\in\quad NER
\end{cases}
$$Currently, our evaluation benchmarks mainly consist of three types of tasks: classification tasks, text generation tasks and named entity recognition. For classification tasks, we use the F1 score. For text generation tasks, we use the average of Rouge1, Rouge2, and RougeL scores. Specifically, for legal Named Entity Recognition tasks, we use the extraction accuracy of legal entities as their score.
🌟 Scores for each LLM
For individual LLM, we first calculate the average score of tasks at each level as its legal capability score for that level. Then, we take the average of these three legal capability scores as the final evaluation score for the LLM. Model evaluation scores can be found [here](#模型得分).
### Run
We will continue to evaluate the performance of existing LLMs on these tasks according to the structure diagram of the 14 foundational tasks. For details, please refer to the [leaderboard](https://huggingface.co/spaces/daishen/SCULAiW).
#### 1.Preparation
```bash
git clone git clone https://github.com/Dai-shen/LAiW.git --recursive
cd LAiW
pip install -r requirements.txt
cd LAiW/src/financial-evaluation
pip install -e .[multilingual]
```#### 2.Output of LLM
We select the model and legal tasks to be evaluated. By running the following code, we can obtain the model's output.
```bash
export CUDA_VISIBLE_DEVICES="1,2"
python eval.py \
--model "hf-causal-experimental" \
--model_args "use_accelerate=True,pretrained=$pretrained_model,tokenizer=$pretrained_model,use_fast=False,trust_remote_code=True" \
--tasks "legal_ar,legal_er,legal_js" \
--no_cache \
--num_fewshot 0 \
--write_out \
--output_base_path ""
```Parameter Description
- `model`: Model interface type, optional parameters can be found in `src/financial-evaluation/lm_eval/models/__init__.py`
- `tasks`: Predefined task names, you can define your own tasks in `src/tasks/_init_.py` and `src/tasks/legal.py`
- `pretrained_model`: Path to the large model (Hugging Face space or local model path)
- `output_base_path`: Model saving path### Contributors
- Sichuan University: Yongfu Dai, Duanyu Feng, Haochen Jia, Yifang Zhang and Hao Wang
- Wuhan University: Qianqian Xie, Weiguang Han and [Jimin Huang](https://jimin.chancefocus.com/)
- Southwest Petroleum University: Wei Tian### Disclaimer
This project is provided for academic and educational purposes only. We do not take responsibility for any issues, risks, or adverse consequences that may arise from the use of this project.
### Acknowledgements
This project is built upon the following open-source projects, and we are really thankful for them:
- [**LLMindCraft**](https://github.com/XplainMind/LLMindCraft)
- [**Awesome Chinese Legal Resources**](https://github.com/pengxiao-song/awesome-chinese-legal-resources)### Cite
If this project has been helpful to your research, please consider citing our project.
```
@article{dai2023laiw,
title={LAiW: A Chinese legal large language models benchmark},
author={Dai, Yongfu and Feng, Duanyu and Huang, Jimin and Jia, Haochen and Xie, Qianqian and Zhang, Yifang and Han, Weiguang and Tian, Wei and Wang, Hao},
journal={arXiv preprint arXiv:2310.05620},
year={2023}
}
```