Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/DataEval/dingo
Dingo: A Comprehensive Data Quality Evaluation Tool
https://github.com/DataEval/dingo
data-evaluation data-quality data-science data-validation gpt llm spark vlm
Last synced: 3 days ago
JSON representation
Dingo: A Comprehensive Data Quality Evaluation Tool
- Host: GitHub
- URL: https://github.com/DataEval/dingo
- Owner: DataEval
- License: apache-2.0
- Created: 2024-12-24T05:59:24.000Z (15 days ago)
- Default Branch: dev
- Last Pushed: 2025-01-03T13:53:29.000Z (5 days ago)
- Last Synced: 2025-01-03T14:43:50.121Z (5 days ago)
- Topics: data-evaluation, data-quality, data-science, data-validation, gpt, llm, spark, vlm
- Language: JavaScript
- Homepage:
- Size: 15 MB
- Stars: 13
- Watchers: 0
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome_ai_agents - Dingo - Dingo - A Comprehensive Data Quality Evaluation Tool (Building / Tools)
- awesome_ai_agents - Dingo - Dingo - A Comprehensive Data Quality Evaluation Tool (Building / Tools)
README
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://github.com/pre-commit/pre-commit)
[English](README.md) | [简体中文](README_CN.md)
# Changelog
- 2024/12/27: Project Initialization
# Introduction
Dingo is a data quality evaluation tool that helps you automatically detect data quality issues in your datasets. Dingo provides a variety of built-in rules and model evaluation methods, and also supports custom evaluation methods. Dingo supports commonly used text datasets and multimodal datasets, including pre-training datasets, fine-tuning datasets, and evaluation datasets. In addition, Dingo supports multiple usage methods, including local CLI and SDK, making it easy to integrate into various evaluation platforms, such as [OpenCompass](https://github.com/open-compass/opencompass).
## Architecture Diagram
![Architecture of dingo](./docs/assets/architeture.png)
## Scenario Diagram
![Scene of dingo](docs/assets/scene.png)
# Quick Start
Users can use Dingo in two ways as shown below.
## Installation
Install `dingo`
```shell
pip install dingo-python
```
## SDKTry to run the `SDK` call method below:
```python
from dingo.io import InputArgs
from dingo.exec import Executorinput_data = {
"eval_group": "sft", # rule list for sft data, other ['default', 'pretrain' ...]
"input_path": "tatsu-lab/alpaca", # dataset from huggingface
"data_format": "plaintext", # data format, other ['json', 'jsonl', 'plaintext']
"save_data": True, # save data to local
}input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
result = executor.execute()
print(result)
```For more usage examples, please refer to [examples](examples), for more evaluation results, please refer to [evaluation](docs/eval), and for more configurations, please refer to [config](docs/config.md).
## CLI
Try to run the `CLI` call rule set evaluation below:
``` shell
python -m dingo.run.cli --input_path tatsu-lab/alpaca -e sft --data_format plaintext --save_data True
```Or try to run the `CLI` call gpt-4o model evaluation below:
```shell
python -m dingo.run.cli --input_path test/data/test_local_json.json --dataset local -e openai --data_format json --column_content prediction --custom_config test/config/config_gpt.json --save_data True
```Note that calling the model evaluation requires adding the corresponding configuration, such as the configuration used in the above example:
```shell
$ cat test/data/config_gpt.json
{
"llm_config": {
"openai": {
"model": "gpt-4o",
"key": "xxxx",
"api_url": "https://api.openai.com/v1/chat/completions"
}
}
}
```## GUI
After the project runs on the `cli` side, if the user sets the save_data parameter to True, a frontend page will be automatically generated based on the quality inspection results.
If the user wants to manually start a frontend page, you need to enter the following command:```shell
python -m dingo.run.vsl --input xxx
```The input followed is the directory of the quality inspection results. Users need to ensure that there is a summary.json file when the directory is opened.
# Feature List
## Supports multiple input data sources, data types, and data modalities
Dingo supports local files, huggingface datasets, S3 storage files as data sources; supports pre-training, fine-tuning, and evaluation datasets as data types; supports text and image data modalities.
## Supports custom rules, model evaluation
Dingo has built-in 20+ general heuristic rule evaluations, common LLMs (such as OpenAI, kimi, etc.) evaluations, and launching local specified model (llama3, etc.) evaluations.
Built-in heuristic rules have built-in multiple rule set combinations such as pretrain, sft according to the dataset type.
Both rules and model evaluations support customization or modification.
Supports data security evaluation, such as perspective API.## Supports multiple interface usage methods, good extensibility, and easy to integrate
Dingo supports multiple interface usage methods, including local CLI and SDK, making it easy to integrate into various evaluation platforms, such as OpenCompass.
## Supports multiple execution engines
Dingo supports local and SPARK two execution engines, which is convenient for executing data evaluation tasks of various sizes.
## Supports multi-dimensional metric reports, traceable
Dingo supports outputting 7 Quality Metrics summary reports and abnormal data trace details reports.
# Concept Introduction
## Metric Introduction
[Metric Document](docs/metrics.md)
## Rule Introduction
[Rule Document](docs/rules.md)
## eval_group Introduction
[eval_group Document](docs/groups.md)
## Response Introduction
[Response Document](docs/response.md)
# User Guide
## Installation
The installation mentioned in the quick start module above only installs the necessary packages required for running, and some special function packages are not installed. If users need to install corresponding packages during the practice use process,
then you can refer to: [Install Dependencies](requirements)## Register Rules/Prompts/Models
If the heuristic rules inside the project do not meet the user's quality inspection requirements, users can also customize rules or models.
### Register Rules
If the user wants to create a new rule `CommonPatternDemo`, then the first step is to add a decorator to the rule to inject the rule into the project.
Secondly, the `metric_type` type, such as `QUALITY_BAD_RELEVANCE`, needs to be set for the rule, and `group` does not need to be set.
Then the user needs to define the `DynamicRuleConfig` object, so that the properties of the rule can be configured dynamically.
In addition, the method name of the rule must be `eval` and it needs to be a class method.
The return value of the last step should be a `ModelRes` object.For example: [Register Rules](examples/register/sdk_register_rule.py)
### Register Prompts
Users can also register prompts, the method is similar to when registering rules.
For example: [Register Prompts](examples/register/sdk_register_prompt.py)
### Register Models
The way to register models is slightly different, users need to implement a call_api method, accept MetaData type parameters, and return ModelRes type results.
There are already implemented basic model classes [BaseOpenAI](dingo/model/llm/base_openai.py) in the project, users can directly inherit.
If the user has special functions to implement, then you can rewrite the corresponding methods.For example: [Register Models](examples/register/sdk_register_llm.py)
## Configuration
[Configuration Document](docs/config.md)
## Execution Engine
`Dingo` can run locally or on a spark cluster.
Regardless of the choice of engine, the executor supports some common methods:| function name | description |
|--------------------|--------------------------|
| get_summary | get the summary of test. |
| get_bad_info_list | get the bad data. |
| get_good_info_list | get the good data. |### Local Mode
When choosing the spark engine, users can freely choose rules, models for quality inspection.
[Local Example](examples/dataset/sdk_local.py)
### Spark Mode
When choosing the spark engine, users can only choose rules for quality inspection, and models cannot be used.
And only `eval_group`,`save_data`,`save_correct`,`custom_config` in `InputArgs` are still valid.
Therefore, the user needs to input `spark_session` to initialize spark, and input `spark_rdd` (composed of `MetaData` structure) as data for quality inspection.
It should be noted that if `save_data` is `False`, then the data in memory will be cleared immediately after the quality inspection is completed, and `spark_session` will also stop immediately.[Spark Example](examples/spark/sdk_spark.py)
## Evaluation Report
After completing an evaluation, Dingo will generate a summary report (summary) and a detailed report (detail). The summary includes the overall score Score and the scores of the 7 Quality Metrics dimensions of this evaluation. The detailed report will include the specific data content of each Quality Metrics evaluation with exceptions, which is convenient for tracing the cause.
The `summary.json` profile file example is as follows:```shell
{
"task_id": "d6c922ec-981c-11ef-b723-7c10c9512fac",
"task_name": "dingo",
"eval_group": "default",
"input_path": "test/data/test_local_jsonl.jsonl",
"output_path": "outputs/d6c921ac-981c-11ef-b723-7c10c9512fac",
"create_time": "20241101_144510",
"score": 50.0,
"num_good": 1,
"num_bad": 1,
"total": 2,
"type_ratio": {
"QUALITY_BAD_COMPLETENESS": 0.5,
"QUALITY_BAD_RELEVANCE": 0.5
},
"name_ratio": {
"QUALITY_BAD_COMPLETENESS-RuleColonEnd": 0.5,
"QUALITY_BAD_RELEVANCE-RuleSpecialCharacter": 0.5
}
}
```The detailed report such as the `RuleColonEnd.json` file example is as follows:
```shell
{"data_id": "1", "prompt": "", "content": "�I am 8 years old. ^I love apple because:", "type_list": ["QUALITY_BAD_COMPLETENESS", "QUALITY_BAD_RELEVANCE"], "name_list": ["QUALITY_BAD_COMPLETENESS-RuleColonEnd", "QUALITY_BAD_RELEVANCE-RuleSpecialCharacter"], "reason_list": ["�I am 8 years old. ^I love apple because:", ["�"]]}```
## TODO
- [ ] Richer graphic and text evaluation indicators;
- [ ] New audio and video data modality evaluation;
- [ ] New small model evaluation, such as fasttext, Qurating;
- [ ] New data diversity evaluation;# Limitations
- The current evaluation tool's built-in detection rules and model methods mostly come from papers, open source projects, etc., mainly focusing on common data quality problems. If there is a need to evaluate special data problems, it is recommended to customize the corresponding detection rules for evaluation;
# Acknowledgments
- [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)
- [mlflow](https://github.com/mlflow/mlflow)# Contribution
We appreciate all the contributors for their efforts to improve and enhance `Dingo`. Please refer to the [Contribution Guide](docs/en/CONTRIBUTING.md) for guidance on contributing to the project.
# License
This project uses the [Apache 2.0 Open Source License](LICENSE).
# Citation
If you find this project useful, please consider citing our tool:
```
@misc{dingo,
title={Dingo: A Comprehensive Data Quality Evaluation Tool for Large Models},
howpublished={\url{https://github.com/DataEval/dingo}},
year={2024}
}
```