{"id":30582019,"url":"https://github.com/MigoXLab/dingo","last_synced_at":"2025-08-29T07:03:43.557Z","repository":{"id":269957333,"uuid":"907673924","full_name":"MigoXLab/dingo","owner":"MigoXLab","description":"Dingo: A Comprehensive AI Data Quality Evaluation Tool","archived":false,"fork":false,"pushed_at":"2025-08-11T11:01:30.000Z","size":22097,"stargazers_count":353,"open_issues_count":4,"forks_count":39,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-08-19T11:57:47.703Z","etag":null,"topics":["common-crawl","data-evaluation","data-quality","data-quality-assessment","data-quality-report","data-science","data-validation","dataquality","datascience","deepseek","gpt","hallucination","hallucination-detection","llm","openai","opencompass","qwen","spark","vlm"],"latest_commit_sha":null,"homepage":"https://huggingface.co/spaces/DataEval/dingo","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MigoXLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-12-24T05:59:24.000Z","updated_at":"2025-08-19T09:48:27.000Z","dependencies_parsed_at":"2025-03-14T04:21:30.797Z","dependency_job_id":"2b564e14-d3ea-44a0-a493-1691d88816ae","html_url":"https://github.com/MigoXLab/dingo","commit_stats":null,"previous_names":["dataeval/dingo","migoxlab/dingo"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/MigoXLab/dingo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MigoXLab%2Fdingo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MigoXLab%2Fdingo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MigoXLab%2Fdingo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MigoXLab%2Fdingo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MigoXLab","download_url":"https://codeload.github.com/MigoXLab/dingo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MigoXLab%2Fdingo/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272647146,"owners_count":24969679,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-29T02:00:10.610Z","response_time":87,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["common-crawl","data-evaluation","data-quality","data-quality-assessment","data-quality-report","data-science","data-validation","dataquality","datascience","deepseek","gpt","hallucination","hallucination-detection","llm","openai","opencompass","qwen","spark","vlm"],"created_at":"2025-08-29T07:02:19.211Z","updated_at":"2025-08-29T07:03:43.548Z","avatar_url":"https://github.com/MigoXLab.png","language":"JavaScript","funding_links":[],"categories":["Data \u0026 Analytics","Data Analysis \u0026 Exploration Mcp Servers","Large Language Model Data","Data Science Tools","MCP Servers"],"sub_categories":["AI/ML Platforms","Pretraining Data","Other MCP Servers"],"readme":"\u003cdiv align=\"center\" xmlns=\"http://www.w3.org/1999/html\"\u003e\n\u003c!-- logo --\u003e\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/assets/dingo-logo.png\" width=\"300px\" style=\"vertical-align:middle;\"\u003e\n\u003c/p\u003e\n\n\u003c!-- badges --\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/pre-commit/pre-commit\"\u003e\u003cimg src=\"https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit\u0026logoColor=white\" alt=\"pre-commit\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/dingo-python/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/dingo-python.svg\" alt=\"PyPI version\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/dingo-python/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/pyversions/dingo-python.svg\" alt=\"Python versions\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/DataEval/dingo/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/github/license/DataEval/dingo\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/DataEval/dingo/stargazers\"\u003e\u003cimg src=\"https://img.shields.io/github/stars/DataEval/dingo\" alt=\"GitHub stars\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/DataEval/dingo/network/members\"\u003e\u003cimg src=\"https://img.shields.io/github/forks/DataEval/dingo\" alt=\"GitHub forks\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/DataEval/dingo/issues\"\u003e\u003cimg src=\"https://img.shields.io/github/issues/DataEval/dingo\" alt=\"GitHub issues\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://mseep.ai/app/dataeval-dingo\"\u003e\u003cimg src=\"https://mseep.net/pr/dataeval-dingo-badge.png\" alt=\"MseeP.ai Security Assessment Badge\" height=\"20\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://deepwiki.com/MigoXLab/dingo\"\u003e\u003cimg src=\"https://deepwiki.com/badge.svg\" alt=\"Ask DeepWiki\"\u003e\u003c/a\u003e\n\n[![Trust Score](https://archestra.ai/mcp-catalog/api/badge/quality/DataEval/dingo)](https://archestra.ai/mcp-catalog/dataeval__dingo)\n\u003c/p\u003e\n\n\u003c/div\u003e\n\n\n\u003cdiv align=\"center\"\u003e\n\n[English](README.md) · [简体中文](README_zh-CN.md) · [日本語](README_ja.md)\n\n\u003c/div\u003e\n\n\n\u003c!-- join us --\u003e\n\n\u003cp align=\"center\"\u003e\n    👋 join us on \u003ca href=\"https://discord.gg/Jhgb2eKWh8\" target=\"_blank\"\u003eDiscord\u003c/a\u003e and \u003ca href=\"./docs/assets/wechat.jpg\" target=\"_blank\"\u003eWeChat\u003c/a\u003e\n\u003c/p\u003e\n\n\n# Introduction\n\nDingo is a data quality evaluation tool that helps you automatically detect data quality issues in your datasets. Dingo provides a variety of built-in rules and model evaluation methods, and also supports custom evaluation methods. Dingo supports commonly used text datasets and multimodal datasets, including pre-training datasets, fine-tuning datasets, and evaluation datasets. In addition, Dingo supports multiple usage methods, including local CLI and SDK, making it easy to integrate into various evaluation platforms, such as [OpenCompass](https://github.com/open-compass/opencompass).\n\n## Architecture Diagram\n\n![Architecture of dingo](./docs/assets/architeture.png)\n\n# Quick Start\n\n## Installation\n\n```shell\npip install dingo-python\n```\n\n## Example Use Cases\n\n### 1. Evaluate LLM chat data\n\n```python\nfrom dingo.config.input_args import EvaluatorLLMArgs\nfrom dingo.io.input import Data\nfrom dingo.model.llm.llm_text_quality_model_base import LLMTextQualityModelBase\nfrom dingo.model.rule.rule_common import RuleEnterAndSpace\n\ndata = Data(\n    data_id='123',\n    prompt=\"hello, introduce the world\",\n    content=\"Hello! The world is a vast and diverse place, full of wonders, cultures, and incredible natural beauty.\"\n)\n\n\ndef llm():\n    LLMTextQualityModelBase.dynamic_config = EvaluatorLLMArgs(\n        key='YOUR_API_KEY',\n        api_url='https://api.openai.com/v1/chat/completions',\n        model='gpt-4o',\n    )\n    res = LLMTextQualityModelBase.eval(data)\n    print(res)\n\n\ndef rule():\n    res = RuleEnterAndSpace().eval(data)\n    print(res)\n```\n\n### 2. Evaluate Dataset\n\n```python\nfrom dingo.config import InputArgs\nfrom dingo.exec import Executor\n\n# Evaluate a dataset from Hugging Face\ninput_data = {\n    \"input_path\": \"tatsu-lab/alpaca\",  # Dataset from Hugging Face\n    \"dataset\": {\n        \"source\": \"hugging_face\",\n        \"format\": \"plaintext\"  # Format: plaintext\n    },\n    \"executor\": {\n        \"eval_group\": \"sft\",  # Rule set for SFT data\n        \"result_save\": {\n            \"bad\": True  # Save evaluation results\n        }\n    }\n}\n\ninput_args = InputArgs(**input_data)\nexecutor = Executor.exec_map[\"local\"](input_args)\nresult = executor.execute()\nprint(result)\n```\n\n## Command Line Interface\n\n### Evaluate with Rule Sets\n\n```shell\npython -m dingo.run.cli --input test/env/local_plaintext.json\n```\n\n### Evaluate with LLM (e.g., GPT-4o)\n\n```shell\npython -m dingo.run.cli --input test/env/local_json.json\n```\n\n## GUI Visualization\n\nAfter evaluation (with `result_save.bad=True`), a frontend page will be automatically generated. To manually start the frontend:\n\n```shell\npython -m dingo.run.vsl --input output_directory\n```\n\nWhere `output_directory` contains the evaluation results with a `summary.json` file.\n\n![GUI output](docs/assets/dingo_gui.png)\n\n## Online Demo\nTry Dingo on our online demo: [(Hugging Face)🤗](https://huggingface.co/spaces/DataEval/dingo)\n\n## Local Demo\nTry Dingo in local:\n\n```shell\ncd app_gradio\npython app.py\n```\n\n![Gradio demo](docs/assets/gradio_demo.png)\n\n\n## Google Colab Demo\nExperience Dingo interactively with Google Colab notebook: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/DataEval/dingo/blob/dev/examples/colab/dingo_colab_demo.ipynb)\n\n\n\n# MCP Server\n\nDingo includes an experimental Model Context Protocol (MCP) server. For details on running the server and integrating it with clients like Cursor, please see the dedicated documentation:\n\n[English](README_mcp.md) · [简体中文](README_mcp_zh-CN.md) · [日本語](README_mcp_ja.md)\n\n## Video Demonstration\n\nTo help you get started quickly with Dingo MCP, we've created a video walkthrough:\n\nhttps://github.com/user-attachments/assets/aca26f4c-3f2e-445e-9ef9-9331c4d7a37b\n\nThis video demonstrates step-by-step how to use Dingo MCP server with Cursor.\n\n\n# Data Quality Metrics\n\nDingo provides comprehensive data quality assessment through both rule-based and prompt-based evaluation metrics. These metrics cover multiple quality dimensions including effectiveness, completeness, similarity, security, and more.\n\n📊 **[View Complete Metrics Documentation →](docs/metrics.md)**\n\nOur evaluation system includes:\n- **Text Quality Assessment Metrics**: Pre-training data quality evaluation using DataMan methodology and enhanced multi-dimensional assessment\n- **SFT Data Assessment Metrics**: Honest, Helpful, Harmless evaluation for supervised fine-tuning data\n- **Classification Metrics**: Topic categorization and content classification\n- **Multimodality Assessment Metrics**: Image classification and relevance evaluation\n- **Rule-Based Quality Metrics**: Automated quality checks using heuristic rules for effectiveness and similarity detection\n- etc\n\nMost metrics are backed by academic sources to ensure objectivity and scientific rigor.\n\n### Using LLM Assessment in Evaluation\n\nTo use these assessment prompts in your evaluations, specify them in your configuration:\n\n```python\ninput_data = {\n    # Other parameters...\n    \"executor\": {\n        \"prompt_list\": [\"QUALITY_BAD_SIMILARITY\"],  # Specific prompt to use\n    },\n    \"evaluator\": {\n        \"llm_config\": {\n            \"LLMTextQualityPromptBase\": {  # LLM model to use\n                \"model\": \"gpt-4o\",\n                \"key\": \"YOUR_API_KEY\",\n                \"api_url\": \"https://api.openai.com/v1/chat/completions\"\n            }\n        }\n    }\n}\n```\n\nYou can customize these prompts to focus on specific quality dimensions or to adapt to particular domain requirements. When combined with appropriate LLM models, these prompts enable comprehensive evaluation of data quality across multiple dimensions.\n\n### Hallucination Detection \u0026 RAG System Evaluation\n\nFor detailed guidance on using Dingo's hallucination detection capabilities, including HHEM-2.1-Open local inference and LLM-based evaluation:\n\n📖 **[View Hallucination Detection Guide →](docs/hallucination_guide.md)**\n\n# Rule Groups\n\nDingo provides pre-configured rule groups for different types of datasets:\n\n| Group | Use Case | Example Rules |\n|-------|----------|---------------|\n| `default` | General text quality | `RuleColonEnd`, `RuleContentNull`, `RuleDocRepeat`, etc. |\n| `sft` | Fine-tuning datasets | Rules from `default` plus `RuleHallucinationHHEM` for hallucination detection |\n| `rag` | RAG system evaluation | `RuleHallucinationHHEM`, `PromptHallucination` for response consistency |\n| `hallucination` | Hallucination detection | `PromptHallucination` with LLM-based evaluation |\n| `pretrain` | Pre-training datasets | Comprehensive set of 20+ rules including `RuleAlphaWords`, `RuleCapitalWords`, etc. |\n\nTo use a specific rule group:\n\n```python\ninput_data = {\n    \"executor\": {\n        \"eval_group\": \"sft\",  # Use \"default\", \"sft\", \"rag\", \"hallucination\", or \"pretrain\"\n    }\n    # other parameters...\n}\n```\n\n# Feature Highlights\n\n## Multi-source \u0026 Multi-modal Support\n\n- **Data Sources**: Local files, Hugging Face datasets, S3 storage\n- **Data Types**: Pre-training, fine-tuning, and evaluation datasets\n- **Data Modalities**: Text and image\n\n## Rule-based \u0026 Model-based Evaluation\n\n- **Built-in Rules**: 20+ general heuristic evaluation rules\n- **LLM Integration**: OpenAI, Kimi, and local models (e.g., Llama3)\n- **Hallucination Detection**: HHEM-2.1-Open local model and GPT-based evaluation\n- **RAG System Evaluation**: Response consistency and context alignment assessment\n- **Custom Rules**: Easily extend with your own rules and models\n- **Security Evaluation**: Perspective API integration\n\n## Flexible Usage\n\n- **Interfaces**: CLI and SDK options\n- **Integration**: Easy integration with other platforms\n- **Execution Engines**: Local and Spark\n\n## Comprehensive Reporting\n\n- **Quality Metrics**: 7-dimensional quality assessment\n- **Traceability**: Detailed reports for anomaly tracking\n\n# User Guide\n\n## Custom Rules, Prompts, and Models\n\nIf the built-in rules don't meet your requirements, you can create custom ones:\n\n### Custom Rule Example\n\n```python\nfrom dingo.model import Model\nfrom dingo.model.rule.base import BaseRule\nfrom dingo.config.input_args import EvaluatorRuleArgs\nfrom dingo.io import Data\nfrom dingo.model.modelres import ModelRes\n\n@Model.rule_register('QUALITY_BAD_RELEVANCE', ['default'])\nclass MyCustomRule(BaseRule):\n    \"\"\"Check for custom pattern in text\"\"\"\n\n    dynamic_config = EvaluatorRuleArgs(pattern=r'your_pattern_here')\n\n    @classmethod\n    def eval(cls, input_data: Data) -\u003e ModelRes:\n        res = ModelRes()\n        # Your rule implementation here\n        return res\n```\n\n### Custom LLM Integration\n\n```python\nfrom dingo.model import Model\nfrom dingo.model.llm.base_openai import BaseOpenAI\n\n@Model.llm_register('my_custom_model')\nclass MyCustomModel(BaseOpenAI):\n    # Custom implementation here\n    pass\n```\n\nSee more examples in:\n- [Register Rules](examples/register/sdk_register_rule.py)\n- [Register Prompts](examples/register/sdk_register_prompt.py)\n- [Register Models](examples/register/sdk_register_llm.py)\n\n## Execution Engines\n\n### Local Execution\n\n```python\nfrom dingo.config import InputArgs\nfrom dingo.exec import Executor\n\ninput_args = InputArgs(**input_data)\nexecutor = Executor.exec_map[\"local\"](input_args)\nresult = executor.execute()\n\n# Get results\nsummary = executor.get_summary()        # Overall evaluation summary\nbad_data = executor.get_bad_info_list() # List of problematic data\ngood_data = executor.get_good_info_list() # List of high-quality data\n```\n\n### Spark Execution\n\n```python\nfrom dingo.config import InputArgs\nfrom dingo.exec import Executor\nfrom pyspark.sql import SparkSession\n\n# Initialize Spark\nspark = SparkSession.builder.appName(\"Dingo\").getOrCreate()\nspark_rdd = spark.sparkContext.parallelize([...])  # Your data as Data objects\n\ninput_data = {\n    \"executor\": {\n        \"eval_group\": \"default\",\n        \"result_save\": {\"bad\": True}\n    }\n}\ninput_args = InputArgs(**input_data)\nexecutor = Executor.exec_map[\"spark\"](input_args, spark_session=spark, spark_rdd=spark_rdd)\nresult = executor.execute()\n```\n\n## Evaluation Reports\n\nAfter evaluation, Dingo generates:\n\n1. **Summary Report** (`summary.json`): Overall metrics and scores\n2. **Detailed Reports**: Specific issues for each rule violation\n\nReport Description:\n1. **score**: `num_good` / `total`\n2. **type_ratio**: The count of type / total, such as: `QUALITY_BAD_COMPLETENESS` / `total`\n3. **name_ratio**: The count of name / total, such as: `QUALITY_BAD_COMPLETENESS-RuleColonEnd` / `total`\n\nExample summary:\n```json\n{\n    \"task_id\": \"d6c922ec-981c-11ef-b723-7c10c9512fac\",\n    \"task_name\": \"dingo\",\n    \"eval_group\": \"default\",\n    \"input_path\": \"test/data/test_local_jsonl.jsonl\",\n    \"output_path\": \"outputs/d6c921ac-981c-11ef-b723-7c10c9512fac\",\n    \"create_time\": \"20241101_144510\",\n    \"score\": 50.0,\n    \"num_good\": 1,\n    \"num_bad\": 1,\n    \"total\": 2,\n    \"type_ratio\": {\n        \"QUALITY_BAD_COMPLETENESS\": 0.5,\n        \"QUALITY_BAD_RELEVANCE\": 0.5\n    },\n    \"name_ratio\": {\n        \"QUALITY_BAD_COMPLETENESS-RuleColonEnd\": 0.5,\n        \"QUALITY_BAD_RELEVANCE-RuleSpecialCharacter\": 0.5\n    }\n}\n```\n\n# Future Plans\n\n- [ ] Richer graphic and text evaluation indicators\n- [ ] Audio and video data modality evaluation\n- [ ] Small model evaluation (fasttext, Qurating)\n- [ ] Data diversity evaluation\n\n# Limitations\n\nThe current built-in detection rules and model methods focus on common data quality problems. For specialized evaluation needs, we recommend customizing detection rules.\n\n# Acknowledgments\n\n- [RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)\n- [mlflow](https://github.com/mlflow/mlflow)\n- [deepeval](https://github.com/confident-ai/deepeval)\n\n# Contribution\n\nWe appreciate all the contributors for their efforts to improve and enhance `Dingo`. Please refer to the [Contribution Guide](docs/en/CONTRIBUTING.md) for guidance on contributing to the project.\n\n# License\n\nThis project uses the [Apache 2.0 Open Source License](LICENSE).\n\nThis project uses fasttext for some functionality including language detection. fasttext is licensed under the MIT License, which is compatible with our Apache 2.0 license and provides flexibility for various usage scenarios.\n\n# Citation\n\nIf you find this project useful, please consider citing our tool:\n\n```\n@misc{dingo,\n  title={Dingo: A Comprehensive Data Quality Evaluation Tool for Large Models},\n  author={Dingo Contributors},\n  howpublished={\\url{https://github.com/DataEval/dingo}},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMigoXLab%2Fdingo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMigoXLab%2Fdingo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMigoXLab%2Fdingo/lists"}