{"id":31532692,"url":"https://github.com/zjunlp/datamind","last_synced_at":"2025-10-04T03:58:00.875Z","repository":{"id":317414118,"uuid":"987637200","full_name":"zjunlp/DataMind","owner":"zjunlp","description":"Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study","archived":false,"fork":false,"pushed_at":"2025-09-30T17:39:36.000Z","size":14463,"stargazers_count":10,"open_issues_count":1,"forks_count":0,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-09-30T19:15:37.940Z","etag":null,"topics":["agent","artificial-intelligence","data-analysis","data-science","language-model","natural-language-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zjunlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-05-21T11:15:10.000Z","updated_at":"2025-09-30T17:39:45.000Z","dependencies_parsed_at":"2025-09-30T19:17:50.275Z","dependency_job_id":"6d56c053-eadc-46aa-803e-ffef2cbe6a23","html_url":"https://github.com/zjunlp/DataMind","commit_stats":null,"previous_names":["zjunlp/datamind"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/zjunlp/DataMind","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FDataMind","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FDataMind/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FDataMind/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FDataMind/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zjunlp","download_url":"https://codeload.github.com/zjunlp/DataMind/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zjunlp%2FDataMind/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278262443,"owners_count":25957938,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-04T02:00:05.491Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","artificial-intelligence","data-analysis","data-science","language-model","natural-language-processing"],"created_at":"2025-10-04T03:57:48.578Z","updated_at":"2025-10-04T03:58:00.868Z","avatar_url":"https://github.com/zjunlp.png","language":"Python","readme":"\n\n\u003ch1 align=\"center\"\u003e DataMind \u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://arxiv.org/abs/2509.25084\"\u003e📄arXiv\u003c/a\u003e •\n  \u003ca href=\"https://huggingface.co/collections/zjunlp/datamind-687d90047c58bb1e3d901dd8\"\u003e🤗HuggingFace\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n \n[![Awesome](https://awesome.re/badge.svg)](https://github.com/zjunlp/DataMind) \n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n![](https://img.shields.io/github/last-commit/zjunlp/DataMind?color=green) \n \n\u003c/div\u003e\n\n\n## Table of Contents\n\n- 🔔 [News](#news)\n- 📑 [Todo-List](#todo-list)\n- 👀 [Overview](#overview)\n- 🔧 [Installation](#installation)\n- 💻 [Training](#training)\n- 🧐 [Evaluation](#evaluation)\n- ✍️ [Citation](#citation)\n\n\n---\n\n## 🔔 News\n- **[2025-09]** We release a new paper: \"[Scaling Generalist Data-Analytic Agents](https://arxiv.org/abs/2509.25084)\".\n\n- **[2025-06]** We release a new paper: \"[Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study](https://arxiv.org/pdf/2506.19794)\".\n\n## 📑 Todo-List\n- [ ] RL training code will be released soon.\n- [ ] RL and Evaluation Data will be released soon.\n\n## 👀 Overview\n\nData-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering or multi-agent scaffolds over proprietary models, while open-source models still struggle with diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces **DataMind**, a scalable data synthesis and agent training recipe designed to construct generalist data-analytic agents. **DataMind** tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. \n\nConcretely, **DataMind** applies\n- A fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; \n- A knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; \n- A dynamically adjustable training objective combining both SFT and RL losses;\n- A memory-frugal and stable code-based multi-turn rollout framework. \n\nBuilt on **DataMind**, we curate **DataMind-12K**, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16\\% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10\\%. We also list some empirical insights gained from our exploratory trials in the analysis experiments, aiming to provide actionable insights about agent training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.\n\n\u003c!-- Large Language Models (LLMs) hold promise in automating data analysis tasks, yet opensource models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate models across three dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: *(1) Strategic planning quality serves as the primary determinant of model performance*; *(2) Interaction design and task complexity significantly influence reasoning capabilities*; *(3) Data quality demonstrates a greater impact than diversity in achieving optimal performance.* We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs’ analytical reasoning capabilities. --\u003e\n\n## 🔧Installation\n#### Manual Environment Configuration\n\nConda virtual environments offer a light and flexible setup. For different projects, we recommend using separate conda environments for management.\n\n#### Prerequisites\n\n- Anaconda Installation\n- GPU support (recommended CUDA version: 12.6)\n\n#### Scaling Generalist Data-Analytic Agents\n\n- SFT training\n\n    For SFT training, we use **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)** (0.9.4.dev0) framework. \n    ```bash\n    cd train/SFT/LLaMA-Factory\n    pip install -e \".[torch,metrics]\" --no-build-isolation\n    ```\n\n- RL training\n\n    For RL training, we use **[verl](https://github.com/volcengine/verl)** (v0.4.0) framework.\n    ```bash\n    cd train/RL/verl\n    USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh\n    pip install -e .[vllm]\n    pip install -e .[sglang]\n    apt install sqlite3\n    ```\n\n- Eval\n    ```bash\n    cd eval/Datamind\n    pip install -r requirements.txt\n    apt install sqlite3\n    ```\n\n#### Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study\n- SFT training\n\n    For SFT training, we use **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)** (0.9.4.dev0) framework. \n    ```bash\n    cd train/SFT/LLaMA-Factory\n    pip install -e \".[torch,metrics]\" --no-build-isolation\n    ```\n\n- Eval\n    ```bash\n    cd eval/DataMind-Qwen2.5\n    pip install -r requirements.txt\n    ```\n\n\u003c!-- \n## 🔧 Installation\n\n#### 🔩Manual Environment Configuration\n\nConda virtual environments offer a light and flexible setup.\n\n**Prerequisites**\n\n- Anaconda Installation\n- GPU support (recommended CUDA version: 12.4)\n\n**Configure Steps**\n\n1. Clone the repository:\n\n```bash\ngit clone https://github.com/zjunlp/DataMind.git\n```\n\n2. Enter the working directory, and all subsequent commands should be executed in this directory.\n\n```bash\ncd DataMind/eval\n```\n\n3. Create a virtual environment using `Anaconda`.\n\n```bash\nconda create -n DataMind python=3.10\nconda activate DataMind\n```\n\n4. Install all required Python packages.\n\n```bash\npip install -r requirements.txt\n``` --\u003e\n\n\n\n## 💻  Training\n\n### SFT training\nOur model training was completed using the powerful and user-friendly **[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)** framework (0.9.4.dev0), which provided us with an efficient fine-tuning workflow.\n\n##### 1. Training Data\n\nThe training dataset `datamind_12k` in *Scaling Generalist Data-Analytic Agents* is available in huggingface [datamind-12k](https://huggingface.co/datasets/zjunlp/DataMind-12K/tree/main). You can download it and put it in `train/SFT/LLaMA-Factory/data/datamind/datamind_12k.json`.\n\nThe training dataset `datamind-da-dataset` in *Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study* is available in `train/SFT/LLaMA-Factory/data/datamind/datamind-da-dataset.json`\n\n##### 2. Training Configuration\n\nWe provide our configuration for full-parameter fine-tuning using DeepSpeed ZeRO-3 in yaml file. You can find it in `train/SFT/LLaMA-Factory/examples/train_full/datamind_12k_full_sft.yaml` and `train/SFT/LLaMA-Factory/examples/train_full/datamind_da_dataset_full_sft.yaml`.\n\n##### 3. Launch Training\nYou can use the following command to start training. Here we take `datamind_12k_full_sft.yaml` as an example. Or you can use the shell script `train/SFT/LLaMA-Factory/train.sh`.\n```\nCUDA_VISIBLE_DEVICES=0,1,2,3 llamafactory-cli train examples/train_full/datamind_12k_full_sft.yaml\n```\n\n### RL training\nOur RL training framework is modified from the [verl](https://github.com/volcengine/verl) (v0.4.0) framework, which is a flexible, efficient and production-ready RL training library for large language models (LLMs).\n\n##### 1. Training Data\nThe training data will be released soon.\n\n##### 2. Training Configuration\nThe training code will be released soon.\n\n## 🧐 Evaluation\n### Scaling Generalist Data-Analytic Agents\n### 1. Evaluation Data\nThe evaluation data will be released soon.\nYou should unzip the zip files and place them in the corresponding folders.\n```\n├── model.sh\n├── requirements.txt\n├── python\n│   ├── compute_pass3.py\n│   ├── da-dev-tables\n│   ├── eval_python.py\n│   ├── eval.sh\n│   ├── interpreter.py\n│   ├── tablebench_csv\n│   └── test_file\n│       ├── daeval_test.parquet\n│       └── tablebench_test.parquet\n└── sql\n    ├── bird\n    │   ├── bird_dev_csv_results\n    │   ├── dev_sqlite_files\n    │   ├── bird_dev_omni_ddl.json\n    │   └── test_file\n    │       └── bird_dev.parquet\n    ├── compute_pass3.py\n    ├── eval_bird.py\n    ├── eval.sh\n    └── interpreter.py\n```\n\n### 2. Evaluation\nWe use vLLM to launch a local model server. You can modify the `model.sh` to adapt to your own environment and run it to start the model server.\n```sh\nbash model.sh\n```\n\n#### For Python Evaluation\nYou can modify the eval/python/eval.sh and run it to start Python evaluation. Notice that you should modify the `base_url` and `api_key` for judge model in `eval/python/eval_python.py`.\n```sh\nPORT=19007\nexport OPENAI_BASE_URL=http://0.0.0.0:${PORT}/v1\nexport OPENAI_API_KEY=placeholder_key\n\npython eval_python.py \\\n    --model datamind \\\n    --temperature 0.7 \\\n    --top_p 0.95 \\\n    --bs 5 \\\n    --test_bench dabench \\\n    --test_file test_file/daeval_test.parquet \\\n    --csv_or_db_folder da-dev-tables \\\n```\n\n#### For SQL Evaluation\nYou can modify the eval/sql/eval.sh and run it to start SQL evaluation.\n```sh\nPORT=19008\nexport OPENAI_BASE_URL=http://0.0.0.0:${PORT}/v1\nexport OPENAI_API_KEY=placeholder_key\n\npython eval_bird.py \\\n    --model datamind \\\n    --temperature 0.7 \\\n    --top_p 0.95 \\\n    --bs 5 \\\n    --test_bench bird \\\n    --test_file bird/test_file/bird_dev.parquet \\\n    --csv_or_db_folder bird/dev_sqlite_files \\\n    --gold_csv_results_dir bird/bird_dev_csv_results \\\n    --db_schema_data_path bird/bird_dev_omni_ddl.json\n```\n\n### Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study\n\u003e Note:\n\u003e\n\u003e - **Ensure** that your working directory is set to the **`eval/DataMind-Analysis`** folder in a virtual environment.\n\u003e - If you have more questions, feel free to open an issue with us.\n\u003e - If you need to use local model, you need to deploy it according to **(Optional)`local_model.sh`**.\n\n**Step 1: Download the evaluation datasets and our sft models**\nThe evaluation datasets we used are in [QRData](https://github.com/xxxiaol/QRData) and [DiscoveryBench](https://github.com/allenai/discoverybench).  The script expects data to be at `data/QRData/benchmark/data/*.csv` and `data/DiscoveryBench/*.csv`.\n\n You can also download our sft models directly from Hugging Face:  [DataMind-Analysis-Qwen2.5-7B](https://huggingface.co/zjunlp/DataMind-Analysis-Qwen2.5-7B) ,[DataMind-Analysis-Qwen2.5-14B](https://huggingface.co/zjunlp/DataMind-Analysis-Qwen2.5-14B).\n\nYou can use the following `bash` script to download the dataset:\n```bash\nbash download_eval_data.sh\n```\n\n**Step 2: Prepare the parameter configuration**\n\nHere is the example:\n**`config.yaml`**\n\n```yaml\napi_key: your_api_key # your API key for the model with API service. No need for open-source models.\ndata_root: /path/to/your/project/DataMind/eval/data # Root directory for data. (absolute path !!!)\n```\n\n**`run_eval.sh`**\n\n```bash\npython do_generate.py \\\n  --model_name DataMind-Qwen2.5-7B \\  # Model name to use.\n  --check_model gpt-4o-mini \\  # Check model to use.\n  --output results \\  # Output directory path.\n  --dataset_name QRData \\  # Dataset name to use, chosen from QRData, DiscoveryBench.\n  --max_round 25 \\  # Maximum number of steps.\n  --api_port 8000 \\  # API port number, it is necessary if the local model is used.\n  --bidx 0 \\  # Begin index (inclusive), `None` indicates that there is no restriction.\n  --eidx None \\  # End index (exclusive), `None` indicates that there is no restriction.\n  --temperature 0.0 \\  # Temperature for sampling.\n  --top_p 1 \\  # Top p for sampling.\n  --add_random False \\  # Whether to add random files.\n```\n\n**(Optional)`local_model.sh`**\n\n```bash\nCUDA_VISIBLE_DEVICES=$i python -m vllm.entrypoints.openai.api_server \\\n  --model $MODEL_PATH \\ # Local model path.\n  --served-model-name $MODEL_NAME \\ # The model name specified by you.\n  --tensor-parallel-size $i \\ # Set the size of tensor parallel processing.\n  --port $port # API port number, which is consistent with the `api_port` above.\n```\n\n**Step 3: Run the shell script**\n\n**(Optional)** Deploy the local model if you need.\n\n```bash\nbash local_model.sh\n```\n\nRun the shell script to start the process.\n\n```bash\nbash run_eval.sh\n```\n\n\n## 🎉Contributors\n\n\u003ca href=\"https://github.com/zjunlp/DataMind/graphs/contributors\"\u003e\n  \u003cimg src=\"https://contrib.rocks/image?repo=zjunlp/DataMind\" /\u003e\u003c/a\u003e\n\n\nWe deeply appreciate the collaborative efforts of everyone involved. We will continue to enhance and maintain this repository over the long term. If you encounter any issues, feel free to submit them to us!\n\n\n\n## ✍️ Citation\n\nIf you find our work helpful, please use the following citations.\n\n```\n\n@misc{qiao2025scalinggeneralistdataanalyticagents,\n      title={Scaling Generalist Data-Analytic Agents}, \n      author={Shuofei Qiao and Yanqiu Zhao and Zhisong Qiu and Xiaobin Wang and Jintian Zhang and Zhao Bin and Ningyu Zhang and Yong Jiang and Pengjun Xie and Fei Huang and Huajun Chen},\n      year={2025},\n      eprint={2509.25084},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2509.25084}, \n}\n\n@article{zhu2025open,\n  title={Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study},\n  author={Zhu, Yuqi and Zhong, Yi and Zhang, Jintian and Zhang, Ziheng and Qiao, Shuofei and Luo, Yujie and Du, Lun and Zheng, Da and Chen, Huajun and Zhang, Ningyu},\n  journal={arXiv preprint arXiv:2506.19794},\n  year={2025}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Fdatamind","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzjunlp%2Fdatamind","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzjunlp%2Fdatamind/lists"}