{"id":15051518,"url":"https://github.com/azure99/blossomdata","last_synced_at":"2026-01-31T09:04:28.368Z","repository":{"id":255475676,"uuid":"852238050","full_name":"Azure99/BlossomData","owner":"Azure99","description":"A fluent, scalable, and easy-to-use LLM data processing framework.","archived":false,"fork":false,"pushed_at":"2025-04-09T14:28:22.000Z","size":213,"stargazers_count":16,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-09T15:40:04.556Z","etag":null,"topics":["data-engineering","data-science","fine-tuning","gpt","llama","llm","nlp","supervised-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Azure99.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-04T13:17:59.000Z","updated_at":"2025-04-09T14:28:26.000Z","dependencies_parsed_at":"2024-09-14T02:28:04.221Z","dependency_job_id":"89b60576-2ba1-45e8-bf4d-5a478ac2ebc2","html_url":"https://github.com/Azure99/BlossomData","commit_stats":{"total_commits":28,"total_committers":2,"mean_commits":14.0,"dds":0.0714285714285714,"last_synced_commit":"48bf2e21ef2cd34be20f1bab16a4f08758db5c64"},"previous_names":["azure99/blossomdata"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Azure99%2FBlossomData","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Azure99%2FBlossomData/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Azure99%2FBlossomData/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Azure99%2FBlossomData/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Azure99","download_url":"https://codeload.github.com/Azure99/BlossomData/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248147717,"owners_count":21055545,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","data-science","fine-tuning","gpt","llama","llm","nlp","supervised-learning"],"created_at":"2024-09-24T21:36:30.709Z","updated_at":"2026-01-31T09:04:28.363Z","avatar_url":"https://github.com/Azure99.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BlossomData\n\n中文 | [English](README_EN.md)\n\n![](https://img.shields.io/badge/language-Python-214870)\n![GitHub License](https://img.shields.io/github/license/Azure99/BlossomData)\n[![PyPI - Version](https://img.shields.io/pypi/v/blossom-data)](https://pypi.org/project/blossom-data/)\n[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/Azure99/BlossomData)\n\nBlossomData是一个专为大模型训练打造的数据处理框架，旨在提供灵活、高效的数据处理解决方案。它内置丰富的算子库，支持不同模态、不同训练阶段的数据处理与合成。能够从单机环境无缝迁移到分布式环境，允许用户灵活扩展自定义算子，从而大幅降低数据处理流程的开发和计算成本。\n\n# 使用示例\n\n```bash\npip install blossom-data\n# 从源码安装\n# pip install git+https://github.com/Azure99/BlossomData.git\n```\n\n在使用之前，请在`config.yaml`文件中配置模型服务提供商的API密钥和相关参数。（可参考`config.yaml.example`）\n\n配置加载顺序（找到即止）：\n- 环境变量 `BLOSSOM_CONFIG` 指定的路径\n- 当前工作目录下的 `config.yaml`\n- 用户目录 `~/.blossom.yaml`\n\n## 第一个数据合成任务\n\n下面是一个非常实用的示例，仅依赖数学题目和参考答案，即可合成经过验证的长推理中文训练数据。\n\n框架提供了大量内置算子，可以在[blossom.op](src/blossom/op/__init__.py)中查看。例如，当原始数据缺失答案时，可以使用[ChatDistiller](src/blossom/op/chat/chat_distiller.py)生成回答，然后通过[ChatReasoningConsistencyFilter](src/blossom/op/chat/chat_reasoning_consistency_filter.py)基于LLM一致性校验过滤掉潜在不一致/错误样本。\n\n```python\nfrom blossom import *\nfrom blossom.dataframe import *\nfrom blossom.op import *\nfrom blossom.schema import *\n\n# 示例数学指令数据\ndata = [\n    ChatSchema(\n        messages=[\n            user(\"Suppose that $wz = 12-8i$, and $|w| = \\\\sqrt{13}$. What is $|z|$?\"),\n            assistant(\"4\"),\n        ]\n    ),\n]\n\n# 定义要使用的算子\nops = [\n    # 利用非推理模型将指令数据翻译为中文\n    ChatTranslator(model=\"deepseek-chat\", target_language=\"Chinese\"),\n    # 使用推理模型生成问题的回答，并使用非推理模型验证回答正确性\n    ChatVerifyDistiller(\n        model=\"deepseek-reasoner\",\n        mode=ChatVerifyDistiller.Mode.LLM,\n        validation_model=\"deepseek-chat\",\n    ),\n    # 将reasoning_content合并到content中，以便用于训练\n    ChatReasoningContentMerger(),\n]\n\ndataset = create_dataset(data)\nresult = dataset.execute(ops).collect()\nprint(result)\n```\n\n# 框架核心概念\n\n## Schema\n\nSchema是框架中的基础数据结构，用于表示和处理不同类型的数据。所有Schema都继承自基础Schema类，提供了统一的接口和功能。\n\n```python\n# 文本数据\ntext_data = TextSchema(content=\"这是一段文本内容\")\n\n# 对话数据\nchat_data = ChatSchema(\n    messages=[\n        user(\"你好\"),\n        assistant(\"你好，请问需要什么帮助？\")\n    ]\n)\n\n# 结构化数据\nrow_data = RowSchema(data={\"name\": \"张三\", \"age\": 30, \"score\": 95})\n\n# 自定义数据\ncustom_data = CustomSchema(data=1)\n```\n\n每个 Schema 实例都包含以下通用字段：\n\n- `id`: 唯一标识符\n- `type`: Schema类型标识\n- `failed`: 处理失败标志\n- `failure_reason`: 失败原因，当failed=True时记录具体的失败信息\n- `metadata`: dict类型的附加元数据\n\n## DataFrame与Dataset\n\nDataFrame是对数据的抽象表示，提供了对数据进行转换、过滤和聚合的接口。框架支持多种DataFrame实现，包括Local、Multiprocess、Spark和Ray，使得同一套代码可以在不同的执行引擎中运行。\n\nDataset是对DataFrame的高级封装，提供了更加便捷的接口和额外的功能，特别是对算子的支持。Dataset是用户交互的主要接口，隐藏了底层执行引擎的复杂性。\n\n为保证数据的类型和结构一致，框架推荐使用强类型Schema。对于已有数据集，可通过实现自定义[DataHandler](src/blossom/dataframe/data_handler.py)来定制不同格式数据的加载与保存逻辑。\n\n```python\n# 创建数据集\ndataset = create_dataset(data, engine=DatasetEngine.LOCAL)\n# 或从文件系统加载，并指定执行引擎为Ray，使用默认的数据加载器\ndataset = load_dataset(\n    \"example/data/chat.jsonl\",\n    engine=DatasetEngine.RAY,\n    data_handler=DefaultDataHandler(),\n)\ndataset = (\n    # 对数据重新分区\n    dataset.repartition(2)\n    # 一对一映射, 取第一轮对话的用户prompt并转换为文本数据\n    # .map(lambda x: TextSchema(content=x.messages[0].content))\n    # 过滤出语言为英文的对话\n    .filter(lambda x: x.metadata[\"language\"] == \"en\")\n    # 排序，按反馈分数降序\n    .sort(lambda x: x.metadata[\"feedback\"], ascending=False)\n    # 新增元信息，基于对话内容计算总上下文长度\n    .add_metadata(\n        lambda x: {\n            \"context_length\": sum(len(message.content) for message in x.messages)\n        }\n    )\n    # 最多保留4条数据\n    .limit(4)\n    # 执行一系列算子\n    .execute(\n        [\n            # 翻译算子，将数据翻译为中文\n            ChatTranslator(model=\"gpt-4o-mini\", target_language=\"Chinese\"),\n            # 蒸馏算子，将翻译后的数据使用模型重新生成回复，并发为4\n            ChatDistiller(model=\"gpt-4o-mini\", parallel=4),\n        ]\n    )\n    # 缓存数据集，避免后续多个分支操作的重复计算\n    .cache()\n)\n\n# 收集结果（数据量较大时，请使用write_json）\nresults = dataset.collect()\n# 将结果写入文件\ndataset.write_json(\"output.jsonl\")\n\n# 聚合操作\nagg_result = dataset.aggregate(\n    Sum(lambda x: x.metadata[\"context_length\"]),\n    Mean(lambda x: x.metadata[\"context_length\"]),\n    Count(),\n)\n\n# 分组后聚合\ngrouped_count = dataset.group_by(lambda x: x.metadata[\"country\"]).count().collect()\n\nprint(results)\nprint(agg_result)\nprint(grouped_count)\n```\n\n## Operator\n\nOperator（算子）是数据处理的核心单元，封装了特定的数据处理逻辑。框架提供了三种基本类型的算子：\n\n1. **MapOperator**: 一对一映射，对每个元素单独处理\n2. **FilterOperator**: 过滤操作，决定保留或删除元素\n3. **TransformOperator**: 多对多映射，可以对多个元素同时操作\n\n```python\nfrom blossom.op import *\nfrom blossom.util import loads_markdown_first_json\n\n# 使用装饰器定义自定义算子，第一个参数为元素\n@map_operator()\ndef uppercase_text(item):\n    item.content = item.content.upper()\n    return item\n\n@filter_operator()\ndef filter_short_text(item):\n    return len(item.content) \u003e 10\n\n@transform_operator()\ndef batch_process(items):\n    return [item for item in items if \"keyword\" in item.content]\n\n# 使用上下文调用模型的算子，名称以context开头，第一个参数为上下文，第二个参数为元素\n# 传入parallel以指定并发数\n@context_map_operator(parallel=4)\ndef translate_with_model(context, item):\n    result = context.chat_completion(\n        \"gpt-4o-mini\", \n        [user(f\"Translate to Chinese: {item.content}\")]\n    )\n    return TextSchema(content=result)\n\n# 继承Operator实现自定义算子\n# 对于map算子，需要实现process_item方法，通过self.context访问上下文\nclass SelfQA(MapOperator):\n    def process_item(self, item):\n        self_qa_prompt = (\n            \"基于给定的文本，随意生成一个问题以及对应的长答案。\\n\"\n            \"你的输出应该是一个json，包含question、answer两个字符串字段，不需要输出任何其他的无关解释。\\n\"\n            f\"给定的文本：{item.content}\"\n        )\n        raw_result = self.context.chat_completion(\"gpt-4o-mini\", [user(self_qa_prompt)])\n        result = loads_markdown_first_json(raw_result)\n        return ChatSchema(\n            messages=[\n                user(result[\"question\"]),\n                assistant(result[\"answer\"]),\n            ]\n        )\n```\n\n## Context\n\nContext（上下文）提供了算子执行所需的环境和资源，包括配置信息和模型提供者的访问。它是算子与外部资源交互的桥梁，使得算子可以访问模型服务、配置参数等。\n\n```python\n# 创建上下文\ncontext = Context()\n\n# 访问配置\nconfig = context.get_config()\n\n# 获取模型提供者\nprovider = context.get_model(\"gpt-4o-mini\")\n\n# 使用模型生成内容\nresponse = context.chat_completion(\n    \"gpt-4o-mini\",\n    [user(\"你好，请问今天天气如何？\")]\n)\n\n# 生成嵌入向量\nembedding = context.embedding(\"text-embedding-3-small\", \"这是一段文本\")\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fazure99%2Fblossomdata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fazure99%2Fblossomdata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fazure99%2Fblossomdata/lists"}