{"id":16183689,"url":"https://github.com/yinochaos/datasets","last_synced_at":"2026-04-15T15:41:24.000Z","repository":{"id":57442252,"uuid":"293019728","full_name":"yinochaos/datasets","owner":"yinochaos","description":"easy used dataset for tf.keras, pytorch .etc","archived":false,"fork":false,"pushed_at":"2021-02-06T14:41:14.000Z","size":124,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-13T15:28:16.387Z","etag":null,"topics":["keras-tensorflow","machine-learning","pytorch","tensorflow"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/ml-dataset/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yinochaos.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.md","contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-09-05T06:44:39.000Z","updated_at":"2021-02-06T14:41:17.000Z","dependencies_parsed_at":"2022-09-26T17:21:08.742Z","dependency_job_id":null,"html_url":"https://github.com/yinochaos/datasets","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yinochaos%2Fdatasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yinochaos%2Fdatasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yinochaos%2Fdatasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yinochaos%2Fdatasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yinochaos","download_url":"https://codeload.github.com/yinochaos/datasets/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247655250,"owners_count":20974148,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["keras-tensorflow","machine-learning","pytorch","tensorflow"],"created_at":"2024-10-10T07:05:58.593Z","updated_at":"2026-04-15T15:41:18.950Z","avatar_url":"https://github.com/yinochaos.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# datasets\n\n![image](https://img.shields.io/pypi/v/ml-dataset.svg%0A%20%20%20%20%20:target:%20https://pypi.python.org/pypi/ml-dataset)\n[![Build Status](https://travis-ci.com/yinochaos/datasets.svg?branch=master)](https://travis-ci.com/yinochaos/datasets)\n[![Documentation Status](https://readthedocs.org/projects/ml-dataset/badge/?version=latest)](https://ml-dataset.readthedocs.io/en/latest/?badge=latest)\n\n-   Free software: Apache Software License 2.0\n-   Documentation: \u003chttps://ml-dataset.readthedocs.io\u003e.\n\ndatasets for easy machine learning use\n该项目的目的是提供一个简洁方便的dataset封装，使用少量代码即可实现的dataset，用以喂到模型中进行训练:\n- 同时支持tensorflow和pytorch\n- 支持从本地、HDFS、以及其他网络接口获取数据\n- 支持文本、lmdb、tfrecord等数据文件格式【可以大大提高训练时GPU利用率】\n- 支持文本、图像、语音等数据的基本处理\n- 支持多种数据增强(data agumentation)方法 【TODO 未完成】\n- 可以支持直接从输入数据文件，自动检查格式并生成适用于该数据的代码【TODO 未完成】\n\n\n## Datasets 重要数据结构\n--------\n- DataSchema : 用于描述数据schema的结构体\n```\nname : 数据名称[每个数据名称不要重复]\nprocessor : 对于该列数据需要使用的数据处理函数,具体参见data_processor_dicts, 包括对于数组、文本、图像、语音等的处理函数\ntype : 只对于TFDataset有用, 后续考虑去除掉该依赖\ndtype : numpy的数据类型\nshape : 处理后的数据shape\ntoken_dict_name : 所需的词典【e.g. 】\nis_with_len : 对于变长序列，是否需要产出数组变长维度的大小\nmax_len : 对于定长数组的最大长度设置\n```\n- TFDataset、PTDataset: dataset数据结构\n    - generate_dataset() : 用于产出dataset的接口\n- Parser : 数据解析器\n  - TextlineParser\n\n### data_processor_dicts\ndata_processor_dicts是数据处理函数的集合词典，这里面包含了很多针对不同数据类型（e.g. 文本、语音、图像、数值等）进行特征提取、数据转换等处理，最终转换成Dataset,用以喂到模型进行训练预测等操作。\n\n## example data\n--------\n- tests/data/raw_datasets/query_float.input format:id\\tlabel\\tquery\\tfloats\n```\n1  1  面 对 疫 情  0.12 0.34 0.87 0.28\n2  0  球 王 马 拉 多 纳 目 前 处 在 隔 离 之 中  0.12 0.34 0.87 0.28\n```\n针对该数据集，需要新建feature_schema_list和label_schema , 完整代码可参考[code](https://github.com/yinochaos/datasets/blob/master/tests/test_tf_datasets.py#L82)\n```python\ntoken_dicts = TokenDicts('tests/data/dicts', {'query': 0})\ndata_field_list = []\n# param = [\"name\", \"processor\", \"type\", \"dtype\", \"shape\", \"max_len\", \"token_dict_name\"]\ndata_field_list.append(DataSchema(name='query', processor='to_tokenid',\n                                    dtype='int32', shape=(None,), is_with_len=True, token_dict_name='query'))\n\"\"\"\n这里的DataSchema描述数据和处理逻辑如下:\n- 数据名称为query\n- 使用to_tokenid的func进行数据处理，处理完成后，数据shape为(None,),数据type为int32，词典名称是query\n- is_with_len=True意味着对于变长数据，会产出变长维度的具体大小\n\"\"\"\ndata_field_list.append(DataSchema(\n    name='width', processor='to_np', dtype='float32', shape=(4)))\n\"\"\"\n这里的DataSchema描述数据和处理逻辑如下:\n- 数据名称为label\n- 使用to_np的func进行数据处理，处理完成后，数据shape为(,),数据type为int32，词典名称是query\n\"\"\"\nlabel_field = DataSchema(name='label', processor='to_np', dtype='float32', shape=(1,))\n\"\"\"\n这里的DataSchema描述数据和处理逻辑如下:\n- 数据名称为label\n- 使用to_np的func进行数据处理，处理完成后，数据shape为(,),数据type为int32，词典名称是query\n\"\"\"\n#新建一个parser,该parser负责解析处理一般单行的数据输入\nparser = TextlineParser(token_dicts, data_field_list, label_field)\n# 新建generator，处理file_path下面，文件后缀是file_suffix的数据\ngenerator = TFDataset(parser=parser, file_path='tests/data/raw_datasets', file_suffix='query_float.input')\n# 产生dataset\ndataset = generator.generate_dataset(\n    batch_size=12, num_epochs=1, is_shuffle=False)\n# 遍历dataset \nfor _ in enumerate(dataset):\n    pass\n\n```\n- tests/data/raw_datasets/varnum.input format:id\\tlabel\\tnums\n```\n1  2  2 3 4 6 8 23 435 234 12 234 234\n1  2  2 3 4 6 8 23 4 2 9 4 5 6 2 4\n1  2  2 3 4 6 8 23 45 24 12 234 234\n```\n针对该数据集，需要新建feature_schema_list和label_schema , 完整代码可以参见 [code](https://github.com/yinochaos/datasets/blob/master/tests/test_tf_datasets.py#L100)\n```python\ntoken_dicts = None\ndata_field_list = []\ndata_field_list.append(DataSchema(name='query', processor='to_np',\n                                    dtype='int32', shape=(None,), is_with_len=True))\nlabel_field = DataSchema(name='label', processor='to_np', dtype='float32', shape=(1,), is_with_len=False)\nparser = TextlineParser(token_dicts, data_field_list, label_field)\ngenerator = TFDataset(parser=parser, file_path='tests/data/raw_datasets', file_suffix='varnum.input')\n# 产生dataset\ndataset = generator.generate_dataset(\n    batch_size=12, num_epochs=1, is_shuffle=False)\n# 遍历dataset \nfor _ in enumerate(dataset):\n    pass\n```\ndataset遍历,可以参考[pass_dataset](https://github.com/yinochaos/datasets/blob/dfacaca19a04dccf43575aadfe85c2001e88047a/tests/test_tf_datasets.py#L36)\n```python\ndef pass_dataset(self, is_training, weight_fn, dataset):\n    if weight_fn:\n        if is_training:\n            for batch_num, (x, label, weight) in enumerate(dataset):\n                print('x', x)\n                print('weight', weight)\n                for d in x:\n                    print('d.shape', d.shape)\n                print('label.shape', label.shape)\n                print('batch_num', batch_num)\n                break\n    else:\n        if is_training:\n            for batch_num, (x, label) in enumerate(dataset):\n                print('x', x)\n                for d in x:\n                    print('d.shape', d.shape)\n                print('label.shape', label.shape)\n                print('batch_num', batch_num)\n                break\n        else:\n            for batch_num, (info, x, label) in enumerate(dataset):\n                print('info', info)\n                print('x', x)\n                for d in x:\n                    print('d.shape', d.shape)\n                print('label.shape', label.shape)\n                print('batch_num', batch_num)\n                break\n```\n# 交互式的自动代码生成\n运行 python -m datasets.cli \n```\n$ python -m datasets.cli\n输入代码生成目录 : outputs\n请输入输入数据文件的文件路径:tests/data/raw_datasets/\n输入数据文件后缀 : header_text_seq2seq.input\nlabel schema: [['to_tokenid', 'text', 'int', [None]]]\nfeature schema: [['to_tokenid', 'text', 'int', [None]]]\ninput dict name forlabel :q\ninput dict name for feature :query :q\n```\n然后在outputs文件夹下面生成dataset_reader.py，具体内容如下：\n```python\nimport tensorflow as tf\nfrom datasets import TextlineParser\nfrom datasets import TFDataset\nfrom datasets.utils import TokenDicts, DataSchema\nfile_path = 'tests/data/raw_datasets/'\nfile_suffix = 'header_text_seq2seq.input'\nbatch_size = 64\nnum_epochs = 1\nis_shuffle = True\ntoken_dicts = TokenDicts('dicts', {'q':0})\nlabel_schema_list = []\nlabel_schema_list.append(DataSchema(name='label', processor='to_tokenid', dtype='int', shape=(None,), token_dict_name='q'))\nfeature_schema_list = []\nfeature_schema_list.append(DataSchema(name='query', processor='to_tokenid', dtype='int', shape=(None,), token_dict_name='q'))\nparser = TextlineParser(token_dicts, feature_schema_list, label_schema_list)\ngenerator = TFDataset(parser=parser, file_path=file_path, file_suffix=file_suffix)\ndataset = generator.generate_dataset(batch_size=batch_size, num_epochs=num_epochs, is_shuffle=is_shuffle)\nfor _ in enumerate(dataset):\n    pass\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyinochaos%2Fdatasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyinochaos%2Fdatasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyinochaos%2Fdatasets/lists"}