{"id":15293039,"url":"https://github.com/luozhouyang/nlp-datasets","last_synced_at":"2025-04-13T12:20:59.923Z","repository":{"id":35130346,"uuid":"210538922","full_name":"luozhouyang/nlp-datasets","owner":"luozhouyang","description":"A dataset utils repository based on tf.data API.","archived":false,"fork":false,"pushed_at":"2023-03-25T00:32:00.000Z","size":50,"stargazers_count":5,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-11T01:04:36.369Z","etag":null,"topics":["data-pipeline","sequence-models","tensorflow","tfdata"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luozhouyang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-09-24T07:26:27.000Z","updated_at":"2024-02-29T04:53:23.000Z","dependencies_parsed_at":"2024-11-15T07:37:07.507Z","dependency_job_id":"26a788be-17e6-41a8-943a-581211d38e78","html_url":"https://github.com/luozhouyang/nlp-datasets","commit_stats":{"total_commits":32,"total_committers":3,"mean_commits":"10.666666666666666","dds":0.125,"last_synced_commit":"471e2e3ad2a4b8c363c5a4080683bac5618c8c6a"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luozhouyang%2Fnlp-datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luozhouyang%2Fnlp-datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luozhouyang%2Fnlp-datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luozhouyang%2Fnlp-datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luozhouyang","download_url":"https://codeload.github.com/luozhouyang/nlp-datasets/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248711166,"owners_count":21149306,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-pipeline","sequence-models","tensorflow","tfdata"],"created_at":"2024-09-30T16:38:43.419Z","updated_at":"2025-04-13T12:20:59.900Z","avatar_url":"https://github.com/luozhouyang.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# datasets\nA dataset utils repository based on `tf.data`. **For tensorflow\u003e=2.0 only!**\n\n## Requirements\n\n* python 3.6\n* tensorflow\u003e=2.0\n\n## Installation\n\n```bash\npip install nlp-datasets\n```\n\n## Usage\n\n### seq2seq models\n\nThese models has an source sequence `x` and an target sequence `y`.\n\n```python\nfrom nlp_datasets import Seq2SeqDataset\nfrom nlp_datasets import SpaceTokenizer\nfrom nlp_datasets.utils import data_dir_utils as utils\n\nfiles = [\n    utils.get_data_file('iwslt15.tst2013.100.envi'),\n    utils.get_data_file('iwslt15.tst2013.100.envi'),\n]\nx_tokenizer = SpaceTokenizer()\nx_tokenizer.build_from_corpus([utils.get_data_file('iwslt15.tst2013.100.en')])\ny_tokenizer = SpaceTokenizer()\ny_tokenizer.build_from_corpus([utils.get_data_file('iwslt15.tst2013.100.vi')])\nconfig = {\n    'train_batch_size': 2,\n    'predict_batch_size': 2,\n    'eval_batch_size': 2,\n    'buffer_size': 100\n}\ndataset = Seq2SeqDataset(x_tokenizer, y_tokenizer, config)\n\ntrain_dataset = dataset.build_train_dataset(files)\nprint(next(iter(train_dataset)))\nprint('=' * 120)\n\neval_dataset = dataset.build_eval_dataset(files)\nprint(next(iter(eval_dataset)))\nprint('=' * 120)\n\npredict_files = [utils.get_data_file('iwslt15.tst2013.100.envi')]\npredict_dataset = dataset.build_predict_dataset(predict_files)\nprint(next(iter(predict_dataset)))\nprint('=' * 120)\n```\n\nOutput:\n\n```bash\n(\u003ctf.Tensor: id=328, shape=(2, 17), dtype=int64, numpy=\narray([[628,  18,   3,  97,  96,   4,  10,  22,  52,   2,  18, 629,   0,\n          0,   0,   0,   0],\n       [628, 428, 112,  11,  26,  16,   8,   9, 134,  40, 429, 108,   3,\n         33, 430,   2, 629]])\u003e, \u003ctf.Tensor: id=329, shape=(2, 19), dtype=int64, numpy=\narray([[640,  54, 567,  16,  56,  83,   6,  15,  10,   9,   3,  54, 641,\n          0,   0,   0,   0,   0,   0],\n       [640, 181, 472, 291,  27,  47,  37, 112, 155, 188, 254,  45, 473,\n         18,   1, 121, 145,   3, 641]])\u003e)\n========================================================================================================================\n(\u003ctf.Tensor: id=633, shape=(2, 21), dtype=int64, numpy=\narray([[628,  42, 224,  30, 156,  59, 611, 612,   1,   5,  50,  81, 225,\n         42, 613,  78, 208,   9, 614,   2, 629],\n       [628,  91, 117, 448,   6,  27,  11,  26,  16,   8,  28, 449,   1,\n          3, 200,   9, 450,   2, 629,   0,   0]])\u003e, \u003ctf.Tensor: id=634, shape=(2, 26), dtype=int64, numpy=\narray([[640, 107,  12, 150, 312,  34, 101, 106, 325, 632, 317,   2,   5,\n        633, 307,  35, 177, 107, 156, 175, 173,  85, 634,   3, 641,   0],\n       [640, 225, 132,  21, 489, 490,  18,  27,  47,  37,  91,  22,  66,\n         12, 491, 297,  70, 115,   1,   7, 204,   4, 298, 299,   3, 641]])\u003e)\n========================================================================================================================\ntf.Tensor(\n[[628  75   3   8  98   1   3  43   7  76   8   4 131  57   4 226   1   5\n    3 227 132 228   9 229 230  18 231 232 233   2  18 629]\n [628 133   3   8  58 234   2 629   0   0   0   0   0   0   0   0   0   0\n    0   0   0   0   0   0   0   0   0   0   0   0   0   0]], shape=(2, 32), dtype=int64)\n========================================================================================================================\n```\n\n### sequence match models\n\nThese models has two sequences as input, `x` and `y`, and has an label `z`.\n\n```python\nfrom nlp_datasets import SeqMatchDataset\nfrom nlp_datasets import SpaceTokenizer\nfrom nlp_datasets.utils import data_dir_utils as utils\n\nfiles = [\n    utils.get_data_file('dssm.query.doc.label.txt'),\n    utils.get_data_file('dssm.query.doc.label.txt'),\n]\nx_tokenizer = SpaceTokenizer()\nx_tokenizer.build_from_vocab(utils.get_data_file('dssm.vocab.txt'))\ny_tokenizer = SpaceTokenizer()\ny_tokenizer.build_from_vocab(utils.get_data_file('dssm.vocab.txt'))\n\nconfig = {\n    'train_batch_size': 2,\n    'eval_batch_size': 2,\n    'predict_batch_size': 2,\n    'buffer_size': 100,\n}\ndataset = SeqMatchDataset(x_tokenizer, y_tokenizer, config)\n\ntrain_dataset = dataset.build_train_dataset(files)\nprint(next(iter(train_dataset)))\nprint('=' * 120)\n\neval_dataset = dataset.build_eval_dataset(files)\nprint(next(iter(eval_dataset)))\nprint('=' * 120)\n\npredict_files = [utils.get_data_file('dssm.query.doc.label.txt')]\npredict_dataset = dataset.build_predict_dataset(predict_files)\nprint(next(iter(predict_dataset)))\nprint('=' * 120)\n```\n\nOutput:\n\n```bash\n(\u003ctf.Tensor: id=514, shape=(2, 5), dtype=int64, numpy=\narray([[10,  1,  3,  4, 11],\n       [10,  1,  3,  4, 11]])\u003e, \u003ctf.Tensor: id=515, shape=(2, 11), dtype=int64, numpy=\narray([[10,  0,  1,  2,  7,  5,  8,  6,  3,  9, 11],\n       [10,  0,  1,  2,  7,  5,  8,  6,  3,  9, 11]])\u003e, \u003ctf.Tensor: id=516, shape=(2,), dtype=int64, numpy=array([1, 0])\u003e)\n========================================================================================================================\n(\u003ctf.Tensor: id=920, shape=(2, 5), dtype=int64, numpy=\narray([[10,  1,  3,  4, 11],\n       [10,  1,  3,  4, 11]])\u003e, \u003ctf.Tensor: id=921, shape=(2, 11), dtype=int64, numpy=\narray([[10,  0,  1,  2,  7,  5,  8,  6,  3,  9, 11],\n       [10,  0,  1,  2,  7,  5,  8,  6,  3,  9, 11]])\u003e, \u003ctf.Tensor: id=922, shape=(2,), dtype=int64, numpy=array([0, 1])\u003e)\n========================================================================================================================\n(\u003ctf.Tensor: id=1206, shape=(2, 5), dtype=int64, numpy=\narray([[10,  1,  3,  4, 11],\n       [10,  1,  3,  4, 11]])\u003e, \u003ctf.Tensor: id=1207, shape=(2, 11), dtype=int64, numpy=\narray([[10,  0,  1,  2,  7,  5,  8,  6,  3,  9, 11],\n       [10,  0,  1,  2,  7,  5,  8,  6,  3,  9, 11]])\u003e)\n========================================================================================================================\n```\n\n### sequence classify model\n\nThese models has a input sequence `x`, and a output label `y`.\n\n```python\nfrom nlp_datasets import SeqClassifyDataset\nfrom nlp_datasets import SpaceTokenizer\nfrom nlp_datasets.utils import data_dir_utils as utils\n\nfiles = [\n    utils.get_data_file('classify.seq.label.txt')\n]\nx_tokenizer = SpaceTokenizer()\nx_tokenizer.build_from_corpus([utils.get_data_file('classify.seq.txt')])\n\nconfig = {\n    'train_batch_size': 2,\n    'eval_batch_size': 2,\n    'predict_batch_size': 2,\n    'buffer_size': 100\n}\ndataset = SeqClassifyDataset(x_tokenizer, config)\n\ntrain_dataset = dataset.build_train_dataset(files)\nprint(next(iter(train_dataset)))\nprint('=' * 120)\n\neval_dataset = dataset.build_eval_dataset(files)\nprint(next(iter(eval_dataset)))\nprint('=' * 120)\n\npredict_files = [utils.get_data_file('classify.seq.txt')]\npredict_dataset = dataset.build_predict_dataset(predict_files)\nprint(next(iter(predict_dataset)))\nprint('=' * 120)\n```\n\nOutput:\n\n```bash\n(\u003ctf.Tensor: id=349, shape=(2, 7), dtype=int64, numpy=\narray([[7, 1, 4, 5, 6, 2, 8],\n       [7, 1, 3, 2, 8, 0, 0]])\u003e, \u003ctf.Tensor: id=350, shape=(2,), dtype=int64, numpy=array([0, 1])\u003e)\n========================================================================================================================\n(\u003ctf.Tensor: id=601, shape=(2, 7), dtype=int64, numpy=\narray([[7, 1, 3, 2, 8, 0, 0],\n       [7, 1, 4, 5, 6, 2, 8]])\u003e, \u003ctf.Tensor: id=602, shape=(2,), dtype=int64, numpy=array([1, 0])\u003e)\n========================================================================================================================\ntf.Tensor(\n[[7 1 3 2 8 0 0]\n [7 1 4 5 6 2 8]], shape=(2, 7), dtype=int64)\n========================================================================================================================\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluozhouyang%2Fnlp-datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluozhouyang%2Fnlp-datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluozhouyang%2Fnlp-datasets/lists"}