{"id":13738630,"url":"https://github.com/vahidk/tfrecord","last_synced_at":"2025-05-05T07:51:05.102Z","repository":{"id":38150834,"uuid":"224102642","full_name":"vahidk/tfrecord","owner":"vahidk","description":"Standalone TFRecord reader/writer with PyTorch data loaders","archived":false,"fork":false,"pushed_at":"2024-08-20T15:25:37.000Z","size":54,"stargazers_count":883,"open_issues_count":4,"forks_count":108,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-04-19T19:53:49.185Z","etag":null,"topics":["dataset","loader","pytorch","tensorflow","tfrecord"],"latest_commit_sha":null,"homepage":"https://twitter.com/VahidK","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vahidk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-11-26T04:31:11.000Z","updated_at":"2025-04-14T12:48:58.000Z","dependencies_parsed_at":"2023-02-18T07:45:37.348Z","dependency_job_id":"740faa3c-c402-4c0b-901c-2607b4f3422c","html_url":"https://github.com/vahidk/tfrecord","commit_stats":{"total_commits":39,"total_committers":10,"mean_commits":3.9,"dds":"0.33333333333333337","last_synced_commit":"74b2d24a838081356d993ec0e147eaf59ccd4c84"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vahidk%2Ftfrecord","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vahidk%2Ftfrecord/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vahidk%2Ftfrecord/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vahidk%2Ftfrecord/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vahidk","download_url":"https://codeload.github.com/vahidk/tfrecord/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252463299,"owners_count":21751758,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","loader","pytorch","tensorflow","tfrecord"],"created_at":"2024-08-03T03:02:29.810Z","updated_at":"2025-05-05T07:51:05.086Z","avatar_url":"https://github.com/vahidk.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# TFRecord reader and writer\n\nThis library allows reading and writing tfrecord files efficiently in python. The library also provides an IterableDataset reader of tfrecord files for PyTorch. Currently uncompressed and compressed gzip TFRecords are supported.\n\n## Installation\n\n```\npip3 install 'tfrecord[torch]'\n```\n\n## Usage\n\nIt's recommended to create an index file for each TFRecord file. Index file must be provided when using multiple workers, otherwise the loader may return duplicate records. You can create an index file for an individual tfrecord file with this utility program:\n```\npython3 -m tfrecord.tools.tfrecord2idx \u003ctfrecord path\u003e \u003cindex path\u003e\n```\n\nTo create \"*.tfidnex\" files for all \"*.tfrecord\" files in a directory run:\n```\ntfrecord2idx \u003cdata dir\u003e\n```\n\n## Reading \u0026 Writing tf.train.Example\n\n### Reading tf.Example records in PyTorch\nUse TFRecordDataset to read TFRecord files in PyTorch.\n```python\nimport torch\nfrom tfrecord.torch.dataset import TFRecordDataset\n\ntfrecord_path = \"/tmp/data.tfrecord\"\nindex_path = None\ndescription = {\"image\": \"byte\", \"label\": \"float\"}\ndataset = TFRecordDataset(tfrecord_path, index_path, description)\nloader = torch.utils.data.DataLoader(dataset, batch_size=32)\n\ndata = next(iter(loader))\nprint(data)\n```\n\nUse MultiTFRecordDataset to read multiple TFRecord files. This class samples from given tfrecord files with given probability.\n```python\nimport torch\nfrom tfrecord.torch.dataset import MultiTFRecordDataset\n\ntfrecord_pattern = \"/tmp/{}.tfrecord\"\nindex_pattern = \"/tmp/{}.index\"\nsplits = {\n    \"dataset1\": 0.8,\n    \"dataset2\": 0.2,\n}\ndescription = {\"image\": \"byte\", \"label\": \"int\"}\ndataset = MultiTFRecordDataset(tfrecord_pattern, index_pattern, splits, description)\nloader = torch.utils.data.DataLoader(dataset, batch_size=32)\n\ndata = next(iter(loader))\nprint(data)\n```\n\n### Infinite and finite PyTorch dataset\n\nBy default, `MultiTFRecordDataset` is infinite, meaning that it samples the data forever. You can make it finite by providing the appropriate flag\n```\ndataset = MultiTFRecordDataset(..., infinite=False)\n```\n\n### Shuffling the data\n\nBoth TFRecordDataset and MultiTFRecordDataset automatically shuffle the data when you provide a queue size.\n```\ndataset = TFRecordDataset(..., shuffle_queue_size=1024)\n```\n\n### Transforming input data\n\nYou can optionally pass a function as `transform` argument to perform post processing of features before returning. \nThis can for example be used to decode images or normalize colors to a certain range or pad variable length sequence.\n \n```python\nimport tfrecord\nimport cv2\n\ndef decode_image(features):\n    # get BGR image from bytes\n    features[\"image\"] = cv2.imdecode(features[\"image\"], -1)\n    return features\n\n\ndescription = {\n    \"image\": \"bytes\",\n}\n\ndataset = tfrecord.torch.TFRecordDataset(\"/tmp/data.tfrecord\",\n                                         index_path=None,\n                                         description=description,\n                                         transform=decode_image)\n\ndata = next(iter(dataset))\nprint(data)\n```\n\n### Writing tf.Example records in Python\n```python\nimport tfrecord\n\nwriter = tfrecord.TFRecordWriter(\"/tmp/data.tfrecord\")\nwriter.write({\n    \"image\": (image_bytes, \"byte\"),\n    \"label\": (label, \"float\"),\n    \"index\": (index, \"int\")\n})\nwriter.close()\n```\n\n### Reading tf.Example records in Python\n```python\nimport tfrecord\n\nloader = tfrecord.tfrecord_loader(\"/tmp/data.tfrecord\", None, {\n    \"image\": \"byte\",\n    \"label\": \"float\",\n    \"index\": \"int\"\n})\nfor record in loader:\n    print(record[\"label\"])\n```\n\n## Reading \u0026 Writing tf.train.SequenceExample\n\nSequenceExamples can be read and written using the same methods shown above with an extra argument\n(`sequence_description` for reading and `sequence_datum` for writing) which cause the respective\nread/write functions to treat the data as a SequenceExample.\n\n### Writing SequenceExamples to file\n\n```python\nimport tfrecord\n\nwriter = tfrecord.TFRecordWriter(\"/tmp/data.tfrecord\")\nwriter.write({'length': (3, 'int'), 'label': (1, 'int')},\n             {'tokens': ([[0, 0, 1], [0, 1, 0], [1, 0, 0]], 'int'), 'seq_labels': ([0, 1, 1], 'int')})\nwriter.write({'length': (3, 'int'), 'label': (1, 'int')},\n             {'tokens': ([[0, 0, 1], [1, 0, 0]], 'int'), 'seq_labels': ([0, 1], 'int')})\nwriter.close()\n```\n\n### Reading SequenceExamples in python\n\nReading from a SequenceExample yeilds a tuple containing two elements.\n\n```python\nimport tfrecord\n\ncontext_description = {\"length\": \"int\", \"label\": \"int\"}\nsequence_description = {\"tokens\": \"int\", \"seq_labels\": \"int\"}\nloader = tfrecord.tfrecord_loader(\"/tmp/data.tfrecord\", None,\n                                  context_description,\n                                  sequence_description=sequence_description)\n\nfor context, sequence_feats in loader:\n    print(context[\"label\"])\n    print(sequence_feats[\"seq_labels\"])\n```\n\n### Read SequenceExamples in PyTorch\n\nAs described in the section on `Transforming Input`, one can pass a function as the `transform` argument to\nperform post processing of features. This should be used especially for the sequence features as these are\nvariable length sequence and need to be padded out before being batched.\n\n```python\nimport torch\nimport numpy as np\nfrom tfrecord.torch.dataset import TFRecordDataset\n\nPAD_WIDTH = 5\ndef pad_sequence_feats(data):\n    context, features = data\n    for k, v in features.items():\n        features[k] = np.pad(v, ((0, PAD_WIDTH - len(v)), (0, 0)), 'constant')\n    return (context, features)\n\ncontext_description = {\"length\": \"int\", \"label\": \"int\"}\nsequence_description = {\"tokens\": \"int \", \"seq_labels\": \"int\"}\ndataset = TFRecordDataset(\"/tmp/data.tfrecord\",\n                          index_path=None,\n\t\t\t  description=context_description,\n\t\t\t  transform=pad_sequence_feats,\n\t\t\t  sequence_description=sequence_description)\nloader = torch.utils.data.DataLoader(dataset, batch_size=32)\ndata = next(iter(loader))\nprint(data)\n```\n\nAlternatively, you could choose to implement a custom `collate_fn` in order to assemble the batch,\nfor example, to perform dynamic padding.\n\n```python\nimport torch\nimport numpy as np\nfrom tfrecord.torch.dataset import TFRecordDataset\n\ndef collate_fn(batch):\n    from torch.utils.data._utils import collate\n    from torch.nn.utils import rnn\n    context, feats = zip(*batch)\n    feats_ = {k: [torch.Tensor(d[k]) for d in feats] for k in feats[0]}\n    return (collate.default_collate(context),\n            {k: rnn.pad_sequence(f, True) for (k, f) in feats_.items()})\n\ncontext_description = {\"length\": \"int\", \"label\": \"int\"}\nsequence_description = {\"tokens\": \"int \", \"seq_labels\": \"int\"}\ndataset = TFRecordDataset(\"/tmp/data.tfrecord\",\n                          index_path=None,\n\t\t\t  description=context_description,\n\t\t\t  transform=pad_sequence_feats,\n\t\t\t  sequence_description=sequence_description)\nloader = torch.utils.data.DataLoader(dataset, batch_size=32, collate_fn=collate_fn)\ndata = next(iter(loader))\nprint(data)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvahidk%2Ftfrecord","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvahidk%2Ftfrecord","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvahidk%2Ftfrecord/lists"}