{"id":26477468,"url":"https://github.com/yonetaniryo/numpy2tfrecord","last_synced_at":"2025-03-20T00:47:04.759Z","repository":{"id":44951791,"uuid":"448284895","full_name":"yonetaniryo/numpy2tfrecord","owner":"yonetaniryo","description":"Simple helper library to convert numpy data to tfrecord and build a tensorflow dataset","archived":false,"fork":false,"pushed_at":"2023-03-26T07:39:32.000Z","size":26,"stargazers_count":3,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-18T13:35:13.396Z","etag":null,"topics":["numpy","tensorflow","tfrecord"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yonetaniryo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-01-15T13:27:55.000Z","updated_at":"2023-09-20T06:48:42.000Z","dependencies_parsed_at":"2022-09-11T07:11:58.829Z","dependency_job_id":null,"html_url":"https://github.com/yonetaniryo/numpy2tfrecord","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yonetaniryo%2Fnumpy2tfrecord","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yonetaniryo%2Fnumpy2tfrecord/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yonetaniryo%2Fnumpy2tfrecord/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yonetaniryo%2Fnumpy2tfrecord/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yonetaniryo","download_url":"https://codeload.github.com/yonetaniryo/numpy2tfrecord/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244232232,"owners_count":20420069,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["numpy","tensorflow","tfrecord"],"created_at":"2025-03-20T00:47:04.210Z","updated_at":"2025-03-20T00:47:04.751Z","avatar_url":"https://github.com/yonetaniryo.png","language":"Python","readme":"# numpy2tfrecord\n\nSimple helper library to convert numpy data to tfrecord and build a tensorflow dataset.\n\n## Installation\n```sh\n$ git clone git@github.com:yonetaniryo/numpy2tfrecord.git\n$ cd numpy2tfrecord\n$ pip install .\n```\nor simply using pip:\n```sh\n$ pip install numpy2tfrecord\n```\n\n\n## How to use\n### Convert a collection of numpy data to tfrecord\n\nYou can convert samples represented in the form of a `dict` to `tf.train.Example` and save them as a tfrecord.\n```python\nimport numpy as np\nfrom numpy2tfrecord import Numpy2TFRecordConverter\n\nwith Numpy2TFRecordConverter(\"test.tfrecord\") as converter:\n    x = np.arange(100).reshape(10, 10).astype(np.float32)  # float array\n    y = np.arange(100).reshape(10, 10).astype(np.int64)  # int array\n    a = 5  # int\n    b = 0.3  # float\n    sample = {\"x\": x, \"y\": y, \"a\": a, \"b\": b}\n    converter.convert_sample(sample)  # convert data sample\n```\n\nYou can also convert a `list` of samples at once using `convert_list`.\n```python\nwith Numpy2TFRecordConverter(\"test.tfrecord\") as converter:\n    samples = [\n        {\n            \"x\": np.random.rand(64).astype(np.float32),\n            \"y\": np.random.randint(0, 10),\n        }\n        for _ in range(32)\n    ]  # list of 32 samples\n\n    converter.convert_list(samples)\n```\n\nOr a batch of samples at once using `convert_batch`.\n```python\nwith Numpy2TFRecordConverter(\"test.tfrecord\") as converter:\n    samples = {\n        \"x\": np.random.rand(32, 64).astype(np.float32),\n        \"y\": np.random.randint(0, 10, size=32).astype(np.int64),\n    }  # batch of 32 samples\n\n    converter.convert_batch(samples)\n```\n\nSo what are the advantages of `Numpy2TFRecordConverter` compared to `tf.data.datset.from_tensor_slices`? \nSimply put, when using `tf.data.dataset.from_tensor_slices`, all the samples that will be converted to a dataset must be in memory. \nOn the other hand, you can use `Numpy2TFRecordConverter` to sequentially add samples to the tfrecord without having to read all of them into memory beforehand..\n\n\n\n### Build a tensorflow dataset from tfrecord\nSamples once stored in the tfrecord can be streamed using `tf.data.TFRecordDataset`.\n\n```python\nfrom numpy2tfrecord import build_dataset_from_tfrecord\n\ndataset = build_dataset_from_tfrecord(\"test.tfrecord\")\n```\n\nThe dataset can then be used directly in the for-loop of machine learning.\n\n```python\nfor batch in dataset.as_numpy_iterator():\n    x, y = batch.values()\n    ...\n```\n\n### Speeding up PyTorch data loading with `numpy2tfrecord`!\nhttps://gist.github.com/yonetaniryo/c1780e58b841f30150c45233d3fe6d01\n\n```python\nimport os\nimport time\n\nimport numpy as np\nfrom numpy2tfrecord import Numpy2TfrecordConverter, build_dataset_from_tfrecord\nimport torch\nfrom torchvision import datasets, transforms\n\ndataset = datasets.MNIST(\".\", download=True, transform=transforms.ToTensor())\n\n# convert to tfrecord\nwith Numpy2TfrecordConverter(\"mnist.tfrecord\") as converter:\n    converter.convert_batch({\"x\": dataset.data.numpy().astype(np.int64), \n                        \"y\": dataset.targets.numpy().astype(np.int64)})\n\ntorch_loader = torch.utils.data.DataLoader(dataset, batch_size=32, pin_memory=True, num_workers=os.cpu_count())\ntic = time.time()\nfor e in range(5):\n    for batch in torch_loader:\n        x, y = batch\nelapsed = time.time() - tic\nprint(f\"elapsed time with pytorch dataloader: {elapsed:0.2f} sec for 5 epochs\")\n\ntf_loader = build_dataset_from_tfrecord(\"mnist.tfrecord\").batch(32).prefetch(1)\ntic = time.time()\nfor e in range(5):\n    for batch in tf_loader.as_numpy_iterator():\n        x, y = batch.values()\nelapsed = time.time() - tic\nprint(f\"elapsed time with tf dataloader: {elapsed:0.2f} sec for 5 epochs\")\n```\n\n⬇️\n\n```\nelapsed time with pytorch dataloader: 41.10 sec for 5 epochs\nelapsed time with tf dataloader: 17.34 sec for 5 epochs\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyonetaniryo%2Fnumpy2tfrecord","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyonetaniryo%2Fnumpy2tfrecord","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyonetaniryo%2Fnumpy2tfrecord/lists"}