https://github.com/yonetaniryo/numpy2tfrecord
Simple helper library to convert numpy data to tfrecord and build a tensorflow dataset
https://github.com/yonetaniryo/numpy2tfrecord
numpy tensorflow tfrecord
Last synced: about 2 months ago
JSON representation
Simple helper library to convert numpy data to tfrecord and build a tensorflow dataset
- Host: GitHub
- URL: https://github.com/yonetaniryo/numpy2tfrecord
- Owner: yonetaniryo
- Created: 2022-01-15T13:27:55.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-03-26T07:39:32.000Z (about 2 years ago)
- Last Synced: 2025-03-18T13:35:13.396Z (2 months ago)
- Topics: numpy, tensorflow, tfrecord
- Language: Python
- Homepage:
- Size: 25.4 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# numpy2tfrecord
Simple helper library to convert numpy data to tfrecord and build a tensorflow dataset.
## Installation
```sh
$ git clone [email protected]:yonetaniryo/numpy2tfrecord.git
$ cd numpy2tfrecord
$ pip install .
```
or simply using pip:
```sh
$ pip install numpy2tfrecord
```## How to use
### Convert a collection of numpy data to tfrecordYou can convert samples represented in the form of a `dict` to `tf.train.Example` and save them as a tfrecord.
```python
import numpy as np
from numpy2tfrecord import Numpy2TFRecordConverterwith Numpy2TFRecordConverter("test.tfrecord") as converter:
x = np.arange(100).reshape(10, 10).astype(np.float32) # float array
y = np.arange(100).reshape(10, 10).astype(np.int64) # int array
a = 5 # int
b = 0.3 # float
sample = {"x": x, "y": y, "a": a, "b": b}
converter.convert_sample(sample) # convert data sample
```You can also convert a `list` of samples at once using `convert_list`.
```python
with Numpy2TFRecordConverter("test.tfrecord") as converter:
samples = [
{
"x": np.random.rand(64).astype(np.float32),
"y": np.random.randint(0, 10),
}
for _ in range(32)
] # list of 32 samplesconverter.convert_list(samples)
```Or a batch of samples at once using `convert_batch`.
```python
with Numpy2TFRecordConverter("test.tfrecord") as converter:
samples = {
"x": np.random.rand(32, 64).astype(np.float32),
"y": np.random.randint(0, 10, size=32).astype(np.int64),
} # batch of 32 samplesconverter.convert_batch(samples)
```So what are the advantages of `Numpy2TFRecordConverter` compared to `tf.data.datset.from_tensor_slices`?
Simply put, when using `tf.data.dataset.from_tensor_slices`, all the samples that will be converted to a dataset must be in memory.
On the other hand, you can use `Numpy2TFRecordConverter` to sequentially add samples to the tfrecord without having to read all of them into memory beforehand..### Build a tensorflow dataset from tfrecord
Samples once stored in the tfrecord can be streamed using `tf.data.TFRecordDataset`.```python
from numpy2tfrecord import build_dataset_from_tfrecorddataset = build_dataset_from_tfrecord("test.tfrecord")
```The dataset can then be used directly in the for-loop of machine learning.
```python
for batch in dataset.as_numpy_iterator():
x, y = batch.values()
...
```### Speeding up PyTorch data loading with `numpy2tfrecord`!
https://gist.github.com/yonetaniryo/c1780e58b841f30150c45233d3fe6d01```python
import os
import timeimport numpy as np
from numpy2tfrecord import Numpy2TfrecordConverter, build_dataset_from_tfrecord
import torch
from torchvision import datasets, transformsdataset = datasets.MNIST(".", download=True, transform=transforms.ToTensor())
# convert to tfrecord
with Numpy2TfrecordConverter("mnist.tfrecord") as converter:
converter.convert_batch({"x": dataset.data.numpy().astype(np.int64),
"y": dataset.targets.numpy().astype(np.int64)})torch_loader = torch.utils.data.DataLoader(dataset, batch_size=32, pin_memory=True, num_workers=os.cpu_count())
tic = time.time()
for e in range(5):
for batch in torch_loader:
x, y = batch
elapsed = time.time() - tic
print(f"elapsed time with pytorch dataloader: {elapsed:0.2f} sec for 5 epochs")tf_loader = build_dataset_from_tfrecord("mnist.tfrecord").batch(32).prefetch(1)
tic = time.time()
for e in range(5):
for batch in tf_loader.as_numpy_iterator():
x, y = batch.values()
elapsed = time.time() - tic
print(f"elapsed time with tf dataloader: {elapsed:0.2f} sec for 5 epochs")
```⬇️
```
elapsed time with pytorch dataloader: 41.10 sec for 5 epochs
elapsed time with tf dataloader: 17.34 sec for 5 epochs
```