Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/hristo-vrigazov/mmap.ninja

Memory mapped numpy arrays of varying shapes
https://github.com/hristo-vrigazov/mmap.ninja
Last synced: about 2 months ago
JSON representation
Memory mapped numpy arrays of varying shapes
Host: GitHub
URL: https://github.com/hristo-vrigazov/mmap.ninja
Owner: hristo-vrigazov
License: mit
Created: 2021-12-25T14:00:50.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-06-19T06:37:29.000Z (3 months ago)
Last Synced: 2024-07-19T16:57:14.677Z (2 months ago)
Language: Python
Homepage:
Size: 1.41 MB
Stars: 269
Watchers: 4
Forks: 11
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project

README

        [![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)

# mmap.ninja

Install with:

```bash

pip install mmap_ninja

```

[![Gitter](https://img.shields.io/gitter/room/hristo-vrigazov/mmap-ninja)](https://app.gitter.im/#/room/#mmap-ninja:gitter.im)

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-WMtVyfxx2aUMeV7vlG48Ia27-5cxnrS?usp=sharing)

[![Build Status](https://app.travis-ci.com/hristo-vrigazov/mmap.ninja.svg?branch=master)](https://app.travis-ci.com/hristo-vrigazov/mmap.ninja)

[![codecov](https://codecov.io/gh/hristo-vrigazov/mmap.ninja/branch/master/graph/badge.svg?token=YUCO0KJONB)](https://codecov.io/gh/hristo-vrigazov/mmap.ninja)

[![Downloads](https://static.pepy.tech/badge/mmap-ninja/month)](https://pepy.tech/project/mmap-ninja)

[![PyPi version](https://badgen.net/pypi/v/mmap_ninja/)](https://pypi.com/project/mmap_ninja)

[![PyPI license](https://img.shields.io/pypi/l/mmap_ninja.svg)](https://pypi.python.org/pypi/mmap_ninja/)

Microlib docs can be found [here](./MICROLIB.md). 

## Contents

1. [Quick example](#quick-example)

2. [What is it?](#what-is-it)

3. [When to use it?](#when-do-i-use-it)

4. [When not to use it?](#when-not-to-use-it)

5. [How it works?](#how-it-works)

6. [API guide](#api-guide)

7. [FAQ](#faq)

8. [I want to contribute](#i-want-to-contribute)

## Quick example

```python

import numpy as np

import matplotlib.image as mpimg

from tqdm import tqdm

from pathlib import Path

from mmap_ninja import RaggedMmap

coco_path = Path('')

# Once per project, convert the images to a memory map

RaggedMmap.from_generator(

    # Directory in which the memory map will be persisted

    out_dir='images_mmap',

    # Something that yields np.ndarray

    sample_generator=map(mpimg.imread, coco_path.iterdir()),

    # Maximum number of samples to keep in memory before flushing to disk

    batch_size=1024,

    # Show/hide progress bar

    verbose=True

)

# Open the memory map

images_mmap = RaggedMmap('images_mmap')

# This iteration takes 0.2s on COCO val 2017

# This iteration takes 35s without memory-mapping

for i in tqdm(range(len(images_mmap))):

  img: np.ndarray = images_mmap[i]

```

[Back to Contents](#contents)

## What is it?

Accelerate the iteration over your machine learning dataset by up to **20 times** !

`mmap_ninja` is a library for storing your datasets in memory-mapped files,

which leads to a dramatic speedup in the training time.

The only dependencies are `numpy` and `tqdm`.

You can use `mmap_ninja` with any training framework (such as `Tensorflow`, `PyTorch`, `MxNet`), etc.,

as it stores your dataset as a memory-mapped numpy array.

A **memory mapped file** is a file that is physically present on disk in a way that the correlation between the file

and the memory space permits applications to treat the mapped portions as if it were primary memory, allowing very fast

I/O!

When working on a machine learning project, one of the most time-consuming parts is the model's training.

However, a large portion of the training time actually consists of just iterating over your dataset and filesystem I/O!

This library, `mmap_ninja` provides high-level, easy to use, well tested API for using memory maps for your 

datasets, reducing the time needed for training.

Memory maps would usually take a little more disk space though, so if you are willing to trade some disk space

for fast filesystem to memory I/O, this is your library!

[Back to Contents](#contents)

## When do I use it?

Use it whenever you want to store a sequence of `np.ndarray`s (of varying shapes) that you are going to

read from at random positions very often.

`mmap_ninja` can work with any type of data that can be stored as a `np.ndarray`, as the

memory map is initialized with a generator that yields samples.

In the table below, you can see concrete examples, but beware that those are just examples,

`mmap_ninja` has no specific logic to handle images or videos or something like that.

It just stores `np.ndarray` and it is up to you to decide what this array represents.

| Use case   | Notebook                                                                                                                                                             | Benchmark                                                 | Class/Module                         |

|:-----------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------|:-------------------------------------|

| Image      | [![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1-WMtVyfxx2aUMeV7vlG48Ia27-5cxnrS?usp=sharing) | [COCO 2017](#memory-mapping-images-with-different-shapes) | `from mmap_ninja import RaggedMmap`  |

| Text       | [![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/18bEwylFwx4owMpb-RAkJZS_9JrrUcFd7?usp=sharing) | [20 newsgroups](#memory-mapping-text-documents)           | `from mmap_ninja import StringsMmap` |

| Video      | [![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1xMEHbwntgpBfCGfTicXmdbA8UpEowXzW?usp=sharing) | Coming soon!                                              | `from mmap_ninja import RaggedMmap`  |

[Back to Contents](#contents)

### Memory mapping images with different shapes

You can create a new `RaggedMmmap` from one of its class methods: `RaggedMmmap.from_lists`, 

`RaggedMmap.from_generator`.

Create a memory map from generator, flushing to disk every 1024 images (so that you don't have to keep it all in memory at once):

```python

import matplotlib.pyplot as plt

from mmap_ninja import RaggedMmap

from pathlib import Path

coco_path = Path('')

val_images = RaggedMmap.from_generator(

    out_dir='val_images', 

    sample_generator=map(plt.imread, coco_path.iterdir()), 

    batch_size=1024, 

    verbose=True

)

```

Once created, you can open the map by simply supplying the path to the memory map:

```python

from mmap_ninja import RaggedMmap

val_images = RaggedMmap('val_images')

print(val_images[3]) # Prints the ndarray image, e.g. with shape (387, 640, 3)

```

You can also extend an already existing memory map easily by using the `.extend` method.

In the table show the time needed for initial loading, one iteration over the COCO validation 2017 dataset,

the memory usage of every method and the disk usage.

|                  |   Initial load (s) |   Time for iteration (s) | Memory usage (GB)   | Disk usage (GB)   |

|:-----------------|-------------------:|-------------------------:|:--------------------|:------------------|

| in_memory        |           1.356077 |                 0.000403 | 3.818741 GB         | 3.819034 GB       |

| ragged_mmap      |           0.002054 |                 0.057858 | 0.001144 GB         | 3.819114 GB       |

| imread_from_disk |           0.000000 |                22.208385 | 0.001144 GB         | 0.758753 GB       |

You can see that once created, the `RaggedMmap` is **383 times** faster for iterating over the 

dataset.

It does require 4 times more disk space though, so if you are willing to trade 4 times more disk space

for **383 times** speedup (and less memory usage), you definitely should use the `RaggedMmap`!

This makes the `RaggedMmap` a fantastic choice for your computer vision, image-based machine learning datasets!

### Memory mapping text documents

You can create a new `StringsMmmap` from one of its class methods: `StringsMmmap.from_strings`,

`StringsMmap.from_generator`. 

Once it's created, you can open it by just supplying the path to the memory map.

An example of creating a memory map for the [sklearn's 20newsgroups dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html):

```python

from mmap_ninja import StringsMmap

from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()

memmap = StringsMmap.from_strings('20newsgroups', data['data'], verbose=True)

```

Opening an already existing `StringsMmmap`:

```python

from mmap_ninja import StringsMmap

texts = StringsMmap('20newsgroups')

print(texts[123])  # Prints the 123-th text

```

You can also extend an already existing memory map easily by using the `.extend` method.

In the table show the time needed for initial loading, 100 iterations over the [sklearn's 20newsgroups dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html),

the memory usage of every method and the disk usage.

|                |   Initial load (s) |   Time for iteration (s) | Memory usage (GB)   | Disk usage (GB)   |

|:---------------|-------------------:|-------------------------:|:--------------------|:------------------|

| in_memory      |           0.174626 |                 0.068995 | 0.09 MB             | 45 MB             |

| ragged_mmap    |           0.003701 |                 2.052659 | 0.07 MB             | 22 MB             |

| read_from_disk |           0.000000 |                13.996738 | 0.07 MB             | 45 MB             |

You can see that once created, the `StringsMmap` is nearly **7 times** faster compared to reading `.txt` files

from disk one by one.

Moreover, it takes **2 times** less disk space (this is true only for `StringsMmap`, in general for other types the memory map

would take more disk space).

This makes the `StringsMmmap` a fantastic choice for your NLP, text-based machine learning datasets!

## When not to use it?

Very frequently, `mmap_ninja` takes more disk space than traditional approaches.

For example, for jpeg images, it takes 4 times more disk space. 

For this reason, do not use `mmap_ninja` in the following cases:

* You are low on disk space

* You want to send the data over a network - use a compressed format instead

There are other cases in which `mmap_ninja` is not a good choice:

* When you want to concurrently append to the memory map (use a queue like RabbitMQ and append from a subscriber instead)

* If you want to frequently delete samples from the memory map - this will require a new copy of the whole object

and so on.

[Back to Contents](#contents)

## How it works

Coming soon

[Back to Contents](#contents)

## API guide

[Read the docs](https://microlib.org/mmap_ninja.html)

[Back to Contents](#contents)

## FAQ

Q: Can I use it with Tensorflow/TF?

A: Of course. You can use it with any framework that can work with numpy arrays. Here's an [end-to-end example](https://colab.research.google.com/drive/1zXFFXc33hITitRbgeE_KrH7ns2Ddv7q9?usp=sharing)

[Back to Contents](#contents)

## I want to contribute

Coming soon!

[Back to Contents](#contents)