https://github.com/trungnt13/bigarray
Fast and scalable numpy array using Memory-mapped I/O
https://github.com/trungnt13/bigarray
big-data machine-learning parrallel-computing
Last synced: 6 months ago
JSON representation
Fast and scalable numpy array using Memory-mapped I/O
- Host: GitHub
- URL: https://github.com/trungnt13/bigarray
- Owner: trungnt13
- License: apache-2.0
- Created: 2019-10-08T09:16:30.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-10-16T11:38:33.000Z (almost 6 years ago)
- Last Synced: 2025-04-14T22:14:41.435Z (6 months ago)
- Topics: big-data, machine-learning, parrallel-computing
- Language: Python
- Homepage:
- Size: 31.3 KB
- Stars: 6
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# bigarray
Fast and scalable numpy array using Memory-mapped I/OStable build:
> pip install bigarrayNightly build from github:
> pip install git+https://github.com/trungnt13/bigarray@master---
## The three principles* **Transparency**: everything is `numpy.array`, metadata and support for extra features (e.g. multiprocessing, indexing, etc) are subtly implemented in the _background_.
* **Pragmatism**: fast but easy, simplified A.P.I for common use cases
* **Focus**: _"Do one thing and do it well"_## The benchmarks
About **535 times faster** than HDF5 data (using `h5py`) and **223 times faster** than normal `numpy.array`
[The detail benchmark code](https://github.com/trungnt13/bigarray/blob/master/benchmarks/mmap_vs_hdf5.py)
```
Array size: 1220.70 (MB)Create HDF5 in: 0.0005580571014434099 s
Create Memmap in: 0.000615391880273819 sNumpy save in: 0.5713834380730987 s
Writing data to HDF5 : 0.5530977640300989 s
Writing data to Memmap: 0.7038380969315767 sNumpy saved size: 1220.70 (MB)
HDF5 saved size: 1220.71 (MB)
Mmap saved size: 1220.70 (MB)Load Numpy array: 0.3723734531085938 s
Load HDF5 data : 0.00041177100501954556 s
Load Memmap data: 0.00017150305211544037 sTest correctness of stored data
Numpy : True
HDF5 : True
Memmap: TrueIterate Numpy data : 0.00020254682749509811 s
Iterate HDF5 data : 0.8945782391820103 s
Iterate Memmap data : 0.0014937107916921377 s
Iterate Memmap (2nd) : 0.0011746759992092848 sNumpy total time (open+iter): 0.3725759999360889 s
H5py total time (open+iter): 0.8949900101870298 s
**Mmap total time (open+iter): 0.001665213843807578 s**
```## Example
```python
from multiprocessing import Poolimport numpy as np
from bigarray import PointerArray, PointerArrayWriter
n = 80 * 10 # total number of samples
jobs = [(i, i + 10) for i in range(0, n // 10, 10)]
path = '/tmp/array'# ====== Multiprocessing writing ====== #
writer = PointerArrayWriter(path, shape=(n,), dtype='int32', remove_exist=True)def fn_write(job):
start, end = job
# it is crucial to write at different position for different process
writer.write(
{"name%i" % i: np.arange(i * 10, i * 10 + 10) for i in range(start, end)},
start_position=start * 10)# using 2 processes to generate and write data
with Pool(2) as p:
p.map(fn_write, jobs)
writer.flush()
writer.close()# ====== Multiprocessing reading ====== #
x = PointerArray(path)
print(x['name0'])
print(x['name66'])
print(x['name78'])# normal indexing
for name, (s, e) in x.indices.items():
data = x[s:e]
# fast indexing
for name in x.indices:
data = x[name]# multiprocess indexing
def fn_read(job):
start, end = job
total = 0
for i in range(start, end):
total += np.sum(x['name%d' % i])
return total# use multiprocessing to calculate the sum of all arrays
with Pool(2) as p:
total_sum = sum(p.map(fn_read, jobs))
print(np.sum(x), total_sum)
```Output:
```
[0 1 2 3 4 5 6 7 8 9]
[660 661 662 663 664 665 666 667 668 669]
[780 781 782 783 784 785 786 787 788 789]
319600 319600
```