Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/PyTables/PyTables

A Python package to manage extremely large amounts of data
https://github.com/PyTables/PyTables

Last synced: 2 months ago
JSON representation

A Python package to manage extremely large amounts of data

Lists

README

        

===========================================
PyTables: hierarchical datasets in Python
===========================================

.. image:: https://badges.gitter.im/Join%20Chat.svg
:alt: Join the chat at https://gitter.im/PyTables/PyTables
:target: https://gitter.im/PyTables/PyTables

.. image:: https://github.com/PyTables/PyTables/workflows/CI/badge.svg
:target: https://github.com/PyTables/PyTables/actions?query=workflow%3ACI

.. image:: https://img.shields.io/pypi/v/tables.svg
:target: https://pypi.org/project/tables/

.. image:: https://img.shields.io/pypi/pyversions/tables.svg
:target: https://pypi.org/project/tables/

.. image:: https://img.shields.io/pypi/l/tables
:target: https://github.com/PyTables/PyTables/

:URL: http://www.pytables.org/

PyTables is a package for managing hierarchical datasets and designed
to efficiently cope with extremely large amounts of data.

It is built on top of the HDF5 library and the NumPy package. It
features an object-oriented interface that, combined with C extensions
for the performance-critical parts of the code (generated using
Cython), makes it a fast, yet extremely easy to use tool for
interactively save and retrieve very large amounts of data. One
important feature of PyTables is that it optimizes memory and disk
resources so that they take much less space (between a factor 3 to 5,
and more if the data is compressible) than other solutions, like for
example, relational or object oriented databases.

State-of-the-art compression
----------------------------

PyTables comes with out-of-box support for the `Blosc compressor
`_. This allows for extremely high compression
speed, while keeping decent compression ratios. By doing so, I/O can
be accelerated by a large extent, and you may end achieving higher
performance than the bandwidth provided by your I/O subsystem. See
the `Tuning The Chunksize section of the Optimization Tips chapter
`_
of user documentation for some benchmarks.

Not a RDBMS replacement
-----------------------

PyTables is not designed to work as a relational database replacement,
but rather as a teammate. If you want to work with large datasets of
multidimensional data (for example, for multidimensional analysis), or
just provide a categorized structure for some portions of your
cluttered RDBS, then give PyTables a try. It works well for storing
data from data acquisition systems (DAS), simulation software, network
data monitoring systems (for example, traffic measurements of IP
packets on routers), or as a centralized repository for system logs,
to name only a few possible uses.

Tables
------

A table is defined as a collection of records whose values are stored
in fixed-length fields. All records have the same structure and all
values in each field have the same data type. The terms "fixed-length"
and strict "data types" seems to be quite a strange requirement for an
interpreted language like Python, but they serve a useful function if
the goal is to save very large quantities of data (such as is
generated by many scientific applications, for example) in an
efficient manner that reduces demand on CPU time and I/O.

Arrays
------

There are other useful objects like arrays, enlargeable arrays or
variable length arrays that can cope with different missions on your
project.

Easy to use
-----------

One of the principal objectives of PyTables is to be user-friendly.
In addition, many different iterators have been implemented so as to
enable the interactive work to be as productive as possible.

Platforms
---------

We are using Linux on top of Intel32 and Intel64 boxes as the main
development platforms, but PyTables should be easy to compile/install
on other UNIX or Windows machines.

Compiling
---------

To compile PyTables you will need, at least, a recent version of HDF5
(C flavor) library, the Zlib compression library and the NumPy and
Numexpr packages. Besides, it comes with support for the Blosc, LZO
and bzip2 compressor libraries. Blosc is mandatory, but PyTables comes
with Blosc sources so, although it is recommended to have Blosc
installed in your system, you don't absolutely need to install it
separately. LZO and bzip2 compression libraries are, however,
optional.

Installation
------------

1. Make sure you have HDF5 version 1.10.5 or above.

On OSX you can install HDF5 using `Homebrew `_::

$ brew install hdf5

On debian bases distributions::

$ sudo apt-get install libhdf5-serial-dev

If you have the HDF5 library in some non-standard location (that
is, where the compiler and the linker can't find it) you can use
the environment variable `HDF5_DIR` to specify its location. See
`the manual
`_ for more
details.

3. For stability (and performance too) reasons, it is strongly
recommended that you install the C-Blosc library separately,
although you might want PyTables to use its internal C-Blosc
sources.

3. Optionally, consider to install the LZO compression library and/or
the bzip2 compression library.

4. Install!::

$ python3 -m pip install tables

5. To run the test suite run::

$ python3 -m tables.tests.test_all

If there is some test that does not pass, please send the
complete output for tests back to us.

**Enjoy data!** -- The PyTables Team

.. Local Variables:
.. mode: text
.. coding: utf-8
.. fill-column: 70
.. End: