Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/h2oai/datatable

A Python package for manipulating 2-dimensional tabular data structures
https://github.com/h2oai/datatable

data-analysis data-structure ftrl performance python

Last synced: about 2 months ago
JSON representation

A Python package for manipulating 2-dimensional tabular data structures

Awesome Lists containing this project

README

        

# datatable

[![PyPi version](https://img.shields.io/pypi/v/datatable.svg)](https://pypi.org/project/datatable/)
[![License](https://img.shields.io/pypi/l/datatable.svg)](https://github.com/h2oai/datatable/blob/main/LICENSE)
[![Build Status](https://travis-ci.org/h2oai/datatable.svg?branch=main)](https://travis-ci.org/h2oai/datatable)
[![Documentation Status](https://readthedocs.org/projects/datatable/badge/?version=latest)](https://datatable.readthedocs.io/en/latest/?badge=latest)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/e72cadff26ed4ad68decd61b66b4c563)](https://www.codacy.com/app/st-pasha/datatable?utm_source=github.com&utm_medium=referral&utm_content=h2oai/datatable&utm_campaign=Badge_Grade)

This is a Python package for manipulating 2-dimensional tabular data structures
(aka data frames). It is close in spirit to [pandas][] or [SFrame][]; however we
put specific emphasis on speed and big data support. As the name suggests, the
package is closely related to R's [data.table][] and attempts to mimic its core
algorithms and API.

Requirements: Python 3.6+ (64 bit) and pip 20.3+.

## Project goals

`datatable` started in 2017 as a toolkit for performing big data (up to 100GB)
operations on a single-node machine, at the maximum speed possible. Such
requirements are dictated by modern machine-learning applications, which need
to process large volumes of data and generate many features in order to
achieve the best model accuracy. The first user of `datatable` was
[Driverless.ai][].

The set of features that we want to implement with `datatable` is at least
the following:

* Column-oriented data storage.

* Native-C implementation for all datatypes, including strings. Packages such
as pandas and numpy already do that for numeric columns, but not for
strings.

* Support for date-time and categorical types. Object type is also supported,
but promotion into object discouraged.

* All types should support null values, with as little overhead as possible.

* Data should be stored on disk in the same format as in memory. This will
allow us to memory-map data on disk and work on out-of-memory datasets
transparently.

* Work with memory-mapped datasets to avoid loading into memory more data than
necessary for each particular operation.

* Fast data reading from CSV and other formats.

* Multi-threaded data processing: time-consuming operations should attempt to
utilize all cores for maximum efficiency.

* Efficient algorithms for sorting/grouping/joining.

* Expressive query syntax (similar to [data.table][]).

* Minimal amount of data copying, copy-on-write semantics for shared data.

* Use "rowindex" views in filtering/sorting/grouping/joining operators to
avoid unnecessary data copying.

* Interoperability with pandas / numpy / pyarrow / pure python: the users
should have the ability to convert to another data-processing framework
with ease.

## Installation

On macOS, Linux and Windows systems installing datatable is as easy as
```sh
pip install datatable
```

On all other platforms a source distribution will be needed. For more
information see [Build instructions](https://datatable.readthedocs.io/en/latest/install.html).

## See also

* [Build instructions](https://datatable.readthedocs.io/en/latest/install.html)
* [Documentation](https://datatable.readthedocs.io/en/latest/?badge=latest)

[pandas]: https://github.com/pandas-dev/pandas
[sframe]: https://github.com/turi-code/SFrame
[data.table]: https://github.com/Rdatatable/data.table
[driverless.ai]: https://www.h2o.ai/driverless-ai/