https://github.com/h2oai/datatable

A Python package for manipulating 2-dimensional tabular data structures
https://github.com/h2oai/datatable

data-analysis data-structure ftrl performance python

Last synced: about 1 month ago
JSON representation

A Python package for manipulating 2-dimensional tabular data structures

Host: GitHub
URL: https://github.com/h2oai/datatable
Owner: h2oai
License: mpl-2.0
Created: 2017-03-03T02:32:59.000Z (over 8 years ago)
Default Branch: main
Last Pushed: 2025-03-17T07:12:47.000Z (3 months ago)
Last Synced: 2025-05-12T13:53:06.419Z (about 1 month ago)
Topics: data-analysis, data-structure, ftrl, performance, python
Language: C++
Homepage: https://datatable.readthedocs.io
Size: 14.7 MB
Stars: 1,852
Watchers: 105
Forks: 163
Open Issues: 179
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

best-of-python - GitHub - 11% open · ⏱️ 01.12.2023): (Data Containers & Dataframes)
awesome-dataframes - datatable - A Python package for manipulating 2-dimensional tabular data structures. (Libraries)
awesome-list - datatable - A Python package for manipulating 2-dimensional tabular data structures. (Data Processing / Data Representation)
awesome-python-machine-learning-resources - GitHub - 10% open · ⏱️ 12.08.2022): (数据容器和结构)

README

        

# datatable

[![PyPi version](https://img.shields.io/pypi/v/datatable.svg)](https://pypi.org/project/datatable/)

[![License](https://img.shields.io/pypi/l/datatable.svg)](https://github.com/h2oai/datatable/blob/main/LICENSE)

[![Build Status](https://travis-ci.org/h2oai/datatable.svg?branch=main)](https://travis-ci.org/h2oai/datatable)

[![Documentation Status](https://readthedocs.org/projects/datatable/badge/?version=latest)](https://datatable.readthedocs.io/en/latest/?badge=latest)

[![Codacy Badge](https://api.codacy.com/project/badge/Grade/e72cadff26ed4ad68decd61b66b4c563)](https://www.codacy.com/app/st-pasha/datatable?utm_source=github.com&utm_medium=referral&utm_content=h2oai/datatable&utm_campaign=Badge_Grade)

This is a Python package for manipulating 2-dimensional tabular data structures

(aka data frames). It is close in spirit to [pandas][] or [SFrame][]; however we

put specific emphasis on speed and big data support. As the name suggests, the

package is closely related to R's [data.table][] and attempts to mimic its core

algorithms and API.

Requirements: Python 3.6+ (64 bit) and pip 20.3+.

## Project goals

`datatable` started in 2017 as a toolkit for performing big data (up to 100GB)

operations on a single-node machine, at the maximum speed possible. Such

requirements are dictated by modern machine-learning applications, which need

to process large volumes of data and generate many features in order to

achieve the best model accuracy. The first user of `datatable` was

[Driverless.ai][].

The set of features that we want to implement with `datatable` is at least

the following:

* Column-oriented data storage.

* Native-C implementation for all datatypes, including strings. Packages such

  as pandas and numpy already do that for numeric columns, but not for

  strings.

* Support for date-time and categorical types. Object type is also supported,

  but promotion into object discouraged.

* All types should support null values, with as little overhead as possible.

* Data should be stored on disk in the same format as in memory. This will

  allow us to memory-map data on disk and work on out-of-memory datasets

  transparently.

* Work with memory-mapped datasets to avoid loading into memory more data than

  necessary for each particular operation.

* Fast data reading from CSV and other formats.

* Multi-threaded data processing: time-consuming operations should attempt to

  utilize all cores for maximum efficiency.

* Efficient algorithms for sorting/grouping/joining.

* Expressive query syntax (similar to [data.table][]).

* Minimal amount of data copying, copy-on-write semantics for shared data.

* Use "rowindex" views in filtering/sorting/grouping/joining operators to

  avoid unnecessary data copying.

* Interoperability with pandas / numpy / pyarrow / pure python: the users

  should have the ability to convert to another data-processing framework

  with ease.

## Installation

On macOS, Linux and Windows systems installing datatable is as easy as

```sh

pip install datatable

```

On all other platforms a source distribution will be needed. For more

information see [Build instructions](https://datatable.readthedocs.io/en/latest/install.html).

## See also

* [Build instructions](https://datatable.readthedocs.io/en/latest/install.html)

* [Documentation](https://datatable.readthedocs.io/en/latest/?badge=latest)

  [pandas]: https://github.com/pandas-dev/pandas

  [sframe]: https://github.com/turi-code/SFrame

  [data.table]: https://github.com/Rdatatable/data.table

  [driverless.ai]: https://www.h2o.ai/driverless-ai/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/h2oai/datatable

Awesome Lists containing this project

README