Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/h2oai/datatable
A Python package for manipulating 2-dimensional tabular data structures
https://github.com/h2oai/datatable
data-analysis data-structure ftrl performance python
Last synced: 5 days ago
JSON representation
A Python package for manipulating 2-dimensional tabular data structures
- Host: GitHub
- URL: https://github.com/h2oai/datatable
- Owner: h2oai
- License: mpl-2.0
- Created: 2017-03-03T02:32:59.000Z (almost 8 years ago)
- Default Branch: main
- Last Pushed: 2024-10-24T17:57:13.000Z (3 months ago)
- Last Synced: 2024-12-31T07:07:40.640Z (12 days ago)
- Topics: data-analysis, data-structure, ftrl, performance, python
- Language: C++
- Homepage: https://datatable.readthedocs.io
- Size: 14.7 MB
- Stars: 1,820
- Watchers: 108
- Forks: 158
- Open Issues: 178
-
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- best-of-python - GitHub - 11% open · ⏱️ 01.12.2023): (Data Containers & Dataframes)
- awesome-dataframes - datatable - A Python package for manipulating 2-dimensional tabular data structures. (Libraries)
- awesome-list - datatable - A Python package for manipulating 2-dimensional tabular data structures. (Data Processing / Data Representation)
- awesome-python-machine-learning-resources - GitHub - 10% open · ⏱️ 12.08.2022): (数据容器和结构)
README
# datatable
[![PyPi version](https://img.shields.io/pypi/v/datatable.svg)](https://pypi.org/project/datatable/)
[![License](https://img.shields.io/pypi/l/datatable.svg)](https://github.com/h2oai/datatable/blob/main/LICENSE)
[![Build Status](https://travis-ci.org/h2oai/datatable.svg?branch=main)](https://travis-ci.org/h2oai/datatable)
[![Documentation Status](https://readthedocs.org/projects/datatable/badge/?version=latest)](https://datatable.readthedocs.io/en/latest/?badge=latest)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/e72cadff26ed4ad68decd61b66b4c563)](https://www.codacy.com/app/st-pasha/datatable?utm_source=github.com&utm_medium=referral&utm_content=h2oai/datatable&utm_campaign=Badge_Grade)This is a Python package for manipulating 2-dimensional tabular data structures
(aka data frames). It is close in spirit to [pandas][] or [SFrame][]; however we
put specific emphasis on speed and big data support. As the name suggests, the
package is closely related to R's [data.table][] and attempts to mimic its core
algorithms and API.Requirements: Python 3.6+ (64 bit) and pip 20.3+.
## Project goals
`datatable` started in 2017 as a toolkit for performing big data (up to 100GB)
operations on a single-node machine, at the maximum speed possible. Such
requirements are dictated by modern machine-learning applications, which need
to process large volumes of data and generate many features in order to
achieve the best model accuracy. The first user of `datatable` was
[Driverless.ai][].The set of features that we want to implement with `datatable` is at least
the following:* Column-oriented data storage.
* Native-C implementation for all datatypes, including strings. Packages such
as pandas and numpy already do that for numeric columns, but not for
strings.* Support for date-time and categorical types. Object type is also supported,
but promotion into object discouraged.* All types should support null values, with as little overhead as possible.
* Data should be stored on disk in the same format as in memory. This will
allow us to memory-map data on disk and work on out-of-memory datasets
transparently.* Work with memory-mapped datasets to avoid loading into memory more data than
necessary for each particular operation.* Fast data reading from CSV and other formats.
* Multi-threaded data processing: time-consuming operations should attempt to
utilize all cores for maximum efficiency.* Efficient algorithms for sorting/grouping/joining.
* Expressive query syntax (similar to [data.table][]).
* Minimal amount of data copying, copy-on-write semantics for shared data.
* Use "rowindex" views in filtering/sorting/grouping/joining operators to
avoid unnecessary data copying.* Interoperability with pandas / numpy / pyarrow / pure python: the users
should have the ability to convert to another data-processing framework
with ease.## Installation
On macOS, Linux and Windows systems installing datatable is as easy as
```sh
pip install datatable
```On all other platforms a source distribution will be needed. For more
information see [Build instructions](https://datatable.readthedocs.io/en/latest/install.html).## See also
* [Build instructions](https://datatable.readthedocs.io/en/latest/install.html)
* [Documentation](https://datatable.readthedocs.io/en/latest/?badge=latest)[pandas]: https://github.com/pandas-dev/pandas
[sframe]: https://github.com/turi-code/SFrame
[data.table]: https://github.com/Rdatatable/data.table
[driverless.ai]: https://www.h2o.ai/driverless-ai/