{"id":27936207,"url":"https://github.com/enso-org/dataframes","last_synced_at":"2026-02-27T15:33:45.214Z","repository":{"id":45990227,"uuid":"137364889","full_name":"enso-org/dataframes","owner":"enso-org","description":"A library for working with tabular data in Luna.","archived":false,"fork":false,"pushed_at":"2019-06-27T12:37:42.000Z","size":3103,"stargazers_count":5,"open_issues_count":34,"forks_count":5,"subscribers_count":21,"default_branch":"master","last_synced_at":"2025-08-17T13:33:07.840Z","etag":null,"topics":["dataframes","hybrid","luna","textual","visual","visualisation"],"latest_commit_sha":null,"homepage":"https://luna-lang.org","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/enso-org.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-06-14T13:54:21.000Z","updated_at":"2025-05-03T00:21:09.000Z","dependencies_parsed_at":"2022-08-26T09:30:32.165Z","dependency_job_id":null,"html_url":"https://github.com/enso-org/dataframes","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/enso-org/dataframes","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enso-org%2Fdataframes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enso-org%2Fdataframes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enso-org%2Fdataframes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enso-org%2Fdataframes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/enso-org","download_url":"https://codeload.github.com/enso-org/dataframes/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enso-org%2Fdataframes/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270856563,"owners_count":24657688,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-17T02:00:09.016Z","response_time":129,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataframes","hybrid","luna","textual","visual","visualisation"],"created_at":"2025-05-07T06:56:52.618Z","updated_at":"2026-02-27T15:33:40.178Z","avatar_url":"https://github.com/enso-org.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Dataframes implementation in Luna\n## Purpose\nThis project is a library with dataframes implementation. Dataframes are structures allowing more comfortable work with big datasets.\n\n## Build status\n\n| Environment                 | Build status |\n|-----------------------------|--------------|\n| CI Build (macOS, Linux, Windows) | [![Build Status](https://dev.azure.com/luna-lang/luna/_apis/build/status/luna.Dataframes?branchName=master)](https://dev.azure.com/luna-lang/luna/_build/latest?definitionId=3\u0026branchName=master) |\n\n## Third-party dependencies\nRequired dependencies:\n* C++ build tools:\n    * [CMake](https://cmake.org/) — cross-platform build tool for C++ used by the C++ helper and all its dependencies.\n    * A mostly C++17-compliant compiler. The tested ones are Visual Studio 2017.8 on Windows and GCC 7.3.0 on Ubuntu. Anything newer is expected to work as well.\n* Libraries:\n  * [Apache Arrow](https://arrow.apache.org/) — from it [C++ library](https://github.com/apache/arrow/tree/master/cpp) component must be installed\n  * [Boost C++ Libraries](https://www.boost.org/) — also required by Apache Arrow.\n  * [date library](https://github.com/HowardHinnant/date) — for calendar support for timestamps.\n  * [pybind11](https://github.com/pybind/pybind11) — C++ Python bindings\n  * [Python 3.6+](https://www.python.org/) with some packages:\n    * `matplotlib`\n    * `seaborn`\n  * [RapidJSON](https://github.com/Tencent/rapidjson) — needed for LQuery processing\n  * [{fmt}](http://fmtlib.net/) - **version 5.2.0 does not work, use 5.2.1** -  C++ library string formatting\n\nOptional dependencies:\nThese dependencies are not required to compile the helper library, however without them certain functionalities shall be disabled.\n* [xlnt library](https://github.com/mwu-tow/xlnt) C++ library — needed for .xlsx file format support. NOTE: On MacOS mwu-tow's fork is needed to fix the compilation issue. On other platforms, [official library repo](https://github.com/tfussell/xlnt) can be used.\n\n## Build \u0026 Install\n* make sure that dependecies are all installed.\n    * On Mac it is easily done with Anaconda (https://www.anaconda.com/download/).\n    * Once you have installed it, you can run the following commands to install Arrow:\n        ```bash\n        conda create -n dataframes python=3.6\n        conda activate dataframes\n        conda install arrow-cpp=0.10.* -c conda-forge\n        conda install pyarrow=0.10.* -c conda-forge\n        conda install rapidjson\n        ```\n    * With that in place, you need to instruct CMake where to find the libraries you've just installed. Add the following lines to `native_libs/src/CMakeLists.txt`:\n        ```cmake\n        set(CMAKE_LIBRARY_PATH \"/anaconda3/envs/dataframes/lib\")\n        set(CMAKE_INCLUDE_PATH \"/anaconda3/envs/dataframes/include\")\n        ```\n    And you should be all set.\n* build the helper C++ library — CMake will automatically place the built binary in the native_libs/platform directory, so `luna` should out-of-the-box be able to find it.\n    * on Windows start *Visual Studio x64 Tools Command Prompt* and type:\n      ```\n      cd Dataframes\\native_libs\n      mkdir build\n      cd build\n      cmake -G\"NMake Makefiles\" ..\\src\n      nmake\n      ```\n    * on other platforms:\n      ```\n      cd Dataframes/native_libs\n      mkdir build\n      cd build\n      cmake ../src\n      make\n      ```\n    where `Dataframes` refer to the local copy of this repo.\n* happily use the dataframes library\n\n## Overview\nThe library currently provides wrappers for Apache Arrow structures.\n\n### Storage types\n* `ArrayData` — type-erased storage for `Array` consisting of several contiguous memory buffers. The buffer count depends on stored type. Typically there are two buffers: one for values and one for masking nulls. More comples types (union, lists) will use more.\n* `Array tag` — data array with strongly typed accessors. See section below for supported `tag` types.\n* `ChunkedArray tag` — a list of `Array`s of the same type viewed as a single large array. Allows storing large sequences of data (that could not be feasably stored in a single memory block) and efficient slice / concat operations.\n* `Column` — type erased accessor for a named `ChunkedArray`. Stored type is represented by using one of its constructors. Described by `Field`.\n* `Table` — ordered sequence of `Column`s. Described by `Schema`.\n\n### Type tag types\nThese types are provided by the library to identify types that can be stored by `Array` and their mapping to Luna types. Currently provided type tags are listed in the table below.\n\n| Tag type        | Luna value type | Apache Arrow type   | Memory per element                          |\n|-----------------|-----------------|---------------------|---------------------------------------------|\n| StringType      | Text            | utf8 non-nullable   | 4 bytes + 1 byte per character + 1 bit mask |\n| MaybeStringType | Maybe Text      | utf8 nullable       | as above                                    |\n| Int64Type       | Int             | int64 non-nullable  | 8 bytes + 1 bit mask                        |\n| MaybeInt64Type  | Maybe Int       | int64 nullable      | as above                                    |\n| DoubleType      | Real            | double non-nullable | 8 bytes + 1 bit mask                        |\n| MaybeDoubleType | Maybe Real      | double nullable     | as above                                    |\n\nNote: Arrow's `utf8` type is a list of non-nullable bytes.\n\n### IO types\nCSV and Feather files are supported. XLSX files are supported if the helper C++ library was built with XLNT third-part library enabled.\n\n| Format                                          | Parser Type     | Generator Type     | Remarks                                                       |\n|-------------------------------------------------|-----------------|--------------------|---------------------------------------------------------------|\n| [CSV file](https://tools.ietf.org/html/rfc4180) | `CSVParser`     | `CSVGenerator`     |                                                               |\n| XLSX                                            | `XLSXParser`    | `XLSXGenerator`    | Requires optional XLNT library                                |\n| Feather                                         | `FeatherParser` | `FeatherGenerator` | Best performance, not all value types are currently supported |\n\n\n#### Methods\nParser type shall provide the following method:\n* `readFile path :: Text -\u003e Table`\nGenerator type shall provide the following method:\n* `writeFile path table :: Text -\u003e Table -\u003e IO None`\n\nColumn names are by default read from the file. CSV and XLSX parsers can also work with files that do not contain the reader row. In such case one of the methods below should be called:\n* `useCustomNames names` where `names :: [Text]` are user-provided list of desired column names. If there are more columns in file than names count, then more names will be generated.\n* `useGeneratedColumnNames` — all column names will be automatically generated (and the first row will be treated as containing values).\n\nSimilarly, the CSV and XLSX generators can be configured whether to output a heading row with names.\n* `setHeaderPolicy WriteHeaderLine` or `setHeaderPolicy SkipHeaderLine`\n\nThe CSV generator can be also configured whether the fields should be always enclosed within quotes or whether this should be done only when necessary (the latter being the default):\n* `setQuotingPolicy QuoteWhenNeeded` or `setQuotingPolicy QuoteAllFields`\n\n\n### Other types\n* `DataType` represents the type of values being stored in a `ArrayData`. Note that this type does not contain information whether it is nullable — being nullable is a property of `Field`, not `Datatype`.\n* `Field` is a named `DataType` with an additional information whether values are nullable. Describes contents of the `Column`.\n* `Schema` is a sequence of `Field`s describing the `Table`.\n\n## Data processing API\n\n\n## Data description API\n### Table\n* `corr` — calculates correlation matrix, with Pearson correlation coefficient for each column pair.\n* `corrWith columnName` — calculates Pearson correlation coefficient between given column in a table and its other columns.\n* `countValues columnName` — returns table with pairs (value, count).\n* `describeNa` — calculates count of null values and their ratio to the total row count.\n* `describe columnName` — calculates a number of statistics for a given column (mean, std, min, quartiles, max).\n\n### Column\n* `countMissing` — returns the number of null values in the column.\n* `countValues` — counts occurences of each unique value and returns pairs (value, count).\n* stats:\n  * `min` — minimum of values.\n  * `max` — maximum of values.\n  * `mean` — mean of values.\n  * `median` — median of values (interpolated, if value count is even)\n  * `std` — standard deviation of values.\n  * `var` — variation of values.\n  * `sum` — sum of values.\n  * `quantile q` — value at given quantile, q belongs to \u003c0,1\u003e.\n* `describe` — calculates a number of statistics for a given column (mean, std, min, quartiles, max).\n\n## Tutorial\nTBD\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fenso-org%2Fdataframes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fenso-org%2Fdataframes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fenso-org%2Fdataframes/lists"}