{"id":13611002,"url":"https://github.com/bluenote10/NimData","last_synced_at":"2025-04-13T01:34:00.881Z","repository":{"id":43731243,"uuid":"80755906","full_name":"bluenote10/NimData","owner":"bluenote10","description":"DataFrame API written in Nim, enabling fast out-of-core data processing","archived":false,"fork":false,"pushed_at":"2021-06-09T08:50:31.000Z","size":426,"stargazers_count":341,"open_issues_count":28,"forks_count":22,"subscribers_count":18,"default_branch":"master","last_synced_at":"2024-11-07T02:30:49.858Z","etag":null,"topics":["dataframe","nim"],"latest_commit_sha":null,"homepage":"","language":"Nim","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bluenote10.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-02-02T18:34:04.000Z","updated_at":"2024-10-23T11:03:52.000Z","dependencies_parsed_at":"2022-09-21T16:01:15.914Z","dependency_job_id":null,"html_url":"https://github.com/bluenote10/NimData","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bluenote10%2FNimData","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bluenote10%2FNimData/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bluenote10%2FNimData/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bluenote10%2FNimData/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bluenote10","download_url":"https://codeload.github.com/bluenote10/NimData/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223558463,"owners_count":17165134,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataframe","nim"],"created_at":"2024-08-01T19:01:50.761Z","updated_at":"2024-11-07T17:31:06.161Z","avatar_url":"https://github.com/bluenote10.png","language":"Nim","funding_links":[],"categories":["Nim","Data"],"sub_categories":["Data Processing"],"readme":"# NimData  [![Build Status](https://github.com/bluenote10/NimData/workflows/ci/badge.svg)](https://github.com/bluenote10/NimData/actions?query=workflow%3Aci) [![license](https://img.shields.io/github/license/mashape/apistatus.svg)](LICENSE) \u003ca href=\"https://github.com/yglukhov/nimble-tag\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/yglukhov/nimble-tag/master/nimble.png\" height=\"23\" \u003e\u003c/a\u003e\n\n## Overview\n\n**NimData** is a data manipulation and analysis library for the Nim programming language. It combines Pandas-like syntax with the type-safe, lazy APIs of distributed frameworks like Spark/Flink/Thrill. Although NimData is  currently non-distributed, it harnesses the power of Nim to perform out-of-core processing at native speed.\n\nNimData's core data type is the generic `DataFrame[T]`. All `DataFrame` methods are based on the MapReduce paradigm and fall into two categories:\n\n- **Transformations**: Operations like `map` or `filter` transform one `DataFrame` into another. Transformations are lazy, meaning that they are not executed until an *action* is called. They can also be chained.\n- **Actions**: Operations like `count`, `min`, `max`, `sum`, `reduce`, `fold`, `collect`, or `show` perform an aggregation on a `DataFrame`. Calling an action triggers the processing pipeline.\n\nFor a complete list of NimData's supported operations, see the\n[module docs](https://bluenote10.github.io/NimData/nimdata.html).\n\n\n## Installation\n\n1. [Install Nim](https://nim-lang.org/install.html) and ensure that both Nim and Nimble (Nim's package manager) are added to your PATH.\n2. From the command line, run `$ nimble install NimData` (this will download NimData's source from GitHub to `~/.nimble/pkgs`).\n\n\n## Quickstart\n\n### Hello, World!\n\nOnce NimData is installed, we'll write a simple program to test it. Create a new file named `test.nim` with the following contents:\n\n```nim\nimport nimdata\n\necho DF.fromRange(0, 10).collect()\n```\n\nFrom the command line, use `$ nim c -r test.nim` to compile and run the program (`c` for *compile*, and `-r` to *run* directly after compilation). It should print this [sequence](https://nim-by-example.github.io/seqs/):\n```nim\n# =\u003e @[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n```\n\n_Pandas users_: This is roughly equivalent to `print(pd.DataFrame(range(10))[0].values)`\n\n\n### Reading raw text data\n\nNext we'll use [this](examples/Bundesliga.csv) German soccer data set to explore NimData's main functionality.\n\nTo create a `DataFrame` which simply iterates over the raw text content\nof a file, we can use `DF.fromFile()`:\n\n```nim\nlet dfRawText = DF.fromFile(\"examples/Bundesliga.csv\")\n```\n\nNote that `fromFile` is a *lazy* operation, meaning that NimData doesn't actually read the contents of the file yet. To read the file, we need to call an *action* on our dataframe. Calling `count`, for example, triggers a line-by-line reading of the file and returns the number of rows:\n\n```nim\necho dfRawText.count()\n# =\u003e 14018\n```\n\nWe can chain multiple operations on `dfRawText`. For example, we can use `take` to filter the file down to its first five rows, and `show` to print the result:\n\n```nim\ndfRawText.take(5).show()\n# =\u003e\n# \"1\",\"Werder Bremen\",\"Borussia Dortmund\",3,2,1,1963,1963-08-24 09:30:00\n# \"2\",\"Hertha BSC Berlin\",\"1. FC Nuernberg\",1,1,1,1963,1963-08-24 09:30:00\n# \"3\",\"Preussen Muenster\",\"Hamburger SV\",1,1,1,1963,1963-08-24 09:30:00\n# \"4\",\"Eintracht Frankfurt\",\"1. FC Kaiserslautern\",1,1,1,1963,1963-08-24 09:30:00\n# \"5\",\"Karlsruher SC\",\"Meidericher SV\",1,4,1,1963,1963-08-24 09:30:00\n```\n\n_Pandas users_: This is equivalent to `print(dfRawText.head(5))`.\n\nNote, however, that every time an action is called, the file is read from scratch, which is inefficient. We'll improve on that in a moment.\n\n### Type-safe schema parsing\n\nAt this stage, `dfRawText`'s data type is a plain `DataFrame[string]`. It also doesn't have any column headers, and the first field isn't a proper index, but rather contains [string literals](https://nim-lang.org/docs/manual.html#lexical-analysis-generalized-raw-string-literals). Let's transform our dataframe into something more useful for analysis:\n\n```nim\nconst schema = [\n  strCol(\"index\"),\n  strCol(\"homeTeam\"),\n  strCol(\"awayTeam\"),\n  intCol(\"homeGoals\"),\n  intCol(\"awayGoals\"),\n  intCol(\"round\"),\n  intCol(\"year\"),\n  dateCol(\"date\", format=\"yyyy-MM-dd hh:mm:ss\")\n]\nlet df = dfRawText.map(schemaParser(schema, ','))\n                  .map(record =\u003e record.projectAway(index))\n                  .cache()\n```\n\nThis code does three things:\n\n1. The [`schemaParser` macro](https://bluenote10.github.io/NimData/nimdata/schema_parser.html#12) constructs a specialized parsing function for each field, which takes a string as input and returns a type-safe named tuple corresponding to the type definition in `schema`. For instance, `dateCol(\"date\")` tells the parser that the last column is named \"date\" and contains `datetime` values. We can even specify the datetime format by passing a format string to `dateCol()` as a named parameter. A key benefit of defining the schema at compile time is that the parser produces highly optimized machine code, resulting in very fast performance.\n\n2. The `projectAway` macro transforms the results of `schemeParser` into a new dataframe with the \"index\" column removed (_Pandas users_: this is roughly equivalent to `dfRawText.drop(columns=['index'])`). See also `projectTo`, which instead _keeps_ certain fields, and `addFields`, which extends the schema by new fields.\n\n3. The `cache` method stores the parsing result in memory. This allows us to perform multiple actions on the data without having to re-read the file contents every time. _Spark users_: In contrast to Spark, `cache` is currently implemented as an action.\n\n\n\nNow we can perform the same operations as before, but this time our dataframe contains the parsed tuples:\n\n```nim\necho df.count()\n# =\u003e 14018\n\ndf.take(5).show()\n# =\u003e\n# +------------+------------+------------+------------+------------+------------+------------+\n# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |\n# +------------+------------+------------+------------+------------+------------+------------+\n# | \"Werder B… | \"Borussia… |          3 |          2 |          1 |       1963 | 1963-08-2… |\n# | \"Hertha B… | \"1. FC Nu… |          1 |          1 |          1 |       1963 | 1963-08-2… |\n# | \"Preussen… | \"Hamburge… |          1 |          1 |          1 |       1963 | 1963-08-2… |\n# | \"Eintrach… | \"1. FC Ka… |          1 |          1 |          1 |       1963 | 1963-08-2… |\n# | \"Karlsruh… | \"Meideric… |          1 |          4 |          1 |       1963 | 1963-08-2… |\n# +------------+------------+------------+------------+------------+------------+------------+\n```\n\nNote that instead of starting the pipeline from `dfRawText` and using\ncaching, we could always write the pipeline from scratch:\n\n```nim\nDF.fromFile(\"examples/Bundesliga.csv\")\n  .map(schemaParser(schema, ','))\n  .map(record =\u003e record.projectAway(index))\n  .take(5)\n  .show()\n```\n\n### Filter\n\nData can be filtered by using `filter`. For instance, we can filter the data to get games\nof a certain team only:\n\n```nim\nimport strutils\n\ndf.filter(record =\u003e\n    record.homeTeam.contains(\"Freiburg\") or\n    record.awayTeam.contains(\"Freiburg\")\n  )\n  .take(5)\n  .show()\n# =\u003e\n# +------------+------------+------------+------------+------------+------------+------------+\n# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |\n# +------------+------------+------------+------------+------------+------------+------------+\n# | \"Bayern M… | \"SC Freib… |          3 |          1 |          1 |       1993 | 1993-08-0… |\n# | \"SC Freib… | \"Wattensc… |          4 |          1 |          2 |       1993 | 1993-08-1… |\n# | \"Borussia… | \"SC Freib… |          3 |          2 |          3 |       1993 | 1993-08-2… |\n# | \"SC Freib… | \"Hamburge… |          0 |          1 |          4 |       1993 | 1993-08-2… |\n# | \"1. FC Ko… | \"SC Freib… |          2 |          0 |          5 |       1993 | 1993-09-0… |\n# +------------+------------+------------+------------+------------+------------+------------+\n```\n_Note: Without the `strutils` module, `contains` will throw a type error here._\n\nOr search for games with many home goals:\n\n```nim\ndf.filter(record =\u003e record.homeGoals \u003e= 10)\n  .show()\n# =\u003e\n# +------------+------------+------------+------------+------------+------------+------------+\n# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |\n# +------------+------------+------------+------------+------------+------------+------------+\n# | \"Borussia… | \"Schalke … |         11 |          0 |         18 |       1966 | 1967-01-0… |\n# | \"Borussia… | \"Borussia… |         10 |          0 |         12 |       1967 | 1967-11-0… |\n# | \"Bayern M… | \"Borussia… |         11 |          1 |         16 |       1971 | 1971-11-2… |\n# | \"Borussia… | \"Borussia… |         12 |          0 |         34 |       1977 | 1978-04-2… |\n# | \"Borussia… | \"Arminia … |         11 |          1 |         12 |       1982 | 1982-11-0… |\n# | \"Borussia… | \"Eintrach… |         10 |          0 |          8 |       1984 | 1984-10-1… |\n# +------------+------------+------------+------------+------------+------------+------------+\n```\n\nNote that we can now fully benefit from type-safety:\nThe compiler knows the exact fields and types of a record.\nNo dynamic field lookup and/or type casting is required.\nAssumptions about the data structure are moved to the earliest\npossible step in the pipeline, allowing to fail early if they\nare wrong. After transitioning into the type-safe domain, the\ncompiler helps to verify the correctness of even long processing\npipelines, reducing the risk of runtime errors.\n\nOther filter-like transformation are:\n\n- `take`, which takes the first N records as already seen.\n- `drop`, which discard the first N records.\n- `filterWithIndex`, which allows to define a filter function that take both the index and the elements as input.\n\n### Collecting data\n\nA `DataFrame[T]` can be converted easily into a `seq[T]` (Nim's native dynamic\narrays) by using `collect`:\n\n```nim\necho df.map(record =\u003e record.homeGoals)\n       .filter(goals =\u003e goals \u003e= 10)\n       .collect()\n# =\u003e @[11, 10, 11, 12, 11, 10]\n```\n\n### Numerical aggregation\n\nA DataFrame of a numerical type allows to use functions like `min`/`max`/`mean`.\nThis allows to get things like:\n\n```nim\necho \"Min date: \", df.map(record =\u003e record.year).min()\necho \"Max date: \", df.map(record =\u003e record.year).max()\necho \"Average home goals: \", df.map(record =\u003e record.homeGoals).mean()\necho \"Average away goals: \", df.map(record =\u003e record.awayGoals).mean()\n# =\u003e\n# Min date: 1963\n# Max date: 2008\n# Average home goals: 1.898130974461407\n# Average away goals: 1.190754743900699\n\n# Let's find the highest defeat\nlet maxDiff = df.map(record =\u003e (record.homeGoals - record.awayGoals).abs).max()\ndf.filter(record =\u003e (record.homeGoals - record.awayGoals) == maxDiff)\n  .show()\n# =\u003e\n# +------------+------------+------------+------------+------------+------------+------------+\n# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |\n# +------------+------------+------------+------------+------------+------------+------------+\n# | \"Borussia… | \"Borussia… |         12 |          0 |         34 |       1977 | 1978-04-2… |\n# +------------+------------+------------+------------+------------+------------+------------+\n```\n\n### Sorting\n\nA `DataFrame` can be transformed into a sorted `DataFrame` by the `sort()` method.\nWithout specifying any arguments, the operation would sort using default\ncomparison over all columns. By specifying a key function and the sort order,\nwe can for instance rank the games by the number of away goals:\n\n```nim\ndf.sort(record =\u003e record.awayGoals, SortOrder.Descending)\n  .take(5)\n  .show()\n# =\u003e\n# +------------+------------+------------+------------+------------+------------+------------+\n# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |\n# +------------+------------+------------+------------+------------+------------+------------+\n# | \"Tasmania… | \"Meideric… |          0 |          9 |         27 |       1965 | 1966-03-2… |\n# | \"Borussia… | \"TSV 1860… |          1 |          9 |         29 |       1965 | 1966-04-1… |\n# | \"SSV Ulm\"  | \"Bayer Le… |          1 |          9 |         25 |       1999 | 2000-03-1… |\n# | \"Rot-Weis… | \"Eintrach… |          1 |          8 |         32 |       1976 | 1977-05-0… |\n# | \"Borussia… | \"Bayer Le… |          2 |          8 |         10 |       1998 | 1998-10-3… |\n# +------------+------------+------------+------------+------------+------------+------------+\n```\n\n### Unique values\n\nThe `DataFrame[T].unique()` transformation filters a `DataFrame` to unique elements.\nThis can be used for instance to find the number of teams that appear in the data:\n\n```nim\necho df.map(record =\u003e record.homeTeam).unique().count()\n# =\u003e 52\n```\n\n_Pandas user note_: In contrast to Pandas, there is no differentiation between\na one-dimensional series and multi-dimensional `DataFrame` (`unique` vs `drop_duplicates`).\n`unique` works the same in for any hashable type `T`, e.g., we might as well get\na `DataFrame` of unique pairs:\n\n```nim\ndf.map(record =\u003e record.projectTo(homeTeam, awayTeam))\n  .unique()\n  .take(5)\n  .show()\n# =\u003e\n# +------------+------------+\n# | homeTeam   | awayTeam   |\n# +------------+------------+\n# | \"Werder B… | \"Borussia… |\n# | \"Hertha B… | \"1. FC Nu… |\n# | \"Preussen… | \"Hamburge… |\n# | \"Eintrach… | \"1. FC Ka… |\n# | \"Karlsruh… | \"Meideric… |\n# +------------+------------+\n```\n\n### Value counts\n\nThe `DataFrame[T].valueCounts()` transformation extends the functionality of\n`unique()` by returning the unique values and their respective counts.\nThe type of the transformed `DataFrame` is a tuple of `(key: T, count: int)`,\nwhere `T` is the original type.\n\nIn our example, we can use `valueCounts()` for instance to find the most\nfrequent results in German soccer:\n\n```nim\ndf.map(record =\u003e record.projectTo(homeGoals, awayGoals))\n  .valueCounts()\n  .sort(x =\u003e x.count, SortOrder.Descending)\n  .map(x =\u003e (\n    homeGoals: x.key.homeGoals,\n    awayGoals: x.key.awayGoals,\n    count: x.count\n  ))\n  .take(5)\n  .show()\n# =\u003e\n# +------------+------------+------------+\n# |  homeGoals |  awayGoals |      count |\n# +------------+------------+------------+\n# |          1 |          1 |       1632 |\n# |          2 |          1 |       1203 |\n# |          1 |          0 |       1109 |\n# |          2 |          0 |       1092 |\n# |          0 |          0 |        914 |\n# +------------+------------+------------+\n```\n\nThis transformation first projects the data onto a named tuple of\n`(homeGoals, awayGoals)`. After applying `valueCounts()` the data\nframe is sorted according to the counts. The final `map()` function\nis purely for cosmetics of the resulting table, projecting the nested\n`(key: (homeGaols: int, awayGoals: int), counts: int)` tuple back\nto a flat result.\n\n### `DataFrame` viewer\n\n`DataFrame`s can be opened and inspected in the browser by using `df.openInBrowser()`,\nwhich offers a simple Javascript based data browser:\n\n![Viewer example](docs/viewer_example.png)\n\nNote that the viewer uses static HTML, so it should only be applied to small\nor heavily filtered `DataFrame`s.\n\n\n## Benchmarks\n\nMore meaningful benchmarks are still on the todo list. This just shows a\nfew first results. The benchmarks will be split into small (data\nwhich fits into memory so we can compare against Pandas or R easily) and\nbig (where we can only compare against out-of-core frameworks).\n\nAll implementations are available in the [benchmarks](benchmarks) folder.\n\n### Basic operations (small data)\n\nThe test data set is 1 million rows CSV with two int and two float columns.\nThe test tasks are:\n\n- Parse/Count: Just the most basic operations -- iterating the file, applying\nparsing, and return a count.\n- Column Averages: Same steps, plus an additional computation of all 4 column means.\n\nThe results are average runtime in seconds of three runs:\n\n| Task                    |          NimData |           Pandas |  Spark (4 cores) |   Dask (4 cores) |\n|:------------------------|-----------------:|-----------------:|-----------------:|-----------------:|\n| Parse/Count             |            0.165 |            0.321 |            1.606 |            0.182 |\n| Column Averages         |            0.259 |            0.340 |            1.179 |            0.622 |\n\nNote that Spark internally caches the file over the three runs, so the first iteration\nis much slower (with \u003e 3 sec) while it reaches run times of 0.6 sec in the last iterations\n(obviously the data is too small to justify the overhead anyway).\n\n\n## Next steps\n\n- More transformations:\n  - [x] map\n  - [x] filter\n  - [x] flatMap\n  - [x] sort\n  - [x] unique\n  - [x] valueCounts\n  - [x] groupBy (reduce)\n  - [ ] groupBy (transform)\n  - [x] join (inner)\n  - [ ] join (outer)\n  - [ ] concat/union\n  - [ ] window\n- More actions:\n  - [x] numerical aggergations (count, min, max, sum, mean)\n  - [x] collect\n  - [x] show\n  - [x] openInBrowser\n- More data formats/sources\n  - [x] csv\n  - [x] gzipped csv\n  - [ ] parquet\n  - [ ] S3\n- REPL or Jupyter kernel?\n- Plotting (maybe in the form of Bokeh bindings)?\n\n## License\n\nThis project is licensed under the terms of the MIT license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbluenote10%2FNimData","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbluenote10%2FNimData","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbluenote10%2FNimData/lists"}