https://github.com/SciNim/Datamancer

A dataframe library with a dplyr like API
https://github.com/SciNim/Datamancer
dataframe dplyr hacktoberfest nim nim-lang
Last synced: about 1 month ago
JSON representation
A dataframe library with a dplyr like API
Host: GitHub
URL: https://github.com/SciNim/Datamancer
Owner: SciNim
License: mit
Created: 2021-06-16T17:58:42.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2025-04-16T18:07:42.000Z (2 months ago)
Last Synced: 2025-04-17T02:56:04.914Z (about 2 months ago)
Topics: dataframe, dplyr, hacktoberfest, nim, nim-lang
Language: Nim
Homepage: https://scinim.github.io/Datamancer/datamancer.html
Size: 1010 KB
Stars: 139
Watchers: 3
Forks: 8
Open Issues: 8
Metadata Files:
- Readme: README.org
- Changelog: changelog.org
- License: LICENSE
Awesome Lists containing this project

awesome-nim - Datamancer - A dataframe library with a dplyr like API. (Data / Data Processing)
README

        * Datamancer

[[https://github.com/SciNim/datamancer/workflows/datamancer%20CI/badge.svg]]

[[https://matrix.to/#/#nim-science:envs.net][https://img.shields.io/static/v1?message=join%20chat&color=blue&label=nim-science&logo=matrix&logoColor=gold&style=flat-square&.svg]]

[[https://discord.gg/f5hA9UK3dY][https://img.shields.io/discord/371759389889003530?color=blue&label=nim-science&logo=discord&logoColor=gold&style=flat-square&.svg]]

~Datamancer~ is a DataFrame library for Nim, which is heavily inspired

by [[https://dplyr.tidyverse.org/][dplyr]].

** Comparison to other dataframe libraries

Check out the following gist for a comparison of this library with

dplyr (R) and pandas (Python):

https://gist.github.com/Vindaar/6908c038707c7d8293049edb3d204f84

** Documentation

The documentation is found at:

https://scinim.github.io/Datamancer/

with a short introduction under:

https://scinim.github.io/Datamancer/datamancer.html

and a tutorial for data wrangling with Datamancer available at:

https://scinim.github.io/getting-started/basics/data_wrangling.html

** Installation & dependencies

Installation should be just a 

#+BEGIN_SRC sh

nimble install datamancer

#+END_SRC

away.

** Backend targets

The library supports both Nim's C/C++ backends as well as the

JavaScript target. The latter was added in ~v0.4.2~ and is still

experimental. Certain features are currently not supported (reading

files from disk, reading CSVs from a URL).

Note also that on older versions than current devel (as of

<2024-02-19 Mon 14:47>) including the current stable, in certain

applications the formula macro ~f{}~ (see below) does not resolve

types in the way it should on the JS backend. You might need to give

explicit type hints in more cases than usual (also see below).

** Features and formulas

The data frame provides the "5 verbs" of [[https://dplyr.tidyverse.org/][dplyr]] and more. Main implemented functions:

- =filter=

- =mutate=, =transmute=

- =select=, =rename=

- =arrange=

- =summarize=

- =group_by=

- =arrange=

- =inner_join=

- =set_diff=

- =count=

- =bind_rows=

- =gather=

- =unique=,

which are all based on the =FormulaNode= object. Basically they all

receive =varargs[FormulaNode]=, which is evaluated in context of the

given dataframe.

** Supported types

By default the =DataFrame= type supports these types as builtins:

- =float=

- =int=

- =string=

- =bool=

- =Value= (a variant object that can store either of the above)

As of version =v0.3.0= very experimental support to extend the

supported types has landed. See the [[changelog.org]] for an explanation or

the [[playground/non_generic_generics.nim]] example to get an idea.

Essentially any arbitrary type can be used in a DF by extending the

=Column= type using some macro magic.

** A few words on the =f{}= macro to create formulas

Use:

- no infix symbol and only code, which does not involve a column in

  the sense defined below in [[Column access]]:

  #+BEGIN_SRC nim

  f{1 + 2}

  f{"foo"}

  f{true}

  #+END_SRC

  a =FormulaNode= of kind =fkVariable=. Stores the values as a =Value=

  variant object.

- =<-= for assignment

  #+BEGIN_SRC nim

  f{"newName" <- "oldName"}

  #+END_SRC

  a =FormulaNode= of kind =fkAssign=.

  This does not involve a closure and is just a simple object storing

  a LHS as a string and the RHS as a =Value= (to also support constant

  columns via =f{"constantCol" <- 5}=).

  Typically used for =rename= or as an argument for =transmute= and

  =mutate= to just rename a column or to assign a constant column.

- =<<= for reduce operations

  #+BEGIN_SRC nim

  f{"meanHwy" << mean(`hwy`)}

  #+END_SRC

  a =FormulaNode= of kind =fkScalar=.

  Used only for =summarize= and means we reduce a full column to a

  single =Value=. This generates a closure, which computes the RHS and

  assigns it to a result variable of type =Value=. Type hints are

  required (for now) if only a single proc call is involved on the

  RHS to tell the macro as what to read the column "hwy" and what the

  result variable is.

- =~= for vector like proc

  #+BEGIN_SRC nim

  f{"xSquared" ~ `x` * `x`}

  #+END_SRC

  a =FormulaNode= of kind =fkVector=.

  Used in =mutate=, =transmute= to calculate a full column. This also

  generates a closure as the reduce operations =<<= does, except here

  we loop over the length of the DF and access each read tensor via =[idx]=.

- a formula without any infix symbols will be considered:

  - =fkVariable= if no column involved

  - =fkVector= else

*** Column access

To access columns in the context of formula, the biggest change

occured. In the old formula system, a literal string was attempted to

be resolved as a DF column dynamically. Since the new formulas are

compiled to closures, this would involve overhead and is thus avoided

for clearer separation between columns and real strings. This also

helps readers of a formula.

This means:

- =`columnName`=: accented quotes refer to a DF column. Be careful to

  only use this for simple letters (no non letter characters or spaces).

- =c"columnName"= : call string literals (by convention use a =c=

  before the string) are interpreted as a column in the same way as

  accented quotes, but allow for column names with spaces / non letter

  characters.

- =idx("columnName"), idx(`columnName`), idx(nimExpressionReturningString)=:

  refers to a specific element of the referred column

- =col("columnName"), col(`columnName`), col(nimExpressionReturningString)=:

  refers to a the full tensor of the referred column

- or directly via: =df[nimExpressionReturningString] /

  df[nimExpressionReturningString][idx]=: to access columns / indices using

  identifiers / symbols / general expressions that return a string

  quotes, call string literals or just string literals). This is

  equivalent to =idx= / =col=, so the latter are preferred.

The closures take a data frame as an argument, which is named

=df=. The =df["columnName"]= refers to that argument, although not

literally (it is gen'symmed and =df["columnName"]= refers to a

=Column=). From that column we get the underlying =Tensor=.

In the context of calling procedures, e.g.:

#+BEGIN_SRC nim

f{someProc(`columnName`)}

#+END_SRC

it may not be clear whether the procedure is supposed to take the

whole tensor as an argument or hand each element of the tensor in a

loop. Internally the macro tries to determine a suitable call for

either a scalar or tensor argument. If the called procedure is unique

this will likely succeed. In case of heavily overloaded symbols

(e.g. =max=) it also tries to determine a match from (if any)

additional arguments given to that procedure (and uses their types if

they are not column references).

In case at cannot be resolved, you will get an error at compile time

to specify =idx= (per index access) or =col= (full column access) of the column.

So for example:

#+BEGIN_SRC nim

f{"asFloat" ~ parseFloat(idx("colName"))}

#+END_SRC

where =parseFloat= acts on each element individually. If there is only

a single overload (as in case of =parseFloat=), the input and output

types are inferred automatically to be:

- read tensor =colName= as a =string=

- result type is =float=

*** Type information for columns

When accessing columns sometimes it may be necessary or desired to

specify the type with which a column should be read. To achieve this,

the =idx= and =col= helpers mentioned in the previous section can be

given an explicit type:

#+begin_src nim

idx("foo", int) # <- reads "foo" as a `Tensor[int]`

col("bar", float) # <- read "baz" as a `Tensor[float]`)

#+end_src

These can be mixed and matched to read differently typed columns in

the same formula:

#+begin_src nim

f{"foo" ~ idx("x", int) + idx("yStr", string).parseInt}

#+end_src

for an example.

*** Type hints

To avoid specifying the types of every input column (as mentioned above

using =idx= and =col=) and to force a specific return type, type hints

can be used.

Type hints are required if the formula macro cannot determine the type

required, either input or output. This is usually the case for

ambiguous operations (overloaded procedures, only a single column

without any operations, etc.). They are of

the form:

- =: =: simple type hint for the type of the

  underlying tensor of the columns involved in the formula.  

- = -> : =: full type for closure.

  == is the dtype used for input tensors, == the resulting

  type.

For example:

#+begin_src nim

f{int -> int: `x` * `y`}

# ^--- type of the tensors involved on the RHS. Will be read as integers

#        ^--- type of the resulting tensor

#+end_src

In this case the type would be determined to be float by the macro, so

type hints are required in case we need them to be integers.

In addition to looking at symbols in the scope, there is a step

involving some simple heuristic rules, e.g. if =*=, =/= is involved, it's

assumed that the input tensors are floats and the output as well. If

=&= or =$= is involved, it's assumed to be strings. 

Finally if =and= and other logic keywords are used, the result is

assumed to be =bool= (not the input thought!). 

#+BEGIN_SRC nim

    const floatSet = toSet(@["+", "-", "*", "/", "mod"])

    const stringSet = toSet(@["&", "$"])

    const boolSet = toSet(@["and", "or", "xor", ">", "<", ">=", "<=", "==", "!=",

                            "true", "false", "in", "notin"])

#+END_SRC

*** Notes on formula macro internals

For an insight into the implementation details, ideas and development

notes, check out the following document:

https://github.com/SciNim/Datamancer/blob/master/notes/formula_dev_notes.org
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/SciNim/Datamancer

Awesome Lists containing this project

README