https://github.com/haptork/easylambda

distributed dataflows with functional list operations for data processing with C++14
https://github.com/haptork/easylambda

cpp14 dataflow-programming distributed-computing functional-programming hpc mpi parallel

Last synced: over 1 year ago
JSON representation

distributed dataflows with functional list operations for data processing with C++14

Host: GitHub
URL: https://github.com/haptork/easylambda
Owner: haptork
License: bsl-1.0
Created: 2016-03-14T12:08:04.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2019-11-16T09:37:25.000Z (over 6 years ago)
Last Synced: 2024-10-19T20:25:03.510Z (over 1 year ago)
Topics: cpp14, dataflow-programming, distributed-computing, functional-programming, hpc, mpi, parallel
Language: C++
Homepage: https://haptork.github.io/easyLambda/
Size: 1.96 MB
Stars: 497
Watchers: 36
Forks: 43
Open Issues: 2
Metadata Files:
- Readme: Readme.md
- Contributing: contributing.md
- License: LICENSE.md

Awesome Lists containing this project

README

          [![Build Status](https://travis-ci.org/haptork/easyLambda.svg?branch=master)](https://travis-ci.org/haptork/easyLambda)

[![codecov](https://codecov.io/gh/haptork/easyLambda/branch/master/graph/badge.svg)](https://codecov.io/gh/haptork/easyLambda)

# ezl: easyLambda

> Parallel data processing made easy using functional and dataflow programming with modern C++

Welcome to easyLambda and thanks for your interest. The site aims to be a

comprehensive guide for easyLambda.

## What is easyLambda

EasyLambda is *header only C++14* library for data processing in parallel with

*functional list operations* (map, filter, reduce, scan, zip) that are tied

together in *type--safe dataflow*. 

EasyLambda is parallel, it scales from multiple cores to *hundreds of distributed*

nodes *without any need to deal with parallelism* in user code.

EasyLambda is fast. It has *minimal overhead in serial execution* and builds upon

high performance MPI parallelism that is known to be *more efficient than any

other comparable work*

[[1]](http://www.sciencedirect.com/science/article/pii/S1877050915017895).

EasyLambda is *expressive and succinct*, thanks to the *column selection* for

composition of functions and many *generic algorithms* such as *configurable

parallel file reader*, predicates, correlation, summary etc.

EasyLambda is intuitive and easy to understand with its *uniform property based*

(or [ExpressionBuilder](http://martinfowler.com/bliki/ExpressionBuilder.html))

interface for everything from configuring parallelism to changing behavior of

generic algorithms to routing dataflow.

EasyLambda is easily *interoperable* with other libraries like standard library or

raw MPI code, since it uses *standard data types* and enforces no special

structure, data-types or requirements on the user functions.

## Why easyLambda

EasyLambda is a good fit for the following tasks:

+ table/list processing and analysis from CSV or flat text files.

+ post-processing of scientific simulation results.

+ running iterative machine learning algorithms.

+ parallel type-safe data reading.

+ to play with dataflow programming and functional list operations.

Since, it can smoothly interoperate with other libraries, it is possible to

add distributed parallelism using easyLambda to the existing libraries or

codebase when its programming abstraction fits well e.g. it can be used along

with bare MPI code or with a machine learning library to add distributed training

and testing.

EasyLambda will also interest you if you 

+ are a modern C++ enthusiast

+ want to dabble with metaprogramming

+ like functional and dataflow programming

+ have cluster resources that you want to put to use in everyday tasks without much effort.

+ have always wanted a high-level MPI interface.

#### Benchmarks

EasyLambda combines the efficiency of MPI with a high level programming

abstraction. With easyLambda you get easy to understand code with good

run-time performance. Check out the benchmarks and comparisons for performance

and ease of use.

[![benchmarks](doc/benchmarks.png)](https://haptork.github.io/easyLambda/docs/benchmarks/)

## Getting Started

Check out the [Getting Started](https://haptork.github.io/easyLambda/docs/quick-start-guide/)

section of the library webpage to know how to install and begin with easyLambda. The library

can also be used on aws elastic cloud or single instance. 

# Examples

A detailed walkthrough of the library is given [here](https://haptork.github.io/easyLambda/docs/hello-world/),

The [examples directory](examples) contains various examples and demonstrations with explanations

of features and options.

Here we mention some examples in short.

## [Example wordcount](examples/wordcount.cpp)

The following program calculates frequency of each word in the data files.

```cpp

auto reader = fromFile(argv[1]).rowSeparator('s').colSeparator("");

ezl::rise(reader)

  .reduce<1>(ezl::count(), 0).dump()

  .run();

```

The dataflow pipeline starts with `rise` and subsequent operations are added to it.

In the above example, the pipeline begins by reading in data from the specified 

file(s). `fromFile` is a library function that takes column types and the specified 

file(s) glob pattern as input and reads the file(s) in parallel. It has a lot of

properties for controlling data-format, parallelism, denormalization etc

(shown in [demoFromFile](examples/demoFromFile.cpp)).

In `reduce` we pass the index of the key column to group by, the library function

for counting and initial value of the result.

## [Example pi (Monte-Carlo)](examples/pi.cpp)

Following is a dataflow for calculating pi using Monte-Carlo method.

```cpp

ezl::rise(ezl::kick(10000)) // 10000 trials shared over all processes

  .map([] { 

    return pow(rnd(), 2) + pow(rnd(), 2);

  })

  .filter(ezl::lt(1.))

  .reduce(ezl::count(), 0)

  .map([](int inCircleCount) { 

    return (4.0 * inCircleCount / 10000); 

  }).dump()

  .run();

```

The dataflow starts with rise in which we pass a library function to call the

next unit a number of times. The steps in the algorithm have been expressed

with the composition of small operations, some are common library functions

like `count()`, `lt()` (less-than) and some are user-defined functions specific

to the problem.

## [Example CSV stats](examples/cods2016.cpp)

Here is another example from

[cods2016](http://ikdd.acm.org/Site/CoDS2016/datachallenge.html). A stripped

version of the input data-file is given with ezl

[here](data/datachallenge_cods2016/train.csv). The data contains student

profiles with scores, gender, job-salary, city etc.

```cpp

auto scores = ezl::fromFile>(fileName)

                .cols({"Gender", "English", "Logical", "Domain"})

                .colSeparator("\t");

ezl::rise(scores)

  .filter<2>(ezl::gtAr<3>(0.F))   // filter valid domain scores > 0

  .map<1>([] (char gender) {      // transforming with 0/1 for isMale

    return float(gender == 'm');

  }).colsTransform()

  .reduceAll(ezl::corr<1>())

    .dump("", "Corr. of gender with scores\n(gender|E|L|D)")

  .run();

```

The above example prints the correlation of English, logical and domain scores

with respect to gender. We can find similarity of the above code with steps in

a spreadsheet analysis or with SQL query. We select the columns to work with

viz. gender and three scores. We filter the rows based on a column and predicate.

Next, we transform a selected column in-place and then find an aggregate property

(correlation) for all the rows.

----

## Contributing

Suggestions and feedback are welcome. Feel free to contact via mail or issues

for any query.

Some of the possible directions of improvement:

+ compile time optimization

+ use of specialized data structures in various units like reduce etc.

+ addition of more examples e.g. neural nets, simulations etc.

+ design simplifications

+ parallelism optimization

+ code reviews

+ documentation

Possible ideas for future extenstions:

+ fault tolerance

+ algorithms / functions to plot streaming and buffered data

+ domain specific algorithms 

+ MPI single-sided communications

+ Experiments to extend current programming abstraction to cover more problems like domain-decomposition etc. 

Check [internals](https://haptork.github.io/easyLambda/docs/internals) and

[blog](https://haptork.github.io/easyLambda/posts/) for design and

implementation details.

## Acknowledgments

A big thanks to cppcon, meetingc++ and other conferences and all C++ expert

speakers, committee members and compiler implementers for modernising C++ and

teaching it with so much enthusiasm. I had fun implementing this, hoping you

will have fun using it. Looking forward to learn more from the community.

I wish to thank [eicossa](https://github.com/eicossa) and Nitesh for their

(less online, more offline :P) contributions.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/haptork/easylambda

Awesome Lists containing this project

README