Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/utdemir/distributed-dataset
A distributed data processing framework in Haskell.
https://github.com/utdemir/distributed-dataset
aws-lambda data-processing distributed haskell spark
Last synced: 9 days ago
JSON representation
A distributed data processing framework in Haskell.
- Host: GitHub
- URL: https://github.com/utdemir/distributed-dataset
- Owner: utdemir
- License: bsd-3-clause
- Created: 2018-03-11T06:19:04.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-06-13T05:28:09.000Z (over 4 years ago)
- Last Synced: 2024-05-02T05:10:21.627Z (6 months ago)
- Topics: aws-lambda, data-processing, distributed, haskell, spark
- Language: Haskell
- Homepage:
- Size: 875 KB
- Stars: 113
- Watchers: 13
- Forks: 5
- Open Issues: 19
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# distributed-dataset
[![CI Status](https://github.com/utdemir/distributed-dataset/workflows/ci/badge.svg)](https://github.com/utdemir/distributed-dataset/actions)
A distributed data processing framework in pure Haskell. Inspired by [Apache Spark](https://spark.apache.org/).
* **An example:** [/examples/gh/Main.hs](/examples/gh/Main.hs)
* **API documentation:**
* **Introduction blogpost:**## Packages
### distributed-dataset
This package provides a `Dataset` type which lets you express and execute
transformations on a distributed multiset. Its API is highly inspired
by Apache Spark.It uses pluggable `Backend`s for spawning executors and `ShuffleStore`s
for exchanging information. See 'distributed-dataset-aws' for an
implementation using AWS Lambda and S3.It also exposes a more primitive `Control.Distributed.Fork`
module which lets you run `IO` actions remotely. It
is especially useful when your task is [embarrassingly
parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel).### distributed-dataset-aws
This package provides a backend for 'distributed-dataset' using AWS
services. Currently it supports running functions on AWS Lambda and
using an S3 bucket as a shuffle store.### distributed-dataset-opendatasets
Provides `Dataset`'s reading from public open datasets. Currently it can fetch GitHub event data from [GH Archive](https://www.gharchive.org).
## Running the example
* Clone the repository.
```sh
$ git clone https://github.com/utdemir/distributed-dataset
$ cd distributed-dataset
```* Make sure that you have AWS credentials set up. The easiest way is
to install [AWS command line interface](https://aws.amazon.com/cli/)
and to run:```sh
$ aws configure
```* Create an S3 bucket to put the deployment artifact in. You can use
the console or the CLI:```sh
$ aws s3api create-bucket --bucket my-s3-bucket
```* Build an run the example:
* If you use Nix on Linux:
* (Recommended) Use my binary cache on Cachix to reduce compilation times:
```sh
nix-env -i cachix # or your preferred installation method
cachix use utdemir
```* Then:
```sh
$ nix run -f ./default.nix example-gh -c example-gh my-s3-bucket
```* If you use stack (requires Docker, works on Linux and MacOS):
```sh
$ stack run --docker-mount $HOME/.aws/ --docker-env HOME=$HOME example-gh my-s3-bucket
```## Stability
Experimental. Expect lots of missing features, bugs,
instability and API changes. You will probably need to
modify the source if you want to do anything serious. See
[issues](https://github.com/utdemir/distributed-dataset/issues).## Contributing
I am open to contributions; any issue, PR or opinion is more than welcome.
* In order to develop `distributed-dataset`, you can use;
* On Linux: `Nix`, `cabal-install` or `stack`.
* On MacOS: `stack` with `docker`.
* Use [ormolu](https://github.com/tweag/ormolu) to format source code.### Nix
* You can use [my binary cache on cachix](https://utdemir.cachix.org/)
so that you don't recompile half of the Hackage.
* `nix-shell` will drop you into a shell with `ormolu`, `cabal-install` and
`steeloverseer` alongside with all required haskell and system dependencies.
You can use `cabal new-*` commands there.
* Easiest way to get a development environment would be to run `sos` at the
top level directory inside of a nix-shell.### Stack
* Make sure that you have `Docker` installed.
* Use `stack` as usual, it will automatically use a Docker image
* Run `./make.sh stack-build` before you send a PR to test different resolvers.## Related Work
### Papers
* [Towards Haskell in Cloud](https://www.microsoft.com/en-us/research/publication/towards-haskell-cloud/) by Jeff Epstein, Andrew P. Black, Simon L. Peyton Jones
* [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf) by Matei Zaharia, et al.### Projects
* [Apache Spark](https://spark.apache.org/).
* [Sparkle](https://github.com/tweag/sparkle): Run Haskell on top of Apache Spark.
* [HSpark](https://github.com/yogeshsajanikar/hspark): Another attempt at porting Apache Spark to Haskell.