https://github.com/naupio/pical

(Work In Process) pita is a general distributed computation system with Erlang language base on DAG model. This project is inspired by DouBan 's DPark and Apache Spark.
https://github.com/naupio/pical

big-data bigdata dag data distributed distributed-computing distributed-systems erlang erlang-otp flink spark

Last synced: 4 months ago
JSON representation

(Work In Process) pita is a general distributed computation system with Erlang language base on DAG model. This project is inspired by DouBan 's DPark and Apache Spark.

Host: GitHub
URL: https://github.com/naupio/pical
Owner: Naupio
License: mit
Created: 2018-05-09T02:48:41.000Z (almost 8 years ago)
Default Branch: main
Last Pushed: 2021-12-22T07:06:55.000Z (about 4 years ago)
Last Synced: 2025-04-08T00:41:38.582Z (10 months ago)
Topics: big-data, bigdata, dag, data, distributed, distributed-computing, distributed-systems, erlang, erlang-otp, flink, spark
Language: Erlang
Homepage:
Size: 35.2 KB
Stars: 7
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # PiCal

(Work In Process) **PiCal** is a general distributed computation system with **Elixir** language base on DAG model. This project is inspired by [DouBan 's DPark](https://github.com/douban/dpark) and [Apache Spark](https://github.com/apache/spark).  

# LICENSE

- The [MIT License](./LICENSE)  

- Copyright (c) 2016-2022 [Naupio Z.Y Huang](https://github.com/Naupio)  

# WARNING

This project is **not finish** (yet).

---

---

# **DAG Engine TODO LIST**

# RDD  

* getPartition  

* compute  

* Dependency  

* Partitioner for K-V RDDs (Optional)  

* preferredLocations (Optional)  

# BaseRDD  

```

ParallelCollectionRDD

MappedRDD

FlatMappedRDD

MapPartitionsRDD

MappedValuesRDD

FlatMappedValuesRDD

FilteredRDD

ShuffledRDD

TextFileRDD

OutputTextFileRDD

UnionRDD

CoGroupedRDD

CartesianRDD

CoalescedRDD

SampleRDD

CheckpointRDD

```

**`PipedRDD`**

# DataSource

* parallelize :> ParallelCollectionRDD  

* textFile :> TextFileRDD  

# Transformation  

## simpleTransformation

* map(func) :> MappedRDD  

compute:> iterator(split).map(f)  

* filter(func) :> FilteredRDD  

compute:> iterator(split).filter(f)  

* flatMap(func) :> FlatMappedRDD  

compute:> iterator(split).flatMap(f)  

* mapPartitions(func) :> MapPartitionsRDD  

compute:> f(iterator(split))  

* mapPartitionsIndex(func) :> MapPartitionsRDD  

compute:> f(split.index, iterator(split))  

* sample(withReplacement, fraction, seed) :> PartitionwiseSampledRDD  

compute:>  

PoissonSampler.sample(iterator(split))

BernoulliSampler.sample(iterator(split))

## complexThansformation

* union(otherDataset) :> (RDD a, RDD b) => UnionRDD  

* groupByKey([numTasks]) :>  RDD a => ShuffledRDD => MapPartitionsRDD  

* reduceByKey(func, [numTasks]) :> RDD a => MapPartitionsRDD => ShuffledRDD => MapPartitionsRDD  

* distinct([numTasks])) :> RDD a => MappedRDD => MapPartitionsRDD => ShuffledRDD => MapPartitionsRDD => MappedRDD  

* cogroup(otherDataset, [numTasks]) :> (RDD a, RDD b) => CoGroupedRDD => MappedValuesRDD  

* intersection(otherDataset) :> (RDD a, RDD b) => (MappedRDD a, MappedRDD b) => CoGroupedRDD => MappedValuesRDD => FilteredRDD => MappedRDD  

* join(otherDataset, [numTasks]) :> (RDD a, RDD b) => CoGroupedRDD => MappedValuesRDD => FlatMappedValuesRDD  

* sortByKey([ascending], [numTasks]) :> RDD a => ShuffledRDD => MapPartitionsRDD

* cartesian(otherDataset) :> (RDD a, RDD b) =>  CartesianRDD

* coalesce(numPartitions,shuffle=false) :> RDD a => CoalescedRDD

* repartition(numPartitions) == coalesce(numPartitions,shuffle=true) :> RDD a => MapPartitionsRDD => ShuffledRDD => CoalescedRDD => MappedRDD  

* combineByKey() :> 	aggregate and compute()

```

combineByKey(createCombiner:	V	=>	C,

						mergeValue:	(C,	V)	=>	C,

						mergeCombiners:	(C,	C)	=>	C,

						partitioner:	Partitioner,

						mapSideCombine:	Boolean	=	true,

						serializer:	Serializer	=	null):	RDD[(K,	C)])

```

* **pipe(command,	[envVars])  :> PipedRDD**  

# Action  

* reduce(func) :> (record1, record2) =>  result, (result, record i) => result  

compute(results) :> (result1,	result2)	=>	result,	(result, result i) => result  

* collect() :> Array[records] => result  

compute(results) :> Array[result]  

* count() :> count(records) => result  

compute(results) :> sum(result)  

* foreach(f) :> f(records) => result  

compute(results) :> Array[result]  

* take(n) :> record(i result  

compute(results) :> Array[result]  

* frist() :> record 1 => result  

compute(results) :> Array[result]  

* takeSample() :> selectd records => result  

compute(results) :> Array[result]  

* takeOrdered(n,[ordering]) :> TopN(records) => result  

compute(results) :> TopN(results)  

* saveAsFile(path) :> records => write(records)  

compute(results) :> null  

* countByKey() :> (K, V) => Map(K, count(K))  

compute(results) :> (Map,	Map)	=>	Map(K,	count(K))  

---

# Partitioner

* HashPartitioner  

* RangePartitioner  

# Aggregator

* createCombiner  

* mergeValue  

* mergeCombiner  

---

# Dependency

## NarrowDenpendency  

* OneToOneDependency (1:1)    

* RangeDependency  

* NarrowDenpendency (N:1)

## WideDenpendency  

* ShuffleDependency (M:N)  

---

# Scheduler

## DAGScheduler

* one ShuffleDependency one stage  

## TaskScheduler

* one finalRDD-partition one task  

## Job  

* runJob(rdd,	processPartition,	resultHandler)  

* runJob(rdd,	cleanedFunc,	partitions,	allowLocal,	resultHandler)  

* submitJob(rdd,	func,	partitions,	allowLocal,	resultHandler)  

* handleJobSubmmitted()  

## Stage

* noParentStage computeSoon  

* haveParentStage waitParentComputeFinish  

* newStage()  

* submitStage(finalStage)  

* submitWaitingStages()  

**ShuffleMapStage**  

**ResultStage**  

## Task  

* ShuffleMapTask  

* ResultTask  

* TaskSet

* submitTasks(taskSet)

* LaunchTask(new	SerializableBuffer(serializedTask))

---

# Shuffle  

## Shuffle	write  

ShuffleBlockFile/FileSegment :> record => partition => persist in bucket

FileConsolidation	:> cores *	R  

## Shuffle	read  

fetch and combine (aggregate in HashMap)

---

# RTS  

* masterNode  

* workerNode  

* driverNode  

* executorBackend  

* executorRunner  

---

# Persist

* Cache  

* Checkpoint

---

# Accumulator

* value  

* list  

* set  

* dict  

---

# Broadcast

* BroadcastManager

* P2PBroadcastManager

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/naupio/pical

Awesome Lists containing this project

README