https://github.com/hellothisisflo/simple-mapreduce
A Simple MapReduce using Elixir and the power of OTP
https://github.com/hellothisisflo/simple-mapreduce
Last synced: 5 months ago
JSON representation
A Simple MapReduce using Elixir and the power of OTP
- Host: GitHub
- URL: https://github.com/hellothisisflo/simple-mapreduce
- Owner: HelloThisIsFlo
- Created: 2017-02-17T19:10:15.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2017-07-11T10:30:54.000Z (over 8 years ago)
- Last Synced: 2025-06-05T23:08:57.738Z (7 months ago)
- Language: Elixir
- Size: 22.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SimpleMapreduce
A simple distributed and load-balanced MapReduce.
It is fault tolerent, but does not guarantee that all events are parsed.
When a batch fail, the worker is immediately restarted and continue to accept new batches, but the failed batch is lost.
The same happen when a machine fail, the batch that was being parsed is lost.
## Initial setup
Implement the `SimpleMapreduce.HeavyWork` behavior, and provide it to the configuration file.
Also provide the names of the nodes:
* `main_node` is where the pipeline is located, it should be the same on each machine.
It is used by the worker to know where is located the pipeline when connecting to it.
* `workers_node` is the default node where the workers are started.
Usually it is different on each node, and it is the name of the current node (`node()`).
A good practice is to put the node configuration in an external file `nodes.exs` and import it in the main config.
That allows to remove the `node.exs` from version control.
Which is useful since it is different on each machine
Finally set the max demand for each worker. (Check `GenStage.sync_subscribe` documentation for more details)
```
config :simple_mapreduce,
heavy_work_module: MyApplication.MyHeavyWorkModule,
max_demand_for_worker: 20
config :simple_mapreduce,
main_node: :"main@MACHINE-NAME",
workers_node: :"worker@MACHINE-NAME"
```
### Important: After changing the configuration. Refetch the dependency
```
mix deps.clean simple_mapreduce
mix deps.get
```
This is a known bug, and will be addressed: [#1](https://github.com/ShockN745/Simple-MapReduce/issues/1)
## After initial setup
1. Start a pipeline by name:
* `SimpleMapreduce.start_new_pipeline(PIPELINE_NAME)`
2. Add workers on the main machine:
* `SimpleMapreduce.add_worker(PIPELINE_NAME, NODE_WHERE_TO_SPAWN_WORKER)`
* If NODE_WHERE_TO_SPAWN_WORKER is not provided, will use `:workers_node` from config
* NODE_WHERE_TO_SPAWN_WORKER can be `node()`
3. Add **Extra** workers on all extra machines.
* `SimpleMapreduce.add_extra_worker(PIPELINE_NAME, NODE_WHERE_TO_SPAWN_WORKER)`
* If NODE_WHERE_TO_SPAWN_WORKER is not provided, will use `:workers_node` from config
* NODE_WHERE_TO_SPAWN_WORKER can be `node()`
4. Perform work super fast by passing a list to the pipeline.
* `SimpleMapreduce.do_heavy_work_on_all_elements(ELEMENTS, PIPELINE_NAME, TIMEOUT)`
* TIMEOUT
* Can be :infinity
* Default is 3000ms