https://github.com/banyc/mapreduce
In C#. Master-Worker. From scratch. No Hadoop. Done. Depend on DFS.
https://github.com/banyc/mapreduce
distributed-systems educational from-scratch mapreduce master-slave object-oriented-programming
Last synced: 7 days ago
JSON representation
In C#. Master-Worker. From scratch. No Hadoop. Done. Depend on DFS.
- Host: GitHub
- URL: https://github.com/banyc/mapreduce
- Owner: Banyc
- License: mit
- Created: 2021-03-12T17:16:52.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2022-07-30T10:52:16.000Z (almost 3 years ago)
- Last Synced: 2025-04-02T19:03:36.216Z (about 2 months ago)
- Topics: distributed-systems, educational, from-scratch, mapreduce, master-slave, object-oriented-programming
- Language: C#
- Homepage:
- Size: 1.34 MB
- Stars: 8
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# MapReduce
- Distributed systems.
- Object-oriented programming.
- Educational only.## How to run
```bash
dotnet run -p src/MapReduce.Sample
```## How to use
- Example of master initiation: [`src/MapReduce.Sample/Program.cs`](src/MapReduce.Sample/Program.cs).
- Example of workers initiation: [`src/MapReduce.Sample/Playbook/WorkerHelper.cs`](src/MapReduce.Sample/Playbook/WorkerHelper.cs).
- Example of Custom `Map` and `Reduce` functions:
- Word count - [`src/MapReduce.Sample/Playbook/WordCount.cs`](src/MapReduce.Sample/Playbook/WordCount.cs).
- Inverted file index - [`src/MapReduce.Sample/Playbook/InvertedIndex.cs`](src/MapReduce.Sample/Playbook/InvertedIndex.cs).## Principle
`map()`:
```text
part of object -> list<(key, value)>
return list<(key, value)>
```- [Interface](src/MapReduce.Worker/Helpers/IMapping.cs).
- [Implementation (WordCount)](src/MapReduce.Sample/Playbook/WordCount.cs).
- [Implementation (InvertedIndex)](src/MapReduce.Sample/Playbook/InvertedIndex.cs).`combine()`:
```text
hash>
foreach ((key,value) in list<(key, value)>)
{
hash>[key].Add(value)
}
return hash>
```- [Implementation](src/MapReduce.Worker/Helpers/Mapper.cs).
`partition()`:
```text
hash>>
```- [Interface](src/MapReduce.Worker/Helpers/IPartitioning.cs).
- [Implementation](src/MapReduce.Worker/Helpers/DefaultPartitioner.cs).`reduce()`:
```text
hash
foreach ((key,values) in hash>)
{
foreach (value in values)
{
hash[key] += value
}
}
// foreach (key,value) in other list<(key, value)>
// omitted
return hash
```- [Interface](src/MapReduce.Worker/Helpers/IReducing.cs).
- [Implementation (WordCount)](src/MapReduce.Sample/Playbook/WordCount.cs).
- [Implementation (InvertedIndex)](src/MapReduce.Sample/Playbook/InvertedIndex.cs).
- each intermediate file is a partition.
- `i`th reducer take every `i`th partition in each mapper's local disk.## Master Data Structure
- `class master`
- `List`
- `List`
- `List`
- relative data structures
- `enum state { idle, in-progress, completed }`
- idle:
- task waiting to be scheduled.
- the task is not done yet.
- `class MapTask { state, CompletedFile, ... }`
- `class ReduceTask { state, CompletedFile, ... }`
- `class CompletedFile { location, size }`## Failure
- worker failure
- master pings worker.
- no response in amount of time -> worker failed.
- master failure
- exception on user code.
- master writes data structures in checkpoints periodically.
- master gives the same task to a different worker.## Use Cases
- When map worker completes a map task
1. worker ---{file names}--> master.
1. master saves file names to data structure.
- When reduce worker completes a reduce task
1. rename temp output file to final output file.
- Task processing
- worker
1. The workers talk to the master via RPC.
1. worker ask the master for a task
1. worker read the task's input from one or more files,
1. worker executes the task,
1. worker writes the task's output to one or more files.## Partitioning
- 
-
- each partition is a file.
- each partition has a dictionary.
- each partition might have 0, 1, or more keys.
- those keys have the same value of `key.GetHashCode() % numPartitions`.
- `numPartitions` := number of reduce tasks.
- number of reduce tasks is preset in master.
- at each reduce task, the worker should read the `i`th partition of outputs of all mappers.
- worker can acquire more than one task.
- additional details - .## Assignment
Your job is to implement a distributed MapReduce, consisting of two programs, the master and the worker. There will be just one master process, and one or more worker processes executing in parallel. In a real system the workers would run on a bunch of different machines, but for this lab you'll run them all on a single machine. The workers will talk to the master via RPC. Each worker process will ask the master for a task, read the task's input from one or more files, execute the task, and write the task's output to one or more files. The master should notice if a worker hasn't completed its task in a reasonable amount of time (for this lab, use ten seconds), and give the same task to a different worker.