https://github.com/banyc/mapreduce

In C#. Master-Worker. From scratch. No Hadoop. Done. Depend on DFS.
https://github.com/banyc/mapreduce

distributed-systems educational from-scratch mapreduce master-slave object-oriented-programming

Last synced: 7 days ago
JSON representation

In C#. Master-Worker. From scratch. No Hadoop. Done. Depend on DFS.

Host: GitHub
URL: https://github.com/banyc/mapreduce
Owner: Banyc
License: mit
Created: 2021-03-12T17:16:52.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2022-07-30T10:52:16.000Z (almost 3 years ago)
Last Synced: 2025-04-02T19:03:36.216Z (about 2 months ago)
Topics: distributed-systems, educational, from-scratch, mapreduce, master-slave, object-oriented-programming
Language: C#
Homepage:
Size: 1.34 MB
Stars: 8
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # MapReduce

-   Distributed systems.

-   Object-oriented programming.

-   Educational only.

## How to run

```bash

dotnet run -p src/MapReduce.Sample

```

## How to use

-   Example of master initiation: [`src/MapReduce.Sample/Program.cs`](src/MapReduce.Sample/Program.cs).

-   Example of workers initiation: [`src/MapReduce.Sample/Playbook/WorkerHelper.cs`](src/MapReduce.Sample/Playbook/WorkerHelper.cs).

-   Example of Custom `Map` and `Reduce` functions:

    -   Word count - [`src/MapReduce.Sample/Playbook/WordCount.cs`](src/MapReduce.Sample/Playbook/WordCount.cs).

    -   Inverted file index - [`src/MapReduce.Sample/Playbook/InvertedIndex.cs`](src/MapReduce.Sample/Playbook/InvertedIndex.cs).

## Principle

`map()`:

```text

part of object -> list<(key, value)>

return list<(key, value)>

```

-   [Interface](src/MapReduce.Worker/Helpers/IMapping.cs).

-   [Implementation (WordCount)](src/MapReduce.Sample/Playbook/WordCount.cs).

-   [Implementation (InvertedIndex)](src/MapReduce.Sample/Playbook/InvertedIndex.cs).

`combine()`:

```text

hash>

foreach ((key,value) in list<(key, value)>)

{

    hash>[key].Add(value)

}

return hash>

```

-   [Implementation](src/MapReduce.Worker/Helpers/Mapper.cs).

`partition()`:

```text

hash>>

```

-   [Interface](src/MapReduce.Worker/Helpers/IPartitioning.cs).

-   [Implementation](src/MapReduce.Worker/Helpers/DefaultPartitioner.cs).

`reduce()`:

```text

hash

foreach ((key,values) in hash>)

{

    foreach (value in values)

    {

        hash[key] += value

    }

}

// foreach (key,value) in other list<(key, value)>

// omitted

return hash

```

-   [Interface](src/MapReduce.Worker/Helpers/IReducing.cs).

-   [Implementation (WordCount)](src/MapReduce.Sample/Playbook/WordCount.cs).

-   [Implementation (InvertedIndex)](src/MapReduce.Sample/Playbook/InvertedIndex.cs).

![](img/2021-03-09-16-21-13.png)

-   each intermediate file is a partition.

-   `i`th reducer take every `i`th partition in each mapper's local disk.

## Master Data Structure

-   `class master`

    -   `List`

    -   `List`

    -   `List`

-   relative data structures

    -   `enum state { idle, in-progress, completed }`

        -   idle:

            -   task waiting to be scheduled.

            -   the task is not done yet.

    -   `class MapTask { state, CompletedFile, ... }`

    -   `class ReduceTask { state, CompletedFile, ... }`

    -   `class CompletedFile { location, size }`

## Failure

-   worker failure

    -   master pings worker.

        -   no response in amount of time -> worker failed.

-   master failure

    -   exception on user code.

    -   master writes data structures in checkpoints periodically.

    -   master gives the same task to a different worker.

## Use Cases

-   When map worker completes a map task

    1.  worker ---{file names}--> master.

    1.  master saves file names to data structure.

-   When reduce worker completes a reduce task

    1.  rename temp output file to final output file.

-   Task processing

    -   worker

        1.  The workers talk to the master via RPC.

        1.  worker ask the master for a task

        1.  worker read the task's input from one or more files,

        1.  worker executes the task,

        1.  worker writes the task's output to one or more files.

## Partitioning

-   ![](img/2021-03-12-21-49-30.png)

    -   

-   each partition is a file.

-   each partition has a dictionary.

-   each partition might have 0, 1, or more keys.

    -   those keys have the same value of `key.GetHashCode() % numPartitions`.

    -   `numPartitions` := number of reduce tasks.

    -   number of reduce tasks is preset in master.

-   at each reduce task, the worker should read the `i`th partition of outputs of all mappers.

-   worker can acquire more than one task.

-   additional details - .

## Assignment

Your job is to implement a distributed MapReduce, consisting of two programs, the master and the worker. There will be just one master process, and one or more worker processes executing in parallel. In a real system the workers would run on a bunch of different machines, but for this lab you'll run them all on a single machine. The workers will talk to the master via RPC. Each worker process will ask the master for a task, read the task's input from one or more files, execute the task, and write the task's output to one or more files. The master should notice if a worker hasn't completed its task in a reasonable amount of time (for this lab, use ten seconds), and give the same task to a different worker.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/banyc/mapreduce

Awesome Lists containing this project

README