https://github.com/platformlab/mappy
Demo re-implementation of the Hadoop MapReduce scheduler in Python
https://github.com/platformlab/mappy
Last synced: 9 months ago
JSON representation
Demo re-implementation of the Hadoop MapReduce scheduler in Python
- Host: GitHub
- URL: https://github.com/platformlab/mappy
- Owner: PlatformLab
- Created: 2015-04-30T23:38:43.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2016-03-01T19:58:24.000Z (almost 10 years ago)
- Last Synced: 2023-02-28T11:02:41.376Z (almost 3 years ago)
- Language: Java
- Homepage:
- Size: 88.9 KB
- Stars: 11
- Watchers: 16
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# *mappy*
*mappy* is a re-implementation of the Hadoop MapReduce scheduler written to
demonstrate a [rules-based coding
style](https://www.usenix.org/conference/atc15/technical-session/presentation/stutsman)
and to highlight the benefits of the technique. *mappy*'s job scheduler is
equivalent to Hadoop's, and it reimplements the functionality provided by 3
classes in the Hadoop Java implementation: JobImpl, TaskImpl, and
TaskAttemptImpl. Each of the 3 classes implements an event-driven state machine
and together form the core of Hadoop's job scheduler and its fault handling.
Each state machine is defined by specifying what can be visualized as a
transition table. The implementation explicitly specifies each transition with
a start state, end state, trigger event, and transition action. The implemented
"transition table" for each class can be found at the following locations:
- [JobImpl.StateMachineFactory](reference/JobImpl.java#L239)
- [TaskImpl.StateMachineFactory](reference/TaskImpl.java#L147)
- [TaskAttemptImpl.StateMachineFactory](reference/TaskAttemptImpl.java#L209)
Each transition action is itself implemented as a nested class with a member
function ```transition``` which defines the body of the action.
[JobImpl.TaskAttemptCompletedEventTransition](reference/JobImpl.java#L1779) is
an example of a relatively involved action where as
[JobImpl.KillInitedJobTransition](reference/JobImpl.java#L1743),
[JobImpl.KilledDuringSetupTransition](reference/JobImpl.java#L1754), and
[JobImpl.KilledDuringCommitTransition](reference/JobImpl.java#L2030) show 3
transitions that are near identical.
*mappy* reimplements the functionality provided by the 3 state machines found
in JobImpl.java, TaskImpl.java, and TaskAttemptImpl.java. Each state machine
corresponds to a *task* (our term for a grouped set of rules and the state
variables they act on) in [job.py](job.py). Here are the ```applyRules```
methods that implement the rules for each task type:
- [Job.applyRules()](job.py#L24)
- [Task.applyRules()](job.py#L131)
- [TaskAttempt.applyRules()](job.py#L215)
The rules-based implementation of the MapReduce scheduler is significantly
simpler than the state machine implementation: a total of 19 rules in 3 tasks
provided functionality equivalent to the 163 transitions in the state
implementation. Each of the three ```applyRules``` methods fits in a screen or
two of code (117 total lines of code and comments between the three
```applyRules``` methods), which makes it possible to view the entire behavior
of each task at once. Furthermore, the order of the rules within each
```applyRules``` method shows the normal order of processing, which also helps
visualization. In contrast, the state machine implementation required more than
750 lines of code just to specify the three transition tables, plus another
1500 lines of code for the transition handlers.
Hadoop's event-driven state machines use events heavily to communicate between
the job scheduler and the outside world, other modules, and sometimes even
internally between the components of the job scheduler. For parity and
congruence, the *mappy* implementation uses the same event names for events
used to interact with modules outside the 3 classes being reimplemented. Events
that Hadoop used to communicate internally between the job scheduler's 3 state
machine classes were replaced with equivalent functionality implemented in a
more rules-based style.
*mappy*'s equivalence to the Hadoop implementation was verified by hand, and we
have also built a mock MapReduce implementation around [job.py](job.py) to run
the scheduler as an additional sanity check. The mock implementation provides
the following:
- A "worker" that mocks the behavior of a Hadoop container and simply accepts
"work" and responds asynchronously after waiting for certain about of time.
- Mock RMContainerAllocator and CommitterEventHandler modules which, like their
Hadoop counterparts, handle the events generated by the scheduler.
- A basic RPC system for communication.
- A "master" that runs the scheduler and other modules as a Hadoop MapReduce
master would.
## Running *mappy*
*mappy* can be run by starting a [*master*](master.py) and any number of
[*workers*](worker.py); they can be all run on the same machine or separately a
cluster of machines. The example commands will use a single machine.
By default, modules run in the foreground and print both RPC traffic and
certain events to standard out. As such, each module should be run in separate
command-line terminals. To kill the process, enter Ctrl-C.
To start a master run the [master.py](master.py) module with the following
command which specifies its IP address, PORT number, and number of tasks the
Job should have:
```
./master.py 127.0.0.1 8000 -t 3
```
To start a worker run the [worker.py](worker.py) module with the following
command which specifies its IP address and PORT number as well as the master's
IP address and PORT number:
```
./worker.py 127.0.0.1 8001 127.0.0.1 8000
```
The master will run the scheduler until the Job's goal is reached, all tasks
are run and "committed". If any worker dies (is killed using Ctrl-C) while the
job is running, the scheduler will reschedule the now lost tasks. Once the job
is complete, the master will print the list of tasks and the corresponding
worker that completed the task.
```
Job Complete
<0: SUCCEEDED (u'127.0.0.1', 8001, 3721)>
<1: SUCCEEDED (u'127.0.0.1', 8001, 3721)>
<2: SUCCEEDED (u'127.0.0.1', 8001, 3721)>
```