{"id":20656706,"url":"https://github.com/platformlab/mappy","last_synced_at":"2025-04-19T12:24:03.556Z","repository":{"id":31319542,"uuid":"34882008","full_name":"PlatformLab/mappy","owner":"PlatformLab","description":"Demo re-implementation of the Hadoop MapReduce scheduler in Python","archived":false,"fork":false,"pushed_at":"2016-03-01T19:58:24.000Z","size":91,"stargazers_count":11,"open_issues_count":0,"forks_count":3,"subscribers_count":16,"default_branch":"master","last_synced_at":"2023-02-28T11:02:41.376Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PlatformLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-04-30T23:38:43.000Z","updated_at":"2022-12-07T19:47:50.000Z","dependencies_parsed_at":"2022-09-22T16:10:41.118Z","dependency_job_id":null,"html_url":"https://github.com/PlatformLab/mappy","commit_stats":null,"previous_names":[],"tags_count":null,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PlatformLab%2Fmappy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PlatformLab%2Fmappy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PlatformLab%2Fmappy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PlatformLab%2Fmappy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PlatformLab","download_url":"https://codeload.github.com/PlatformLab/mappy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224951605,"owners_count":17397427,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T18:17:01.579Z","updated_at":"2024-11-16T18:17:02.088Z","avatar_url":"https://github.com/PlatformLab.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# *mappy*\n\n*mappy* is a re-implementation of the Hadoop MapReduce scheduler written to\ndemonstrate a [rules-based coding\nstyle](https://www.usenix.org/conference/atc15/technical-session/presentation/stutsman)\nand to highlight the benefits of the technique. *mappy*'s job scheduler is\nequivalent to Hadoop's, and it reimplements the functionality provided by 3\nclasses in the Hadoop Java implementation: JobImpl, TaskImpl, and\nTaskAttemptImpl. Each of the 3 classes implements an event-driven state machine\nand together form the core of Hadoop's job scheduler and its fault handling.\nEach state machine is defined by specifying what can be visualized as a\ntransition table. The implementation explicitly specifies each transition with\na start state, end state, trigger event, and transition action. The implemented\n\"transition table\" for each class can be found at the following locations:\n\n- [JobImpl.StateMachineFactory](reference/JobImpl.java#L239)\n- [TaskImpl.StateMachineFactory](reference/TaskImpl.java#L147)\n- [TaskAttemptImpl.StateMachineFactory](reference/TaskAttemptImpl.java#L209)\n\nEach transition action is itself implemented as a nested class with a member\nfunction ```transition``` which defines the body of the action.\n[JobImpl.TaskAttemptCompletedEventTransition](reference/JobImpl.java#L1779) is\nan example of a relatively involved action where as\n[JobImpl.KillInitedJobTransition](reference/JobImpl.java#L1743),\n[JobImpl.KilledDuringSetupTransition](reference/JobImpl.java#L1754), and\n[JobImpl.KilledDuringCommitTransition](reference/JobImpl.java#L2030) show 3\ntransitions that are near identical.\n\n*mappy* reimplements the functionality provided by the 3 state machines found\nin JobImpl.java, TaskImpl.java, and TaskAttemptImpl.java. Each state machine\ncorresponds to a *task* (our term for a grouped set of rules and the state\nvariables they act on) in [job.py](job.py). Here are the ```applyRules```\nmethods that implement the rules for each task type:\n\n- [Job.applyRules()](job.py#L24)\n- [Task.applyRules()](job.py#L131)\n- [TaskAttempt.applyRules()](job.py#L215)\n\nThe rules-based implementation of the MapReduce scheduler is significantly\nsimpler than the state machine implementation: a total of 19 rules in 3 tasks\nprovided functionality equivalent to the 163 transitions in the state\nimplementation. Each of the three ```applyRules``` methods fits in a screen or\ntwo of code (117 total lines of code and comments between the three\n```applyRules``` methods), which makes it possible to view the entire behavior\nof each task at once. Furthermore, the order of the rules within each\n```applyRules``` method shows the normal order of processing, which also helps\nvisualization. In contrast, the state machine implementation required more than\n750 lines of code just to specify the three transition tables, plus another\n1500 lines of code for the transition handlers.\n\nHadoop's event-driven state machines use events heavily to communicate between\nthe job scheduler and the outside world, other modules, and sometimes even\ninternally between the components of the job scheduler. For parity and\ncongruence, the *mappy* implementation uses the same event names for events\nused to interact with modules outside the 3 classes being reimplemented. Events\nthat Hadoop used to communicate internally between the job scheduler's 3 state\nmachine classes were replaced with equivalent functionality implemented in a\nmore rules-based style.\n\n*mappy*'s equivalence to the Hadoop implementation was verified by hand, and we\nhave also built a mock MapReduce implementation around [job.py](job.py) to run\nthe scheduler as an additional sanity check. The mock implementation provides\nthe following:\n\n- A \"worker\" that mocks the behavior of a Hadoop container and simply accepts\n  \"work\" and responds asynchronously after waiting for certain about of time.\n- Mock RMContainerAllocator and CommitterEventHandler modules which, like their\n  Hadoop counterparts, handle the events generated by the scheduler.\n- A basic RPC system for communication.\n- A \"master\" that runs the scheduler and other modules as a Hadoop MapReduce\n  master would.\n\n## Running *mappy*\n\n*mappy* can be run by starting a [*master*](master.py) and any number of\n[*workers*](worker.py); they can be all run on the same machine or separately a\ncluster of machines. The example commands will use a single machine.\n\nBy default, modules run in the foreground and print both RPC traffic and\ncertain events to standard out. As such, each module should be run in separate\ncommand-line terminals. To kill the process, enter Ctrl-C.\n\nTo start a master run the [master.py](master.py) module with the following\ncommand which specifies its IP address, PORT number, and number of tasks the\nJob should have:\n\n```\n./master.py 127.0.0.1 8000 -t 3\n```\n\nTo start a worker run the [worker.py](worker.py) module with the following\ncommand which specifies its IP address and PORT number as well as the master's\nIP address and PORT number:\n\n```\n./worker.py 127.0.0.1 8001 127.0.0.1 8000\n```\n\nThe master will run the scheduler until the Job's goal is reached, all tasks\nare run and \"committed\". If any worker dies (is killed using Ctrl-C) while the\njob is running, the scheduler will reschedule the now lost tasks. Once the job\nis complete, the master will print the list of tasks and the corresponding\nworker that completed the task.\n\n```\nJob Complete\n\u003c0: SUCCEEDED (u'127.0.0.1', 8001, 3721)\u003e\n\u003c1: SUCCEEDED (u'127.0.0.1', 8001, 3721)\u003e\n\u003c2: SUCCEEDED (u'127.0.0.1', 8001, 3721)\u003e\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplatformlab%2Fmappy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplatformlab%2Fmappy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplatformlab%2Fmappy/lists"}