{"id":13941519,"url":"https://github.com/haifengl/bigdata","last_synced_at":"2025-04-04T21:11:35.516Z","repository":{"id":87559652,"uuid":"43791031","full_name":"haifengl/bigdata","owner":"haifengl","description":"Introduction to Big Data","archived":false,"fork":false,"pushed_at":"2024-05-14T19:00:10.000Z","size":3584,"stargazers_count":392,"open_issues_count":0,"forks_count":145,"subscribers_count":54,"default_branch":"master","last_synced_at":"2025-03-28T20:11:49.175Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TeX","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/haifengl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-10-07T02:34:22.000Z","updated_at":"2025-02-23T04:56:03.000Z","dependencies_parsed_at":"2024-05-23T05:04:03.881Z","dependency_job_id":null,"html_url":"https://github.com/haifengl/bigdata","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haifengl%2Fbigdata","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haifengl%2Fbigdata/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haifengl%2Fbigdata/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/haifengl%2Fbigdata/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/haifengl","download_url":"https://codeload.github.com/haifengl/bigdata/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247249536,"owners_count":20908212,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-08T02:01:20.533Z","updated_at":"2025-04-04T21:11:35.479Z","avatar_url":"https://github.com/haifengl.png","language":"TeX","funding_links":[],"categories":["TeX"],"sub_categories":[],"readme":"# Introduction to Big Data\n\n[![Join the chat at https://gitter.im/haifengl/bigdata](https://badges.gitter.im/haifengl/bigdata.svg)](https://gitter.im/haifengl/bigdata?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)\n\nDownload the book in \u003ca href=\"https://github.com/haifengl/bigdata/releases/download/v0.0.2/bigdata.pdf\"\u003ePDF\u003c/a\u003e or \u003ca href=\"https://github.com/haifengl/bigdata/releases/download/v0.0.2/bigdata.epub\"\u003eEPUB\u003c/a\u003e.\n\n-   Introduction\n    -   What’s Big Data?\n    -   Business Use Cases\n        -   CRM\n        -   HCM\n        -   IoT\n        -   Healthcare\n    -   Audience\n    -   Roadmap\n-   Data Management\n-   Hadoop\n    -   HDFS\n        -   Assumptions\n        -   Architecture\n        -   Control and Data Flow\n        -   The Small Files Problem\n        -   HDFS Federation\n        -   Java API\n        -   Data Ingestion\n    -   MapReduce\n        -   Overview\n        -   Data Flow\n        -   Secondary Sorting\n        -   Examples\n        -   Shortcomings\n    -   Tez\n    -   YARN\n-   Spark\n    -   RDD\n    -   Implementation\n    -   API\n-   Analytics and Data Warehouse\n    -   Pig\n    -   Hive\n    -   Impala\n    -   Shark and Spark SQL\n-   NoSQL\n    -   The CAP Theorem\n    -   ZooKeeper\n        -   Data Model\n        -   Atomic Broadcast\n    -   HBase\n        -   Data Model\n        -   Storage\n        -   Architecture\n        -   Security\n        -   Coprocessor\n        -   Summary\n    -   Riak\n        -   Data Model\n        -   Storage\n        -   Architecture\n        -   Consistency\n        -   Summary\n    -   Cassandra\n        -   Data Model\n        -   Storage\n        -   Architecture\n        -   CQL\n        -   Consistency\n        -   Summary\n    -   MongoDB\n        -   Data Model\n        -   Storage\n        -   Cluster Architecture\n        -   Replic Set\n        -   Sharding\n        -   Summary\n\nIntroduction\n============\n\nJust like Internet, Big Data is part of our lives today. From search,\nonline shopping, video on demand, to e-dating, Big Data always plays an\nimportant role behind the scene. Some people claim that Internet of\nthings (IoT) will take over big data as the most hyped technology\n@Gartner2014. It may become true. But IoT cannot come alive without big\ndata. In this book, we will dive deeply into big data technologies. But\nwe need to understand what is Big Data first.\n\nWhat’s Big Data?\n----------------\n\nGartner, and now much of the industry, use the “3Vs” model @Laney2012\nfor describing big data:\n\n\u003e Big data is high volume, high velocity, and/or high variety\n\u003e information assets that require new forms of processing to enable\n\u003e enhanced decision making, insight discovery and process optimization.\n\nIt is no doubt that today’s systems are processing huge amount of data\nevery day. For example, Facebook’s Hive data warehouse holds 300 PB data\nwith an incoming daily rate of about 600 TB in April, 2014\n@VagateWilfong2014! This example also shows us that big data is fast\ndata, too. Without high speed data generation and capture, we won’t\nquickly accumulate a large amount of data to process. According to IBM,\n$90\\%$ of the data in the world today has been created over the last two\nyears alone @IBM2013. High variety (i.e. unstructured data) is another\nimportant aspect of big data. It refers to information that either does\nnot have a pre-defined data model or format. Traditional data processing\nsystems (e.g. relational data warehouse) may handle large volume of\nrigid relational data but they are not flexible to process\nsemi-structure or unstructured data. New technologies have to be\ndeveloped to handle data from various sources, e.g. texts, social\nnetworks, image data, etc.\n\nThe 3Vs model nicely describe several major aspects of big data. Since\nthen, people added more Vs (e.g. Variability, Veracity) to the list.\nHowever, do 3Vs (or 4Vs, 5Vs, …) really capture the core characteristics\nof big data? Probably not. We are processing data in the scale of\npetabyte or even exabyte today. But big is always relative, right?\nAlthough 1TB data is not that big today, it was big and very challenging\nto process 20 years ago. Recall the fastest supercomputer in 1994,\nFujitsu Numerical Wind Tunnel, had the peak speed of 170 GFLOPS @Top500.\nWell, a Nvidia K40 GPU in a PC has the power of 1430 GFLOPS today\n@Nvidia2014. Besides software innovations (e.g. GFS and MapReduce) also\nhelped a lot to process bigger and bigger data. With the advances of\ntechnologies, today’s big data will quickly become small in tomorrow’s\nstandard. The same thing holds for “high velocity”. So high volume and\nhigh velocity are not the core of big data movement even though they are\nthe driving force of technology advancement. How about “high variety”?\nMany people read it as unstructured data which can not be well handled\nby RDBMS. But unstructured data have always been there no matter how\nthey are stored, processed, and analyzed. We do handle text, voice,\nimages and videos better today with the advances in NoSQL, natural\nlanguage processing, information retrieval, computer vision, and pattern\nrecognition. But it is still about the technology advancement rather\nthan intrinsic value of big data.\n\nFrom the business point of view, we may understand big data better.\nAlthough data is a valuable corporate asset, it is just soil, not oil.\nWithout analysis, they are pretty much useless. But extremely valuable\nknowledge and insights can be discovered from data. No matter how you\ncall this analytic process (data science, business intelligence, machine\nlearning, data mining, or information retrieval), the business goal is\nthe same: higher competency gained from the discovered knowledge and\ninsights. But wait a second. does not the idea of data analytics exist\nfor a long time? So what’re the real differences between today’s “big\ndata” analytics and traditional data analytics? Looking back to web data\nanalysis, the origin of big data, we will find that big data means\nproactively learning and understanding the customers, their needs,\nbehaviors, experience, and trends in near real-time and 24$\\times$7. On\nthe other hand, traditional data analytics is passive/reactive, treats\ncustomers as a whole or segments rather than individuals, and there is\nsignificant time lag. Check out the applications of big data, a lot of\nthem is about\n\n-   User Experience and Behavior Analysis\n\n-   Personalization\n\n-   Recommendation\n\nwhich you rarely find in business intelligence applications.[^1] New\napplications, e.g. smart grid and Internet of things, are pushing this\nreal-time proactive analysis forward to the whole environment and\ncontext. Therefore, the fundamental objective of big data is to help the\norganizations turn data into actionable information for identifying new\nopportunities, recognizing operational issues and problems, and better\ndecision-making, etc. This is the driving force for corporations to\nembrace big data.\n\nHow did this shift happen? The data have been changing. Traditionally,\nour databases are just the systems of records, which are manually input\nby people. In contrast, a large part of big data is log data, which are\ngenerated by applications and record every interaction between users and\nsystems. Some people call them machine generated data to emphasize the\nspeed of data generation and the size of data. But the truth is that\nthey are triggered by human actions (event is probably a better name of\nthese data). The Internet of things will help us even to understand the\nenvironment and context of user actions. The analysis on events results\nin a better understanding of every single user and thus yield improved\nuser experience and bigger revenue, a lovely win-win for both customers\nand business.\n\nBusiness Use Cases\n------------------\n\nBig data is not just a hype but can bring great values to business. In\nwhat follows, we will discuss some use case of big data in different\nareas and industries. The list can go very long but we will focus on\nseveral important cases to show how big data can help solve business\nchallenges.\n\n### CRM\n\nCustomer relationship management (CRM) is for managing a company’s\ninteractions with current and future customers. By integrating big data\ninto a CRM solution, companies can learn customer behavior, identify\nsales opportunities, analyze customers’ sentiment, and improve customer\nexperience to increase customer engagement and bring greater profits.\n\nUsing big data, organizations can collect more accurate and detailed\ninformation to gain the 360 view of customers. The analysis of all the\ncustomers’ touch points, such as browsing history,[^2] social media,\nemail, and call center, enable companies to gain a much more complete\nand deeper understanding of customer behavior – what ads attract them,\nwhy they buy, how they shop, what they buy together, what they’ll buy\nnext, why they switch, how they recommend a product/service in their\nsocial network, etc. Once actionable insights are discovered, companies\nwill more likely rise above industry standards.\n\nBig data also enable comprehensive benchmarking over the time. For\nexample, banks, telephone service companies, Internet service providers,\npay TV companies, insurance firms, and alarm monitoring services, often\nuse customer attrition analysis and customer attrition rates as one of\ntheir key business metrics because the cost of retaining an existing\ncustomer is far less than acquiring a new one @ReichheldSasser1990.\nMoreover, big data enables service providers to move from reactive churn\nmanagement to proactive customer retention with predictive modeling\nbefore customers explicitly start the switch.\n\n### HCM\n\nHuman capital management (HCM) supposes to maximize employee performance\nin service of their employer’s strategic objectives. However, current\nHCM systems are mostly bookkeeping. For example, many HCM\nsoftwares/services provide @AdpHcm\n\n-   Enrolling or changing benefits information\n\n-   Reporting life events such as moving or having a baby\n\n-   Acknowledging company policies\n\n-   Viewing pay statements and W-2 information\n\n-   Changing W-4 tax information\n\n-   Managing a 401(k) account\n\n-   Viewing the company directory\n\n-   Submitting requisition requests\n\n-   Approving leave requests\n\n-   Managing performance and goals\n\n-   Viewing team calendars\n\nThese are all important HR tasks. However, they are hardly associated to\n“maximize employee performance”. Even worse, current HCM systems are\npassive. Taking performance and goals management as an example, one and\nhis/her manager enter the goals at the beginning of years and input the\nperformance evaluations and feedbacks at the end of year. So what? If\nlow performance happened, it has already happened for most of the year!\n\nWith big data, HCM systems can help HR practitioners and managers to\nactively measure, monitor and improve employee performance. Although it\nis pretty hard to measure employee performance in real time, especially\nfor long term projects, studies show a clear correlation between\nengagement and performance – and most importantly between improving\nengagement and improving performance @MacLeodClarke2012. That is,\norganizations with a highly engaged workforce significantly outperform\nthose without.\n\nEngagement analytics has been an active research area in CRM and many\ntechnologies can be borrowed to HCM. For example, churn analysis can be\nused to understand the underlying patterns of employee turnover. With\nbig data, HCM systems can predict which high-performing employees are\nlikely to leave a company in the next year and then offers possible\nactions (higher compensation and/or new job) that might make them stay.\nFor corporations, they simply want to know their employees as well as\nthey know their customers. From this point of view, it does make a lot\nof sense to connect HCM and CRM together with big data to shorten the\ncommunication paths between inside and outside world.\n\n### IoT\n\nThe Internet of Things is the interconnection of uniquely identifiable\nembedded computing devices within the Internet infrastructure. IoT is\nrepresenting the next big wave in the evolution of the Internet. The\ncombination of big data and IoT is producing huge opportunities for\ncompanies in all industries. Industries such as manufacturing, mobility\nand retail have already been leveraging the data generated by billions\nof devices to provide new levels of operational and business insights.\n\nIndustrial companies are progressing in creating financial value by\ngathering and analyzing vast volumes of machine sensor data\n@IndustrialInternetReport2014. Additionally, some companies are\nprogressing to leverage insights from machine asset data to create\nefficiencies in operations and drive market advantages with greater\nconfidence. For example, Thames Water Utilities Limited, the largest\nprovider of water and wastewater services in the UK, is using sensors,\nanalytics and real-time data to help the utility respond more quickly to\ncritical situations such as leaks or adverse weather events\n@Accenture14SmartGrid.\n\nSmart grid, an advanced application of IoT, is profoundly changing the\nfundamentals of urban areas throughout the world. Multiple cities around\nthe world are conducting the so called smart city trials. For example,\nthe city of Seattle is applying analytics to optimize energy usage by\nidentifying equipment and system inefficiencies, and alerting building\nmanagers to areas of wasted energy. Elements in each room of a building\n– such as lighting, temperature and the position of window shades – can\nthen be adjusted, depending on data readings, to maximize efficiency\n@Accenture13Seattle.\n\n### Healthcare\n\nHealthcare is a big industry and contribute to a significant part of a\ncountry’s economy (in fact 17.7% of GDP in USA). Big data can improve\nour ability to treat illnesses, e.g. recognizing individuals who are at\nrisk for serious health problems. It can also identify waste in the\nhealthcare system and thus lower the cost of healthcare across the\nboard.\n\nA recent exciting advance in applying big data to healthcare is IBM\nWatson. IBM Watson is an artificially intelligent computer system\ncapable of answering questions posed in natural language @Watson2014.\nWatson may work as a clinical decision support system for medical\nprofessionals based on its natural language, hypothesis generation, and\nevidence-based learning capabilities @Watson2013Healthcare\n[@Watson2013Cancer]. When a doctor asks Watson about symptoms and other\nrelated factors, Watson first parses the input to identify the most\nimportant pieces of information; then mines patient data to find facts\nrelevant to the patient’s medical and hereditary history; then examines\navailable data sources to form and test hypotheses; and finally provides\na list of individualized, confidence-scored recommendations. The sources\nof data that Watson uses for analysis can include treatment guidelines,\nelectronic medical record data, notes from doctors and nurses, research\nmaterials, clinical studies, journal articles, and patient information.\n\nAudience\n--------\n\nThis book is created as an overview of the Big Data technologies, geared\ntoward software architects and advanced developers. Prior experience\nwith Big Data, as either a user or a developer, is not necessary. As\nthis young area is evolving at an amazing speed, we do not intend to\ncover how to use software tools or their APIs in details, which will\nbecome obsolete very soon. Instead, we focus how these systems are\ndesigned and why in this way. We hope that you get a better\nunderstanding of Big Data and thus make the best use of it.\n\nRoadmap\n-------\n\nAlthough the book is made for technologists, we start with a brief\ndiscussion of data management. Frequently, technologists are lost in the\ntrees of technical details without seeing the whole forest. As we\ndiscussed earlier, Big Data is meant to meet business needs in a\ndata-driven approach. To make Big Data a success, executives and\nmanagers need all the disciplines to manage data as a valuable resource.\nChapter 2 brings up a framework to define a successful data strategy.\n\nChapter 3 is a deep diving into Apache Hadoop, the de facto Big Data\nplatform. Apache Hadoop is an open-source software framework for\ndistributed storage and distributed processing of big data on clusters\nof commodity hardware. Especially, we discuss HDFS, MapReduce, Tez and\nYARN.\n\nChapter 4 is a discussion on Apache Spark, the new hot buzzword in Big\nData. Although MapReduce is great for large scale data processing, it is\nnot friendly for iterative algorithms or interactive analytics. Apache\nSpark is designed to solve this problem by reusing the working dataset.\n\nMapReduce and Spark enable us to crunch numbers in a massive parallel\nway. However, they provide relatively low level APIs. To quickly obtain\nactionable insights from data, we would like to employ some data\nwarehouse built on top of them. In Chapter 5, we cover Pig and Hive that\ntranslate high level DSL or SQL to native MapReduce/Tez code. Similarly,\nShark and Spark SQL bring SQL on top of Spark. Moreover, we discuss\nCloudera Impala and Apache Drill that are native massively parallel\nprocessing query engines for interactive analysis of web-scale datasets.\n\nIn Chapter 6, we discuss several operational NoSQL databases that are\ndesigned for horizontal scaling and high availability.\n\nAlthough the book can be read sequentially straight through, you can\ncomfortably break between the chapters. For example, you may jump\ndirectly into the NoSQL chapter while skipping Hadoop and Spark.\n\nData Management\n===============\n\n![Data Management](images/data-management.png)\n\nBig Data is to solve complex enterprise optimization problems. To make\nthe best use of Big Data, we have to recognize that data is a vital\ncorporate asset as data is the lifeblood of the Internet economy. Today\norganizations rely on data science to make more informed and more\neffective decisions, which create competitive advantages through\ninnovative products and operational efficiencies.\n\nHowever, data is firstly a debt. The costs of data acquisition,\nhardware, software, operation, and talents are very high. Without the\nright management, it is unlikely for us to effectively extract values\nfrom data. To make big data a success, we must have all the disciplines\nto manage data as a valuable resource. Data management is much broader\nthan database management. It is a systematic process of capturing,\ndelivering, operating, protecting, enhancing, and disposing of the data\ncost-effectively, which needs the ever-going reinforcement of plans,\npolicies, programs and practices.\n\nThe ultimate goal of data management is to increase the value\nproposition of the data. It requires serious and careful consideration\nand should start with a data strategy that defines a roadmap to meet the\nbusiness needs in a data-driven approach. To create a data strategy,\nthink carefully of the following questions:\n\n-   What problem do we try to solve? What value can big data bring in?\n    Big data is hot and thus many corporations are hugging it. However,\n    big data for the sake of big data is apparently wrong. Other’s use\n    cases do not have to be yours. To glean the value of big data, a\n    deep understanding of your business and problems to solve\n    is essential.\n\n-   Who holds the data, who owns the data, and who can access the data?\n    Data governance is a set of processes that ensures that important\n    data assets are formally managed throughout the enterprise. Through\n    data governance, we expect data stewards and data custodians to\n    exercise positive control over the data. Data custodians are\n    responsible for the safe custody, transport, and storage of the data\n    while data stewards are responsible for the management of data\n    elements – both the content and metadata.\n\n-   What data do we need? It may seem obvious, but it is often simply\n    answered with “I do not know” or “Everything”, which indicates a\n    lack of understanding business practices. Whenever this happens, we\n    should go back to answer the first question again. How to acquire\n    the data? Data may be collected from internal system of records, log\n    files, surveys, or third parties. The transactional systems may be\n    revised to collect necessary data for analytics.\n\n-   Where to store the data and how long to keep them? Due to the\n    variety of data, today’s data may be stored in various databases\n    (relational or NoSQL), data warehouses, Hadoop, etc. Today, database\n    management is way beyond relational database administration. Because\n    big data is also fast data, it is impractical to keep all of the\n    data forever. Careful thoughts are needed to determine the lifespan\n    of data.\n\n-   How to ensure the data quality? Junk in, Junk out. Without ensuring\n    the data quality, big data won’t bring any values to the business.\n    With the advent of big data, data quality management is both more\n    important and more challenging than ever.\n\n-   How to analyze and visualize the data? A large number of\n    mathematical models are available for analyzing data. Simply\n    applying mathematical models does not necessarily result in\n    actionable insights. Before talking about your mathematical models,\n    go understand your business and problems. Lead the model with your\n    insights (or \u003cspan\u003e*a priori*\u003c/span\u003e in terms of machine learning)\n    rather than be lead by the uninterpretable numbers of black\n    box models. Besides, visualization is extremely helpful to explore\n    data and present the analytic results as a picture is worth a\n    thousand words.\n\n-   How to manage the complexity? Big data is extremely complicated. To\n    manage the complexity and improve the data management practices, we\n    need to develop the accountability framework to encourage desirable\n    behavior, which is tailored to the organization’s business\n    strategies, strengths and priorities.\n\nWe believe that a good data strategy will emerge after thinking through\nand answer the above questions.\n\nHadoop\n======\n\nBig data unavoidably needs distributed parallel computing on a cluster\nof computers. Therefore, we need a distributed data operating system to\nmanage a variety of resources, data, and computing tasks. Today, Apache\nHadoop @Hadoop is the de facto distributed data operating system. Apache\nHadoop is an open-source software framework for distributed storage and\ndistributed processing of big data on clusters of commodity hardware.\nEssentially, Hadoop consists of three parts:\n\n-   HDFS is a distributed high-throughput file system\n\n-   MapReduce for job framework of parallel data processing\n\n-   YARN for job scheduling and cluster resource management\n\nThe HDFS splits files into large blocks that are distributed (and\nreplicated) among the nodes in the cluster. For processing the data,\nMapReduce takes advantage of data locality by shipping code to the nodes\nthat have the required data and processing the data in parallel.\n\n![Hadoop](images/hadoop.png)\n\nOriginally Hadoop cluster resource management was part of MapReduce\nbecause it was the main computing paradigm. Today the Hadoop ecosystem\ngoes beyond MapReduce and includes many additional parallel computing\nframework, such as Apache Spark, Apache Tez, Apache Storm, etc. So the\nresource manager, referred to as YARN, was striped out from MapReduce\nand improved to support other computing framework in Hadoop v2. Now\nMapReduce is one kind of applications running in a YARN container and\nother types of applications can be written generically to run on YARN.\n\nHDFS\n----\n\nHadoop Distributed File System (HDFS) @HDFS is a multi-machine file\nsystem that runs on top of machines’ local file system but appears as a\nsingle namespace, accessible through `hdfs://` URIs. It is designed to\nreliably store very large files across machines in a large cluster of\ninexpensive commodity hardware. HDFS closely follows the design of the\nGoogle File System (GFS) @Ghemawat:2003:GFS [@McKusick:2009:GEF].\n\n### Assumptions\n\nAn HDFS instance may consist of hundreds or thousands of nodes, which\nare made of inexpensive commodity components that often fail. It implies\nthat some components are virtually not functional at any given time and\nsome will not recover from their current failures. Therefore, constant\nmonitoring, error detection, fault tolerance, and automatic recovery\nwould have to be an integral part of the file system.\n\nHDFS is tuned to support a modest number (tens of millions) of large\nfiles, which are typically gigabytes to terabytes in size. Initially,\nHDFS assumes a write-once-read-many access model for files. A file once\ncreated, written, and closed need not be changed. This assumption\nsimplifies the data coherency problem and enables high throughput data\naccess. The append operation was added later (single appender only)\n@HDFS2010:265.\n\nHDFS applications typically have large streaming access to their\ndatasets. HDFS is mainly designed for batch processing rather than\ninteractive use. The emphasis is on high throughput of data access\nrather than low latency.\n\n### Architecture\n\n![HDFS Architecture](images/hdfs-architecture.png)\n\nHDFS has a master/slave architecture. An HDFS cluster consists of a\nsingle NameNode, a master server that manages the file system namespace\nand regulates access to files by clients. In addition, there are a\nnumber of DataNodes that manage storage attached to the nodes that they\nrun on. A typical deployment has a dedicated machine that runs only the\nNameNode. Each of the other machines in the cluster runs one instance of\nthe DataNode.[^3]\n\nHDFS supports a traditional hierarchical file organization that consists\nof directories and files. In HDFS, each file is stored as a sequence of\nblocks (identified by 64 bit unique id); all blocks in a file except the\nlast one are the same size (typically 64 MB). DataNodes store each block\nin a separate file on local file system and provide read/write access.\nWhen a DataNode starts up, it scans through its local file system and\nsends the list of hosted data blocks (called Blockreport) to the\nNameNode.\n\nFor reliability, each block is replicated on multiple DataNodes (three\nreplicas by default). The placement of replicas is critical to HDFS\nreliability and performance. HDFS employs a rack-aware replica placement\npolicy to improve data reliability, availability, and network bandwidth\nutilization. When the replication factor is three, HDFS puts one replica\non one node in the local rack, another on a different node in the same\nrack, and the last on a node in a different rack. This policy reduces\nthe inter-rack write traffic which generally improves write performance.\nSince the chance of rack failure is far less than that of node failure,\nthis policy does not impact data reliability and availability notably.\n\nThe NameNode is the arbitrator and repository for all HDFS metadata. The\nNameNode executes common namespace operations such as create, delete,\nmodify and list files and directories. The NameNode also performs the\nblock management including mapping files to blocks, creating and\ndeleting blocks, and managing replica placement and re-replication.\nBesides, the NameNode provides DataNode cluster membership by handling\nregistrations and periodic heart beats. But the user data never flows\nthrough the NameNode.\n\nTo achieve high performance, the NameNode keeps all metadata in main\nmemory including the file and block namespace, the mapping from files to\nblocks, and the locations of each block’s replicas. The namespace and\nfile-to-block mapping are also kept persistent into the files EditLog\nand FsImage in the local file system of the NameNode. The file FsImage\nstores the entire file system namespace and file-to-block map. The\nEditLog is a transaction log to record every change that occurs to file\nsystem metadata, e.g. creating a new file and changing the replication\nfactor of a file. When the NameNode starts up, it reads the FsImage and\nEditLog from disk, applies all the transactions from the EditLog to the\nin-memory representation of the FsImage, flushes out the new version of\nFsImage to disk, and truncates the EditLog.\n\nBecause the NameNode replays the EditLog and updates the FsImage only\nduring start up, the EditLog could get very large over time and the next\nrestart of NameNode takes longer. To avoid this problem, HDFS has a\nsecondary NameNode that updates the FsImage with the EditLog\nperiodically and keeps the EditLog within a limit. Note that the\nsecondary NameNode is not a standby NameNode. It usually runs on a\ndifferent machine from the primary NameNode since its memory\nrequirements are on the same order as the primary NameNode.\n\nThe NameNode does not store block location information persistently. On\nstartup, the NameNode enters a special state called Safemode and\nreceives Blockreport messages from the DataNodes. Each block has a\nspecified minimum number of replicas. A block is considered safely\nreplicated when the minimum number of replicas has checked in with the\nNameNode. After a configurable percentage of safely replicated data\nblocks checks in with the NameNode (plus an additional 30 seconds), the\nNameNode exits the Safemode state.\n\n### Control and Data Flow\n\nHDFS is designed such that clients never read and write file data\nthrough the NameNode. Instead, a client asks the NameNode which\nDataNodes it should contact using the class ClientProtocol through an\nRPC connection. Then the client communicates with a DataNode directly to\ntransfer data using the DataTransferProtocol, which is a streaming\nprotocol for performance reasons. Besides, all communication between\nNamenode and Datanode, e.g. DataNode registration, heartbeat,\nBlockreport, is initiated by the Datanode, and responded to by the\nNamenode.\n\n#### Read\n\nFirst, the client queries the NameNode with the file name, read range\nstart offset, and the range length. The NameNode returns the locations\nof the blocks of the specified file within the specified range.\nEspecially, DataNode locations for each block are sorted by the\nproximity to the client. The client then sends a request to one of the\nDataNodes, most likely the closest one.\n\n#### Write\n\nA client request to create a file does not reach the NameNode\nimmediately. Instead, the client caches the file data into a temporary\nlocal file. Once the local file accumulates data worth over one block\nsize, the client contacts the NameNode, which updates the file system\nnamespace and returns the allocated data block location. Then the client\nflushes the block from the local temporary file to the specified\nDataNode. When a file is closed, the remaining last block data is\ntransferred to the DataNodes.\n\n### The Small Files Problem\n\nBig data but small files (significantly smaller than the block size)\nimplies a lot of files, which creates a big problem for the NameNode\n@SmallFiles. Recall that the NameNode holds all the metadata of files\nand blocks in main memory. Given that each of the metadata object\noccupies about 150 bytes, the NameNode may host about 10 million files,\neach using a block, with 3 gigabytes of memory. Although larger memory\ncan push the upper limit higher, large heap is a big challenge for JVM\ngarbage collector. Furthermore, HDFS is not efficient to read small\nfiles because of the overhead of client-NameNode communication, too much\ndisk seeks, and lots of hopping from DataNode to DataNode to retrieve\neach small file.\n\nIn order to reduce the number of files and thus the pressure on the\nNameNode’s memory, Hadoop Archives (HAR files) were introduced. HAR\nfiles, created by `hadoop archive`[^4] command, are special format\narchives that contain metadata and data files. The archive exposes\nitself as a file system layer. All of the original files are visible and\naccessible through a `har://` URI. It is also easy to use HAR files as\ninput file system in MapReduce. Note that it is actually slower to read\nthrough files in a HAR because of the extra access to metadata.\n\nThe SequenceFile, consisting of binary key-value pairs, can also be used\nto handle the small files problem, by using the filename as the key and\nthe file contents as the value. This works very well in practice for\nMapReduce jobs. Besides, the SequenceFile supports compression, which\nreduces disk usage and speeds up data loading in MapReduce. Open source\ntools exist to convert tar files to SequenceFiles @Tar2Seq.\n\nThe key-value stores, e.g. HBase and Accumulo, may also be used to\nreduce file count although they are designed for much more complicated\nuse cases. Compared to SequenceFile, they support random access by keys.\n\n### HDFS Federation\n\nThe existence of a single NameNode in a cluster greatly simplifies the\narchitecture of the system. However, it also introduces problems. The\nfile count problem, due to the limited memory of NameNode, is an\nexample. A more serious problem is that it proved to be a bottleneck for\nthe clients @McKusick:2009:GEF. Even though the clients issue few\nmetadata operations to the NameNode, there may be thousands of clients\nall talking to the NameNode at the same time. With multiple MapReduce\njobs, we might suddenly have thousands of tasks in a large cluster, each\ntrying to open a number of files. Given that the NameNode is capable of\ndoing only a few thousand operations a second, it would take a long time\nto handle all those requests.\n\nSince Hadoop 2.0, we can have two redundant NameNodes in the same\ncluster in an active/passive configuration with a hot standby. Although\nthis allows a fast failover to a new NameNode for fault tolerance, it\ndoes not solve the the performance issue. To partially resolve the\nscalability problem, the concept of HDFS Federation, was introduced to\nallow multiple namespaces within a HDFS cluster. In the future, it may\nalso support the cooperation across clusters.\n\nIn HDFS Federation, there are multiple independent NameNodes (and thus\nmultiple namespaces). The NameNodes do not require coordination with\neach other. The DataNodes are used as the common storage by all the\nNameNodes by registering with and handles commands from all the\nNameNodes in the cluster. The failure of a NameNode does not prevent the\nDataNode from serving other NameNodes in the cluster.\n\nBecause multiple NameNodes run independently, there may be conflicts of\n64 bit block ids generated by different NameNodes. To avoid this\nproblem, a namespace uses one or more block pools, identified by a\nunique id in a cluster. A block pool belongs to a single namespace and\ndoes not cross namespace boundary. The extended block id, a tuple of\n(Block Pool ID, Block ID), is used for block identification in HDFS\nFederation.\n\n### Java API\n\nHDFS is implemented in Java and provides a native Java API. To access\nHDFS in other programming languages, Thrift[^5] bindings are provided\nfor Perl, Python, Ruby and PHP @HdfsThrift. In what follows, we will\ndiscuss how to work with HDFS Java API with a couple of small examples.\nFirst of all, we need to add the following dependencies to the project’s\nMaven POM file @Maven.\n\n    \u003cdependency\u003e\n        \u003cgroupId\u003eorg.apache.hadoop\u003c/groupId\u003e\n        \u003cartifactId\u003ehadoop-common\u003c/artifactId\u003e\n        \u003cversion\u003e2.6.0\u003c/version\u003e\n    \u003c/dependency\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003eorg.apache.hadoop\u003c/groupId\u003e\n        \u003cartifactId\u003ehadoop-hdfs\u003c/artifactId\u003e\n        \u003cversion\u003e2.6.0\u003c/version\u003e\n    \u003c/dependency\u003e\n\nThe main entry point of HDFS Java API is the abstract class `FileSystem`\nin the package `org.apache.hadoop.fs` that serves as a generic file\nsystem representation. `FileSystem` has various implementations:\n\nDistributedFileSystem\n\n:   The implementation of distributed file system. This object is the\n    way end-user code interacts with an HDFS.\n\nLocalFileSystem\n\n:   The local implementation for small Hadoop instances and for testing.\n\nFTPFileSystem\n\n:   A FileSystem backed by an FTP client.\n\nS3FileSystem\n\n:   A block-based FileSystem backed by Amazon S3.\n\nThe `FileSystem` class also serves as a factory for concrete\nimplementations:\n\n    Configuration conf = new Configuration();\n    FileSystem fs = FileSystem.get (conf);\n\nwhere the `Configuration` class passes the Hadoop configuration\ninformation such as scheme, authority, NameNode host and port, etc.\nUnless explicitly turned off, Hadoop by default specifies two resources,\nloaded in-order from the classpath:\n\ncore-default.xml\n\n:   Read-only defaults for Hadoop.\n\ncore-site.xml\n\n:   Site-specific configuration for a given Hadoop installation.\n\nApplications may add additional resources, which are loaded subsequent\nto these resources in the order they are added. With `FileSystem`, one\ncan do common namespace operations, e.g. creating, deleting, and\nrenaming files. We can also query the status of a file such as the\nlength, block size, block locations, permission, etc. To read or write\nfiles, we need to use the classes `FSDataInputStream` and\n`FSDataOutputStream`. In the following example, we develop two simple\nfunctions to copy a local file into/from HDFS. For simplicity, we do not\ncheck the file existence or any I/O errors. Note that `FileSystem` does\nprovide several utility functions for copying files between local and\ndistributed file systems.\n\n    /** Copy a local file to HDFS */\n    public void copyFromLocal(String src, String dst) throws IOException {\n     \n      Configuration conf = new Configuration();\n      FileSystem fs = FileSystem.get(conf);\n      \n      // The Path object represents a file or directory in HDFS.\n      FSDataOutputStream out = fs.create(new Path(dst));\n      InputStream in = new BufferedInputStream(new FileInputStream(new File(src)));\n     \n      byte[] b = new byte[1024];\n      int numBytes = 0;\n      while ((numBytes = in.read(b)) \u003e 0) {\n        out.write(b, 0, numBytes);\n      }\n     \n      in.close();\n      out.close();\n      fs.close();\n    }\n\n    /** Copy an HDFS file to local file system */\n    public void copyToLocal(String src, String dst) throws IOException {\n     \n      Configuration conf = new Configuration();\n      FileSystem fs = FileSystem.get(conf);\n     \n      FSDataInputStream in = fs.open(new Path(src));\n      OutputStream out = new BufferedOutputStream(new FileOutputStream(new File(dst)));\n      byte[] b = new byte[1024];\n      int numBytes = 0;\n      while ((numBytes = in.read(b)) \u003e 0) {\n        out.write(b, 0, numBytes);\n      }\n     \n      in.close();\n      out.close();\n      fs.close();\n    }\n\nIn the example, we use the method `FileSystem.create` to create an\n`FSDataOutputStream` at the indicated `Path`. If the file exists, it\nwill be overwritten by default. The `Path` object is used to locate a\nfile or directory in HDFS. `Path` is really a URI. For HDFS, it takes\nthe format of `hdfs://host: port/location`. To read an HDFS file, we use\nthe method `FileSystem.open` that returns an `FSDataInputStream` object.\nThe rest of example is just as the regular Java I/O stream operations.\n\n### Data Ingestion\n\nToday, most data are generated and stored out of Hadoop, e.g. relational\ndatabases, plain files, etc. Therefore, data ingestion is the first step\nto utilize the power of Hadoop. To move the data into HDFS, we do not\nhave to do the low level programming as the previous example. Various\nutilities have been developed to move data into Hadoop.\n\n#### Batch Data Ingestion\n\nThe File System Shell @HdfsShell includes various shell-like commands,\nincluding `copyFromLocal` and `copyToLocal`, that directly interact with\nthe HDFS as well as other file systems that Hadoop supports. Most of the\ncommands in File System Shell behave like corresponding Unix commands.\nWhen the data files are ready in local file system, the shell is a great\ntool to ingest data into HDFS in batch. In order to stream data into\nHadoop for real time analytics, however, we need more advanced tools,\ne.g. Apache Flume and Apache Chukwa.\n\n#### Streaming Data Ingestion\n\nApache Flume @Flume is a distributed, reliable, and available service\nfor efficiently collecting, aggregating, and moving large amounts of log\ndata into HDFS. It has a simple and flexible architecture based on\nstreaming data flows; and robust and fault tolerant with tunable\nreliability mechanisms and many failover and recovery mechanisms. It\nuses a simple extensible data model that allows for online analytic\napplication. Flume employs the familiar producer-consumer model.\n`Source` is the entity through which data enters into Flume. Sources\neither actively poll for data or passively wait for data to be delivered\nto them. On the other hand, `Sink` is the entity that delivers the data\nto the destination. Flume has many built-in sources (e.g. log4j and\nsyslogs) and sinks (e.g. HDFS and HBase). `Channel` is the conduit\nbetween the Source and the Sink. Sources ingest events into the channel\nand the sinks drain the channel. Channels allow decoupling of ingestion\nrate from drain rate. When data are generated faster than what the\ndestination can handle, the channel size increases.\n\nApache Chukwa @Chukwa is devoted to large-scale log collection and\nanalysis, built on top of MapReduce framework. Beyond data ingestion,\nChukwa also includes a flexible and powerful toolkit for displaying\nmonitoring and analyzing results. Different from Flume, Chukwa is not a\na continuous stream processing system but a mini-batch system.\n\nApache Kafka @Kafka and Apache Storm @Storm may also be used to ingest\nstreaming data into Hadoop although they are mainly designed to solve\ndifferent problems. Kafka is a distributed publish-subscribe messaging\nsystem. It is designed to provide high throughput persistent messaging\nthat’s scalable and allows for parallel data loads into Hadoop. Storm is\na distributed realtime computation system for use cases such as realtime\nanalytics, online machine learning, continuous computation, etc.\n\n#### Structured Data Ingestion\n\nApache Sqoop @Sqoop is a tool designed to efficiently transfer data\nbetween Hadoop and relational databases. We can use Sqoop to import data\nfrom a relational database table into HDFS. The import process is\nperformed in parallel and thus generates multiple files in the format of\ndelimited text, Avro, or SequenceFile. Besides, Sqoop generates a Java\nclass that encapsulates one row of the imported table, which can be used\nin subsequent MapReduce processing of the data. Moreover, Sqoop can\nexport the data (e.g. the results of MapReduce processing) back to the\nrelational database for consumption by external applications or users.\n\nMapReduce\n---------\n\nDistributed parallel computing is not new. Supercomputers have been\nusing MPI @Forum:1994:MMI for years for complex numerical computing.\nAlthough MPI provides a comprehensive API for data transfer and\nsynchronization, it is not very suitable for big data. Due to the large\ndata size and shared-nothing architecture for scalability, data\ndistribution and I/O are critical to big data analytics while MPI almost\nignores it.[^6] On the other hand, many big data analytics are\nconceptually straightforward and does not need very complicated\ncommunication and synchronization mechanism. Based on these\nobservations, Google invented MapReduce @Dean:2008:MSD to deal the\nissues of how to parallelize the computation, distribute the data, and\nhandle failures.\n\n### Overview\n\nIn a shared-nothing distributed computing environment, a computation is\nmuch more efficient if it is executed near the data it operates on. This\nis especially true when the size of the data set is huge as it minimizes\nnetwork traffic and increases the overall throughput of the system.\nTherefore, it is often better to migrate the computation closer to where\nthe data is located rather than moving the data to where the application\nis running. With GFS/HDFS, MapReduce provides such a parallel\nprogramming framework.\n\nInspired by the `map` and `reduce`[^7] functions commonly used in\nfunctional programming, a MapReduce program is composed of a Map()\nprocedure that performs transformation and a Reduce() procedure that\ntakes the shuffled output of Map as input and performs a summarization\noperation. More specifically, the user-defined Map function processes a\nkey-value pair to generate a set of intermediate key-value pairs, and\nthe Reduce function aggregates all intermediate values associated with\nthe same intermediate key.\n\nMapReduce applications are automatically parallelized and executed on a\nlarge cluster of commodity machines. During the execution, the Map\ninvocations are distributed across multiple machines by automatically\npartitioning the input data into a set of M splits. The input splits can\nbe processed in parallel by different machines. Reduce invocations are\ndistributed by partitioning the intermediate key space into R pieces\nusing a partitioning function. The number of partitions and the\npartitioning function are specified by the user. Besides partitioning\nthe input data and running the various tasks in parallel, the framework\nalso manages all communications and data transfers, load balance, and\nfault tolerance.\n\nMapReduce provides programmers a really simple parallel computing\nparadigm. Because of automatic parallelization, no explicit handling of\ndata transfer and synchronization in programs, and no deadlock, this\nmodel is very attractive. MapReduce is also designed to process very\nlarge data that is too big to fit into the memory (combined from all\nnodes). To achieve that, MapReduce employs a data flow model, which also\nprovides a simple I/O interface to access large amount of data in\ndistributed file system. It also exploits data locality for efficiency.\nIn most cases, we do not need to worry about I/O at all.\n\n### Data Flow\n\n![MapReduce Data Flow](images/MapReduce.png)\n\nFor a given task, the MapReduce system runs as follows\n\nPrepare the Map() input\n\n:   The system splits the input files into M pieces and then starts up M\n    Map workers on a cluster of machines.\n\nRun the user-defined Map() code\n\n:   The Map worker parses key-value pairs out of the assigned split and\n    passes each pair to the user-defined Map function. The intermediate\n    key-value pairs produced by the Map function are buffered in memory.\n    Periodically, the buffered pairs are written to local disk,\n    partitioned into R regions for sharding purposes by the partitioning\n    function (called partitioner) that is given the key and the number\n    of reducers R and returns the index of the desired reducer.\n\nShuffle the Map output to the Reduce processors\n\n:   When ready, a reduce worker reads remotely the buffered data from\n    the local disks of the map workers. When a reduce worker has read\n    all intermediate data, it sorts the data by the intermediate keys so\n    that all occurrences of the same key are grouped together. Typically\n    many different keys map to the same reduce task.\n\nRun the user-defined Reduce() code\n\n:   The reduce worker iterates over the sorted intermediate data and for\n    each unique intermediate key encountered, it passes the key and the\n    corresponding set of intermediate values to the user’s\n    Reduce function.\n\nProduce the final output\n\n:   The final output is available in the R output files (one per\n    reduce task).\n\nOptionally, a combiner can be used between map and reduce as an\noptimization. The combiner function runs on the output of the map phase\nand is used as a filtering or an aggregating step to lessen the data\nthat are being passed to the reducer. In most of the cases the reducer\nclass is set to be the combiner class so that we can save network time.\nNote that this works only if reduce function is commutative and\nassociative.\n\nIn practice, one should pay attention to the task granularity, i.e. the\nnumber of map tasks M and the number of reduce tasks R. In general, M\nshould be much larger than the number of nodes in cluster, which\nimproves load balancing and speeds recovery from worker failure. The\nright level of parallelism for maps seems to be around 10-100 maps per\nnode (maybe more for very cpu light map tasks). Besides, the task setup\ntakes awhile. On a Hadoop cluster of 100 nodes, it takes 25 seconds\nuntil all nodes are executing the job. So it is best if the maps take at\nleast a minute to execute. In Hadoop, one can call\n`JobConf.setNumMapTasks(int)` to set the number of map tasks. Note that\nit only provides a hint to the framework.\n\nThe number of reducers is usually a small multiple of the number of\nnodes. The right factor number seems to be 0.95 for well-balanced data\n(per intermediate key) or 1.75 otherwise for better load balancing. Note\nthat we reserve a few reduce slots for speculative tasks and failed\ntasks. We can set the number of reduce tasks by\n`JobConf.setNumReduceTasks(int)` in Hadoop and the framework will honor\nit. It is fine to set R to zero if no reduction is desired.\n\n### Secondary Sorting\n\nThe output of Mappers is firstly sorted by the intermediate keys.\nHowever, we do want to sort the intermediate values (or some fields of\nintermediate values) sometimes, e.g. calculating the stock price moving\naverage where the key is the stock ticker and the value is a pair of\ntimestamp and stock price. If the values of a given key are sorted by\nthe timestamp, we can easily calculate the moving average with a sliding\nwindow over the values. This problem is called secondary sorting.\n\nA direct approach to secondary sorting is for the reducer to buffer all\nof the values for a given key and do an in-memory sort. Unfortunately,\nit may cause the reducer to run out of memory.\n\nAlternatively, we may use a composite key that has multiple parts. In\nthe case of calculating moving average, we may create a composite key of\n(ticker, timestamp) and also provide a customized sort comparator\n(subclass of `WritableComparator`) that compares ticker and then\ntimestamp. To ensure only the ticker (referred as natural key) is\nconsidered when determining which reducer to send the data to, we need\nto write a custom partitioner (subclass of `Partitioner`) that is solely\nbased on the natural key. Once the data reaches a reducer, all data is\ngrouped by key. Since we have a composite key, we need to make sure\nrecords are grouped solely by the natural key by implementing a group\ncomparator (another subclass of `WritableComparator`) that considers\nonly the natural key.\n\n### Examples\n\nHadoop implements MapReduce in Java. To create a MapReduce program,\nplease add the following dependencies to the project’s Maven POM file.\n\n    \u003cdependency\u003e\n        \u003cgroupId\u003eorg.apache.hadoop\u003c/groupId\u003e\n        \u003cartifactId\u003ehadoop-common\u003c/artifactId\u003e\n        \u003cversion\u003e2.6.0\u003c/version\u003e\n    \u003c/dependency\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003eorg.apache.hadoop\u003c/groupId\u003e\n        \u003cartifactId\u003ehadoop-mapreduce-client-core\u003c/artifactId\u003e\n        \u003cversion\u003e2.6.0\u003c/version\u003e\n    \u003c/dependency\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003eorg.apache.hadoop\u003c/groupId\u003e\n        \u003cartifactId\u003ehadoop-hdfs\u003c/artifactId\u003e\n        \u003cversion\u003e2.6.0\u003c/version\u003e\n    \u003c/dependency\u003e\n\n#### Sort\n\nThe essential part of the MapReduce framework is a large distributed\nsort. So we just let the framework do the job in this case while the map\nis as simple as emitting the sort key and original input. In the below\nexample, we just assume the input key is the sort key. The reduce\noperator is an identity function.\n\n    public class SortMapper extends Mapper\u003cIntWritable, Text, IntWritable, Text\u003e {\n\n      public void map(IntWritable key, Text value, Context context)\n        throws IOException, InterruptedException {\n        context.write(key, value);\n      }\n    }\n\n    public class SortReducer extends Reducer\u003cIntWritable, Text, IntWritable, Text\u003e {\n\n      public void reduce(IntWritable key, Iterable\u003cText\u003e values, Context context)\n        throws IOException, InterruptedException {\n        for (Text value : values) {\n          context.write(key, value);\n        }\n      }\n    }\n\nAlthough this example is extremely simple, there are many important\nclasses to understand. The MapReduce framework takes key-value pairs as\nthe input and produces a new set of key-value pairs (maybe of different\ntypes). The key and value classes have to be serializable by the\nframework and hence need to implement the `Writable` interface.\nAdditionally, the key classes have to implement the `WritableComparable`\ninterface to facilitate sorting by the framework.\n\nThe `map` method of `Mapper` implementation processes one key-value pair\nin the input split at a time. The `reduce` method of `Reducer`\nimplementation is called once for each intermediate key and associate\ngroup of values. In this case, we do not have to override the `map` and\n`reduce` methods because the default implementation is actually an\nidentity function. The sample code is mainly to show the interface. Both\n`Mapper` and `Reducer` emit their output through the `Context` object\nprovided by the framework.\n\nTo submit a MapReduce job to Hadoop, we need to do the below steps.\nFirst, the application describes various facets of the job via `Job`\nobject. `Job` is typically used to specify the `Mapper`, `Reducer`,\n`InputFormat`, `OutputFormat` implementations, the directories of input\nfiles and the location of output files. Optionally, one may specify\nadvanced facets of the job such as the Combiner, Partitioner,\nComparator, and DistributedCache, etc. Then the application submits the\njob to the cluster by the method `waitForCompletion(boolean verbose)`\nand wait for it to finish. `Job` also allows the user to control the\nexecution and query the state.\n\n    public class MapReduceSort {\n      public static void main(String[] args) throws Exception {\n        Configuration conf = new Configuration();\n        Job job = Job.getInstance(conf, \"sort\");\n        job.setJarByClass(MapReduceSort.class);\n        job.setMapperClass(SortMapper.class);\n        job.setReducerClass(SortReducer.class);\n        job.setOutputKeyClass(IntWritable.class);\n        job.setOutputValueClass(Text.class);\n\n        FileInputFormat.addInputPath(job, new Path(args[0]));\n        FileOutputFormat.setOutputPath(job, new Path(args[1]));\n\n        System.exit(job.waitForCompletion(true) ? 0 : 1);\n      }\n    }\n\n#### Grep\n\nThe map function emits a line if it matches a given pattern. The reduce\npart is not necessary in this case and we can simply set the number of\nreduce tasks zero (`job.setNumReduceTasks(0)`). Note that the `Mapper`\nimplementation also overrides the `setup` method, which will be called\nonce at the beginning of the task. In this case, we use it to set the\nsearch pattern from the job configuration. This is also a good example\nof passing small configuration data to MapReduce tasks. To pass large\namount of read-only data to tasks, DistributedCache is preferred and\nwill be discussed later in the case of Inner Join. Similar to `setup`,\none may also overrides the `cleanup` method, which will be called once\nat the end of the task.\n\n    public class GrepMapper\u003cK\u003e extends Mapper\u003cK, Text, K, Text\u003e {\n\n      public static String PATTERN = \"mapper.pattern\";\n      private Pattern pattern;\n\n      // Setup the match pattern from job context.\n      // Called once at the beginning of the task.\n      public void setup(Context context) {\n        Configuration conf = context.getConfiguration();\n        pattern = Pattern.compile(conf.get(PATTERN));\n      }\n\n      public void map(K key, Text value, Context context)\n        throws IOException, InterruptedException {\n        if (pattern.matcher(value.toString()).find()) {\n          context.write(key, value);\n        }\n      }\n    }\n\nIn a relational database, one can achieve this by the following simple\nquery in SQL.\n\n    SELECT * FROM T_KV WHERE value LIKE '%XYZ%';\n\nAlthough this query requires a full table scan, a parallel DMBS can\neasily outperformance MapReduce in this case. It is because the setup\ncost of MapReduce is high. The performance gap will be much larger in\ncase that an index can be used such as\n\n    SELECT * FROM T_PERSON WHERE age \u003e 30;\n\n#### Aggregation\n\nAggregation is a simple analytic calculation such as counting the number\nof access or users from different countries. WordCount, the “hello\nworld” program in the MapReduce world, is an example of aggregation.\nWordCount simply counts the number of occurrences of each word in a\ngiven input set. The Mapper splits the input line into words and emits a\nkey-value pair of \u0026lt;word, 1\u0026gt;. The Reducer just sums up the values.\nFor the sample code, please refer Hadoop’s MapReduce Tutorial\n@MapReduceTutorial.\n\nFor SQL, aggregation simply means GROUP BY such as the following\nexample:\n\n    SELECT country, count(*) FROM T_WEB_LOG GROUP BY country;\n\nWith a combiner, the aggregation in MapReduce works pretty much same as\nin a parallel DBMS. Of course, a DBMS can still benefit a lot from an\nindex on the group by field.\n\n#### Inner Join\n\nAn inner join operation combines two data sets, A and B, to produce a\nthird one containing all record pairs from A and B with matching\nattribute value. The sort-merge join algorithm and hash-join algorithm\nare two common alternatives to implement the join operation in a\nparallel data flow environment @DeWitt:1992:PDS. In sort-merge join,\nboth A and B are sorted by the join attribute and then compared in\nsorted order. The matching pairs are inserted into the output stream.\nThe hash-join first prepares a hash table of the smaller data set with\nthe join attribute as the hash key. Then we scan the larger dataset and\nfind the relevant rows from the smaller dataset by searching the hash\ntable.\n\nThere are several ways to implement join in MapReduce, e.g. reduce-side\njoin and map-side join. The reduce-side join is a straightforward\napproach that takes advantage of that identical keys are sent to the\nsame reducer. In the reduce-side join, the output key of Mapper has to\nbe the join key so that they reach the same reducer. The Mapper also\ntags each dataset with an identity to differentiate them in the reducer.\nWith secondary sorting on the dataset identity, we ensure the order of\nvalues sent to the reducer, which generates the matched pairs for each\njoin key. Because two datasets are usually in different formats, we can\nuse the class `MultipleInputs` to setup different `InputFormat` and\n`Mapper` for each input path. The reduce-side join belongs to the\nsort-merge join family and scales very well for large datasets. However,\nit may be less efficient in the case of data skew where a dataset is\nsignificantly smaller than the other.\n\nIf one dataset is small enough to fit into the memory, we may use the\nmemory-based map-side join. In this approach, the Mappers side-load the\nsmaller dataset and build a hash table of it during the setup, and\nprocess the rows of the larger dataset one-by-one in the map function.\nTo efficiently load the smaller dataset in every Mapper, we should use\nthe `DistributedCache`. The `DistributedCache` is a facility to cache\napplication-specific large, read-only files. An application specifies\nthe files to be cached by `Job.addCacheFile(URI)`. The MapReduce\nframework will copy the necessary files on to the slave node before any\ntasks for the job are executed on that node. This is much more efficient\nthan that copying the files for each Mapper. Besides, we can declare the\nhash table as a static field so that the tasks running successively in a\nJVM will share the data using the task JVM reuse feature. Thus, we only\nneed to load the data only once for each JVM.\n\nThe above map-side join is fast but only works when the smaller dataset\nfits in the memory. To avoid this pitfall, we can use the multi-phrase\nmap-side join. First we run a MapReduce job on each dataset that uses\nthe join attribute as the Mapper’s and Reducer’s output key and have the\nsame number of reducers for all datasets. In this way, all datasets are\nsorted by the join attribute and have the same number of partitions. In\nsecond phrase, we use `CompositeInputFormat` as the input format. The\n`CompositeInputFormat` performs joins over a set of data sources sorted\nand partitioned the same way, which is guaranteed by the first phrase.\nSo the records are already merged before they reach the Mapper, which\nsimplify outputs the joins to the stream.\n\nBecause the join implementation is fairly complicated, we will not show\nthe sample code here. In practice, one should use higher level tools\nsuch as Hive or Pig to join data sets rather than reinventing the wheel.\n\nIn practice, join, aggregation, and sort are frequently used together,\ne.g. finding the client of the ad that generates the most revenue (or\nclicks) during a period. In MapReduce, this has to be done in multiple\nphases. The first phrase filters the data base on the click timestamp\nand joins the client and click log datasets. The second phrase does the\naggregation on the output of join and the third one finishes the task by\nsorting the output of aggregation.\n\nVarious benchmarks shows that parallel DBMSs are way faster than\nMapReduce for joins @Pavlo:2009:CAL. Again an index on the join key is\nvery helpful. But more importantly, joins can be done locally on each\nnode if both tables are partitioned by the join key so that no data\ntransfer is needed before the join.\n\n#### K-Means Clustering\n\nThe k-means clustering is a simple and widely used method that\npartitions data into k clusters in which each record belongs to the\ncluster with the nearest center, serving as a prototype of the cluster\n@Jain:1988:ACD. The most common algorithm for k-means clustering is\nLloyd’s algorithm that iteratively proceeds by alternating between two\nsteps. The assignment step assigns each sample to the cluster of nearest\nmean. The update step calculates the new means to be the centroids of\nthe samples in the new clusters. The algorithm converges when the\nassignments no longer change. The algorithm can be naturally implemented\nin the MapReduce framework where each iteration will be a MapReduce job.\n\nInput\n\n:   The data files as regular MapReduce input and cluster center files\n    side-loaded by DistributedCache. Initially, the cluster centers may\n    be random selected.\n\nMap\n\n:   With side-loaded cluster centers, each sample input is mapped to a\n    cluster of nearest mean. The emitted key-value pair is \u0026lt;cluster\n    id, sample vector\u0026gt;.\n\nCombine\n\n:   In order to reduce the data passed to the reducer, we may have a\n    combiner that aggregates samples belonging to the same cluster.\n\nReduce\n\n:   The reduce tasks recalculate the new means of clusters as the\n    centroids of samples in the new clusters. The output of new cluster\n    means will be used as the input to next iteration.\n\nIterate\n\n:   This process is repeated until the algorithm converges or reaches\n    the maximum number of iterations.\n\nOutput\n\n:   Runs a map only job to output the cluster assignment.\n\nSuch an implementation is very scalable. it can handle very large data\nsize, which may be even larger than the combined memory of the cluster.\nOn the other hand, it is not very efficient because the input data have\nto been read again and again for each iteration. This is a general\nperformance issue for MapReduce to implement iterative algorithms.\n\n### Shortcomings\n\nThe above examples show that MapReduce is capable of a variety of tasks.\nOn the other hand, they also demonstrate several drawbacks of MapReduce.\n\n#### Performance\n\nMapReduce provides a scalable programming model on large clusters.\nHowever, it is not guaranteed to be fast due to many reasons:\n\n-   Even though Hadoop now reuses JVM instances for map and reduce\n    tasks, the startup time is still significant on large clusters. The\n    high startup cost means that MapReduce is mainly suitable for long\n    run batch jobs.\n\n-   The communication between map and reduce tasks always are done by\n    remote file access, which actually often dominates the\n    computation cost. Such a pulling strategy is great for fault\n    tolerance, but it results in low performance compared to the\n    push mechanism. Besides there could be M \\* R intermediate files.\n    Given large M and R, it is certainly a challenge for underlying\n    file system. With multiple reducers running simultaneously, it is\n    highly likely that some of them will attempt to read from the same\n    map node at the same time, inducing a large number of disk seeks and\n    slowing the effective disk transfer rate.\n\n-   Iterative algorithms perform poorly on MapReduce because of reading\n    input data again and again. Data also must be materialized and\n    replicated on the distributed file system between successive jobs.\n\n#### Low Level Programming Interface\n\nA major goal of MapReduce is to provide a simple programming model that\napplication developers need only to write the map and reduce parts of\nthe program. However, practical programmers have to take care of a lot\nthings such as input/output format, partition functions, comparison\nfunctions, combiners, and job configuration to achieve good performance.\nAs shown in the example, even a very simple grep MapReduce program is\nfairly long. On the other hand, the same query in SQL is much shorter\nand cleaner.\n\nMapReduce is independent of the underlying storage system. It’s\napplication developers’ duty to organize data such as building and using\nany index, partitioning and collocating related data sets, etc.\nUnfortunately, these are not easy tasks in the context of HDFS and\nMapReduce.\n\n#### Limited Parallel Computing Model\n\nThe simple computing model of MapReduce brings us no explicit handling\nof data transfer and synchronization in programs, and no deadlock. But\nit is a limited parallel computing model, essentially a scatter-gather\nprocessing model. For non-trivial algorithms, programmers try hard to\n“MapReducize” them, often in a non-intuitive way.\n\nAfter years of practice, the community has realized these problems and\ntries to address them in different ways. For example, Apache Spark aims\non the speed by keeping data in memory. Apache Pig provides a DSL and\nHive provides a SQL dialect on the top of MapReduce to ease the\nprogramming. Google Dremel and Cloudera Impala target on interactive\nanalysis with SQL queries. Microsoft Dryad/Apache Tez provides a more\ngeneral parallel computing framework that models computations in DAGs.\nGoogle Pregel and Apache Giraph concerns computing problems on large\ngraphs. Apache Storm focuses on real time event processing. We will look\ninto all of them in the rest of book. First, we will check out Tez and\nSpark in this chapter.\n\nTez\n---\n\nMapReduce provides a scatter-gather parallel computing model, which is\nvery limited. Dryad, a research project at Microsoft Research, attempted\nto support a more general purpose runtime for parallel data processing\n@Isard:2007:DDD. A Dryad job is a directed acyclic graph (DAG) where\neach vertex is a program and edges represent data channels (files, TCP\npipes, or shared-memory FIFOs). The DAG defines the data flow of the\napplication, and the vertices of the graph defines the operations that\nare to be performed on the data. It is a logical computation graph that\nis automatically mapped onto physical resources by the runtime. Dryad\nincludes a domain-specific language, in C++ as a library using a mixture\nof method calls and operator overloading, that is used to create and\nmodel a Dryad execution graph. Dryad is notable for allowing graph\nvertices to use an arbitrary number of inputs and outputs, while\nMapReduce restricts all computations to take a single input set and\ngenerate a single output set. Although Dryad provides a nice alternative\nto MapReduce, Microsoft discontinued active development on Dryad,\nshifting focus to the Apache Hadoop framework in October 2011.\n\nInterestingly, the Apache Hadoop community recently picked up the idea\nof Dryad and developed Apache Tez @Tez [@TezTutorial], a new runtime\nframework on YARN, during the Stinger initiative of Hive @Stinger.\nSimilar to Dryad, Tez is an application framework which allows for a\ncomplex directed-acyclic-graph of tasks for processing data. Edges of\ndata flow graph determine how the data is transferred and the dependency\nbetween the producer and consumer vertices. Edge properties enable Tez\nto instantiate user tasks, configure their inputs and outputs, schedule\nthem appropriately and define how to route data between the tasks. The\nedge properties include:\n\nData movement\n\n:   determines routing of data between tasks.\n\n    -   One-To-One: Data from the $i^{th}$ producer task routes to the\n        $i^{th}$ consumer task.\n\n    -   Broadcast: Data from a producer task routes to all\n        consumer tasks.\n\n    -   Scatter-Gather: Producer tasks scatter data into shards and\n        consumer tasks gather the shards. The $i^{th}$ shard from all\n        producer tasks routes to the $i^{th}$ consumer task.\n\nScheduling\n\n:   determines when a consumer task is scheduled.\n\n    -   Sequential: Consumer task may be scheduled after a producer\n        task completes.\n\n    -   Concurrent: Consumer task must be co-scheduled with a\n        producer task.\n\nData source\n\n:   determines the lifetime/reliability of a task output.\n\n    -   Persisted: Output will be available after the task exits. Output\n        may be lost later on.\n\n    -   Persisted-Reliable: Output is reliably stored and will always\n        be available.\n\n    -   Ephemeral: Output is available only while the producer task\n        is running.\n\nFor example, MapReduce would be expressed with the scatter-gather,\nsequential and persisted edge properties.\n\nThe vertex in the data flow graph defines the user logic that transforms\nthe data. Tez models each vertex as a composition of Input, Processor\nand Output modules. Input and Output determine the data format and how\nand where it is read/written. An input represents a pipe through which a\nprocessor can accept input data from a data source such as HDFS or the\noutput generated by another vertex, while an output represents a pipe\nthrough which a processor can generate output data for another vertex to\nconsume or to a data sink such as HDFS. Processor holds the data\ntransformation logic, which consumes one or more Inputs and produces one\nor more Outputs.\n\nThe Tez runtime expands the logical graph into a physical graph by\nadding parallelism at the vertices, i.e. multiple tasks are created per\nlogical vertex to perform the computation in parallel. A logical edge in\na DAG is also materialized as a number of physical connections between\nthe tasks of two connected vertices. Tez also supports pluggable vertex\nmanagement modules to collect information from tasks and change the data\nflow graph at runtime to optimize performance and resource usage.\n\nWith Tez, Apache Hive is now able to process data in a single Tez job,\nwhich may take multiple MapReduce jobs. If the data processing is too\ncomplicated to finish in a single Tez job, Tez session can encompass\nmultiple jobs by leveraging common services. This provides additional\nperformance optimizations.\n\n![Pig/Hive on MapReduce vs Tez](images/PigHive_MR.png \"fig:\") ![Pig/Hive\non MapReduce vs Tez](images/PigHive_Tez.png \"fig:\")\n\nLike MapReduce, Tez is still a lower-level programming model. To obtain\ngood performance, the developer must understand the structure of the\ncomputation and the organization and properties of the system resources.\n\nYARN\n----\n\nOriginally, Hadoop was restricted mainly to the paradigm MapReduce,\nwhere the resource management is done by JobTracker and TaskTacker. The\nJobTracker farms out MapReduce tasks to specific nodes in the cluster,\nideally the nodes that have the data, or at least are in the same rack.\nA TaskTracker is a node in the cluster that accepts tasks - Map, Reduce\nand Shuffle operations - from a JobTracker. Because Hadoop has stretched\nbeyond MapReudce (e.g. HBase, Storm, etc.), Hadoop now architecturally\ndecouples the resource management features from the programming model of\nMapReduce, which makes Hadoop clusters more generic. The new resource\nmanager is referred to as MapReduce 2.0 (MRv2) or YARN @YARN2011:279.\nNow MapReduce is one kind of applications running in a YARN container\nand other types of applications can be written generically to run on\nYARN.\n\nYARN employs a master-slave model and includes several components:\n\n-   The global Resource Manager is the ultimate authority that\n    arbitrates resources among all applications in the system.\n\n-   The per-application Application Master negotiates resources from the\n    Resource Manager and works with the Node Managers to execute and\n    monitor the component tasks.\n\n-   The per-node slave Node Manager is responsible for launching the\n    applications’ containers, monitoring their resource usage and\n    reporting to the Resource Manager.\n\n![YARN Architecture](images/yarn-architecture.png)\n\nThe Resource Manager, consisting of Scheduler and Application Manager,\nis the central authority that arbitrates resources among various\ncompeting applications in the cluster. The Scheduler is responsible for\nallocating resources to the various running applications subject to the\nconstraints of capacities, queues etc. The Application Manager is\nresponsible for accepting job-submissions, negotiating the first\ncontainer for executing the application specific Application Master and\nprovides the service for restarting the Application Master container on\nfailure.\n\nThe Scheduler uses the abstract notion of a Resource Container which\nincorporates elements such as memory, CPU, disk, network etc. Initially,\nYARN uses the memory-based scheduling. Each node is configured with a\nset amount of memory and applications request containers for their tasks\nwith configurable amounts of memory. Recently, YARN added CPU as a\nresource in the same manner. Nodes are configured with a number of\n“virtual cores” (vcores) and applications give a vcore number in the\ncontainer request.\n\nThe Scheduler has a pluggable policy plug-in, which is responsible for\npartitioning the cluster resources among the various queues,\napplications etc. For example, the Capacity Scheduler is designed to\nmaximize the throughput and the utilization of shared, multi-tenant\nclusters. Queues are the primary abstraction in the Capacity Scheduler.\nThe capacity of each queue specifies the percentage of cluster resources\nthat are available for applications submitted to the queue. Furthermore,\nqueues can be set up in a hierarchy. YARN also sports a Fair Scheduler\nthat tries to assign resources to applications such that all\napplications get an equal share of resources over time on average using\ndominant resource fairness @Ghodsi:2011:DRF.\n\nThe protocol between YARN and applications is as follows. First an\nApplication Submission Client communicates with the Resource Manager to\nacquire a new Application Id. Then it submit the Application to be run\nby providing sufficient information (e.g. the local files/jars, command\nline, environment settings, etc.) to the Resource Manager to launch the\nApplication Master. The Application Master is then expected to register\nitself with the Resource Manager and request for and receive containers.\nAfter a container is allocated to it, the Application Master\ncommunicates with the Node Manager to launch the container for its task\nby specifying the launch information such as command line specification,\nenvironment, etc. The Application Master also handles failures of job\ncontainers. Once the task is completed, the Application Master signals\nthe Resource Manager.\n\nAs the central authority of the YARN cluster, the Resource Manager is\nalso the single point of failure (SPOF). To make it fault tolerant, an\nActive/Standby architecture can be employed since Hadoop 2.4. Multiple\nResource Manager instances (listed in the configuration file\nyarn-site.xml) can be brought up but only one instance is Active at any\npoint of time while others are in Standby mode. When the Active goes\ndown or becomes unresponsive, another Resource Manager is automatically\nelected by a ZooKeeper-based method to be the Active. ZooKeeper is a\nreplicated CP key-value store, which we will discuss in details later.\nClients, Application Masters and Node Managers try connecting to the\nResource Managers in a round-robin fashion until they hit the new\nActive.\n\nSpark\n=====\n\nAlthough MapReduce is great for large scale data processing, it is not\nfriendly for iterative algorithms or interactive analytics because the\ndata have to be repeatedly loaded for each iteration or be materialized\nand replicated on the distributed file system between successive jobs.\nApache Spark @Zaharia:2010:SCC [@Zaharia:2012:RDD; @Spark] is designed\nto solve this problem by reusing the working dataset. Initially Spark\nwas built on top of Mesos but can now also run on top of YARN or\nstandalone today. The overall framework and parallel computing model of\nSpark is similar to MapReduce but with an important innovation, reliant\ndistributed dataset (RDD).\n\nRDD\n---\n\nAn RDD is a read-only collection of objects partitioned across a cluster\nof computers that can be operated on in parallel. A Spark application\nconsists of a driver program that creates RDDs from HDFS files or an\nexisting Scala collection. The driver program may transform an RDD in\nparallel by invoking supported operations with user-defined functions,\nwhich returns another RDD. The driver can also persist an RDD in memory,\nallowing it to be reused efficiently across parallel operations. In\nfact, the semantics of RDDs are way more than just parallelization:\n\nAbstract\n\n:   The elements of an RDD does not have to exist in physical memory. In\n    this sense, an element of RDD is an expression rather than a value.\n    The value can be computed by evaluating the expression\n    when necessary.\n\nLazy and Ephemeral\n\n:   One can construct an RDD from a file or by transforming an existing\n    RDD such as `map()`, `filter()`, `groupByKey()`, `reduceByKey()`,\n    `join()`, `cogroup()`, `cartesian()`, etc. However, no real data\n    loading or computation happens at the time of construction. Instead,\n    they are materialized on demand when they are used in some\n    operation, and are discarded from memory after use.\n\nCaching and Persistence\n\n:   We can cache a dataset in memory across operations, which allows\n    future actions to be much faster. Caching is a key tool for\n    iterative algorithms and fast interactive use cases. Caching is\n    actually one special case of persistence that allows different\n    storage levels, e.g. persisting the dataset on disk, persisting it\n    in memory but as serialized Java objects (to save space),\n    replicating it across nodes, or storing it off-heap in Tachyon[^8]\n    @Tachyon. These levels are set by passing a `StorageLevel` object to\n    `persist()`. The cache() method is a shorthand for using the default\n    storage level `StorageLevel.MEMORY_ONLY` (store deserialized objects\n    in memory).\n\nFault Tolerant\n\n:   If any partition of an RDD is lost, it will automatically be\n    recomputed using the transformations that originally created it.\n\nThe operations on RDDs take user-defined functions, which are closures\nin functional programming as Spark is implemented in Scala. A closure\ncan refer to variables in the scope when created, which will be copied\nto the workers when Spark runs a closure. Spark optimizes this process\nby shared variables for a couple of cases:\n\nBroadcast variables\n\n:   If a large read-only data is used in multiple operations, it is\n    better to copy it to the workers only once. Similar to the idea of\n    DistributedCache, this can be achieved by broadcast variables that\n    are created from a variable `v` by calling\n    `SparkContext.broadcast(v)`.\n\nAccumulators\n\n:   Accumulators are variables that are only “added” to through an\n    associative operation and can therefore be efficiently supported\n    in parallel. They can be used to implement counters or sums. Only\n    the driver program can read the accumulator’s value. Spark natively\n    supports accumulators of numeric types.\n\nBy reusing cached data in RDDs, Spark offers great performance\nimprovement over MapReduce (10x $\\sim$ 100x faster). Thus, it is very\nsuitable for iterative machine learning algorithms. Similar to\nMapReduce, Spark is independent of the underlying storage system. It is\napplication developers’ duty to organize data such as building and using\nany index, partitioning and collocating related data sets, etc. These\nare critical for interactive analytics. Merely caching is insufficient\nand not effective for extremely large data.\n\nImplementation\n--------------\n\nThe RDD object implements a simple interface, which consists of three\noperations:\n\n`getPartitions`\n\n:   returns a list of partition IDs.\n\n`getIterator(partition)`\n\n:   iterates over a partition.\n\n`getPreferredLocations(partition)`\n\n:   is used to achieve data locality.\n\nWhen a parallel operation is invoked on a dataset, Spark creates a task\nto process each partition of the dataset and sends these tasks to worker\nnodes. Spark tries to send each task to one of its preferred locations.\nOnce launched on a worker, each task calls `getIterator` to start\nreading its partition.\n\nAPI\n---\n\nSpark is implemented in Scala and provides high-level APIs in Scala,\nJava, and Python. The following examples are in Scala. A Spark program\nneeds to create a `SparkContext` object:\n\n    val conf = new SparkConf().setAppName(appName).setMaster(master)\n    val sc = new SparkContext(conf)\n\nThe `appName` parameter is a name for your application to show on the\ncluster UI and the `master` is a cluster URL or a special “local” string\nto run in local mode.\n\nThen we can create RDDs from any storage source supported by Hadoop.\nSpark supports text files, SequenceFiles, etc. Text file RDDs can be\ncreated using `SparkContext`’s `textFile` method. This method takes an\nURI for the file (directories, compressed files, and wildcards as well)\nand reads it as a collection of lines.\n\n    val lines = sc.textFile(\"data.txt\")\n\nWe can create a new RDD by transforming from an existing one, such as\n`map`, `flatMap`, `filter`, etc. We can also aggregate all the elements\nof an RDD using some function, e.g. `reduce`, `reduceByKey`, etc.\n\n    val lengths = lines.map(s =\u003e s.length)\n\nBeyond the basic operations such as `map` and `reduce`, Spark also\nprovides advanced operations such as `union`, `intersection`, `join`,\n`cogroup`, which creates a new dataset from two existing RDDs. All these\noperations take a functions from the driver program to run on the\ncluster. Thanks to the functional features of Scala, the code is a lot\nsimpler and cleaner than MapReduce as shown in the example.\n\nAs we discussed, RDDs are lazy and ephemeral. If we need to access an\nRDD multiple times, it is better to persist it in memory using the\n`persist` (or `cache`) method.\n\n    lengths.persist\n\nSpark also supports a rich set of higher-level tools including Spark SQL\nfor SQL and structured data processing, MLlib for machine learning,\nGraphX for graph processing, and Spark Streaming for event processing.\nWe will discuss these technologies later in related chapters.\n\nAnalytics and Data Warehouse\n============================\n\nWith big data at hand, we want to crunch numbers from them. MapReduce\nand TeZ are good tools for ad-hoc analytics. However, their programming\nmodels are very low level. Custom code has to be written for even simple\noperations like projection and filtering. It is even more tedious and\nverbose to implement common relational operators such as join. Several\nefforts, including Pig and Hive, have been devoted to simplify the\ndevelopment of MapReduce/Tez programs by providing high level DSL or SQL\nthat can be translated to native MapReduce/Tez code. Similarly, Shark\nand Spark SQL bring SQL on top of Spark. Moreover, Cloudera Impala and\nApache Drill introduces native massively parallel processing query\nengine to Hadoop for interactive analysis of web-scale datasets.\n\nPig\n---\n\nDifferent from many other projects that bring SQL to Hadoop, Pig is\nspecial in that it provides a procedural (data flow) programming\nlanguage Pig Latin as it was designed for experienced programmers.\nHowever, SQL programmers won’t have difficulties to understand Pig Latin\nprograms because most statements just look like SQL clauses.\n\nA Pig Latin program is a sequence of steps, each of which carries out a\nsingle data processing at fairly high level, e.g. loading, filtering,\ngrouping, etc. The input data can be loaded from the file system or\nHBase by the operator LOAD:\n\n    grunt\u003e persons = LOAD 'person.csv' USING PigStorage(',') AS (name: chararray, age:int, address: (street: chararray, city: chararray, state: chararray, zip: int));\n\nwhere $grunt\u003e$ is the prompt of Pig console and PigStorage is a built-in\ndeserializer for structured text files. Various deserializers are\navailable. User defined functions (UDFs) can also be used to parse data\nin unsupported format. The AS clause defines a schema that assigns names\nto fields and declares types for fields. Although schemas are optional,\nprogrammer are encouraged to use them whenever possible. Note that such\na “schema on read” is very different from the relational approach that\nrequires rigid predefined schemas. Therefore, there is no need copying\nor reorganizing the data.\n\nPig has a rich data model. Primitive data types include int, long,\nfloat, double, chararray, bytearray, boolean, datetime, biginteger and\nbigdecimal. And complex data types include tuple, bag (a collection of\ntuples), and map (a set of key value pairs). Different from relational\nmodel, the fields of tuples can be any data types. Similarly, the map\nvalues can be any types (the map key is always type chararray). That is,\nnested data structures are supported.\n\nOnce the input data have been specified, there is a rich set of\nrelational operators to transform them. The FOREACH...GENERATE operator,\ncorresponding to the map tasks of MapReduce, produces a new bag by\nprojection, applying functions, etc.\n\n    grunt\u003e flatten_persons = FOREACH persons GENERATE name, age, FLATTEN(address);\n\nwhere FLATTEN is a function to remove one level of nesting. With the\noperator DESCRIBE, we can see the schema difference between persons and\nflatten\\_persons:\n\n    grunt\u003e DESCRIBE persons;\n    persons: {name: chararray,age: int,address: (street: chararray,city: chararray,state: chararray,zip: int)}\n    grunt\u003e DESCRIBE flatten_persons;\n    flatten_persons: {name: chararray,age: int,address::street: chararray,address::city: chararray,address::state: chararray,address::zip: int}\n\nFrequently, we want to filter the data based on some condition.\n\n    grunt\u003e adults = FILTER flatten_persons BY age \u003e 18;\n\nAggregations can be done by GROUP operator, which corresponds to the\nreduce tasks in MapReduce.\n\n    grunt\u003e grouped_by_state = GROUP flatten_persons BY state;\n    grunt\u003e DESCRIBE grouped_by_state;\n    grouped_by_state: {group: chararray,flatten_persons: {(name: chararray,age: int,address::street: chararray,address::city: chararray,address::state: chararray,address::zip: int)}}\n\nThe result of a GROUP operation is a relation that includes one tuple\nper group of two fields:\n\nThe first field is named “group” and is the same type as the group key.\nThe second field takes the name of the original relation and is type\nbag. We can also cogroup two or more relations.\n\n    grunt\u003e cogrouped_by_name = COGROUP persons BY name, flatten_persons BY name;\n    grunt\u003e DESCRIBE cogrouped_by_name;\n    cogrouped_by_name: {group: chararray,persons: {(name: chararray,age: int,address: (street: chararray,city: chararray,state: chararray,zip: int))},flatten_persons: {(name: chararray,age: int,address::street: chararray,address::city: chararray,address::state: chararray,address::zip: int)}}\n\nIn fact, the GROUP and COGROUP operators are identical. Both operators\nwork with one or more relations. For readability, GROUP is used in\nstatements involving one relation while COGROUP is used when involving\ntwo or more relations.\n\nA closely related but different operator is JOIN, which is a syntactic\nsugar of COGROUP followed by FLATTEN.\n\n    grunt\u003e joined_by_name = JOIN persons BY name, flatten_persons BY name;\n    grunt\u003e DESCRIBE joined_by_name;\n    joined_by_name: {persons::name: chararray,persons::age: int,persons::address: (street: chararray,city: chararray,state: chararray,zip: int),flatten_persons::name: chararray,flatten_persons::age: int,flatten_persons::address::street: chararray,flatten_persons::address::city: chararray,flatten_persons::address::state: chararray,flatten_persons::address::zip: int}\n\nOverall, a Pig Latin program is like a handcrafted query execution plan.\nIn contrast, a SQL based solution, e.g. Hive, relies on an execution\nplanner to automatically translate SQL statements to an execution plan.\nLike SQL, Pig Latin has no control structures. But it is possible to\nembed Pig Latin statements and Pig commands in the Python, JavaScript\nand Groovy scripts.\n\nWhen you run the above statements in the console of Pig, you will notice\nthat they finish instantaneously. It is because Pig is lazy and there is\nno really computation happened. For example, LOAD does not really read\nthe data but just returns a handle to a bag/relation. Only when a STORE\ncommand is issued, Pig materialize the result of a Pig Latin expression\nsequence to the file system. Before a STORE command, Pig just builds a\nlogical plan for every user defined bag. At the point of a STORE\ncommand, the logical plan is compiled into a physical plan (a directed\nacyclic graph of MapReduce jobs) and is executed.\n\nIt is possible to replace MapReduce with other execution engines in Pig.\nFor example, there are efforts to run Pig on top of Spark. However, is\nit necessary? Spark already provides many relational operators and the\nhost language Scala is very nice to write concise and expressive\nprograms.\n\nIn summary, Pig Latin is a simple and easy to use DSL that makes\nMapReduce programming a lot easier. Meanwhile, Pig keeps the flexibility\nof MapReduce to process schemaless data in plain files. There is no need\nto do slow and complex ETL tasks before analysis, which makes Pig a\ngreat tool for quick ad-hoc analytics such as web log analysis.\n\nHive\n----\n\nAlthough many statements in Pig Latin look just like SQL clauses, it is\na procedural programming language. In this section we will discuss\nApache Hive that first brought SQL to Hadoop. Similar to Pig, Hive\ntranslates its own dialect of SQL (HiveQL) queries to a directed acyclic\ngraph of MapReduce (or Tez since 0.13) jobs. However, the difference\nbetween Pig and Hive is not only procedural vs declarative. Pig is a\nrelatively thin layer on top of MapReduce for offline analytics. But\nHive is towards a data warehouse. With the recent stinger initiative,\nHive is closer to interactive analytics by 100x performance improvement.\n\nPig uses a “schema on read” approach that users define the (optional)\nschema on loading data. In contrast, Hive requires users to provides\nschema, (optional) storage format and serializer/deserializer (called\nSerDe) when creating a table. These information is saved in the metadata\nrepository (by default an embedded Derby database) and will be used\nwhenever the table is referenced, e.g. to typecheck the expressions in\nthe query and to prune partitions based on query predicates. The\nmetadata store also provides data discovery (e.g. SHOW TABLES and\nDESCRIBE) that enables users to discover and explore relevant and\nspecific data in the warehouse. The following example shows how to\ncreate a database and a table.\n\n    CREATE DATABASE portal;\n    USE portal;\n    CREATE TABLE weblog (\n      host STRING,\n      identity STRING,\n      user STRING,\n      time STRING,\n      request STRING,\n      status STRING,\n      size STRING,\n      referer STRING,\n      agent STRING)\n    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'\n    WITH SERDEPROPERTIES (\n      \"input.regex\" = \"([^ ]*) ([^ ]*) ([^ ]*) (-|\\\\[[^\\\\]]*\\\\]) ([^ \\\"]*|\\\"[^\\\"]*\\\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \\\"]*|\\\"[^\\\"]*\\\") ([^ \\\"]*|\\\"[^\\\"]*\\\"))?\"\n    )\n    STORED AS TEXTFILE;\n\nThe interesting part of example is the bottom five lines that specify\ncustom regular expression SerDe and plain text file format. If ROW\nFORMAT is not specified or ROW FORMAT DELIMITED is specified, a native\nSerDe is used. Besides plain text files, many other file formats are\nsupported. Later we will discuss more details on ORC files, which\nimprove query performance significantly.\n\nDifferent from relational data warehouses, Hive supports nested data\nmodels with complex types array, map, and struct. For example, the\nfollowing statement creates a table with a complex schema.\n\n    CREATE TABLE complex_table(\n      id STRING,\n      value FLOAT,\n      list_of_maps ARRAY\u003cMAP\u003cSTRING, STRUCT\u003cx:INT, y:INT\u003e\u003e\u003e\n    );\n\nBy default, all the data files for a table are located in a single\ndirectory. Tables can be physically partitioned based on values of one\nor more columns with the PARTITIONED BY clause. A separate directory is\ncreated for each distinct value combination in the partition columns.\nPartitioning can greatly speed up queries that test those columns. Note\nthat the partitioning columns are not part of the table data and the\npartition column values are encoded in the directory path of that\npartition (and also stored in the metadata store). Moreover, tables or\npartitions can be bucketed using CLUSTERED BY columns, and data can be\nsorted within that bucket via SORT BY columns.\n\nNow we can load some data into our table:\n\n    LOAD DATA LOCAL INPATH 'portal/logs' OVERWRITE INTO TABLE weblog;\n\nNote that Hive does not do any verification of data against the schema\nor transformation while loading data into tables. The input files are\nsimply copied or moved into the Hive’s file system namespace. If the\nkeyword LOCAL is specified, the input files are assumed in the local\nfile system, otherwise in HDFS. While not necessary in this example, the\nkeyword OVERWRITE signifies that existing data in the table is\noverwritten. If the OVERWRITE keyword is omitted, data files are\nappended to existing data sets.\n\nTables can also be created and populated by the results of a query in a\ncreate-table-as-select (CTAS) statement that includes two parts. The\nSELECT part can be any SELECT statement supported by HiveQL. The CREATE\npart of the CTAS takes the resulting schema from the SELECT part and\ncreates the target table with other table properties such as the SerDe\nand storage format.\n\n    CREATE TABLE orc_weblog\n      STORED AS ORC\n    AS\n    SELECT * FROM weblog;\n\nSimilarly, query results can be inserted into tables by the INSERT\nclause. INSERT OVERWRITE will overwrite any existing data in the table\nor partition while INSERT INTO will append to the table or partition.\nMultiple insert clauses can be specified in the same query, which\nminimize the number of data scans required.\n\nHive does not support the OLTP-style INSERT INTO that inserts a new\nrecord. HiveQL does not have UPDATE and DELETE clauses either. This is\nactually a good design choice as these clauses are not necessary for\ndata warehouses. Without them, Hive can use very simple mechanisms to\ndeal with reader and writer concurrency.\n\nFor queries, HiveQL is pretty much like what you see in SQL. Besides\ncommon SQL features (e.g. JOIN, WHERE, HAVING, GROUP BY, SORT BY, ...),\nHiveQL also have extensions such as TABLESAMPLE, LATERAL VIEW, OVER,\netc. We will not dive into the syntax of query statements. Instead, we\nwill discuss the stinger initiative, which improves the query\nperformance significantly.\n\nA big contribution of stinger initiative is the Optimized Record\nColumnar (ORC) file. In previous example, we use TEXTFILE in which each\nline/row contains a record. In fact, most relational and document\ndatabases employ such a row-oriented storage format. However,\ncolumn-oriented file format has advantages for data warehouses where\naggregates are computed over large numbers of data items. For example,\nonly required column values on each query are scanned and transferred on\nquery execution. Besides, column data is of uniform type and thus may\nachieve better compression, especially if the cardinality of the column\nis low. Before ORC files, Hive already had a columnar file format\nRCFile. However, RCFile is data-type-agnostic and its corresponding\nSerDe serializes a single row at a time. In ORC Files, the SerDe is\nde-emphasized and the ORC file writer is data type aware. So the ORC\nfile can decompose a complex column to multiple child columns and\nvarious type-specific data encoding schemes can be applied to primitive\ndata streams to store data efficiently. Besides, the ORC file also\nsupports indexes. Well, these indexes are not B-trees but basically data\nstatistics and position pointers. The data statistics are used in query\noptimization and to answer simple aggregation queries. They are also\nhelpful to avoid unnecessary data read. The position pointers are used\nto locate the index groups and stripes.\n\nThe stinger initiative also put a lot of efforts to improve the query\nplanning and execution. For example, unnecessary Map-only jobs are\neliminated. In Hive, a Map-only job is generated when the query planner\nconverts a Reduce Join to a Map Join. Now, Hive tries to merge the\ngenerated Map-only job to its child job if the total size of small\ntables used to build hash tables in the merged job is under a\nconfigurable threshold. Besides, a correlation optimizer was developed\nto avoid unnecessary data loading and repartitioning so that Hive loads\nthe common table only once instead of multiple times and the optimized\nplan will have less number of shuffling phases.\n\nBesides MapReduce, Hive now embeds Apache Tez as an execution engine.\nCompared to MapReduce’s simple scatter/gather model, Tez offers a\ncustomizable execution architecture that models complex computations as\ndataflow graphs with dynamic performance optimizations. With Tez, Hive\ncan translate complex SQL statements into efficient physical plans. For\nexample, several reduce sinks can be linked directly in Tez and data can\nbe pipelined without the need of temporary HDFS files. This pattern is\nreferred to as MRR (Map - reduce - reduce\\*). Join is also much easier\nin Tez because a Tez task may take multiple bipartite edges as input\nthus exposing the input relations directly to the join implementation.\nThe shuffle join task taking multiple feeds is called multi-parent\nshuffle join (MPJ). Both MRR and MPJ are employed in Hive to speed up a\nwide variety of queries.\n\nAnother potential benefit of Tez is to avoid unnecessary disk writes. In\nMapReduce, map outputs are partitioned, sorted and written to disk, then\npulled, merge-sorted and fed into the reducers. Tez allows for small\ndatasets to be handled entirely in memory. This is attractive as many\nanalytic queries generate small intermediate datasets after the heavy\nlifting. Moreover, Tez allows complete control over the processing, e.g.\nstopping processing when limits are met. Unfortunately, these feature\nare not used in Hive currently.\n\nThere is also work to employ Spark as the third execution engine in\nHive, called Hive on Spark. Hive on Spark is still in early stage and it\nis not designed to replace Tez or MapReduce as each has different\nstrengths depending on the use case. Shark and Spark SQL are similar\nattempts. We will discuss them in details later.\n\nFinally, let’s briefly talk about the vectorized query execution. But\nfirst to note that “vectorized” does not mean using vector computing\nfacility such as SSE/AVX or CUDA. Instead, it aims to improve the\nruntime execution efficiency by taking advantage of the characteristics\nof modern CPUs. The one-row-at-a-time model of MapReduce is not friendly\nto modern CPUs that heavily relay on pipelines, superscalar (multiple\nissue), and cache. In the vectorized execution model, data are processed\nin batches of rows through the operator tree, whose expressions w","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhaifengl%2Fbigdata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhaifengl%2Fbigdata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhaifengl%2Fbigdata/lists"}