An overview of Distributed Computing frameworks

Note: this post is the first of a set of articles aiming to present some key problematic in modern Distributed Computing abstractions.

Distributed Computing Frameworks

The goal of a Distributed Computing framework is to provide an abstraction to developers which allows them to use a large amount of resources while seeing them as an unified pool. The framework will provide a protocol to handle data and task distribution on the nodes, as well as fault tolerance concerns, like relaunching a task which failed (whether it comes from the task or from the node). The framework does not know the underlying topology and must cope with infrastructure evolution. Popular frameworks such as MapReduce and Spark also provide a programming model which allows to reason about a large dataset as a unified piece (location transparency), and to logically split it in a multitude of components independently processed by the cluster's nodes. The dataset itself is often stored on a distributed file system, on the very same nodes where tasks are processed; developers can control its partitioning, and create a computation pipeline to be applied on each part.

MPI

The Message Passing Interface is a language independent standard allowing programs to communicate with other elements in the system. MPI provides a fundamental building block in distributed computing systems, namely the message passing protocol which allow programs to access the distributed memory in a computer or in remote machines. However, it requires a user to implement message passing in her code, where higher level frameworks such as Spark and Hadoop handle that transparently. The consequence is that users have to know much more about the underlying infrastructure, distracting them from their program. MPI is more often than not categorized as a High Performance Computing (HPC) framework, rather than a Distributed Computing one, however, it shares many properties with the latter.

Hadoop

Apache Hadoop is the open source implementation of the MapReduce paper [1] first published in 2004 at OSDI. The project was started in 2006 by Doug Cutting and Mike Cafarella, employed by Yahoo! at the time. The first official release of the Apache distribution was shipped in 2011.
Hadoop is made of three main building blocks: the distributed file system, HDFS (an open source implementation of the Google File System [2]), the MapReduce API, and YARN [3], the scheduler.
The MapReduce programming model comes from functional programming: a user can apply (map) a function to a set of independent data segments, and then aggregate the intermediate results (reduce). A classic example is an implementation of WordCount, where the user wants to know the number of occurrences of every word in a corpus. First the text is split in a certain number of partitions, which are each mapped to a function which operates local word count in its assigned partition. Then, once map() is done, all the separate intermediate results are joined and merged in a reduce() phase.
Typically, Hadoop's implementation of Map Reduce stores the data in HDFS, which ensure that a sufficient amount of chunks are available and spread throughout the cluster. YARN implements a technique called delay scheduling [4], which tries to maximize data locality to enhance performances by reducing the cost of network I/O. We discuss further data locality and delay scheduling later in the chapter.
HDFS architecture is a master/slave paradigm, where the master keeps all the information about data placement, and authorization, whereas the slaves store chunks of data and report status to the master regularly. To ensure fault tolerance and availability, each slave has a specified number of replicas (3 by default), which are located in different physical locations to strengthen their resilience to a physical failure. To ensure consistency, each set of replicas has a primary, which validates (or deny) a sequence of mutations on a piece of data. Enforcing a set of mutations makes the data defined, which means that the version is acknowledged as being the definitive one.

Spark

Apache Spark [5] belongs to a new generation of Distributed Computing frameworks. A key motivation behind Spark was to enhance iterative workloads performances by performing in memory computations. The raw Hadoop implementation is quite inefficient for those workloads, mainly due to the numerous access to disk required to carry the application to its end. Spark made two main contributions that tackle this issue: first it introduce a new type of data structure, called Resilient Distributed Dataset [6], which plays an major role in enabling in memory computing; second, it proposes a rich software architecture that ease the creation of Distributed Computing programs for a wide range of category, such as graph processing, machine learning and streaming.
RDDs are immutable collections of data, which the user can create from an initial storage (HDFS, remote object store, local file, etc), and modify through transformations. At some point, the user can collect the new data generated by such transformations, by collection methods. RDD allows the implementation of a refined version of the MapReduce model, enriching the variety of possible user defined functions, and allowing them to create a real computation pipeline on a dataset. RDDs are computed lazily, which means that only when a user wants to collect the data, will transformations be applied. Each RDD has a lineage, meaning that the sequence of operations applied to the initial slice of a dataset is known and recorded. Tasks in Spark are operations on RDDs. Each operates on a partition of an RDD, and the relationship between tasks is either concurrent or dependent. The nature of tasks (transformation or collection) defines two type of dependencies: narrow and wide. A narrow dependency is typically the result of two consecutive map() functions on an RDD. A wide dependency often induces a shuffle, meaning that intermediate data must be moved through the cluster to be reduced().
At compile time, Spark analyzes operations applied to RDDs in the program, and generates a Directed Acyclic Graph. It tries to pack as many narrow dependencies as possible in one stage, and typically places both ends of a wide dependency into different stages. Tasks are preferably assigned to workers holding the required data, with a technique called delay scheduling. It allows a task to decline a worker's offer if it doesn't hold the data required. Different locality levels are configured, node, rack, and cluster locality, allowing Spark to gradually increase a task's tolerance to non locality. The framework can carry a computation to its end even in the event of discrete failures by replaying RDD's lineage.
RDDs can be instantiated in specialized types depending on the type of computation they store data for. For example, there exist graph RDDs, file RDDs, and columnar stores RDDs.

references:

[1] J. Dean and S.Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107{113, 2008.
[2] S. Ghemawat, H. Gobio , and S.-T. Leung. The google file system. In ACM SIGOPS operating systems review, volume 37, pages 29{43. ACM, 2003.
[3] V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing, page 5. ACM, 2013.
[4] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of the 5th European conference on Computer systems, pages 265{278. ACM, 2010.
[5] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, 10:10, 2010.
[6] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2-2. USENIX Association, 2012.