An overview of Distributed Computing schedulers

Note: this post is the second of a set of articles aiming to present some key problematic in modern Distributed Computing abstractions. Find the first post here.

Distributed Computing Schedulers

There are three main categories of distributed systems schedulers: monolithic, two levels, and shared states schedulers. The first type has a global view of the resource pool, and must implement a variety of possibly conflicting allocation policies for every framework in the cluster. As the amount of specialized frameworks and applications increase, such monolithism is hard to maintain and can become a bottleneck. Two level schedulers, such as Mesos, allocate a pool of resources to a framework, which then implement its own specialized policies at will. Each framework has a partial, locked view of the global resource pool. Hadoop, for example, can use YARN with such a subset to manage concurrent Map Reduce applications. Lastly, shared state schedulers offer a global view of the resource to a set of specialized schedulers. They typically reach a higher level of consolidation, and have a great flexibility in terms of differences in the allocation policies, but must carefully manage changes in the global state to mitigate conflicting allocation decisions.

schedulers from the Omega paper

Borg

More than a distributed computing scheduler, Borg [1] not only manage jobs and tasks allocation on a tremendous amount of machines at Google, it also manage their life cycle by setting up operating systems and installing software components, ensures fault tolerance by replacing failing nodes, and provide a whole monitoring ecosystem. In addition, it provides simulators which allow to test new versions of the system, new scheduling policies, and so on, on a subset of the resource, before an actual deployment in production. It also provides a specialized language for users to declare their workload.

Borg operates on cells, which are clusters containing about 10k heterogeneous nodes, defined by the network fabric which interconnects them. Each cell hosts long lived services and short lived batch jobs. Borgmasters have total control over the cell, and each node has a Borglet agent. Initially monolithic, Borgmasters can be replicated (in a Paxos style quorum) to support specialized schedulers accessing a shared state of the cell (inspired by Omega, which we describe later, and is also a Google project).

Borg acts as the kernel of a cell, scheduling tasks on nodes from pending, running, and dead queues. It can prioritize tasks and preempt them to free up resources. It reduces the need for fairness policies by defining resource quotas for each user/application. Those users can define preferences (which we identify as constraints driven resource requests in our terminology), which Borg takes into account. Users are still responsible for declaring the amount of resources they desire for an application, and have great incentives to be precise about the amount of resources they request: Borg's managed ecosystem is priced, and users purchase resource quotas.

Also, as Borg collects a tremendous amount of monitoring data and logs, it is able to perform a lot of internal optimizations, which are prioritized over individual users' preferences (for the larger good). In addition, it can empirically determine application preferred settings for some users, thought the method used to make such predictions are not disclosed in the paper.

Borg is similar in essence to cluster managers and schedulers used by other, equally sized internet giants: Microsoft's Autopilot [2], Facebook's Tupperware [3], and Twitter's Aurora [4].

Mesos

Mesos is a two level modular scheduler for distributed computing frameworks. The modular design allows it to implement different resource allocation policies. Frameworks implement they own schedulers, which then take responsibility for resource management of their allocated pool.

Mesos aims to be scalable and fault tolerant. It minimize its role in the cluster by simply offering resources from the global pool to frameworks, which can in turn accept or reject those offers. Mesos allocation modules can implement weighed fairness, prioritization, task preemption, and so on. It provides isolation to co-existing frameworks by running containers [5] on workers, where tasks are executed. Mesos' master is replicated through Zookeeper [6].

The philosophy behind Mesos is, as advocated by Omega and later implemented in Borg, that a monolithic scheduler responsible for providing every possible allocation policies is not maintainable, nor is it scalable. Pushing those features to the frameworks, with a set of ''efficiency`` incentives, such as a revocation time on workers, and accounting individual allocation time as occupancy from the framework, makes it in practice much easier to consolidate the cluster with multiple, different frameworks. In a scenario where a large number of users want to use the cluster (and where its operators want to maximize utilization and efficiency) and run dramatically different frameworks, or different versions of the same framework, such a design makes a lot of sense. There is a small risk that a specific mix of frameworks leads to perpetual resource contention, and an overall lower cluster utilization than with statically partitioned shares, but in such cases it might suffice to either statically allocate a partition of the cluster to one of the contending framework, or to switch to a shared state scheduler. Mesos' paper also mention that such a conflicting combination can lead to resource fragmentation, which results in sub-optimal usage of the cluster.

We note that Apache Spark was initially designed to be a specialized scheduler and framework running on Mesos, to showcase its efficiency.

Omega

Omega is a Google project aiming to tackle the challenges raised by Borg's monolithism, which was a bottleneck that made it difficult to scale and hard to update with new scheduling policies.

Omega also steps away from the two level resource pools advocated by schedulers such as Mesos, where each framework is given a share of the cluster to use at will. Instead, it allows each framework's scheduler to access the global state of the cluster, and run its allocation
algorithm based on a user defined resource request and the current state of the cluster. In addition, it is possible to let the framework opportunistically claim more than the initial resource request. When a framework scheduler tries to update the cluster state, potential conflicts have to be resolved as the view of the cluster obtained by the framework while starting the allocation might have changed when the algorithm came to an end. A framework updates the global state in an atomic transaction. If the transaction fails, the framework has to run its scheduling algorithm again, with an updated view of the cluster returned by the transaction.

That mode of operation fits very well environments where the nature of workloads is heterogeneous, with, for example, coexisting long lived services, and short lived batch jobs. It also allows schedulers to implement independent, specialized policies. We note that for contexts where only a handful of frameworks are sharing a cluster, the two level scheduling might be easier to manage out of the box, and that it provides a degree of isolation for jobs not provided by a Shared state scheduler.

YARN

YARN is Hadoop's scheduler. It was conceived to replace a monolithic implementation of Hadoop, where the programming model and scheduler were an unique block. Its goal is to maximize utilization, while providing reliability and high availability to jobs. Another prized feature of YARN is the support of multiple programming model. Frameworks such as Spark can use it to manage resource allocation. YARN is built of three main components. The Resource Manager (RM), ubiquitous overseer of a global resource pool, can distribute them to users under the form of leases. A lease on a node allow to spawn a process, named container, which can execute tasks. A container has access to a limited amount of memory and cores (CPU time) on the worker node. Each node runs a Node Manager (NM), which interacts with the RM to report its state and let the RM gather a global view of the resource pool through its interactions with each node's NM. Once an application is accepted and can run on the cluster, an Application Master (AM) is run on one of the allocated nodes. AM will be responsible for coordinating application's tasks. For example, in the case of Spark, AM will run the application's driver (in ``cluster'' mode).

In YARN, users can express to the RM, a ResourceRequest, through the AM, which specifies a number of containers, their size (memory and cores) and locality preferences. In the paper words, users are expected to ``express their needs clearly''.
YARN implements a fair scheduler (HFS [7]), which maximize the share of memory applications can get out of the cluster. Resources can be arranged in pools, which support a hierarchical relationship. HFS can be extended by a technique named Dominant Resource Fairness [8], which maximizes the minimum share of a resource for an application, as well as guarantee some desirable properties such as envy-freeness, strategy-proofness, and gives sharing incentives to users. DRF process users' ResourceRequest to determine their share.

Octopus

Octopus [9] is a specialized scheduler for Graphlab [10], a distributed computing framework for graph analytics. It tackles the special challenges encountered to schedule concurrently multiple graph processing applications from different users. It does so by applying different placement policies such as First Fit and FIFO with Round Robing Filling, as well as a fair scheduler. Octopus does not preempt running tasks, so might lead to poor performances in certain cases where a particularly long job has to wait for two long ones to finish, for example.

Very relevant to the overall argument of this set of blog posts are some highlights offered by Octopus' paper:

- Demonstration of an ''elbow point`` where throwing more resources at the system results in reduced performances,

- Mention to the importance and difficulty to size executor processes. We quote the paper: "Also, in case of GraphX, it is non-trivial to know the optimal executor memory size and number of cores desirable by a job.",

- Mention that while it is possible to predict and recommend such configurations parameters, as in YARN, Borg, and other schedulers, users are still responsible for declaring their resource vectors. Again, we quote the paper: "Machine learning algorithms can be used to predict the optimal number of nodes required for a job. But currently our scheduler takes this as an input from the user."

Reflections

Across the various scheduler we have presented, a common factor is that they all push toward the user the responsibility of knowing how many resources their application need, as well as running time configurations such as executor size. It is indeed arguable that schedulers themselves should not implement the logic of determining such vectors for users, as it would result in an increase in complexity, equivalent to the one perceived in monolithic schedulers when trying to fit every framework's specialized allocation policies. However, there is clearly a gap to fill between schedulers and users, as the amount of factors impacting their decision is extremely difficult to think about. Some study reveal how inaccurate users estimation can be when requesting resources [11]}, which supports our motivations.

References

[1] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems, page 18. ACM, 2015.
[2] M. Isard. Autopilot: automatic data center management. ACM SIGOPS Operating Systems Review, 41(2):60–67, 2007.
[3] http://codechannels.com/video/docker/docker/tupperware-containerized-dep..., Last visited in March 2016.
[4] http://aurora.apache.org/, Last visited in March 2016.
[5] https://www.docker.com/, Last visited in March 2016.
[6] https://zookeeper.apache.org/, Last visited in March 2016.
[7] http://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/FairSc..., Last visited in January 2017.
[8] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In NSDI, volume 11, pages 24–24, 2011.
[9] S. Padala, D. Kumar, A. Raj, and J. Dharanipragada. Octopus: A multi-job scheduler for graphlab. In Big Data (Big Data), 2015 IEEE International Conference on, pages 293–298. IEEE, 2015.
[10] Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Heller-stein. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1408.2041, 2014.
[11] A. W. M. Alem and D. G. Feitelson. Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling. Parallel and Distributed Systems, IEEE Transactions on, 12(6):529–543, 2001.