{"id":125,"date":"2016-04-25T09:01:39","date_gmt":"2016-04-25T01:01:39","guid":{"rendered":"https:\/\/wangqf.com\/?p=125"},"modified":"2016-04-25T09:01:39","modified_gmt":"2016-04-25T01:01:39","slug":"the-evolution-of-cluster-scheduler-architectures","status":"publish","type":"post","link":"https:\/\/wangqf.com\/?p=125","title":{"rendered":"The evolution of cluster scheduler architectures"},"content":{"rendered":"<p><em>Cluster schedulers are an important component of modern infrastructure, and have evolved significantly in the last few years. Their architecture has moved from monolithic designs to much more flexible, disaggregated and distributed designs. However, many current open-source offerings are either still monolithic, or otherwise lack key features. These features matter to real-world users, as they are required to achieve good utilization.<\/em><\/p>\n<p>This post is our first in a series of posts about <em>task scheduling on large clusters<\/em>, such as those operated by internet companies like Amazon, Google, Facebook, Microsoft, or Yahoo!, but increasingly elsewhere too. Scheduling is an important topic because it directly affects the cost of operating a cluster: a poor scheduler results in low <em>utilization<\/em>, which costs money as expensive machines are left idle. High utilization, however, is not sufficient on its own: antagonistic workloads interfere with other workloads unless the decisions are made carefully.<\/p>\n<h3>Architectural evolution<\/h3>\n<p>This post discusses how scheduler architectures have evolved over the last few years, and why this happened. Figure 1 visualises the different approaches: a gray square corresponds to a machine, a coloured circle to a task, and a rounded rectangle with an &#8220;S&#8221; inside corresponds to a scheduler.<sup><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn0\" name=\"fn0l\">0<\/a><\/sup> Arrows indicate placement decisions made by schedulers, and the three colours correspond to different workloads (e.g., web serving, batch analytics, and machine learning).<\/p>\n<div>\n<table>\n<tbody>\n<tr>\n<td><a name=\"fig1a\"><\/a><img decoding=\"async\" src=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/images\/scheduler-arch-monolithic.png\" \/><\/td>\n<td><a name=\"fig1b\"><\/a><img decoding=\"async\" src=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/images\/scheduler-arch-twolevel.png\" \/><\/td>\n<td><a name=\"fig1c\"><\/a><img decoding=\"async\" src=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/images\/scheduler-arch-sharedstate.png\" \/><\/td>\n<td><a name=\"fig1d\"><\/a><img decoding=\"async\" src=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/images\/scheduler-arch-distributed.png\" \/><\/td>\n<td><a name=\"fig1e\"><\/a><img decoding=\"async\" src=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/images\/scheduler-arch-hybrid.png\" \/><\/td>\n<\/tr>\n<tr>\n<td><b>(a)<\/b> Monolithic scheduler.<\/td>\n<td><b>(b)<\/b> Two-level scheduling.<\/td>\n<td><b>(c)<\/b> Shared-state scheduling.<\/td>\n<td><b>(d)<\/b> Distributed scheduling.<\/td>\n<td><b>(e)<\/b> Hybrid scheduling.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><b>Figure 1:<\/b> Different cluster scheduler architectures. Gray boxes represent cluster machines, circles correspond to tasks and <em>S<sub>i<\/sub><\/em> denotes scheduler <em>i<\/em>.<\/p>\n<\/div>\n<p>Many cluster schedulers \u2013 such as most high-performance computing (HPC) schedulers, the <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=2741964\">Borg scheduler<\/a>, various early Hadoop schedulers and the Kubernetes scheduler \u2013 are <b>monolithic<\/b>. A single scheduler process runs on one machine (e.g., the <code>JobTracker<\/code> in Hadoop v1, and <code>kube-scheduler<\/code> in Kubernetes) and assigns tasks to machines. All workloads are handled by the same scheduler, and all tasks run through the same scheduling logic (<a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fig1a\">Figure 1a<\/a>). This is simple and uniform, and has led to increasingly sophisticated schedulers being developed. As an example, see the <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=2451125\">Paragon<\/a> and <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=2541941\">Quasar<\/a> schedulers, which use a machine learning approach to avoid negative interference between workloads competing for resources.<\/p>\n<p>Most clusters run different types of applications today (as opposed to, say, just Hadoop MapReduce jobs in the early days). However, maintaining a single scheduler implementation that handles mixed (heterogeneous) workloads can be tricky, for several reasons:<\/p>\n<ol>\n<li>It is quite reasonable to expect a scheduler to treat long-running service jobs and batch analytics jobs differently.<\/li>\n<li>Since different applications have different needs, supporting them all keeps adding features to the scheduler, increasing the complexity of its logic and implementation.<\/li>\n<li>The order in which the scheduler processes tasks becomes an issue: queueing effects (e.g., head-of-line blocking) and backlog can become an issue unless the scheduler is carefully designed.<\/li>\n<\/ol>\n<p>Overall, this sounds like the makings of an engineering nightmare \u2013 and the never-ending lists of feature requests that scheduler maintainers receive attests to this.<sup><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn1\" name=\"fn1l\">1<\/a><\/sup><\/p>\n<p><b>Two-level scheduling architectures<\/b> address this problem by separating the concerns of <em>resource allocation<\/em> and <em>task placement<\/em>. This allows the task placement logic to be tailored towards specific applications, but also maintains the ability to share the cluster between them. The <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/static.usenix.org\/events\/nsdi11\/tech\/full_papers\/Hindman_new.pdf\">Mesos<\/a> cluster manager pioneered this approach, and <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=2523633\">YARN<\/a> supports a limited version of it. In Mesos, resources are <em>offered<\/em> to application-level schedulers (which may pick and choose from them), while YARN allows the application-level schedulers to to <em>request<\/em>resources (and receive allocations in return).<sup><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn2\" name=\"fn2l\">2<\/a><\/sup> <a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fig1b\">Figure 1b<\/a> shows the general idea: workload-specific schedulers (S<sub>0<\/sub>\u2013S<sub>2<\/sub>) interact with a resource manager that carves out dynamic partitions of the cluster resources for each workload. This is a very flexible approach that allows for custom, workload-specific scheduling policies.<\/p>\n<p>Yet, the separation of concerns in two-level architectures comes with a drawback: the application-level schedulers lose <em>omniscience<\/em>, i.e., they cannot see <em>all<\/em> the possible placement options any more.<sup><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn2\" name=\"fn3l\">3<\/a><\/sup> Instead, they merely see those options that correspond to resources offered (Mesos) or allocated (YARN) by the resource manager component. This has several disadvantages:<\/p>\n<ol>\n<li><em>Priority preemption<\/em> (higher priority tasks kick out lower priority ones) becomes difficult to implement: in an offer-based model, the resources occupied by running tasks aren&#8217;t visible to the upper-level schedulers; in a request-based model, the lower-level resource manager must understand the preemption policy (which may be application-dependent).<\/li>\n<li>Schedulers are unable to consider <em>interference from running workloads<\/em> that may degrade resource quality (e.g., &#8220;noisy neighbours&#8221; that saturate I\/O bandwidth), since they cannot see them.<\/li>\n<li>Application-specific schedulers care about many different aspects of the underlying resources, but their only means of choosing resources is the offer\/request interface with the resource manager. This interface can easily become quite complex.<\/li>\n<\/ol>\n<p><b>Shared-state architectures<\/b> address this by moving to a semi-distributed model,<sup><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn4\" name=\"fn4l\">4<\/a><\/sup> in which multiple replicas of cluster state are independently updated by application-level schedulers, as shown in <a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fig1c\">Figure 1c<\/a>. After the change is applied locally, the scheduler issues an optimistically concurrent transaction to update the shared cluster state. This transaction may fail, of course: another scheduler may have made a conflicting change in the meantime.<\/p>\n<p>The most prominent examples of shared-state designs are <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=2465386\">Omega<\/a> at Google, and <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/www.usenix.org\/conference\/osdi14\/technical-sessions\/presentation\/boutin\">Apollo<\/a> at Microsoft, as well as the <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/www.nomadproject.io\/docs\/internals\/scheduling.html\">Nomad<\/a> container scheduler by Hashicorp. All of these materialise the <em>shared cluster state<\/em> in a single location: the &#8220;cell state&#8221; in Omega, the &#8220;resource monitor&#8221; in Apollo, and the &#8220;plan queue&#8221; in Nomad.<sup><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn5\" name=\"fn5l\">5<\/a><\/sup> Apollo differs from the other two as its shared-state is read-only, and the scheduling transactions are submitted directly to the cluster machines. The machines themselves check for conflicts and accept or reject the changes. This allows Apollo to make progress even if the shared-state is temporarily unavailable.<sup><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn6\" name=\"fn6l\">6<\/a><\/sup><\/p>\n<p>A &#8220;logical&#8221; shared-state design can also be achieved without materialising the full cluster state anywhere. In this approach (somewhat similar to what Apollo does), each machine maintains its own state and sends updates to different interested agents such as schedulers, machine health monitors, and resource monitoring systems. Each machine&#8217;s local view of its state now forms a &#8220;shard&#8221; of the global shared-state.<\/p>\n<p>However, shared-state architectures have some drawbacks, too: they must work with stale information (unlike a centralized scheduler), and may experience degraded scheduler performance under high contention (although this can apply to other architectures as well).<\/p>\n<p><b>Fully-distributed architectures<\/b> take the disaggregation even further: they have no coordination between schedulers at all, and use many independent schedulers to service the incoming workload, as shown in <a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fig1d\">Figure 1d<\/a>. Each of these schedulers works purely with its local, partial, and often out-of-date view of the cluster. Jobs can typically be submitted to any scheduler, and each scheduler may place tasks anywhere in the cluster. Unlike with two-level schedulers, there are no partitions that each scheduler is responsible for. Instead, the overall schedule and resource partitioning are emergent consequences of statistical multiplexing and randomness in workload and scheduler decisions \u2013 similar to shared-state schedulers, albeit without any central control at all.<\/p>\n<p>The recent distributed scheduler movement probably started with the <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=2522716\">Sparrow<\/a> paper, although the underlying concept (power of multiple random choices) <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/www.eecs.harvard.edu\/~michaelm\/postscripts\/mythesis.pdf\">first appeared in 1996<\/a>. The key premise of Sparrow is a hypothesis that the tasks we run on clusters are becoming ever shorter in duration, supported by <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=2490497\">an argument<\/a> that fine-grained tasks have many benefits. Consequently, the authors assume that tasks are becoming more numerous, meaning that a higher decision throughput must be supported by the scheduler. Since a single scheduler may not be able to keep up with this throughput (assumed to be a million tasks per second!), Sparrow spreads the load across many schedulers.<\/p>\n<p>This makes perfect sense: and the lack of central control can be conceptually appealing, and it suits some workloads very well \u2013 more on this in a future post. For the moment, it suffices to note that since the distributed schedulers are uncoordinated, they apply significantly simpler logic than advanced monolithic, two-level, or shared-state schedulers. For example:<\/p>\n<ol>\n<li>Distributed schedulers are typically based on a simple &#8220;slot&#8221; concept that chops each machine into <em>n<\/em> uniform slots, and places up to <em>n<\/em>parallel tasks. This simplifies over the fact that tasks&#8217; resource requirements are not uniform.<\/li>\n<li>They also use worker-side queues with simple service disciplines (e.g., FIFO in Sparrow), which restricts scheduling flexibility, as the scheduler can merely choose at which machine to enqueue a task.<\/li>\n<li>Distributed schedulers have difficulty enforcing global invariants (e.g., fairness policies or strict priority precedence), since there is no central control.<\/li>\n<li>Since they are designed for rapid decisions based on minimal knowledge, distributed schedulers cannot support or afford complex or application-specific scheduling policies. Avoiding interference between tasks, for example, becomes tricky.<\/li>\n<\/ol>\n<p><b>Hybrid architectures<\/b> are a recent (mostly academic) invention that seeks to address these drawbacks of fully distributed architectures by combining them with monolithic or shared-state designs. The way this typically works \u2013 e.g., in <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=2806779\">Tarcil<\/a>, <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/www.usenix.org\/conference\/atc15\/technical-session\/presentation\/karanasos\">Mercury<\/a>, and <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/www.usenix.org\/conference\/atc15\/technical-session\/presentation\/delgado\">Hawk<\/a> \u2013 is that there really are two scheduling paths: a distributed one for part of the workload (e.g., very short tasks, or low-priority batch workloads), and a centralized one for the rest. <a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fig1e\">Figure 1e<\/a> illustrates this design. The behaviour of each constituent part of a hybrid scheduler is identical to the part&#8217;s architecture described above. In practice, no hybrid schedulers have been deployed in production settings yet, however, as far as I know.<\/p>\n<h3>What does this mean in practice?<\/h3>\n<p>Discussion about the relative merits of different scheduler architectures is not merely an academic topic, although it naturally revolves around research papers. For an extensive discussion of the Borg, Mesos and Omega papers from an industry perspective, for example, see<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/umbrant.com\/blog\/2015\/mesos_omega_borg_survey.html\">Andrew Wang&#8217;s excellent blog post<\/a>. Moreover, many of the systems discussed are deployed in production settings at large enterprises (e.g., Apollo at Microsoft, Borg at Google, and Mesos at Apple), and they have in turn inspired other systems that are available as open source projects.<\/p>\n<p>These days, many clusters run containerised workloads, and consequently a variety of contained-focused &#8220;orchestration frameworks&#8221; have appeared. These are similar to what Google and others call &#8220;cluster managers&#8221;. However, there are few detailed discussions of the schedulers within these frameworks and their design principles, and they typically focus more on the user-facing scheduler APIs (e.g., <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/armand.gr\/static\/files\/htise.pdf\">this report by Armand Grillet<\/a>, which compares Docker Swarm, Mesos\/Marathon, and the Kubernetes default scheduler). Moreover, many users neither know what difference the scheduler architecture makes, nor which one is most suitable for their applications.<\/p>\n<p><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fig2\">Figure 2<\/a> shows an overview of a selection of open-source orchestration frameworks, their architecture and the features supported by their schedulers. At the bottom of the table, We also include closed-source systems at Google and Microsoft for reference. The resource granularity column indicates whether the scheduler assigns tasks to fixed-size slots, or whether it allocates resources in multiple dimensions (e.g., CPU, memory, disk I\/O bandwidth, network bandwidth, etc.).<\/p>\n<div>\n<table border=\"1\">\n<tbody>\n<tr>\n<th><\/th>\n<th>Framework<\/th>\n<th>Architecture<\/th>\n<th>Resource granularity<\/th>\n<th>Multi-scheduler<\/th>\n<th>Pluggable logic<\/th>\n<th>Priority preemption<\/th>\n<th>Re-scheduling<\/th>\n<th>Over-subscription<\/th>\n<th>Resource estimation<\/th>\n<th>Avoid interference<\/th>\n<\/tr>\n<tr>\n<td rowspan=\"6\">O<br \/>\nP<br \/>\nE<br \/>\nN<\/td>\n<td>Kubernetes<\/td>\n<td>monolithic<\/td>\n<td>multi-dimensional<\/td>\n<td>N<sup>[v1.2,<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/kubernetes\/kubernetes\/blob\/master\/docs\/proposals\/multiple-schedulers.md\">DD<\/a>,<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/kubernetes\/kubernetes\/issues\/11793\">Issue<\/a>]<\/sup><\/td>\n<td>Y<sup>[<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/kubernetes\/kubernetes\/blob\/master\/docs\/design\/scheduler_extender.md\">DD<\/a>]<\/sup><\/td>\n<td>N<sup>[<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/kubernetes\/kubernetes\/issues\/22217\">Issue<\/a>]<\/sup><\/td>\n<td>N<sup>[<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/kubernetes\/kubernetes\/pull\/22217\">Issue<\/a>]<\/sup><\/td>\n<td>Y<sup>[<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/kubernetes.io\/v1.1\/docs\/proposals\/resource-qos.html\">DD<\/a>]<\/sup><\/td>\n<td>N<\/td>\n<td>N<\/td>\n<\/tr>\n<tr>\n<td>Swarm<\/td>\n<td>monolithic<\/td>\n<td>multi-dimensional<\/td>\n<td>N<\/td>\n<td>N<\/td>\n<td>N<sup>[<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/docker\/swarm\/issues\/1488\">Issue<\/a>]<\/sup><\/td>\n<td>N<\/td>\n<td>N<\/td>\n<td>N<\/td>\n<td>N<\/td>\n<\/tr>\n<tr>\n<td>YARN<\/td>\n<td>monolithic\/<br \/>\ntwo-level<\/td>\n<td>RAM\/CPU slots<\/td>\n<td>Y<\/td>\n<td>N<sup>[<em>app-lvl. only<\/em>]<\/sup><\/td>\n<td>N<sup>[<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/issues.apache.org\/jira\/browse\/YARN-2009\">JIRA<\/a>]<\/sup><\/td>\n<td>N<\/td>\n<td>N<sup>[<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/issues.apache.org\/jira\/browse\/YARN-1011\">JIRA<\/a>]<\/sup><\/td>\n<td>N<\/td>\n<td>N<\/td>\n<\/tr>\n<tr>\n<td>Mesos<\/td>\n<td>two-level<\/td>\n<td>multi-dimensional<\/td>\n<td>Y<\/td>\n<td>Y<sup>[<em>framework-lvl.<\/em>]<\/sup><\/td>\n<td>N<sup>[<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/issues.apache.org\/jira\/browse\/MESOS-155\">JIRA<\/a>]<\/sup><\/td>\n<td>N<\/td>\n<td>Y<sup>[v0.23,<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/mesos.apache.org\/documentation\/latest\/oversubscription\/\">Doc<\/a>]<\/sup><\/td>\n<td>N<\/td>\n<td>N<\/td>\n<\/tr>\n<tr>\n<td>Nomad<\/td>\n<td>shared-state<\/td>\n<td>multi-dimensional<\/td>\n<td>Y<\/td>\n<td>Y<\/td>\n<td>N<sup>[<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/hashicorp\/nomad\/issues\/294\">Issue<\/a>]<\/sup><\/td>\n<td>N<sup>[<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/hashicorp\/nomad\/issues\/876\">Issue<\/a>]<\/sup><\/td>\n<td>N<sup>[<a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/hashicorp\/nomad\/issues\/606\">Issue<\/a>]<\/sup><\/td>\n<td>N<\/td>\n<td>N<\/td>\n<\/tr>\n<tr>\n<td>Sparrow<\/td>\n<td>fully-distributed<\/td>\n<td>fixed slots<\/td>\n<td>Y<\/td>\n<td>N<\/td>\n<td>N<\/td>\n<td>N<\/td>\n<td>N<\/td>\n<td>N<\/td>\n<td>N<\/td>\n<\/tr>\n<tr>\n<td rowspan=\"3\">C<br \/>\nL<br \/>\nO<br \/>\nS<br \/>\nE<br \/>\nD<\/td>\n<td>Borg<\/td>\n<td>monolithic<sup>[<a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn7\" name=\"fn7l\">7<\/a>]<\/sup><\/td>\n<td>multi-dimensional<\/td>\n<td>N<sup>[<a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn7\" name=\"fn7l\">7<\/a>]<\/sup><\/td>\n<td>N<sup>[<a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn7\" name=\"fn7l\">7<\/a>]<\/sup><\/td>\n<td>Y<\/td>\n<td>Y<\/td>\n<td>Y<\/td>\n<td>Y<\/td>\n<td>N<\/td>\n<\/tr>\n<tr>\n<td>Omega<\/td>\n<td>shared-state<\/td>\n<td>multi-dimensional<\/td>\n<td>Y<\/td>\n<td>Y<\/td>\n<td>Y<\/td>\n<td>Y<\/td>\n<td>Y<\/td>\n<td>Y<\/td>\n<td>N<\/td>\n<\/tr>\n<tr>\n<td>Apollo<\/td>\n<td>shared-state<\/td>\n<td>multi-dimensional<\/td>\n<td>Y<\/td>\n<td>Y<\/td>\n<td>Y<\/td>\n<td>Y<\/td>\n<td>N<\/td>\n<td>N<\/td>\n<td>N<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><b>Figure 2:<\/b> Architectural classification and feature matrix of widely-used orchestration frameworks, compared to closed-source systems.<\/p>\n<\/div>\n<p>One key aspect that helps determine an appropriate scheduler architecture is whether your cluster runs a <em>heterogeneous<\/em> (i.e., mixed) workload. This is the case, for example, when combining production front-end services (e.g., load-balanced web servers and memcached) with batch data analytics (e.g., MapReduce or Spark). Such combinations make sense in order to improve utilization, but the different applications have different scheduling needs. In a mixed setting, a monolithic scheduler likely results in sub-optimal assignments, since the logic cannot be diversified on a per-application basis. A two-level or shared-state scheduler will likely offer benefits here.<sup> <a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn8\" name=\"fn8l\">8<\/a><\/sup><\/p>\n<p>Most user-facing service workloads run with resource allocations sized to serve peak demand expected of each container, but in practice they typically under-utilize their allocations substantially. In this situation, being able to opportunistically over-subscribe the resources with lower-priority workloads (while maintaining QoS guarantees) is the key to an efficient cluster. Mesos is currently the only open-source system that ships support for such over-subscription, although Kubernetes has <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/kubernetes.io\/v1.1\/docs\/proposals\/resource-qos.html\">a fairly mature proposal<\/a> for adding it. We should expect more activity in this space in the future, since the utilization of most clusters is still substantially lower than the 60-70% <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=2741964\">reported<\/a> for Google&#8217;s Borg clusters. We will focus on resource estimation, over-subscription and efficient machine utilization in a future post in this series.<\/p>\n<p>Finally, specific analytics and OLAP-style applications (for example, Dremel or SparkSQL queries) can benefit from fully-distributed schedulers. However, fully-distributed schedulers (like e.g., Sparrow) come with fairly restricted feature sets, and thus work best when the workload is homogeneous (i.e., all tasks run for roughly the same time), set-up times are low (i.e., tasks are scheduled to long-running workers, as e.g., with MapReduce application-level tasks in YARN), and task churn is very high (i.e., many scheduling decisions must be made in a short time). We will talk more about these conditions and why fully-distributed schedulers \u2013 and the distributed components of hybrid schedulers \u2013 only make sense for these applications in the next blog post in this series. For now, it suffices to observe that distributed schedulers are substantially simpler than others, and do not support multiple resource dimensions, over-subscription, or re-scheduling.<\/p>\n<p>Overall, the table in Figure 2 is evidence that the open-source frameworks still have some way to go until they match the feature sets of advanced, but closed-source systems. This should serve as a call to action: as a result of missing features, utilization suffers, task performance is unpredictable, noisy neighbours cause pagers to go off, and elaborate hacks are required to coerce schedulers into supporting some user needs.<\/p>\n<p>However, there are some good news: while many frameworks have monolithic schedulers today, many are also moving towards more flexible designs. Kubernetes already supports pluggable schedulers (the <code>kube-scheduler<\/code> pod can be replaced by another API-compatible scheduler pod), <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/kubernetes\/kubernetes\/blob\/master\/docs\/proposals\/multiple-schedulers.md\">multiple schedulers from v1.2<\/a>, and has ongoing work on <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/kubernetes\/kubernetes\/blob\/master\/docs\/design\/scheduler_extender.md\">&#8220;extenders&#8221; to supply custom policies<\/a>. Docker Swarm may \u2013 to our understanding \u2013 also gain pluggable scheduler support in the future.<\/p>\n<h3>What&#8217;s next?<\/h3>\n<p>The next blog post in this series will look at the question of whether fully distributed architectures are the key innovation required to <em>scale<\/em>cluster schedulers further (spoiler: not necessarily). After that, we will also look at resource-fitting strategies (essential for good utilisation), and finally discuss how our Firmament scheduling platform combines many of the benefits of a shared-state architecture with the scheduling quality of monolithic schedulers and the speed of fully-distributed schedulers.<\/p>\n<p>&nbsp;<\/p>\n<p><em><b>Correction: March 10, 2016<\/b><br \/>\nAn earlier version of the text incorrectly reported the implementation status of some Kubernetes features. We amended the table in <a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fig2\">Figure 2<\/a>and the text to clarify that scheduler extenders are implemented, and that over-subscription is supported although automatic resource estimation is not. We also added a footnote explaining that a single scheduler <em>can<\/em> serve a mixed workload, but that its complexity will be high.<\/em><\/p>\n<p><em><b>Correction: March 15, 2016<\/b><br \/>\nAn earlier version of the text suggested that YARN and Mesos are two-level designs in an equal sense. However, YARN&#8217;s application-level scheduling is substantially less powerful than Mesos&#8217;s. This is now clearer in the text, and clarified further in <a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn2\">footnote 2<\/a>.<\/em><\/p>\n<p><em><b><a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/twitter.com\/CamSysAtScale\">Follow us on Twitter<\/a> to find out about new posts.<\/b><\/em><\/p>\n<p>&nbsp;<\/p>\n<hr \/>\n<div>\n<p><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn0l\" name=\"fn0\"><sup>0<\/sup><\/a> \u2013 This figure simplifies things a bit: of course, in practice each machine runs more than one task, and many schedulers fit tasks in multiple resource dimensions, rather than into simple slots.<\/p>\n<p><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn1l\" name=\"fn1\"><sup>1<\/sup><\/a> \u2013 As an illustrative example, <code>kube-scheduler<\/code> in Kubernetes currently has outstanding feature requests for <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/kubernetes\/kubernetes\/pull\/22217\">re-scheduling (pod migration)<\/a>, <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/kubernetes\/kubernetes\/issues\/22212\">priority preemption<\/a>, and <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/github.com\/kubernetes\/kubernetes\/pull\/14943\">resource over-subscription<\/a> in its monolithic scheduler.<\/p>\n<p><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn2l\" name=\"fn2\"><sup>2<\/sup><\/a> \u2013 YARN&#8217;s approach is restricted compared to Mesos because the application-level logic cannot choose resources (unless it requests much more than it needs from the resource manager), but it can only place application-level &#8220;tasks&#8221; to pre-existing containers that represent cluster-level tasks.<br \/>\nThis is a good fit for a system like Hadoop MapReduce, in which application-level tasks (maps and reduces) must be assigned to a dynamic collection of workers in an application-specific way (e.g., optimised for data locality and per-job). It is less suited to building a more general, multi-application scheduler on top \u2013 for example, a service scheduler like the &#8220;Marathon&#8221; framework for Mesos.<br \/>\nMonolithic schedulers like the Kubernetes one do not support this and rely on the application doing its own scheduling (e.g., <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/kubernetes.io\/v1.0\/examples\/spark\/README.html\">running a Spark &#8220;worker controller&#8221; as a long-running service<\/a>). Consequently, there are efforts to put <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/hortonworks.com\/blog\/docker-kubernetes-apache-hadoop-yarn\/\">Kubernetes on top of YARN<\/a> via a special<code>YARNScheduler<\/code> extension \u2013 requiring two complex systems to be administered. However, there are also long-term efforts to <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/docs.google.com\/document\/d\/1YhNLN39f5oZ4AHn_g7vBp0LQd7k37azL7FkWG8CEDrE\/edit#heading=h.ukbaidczvy3r\">improve native &#8220;big data&#8221; batch processing support in Kubernetes<\/a>.<\/p>\n<p><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn3l\" name=\"fn3\"><sup>3<\/sup><\/a> \u2013 In the Omega paper, this problem is referred to as &#8220;information hiding&#8221;.<\/p>\n<p><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn4l\" name=\"fn4\"><sup>4<\/sup><\/a> \u2013 Curiously, the literature does not appear to be quite sure in agreement about whether to consider shared-state schedulers centralized or distributed: the <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/www.usenix.org\/conference\/atc15\/technical-session\/presentation\/delgado\">Hawk paper<\/a> treats them as examples of distributed schedulers, while the <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/www.usenix.org\/conference\/atc15\/technical-session\/presentation\/karanasos\">Mercury paper<\/a> refers to them as examples of a centralized architecture!<\/p>\n<p><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn5l\" name=\"fn5\"><sup>5<\/sup><\/a> \u2013 Nomad actually uses a slightly different approach to Omega and Apollo: while multiple independent schedulers exist, jobs are not submitted directly to them, but instead <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"https:\/\/www.nomadproject.io\/docs\/internals\/scheduling.html\">arrive via a centralised &#8220;evaluation broker&#8221; queue<\/a>.<\/p>\n<p><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn6l\" name=\"fn6\"><sup>6<\/sup><\/a> \u2013 It&#8217;s worth noting that the same optimisation \u2013 taking the shared-state off the critical path to enacting scheduling decisions \u2013 can be applied to Omega, but <em>not<\/em> to Nomad (in its current design): Omega can ship deltas directly to machines and update the cell state out-of-band, while Nomad&#8217;s design is premised on the leader reconciling changes in the plan queue.<\/p>\n<p><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn7l\" name=\"fn7\"><sup>7<\/sup><\/a> \u2013 The table entry reflects the original Borg, but the <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=2741964\">Borg paper<\/a> and the recent <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/queue.acm.org\/detail.cfm?id=2898444\">ACM Queue paper<\/a> note that multi-scheduler support and other features have been back-ported into from Omega into Borg.<\/p>\n<p><a href=\"http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html#fn8l\" name=\"fn8\"><sup>8<\/sup><\/a> \u2013 That said, having multiple schedulers is not a <em>necessary<\/em> precondition for serving mixed workloads: the <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=2741964\">Borg scheduler<\/a> is a case in point that a sophisticated single scheduler can serve both long-running service and batch workloads. However, this comes at the expense of higher scheduler implementation complexity \u2013 a key motivation for <a class=\"campl-external\" title=\"undefined (Link to an external website)\" href=\"http:\/\/dl.acm.org\/citation.cfm?id=2465386\">Omega&#8217;s multi-scheduler design<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p>[\u8f6c\u8f7d]http:\/\/www.cl.cam.ac.uk\/research\/srg\/netos\/camsas\/blog\/2016-03-09-scheduler-architectures.html<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Cluster schedulers are an important component of modern &hellip; <a href=\"https:\/\/wangqf.com\/?p=125\" class=\"more-link\">\u7ee7\u7eed\u9605\u8bfb<span class=\"screen-reader-text\">\u201cThe evolution of cluster scheduler architectures\u201d<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-125","post","type-post","status-publish","format-standard","hentry","category-2"],"_links":{"self":[{"href":"https:\/\/wangqf.com\/index.php?rest_route=\/wp\/v2\/posts\/125","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wangqf.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wangqf.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wangqf.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wangqf.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=125"}],"version-history":[{"count":0,"href":"https:\/\/wangqf.com\/index.php?rest_route=\/wp\/v2\/posts\/125\/revisions"}],"wp:attachment":[{"href":"https:\/\/wangqf.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=125"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wangqf.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=125"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wangqf.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=125"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}