{"id":123,"date":"2016-04-25T08:57:36","date_gmt":"2016-04-25T00:57:36","guid":{"rendered":"https:\/\/wangqf.com\/?p=123"},"modified":"2016-04-25T08:57:36","modified_gmt":"2016-04-25T00:57:36","slug":"mesos-omega-borg-a-survey","status":"publish","type":"post","link":"https:\/\/wangqf.com\/?p=123","title":{"rendered":"mesos, omega, borg: a survey"},"content":{"rendered":"<p>Google recently unveiled one of their crown jewels of system infrastructure: <a href=\"http:\/\/research.google.com\/pubs\/pub43438.html\">Borg<\/a>, their cluster scheduler. This prompted me to re-read the <a href=\"https:\/\/www.cs.berkeley.edu\/~alig\/papers\/mesos.pdf\">Mesos<\/a> and <a href=\"http:\/\/research.google.com\/pubs\/pub41684.html\">Omega<\/a> papers, which deal with the same topic. I thought it&#8217;d be interested to do a compare and contrast of these systems. Mesos gets credit for the groundbreaking idea of two-level scheduling, Omega improved upon this with an analogy from databases, and Borg can sort of be seen as the culmination of all these\u00a0ideas.<\/p>\n<h3>Background<\/h3>\n<p>Cluster schedulers have existed long before big data. There&#8217;s a rich literature on scheduling on 1000s of cores in the HPC world, but their problem domain is simpler than what is addressed by<em>datacenter schedulers<\/em>, meaning Mesos\/Borg and their ilk. Let&#8217;s compare and contrast on a few\u00a0dimensions.<\/p>\n<h4>Scheduling for\u00a0locality<\/h4>\n<p>Supercomputers separate storage and compute and connect them with an approximately full-bisection bandwidth network that goes at close to memory speeds (GB\/s). This means your tasks can get placed anywhere on the cluster without worrying much about locality, since all compute nodes can access data equally quickly. There are a few hyper-optimized applications that optimize for the network topology, but these are very\u00a0rare.<\/p>\n<p>Data center schedulers <em>do<\/em> care about locality, and in fact this is the whole point of GFS and MapReduce co-design. Back in the 2000s, network bandwidth was comparatively much more expensive than disk bandwidth. So, there was a huge economic savings by scheduling your computation tasks on the same node that held the data. This is a major scheduling constraint; whereas before you could put the task anywhere, now it needs to go on one of the three data\u00a0replicas.<\/p>\n<h4>Hardware\u00a0configuration<\/h4>\n<p>Supercomputers are typically composed of homogeneous nodes, i.e. they all have the same hardware specs. This is because supercomputers are typically purchased in one shot: a lab gets $x million dollars for a new one, and they spend it all upfront. Some HPC applications are optimized for the specific CPU models in a supercomputer. New technology like GPUs or co-processors are rolled out as a new\u00a0cluster.<\/p>\n<p>In the big data realm, clusters are primarily storage constrained, so operators are continually adding new racks with updated specs to expand cluster capacity. This means it&#8217;s typical for nodes to have different CPUs, memory capacities, number of disks, etc. Also toss in special additions like SSDs, GPUs, shingled drives. A single datacenter might need to support a broad range of applications, and all of this again imposes additional scheduling\u00a0constraints.<\/p>\n<h4>Queue management and\u00a0scheduling<\/h4>\n<p>When running an application on a supercomputer, you specify how many nodes you want, the queue you want to submit your job to, and how long the job will run for. Queues place different restrictions on how many resources you can request and how long your job can run for. Queues also have a priority or reservation based system to determine ordering. Since the job durations are all known, this is a pretty easy box packing problem. If the queues are long (typically true) and there&#8217;s a good mix of small jobs to backfill the space leftover from big jobs (also typical), you can achieve extremely high levels of utilization. I like to visualize this in 2D, with time as X and resource usage as\u00a0Y.<\/p>\n<p>As per the previous, datacenter scheduling is a more general problem. The &#8220;shape&#8221; of resource requests can be quite varied, and there are more dimensions. Jobs also do not have a set duration, so it&#8217;s hard to pre-plan queues. Thus we have more sophisticated scheduling algorithms, and the performance of the scheduler thus becomes\u00a0important.<\/p>\n<p>Utilization as a general rule is going to be worse (unless you&#8217;re Google; more on that later), but one benefit over HPC workloads is that MapReduce and similar can be incrementally scheduled instead of gang scheduled. HPC, we wait until all N nodes that you requested are available, then run all your tasks at once. MR can instead run its tasks in multiple waves, meaning it can still effectively use bits of leftover resources. A single MR job can also ebb and flow based on cluster demand, which avoids the need for preemption or resource reservations, and also helps with fairness between multiple\u00a0users.<\/p>\n<h3>Mesos<\/h3>\n<p>Mesos predates YARN, and was designed with the problems of the original MapReduce in mind. Back then, Hadoop clusters could run only a single application: MapReduce. This made it difficult to run applications that didn&#8217;t conform to a map phase followed by a reduce phase. The biggest example here is Spark. Previously, you&#8217;d have to install a whole new set of workers and masters for Spark, which would sit alongside your MapReduce workers and masters. Hardly ideal from a utilization perspective, since they were typically statically\u00a0partitioned.<\/p>\n<p>Mesos addresses this problem by providing a generalized scheduler for all cluster applications. MapReduce and Spark became simply different applications using the same underlying resource sharing framework. The simplest approach would be to write a centralized scheduler, but that has a number of\u00a0drawbacks:<\/p>\n<ul>\n<li>API complexity. We need a single API that is a superset of all known framework scheduler APIs. This is difficult by itself. Expressing resource requests will also become very\u00a0complicated.<\/li>\n<li>Performance. 10&#8217;s of thousands of nodes and millions of tasks is a lot, especially if the scheduling problem is\u00a0complex.<\/li>\n<li>Code agility. New schedulers and new frameworks are constantly being written, with new\u00a0requirements.<\/li>\n<\/ul>\n<p>Instead, Mesos introduces the idea of <em>two-level scheduling<\/em>. Mesos delegates the per-application scheduling work to the applications themselves, while Mesos still remains responsible for resource distribution between applications and enforcing overall fairness. This means Mesos can be pretty thin, 10K lines of\u00a0code.<\/p>\n<p>Two-level scheduling happens through a novel API called <em>resource offers<\/em>, where Mesos periodically offers some resources to the application schedulers. This sounds backwards at first (the request goes from the master to the application?), but it&#8217;s actually not that strange. In MR1, the TaskTracker workers are the source of truth as to what&#8217;s running on a node. When a TT heartbeats in saying that a task has completed, the JobTracker then chooses something else to run on that TaskTracker. Scheduling decisions are triggered by what&#8217;s essentially a resource offer from the worker. In Mesos, the resource offer comes from the Mesos master instead of the slave, since Mesos is managing the cluster. Not that\u00a0different.<\/p>\n<p>Resource offers act as time-bounded leases for some resources. Mesos offers resources to an application based on policies like priority or fair share. The app then computes how it uses them, and tells Mesos what resources from the offer it wants. This gives the app lots of flexibility, since it can choose to run a portion of tasks now, wait for a bigger allocation later (gang scheduling), or size its tasks differently to fit what&#8217;s available. Since offers are time-bounded, it also incentivizes applications to schedule\u00a0quickly.<\/p>\n<p>Some concerns and how they were\u00a0addressed:<\/p>\n<ul>\n<li>Long tasks hogging resources. Mesos lets you reserve some resources for short tasks, killing them after a time limit. This also incentivizes using short tasks, which is good for\u00a0fairness.<\/li>\n<li>Performance isolation. Use Linux Containers\u00a0(cgroups).<\/li>\n<li>Starvation of large tasks. It&#8217;s difficult to get sole access to a node, since some other app with smaller tasks will snap it up. The fix is having a <em>minimum offer size<\/em>.<\/li>\n<\/ul>\n<p>Unaddressed \/ unknown\u00a0resolution:<\/p>\n<ul>\n<li>Gang scheduling. I think this is impossible to do with high utilization without either knowing task lengths or preempting. Incrementally hoarding resources works with low utilization, but can result in\u00a0deadlock.<\/li>\n<li>Cross-application preemption is also hard. The resource offer API has no way of saying &#8220;here are some low-priority tasks I could kill if you want them&#8221;. Mesos depends on tasks being short to achieve\u00a0fairness.<\/li>\n<\/ul>\n<h3>Omega<\/h3>\n<p>Omega is sort of a successor to Mesos, and in fact shares an author. Since the paper uses simulated results for its evaluation, I suspect it never went into production at Google, and the ideas were rolled into the next generation of Borg. Rewriting the API is probably too invasive of a change, even for\u00a0Google.<\/p>\n<p>Omega takes the resource offers one degree further. In Mesos, resource offers are <em>pessimistic<\/em> or\u00a0<em>exclusive<\/em>. If a resource has been offered to an app, the same resource won&#8217;t be offered to another app until the offer times out. In Omega, resource offers are <em>optimistic<\/em>. Every application is offered all the available resources on the cluster, and conflicts are resolved at commit time. Omega&#8217;s resource manager is essentially just a relational database of all the per-node state with different types of optimistic concurrency control to resolve conflicts. The upside of this is vastly increased scheduler performance (full parallelism) and better\u00a0utilization.<\/p>\n<p>The downside of all this is that applications are in a free-for-all where they are allowed to gobble up resources as fast as they want, and even preempt other users. This is okay for Google because they use a priority-based system, and can go yell at their internal users. Their workload broadly falls into just two priority bands: high-priority <em>service<\/em> jobs (HBase, webservers, long-lived services) and low-priority <em>batch<\/em> jobs (MapReduce and similar). Applications are allowed to preempt lower-priority jobs, and are also trusted to stay within their cooperatively enforced limits on # of submitted jobs, amount of allocated resources, etc. I think Yahoo has said differently about being able to go yell at users (certainly not scalable), but it works somehow at\u00a0Google.<\/p>\n<p>Most of the paper talks about how this optimistic allocation scheme works with conflicts, which is always the question. There are a few high-level\u00a0notes:<\/p>\n<ul>\n<li>Service jobs are larger, and have more rigorous placement requirements for fault-tolerance (spread across\u00a0racks).<\/li>\n<li>Omega can probably scale up to 10s but not 100s of schedulers, due to the overhead of distributing the full cluster\u00a0state.<\/li>\n<li>Scheduling times of a few seconds is typical. They also compare up to 10s and 100s of seconds, which is where the benefits of two-level scheduling really kick in. Not sure how common this is, maybe for service\u00a0jobs?<\/li>\n<li>Typical cluster utilization is about\u00a060%.<\/li>\n<li>Conflicts are rare enough that OCC works in practice. They were able to go up to 6x their normal batch workload before the scheduler fell\u00a0apart.<\/li>\n<li>Incremental scheduling is very important. Gang-scheduling is significantly more expensive to implement due to increased conflicts. Apparently most applications can do incremental okay, and can just do a couple partial allocations to get up to their total desired\u00a0amount.<\/li>\n<li>Even for complicated schedulers (10s per-job overheads), Omega can still schedule a mixed workload with reasonable wait\u00a0times.<\/li>\n<li>Experimenting with a new MapReduce scheduler was empirically easy with\u00a0Omega<\/li>\n<\/ul>\n<h4>Open\u00a0questions<\/h4>\n<ul>\n<li>At some point, optimistic concurrency control breaks down because of a high conflict rate and the duplicated work from retries. It seems like they won&#8217;t run into this in practice, but I wonder if there are worst-case scenarios with oddly-shaped tasks. Is this affected by the mix of service and batch jobs? Is this something that is tuned in\u00a0practice?<\/li>\n<li>Is a lack of global policies really acceptable? Fairness, preemption,\u00a0etc.<\/li>\n<li>What&#8217;s the scheduling time like for different types of jobs? Have people written very complicated\u00a0schedulers?<\/li>\n<\/ul>\n<h3>Borg<\/h3>\n<p>This is a production experience paper. It&#8217;s the same workload as Omega since it&#8217;s also Google, so many of the metapoints are the\u00a0same.<\/p>\n<h4>High-level<\/h4>\n<ul>\n<li>Everything runs within Borg, including the storage systems like CFS and\u00a0BigTable.<\/li>\n<li>Median cluster size is 10K nodes, though some are much\u00a0bigger.<\/li>\n<li>Nodes can be very\u00a0heterogeneous.<\/li>\n<li>Linux process isolation is used (essentially containers), since Borg predates modern virtual machine infrastructure. Efficiency and launch time were\u00a0important.<\/li>\n<li>All jobs are statically linked\u00a0binaries.<\/li>\n<li>Very complicated, very rich resource specification language\u00a0available<\/li>\n<li>Can rolling update running jobs, meaning configuration and binary. This sometimes requires a task restart, so fault-tolerance is\u00a0important.<\/li>\n<li>Support for &#8220;graceful stop&#8221; via SIGTERM before final kill via SIGKILL. The soft kill is optional, and can not be relied on for\u00a0correctness.<\/li>\n<\/ul>\n<h4>Allocs<\/h4>\n<ul>\n<li>Resource allocation is separated from process liveness. An <em>alloc<\/em> can be used for task grouping or to hold resources across task\u00a0restarts.<\/li>\n<li>An <em>alloc set<\/em> is a group of allocs on multiple machines. Multiple jobs can be run within a single\u00a0alloc.<\/li>\n<li>This is actually a pretty common pattern! Multi-process is useful to separate concerns and\u00a0development.<\/li>\n<\/ul>\n<h4>Priorities and\u00a0quotas<\/h4>\n<ul>\n<li>Two priority bands: high and low for service and\u00a0batch.<\/li>\n<li>Higher priority jobs can preempt lower\u00a0priority<\/li>\n<li>High priority jobs cannot preempt each other (prevents cascading livelock\u00a0situations)<\/li>\n<li>Quotas are used for admission control. Users pay more for quota at higher\u00a0priorities.<\/li>\n<li>Also provide a &#8220;free&#8221; tier that runs at lowest priority, to encourage high utilization and backfill\u00a0work.<\/li>\n<li>This is a simple and easy to understand\u00a0system!<\/li>\n<\/ul>\n<h4>Scheduling<\/h4>\n<ul>\n<li>Two phases to scheduling: finding feasible nodes, then scoring these nodes for final\u00a0placement.<\/li>\n<li>Feasibility is heavily determined by task\u00a0constraints.<\/li>\n<li>Scoring is mostly determined by system properties, like best-fit vs. worst-fit, job mix, failure domains, locality,\u00a0etc.<\/li>\n<li>Once final nodes are chosen, Borg will preempt to fit if\u00a0necessary.<\/li>\n<li>Typical scheduling time is around 25s, because of localizing dependencies. Downloading the binaries is 80% of this. This locality matters. Torrent and tree protocols are used to distribute\u00a0binaries.<\/li>\n<\/ul>\n<h4>Scalability<\/h4>\n<ul>\n<li>Centralization has not been an impossible performance\u00a0bottleneck.<\/li>\n<li>10s of thousands of nodes, 10K tasks per minute scheduling\u00a0rate.<\/li>\n<li>Typical Borgmaster uses 10-14 cores and 50GB of\u00a0RAM.<\/li>\n<li>Architecture has become more and more multi-process over time, with reference to Omega and two-level\u00a0scheduling.<\/li>\n<li>Single master Borgmaster, but some responsibilities are still sharded: state updates from workers, read-only\u00a0RPCs.<\/li>\n<li>Some obvious optimizations: cache machine scores, compute feasibility once per task type, don&#8217;t attempt global optimality when making scheduling\u00a0decisions.<\/li>\n<li>Primary argument against bigger cells is isolation from operator errors and failure propagation. Architecture keeps scaling\u00a0fine<\/li>\n<\/ul>\n<h4>Utilization<\/h4>\n<ul>\n<li>Their primary metric was <em>cell compaction<\/em>, or the smallest cluster that can still fit a set of tasks. Essentially box\u00a0packing.<\/li>\n<li>Big gains from the following: not segregating workloads or users, having big shared clusters, fine-grained resource\u00a0requests.<\/li>\n<li>Optimistic overcommit on a per-Borglet basis. Borglets do resource estimation, and backfill non-prod work. If the estimation is incorrect, kill off the non-prod work. Memory is the inelastic\u00a0resource.<\/li>\n<li>Sharing does not drastically affect CPI (CPU interference), but I wonder about the effect on\u00a0storage.<\/li>\n<\/ul>\n<h4>Lessons\u00a0learned<\/h4>\n<p>The issues listed here are pretty much fixed in Kubernetes, their public, open-source container\u00a0scheduler.<\/p>\n<p>Bad:<\/p>\n<ul>\n<li>Would be nice to schedule multi-job workflows rather than single joba, for tracking and management. This also requires more flexible ways of referring to components of a workflow. This is solved by attaching arbitrary key-value pairs to each task and allowing users to query against\u00a0them.<\/li>\n<li>One IP per machine. This leads to port conflicts on a single machine and complicates binding and service discovery. This is solved by Linux namespaces, IPv6,\u00a0SDN.<\/li>\n<li>Complicated specification language. Lots of knobs to turn, which makes it hard to get started as a casual user. Some work on automatically determining resource\u00a0requirements.<\/li>\n<\/ul>\n<p>Good:<\/p>\n<ul>\n<li>Allocs are great! Allows helper services to be easily placed next to the main\u00a0task.<\/li>\n<li>Baking in services like load balancing and naming is very\u00a0useful.<\/li>\n<li>Metrics, debugging, web UIs are very important so users can solve their own\u00a0problems.<\/li>\n<li>Centralization scales up well, but need to split it up into multiple processes. Kubernetes does this from the start, meaning a nice clean API between the different scheduler\u00a0components.<\/li>\n<\/ul>\n<h3>Closing\u00a0remarks<\/h3>\n<p>It seems like YARN will need to draw from Mesos and Omega to scale up to the 10K node scale. YARN is still a centralized scheduler, which is the strawman for comparison in Mesos and Omega. Borg specifically mentions the need to shard to\u00a0scale.<\/p>\n<p>Isolation is very important to achieve high utilization without compromising SLOs. This can surface at the application layer, where apps themselves need to be design to be latency-tolerant. Think tail-at-scale request replication in BigTable. Ultimately it comes down to hardware spend vs. software spend. Running at lower utilization sidesteps this problem. Or, you can tackle it head-on through OS isolation mechanisms, resource estimation, and tuning your workload and schedulers. At Google-scale, there&#8217;s enough hardware that it makes sense to hire a bunch of kernel developers. Fortunately they&#8217;ve done the work for us\u00a0\ud83d\ude42<\/p>\n<p>I wonder also if the Google workload assumptions apply more generally. Priority bands, reservations, and preemption work well for Google, but our customers almost all use the fair share scheduler. Yahoo uses the capacity scheduler. Twitter uses the fair scheduler. I haven&#8217;t heard of any demand or usage of a priority + reservation\u00a0scheduler.<\/p>\n<p>Finally, very few of our customers run big shared clusters as envisioned at Google. We have customers with thousands of nodes, but this is split up into pods of hundreds of nodes. It&#8217;s also still common to have separate clusters for separate users or applications. Clusters are also typically homogeneous in terms of hardware. I think this will begin to change though, and\u00a0soon.<\/p>\n<p>[\u8f6c\u8f7d]http:\/\/umbrant.com\/blog\/2015\/mesos_omega_borg_survey.html<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Google recently unveiled one of their crown jewels of s &hellip; <a href=\"https:\/\/wangqf.com\/?p=123\" class=\"more-link\">\u7ee7\u7eed\u9605\u8bfb<span class=\"screen-reader-text\">\u201cmesos, omega, borg: a survey\u201d<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-123","post","type-post","status-publish","format-standard","hentry","category-2"],"_links":{"self":[{"href":"https:\/\/wangqf.com\/index.php?rest_route=\/wp\/v2\/posts\/123","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wangqf.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wangqf.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wangqf.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/wangqf.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=123"}],"version-history":[{"count":0,"href":"https:\/\/wangqf.com\/index.php?rest_route=\/wp\/v2\/posts\/123\/revisions"}],"wp:attachment":[{"href":"https:\/\/wangqf.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=123"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wangqf.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=123"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wangqf.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=123"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}