CFS: Completely fair process scheduling in Linux

arrows cycle symbol for failing faster

Linux takes a modular approach to processor scheduling in that different algorithms can be used to schedule different process types. A scheduling class specifies which scheduling policy applies to which type of process. Completely fair scheduling (CFS), which became part of the Linux 2.6.23 kernel in 2007, is the scheduling class for normal (as opposed to real-time) processes and therefore is named SCHED_NORMAL.

CFS is geared for the interactive applications typical in a desktop environment, but it can be configured as SCHED_BATCH to favor the batch workloads common, for example, on a high-volume web server. In any case, CFS breaks dramatically with what might be called "classic preemptive scheduling." Also, the "completely fair" claim has to be seen with a technical eye; otherwise, the claim might seem like an empty boast.

Let's dig into the details of what sets CFS apart from—indeed, above—other process schedulers. Let's start with a quick review of some core technical terms.

Some core concepts

Linux inherits the Unix view of a process as a program in execution. As such, a process must contend with other processes for shared system resources: memory to hold instructions and data, at least one processor to execute instructions, and I/O devices to interact with the external world. Process scheduling is how the operating system (OS) assigns tasks (e.g., crunching some numbers, copying a file) to processors—a running process then performs the task. A process has one or more threads of execution, which are sequences of machine-level instructions. To schedule a process is to schedule one of its threads on a processor.

The Linux Terminal

In a simplifying move, Linux turns process scheduling into thread scheduling by treating a scheduled process as if it were single-threaded. If a process is multi-threaded with N threads, then N scheduling actions would be required to cover the threads. Threads within a multi-threaded process remain related in that they share resources such as memory address space. Linux threads are sometimes described as lightweight processes, with the lightweight underscoring the sharing of resources among the threads within a process.

Although a process can be in various states, two are of particular interest in scheduling. A blocked process is awaiting the completion of some event such as an I/O event. The process can resume execution only after the event completes. A runnable process is one that is not currently blocked.

A process is processor-bound (aka compute-bound) if it consumes mostly processor as opposed to I/O resources, and I/O-bound in the opposite case; hence, a processor-bound process is mostly runnable, whereas an I/O-bound process is mostly blocked. As examples, crunching numbers is processor-bound, and accessing files is I/O-bound. Although an entire process might be characterized as either processor-bound or I/O-bound, a given process may be one or the other during different stages of its execution. Interactive desktop applications, such as browsers, tend to be I/O-bound.

A good process scheduler has to balance the needs of processor-bound and I/O-bound tasks, especially in an operating system such as Linux that thrives on so many hardware platforms: desktop machines, embedded devices, mobile devices, server clusters, supercomputers, and more.

Classic preemptive scheduling versus CFS

Unix popularized classic preemptive scheduling, which other operating systems including VAX/VMS, Windows NT, and Linux later adopted. At the center of this scheduling model is a fixed timeslice, the amount of time (e.g., 50ms) that a task is allowed to hold a processor until preempted in favor of some other task. If a preempted process has not finished its work, the process must be rescheduled. This model is powerful in that it supports multitasking (concurrency) through processor time-sharing, even on the single-CPU machines of yesteryear.

The classic model typically includes multiple scheduling queues, one per process priority: Every process in a higher-priority queue gets scheduled before any process in a lower-priority queue. As an example, VAX/VMS uses 32 priority queues for scheduling.

CFS dispenses with fixed timeslices and explicit priorities. The amount of time for a given task on a processor is computed dynamically as the scheduling context changes over the system's lifetime. Here is a sketch of the motivating ideas and technical details:

Linux schedulers that predate CFS use heuristics to promote the fair treatment of interactive tasks with respect to scheduling. CFS takes a quite different approach by letting the vruntime facts speak mostly for themselves, which happens to support sleeper fairness. An interactive task, by its very nature, tends to sleep a lot in the sense that it awaits user inputs and so becomes I/O-bound; hence, such a task tends to have a relatively low vruntime, which tends to move the task towards the front of the scheduling line.

Special features

CFS supports symmetrical multiprocessing (SMP) in which any process (whether kernel or user) can execute on any processor. Yet configurable scheduling domains can be used to group processors for load balancing or even segregation. If several processors share the same scheduling policy, then load balancing among them is an option; if a particular processor has a scheduling policy different from the others, then this processor would be segregated from the others with respect to scheduling.

Configurable scheduling groups are another CFS feature. As an example, consider the Nginx web server that's running on my desktop machine. At startup, this server has a master process and four worker processes, which act as HTTP request handlers. For any HTTP request, the particular worker that handles the request is irrelevant; it matters only that the request is handled in a timely manner, and so the four workers together provide a pool from which to draw a task-handler as requests come in. It thus seems fair to treat the four Nginx workers as a group rather than as individuals for scheduling purposes, and a scheduling group can be used to do just that. The four Nginx workers could be configured to have a single vruntime among them rather than individual vruntimes. Configuration is done in the traditional Linux way, through files. For vruntime-sharing, a file named cpu.shares, with the details given through familiar shell commands, would be created.

As noted earlier, Linux supports scheduling classes so that different scheduling policies, together with their implementing algorithms, can coexist on the same platform. A scheduling class is implemented as a code module in C. CFS, the scheduling class described so far, is SCHED_NORMAL. There are also scheduling classes specifically for real-time tasks, SCHED_FIFO (first in, first out) and SCHED_RR (round robin). Under SCHED_FIFO, tasks run to completion; under SCHED_RR, tasks run until they exhaust a fixed timeslice and are preempted.

CFS implementation

CFS requires efficient data structures to track task information and high-performance code to generate the schedules. Let's begin with a central term in scheduling, the runqueue. This is a data structure that represents a timeline for scheduled tasks. Despite the name, the runqueue need not be implemented in the traditional way, as a FIFO list. CFS breaks with tradition by using a time-ordered red-black tree as a runqueue. The data structure is well-suited for the job because it is a self-balancing binary search tree, with efficient insert and remove operations that execute in O(log N) time, where N is the number of nodes in the tree. Also, a tree is an excellent data structure for organizing entities into a hierarchy based on a particular property, in this case a vruntime.

In CFS, the tree's internal nodes represent tasks to be scheduled, and the tree as a whole, like any runqueue, represents a timeline for task execution. Red-black trees are in wide use beyond scheduling; for example, Java uses this data structure to implement its TreeMap.

Under CFS, every processor has a specific runqueue of tasks, and no task occurs at the same time in more than one runqueue. Each runqueue is a red-black tree. The tree's internal nodes represent tasks or task groups, and these nodes are indexed by their vruntime values so that (in the tree as a whole or in any subtree) the internal nodes to the left have lower vruntime values than the ones to the right:

 25 ## 25 is a task vruntime /\ 17 29 ## 17 roots the left subtree, 29 the right one /\ . 5 19 ## and so on . \ nil ## leaf nodes are nil

In summary, tasks with the lowest vruntime—and, therefore, the greatest need for a processor—reside somewhere in the left subtree; tasks with relatively high vruntimes congregate in the right subtree. A preempted task would go into the right subtree, thus giving other tasks a chance to move leftwards in the tree. A task with the smallest vruntime winds up in the tree's leftmost (internal) node, which is thus the front of the runqueue.

The CFS scheduler has an instance, the C task_struct, to track detailed information about each task to be scheduled. This structure embeds a sched_entity structure, which in turn has scheduling-specific information, in particular, the vruntime per task or task group:

struct task_struct < /** info on a task **/ . struct sched_entity se; /** vruntime, etc. **/ . >;

The red-black tree is implemented in familiar C fashion, with a premium on pointers for efficiency. A cfs_rq structure instance embeds a rb_root field named tasks_timeline, which points to the root of a red-black tree. Each of the tree's internal nodes has pointers to the parent and the two child nodes; the leaf nodes have nil as their value.

CFS illustrates how a straightforward idea—give every task a fair share of processor resources—can be implemented in a low-fuss but highly efficient way. It's worth repeating that CFS achieves fair and efficient scheduling without traditional artifacts such as fixed timeslices and explicit task priorities. The pursuit of even better schedulers goes on, of course; for the moment, however, CFS is as good as it gets for general-purpose processor scheduling.