From 898bd37a92063e46bc8d7b870781cecd66234f92 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 18 Apr 2019 19:45:00 -0300 Subject: docs: block: convert to ReST Rename the block documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab --- Documentation/block/bfq-iosched.rst | 597 ++++++++++++ Documentation/block/bfq-iosched.txt | 583 ----------- Documentation/block/biodoc.rst | 1168 +++++++++++++++++++++++ Documentation/block/biodoc.txt | 1076 --------------------- Documentation/block/biovecs.rst | 146 +++ Documentation/block/biovecs.txt | 144 --- Documentation/block/capability.rst | 18 + Documentation/block/capability.txt | 15 - Documentation/block/cmdline-partition.rst | 53 + Documentation/block/cmdline-partition.txt | 46 - Documentation/block/data-integrity.rst | 291 ++++++ Documentation/block/data-integrity.txt | 281 ------ Documentation/block/deadline-iosched.rst | 72 ++ Documentation/block/deadline-iosched.txt | 75 -- Documentation/block/index.rst | 25 + Documentation/block/ioprio.rst | 182 ++++ Documentation/block/ioprio.txt | 183 ---- Documentation/block/kyber-iosched.rst | 15 + Documentation/block/kyber-iosched.txt | 14 - Documentation/block/null_blk.rst | 126 +++ Documentation/block/null_blk.txt | 99 -- Documentation/block/pr.rst | 119 +++ Documentation/block/pr.txt | 119 --- Documentation/block/queue-sysfs.rst | 254 +++++ Documentation/block/queue-sysfs.txt | 253 ----- Documentation/block/request.rst | 99 ++ Documentation/block/request.txt | 88 -- Documentation/block/stat.rst | 93 ++ Documentation/block/stat.txt | 86 -- Documentation/block/switching-sched.rst | 39 + Documentation/block/switching-sched.txt | 35 - Documentation/block/writeback_cache_control.rst | 86 ++ Documentation/block/writeback_cache_control.txt | 86 -- 33 files changed, 3383 insertions(+), 3183 deletions(-) create mode 100644 Documentation/block/bfq-iosched.rst delete mode 100644 Documentation/block/bfq-iosched.txt create mode 100644 Documentation/block/biodoc.rst delete mode 100644 Documentation/block/biodoc.txt create mode 100644 Documentation/block/biovecs.rst delete mode 100644 Documentation/block/biovecs.txt create mode 100644 Documentation/block/capability.rst delete mode 100644 Documentation/block/capability.txt create mode 100644 Documentation/block/cmdline-partition.rst delete mode 100644 Documentation/block/cmdline-partition.txt create mode 100644 Documentation/block/data-integrity.rst delete mode 100644 Documentation/block/data-integrity.txt create mode 100644 Documentation/block/deadline-iosched.rst delete mode 100644 Documentation/block/deadline-iosched.txt create mode 100644 Documentation/block/index.rst create mode 100644 Documentation/block/ioprio.rst delete mode 100644 Documentation/block/ioprio.txt create mode 100644 Documentation/block/kyber-iosched.rst delete mode 100644 Documentation/block/kyber-iosched.txt create mode 100644 Documentation/block/null_blk.rst delete mode 100644 Documentation/block/null_blk.txt create mode 100644 Documentation/block/pr.rst delete mode 100644 Documentation/block/pr.txt create mode 100644 Documentation/block/queue-sysfs.rst delete mode 100644 Documentation/block/queue-sysfs.txt create mode 100644 Documentation/block/request.rst delete mode 100644 Documentation/block/request.txt create mode 100644 Documentation/block/stat.rst delete mode 100644 Documentation/block/stat.txt create mode 100644 Documentation/block/switching-sched.rst delete mode 100644 Documentation/block/switching-sched.txt create mode 100644 Documentation/block/writeback_cache_control.rst delete mode 100644 Documentation/block/writeback_cache_control.txt (limited to 'Documentation/block') diff --git a/Documentation/block/bfq-iosched.rst b/Documentation/block/bfq-iosched.rst new file mode 100644 index 000000000000..2c13b2fc1888 --- /dev/null +++ b/Documentation/block/bfq-iosched.rst @@ -0,0 +1,597 @@ +========================== +BFQ (Budget Fair Queueing) +========================== + +BFQ is a proportional-share I/O scheduler, with some extra +low-latency capabilities. In addition to cgroups support (blkio or io +controllers), BFQ's main features are: + +- BFQ guarantees a high system and application responsiveness, and a + low latency for time-sensitive applications, such as audio or video + players; +- BFQ distributes bandwidth, and not just time, among processes or + groups (switching back to time distribution when needed to keep + throughput high). + +In its default configuration, BFQ privileges latency over +throughput. So, when needed for achieving a lower latency, BFQ builds +schedules that may lead to a lower throughput. If your main or only +goal, for a given device, is to achieve the maximum-possible +throughput at all times, then do switch off all low-latency heuristics +for that device, by setting low_latency to 0. See Section 3 for +details on how to configure BFQ for the desired tradeoff between +latency and throughput, or on how to maximize throughput. + +As every I/O scheduler, BFQ adds some overhead to per-I/O-request +processing. To give an idea of this overhead, the total, +single-lock-protected, per-request processing time of BFQ---i.e., the +sum of the execution times of the request insertion, dispatch and +completion hooks---is, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz +(dated CPU for notebooks; time measured with simple code +instrumentation, and using the throughput-sync.sh script of the S +suite [1], in performance-profiling mode). To put this result into +context, the total, single-lock-protected, per-request execution time +of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 +us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). + +Scheduling overhead further limits the maximum IOPS that a CPU can +process (already limited by the execution of the rest of the I/O +stack). To give an idea of the limits with BFQ, on slow or average +CPUs, here are, first, the limits of BFQ for three different CPUs, on, +respectively, an average laptop, an old desktop, and a cheap embedded +system, in case full hierarchical support is enabled (i.e., +CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_BFQ_CGROUP_DEBUG is not +set (Section 4-2): +- Intel i7-4850HQ: 400 KIOPS +- AMD A8-3850: 250 KIOPS +- ARM CortexTM-A53 Octa-core: 80 KIOPS + +If CONFIG_BFQ_CGROUP_DEBUG is set (and of course full hierarchical +support is enabled), then the sustainable throughput with BFQ +decreases, because all blkio.bfq* statistics are created and updated +(Section 4-2). For BFQ, this leads to the following maximum +sustainable throughputs, on the same systems as above: +- Intel i7-4850HQ: 310 KIOPS +- AMD A8-3850: 200 KIOPS +- ARM CortexTM-A53 Octa-core: 56 KIOPS + +BFQ works for multi-queue devices too. + +.. The table of contents follow. Impatients can just jump to Section 3. + +.. CONTENTS + + 1. When may BFQ be useful? + 1-1 Personal systems + 1-2 Server systems + 2. How does BFQ work? + 3. What are BFQ's tunables and how to properly configure BFQ? + 4. BFQ group scheduling + 4-1 Service guarantees provided + 4-2 Interface + +1. When may BFQ be useful? +========================== + +BFQ provides the following benefits on personal and server systems. + +1-1 Personal systems +-------------------- + +Low latency for interactive applications +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Regardless of the actual background workload, BFQ guarantees that, for +interactive tasks, the storage device is virtually as responsive as if +it was idle. For example, even if one or more of the following +background workloads are being executed: + +- one or more large files are being read, written or copied, +- a tree of source files is being compiled, +- one or more virtual machines are performing I/O, +- a software update is in progress, +- indexing daemons are scanning filesystems and updating their + databases, + +starting an application or loading a file from within an application +takes about the same time as if the storage device was idle. As a +comparison, with CFQ, NOOP or DEADLINE, and in the same conditions, +applications experience high latencies, or even become unresponsive +until the background workload terminates (also on SSDs). + +Low latency for soft real-time applications +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Also soft real-time applications, such as audio and video +players/streamers, enjoy a low latency and a low drop rate, regardless +of the background I/O workload. As a consequence, these applications +do not suffer from almost any glitch due to the background workload. + +Higher speed for code-development tasks +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +If some additional workload happens to be executed in parallel, then +BFQ executes the I/O-related components of typical code-development +tasks (compilation, checkout, merge, ...) much more quickly than CFQ, +NOOP or DEADLINE. + +High throughput +^^^^^^^^^^^^^^^ + +On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and +up to 150% higher throughput than DEADLINE and NOOP, with all the +sequential workloads considered in our tests. With random workloads, +and with all the workloads on flash-based devices, BFQ achieves, +instead, about the same throughput as the other schedulers. + +Strong fairness, bandwidth and delay guarantees +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +BFQ distributes the device throughput, and not just the device time, +among I/O-bound applications in proportion their weights, with any +workload and regardless of the device parameters. From these bandwidth +guarantees, it is possible to compute tight per-I/O-request delay +guarantees by a simple formula. If not configured for strict service +guarantees, BFQ switches to time-based resource sharing (only) for +applications that would otherwise cause a throughput loss. + +1-2 Server systems +------------------ + +Most benefits for server systems follow from the same service +properties as above. In particular, regardless of whether additional, +possibly heavy workloads are being served, BFQ guarantees: + +* audio and video-streaming with zero or very low jitter and drop + rate; + +* fast retrieval of WEB pages and embedded objects; + +* real-time recording of data in live-dumping applications (e.g., + packet logging); + +* responsiveness in local and remote access to a server. + + +2. How does BFQ work? +===================== + +BFQ is a proportional-share I/O scheduler, whose general structure, +plus a lot of code, are borrowed from CFQ. + +- Each process doing I/O on a device is associated with a weight and a + `(bfq_)queue`. + +- BFQ grants exclusive access to the device, for a while, to one queue + (process) at a time, and implements this service model by + associating every queue with a budget, measured in number of + sectors. + + - After a queue is granted access to the device, the budget of the + queue is decremented, on each request dispatch, by the size of the + request. + + - The in-service queue is expired, i.e., its service is suspended, + only if one of the following events occurs: 1) the queue finishes + its budget, 2) the queue empties, 3) a "budget timeout" fires. + + - The budget timeout prevents processes doing random I/O from + holding the device for too long and dramatically reducing + throughput. + + - Actually, as in CFQ, a queue associated with a process issuing + sync requests may not be expired immediately when it empties. In + contrast, BFQ may idle the device for a short time interval, + giving the process the chance to go on being served if it issues + a new request in time. Device idling typically boosts the + throughput on rotational devices and on non-queueing flash-based + devices, if processes do synchronous and sequential I/O. In + addition, under BFQ, device idling is also instrumental in + guaranteeing the desired throughput fraction to processes + issuing sync requests (see the description of the slice_idle + tunable in this document, or [1, 2], for more details). + + - With respect to idling for service guarantees, if several + processes are competing for the device at the same time, but + all processes and groups have the same weight, then BFQ + guarantees the expected throughput distribution without ever + idling the device. Throughput is thus as high as possible in + this common scenario. + + - On flash-based storage with internal queueing of commands + (typically NCQ), device idling happens to be always detrimental + for throughput. So, with these devices, BFQ performs idling + only when strictly needed for service guarantees, i.e., for + guaranteeing low latency or fairness. In these cases, overall + throughput may be sub-optimal. No solution currently exists to + provide both strong service guarantees and optimal throughput + on devices with internal queueing. + + - If low-latency mode is enabled (default configuration), BFQ + executes some special heuristics to detect interactive and soft + real-time applications (e.g., video or audio players/streamers), + and to reduce their latency. The most important action taken to + achieve this goal is to give to the queues associated with these + applications more than their fair share of the device + throughput. For brevity, we call just "weight-raising" the whole + sets of actions taken by BFQ to privilege these queues. In + particular, BFQ provides a milder form of weight-raising for + interactive applications, and a stronger form for soft real-time + applications. + + - BFQ automatically deactivates idling for queues born in a burst of + queue creations. In fact, these queues are usually associated with + the processes of applications and services that benefit mostly + from a high throughput. Examples are systemd during boot, or git + grep. + + - As CFQ, BFQ merges queues performing interleaved I/O, i.e., + performing random I/O that becomes mostly sequential if + merged. Differently from CFQ, BFQ achieves this goal with a more + reactive mechanism, called Early Queue Merge (EQM). EQM is so + responsive in detecting interleaved I/O (cooperating processes), + that it enables BFQ to achieve a high throughput, by queue + merging, even for queues for which CFQ needs a different + mechanism, preemption, to get a high throughput. As such EQM is a + unified mechanism to achieve a high throughput with interleaved + I/O. + + - Queues are scheduled according to a variant of WF2Q+, named + B-WF2Q+, and implemented using an augmented rb-tree to preserve an + O(log N) overall complexity. See [2] for more details. B-WF2Q+ is + also ready for hierarchical scheduling, details in Section 4. + + - B-WF2Q+ guarantees a tight deviation with respect to an ideal, + perfectly fair, and smooth service. In particular, B-WF2Q+ + guarantees that each queue receives a fraction of the device + throughput proportional to its weight, even if the throughput + fluctuates, and regardless of: the device parameters, the current + workload and the budgets assigned to the queue. + + - The last, budget-independence, property (although probably + counterintuitive in the first place) is definitely beneficial, for + the following reasons: + + - First, with any proportional-share scheduler, the maximum + deviation with respect to an ideal service is proportional to + the maximum budget (slice) assigned to queues. As a consequence, + BFQ can keep this deviation tight not only because of the + accurate service of B-WF2Q+, but also because BFQ *does not* + need to assign a larger budget to a queue to let the queue + receive a higher fraction of the device throughput. + + - Second, BFQ is free to choose, for every process (queue), the + budget that best fits the needs of the process, or best + leverages the I/O pattern of the process. In particular, BFQ + updates queue budgets with a simple feedback-loop algorithm that + allows a high throughput to be achieved, while still providing + tight latency guarantees to time-sensitive applications. When + the in-service queue expires, this algorithm computes the next + budget of the queue so as to: + + - Let large budgets be eventually assigned to the queues + associated with I/O-bound applications performing sequential + I/O: in fact, the longer these applications are served once + got access to the device, the higher the throughput is. + + - Let small budgets be eventually assigned to the queues + associated with time-sensitive applications (which typically + perform sporadic and short I/O), because, the smaller the + budget assigned to a queue waiting for service is, the sooner + B-WF2Q+ will serve that queue (Subsec 3.3 in [2]). + +- If several processes are competing for the device at the same time, + but all processes and groups have the same weight, then BFQ + guarantees the expected throughput distribution without ever idling + the device. It uses preemption instead. Throughput is then much + higher in this common scenario. + +- ioprio classes are served in strict priority order, i.e., + lower-priority queues are not served as long as there are + higher-priority queues. Among queues in the same class, the + bandwidth is distributed in proportion to the weight of each + queue. A very thin extra bandwidth is however guaranteed to + the Idle class, to prevent it from starving. + + +3. What are BFQ's tunables and how to properly configure BFQ? +============================================================= + +Most BFQ tunables affect service guarantees (basically latency and +fairness) and throughput. For full details on how to choose the +desired tradeoff between service guarantees and throughput, see the +parameters slice_idle, strict_guarantees and low_latency. For details +on how to maximise throughput, see slice_idle, timeout_sync and +max_budget. The other performance-related parameters have been +inherited from, and have been preserved mostly for compatibility with +CFQ. So far, no performance improvement has been reported after +changing the latter parameters in BFQ. + +In particular, the tunables back_seek-max, back_seek_penalty, +fifo_expire_async and fifo_expire_sync below are the same as in +CFQ. Their description is just copied from that for CFQ. Some +considerations in the description of slice_idle are copied from CFQ +too. + +per-process ioprio and weight +----------------------------- + +Unless the cgroups interface is used (see "4. BFQ group scheduling"), +weights can be assigned to processes only indirectly, through I/O +priorities, and according to the relation: +weight = (IOPRIO_BE_NR - ioprio) * 10. + +Beware that, if low-latency is set, then BFQ automatically raises the +weight of the queues associated with interactive and soft real-time +applications. Unset this tunable if you need/want to control weights. + +slice_idle +---------- + +This parameter specifies how long BFQ should idle for next I/O +request, when certain sync BFQ queues become empty. By default +slice_idle is a non-zero value. Idling has a double purpose: boosting +throughput and making sure that the desired throughput distribution is +respected (see the description of how BFQ works, and, if needed, the +papers referred there). + +As for throughput, idling can be very helpful on highly seeky media +like single spindle SATA/SAS disks where we can cut down on overall +number of seeks and see improved throughput. + +Setting slice_idle to 0 will remove all the idling on queues and one +should see an overall improved throughput on faster storage devices +like multiple SATA/SAS disks in hardware RAID configuration, as well +as flash-based storage with internal command queueing (and +parallelism). + +So depending on storage and workload, it might be useful to set +slice_idle=0. In general for SATA/SAS disks and software RAID of +SATA/SAS disks keeping slice_idle enabled should be useful. For any +configurations where there are multiple spindles behind single LUN +(Host based hardware RAID controller or for storage arrays), or with +flash-based fast storage, setting slice_idle=0 might end up in better +throughput and acceptable latencies. + +Idling is however necessary to have service guarantees enforced in +case of differentiated weights or differentiated I/O-request lengths. +To see why, suppose that a given BFQ queue A must get several I/O +requests served for each request served for another queue B. Idling +ensures that, if A makes a new I/O request slightly after becoming +empty, then no request of B is dispatched in the middle, and thus A +does not lose the possibility to get more than one request dispatched +before the next request of B is dispatched. Note that idling +guarantees the desired differentiated treatment of queues only in +terms of I/O-request dispatches. To guarantee that the actual service +order then corresponds to the dispatch order, the strict_guarantees +tunable must be set too. + +There is an important flipside for idling: apart from the above cases +where it is beneficial also for throughput, idling can severely impact +throughput. One important case is random workload. Because of this +issue, BFQ tends to avoid idling as much as possible, when it is not +beneficial also for throughput (as detailed in Section 2). As a +consequence of this behavior, and of further issues described for the +strict_guarantees tunable, short-term service guarantees may be +occasionally violated. And, in some cases, these guarantees may be +more important than guaranteeing maximum throughput. For example, in +video playing/streaming, a very low drop rate may be more important +than maximum throughput. In these cases, consider setting the +strict_guarantees parameter. + +slice_idle_us +------------- + +Controls the same tuning parameter as slice_idle, but in microseconds. +Either tunable can be used to set idling behavior. Afterwards, the +other tunable will reflect the newly set value in sysfs. + +strict_guarantees +----------------- + +If this parameter is set (default: unset), then BFQ + +- always performs idling when the in-service queue becomes empty; + +- forces the device to serve one I/O request at a time, by dispatching a + new request only if there is no outstanding request. + +In the presence of differentiated weights or I/O-request sizes, both +the above conditions are needed to guarantee that every BFQ queue +receives its allotted share of the bandwidth. The first condition is +needed for the reasons explained in the description of the slice_idle +tunable. The second condition is needed because all modern storage +devices reorder internally-queued requests, which may trivially break +the service guarantees enforced by the I/O scheduler. + +Setting strict_guarantees may evidently affect throughput. + +back_seek_max +------------- + +This specifies, given in Kbytes, the maximum "distance" for backward seeking. +The distance is the amount of space from the current head location to the +sectors that are backward in terms of distance. + +This parameter allows the scheduler to anticipate requests in the "backward" +direction and consider them as being the "next" if they are within this +distance from the current head location. + +back_seek_penalty +----------------- + +This parameter is used to compute the cost of backward seeking. If the +backward distance of request is just 1/back_seek_penalty from a "front" +request, then the seeking cost of two requests is considered equivalent. + +So scheduler will not bias toward one or the other request (otherwise scheduler +will bias toward front request). Default value of back_seek_penalty is 2. + +fifo_expire_async +----------------- + +This parameter is used to set the timeout of asynchronous requests. Default +value of this is 248ms. + +fifo_expire_sync +---------------- + +This parameter is used to set the timeout of synchronous requests. Default +value of this is 124ms. In case to favor synchronous requests over asynchronous +one, this value should be decreased relative to fifo_expire_async. + +low_latency +----------- + +This parameter is used to enable/disable BFQ's low latency mode. By +default, low latency mode is enabled. If enabled, interactive and soft +real-time applications are privileged and experience a lower latency, +as explained in more detail in the description of how BFQ works. + +DISABLE this mode if you need full control on bandwidth +distribution. In fact, if it is enabled, then BFQ automatically +increases the bandwidth share of privileged applications, as the main +means to guarantee a lower latency to them. + +In addition, as already highlighted at the beginning of this document, +DISABLE this mode if your only goal is to achieve a high throughput. +In fact, privileging the I/O of some application over the rest may +entail a lower throughput. To achieve the highest-possible throughput +on a non-rotational device, setting slice_idle to 0 may be needed too +(at the cost of giving up any strong guarantee on fairness and low +latency). + +timeout_sync +------------ + +Maximum amount of device time that can be given to a task (queue) once +it has been selected for service. On devices with costly seeks, +increasing this time usually increases maximum throughput. On the +opposite end, increasing this time coarsens the granularity of the +short-term bandwidth and latency guarantees, especially if the +following parameter is set to zero. + +max_budget +---------- + +Maximum amount of service, measured in sectors, that can be provided +to a BFQ queue once it is set in service (of course within the limits +of the above timeout). According to what said in the description of +the algorithm, larger values increase the throughput in proportion to +the percentage of sequential I/O requests issued. The price of larger +values is that they coarsen the granularity of short-term bandwidth +and latency guarantees. + +The default value is 0, which enables auto-tuning: BFQ sets max_budget +to the maximum number of sectors that can be served during +timeout_sync, according to the estimated peak rate. + +For specific devices, some users have occasionally reported to have +reached a higher throughput by setting max_budget explicitly, i.e., by +setting max_budget to a higher value than 0. In particular, they have +set max_budget to higher values than those to which BFQ would have set +it with auto-tuning. An alternative way to achieve this goal is to +just increase the value of timeout_sync, leaving max_budget equal to 0. + +weights +------- + +Read-only parameter, used to show the weights of the currently active +BFQ queues. + + +4. Group scheduling with BFQ +============================ + +BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely +blkio and io. In particular, BFQ supports weight-based proportional +share. To activate cgroups support, set BFQ_GROUP_IOSCHED. + +4-1 Service guarantees provided +------------------------------- + +With BFQ, proportional share means true proportional share of the +device bandwidth, according to group weights. For example, a group +with weight 200 gets twice the bandwidth, and not just twice the time, +of a group with weight 100. + +BFQ supports hierarchies (group trees) of any depth. Bandwidth is +distributed among groups and processes in the expected way: for each +group, the children of the group share the whole bandwidth of the +group in proportion to their weights. In particular, this implies +that, for each leaf group, every process of the group receives the +same share of the whole group bandwidth, unless the ioprio of the +process is modified. + +The resource-sharing guarantee for a group may partially or totally +switch from bandwidth to time, if providing bandwidth guarantees to +the group lowers the throughput too much. This switch occurs on a +per-process basis: if a process of a leaf group causes throughput loss +if served in such a way to receive its share of the bandwidth, then +BFQ switches back to just time-based proportional share for that +process. + +4-2 Interface +------------- + +To get proportional sharing of bandwidth with BFQ for a given device, +BFQ must of course be the active scheduler for that device. + +Within each group directory, the names of the files associated with +BFQ-specific cgroup parameters and stats begin with the "bfq." +prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for +BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group +parameter to set the weight of a group with BFQ is blkio.bfq.weight +or io.bfq.weight. + +As for cgroups-v1 (blkio controller), the exact set of stat files +created, and kept up-to-date by bfq, depends on whether +CONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates all +the stat files documented in +Documentation/cgroup-v1/blkio-controller.rst. If, instead, +CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files:: + + blkio.bfq.io_service_bytes + blkio.bfq.io_service_bytes_recursive + blkio.bfq.io_serviced + blkio.bfq.io_serviced_recursive + +The value of CONFIG_BFQ_CGROUP_DEBUG greatly influences the maximum +throughput sustainable with bfq, because updating the blkio.bfq.* +stats is rather costly, especially for some of the stats enabled by +CONFIG_BFQ_CGROUP_DEBUG. + +Parameters to set +----------------- + +For each group, there is only the following parameter to set. + +weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the +group inside its parent. Available values: 1..10000 (default 100). The +linear mapping between ioprio and weights, described at the beginning +of the tunable section, is still valid, but all weights higher than +IOPRIO_BE_NR*10 are mapped to ioprio 0. + +Recall that, if low-latency is set, then BFQ automatically raises the +weight of the queues associated with interactive and soft real-time +applications. Unset this tunable if you need/want to control weights. + + +[1] + P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O + Scheduler", Proceedings of the First Workshop on Mobile System + Technologies (MST-2015), May 2015. + + http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf + +[2] + P. Valente and M. Andreolini, "Improving Application + Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of + the 5th Annual International Systems and Storage Conference + (SYSTOR '12), June 2012. + + Slightly extended version: + + http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-results.pdf + +[3] + https://github.com/Algodev-github/S diff --git a/Documentation/block/bfq-iosched.txt b/Documentation/block/bfq-iosched.txt deleted file mode 100644 index bbd6eb5bbb07..000000000000 --- a/Documentation/block/bfq-iosched.txt +++ /dev/null @@ -1,583 +0,0 @@ -BFQ (Budget Fair Queueing) -========================== - -BFQ is a proportional-share I/O scheduler, with some extra -low-latency capabilities. In addition to cgroups support (blkio or io -controllers), BFQ's main features are: -- BFQ guarantees a high system and application responsiveness, and a - low latency for time-sensitive applications, such as audio or video - players; -- BFQ distributes bandwidth, and not just time, among processes or - groups (switching back to time distribution when needed to keep - throughput high). - -In its default configuration, BFQ privileges latency over -throughput. So, when needed for achieving a lower latency, BFQ builds -schedules that may lead to a lower throughput. If your main or only -goal, for a given device, is to achieve the maximum-possible -throughput at all times, then do switch off all low-latency heuristics -for that device, by setting low_latency to 0. See Section 3 for -details on how to configure BFQ for the desired tradeoff between -latency and throughput, or on how to maximize throughput. - -As every I/O scheduler, BFQ adds some overhead to per-I/O-request -processing. To give an idea of this overhead, the total, -single-lock-protected, per-request processing time of BFQ---i.e., the -sum of the execution times of the request insertion, dispatch and -completion hooks---is, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz -(dated CPU for notebooks; time measured with simple code -instrumentation, and using the throughput-sync.sh script of the S -suite [1], in performance-profiling mode). To put this result into -context, the total, single-lock-protected, per-request execution time -of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 -us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). - -Scheduling overhead further limits the maximum IOPS that a CPU can -process (already limited by the execution of the rest of the I/O -stack). To give an idea of the limits with BFQ, on slow or average -CPUs, here are, first, the limits of BFQ for three different CPUs, on, -respectively, an average laptop, an old desktop, and a cheap embedded -system, in case full hierarchical support is enabled (i.e., -CONFIG_BFQ_GROUP_IOSCHED is set), but CONFIG_BFQ_CGROUP_DEBUG is not -set (Section 4-2): -- Intel i7-4850HQ: 400 KIOPS -- AMD A8-3850: 250 KIOPS -- ARM CortexTM-A53 Octa-core: 80 KIOPS - -If CONFIG_BFQ_CGROUP_DEBUG is set (and of course full hierarchical -support is enabled), then the sustainable throughput with BFQ -decreases, because all blkio.bfq* statistics are created and updated -(Section 4-2). For BFQ, this leads to the following maximum -sustainable throughputs, on the same systems as above: -- Intel i7-4850HQ: 310 KIOPS -- AMD A8-3850: 200 KIOPS -- ARM CortexTM-A53 Octa-core: 56 KIOPS - -BFQ works for multi-queue devices too. - -The table of contents follow. Impatients can just jump to Section 3. - -CONTENTS - -1. When may BFQ be useful? - 1-1 Personal systems - 1-2 Server systems -2. How does BFQ work? -3. What are BFQ's tunables and how to properly configure BFQ? -4. BFQ group scheduling - 4-1 Service guarantees provided - 4-2 Interface - -1. When may BFQ be useful? -========================== - -BFQ provides the following benefits on personal and server systems. - -1-1 Personal systems --------------------- - -Low latency for interactive applications - -Regardless of the actual background workload, BFQ guarantees that, for -interactive tasks, the storage device is virtually as responsive as if -it was idle. For example, even if one or more of the following -background workloads are being executed: -- one or more large files are being read, written or copied, -- a tree of source files is being compiled, -- one or more virtual machines are performing I/O, -- a software update is in progress, -- indexing daemons are scanning filesystems and updating their - databases, -starting an application or loading a file from within an application -takes about the same time as if the storage device was idle. As a -comparison, with CFQ, NOOP or DEADLINE, and in the same conditions, -applications experience high latencies, or even become unresponsive -until the background workload terminates (also on SSDs). - -Low latency for soft real-time applications - -Also soft real-time applications, such as audio and video -players/streamers, enjoy a low latency and a low drop rate, regardless -of the background I/O workload. As a consequence, these applications -do not suffer from almost any glitch due to the background workload. - -Higher speed for code-development tasks - -If some additional workload happens to be executed in parallel, then -BFQ executes the I/O-related components of typical code-development -tasks (compilation, checkout, merge, ...) much more quickly than CFQ, -NOOP or DEADLINE. - -High throughput - -On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and -up to 150% higher throughput than DEADLINE and NOOP, with all the -sequential workloads considered in our tests. With random workloads, -and with all the workloads on flash-based devices, BFQ achieves, -instead, about the same throughput as the other schedulers. - -Strong fairness, bandwidth and delay guarantees - -BFQ distributes the device throughput, and not just the device time, -among I/O-bound applications in proportion their weights, with any -workload and regardless of the device parameters. From these bandwidth -guarantees, it is possible to compute tight per-I/O-request delay -guarantees by a simple formula. If not configured for strict service -guarantees, BFQ switches to time-based resource sharing (only) for -applications that would otherwise cause a throughput loss. - -1-2 Server systems ------------------- - -Most benefits for server systems follow from the same service -properties as above. In particular, regardless of whether additional, -possibly heavy workloads are being served, BFQ guarantees: - -. audio and video-streaming with zero or very low jitter and drop - rate; - -. fast retrieval of WEB pages and embedded objects; - -. real-time recording of data in live-dumping applications (e.g., - packet logging); - -. responsiveness in local and remote access to a server. - - -2. How does BFQ work? -===================== - -BFQ is a proportional-share I/O scheduler, whose general structure, -plus a lot of code, are borrowed from CFQ. - -- Each process doing I/O on a device is associated with a weight and a - (bfq_)queue. - -- BFQ grants exclusive access to the device, for a while, to one queue - (process) at a time, and implements this service model by - associating every queue with a budget, measured in number of - sectors. - - - After a queue is granted access to the device, the budget of the - queue is decremented, on each request dispatch, by the size of the - request. - - - The in-service queue is expired, i.e., its service is suspended, - only if one of the following events occurs: 1) the queue finishes - its budget, 2) the queue empties, 3) a "budget timeout" fires. - - - The budget timeout prevents processes doing random I/O from - holding the device for too long and dramatically reducing - throughput. - - - Actually, as in CFQ, a queue associated with a process issuing - sync requests may not be expired immediately when it empties. In - contrast, BFQ may idle the device for a short time interval, - giving the process the chance to go on being served if it issues - a new request in time. Device idling typically boosts the - throughput on rotational devices and on non-queueing flash-based - devices, if processes do synchronous and sequential I/O. In - addition, under BFQ, device idling is also instrumental in - guaranteeing the desired throughput fraction to processes - issuing sync requests (see the description of the slice_idle - tunable in this document, or [1, 2], for more details). - - - With respect to idling for service guarantees, if several - processes are competing for the device at the same time, but - all processes and groups have the same weight, then BFQ - guarantees the expected throughput distribution without ever - idling the device. Throughput is thus as high as possible in - this common scenario. - - - On flash-based storage with internal queueing of commands - (typically NCQ), device idling happens to be always detrimental - for throughput. So, with these devices, BFQ performs idling - only when strictly needed for service guarantees, i.e., for - guaranteeing low latency or fairness. In these cases, overall - throughput may be sub-optimal. No solution currently exists to - provide both strong service guarantees and optimal throughput - on devices with internal queueing. - - - If low-latency mode is enabled (default configuration), BFQ - executes some special heuristics to detect interactive and soft - real-time applications (e.g., video or audio players/streamers), - and to reduce their latency. The most important action taken to - achieve this goal is to give to the queues associated with these - applications more than their fair share of the device - throughput. For brevity, we call just "weight-raising" the whole - sets of actions taken by BFQ to privilege these queues. In - particular, BFQ provides a milder form of weight-raising for - interactive applications, and a stronger form for soft real-time - applications. - - - BFQ automatically deactivates idling for queues born in a burst of - queue creations. In fact, these queues are usually associated with - the processes of applications and services that benefit mostly - from a high throughput. Examples are systemd during boot, or git - grep. - - - As CFQ, BFQ merges queues performing interleaved I/O, i.e., - performing random I/O that becomes mostly sequential if - merged. Differently from CFQ, BFQ achieves this goal with a more - reactive mechanism, called Early Queue Merge (EQM). EQM is so - responsive in detecting interleaved I/O (cooperating processes), - that it enables BFQ to achieve a high throughput, by queue - merging, even for queues for which CFQ needs a different - mechanism, preemption, to get a high throughput. As such EQM is a - unified mechanism to achieve a high throughput with interleaved - I/O. - - - Queues are scheduled according to a variant of WF2Q+, named - B-WF2Q+, and implemented using an augmented rb-tree to preserve an - O(log N) overall complexity. See [2] for more details. B-WF2Q+ is - also ready for hierarchical scheduling, details in Section 4. - - - B-WF2Q+ guarantees a tight deviation with respect to an ideal, - perfectly fair, and smooth service. In particular, B-WF2Q+ - guarantees that each queue receives a fraction of the device - throughput proportional to its weight, even if the throughput - fluctuates, and regardless of: the device parameters, the current - workload and the budgets assigned to the queue. - - - The last, budget-independence, property (although probably - counterintuitive in the first place) is definitely beneficial, for - the following reasons: - - - First, with any proportional-share scheduler, the maximum - deviation with respect to an ideal service is proportional to - the maximum budget (slice) assigned to queues. As a consequence, - BFQ can keep this deviation tight not only because of the - accurate service of B-WF2Q+, but also because BFQ *does not* - need to assign a larger budget to a queue to let the queue - receive a higher fraction of the device throughput. - - - Second, BFQ is free to choose, for every process (queue), the - budget that best fits the needs of the process, or best - leverages the I/O pattern of the process. In particular, BFQ - updates queue budgets with a simple feedback-loop algorithm that - allows a high throughput to be achieved, while still providing - tight latency guarantees to time-sensitive applications. When - the in-service queue expires, this algorithm computes the next - budget of the queue so as to: - - - Let large budgets be eventually assigned to the queues - associated with I/O-bound applications performing sequential - I/O: in fact, the longer these applications are served once - got access to the device, the higher the throughput is. - - - Let small budgets be eventually assigned to the queues - associated with time-sensitive applications (which typically - perform sporadic and short I/O), because, the smaller the - budget assigned to a queue waiting for service is, the sooner - B-WF2Q+ will serve that queue (Subsec 3.3 in [2]). - -- If several processes are competing for the device at the same time, - but all processes and groups have the same weight, then BFQ - guarantees the expected throughput distribution without ever idling - the device. It uses preemption instead. Throughput is then much - higher in this common scenario. - -- ioprio classes are served in strict priority order, i.e., - lower-priority queues are not served as long as there are - higher-priority queues. Among queues in the same class, the - bandwidth is distributed in proportion to the weight of each - queue. A very thin extra bandwidth is however guaranteed to - the Idle class, to prevent it from starving. - - -3. What are BFQ's tunables and how to properly configure BFQ? -============================================================= - -Most BFQ tunables affect service guarantees (basically latency and -fairness) and throughput. For full details on how to choose the -desired tradeoff between service guarantees and throughput, see the -parameters slice_idle, strict_guarantees and low_latency. For details -on how to maximise throughput, see slice_idle, timeout_sync and -max_budget. The other performance-related parameters have been -inherited from, and have been preserved mostly for compatibility with -CFQ. So far, no performance improvement has been reported after -changing the latter parameters in BFQ. - -In particular, the tunables back_seek-max, back_seek_penalty, -fifo_expire_async and fifo_expire_sync below are the same as in -CFQ. Their description is just copied from that for CFQ. Some -considerations in the description of slice_idle are copied from CFQ -too. - -per-process ioprio and weight ------------------------------ - -Unless the cgroups interface is used (see "4. BFQ group scheduling"), -weights can be assigned to processes only indirectly, through I/O -priorities, and according to the relation: -weight = (IOPRIO_BE_NR - ioprio) * 10. - -Beware that, if low-latency is set, then BFQ automatically raises the -weight of the queues associated with interactive and soft real-time -applications. Unset this tunable if you need/want to control weights. - -slice_idle ----------- - -This parameter specifies how long BFQ should idle for next I/O -request, when certain sync BFQ queues become empty. By default -slice_idle is a non-zero value. Idling has a double purpose: boosting -throughput and making sure that the desired throughput distribution is -respected (see the description of how BFQ works, and, if needed, the -papers referred there). - -As for throughput, idling can be very helpful on highly seeky media -like single spindle SATA/SAS disks where we can cut down on overall -number of seeks and see improved throughput. - -Setting slice_idle to 0 will remove all the idling on queues and one -should see an overall improved throughput on faster storage devices -like multiple SATA/SAS disks in hardware RAID configuration, as well -as flash-based storage with internal command queueing (and -parallelism). - -So depending on storage and workload, it might be useful to set -slice_idle=0. In general for SATA/SAS disks and software RAID of -SATA/SAS disks keeping slice_idle enabled should be useful. For any -configurations where there are multiple spindles behind single LUN -(Host based hardware RAID controller or for storage arrays), or with -flash-based fast storage, setting slice_idle=0 might end up in better -throughput and acceptable latencies. - -Idling is however necessary to have service guarantees enforced in -case of differentiated weights or differentiated I/O-request lengths. -To see why, suppose that a given BFQ queue A must get several I/O -requests served for each request served for another queue B. Idling -ensures that, if A makes a new I/O request slightly after becoming -empty, then no request of B is dispatched in the middle, and thus A -does not lose the possibility to get more than one request dispatched -before the next request of B is dispatched. Note that idling -guarantees the desired differentiated treatment of queues only in -terms of I/O-request dispatches. To guarantee that the actual service -order then corresponds to the dispatch order, the strict_guarantees -tunable must be set too. - -There is an important flipside for idling: apart from the above cases -where it is beneficial also for throughput, idling can severely impact -throughput. One important case is random workload. Because of this -issue, BFQ tends to avoid idling as much as possible, when it is not -beneficial also for throughput (as detailed in Section 2). As a -consequence of this behavior, and of further issues described for the -strict_guarantees tunable, short-term service guarantees may be -occasionally violated. And, in some cases, these guarantees may be -more important than guaranteeing maximum throughput. For example, in -video playing/streaming, a very low drop rate may be more important -than maximum throughput. In these cases, consider setting the -strict_guarantees parameter. - -slice_idle_us -------------- - -Controls the same tuning parameter as slice_idle, but in microseconds. -Either tunable can be used to set idling behavior. Afterwards, the -other tunable will reflect the newly set value in sysfs. - -strict_guarantees ------------------ - -If this parameter is set (default: unset), then BFQ - -- always performs idling when the in-service queue becomes empty; - -- forces the device to serve one I/O request at a time, by dispatching a - new request only if there is no outstanding request. - -In the presence of differentiated weights or I/O-request sizes, both -the above conditions are needed to guarantee that every BFQ queue -receives its allotted share of the bandwidth. The first condition is -needed for the reasons explained in the description of the slice_idle -tunable. The second condition is needed because all modern storage -devices reorder internally-queued requests, which may trivially break -the service guarantees enforced by the I/O scheduler. - -Setting strict_guarantees may evidently affect throughput. - -back_seek_max -------------- - -This specifies, given in Kbytes, the maximum "distance" for backward seeking. -The distance is the amount of space from the current head location to the -sectors that are backward in terms of distance. - -This parameter allows the scheduler to anticipate requests in the "backward" -direction and consider them as being the "next" if they are within this -distance from the current head location. - -back_seek_penalty ------------------ - -This parameter is used to compute the cost of backward seeking. If the -backward distance of request is just 1/back_seek_penalty from a "front" -request, then the seeking cost of two requests is considered equivalent. - -So scheduler will not bias toward one or the other request (otherwise scheduler -will bias toward front request). Default value of back_seek_penalty is 2. - -fifo_expire_async ------------------ - -This parameter is used to set the timeout of asynchronous requests. Default -value of this is 248ms. - -fifo_expire_sync ----------------- - -This parameter is used to set the timeout of synchronous requests. Default -value of this is 124ms. In case to favor synchronous requests over asynchronous -one, this value should be decreased relative to fifo_expire_async. - -low_latency ------------ - -This parameter is used to enable/disable BFQ's low latency mode. By -default, low latency mode is enabled. If enabled, interactive and soft -real-time applications are privileged and experience a lower latency, -as explained in more detail in the description of how BFQ works. - -DISABLE this mode if you need full control on bandwidth -distribution. In fact, if it is enabled, then BFQ automatically -increases the bandwidth share of privileged applications, as the main -means to guarantee a lower latency to them. - -In addition, as already highlighted at the beginning of this document, -DISABLE this mode if your only goal is to achieve a high throughput. -In fact, privileging the I/O of some application over the rest may -entail a lower throughput. To achieve the highest-possible throughput -on a non-rotational device, setting slice_idle to 0 may be needed too -(at the cost of giving up any strong guarantee on fairness and low -latency). - -timeout_sync ------------- - -Maximum amount of device time that can be given to a task (queue) once -it has been selected for service. On devices with costly seeks, -increasing this time usually increases maximum throughput. On the -opposite end, increasing this time coarsens the granularity of the -short-term bandwidth and latency guarantees, especially if the -following parameter is set to zero. - -max_budget ----------- - -Maximum amount of service, measured in sectors, that can be provided -to a BFQ queue once it is set in service (of course within the limits -of the above timeout). According to what said in the description of -the algorithm, larger values increase the throughput in proportion to -the percentage of sequential I/O requests issued. The price of larger -values is that they coarsen the granularity of short-term bandwidth -and latency guarantees. - -The default value is 0, which enables auto-tuning: BFQ sets max_budget -to the maximum number of sectors that can be served during -timeout_sync, according to the estimated peak rate. - -For specific devices, some users have occasionally reported to have -reached a higher throughput by setting max_budget explicitly, i.e., by -setting max_budget to a higher value than 0. In particular, they have -set max_budget to higher values than those to which BFQ would have set -it with auto-tuning. An alternative way to achieve this goal is to -just increase the value of timeout_sync, leaving max_budget equal to 0. - -weights -------- - -Read-only parameter, used to show the weights of the currently active -BFQ queues. - - -4. Group scheduling with BFQ -============================ - -BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely -blkio and io. In particular, BFQ supports weight-based proportional -share. To activate cgroups support, set BFQ_GROUP_IOSCHED. - -4-1 Service guarantees provided -------------------------------- - -With BFQ, proportional share means true proportional share of the -device bandwidth, according to group weights. For example, a group -with weight 200 gets twice the bandwidth, and not just twice the time, -of a group with weight 100. - -BFQ supports hierarchies (group trees) of any depth. Bandwidth is -distributed among groups and processes in the expected way: for each -group, the children of the group share the whole bandwidth of the -group in proportion to their weights. In particular, this implies -that, for each leaf group, every process of the group receives the -same share of the whole group bandwidth, unless the ioprio of the -process is modified. - -The resource-sharing guarantee for a group may partially or totally -switch from bandwidth to time, if providing bandwidth guarantees to -the group lowers the throughput too much. This switch occurs on a -per-process basis: if a process of a leaf group causes throughput loss -if served in such a way to receive its share of the bandwidth, then -BFQ switches back to just time-based proportional share for that -process. - -4-2 Interface -------------- - -To get proportional sharing of bandwidth with BFQ for a given device, -BFQ must of course be the active scheduler for that device. - -Within each group directory, the names of the files associated with -BFQ-specific cgroup parameters and stats begin with the "bfq." -prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for -BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group -parameter to set the weight of a group with BFQ is blkio.bfq.weight -or io.bfq.weight. - -As for cgroups-v1 (blkio controller), the exact set of stat files -created, and kept up-to-date by bfq, depends on whether -CONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates all -the stat files documented in -Documentation/cgroup-v1/blkio-controller.rst. If, instead, -CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files -blkio.bfq.io_service_bytes -blkio.bfq.io_service_bytes_recursive -blkio.bfq.io_serviced -blkio.bfq.io_serviced_recursive - -The value of CONFIG_BFQ_CGROUP_DEBUG greatly influences the maximum -throughput sustainable with bfq, because updating the blkio.bfq.* -stats is rather costly, especially for some of the stats enabled by -CONFIG_BFQ_CGROUP_DEBUG. - -Parameters to set ------------------ - -For each group, there is only the following parameter to set. - -weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the -group inside its parent. Available values: 1..10000 (default 100). The -linear mapping between ioprio and weights, described at the beginning -of the tunable section, is still valid, but all weights higher than -IOPRIO_BE_NR*10 are mapped to ioprio 0. - -Recall that, if low-latency is set, then BFQ automatically raises the -weight of the queues associated with interactive and soft real-time -applications. Unset this tunable if you need/want to control weights. - - -[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O - Scheduler", Proceedings of the First Workshop on Mobile System - Technologies (MST-2015), May 2015. - http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf - -[2] P. Valente and M. Andreolini, "Improving Application - Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of - the 5th Annual International Systems and Storage Conference - (SYSTOR '12), June 2012. - Slightly extended version: - http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite- - results.pdf - -[3] https://github.com/Algodev-github/S diff --git a/Documentation/block/biodoc.rst b/Documentation/block/biodoc.rst new file mode 100644 index 000000000000..d6e30b680405 --- /dev/null +++ b/Documentation/block/biodoc.rst @@ -0,0 +1,1168 @@ +===================================================== +Notes on the Generic Block Layer Rewrite in Linux 2.5 +===================================================== + +.. note:: + + It seems that there are lot of outdated stuff here. This seems + to be written somewhat as a task list. Yet, eventually, something + here might still be useful. + +Notes Written on Jan 15, 2002: + - Jens Axboe + - Suparna Bhattacharya + +Last Updated May 2, 2002 + +September 2003: Updated I/O Scheduler portions + - Nick Piggin + +Introduction +============ + +These are some notes describing some aspects of the 2.5 block layer in the +context of the bio rewrite. The idea is to bring out some of the key +changes and a glimpse of the rationale behind those changes. + +Please mail corrections & suggestions to suparna@in.ibm.com. + +Credits +======= + +2.5 bio rewrite: + - Jens Axboe + +Many aspects of the generic block layer redesign were driven by and evolved +over discussions, prior patches and the collective experience of several +people. See sections 8 and 9 for a list of some related references. + +The following people helped with review comments and inputs for this +document: + + - Christoph Hellwig + - Arjan van de Ven + - Randy Dunlap + - Andre Hedrick + +The following people helped with fixes/contributions to the bio patches +while it was still work-in-progress: + + - David S. Miller + + +.. Description of Contents: + + 1. Scope for tuning of logic to various needs + 1.1 Tuning based on device or low level driver capabilities + - Per-queue parameters + - Highmem I/O support + - I/O scheduler modularization + 1.2 Tuning based on high level requirements/capabilities + 1.2.1 Request Priority/Latency + 1.3 Direct access/bypass to lower layers for diagnostics and special + device operations + 1.3.1 Pre-built commands + 2. New flexible and generic but minimalist i/o structure or descriptor + (instead of using buffer heads at the i/o layer) + 2.1 Requirements/Goals addressed + 2.2 The bio struct in detail (multi-page io unit) + 2.3 Changes in the request structure + 3. Using bios + 3.1 Setup/teardown (allocation, splitting) + 3.2 Generic bio helper routines + 3.2.1 Traversing segments and completion units in a request + 3.2.2 Setting up DMA scatterlists + 3.2.3 I/O completion + 3.2.4 Implications for drivers that do not interpret bios (don't handle + multiple segments) + 3.3 I/O submission + 4. The I/O scheduler + 5. Scalability related changes + 5.1 Granular locking: Removal of io_request_lock + 5.2 Prepare for transition to 64 bit sector_t + 6. Other Changes/Implications + 6.1 Partition re-mapping handled by the generic block layer + 7. A few tips on migration of older drivers + 8. A list of prior/related/impacted patches/ideas + 9. Other References/Discussion Threads + + +Bio Notes +========= + +Let us discuss the changes in the context of how some overall goals for the +block layer are addressed. + +1. Scope for tuning the generic logic to satisfy various requirements +===================================================================== + +The block layer design supports adaptable abstractions to handle common +processing with the ability to tune the logic to an appropriate extent +depending on the nature of the device and the requirements of the caller. +One of the objectives of the rewrite was to increase the degree of tunability +and to enable higher level code to utilize underlying device/driver +capabilities to the maximum extent for better i/o performance. This is +important especially in the light of ever improving hardware capabilities +and application/middleware software designed to take advantage of these +capabilities. + +1.1 Tuning based on low level device / driver capabilities +---------------------------------------------------------- + +Sophisticated devices with large built-in caches, intelligent i/o scheduling +optimizations, high memory DMA support, etc may find some of the +generic processing an overhead, while for less capable devices the +generic functionality is essential for performance or correctness reasons. +Knowledge of some of the capabilities or parameters of the device should be +used at the generic block layer to take the right decisions on +behalf of the driver. + +How is this achieved ? + +Tuning at a per-queue level: + +i. Per-queue limits/values exported to the generic layer by the driver + +Various parameters that the generic i/o scheduler logic uses are set at +a per-queue level (e.g maximum request size, maximum number of segments in +a scatter-gather list, logical block size) + +Some parameters that were earlier available as global arrays indexed by +major/minor are now directly associated with the queue. Some of these may +move into the block device structure in the future. Some characteristics +have been incorporated into a queue flags field rather than separate fields +in themselves. There are blk_queue_xxx functions to set the parameters, +rather than update the fields directly + +Some new queue property settings: + + blk_queue_bounce_limit(q, u64 dma_address) + Enable I/O to highmem pages, dma_address being the + limit. No highmem default. + + blk_queue_max_sectors(q, max_sectors) + Sets two variables that limit the size of the request. + + - The request queue's max_sectors, which is a soft size in + units of 512 byte sectors, and could be dynamically varied + by the core kernel. + + - The request queue's max_hw_sectors, which is a hard limit + and reflects the maximum size request a driver can handle + in units of 512 byte sectors. + + The default for both max_sectors and max_hw_sectors is + 255. The upper limit of max_sectors is 1024. + + blk_queue_max_phys_segments(q, max_segments) + Maximum physical segments you can handle in a request. 128 + default (driver limit). (See 3.2.2) + + blk_queue_max_hw_segments(q, max_segments) + Maximum dma segments the hardware can handle in a request. 128 + default (host adapter limit, after dma remapping). + (See 3.2.2) + + blk_queue_max_segment_size(q, max_seg_size) + Maximum size of a clustered segment, 64kB default. + + blk_queue_logical_block_size(q, logical_block_size) + Lowest possible sector size that the hardware can operate + on, 512 bytes default. + +New queue flags: + + QUEUE_FLAG_CLUSTER (see 3.2.2) + QUEUE_FLAG_QUEUED (see 3.2.4) + + +ii. High-mem i/o capabilities are now considered the default + +The generic bounce buffer logic, present in 2.4, where the block layer would +by default copyin/out i/o requests on high-memory buffers to low-memory buffers +assuming that the driver wouldn't be able to handle it directly, has been +changed in 2.5. The bounce logic is now applied only for memory ranges +for which the device cannot handle i/o. A driver can specify this by +setting the queue bounce limit for the request queue for the device +(blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out +where a device is capable of handling high memory i/o. + +In order to enable high-memory i/o where the device is capable of supporting +it, the pci dma mapping routines and associated data structures have now been +modified to accomplish a direct page -> bus translation, without requiring +a virtual address mapping (unlike the earlier scheme of virtual address +-> bus translation). So this works uniformly for high-memory pages (which +do not have a corresponding kernel virtual address space mapping) and +low-memory pages. + +Note: Please refer to Documentation/DMA-API-HOWTO.txt for a discussion +on PCI high mem DMA aspects and mapping of scatter gather lists, and support +for 64 bit PCI. + +Special handling is required only for cases where i/o needs to happen on +pages at physical memory addresses beyond what the device can support. In these +cases, a bounce bio representing a buffer from the supported memory range +is used for performing the i/o with copyin/copyout as needed depending on +the type of the operation. For example, in case of a read operation, the +data read has to be copied to the original buffer on i/o completion, so a +callback routine is set up to do this, while for write, the data is copied +from the original buffer to the bounce buffer prior to issuing the +operation. Since an original buffer may be in a high memory area that's not +mapped in kernel virtual addr, a kmap operation may be required for +performing the copy, and special care may be needed in the completion path +as it may not be in irq context. Special care is also required (by way of +GFP flags) when allocating bounce buffers, to avoid certain highmem +deadlock possibilities. + +It is also possible that a bounce buffer may be allocated from high-memory +area that's not mapped in kernel virtual addr, but within the range that the +device can use directly; so the bounce page may need to be kmapped during +copy operations. [Note: This does not hold in the current implementation, +though] + +There are some situations when pages from high memory may need to +be kmapped, even if bounce buffers are not necessary. For example a device +may need to abort DMA operations and revert to PIO for the transfer, in +which case a virtual mapping of the page is required. For SCSI it is also +done in some scenarios where the low level driver cannot be trusted to +handle a single sg entry correctly. The driver is expected to perform the +kmaps as needed on such occasions as appropriate. A driver could also use +the blk_queue_bounce() routine on its own to bounce highmem i/o to low +memory for specific requests if so desired. + +iii. The i/o scheduler algorithm itself can be replaced/set as appropriate + +As in 2.4, it is possible to plugin a brand new i/o scheduler for a particular +queue or pick from (copy) existing generic schedulers and replace/override +certain portions of it. The 2.5 rewrite provides improved modularization +of the i/o scheduler. There are more pluggable callbacks, e.g for init, +add request, extract request, which makes it possible to abstract specific +i/o scheduling algorithm aspects and details outside of the generic loop. +It also makes it possible to completely hide the implementation details of +the i/o scheduler from block drivers. + +I/O scheduler wrappers are to be used instead of accessing the queue directly. +See section 4. The I/O scheduler for details. + +1.2 Tuning Based on High level code capabilities +------------------------------------------------ + +i. Application capabilities for raw i/o + +This comes from some of the high-performance database/middleware +requirements where an application prefers to make its own i/o scheduling +decisions based on an understanding of the access patterns and i/o +characteristics + +ii. High performance filesystems or other higher level kernel code's +capabilities + +Kernel components like filesystems could also take their own i/o scheduling +decisions for optimizing performance. Journalling filesystems may need +some control over i/o ordering. + +What kind of support exists at the generic block layer for this ? + +The flags and rw fields in the bio structure can be used for some tuning +from above e.g indicating that an i/o is just a readahead request, or priority +settings (currently unused). As far as user applications are concerned they +would need an additional mechanism either via open flags or ioctls, or some +other upper level mechanism to communicate such settings to block. + +1.2.1 Request Priority/Latency +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Todo/Under discussion:: + + Arjan's proposed request priority scheme allows higher levels some broad + control (high/med/low) over the priority of an i/o request vs other pending + requests in the queue. For example it allows reads for bringing in an + executable page on demand to be given a higher priority over pending write + requests which haven't aged too much on the queue. Potentially this priority + could even be exposed to applications in some manner, providing higher level + tunability. Time based aging avoids starvation of lower priority + requests. Some bits in the bi_opf flags field in the bio structure are + intended to be used for this priority information. + + +1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode) +----------------------------------------------------------------------- + +(e.g Diagnostics, Systems Management) + +There are situations where high-level code needs to have direct access to +the low level device capabilities or requires the ability to issue commands +to the device bypassing some of the intermediate i/o layers. +These could, for example, be special control commands issued through ioctl +interfaces, or could be raw read/write commands that stress the drive's +capabilities for certain kinds of fitness tests. Having direct interfaces at +multiple levels without having to pass through upper layers makes +it possible to perform bottom up validation of the i/o path, layer by +layer, starting from the media. + +The normal i/o submission interfaces, e.g submit_bio, could be bypassed +for specially crafted requests which such ioctl or diagnostics +interfaces would typically use, and the elevator add_request routine +can instead be used to directly insert such requests in the queue or preferably +the blk_do_rq routine can be used to place the request on the queue and +wait for completion. Alternatively, sometimes the caller might just +invoke a lower level driver specific interface with the request as a +parameter. + +If the request is a means for passing on special information associated with +the command, then such information is associated with the request->special +field (rather than misuse the request->buffer field which is meant for the +request data buffer's virtual mapping). + +For passing request data, the caller must build up a bio descriptor +representing the concerned memory buffer if the underlying driver interprets +bio segments or uses the block layer end*request* functions for i/o +completion. Alternatively one could directly use the request->buffer field to +specify the virtual address of the buffer, if the driver expects buffer +addresses passed in this way and ignores bio entries for the request type +involved. In the latter case, the driver would modify and manage the +request->buffer, request->sector and request->nr_sectors or +request->current_nr_sectors fields itself rather than using the block layer +end_request or end_that_request_first completion interfaces. +(See 2.3 or Documentation/block/request.rst for a brief explanation of +the request structure fields) + +:: + + [TBD: end_that_request_last should be usable even in this case; + Perhaps an end_that_direct_request_first routine could be implemented to make + handling direct requests easier for such drivers; Also for drivers that + expect bios, a helper function could be provided for setting up a bio + corresponding to a data buffer] + + + + + +1.3.1 Pre-built Commands +^^^^^^^^^^^^^^^^^^^^^^^^ + +A request can be created with a pre-built custom command to be sent directly +to the device. The cmd block in the request structure has room for filling +in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for +command pre-building, and the type of the request is now indicated +through rq->flags instead of via rq->cmd) + +The request structure flags can be set up to indicate the type of request +in such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC: +packet command issued via blk_do_rq, REQ_SPECIAL: special request). + +It can help to pre-build device commands for requests in advance. +Drivers can now specify a request prepare function (q->prep_rq_fn) that the +block layer would invoke to pre-build device commands for a given request, +or perform other preparatory processing for the request. This is routine is +called by elv_next_request(), i.e. typically just before servicing a request. +(The prepare function would not be called for requests that have RQF_DONTPREP +enabled) + +Aside: + Pre-building could possibly even be done early, i.e before placing the + request on the queue, rather than construct the command on the fly in the + driver while servicing the request queue when it may affect latencies in + interrupt context or responsiveness in general. One way to add early + pre-building would be to do it whenever we fail to merge on a request. + Now REQ_NOMERGE is set in the request flags to skip this one in the future, + which means that it will not change before we feed it to the device. So + the pre-builder hook can be invoked there. + + +2. Flexible and generic but minimalist i/o structure/descriptor +=============================================================== + +2.1 Reason for a new structure and requirements addressed +--------------------------------------------------------- + +Prior to 2.5, buffer heads were used as the unit of i/o at the generic block +layer, and the low level request structure was associated with a chain of +buffer heads for a contiguous i/o request. This led to certain inefficiencies +when it came to large i/o requests and readv/writev style operations, as it +forced such requests to be broken up into small chunks before being passed +on to the generic block layer, only to be merged by the i/o scheduler +when the underlying device was capable of handling the i/o in one shot. +Also, using the buffer head as an i/o structure for i/os that didn't originate +from the buffer cache unnecessarily added to the weight of the descriptors +which were generated for each such chunk. + +The following were some of the goals and expectations considered in the +redesign of the block i/o data structure in 2.5. + +1. Should be appropriate as a descriptor for both raw and buffered i/o - + avoid cache related fields which are irrelevant in the direct/page i/o path, + or filesystem block size alignment restrictions which may not be relevant + for raw i/o. +2. Ability to represent high-memory buffers (which do not have a virtual + address mapping in kernel address space). +3. Ability to represent large i/os w/o unnecessarily breaking them up (i.e + greater than PAGE_SIZE chunks in one shot) +4. At the same time, ability to retain independent identity of i/os from + different sources or i/o units requiring individual completion (e.g. for + latency reasons) +5. Ability to represent an i/o involving multiple physical memory segments + (including non-page aligned page fragments, as specified via readv/writev) + without unnecessarily breaking it up, if the underlying device is capable of + handling it. +6. Preferably should be based on a memory descriptor structure that can be + passed around different types of subsystems or layers, maybe even + networking, without duplication or extra copies of data/descriptor fields + themselves in the process +7. Ability to handle the possibility of splits/merges as the structure passes + through layered drivers (lvm, md, evms), with minimal overhead. + +The solution was to define a new structure (bio) for the block layer, +instead of using the buffer head structure (bh) directly, the idea being +avoidance of some associated baggage and limitations. The bio structure +is uniformly used for all i/o at the block layer ; it forms a part of the +bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are +mapped to bio structures. + +2.2 The bio struct +------------------ + +The bio structure uses a vector representation pointing to an array of tuples +of to describe the i/o buffer, and has various other +fields describing i/o parameters and state that needs to be maintained for +performing the i/o. + +Notice that this representation means that a bio has no virtual address +mapping at all (unlike buffer heads). + +:: + + struct bio_vec { + struct page *bv_page; + unsigned short bv_len; + unsigned short bv_offset; + }; + + /* + * main unit of I/O for the block layer and lower layers (ie drivers) + */ + struct bio { + struct bio *bi_next; /* request queue link */ + struct block_device *bi_bdev; /* target device */ + unsigned long bi_flags; /* status, command, etc */ + unsigned long bi_opf; /* low bits: r/w, high: priority */ + + unsigned int bi_vcnt; /* how may bio_vec's */ + struct bvec_iter bi_iter; /* current index into bio_vec array */ + + unsigned int bi_size; /* total size in bytes */ + unsigned short bi_hw_segments; /* segments after DMA remapping */ + unsigned int bi_max; /* max bio_vecs we can hold + used as index into pool */ + struct bio_vec *bi_io_vec; /* the actual vec list */ + bio_end_io_t *bi_end_io; /* bi_end_io (bio) */ + atomic_t bi_cnt; /* pin count: free when it hits zero */ + void *bi_private; + }; + +With this multipage bio design: + +- Large i/os can be sent down in one go using a bio_vec list consisting + of an array of fragments (similar to the way fragments + are represented in the zero-copy network code) +- Splitting of an i/o request across multiple devices (as in the case of + lvm or raid) is achieved by cloning the bio (where the clone points to + the same bi_io_vec array, but with the index and size accordingly modified) +- A linked list of bios is used as before for unrelated merges [*]_ - this + avoids reallocs and makes independent completions easier to handle. +- Code that traverses the req list can find all the segments of a bio + by using rq_for_each_segment. This handles the fact that a request + has multiple bios, each of which can have multiple segments. +- Drivers which can't process a large bio in one shot can use the bi_iter + field to keep track of the next bio_vec entry to process. + (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE) + [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying + bi_offset an len fields] + +.. [*] + + unrelated merges -- a request ends up containing two or more bios that + didn't originate from the same place. + +bi_end_io() i/o callback gets called on i/o completion of the entire bio. + +At a lower level, drivers build a scatter gather list from the merged bios. +The scatter gather list is in the form of an array of +entries with their corresponding dma address mappings filled in at the +appropriate time. As an optimization, contiguous physical pages can be +covered by a single entry where refers to the first page and +covers the range of pages (up to 16 contiguous pages could be covered this +way). There is a helper routine (blk_rq_map_sg) which drivers can use to build +the sg list. + +Note: Right now the only user of bios with more than one page is ll_rw_kio, +which in turn means that only raw I/O uses it (direct i/o may not work +right now). The intent however is to enable clustering of pages etc to +become possible. The pagebuf abstraction layer from SGI also uses multi-page +bios, but that is currently not included in the stock development kernels. +The same is true of Andrew Morton's work-in-progress multipage bio writeout +and readahead patches. + +2.3 Changes in the Request Structure +------------------------------------ + +The request structure is the structure that gets passed down to low level +drivers. The block layer make_request function builds up a request structure, +places it on the queue and invokes the drivers request_fn. The driver makes +use of block layer helper routine elv_next_request to pull the next request +off the queue. Control or diagnostic functions might bypass block and directly +invoke underlying driver entry points passing in a specially constructed +request structure. + +Only some relevant fields (mainly those which changed or may be referred +to in some of the discussion here) are listed below, not necessarily in +the order in which they occur in the structure (see include/linux/blkdev.h) +Refer to Documentation/block/request.rst for details about all the request +structure fields and a quick reference about the layers which are +supposed to use or modify those fields:: + + struct request { + struct list_head queuelist; /* Not meant to be directly accessed by + the driver. + Used by q->elv_next_request_fn + rq->queue is gone + */ + . + . + unsigned char cmd[16]; /* prebuilt command data block */ + unsigned long flags; /* also includes earlier rq->cmd settings */ + . + . + sector_t sector; /* this field is now of type sector_t instead of int + preparation for 64 bit sectors */ + . + . + + /* Number of scatter-gather DMA addr+len pairs after + * physical address coalescing is performed. + */ + unsigned short nr_phys_segments; + + /* Number of scatter-gather addr+len pairs after + * physical and DMA remapping hardware coalescing is performed. + * This is the number of scatter-gather entries the driver + * will actually have to deal with after DMA mapping is done. + */ + unsigned short nr_hw_segments; + + /* Various sector counts */ + unsigned long nr_sectors; /* no. of sectors left: driver modifiable */ + unsigned long hard_nr_sectors; /* block internal copy of above */ + unsigned int current_nr_sectors; /* no. of sectors left in the + current segment:driver modifiable */ + unsigned long hard_cur_sectors; /* block internal copy of the above */ + . + . + int tag; /* command tag associated with request */ + void *special; /* same as before */ + char *buffer; /* valid only for low memory buffers up to + current_nr_sectors */ + . + . + struct bio *bio, *biotail; /* bio list instead of bh */ + struct request_list *rl; + } + +See the req_ops and req_flag_bits definitions for an explanation of the various +flags available. Some bits are used by the block layer or i/o scheduler. + +The behaviour of the various sector counts are almost the same as before, +except that since we have multi-segment bios, current_nr_sectors refers +to the numbers of sectors in the current segment being processed which could +be one of the many segments in the current bio (i.e i/o completion unit). +The nr_sectors value refers to the total number of sectors in the whole +request that remain to be transferred (no change). The purpose of the +hard_xxx values is for block to remember these counts every time it hands +over the request to the driver. These values are updated by block on +end_that_request_first, i.e. every time the driver completes a part of the +transfer and invokes block end*request helpers to mark this. The +driver should not modify these values. The block layer sets up the +nr_sectors and current_nr_sectors fields (based on the corresponding +hard_xxx values and the number of bytes transferred) and updates it on +every transfer that invokes end_that_request_first. It does the same for the +buffer, bio, bio->bi_iter fields too. + +The buffer field is just a virtual address mapping of the current segment +of the i/o buffer in cases where the buffer resides in low-memory. For high +memory i/o, this field is not valid and must not be used by drivers. + +Code that sets up its own request structures and passes them down to +a driver needs to be careful about interoperation with the block layer helper +functions which the driver uses. (Section 1.3) + +3. Using bios +============= + +3.1 Setup/Teardown +------------------ + +There are routines for managing the allocation, and reference counting, and +freeing of bios (bio_alloc, bio_get, bio_put). + +This makes use of Ingo Molnar's mempool implementation, which enables +subsystems like bio to maintain their own reserve memory pools for guaranteed +deadlock-free allocations during extreme VM load. For example, the VM +subsystem makes use of the block layer to writeout dirty pages in order to be +able to free up memory space, a case which needs careful handling. The +allocation logic draws from the preallocated emergency reserve in situations +where it cannot allocate through normal means. If the pool is empty and it +can wait, then it would trigger action that would help free up memory or +replenish the pool (without deadlocking) and wait for availability in the pool. +If it is in IRQ context, and hence not in a position to do this, allocation +could fail if the pool is empty. In general mempool always first tries to +perform allocation without having to wait, even if it means digging into the +pool as long it is not less that 50% full. + +On a free, memory is released to the pool or directly freed depending on +the current availability in the pool. The mempool interface lets the +subsystem specify the routines to be used for normal alloc and free. In the +case of bio, these routines make use of the standard slab allocator. + +The caller of bio_alloc is expected to taken certain steps to avoid +deadlocks, e.g. avoid trying to allocate more memory from the pool while +already holding memory obtained from the pool. + +:: + + [TBD: This is a potential issue, though a rare possibility + in the bounce bio allocation that happens in the current code, since + it ends up allocating a second bio from the same pool while + holding the original bio ] + +Memory allocated from the pool should be released back within a limited +amount of time (in the case of bio, that would be after the i/o is completed). +This ensures that if part of the pool has been used up, some work (in this +case i/o) must already be in progress and memory would be available when it +is over. If allocating from multiple pools in the same code path, the order +or hierarchy of allocation needs to be consistent, just the way one deals +with multiple locks. + +The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc()) +for a non-clone bio. There are the 6 pools setup for different size biovecs, +so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the +given size from these slabs. + +The bio_get() routine may be used to hold an extra reference on a bio prior +to i/o submission, if the bio fields are likely to be accessed after the +i/o is issued (since the bio may otherwise get freed in case i/o completion +happens in the meantime). + +The bio_clone_fast() routine may be used to duplicate a bio, where the clone +shares the bio_vec_list with the original bio (i.e. both point to the +same bio_vec_list). This would typically be used for splitting i/o requests +in lvm or md. + +3.2 Generic bio helper Routines +------------------------------- + +3.2.1 Traversing segments and completion units in a request +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The macro rq_for_each_segment() should be used for traversing the bios +in the request list (drivers should avoid directly trying to do it +themselves). Using these helpers should also make it easier to cope +with block changes in the future. + +:: + + struct req_iterator iter; + rq_for_each_segment(bio_vec, rq, iter) + /* bio_vec is now current segment */ + +I/O completion callbacks are per-bio rather than per-segment, so drivers +that traverse bio chains on completion need to keep that in mind. Drivers +which don't make a distinction between segments and completion units would +need to be reorganized to support multi-segment bios. + +3.2.2 Setting up DMA scatterlists +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The blk_rq_map_sg() helper routine would be used for setting up scatter +gather lists from a request, so a driver need not do it on its own. + + nr_segments = blk_rq_map_sg(q, rq, scatterlist); + +The helper routine provides a level of abstraction which makes it easier +to modify the internals of request to scatterlist conversion down the line +without breaking drivers. The blk_rq_map_sg routine takes care of several +things like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER +is set) and correct segment accounting to avoid exceeding the limits which +the i/o hardware can handle, based on various queue properties. + +- Prevents a clustered segment from crossing a 4GB mem boundary +- Avoids building segments that would exceed the number of physical + memory segments that the driver can handle (phys_segments) and the + number that the underlying hardware can handle at once, accounting for + DMA remapping (hw_segments) (i.e. IOMMU aware limits). + +Routines which the low level driver can use to set up the segment limits: + +blk_queue_max_hw_segments() : Sets an upper limit of the maximum number of +hw data segments in a request (i.e. the maximum number of address/length +pairs the host adapter can actually hand to the device at once) + +blk_queue_max_phys_segments() : Sets an upper limit on the maximum number +of physical data segments in a request (i.e. the largest sized scatter list +a driver could handle) + +3.2.3 I/O completion +^^^^^^^^^^^^^^^^^^^^ + +The existing generic block layer helper routines end_request, +end_that_request_first and end_that_request_last can be used for i/o +completion (and setting things up so the rest of the i/o or the next +request can be kicked of) as before. With the introduction of multi-page +bio support, end_that_request_first requires an additional argument indicating +the number of sectors completed. + +3.2.4 Implications for drivers that do not interpret bios +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +(don't handle multiple segments) + +Drivers that do not interpret bios e.g those which do not handle multiple +segments and do not support i/o into high memory addresses (require bounce +buffers) and expect only virtually mapped buffers, can access the rq->buffer +field. As before the driver should use current_nr_sectors to determine the +size of remaining data in the current segment (that is the maximum it can +transfer in one go unless it interprets segments), and rely on the block layer +end_request, or end_that_request_first/last to take care of all accounting +and transparent mapping of the next bio segment when a segment boundary +is crossed on completion of a transfer. (The end*request* functions should +be used if only if the request has come down from block/bio path, not for +direct access requests which only specify rq->buffer without a valid rq->bio) + +3.3 I/O Submission +------------------ + +The routine submit_bio() is used to submit a single io. Higher level i/o +routines make use of this: + +(a) Buffered i/o: + +The routine submit_bh() invokes submit_bio() on a bio corresponding to the +bh, allocating the bio if required. ll_rw_block() uses submit_bh() as before. + +(b) Kiobuf i/o (for raw/direct i/o): + +The ll_rw_kio() routine breaks up the kiobuf into page sized chunks and +maps the array to one or more multi-page bios, issuing submit_bio() to +perform the i/o on each of these. + +The embedded bh array in the kiobuf structure has been removed and no +preallocation of bios is done for kiobufs. [The intent is to remove the +blocks array as well, but it's currently in there to kludge around direct i/o.] +Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc. + +Todo/Observation: + + A single kiobuf structure is assumed to correspond to a contiguous range + of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec. + So right now it wouldn't work for direct i/o on non-contiguous blocks. + This is to be resolved. The eventual direction is to replace kiobuf + by kvec's. + + Badari Pulavarty has a patch to implement direct i/o correctly using + bio and kvec. + + +(c) Page i/o: + +Todo/Under discussion: + + Andrew Morton's multi-page bio patches attempt to issue multi-page + writeouts (and reads) from the page cache, by directly building up + large bios for submission completely bypassing the usage of buffer + heads. This work is still in progress. + + Christoph Hellwig had some code that uses bios for page-io (rather than + bh). This isn't included in bio as yet. Christoph was also working on a + design for representing virtual/real extents as an entity and modifying + some of the address space ops interfaces to utilize this abstraction rather + than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf + abstraction, but intended to be as lightweight as possible). + +(d) Direct access i/o: + +Direct access requests that do not contain bios would be submitted differently +as discussed earlier in section 1.3. + +Aside: + + Kvec i/o: + + Ben LaHaise's aio code uses a slightly different structure instead + of kiobufs, called a kvec_cb. This contains an array of + tuples (very much like the networking code), together with a callback function + and data pointer. This is embedded into a brw_cb structure when passed + to brw_kvec_async(). + + Now it should be possible to directly map these kvecs to a bio. Just as while + cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec + array pointer to point to the veclet array in kvecs. + + TBD: In order for this to work, some changes are needed in the way multi-page + bios are handled today. The values of the tuples in such a vector passed in + from higher level code should not be modified by the block layer in the course + of its request processing, since that would make it hard for the higher layer + to continue to use the vector descriptor (kvec) after i/o completes. Instead, + all such transient state should either be maintained in the request structure, + and passed on in some way to the endio completion routine. + + +4. The I/O scheduler +==================== + +I/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatch +queue and specific I/O schedulers. Unless stated otherwise, elevator is used +to refer to both parts and I/O scheduler to specific I/O schedulers. + +Block layer implements generic dispatch queue in `block/*.c`. +The generic dispatch queue is responsible for requeueing, handling non-fs +requests and all other subtleties. + +Specific I/O schedulers are responsible for ordering normal filesystem +requests. They can also choose to delay certain requests to improve +throughput or whatever purpose. As the plural form indicates, there are +multiple I/O schedulers. They can be built as modules but at least one should +be built inside the kernel. Each queue can choose different one and can also +change to another one dynamically. + +A block layer call to the i/o scheduler follows the convention elv_xxx(). This +calls elevator_xxx_fn in the elevator switch (block/elevator.c). Oh, xxx +and xxx might not match exactly, but use your imagination. If an elevator +doesn't implement a function, the switch does nothing or some minimal house +keeping work. + +4.1. I/O scheduler API +---------------------- + +The functions an elevator may implement are: (* are mandatory) + +=============================== ================================================ +elevator_merge_fn called to query requests for merge with a bio + +elevator_merge_req_fn called when two requests get merged. the one + which gets merged into the other one will be + never seen by I/O scheduler again. IOW, after + being merged, the request is gone. + +elevator_merged_fn called when a request in the scheduler has been + involved in a merge. It is used in the deadline + scheduler for example, to reposition the request + if its sorting order has changed. + +elevator_allow_merge_fn called whenever the block layer determines + that a bio can be merged into an existing + request safely. The io scheduler may still + want to stop a merge at this point if it + results in some sort of conflict internally, + this hook allows it to do that. Note however + that two *requests* can still be merged at later + time. Currently the io scheduler has no way to + prevent that. It can only learn about the fact + from elevator_merge_req_fn callback. + +elevator_dispatch_fn* fills the dispatch queue with ready requests. + I/O schedulers are free to postpone requests by + not filling the dispatch queue unless @force + is non-zero. Once dispatched, I/O schedulers + are not allowed to manipulate the requests - + they belong to generic dispatch queue. + +elevator_add_req_fn* called to add a new request into the scheduler + +elevator_former_req_fn +elevator_latter_req_fn These return the request before or after the + one specified in disk sort order. Used by the + block layer to find merge possibilities. + +elevator_completed_req_fn called when a request is completed. + +elevator_may_queue_fn returns true if the scheduler wants to allow the + current context to queue a new request even if + it is over the queue limit. This must be used + very carefully!! + +elevator_set_req_fn +elevator_put_req_fn Must be used to allocate and free any elevator + specific storage for a request. + +elevator_activate_req_fn Called when device driver first sees a request. + I/O schedulers can use this callback to + determine when actual execution of a request + starts. +elevator_deactivate_req_fn Called when device driver decides to delay + a request by requeueing it. + +elevator_init_fn* +elevator_exit_fn Allocate and free any elevator specific storage + for a queue. +=============================== ================================================ + +4.2 Request flows seen by I/O schedulers +---------------------------------------- + +All requests seen by I/O schedulers strictly follow one of the following three +flows. + + set_req_fn -> + + i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn -> + (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn + ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn + iii. [none] + + -> put_req_fn + +4.3 I/O scheduler implementation +-------------------------------- + +The generic i/o scheduler algorithm attempts to sort/merge/batch requests for +optimal disk scan and request servicing performance (based on generic +principles and device capabilities), optimized for: + +i. improved throughput +ii. improved latency +iii. better utilization of h/w & CPU time + +Characteristics: + +i. Binary tree +AS and deadline i/o schedulers use red black binary trees for disk position +sorting and searching, and a fifo linked list for time-based searching. This +gives good scalability and good availability of information. Requests are +almost always dispatched in disk sort order, so a cache is kept of the next +request in sort order to prevent binary tree lookups. + +This arrangement is not a generic block layer characteristic however, so +elevators may implement queues as they please. + +ii. Merge hash +AS and deadline use a hash table indexed by the last sector of a request. This +enables merging code to quickly look up "back merge" candidates, even when +multiple I/O streams are being performed at once on one disk. + +"Front merges", a new request being merged at the front of an existing request, +are far less common than "back merges" due to the nature of most I/O patterns. +Front merges are handled by the binary trees in AS and deadline schedulers. + +iii. Plugging the queue to batch requests in anticipation of opportunities for + merge/sort optimizations + +Plugging is an approach that the current i/o scheduling algorithm resorts to so +that it collects up enough requests in the queue to be able to take +advantage of the sorting/merging logic in the elevator. If the +queue is empty when a request comes in, then it plugs the request queue +(sort of like plugging the bath tub of a vessel to get fluid to build up) +till it fills up with a few more requests, before starting to service +the requests. This provides an opportunity to merge/sort the requests before +passing them down to the device. There are various conditions when the queue is +unplugged (to open up the flow again), either through a scheduled task or +could be on demand. For example wait_on_buffer sets the unplugging going +through sync_buffer() running blk_run_address_space(mapping). Or the caller +can do it explicity through blk_unplug(bdev). So in the read case, +the queue gets explicitly unplugged as part of waiting for completion on that +buffer. + +Aside: + This is kind of controversial territory, as it's not clear if plugging is + always the right thing to do. Devices typically have their own queues, + and allowing a big queue to build up in software, while letting the device be + idle for a while may not always make sense. The trick is to handle the fine + balance between when to plug and when to open up. Also now that we have + multi-page bios being queued in one shot, we may not need to wait to merge + a big request from the broken up pieces coming by. + +4.4 I/O contexts +---------------- + +I/O contexts provide a dynamically allocated per process data area. They may +be used in I/O schedulers, and in the block layer (could be used for IO statis, +priorities for example). See `*io_context` in block/ll_rw_blk.c, and as-iosched.c +for an example of usage in an i/o scheduler. + + +5. Scalability related changes +============================== + +5.1 Granular Locking: io_request_lock replaced by a per-queue lock +------------------------------------------------------------------ + +The global io_request_lock has been removed as of 2.5, to avoid +the scalability bottleneck it was causing, and has been replaced by more +granular locking. The request queue structure has a pointer to the +lock to be used for that queue. As a result, locking can now be +per-queue, with a provision for sharing a lock across queues if +necessary (e.g the scsi layer sets the queue lock pointers to the +corresponding adapter lock, which results in a per host locking +granularity). The locking semantics are the same, i.e. locking is +still imposed by the block layer, grabbing the lock before +request_fn execution which it means that lots of older drivers +should still be SMP safe. Drivers are free to drop the queue +lock themselves, if required. Drivers that explicitly used the +io_request_lock for serialization need to be modified accordingly. +Usually it's as easy as adding a global lock:: + + static DEFINE_SPINLOCK(my_driver_lock); + +and passing the address to that lock to blk_init_queue(). + +5.2 64 bit sector numbers (sector_t prepares for 64 bit support) +---------------------------------------------------------------- + +The sector number used in the bio structure has been changed to sector_t, +which could be defined as 64 bit in preparation for 64 bit sector support. + +6. Other Changes/Implications +============================= + +6.1 Partition re-mapping handled by the generic block layer +----------------------------------------------------------- + +In 2.5 some of the gendisk/partition related code has been reorganized. +Now the generic block layer performs partition-remapping early and thus +provides drivers with a sector number relative to whole device, rather than +having to take partition number into account in order to arrive at the true +sector number. The routine blk_partition_remap() is invoked by +generic_make_request even before invoking the queue specific make_request_fn, +so the i/o scheduler also gets to operate on whole disk sector numbers. This +should typically not require changes to block drivers, it just never gets +to invoke its own partition sector offset calculations since all bios +sent are offset from the beginning of the device. + + +7. A Few Tips on Migration of older drivers +=========================================== + +Old-style drivers that just use CURRENT and ignores clustered requests, +may not need much change. The generic layer will automatically handle +clustered requests, multi-page bios, etc for the driver. + +For a low performance driver or hardware that is PIO driven or just doesn't +support scatter-gather changes should be minimal too. + +The following are some points to keep in mind when converting old drivers +to bio. + +Drivers should use elv_next_request to pick up requests and are no longer +supposed to handle looping directly over the request list. +(struct request->queue has been removed) + +Now end_that_request_first takes an additional number_of_sectors argument. +It used to handle always just the first buffer_head in a request, now +it will loop and handle as many sectors (on a bio-segment granularity) +as specified. + +Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the +right thing to use is bio_endio(bio) instead. + +If the driver is dropping the io_request_lock from its request_fn strategy, +then it just needs to replace that with q->queue_lock instead. + +As described in Sec 1.1, drivers can set max sector size, max segment size +etc per queue now. Drivers that used to define their own merge functions i +to handle things like this can now just use the blk_queue_* functions at +blk_init_queue time. + +Drivers no longer have to map a {partition, sector offset} into the +correct absolute location anymore, this is done by the block layer, so +where a driver received a request ala this before:: + + rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */ + rq->sector = 0; /* first sector on hda5 */ + +it will now see:: + + rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */ + rq->sector = 123128; /* offset from start of disk */ + +As mentioned, there is no virtual mapping of a bio. For DMA, this is +not a problem as the driver probably never will need a virtual mapping. +Instead it needs a bus mapping (dma_map_page for a single segment or +use dma_map_sg for scatter gather) to be able to ship it to the driver. For +PIO drivers (or drivers that need to revert to PIO transfer once in a +while (IDE for example)), where the CPU is doing the actual data +transfer a virtual mapping is needed. If the driver supports highmem I/O, +(Sec 1.1, (ii) ) it needs to use kmap_atomic or similar to temporarily map +a bio into the virtual address space. + + +8. Prior/Related/Impacted patches +================================= + +8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp) +----------------------------------------------------- + +- orig kiobuf & raw i/o patches (now in 2.4 tree) +- direct kiobuf based i/o to devices (no intermediate bh's) +- page i/o using kiobuf +- kiobuf splitting for lvm (mkp) +- elevator support for kiobuf request merging (axboe) + +8.2. Zero-copy networking (Dave Miller) +--------------------------------------- + +8.3. SGI XFS - pagebuf patches - use of kiobufs +----------------------------------------------- +8.4. Multi-page pioent patch for bio (Christoph Hellwig) +-------------------------------------------------------- +8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11 +-------------------------------------------------------------------- +8.6. Async i/o implementation patch (Ben LaHaise) +------------------------------------------------- +8.7. EVMS layering design (IBM EVMS team) +----------------------------------------- +8.8. Larger page cache size patch (Ben LaHaise) and Large page size (Daniel Phillips) +------------------------------------------------------------------------------------- + + => larger contiguous physical memory buffers + +8.9. VM reservations patch (Ben LaHaise) +---------------------------------------- +8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?) +---------------------------------------------------------- +8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+ +--------------------------------------------------------------------------- +8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, Badari) +------------------------------------------------------------------------------- +8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven) +------------------------------------------------------------------ +8.14 IDE Taskfile i/o patch (Andre Hedrick) +-------------------------------------------- +8.15 Multi-page writeout and readahead patches (Andrew Morton) +--------------------------------------------------------------- +8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy) +----------------------------------------------------------------------- + +9. Other References +=================== + +9.1 The Splice I/O Model +------------------------ + +Larry McVoy (and subsequent discussions on lkml, and Linus' comments - Jan 2001 + +9.2 Discussions about kiobuf and bh design +------------------------------------------ + +On lkml between sct, linus, alan et al - Feb-March 2001 (many of the +initial thoughts that led to bio were brought up in this discussion thread) + +9.3 Discussions on mempool on lkml - Dec 2001. +---------------------------------------------- diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt deleted file mode 100644 index 31c177663ed5..000000000000 --- a/Documentation/block/biodoc.txt +++ /dev/null @@ -1,1076 +0,0 @@ - Notes on the Generic Block Layer Rewrite in Linux 2.5 - ===================================================== - -Notes Written on Jan 15, 2002: - Jens Axboe - Suparna Bhattacharya - -Last Updated May 2, 2002 -September 2003: Updated I/O Scheduler portions - Nick Piggin - -Introduction: - -These are some notes describing some aspects of the 2.5 block layer in the -context of the bio rewrite. The idea is to bring out some of the key -changes and a glimpse of the rationale behind those changes. - -Please mail corrections & suggestions to suparna@in.ibm.com. - -Credits: ---------- - -2.5 bio rewrite: - Jens Axboe - -Many aspects of the generic block layer redesign were driven by and evolved -over discussions, prior patches and the collective experience of several -people. See sections 8 and 9 for a list of some related references. - -The following people helped with review comments and inputs for this -document: - Christoph Hellwig - Arjan van de Ven - Randy Dunlap - Andre Hedrick - -The following people helped with fixes/contributions to the bio patches -while it was still work-in-progress: - David S. Miller - - -Description of Contents: ------------------------- - -1. Scope for tuning of logic to various needs - 1.1 Tuning based on device or low level driver capabilities - - Per-queue parameters - - Highmem I/O support - - I/O scheduler modularization - 1.2 Tuning based on high level requirements/capabilities - 1.2.1 Request Priority/Latency - 1.3 Direct access/bypass to lower layers for diagnostics and special - device operations - 1.3.1 Pre-built commands -2. New flexible and generic but minimalist i/o structure or descriptor - (instead of using buffer heads at the i/o layer) - 2.1 Requirements/Goals addressed - 2.2 The bio struct in detail (multi-page io unit) - 2.3 Changes in the request structure -3. Using bios - 3.1 Setup/teardown (allocation, splitting) - 3.2 Generic bio helper routines - 3.2.1 Traversing segments and completion units in a request - 3.2.2 Setting up DMA scatterlists - 3.2.3 I/O completion - 3.2.4 Implications for drivers that do not interpret bios (don't handle - multiple segments) - 3.3 I/O submission -4. The I/O scheduler -5. Scalability related changes - 5.1 Granular locking: Removal of io_request_lock - 5.2 Prepare for transition to 64 bit sector_t -6. Other Changes/Implications - 6.1 Partition re-mapping handled by the generic block layer -7. A few tips on migration of older drivers -8. A list of prior/related/impacted patches/ideas -9. Other References/Discussion Threads - ---------------------------------------------------------------------------- - -Bio Notes --------- - -Let us discuss the changes in the context of how some overall goals for the -block layer are addressed. - -1. Scope for tuning the generic logic to satisfy various requirements - -The block layer design supports adaptable abstractions to handle common -processing with the ability to tune the logic to an appropriate extent -depending on the nature of the device and the requirements of the caller. -One of the objectives of the rewrite was to increase the degree of tunability -and to enable higher level code to utilize underlying device/driver -capabilities to the maximum extent for better i/o performance. This is -important especially in the light of ever improving hardware capabilities -and application/middleware software designed to take advantage of these -capabilities. - -1.1 Tuning based on low level device / driver capabilities - -Sophisticated devices with large built-in caches, intelligent i/o scheduling -optimizations, high memory DMA support, etc may find some of the -generic processing an overhead, while for less capable devices the -generic functionality is essential for performance or correctness reasons. -Knowledge of some of the capabilities or parameters of the device should be -used at the generic block layer to take the right decisions on -behalf of the driver. - -How is this achieved ? - -Tuning at a per-queue level: - -i. Per-queue limits/values exported to the generic layer by the driver - -Various parameters that the generic i/o scheduler logic uses are set at -a per-queue level (e.g maximum request size, maximum number of segments in -a scatter-gather list, logical block size) - -Some parameters that were earlier available as global arrays indexed by -major/minor are now directly associated with the queue. Some of these may -move into the block device structure in the future. Some characteristics -have been incorporated into a queue flags field rather than separate fields -in themselves. There are blk_queue_xxx functions to set the parameters, -rather than update the fields directly - -Some new queue property settings: - - blk_queue_bounce_limit(q, u64 dma_address) - Enable I/O to highmem pages, dma_address being the - limit. No highmem default. - - blk_queue_max_sectors(q, max_sectors) - Sets two variables that limit the size of the request. - - - The request queue's max_sectors, which is a soft size in - units of 512 byte sectors, and could be dynamically varied - by the core kernel. - - - The request queue's max_hw_sectors, which is a hard limit - and reflects the maximum size request a driver can handle - in units of 512 byte sectors. - - The default for both max_sectors and max_hw_sectors is - 255. The upper limit of max_sectors is 1024. - - blk_queue_max_phys_segments(q, max_segments) - Maximum physical segments you can handle in a request. 128 - default (driver limit). (See 3.2.2) - - blk_queue_max_hw_segments(q, max_segments) - Maximum dma segments the hardware can handle in a request. 128 - default (host adapter limit, after dma remapping). - (See 3.2.2) - - blk_queue_max_segment_size(q, max_seg_size) - Maximum size of a clustered segment, 64kB default. - - blk_queue_logical_block_size(q, logical_block_size) - Lowest possible sector size that the hardware can operate - on, 512 bytes default. - -New queue flags: - - QUEUE_FLAG_CLUSTER (see 3.2.2) - QUEUE_FLAG_QUEUED (see 3.2.4) - - -ii. High-mem i/o capabilities are now considered the default - -The generic bounce buffer logic, present in 2.4, where the block layer would -by default copyin/out i/o requests on high-memory buffers to low-memory buffers -assuming that the driver wouldn't be able to handle it directly, has been -changed in 2.5. The bounce logic is now applied only for memory ranges -for which the device cannot handle i/o. A driver can specify this by -setting the queue bounce limit for the request queue for the device -(blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out -where a device is capable of handling high memory i/o. - -In order to enable high-memory i/o where the device is capable of supporting -it, the pci dma mapping routines and associated data structures have now been -modified to accomplish a direct page -> bus translation, without requiring -a virtual address mapping (unlike the earlier scheme of virtual address --> bus translation). So this works uniformly for high-memory pages (which -do not have a corresponding kernel virtual address space mapping) and -low-memory pages. - -Note: Please refer to Documentation/DMA-API-HOWTO.txt for a discussion -on PCI high mem DMA aspects and mapping of scatter gather lists, and support -for 64 bit PCI. - -Special handling is required only for cases where i/o needs to happen on -pages at physical memory addresses beyond what the device can support. In these -cases, a bounce bio representing a buffer from the supported memory range -is used for performing the i/o with copyin/copyout as needed depending on -the type of the operation. For example, in case of a read operation, the -data read has to be copied to the original buffer on i/o completion, so a -callback routine is set up to do this, while for write, the data is copied -from the original buffer to the bounce buffer prior to issuing the -operation. Since an original buffer may be in a high memory area that's not -mapped in kernel virtual addr, a kmap operation may be required for -performing the copy, and special care may be needed in the completion path -as it may not be in irq context. Special care is also required (by way of -GFP flags) when allocating bounce buffers, to avoid certain highmem -deadlock possibilities. - -It is also possible that a bounce buffer may be allocated from high-memory -area that's not mapped in kernel virtual addr, but within the range that the -device can use directly; so the bounce page may need to be kmapped during -copy operations. [Note: This does not hold in the current implementation, -though] - -There are some situations when pages from high memory may need to -be kmapped, even if bounce buffers are not necessary. For example a device -may need to abort DMA operations and revert to PIO for the transfer, in -which case a virtual mapping of the page is required. For SCSI it is also -done in some scenarios where the low level driver cannot be trusted to -handle a single sg entry correctly. The driver is expected to perform the -kmaps as needed on such occasions as appropriate. A driver could also use -the blk_queue_bounce() routine on its own to bounce highmem i/o to low -memory for specific requests if so desired. - -iii. The i/o scheduler algorithm itself can be replaced/set as appropriate - -As in 2.4, it is possible to plugin a brand new i/o scheduler for a particular -queue or pick from (copy) existing generic schedulers and replace/override -certain portions of it. The 2.5 rewrite provides improved modularization -of the i/o scheduler. There are more pluggable callbacks, e.g for init, -add request, extract request, which makes it possible to abstract specific -i/o scheduling algorithm aspects and details outside of the generic loop. -It also makes it possible to completely hide the implementation details of -the i/o scheduler from block drivers. - -I/O scheduler wrappers are to be used instead of accessing the queue directly. -See section 4. The I/O scheduler for details. - -1.2 Tuning Based on High level code capabilities - -i. Application capabilities for raw i/o - -This comes from some of the high-performance database/middleware -requirements where an application prefers to make its own i/o scheduling -decisions based on an understanding of the access patterns and i/o -characteristics - -ii. High performance filesystems or other higher level kernel code's -capabilities - -Kernel components like filesystems could also take their own i/o scheduling -decisions for optimizing performance. Journalling filesystems may need -some control over i/o ordering. - -What kind of support exists at the generic block layer for this ? - -The flags and rw fields in the bio structure can be used for some tuning -from above e.g indicating that an i/o is just a readahead request, or priority -settings (currently unused). As far as user applications are concerned they -would need an additional mechanism either via open flags or ioctls, or some -other upper level mechanism to communicate such settings to block. - -1.2.1 Request Priority/Latency - -Todo/Under discussion: -Arjan's proposed request priority scheme allows higher levels some broad - control (high/med/low) over the priority of an i/o request vs other pending - requests in the queue. For example it allows reads for bringing in an - executable page on demand to be given a higher priority over pending write - requests which haven't aged too much on the queue. Potentially this priority - could even be exposed to applications in some manner, providing higher level - tunability. Time based aging avoids starvation of lower priority - requests. Some bits in the bi_opf flags field in the bio structure are - intended to be used for this priority information. - - -1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode) - (e.g Diagnostics, Systems Management) - -There are situations where high-level code needs to have direct access to -the low level device capabilities or requires the ability to issue commands -to the device bypassing some of the intermediate i/o layers. -These could, for example, be special control commands issued through ioctl -interfaces, or could be raw read/write commands that stress the drive's -capabilities for certain kinds of fitness tests. Having direct interfaces at -multiple levels without having to pass through upper layers makes -it possible to perform bottom up validation of the i/o path, layer by -layer, starting from the media. - -The normal i/o submission interfaces, e.g submit_bio, could be bypassed -for specially crafted requests which such ioctl or diagnostics -interfaces would typically use, and the elevator add_request routine -can instead be used to directly insert such requests in the queue or preferably -the blk_do_rq routine can be used to place the request on the queue and -wait for completion. Alternatively, sometimes the caller might just -invoke a lower level driver specific interface with the request as a -parameter. - -If the request is a means for passing on special information associated with -the command, then such information is associated with the request->special -field (rather than misuse the request->buffer field which is meant for the -request data buffer's virtual mapping). - -For passing request data, the caller must build up a bio descriptor -representing the concerned memory buffer if the underlying driver interprets -bio segments or uses the block layer end*request* functions for i/o -completion. Alternatively one could directly use the request->buffer field to -specify the virtual address of the buffer, if the driver expects buffer -addresses passed in this way and ignores bio entries for the request type -involved. In the latter case, the driver would modify and manage the -request->buffer, request->sector and request->nr_sectors or -request->current_nr_sectors fields itself rather than using the block layer -end_request or end_that_request_first completion interfaces. -(See 2.3 or Documentation/block/request.txt for a brief explanation of -the request structure fields) - -[TBD: end_that_request_last should be usable even in this case; -Perhaps an end_that_direct_request_first routine could be implemented to make -handling direct requests easier for such drivers; Also for drivers that -expect bios, a helper function could be provided for setting up a bio -corresponding to a data buffer] - - - - -1.3.1 Pre-built Commands - -A request can be created with a pre-built custom command to be sent directly -to the device. The cmd block in the request structure has room for filling -in the command bytes. (i.e rq->cmd is now 16 bytes in size, and meant for -command pre-building, and the type of the request is now indicated -through rq->flags instead of via rq->cmd) - -The request structure flags can be set up to indicate the type of request -in such cases (REQ_PC: direct packet command passed to driver, REQ_BLOCK_PC: -packet command issued via blk_do_rq, REQ_SPECIAL: special request). - -It can help to pre-build device commands for requests in advance. -Drivers can now specify a request prepare function (q->prep_rq_fn) that the -block layer would invoke to pre-build device commands for a given request, -or perform other preparatory processing for the request. This is routine is -called by elv_next_request(), i.e. typically just before servicing a request. -(The prepare function would not be called for requests that have RQF_DONTPREP -enabled) - -Aside: - Pre-building could possibly even be done early, i.e before placing the - request on the queue, rather than construct the command on the fly in the - driver while servicing the request queue when it may affect latencies in - interrupt context or responsiveness in general. One way to add early - pre-building would be to do it whenever we fail to merge on a request. - Now REQ_NOMERGE is set in the request flags to skip this one in the future, - which means that it will not change before we feed it to the device. So - the pre-builder hook can be invoked there. - - -2. Flexible and generic but minimalist i/o structure/descriptor. - -2.1 Reason for a new structure and requirements addressed - -Prior to 2.5, buffer heads were used as the unit of i/o at the generic block -layer, and the low level request structure was associated with a chain of -buffer heads for a contiguous i/o request. This led to certain inefficiencies -when it came to large i/o requests and readv/writev style operations, as it -forced such requests to be broken up into small chunks before being passed -on to the generic block layer, only to be merged by the i/o scheduler -when the underlying device was capable of handling the i/o in one shot. -Also, using the buffer head as an i/o structure for i/os that didn't originate -from the buffer cache unnecessarily added to the weight of the descriptors -which were generated for each such chunk. - -The following were some of the goals and expectations considered in the -redesign of the block i/o data structure in 2.5. - -i. Should be appropriate as a descriptor for both raw and buffered i/o - - avoid cache related fields which are irrelevant in the direct/page i/o path, - or filesystem block size alignment restrictions which may not be relevant - for raw i/o. -ii. Ability to represent high-memory buffers (which do not have a virtual - address mapping in kernel address space). -iii.Ability to represent large i/os w/o unnecessarily breaking them up (i.e - greater than PAGE_SIZE chunks in one shot) -iv. At the same time, ability to retain independent identity of i/os from - different sources or i/o units requiring individual completion (e.g. for - latency reasons) -v. Ability to represent an i/o involving multiple physical memory segments - (including non-page aligned page fragments, as specified via readv/writev) - without unnecessarily breaking it up, if the underlying device is capable of - handling it. -vi. Preferably should be based on a memory descriptor structure that can be - passed around different types of subsystems or layers, maybe even - networking, without duplication or extra copies of data/descriptor fields - themselves in the process -vii.Ability to handle the possibility of splits/merges as the structure passes - through layered drivers (lvm, md, evms), with minimal overhead. - -The solution was to define a new structure (bio) for the block layer, -instead of using the buffer head structure (bh) directly, the idea being -avoidance of some associated baggage and limitations. The bio structure -is uniformly used for all i/o at the block layer ; it forms a part of the -bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are -mapped to bio structures. - -2.2 The bio struct - -The bio structure uses a vector representation pointing to an array of tuples -of to describe the i/o buffer, and has various other -fields describing i/o parameters and state that needs to be maintained for -performing the i/o. - -Notice that this representation means that a bio has no virtual address -mapping at all (unlike buffer heads). - -struct bio_vec { - struct page *bv_page; - unsigned short bv_len; - unsigned short bv_offset; -}; - -/* - * main unit of I/O for the block layer and lower layers (ie drivers) - */ -struct bio { - struct bio *bi_next; /* request queue link */ - struct block_device *bi_bdev; /* target device */ - unsigned long bi_flags; /* status, command, etc */ - unsigned long bi_opf; /* low bits: r/w, high: priority */ - - unsigned int bi_vcnt; /* how may bio_vec's */ - struct bvec_iter bi_iter; /* current index into bio_vec array */ - - unsigned int bi_size; /* total size in bytes */ - unsigned short bi_hw_segments; /* segments after DMA remapping */ - unsigned int bi_max; /* max bio_vecs we can hold - used as index into pool */ - struct bio_vec *bi_io_vec; /* the actual vec list */ - bio_end_io_t *bi_end_io; /* bi_end_io (bio) */ - atomic_t bi_cnt; /* pin count: free when it hits zero */ - void *bi_private; -}; - -With this multipage bio design: - -- Large i/os can be sent down in one go using a bio_vec list consisting - of an array of fragments (similar to the way fragments - are represented in the zero-copy network code) -- Splitting of an i/o request across multiple devices (as in the case of - lvm or raid) is achieved by cloning the bio (where the clone points to - the same bi_io_vec array, but with the index and size accordingly modified) -- A linked list of bios is used as before for unrelated merges (*) - this - avoids reallocs and makes independent completions easier to handle. -- Code that traverses the req list can find all the segments of a bio - by using rq_for_each_segment. This handles the fact that a request - has multiple bios, each of which can have multiple segments. -- Drivers which can't process a large bio in one shot can use the bi_iter - field to keep track of the next bio_vec entry to process. - (e.g a 1MB bio_vec needs to be handled in max 128kB chunks for IDE) - [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying - bi_offset an len fields] - -(*) unrelated merges -- a request ends up containing two or more bios that - didn't originate from the same place. - -bi_end_io() i/o callback gets called on i/o completion of the entire bio. - -At a lower level, drivers build a scatter gather list from the merged bios. -The scatter gather list is in the form of an array of -entries with their corresponding dma address mappings filled in at the -appropriate time. As an optimization, contiguous physical pages can be -covered by a single entry where refers to the first page and -covers the range of pages (up to 16 contiguous pages could be covered this -way). There is a helper routine (blk_rq_map_sg) which drivers can use to build -the sg list. - -Note: Right now the only user of bios with more than one page is ll_rw_kio, -which in turn means that only raw I/O uses it (direct i/o may not work -right now). The intent however is to enable clustering of pages etc to -become possible. The pagebuf abstraction layer from SGI also uses multi-page -bios, but that is currently not included in the stock development kernels. -The same is true of Andrew Morton's work-in-progress multipage bio writeout -and readahead patches. - -2.3 Changes in the Request Structure - -The request structure is the structure that gets passed down to low level -drivers. The block layer make_request function builds up a request structure, -places it on the queue and invokes the drivers request_fn. The driver makes -use of block layer helper routine elv_next_request to pull the next request -off the queue. Control or diagnostic functions might bypass block and directly -invoke underlying driver entry points passing in a specially constructed -request structure. - -Only some relevant fields (mainly those which changed or may be referred -to in some of the discussion here) are listed below, not necessarily in -the order in which they occur in the structure (see include/linux/blkdev.h) -Refer to Documentation/block/request.txt for details about all the request -structure fields and a quick reference about the layers which are -supposed to use or modify those fields. - -struct request { - struct list_head queuelist; /* Not meant to be directly accessed by - the driver. - Used by q->elv_next_request_fn - rq->queue is gone - */ - . - . - unsigned char cmd[16]; /* prebuilt command data block */ - unsigned long flags; /* also includes earlier rq->cmd settings */ - . - . - sector_t sector; /* this field is now of type sector_t instead of int - preparation for 64 bit sectors */ - . - . - - /* Number of scatter-gather DMA addr+len pairs after - * physical address coalescing is performed. - */ - unsigned short nr_phys_segments; - - /* Number of scatter-gather addr+len pairs after - * physical and DMA remapping hardware coalescing is performed. - * This is the number of scatter-gather entries the driver - * will actually have to deal with after DMA mapping is done. - */ - unsigned short nr_hw_segments; - - /* Various sector counts */ - unsigned long nr_sectors; /* no. of sectors left: driver modifiable */ - unsigned long hard_nr_sectors; /* block internal copy of above */ - unsigned int current_nr_sectors; /* no. of sectors left in the - current segment:driver modifiable */ - unsigned long hard_cur_sectors; /* block internal copy of the above */ - . - . - int tag; /* command tag associated with request */ - void *special; /* same as before */ - char *buffer; /* valid only for low memory buffers up to - current_nr_sectors */ - . - . - struct bio *bio, *biotail; /* bio list instead of bh */ - struct request_list *rl; -} - -See the req_ops and req_flag_bits definitions for an explanation of the various -flags available. Some bits are used by the block layer or i/o scheduler. - -The behaviour of the various sector counts are almost the same as before, -except that since we have multi-segment bios, current_nr_sectors refers -to the numbers of sectors in the current segment being processed which could -be one of the many segments in the current bio (i.e i/o completion unit). -The nr_sectors value refers to the total number of sectors in the whole -request that remain to be transferred (no change). The purpose of the -hard_xxx values is for block to remember these counts every time it hands -over the request to the driver. These values are updated by block on -end_that_request_first, i.e. every time the driver completes a part of the -transfer and invokes block end*request helpers to mark this. The -driver should not modify these values. The block layer sets up the -nr_sectors and current_nr_sectors fields (based on the corresponding -hard_xxx values and the number of bytes transferred) and updates it on -every transfer that invokes end_that_request_first. It does the same for the -buffer, bio, bio->bi_iter fields too. - -The buffer field is just a virtual address mapping of the current segment -of the i/o buffer in cases where the buffer resides in low-memory. For high -memory i/o, this field is not valid and must not be used by drivers. - -Code that sets up its own request structures and passes them down to -a driver needs to be careful about interoperation with the block layer helper -functions which the driver uses. (Section 1.3) - -3. Using bios - -3.1 Setup/Teardown - -There are routines for managing the allocation, and reference counting, and -freeing of bios (bio_alloc, bio_get, bio_put). - -This makes use of Ingo Molnar's mempool implementation, which enables -subsystems like bio to maintain their own reserve memory pools for guaranteed -deadlock-free allocations during extreme VM load. For example, the VM -subsystem makes use of the block layer to writeout dirty pages in order to be -able to free up memory space, a case which needs careful handling. The -allocation logic draws from the preallocated emergency reserve in situations -where it cannot allocate through normal means. If the pool is empty and it -can wait, then it would trigger action that would help free up memory or -replenish the pool (without deadlocking) and wait for availability in the pool. -If it is in IRQ context, and hence not in a position to do this, allocation -could fail if the pool is empty. In general mempool always first tries to -perform allocation without having to wait, even if it means digging into the -pool as long it is not less that 50% full. - -On a free, memory is released to the pool or directly freed depending on -the current availability in the pool. The mempool interface lets the -subsystem specify the routines to be used for normal alloc and free. In the -case of bio, these routines make use of the standard slab allocator. - -The caller of bio_alloc is expected to taken certain steps to avoid -deadlocks, e.g. avoid trying to allocate more memory from the pool while -already holding memory obtained from the pool. -[TBD: This is a potential issue, though a rare possibility - in the bounce bio allocation that happens in the current code, since - it ends up allocating a second bio from the same pool while - holding the original bio ] - -Memory allocated from the pool should be released back within a limited -amount of time (in the case of bio, that would be after the i/o is completed). -This ensures that if part of the pool has been used up, some work (in this -case i/o) must already be in progress and memory would be available when it -is over. If allocating from multiple pools in the same code path, the order -or hierarchy of allocation needs to be consistent, just the way one deals -with multiple locks. - -The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc()) -for a non-clone bio. There are the 6 pools setup for different size biovecs, -so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the -given size from these slabs. - -The bio_get() routine may be used to hold an extra reference on a bio prior -to i/o submission, if the bio fields are likely to be accessed after the -i/o is issued (since the bio may otherwise get freed in case i/o completion -happens in the meantime). - -The bio_clone_fast() routine may be used to duplicate a bio, where the clone -shares the bio_vec_list with the original bio (i.e. both point to the -same bio_vec_list). This would typically be used for splitting i/o requests -in lvm or md. - -3.2 Generic bio helper Routines - -3.2.1 Traversing segments and completion units in a request - -The macro rq_for_each_segment() should be used for traversing the bios -in the request list (drivers should avoid directly trying to do it -themselves). Using these helpers should also make it easier to cope -with block changes in the future. - - struct req_iterator iter; - rq_for_each_segment(bio_vec, rq, iter) - /* bio_vec is now current segment */ - -I/O completion callbacks are per-bio rather than per-segment, so drivers -that traverse bio chains on completion need to keep that in mind. Drivers -which don't make a distinction between segments and completion units would -need to be reorganized to support multi-segment bios. - -3.2.2 Setting up DMA scatterlists - -The blk_rq_map_sg() helper routine would be used for setting up scatter -gather lists from a request, so a driver need not do it on its own. - - nr_segments = blk_rq_map_sg(q, rq, scatterlist); - -The helper routine provides a level of abstraction which makes it easier -to modify the internals of request to scatterlist conversion down the line -without breaking drivers. The blk_rq_map_sg routine takes care of several -things like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTER -is set) and correct segment accounting to avoid exceeding the limits which -the i/o hardware can handle, based on various queue properties. - -- Prevents a clustered segment from crossing a 4GB mem boundary -- Avoids building segments that would exceed the number of physical - memory segments that the driver can handle (phys_segments) and the - number that the underlying hardware can handle at once, accounting for - DMA remapping (hw_segments) (i.e. IOMMU aware limits). - -Routines which the low level driver can use to set up the segment limits: - -blk_queue_max_hw_segments() : Sets an upper limit of the maximum number of -hw data segments in a request (i.e. the maximum number of address/length -pairs the host adapter can actually hand to the device at once) - -blk_queue_max_phys_segments() : Sets an upper limit on the maximum number -of physical data segments in a request (i.e. the largest sized scatter list -a driver could handle) - -3.2.3 I/O completion - -The existing generic block layer helper routines end_request, -end_that_request_first and end_that_request_last can be used for i/o -completion (and setting things up so the rest of the i/o or the next -request can be kicked of) as before. With the introduction of multi-page -bio support, end_that_request_first requires an additional argument indicating -the number of sectors completed. - -3.2.4 Implications for drivers that do not interpret bios (don't handle - multiple segments) - -Drivers that do not interpret bios e.g those which do not handle multiple -segments and do not support i/o into high memory addresses (require bounce -buffers) and expect only virtually mapped buffers, can access the rq->buffer -field. As before the driver should use current_nr_sectors to determine the -size of remaining data in the current segment (that is the maximum it can -transfer in one go unless it interprets segments), and rely on the block layer -end_request, or end_that_request_first/last to take care of all accounting -and transparent mapping of the next bio segment when a segment boundary -is crossed on completion of a transfer. (The end*request* functions should -be used if only if the request has come down from block/bio path, not for -direct access requests which only specify rq->buffer without a valid rq->bio) - -3.3 I/O Submission - -The routine submit_bio() is used to submit a single io. Higher level i/o -routines make use of this: - -(a) Buffered i/o: -The routine submit_bh() invokes submit_bio() on a bio corresponding to the -bh, allocating the bio if required. ll_rw_block() uses submit_bh() as before. - -(b) Kiobuf i/o (for raw/direct i/o): -The ll_rw_kio() routine breaks up the kiobuf into page sized chunks and -maps the array to one or more multi-page bios, issuing submit_bio() to -perform the i/o on each of these. - -The embedded bh array in the kiobuf structure has been removed and no -preallocation of bios is done for kiobufs. [The intent is to remove the -blocks array as well, but it's currently in there to kludge around direct i/o.] -Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc. - -Todo/Observation: - - A single kiobuf structure is assumed to correspond to a contiguous range - of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec. - So right now it wouldn't work for direct i/o on non-contiguous blocks. - This is to be resolved. The eventual direction is to replace kiobuf - by kvec's. - - Badari Pulavarty has a patch to implement direct i/o correctly using - bio and kvec. - - -(c) Page i/o: -Todo/Under discussion: - - Andrew Morton's multi-page bio patches attempt to issue multi-page - writeouts (and reads) from the page cache, by directly building up - large bios for submission completely bypassing the usage of buffer - heads. This work is still in progress. - - Christoph Hellwig had some code that uses bios for page-io (rather than - bh). This isn't included in bio as yet. Christoph was also working on a - design for representing virtual/real extents as an entity and modifying - some of the address space ops interfaces to utilize this abstraction rather - than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf - abstraction, but intended to be as lightweight as possible). - -(d) Direct access i/o: -Direct access requests that do not contain bios would be submitted differently -as discussed earlier in section 1.3. - -Aside: - - Kvec i/o: - - Ben LaHaise's aio code uses a slightly different structure instead - of kiobufs, called a kvec_cb. This contains an array of - tuples (very much like the networking code), together with a callback function - and data pointer. This is embedded into a brw_cb structure when passed - to brw_kvec_async(). - - Now it should be possible to directly map these kvecs to a bio. Just as while - cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec - array pointer to point to the veclet array in kvecs. - - TBD: In order for this to work, some changes are needed in the way multi-page - bios are handled today. The values of the tuples in such a vector passed in - from higher level code should not be modified by the block layer in the course - of its request processing, since that would make it hard for the higher layer - to continue to use the vector descriptor (kvec) after i/o completes. Instead, - all such transient state should either be maintained in the request structure, - and passed on in some way to the endio completion routine. - - -4. The I/O scheduler -I/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatch -queue and specific I/O schedulers. Unless stated otherwise, elevator is used -to refer to both parts and I/O scheduler to specific I/O schedulers. - -Block layer implements generic dispatch queue in block/*.c. -The generic dispatch queue is responsible for requeueing, handling non-fs -requests and all other subtleties. - -Specific I/O schedulers are responsible for ordering normal filesystem -requests. They can also choose to delay certain requests to improve -throughput or whatever purpose. As the plural form indicates, there are -multiple I/O schedulers. They can be built as modules but at least one should -be built inside the kernel. Each queue can choose different one and can also -change to another one dynamically. - -A block layer call to the i/o scheduler follows the convention elv_xxx(). This -calls elevator_xxx_fn in the elevator switch (block/elevator.c). Oh, xxx -and xxx might not match exactly, but use your imagination. If an elevator -doesn't implement a function, the switch does nothing or some minimal house -keeping work. - -4.1. I/O scheduler API - -The functions an elevator may implement are: (* are mandatory) -elevator_merge_fn called to query requests for merge with a bio - -elevator_merge_req_fn called when two requests get merged. the one - which gets merged into the other one will be - never seen by I/O scheduler again. IOW, after - being merged, the request is gone. - -elevator_merged_fn called when a request in the scheduler has been - involved in a merge. It is used in the deadline - scheduler for example, to reposition the request - if its sorting order has changed. - -elevator_allow_merge_fn called whenever the block layer determines - that a bio can be merged into an existing - request safely. The io scheduler may still - want to stop a merge at this point if it - results in some sort of conflict internally, - this hook allows it to do that. Note however - that two *requests* can still be merged at later - time. Currently the io scheduler has no way to - prevent that. It can only learn about the fact - from elevator_merge_req_fn callback. - -elevator_dispatch_fn* fills the dispatch queue with ready requests. - I/O schedulers are free to postpone requests by - not filling the dispatch queue unless @force - is non-zero. Once dispatched, I/O schedulers - are not allowed to manipulate the requests - - they belong to generic dispatch queue. - -elevator_add_req_fn* called to add a new request into the scheduler - -elevator_former_req_fn -elevator_latter_req_fn These return the request before or after the - one specified in disk sort order. Used by the - block layer to find merge possibilities. - -elevator_completed_req_fn called when a request is completed. - -elevator_may_queue_fn returns true if the scheduler wants to allow the - current context to queue a new request even if - it is over the queue limit. This must be used - very carefully!! - -elevator_set_req_fn -elevator_put_req_fn Must be used to allocate and free any elevator - specific storage for a request. - -elevator_activate_req_fn Called when device driver first sees a request. - I/O schedulers can use this callback to - determine when actual execution of a request - starts. -elevator_deactivate_req_fn Called when device driver decides to delay - a request by requeueing it. - -elevator_init_fn* -elevator_exit_fn Allocate and free any elevator specific storage - for a queue. - -4.2 Request flows seen by I/O schedulers -All requests seen by I/O schedulers strictly follow one of the following three -flows. - - set_req_fn -> - - i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn -> - (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn - ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn - iii. [none] - - -> put_req_fn - -4.3 I/O scheduler implementation -The generic i/o scheduler algorithm attempts to sort/merge/batch requests for -optimal disk scan and request servicing performance (based on generic -principles and device capabilities), optimized for: -i. improved throughput -ii. improved latency -iii. better utilization of h/w & CPU time - -Characteristics: - -i. Binary tree -AS and deadline i/o schedulers use red black binary trees for disk position -sorting and searching, and a fifo linked list for time-based searching. This -gives good scalability and good availability of information. Requests are -almost always dispatched in disk sort order, so a cache is kept of the next -request in sort order to prevent binary tree lookups. - -This arrangement is not a generic block layer characteristic however, so -elevators may implement queues as they please. - -ii. Merge hash -AS and deadline use a hash table indexed by the last sector of a request. This -enables merging code to quickly look up "back merge" candidates, even when -multiple I/O streams are being performed at once on one disk. - -"Front merges", a new request being merged at the front of an existing request, -are far less common than "back merges" due to the nature of most I/O patterns. -Front merges are handled by the binary trees in AS and deadline schedulers. - -iii. Plugging the queue to batch requests in anticipation of opportunities for - merge/sort optimizations - -Plugging is an approach that the current i/o scheduling algorithm resorts to so -that it collects up enough requests in the queue to be able to take -advantage of the sorting/merging logic in the elevator. If the -queue is empty when a request comes in, then it plugs the request queue -(sort of like plugging the bath tub of a vessel to get fluid to build up) -till it fills up with a few more requests, before starting to service -the requests. This provides an opportunity to merge/sort the requests before -passing them down to the device. There are various conditions when the queue is -unplugged (to open up the flow again), either through a scheduled task or -could be on demand. For example wait_on_buffer sets the unplugging going -through sync_buffer() running blk_run_address_space(mapping). Or the caller -can do it explicity through blk_unplug(bdev). So in the read case, -the queue gets explicitly unplugged as part of waiting for completion on that -buffer. - -Aside: - This is kind of controversial territory, as it's not clear if plugging is - always the right thing to do. Devices typically have their own queues, - and allowing a big queue to build up in software, while letting the device be - idle for a while may not always make sense. The trick is to handle the fine - balance between when to plug and when to open up. Also now that we have - multi-page bios being queued in one shot, we may not need to wait to merge - a big request from the broken up pieces coming by. - -4.4 I/O contexts -I/O contexts provide a dynamically allocated per process data area. They may -be used in I/O schedulers, and in the block layer (could be used for IO statis, -priorities for example). See *io_context in block/ll_rw_blk.c, and as-iosched.c -for an example of usage in an i/o scheduler. - - -5. Scalability related changes - -5.1 Granular Locking: io_request_lock replaced by a per-queue lock - -The global io_request_lock has been removed as of 2.5, to avoid -the scalability bottleneck it was causing, and has been replaced by more -granular locking. The request queue structure has a pointer to the -lock to be used for that queue. As a result, locking can now be -per-queue, with a provision for sharing a lock across queues if -necessary (e.g the scsi layer sets the queue lock pointers to the -corresponding adapter lock, which results in a per host locking -granularity). The locking semantics are the same, i.e. locking is -still imposed by the block layer, grabbing the lock before -request_fn execution which it means that lots of older drivers -should still be SMP safe. Drivers are free to drop the queue -lock themselves, if required. Drivers that explicitly used the -io_request_lock for serialization need to be modified accordingly. -Usually it's as easy as adding a global lock: - - static DEFINE_SPINLOCK(my_driver_lock); - -and passing the address to that lock to blk_init_queue(). - -5.2 64 bit sector numbers (sector_t prepares for 64 bit support) - -The sector number used in the bio structure has been changed to sector_t, -which could be defined as 64 bit in preparation for 64 bit sector support. - -6. Other Changes/Implications - -6.1 Partition re-mapping handled by the generic block layer - -In 2.5 some of the gendisk/partition related code has been reorganized. -Now the generic block layer performs partition-remapping early and thus -provides drivers with a sector number relative to whole device, rather than -having to take partition number into account in order to arrive at the true -sector number. The routine blk_partition_remap() is invoked by -generic_make_request even before invoking the queue specific make_request_fn, -so the i/o scheduler also gets to operate on whole disk sector numbers. This -should typically not require changes to block drivers, it just never gets -to invoke its own partition sector offset calculations since all bios -sent are offset from the beginning of the device. - - -7. A Few Tips on Migration of older drivers - -Old-style drivers that just use CURRENT and ignores clustered requests, -may not need much change. The generic layer will automatically handle -clustered requests, multi-page bios, etc for the driver. - -For a low performance driver or hardware that is PIO driven or just doesn't -support scatter-gather changes should be minimal too. - -The following are some points to keep in mind when converting old drivers -to bio. - -Drivers should use elv_next_request to pick up requests and are no longer -supposed to handle looping directly over the request list. -(struct request->queue has been removed) - -Now end_that_request_first takes an additional number_of_sectors argument. -It used to handle always just the first buffer_head in a request, now -it will loop and handle as many sectors (on a bio-segment granularity) -as specified. - -Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the -right thing to use is bio_endio(bio) instead. - -If the driver is dropping the io_request_lock from its request_fn strategy, -then it just needs to replace that with q->queue_lock instead. - -As described in Sec 1.1, drivers can set max sector size, max segment size -etc per queue now. Drivers that used to define their own merge functions i -to handle things like this can now just use the blk_queue_* functions at -blk_init_queue time. - -Drivers no longer have to map a {partition, sector offset} into the -correct absolute location anymore, this is done by the block layer, so -where a driver received a request ala this before: - - rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */ - rq->sector = 0; /* first sector on hda5 */ - - it will now see - - rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */ - rq->sector = 123128; /* offset from start of disk */ - -As mentioned, there is no virtual mapping of a bio. For DMA, this is -not a problem as the driver probably never will need a virtual mapping. -Instead it needs a bus mapping (dma_map_page for a single segment or -use dma_map_sg for scatter gather) to be able to ship it to the driver. For -PIO drivers (or drivers that need to revert to PIO transfer once in a -while (IDE for example)), where the CPU is doing the actual data -transfer a virtual mapping is needed. If the driver supports highmem I/O, -(Sec 1.1, (ii) ) it needs to use kmap_atomic or similar to temporarily map -a bio into the virtual address space. - - -8. Prior/Related/Impacted patches - -8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp) -- orig kiobuf & raw i/o patches (now in 2.4 tree) -- direct kiobuf based i/o to devices (no intermediate bh's) -- page i/o using kiobuf -- kiobuf splitting for lvm (mkp) -- elevator support for kiobuf request merging (axboe) -8.2. Zero-copy networking (Dave Miller) -8.3. SGI XFS - pagebuf patches - use of kiobufs -8.4. Multi-page pioent patch for bio (Christoph Hellwig) -8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre11 -8.6. Async i/o implementation patch (Ben LaHaise) -8.7. EVMS layering design (IBM EVMS team) -8.8. Larger page cache size patch (Ben LaHaise) and - Large page size (Daniel Phillips) - => larger contiguous physical memory buffers -8.9. VM reservations patch (Ben LaHaise) -8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?) -8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+ -8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, - Badari) -8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven) -8.14 IDE Taskfile i/o patch (Andre Hedrick) -8.15 Multi-page writeout and readahead patches (Andrew Morton) -8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy) - -9. Other References: - -9.1 The Splice I/O Model - Larry McVoy (and subsequent discussions on lkml, -and Linus' comments - Jan 2001) -9.2 Discussions about kiobuf and bh design on lkml between sct, linus, alan -et al - Feb-March 2001 (many of the initial thoughts that led to bio were -brought up in this discussion thread) -9.3 Discussions on mempool on lkml - Dec 2001. - diff --git a/Documentation/block/biovecs.rst b/Documentation/block/biovecs.rst new file mode 100644 index 000000000000..86fa66c87172 --- /dev/null +++ b/Documentation/block/biovecs.rst @@ -0,0 +1,146 @@ +====================================== +Immutable biovecs and biovec iterators +====================================== + +Kent Overstreet + +As of 3.13, biovecs should never be modified after a bio has been submitted. +Instead, we have a new struct bvec_iter which represents a range of a biovec - +the iterator will be modified as the bio is completed, not the biovec. + +More specifically, old code that needed to partially complete a bio would +update bi_sector and bi_size, and advance bi_idx to the next biovec. If it +ended up partway through a biovec, it would increment bv_offset and decrement +bv_len by the number of bytes completed in that biovec. + +In the new scheme of things, everything that must be mutated in order to +partially complete a bio is segregated into struct bvec_iter: bi_sector, +bi_size and bi_idx have been moved there; and instead of modifying bv_offset +and bv_len, struct bvec_iter has bi_bvec_done, which represents the number of +bytes completed in the current bvec. + +There are a bunch of new helper macros for hiding the gory details - in +particular, presenting the illusion of partially completed biovecs so that +normal code doesn't have to deal with bi_bvec_done. + + * Driver code should no longer refer to biovecs directly; we now have + bio_iovec() and bio_iter_iovec() macros that return literal struct biovecs, + constructed from the raw biovecs but taking into account bi_bvec_done and + bi_size. + + bio_for_each_segment() has been updated to take a bvec_iter argument + instead of an integer (that corresponded to bi_idx); for a lot of code the + conversion just required changing the types of the arguments to + bio_for_each_segment(). + + * Advancing a bvec_iter is done with bio_advance_iter(); bio_advance() is a + wrapper around bio_advance_iter() that operates on bio->bi_iter, and also + advances the bio integrity's iter if present. + + There is a lower level advance function - bvec_iter_advance() - which takes + a pointer to a biovec, not a bio; this is used by the bio integrity code. + +What's all this get us? +======================= + +Having a real iterator, and making biovecs immutable, has a number of +advantages: + + * Before, iterating over bios was very awkward when you weren't processing + exactly one bvec at a time - for example, bio_copy_data() in fs/bio.c, + which copies the contents of one bio into another. Because the biovecs + wouldn't necessarily be the same size, the old code was tricky convoluted - + it had to walk two different bios at the same time, keeping both bi_idx and + and offset into the current biovec for each. + + The new code is much more straightforward - have a look. This sort of + pattern comes up in a lot of places; a lot of drivers were essentially open + coding bvec iterators before, and having common implementation considerably + simplifies a lot of code. + + * Before, any code that might need to use the biovec after the bio had been + completed (perhaps to copy the data somewhere else, or perhaps to resubmit + it somewhere else if there was an error) had to save the entire bvec array + - again, this was being done in a fair number of places. + + * Biovecs can be shared between multiple bios - a bvec iter can represent an + arbitrary range of an existing biovec, both starting and ending midway + through biovecs. This is what enables efficient splitting of arbitrary + bios. Note that this means we _only_ use bi_size to determine when we've + reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes + bi_size into account when constructing biovecs. + + * Splitting bios is now much simpler. The old bio_split() didn't even work on + bios with more than a single bvec! Now, we can efficiently split arbitrary + size bios - because the new bio can share the old bio's biovec. + + Care must be taken to ensure the biovec isn't freed while the split bio is + still using it, in case the original bio completes first, though. Using + bio_chain() when splitting bios helps with this. + + * Submitting partially completed bios is now perfectly fine - this comes up + occasionally in stacking block drivers and various code (e.g. md and + bcache) had some ugly workarounds for this. + + It used to be the case that submitting a partially completed bio would work + fine to _most_ devices, but since accessing the raw bvec array was the + norm, not all drivers would respect bi_idx and those would break. Now, + since all drivers _must_ go through the bvec iterator - and have been + audited to make sure they are - submitting partially completed bios is + perfectly fine. + +Other implications: +=================== + + * Almost all usage of bi_idx is now incorrect and has been removed; instead, + where previously you would have used bi_idx you'd now use a bvec_iter, + probably passing it to one of the helper macros. + + I.e. instead of using bio_iovec_idx() (or bio->bi_iovec[bio->bi_idx]), you + now use bio_iter_iovec(), which takes a bvec_iter and returns a + literal struct bio_vec - constructed on the fly from the raw biovec but + taking into account bi_bvec_done (and bi_size). + + * bi_vcnt can't be trusted or relied upon by driver code - i.e. anything that + doesn't actually own the bio. The reason is twofold: firstly, it's not + actually needed for iterating over the bio anymore - we only use bi_size. + Secondly, when cloning a bio and reusing (a portion of) the original bio's + biovec, in order to calculate bi_vcnt for the new bio we'd have to iterate + over all the biovecs in the new bio - which is silly as it's not needed. + + So, don't use bi_vcnt anymore. + + * The current interface allows the block layer to split bios as needed, so we + could eliminate a lot of complexity particularly in stacked drivers. Code + that creates bios can then create whatever size bios are convenient, and + more importantly stacked drivers don't have to deal with both their own bio + size limitations and the limitations of the underlying devices. Thus + there's no need to define ->merge_bvec_fn() callbacks for individual block + drivers. + +Usage of helpers: +================= + +* The following helpers whose names have the suffix of `_all` can only be used + on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers + shouldn't use them because the bio may have been split before it reached the + driver. + +:: + + bio_for_each_segment_all() + bio_first_bvec_all() + bio_first_page_all() + bio_last_bvec_all() + +* The following helpers iterate over single-page segment. The passed 'struct + bio_vec' will contain a single-page IO vector during the iteration:: + + bio_for_each_segment() + bio_for_each_segment_all() + +* The following helpers iterate over multi-page bvec. The passed 'struct + bio_vec' will contain a multi-page IO vector during the iteration:: + + bio_for_each_bvec() + rq_for_each_bvec() diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt deleted file mode 100644 index ce6eccaf5df7..000000000000 --- a/Documentation/block/biovecs.txt +++ /dev/null @@ -1,144 +0,0 @@ - -Immutable biovecs and biovec iterators: -======================================= - -Kent Overstreet - -As of 3.13, biovecs should never be modified after a bio has been submitted. -Instead, we have a new struct bvec_iter which represents a range of a biovec - -the iterator will be modified as the bio is completed, not the biovec. - -More specifically, old code that needed to partially complete a bio would -update bi_sector and bi_size, and advance bi_idx to the next biovec. If it -ended up partway through a biovec, it would increment bv_offset and decrement -bv_len by the number of bytes completed in that biovec. - -In the new scheme of things, everything that must be mutated in order to -partially complete a bio is segregated into struct bvec_iter: bi_sector, -bi_size and bi_idx have been moved there; and instead of modifying bv_offset -and bv_len, struct bvec_iter has bi_bvec_done, which represents the number of -bytes completed in the current bvec. - -There are a bunch of new helper macros for hiding the gory details - in -particular, presenting the illusion of partially completed biovecs so that -normal code doesn't have to deal with bi_bvec_done. - - * Driver code should no longer refer to biovecs directly; we now have - bio_iovec() and bio_iter_iovec() macros that return literal struct biovecs, - constructed from the raw biovecs but taking into account bi_bvec_done and - bi_size. - - bio_for_each_segment() has been updated to take a bvec_iter argument - instead of an integer (that corresponded to bi_idx); for a lot of code the - conversion just required changing the types of the arguments to - bio_for_each_segment(). - - * Advancing a bvec_iter is done with bio_advance_iter(); bio_advance() is a - wrapper around bio_advance_iter() that operates on bio->bi_iter, and also - advances the bio integrity's iter if present. - - There is a lower level advance function - bvec_iter_advance() - which takes - a pointer to a biovec, not a bio; this is used by the bio integrity code. - -What's all this get us? -======================= - -Having a real iterator, and making biovecs immutable, has a number of -advantages: - - * Before, iterating over bios was very awkward when you weren't processing - exactly one bvec at a time - for example, bio_copy_data() in fs/bio.c, - which copies the contents of one bio into another. Because the biovecs - wouldn't necessarily be the same size, the old code was tricky convoluted - - it had to walk two different bios at the same time, keeping both bi_idx and - and offset into the current biovec for each. - - The new code is much more straightforward - have a look. This sort of - pattern comes up in a lot of places; a lot of drivers were essentially open - coding bvec iterators before, and having common implementation considerably - simplifies a lot of code. - - * Before, any code that might need to use the biovec after the bio had been - completed (perhaps to copy the data somewhere else, or perhaps to resubmit - it somewhere else if there was an error) had to save the entire bvec array - - again, this was being done in a fair number of places. - - * Biovecs can be shared between multiple bios - a bvec iter can represent an - arbitrary range of an existing biovec, both starting and ending midway - through biovecs. This is what enables efficient splitting of arbitrary - bios. Note that this means we _only_ use bi_size to determine when we've - reached the end of a bio, not bi_vcnt - and the bio_iovec() macro takes - bi_size into account when constructing biovecs. - - * Splitting bios is now much simpler. The old bio_split() didn't even work on - bios with more than a single bvec! Now, we can efficiently split arbitrary - size bios - because the new bio can share the old bio's biovec. - - Care must be taken to ensure the biovec isn't freed while the split bio is - still using it, in case the original bio completes first, though. Using - bio_chain() when splitting bios helps with this. - - * Submitting partially completed bios is now perfectly fine - this comes up - occasionally in stacking block drivers and various code (e.g. md and - bcache) had some ugly workarounds for this. - - It used to be the case that submitting a partially completed bio would work - fine to _most_ devices, but since accessing the raw bvec array was the - norm, not all drivers would respect bi_idx and those would break. Now, - since all drivers _must_ go through the bvec iterator - and have been - audited to make sure they are - submitting partially completed bios is - perfectly fine. - -Other implications: -=================== - - * Almost all usage of bi_idx is now incorrect and has been removed; instead, - where previously you would have used bi_idx you'd now use a bvec_iter, - probably passing it to one of the helper macros. - - I.e. instead of using bio_iovec_idx() (or bio->bi_iovec[bio->bi_idx]), you - now use bio_iter_iovec(), which takes a bvec_iter and returns a - literal struct bio_vec - constructed on the fly from the raw biovec but - taking into account bi_bvec_done (and bi_size). - - * bi_vcnt can't be trusted or relied upon by driver code - i.e. anything that - doesn't actually own the bio. The reason is twofold: firstly, it's not - actually needed for iterating over the bio anymore - we only use bi_size. - Secondly, when cloning a bio and reusing (a portion of) the original bio's - biovec, in order to calculate bi_vcnt for the new bio we'd have to iterate - over all the biovecs in the new bio - which is silly as it's not needed. - - So, don't use bi_vcnt anymore. - - * The current interface allows the block layer to split bios as needed, so we - could eliminate a lot of complexity particularly in stacked drivers. Code - that creates bios can then create whatever size bios are convenient, and - more importantly stacked drivers don't have to deal with both their own bio - size limitations and the limitations of the underlying devices. Thus - there's no need to define ->merge_bvec_fn() callbacks for individual block - drivers. - -Usage of helpers: -================= - -* The following helpers whose names have the suffix of "_all" can only be used -on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers -shouldn't use them because the bio may have been split before it reached the -driver. - - bio_for_each_segment_all() - bio_first_bvec_all() - bio_first_page_all() - bio_last_bvec_all() - -* The following helpers iterate over single-page segment. The passed 'struct -bio_vec' will contain a single-page IO vector during the iteration - - bio_for_each_segment() - bio_for_each_segment_all() - -* The following helpers iterate over multi-page bvec. The passed 'struct -bio_vec' will contain a multi-page IO vector during the iteration - - bio_for_each_bvec() - rq_for_each_bvec() diff --git a/Documentation/block/capability.rst b/Documentation/block/capability.rst new file mode 100644 index 000000000000..2cf258d64bbe --- /dev/null +++ b/Documentation/block/capability.rst @@ -0,0 +1,18 @@ +=============================== +Generic Block Device Capability +=============================== + +This file documents the sysfs file block//capability + +capability is a hex word indicating which capabilities a specific disk +supports. For more information on bits not listed here, see +include/linux/genhd.h + +GENHD_FL_MEDIA_CHANGE_NOTIFY +---------------------------- + +Value: 4 + +When this bit is set, the disk supports Asynchronous Notification +of media change events. These events will be broadcast to user +space via kernel uevent. diff --git a/Documentation/block/capability.txt b/Documentation/block/capability.txt deleted file mode 100644 index 2f1729424ef4..000000000000 --- a/Documentation/block/capability.txt +++ /dev/null @@ -1,15 +0,0 @@ -Generic Block Device Capability -=============================================================================== -This file documents the sysfs file block//capability - -capability is a hex word indicating which capabilities a specific disk -supports. For more information on bits not listed here, see -include/linux/genhd.h - -Capability Value -------------------------------------------------------------------------------- -GENHD_FL_MEDIA_CHANGE_NOTIFY 4 - When this bit is set, the disk supports Asynchronous Notification - of media change events. These events will be broadcast to user - space via kernel uevent. - diff --git a/Documentation/block/cmdline-partition.rst b/Documentation/block/cmdline-partition.rst new file mode 100644 index 000000000000..530bedff548a --- /dev/null +++ b/Documentation/block/cmdline-partition.rst @@ -0,0 +1,53 @@ +============================================== +Embedded device command line partition parsing +============================================== + +The "blkdevparts" command line option adds support for reading the +block device partition table from the kernel command line. + +It is typically used for fixed block (eMMC) embedded devices. +It has no MBR, so saves storage space. Bootloader can be easily accessed +by absolute address of data on the block device. +Users can easily change the partition. + +The format for the command line is just like mtdparts: + +blkdevparts=[;] + := :[,] + := [@](part-name) + + + block device disk name. Embedded device uses fixed block device. + Its disk name is also fixed, such as: mmcblk0, mmcblk1, mmcblk0boot0. + + + partition size, in bytes, such as: 512, 1m, 1G. + size may contain an optional suffix of (upper or lower case): + + K, M, G, T, P, E. + + "-" is used to denote all remaining space. + + + partition start address, in bytes. + offset may contain an optional suffix of (upper or lower case): + + K, M, G, T, P, E. + +(part-name) + partition name. Kernel sends uevent with "PARTNAME". Application can + create a link to block device partition with the name "PARTNAME". + User space application can access partition by partition name. + +Example: + + eMMC disk names are "mmcblk0" and "mmcblk0boot0". + + bootargs:: + + 'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)' + + dmesg:: + + mmcblk0: p1(data0) p2(data1) p3() + mmcblk0boot0: p1(boot) p2(kernel) diff --git a/Documentation/block/cmdline-partition.txt b/Documentation/block/cmdline-partition.txt deleted file mode 100644 index 760a3f7c3ed4..000000000000 --- a/Documentation/block/cmdline-partition.txt +++ /dev/null @@ -1,46 +0,0 @@ -Embedded device command line partition parsing -===================================================================== - -The "blkdevparts" command line option adds support for reading the -block device partition table from the kernel command line. - -It is typically used for fixed block (eMMC) embedded devices. -It has no MBR, so saves storage space. Bootloader can be easily accessed -by absolute address of data on the block device. -Users can easily change the partition. - -The format for the command line is just like mtdparts: - -blkdevparts=[;] - := :[,] - := [@](part-name) - - - block device disk name. Embedded device uses fixed block device. - Its disk name is also fixed, such as: mmcblk0, mmcblk1, mmcblk0boot0. - - - partition size, in bytes, such as: 512, 1m, 1G. - size may contain an optional suffix of (upper or lower case): - K, M, G, T, P, E. - "-" is used to denote all remaining space. - - - partition start address, in bytes. - offset may contain an optional suffix of (upper or lower case): - K, M, G, T, P, E. - -(part-name) - partition name. Kernel sends uevent with "PARTNAME". Application can - create a link to block device partition with the name "PARTNAME". - User space application can access partition by partition name. - -Example: - eMMC disk names are "mmcblk0" and "mmcblk0boot0". - - bootargs: - 'blkdevparts=mmcblk0:1G(data0),1G(data1),-;mmcblk0boot0:1m(boot),-(kernel)' - - dmesg: - mmcblk0: p1(data0) p2(data1) p3() - mmcblk0boot0: p1(boot) p2(kernel) diff --git a/Documentation/block/data-integrity.rst b/Documentation/block/data-integrity.rst new file mode 100644 index 000000000000..4f2452a95c43 --- /dev/null +++ b/Documentation/block/data-integrity.rst @@ -0,0 +1,291 @@ +============== +Data Integrity +============== + +1. Introduction +=============== + +Modern filesystems feature checksumming of data and metadata to +protect against data corruption. However, the detection of the +corruption is done at read time which could potentially be months +after the data was written. At that point the original data that the +application tried to write is most likely lost. + +The solution is to ensure that the disk is actually storing what the +application meant it to. Recent additions to both the SCSI family +protocols (SBC Data Integrity Field, SCC protection proposal) as well +as SATA/T13 (External Path Protection) try to remedy this by adding +support for appending integrity metadata to an I/O. The integrity +metadata (or protection information in SCSI terminology) includes a +checksum for each sector as well as an incrementing counter that +ensures the individual sectors are written in the right order. And +for some protection schemes also that the I/O is written to the right +place on disk. + +Current storage controllers and devices implement various protective +measures, for instance checksumming and scrubbing. But these +technologies are working in their own isolated domains or at best +between adjacent nodes in the I/O path. The interesting thing about +DIF and the other integrity extensions is that the protection format +is well defined and every node in the I/O path can verify the +integrity of the I/O and reject it if corruption is detected. This +allows not only corruption prevention but also isolation of the point +of failure. + +2. The Data Integrity Extensions +================================ + +As written, the protocol extensions only protect the path between +controller and storage device. However, many controllers actually +allow the operating system to interact with the integrity metadata +(IMD). We have been working with several FC/SAS HBA vendors to enable +the protection information to be transferred to and from their +controllers. + +The SCSI Data Integrity Field works by appending 8 bytes of protection +information to each sector. The data + integrity metadata is stored +in 520 byte sectors on disk. Data + IMD are interleaved when +transferred between the controller and target. The T13 proposal is +similar. + +Because it is highly inconvenient for operating systems to deal with +520 (and 4104) byte sectors, we approached several HBA vendors and +encouraged them to allow separation of the data and integrity metadata +scatter-gather lists. + +The controller will interleave the buffers on write and split them on +read. This means that Linux can DMA the data buffers to and from +host memory without changes to the page cache. + +Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs +is somewhat heavy to compute in software. Benchmarks found that +calculating this checksum had a significant impact on system +performance for a number of workloads. Some controllers allow a +lighter-weight checksum to be used when interfacing with the operating +system. Emulex, for instance, supports the TCP/IP checksum instead. +The IP checksum received from the OS is converted to the 16-bit CRC +when writing and vice versa. This allows the integrity metadata to be +generated by Linux or the application at very low cost (comparable to +software RAID5). + +The IP checksum is weaker than the CRC in terms of detecting bit +errors. However, the strength is really in the separation of the data +buffers and the integrity metadata. These two distinct buffers must +match up for an I/O to complete. + +The separation of the data and integrity metadata buffers as well as +the choice in checksums is referred to as the Data Integrity +Extensions. As these extensions are outside the scope of the protocol +bodies (T10, T13), Oracle and its partners are trying to standardize +them within the Storage Networking Industry Association. + +3. Kernel Changes +================= + +The data integrity framework in Linux enables protection information +to be pinned to I/Os and sent to/received from controllers that +support it. + +The advantage to the integrity extensions in SCSI and SATA is that +they enable us to protect the entire path from application to storage +device. However, at the same time this is also the biggest +disadvantage. It means that the protection information must be in a +format that can be understood by the disk. + +Generally Linux/POSIX applications are agnostic to the intricacies of +the storage devices they are accessing. The virtual filesystem switch +and the block layer make things like hardware sector size and +transport protocols completely transparent to the application. + +However, this level of detail is required when preparing the +protection information to send to a disk. Consequently, the very +concept of an end-to-end protection scheme is a layering violation. +It is completely unreasonable for an application to be aware whether +it is accessing a SCSI or SATA disk. + +The data integrity support implemented in Linux attempts to hide this +from the application. As far as the application (and to some extent +the kernel) is concerned, the integrity metadata is opaque information +that's attached to the I/O. + +The current implementation allows the block layer to automatically +generate the protection information for any I/O. Eventually the +intent is to move the integrity metadata calculation to userspace for +user data. Metadata and other I/O that originates within the kernel +will still use the automatic generation interface. + +Some storage devices allow each hardware sector to be tagged with a +16-bit value. The owner of this tag space is the owner of the block +device. I.e. the filesystem in most cases. The filesystem can use +this extra space to tag sectors as they see fit. Because the tag +space is limited, the block interface allows tagging bigger chunks by +way of interleaving. This way, 8*16 bits of information can be +attached to a typical 4KB filesystem block. + +This also means that applications such as fsck and mkfs will need +access to manipulate the tags from user space. A passthrough +interface for this is being worked on. + + +4. Block Layer Implementation Details +===================================== + +4.1 Bio +------- + +The data integrity patches add a new field to struct bio when +CONFIG_BLK_DEV_INTEGRITY is enabled. bio_integrity(bio) returns a +pointer to a struct bip which contains the bio integrity payload. +Essentially a bip is a trimmed down struct bio which holds a bio_vec +containing the integrity metadata and the required housekeeping +information (bvec pool, vector count, etc.) + +A kernel subsystem can enable data integrity protection on a bio by +calling bio_integrity_alloc(bio). This will allocate and attach the +bip to the bio. + +Individual pages containing integrity metadata can subsequently be +attached using bio_integrity_add_page(). + +bio_free() will automatically free the bip. + + +4.2 Block Device +---------------- + +Because the format of the protection data is tied to the physical +disk, each block device has been extended with a block integrity +profile (struct blk_integrity). This optional profile is registered +with the block layer using blk_integrity_register(). + +The profile contains callback functions for generating and verifying +the protection data, as well as getting and setting application tags. +The profile also contains a few constants to aid in completing, +merging and splitting the integrity metadata. + +Layered block devices will need to pick a profile that's appropriate +for all subdevices. blk_integrity_compare() can help with that. DM +and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 +will require extra work due to the application tag. + + +5.0 Block Layer Integrity API +============================= + +5.1 Normal Filesystem +--------------------- + + The normal filesystem is unaware that the underlying block device + is capable of sending/receiving integrity metadata. The IMD will + be automatically generated by the block layer at submit_bio() time + in case of a WRITE. A READ request will cause the I/O integrity + to be verified upon completion. + + IMD generation and verification can be toggled using the:: + + /sys/block//integrity/write_generate + + and:: + + /sys/block//integrity/read_verify + + flags. + + +5.2 Integrity-Aware Filesystem +------------------------------ + + A filesystem that is integrity-aware can prepare I/Os with IMD + attached. It can also use the application tag space if this is + supported by the block device. + + + `bool bio_integrity_prep(bio);` + + To generate IMD for WRITE and to set up buffers for READ, the + filesystem must call bio_integrity_prep(bio). + + Prior to calling this function, the bio data direction and start + sector must be set, and the bio should have all data pages + added. It is up to the caller to ensure that the bio does not + change while I/O is in progress. + Complete bio with error if prepare failed for some reson. + + +5.3 Passing Existing Integrity Metadata +--------------------------------------- + + Filesystems that either generate their own integrity metadata or + are capable of transferring IMD from user space can use the + following calls: + + + `struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);` + + Allocates the bio integrity payload and hangs it off of the bio. + nr_pages indicate how many pages of protection data need to be + stored in the integrity bio_vec list (similar to bio_alloc()). + + The integrity payload will be freed at bio_free() time. + + + `int bio_integrity_add_page(bio, page, len, offset);` + + Attaches a page containing integrity metadata to an existing + bio. The bio must have an existing bip, + i.e. bio_integrity_alloc() must have been called. For a WRITE, + the integrity metadata in the pages must be in a format + understood by the target device with the notable exception that + the sector numbers will be remapped as the request traverses the + I/O stack. This implies that the pages added using this call + will be modified during I/O! The first reference tag in the + integrity metadata must have a value of bip->bip_sector. + + Pages can be added using bio_integrity_add_page() as long as + there is room in the bip bio_vec array (nr_pages). + + Upon completion of a READ operation, the attached pages will + contain the integrity metadata received from the storage device. + It is up to the receiver to process them and verify data + integrity upon completion. + + +5.4 Registering A Block Device As Capable Of Exchanging Integrity Metadata +-------------------------------------------------------------------------- + + To enable integrity exchange on a block device the gendisk must be + registered as capable: + + `int blk_integrity_register(gendisk, blk_integrity);` + + The blk_integrity struct is a template and should contain the + following:: + + static struct blk_integrity my_profile = { + .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", + .generate_fn = my_generate_fn, + .verify_fn = my_verify_fn, + .tuple_size = sizeof(struct my_tuple_size), + .tag_size = , + }; + + 'name' is a text string which will be visible in sysfs. This is + part of the userland API so chose it carefully and never change + it. The format is standards body-type-variant. + E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. + + 'generate_fn' generates appropriate integrity metadata (for WRITE). + + 'verify_fn' verifies that the data buffer matches the integrity + metadata. + + 'tuple_size' must be set to match the size of the integrity + metadata per sector. I.e. 8 for DIF and EPP. + + 'tag_size' must be set to identify how many bytes of tag space + are available per hardware sector. For DIF this is either 2 or + 0 depending on the value of the Control Mode Page ATO bit. + +---------------------------------------------------------------------- + +2007-12-24 Martin K. Petersen diff --git a/Documentation/block/data-integrity.txt b/Documentation/block/data-integrity.txt deleted file mode 100644 index 934c44ea0c57..000000000000 --- a/Documentation/block/data-integrity.txt +++ /dev/null @@ -1,281 +0,0 @@ ----------------------------------------------------------------------- -1. INTRODUCTION - -Modern filesystems feature checksumming of data and metadata to -protect against data corruption. However, the detection of the -corruption is done at read time which could potentially be months -after the data was written. At that point the original data that the -application tried to write is most likely lost. - -The solution is to ensure that the disk is actually storing what the -application meant it to. Recent additions to both the SCSI family -protocols (SBC Data Integrity Field, SCC protection proposal) as well -as SATA/T13 (External Path Protection) try to remedy this by adding -support for appending integrity metadata to an I/O. The integrity -metadata (or protection information in SCSI terminology) includes a -checksum for each sector as well as an incrementing counter that -ensures the individual sectors are written in the right order. And -for some protection schemes also that the I/O is written to the right -place on disk. - -Current storage controllers and devices implement various protective -measures, for instance checksumming and scrubbing. But these -technologies are working in their own isolated domains or at best -between adjacent nodes in the I/O path. The interesting thing about -DIF and the other integrity extensions is that the protection format -is well defined and every node in the I/O path can verify the -integrity of the I/O and reject it if corruption is detected. This -allows not only corruption prevention but also isolation of the point -of failure. - ----------------------------------------------------------------------- -2. THE DATA INTEGRITY EXTENSIONS - -As written, the protocol extensions only protect the path between -controller and storage device. However, many controllers actually -allow the operating system to interact with the integrity metadata -(IMD). We have been working with several FC/SAS HBA vendors to enable -the protection information to be transferred to and from their -controllers. - -The SCSI Data Integrity Field works by appending 8 bytes of protection -information to each sector. The data + integrity metadata is stored -in 520 byte sectors on disk. Data + IMD are interleaved when -transferred between the controller and target. The T13 proposal is -similar. - -Because it is highly inconvenient for operating systems to deal with -520 (and 4104) byte sectors, we approached several HBA vendors and -encouraged them to allow separation of the data and integrity metadata -scatter-gather lists. - -The controller will interleave the buffers on write and split them on -read. This means that Linux can DMA the data buffers to and from -host memory without changes to the page cache. - -Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs -is somewhat heavy to compute in software. Benchmarks found that -calculating this checksum had a significant impact on system -performance for a number of workloads. Some controllers allow a -lighter-weight checksum to be used when interfacing with the operating -system. Emulex, for instance, supports the TCP/IP checksum instead. -The IP checksum received from the OS is converted to the 16-bit CRC -when writing and vice versa. This allows the integrity metadata to be -generated by Linux or the application at very low cost (comparable to -software RAID5). - -The IP checksum is weaker than the CRC in terms of detecting bit -errors. However, the strength is really in the separation of the data -buffers and the integrity metadata. These two distinct buffers must -match up for an I/O to complete. - -The separation of the data and integrity metadata buffers as well as -the choice in checksums is referred to as the Data Integrity -Extensions. As these extensions are outside the scope of the protocol -bodies (T10, T13), Oracle and its partners are trying to standardize -them within the Storage Networking Industry Association. - ----------------------------------------------------------------------- -3. KERNEL CHANGES - -The data integrity framework in Linux enables protection information -to be pinned to I/Os and sent to/received from controllers that -support it. - -The advantage to the integrity extensions in SCSI and SATA is that -they enable us to protect the entire path from application to storage -device. However, at the same time this is also the biggest -disadvantage. It means that the protection information must be in a -format that can be understood by the disk. - -Generally Linux/POSIX applications are agnostic to the intricacies of -the storage devices they are accessing. The virtual filesystem switch -and the block layer make things like hardware sector size and -transport protocols completely transparent to the application. - -However, this level of detail is required when preparing the -protection information to send to a disk. Consequently, the very -concept of an end-to-end protection scheme is a layering violation. -It is completely unreasonable for an application to be aware whether -it is accessing a SCSI or SATA disk. - -The data integrity support implemented in Linux attempts to hide this -from the application. As far as the application (and to some extent -the kernel) is concerned, the integrity metadata is opaque information -that's attached to the I/O. - -The current implementation allows the block layer to automatically -generate the protection information for any I/O. Eventually the -intent is to move the integrity metadata calculation to userspace for -user data. Metadata and other I/O that originates within the kernel -will still use the automatic generation interface. - -Some storage devices allow each hardware sector to be tagged with a -16-bit value. The owner of this tag space is the owner of the block -device. I.e. the filesystem in most cases. The filesystem can use -this extra space to tag sectors as they see fit. Because the tag -space is limited, the block interface allows tagging bigger chunks by -way of interleaving. This way, 8*16 bits of information can be -attached to a typical 4KB filesystem block. - -This also means that applications such as fsck and mkfs will need -access to manipulate the tags from user space. A passthrough -interface for this is being worked on. - - ----------------------------------------------------------------------- -4. BLOCK LAYER IMPLEMENTATION DETAILS - -4.1 BIO - -The data integrity patches add a new field to struct bio when -CONFIG_BLK_DEV_INTEGRITY is enabled. bio_integrity(bio) returns a -pointer to a struct bip which contains the bio integrity payload. -Essentially a bip is a trimmed down struct bio which holds a bio_vec -containing the integrity metadata and the required housekeeping -information (bvec pool, vector count, etc.) - -A kernel subsystem can enable data integrity protection on a bio by -calling bio_integrity_alloc(bio). This will allocate and attach the -bip to the bio. - -Individual pages containing integrity metadata can subsequently be -attached using bio_integrity_add_page(). - -bio_free() will automatically free the bip. - - -4.2 BLOCK DEVICE - -Because the format of the protection data is tied to the physical -disk, each block device has been extended with a block integrity -profile (struct blk_integrity). This optional profile is registered -with the block layer using blk_integrity_register(). - -The profile contains callback functions for generating and verifying -the protection data, as well as getting and setting application tags. -The profile also contains a few constants to aid in completing, -merging and splitting the integrity metadata. - -Layered block devices will need to pick a profile that's appropriate -for all subdevices. blk_integrity_compare() can help with that. DM -and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 -will require extra work due to the application tag. - - ----------------------------------------------------------------------- -5.0 BLOCK LAYER INTEGRITY API - -5.1 NORMAL FILESYSTEM - - The normal filesystem is unaware that the underlying block device - is capable of sending/receiving integrity metadata. The IMD will - be automatically generated by the block layer at submit_bio() time - in case of a WRITE. A READ request will cause the I/O integrity - to be verified upon completion. - - IMD generation and verification can be toggled using the - - /sys/block//integrity/write_generate - - and - - /sys/block//integrity/read_verify - - flags. - - -5.2 INTEGRITY-AWARE FILESYSTEM - - A filesystem that is integrity-aware can prepare I/Os with IMD - attached. It can also use the application tag space if this is - supported by the block device. - - - bool bio_integrity_prep(bio); - - To generate IMD for WRITE and to set up buffers for READ, the - filesystem must call bio_integrity_prep(bio). - - Prior to calling this function, the bio data direction and start - sector must be set, and the bio should have all data pages - added. It is up to the caller to ensure that the bio does not - change while I/O is in progress. - Complete bio with error if prepare failed for some reson. - - -5.3 PASSING EXISTING INTEGRITY METADATA - - Filesystems that either generate their own integrity metadata or - are capable of transferring IMD from user space can use the - following calls: - - - struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages); - - Allocates the bio integrity payload and hangs it off of the bio. - nr_pages indicate how many pages of protection data need to be - stored in the integrity bio_vec list (similar to bio_alloc()). - - The integrity payload will be freed at bio_free() time. - - - int bio_integrity_add_page(bio, page, len, offset); - - Attaches a page containing integrity metadata to an existing - bio. The bio must have an existing bip, - i.e. bio_integrity_alloc() must have been called. For a WRITE, - the integrity metadata in the pages must be in a format - understood by the target device with the notable exception that - the sector numbers will be remapped as the request traverses the - I/O stack. This implies that the pages added using this call - will be modified during I/O! The first reference tag in the - integrity metadata must have a value of bip->bip_sector. - - Pages can be added using bio_integrity_add_page() as long as - there is room in the bip bio_vec array (nr_pages). - - Upon completion of a READ operation, the attached pages will - contain the integrity metadata received from the storage device. - It is up to the receiver to process them and verify data - integrity upon completion. - - -5.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY - METADATA - - To enable integrity exchange on a block device the gendisk must be - registered as capable: - - int blk_integrity_register(gendisk, blk_integrity); - - The blk_integrity struct is a template and should contain the - following: - - static struct blk_integrity my_profile = { - .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", - .generate_fn = my_generate_fn, - .verify_fn = my_verify_fn, - .tuple_size = sizeof(struct my_tuple_size), - .tag_size = , - }; - - 'name' is a text string which will be visible in sysfs. This is - part of the userland API so chose it carefully and never change - it. The format is standards body-type-variant. - E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. - - 'generate_fn' generates appropriate integrity metadata (for WRITE). - - 'verify_fn' verifies that the data buffer matches the integrity - metadata. - - 'tuple_size' must be set to match the size of the integrity - metadata per sector. I.e. 8 for DIF and EPP. - - 'tag_size' must be set to identify how many bytes of tag space - are available per hardware sector. For DIF this is either 2 or - 0 depending on the value of the Control Mode Page ATO bit. - ----------------------------------------------------------------------- -2007-12-24 Martin K. Petersen diff --git a/Documentation/block/deadline-iosched.rst b/Documentation/block/deadline-iosched.rst new file mode 100644 index 000000000000..9f5c5a4c370e --- /dev/null +++ b/Documentation/block/deadline-iosched.rst @@ -0,0 +1,72 @@ +============================== +Deadline IO scheduler tunables +============================== + +This little file attempts to document how the deadline io scheduler works. +In particular, it will clarify the meaning of the exposed tunables that may be +of interest to power users. + +Selecting IO schedulers +----------------------- +Refer to Documentation/block/switching-sched.rst for information on +selecting an io scheduler on a per-device basis. + +------------------------------------------------------------------------------ + +read_expire (in ms) +----------------------- + +The goal of the deadline io scheduler is to attempt to guarantee a start +service time for a request. As we focus mainly on read latencies, this is +tunable. When a read request first enters the io scheduler, it is assigned +a deadline that is the current time + the read_expire value in units of +milliseconds. + + +write_expire (in ms) +----------------------- + +Similar to read_expire mentioned above, but for writes. + + +fifo_batch (number of requests) +------------------------------------ + +Requests are grouped into ``batches`` of a particular data direction (read or +write) which are serviced in increasing sector order. To limit extra seeking, +deadline expiries are only checked between batches. fifo_batch controls the +maximum number of requests per batch. + +This parameter tunes the balance between per-request latency and aggregate +throughput. When low latency is the primary concern, smaller is better (where +a value of 1 yields first-come first-served behaviour). Increasing fifo_batch +generally improves throughput, at the cost of latency variation. + + +writes_starved (number of dispatches) +-------------------------------------- + +When we have to move requests from the io scheduler queue to the block +device dispatch queue, we always give a preference to reads. However, we +don't want to starve writes indefinitely either. So writes_starved controls +how many times we give preference to reads over writes. When that has been +done writes_starved number of times, we dispatch some writes based on the +same criteria as reads. + + +front_merges (bool) +---------------------- + +Sometimes it happens that a request enters the io scheduler that is contiguous +with a request that is already on the queue. Either it fits in the back of that +request, or it fits at the front. That is called either a back merge candidate +or a front merge candidate. Due to the way files are typically laid out, +back merges are much more common than front merges. For some work loads, you +may even know that it is a waste of time to spend any time attempting to +front merge requests. Setting front_merges to 0 disables this functionality. +Front merges may still occur due to the cached last_merge hint, but since +that comes at basically 0 cost we leave that on. We simply disable the +rbtree front sector lookup when the io scheduler merge function is called. + + +Nov 11 2002, Jens Axboe diff --git a/Documentation/block/deadline-iosched.txt b/Documentation/block/deadline-iosched.txt deleted file mode 100644 index 2d82c80322cb..000000000000 --- a/Documentation/block/deadline-iosched.txt +++ /dev/null @@ -1,75 +0,0 @@ -Deadline IO scheduler tunables -============================== - -This little file attempts to document how the deadline io scheduler works. -In particular, it will clarify the meaning of the exposed tunables that may be -of interest to power users. - -Selecting IO schedulers ------------------------ -Refer to Documentation/block/switching-sched.txt for information on -selecting an io scheduler on a per-device basis. - - -******************************************************************************** - - -read_expire (in ms) ------------ - -The goal of the deadline io scheduler is to attempt to guarantee a start -service time for a request. As we focus mainly on read latencies, this is -tunable. When a read request first enters the io scheduler, it is assigned -a deadline that is the current time + the read_expire value in units of -milliseconds. - - -write_expire (in ms) ------------ - -Similar to read_expire mentioned above, but for writes. - - -fifo_batch (number of requests) ----------- - -Requests are grouped into ``batches'' of a particular data direction (read or -write) which are serviced in increasing sector order. To limit extra seeking, -deadline expiries are only checked between batches. fifo_batch controls the -maximum number of requests per batch. - -This parameter tunes the balance between per-request latency and aggregate -throughput. When low latency is the primary concern, smaller is better (where -a value of 1 yields first-come first-served behaviour). Increasing fifo_batch -generally improves throughput, at the cost of latency variation. - - -writes_starved (number of dispatches) --------------- - -When we have to move requests from the io scheduler queue to the block -device dispatch queue, we always give a preference to reads. However, we -don't want to starve writes indefinitely either. So writes_starved controls -how many times we give preference to reads over writes. When that has been -done writes_starved number of times, we dispatch some writes based on the -same criteria as reads. - - -front_merges (bool) ------------- - -Sometimes it happens that a request enters the io scheduler that is contiguous -with a request that is already on the queue. Either it fits in the back of that -request, or it fits at the front. That is called either a back merge candidate -or a front merge candidate. Due to the way files are typically laid out, -back merges are much more common than front merges. For some work loads, you -may even know that it is a waste of time to spend any time attempting to -front merge requests. Setting front_merges to 0 disables this functionality. -Front merges may still occur due to the cached last_merge hint, but since -that comes at basically 0 cost we leave that on. We simply disable the -rbtree front sector lookup when the io scheduler merge function is called. - - -Nov 11 2002, Jens Axboe - - diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst new file mode 100644 index 000000000000..8cd226a0e86e --- /dev/null +++ b/Documentation/block/index.rst @@ -0,0 +1,25 @@ +:orphan: + +===== +Block +===== + +.. toctree:: + :maxdepth: 1 + + bfq-iosched + biodoc + biovecs + capability + cmdline-partition + data-integrity + deadline-iosched + ioprio + kyber-iosched + null_blk + pr + queue-sysfs + request + stat + switching-sched + writeback_cache_control diff --git a/Documentation/block/ioprio.rst b/Documentation/block/ioprio.rst new file mode 100644 index 000000000000..f72b0de65af7 --- /dev/null +++ b/Documentation/block/ioprio.rst @@ -0,0 +1,182 @@ +=================== +Block io priorities +=================== + + +Intro +----- + +With the introduction of cfq v3 (aka cfq-ts or time sliced cfq), basic io +priorities are supported for reads on files. This enables users to io nice +processes or process groups, similar to what has been possible with cpu +scheduling for ages. This document mainly details the current possibilities +with cfq; other io schedulers do not support io priorities thus far. + +Scheduling classes +------------------ + +CFQ implements three generic scheduling classes that determine how io is +served for a process. + +IOPRIO_CLASS_RT: This is the realtime io class. This scheduling class is given +higher priority than any other in the system, processes from this class are +given first access to the disk every time. Thus it needs to be used with some +care, one io RT process can starve the entire system. Within the RT class, +there are 8 levels of class data that determine exactly how much time this +process needs the disk for on each service. In the future this might change +to be more directly mappable to performance, by passing in a wanted data +rate instead. + +IOPRIO_CLASS_BE: This is the best-effort scheduling class, which is the default +for any process that hasn't set a specific io priority. The class data +determines how much io bandwidth the process will get, it's directly mappable +to the cpu nice levels just more coarsely implemented. 0 is the highest +BE prio level, 7 is the lowest. The mapping between cpu nice level and io +nice level is determined as: io_nice = (cpu_nice + 20) / 5. + +IOPRIO_CLASS_IDLE: This is the idle scheduling class, processes running at this +level only get io time when no one else needs the disk. The idle class has no +class data, since it doesn't really apply here. + +Tools +----- + +See below for a sample ionice tool. Usage:: + + # ionice -c -n -p + +If pid isn't given, the current process is assumed. IO priority settings +are inherited on fork, so you can use ionice to start the process at a given +level:: + + # ionice -c2 -n0 /bin/ls + +will run ls at the best-effort scheduling class at the highest priority. +For a running process, you can give the pid instead:: + + # ionice -c1 -n2 -p100 + +will change pid 100 to run at the realtime scheduling class, at priority 2. + +ionice.c tool:: + + #include + #include + #include + #include + #include + #include + #include + + extern int sys_ioprio_set(int, int, int); + extern int sys_ioprio_get(int, int); + + #if defined(__i386__) + #define __NR_ioprio_set 289 + #define __NR_ioprio_get 290 + #elif defined(__ppc__) + #define __NR_ioprio_set 273 + #define __NR_ioprio_get 274 + #elif defined(__x86_64__) + #define __NR_ioprio_set 251 + #define __NR_ioprio_get 252 + #elif defined(__ia64__) + #define __NR_ioprio_set 1274 + #define __NR_ioprio_get 1275 + #else + #error "Unsupported arch" + #endif + + static inline int ioprio_set(int which, int who, int ioprio) + { + return syscall(__NR_ioprio_set, which, who, ioprio); + } + + static inline int ioprio_get(int which, int who) + { + return syscall(__NR_ioprio_get, which, who); + } + + enum { + IOPRIO_CLASS_NONE, + IOPRIO_CLASS_RT, + IOPRIO_CLASS_BE, + IOPRIO_CLASS_IDLE, + }; + + enum { + IOPRIO_WHO_PROCESS = 1, + IOPRIO_WHO_PGRP, + IOPRIO_WHO_USER, + }; + + #define IOPRIO_CLASS_SHIFT 13 + + const char *to_prio[] = { "none", "realtime", "best-effort", "idle", }; + + int main(int argc, char *argv[]) + { + int ioprio = 4, set = 0, ioprio_class = IOPRIO_CLASS_BE; + int c, pid = 0; + + while ((c = getopt(argc, argv, "+n:c:p:")) != EOF) { + switch (c) { + case 'n': + ioprio = strtol(optarg, NULL, 10); + set = 1; + break; + case 'c': + ioprio_class = strtol(optarg, NULL, 10); + set = 1; + break; + case 'p': + pid = strtol(optarg, NULL, 10); + break; + } + } + + switch (ioprio_class) { + case IOPRIO_CLASS_NONE: + ioprio_class = IOPRIO_CLASS_BE; + break; + case IOPRIO_CLASS_RT: + case IOPRIO_CLASS_BE: + break; + case IOPRIO_CLASS_IDLE: + ioprio = 7; + break; + default: + printf("bad prio class %d\n", ioprio_class); + return 1; + } + + if (!set) { + if (!pid && argv[optind]) + pid = strtol(argv[optind], NULL, 10); + + ioprio = ioprio_get(IOPRIO_WHO_PROCESS, pid); + + printf("pid=%d, %d\n", pid, ioprio); + + if (ioprio == -1) + perror("ioprio_get"); + else { + ioprio_class = ioprio >> IOPRIO_CLASS_SHIFT; + ioprio = ioprio & 0xff; + printf("%s: prio %d\n", to_prio[ioprio_class], ioprio); + } + } else { + if (ioprio_set(IOPRIO_WHO_PROCESS, pid, ioprio | ioprio_class << IOPRIO_CLASS_SHIFT) == -1) { + perror("ioprio_set"); + return 1; + } + + if (argv[optind]) + execvp(argv[optind], &argv[optind]); + } + + return 0; + } + + +March 11 2005, Jens Axboe diff --git a/Documentation/block/ioprio.txt b/Documentation/block/ioprio.txt deleted file mode 100644 index 8ed8c59380b4..000000000000 --- a/Documentation/block/ioprio.txt +++ /dev/null @@ -1,183 +0,0 @@ -Block io priorities -=================== - - -Intro ------ - -With the introduction of cfq v3 (aka cfq-ts or time sliced cfq), basic io -priorities are supported for reads on files. This enables users to io nice -processes or process groups, similar to what has been possible with cpu -scheduling for ages. This document mainly details the current possibilities -with cfq; other io schedulers do not support io priorities thus far. - -Scheduling classes ------------------- - -CFQ implements three generic scheduling classes that determine how io is -served for a process. - -IOPRIO_CLASS_RT: This is the realtime io class. This scheduling class is given -higher priority than any other in the system, processes from this class are -given first access to the disk every time. Thus it needs to be used with some -care, one io RT process can starve the entire system. Within the RT class, -there are 8 levels of class data that determine exactly how much time this -process needs the disk for on each service. In the future this might change -to be more directly mappable to performance, by passing in a wanted data -rate instead. - -IOPRIO_CLASS_BE: This is the best-effort scheduling class, which is the default -for any process that hasn't set a specific io priority. The class data -determines how much io bandwidth the process will get, it's directly mappable -to the cpu nice levels just more coarsely implemented. 0 is the highest -BE prio level, 7 is the lowest. The mapping between cpu nice level and io -nice level is determined as: io_nice = (cpu_nice + 20) / 5. - -IOPRIO_CLASS_IDLE: This is the idle scheduling class, processes running at this -level only get io time when no one else needs the disk. The idle class has no -class data, since it doesn't really apply here. - -Tools ------ - -See below for a sample ionice tool. Usage: - -# ionice -c -n -p - -If pid isn't given, the current process is assumed. IO priority settings -are inherited on fork, so you can use ionice to start the process at a given -level: - -# ionice -c2 -n0 /bin/ls - -will run ls at the best-effort scheduling class at the highest priority. -For a running process, you can give the pid instead: - -# ionice -c1 -n2 -p100 - -will change pid 100 to run at the realtime scheduling class, at priority 2. - ----> snip ionice.c tool <--- - -#include -#include -#include -#include -#include -#include -#include - -extern int sys_ioprio_set(int, int, int); -extern int sys_ioprio_get(int, int); - -#if defined(__i386__) -#define __NR_ioprio_set 289 -#define __NR_ioprio_get 290 -#elif defined(__ppc__) -#define __NR_ioprio_set 273 -#define __NR_ioprio_get 274 -#elif defined(__x86_64__) -#define __NR_ioprio_set 251 -#define __NR_ioprio_get 252 -#elif defined(__ia64__) -#define __NR_ioprio_set 1274 -#define __NR_ioprio_get 1275 -#else -#error "Unsupported arch" -#endif - -static inline int ioprio_set(int which, int who, int ioprio) -{ - return syscall(__NR_ioprio_set, which, who, ioprio); -} - -static inline int ioprio_get(int which, int who) -{ - return syscall(__NR_ioprio_get, which, who); -} - -enum { - IOPRIO_CLASS_NONE, - IOPRIO_CLASS_RT, - IOPRIO_CLASS_BE, - IOPRIO_CLASS_IDLE, -}; - -enum { - IOPRIO_WHO_PROCESS = 1, - IOPRIO_WHO_PGRP, - IOPRIO_WHO_USER, -}; - -#define IOPRIO_CLASS_SHIFT 13 - -const char *to_prio[] = { "none", "realtime", "best-effort", "idle", }; - -int main(int argc, char *argv[]) -{ - int ioprio = 4, set = 0, ioprio_class = IOPRIO_CLASS_BE; - int c, pid = 0; - - while ((c = getopt(argc, argv, "+n:c:p:")) != EOF) { - switch (c) { - case 'n': - ioprio = strtol(optarg, NULL, 10); - set = 1; - break; - case 'c': - ioprio_class = strtol(optarg, NULL, 10); - set = 1; - break; - case 'p': - pid = strtol(optarg, NULL, 10); - break; - } - } - - switch (ioprio_class) { - case IOPRIO_CLASS_NONE: - ioprio_class = IOPRIO_CLASS_BE; - break; - case IOPRIO_CLASS_RT: - case IOPRIO_CLASS_BE: - break; - case IOPRIO_CLASS_IDLE: - ioprio = 7; - break; - default: - printf("bad prio class %d\n", ioprio_class); - return 1; - } - - if (!set) { - if (!pid && argv[optind]) - pid = strtol(argv[optind], NULL, 10); - - ioprio = ioprio_get(IOPRIO_WHO_PROCESS, pid); - - printf("pid=%d, %d\n", pid, ioprio); - - if (ioprio == -1) - perror("ioprio_get"); - else { - ioprio_class = ioprio >> IOPRIO_CLASS_SHIFT; - ioprio = ioprio & 0xff; - printf("%s: prio %d\n", to_prio[ioprio_class], ioprio); - } - } else { - if (ioprio_set(IOPRIO_WHO_PROCESS, pid, ioprio | ioprio_class << IOPRIO_CLASS_SHIFT) == -1) { - perror("ioprio_set"); - return 1; - } - - if (argv[optind]) - execvp(argv[optind], &argv[optind]); - } - - return 0; -} - ----> snip ionice.c tool <--- - - -March 11 2005, Jens Axboe diff --git a/Documentation/block/kyber-iosched.rst b/Documentation/block/kyber-iosched.rst new file mode 100644 index 000000000000..3e164dd0617c --- /dev/null +++ b/Documentation/block/kyber-iosched.rst @@ -0,0 +1,15 @@ +============================ +Kyber I/O scheduler tunables +============================ + +The only two tunables for the Kyber scheduler are the target latencies for +reads and synchronous writes. Kyber will throttle requests in order to meet +these target latencies. + +read_lat_nsec +------------- +Target latency for reads (in nanoseconds). + +write_lat_nsec +-------------- +Target latency for synchronous writes (in nanoseconds). diff --git a/Documentation/block/kyber-iosched.txt b/Documentation/block/kyber-iosched.txt deleted file mode 100644 index e94feacd7edc..000000000000 --- a/Documentation/block/kyber-iosched.txt +++ /dev/null @@ -1,14 +0,0 @@ -Kyber I/O scheduler tunables -=========================== - -The only two tunables for the Kyber scheduler are the target latencies for -reads and synchronous writes. Kyber will throttle requests in order to meet -these target latencies. - -read_lat_nsec -------------- -Target latency for reads (in nanoseconds). - -write_lat_nsec --------------- -Target latency for synchronous writes (in nanoseconds). diff --git a/Documentation/block/null_blk.rst b/Documentation/block/null_blk.rst new file mode 100644 index 000000000000..31451d80783c --- /dev/null +++ b/Documentation/block/null_blk.rst @@ -0,0 +1,126 @@ +======================== +Null block device driver +======================== + +1. Overview +=========== + +The null block device (/dev/nullb*) is used for benchmarking the various +block-layer implementations. It emulates a block device of X gigabytes in size. +The following instances are possible: + + Single-queue block-layer + + - Request-based. + - Single submission queue per device. + - Implements IO scheduling algorithms (CFQ, Deadline, noop). + + Multi-queue block-layer + + - Request-based. + - Configurable submission queues per device. + + No block-layer (Known as bio-based) + + - Bio-based. IO requests are submitted directly to the device driver. + - Directly accepts bio data structure and returns them. + +All of them have a completion queue for each core in the system. + +2. Module parameters applicable for all instances +================================================= + +queue_mode=[0-2]: Default: 2-Multi-queue + Selects which block-layer the module should instantiate with. + + = ============ + 0 Bio-based + 1 Single-queue + 2 Multi-queue + = ============ + +home_node=[0--nr_nodes]: Default: NUMA_NO_NODE + Selects what CPU node the data structures are allocated from. + +gb=[Size in GB]: Default: 250GB + The size of the device reported to the system. + +bs=[Block size (in bytes)]: Default: 512 bytes + The block size reported to the system. + +nr_devices=[Number of devices]: Default: 1 + Number of block devices instantiated. They are instantiated as /dev/nullb0, + etc. + +irqmode=[0-2]: Default: 1-Soft-irq + The completion mode used for completing IOs to the block-layer. + + = =========================================================================== + 0 None. + 1 Soft-irq. Uses IPI to complete IOs across CPU nodes. Simulates the overhead + when IOs are issued from another CPU node than the home the device is + connected to. + 2 Timer: Waits a specific period (completion_nsec) for each IO before + completion. + = =========================================================================== + +completion_nsec=[ns]: Default: 10,000ns + Combined with irqmode=2 (timer). The time each completion event must wait. + +submit_queues=[1..nr_cpus]: + The number of submission queues attached to the device driver. If unset, it + defaults to 1. For multi-queue, it is ignored when use_per_node_hctx module + parameter is 1. + +hw_queue_depth=[0..qdepth]: Default: 64 + The hardware queue depth of the device. + +III: Multi-queue specific parameters + +use_per_node_hctx=[0/1]: Default: 0 + + = ===================================================================== + 0 The number of submit queues are set to the value of the submit_queues + parameter. + 1 The multi-queue block layer is instantiated with a hardware dispatch + queue for each CPU node in the system. + = ===================================================================== + +no_sched=[0/1]: Default: 0 + + = ====================================== + 0 nullb* use default blk-mq io scheduler + 1 nullb* doesn't use io scheduler + = ====================================== + +blocking=[0/1]: Default: 0 + + = =============================================================== + 0 Register as a non-blocking blk-mq driver device. + 1 Register as a blocking blk-mq driver device, null_blk will set + the BLK_MQ_F_BLOCKING flag, indicating that it sometimes/always + needs to block in its ->queue_rq() function. + = =============================================================== + +shared_tags=[0/1]: Default: 0 + + = ================================================================ + 0 Tag set is not shared. + 1 Tag set shared between devices for blk-mq. Only makes sense with + nr_devices > 1, otherwise there's no tag set to share. + = ================================================================ + +zoned=[0/1]: Default: 0 + + = ====================================================================== + 0 Block device is exposed as a random-access block device. + 1 Block device is exposed as a host-managed zoned block device. Requires + CONFIG_BLK_DEV_ZONED. + = ====================================================================== + +zone_size=[MB]: Default: 256 + Per zone size when exposed as a zoned block device. Must be a power of two. + +zone_nr_conv=[nr_conv]: Default: 0 + The number of conventional zones to create when block device is zoned. If + zone_nr_conv >= nr_zones, it will be reduced to nr_zones - 1. diff --git a/Documentation/block/null_blk.txt b/Documentation/block/null_blk.txt deleted file mode 100644 index 41f0a3d33bbd..000000000000 --- a/Documentation/block/null_blk.txt +++ /dev/null @@ -1,99 +0,0 @@ -Null block device driver -================================================================================ - -I. Overview - -The null block device (/dev/nullb*) is used for benchmarking the various -block-layer implementations. It emulates a block device of X gigabytes in size. -The following instances are possible: - - Single-queue block-layer - - Request-based. - - Single submission queue per device. - - Implements IO scheduling algorithms (CFQ, Deadline, noop). - Multi-queue block-layer - - Request-based. - - Configurable submission queues per device. - No block-layer (Known as bio-based) - - Bio-based. IO requests are submitted directly to the device driver. - - Directly accepts bio data structure and returns them. - -All of them have a completion queue for each core in the system. - -II. Module parameters applicable for all instances: - -queue_mode=[0-2]: Default: 2-Multi-queue - Selects which block-layer the module should instantiate with. - - 0: Bio-based. - 1: Single-queue. - 2: Multi-queue. - -home_node=[0--nr_nodes]: Default: NUMA_NO_NODE - Selects what CPU node the data structures are allocated from. - -gb=[Size in GB]: Default: 250GB - The size of the device reported to the system. - -bs=[Block size (in bytes)]: Default: 512 bytes - The block size reported to the system. - -nr_devices=[Number of devices]: Default: 1 - Number of block devices instantiated. They are instantiated as /dev/nullb0, - etc. - -irqmode=[0-2]: Default: 1-Soft-irq - The completion mode used for completing IOs to the block-layer. - - 0: None. - 1: Soft-irq. Uses IPI to complete IOs across CPU nodes. Simulates the overhead - when IOs are issued from another CPU node than the home the device is - connected to. - 2: Timer: Waits a specific period (completion_nsec) for each IO before - completion. - -completion_nsec=[ns]: Default: 10,000ns - Combined with irqmode=2 (timer). The time each completion event must wait. - -submit_queues=[1..nr_cpus]: - The number of submission queues attached to the device driver. If unset, it - defaults to 1. For multi-queue, it is ignored when use_per_node_hctx module - parameter is 1. - -hw_queue_depth=[0..qdepth]: Default: 64 - The hardware queue depth of the device. - -III: Multi-queue specific parameters - -use_per_node_hctx=[0/1]: Default: 0 - 0: The number of submit queues are set to the value of the submit_queues - parameter. - 1: The multi-queue block layer is instantiated with a hardware dispatch - queue for each CPU node in the system. - -no_sched=[0/1]: Default: 0 - 0: nullb* use default blk-mq io scheduler. - 1: nullb* doesn't use io scheduler. - -blocking=[0/1]: Default: 0 - 0: Register as a non-blocking blk-mq driver device. - 1: Register as a blocking blk-mq driver device, null_blk will set - the BLK_MQ_F_BLOCKING flag, indicating that it sometimes/always - needs to block in its ->queue_rq() function. - -shared_tags=[0/1]: Default: 0 - 0: Tag set is not shared. - 1: Tag set shared between devices for blk-mq. Only makes sense with - nr_devices > 1, otherwise there's no tag set to share. - -zoned=[0/1]: Default: 0 - 0: Block device is exposed as a random-access block device. - 1: Block device is exposed as a host-managed zoned block device. Requires - CONFIG_BLK_DEV_ZONED. - -zone_size=[MB]: Default: 256 - Per zone size when exposed as a zoned block device. Must be a power of two. - -zone_nr_conv=[nr_conv]: Default: 0 - The number of conventional zones to create when block device is zoned. If - zone_nr_conv >= nr_zones, it will be reduced to nr_zones - 1. diff --git a/Documentation/block/pr.rst b/Documentation/block/pr.rst new file mode 100644 index 000000000000..30ea1c2e39eb --- /dev/null +++ b/Documentation/block/pr.rst @@ -0,0 +1,119 @@ +=============================================== +Block layer support for Persistent Reservations +=============================================== + +The Linux kernel supports a user space interface for simplified +Persistent Reservations which map to block devices that support +these (like SCSI). Persistent Reservations allow restricting +access to block devices to specific initiators in a shared storage +setup. + +This document gives a general overview of the support ioctl commands. +For a more detailed reference please refer the the SCSI Primary +Commands standard, specifically the section on Reservations and the +"PERSISTENT RESERVE IN" and "PERSISTENT RESERVE OUT" commands. + +All implementations are expected to ensure the reservations survive +a power loss and cover all connections in a multi path environment. +These behaviors are optional in SPC but will be automatically applied +by Linux. + + +The following types of reservations are supported: +-------------------------------------------------- + + - PR_WRITE_EXCLUSIVE + Only the initiator that owns the reservation can write to the + device. Any initiator can read from the device. + + - PR_EXCLUSIVE_ACCESS + Only the initiator that owns the reservation can access the + device. + + - PR_WRITE_EXCLUSIVE_REG_ONLY + Only initiators with a registered key can write to the device, + Any initiator can read from the device. + + - PR_EXCLUSIVE_ACCESS_REG_ONLY + Only initiators with a registered key can access the device. + + - PR_WRITE_EXCLUSIVE_ALL_REGS + + Only initiators with a registered key can write to the device, + Any initiator can read from the device. + All initiators with a registered key are considered reservation + holders. + Please reference the SPC spec on the meaning of a reservation + holder if you want to use this type. + + - PR_EXCLUSIVE_ACCESS_ALL_REGS + Only initiators with a registered key can access the device. + All initiators with a registered key are considered reservation + holders. + Please reference the SPC spec on the meaning of a reservation + holder if you want to use this type. + + +The following ioctl are supported: +---------------------------------- + +1. IOC_PR_REGISTER +^^^^^^^^^^^^^^^^^^ + +This ioctl command registers a new reservation if the new_key argument +is non-null. If no existing reservation exists old_key must be zero, +if an existing reservation should be replaced old_key must contain +the old reservation key. + +If the new_key argument is 0 it unregisters the existing reservation passed +in old_key. + + +2. IOC_PR_RESERVE +^^^^^^^^^^^^^^^^^ + +This ioctl command reserves the device and thus restricts access for other +devices based on the type argument. The key argument must be the existing +reservation key for the device as acquired by the IOC_PR_REGISTER, +IOC_PR_REGISTER_IGNORE, IOC_PR_PREEMPT or IOC_PR_PREEMPT_ABORT commands. + + +3. IOC_PR_RELEASE +^^^^^^^^^^^^^^^^^ + +This ioctl command releases the reservation specified by key and flags +and thus removes any access restriction implied by it. + + +4. IOC_PR_PREEMPT +^^^^^^^^^^^^^^^^^ + +This ioctl command releases the existing reservation referred to by +old_key and replaces it with a new reservation of type for the +reservation key new_key. + + +5. IOC_PR_PREEMPT_ABORT +^^^^^^^^^^^^^^^^^^^^^^^ + +This ioctl command works like IOC_PR_PREEMPT except that it also aborts +any outstanding command sent over a connection identified by old_key. + +6. IOC_PR_CLEAR +^^^^^^^^^^^^^^^ + +This ioctl command unregisters both key and any other reservation key +registered with the device and drops any existing reservation. + + +Flags +----- + +All the ioctls have a flag field. Currently only one flag is supported: + + - PR_FL_IGNORE_KEY + Ignore the existing reservation key. This is commonly supported for + IOC_PR_REGISTER, and some implementation may support the flag for + IOC_PR_RESERVE. + +For all unknown flags the kernel will return -EOPNOTSUPP. diff --git a/Documentation/block/pr.txt b/Documentation/block/pr.txt deleted file mode 100644 index ac9b8e70e64b..000000000000 --- a/Documentation/block/pr.txt +++ /dev/null @@ -1,119 +0,0 @@ - -Block layer support for Persistent Reservations -=============================================== - -The Linux kernel supports a user space interface for simplified -Persistent Reservations which map to block devices that support -these (like SCSI). Persistent Reservations allow restricting -access to block devices to specific initiators in a shared storage -setup. - -This document gives a general overview of the support ioctl commands. -For a more detailed reference please refer the the SCSI Primary -Commands standard, specifically the section on Reservations and the -"PERSISTENT RESERVE IN" and "PERSISTENT RESERVE OUT" commands. - -All implementations are expected to ensure the reservations survive -a power loss and cover all connections in a multi path environment. -These behaviors are optional in SPC but will be automatically applied -by Linux. - - -The following types of reservations are supported: --------------------------------------------------- - - - PR_WRITE_EXCLUSIVE - - Only the initiator that owns the reservation can write to the - device. Any initiator can read from the device. - - - PR_EXCLUSIVE_ACCESS - - Only the initiator that owns the reservation can access the - device. - - - PR_WRITE_EXCLUSIVE_REG_ONLY - - Only initiators with a registered key can write to the device, - Any initiator can read from the device. - - - PR_EXCLUSIVE_ACCESS_REG_ONLY - - Only initiators with a registered key can access the device. - - - PR_WRITE_EXCLUSIVE_ALL_REGS - - Only initiators with a registered key can write to the device, - Any initiator can read from the device. - All initiators with a registered key are considered reservation - holders. - Please reference the SPC spec on the meaning of a reservation - holder if you want to use this type. - - - PR_EXCLUSIVE_ACCESS_ALL_REGS - - Only initiators with a registered key can access the device. - All initiators with a registered key are considered reservation - holders. - Please reference the SPC spec on the meaning of a reservation - holder if you want to use this type. - - -The following ioctl are supported: ----------------------------------- - -1. IOC_PR_REGISTER - -This ioctl command registers a new reservation if the new_key argument -is non-null. If no existing reservation exists old_key must be zero, -if an existing reservation should be replaced old_key must contain -the old reservation key. - -If the new_key argument is 0 it unregisters the existing reservation passed -in old_key. - - -2. IOC_PR_RESERVE - -This ioctl command reserves the device and thus restricts access for other -devices based on the type argument. The key argument must be the existing -reservation key for the device as acquired by the IOC_PR_REGISTER, -IOC_PR_REGISTER_IGNORE, IOC_PR_PREEMPT or IOC_PR_PREEMPT_ABORT commands. - - -3. IOC_PR_RELEASE - -This ioctl command releases the reservation specified by key and flags -and thus removes any access restriction implied by it. - - -4. IOC_PR_PREEMPT - -This ioctl command releases the existing reservation referred to by -old_key and replaces it with a new reservation of type for the -reservation key new_key. - - -5. IOC_PR_PREEMPT_ABORT - -This ioctl command works like IOC_PR_PREEMPT except that it also aborts -any outstanding command sent over a connection identified by old_key. - -6. IOC_PR_CLEAR - -This ioctl command unregisters both key and any other reservation key -registered with the device and drops any existing reservation. - - -Flags ------ - -All the ioctls have a flag field. Currently only one flag is supported: - - - PR_FL_IGNORE_KEY - - Ignore the existing reservation key. This is commonly supported for - IOC_PR_REGISTER, and some implementation may support the flag for - IOC_PR_RESERVE. - -For all unknown flags the kernel will return -EOPNOTSUPP. diff --git a/Documentation/block/queue-sysfs.rst b/Documentation/block/queue-sysfs.rst new file mode 100644 index 000000000000..6a8513af9201 --- /dev/null +++ b/Documentation/block/queue-sysfs.rst @@ -0,0 +1,254 @@ +================= +Queue sysfs files +================= + +This text file will detail the queue files that are located in the sysfs tree +for each block device. Note that stacked devices typically do not export +any settings, since their queue merely functions are a remapping target. +These files are the ones found in the /sys/block/xxx/queue/ directory. + +Files denoted with a RO postfix are readonly and the RW postfix means +read-write. + +add_random (RW) +--------------- +This file allows to turn off the disk entropy contribution. Default +value of this file is '1'(on). + +chunk_sectors (RO) +------------------ +This has different meaning depending on the type of the block device. +For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors +of the RAID volume stripe segment. For a zoned block device, either host-aware +or host-managed, chunk_sectors indicates the size in 512B sectors of the zones +of the device, with the eventual exception of the last zone of the device which +may be smaller. + +dax (RO) +-------- +This file indicates whether the device supports Direct Access (DAX), +used by CPU-addressable storage to bypass the pagecache. It shows '1' +if true, '0' if not. + +discard_granularity (RO) +------------------------ +This shows the size of internal allocation of the device in bytes, if +reported by the device. A value of '0' means device does not support +the discard functionality. + +discard_max_hw_bytes (RO) +------------------------- +Devices that support discard functionality may have internal limits on +the number of bytes that can be trimmed or unmapped in a single operation. +The discard_max_bytes parameter is set by the device driver to the maximum +number of bytes that can be discarded in a single operation. Discard +requests issued to the device must not exceed this limit. A discard_max_bytes +value of 0 means that the device does not support discard functionality. + +discard_max_bytes (RW) +---------------------- +While discard_max_hw_bytes is the hardware limit for the device, this +setting is the software limit. Some devices exhibit large latencies when +large discards are issued, setting this value lower will make Linux issue +smaller discards and potentially help reduce latencies induced by large +discard operations. + +discard_zeroes_data (RO) +------------------------ +Obsolete. Always zero. + +fua (RO) +-------- +Whether or not the block driver supports the FUA flag for write requests. +FUA stands for Force Unit Access. If the FUA flag is set that means that +write requests must bypass the volatile cache of the storage device. + +hw_sector_size (RO) +------------------- +This is the hardware sector size of the device, in bytes. + +io_poll (RW) +------------ +When read, this file shows whether polling is enabled (1) or disabled +(0). Writing '0' to this file will disable polling for this device. +Writing any non-zero value will enable this feature. + +io_poll_delay (RW) +------------------ +If polling is enabled, this controls what kind of polling will be +performed. It defaults to -1, which is classic polling. In this mode, +the CPU will repeatedly ask for completions without giving up any time. +If set to 0, a hybrid polling mode is used, where the kernel will attempt +to make an educated guess at when the IO will complete. Based on this +guess, the kernel will put the process issuing IO to sleep for an amount +of time, before entering a classic poll loop. This mode might be a +little slower than pure classic polling, but it will be more efficient. +If set to a value larger than 0, the kernel will put the process issuing +IO to sleep for this amount of microseconds before entering classic +polling. + +io_timeout (RW) +--------------- +io_timeout is the request timeout in milliseconds. If a request does not +complete in this time then the block driver timeout handler is invoked. +That timeout handler can decide to retry the request, to fail it or to start +a device recovery strategy. + +iostats (RW) +------------- +This file is used to control (on/off) the iostats accounting of the +disk. + +logical_block_size (RO) +----------------------- +This is the logical block size of the device, in bytes. + +max_discard_segments (RO) +------------------------- +The maximum number of DMA scatter/gather entries in a discard request. + +max_hw_sectors_kb (RO) +---------------------- +This is the maximum number of kilobytes supported in a single data transfer. + +max_integrity_segments (RO) +--------------------------- +Maximum number of elements in a DMA scatter/gather list with integrity +data that will be submitted by the block layer core to the associated +block driver. + +max_sectors_kb (RW) +------------------- +This is the maximum number of kilobytes that the block layer will allow +for a filesystem request. Must be smaller than or equal to the maximum +size allowed by the hardware. + +max_segments (RO) +----------------- +Maximum number of elements in a DMA scatter/gather list that is submitted +to the associated block driver. + +max_segment_size (RO) +--------------------- +Maximum size in bytes of a single element in a DMA scatter/gather list. + +minimum_io_size (RO) +-------------------- +This is the smallest preferred IO size reported by the device. + +nomerges (RW) +------------- +This enables the user to disable the lookup logic involved with IO +merging requests in the block layer. By default (0) all merges are +enabled. When set to 1 only simple one-hit merges will be tried. When +set to 2 no merge algorithms will be tried (including one-hit or more +complex tree/hash lookups). + +nr_requests (RW) +---------------- +This controls how many requests may be allocated in the block layer for +read or write requests. Note that the total allocated number may be twice +this amount, since it applies only to reads or writes (not the accumulated +sum). + +To avoid priority inversion through request starvation, a request +queue maintains a separate request pool per each cgroup when +CONFIG_BLK_CGROUP is enabled, and this parameter applies to each such +per-block-cgroup request pool. IOW, if there are N block cgroups, +each request queue may have up to N request pools, each independently +regulated by nr_requests. + +nr_zones (RO) +------------- +For zoned block devices (zoned attribute indicating "host-managed" or +"host-aware"), this indicates the total number of zones of the device. +This is always 0 for regular block devices. + +optimal_io_size (RO) +-------------------- +This is the optimal IO size reported by the device. + +physical_block_size (RO) +------------------------ +This is the physical block size of device, in bytes. + +read_ahead_kb (RW) +------------------ +Maximum number of kilobytes to read-ahead for filesystems on this block +device. + +rotational (RW) +--------------- +This file is used to stat if the device is of rotational type or +non-rotational type. + +rq_affinity (RW) +---------------- +If this option is '1', the block layer will migrate request completions to the +cpu "group" that originally submitted the request. For some workloads this +provides a significant reduction in CPU cycles due to caching effects. + +For storage configurations that need to maximize distribution of completion +processing setting this option to '2' forces the completion to run on the +requesting cpu (bypassing the "group" aggregation logic). + +scheduler (RW) +-------------- +When read, this file will display the current and available IO schedulers +for this block device. The currently active IO scheduler will be enclosed +in [] brackets. Writing an IO scheduler name to this file will switch +control of this block device to that new IO scheduler. Note that writing +an IO scheduler name to this file will attempt to load that IO scheduler +module, if it isn't already present in the system. + +write_cache (RW) +---------------- +When read, this file will display whether the device has write back +caching enabled or not. It will return "write back" for the former +case, and "write through" for the latter. Writing to this file can +change the kernels view of the device, but it doesn't alter the +device state. This means that it might not be safe to toggle the +setting from "write back" to "write through", since that will also +eliminate cache flushes issued by the kernel. + +write_same_max_bytes (RO) +------------------------- +This is the number of bytes the device can write in a single write-same +command. A value of '0' means write-same is not supported by this +device. + +wbt_lat_usec (RW) +----------------- +If the device is registered for writeback throttling, then this file shows +the target minimum read latency. If this latency is exceeded in a given +window of time (see wb_window_usec), then the writeback throttling will start +scaling back writes. Writing a value of '0' to this file disables the +feature. Writing a value of '-1' to this file resets the value to the +default setting. + +throttle_sample_time (RW) +------------------------- +This is the time window that blk-throttle samples data, in millisecond. +blk-throttle makes decision based on the samplings. Lower time means cgroups +have more smooth throughput, but higher CPU overhead. This exists only when +CONFIG_BLK_DEV_THROTTLING_LOW is enabled. + +write_zeroes_max_bytes (RO) +--------------------------- +For block drivers that support REQ_OP_WRITE_ZEROES, the maximum number of +bytes that can be zeroed at once. The value 0 means that REQ_OP_WRITE_ZEROES +is not supported. + +zoned (RO) +---------- +This indicates if the device is a zoned block device and the zone model of the +device if it is indeed zoned. The possible values indicated by zoned are +"none" for regular block devices and "host-aware" or "host-managed" for zoned +block devices. The characteristics of host-aware and host-managed zoned block +devices are described in the ZBC (Zoned Block Commands) and ZAC +(Zoned Device ATA Command Set) standards. These standards also define the +"drive-managed" zone model. However, since drive-managed zoned block devices +do not support zone commands, they will be treated as regular block devices +and zoned will report "none". + +Jens Axboe , February 2009 diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt deleted file mode 100644 index b40b5b7cebd9..000000000000 --- a/Documentation/block/queue-sysfs.txt +++ /dev/null @@ -1,253 +0,0 @@ -Queue sysfs files -================= - -This text file will detail the queue files that are located in the sysfs tree -for each block device. Note that stacked devices typically do not export -any settings, since their queue merely functions are a remapping target. -These files are the ones found in the /sys/block/xxx/queue/ directory. - -Files denoted with a RO postfix are readonly and the RW postfix means -read-write. - -add_random (RW) ----------------- -This file allows to turn off the disk entropy contribution. Default -value of this file is '1'(on). - -chunk_sectors (RO) ------------------- -This has different meaning depending on the type of the block device. -For a RAID device (dm-raid), chunk_sectors indicates the size in 512B sectors -of the RAID volume stripe segment. For a zoned block device, either host-aware -or host-managed, chunk_sectors indicates the size in 512B sectors of the zones -of the device, with the eventual exception of the last zone of the device which -may be smaller. - -dax (RO) --------- -This file indicates whether the device supports Direct Access (DAX), -used by CPU-addressable storage to bypass the pagecache. It shows '1' -if true, '0' if not. - -discard_granularity (RO) ------------------------ -This shows the size of internal allocation of the device in bytes, if -reported by the device. A value of '0' means device does not support -the discard functionality. - -discard_max_hw_bytes (RO) ----------------------- -Devices that support discard functionality may have internal limits on -the number of bytes that can be trimmed or unmapped in a single operation. -The discard_max_bytes parameter is set by the device driver to the maximum -number of bytes that can be discarded in a single operation. Discard -requests issued to the device must not exceed this limit. A discard_max_bytes -value of 0 means that the device does not support discard functionality. - -discard_max_bytes (RW) ----------------------- -While discard_max_hw_bytes is the hardware limit for the device, this -setting is the software limit. Some devices exhibit large latencies when -large discards are issued, setting this value lower will make Linux issue -smaller discards and potentially help reduce latencies induced by large -discard operations. - -discard_zeroes_data (RO) ------------------------- -Obsolete. Always zero. - -fua (RO) --------- -Whether or not the block driver supports the FUA flag for write requests. -FUA stands for Force Unit Access. If the FUA flag is set that means that -write requests must bypass the volatile cache of the storage device. - -hw_sector_size (RO) -------------------- -This is the hardware sector size of the device, in bytes. - -io_poll (RW) ------------- -When read, this file shows whether polling is enabled (1) or disabled -(0). Writing '0' to this file will disable polling for this device. -Writing any non-zero value will enable this feature. - -io_poll_delay (RW) ------------------- -If polling is enabled, this controls what kind of polling will be -performed. It defaults to -1, which is classic polling. In this mode, -the CPU will repeatedly ask for completions without giving up any time. -If set to 0, a hybrid polling mode is used, where the kernel will attempt -to make an educated guess at when the IO will complete. Based on this -guess, the kernel will put the process issuing IO to sleep for an amount -of time, before entering a classic poll loop. This mode might be a -little slower than pure classic polling, but it will be more efficient. -If set to a value larger than 0, the kernel will put the process issuing -IO to sleep for this amount of microseconds before entering classic -polling. - -io_timeout (RW) ---------------- -io_timeout is the request timeout in milliseconds. If a request does not -complete in this time then the block driver timeout handler is invoked. -That timeout handler can decide to retry the request, to fail it or to start -a device recovery strategy. - -iostats (RW) -------------- -This file is used to control (on/off) the iostats accounting of the -disk. - -logical_block_size (RO) ------------------------ -This is the logical block size of the device, in bytes. - -max_discard_segments (RO) -------------------------- -The maximum number of DMA scatter/gather entries in a discard request. - -max_hw_sectors_kb (RO) ----------------------- -This is the maximum number of kilobytes supported in a single data transfer. - -max_integrity_segments (RO) ---------------------------- -Maximum number of elements in a DMA scatter/gather list with integrity -data that will be submitted by the block layer core to the associated -block driver. - -max_sectors_kb (RW) -------------------- -This is the maximum number of kilobytes that the block layer will allow -for a filesystem request. Must be smaller than or equal to the maximum -size allowed by the hardware. - -max_segments (RO) ------------------ -Maximum number of elements in a DMA scatter/gather list that is submitted -to the associated block driver. - -max_segment_size (RO) ---------------------- -Maximum size in bytes of a single element in a DMA scatter/gather list. - -minimum_io_size (RO) --------------------- -This is the smallest preferred IO size reported by the device. - -nomerges (RW) -------------- -This enables the user to disable the lookup logic involved with IO -merging requests in the block layer. By default (0) all merges are -enabled. When set to 1 only simple one-hit merges will be tried. When -set to 2 no merge algorithms will be tried (including one-hit or more -complex tree/hash lookups). - -nr_requests (RW) ----------------- -This controls how many requests may be allocated in the block layer for -read or write requests. Note that the total allocated number may be twice -this amount, since it applies only to reads or writes (not the accumulated -sum). - -To avoid priority inversion through request starvation, a request -queue maintains a separate request pool per each cgroup when -CONFIG_BLK_CGROUP is enabled, and this parameter applies to each such -per-block-cgroup request pool. IOW, if there are N block cgroups, -each request queue may have up to N request pools, each independently -regulated by nr_requests. - -nr_zones (RO) -------------- -For zoned block devices (zoned attribute indicating "host-managed" or -"host-aware"), this indicates the total number of zones of the device. -This is always 0 for regular block devices. - -optimal_io_size (RO) --------------------- -This is the optimal IO size reported by the device. - -physical_block_size (RO) ------------------------- -This is the physical block size of device, in bytes. - -read_ahead_kb (RW) ------------------- -Maximum number of kilobytes to read-ahead for filesystems on this block -device. - -rotational (RW) ---------------- -This file is used to stat if the device is of rotational type or -non-rotational type. - -rq_affinity (RW) ----------------- -If this option is '1', the block layer will migrate request completions to the -cpu "group" that originally submitted the request. For some workloads this -provides a significant reduction in CPU cycles due to caching effects. - -For storage configurations that need to maximize distribution of completion -processing setting this option to '2' forces the completion to run on the -requesting cpu (bypassing the "group" aggregation logic). - -scheduler (RW) --------------- -When read, this file will display the current and available IO schedulers -for this block device. The currently active IO scheduler will be enclosed -in [] brackets. Writing an IO scheduler name to this file will switch -control of this block device to that new IO scheduler. Note that writing -an IO scheduler name to this file will attempt to load that IO scheduler -module, if it isn't already present in the system. - -write_cache (RW) ----------------- -When read, this file will display whether the device has write back -caching enabled or not. It will return "write back" for the former -case, and "write through" for the latter. Writing to this file can -change the kernels view of the device, but it doesn't alter the -device state. This means that it might not be safe to toggle the -setting from "write back" to "write through", since that will also -eliminate cache flushes issued by the kernel. - -write_same_max_bytes (RO) -------------------------- -This is the number of bytes the device can write in a single write-same -command. A value of '0' means write-same is not supported by this -device. - -wbt_lat_usec (RW) ------------------ -If the device is registered for writeback throttling, then this file shows -the target minimum read latency. If this latency is exceeded in a given -window of time (see wb_window_usec), then the writeback throttling will start -scaling back writes. Writing a value of '0' to this file disables the -feature. Writing a value of '-1' to this file resets the value to the -default setting. - -throttle_sample_time (RW) -------------------------- -This is the time window that blk-throttle samples data, in millisecond. -blk-throttle makes decision based on the samplings. Lower time means cgroups -have more smooth throughput, but higher CPU overhead. This exists only when -CONFIG_BLK_DEV_THROTTLING_LOW is enabled. - -write_zeroes_max_bytes (RO) ---------------------------- -For block drivers that support REQ_OP_WRITE_ZEROES, the maximum number of -bytes that can be zeroed at once. The value 0 means that REQ_OP_WRITE_ZEROES -is not supported. - -zoned (RO) ----------- -This indicates if the device is a zoned block device and the zone model of the -device if it is indeed zoned. The possible values indicated by zoned are -"none" for regular block devices and "host-aware" or "host-managed" for zoned -block devices. The characteristics of host-aware and host-managed zoned block -devices are described in the ZBC (Zoned Block Commands) and ZAC -(Zoned Device ATA Command Set) standards. These standards also define the -"drive-managed" zone model. However, since drive-managed zoned block devices -do not support zone commands, they will be treated as regular block devices -and zoned will report "none". - -Jens Axboe , February 2009 diff --git a/Documentation/block/request.rst b/Documentation/block/request.rst new file mode 100644 index 000000000000..747021e1ffdb --- /dev/null +++ b/Documentation/block/request.rst @@ -0,0 +1,99 @@ +============================ +struct request documentation +============================ + +Jens Axboe 27/05/02 + + +.. FIXME: + No idea about what does mean - seems just some noise, so comment it + + 1.0 + Index + + 2.0 Struct request members classification + + 2.1 struct request members explanation + + 3.0 + + + 2.0 + + + +Short explanation of request members +==================================== + +Classification flags: + + = ==================== + D driver member + B block layer member + I I/O scheduler member + = ==================== + +Unless an entry contains a D classification, a device driver must not access +this member. Some members may contain D classifications, but should only be +access through certain macros or functions (eg ->flags). + + + +=============================== ======= ======================================= +Member Flag Comment +=============================== ======= ======================================= +struct list_head queuelist BI Organization on various internal + queues + +``void *elevator_private`` I I/O scheduler private data + +unsigned char cmd[16] D Driver can use this for setting up + a cdb before execution, see + blk_queue_prep_rq + +unsigned long flags DBI Contains info about data direction, + request type, etc. + +int rq_status D Request status bits + +kdev_t rq_dev DBI Target device + +int errors DB Error counts + +sector_t sector DBI Target location + +unsigned long hard_nr_sectors B Used to keep sector sane + +unsigned long nr_sectors DBI Total number of sectors in request + +unsigned long hard_nr_sectors B Used to keep nr_sectors sane + +unsigned short nr_phys_segments DB Number of physical scatter gather + segments in a request + +unsigned short nr_hw_segments DB Number of hardware scatter gather + segments in a request + +unsigned int current_nr_sectors DB Number of sectors in first segment + of request + +unsigned int hard_cur_sectors B Used to keep current_nr_sectors sane + +int tag DB TCQ tag, if assigned + +``void *special`` D Free to be used by driver + +``char *buffer`` D Map of first segment, also see + section on bouncing SECTION + +``struct completion *waiting`` D Can be used by driver to get signalled + on request completion + +``struct bio *bio`` DBI First bio in request + +``struct bio *biotail`` DBI Last bio in request + +``struct request_queue *q`` DB Request queue this request belongs to + +``struct request_list *rl`` B Request list this request came from +=============================== ======= ======================================= diff --git a/Documentation/block/request.txt b/Documentation/block/request.txt deleted file mode 100644 index 754e104ed369..000000000000 --- a/Documentation/block/request.txt +++ /dev/null @@ -1,88 +0,0 @@ - -struct request documentation - -Jens Axboe 27/05/02 - -1.0 -Index - -2.0 Struct request members classification - - 2.1 struct request members explanation - -3.0 - - -2.0 -Short explanation of request members - -Classification flags: - - D driver member - B block layer member - I I/O scheduler member - -Unless an entry contains a D classification, a device driver must not access -this member. Some members may contain D classifications, but should only be -access through certain macros or functions (eg ->flags). - - - -2.1 -Member Flag Comment ------- ---- ------- - -struct list_head queuelist BI Organization on various internal - queues - -void *elevator_private I I/O scheduler private data - -unsigned char cmd[16] D Driver can use this for setting up - a cdb before execution, see - blk_queue_prep_rq - -unsigned long flags DBI Contains info about data direction, - request type, etc. - -int rq_status D Request status bits - -kdev_t rq_dev DBI Target device - -int errors DB Error counts - -sector_t sector DBI Target location - -unsigned long hard_nr_sectors B Used to keep sector sane - -unsigned long nr_sectors DBI Total number of sectors in request - -unsigned long hard_nr_sectors B Used to keep nr_sectors sane - -unsigned short nr_phys_segments DB Number of physical scatter gather - segments in a request - -unsigned short nr_hw_segments DB Number of hardware scatter gather - segments in a request - -unsigned int current_nr_sectors DB Number of sectors in first segment - of request - -unsigned int hard_cur_sectors B Used to keep current_nr_sectors sane - -int tag DB TCQ tag, if assigned - -void *special D Free to be used by driver - -char *buffer D Map of first segment, also see - section on bouncing SECTION - -struct completion *waiting D Can be used by driver to get signalled - on request completion - -struct bio *bio DBI First bio in request - -struct bio *biotail DBI Last bio in request - -struct request_queue *q DB Request queue this request belongs to - -struct request_list *rl B Request list this request came from diff --git a/Documentation/block/stat.rst b/Documentation/block/stat.rst new file mode 100644 index 000000000000..9c07bc22b0bc --- /dev/null +++ b/Documentation/block/stat.rst @@ -0,0 +1,93 @@ +=============================================== +Block layer statistics in /sys/block//stat +=============================================== + +This file documents the contents of the /sys/block//stat file. + +The stat file provides several statistics about the state of block +device . + +Q. + Why are there multiple statistics in a single file? Doesn't sysfs + normally contain a single value per file? + +A. + By having a single file, the kernel can guarantee that the statistics + represent a consistent snapshot of the state of the device. If the + statistics were exported as multiple files containing one statistic + each, it would be impossible to guarantee that a set of readings + represent a single point in time. + +The stat file consists of a single line of text containing 11 decimal +values separated by whitespace. The fields are summarized in the +following table, and described in more detail below. + + +=============== ============= ================================================= +Name units description +=============== ============= ================================================= +read I/Os requests number of read I/Os processed +read merges requests number of read I/Os merged with in-queue I/O +read sectors sectors number of sectors read +read ticks milliseconds total wait time for read requests +write I/Os requests number of write I/Os processed +write merges requests number of write I/Os merged with in-queue I/O +write sectors sectors number of sectors written +write ticks milliseconds total wait time for write requests +in_flight requests number of I/Os currently in flight +io_ticks milliseconds total time this block device has been active +time_in_queue milliseconds total wait time for all requests +discard I/Os requests number of discard I/Os processed +discard merges requests number of discard I/Os merged with in-queue I/O +discard sectors sectors number of sectors discarded +discard ticks milliseconds total wait time for discard requests +=============== ============= ================================================= + +read I/Os, write I/Os, discard I/0s +=================================== + +These values increment when an I/O request completes. + +read merges, write merges, discard merges +========================================= + +These values increment when an I/O request is merged with an +already-queued I/O request. + +read sectors, write sectors, discard_sectors +============================================ + +These values count the number of sectors read from, written to, or +discarded from this block device. The "sectors" in question are the +standard UNIX 512-byte sectors, not any device- or filesystem-specific +block size. The counters are incremented when the I/O completes. + +read ticks, write ticks, discard ticks +====================================== + +These values count the number of milliseconds that I/O requests have +waited on this block device. If there are multiple I/O requests waiting, +these values will increase at a rate greater than 1000/second; for +example, if 60 read requests wait for an average of 30 ms, the read_ticks +field will increase by 60*30 = 1800. + +in_flight +========= + +This value counts the number of I/O requests that have been issued to +the device driver but have not yet completed. It does not include I/O +requests that are in the queue but not yet issued to the device driver. + +io_ticks +======== + +This value counts the number of milliseconds during which the device has +had I/O requests queued. + +time_in_queue +============= + +This value counts the number of milliseconds that I/O requests have waited +on this block device. If there are multiple I/O requests waiting, this +value will increase as the product of the number of milliseconds times the +number of requests waiting (see "read ticks" above for an example). diff --git a/Documentation/block/stat.txt b/Documentation/block/stat.txt deleted file mode 100644 index 0aace9cc536c..000000000000 --- a/Documentation/block/stat.txt +++ /dev/null @@ -1,86 +0,0 @@ -Block layer statistics in /sys/block//stat -=============================================== - -This file documents the contents of the /sys/block//stat file. - -The stat file provides several statistics about the state of block -device . - -Q. Why are there multiple statistics in a single file? Doesn't sysfs - normally contain a single value per file? -A. By having a single file, the kernel can guarantee that the statistics - represent a consistent snapshot of the state of the device. If the - statistics were exported as multiple files containing one statistic - each, it would be impossible to guarantee that a set of readings - represent a single point in time. - -The stat file consists of a single line of text containing 11 decimal -values separated by whitespace. The fields are summarized in the -following table, and described in more detail below. - -Name units description ----- ----- ----------- -read I/Os requests number of read I/Os processed -read merges requests number of read I/Os merged with in-queue I/O -read sectors sectors number of sectors read -read ticks milliseconds total wait time for read requests -write I/Os requests number of write I/Os processed -write merges requests number of write I/Os merged with in-queue I/O -write sectors sectors number of sectors written -write ticks milliseconds total wait time for write requests -in_flight requests number of I/Os currently in flight -io_ticks milliseconds total time this block device has been active -time_in_queue milliseconds total wait time for all requests -discard I/Os requests number of discard I/Os processed -discard merges requests number of discard I/Os merged with in-queue I/O -discard sectors sectors number of sectors discarded -discard ticks milliseconds total wait time for discard requests - -read I/Os, write I/Os, discard I/0s -=================================== - -These values increment when an I/O request completes. - -read merges, write merges, discard merges -========================================= - -These values increment when an I/O request is merged with an -already-queued I/O request. - -read sectors, write sectors, discard_sectors -============================================ - -These values count the number of sectors read from, written to, or -discarded from this block device. The "sectors" in question are the -standard UNIX 512-byte sectors, not any device- or filesystem-specific -block size. The counters are incremented when the I/O completes. - -read ticks, write ticks, discard ticks -====================================== - -These values count the number of milliseconds that I/O requests have -waited on this block device. If there are multiple I/O requests waiting, -these values will increase at a rate greater than 1000/second; for -example, if 60 read requests wait for an average of 30 ms, the read_ticks -field will increase by 60*30 = 1800. - -in_flight -========= - -This value counts the number of I/O requests that have been issued to -the device driver but have not yet completed. It does not include I/O -requests that are in the queue but not yet issued to the device driver. - -io_ticks -======== - -This value counts the number of milliseconds during which the device has -had I/O requests queued. - -time_in_queue -============= - -This value counts the number of milliseconds that I/O requests have waited -on this block device. If there are multiple I/O requests waiting, this -value will increase as the product of the number of milliseconds times the -number of requests waiting (see "read ticks" above for an example). diff --git a/Documentation/block/switching-sched.rst b/Documentation/block/switching-sched.rst new file mode 100644 index 000000000000..42042417380e --- /dev/null +++ b/Documentation/block/switching-sched.rst @@ -0,0 +1,39 @@ +=================== +Switching Scheduler +=================== + +To choose IO schedulers at boot time, use the argument 'elevator=deadline'. +'noop' and 'cfq' (the default) are also available. IO schedulers are assigned +globally at boot time only presently. + +Each io queue has a set of io scheduler tunables associated with it. These +tunables control how the io scheduler works. You can find these entries +in:: + + /sys/block//queue/iosched + +assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted, +you can do so by typing:: + + # mount none /sys -t sysfs + +It is possible to change the IO scheduler for a given block device on +the fly to select one of mq-deadline, none, bfq, or kyber schedulers - +which can improve that device's throughput. + +To set a specific scheduler, simply do this:: + + echo SCHEDNAME > /sys/block/DEV/queue/scheduler + +where SCHEDNAME is the name of a defined IO scheduler, and DEV is the +device name (hda, hdb, sga, or whatever you happen to have). + +The list of defined schedulers can be found by simply doing +a "cat /sys/block/DEV/queue/scheduler" - the list of valid names +will be displayed, with the currently selected scheduler in brackets:: + + # cat /sys/block/sda/queue/scheduler + [mq-deadline] kyber bfq none + # echo none >/sys/block/sda/queue/scheduler + # cat /sys/block/sda/queue/scheduler + [none] mq-deadline kyber bfq diff --git a/Documentation/block/switching-sched.txt b/Documentation/block/switching-sched.txt deleted file mode 100644 index 7977f6fb8b20..000000000000 --- a/Documentation/block/switching-sched.txt +++ /dev/null @@ -1,35 +0,0 @@ -To choose IO schedulers at boot time, use the argument 'elevator=deadline'. -'noop' and 'cfq' (the default) are also available. IO schedulers are assigned -globally at boot time only presently. - -Each io queue has a set of io scheduler tunables associated with it. These -tunables control how the io scheduler works. You can find these entries -in: - -/sys/block//queue/iosched - -assuming that you have sysfs mounted on /sys. If you don't have sysfs mounted, -you can do so by typing: - -# mount none /sys -t sysfs - -It is possible to change the IO scheduler for a given block device on -the fly to select one of mq-deadline, none, bfq, or kyber schedulers - -which can improve that device's throughput. - -To set a specific scheduler, simply do this: - -echo SCHEDNAME > /sys/block/DEV/queue/scheduler - -where SCHEDNAME is the name of a defined IO scheduler, and DEV is the -device name (hda, hdb, sga, or whatever you happen to have). - -The list of defined schedulers can be found by simply doing -a "cat /sys/block/DEV/queue/scheduler" - the list of valid names -will be displayed, with the currently selected scheduler in brackets: - -# cat /sys/block/sda/queue/scheduler -[mq-deadline] kyber bfq none -# echo none >/sys/block/sda/queue/scheduler -# cat /sys/block/sda/queue/scheduler -[none] mq-deadline kyber bfq diff --git a/Documentation/block/writeback_cache_control.rst b/Documentation/block/writeback_cache_control.rst new file mode 100644 index 000000000000..2c752c57c14c --- /dev/null +++ b/Documentation/block/writeback_cache_control.rst @@ -0,0 +1,86 @@ +========================================== +Explicit volatile write back cache control +========================================== + +Introduction +------------ + +Many storage devices, especially in the consumer market, come with volatile +write back caches. That means the devices signal I/O completion to the +operating system before data actually has hit the non-volatile storage. This +behavior obviously speeds up various workloads, but it means the operating +system needs to force data out to the non-volatile storage when it performs +a data integrity operation like fsync, sync or an unmount. + +The Linux block layer provides two simple mechanisms that let filesystems +control the caching behavior of the storage device. These mechanisms are +a forced cache flush, and the Force Unit Access (FUA) flag for requests. + + +Explicit cache flushes +---------------------- + +The REQ_PREFLUSH flag can be OR ed into the r/w flags of a bio submitted from +the filesystem and will make sure the volatile cache of the storage device +has been flushed before the actual I/O operation is started. This explicitly +guarantees that previously completed write requests are on non-volatile +storage before the flagged bio starts. In addition the REQ_PREFLUSH flag can be +set on an otherwise empty bio structure, which causes only an explicit cache +flush without any dependent I/O. It is recommend to use +the blkdev_issue_flush() helper for a pure cache flush. + + +Forced Unit Access +------------------ + +The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the +filesystem and will make sure that I/O completion for this request is only +signaled after the data has been committed to non-volatile storage. + + +Implementation details for filesystems +-------------------------------------- + +Filesystems can simply set the REQ_PREFLUSH and REQ_FUA bits and do not have to +worry if the underlying devices need any explicit cache flushing and how +the Forced Unit Access is implemented. The REQ_PREFLUSH and REQ_FUA flags +may both be set on a single bio. + + +Implementation details for make_request_fn based block drivers +-------------------------------------------------------------- + +These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit +directly below the submit_bio interface. For remapping drivers the REQ_FUA +bits need to be propagated to underlying devices, and a global flush needs +to be implemented for bios with the REQ_PREFLUSH bit set. For real device +drivers that do not have a volatile cache the REQ_PREFLUSH and REQ_FUA bits +on non-empty bios can simply be ignored, and REQ_PREFLUSH requests without +data can be completed successfully without doing any work. Drivers for +devices with volatile caches need to implement the support for these +flags themselves without any help from the block layer. + + +Implementation details for request_fn based block drivers +--------------------------------------------------------- + +For devices that do not support volatile write caches there is no driver +support required, the block layer completes empty REQ_PREFLUSH requests before +entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from +requests that have a payload. For devices with volatile write caches the +driver needs to tell the block layer that it supports flushing caches by +doing:: + + blk_queue_write_cache(sdkp->disk->queue, true, false); + +and handle empty REQ_OP_FLUSH requests in its prep_fn/request_fn. Note that +REQ_PREFLUSH requests with a payload are automatically turned into a sequence +of an empty REQ_OP_FLUSH request followed by the actual write by the block +layer. For devices that also support the FUA bit the block layer needs +to be told to pass through the REQ_FUA bit using:: + + blk_queue_write_cache(sdkp->disk->queue, true, true); + +and the driver must handle write requests that have the REQ_FUA bit set +in prep_fn/request_fn. If the FUA bit is not natively supported the block +layer turns it into an empty REQ_OP_FLUSH request after the actual write. diff --git a/Documentation/block/writeback_cache_control.txt b/Documentation/block/writeback_cache_control.txt deleted file mode 100644 index 8a6bdada5f6b..000000000000 --- a/Documentation/block/writeback_cache_control.txt +++ /dev/null @@ -1,86 +0,0 @@ - -Explicit volatile write back cache control -===================================== - -Introduction ------------- - -Many storage devices, especially in the consumer market, come with volatile -write back caches. That means the devices signal I/O completion to the -operating system before data actually has hit the non-volatile storage. This -behavior obviously speeds up various workloads, but it means the operating -system needs to force data out to the non-volatile storage when it performs -a data integrity operation like fsync, sync or an unmount. - -The Linux block layer provides two simple mechanisms that let filesystems -control the caching behavior of the storage device. These mechanisms are -a forced cache flush, and the Force Unit Access (FUA) flag for requests. - - -Explicit cache flushes ----------------------- - -The REQ_PREFLUSH flag can be OR ed into the r/w flags of a bio submitted from -the filesystem and will make sure the volatile cache of the storage device -has been flushed before the actual I/O operation is started. This explicitly -guarantees that previously completed write requests are on non-volatile -storage before the flagged bio starts. In addition the REQ_PREFLUSH flag can be -set on an otherwise empty bio structure, which causes only an explicit cache -flush without any dependent I/O. It is recommend to use -the blkdev_issue_flush() helper for a pure cache flush. - - -Forced Unit Access ------------------ - -The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the -filesystem and will make sure that I/O completion for this request is only -signaled after the data has been committed to non-volatile storage. - - -Implementation details for filesystems --------------------------------------- - -Filesystems can simply set the REQ_PREFLUSH and REQ_FUA bits and do not have to -worry if the underlying devices need any explicit cache flushing and how -the Forced Unit Access is implemented. The REQ_PREFLUSH and REQ_FUA flags -may both be set on a single bio. - - -Implementation details for make_request_fn based block drivers --------------------------------------------------------------- - -These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit -directly below the submit_bio interface. For remapping drivers the REQ_FUA -bits need to be propagated to underlying devices, and a global flush needs -to be implemented for bios with the REQ_PREFLUSH bit set. For real device -drivers that do not have a volatile cache the REQ_PREFLUSH and REQ_FUA bits -on non-empty bios can simply be ignored, and REQ_PREFLUSH requests without -data can be completed successfully without doing any work. Drivers for -devices with volatile caches need to implement the support for these -flags themselves without any help from the block layer. - - -Implementation details for request_fn based block drivers --------------------------------------------------------------- - -For devices that do not support volatile write caches there is no driver -support required, the block layer completes empty REQ_PREFLUSH requests before -entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from -requests that have a payload. For devices with volatile write caches the -driver needs to tell the block layer that it supports flushing caches by -doing: - - blk_queue_write_cache(sdkp->disk->queue, true, false); - -and handle empty REQ_OP_FLUSH requests in its prep_fn/request_fn. Note that -REQ_PREFLUSH requests with a payload are automatically turned into a sequence -of an empty REQ_OP_FLUSH request followed by the actual write by the block -layer. For devices that also support the FUA bit the block layer needs -to be told to pass through the REQ_FUA bit using: - - blk_queue_write_cache(sdkp->disk->queue, true, true); - -and the driver must handle write requests that have the REQ_FUA bit set -in prep_fn/request_fn. If the FUA bit is not natively supported the block -layer turns it into an empty REQ_OP_FLUSH request after the actual write. -- cgit v1.2.3-55-g7522 From df1b7ce784c220373d202ea9f8bc0c424f2c9f7c Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 18 Jun 2019 17:16:23 -0300 Subject: docs: add some documentation dirs to the driver-api book Those are subsystem docs, with a mix of kABI and user-faced docs. While they're not split, keep the dirs where they are, adding just a pointer to the main index. Signed-off-by: Mauro Carvalho Chehab --- Documentation/accounting/index.rst | 2 +- Documentation/block/index.rst | 2 +- Documentation/hid/index.rst | 2 +- Documentation/iio/index.rst | 2 +- Documentation/index.rst | 4 ++++ 5 files changed, 8 insertions(+), 4 deletions(-) (limited to 'Documentation/block') diff --git a/Documentation/accounting/index.rst b/Documentation/accounting/index.rst index e1f6284b5ff3..9369d8bf32be 100644 --- a/Documentation/accounting/index.rst +++ b/Documentation/accounting/index.rst @@ -1,4 +1,4 @@ -:orphan: +.. SPDX-License-Identifier: GPL-2.0 ========== Accounting diff --git a/Documentation/block/index.rst b/Documentation/block/index.rst index 8cd226a0e86e..3fa7a52fafa4 100644 --- a/Documentation/block/index.rst +++ b/Documentation/block/index.rst @@ -1,4 +1,4 @@ -:orphan: +.. SPDX-License-Identifier: GPL-2.0 ===== Block diff --git a/Documentation/hid/index.rst b/Documentation/hid/index.rst index af4324902622..737d66dc16a1 100644 --- a/Documentation/hid/index.rst +++ b/Documentation/hid/index.rst @@ -1,4 +1,4 @@ -:orphan: +.. SPDX-License-Identifier: GPL-2.0 ============================= Human Interface Devices (HID) diff --git a/Documentation/iio/index.rst b/Documentation/iio/index.rst index 0593dca89a94..58b7a4ebac51 100644 --- a/Documentation/iio/index.rst +++ b/Documentation/iio/index.rst @@ -1,4 +1,4 @@ -:orphan: +.. SPDX-License-Identifier: GPL-2.0 ============== Industrial I/O diff --git a/Documentation/index.rst b/Documentation/index.rst index a322c8721d13..dcdaaff71633 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -92,6 +92,10 @@ needed). driver-api/index core-api/index + accounting/index + block/index + hid/index + iio/index leds/index media/index networking/index -- cgit v1.2.3-55-g7522 From da82c92f1150f66afabf78d2c85ef9ac18dc6d38 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Thu, 27 Jun 2019 13:08:35 -0300 Subject: docs: cgroup-v1: add it to the admin-guide book Those files belong to the admin guide, so add them. Signed-off-by: Mauro Carvalho Chehab --- .../admin-guide/cgroup-v1/blkio-controller.rst | 302 ++++++ Documentation/admin-guide/cgroup-v1/cgroups.rst | 695 ++++++++++++++ Documentation/admin-guide/cgroup-v1/cpuacct.rst | 50 + Documentation/admin-guide/cgroup-v1/cpusets.rst | 866 +++++++++++++++++ Documentation/admin-guide/cgroup-v1/devices.rst | 132 +++ .../admin-guide/cgroup-v1/freezer-subsystem.rst | 127 +++ Documentation/admin-guide/cgroup-v1/hugetlb.rst | 50 + Documentation/admin-guide/cgroup-v1/index.rst | 28 + Documentation/admin-guide/cgroup-v1/memcg_test.rst | 355 +++++++ Documentation/admin-guide/cgroup-v1/memory.rst | 1003 ++++++++++++++++++++ Documentation/admin-guide/cgroup-v1/net_cls.rst | 44 + Documentation/admin-guide/cgroup-v1/net_prio.rst | 57 ++ Documentation/admin-guide/cgroup-v1/pids.rst | 92 ++ Documentation/admin-guide/cgroup-v1/rdma.rst | 117 +++ Documentation/admin-guide/cgroup-v2.rst | 2 +- Documentation/admin-guide/index.rst | 1 + Documentation/admin-guide/kernel-parameters.txt | 4 +- .../admin-guide/mm/numa_memory_policy.rst | 2 +- Documentation/block/bfq-iosched.rst | 2 +- Documentation/cgroup-v1/blkio-controller.rst | 302 ------ Documentation/cgroup-v1/cgroups.rst | 695 -------------- Documentation/cgroup-v1/cpuacct.rst | 50 - Documentation/cgroup-v1/cpusets.rst | 866 ----------------- Documentation/cgroup-v1/devices.rst | 132 --- Documentation/cgroup-v1/freezer-subsystem.rst | 127 --- Documentation/cgroup-v1/hugetlb.rst | 50 - Documentation/cgroup-v1/index.rst | 30 - Documentation/cgroup-v1/memcg_test.rst | 355 ------- Documentation/cgroup-v1/memory.rst | 1003 -------------------- Documentation/cgroup-v1/net_cls.rst | 44 - Documentation/cgroup-v1/net_prio.rst | 57 -- Documentation/cgroup-v1/pids.rst | 92 -- Documentation/cgroup-v1/rdma.rst | 117 --- Documentation/filesystems/tmpfs.txt | 2 +- Documentation/kernel-per-CPU-kthreads.txt | 2 +- Documentation/scheduler/sched-deadline.rst | 2 +- Documentation/scheduler/sched-design-CFS.rst | 2 +- Documentation/scheduler/sched-rt-group.rst | 2 +- Documentation/vm/numa.rst | 4 +- Documentation/vm/page_migration.rst | 2 +- Documentation/vm/unevictable-lru.rst | 2 +- Documentation/x86/x86_64/fake-numa-for-cpusets.rst | 4 +- MAINTAINERS | 4 +- block/Kconfig | 2 +- include/linux/cgroup-defs.h | 2 +- include/uapi/linux/bpf.h | 2 +- init/Kconfig | 4 +- kernel/cgroup/cpuset.c | 2 +- security/device_cgroup.c | 2 +- tools/include/uapi/linux/bpf.h | 2 +- 50 files changed, 3945 insertions(+), 3946 deletions(-) create mode 100644 Documentation/admin-guide/cgroup-v1/blkio-controller.rst create mode 100644 Documentation/admin-guide/cgroup-v1/cgroups.rst create mode 100644 Documentation/admin-guide/cgroup-v1/cpuacct.rst create mode 100644 Documentation/admin-guide/cgroup-v1/cpusets.rst create mode 100644 Documentation/admin-guide/cgroup-v1/devices.rst create mode 100644 Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst create mode 100644 Documentation/admin-guide/cgroup-v1/hugetlb.rst create mode 100644 Documentation/admin-guide/cgroup-v1/index.rst create mode 100644 Documentation/admin-guide/cgroup-v1/memcg_test.rst create mode 100644 Documentation/admin-guide/cgroup-v1/memory.rst create mode 100644 Documentation/admin-guide/cgroup-v1/net_cls.rst create mode 100644 Documentation/admin-guide/cgroup-v1/net_prio.rst create mode 100644 Documentation/admin-guide/cgroup-v1/pids.rst create mode 100644 Documentation/admin-guide/cgroup-v1/rdma.rst delete mode 100644 Documentation/cgroup-v1/blkio-controller.rst delete mode 100644 Documentation/cgroup-v1/cgroups.rst delete mode 100644 Documentation/cgroup-v1/cpuacct.rst delete mode 100644 Documentation/cgroup-v1/cpusets.rst delete mode 100644 Documentation/cgroup-v1/devices.rst delete mode 100644 Documentation/cgroup-v1/freezer-subsystem.rst delete mode 100644 Documentation/cgroup-v1/hugetlb.rst delete mode 100644 Documentation/cgroup-v1/index.rst delete mode 100644 Documentation/cgroup-v1/memcg_test.rst delete mode 100644 Documentation/cgroup-v1/memory.rst delete mode 100644 Documentation/cgroup-v1/net_cls.rst delete mode 100644 Documentation/cgroup-v1/net_prio.rst delete mode 100644 Documentation/cgroup-v1/pids.rst delete mode 100644 Documentation/cgroup-v1/rdma.rst (limited to 'Documentation/block') diff --git a/Documentation/admin-guide/cgroup-v1/blkio-controller.rst b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst new file mode 100644 index 000000000000..1d7d962933be --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/blkio-controller.rst @@ -0,0 +1,302 @@ +=================== +Block IO Controller +=================== + +Overview +======== +cgroup subsys "blkio" implements the block io controller. There seems to be +a need of various kinds of IO control policies (like proportional BW, max BW) +both at leaf nodes as well as at intermediate nodes in a storage hierarchy. +Plan is to use the same cgroup based management interface for blkio controller +and based on user options switch IO policies in the background. + +One IO control policy is throttling policy which can be used to +specify upper IO rate limits on devices. This policy is implemented in +generic block layer and can be used on leaf nodes as well as higher +level logical devices like device mapper. + +HOWTO +===== +Throttling/Upper Limit policy +----------------------------- +- Enable Block IO controller:: + + CONFIG_BLK_CGROUP=y + +- Enable throttling in block layer:: + + CONFIG_BLK_DEV_THROTTLING=y + +- Mount blkio controller (see cgroups.txt, Why are cgroups needed?):: + + mount -t cgroup -o blkio none /sys/fs/cgroup/blkio + +- Specify a bandwidth rate on particular device for root group. The format + for policy is ": ":: + + echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device + + Above will put a limit of 1MB/second on reads happening for root group + on device having major/minor number 8:16. + +- Run dd to read a file and see if rate is throttled to 1MB/s or not:: + + # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 + 1024+0 records in + 1024+0 records out + 4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s + + Limits for writes can be put using blkio.throttle.write_bps_device file. + +Hierarchical Cgroups +==================== + +Throttling implements hierarchy support; however, +throttling's hierarchy support is enabled iff "sane_behavior" is +enabled from cgroup side, which currently is a development option and +not publicly available. + +If somebody created a hierarchy like as follows:: + + root + / \ + test1 test2 + | + test3 + +Throttling with "sane_behavior" will handle the +hierarchy correctly. For throttling, all limits apply +to the whole subtree while all statistics are local to the IOs +directly generated by tasks in that cgroup. + +Throttling without "sane_behavior" enabled from cgroup side will +practically treat all groups at same level as if it looks like the +following:: + + pivot + / / \ \ + root test1 test2 test3 + +Various user visible config options +=================================== +CONFIG_BLK_CGROUP + - Block IO controller. + +CONFIG_BFQ_CGROUP_DEBUG + - Debug help. Right now some additional stats file show up in cgroup + if this option is enabled. + +CONFIG_BLK_DEV_THROTTLING + - Enable block device throttling support in block layer. + +Details of cgroup files +======================= +Proportional weight policy files +-------------------------------- +- blkio.weight + - Specifies per cgroup weight. This is default weight of the group + on all the devices until and unless overridden by per device rule. + (See blkio.weight_device). + Currently allowed range of weights is from 10 to 1000. + +- blkio.weight_device + - One can specify per cgroup per device rules using this interface. + These rules override the default value of group weight as specified + by blkio.weight. + + Following is the format:: + + # echo dev_maj:dev_minor weight > blkio.weight_device + + Configure weight=300 on /dev/sdb (8:16) in this cgroup:: + + # echo 8:16 300 > blkio.weight_device + # cat blkio.weight_device + dev weight + 8:16 300 + + Configure weight=500 on /dev/sda (8:0) in this cgroup:: + + # echo 8:0 500 > blkio.weight_device + # cat blkio.weight_device + dev weight + 8:0 500 + 8:16 300 + + Remove specific weight for /dev/sda in this cgroup:: + + # echo 8:0 0 > blkio.weight_device + # cat blkio.weight_device + dev weight + 8:16 300 + +- blkio.leaf_weight[_device] + - Equivalents of blkio.weight[_device] for the purpose of + deciding how much weight tasks in the given cgroup has while + competing with the cgroup's child cgroups. For details, + please refer to Documentation/block/cfq-iosched.txt. + +- blkio.time + - disk time allocated to cgroup per device in milliseconds. First + two fields specify the major and minor number of the device and + third field specifies the disk time allocated to group in + milliseconds. + +- blkio.sectors + - number of sectors transferred to/from disk by the group. First + two fields specify the major and minor number of the device and + third field specifies the number of sectors transferred by the + group to/from the device. + +- blkio.io_service_bytes + - Number of bytes transferred to/from the disk by the group. These + are further divided by the type of operation - read or write, sync + or async. First two fields specify the major and minor number of the + device, third field specifies the operation type and the fourth field + specifies the number of bytes. + +- blkio.io_serviced + - Number of IOs (bio) issued to the disk by the group. These + are further divided by the type of operation - read or write, sync + or async. First two fields specify the major and minor number of the + device, third field specifies the operation type and the fourth field + specifies the number of IOs. + +- blkio.io_service_time + - Total amount of time between request dispatch and request completion + for the IOs done by this cgroup. This is in nanoseconds to make it + meaningful for flash devices too. For devices with queue depth of 1, + this time represents the actual service time. When queue_depth > 1, + that is no longer true as requests may be served out of order. This + may cause the service time for a given IO to include the service time + of multiple IOs when served out of order which may result in total + io_service_time > actual time elapsed. This time is further divided by + the type of operation - read or write, sync or async. First two fields + specify the major and minor number of the device, third field + specifies the operation type and the fourth field specifies the + io_service_time in ns. + +- blkio.io_wait_time + - Total amount of time the IOs for this cgroup spent waiting in the + scheduler queues for service. This can be greater than the total time + elapsed since it is cumulative io_wait_time for all IOs. It is not a + measure of total time the cgroup spent waiting but rather a measure of + the wait_time for its individual IOs. For devices with queue_depth > 1 + this metric does not include the time spent waiting for service once + the IO is dispatched to the device but till it actually gets serviced + (there might be a time lag here due to re-ordering of requests by the + device). This is in nanoseconds to make it meaningful for flash + devices too. This time is further divided by the type of operation - + read or write, sync or async. First two fields specify the major and + minor number of the device, third field specifies the operation type + and the fourth field specifies the io_wait_time in ns. + +- blkio.io_merged + - Total number of bios/requests merged into requests belonging to this + cgroup. This is further divided by the type of operation - read or + write, sync or async. + +- blkio.io_queued + - Total number of requests queued up at any given instant for this + cgroup. This is further divided by the type of operation - read or + write, sync or async. + +- blkio.avg_queue_size + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. + The average queue size for this cgroup over the entire time of this + cgroup's existence. Queue size samples are taken each time one of the + queues of this cgroup gets a timeslice. + +- blkio.group_wait_time + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. + This is the amount of time the cgroup had to wait since it became busy + (i.e., went from 0 to 1 request queued) to get a timeslice for one of + its queues. This is different from the io_wait_time which is the + cumulative total of the amount of time spent by each IO in that cgroup + waiting in the scheduler queue. This is in nanoseconds. If this is + read when the cgroup is in a waiting (for timeslice) state, the stat + will only report the group_wait_time accumulated till the last time it + got a timeslice and will not include the current delta. + +- blkio.empty_time + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. + This is the amount of time a cgroup spends without any pending + requests when not being served, i.e., it does not include any time + spent idling for one of the queues of the cgroup. This is in + nanoseconds. If this is read when the cgroup is in an empty state, + the stat will only report the empty_time accumulated till the last + time it had a pending request and will not include the current delta. + +- blkio.idle_time + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. + This is the amount of time spent by the IO scheduler idling for a + given cgroup in anticipation of a better request than the existing ones + from other queues/cgroups. This is in nanoseconds. If this is read + when the cgroup is in an idling state, the stat will only report the + idle_time accumulated till the last idle period and will not include + the current delta. + +- blkio.dequeue + - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This + gives the statistics about how many a times a group was dequeued + from service tree of the device. First two fields specify the major + and minor number of the device and third field specifies the number + of times a group was dequeued from a particular device. + +- blkio.*_recursive + - Recursive version of various stats. These files show the + same information as their non-recursive counterparts but + include stats from all the descendant cgroups. + +Throttling/Upper limit policy files +----------------------------------- +- blkio.throttle.read_bps_device + - Specifies upper limit on READ rate from the device. IO rate is + specified in bytes per second. Rules are per device. Following is + the format:: + + echo ": " > /cgrp/blkio.throttle.read_bps_device + +- blkio.throttle.write_bps_device + - Specifies upper limit on WRITE rate to the device. IO rate is + specified in bytes per second. Rules are per device. Following is + the format:: + + echo ": " > /cgrp/blkio.throttle.write_bps_device + +- blkio.throttle.read_iops_device + - Specifies upper limit on READ rate from the device. IO rate is + specified in IO per second. Rules are per device. Following is + the format:: + + echo ": " > /cgrp/blkio.throttle.read_iops_device + +- blkio.throttle.write_iops_device + - Specifies upper limit on WRITE rate to the device. IO rate is + specified in io per second. Rules are per device. Following is + the format:: + + echo ": " > /cgrp/blkio.throttle.write_iops_device + +Note: If both BW and IOPS rules are specified for a device, then IO is + subjected to both the constraints. + +- blkio.throttle.io_serviced + - Number of IOs (bio) issued to the disk by the group. These + are further divided by the type of operation - read or write, sync + or async. First two fields specify the major and minor number of the + device, third field specifies the operation type and the fourth field + specifies the number of IOs. + +- blkio.throttle.io_service_bytes + - Number of bytes transferred to/from the disk by the group. These + are further divided by the type of operation - read or write, sync + or async. First two fields specify the major and minor number of the + device, third field specifies the operation type and the fourth field + specifies the number of bytes. + +Common files among various policies +----------------------------------- +- blkio.reset_stats + - Writing an int to this file will result in resetting all the stats + for that cgroup. diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst new file mode 100644 index 000000000000..b0688011ed06 --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst @@ -0,0 +1,695 @@ +============== +Control Groups +============== + +Written by Paul Menage based on +Documentation/admin-guide/cgroup-v1/cpusets.rst + +Original copyright statements from cpusets.txt: + +Portions Copyright (C) 2004 BULL SA. + +Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. + +Modified by Paul Jackson + +Modified by Christoph Lameter + +.. CONTENTS: + + 1. Control Groups + 1.1 What are cgroups ? + 1.2 Why are cgroups needed ? + 1.3 How are cgroups implemented ? + 1.4 What does notify_on_release do ? + 1.5 What does clone_children do ? + 1.6 How do I use cgroups ? + 2. Usage Examples and Syntax + 2.1 Basic Usage + 2.2 Attaching processes + 2.3 Mounting hierarchies by name + 3. Kernel API + 3.1 Overview + 3.2 Synchronization + 3.3 Subsystem API + 4. Extended attributes usage + 5. Questions + +1. Control Groups +================= + +1.1 What are cgroups ? +---------------------- + +Control Groups provide a mechanism for aggregating/partitioning sets of +tasks, and all their future children, into hierarchical groups with +specialized behaviour. + +Definitions: + +A *cgroup* associates a set of tasks with a set of parameters for one +or more subsystems. + +A *subsystem* is a module that makes use of the task grouping +facilities provided by cgroups to treat groups of tasks in +particular ways. A subsystem is typically a "resource controller" that +schedules a resource or applies per-cgroup limits, but it may be +anything that wants to act on a group of processes, e.g. a +virtualization subsystem. + +A *hierarchy* is a set of cgroups arranged in a tree, such that +every task in the system is in exactly one of the cgroups in the +hierarchy, and a set of subsystems; each subsystem has system-specific +state attached to each cgroup in the hierarchy. Each hierarchy has +an instance of the cgroup virtual filesystem associated with it. + +At any one time there may be multiple active hierarchies of task +cgroups. Each hierarchy is a partition of all tasks in the system. + +User-level code may create and destroy cgroups by name in an +instance of the cgroup virtual file system, specify and query to +which cgroup a task is assigned, and list the task PIDs assigned to +a cgroup. Those creations and assignments only affect the hierarchy +associated with that instance of the cgroup file system. + +On their own, the only use for cgroups is for simple job +tracking. The intention is that other subsystems hook into the generic +cgroup support to provide new attributes for cgroups, such as +accounting/limiting the resources which processes in a cgroup can +access. For example, cpusets (see Documentation/admin-guide/cgroup-v1/cpusets.rst) allow +you to associate a set of CPUs and a set of memory nodes with the +tasks in each cgroup. + +1.2 Why are cgroups needed ? +---------------------------- + +There are multiple efforts to provide process aggregations in the +Linux kernel, mainly for resource-tracking purposes. Such efforts +include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server +namespaces. These all require the basic notion of a +grouping/partitioning of processes, with newly forked processes ending +up in the same group (cgroup) as their parent process. + +The kernel cgroup patch provides the minimum essential kernel +mechanisms required to efficiently implement such groups. It has +minimal impact on the system fast paths, and provides hooks for +specific subsystems such as cpusets to provide additional behaviour as +desired. + +Multiple hierarchy support is provided to allow for situations where +the division of tasks into cgroups is distinctly different for +different subsystems - having parallel hierarchies allows each +hierarchy to be a natural division of tasks, without having to handle +complex combinations of tasks that would be present if several +unrelated subsystems needed to be forced into the same tree of +cgroups. + +At one extreme, each resource controller or subsystem could be in a +separate hierarchy; at the other extreme, all subsystems +would be attached to the same hierarchy. + +As an example of a scenario (originally proposed by vatsa@in.ibm.com) +that can benefit from multiple hierarchies, consider a large +university server with various users - students, professors, system +tasks etc. The resource planning for this server could be along the +following lines:: + + CPU : "Top cpuset" + / \ + CPUSet1 CPUSet2 + | | + (Professors) (Students) + + In addition (system tasks) are attached to topcpuset (so + that they can run anywhere) with a limit of 20% + + Memory : Professors (50%), Students (30%), system (20%) + + Disk : Professors (50%), Students (30%), system (20%) + + Network : WWW browsing (20%), Network File System (60%), others (20%) + / \ + Professors (15%) students (5%) + +Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd goes +into the NFS network class. + +At the same time Firefox/Lynx will share an appropriate CPU/Memory class +depending on who launched it (prof/student). + +With the ability to classify tasks differently for different resources +(by putting those resource subsystems in different hierarchies), +the admin can easily set up a script which receives exec notifications +and depending on who is launching the browser he can:: + + # echo browser_pid > /sys/fs/cgroup///tasks + +With only a single hierarchy, he now would potentially have to create +a separate cgroup for every browser launched and associate it with +appropriate network and other resource class. This may lead to +proliferation of such cgroups. + +Also let's say that the administrator would like to give enhanced network +access temporarily to a student's browser (since it is night and the user +wants to do online gaming :)) OR give one of the student's simulation +apps enhanced CPU power. + +With ability to write PIDs directly to resource classes, it's just a +matter of:: + + # echo pid > /sys/fs/cgroup/network//tasks + (after some time) + # echo pid > /sys/fs/cgroup/network//tasks + +Without this ability, the administrator would have to split the cgroup into +multiple separate ones and then associate the new cgroups with the +new resource classes. + + + +1.3 How are cgroups implemented ? +--------------------------------- + +Control Groups extends the kernel as follows: + + - Each task in the system has a reference-counted pointer to a + css_set. + + - A css_set contains a set of reference-counted pointers to + cgroup_subsys_state objects, one for each cgroup subsystem + registered in the system. There is no direct link from a task to + the cgroup of which it's a member in each hierarchy, but this + can be determined by following pointers through the + cgroup_subsys_state objects. This is because accessing the + subsystem state is something that's expected to happen frequently + and in performance-critical code, whereas operations that require a + task's actual cgroup assignments (in particular, moving between + cgroups) are less common. A linked list runs through the cg_list + field of each task_struct using the css_set, anchored at + css_set->tasks. + + - A cgroup hierarchy filesystem can be mounted for browsing and + manipulation from user space. + + - You can list all the tasks (by PID) attached to any cgroup. + +The implementation of cgroups requires a few, simple hooks +into the rest of the kernel, none in performance-critical paths: + + - in init/main.c, to initialize the root cgroups and initial + css_set at system boot. + + - in fork and exit, to attach and detach a task from its css_set. + +In addition, a new file system of type "cgroup" may be mounted, to +enable browsing and modifying the cgroups presently known to the +kernel. When mounting a cgroup hierarchy, you may specify a +comma-separated list of subsystems to mount as the filesystem mount +options. By default, mounting the cgroup filesystem attempts to +mount a hierarchy containing all registered subsystems. + +If an active hierarchy with exactly the same set of subsystems already +exists, it will be reused for the new mount. If no existing hierarchy +matches, and any of the requested subsystems are in use in an existing +hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy +is activated, associated with the requested subsystems. + +It's not currently possible to bind a new subsystem to an active +cgroup hierarchy, or to unbind a subsystem from an active cgroup +hierarchy. This may be possible in future, but is fraught with nasty +error-recovery issues. + +When a cgroup filesystem is unmounted, if there are any +child cgroups created below the top-level cgroup, that hierarchy +will remain active even though unmounted; if there are no +child cgroups then the hierarchy will be deactivated. + +No new system calls are added for cgroups - all support for +querying and modifying cgroups is via this cgroup file system. + +Each task under /proc has an added file named 'cgroup' displaying, +for each active hierarchy, the subsystem names and the cgroup name +as the path relative to the root of the cgroup file system. + +Each cgroup is represented by a directory in the cgroup file system +containing the following files describing that cgroup: + + - tasks: list of tasks (by PID) attached to that cgroup. This list + is not guaranteed to be sorted. Writing a thread ID into this file + moves the thread into this cgroup. + - cgroup.procs: list of thread group IDs in the cgroup. This list is + not guaranteed to be sorted or free of duplicate TGIDs, and userspace + should sort/uniquify the list if this property is required. + Writing a thread group ID into this file moves all threads in that + group into this cgroup. + - notify_on_release flag: run the release agent on exit? + - release_agent: the path to use for release notifications (this file + exists in the top cgroup only) + +Other subsystems such as cpusets may add additional files in each +cgroup dir. + +New cgroups are created using the mkdir system call or shell +command. The properties of a cgroup, such as its flags, are +modified by writing to the appropriate file in that cgroups +directory, as listed above. + +The named hierarchical structure of nested cgroups allows partitioning +a large system into nested, dynamically changeable, "soft-partitions". + +The attachment of each task, automatically inherited at fork by any +children of that task, to a cgroup allows organizing the work load +on a system into related sets of tasks. A task may be re-attached to +any other cgroup, if allowed by the permissions on the necessary +cgroup file system directories. + +When a task is moved from one cgroup to another, it gets a new +css_set pointer - if there's an already existing css_set with the +desired collection of cgroups then that group is reused, otherwise a new +css_set is allocated. The appropriate existing css_set is located by +looking into a hash table. + +To allow access from a cgroup to the css_sets (and hence tasks) +that comprise it, a set of cg_cgroup_link objects form a lattice; +each cg_cgroup_link is linked into a list of cg_cgroup_links for +a single cgroup on its cgrp_link_list field, and a list of +cg_cgroup_links for a single css_set on its cg_link_list. + +Thus the set of tasks in a cgroup can be listed by iterating over +each css_set that references the cgroup, and sub-iterating over +each css_set's task set. + +The use of a Linux virtual file system (vfs) to represent the +cgroup hierarchy provides for a familiar permission and name space +for cgroups, with a minimum of additional kernel code. + +1.4 What does notify_on_release do ? +------------------------------------ + +If the notify_on_release flag is enabled (1) in a cgroup, then +whenever the last task in the cgroup leaves (exits or attaches to +some other cgroup) and the last child cgroup of that cgroup +is removed, then the kernel runs the command specified by the contents +of the "release_agent" file in that hierarchy's root directory, +supplying the pathname (relative to the mount point of the cgroup +file system) of the abandoned cgroup. This enables automatic +removal of abandoned cgroups. The default value of +notify_on_release in the root cgroup at system boot is disabled +(0). The default value of other cgroups at creation is the current +value of their parents' notify_on_release settings. The default value of +a cgroup hierarchy's release_agent path is empty. + +1.5 What does clone_children do ? +--------------------------------- + +This flag only affects the cpuset controller. If the clone_children +flag is enabled (1) in a cgroup, a new cpuset cgroup will copy its +configuration from the parent during initialization. + +1.6 How do I use cgroups ? +-------------------------- + +To start a new job that is to be contained within a cgroup, using +the "cpuset" cgroup subsystem, the steps are something like:: + + 1) mount -t tmpfs cgroup_root /sys/fs/cgroup + 2) mkdir /sys/fs/cgroup/cpuset + 3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset + 4) Create the new cgroup by doing mkdir's and write's (or echo's) in + the /sys/fs/cgroup/cpuset virtual file system. + 5) Start a task that will be the "founding father" of the new job. + 6) Attach that task to the new cgroup by writing its PID to the + /sys/fs/cgroup/cpuset tasks file for that cgroup. + 7) fork, exec or clone the job tasks from this founding father task. + +For example, the following sequence of commands will setup a cgroup +named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, +and then start a subshell 'sh' in that cgroup:: + + mount -t tmpfs cgroup_root /sys/fs/cgroup + mkdir /sys/fs/cgroup/cpuset + mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset + cd /sys/fs/cgroup/cpuset + mkdir Charlie + cd Charlie + /bin/echo 2-3 > cpuset.cpus + /bin/echo 1 > cpuset.mems + /bin/echo $$ > tasks + sh + # The subshell 'sh' is now running in cgroup Charlie + # The next line should display '/Charlie' + cat /proc/self/cgroup + +2. Usage Examples and Syntax +============================ + +2.1 Basic Usage +--------------- + +Creating, modifying, using cgroups can be done through the cgroup +virtual filesystem. + +To mount a cgroup hierarchy with all available subsystems, type:: + + # mount -t cgroup xxx /sys/fs/cgroup + +The "xxx" is not interpreted by the cgroup code, but will appear in +/proc/mounts so may be any useful identifying string that you like. + +Note: Some subsystems do not work without some user input first. For instance, +if cpusets are enabled the user will have to populate the cpus and mems files +for each new cgroup created before that group can be used. + +As explained in section `1.2 Why are cgroups needed?` you should create +different hierarchies of cgroups for each single resource or group of +resources you want to control. Therefore, you should mount a tmpfs on +/sys/fs/cgroup and create directories for each cgroup resource or resource +group:: + + # mount -t tmpfs cgroup_root /sys/fs/cgroup + # mkdir /sys/fs/cgroup/rg1 + +To mount a cgroup hierarchy with just the cpuset and memory +subsystems, type:: + + # mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1 + +While remounting cgroups is currently supported, it is not recommend +to use it. Remounting allows changing bound subsystems and +release_agent. Rebinding is hardly useful as it only works when the +hierarchy is empty and release_agent itself should be replaced with +conventional fsnotify. The support for remounting will be removed in +the future. + +To Specify a hierarchy's release_agent:: + + # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ + xxx /sys/fs/cgroup/rg1 + +Note that specifying 'release_agent' more than once will return failure. + +Note that changing the set of subsystems is currently only supported +when the hierarchy consists of a single (root) cgroup. Supporting +the ability to arbitrarily bind/unbind subsystems from an existing +cgroup hierarchy is intended to be implemented in the future. + +Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the +tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1 +is the cgroup that holds the whole system. + +If you want to change the value of release_agent:: + + # echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent + +It can also be changed via remount. + +If you want to create a new cgroup under /sys/fs/cgroup/rg1:: + + # cd /sys/fs/cgroup/rg1 + # mkdir my_cgroup + +Now you want to do something with this cgroup: + + # cd my_cgroup + +In this directory you can find several files:: + + # ls + cgroup.procs notify_on_release tasks + (plus whatever files added by the attached subsystems) + +Now attach your shell to this cgroup:: + + # /bin/echo $$ > tasks + +You can also create cgroups inside your cgroup by using mkdir in this +directory:: + + # mkdir my_sub_cs + +To remove a cgroup, just use rmdir:: + + # rmdir my_sub_cs + +This will fail if the cgroup is in use (has cgroups inside, or +has processes attached, or is held alive by other subsystem-specific +reference). + +2.2 Attaching processes +----------------------- + +:: + + # /bin/echo PID > tasks + +Note that it is PID, not PIDs. You can only attach ONE task at a time. +If you have several tasks to attach, you have to do it one after another:: + + # /bin/echo PID1 > tasks + # /bin/echo PID2 > tasks + ... + # /bin/echo PIDn > tasks + +You can attach the current shell task by echoing 0:: + + # echo 0 > tasks + +You can use the cgroup.procs file instead of the tasks file to move all +threads in a threadgroup at once. Echoing the PID of any task in a +threadgroup to cgroup.procs causes all tasks in that threadgroup to be +attached to the cgroup. Writing 0 to cgroup.procs moves all tasks +in the writing task's threadgroup. + +Note: Since every task is always a member of exactly one cgroup in each +mounted hierarchy, to remove a task from its current cgroup you must +move it into a new cgroup (possibly the root cgroup) by writing to the +new cgroup's tasks file. + +Note: Due to some restrictions enforced by some cgroup subsystems, moving +a process to another cgroup can fail. + +2.3 Mounting hierarchies by name +-------------------------------- + +Passing the name= option when mounting a cgroups hierarchy +associates the given name with the hierarchy. This can be used when +mounting a pre-existing hierarchy, in order to refer to it by name +rather than by its set of active subsystems. Each hierarchy is either +nameless, or has a unique name. + +The name should match [\w.-]+ + +When passing a name= option for a new hierarchy, you need to +specify subsystems manually; the legacy behaviour of mounting all +subsystems when none are explicitly specified is not supported when +you give a subsystem a name. + +The name of the subsystem appears as part of the hierarchy description +in /proc/mounts and /proc//cgroups. + + +3. Kernel API +============= + +3.1 Overview +------------ + +Each kernel subsystem that wants to hook into the generic cgroup +system needs to create a cgroup_subsys object. This contains +various methods, which are callbacks from the cgroup system, along +with a subsystem ID which will be assigned by the cgroup system. + +Other fields in the cgroup_subsys object include: + +- subsys_id: a unique array index for the subsystem, indicating which + entry in cgroup->subsys[] this subsystem should be managing. + +- name: should be initialized to a unique subsystem name. Should be + no longer than MAX_CGROUP_TYPE_NAMELEN. + +- early_init: indicate if the subsystem needs early initialization + at system boot. + +Each cgroup object created by the system has an array of pointers, +indexed by subsystem ID; this pointer is entirely managed by the +subsystem; the generic cgroup code will never touch this pointer. + +3.2 Synchronization +------------------- + +There is a global mutex, cgroup_mutex, used by the cgroup +system. This should be taken by anything that wants to modify a +cgroup. It may also be taken to prevent cgroups from being +modified, but more specific locks may be more appropriate in that +situation. + +See kernel/cgroup.c for more details. + +Subsystems can take/release the cgroup_mutex via the functions +cgroup_lock()/cgroup_unlock(). + +Accessing a task's cgroup pointer may be done in the following ways: +- while holding cgroup_mutex +- while holding the task's alloc_lock (via task_lock()) +- inside an rcu_read_lock() section via rcu_dereference() + +3.3 Subsystem API +----------------- + +Each subsystem should: + +- add an entry in linux/cgroup_subsys.h +- define a cgroup_subsys object called _cgrp_subsys + +Each subsystem may export the following methods. The only mandatory +methods are css_alloc/free. Any others that are null are presumed to +be successful no-ops. + +``struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)`` +(cgroup_mutex held by caller) + +Called to allocate a subsystem state object for a cgroup. The +subsystem should allocate its subsystem state object for the passed +cgroup, returning a pointer to the new object on success or a +ERR_PTR() value. On success, the subsystem pointer should point to +a structure of type cgroup_subsys_state (typically embedded in a +larger subsystem-specific object), which will be initialized by the +cgroup system. Note that this will be called at initialization to +create the root subsystem state for this subsystem; this case can be +identified by the passed cgroup object having a NULL parent (since +it's the root of the hierarchy) and may be an appropriate place for +initialization code. + +``int css_online(struct cgroup *cgrp)`` +(cgroup_mutex held by caller) + +Called after @cgrp successfully completed all allocations and made +visible to cgroup_for_each_child/descendant_*() iterators. The +subsystem may choose to fail creation by returning -errno. This +callback can be used to implement reliable state sharing and +propagation along the hierarchy. See the comment on +cgroup_for_each_descendant_pre() for details. + +``void css_offline(struct cgroup *cgrp);`` +(cgroup_mutex held by caller) + +This is the counterpart of css_online() and called iff css_online() +has succeeded on @cgrp. This signifies the beginning of the end of +@cgrp. @cgrp is being removed and the subsystem should start dropping +all references it's holding on @cgrp. When all references are dropped, +cgroup removal will proceed to the next step - css_free(). After this +callback, @cgrp should be considered dead to the subsystem. + +``void css_free(struct cgroup *cgrp)`` +(cgroup_mutex held by caller) + +The cgroup system is about to free @cgrp; the subsystem should free +its subsystem state object. By the time this method is called, @cgrp +is completely unused; @cgrp->parent is still valid. (Note - can also +be called for a newly-created cgroup if an error occurs after this +subsystem's create() method has been called for the new cgroup). + +``int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)`` +(cgroup_mutex held by caller) + +Called prior to moving one or more tasks into a cgroup; if the +subsystem returns an error, this will abort the attach operation. +@tset contains the tasks to be attached and is guaranteed to have at +least one task in it. + +If there are multiple tasks in the taskset, then: + - it's guaranteed that all are from the same thread group + - @tset contains all tasks from the thread group whether or not + they're switching cgroups + - the first task is the leader + +Each @tset entry also contains the task's old cgroup and tasks which +aren't switching cgroup can be skipped easily using the +cgroup_taskset_for_each() iterator. Note that this isn't called on a +fork. If this method returns 0 (success) then this should remain valid +while the caller holds cgroup_mutex and it is ensured that either +attach() or cancel_attach() will be called in future. + +``void css_reset(struct cgroup_subsys_state *css)`` +(cgroup_mutex held by caller) + +An optional operation which should restore @css's configuration to the +initial state. This is currently only used on the unified hierarchy +when a subsystem is disabled on a cgroup through +"cgroup.subtree_control" but should remain enabled because other +subsystems depend on it. cgroup core makes such a css invisible by +removing the associated interface files and invokes this callback so +that the hidden subsystem can return to the initial neutral state. +This prevents unexpected resource control from a hidden css and +ensures that the configuration is in the initial state when it is made +visible again later. + +``void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)`` +(cgroup_mutex held by caller) + +Called when a task attach operation has failed after can_attach() has succeeded. +A subsystem whose can_attach() has some side-effects should provide this +function, so that the subsystem can implement a rollback. If not, not necessary. +This will be called only about subsystems whose can_attach() operation have +succeeded. The parameters are identical to can_attach(). + +``void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)`` +(cgroup_mutex held by caller) + +Called after the task has been attached to the cgroup, to allow any +post-attachment activity that requires memory allocations or blocking. +The parameters are identical to can_attach(). + +``void fork(struct task_struct *task)`` + +Called when a task is forked into a cgroup. + +``void exit(struct task_struct *task)`` + +Called during task exit. + +``void free(struct task_struct *task)`` + +Called when the task_struct is freed. + +``void bind(struct cgroup *root)`` +(cgroup_mutex held by caller) + +Called when a cgroup subsystem is rebound to a different hierarchy +and root cgroup. Currently this will only involve movement between +the default hierarchy (which never has sub-cgroups) and a hierarchy +that is being created/destroyed (and hence has no sub-cgroups). + +4. Extended attribute usage +=========================== + +cgroup filesystem supports certain types of extended attributes in its +directories and files. The current supported types are: + + - Trusted (XATTR_TRUSTED) + - Security (XATTR_SECURITY) + +Both require CAP_SYS_ADMIN capability to set. + +Like in tmpfs, the extended attributes in cgroup filesystem are stored +using kernel memory and it's advised to keep the usage at minimum. This +is the reason why user defined extended attributes are not supported, since +any user can do it and there's no limit in the value size. + +The current known users for this feature are SELinux to limit cgroup usage +in containers and systemd for assorted meta data like main PID in a cgroup +(systemd creates a cgroup per service). + +5. Questions +============ + +:: + + Q: what's up with this '/bin/echo' ? + A: bash's builtin 'echo' command does not check calls to write() against + errors. If you use it in the cgroup file system, you won't be + able to tell whether a command succeeded or failed. + + Q: When I attach processes, only the first of the line gets really attached ! + A: We can only return one error code per call to write(). So you should also + put only ONE PID. diff --git a/Documentation/admin-guide/cgroup-v1/cpuacct.rst b/Documentation/admin-guide/cgroup-v1/cpuacct.rst new file mode 100644 index 000000000000..d30ed81d2ad7 --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/cpuacct.rst @@ -0,0 +1,50 @@ +========================= +CPU Accounting Controller +========================= + +The CPU accounting controller is used to group tasks using cgroups and +account the CPU usage of these groups of tasks. + +The CPU accounting controller supports multi-hierarchy groups. An accounting +group accumulates the CPU usage of all of its child groups and the tasks +directly present in its group. + +Accounting groups can be created by first mounting the cgroup filesystem:: + + # mount -t cgroup -ocpuacct none /sys/fs/cgroup + +With the above step, the initial or the parent accounting group becomes +visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in +the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. +/sys/fs/cgroup/cpuacct.usage gives the CPU time (in nanoseconds) obtained +by this group which is essentially the CPU time obtained by all the tasks +in the system. + +New accounting groups can be created under the parent group /sys/fs/cgroup:: + + # cd /sys/fs/cgroup + # mkdir g1 + # echo $$ > g1/tasks + +The above steps create a new group g1 and move the current shell +process (bash) into it. CPU time consumed by this bash and its children +can be obtained from g1/cpuacct.usage and the same is accumulated in +/sys/fs/cgroup/cpuacct.usage also. + +cpuacct.stat file lists a few statistics which further divide the +CPU time obtained by the cgroup into user and system times. Currently +the following statistics are supported: + +user: Time spent by tasks of the cgroup in user mode. +system: Time spent by tasks of the cgroup in kernel mode. + +user and system are in USER_HZ unit. + +cpuacct controller uses percpu_counter interface to collect user and +system times. This has two side effects: + +- It is theoretically possible to see wrong values for user and system times. + This is because percpu_counter_read() on 32bit systems isn't safe + against concurrent writes. +- It is possible to see slightly outdated values for user and system times + due to the batch processing nature of percpu_counter. diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst new file mode 100644 index 000000000000..86a6ae995d54 --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -0,0 +1,866 @@ +======= +CPUSETS +======= + +Copyright (C) 2004 BULL SA. + +Written by Simon.Derr@bull.net + +- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. +- Modified by Paul Jackson +- Modified by Christoph Lameter +- Modified by Paul Menage +- Modified by Hidetoshi Seto + +.. CONTENTS: + + 1. Cpusets + 1.1 What are cpusets ? + 1.2 Why are cpusets needed ? + 1.3 How are cpusets implemented ? + 1.4 What are exclusive cpusets ? + 1.5 What is memory_pressure ? + 1.6 What is memory spread ? + 1.7 What is sched_load_balance ? + 1.8 What is sched_relax_domain_level ? + 1.9 How do I use cpusets ? + 2. Usage Examples and Syntax + 2.1 Basic Usage + 2.2 Adding/removing cpus + 2.3 Setting flags + 2.4 Attaching processes + 3. Questions + 4. Contact + +1. Cpusets +========== + +1.1 What are cpusets ? +---------------------- + +Cpusets provide a mechanism for assigning a set of CPUs and Memory +Nodes to a set of tasks. In this document "Memory Node" refers to +an on-line node that contains memory. + +Cpusets constrain the CPU and Memory placement of tasks to only +the resources within a task's current cpuset. They form a nested +hierarchy visible in a virtual file system. These are the essential +hooks, beyond what is already present, required to manage dynamic +job placement on large systems. + +Cpusets use the generic cgroup subsystem described in +Documentation/admin-guide/cgroup-v1/cgroups.rst. + +Requests by a task, using the sched_setaffinity(2) system call to +include CPUs in its CPU affinity mask, and using the mbind(2) and +set_mempolicy(2) system calls to include Memory Nodes in its memory +policy, are both filtered through that task's cpuset, filtering out any +CPUs or Memory Nodes not in that cpuset. The scheduler will not +schedule a task on a CPU that is not allowed in its cpus_allowed +vector, and the kernel page allocator will not allocate a page on a +node that is not allowed in the requesting task's mems_allowed vector. + +User level code may create and destroy cpusets by name in the cgroup +virtual file system, manage the attributes and permissions of these +cpusets and which CPUs and Memory Nodes are assigned to each cpuset, +specify and query to which cpuset a task is assigned, and list the +task pids assigned to a cpuset. + + +1.2 Why are cpusets needed ? +---------------------------- + +The management of large computer systems, with many processors (CPUs), +complex memory cache hierarchies and multiple Memory Nodes having +non-uniform access times (NUMA) presents additional challenges for +the efficient scheduling and memory placement of processes. + +Frequently more modest sized systems can be operated with adequate +efficiency just by letting the operating system automatically share +the available CPU and Memory resources amongst the requesting tasks. + +But larger systems, which benefit more from careful processor and +memory placement to reduce memory access times and contention, +and which typically represent a larger investment for the customer, +can benefit from explicitly placing jobs on properly sized subsets of +the system. + +This can be especially valuable on: + + * Web Servers running multiple instances of the same web application, + * Servers running different applications (for instance, a web server + and a database), or + * NUMA systems running large HPC applications with demanding + performance characteristics. + +These subsets, or "soft partitions" must be able to be dynamically +adjusted, as the job mix changes, without impacting other concurrently +executing jobs. The location of the running jobs pages may also be moved +when the memory locations are changed. + +The kernel cpuset patch provides the minimum essential kernel +mechanisms required to efficiently implement such subsets. It +leverages existing CPU and Memory Placement facilities in the Linux +kernel to avoid any additional impact on the critical scheduler or +memory allocator code. + + +1.3 How are cpusets implemented ? +--------------------------------- + +Cpusets provide a Linux kernel mechanism to constrain which CPUs and +Memory Nodes are used by a process or set of processes. + +The Linux kernel already has a pair of mechanisms to specify on which +CPUs a task may be scheduled (sched_setaffinity) and on which Memory +Nodes it may obtain memory (mbind, set_mempolicy). + +Cpusets extends these two mechanisms as follows: + + - Cpusets are sets of allowed CPUs and Memory Nodes, known to the + kernel. + - Each task in the system is attached to a cpuset, via a pointer + in the task structure to a reference counted cgroup structure. + - Calls to sched_setaffinity are filtered to just those CPUs + allowed in that task's cpuset. + - Calls to mbind and set_mempolicy are filtered to just + those Memory Nodes allowed in that task's cpuset. + - The root cpuset contains all the systems CPUs and Memory + Nodes. + - For any cpuset, one can define child cpusets containing a subset + of the parents CPU and Memory Node resources. + - The hierarchy of cpusets can be mounted at /dev/cpuset, for + browsing and manipulation from user space. + - A cpuset may be marked exclusive, which ensures that no other + cpuset (except direct ancestors and descendants) may contain + any overlapping CPUs or Memory Nodes. + - You can list all the tasks (by pid) attached to any cpuset. + +The implementation of cpusets requires a few, simple hooks +into the rest of the kernel, none in performance critical paths: + + - in init/main.c, to initialize the root cpuset at system boot. + - in fork and exit, to attach and detach a task from its cpuset. + - in sched_setaffinity, to mask the requested CPUs by what's + allowed in that task's cpuset. + - in sched.c migrate_live_tasks(), to keep migrating tasks within + the CPUs allowed by their cpuset, if possible. + - in the mbind and set_mempolicy system calls, to mask the requested + Memory Nodes by what's allowed in that task's cpuset. + - in page_alloc.c, to restrict memory to allowed nodes. + - in vmscan.c, to restrict page recovery to the current cpuset. + +You should mount the "cgroup" filesystem type in order to enable +browsing and modifying the cpusets presently known to the kernel. No +new system calls are added for cpusets - all support for querying and +modifying cpusets is via this cpuset file system. + +The /proc//status file for each task has four added lines, +displaying the task's cpus_allowed (on which CPUs it may be scheduled) +and mems_allowed (on which Memory Nodes it may obtain memory), +in the two formats seen in the following example:: + + Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff + Cpus_allowed_list: 0-127 + Mems_allowed: ffffffff,ffffffff + Mems_allowed_list: 0-63 + +Each cpuset is represented by a directory in the cgroup file system +containing (on top of the standard cgroup files) the following +files describing that cpuset: + + - cpuset.cpus: list of CPUs in that cpuset + - cpuset.mems: list of Memory Nodes in that cpuset + - cpuset.memory_migrate flag: if set, move pages to cpusets nodes + - cpuset.cpu_exclusive flag: is cpu placement exclusive? + - cpuset.mem_exclusive flag: is memory placement exclusive? + - cpuset.mem_hardwall flag: is memory allocation hardwalled + - cpuset.memory_pressure: measure of how much paging pressure in cpuset + - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes + - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes + - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset + - cpuset.sched_relax_domain_level: the searching range when migrating tasks + +In addition, only the root cpuset has the following file: + + - cpuset.memory_pressure_enabled flag: compute memory_pressure? + +New cpusets are created using the mkdir system call or shell +command. The properties of a cpuset, such as its flags, allowed +CPUs and Memory Nodes, and attached tasks, are modified by writing +to the appropriate file in that cpusets directory, as listed above. + +The named hierarchical structure of nested cpusets allows partitioning +a large system into nested, dynamically changeable, "soft-partitions". + +The attachment of each task, automatically inherited at fork by any +children of that task, to a cpuset allows organizing the work load +on a system into related sets of tasks such that each set is constrained +to using the CPUs and Memory Nodes of a particular cpuset. A task +may be re-attached to any other cpuset, if allowed by the permissions +on the necessary cpuset file system directories. + +Such management of a system "in the large" integrates smoothly with +the detailed placement done on individual tasks and memory regions +using the sched_setaffinity, mbind and set_mempolicy system calls. + +The following rules apply to each cpuset: + + - Its CPUs and Memory Nodes must be a subset of its parents. + - It can't be marked exclusive unless its parent is. + - If its cpu or memory is exclusive, they may not overlap any sibling. + +These rules, and the natural hierarchy of cpusets, enable efficient +enforcement of the exclusive guarantee, without having to scan all +cpusets every time any of them change to ensure nothing overlaps a +exclusive cpuset. Also, the use of a Linux virtual file system (vfs) +to represent the cpuset hierarchy provides for a familiar permission +and name space for cpusets, with a minimum of additional kernel code. + +The cpus and mems files in the root (top_cpuset) cpuset are +read-only. The cpus file automatically tracks the value of +cpu_online_mask using a CPU hotplug notifier, and the mems file +automatically tracks the value of node_states[N_MEMORY]--i.e., +nodes with memory--using the cpuset_track_online_nodes() hook. + + +1.4 What are exclusive cpusets ? +-------------------------------- + +If a cpuset is cpu or mem exclusive, no other cpuset, other than +a direct ancestor or descendant, may share any of the same CPUs or +Memory Nodes. + +A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled", +i.e. it restricts kernel allocations for page, buffer and other data +commonly shared by the kernel across multiple users. All cpusets, +whether hardwalled or not, restrict allocations of memory for user +space. This enables configuring a system so that several independent +jobs can share common kernel data, such as file system pages, while +isolating each job's user allocation in its own cpuset. To do this, +construct a large mem_exclusive cpuset to hold all the jobs, and +construct child, non-mem_exclusive cpusets for each individual job. +Only a small amount of typical kernel memory, such as requests from +interrupt handlers, is allowed to be taken outside even a +mem_exclusive cpuset. + + +1.5 What is memory_pressure ? +----------------------------- +The memory_pressure of a cpuset provides a simple per-cpuset metric +of the rate that the tasks in a cpuset are attempting to free up in +use memory on the nodes of the cpuset to satisfy additional memory +requests. + +This enables batch managers monitoring jobs running in dedicated +cpusets to efficiently detect what level of memory pressure that job +is causing. + +This is useful both on tightly managed systems running a wide mix of +submitted jobs, which may choose to terminate or re-prioritize jobs that +are trying to use more memory than allowed on the nodes assigned to them, +and with tightly coupled, long running, massively parallel scientific +computing jobs that will dramatically fail to meet required performance +goals if they start to use more memory than allowed to them. + +This mechanism provides a very economical way for the batch manager +to monitor a cpuset for signs of memory pressure. It's up to the +batch manager or other user code to decide what to do about it and +take action. + +==> + Unless this feature is enabled by writing "1" to the special file + /dev/cpuset/memory_pressure_enabled, the hook in the rebalance + code of __alloc_pages() for this metric reduces to simply noticing + that the cpuset_memory_pressure_enabled flag is zero. So only + systems that enable this feature will compute the metric. + +Why a per-cpuset, running average: + + Because this meter is per-cpuset, rather than per-task or mm, + the system load imposed by a batch scheduler monitoring this + metric is sharply reduced on large systems, because a scan of + the tasklist can be avoided on each set of queries. + + Because this meter is a running average, instead of an accumulating + counter, a batch scheduler can detect memory pressure with a + single read, instead of having to read and accumulate results + for a period of time. + + Because this meter is per-cpuset rather than per-task or mm, + the batch scheduler can obtain the key information, memory + pressure in a cpuset, with a single read, rather than having to + query and accumulate results over all the (dynamically changing) + set of tasks in the cpuset. + +A per-cpuset simple digital filter (requires a spinlock and 3 words +of data per-cpuset) is kept, and updated by any task attached to that +cpuset, if it enters the synchronous (direct) page reclaim code. + +A per-cpuset file provides an integer number representing the recent +(half-life of 10 seconds) rate of direct page reclaims caused by +the tasks in the cpuset, in units of reclaims attempted per second, +times 1000. + + +1.6 What is memory spread ? +--------------------------- +There are two boolean flag files per cpuset that control where the +kernel allocates pages for the file system buffers and related in +kernel data structures. They are called 'cpuset.memory_spread_page' and +'cpuset.memory_spread_slab'. + +If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then +the kernel will spread the file system buffers (page cache) evenly +over all the nodes that the faulting task is allowed to use, instead +of preferring to put those pages on the node where the task is running. + +If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set, +then the kernel will spread some file system related slab caches, +such as for inodes and dentries evenly over all the nodes that the +faulting task is allowed to use, instead of preferring to put those +pages on the node where the task is running. + +The setting of these flags does not affect anonymous data segment or +stack segment pages of a task. + +By default, both kinds of memory spreading are off, and memory +pages are allocated on the node local to where the task is running, +except perhaps as modified by the task's NUMA mempolicy or cpuset +configuration, so long as sufficient free memory pages are available. + +When new cpusets are created, they inherit the memory spread settings +of their parent. + +Setting memory spreading causes allocations for the affected page +or slab caches to ignore the task's NUMA mempolicy and be spread +instead. Tasks using mbind() or set_mempolicy() calls to set NUMA +mempolicies will not notice any change in these calls as a result of +their containing task's memory spread settings. If memory spreading +is turned off, then the currently specified NUMA mempolicy once again +applies to memory page allocations. + +Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag +files. By default they contain "0", meaning that the feature is off +for that cpuset. If a "1" is written to that file, then that turns +the named feature on. + +The implementation is simple. + +Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag +PFA_SPREAD_PAGE for each task that is in that cpuset or subsequently +joins that cpuset. The page allocation calls for the page cache +is modified to perform an inline check for this PFA_SPREAD_PAGE task +flag, and if set, a call to a new routine cpuset_mem_spread_node() +returns the node to prefer for the allocation. + +Similarly, setting 'cpuset.memory_spread_slab' turns on the flag +PFA_SPREAD_SLAB, and appropriately marked slab caches will allocate +pages from the node returned by cpuset_mem_spread_node(). + +The cpuset_mem_spread_node() routine is also simple. It uses the +value of a per-task rotor cpuset_mem_spread_rotor to select the next +node in the current task's mems_allowed to prefer for the allocation. + +This memory placement policy is also known (in other contexts) as +round-robin or interleave. + +This policy can provide substantial improvements for jobs that need +to place thread local data on the corresponding node, but that need +to access large file system data sets that need to be spread across +the several nodes in the jobs cpuset in order to fit. Without this +policy, especially for jobs that might have one thread reading in the +data set, the memory allocation across the nodes in the jobs cpuset +can become very uneven. + +1.7 What is sched_load_balance ? +-------------------------------- + +The kernel scheduler (kernel/sched/core.c) automatically load balances +tasks. If one CPU is underutilized, kernel code running on that +CPU will look for tasks on other more overloaded CPUs and move those +tasks to itself, within the constraints of such placement mechanisms +as cpusets and sched_setaffinity. + +The algorithmic cost of load balancing and its impact on key shared +kernel data structures such as the task list increases more than +linearly with the number of CPUs being balanced. So the scheduler +has support to partition the systems CPUs into a number of sched +domains such that it only load balances within each sched domain. +Each sched domain covers some subset of the CPUs in the system; +no two sched domains overlap; some CPUs might not be in any sched +domain and hence won't be load balanced. + +Put simply, it costs less to balance between two smaller sched domains +than one big one, but doing so means that overloads in one of the +two domains won't be load balanced to the other one. + +By default, there is one sched domain covering all CPUs, including those +marked isolated using the kernel boot time "isolcpus=" argument. However, +the isolated CPUs will not participate in load balancing, and will not +have tasks running on them unless explicitly assigned. + +This default load balancing across all CPUs is not well suited for +the following two situations: + + 1) On large systems, load balancing across many CPUs is expensive. + If the system is managed using cpusets to place independent jobs + on separate sets of CPUs, full load balancing is unnecessary. + 2) Systems supporting realtime on some CPUs need to minimize + system overhead on those CPUs, including avoiding task load + balancing if that is not needed. + +When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default +setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus' +be contained in a single sched domain, ensuring that load balancing +can move a task (not otherwised pinned, as by sched_setaffinity) +from any CPU in that cpuset to any other. + +When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the +scheduler will avoid load balancing across the CPUs in that cpuset, +--except-- in so far as is necessary because some overlapping cpuset +has "sched_load_balance" enabled. + +So, for example, if the top cpuset has the flag "cpuset.sched_load_balance" +enabled, then the scheduler will have one sched domain covering all +CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other +cpusets won't matter, as we're already fully load balancing. + +Therefore in the above two situations, the top cpuset flag +"cpuset.sched_load_balance" should be disabled, and only some of the smaller, +child cpusets have this flag enabled. + +When doing this, you don't usually want to leave any unpinned tasks in +the top cpuset that might use non-trivial amounts of CPU, as such tasks +may be artificially constrained to some subset of CPUs, depending on +the particulars of this flag setting in descendant cpusets. Even if +such a task could use spare CPU cycles in some other CPUs, the kernel +scheduler might not consider the possibility of load balancing that +task to that underused CPU. + +Of course, tasks pinned to a particular CPU can be left in a cpuset +that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere +else anyway. + +There is an impedance mismatch here, between cpusets and sched domains. +Cpusets are hierarchical and nest. Sched domains are flat; they don't +overlap and each CPU is in at most one sched domain. + +It is necessary for sched domains to be flat because load balancing +across partially overlapping sets of CPUs would risk unstable dynamics +that would be beyond our understanding. So if each of two partially +overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we +form a single sched domain that is a superset of both. We won't move +a task to a CPU outside its cpuset, but the scheduler load balancing +code might waste some compute cycles considering that possibility. + +This mismatch is why there is not a simple one-to-one relation +between which cpusets have the flag "cpuset.sched_load_balance" enabled, +and the sched domain configuration. If a cpuset enables the flag, it +will get balancing across all its CPUs, but if it disables the flag, +it will only be assured of no load balancing if no other overlapping +cpuset enables the flag. + +If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only +one of them has this flag enabled, then the other may find its +tasks only partially load balanced, just on the overlapping CPUs. +This is just the general case of the top_cpuset example given a few +paragraphs above. In the general case, as in the top cpuset case, +don't leave tasks that might use non-trivial amounts of CPU in +such partially load balanced cpusets, as they may be artificially +constrained to some subset of the CPUs allowed to them, for lack of +load balancing to the other CPUs. + +CPUs in "cpuset.isolcpus" were excluded from load balancing by the +isolcpus= kernel boot option, and will never be load balanced regardless +of the value of "cpuset.sched_load_balance" in any cpuset. + +1.7.1 sched_load_balance implementation details. +------------------------------------------------ + +The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary +to most cpuset flags.) When enabled for a cpuset, the kernel will +ensure that it can load balance across all the CPUs in that cpuset +(makes sure that all the CPUs in the cpus_allowed of that cpuset are +in the same sched domain.) + +If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled, +then they will be (must be) both in the same sched domain. + +If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled, +then by the above that means there is a single sched domain covering +the whole system, regardless of any other cpuset settings. + +The kernel commits to user space that it will avoid load balancing +where it can. It will pick as fine a granularity partition of sched +domains as it can while still providing load balancing for any set +of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled. + +The internal kernel cpuset to scheduler interface passes from the +cpuset code to the scheduler code a partition of the load balanced +CPUs in the system. This partition is a set of subsets (represented +as an array of struct cpumask) of CPUs, pairwise disjoint, that cover +all the CPUs that must be load balanced. + +The cpuset code builds a new such partition and passes it to the +scheduler sched domain setup code, to have the sched domains rebuilt +as necessary, whenever: + + - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, + - or CPUs come or go from a cpuset with this flag enabled, + - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs + and with this flag enabled changes, + - or a cpuset with non-empty CPUs and with this flag enabled is removed, + - or a cpu is offlined/onlined. + +This partition exactly defines what sched domains the scheduler should +setup - one sched domain for each element (struct cpumask) in the +partition. + +The scheduler remembers the currently active sched domain partitions. +When the scheduler routine partition_sched_domains() is invoked from +the cpuset code to update these sched domains, it compares the new +partition requested with the current, and updates its sched domains, +removing the old and adding the new, for each change. + + +1.8 What is sched_relax_domain_level ? +-------------------------------------- + +In sched domain, the scheduler migrates tasks in 2 ways; periodic load +balance on tick, and at time of some schedule events. + +When a task is woken up, scheduler try to move the task on idle CPU. +For example, if a task A running on CPU X activates another task B +on the same CPU X, and if CPU Y is X's sibling and performing idle, +then scheduler migrate task B to CPU Y so that task B can start on +CPU Y without waiting task A on CPU X. + +And if a CPU run out of tasks in its runqueue, the CPU try to pull +extra tasks from other busy CPUs to help them before it is going to +be idle. + +Of course it takes some searching cost to find movable tasks and/or +idle CPUs, the scheduler might not search all CPUs in the domain +every time. In fact, in some architectures, the searching ranges on +events are limited in the same socket or node where the CPU locates, +while the load balance on tick searches all. + +For example, assume CPU Z is relatively far from CPU X. Even if CPU Z +is idle while CPU X and the siblings are busy, scheduler can't migrate +woken task B from X to Z since it is out of its searching range. +As the result, task B on CPU X need to wait task A or wait load balance +on the next tick. For some applications in special situation, waiting +1 tick may be too long. + +The 'cpuset.sched_relax_domain_level' file allows you to request changing +this searching range as you like. This file takes int value which +indicates size of searching range in levels ideally as follows, +otherwise initial value -1 that indicates the cpuset has no request. + +====== =========================================================== + -1 no request. use system default or follow request of others. + 0 no search. + 1 search siblings (hyperthreads in a core). + 2 search cores in a package. + 3 search cpus in a node [= system wide on non-NUMA system] + 4 search nodes in a chunk of node [on NUMA system] + 5 search system wide [on NUMA system] +====== =========================================================== + +The system default is architecture dependent. The system default +can be changed using the relax_domain_level= boot parameter. + +This file is per-cpuset and affect the sched domain where the cpuset +belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset +is disabled, then 'cpuset.sched_relax_domain_level' have no effect since +there is no sched domain belonging the cpuset. + +If multiple cpusets are overlapping and hence they form a single sched +domain, the largest value among those is used. Be careful, if one +requests 0 and others are -1 then 0 is used. + +Note that modifying this file will have both good and bad effects, +and whether it is acceptable or not depends on your situation. +Don't modify this file if you are not sure. + +If your situation is: + + - The migration costs between each cpu can be assumed considerably + small(for you) due to your special application's behavior or + special hardware support for CPU cache etc. + - The searching cost doesn't have impact(for you) or you can make + the searching cost enough small by managing cpuset to compact etc. + - The latency is required even it sacrifices cache hit rate etc. + then increasing 'sched_relax_domain_level' would benefit you. + + +1.9 How do I use cpusets ? +-------------------------- + +In order to minimize the impact of cpusets on critical kernel +code, such as the scheduler, and due to the fact that the kernel +does not support one task updating the memory placement of another +task directly, the impact on a task of changing its cpuset CPU +or Memory Node placement, or of changing to which cpuset a task +is attached, is subtle. + +If a cpuset has its Memory Nodes modified, then for each task attached +to that cpuset, the next time that the kernel attempts to allocate +a page of memory for that task, the kernel will notice the change +in the task's cpuset, and update its per-task memory placement to +remain within the new cpusets memory placement. If the task was using +mempolicy MPOL_BIND, and the nodes to which it was bound overlap with +its new cpuset, then the task will continue to use whatever subset +of MPOL_BIND nodes are still allowed in the new cpuset. If the task +was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed +in the new cpuset, then the task will be essentially treated as if it +was MPOL_BIND bound to the new cpuset (even though its NUMA placement, +as queried by get_mempolicy(), doesn't change). If a task is moved +from one cpuset to another, then the kernel will adjust the task's +memory placement, as above, the next time that the kernel attempts +to allocate a page of memory for that task. + +If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset +will have its allowed CPU placement changed immediately. Similarly, +if a task's pid is written to another cpuset's 'tasks' file, then its +allowed CPU placement is changed immediately. If such a task had been +bound to some subset of its cpuset using the sched_setaffinity() call, +the task will be allowed to run on any CPU allowed in its new cpuset, +negating the effect of the prior sched_setaffinity() call. + +In summary, the memory placement of a task whose cpuset is changed is +updated by the kernel, on the next allocation of a page for that task, +and the processor placement is updated immediately. + +Normally, once a page is allocated (given a physical page +of main memory) then that page stays on whatever node it +was allocated, so long as it remains allocated, even if the +cpusets memory placement policy 'cpuset.mems' subsequently changes. +If the cpuset flag file 'cpuset.memory_migrate' is set true, then when +tasks are attached to that cpuset, any pages that task had +allocated to it on nodes in its previous cpuset are migrated +to the task's new cpuset. The relative placement of the page within +the cpuset is preserved during these migration operations if possible. +For example if the page was on the second valid node of the prior cpuset +then the page will be placed on the second valid node of the new cpuset. + +Also if 'cpuset.memory_migrate' is set true, then if that cpuset's +'cpuset.mems' file is modified, pages allocated to tasks in that +cpuset, that were on nodes in the previous setting of 'cpuset.mems', +will be moved to nodes in the new setting of 'mems.' +Pages that were not in the task's prior cpuset, or in the cpuset's +prior 'cpuset.mems' setting, will not be moved. + +There is an exception to the above. If hotplug functionality is used +to remove all the CPUs that are currently assigned to a cpuset, +then all the tasks in that cpuset will be moved to the nearest ancestor +with non-empty cpus. But the moving of some (or all) tasks might fail if +cpuset is bound with another cgroup subsystem which has some restrictions +on task attaching. In this failing case, those tasks will stay +in the original cpuset, and the kernel will automatically update +their cpus_allowed to allow all online CPUs. When memory hotplug +functionality for removing Memory Nodes is available, a similar exception +is expected to apply there as well. In general, the kernel prefers to +violate cpuset placement, over starving a task that has had all +its allowed CPUs or Memory Nodes taken offline. + +There is a second exception to the above. GFP_ATOMIC requests are +kernel internal allocations that must be satisfied, immediately. +The kernel may drop some request, in rare cases even panic, if a +GFP_ATOMIC alloc fails. If the request cannot be satisfied within +the current task's cpuset, then we relax the cpuset, and look for +memory anywhere we can find it. It's better to violate the cpuset +than stress the kernel. + +To start a new job that is to be contained within a cpuset, the steps are: + + 1) mkdir /sys/fs/cgroup/cpuset + 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset + 3) Create the new cpuset by doing mkdir's and write's (or echo's) in + the /sys/fs/cgroup/cpuset virtual file system. + 4) Start a task that will be the "founding father" of the new job. + 5) Attach that task to the new cpuset by writing its pid to the + /sys/fs/cgroup/cpuset tasks file for that cpuset. + 6) fork, exec or clone the job tasks from this founding father task. + +For example, the following sequence of commands will setup a cpuset +named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, +and then start a subshell 'sh' in that cpuset:: + + mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset + cd /sys/fs/cgroup/cpuset + mkdir Charlie + cd Charlie + /bin/echo 2-3 > cpuset.cpus + /bin/echo 1 > cpuset.mems + /bin/echo $$ > tasks + sh + # The subshell 'sh' is now running in cpuset Charlie + # The next line should display '/Charlie' + cat /proc/self/cpuset + +There are ways to query or modify cpusets: + + - via the cpuset file system directly, using the various cd, mkdir, echo, + cat, rmdir commands from the shell, or their equivalent from C. + - via the C library libcpuset. + - via the C library libcgroup. + (http://sourceforge.net/projects/libcg/) + - via the python application cset. + (http://code.google.com/p/cpuset/) + +The sched_setaffinity calls can also be done at the shell prompt using +SGI's runon or Robert Love's taskset. The mbind and set_mempolicy +calls can be done at the shell prompt using the numactl command +(part of Andi Kleen's numa package). + +2. Usage Examples and Syntax +============================ + +2.1 Basic Usage +--------------- + +Creating, modifying, using the cpusets can be done through the cpuset +virtual filesystem. + +To mount it, type: +# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset + +Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the +tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset +is the cpuset that holds the whole system. + +If you want to create a new cpuset under /sys/fs/cgroup/cpuset:: + + # cd /sys/fs/cgroup/cpuset + # mkdir my_cpuset + +Now you want to do something with this cpuset:: + + # cd my_cpuset + +In this directory you can find several files:: + + # ls + cgroup.clone_children cpuset.memory_pressure + cgroup.event_control cpuset.memory_spread_page + cgroup.procs cpuset.memory_spread_slab + cpuset.cpu_exclusive cpuset.mems + cpuset.cpus cpuset.sched_load_balance + cpuset.mem_exclusive cpuset.sched_relax_domain_level + cpuset.mem_hardwall notify_on_release + cpuset.memory_migrate tasks + +Reading them will give you information about the state of this cpuset: +the CPUs and Memory Nodes it can use, the processes that are using +it, its properties. By writing to these files you can manipulate +the cpuset. + +Set some flags:: + + # /bin/echo 1 > cpuset.cpu_exclusive + +Add some cpus:: + + # /bin/echo 0-7 > cpuset.cpus + +Add some mems:: + + # /bin/echo 0-7 > cpuset.mems + +Now attach your shell to this cpuset:: + + # /bin/echo $$ > tasks + +You can also create cpusets inside your cpuset by using mkdir in this +directory:: + + # mkdir my_sub_cs + +To remove a cpuset, just use rmdir:: + + # rmdir my_sub_cs + +This will fail if the cpuset is in use (has cpusets inside, or has +processes attached). + +Note that for legacy reasons, the "cpuset" filesystem exists as a +wrapper around the cgroup filesystem. + +The command:: + + mount -t cpuset X /sys/fs/cgroup/cpuset + +is equivalent to:: + + mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset + echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent + +2.2 Adding/removing cpus +------------------------ + +This is the syntax to use when writing in the cpus or mems files +in cpuset directories:: + + # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 + # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 + +To add a CPU to a cpuset, write the new list of CPUs including the +CPU to be added. To add 6 to the above cpuset:: + + # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 + +Similarly to remove a CPU from a cpuset, write the new list of CPUs +without the CPU to be removed. + +To remove all the CPUs:: + + # /bin/echo "" > cpuset.cpus -> clear cpus list + +2.3 Setting flags +----------------- + +The syntax is very simple:: + + # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' + # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' + +2.4 Attaching processes +----------------------- + +:: + + # /bin/echo PID > tasks + +Note that it is PID, not PIDs. You can only attach ONE task at a time. +If you have several tasks to attach, you have to do it one after another:: + + # /bin/echo PID1 > tasks + # /bin/echo PID2 > tasks + ... + # /bin/echo PIDn > tasks + + +3. Questions +============ + +Q: + what's up with this '/bin/echo' ? + +A: + bash's builtin 'echo' command does not check calls to write() against + errors. If you use it in the cpuset file system, you won't be + able to tell whether a command succeeded or failed. + +Q: + When I attach processes, only the first of the line gets really attached ! + +A: + We can only return one error code per call to write(). So you should also + put only ONE pid. + +4. Contact +========== + +Web: http://www.bullopensource.org/cpuset diff --git a/Documentation/admin-guide/cgroup-v1/devices.rst b/Documentation/admin-guide/cgroup-v1/devices.rst new file mode 100644 index 000000000000..e1886783961e --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/devices.rst @@ -0,0 +1,132 @@ +=========================== +Device Whitelist Controller +=========================== + +1. Description +============== + +Implement a cgroup to track and enforce open and mknod restrictions +on device files. A device cgroup associates a device access +whitelist with each cgroup. A whitelist entry has 4 fields. +'type' is a (all), c (char), or b (block). 'all' means it applies +to all types and all major and minor numbers. Major and minor are +either an integer or * for all. Access is a composition of r +(read), w (write), and m (mknod). + +The root device cgroup starts with rwm to 'all'. A child device +cgroup gets a copy of the parent. Administrators can then remove +devices from the whitelist or add new entries. A child cgroup can +never receive a device access which is denied by its parent. + +2. User Interface +================= + +An entry is added using devices.allow, and removed using +devices.deny. For instance:: + + echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow + +allows cgroup 1 to read and mknod the device usually known as +/dev/null. Doing:: + + echo a > /sys/fs/cgroup/1/devices.deny + +will remove the default 'a *:* rwm' entry. Doing:: + + echo a > /sys/fs/cgroup/1/devices.allow + +will add the 'a *:* rwm' entry to the whitelist. + +3. Security +=========== + +Any task can move itself between cgroups. This clearly won't +suffice, but we can decide the best way to adequately restrict +movement as people get some experience with this. We may just want +to require CAP_SYS_ADMIN, which at least is a separate bit from +CAP_MKNOD. We may want to just refuse moving to a cgroup which +isn't a descendant of the current one. Or we may want to use +CAP_MAC_ADMIN, since we really are trying to lock down root. + +CAP_SYS_ADMIN is needed to modify the whitelist or move another +task to a new cgroup. (Again we'll probably want to change that). + +A cgroup may not be granted more permissions than the cgroup's +parent has. + +4. Hierarchy +============ + +device cgroups maintain hierarchy by making sure a cgroup never has more +access permissions than its parent. Every time an entry is written to +a cgroup's devices.deny file, all its children will have that entry removed +from their whitelist and all the locally set whitelist entries will be +re-evaluated. In case one of the locally set whitelist entries would provide +more access than the cgroup's parent, it'll be removed from the whitelist. + +Example:: + + A + / \ + B + + group behavior exceptions + A allow "b 8:* rwm", "c 116:1 rw" + B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm" + +If a device is denied in group A:: + + # echo "c 116:* r" > A/devices.deny + +it'll propagate down and after revalidating B's entries, the whitelist entry +"c 116:2 rwm" will be removed:: + + group whitelist entries denied devices + A all "b 8:* rwm", "c 116:* rw" + B "c 1:3 rwm", "b 3:* rwm" all the rest + +In case parent's exceptions change and local exceptions are not allowed +anymore, they'll be deleted. + +Notice that new whitelist entries will not be propagated:: + + A + / \ + B + + group whitelist entries denied devices + A "c 1:3 rwm", "c 1:5 r" all the rest + B "c 1:3 rwm", "c 1:5 r" all the rest + +when adding ``c *:3 rwm``:: + + # echo "c *:3 rwm" >A/devices.allow + +the result:: + + group whitelist entries denied devices + A "c *:3 rwm", "c 1:5 r" all the rest + B "c 1:3 rwm", "c 1:5 r" all the rest + +but now it'll be possible to add new entries to B:: + + # echo "c 2:3 rwm" >B/devices.allow + # echo "c 50:3 r" >B/devices.allow + +or even:: + + # echo "c *:3 rwm" >B/devices.allow + +Allowing or denying all by writing 'a' to devices.allow or devices.deny will +not be possible once the device cgroups has children. + +4.1 Hierarchy (internal implementation) +--------------------------------------- + +device cgroups is implemented internally using a behavior (ALLOW, DENY) and a +list of exceptions. The internal state is controlled using the same user +interface to preserve compatibility with the previous whitelist-only +implementation. Removal or addition of exceptions that will reduce the access +to devices will be propagated down the hierarchy. +For every propagated exception, the effective rules will be re-evaluated based +on current parent's access rules. diff --git a/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst new file mode 100644 index 000000000000..582d3427de3f --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst @@ -0,0 +1,127 @@ +============== +Cgroup Freezer +============== + +The cgroup freezer is useful to batch job management system which start +and stop sets of tasks in order to schedule the resources of a machine +according to the desires of a system administrator. This sort of program +is often used on HPC clusters to schedule access to the cluster as a +whole. The cgroup freezer uses cgroups to describe the set of tasks to +be started/stopped by the batch job management system. It also provides +a means to start and stop the tasks composing the job. + +The cgroup freezer will also be useful for checkpointing running groups +of tasks. The freezer allows the checkpoint code to obtain a consistent +image of the tasks by attempting to force the tasks in a cgroup into a +quiescent state. Once the tasks are quiescent another task can +walk /proc or invoke a kernel interface to gather information about the +quiesced tasks. Checkpointed tasks can be restarted later should a +recoverable error occur. This also allows the checkpointed tasks to be +migrated between nodes in a cluster by copying the gathered information +to another node and restarting the tasks there. + +Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping +and resuming tasks in userspace. Both of these signals are observable +from within the tasks we wish to freeze. While SIGSTOP cannot be caught, +blocked, or ignored it can be seen by waiting or ptracing parent tasks. +SIGCONT is especially unsuitable since it can be caught by the task. Any +programs designed to watch for SIGSTOP and SIGCONT could be broken by +attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can +demonstrate this problem using nested bash shells:: + + $ echo $$ + 16644 + $ bash + $ echo $$ + 16690 + + From a second, unrelated bash shell: + $ kill -SIGSTOP 16690 + $ kill -SIGCONT 16690 + + + +This happens because bash can observe both signals and choose how it +responds to them. + +Another example of a program which catches and responds to these +signals is gdb. In fact any program designed to use ptrace is likely to +have a problem with this method of stopping and resuming tasks. + +In contrast, the cgroup freezer uses the kernel freezer code to +prevent the freeze/unfreeze cycle from becoming visible to the tasks +being frozen. This allows the bash example above and gdb to run as +expected. + +The cgroup freezer is hierarchical. Freezing a cgroup freezes all +tasks belonging to the cgroup and all its descendant cgroups. Each +cgroup has its own state (self-state) and the state inherited from the +parent (parent-state). Iff both states are THAWED, the cgroup is +THAWED. + +The following cgroupfs files are created by cgroup freezer. + +* freezer.state: Read-write. + + When read, returns the effective state of the cgroup - "THAWED", + "FREEZING" or "FROZEN". This is the combined self and parent-states. + If any is freezing, the cgroup is freezing (FREEZING or FROZEN). + + FREEZING cgroup transitions into FROZEN state when all tasks + belonging to the cgroup and its descendants become frozen. Note that + a cgroup reverts to FREEZING from FROZEN after a new task is added + to the cgroup or one of its descendant cgroups until the new task is + frozen. + + When written, sets the self-state of the cgroup. Two values are + allowed - "FROZEN" and "THAWED". If FROZEN is written, the cgroup, + if not already freezing, enters FREEZING state along with all its + descendant cgroups. + + If THAWED is written, the self-state of the cgroup is changed to + THAWED. Note that the effective state may not change to THAWED if + the parent-state is still freezing. If a cgroup's effective state + becomes THAWED, all its descendants which are freezing because of + the cgroup also leave the freezing state. + +* freezer.self_freezing: Read only. + + Shows the self-state. 0 if the self-state is THAWED; otherwise, 1. + This value is 1 iff the last write to freezer.state was "FROZEN". + +* freezer.parent_freezing: Read only. + + Shows the parent-state. 0 if none of the cgroup's ancestors is + frozen; otherwise, 1. + +The root cgroup is non-freezable and the above interface files don't +exist. + +* Examples of usage:: + + # mkdir /sys/fs/cgroup/freezer + # mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer + # mkdir /sys/fs/cgroup/freezer/0 + # echo $some_pid > /sys/fs/cgroup/freezer/0/tasks + +to get status of the freezer subsystem:: + + # cat /sys/fs/cgroup/freezer/0/freezer.state + THAWED + +to freeze all tasks in the container:: + + # echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state + # cat /sys/fs/cgroup/freezer/0/freezer.state + FREEZING + # cat /sys/fs/cgroup/freezer/0/freezer.state + FROZEN + +to unfreeze all tasks in the container:: + + # echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state + # cat /sys/fs/cgroup/freezer/0/freezer.state + THAWED + +This is the basic mechanism which should do the right thing for user space task +in a simple scenario. diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst new file mode 100644 index 000000000000..a3902aa253a9 --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst @@ -0,0 +1,50 @@ +================== +HugeTLB Controller +================== + +The HugeTLB controller allows to limit the HugeTLB usage per control group and +enforces the controller limit during page fault. Since HugeTLB doesn't +support page reclaim, enforcing the limit at page fault time implies that, +the application will get SIGBUS signal if it tries to access HugeTLB pages +beyond its limit. This requires the application to know beforehand how much +HugeTLB pages it would require for its use. + +HugeTLB controller can be created by first mounting the cgroup filesystem. + +# mount -t cgroup -o hugetlb none /sys/fs/cgroup + +With the above step, the initial or the parent HugeTLB group becomes +visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in +the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. + +New groups can be created under the parent group /sys/fs/cgroup:: + + # cd /sys/fs/cgroup + # mkdir g1 + # echo $$ > g1/tasks + +The above steps create a new group g1 and move the current shell +process (bash) into it. + +Brief summary of control files:: + + hugetlb..limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage + hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded + hugetlb..usage_in_bytes # show current usage for "hugepagesize" hugetlb + hugetlb..failcnt # show the number of allocation failure due to HugeTLB limit + +For a system supporting three hugepage sizes (64k, 32M and 1G), the control +files include:: + + hugetlb.1GB.limit_in_bytes + hugetlb.1GB.max_usage_in_bytes + hugetlb.1GB.usage_in_bytes + hugetlb.1GB.failcnt + hugetlb.64KB.limit_in_bytes + hugetlb.64KB.max_usage_in_bytes + hugetlb.64KB.usage_in_bytes + hugetlb.64KB.failcnt + hugetlb.32MB.limit_in_bytes + hugetlb.32MB.max_usage_in_bytes + hugetlb.32MB.usage_in_bytes + hugetlb.32MB.failcnt diff --git a/Documentation/admin-guide/cgroup-v1/index.rst b/Documentation/admin-guide/cgroup-v1/index.rst new file mode 100644 index 000000000000..10bf48bae0b0 --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/index.rst @@ -0,0 +1,28 @@ +======================== +Control Groups version 1 +======================== + +.. toctree:: + :maxdepth: 1 + + cgroups + + blkio-controller + cpuacct + cpusets + devices + freezer-subsystem + hugetlb + memcg_test + memory + net_cls + net_prio + pids + rdma + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst new file mode 100644 index 000000000000..3f7115e07b5d --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst @@ -0,0 +1,355 @@ +===================================================== +Memory Resource Controller(Memcg) Implementation Memo +===================================================== + +Last Updated: 2010/2 + +Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). + +Because VM is getting complex (one of reasons is memcg...), memcg's behavior +is complex. This is a document for memcg's internal behavior. +Please note that implementation details can be changed. + +(*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst) + +0. How to record usage ? +======================== + + 2 objects are used. + + page_cgroup ....an object per page. + + Allocated at boot or memory hotplug. Freed at memory hot removal. + + swap_cgroup ... an entry per swp_entry. + + Allocated at swapon(). Freed at swapoff(). + + The page_cgroup has USED bit and double count against a page_cgroup never + occurs. swap_cgroup is used only when a charged page is swapped-out. + +1. Charge +========= + + a page/swp_entry may be charged (usage += PAGE_SIZE) at + + mem_cgroup_try_charge() + +2. Uncharge +=========== + + a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by + + mem_cgroup_uncharge() + Called when a page's refcount goes down to 0. + + mem_cgroup_uncharge_swap() + Called when swp_entry's refcnt goes down to 0. A charge against swap + disappears. + +3. charge-commit-cancel +======================= + + Memcg pages are charged in two steps: + + - mem_cgroup_try_charge() + - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge() + + At try_charge(), there are no flags to say "this page is charged". + at this point, usage += PAGE_SIZE. + + At commit(), the page is associated with the memcg. + + At cancel(), simply usage -= PAGE_SIZE. + +Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. + +4. Anonymous +============ + + Anonymous page is newly allocated at + - page fault into MAP_ANONYMOUS mapping. + - Copy-On-Write. + + 4.1 Swap-in. + At swap-in, the page is taken from swap-cache. There are 2 cases. + + (a) If the SwapCache is newly allocated and read, it has no charges. + (b) If the SwapCache has been mapped by processes, it has been + charged already. + + 4.2 Swap-out. + At swap-out, typical state transition is below. + + (a) add to swap cache. (marked as SwapCache) + swp_entry's refcnt += 1. + (b) fully unmapped. + swp_entry's refcnt += # of ptes. + (c) write back to swap. + (d) delete from swap cache. (remove from SwapCache) + swp_entry's refcnt -= 1. + + + Finally, at task exit, + (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. + +5. Page Cache +============= + + Page Cache is charged at + - add_to_page_cache_locked(). + + The logic is very clear. (About migration, see below) + + Note: + __remove_from_page_cache() is called by remove_from_page_cache() + and __remove_mapping(). + +6. Shmem(tmpfs) Page Cache +=========================== + + The best way to understand shmem's page state transition is to read + mm/shmem.c. + + But brief explanation of the behavior of memcg around shmem will be + helpful to understand the logic. + + Shmem's page (just leaf page, not direct/indirect block) can be on + + - radix-tree of shmem's inode. + - SwapCache. + - Both on radix-tree and SwapCache. This happens at swap-in + and swap-out, + + It's charged when... + + - A new page is added to shmem's radix-tree. + - A swp page is read. (move a charge from swap_cgroup to page_cgroup) + +7. Page Migration +================= + + mem_cgroup_migrate() + +8. LRU +====== + Each memcg has its own private LRU. Now, its handling is under global + VM's control (means that it's handled under global pgdat->lru_lock). + Almost all routines around memcg's LRU is called by global LRU's + list management functions under pgdat->lru_lock. + + A special function is mem_cgroup_isolate_pages(). This scans + memcg's private LRU and call __isolate_lru_page() to extract a page + from LRU. + + (By __isolate_lru_page(), the page is removed from both of global and + private LRU.) + + +9. Typical Tests. +================= + + Tests for racy cases. + +9.1 Small limit to memcg. +------------------------- + + When you do test to do racy case, it's good test to set memcg's limit + to be very small rather than GB. Many races found in the test under + xKB or xxMB limits. + + (Memory behavior under GB and Memory behavior under MB shows very + different situation.) + +9.2 Shmem +--------- + + Historically, memcg's shmem handling was poor and we saw some amount + of troubles here. This is because shmem is page-cache but can be + SwapCache. Test with shmem/tmpfs is always good test. + +9.3 Migration +------------- + + For NUMA, migration is an another special case. To do easy test, cpuset + is useful. Following is a sample script to do migration:: + + mount -t cgroup -o cpuset none /opt/cpuset + + mkdir /opt/cpuset/01 + echo 1 > /opt/cpuset/01/cpuset.cpus + echo 0 > /opt/cpuset/01/cpuset.mems + echo 1 > /opt/cpuset/01/cpuset.memory_migrate + mkdir /opt/cpuset/02 + echo 1 > /opt/cpuset/02/cpuset.cpus + echo 1 > /opt/cpuset/02/cpuset.mems + echo 1 > /opt/cpuset/02/cpuset.memory_migrate + + In above set, when you moves a task from 01 to 02, page migration to + node 0 to node 1 will occur. Following is a script to migrate all + under cpuset.:: + + -- + move_task() + { + for pid in $1 + do + /bin/echo $pid >$2/tasks 2>/dev/null + echo -n $pid + echo -n " " + done + echo END + } + + G1_TASK=`cat ${G1}/tasks` + G2_TASK=`cat ${G2}/tasks` + move_task "${G1_TASK}" ${G2} & + -- + +9.4 Memory hotplug +------------------ + + memory hotplug test is one of good test. + + to offline memory, do following:: + + # echo offline > /sys/devices/system/memory/memoryXXX/state + + (XXX is the place of memory) + + This is an easy way to test page migration, too. + +9.5 mkdir/rmdir +--------------- + + When using hierarchy, mkdir/rmdir test should be done. + Use tests like the following:: + + echo 1 >/opt/cgroup/01/memory/use_hierarchy + mkdir /opt/cgroup/01/child_a + mkdir /opt/cgroup/01/child_b + + set limit to 01. + add limit to 01/child_b + run jobs under child_a and child_b + + create/delete following groups at random while jobs are running:: + + /opt/cgroup/01/child_a/child_aa + /opt/cgroup/01/child_b/child_bb + /opt/cgroup/01/child_c + + running new jobs in new group is also good. + +9.6 Mount with other subsystems +------------------------------- + + Mounting with other subsystems is a good test because there is a + race and lock dependency with other cgroup subsystems. + + example:: + + # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices + + and do task move, mkdir, rmdir etc...under this. + +9.7 swapoff +----------- + + Besides management of swap is one of complicated parts of memcg, + call path of swap-in at swapoff is not same as usual swap-in path.. + It's worth to be tested explicitly. + + For example, test like following is good: + + (Shell-A):: + + # mount -t cgroup none /cgroup -o memory + # mkdir /cgroup/test + # echo 40M > /cgroup/test/memory.limit_in_bytes + # echo 0 > /cgroup/test/tasks + + Run malloc(100M) program under this. You'll see 60M of swaps. + + (Shell-B):: + + # move all tasks in /cgroup/test to /cgroup + # /sbin/swapoff -a + # rmdir /cgroup/test + # kill malloc task. + + Of course, tmpfs v.s. swapoff test should be tested, too. + +9.8 OOM-Killer +-------------- + + Out-of-memory caused by memcg's limit will kill tasks under + the memcg. When hierarchy is used, a task under hierarchy + will be killed by the kernel. + + In this case, panic_on_oom shouldn't be invoked and tasks + in other groups shouldn't be killed. + + It's not difficult to cause OOM under memcg as following. + + Case A) when you can swapoff:: + + #swapoff -a + #echo 50M > /memory.limit_in_bytes + + run 51M of malloc + + Case B) when you use mem+swap limitation:: + + #echo 50M > memory.limit_in_bytes + #echo 50M > memory.memsw.limit_in_bytes + + run 51M of malloc + +9.9 Move charges at task migration +---------------------------------- + + Charges associated with a task can be moved along with task migration. + + (Shell-A):: + + #mkdir /cgroup/A + #echo $$ >/cgroup/A/tasks + + run some programs which uses some amount of memory in /cgroup/A. + + (Shell-B):: + + #mkdir /cgroup/B + #echo 1 >/cgroup/B/memory.move_charge_at_immigrate + #echo "pid of the program running in group A" >/cgroup/B/tasks + + You can see charges have been moved by reading ``*.usage_in_bytes`` or + memory.stat of both A and B. + + See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should + be written to move_charge_at_immigrate. + +9.10 Memory thresholds +---------------------- + + Memory controller implements memory thresholds using cgroups notification + API. You can use tools/cgroup/cgroup_event_listener.c to test it. + + (Shell-A) Create cgroup and run event listener:: + + # mkdir /cgroup/A + # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M + + (Shell-B) Add task to cgroup and try to allocate and free memory:: + + # echo $$ >/cgroup/A/tasks + # a="$(dd if=/dev/zero bs=1M count=10)" + # a= + + You will see message from cgroup_event_listener every time you cross + the thresholds. + + Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds. + + It's good idea to test root cgroup as well. diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst new file mode 100644 index 000000000000..41bdc038dad9 --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -0,0 +1,1003 @@ +========================== +Memory Resource Controller +========================== + +NOTE: + This document is hopelessly outdated and it asks for a complete + rewrite. It still contains a useful information so we are keeping it + here but make sure to check the current code if you need a deeper + understanding. + +NOTE: + The Memory Resource Controller has generically been referred to as the + memory controller in this document. Do not confuse memory controller + used here with the memory controller that is used in hardware. + +(For editors) In this document: + When we mention a cgroup (cgroupfs's directory) with memory controller, + we call it "memory cgroup". When you see git-log and source code, you'll + see patch's title and function names tend to use "memcg". + In this document, we avoid using it. + +Benefits and Purpose of the memory controller +============================================= + +The memory controller isolates the memory behaviour of a group of tasks +from the rest of the system. The article on LWN [12] mentions some probable +uses of the memory controller. The memory controller can be used to + +a. Isolate an application or a group of applications + Memory-hungry applications can be isolated and limited to a smaller + amount of memory. +b. Create a cgroup with a limited amount of memory; this can be used + as a good alternative to booting with mem=XXXX. +c. Virtualization solutions can control the amount of memory they want + to assign to a virtual machine instance. +d. A CD/DVD burner could control the amount of memory used by the + rest of the system to ensure that burning does not fail due to lack + of available memory. +e. There are several other use cases; find one or use the controller just + for fun (to learn and hack on the VM subsystem). + +Current Status: linux-2.6.34-mmotm(development version of 2010/April) + +Features: + + - accounting anonymous pages, file caches, swap caches usage and limiting them. + - pages are linked to per-memcg LRU exclusively, and there is no global LRU. + - optionally, memory+swap usage can be accounted and limited. + - hierarchical accounting + - soft limit + - moving (recharging) account at moving a task is selectable. + - usage threshold notifier + - memory pressure notifier + - oom-killer disable knob and oom-notifier + - Root cgroup has no limit controls. + + Kernel memory support is a work in progress, and the current version provides + basically functionality. (See Section 2.7) + +Brief summary of control files. + +==================================== ========================================== + tasks attach a task(thread) and show list of + threads + cgroup.procs show list of processes + cgroup.event_control an interface for event_fd() + memory.usage_in_bytes show current usage for memory + (See 5.5 for details) + memory.memsw.usage_in_bytes show current usage for memory+Swap + (See 5.5 for details) + memory.limit_in_bytes set/show limit of memory usage + memory.memsw.limit_in_bytes set/show limit of memory+Swap usage + memory.failcnt show the number of memory usage hits limits + memory.memsw.failcnt show the number of memory+Swap hits limits + memory.max_usage_in_bytes show max memory usage recorded + memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded + memory.soft_limit_in_bytes set/show soft limit of memory usage + memory.stat show various statistics + memory.use_hierarchy set/show hierarchical account enabled + memory.force_empty trigger forced page reclaim + memory.pressure_level set memory pressure notifications + memory.swappiness set/show swappiness parameter of vmscan + (See sysctl's vm.swappiness) + memory.move_charge_at_immigrate set/show controls of moving charges + memory.oom_control set/show oom controls. + memory.numa_stat show the number of memory usage per numa + node + + memory.kmem.limit_in_bytes set/show hard limit for kernel memory + memory.kmem.usage_in_bytes show current kernel memory allocation + memory.kmem.failcnt show the number of kernel memory usage + hits limits + memory.kmem.max_usage_in_bytes show max kernel memory usage recorded + + memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory + memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation + memory.kmem.tcp.failcnt show the number of tcp buf memory usage + hits limits + memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded +==================================== ========================================== + +1. History +========== + +The memory controller has a long history. A request for comments for the memory +controller was posted by Balbir Singh [1]. At the time the RFC was posted +there were several implementations for memory control. The goal of the +RFC was to build consensus and agreement for the minimal features required +for memory control. The first RSS controller was posted by Balbir Singh[2] +in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the +RSS controller. At OLS, at the resource management BoF, everyone suggested +that we handle both page cache and RSS together. Another request was raised +to allow user space handling of OOM. The current memory controller is +at version 6; it combines both mapped (RSS) and unmapped Page +Cache Control [11]. + +2. Memory Control +================= + +Memory is a unique resource in the sense that it is present in a limited +amount. If a task requires a lot of CPU processing, the task can spread +its processing over a period of hours, days, months or years, but with +memory, the same physical memory needs to be reused to accomplish the task. + +The memory controller implementation has been divided into phases. These +are: + +1. Memory controller +2. mlock(2) controller +3. Kernel user memory accounting and slab control +4. user mappings length controller + +The memory controller is the first controller developed. + +2.1. Design +----------- + +The core of the design is a counter called the page_counter. The +page_counter tracks the current memory usage and limit of the group of +processes associated with the controller. Each cgroup has a memory controller +specific data structure (mem_cgroup) associated with it. + +2.2. Accounting +--------------- + +:: + + +--------------------+ + | mem_cgroup | + | (page_counter) | + +--------------------+ + / ^ \ + / | \ + +---------------+ | +---------------+ + | mm_struct | |.... | mm_struct | + | | | | | + +---------------+ | +---------------+ + | + + --------------+ + | + +---------------+ +------+--------+ + | page +----------> page_cgroup| + | | | | + +---------------+ +---------------+ + + (Figure 1: Hierarchy of Accounting) + + +Figure 1 shows the important aspects of the controller + +1. Accounting happens per cgroup +2. Each mm_struct knows about which cgroup it belongs to +3. Each page has a pointer to the page_cgroup, which in turn knows the + cgroup it belongs to + +The accounting is done as follows: mem_cgroup_charge_common() is invoked to +set up the necessary data structures and check if the cgroup that is being +charged is over its limit. If it is, then reclaim is invoked on the cgroup. +More details can be found in the reclaim section of this document. +If everything goes well, a page meta-data-structure called page_cgroup is +updated. page_cgroup has its own LRU on cgroup. +(*) page_cgroup structure is allocated at boot/memory-hotplug time. + +2.2.1 Accounting details +------------------------ + +All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. +Some pages which are never reclaimable and will not be on the LRU +are not accounted. We just account pages under usual VM management. + +RSS pages are accounted at page_fault unless they've already been accounted +for earlier. A file page will be accounted for as Page Cache when it's +inserted into inode (radix-tree). While it's mapped into the page tables of +processes, duplicate accounting is carefully avoided. + +An RSS page is unaccounted when it's fully unmapped. A PageCache page is +unaccounted when it's removed from radix-tree. Even if RSS pages are fully +unmapped (by kswapd), they may exist as SwapCache in the system until they +are really freed. Such SwapCaches are also accounted. +A swapped-in page is not accounted until it's mapped. + +Note: The kernel does swapin-readahead and reads multiple swaps at once. +This means swapped-in pages may contain pages for other tasks than a task +causing page fault. So, we avoid accounting at swap-in I/O. + +At page migration, accounting information is kept. + +Note: we just account pages-on-LRU because our purpose is to control amount +of used pages; not-on-LRU pages tend to be out-of-control from VM view. + +2.3 Shared Page Accounting +-------------------------- + +Shared pages are accounted on the basis of the first touch approach. The +cgroup that first touches a page is accounted for the page. The principle +behind this approach is that a cgroup that aggressively uses a shared +page will eventually get charged for it (once it is uncharged from +the cgroup that brought it in -- this will happen on memory pressure). + +But see section 8.2: when moving a task to another cgroup, its pages may +be recharged to the new cgroup, if move_charge_at_immigrate has been chosen. + +Exception: If CONFIG_MEMCG_SWAP is not used. +When you do swapoff and make swapped-out pages of shmem(tmpfs) to +be backed into memory in force, charges for pages are accounted against the +caller of swapoff rather than the users of shmem. + +2.4 Swap Extension (CONFIG_MEMCG_SWAP) +-------------------------------------- + +Swap Extension allows you to record charge for swap. A swapped-in page is +charged back to original page allocator if possible. + +When swap is accounted, following files are added. + + - memory.memsw.usage_in_bytes. + - memory.memsw.limit_in_bytes. + +memsw means memory+swap. Usage of memory+swap is limited by +memsw.limit_in_bytes. + +Example: Assume a system with 4G of swap. A task which allocates 6G of memory +(by mistake) under 2G memory limitation will use all swap. +In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. +By using the memsw limit, you can avoid system OOM which can be caused by swap +shortage. + +**why 'memory+swap' rather than swap** + +The global LRU(kswapd) can swap out arbitrary pages. Swap-out means +to move account from memory to swap...there is no change in usage of +memory+swap. In other words, when we want to limit the usage of swap without +affecting global LRU, memory+swap limit is better than just limiting swap from +an OS point of view. + +**What happens when a cgroup hits memory.memsw.limit_in_bytes** + +When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out +in this cgroup. Then, swap-out will not be done by cgroup routine and file +caches are dropped. But as mentioned above, global LRU can do swapout memory +from it for sanity of the system's memory management state. You can't forbid +it by cgroup. + +2.5 Reclaim +----------- + +Each cgroup maintains a per cgroup LRU which has the same structure as +global VM. When a cgroup goes over its limit, we first try +to reclaim memory from the cgroup so as to make space for the new +pages that the cgroup has touched. If the reclaim is unsuccessful, +an OOM routine is invoked to select and kill the bulkiest task in the +cgroup. (See 10. OOM Control below.) + +The reclaim algorithm has not been modified for cgroups, except that +pages that are selected for reclaiming come from the per-cgroup LRU +list. + +NOTE: + Reclaim does not work for the root cgroup, since we cannot set any + limits on the root cgroup. + +Note2: + When panic_on_oom is set to "2", the whole system will panic. + +When oom event notifier is registered, event will be delivered. +(See oom_control section) + +2.6 Locking +----------- + + lock_page_cgroup()/unlock_page_cgroup() should not be called under + the i_pages lock. + + Other lock order is following: + + PG_locked. + mm->page_table_lock + pgdat->lru_lock + lock_page_cgroup. + + In many cases, just lock_page_cgroup() is called. + + per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by + pgdat->lru_lock, it has no lock of its own. + +2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) +----------------------------------------------- + +With the Kernel memory extension, the Memory Controller is able to limit +the amount of kernel memory used by the system. Kernel memory is fundamentally +different than user memory, since it can't be swapped out, which makes it +possible to DoS the system by consuming too much of this precious resource. + +Kernel memory accounting is enabled for all memory cgroups by default. But +it can be disabled system-wide by passing cgroup.memory=nokmem to the kernel +at boot time. In this case, kernel memory will not be accounted at all. + +Kernel memory limits are not imposed for the root cgroup. Usage for the root +cgroup may or may not be accounted. The memory used is accumulated into +memory.kmem.usage_in_bytes, or in a separate counter when it makes sense. +(currently only for tcp). + +The main "kmem" counter is fed into the main counter, so kmem charges will +also be visible from the user counter. + +Currently no soft limit is implemented for kernel memory. It is future work +to trigger slab reclaim when those limits are reached. + +2.7.1 Current Kernel Memory resources accounted +----------------------------------------------- + +stack pages: + every process consumes some stack pages. By accounting into + kernel memory, we prevent new processes from being created when the kernel + memory usage is too high. + +slab pages: + pages allocated by the SLAB or SLUB allocator are tracked. A copy + of each kmem_cache is created every time the cache is touched by the first time + from inside the memcg. The creation is done lazily, so some objects can still be + skipped while the cache is being created. All objects in a slab page should + belong to the same memcg. This only fails to hold when a task is migrated to a + different memcg during the page allocation by the cache. + +sockets memory pressure: + some sockets protocols have memory pressure + thresholds. The Memory Controller allows them to be controlled individually + per cgroup, instead of globally. + +tcp memory pressure: + sockets memory pressure for the tcp protocol. + +2.7.2 Common use cases +---------------------- + +Because the "kmem" counter is fed to the main user counter, kernel memory can +never be limited completely independently of user memory. Say "U" is the user +limit, and "K" the kernel limit. There are three possible ways limits can be +set: + +U != 0, K = unlimited: + This is the standard memcg limitation mechanism already present before kmem + accounting. Kernel memory is completely ignored. + +U != 0, K < U: + Kernel memory is a subset of the user memory. This setup is useful in + deployments where the total amount of memory per-cgroup is overcommited. + Overcommiting kernel memory limits is definitely not recommended, since the + box can still run out of non-reclaimable memory. + In this case, the admin could set up K so that the sum of all groups is + never greater than the total memory, and freely set U at the cost of his + QoS. + +WARNING: + In the current implementation, memory reclaim will NOT be + triggered for a cgroup when it hits K while staying below U, which makes + this setup impractical. + +U != 0, K >= U: + Since kmem charges will also be fed to the user counter and reclaim will be + triggered for the cgroup for both kinds of memory. This setup gives the + admin a unified view of memory, and it is also useful for people who just + want to track kernel memory usage. + +3. User Interface +================= + +3.0. Configuration +------------------ + +a. Enable CONFIG_CGROUPS +b. Enable CONFIG_MEMCG +c. Enable CONFIG_MEMCG_SWAP (to use swap extension) +d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) + +3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) +------------------------------------------------------------------- + +:: + + # mount -t tmpfs none /sys/fs/cgroup + # mkdir /sys/fs/cgroup/memory + # mount -t cgroup none /sys/fs/cgroup/memory -o memory + +3.2. Make the new group and move bash into it:: + + # mkdir /sys/fs/cgroup/memory/0 + # echo $$ > /sys/fs/cgroup/memory/0/tasks + +Since now we're in the 0 cgroup, we can alter the memory limit:: + + # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes + +NOTE: + We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, + mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, + Gibibytes.) + +NOTE: + We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``. + +NOTE: + We cannot set limits on the root cgroup any more. + +:: + + # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes + 4194304 + +We can check the usage:: + + # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes + 1216512 + +A successful write to this file does not guarantee a successful setting of +this limit to the value written into the file. This can be due to a +number of factors, such as rounding up to page boundaries or the total +availability of memory on the system. The user is required to re-read +this file after a write to guarantee the value committed by the kernel:: + + # echo 1 > memory.limit_in_bytes + # cat memory.limit_in_bytes + 4096 + +The memory.failcnt field gives the number of times that the cgroup limit was +exceeded. + +The memory.stat file gives accounting information. Now, the number of +caches, RSS and Active pages/Inactive pages are shown. + +4. Testing +========== + +For testing features and implementation, see memcg_test.txt. + +Performance test is also important. To see pure memory controller's overhead, +testing on tmpfs will give you good numbers of small overheads. +Example: do kernel make on tmpfs. + +Page-fault scalability is also important. At measuring parallel +page fault test, multi-process test may be better than multi-thread +test because it has noise of shared objects/status. + +But the above two are testing extreme situations. +Trying usual test under memory controller is always helpful. + +4.1 Troubleshooting +------------------- + +Sometimes a user might find that the application under a cgroup is +terminated by the OOM killer. There are several causes for this: + +1. The cgroup limit is too low (just too low to do anything useful) +2. The user is using anonymous memory and swap is turned off or too low + +A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of +some of the pages cached in the cgroup (page cache pages). + +To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and +seeing what happens will be helpful. + +4.2 Task migration +------------------ + +When a task migrates from one cgroup to another, its charge is not +carried forward by default. The pages allocated from the original cgroup still +remain charged to it, the charge is dropped when the page is freed or +reclaimed. + +You can move charges of a task along with task migration. +See 8. "Move charges at task migration" + +4.3 Removing a cgroup +--------------------- + +A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a +cgroup might have some charge associated with it, even though all +tasks have migrated away from it. (because we charge against pages, not +against tasks.) + +We move the stats to root (if use_hierarchy==0) or parent (if +use_hierarchy==1), and no change on the charge except uncharging +from the child. + +Charges recorded in swap information is not updated at removal of cgroup. +Recorded information is discarded and a cgroup which uses swap (swapcache) +will be charged as a new owner of it. + +About use_hierarchy, see Section 6. + +5. Misc. interfaces +=================== + +5.1 force_empty +--------------- + memory.force_empty interface is provided to make cgroup's memory usage empty. + When writing anything to this:: + + # echo 0 > memory.force_empty + + the cgroup will be reclaimed and as many pages reclaimed as possible. + + The typical use case for this interface is before calling rmdir(). + Though rmdir() offlines memcg, but the memcg may still stay there due to + charged file caches. Some out-of-use page caches may keep charged until + memory pressure happens. If you want to avoid that, force_empty will be useful. + + Also, note that when memory.kmem.limit_in_bytes is set the charges due to + kernel pages will still be seen. This is not considered a failure and the + write will still return success. In this case, it is expected that + memory.kmem.usage_in_bytes == memory.usage_in_bytes. + + About use_hierarchy, see Section 6. + +5.2 stat file +------------- + +memory.stat file includes following statistics + +per-memory cgroup local status +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +=============== =============================================================== +cache # of bytes of page cache memory. +rss # of bytes of anonymous and swap cache memory (includes + transparent hugepages). +rss_huge # of bytes of anonymous transparent hugepages. +mapped_file # of bytes of mapped file (includes tmpfs/shmem) +pgpgin # of charging events to the memory cgroup. The charging + event happens each time a page is accounted as either mapped + anon page(RSS) or cache page(Page Cache) to the cgroup. +pgpgout # of uncharging events to the memory cgroup. The uncharging + event happens each time a page is unaccounted from the cgroup. +swap # of bytes of swap usage +dirty # of bytes that are waiting to get written back to the disk. +writeback # of bytes of file/anon cache that are queued for syncing to + disk. +inactive_anon # of bytes of anonymous and swap cache memory on inactive + LRU list. +active_anon # of bytes of anonymous and swap cache memory on active + LRU list. +inactive_file # of bytes of file-backed memory on inactive LRU list. +active_file # of bytes of file-backed memory on active LRU list. +unevictable # of bytes of memory that cannot be reclaimed (mlocked etc). +=============== =============================================================== + +status considering hierarchy (see memory.use_hierarchy settings) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +========================= =================================================== +hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy + under which the memory cgroup is +hierarchical_memsw_limit # of bytes of memory+swap limit with regard to + hierarchy under which memory cgroup is. + +total_ # hierarchical version of , which in + addition to the cgroup's own value includes the + sum of all hierarchical children's values of + , i.e. total_cache +========================= =================================================== + +The following additional stats are dependent on CONFIG_DEBUG_VM +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +========================= ======================================== +recent_rotated_anon VM internal parameter. (see mm/vmscan.c) +recent_rotated_file VM internal parameter. (see mm/vmscan.c) +recent_scanned_anon VM internal parameter. (see mm/vmscan.c) +recent_scanned_file VM internal parameter. (see mm/vmscan.c) +========================= ======================================== + +Memo: + recent_rotated means recent frequency of LRU rotation. + recent_scanned means recent # of scans to LRU. + showing for better debug please see the code for meanings. + +Note: + Only anonymous and swap cache memory is listed as part of 'rss' stat. + This should not be confused with the true 'resident set size' or the + amount of physical memory used by the cgroup. + + 'rss + mapped_file" will give you resident set size of cgroup. + + (Note: file and shmem may be shared among other cgroups. In that case, + mapped_file is accounted only when the memory cgroup is owner of page + cache.) + +5.3 swappiness +-------------- + +Overrides /proc/sys/vm/swappiness for the particular group. The tunable +in the root cgroup corresponds to the global swappiness setting. + +Please note that unlike during the global reclaim, limit reclaim +enforces that 0 swappiness really prevents from any swapping even if +there is a swap storage available. This might lead to memcg OOM killer +if there are no file pages to reclaim. + +5.4 failcnt +----------- + +A memory cgroup provides memory.failcnt and memory.memsw.failcnt files. +This failcnt(== failure count) shows the number of times that a usage counter +hit its limit. When a memory cgroup hits a limit, failcnt increases and +memory under it will be reclaimed. + +You can reset failcnt by writing 0 to failcnt file:: + + # echo 0 > .../memory.failcnt + +5.5 usage_in_bytes +------------------ + +For efficiency, as other kernel components, memory cgroup uses some optimization +to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the +method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz +value for efficient access. (Of course, when necessary, it's synchronized.) +If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) +value in memory.stat(see 5.2). + +5.6 numa_stat +------------- + +This is similar to numa_maps but operates on a per-memcg basis. This is +useful for providing visibility into the numa locality information within +an memcg since the pages are allowed to be allocated from any physical +node. One of the use cases is evaluating application performance by +combining this information with the application's CPU allocation. + +Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable" +per-node page counts including "hierarchical_" which sums up all +hierarchical children's values in addition to the memcg's own value. + +The output format of memory.numa_stat is:: + + total= N0= N1= ... + file= N0= N1= ... + anon= N0= N1= ... + unevictable= N0= N1= ... + hierarchical_= N0= N1= ... + +The "total" count is sum of file + anon + unevictable. + +6. Hierarchy support +==================== + +The memory controller supports a deep hierarchy and hierarchical accounting. +The hierarchy is created by creating the appropriate cgroups in the +cgroup filesystem. Consider for example, the following cgroup filesystem +hierarchy:: + + root + / | \ + / | \ + a b c + | \ + | \ + d e + +In the diagram above, with hierarchical accounting enabled, all memory +usage of e, is accounted to its ancestors up until the root (i.e, c and root), +that has memory.use_hierarchy enabled. If one of the ancestors goes over its +limit, the reclaim algorithm reclaims from the tasks in the ancestor and the +children of the ancestor. + +6.1 Enabling hierarchical accounting and reclaim +------------------------------------------------ + +A memory cgroup by default disables the hierarchy feature. Support +can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup:: + + # echo 1 > memory.use_hierarchy + +The feature can be disabled by:: + + # echo 0 > memory.use_hierarchy + +NOTE1: + Enabling/disabling will fail if either the cgroup already has other + cgroups created below it, or if the parent cgroup has use_hierarchy + enabled. + +NOTE2: + When panic_on_oom is set to "2", the whole system will panic in + case of an OOM event in any cgroup. + +7. Soft limits +============== + +Soft limits allow for greater sharing of memory. The idea behind soft limits +is to allow control groups to use as much of the memory as needed, provided + +a. There is no memory contention +b. They do not exceed their hard limit + +When the system detects memory contention or low memory, control groups +are pushed back to their soft limits. If the soft limit of each control +group is very high, they are pushed back as much as possible to make +sure that one control group does not starve the others of memory. + +Please note that soft limits is a best-effort feature; it comes with +no guarantees, but it does its best to make sure that when memory is +heavily contended for, memory is allocated based on the soft limit +hints/setup. Currently soft limit based reclaim is set up such that +it gets invoked from balance_pgdat (kswapd). + +7.1 Interface +------------- + +Soft limits can be setup by using the following commands (in this example we +assume a soft limit of 256 MiB):: + + # echo 256M > memory.soft_limit_in_bytes + +If we want to change this to 1G, we can at any time use:: + + # echo 1G > memory.soft_limit_in_bytes + +NOTE1: + Soft limits take effect over a long period of time, since they involve + reclaiming memory for balancing between memory cgroups +NOTE2: + It is recommended to set the soft limit always below the hard limit, + otherwise the hard limit will take precedence. + +8. Move charges at task migration +================================= + +Users can move charges associated with a task along with task migration, that +is, uncharge task's pages from the old cgroup and charge them to the new cgroup. +This feature is not supported in !CONFIG_MMU environments because of lack of +page tables. + +8.1 Interface +------------- + +This feature is disabled by default. It can be enabled (and disabled again) by +writing to memory.move_charge_at_immigrate of the destination cgroup. + +If you want to enable it:: + + # echo (some positive value) > memory.move_charge_at_immigrate + +Note: + Each bits of move_charge_at_immigrate has its own meaning about what type + of charges should be moved. See 8.2 for details. +Note: + Charges are moved only when you move mm->owner, in other words, + a leader of a thread group. +Note: + If we cannot find enough space for the task in the destination cgroup, we + try to make space by reclaiming memory. Task migration may fail if we + cannot make enough space. +Note: + It can take several seconds if you move charges much. + +And if you want disable it again:: + + # echo 0 > memory.move_charge_at_immigrate + +8.2 Type of charges which can be moved +-------------------------------------- + +Each bit in move_charge_at_immigrate has its own meaning about what type of +charges should be moved. But in any case, it must be noted that an account of +a page or a swap can be moved only when it is charged to the task's current +(old) memory cgroup. + ++---+--------------------------------------------------------------------------+ +|bit| what type of charges would be moved ? | ++===+==========================================================================+ +| 0 | A charge of an anonymous page (or swap of it) used by the target task. | +| | You must enable Swap Extension (see 2.4) to enable move of swap charges. | ++---+--------------------------------------------------------------------------+ +| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) | +| | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | +| | anonymous pages, file pages (and swaps) in the range mmapped by the task | +| | will be moved even if the task hasn't done page fault, i.e. they might | +| | not be the task's "RSS", but other task's "RSS" that maps the same file. | +| | And mapcount of the page is ignored (the page can be moved even if | +| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to | +| | enable move of swap charges. | ++---+--------------------------------------------------------------------------+ + +8.3 TODO +-------- + +- All of moving charge operations are done under cgroup_mutex. It's not good + behavior to hold the mutex too long, so we may need some trick. + +9. Memory thresholds +==================== + +Memory cgroup implements memory thresholds using the cgroups notification +API (see cgroups.txt). It allows to register multiple memory and memsw +thresholds and gets notifications when it crosses. + +To register a threshold, an application must: + +- create an eventfd using eventfd(2); +- open memory.usage_in_bytes or memory.memsw.usage_in_bytes; +- write string like " " to + cgroup.event_control. + +Application will be notified through eventfd when memory usage crosses +threshold in any direction. + +It's applicable for root and non-root cgroup. + +10. OOM Control +=============== + +memory.oom_control file is for OOM notification and other controls. + +Memory cgroup implements OOM notifier using the cgroup notification +API (See cgroups.txt). It allows to register multiple OOM notification +delivery and gets notification when OOM happens. + +To register a notifier, an application must: + + - create an eventfd using eventfd(2) + - open memory.oom_control file + - write string like " " to + cgroup.event_control + +The application will be notified through eventfd when OOM happens. +OOM notification doesn't work for the root cgroup. + +You can disable the OOM-killer by writing "1" to memory.oom_control file, as: + + #echo 1 > memory.oom_control + +If OOM-killer is disabled, tasks under cgroup will hang/sleep +in memory cgroup's OOM-waitqueue when they request accountable memory. + +For running them, you have to relax the memory cgroup's OOM status by + + * enlarge limit or reduce usage. + +To reduce usage, + + * kill some tasks. + * move some tasks to other group with account migration. + * remove some files (on tmpfs?) + +Then, stopped tasks will work again. + +At reading, current status of OOM is shown. + + - oom_kill_disable 0 or 1 + (if 1, oom-killer is disabled) + - under_oom 0 or 1 + (if 1, the memory cgroup is under OOM, tasks may be stopped.) + +11. Memory Pressure +=================== + +The pressure level notifications can be used to monitor the memory +allocation cost; based on the pressure, applications can implement +different strategies of managing their memory resources. The pressure +levels are defined as following: + +The "low" level means that the system is reclaiming memory for new +allocations. Monitoring this reclaiming activity might be useful for +maintaining cache level. Upon notification, the program (typically +"Activity Manager") might analyze vmstat and act in advance (i.e. +prematurely shutdown unimportant services). + +The "medium" level means that the system is experiencing medium memory +pressure, the system might be making swap, paging out active file caches, +etc. Upon this event applications may decide to further analyze +vmstat/zoneinfo/memcg or internal memory usage statistics and free any +resources that can be easily reconstructed or re-read from a disk. + +The "critical" level means that the system is actively thrashing, it is +about to out of memory (OOM) or even the in-kernel OOM killer is on its +way to trigger. Applications should do whatever they can to help the +system. It might be too late to consult with vmstat or any other +statistics, so it's advisable to take an immediate action. + +By default, events are propagated upward until the event is handled, i.e. the +events are not pass-through. For example, you have three cgroups: A->B->C. Now +you set up an event listener on cgroups A, B and C, and suppose group C +experiences some pressure. In this situation, only group C will receive the +notification, i.e. groups A and B will not receive it. This is done to avoid +excessive "broadcasting" of messages, which disturbs the system and which is +especially bad if we are low on memory or thrashing. Group B, will receive +notification only if there are no event listers for group C. + +There are three optional modes that specify different propagation behavior: + + - "default": this is the default behavior specified above. This mode is the + same as omitting the optional mode parameter, preserved by backwards + compatibility. + + - "hierarchy": events always propagate up to the root, similar to the default + behavior, except that propagation continues regardless of whether there are + event listeners at each level, with the "hierarchy" mode. In the above + example, groups A, B, and C will receive notification of memory pressure. + + - "local": events are pass-through, i.e. they only receive notifications when + memory pressure is experienced in the memcg for which the notification is + registered. In the above example, group C will receive notification if + registered for "local" notification and the group experiences memory + pressure. However, group B will never receive notification, regardless if + there is an event listener for group C or not, if group B is registered for + local notification. + +The level and event notification mode ("hierarchy" or "local", if necessary) are +specified by a comma-delimited string, i.e. "low,hierarchy" specifies +hierarchical, pass-through, notification for all ancestor memcgs. Notification +that is the default, non pass-through behavior, does not specify a mode. +"medium,local" specifies pass-through notification for the medium level. + +The file memory.pressure_level is only used to setup an eventfd. To +register a notification, an application must: + +- create an eventfd using eventfd(2); +- open memory.pressure_level; +- write string as " " + to cgroup.event_control. + +Application will be notified through eventfd when memory pressure is at +the specific level (or higher). Read/write operations to +memory.pressure_level are no implemented. + +Test: + + Here is a small script example that makes a new cgroup, sets up a + memory limit, sets up a notification in the cgroup and then makes child + cgroup experience a critical pressure:: + + # cd /sys/fs/cgroup/memory/ + # mkdir foo + # cd foo + # cgroup_event_listener memory.pressure_level low,hierarchy & + # echo 8000000 > memory.limit_in_bytes + # echo 8000000 > memory.memsw.limit_in_bytes + # echo $$ > tasks + # dd if=/dev/zero | read x + + (Expect a bunch of notifications, and eventually, the oom-killer will + trigger.) + +12. TODO +======== + +1. Make per-cgroup scanner reclaim not-shared pages first +2. Teach controller to account for shared-pages +3. Start reclamation in the background when the limit is + not yet hit but the usage is getting closer + +Summary +======= + +Overall, the memory controller has been a stable controller and has been +commented and discussed quite extensively in the community. + +References +========== + +1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ +2. Singh, Balbir. Memory Controller (RSS Control), + http://lwn.net/Articles/222762/ +3. Emelianov, Pavel. Resource controllers based on process cgroups + http://lkml.org/lkml/2007/3/6/198 +4. Emelianov, Pavel. RSS controller based on process cgroups (v2) + http://lkml.org/lkml/2007/4/9/78 +5. Emelianov, Pavel. RSS controller based on process cgroups (v3) + http://lkml.org/lkml/2007/5/30/244 +6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ +7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control + subsystem (v3), http://lwn.net/Articles/235534/ +8. Singh, Balbir. RSS controller v2 test results (lmbench), + http://lkml.org/lkml/2007/5/17/232 +9. Singh, Balbir. RSS controller v2 AIM9 results + http://lkml.org/lkml/2007/5/18/1 +10. Singh, Balbir. Memory controller v6 test results, + http://lkml.org/lkml/2007/8/19/36 +11. Singh, Balbir. Memory controller introduction (v6), + http://lkml.org/lkml/2007/8/17/69 +12. Corbet, Jonathan, Controlling memory use in cgroups, + http://lwn.net/Articles/243795/ diff --git a/Documentation/admin-guide/cgroup-v1/net_cls.rst b/Documentation/admin-guide/cgroup-v1/net_cls.rst new file mode 100644 index 000000000000..a2cf272af7a0 --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/net_cls.rst @@ -0,0 +1,44 @@ +========================= +Network classifier cgroup +========================= + +The Network classifier cgroup provides an interface to +tag network packets with a class identifier (classid). + +The Traffic Controller (tc) can be used to assign +different priorities to packets from different cgroups. +Also, Netfilter (iptables) can use this tag to perform +actions on such packets. + +Creating a net_cls cgroups instance creates a net_cls.classid file. +This net_cls.classid value is initialized to 0. + +You can write hexadecimal values to net_cls.classid; the format for these +values is 0xAAAABBBB; AAAA is the major handle number and BBBB +is the minor handle number. +Reading net_cls.classid yields a decimal result. + +Example:: + + mkdir /sys/fs/cgroup/net_cls + mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls + mkdir /sys/fs/cgroup/net_cls/0 + echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid + +- setting a 10:1 handle:: + + cat /sys/fs/cgroup/net_cls/0/net_cls.classid + 1048577 + +- configuring tc:: + + tc qdisc add dev eth0 root handle 10: htb + tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit + +- creating traffic class 10:1:: + + tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup + +configuring iptables, basic example:: + + iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP diff --git a/Documentation/admin-guide/cgroup-v1/net_prio.rst b/Documentation/admin-guide/cgroup-v1/net_prio.rst new file mode 100644 index 000000000000..b40905871c64 --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/net_prio.rst @@ -0,0 +1,57 @@ +======================= +Network priority cgroup +======================= + +The Network priority cgroup provides an interface to allow an administrator to +dynamically set the priority of network traffic generated by various +applications + +Nominally, an application would set the priority of its traffic via the +SO_PRIORITY socket option. This however, is not always possible because: + +1) The application may not have been coded to set this value +2) The priority of application traffic is often a site-specific administrative + decision rather than an application defined one. + +This cgroup allows an administrator to assign a process to a group which defines +the priority of egress traffic on a given interface. Network priority groups can +be created by first mounting the cgroup filesystem:: + + # mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio + +With the above step, the initial group acting as the parent accounting group +becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in +the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup. + +Each net_prio cgroup contains two files that are subsystem specific + +net_prio.prioidx + This file is read-only, and is simply informative. It contains a unique + integer value that the kernel uses as an internal representation of this + cgroup. + +net_prio.ifpriomap + This file contains a map of the priorities assigned to traffic originating + from processes in this group and egressing the system on various interfaces. + It contains a list of tuples in the form . Contents of this + file can be modified by echoing a string into the file using the same tuple + format. For example:: + + echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap + +This command would force any traffic originating from processes belonging to the +iscsi net_prio cgroup and egressing on interface eth0 to have the priority of +said traffic set to the value 5. The parent accounting group also has a +writeable 'net_prio.ifpriomap' file that can be used to set a system default +priority. + +Priorities are set immediately prior to queueing a frame to the device +queueing discipline (qdisc) so priorities will be assigned prior to the hardware +queue selection being made. + +One usage for the net_prio cgroup is with mqprio qdisc allowing application +traffic to be steered to hardware/driver based traffic classes. These mappings +can then be managed by administrators or other networking protocols such as +DCBX. + +A new net_prio cgroup inherits the parent's configuration. diff --git a/Documentation/admin-guide/cgroup-v1/pids.rst b/Documentation/admin-guide/cgroup-v1/pids.rst new file mode 100644 index 000000000000..6acebd9e72c8 --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/pids.rst @@ -0,0 +1,92 @@ +========================= +Process Number Controller +========================= + +Abstract +-------- + +The process number controller is used to allow a cgroup hierarchy to stop any +new tasks from being fork()'d or clone()'d after a certain limit is reached. + +Since it is trivial to hit the task limit without hitting any kmemcg limits in +place, PIDs are a fundamental resource. As such, PID exhaustion must be +preventable in the scope of a cgroup hierarchy by allowing resource limiting of +the number of tasks in a cgroup. + +Usage +----- + +In order to use the `pids` controller, set the maximum number of tasks in +pids.max (this is not available in the root cgroup for obvious reasons). The +number of processes currently in the cgroup is given by pids.current. + +Organisational operations are not blocked by cgroup policies, so it is possible +to have pids.current > pids.max. This can be done by either setting the limit to +be smaller than pids.current, or attaching enough processes to the cgroup such +that pids.current > pids.max. However, it is not possible to violate a cgroup +policy through fork() or clone(). fork() and clone() will return -EAGAIN if the +creation of a new process would cause a cgroup policy to be violated. + +To set a cgroup to have no limit, set pids.max to "max". This is the default for +all new cgroups (N.B. that PID limits are hierarchical, so the most stringent +limit in the hierarchy is followed). + +pids.current tracks all child cgroup hierarchies, so parent/pids.current is a +superset of parent/child/pids.current. + +The pids.events file contains event counters: + + - max: Number of times fork failed because limit was hit. + +Example +------- + +First, we mount the pids controller:: + + # mkdir -p /sys/fs/cgroup/pids + # mount -t cgroup -o pids none /sys/fs/cgroup/pids + +Then we create a hierarchy, set limits and attach processes to it:: + + # mkdir -p /sys/fs/cgroup/pids/parent/child + # echo 2 > /sys/fs/cgroup/pids/parent/pids.max + # echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs + # cat /sys/fs/cgroup/pids/parent/pids.current + 2 + # + +It should be noted that attempts to overcome the set limit (2 in this case) will +fail:: + + # cat /sys/fs/cgroup/pids/parent/pids.current + 2 + # ( /bin/echo "Here's some processes for you." | cat ) + sh: fork: Resource temporary unavailable + # + +Even if we migrate to a child cgroup (which doesn't have a set limit), we will +not be able to overcome the most stringent limit in the hierarchy (in this case, +parent's):: + + # echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs + # cat /sys/fs/cgroup/pids/parent/pids.current + 2 + # cat /sys/fs/cgroup/pids/parent/child/pids.current + 2 + # cat /sys/fs/cgroup/pids/parent/child/pids.max + max + # ( /bin/echo "Here's some processes for you." | cat ) + sh: fork: Resource temporary unavailable + # + +We can set a limit that is smaller than pids.current, which will stop any new +processes from being forked at all (note that the shell itself counts towards +pids.current):: + + # echo 1 > /sys/fs/cgroup/pids/parent/pids.max + # /bin/echo "We can't even spawn a single process now." + sh: fork: Resource temporary unavailable + # echo 0 > /sys/fs/cgroup/pids/parent/pids.max + # /bin/echo "We can't even spawn a single process now." + sh: fork: Resource temporary unavailable + # diff --git a/Documentation/admin-guide/cgroup-v1/rdma.rst b/Documentation/admin-guide/cgroup-v1/rdma.rst new file mode 100644 index 000000000000..2fcb0a9bf790 --- /dev/null +++ b/Documentation/admin-guide/cgroup-v1/rdma.rst @@ -0,0 +1,117 @@ +=============== +RDMA Controller +=============== + +.. Contents + + 1. Overview + 1-1. What is RDMA controller? + 1-2. Why RDMA controller needed? + 1-3. How is RDMA controller implemented? + 2. Usage Examples + +1. Overview +=========== + +1-1. What is RDMA controller? +----------------------------- + +RDMA controller allows user to limit RDMA/IB specific resources that a given +set of processes can use. These processes are grouped using RDMA controller. + +RDMA controller defines two resources which can be limited for processes of a +cgroup. + +1-2. Why RDMA controller needed? +-------------------------------- + +Currently user space applications can easily take away all the rdma verb +specific resources such as AH, CQ, QP, MR etc. Due to which other applications +in other cgroup or kernel space ULPs may not even get chance to allocate any +rdma resources. This can lead to service unavailability. + +Therefore RDMA controller is needed through which resource consumption +of processes can be limited. Through this controller different rdma +resources can be accounted. + +1-3. How is RDMA controller implemented? +---------------------------------------- + +RDMA cgroup allows limit configuration of resources. Rdma cgroup maintains +resource accounting per cgroup, per device using resource pool structure. +Each such resource pool is limited up to 64 resources in given resource pool +by rdma cgroup, which can be extended later if required. + +This resource pool object is linked to the cgroup css. Typically there +are 0 to 4 resource pool instances per cgroup, per device in most use cases. +But nothing limits to have it more. At present hundreds of RDMA devices per +single cgroup may not be handled optimally, however there is no +known use case or requirement for such configuration either. + +Since RDMA resources can be allocated from any process and can be freed by any +of the child processes which shares the address space, rdma resources are +always owned by the creator cgroup css. This allows process migration from one +to other cgroup without major complexity of transferring resource ownership; +because such ownership is not really present due to shared nature of +rdma resources. Linking resources around css also ensures that cgroups can be +deleted after processes migrated. This allow progress migration as well with +active resources, even though that is not a primary use case. + +Whenever RDMA resource charging occurs, owner rdma cgroup is returned to +the caller. Same rdma cgroup should be passed while uncharging the resource. +This also allows process migrated with active RDMA resource to charge +to new owner cgroup for new resource. It also allows to uncharge resource of +a process from previously charged cgroup which is migrated to new cgroup, +even though that is not a primary use case. + +Resource pool object is created in following situations. +(a) User sets the limit and no previous resource pool exist for the device +of interest for the cgroup. +(b) No resource limits were configured, but IB/RDMA stack tries to +charge the resource. So that it correctly uncharge them when applications are +running without limits and later on when limits are enforced during uncharging, +otherwise usage count will drop to negative. + +Resource pool is destroyed if all the resource limits are set to max and +it is the last resource getting deallocated. + +User should set all the limit to max value if it intents to remove/unconfigure +the resource pool for a particular device. + +IB stack honors limits enforced by the rdma controller. When application +query about maximum resource limits of IB device, it returns minimum of +what is configured by user for a given cgroup and what is supported by +IB device. + +Following resources can be accounted by rdma controller. + + ========== ============================= + hca_handle Maximum number of HCA Handles + hca_object Maximum number of HCA Objects + ========== ============================= + +2. Usage Examples +================= + +(a) Configure resource limit:: + + echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max + echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max + +(b) Query resource limit:: + + cat /sys/fs/cgroup/rdma/2/rdma.max + #Output: + mlx4_0 hca_handle=2 hca_object=2000 + ocrdma1 hca_handle=3 hca_object=max + +(c) Query current usage:: + + cat /sys/fs/cgroup/rdma/2/rdma.current + #Output: + mlx4_0 hca_handle=1 hca_object=20 + ocrdma1 hca_handle=1 hca_object=23 + +(d) Delete resource limit:: + + echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 080b18ce2a5d..ed4c5977d6e1 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -9,7 +9,7 @@ This is the authoritative documentation on the design, interface and conventions of cgroup v2. It describes all userland-visible aspects of cgroup including core and specific controller behaviors. All future changes must be reflected in this document. Documentation for -v1 is available under Documentation/cgroup-v1/. +v1 is available under Documentation/admin-guide/cgroup-v1/. .. CONTENTS diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 1f0d9b939311..a5fdb1a846ce 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -59,6 +59,7 @@ configure specific aspects of kernel behavior to your liking. initrd cgroup-v2 + cgroup-v1/index serial-console braille-console parport diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 78576aa45cce..a571a67e0c85 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4089,7 +4089,7 @@ relax_domain_level= [KNL, SMP] Set scheduler's default relax_domain_level. - See Documentation/cgroup-v1/cpusets.rst. + See Documentation/admin-guide/cgroup-v1/cpusets.rst. reserve= [KNL,BUGS] Force kernel to ignore I/O ports or memory Format: ,[,,,...] @@ -4599,7 +4599,7 @@ swapaccount=[0|1] [KNL] Enable accounting of swap in memory resource controller if no parameter or 1 is given or disable - it if 0 is given (See Documentation/cgroup-v1/memory.rst) + it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst) swiotlb= [ARM,IA-64,PPC,MIPS,X86] Format: { | force | noforce } diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index 546f174e5d6a..8463f5538fda 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -15,7 +15,7 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy support. Memory policies should not be confused with cpusets -(``Documentation/cgroup-v1/cpusets.rst``) +(``Documentation/admin-guide/cgroup-v1/cpusets.rst``) which is an administrative mechanism for restricting the nodes from which memory may be allocated by a set of processes. Memory policies are a programming interface that a NUMA-aware application can take advantage of. When diff --git a/Documentation/block/bfq-iosched.rst b/Documentation/block/bfq-iosched.rst index 2c13b2fc1888..0d237d402860 100644 --- a/Documentation/block/bfq-iosched.rst +++ b/Documentation/block/bfq-iosched.rst @@ -547,7 +547,7 @@ As for cgroups-v1 (blkio controller), the exact set of stat files created, and kept up-to-date by bfq, depends on whether CONFIG_BFQ_CGROUP_DEBUG is set. If it is set, then bfq creates all the stat files documented in -Documentation/cgroup-v1/blkio-controller.rst. If, instead, +Documentation/admin-guide/cgroup-v1/blkio-controller.rst. If, instead, CONFIG_BFQ_CGROUP_DEBUG is not set, then bfq creates only the files:: blkio.bfq.io_service_bytes diff --git a/Documentation/cgroup-v1/blkio-controller.rst b/Documentation/cgroup-v1/blkio-controller.rst deleted file mode 100644 index 1d7d962933be..000000000000 --- a/Documentation/cgroup-v1/blkio-controller.rst +++ /dev/null @@ -1,302 +0,0 @@ -=================== -Block IO Controller -=================== - -Overview -======== -cgroup subsys "blkio" implements the block io controller. There seems to be -a need of various kinds of IO control policies (like proportional BW, max BW) -both at leaf nodes as well as at intermediate nodes in a storage hierarchy. -Plan is to use the same cgroup based management interface for blkio controller -and based on user options switch IO policies in the background. - -One IO control policy is throttling policy which can be used to -specify upper IO rate limits on devices. This policy is implemented in -generic block layer and can be used on leaf nodes as well as higher -level logical devices like device mapper. - -HOWTO -===== -Throttling/Upper Limit policy ------------------------------ -- Enable Block IO controller:: - - CONFIG_BLK_CGROUP=y - -- Enable throttling in block layer:: - - CONFIG_BLK_DEV_THROTTLING=y - -- Mount blkio controller (see cgroups.txt, Why are cgroups needed?):: - - mount -t cgroup -o blkio none /sys/fs/cgroup/blkio - -- Specify a bandwidth rate on particular device for root group. The format - for policy is ": ":: - - echo "8:16 1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device - - Above will put a limit of 1MB/second on reads happening for root group - on device having major/minor number 8:16. - -- Run dd to read a file and see if rate is throttled to 1MB/s or not:: - - # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024 - 1024+0 records in - 1024+0 records out - 4194304 bytes (4.2 MB) copied, 4.0001 s, 1.0 MB/s - - Limits for writes can be put using blkio.throttle.write_bps_device file. - -Hierarchical Cgroups -==================== - -Throttling implements hierarchy support; however, -throttling's hierarchy support is enabled iff "sane_behavior" is -enabled from cgroup side, which currently is a development option and -not publicly available. - -If somebody created a hierarchy like as follows:: - - root - / \ - test1 test2 - | - test3 - -Throttling with "sane_behavior" will handle the -hierarchy correctly. For throttling, all limits apply -to the whole subtree while all statistics are local to the IOs -directly generated by tasks in that cgroup. - -Throttling without "sane_behavior" enabled from cgroup side will -practically treat all groups at same level as if it looks like the -following:: - - pivot - / / \ \ - root test1 test2 test3 - -Various user visible config options -=================================== -CONFIG_BLK_CGROUP - - Block IO controller. - -CONFIG_BFQ_CGROUP_DEBUG - - Debug help. Right now some additional stats file show up in cgroup - if this option is enabled. - -CONFIG_BLK_DEV_THROTTLING - - Enable block device throttling support in block layer. - -Details of cgroup files -======================= -Proportional weight policy files --------------------------------- -- blkio.weight - - Specifies per cgroup weight. This is default weight of the group - on all the devices until and unless overridden by per device rule. - (See blkio.weight_device). - Currently allowed range of weights is from 10 to 1000. - -- blkio.weight_device - - One can specify per cgroup per device rules using this interface. - These rules override the default value of group weight as specified - by blkio.weight. - - Following is the format:: - - # echo dev_maj:dev_minor weight > blkio.weight_device - - Configure weight=300 on /dev/sdb (8:16) in this cgroup:: - - # echo 8:16 300 > blkio.weight_device - # cat blkio.weight_device - dev weight - 8:16 300 - - Configure weight=500 on /dev/sda (8:0) in this cgroup:: - - # echo 8:0 500 > blkio.weight_device - # cat blkio.weight_device - dev weight - 8:0 500 - 8:16 300 - - Remove specific weight for /dev/sda in this cgroup:: - - # echo 8:0 0 > blkio.weight_device - # cat blkio.weight_device - dev weight - 8:16 300 - -- blkio.leaf_weight[_device] - - Equivalents of blkio.weight[_device] for the purpose of - deciding how much weight tasks in the given cgroup has while - competing with the cgroup's child cgroups. For details, - please refer to Documentation/block/cfq-iosched.txt. - -- blkio.time - - disk time allocated to cgroup per device in milliseconds. First - two fields specify the major and minor number of the device and - third field specifies the disk time allocated to group in - milliseconds. - -- blkio.sectors - - number of sectors transferred to/from disk by the group. First - two fields specify the major and minor number of the device and - third field specifies the number of sectors transferred by the - group to/from the device. - -- blkio.io_service_bytes - - Number of bytes transferred to/from the disk by the group. These - are further divided by the type of operation - read or write, sync - or async. First two fields specify the major and minor number of the - device, third field specifies the operation type and the fourth field - specifies the number of bytes. - -- blkio.io_serviced - - Number of IOs (bio) issued to the disk by the group. These - are further divided by the type of operation - read or write, sync - or async. First two fields specify the major and minor number of the - device, third field specifies the operation type and the fourth field - specifies the number of IOs. - -- blkio.io_service_time - - Total amount of time between request dispatch and request completion - for the IOs done by this cgroup. This is in nanoseconds to make it - meaningful for flash devices too. For devices with queue depth of 1, - this time represents the actual service time. When queue_depth > 1, - that is no longer true as requests may be served out of order. This - may cause the service time for a given IO to include the service time - of multiple IOs when served out of order which may result in total - io_service_time > actual time elapsed. This time is further divided by - the type of operation - read or write, sync or async. First two fields - specify the major and minor number of the device, third field - specifies the operation type and the fourth field specifies the - io_service_time in ns. - -- blkio.io_wait_time - - Total amount of time the IOs for this cgroup spent waiting in the - scheduler queues for service. This can be greater than the total time - elapsed since it is cumulative io_wait_time for all IOs. It is not a - measure of total time the cgroup spent waiting but rather a measure of - the wait_time for its individual IOs. For devices with queue_depth > 1 - this metric does not include the time spent waiting for service once - the IO is dispatched to the device but till it actually gets serviced - (there might be a time lag here due to re-ordering of requests by the - device). This is in nanoseconds to make it meaningful for flash - devices too. This time is further divided by the type of operation - - read or write, sync or async. First two fields specify the major and - minor number of the device, third field specifies the operation type - and the fourth field specifies the io_wait_time in ns. - -- blkio.io_merged - - Total number of bios/requests merged into requests belonging to this - cgroup. This is further divided by the type of operation - read or - write, sync or async. - -- blkio.io_queued - - Total number of requests queued up at any given instant for this - cgroup. This is further divided by the type of operation - read or - write, sync or async. - -- blkio.avg_queue_size - - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. - The average queue size for this cgroup over the entire time of this - cgroup's existence. Queue size samples are taken each time one of the - queues of this cgroup gets a timeslice. - -- blkio.group_wait_time - - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. - This is the amount of time the cgroup had to wait since it became busy - (i.e., went from 0 to 1 request queued) to get a timeslice for one of - its queues. This is different from the io_wait_time which is the - cumulative total of the amount of time spent by each IO in that cgroup - waiting in the scheduler queue. This is in nanoseconds. If this is - read when the cgroup is in a waiting (for timeslice) state, the stat - will only report the group_wait_time accumulated till the last time it - got a timeslice and will not include the current delta. - -- blkio.empty_time - - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. - This is the amount of time a cgroup spends without any pending - requests when not being served, i.e., it does not include any time - spent idling for one of the queues of the cgroup. This is in - nanoseconds. If this is read when the cgroup is in an empty state, - the stat will only report the empty_time accumulated till the last - time it had a pending request and will not include the current delta. - -- blkio.idle_time - - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. - This is the amount of time spent by the IO scheduler idling for a - given cgroup in anticipation of a better request than the existing ones - from other queues/cgroups. This is in nanoseconds. If this is read - when the cgroup is in an idling state, the stat will only report the - idle_time accumulated till the last idle period and will not include - the current delta. - -- blkio.dequeue - - Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This - gives the statistics about how many a times a group was dequeued - from service tree of the device. First two fields specify the major - and minor number of the device and third field specifies the number - of times a group was dequeued from a particular device. - -- blkio.*_recursive - - Recursive version of various stats. These files show the - same information as their non-recursive counterparts but - include stats from all the descendant cgroups. - -Throttling/Upper limit policy files ------------------------------------ -- blkio.throttle.read_bps_device - - Specifies upper limit on READ rate from the device. IO rate is - specified in bytes per second. Rules are per device. Following is - the format:: - - echo ": " > /cgrp/blkio.throttle.read_bps_device - -- blkio.throttle.write_bps_device - - Specifies upper limit on WRITE rate to the device. IO rate is - specified in bytes per second. Rules are per device. Following is - the format:: - - echo ": " > /cgrp/blkio.throttle.write_bps_device - -- blkio.throttle.read_iops_device - - Specifies upper limit on READ rate from the device. IO rate is - specified in IO per second. Rules are per device. Following is - the format:: - - echo ": " > /cgrp/blkio.throttle.read_iops_device - -- blkio.throttle.write_iops_device - - Specifies upper limit on WRITE rate to the device. IO rate is - specified in io per second. Rules are per device. Following is - the format:: - - echo ": " > /cgrp/blkio.throttle.write_iops_device - -Note: If both BW and IOPS rules are specified for a device, then IO is - subjected to both the constraints. - -- blkio.throttle.io_serviced - - Number of IOs (bio) issued to the disk by the group. These - are further divided by the type of operation - read or write, sync - or async. First two fields specify the major and minor number of the - device, third field specifies the operation type and the fourth field - specifies the number of IOs. - -- blkio.throttle.io_service_bytes - - Number of bytes transferred to/from the disk by the group. These - are further divided by the type of operation - read or write, sync - or async. First two fields specify the major and minor number of the - device, third field specifies the operation type and the fourth field - specifies the number of bytes. - -Common files among various policies ------------------------------------ -- blkio.reset_stats - - Writing an int to this file will result in resetting all the stats - for that cgroup. diff --git a/Documentation/cgroup-v1/cgroups.rst b/Documentation/cgroup-v1/cgroups.rst deleted file mode 100644 index 46bbe7e022d4..000000000000 --- a/Documentation/cgroup-v1/cgroups.rst +++ /dev/null @@ -1,695 +0,0 @@ -============== -Control Groups -============== - -Written by Paul Menage based on -Documentation/cgroup-v1/cpusets.rst - -Original copyright statements from cpusets.txt: - -Portions Copyright (C) 2004 BULL SA. - -Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. - -Modified by Paul Jackson - -Modified by Christoph Lameter - -.. CONTENTS: - - 1. Control Groups - 1.1 What are cgroups ? - 1.2 Why are cgroups needed ? - 1.3 How are cgroups implemented ? - 1.4 What does notify_on_release do ? - 1.5 What does clone_children do ? - 1.6 How do I use cgroups ? - 2. Usage Examples and Syntax - 2.1 Basic Usage - 2.2 Attaching processes - 2.3 Mounting hierarchies by name - 3. Kernel API - 3.1 Overview - 3.2 Synchronization - 3.3 Subsystem API - 4. Extended attributes usage - 5. Questions - -1. Control Groups -================= - -1.1 What are cgroups ? ----------------------- - -Control Groups provide a mechanism for aggregating/partitioning sets of -tasks, and all their future children, into hierarchical groups with -specialized behaviour. - -Definitions: - -A *cgroup* associates a set of tasks with a set of parameters for one -or more subsystems. - -A *subsystem* is a module that makes use of the task grouping -facilities provided by cgroups to treat groups of tasks in -particular ways. A subsystem is typically a "resource controller" that -schedules a resource or applies per-cgroup limits, but it may be -anything that wants to act on a group of processes, e.g. a -virtualization subsystem. - -A *hierarchy* is a set of cgroups arranged in a tree, such that -every task in the system is in exactly one of the cgroups in the -hierarchy, and a set of subsystems; each subsystem has system-specific -state attached to each cgroup in the hierarchy. Each hierarchy has -an instance of the cgroup virtual filesystem associated with it. - -At any one time there may be multiple active hierarchies of task -cgroups. Each hierarchy is a partition of all tasks in the system. - -User-level code may create and destroy cgroups by name in an -instance of the cgroup virtual file system, specify and query to -which cgroup a task is assigned, and list the task PIDs assigned to -a cgroup. Those creations and assignments only affect the hierarchy -associated with that instance of the cgroup file system. - -On their own, the only use for cgroups is for simple job -tracking. The intention is that other subsystems hook into the generic -cgroup support to provide new attributes for cgroups, such as -accounting/limiting the resources which processes in a cgroup can -access. For example, cpusets (see Documentation/cgroup-v1/cpusets.rst) allow -you to associate a set of CPUs and a set of memory nodes with the -tasks in each cgroup. - -1.2 Why are cgroups needed ? ----------------------------- - -There are multiple efforts to provide process aggregations in the -Linux kernel, mainly for resource-tracking purposes. Such efforts -include cpusets, CKRM/ResGroups, UserBeanCounters, and virtual server -namespaces. These all require the basic notion of a -grouping/partitioning of processes, with newly forked processes ending -up in the same group (cgroup) as their parent process. - -The kernel cgroup patch provides the minimum essential kernel -mechanisms required to efficiently implement such groups. It has -minimal impact on the system fast paths, and provides hooks for -specific subsystems such as cpusets to provide additional behaviour as -desired. - -Multiple hierarchy support is provided to allow for situations where -the division of tasks into cgroups is distinctly different for -different subsystems - having parallel hierarchies allows each -hierarchy to be a natural division of tasks, without having to handle -complex combinations of tasks that would be present if several -unrelated subsystems needed to be forced into the same tree of -cgroups. - -At one extreme, each resource controller or subsystem could be in a -separate hierarchy; at the other extreme, all subsystems -would be attached to the same hierarchy. - -As an example of a scenario (originally proposed by vatsa@in.ibm.com) -that can benefit from multiple hierarchies, consider a large -university server with various users - students, professors, system -tasks etc. The resource planning for this server could be along the -following lines:: - - CPU : "Top cpuset" - / \ - CPUSet1 CPUSet2 - | | - (Professors) (Students) - - In addition (system tasks) are attached to topcpuset (so - that they can run anywhere) with a limit of 20% - - Memory : Professors (50%), Students (30%), system (20%) - - Disk : Professors (50%), Students (30%), system (20%) - - Network : WWW browsing (20%), Network File System (60%), others (20%) - / \ - Professors (15%) students (5%) - -Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd goes -into the NFS network class. - -At the same time Firefox/Lynx will share an appropriate CPU/Memory class -depending on who launched it (prof/student). - -With the ability to classify tasks differently for different resources -(by putting those resource subsystems in different hierarchies), -the admin can easily set up a script which receives exec notifications -and depending on who is launching the browser he can:: - - # echo browser_pid > /sys/fs/cgroup///tasks - -With only a single hierarchy, he now would potentially have to create -a separate cgroup for every browser launched and associate it with -appropriate network and other resource class. This may lead to -proliferation of such cgroups. - -Also let's say that the administrator would like to give enhanced network -access temporarily to a student's browser (since it is night and the user -wants to do online gaming :)) OR give one of the student's simulation -apps enhanced CPU power. - -With ability to write PIDs directly to resource classes, it's just a -matter of:: - - # echo pid > /sys/fs/cgroup/network//tasks - (after some time) - # echo pid > /sys/fs/cgroup/network//tasks - -Without this ability, the administrator would have to split the cgroup into -multiple separate ones and then associate the new cgroups with the -new resource classes. - - - -1.3 How are cgroups implemented ? ---------------------------------- - -Control Groups extends the kernel as follows: - - - Each task in the system has a reference-counted pointer to a - css_set. - - - A css_set contains a set of reference-counted pointers to - cgroup_subsys_state objects, one for each cgroup subsystem - registered in the system. There is no direct link from a task to - the cgroup of which it's a member in each hierarchy, but this - can be determined by following pointers through the - cgroup_subsys_state objects. This is because accessing the - subsystem state is something that's expected to happen frequently - and in performance-critical code, whereas operations that require a - task's actual cgroup assignments (in particular, moving between - cgroups) are less common. A linked list runs through the cg_list - field of each task_struct using the css_set, anchored at - css_set->tasks. - - - A cgroup hierarchy filesystem can be mounted for browsing and - manipulation from user space. - - - You can list all the tasks (by PID) attached to any cgroup. - -The implementation of cgroups requires a few, simple hooks -into the rest of the kernel, none in performance-critical paths: - - - in init/main.c, to initialize the root cgroups and initial - css_set at system boot. - - - in fork and exit, to attach and detach a task from its css_set. - -In addition, a new file system of type "cgroup" may be mounted, to -enable browsing and modifying the cgroups presently known to the -kernel. When mounting a cgroup hierarchy, you may specify a -comma-separated list of subsystems to mount as the filesystem mount -options. By default, mounting the cgroup filesystem attempts to -mount a hierarchy containing all registered subsystems. - -If an active hierarchy with exactly the same set of subsystems already -exists, it will be reused for the new mount. If no existing hierarchy -matches, and any of the requested subsystems are in use in an existing -hierarchy, the mount will fail with -EBUSY. Otherwise, a new hierarchy -is activated, associated with the requested subsystems. - -It's not currently possible to bind a new subsystem to an active -cgroup hierarchy, or to unbind a subsystem from an active cgroup -hierarchy. This may be possible in future, but is fraught with nasty -error-recovery issues. - -When a cgroup filesystem is unmounted, if there are any -child cgroups created below the top-level cgroup, that hierarchy -will remain active even though unmounted; if there are no -child cgroups then the hierarchy will be deactivated. - -No new system calls are added for cgroups - all support for -querying and modifying cgroups is via this cgroup file system. - -Each task under /proc has an added file named 'cgroup' displaying, -for each active hierarchy, the subsystem names and the cgroup name -as the path relative to the root of the cgroup file system. - -Each cgroup is represented by a directory in the cgroup file system -containing the following files describing that cgroup: - - - tasks: list of tasks (by PID) attached to that cgroup. This list - is not guaranteed to be sorted. Writing a thread ID into this file - moves the thread into this cgroup. - - cgroup.procs: list of thread group IDs in the cgroup. This list is - not guaranteed to be sorted or free of duplicate TGIDs, and userspace - should sort/uniquify the list if this property is required. - Writing a thread group ID into this file moves all threads in that - group into this cgroup. - - notify_on_release flag: run the release agent on exit? - - release_agent: the path to use for release notifications (this file - exists in the top cgroup only) - -Other subsystems such as cpusets may add additional files in each -cgroup dir. - -New cgroups are created using the mkdir system call or shell -command. The properties of a cgroup, such as its flags, are -modified by writing to the appropriate file in that cgroups -directory, as listed above. - -The named hierarchical structure of nested cgroups allows partitioning -a large system into nested, dynamically changeable, "soft-partitions". - -The attachment of each task, automatically inherited at fork by any -children of that task, to a cgroup allows organizing the work load -on a system into related sets of tasks. A task may be re-attached to -any other cgroup, if allowed by the permissions on the necessary -cgroup file system directories. - -When a task is moved from one cgroup to another, it gets a new -css_set pointer - if there's an already existing css_set with the -desired collection of cgroups then that group is reused, otherwise a new -css_set is allocated. The appropriate existing css_set is located by -looking into a hash table. - -To allow access from a cgroup to the css_sets (and hence tasks) -that comprise it, a set of cg_cgroup_link objects form a lattice; -each cg_cgroup_link is linked into a list of cg_cgroup_links for -a single cgroup on its cgrp_link_list field, and a list of -cg_cgroup_links for a single css_set on its cg_link_list. - -Thus the set of tasks in a cgroup can be listed by iterating over -each css_set that references the cgroup, and sub-iterating over -each css_set's task set. - -The use of a Linux virtual file system (vfs) to represent the -cgroup hierarchy provides for a familiar permission and name space -for cgroups, with a minimum of additional kernel code. - -1.4 What does notify_on_release do ? ------------------------------------- - -If the notify_on_release flag is enabled (1) in a cgroup, then -whenever the last task in the cgroup leaves (exits or attaches to -some other cgroup) and the last child cgroup of that cgroup -is removed, then the kernel runs the command specified by the contents -of the "release_agent" file in that hierarchy's root directory, -supplying the pathname (relative to the mount point of the cgroup -file system) of the abandoned cgroup. This enables automatic -removal of abandoned cgroups. The default value of -notify_on_release in the root cgroup at system boot is disabled -(0). The default value of other cgroups at creation is the current -value of their parents' notify_on_release settings. The default value of -a cgroup hierarchy's release_agent path is empty. - -1.5 What does clone_children do ? ---------------------------------- - -This flag only affects the cpuset controller. If the clone_children -flag is enabled (1) in a cgroup, a new cpuset cgroup will copy its -configuration from the parent during initialization. - -1.6 How do I use cgroups ? --------------------------- - -To start a new job that is to be contained within a cgroup, using -the "cpuset" cgroup subsystem, the steps are something like:: - - 1) mount -t tmpfs cgroup_root /sys/fs/cgroup - 2) mkdir /sys/fs/cgroup/cpuset - 3) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset - 4) Create the new cgroup by doing mkdir's and write's (or echo's) in - the /sys/fs/cgroup/cpuset virtual file system. - 5) Start a task that will be the "founding father" of the new job. - 6) Attach that task to the new cgroup by writing its PID to the - /sys/fs/cgroup/cpuset tasks file for that cgroup. - 7) fork, exec or clone the job tasks from this founding father task. - -For example, the following sequence of commands will setup a cgroup -named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, -and then start a subshell 'sh' in that cgroup:: - - mount -t tmpfs cgroup_root /sys/fs/cgroup - mkdir /sys/fs/cgroup/cpuset - mount -t cgroup cpuset -ocpuset /sys/fs/cgroup/cpuset - cd /sys/fs/cgroup/cpuset - mkdir Charlie - cd Charlie - /bin/echo 2-3 > cpuset.cpus - /bin/echo 1 > cpuset.mems - /bin/echo $$ > tasks - sh - # The subshell 'sh' is now running in cgroup Charlie - # The next line should display '/Charlie' - cat /proc/self/cgroup - -2. Usage Examples and Syntax -============================ - -2.1 Basic Usage ---------------- - -Creating, modifying, using cgroups can be done through the cgroup -virtual filesystem. - -To mount a cgroup hierarchy with all available subsystems, type:: - - # mount -t cgroup xxx /sys/fs/cgroup - -The "xxx" is not interpreted by the cgroup code, but will appear in -/proc/mounts so may be any useful identifying string that you like. - -Note: Some subsystems do not work without some user input first. For instance, -if cpusets are enabled the user will have to populate the cpus and mems files -for each new cgroup created before that group can be used. - -As explained in section `1.2 Why are cgroups needed?` you should create -different hierarchies of cgroups for each single resource or group of -resources you want to control. Therefore, you should mount a tmpfs on -/sys/fs/cgroup and create directories for each cgroup resource or resource -group:: - - # mount -t tmpfs cgroup_root /sys/fs/cgroup - # mkdir /sys/fs/cgroup/rg1 - -To mount a cgroup hierarchy with just the cpuset and memory -subsystems, type:: - - # mount -t cgroup -o cpuset,memory hier1 /sys/fs/cgroup/rg1 - -While remounting cgroups is currently supported, it is not recommend -to use it. Remounting allows changing bound subsystems and -release_agent. Rebinding is hardly useful as it only works when the -hierarchy is empty and release_agent itself should be replaced with -conventional fsnotify. The support for remounting will be removed in -the future. - -To Specify a hierarchy's release_agent:: - - # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ - xxx /sys/fs/cgroup/rg1 - -Note that specifying 'release_agent' more than once will return failure. - -Note that changing the set of subsystems is currently only supported -when the hierarchy consists of a single (root) cgroup. Supporting -the ability to arbitrarily bind/unbind subsystems from an existing -cgroup hierarchy is intended to be implemented in the future. - -Then under /sys/fs/cgroup/rg1 you can find a tree that corresponds to the -tree of the cgroups in the system. For instance, /sys/fs/cgroup/rg1 -is the cgroup that holds the whole system. - -If you want to change the value of release_agent:: - - # echo "/sbin/new_release_agent" > /sys/fs/cgroup/rg1/release_agent - -It can also be changed via remount. - -If you want to create a new cgroup under /sys/fs/cgroup/rg1:: - - # cd /sys/fs/cgroup/rg1 - # mkdir my_cgroup - -Now you want to do something with this cgroup: - - # cd my_cgroup - -In this directory you can find several files:: - - # ls - cgroup.procs notify_on_release tasks - (plus whatever files added by the attached subsystems) - -Now attach your shell to this cgroup:: - - # /bin/echo $$ > tasks - -You can also create cgroups inside your cgroup by using mkdir in this -directory:: - - # mkdir my_sub_cs - -To remove a cgroup, just use rmdir:: - - # rmdir my_sub_cs - -This will fail if the cgroup is in use (has cgroups inside, or -has processes attached, or is held alive by other subsystem-specific -reference). - -2.2 Attaching processes ------------------------ - -:: - - # /bin/echo PID > tasks - -Note that it is PID, not PIDs. You can only attach ONE task at a time. -If you have several tasks to attach, you have to do it one after another:: - - # /bin/echo PID1 > tasks - # /bin/echo PID2 > tasks - ... - # /bin/echo PIDn > tasks - -You can attach the current shell task by echoing 0:: - - # echo 0 > tasks - -You can use the cgroup.procs file instead of the tasks file to move all -threads in a threadgroup at once. Echoing the PID of any task in a -threadgroup to cgroup.procs causes all tasks in that threadgroup to be -attached to the cgroup. Writing 0 to cgroup.procs moves all tasks -in the writing task's threadgroup. - -Note: Since every task is always a member of exactly one cgroup in each -mounted hierarchy, to remove a task from its current cgroup you must -move it into a new cgroup (possibly the root cgroup) by writing to the -new cgroup's tasks file. - -Note: Due to some restrictions enforced by some cgroup subsystems, moving -a process to another cgroup can fail. - -2.3 Mounting hierarchies by name --------------------------------- - -Passing the name= option when mounting a cgroups hierarchy -associates the given name with the hierarchy. This can be used when -mounting a pre-existing hierarchy, in order to refer to it by name -rather than by its set of active subsystems. Each hierarchy is either -nameless, or has a unique name. - -The name should match [\w.-]+ - -When passing a name= option for a new hierarchy, you need to -specify subsystems manually; the legacy behaviour of mounting all -subsystems when none are explicitly specified is not supported when -you give a subsystem a name. - -The name of the subsystem appears as part of the hierarchy description -in /proc/mounts and /proc//cgroups. - - -3. Kernel API -============= - -3.1 Overview ------------- - -Each kernel subsystem that wants to hook into the generic cgroup -system needs to create a cgroup_subsys object. This contains -various methods, which are callbacks from the cgroup system, along -with a subsystem ID which will be assigned by the cgroup system. - -Other fields in the cgroup_subsys object include: - -- subsys_id: a unique array index for the subsystem, indicating which - entry in cgroup->subsys[] this subsystem should be managing. - -- name: should be initialized to a unique subsystem name. Should be - no longer than MAX_CGROUP_TYPE_NAMELEN. - -- early_init: indicate if the subsystem needs early initialization - at system boot. - -Each cgroup object created by the system has an array of pointers, -indexed by subsystem ID; this pointer is entirely managed by the -subsystem; the generic cgroup code will never touch this pointer. - -3.2 Synchronization -------------------- - -There is a global mutex, cgroup_mutex, used by the cgroup -system. This should be taken by anything that wants to modify a -cgroup. It may also be taken to prevent cgroups from being -modified, but more specific locks may be more appropriate in that -situation. - -See kernel/cgroup.c for more details. - -Subsystems can take/release the cgroup_mutex via the functions -cgroup_lock()/cgroup_unlock(). - -Accessing a task's cgroup pointer may be done in the following ways: -- while holding cgroup_mutex -- while holding the task's alloc_lock (via task_lock()) -- inside an rcu_read_lock() section via rcu_dereference() - -3.3 Subsystem API ------------------ - -Each subsystem should: - -- add an entry in linux/cgroup_subsys.h -- define a cgroup_subsys object called _cgrp_subsys - -Each subsystem may export the following methods. The only mandatory -methods are css_alloc/free. Any others that are null are presumed to -be successful no-ops. - -``struct cgroup_subsys_state *css_alloc(struct cgroup *cgrp)`` -(cgroup_mutex held by caller) - -Called to allocate a subsystem state object for a cgroup. The -subsystem should allocate its subsystem state object for the passed -cgroup, returning a pointer to the new object on success or a -ERR_PTR() value. On success, the subsystem pointer should point to -a structure of type cgroup_subsys_state (typically embedded in a -larger subsystem-specific object), which will be initialized by the -cgroup system. Note that this will be called at initialization to -create the root subsystem state for this subsystem; this case can be -identified by the passed cgroup object having a NULL parent (since -it's the root of the hierarchy) and may be an appropriate place for -initialization code. - -``int css_online(struct cgroup *cgrp)`` -(cgroup_mutex held by caller) - -Called after @cgrp successfully completed all allocations and made -visible to cgroup_for_each_child/descendant_*() iterators. The -subsystem may choose to fail creation by returning -errno. This -callback can be used to implement reliable state sharing and -propagation along the hierarchy. See the comment on -cgroup_for_each_descendant_pre() for details. - -``void css_offline(struct cgroup *cgrp);`` -(cgroup_mutex held by caller) - -This is the counterpart of css_online() and called iff css_online() -has succeeded on @cgrp. This signifies the beginning of the end of -@cgrp. @cgrp is being removed and the subsystem should start dropping -all references it's holding on @cgrp. When all references are dropped, -cgroup removal will proceed to the next step - css_free(). After this -callback, @cgrp should be considered dead to the subsystem. - -``void css_free(struct cgroup *cgrp)`` -(cgroup_mutex held by caller) - -The cgroup system is about to free @cgrp; the subsystem should free -its subsystem state object. By the time this method is called, @cgrp -is completely unused; @cgrp->parent is still valid. (Note - can also -be called for a newly-created cgroup if an error occurs after this -subsystem's create() method has been called for the new cgroup). - -``int can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)`` -(cgroup_mutex held by caller) - -Called prior to moving one or more tasks into a cgroup; if the -subsystem returns an error, this will abort the attach operation. -@tset contains the tasks to be attached and is guaranteed to have at -least one task in it. - -If there are multiple tasks in the taskset, then: - - it's guaranteed that all are from the same thread group - - @tset contains all tasks from the thread group whether or not - they're switching cgroups - - the first task is the leader - -Each @tset entry also contains the task's old cgroup and tasks which -aren't switching cgroup can be skipped easily using the -cgroup_taskset_for_each() iterator. Note that this isn't called on a -fork. If this method returns 0 (success) then this should remain valid -while the caller holds cgroup_mutex and it is ensured that either -attach() or cancel_attach() will be called in future. - -``void css_reset(struct cgroup_subsys_state *css)`` -(cgroup_mutex held by caller) - -An optional operation which should restore @css's configuration to the -initial state. This is currently only used on the unified hierarchy -when a subsystem is disabled on a cgroup through -"cgroup.subtree_control" but should remain enabled because other -subsystems depend on it. cgroup core makes such a css invisible by -removing the associated interface files and invokes this callback so -that the hidden subsystem can return to the initial neutral state. -This prevents unexpected resource control from a hidden css and -ensures that the configuration is in the initial state when it is made -visible again later. - -``void cancel_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)`` -(cgroup_mutex held by caller) - -Called when a task attach operation has failed after can_attach() has succeeded. -A subsystem whose can_attach() has some side-effects should provide this -function, so that the subsystem can implement a rollback. If not, not necessary. -This will be called only about subsystems whose can_attach() operation have -succeeded. The parameters are identical to can_attach(). - -``void attach(struct cgroup *cgrp, struct cgroup_taskset *tset)`` -(cgroup_mutex held by caller) - -Called after the task has been attached to the cgroup, to allow any -post-attachment activity that requires memory allocations or blocking. -The parameters are identical to can_attach(). - -``void fork(struct task_struct *task)`` - -Called when a task is forked into a cgroup. - -``void exit(struct task_struct *task)`` - -Called during task exit. - -``void free(struct task_struct *task)`` - -Called when the task_struct is freed. - -``void bind(struct cgroup *root)`` -(cgroup_mutex held by caller) - -Called when a cgroup subsystem is rebound to a different hierarchy -and root cgroup. Currently this will only involve movement between -the default hierarchy (which never has sub-cgroups) and a hierarchy -that is being created/destroyed (and hence has no sub-cgroups). - -4. Extended attribute usage -=========================== - -cgroup filesystem supports certain types of extended attributes in its -directories and files. The current supported types are: - - - Trusted (XATTR_TRUSTED) - - Security (XATTR_SECURITY) - -Both require CAP_SYS_ADMIN capability to set. - -Like in tmpfs, the extended attributes in cgroup filesystem are stored -using kernel memory and it's advised to keep the usage at minimum. This -is the reason why user defined extended attributes are not supported, since -any user can do it and there's no limit in the value size. - -The current known users for this feature are SELinux to limit cgroup usage -in containers and systemd for assorted meta data like main PID in a cgroup -(systemd creates a cgroup per service). - -5. Questions -============ - -:: - - Q: what's up with this '/bin/echo' ? - A: bash's builtin 'echo' command does not check calls to write() against - errors. If you use it in the cgroup file system, you won't be - able to tell whether a command succeeded or failed. - - Q: When I attach processes, only the first of the line gets really attached ! - A: We can only return one error code per call to write(). So you should also - put only ONE PID. diff --git a/Documentation/cgroup-v1/cpuacct.rst b/Documentation/cgroup-v1/cpuacct.rst deleted file mode 100644 index d30ed81d2ad7..000000000000 --- a/Documentation/cgroup-v1/cpuacct.rst +++ /dev/null @@ -1,50 +0,0 @@ -========================= -CPU Accounting Controller -========================= - -The CPU accounting controller is used to group tasks using cgroups and -account the CPU usage of these groups of tasks. - -The CPU accounting controller supports multi-hierarchy groups. An accounting -group accumulates the CPU usage of all of its child groups and the tasks -directly present in its group. - -Accounting groups can be created by first mounting the cgroup filesystem:: - - # mount -t cgroup -ocpuacct none /sys/fs/cgroup - -With the above step, the initial or the parent accounting group becomes -visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in -the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. -/sys/fs/cgroup/cpuacct.usage gives the CPU time (in nanoseconds) obtained -by this group which is essentially the CPU time obtained by all the tasks -in the system. - -New accounting groups can be created under the parent group /sys/fs/cgroup:: - - # cd /sys/fs/cgroup - # mkdir g1 - # echo $$ > g1/tasks - -The above steps create a new group g1 and move the current shell -process (bash) into it. CPU time consumed by this bash and its children -can be obtained from g1/cpuacct.usage and the same is accumulated in -/sys/fs/cgroup/cpuacct.usage also. - -cpuacct.stat file lists a few statistics which further divide the -CPU time obtained by the cgroup into user and system times. Currently -the following statistics are supported: - -user: Time spent by tasks of the cgroup in user mode. -system: Time spent by tasks of the cgroup in kernel mode. - -user and system are in USER_HZ unit. - -cpuacct controller uses percpu_counter interface to collect user and -system times. This has two side effects: - -- It is theoretically possible to see wrong values for user and system times. - This is because percpu_counter_read() on 32bit systems isn't safe - against concurrent writes. -- It is possible to see slightly outdated values for user and system times - due to the batch processing nature of percpu_counter. diff --git a/Documentation/cgroup-v1/cpusets.rst b/Documentation/cgroup-v1/cpusets.rst deleted file mode 100644 index b6a42cdea72b..000000000000 --- a/Documentation/cgroup-v1/cpusets.rst +++ /dev/null @@ -1,866 +0,0 @@ -======= -CPUSETS -======= - -Copyright (C) 2004 BULL SA. - -Written by Simon.Derr@bull.net - -- Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. -- Modified by Paul Jackson -- Modified by Christoph Lameter -- Modified by Paul Menage -- Modified by Hidetoshi Seto - -.. CONTENTS: - - 1. Cpusets - 1.1 What are cpusets ? - 1.2 Why are cpusets needed ? - 1.3 How are cpusets implemented ? - 1.4 What are exclusive cpusets ? - 1.5 What is memory_pressure ? - 1.6 What is memory spread ? - 1.7 What is sched_load_balance ? - 1.8 What is sched_relax_domain_level ? - 1.9 How do I use cpusets ? - 2. Usage Examples and Syntax - 2.1 Basic Usage - 2.2 Adding/removing cpus - 2.3 Setting flags - 2.4 Attaching processes - 3. Questions - 4. Contact - -1. Cpusets -========== - -1.1 What are cpusets ? ----------------------- - -Cpusets provide a mechanism for assigning a set of CPUs and Memory -Nodes to a set of tasks. In this document "Memory Node" refers to -an on-line node that contains memory. - -Cpusets constrain the CPU and Memory placement of tasks to only -the resources within a task's current cpuset. They form a nested -hierarchy visible in a virtual file system. These are the essential -hooks, beyond what is already present, required to manage dynamic -job placement on large systems. - -Cpusets use the generic cgroup subsystem described in -Documentation/cgroup-v1/cgroups.rst. - -Requests by a task, using the sched_setaffinity(2) system call to -include CPUs in its CPU affinity mask, and using the mbind(2) and -set_mempolicy(2) system calls to include Memory Nodes in its memory -policy, are both filtered through that task's cpuset, filtering out any -CPUs or Memory Nodes not in that cpuset. The scheduler will not -schedule a task on a CPU that is not allowed in its cpus_allowed -vector, and the kernel page allocator will not allocate a page on a -node that is not allowed in the requesting task's mems_allowed vector. - -User level code may create and destroy cpusets by name in the cgroup -virtual file system, manage the attributes and permissions of these -cpusets and which CPUs and Memory Nodes are assigned to each cpuset, -specify and query to which cpuset a task is assigned, and list the -task pids assigned to a cpuset. - - -1.2 Why are cpusets needed ? ----------------------------- - -The management of large computer systems, with many processors (CPUs), -complex memory cache hierarchies and multiple Memory Nodes having -non-uniform access times (NUMA) presents additional challenges for -the efficient scheduling and memory placement of processes. - -Frequently more modest sized systems can be operated with adequate -efficiency just by letting the operating system automatically share -the available CPU and Memory resources amongst the requesting tasks. - -But larger systems, which benefit more from careful processor and -memory placement to reduce memory access times and contention, -and which typically represent a larger investment for the customer, -can benefit from explicitly placing jobs on properly sized subsets of -the system. - -This can be especially valuable on: - - * Web Servers running multiple instances of the same web application, - * Servers running different applications (for instance, a web server - and a database), or - * NUMA systems running large HPC applications with demanding - performance characteristics. - -These subsets, or "soft partitions" must be able to be dynamically -adjusted, as the job mix changes, without impacting other concurrently -executing jobs. The location of the running jobs pages may also be moved -when the memory locations are changed. - -The kernel cpuset patch provides the minimum essential kernel -mechanisms required to efficiently implement such subsets. It -leverages existing CPU and Memory Placement facilities in the Linux -kernel to avoid any additional impact on the critical scheduler or -memory allocator code. - - -1.3 How are cpusets implemented ? ---------------------------------- - -Cpusets provide a Linux kernel mechanism to constrain which CPUs and -Memory Nodes are used by a process or set of processes. - -The Linux kernel already has a pair of mechanisms to specify on which -CPUs a task may be scheduled (sched_setaffinity) and on which Memory -Nodes it may obtain memory (mbind, set_mempolicy). - -Cpusets extends these two mechanisms as follows: - - - Cpusets are sets of allowed CPUs and Memory Nodes, known to the - kernel. - - Each task in the system is attached to a cpuset, via a pointer - in the task structure to a reference counted cgroup structure. - - Calls to sched_setaffinity are filtered to just those CPUs - allowed in that task's cpuset. - - Calls to mbind and set_mempolicy are filtered to just - those Memory Nodes allowed in that task's cpuset. - - The root cpuset contains all the systems CPUs and Memory - Nodes. - - For any cpuset, one can define child cpusets containing a subset - of the parents CPU and Memory Node resources. - - The hierarchy of cpusets can be mounted at /dev/cpuset, for - browsing and manipulation from user space. - - A cpuset may be marked exclusive, which ensures that no other - cpuset (except direct ancestors and descendants) may contain - any overlapping CPUs or Memory Nodes. - - You can list all the tasks (by pid) attached to any cpuset. - -The implementation of cpusets requires a few, simple hooks -into the rest of the kernel, none in performance critical paths: - - - in init/main.c, to initialize the root cpuset at system boot. - - in fork and exit, to attach and detach a task from its cpuset. - - in sched_setaffinity, to mask the requested CPUs by what's - allowed in that task's cpuset. - - in sched.c migrate_live_tasks(), to keep migrating tasks within - the CPUs allowed by their cpuset, if possible. - - in the mbind and set_mempolicy system calls, to mask the requested - Memory Nodes by what's allowed in that task's cpuset. - - in page_alloc.c, to restrict memory to allowed nodes. - - in vmscan.c, to restrict page recovery to the current cpuset. - -You should mount the "cgroup" filesystem type in order to enable -browsing and modifying the cpusets presently known to the kernel. No -new system calls are added for cpusets - all support for querying and -modifying cpusets is via this cpuset file system. - -The /proc//status file for each task has four added lines, -displaying the task's cpus_allowed (on which CPUs it may be scheduled) -and mems_allowed (on which Memory Nodes it may obtain memory), -in the two formats seen in the following example:: - - Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff - Cpus_allowed_list: 0-127 - Mems_allowed: ffffffff,ffffffff - Mems_allowed_list: 0-63 - -Each cpuset is represented by a directory in the cgroup file system -containing (on top of the standard cgroup files) the following -files describing that cpuset: - - - cpuset.cpus: list of CPUs in that cpuset - - cpuset.mems: list of Memory Nodes in that cpuset - - cpuset.memory_migrate flag: if set, move pages to cpusets nodes - - cpuset.cpu_exclusive flag: is cpu placement exclusive? - - cpuset.mem_exclusive flag: is memory placement exclusive? - - cpuset.mem_hardwall flag: is memory allocation hardwalled - - cpuset.memory_pressure: measure of how much paging pressure in cpuset - - cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes - - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes - - cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset - - cpuset.sched_relax_domain_level: the searching range when migrating tasks - -In addition, only the root cpuset has the following file: - - - cpuset.memory_pressure_enabled flag: compute memory_pressure? - -New cpusets are created using the mkdir system call or shell -command. The properties of a cpuset, such as its flags, allowed -CPUs and Memory Nodes, and attached tasks, are modified by writing -to the appropriate file in that cpusets directory, as listed above. - -The named hierarchical structure of nested cpusets allows partitioning -a large system into nested, dynamically changeable, "soft-partitions". - -The attachment of each task, automatically inherited at fork by any -children of that task, to a cpuset allows organizing the work load -on a system into related sets of tasks such that each set is constrained -to using the CPUs and Memory Nodes of a particular cpuset. A task -may be re-attached to any other cpuset, if allowed by the permissions -on the necessary cpuset file system directories. - -Such management of a system "in the large" integrates smoothly with -the detailed placement done on individual tasks and memory regions -using the sched_setaffinity, mbind and set_mempolicy system calls. - -The following rules apply to each cpuset: - - - Its CPUs and Memory Nodes must be a subset of its parents. - - It can't be marked exclusive unless its parent is. - - If its cpu or memory is exclusive, they may not overlap any sibling. - -These rules, and the natural hierarchy of cpusets, enable efficient -enforcement of the exclusive guarantee, without having to scan all -cpusets every time any of them change to ensure nothing overlaps a -exclusive cpuset. Also, the use of a Linux virtual file system (vfs) -to represent the cpuset hierarchy provides for a familiar permission -and name space for cpusets, with a minimum of additional kernel code. - -The cpus and mems files in the root (top_cpuset) cpuset are -read-only. The cpus file automatically tracks the value of -cpu_online_mask using a CPU hotplug notifier, and the mems file -automatically tracks the value of node_states[N_MEMORY]--i.e., -nodes with memory--using the cpuset_track_online_nodes() hook. - - -1.4 What are exclusive cpusets ? --------------------------------- - -If a cpuset is cpu or mem exclusive, no other cpuset, other than -a direct ancestor or descendant, may share any of the same CPUs or -Memory Nodes. - -A cpuset that is cpuset.mem_exclusive *or* cpuset.mem_hardwall is "hardwalled", -i.e. it restricts kernel allocations for page, buffer and other data -commonly shared by the kernel across multiple users. All cpusets, -whether hardwalled or not, restrict allocations of memory for user -space. This enables configuring a system so that several independent -jobs can share common kernel data, such as file system pages, while -isolating each job's user allocation in its own cpuset. To do this, -construct a large mem_exclusive cpuset to hold all the jobs, and -construct child, non-mem_exclusive cpusets for each individual job. -Only a small amount of typical kernel memory, such as requests from -interrupt handlers, is allowed to be taken outside even a -mem_exclusive cpuset. - - -1.5 What is memory_pressure ? ------------------------------ -The memory_pressure of a cpuset provides a simple per-cpuset metric -of the rate that the tasks in a cpuset are attempting to free up in -use memory on the nodes of the cpuset to satisfy additional memory -requests. - -This enables batch managers monitoring jobs running in dedicated -cpusets to efficiently detect what level of memory pressure that job -is causing. - -This is useful both on tightly managed systems running a wide mix of -submitted jobs, which may choose to terminate or re-prioritize jobs that -are trying to use more memory than allowed on the nodes assigned to them, -and with tightly coupled, long running, massively parallel scientific -computing jobs that will dramatically fail to meet required performance -goals if they start to use more memory than allowed to them. - -This mechanism provides a very economical way for the batch manager -to monitor a cpuset for signs of memory pressure. It's up to the -batch manager or other user code to decide what to do about it and -take action. - -==> - Unless this feature is enabled by writing "1" to the special file - /dev/cpuset/memory_pressure_enabled, the hook in the rebalance - code of __alloc_pages() for this metric reduces to simply noticing - that the cpuset_memory_pressure_enabled flag is zero. So only - systems that enable this feature will compute the metric. - -Why a per-cpuset, running average: - - Because this meter is per-cpuset, rather than per-task or mm, - the system load imposed by a batch scheduler monitoring this - metric is sharply reduced on large systems, because a scan of - the tasklist can be avoided on each set of queries. - - Because this meter is a running average, instead of an accumulating - counter, a batch scheduler can detect memory pressure with a - single read, instead of having to read and accumulate results - for a period of time. - - Because this meter is per-cpuset rather than per-task or mm, - the batch scheduler can obtain the key information, memory - pressure in a cpuset, with a single read, rather than having to - query and accumulate results over all the (dynamically changing) - set of tasks in the cpuset. - -A per-cpuset simple digital filter (requires a spinlock and 3 words -of data per-cpuset) is kept, and updated by any task attached to that -cpuset, if it enters the synchronous (direct) page reclaim code. - -A per-cpuset file provides an integer number representing the recent -(half-life of 10 seconds) rate of direct page reclaims caused by -the tasks in the cpuset, in units of reclaims attempted per second, -times 1000. - - -1.6 What is memory spread ? ---------------------------- -There are two boolean flag files per cpuset that control where the -kernel allocates pages for the file system buffers and related in -kernel data structures. They are called 'cpuset.memory_spread_page' and -'cpuset.memory_spread_slab'. - -If the per-cpuset boolean flag file 'cpuset.memory_spread_page' is set, then -the kernel will spread the file system buffers (page cache) evenly -over all the nodes that the faulting task is allowed to use, instead -of preferring to put those pages on the node where the task is running. - -If the per-cpuset boolean flag file 'cpuset.memory_spread_slab' is set, -then the kernel will spread some file system related slab caches, -such as for inodes and dentries evenly over all the nodes that the -faulting task is allowed to use, instead of preferring to put those -pages on the node where the task is running. - -The setting of these flags does not affect anonymous data segment or -stack segment pages of a task. - -By default, both kinds of memory spreading are off, and memory -pages are allocated on the node local to where the task is running, -except perhaps as modified by the task's NUMA mempolicy or cpuset -configuration, so long as sufficient free memory pages are available. - -When new cpusets are created, they inherit the memory spread settings -of their parent. - -Setting memory spreading causes allocations for the affected page -or slab caches to ignore the task's NUMA mempolicy and be spread -instead. Tasks using mbind() or set_mempolicy() calls to set NUMA -mempolicies will not notice any change in these calls as a result of -their containing task's memory spread settings. If memory spreading -is turned off, then the currently specified NUMA mempolicy once again -applies to memory page allocations. - -Both 'cpuset.memory_spread_page' and 'cpuset.memory_spread_slab' are boolean flag -files. By default they contain "0", meaning that the feature is off -for that cpuset. If a "1" is written to that file, then that turns -the named feature on. - -The implementation is simple. - -Setting the flag 'cpuset.memory_spread_page' turns on a per-process flag -PFA_SPREAD_PAGE for each task that is in that cpuset or subsequently -joins that cpuset. The page allocation calls for the page cache -is modified to perform an inline check for this PFA_SPREAD_PAGE task -flag, and if set, a call to a new routine cpuset_mem_spread_node() -returns the node to prefer for the allocation. - -Similarly, setting 'cpuset.memory_spread_slab' turns on the flag -PFA_SPREAD_SLAB, and appropriately marked slab caches will allocate -pages from the node returned by cpuset_mem_spread_node(). - -The cpuset_mem_spread_node() routine is also simple. It uses the -value of a per-task rotor cpuset_mem_spread_rotor to select the next -node in the current task's mems_allowed to prefer for the allocation. - -This memory placement policy is also known (in other contexts) as -round-robin or interleave. - -This policy can provide substantial improvements for jobs that need -to place thread local data on the corresponding node, but that need -to access large file system data sets that need to be spread across -the several nodes in the jobs cpuset in order to fit. Without this -policy, especially for jobs that might have one thread reading in the -data set, the memory allocation across the nodes in the jobs cpuset -can become very uneven. - -1.7 What is sched_load_balance ? --------------------------------- - -The kernel scheduler (kernel/sched/core.c) automatically load balances -tasks. If one CPU is underutilized, kernel code running on that -CPU will look for tasks on other more overloaded CPUs and move those -tasks to itself, within the constraints of such placement mechanisms -as cpusets and sched_setaffinity. - -The algorithmic cost of load balancing and its impact on key shared -kernel data structures such as the task list increases more than -linearly with the number of CPUs being balanced. So the scheduler -has support to partition the systems CPUs into a number of sched -domains such that it only load balances within each sched domain. -Each sched domain covers some subset of the CPUs in the system; -no two sched domains overlap; some CPUs might not be in any sched -domain and hence won't be load balanced. - -Put simply, it costs less to balance between two smaller sched domains -than one big one, but doing so means that overloads in one of the -two domains won't be load balanced to the other one. - -By default, there is one sched domain covering all CPUs, including those -marked isolated using the kernel boot time "isolcpus=" argument. However, -the isolated CPUs will not participate in load balancing, and will not -have tasks running on them unless explicitly assigned. - -This default load balancing across all CPUs is not well suited for -the following two situations: - - 1) On large systems, load balancing across many CPUs is expensive. - If the system is managed using cpusets to place independent jobs - on separate sets of CPUs, full load balancing is unnecessary. - 2) Systems supporting realtime on some CPUs need to minimize - system overhead on those CPUs, including avoiding task load - balancing if that is not needed. - -When the per-cpuset flag "cpuset.sched_load_balance" is enabled (the default -setting), it requests that all the CPUs in that cpusets allowed 'cpuset.cpus' -be contained in a single sched domain, ensuring that load balancing -can move a task (not otherwised pinned, as by sched_setaffinity) -from any CPU in that cpuset to any other. - -When the per-cpuset flag "cpuset.sched_load_balance" is disabled, then the -scheduler will avoid load balancing across the CPUs in that cpuset, ---except-- in so far as is necessary because some overlapping cpuset -has "sched_load_balance" enabled. - -So, for example, if the top cpuset has the flag "cpuset.sched_load_balance" -enabled, then the scheduler will have one sched domain covering all -CPUs, and the setting of the "cpuset.sched_load_balance" flag in any other -cpusets won't matter, as we're already fully load balancing. - -Therefore in the above two situations, the top cpuset flag -"cpuset.sched_load_balance" should be disabled, and only some of the smaller, -child cpusets have this flag enabled. - -When doing this, you don't usually want to leave any unpinned tasks in -the top cpuset that might use non-trivial amounts of CPU, as such tasks -may be artificially constrained to some subset of CPUs, depending on -the particulars of this flag setting in descendant cpusets. Even if -such a task could use spare CPU cycles in some other CPUs, the kernel -scheduler might not consider the possibility of load balancing that -task to that underused CPU. - -Of course, tasks pinned to a particular CPU can be left in a cpuset -that disables "cpuset.sched_load_balance" as those tasks aren't going anywhere -else anyway. - -There is an impedance mismatch here, between cpusets and sched domains. -Cpusets are hierarchical and nest. Sched domains are flat; they don't -overlap and each CPU is in at most one sched domain. - -It is necessary for sched domains to be flat because load balancing -across partially overlapping sets of CPUs would risk unstable dynamics -that would be beyond our understanding. So if each of two partially -overlapping cpusets enables the flag 'cpuset.sched_load_balance', then we -form a single sched domain that is a superset of both. We won't move -a task to a CPU outside its cpuset, but the scheduler load balancing -code might waste some compute cycles considering that possibility. - -This mismatch is why there is not a simple one-to-one relation -between which cpusets have the flag "cpuset.sched_load_balance" enabled, -and the sched domain configuration. If a cpuset enables the flag, it -will get balancing across all its CPUs, but if it disables the flag, -it will only be assured of no load balancing if no other overlapping -cpuset enables the flag. - -If two cpusets have partially overlapping 'cpuset.cpus' allowed, and only -one of them has this flag enabled, then the other may find its -tasks only partially load balanced, just on the overlapping CPUs. -This is just the general case of the top_cpuset example given a few -paragraphs above. In the general case, as in the top cpuset case, -don't leave tasks that might use non-trivial amounts of CPU in -such partially load balanced cpusets, as they may be artificially -constrained to some subset of the CPUs allowed to them, for lack of -load balancing to the other CPUs. - -CPUs in "cpuset.isolcpus" were excluded from load balancing by the -isolcpus= kernel boot option, and will never be load balanced regardless -of the value of "cpuset.sched_load_balance" in any cpuset. - -1.7.1 sched_load_balance implementation details. ------------------------------------------------- - -The per-cpuset flag 'cpuset.sched_load_balance' defaults to enabled (contrary -to most cpuset flags.) When enabled for a cpuset, the kernel will -ensure that it can load balance across all the CPUs in that cpuset -(makes sure that all the CPUs in the cpus_allowed of that cpuset are -in the same sched domain.) - -If two overlapping cpusets both have 'cpuset.sched_load_balance' enabled, -then they will be (must be) both in the same sched domain. - -If, as is the default, the top cpuset has 'cpuset.sched_load_balance' enabled, -then by the above that means there is a single sched domain covering -the whole system, regardless of any other cpuset settings. - -The kernel commits to user space that it will avoid load balancing -where it can. It will pick as fine a granularity partition of sched -domains as it can while still providing load balancing for any set -of CPUs allowed to a cpuset having 'cpuset.sched_load_balance' enabled. - -The internal kernel cpuset to scheduler interface passes from the -cpuset code to the scheduler code a partition of the load balanced -CPUs in the system. This partition is a set of subsets (represented -as an array of struct cpumask) of CPUs, pairwise disjoint, that cover -all the CPUs that must be load balanced. - -The cpuset code builds a new such partition and passes it to the -scheduler sched domain setup code, to have the sched domains rebuilt -as necessary, whenever: - - - the 'cpuset.sched_load_balance' flag of a cpuset with non-empty CPUs changes, - - or CPUs come or go from a cpuset with this flag enabled, - - or 'cpuset.sched_relax_domain_level' value of a cpuset with non-empty CPUs - and with this flag enabled changes, - - or a cpuset with non-empty CPUs and with this flag enabled is removed, - - or a cpu is offlined/onlined. - -This partition exactly defines what sched domains the scheduler should -setup - one sched domain for each element (struct cpumask) in the -partition. - -The scheduler remembers the currently active sched domain partitions. -When the scheduler routine partition_sched_domains() is invoked from -the cpuset code to update these sched domains, it compares the new -partition requested with the current, and updates its sched domains, -removing the old and adding the new, for each change. - - -1.8 What is sched_relax_domain_level ? --------------------------------------- - -In sched domain, the scheduler migrates tasks in 2 ways; periodic load -balance on tick, and at time of some schedule events. - -When a task is woken up, scheduler try to move the task on idle CPU. -For example, if a task A running on CPU X activates another task B -on the same CPU X, and if CPU Y is X's sibling and performing idle, -then scheduler migrate task B to CPU Y so that task B can start on -CPU Y without waiting task A on CPU X. - -And if a CPU run out of tasks in its runqueue, the CPU try to pull -extra tasks from other busy CPUs to help them before it is going to -be idle. - -Of course it takes some searching cost to find movable tasks and/or -idle CPUs, the scheduler might not search all CPUs in the domain -every time. In fact, in some architectures, the searching ranges on -events are limited in the same socket or node where the CPU locates, -while the load balance on tick searches all. - -For example, assume CPU Z is relatively far from CPU X. Even if CPU Z -is idle while CPU X and the siblings are busy, scheduler can't migrate -woken task B from X to Z since it is out of its searching range. -As the result, task B on CPU X need to wait task A or wait load balance -on the next tick. For some applications in special situation, waiting -1 tick may be too long. - -The 'cpuset.sched_relax_domain_level' file allows you to request changing -this searching range as you like. This file takes int value which -indicates size of searching range in levels ideally as follows, -otherwise initial value -1 that indicates the cpuset has no request. - -====== =========================================================== - -1 no request. use system default or follow request of others. - 0 no search. - 1 search siblings (hyperthreads in a core). - 2 search cores in a package. - 3 search cpus in a node [= system wide on non-NUMA system] - 4 search nodes in a chunk of node [on NUMA system] - 5 search system wide [on NUMA system] -====== =========================================================== - -The system default is architecture dependent. The system default -can be changed using the relax_domain_level= boot parameter. - -This file is per-cpuset and affect the sched domain where the cpuset -belongs to. Therefore if the flag 'cpuset.sched_load_balance' of a cpuset -is disabled, then 'cpuset.sched_relax_domain_level' have no effect since -there is no sched domain belonging the cpuset. - -If multiple cpusets are overlapping and hence they form a single sched -domain, the largest value among those is used. Be careful, if one -requests 0 and others are -1 then 0 is used. - -Note that modifying this file will have both good and bad effects, -and whether it is acceptable or not depends on your situation. -Don't modify this file if you are not sure. - -If your situation is: - - - The migration costs between each cpu can be assumed considerably - small(for you) due to your special application's behavior or - special hardware support for CPU cache etc. - - The searching cost doesn't have impact(for you) or you can make - the searching cost enough small by managing cpuset to compact etc. - - The latency is required even it sacrifices cache hit rate etc. - then increasing 'sched_relax_domain_level' would benefit you. - - -1.9 How do I use cpusets ? --------------------------- - -In order to minimize the impact of cpusets on critical kernel -code, such as the scheduler, and due to the fact that the kernel -does not support one task updating the memory placement of another -task directly, the impact on a task of changing its cpuset CPU -or Memory Node placement, or of changing to which cpuset a task -is attached, is subtle. - -If a cpuset has its Memory Nodes modified, then for each task attached -to that cpuset, the next time that the kernel attempts to allocate -a page of memory for that task, the kernel will notice the change -in the task's cpuset, and update its per-task memory placement to -remain within the new cpusets memory placement. If the task was using -mempolicy MPOL_BIND, and the nodes to which it was bound overlap with -its new cpuset, then the task will continue to use whatever subset -of MPOL_BIND nodes are still allowed in the new cpuset. If the task -was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed -in the new cpuset, then the task will be essentially treated as if it -was MPOL_BIND bound to the new cpuset (even though its NUMA placement, -as queried by get_mempolicy(), doesn't change). If a task is moved -from one cpuset to another, then the kernel will adjust the task's -memory placement, as above, the next time that the kernel attempts -to allocate a page of memory for that task. - -If a cpuset has its 'cpuset.cpus' modified, then each task in that cpuset -will have its allowed CPU placement changed immediately. Similarly, -if a task's pid is written to another cpuset's 'tasks' file, then its -allowed CPU placement is changed immediately. If such a task had been -bound to some subset of its cpuset using the sched_setaffinity() call, -the task will be allowed to run on any CPU allowed in its new cpuset, -negating the effect of the prior sched_setaffinity() call. - -In summary, the memory placement of a task whose cpuset is changed is -updated by the kernel, on the next allocation of a page for that task, -and the processor placement is updated immediately. - -Normally, once a page is allocated (given a physical page -of main memory) then that page stays on whatever node it -was allocated, so long as it remains allocated, even if the -cpusets memory placement policy 'cpuset.mems' subsequently changes. -If the cpuset flag file 'cpuset.memory_migrate' is set true, then when -tasks are attached to that cpuset, any pages that task had -allocated to it on nodes in its previous cpuset are migrated -to the task's new cpuset. The relative placement of the page within -the cpuset is preserved during these migration operations if possible. -For example if the page was on the second valid node of the prior cpuset -then the page will be placed on the second valid node of the new cpuset. - -Also if 'cpuset.memory_migrate' is set true, then if that cpuset's -'cpuset.mems' file is modified, pages allocated to tasks in that -cpuset, that were on nodes in the previous setting of 'cpuset.mems', -will be moved to nodes in the new setting of 'mems.' -Pages that were not in the task's prior cpuset, or in the cpuset's -prior 'cpuset.mems' setting, will not be moved. - -There is an exception to the above. If hotplug functionality is used -to remove all the CPUs that are currently assigned to a cpuset, -then all the tasks in that cpuset will be moved to the nearest ancestor -with non-empty cpus. But the moving of some (or all) tasks might fail if -cpuset is bound with another cgroup subsystem which has some restrictions -on task attaching. In this failing case, those tasks will stay -in the original cpuset, and the kernel will automatically update -their cpus_allowed to allow all online CPUs. When memory hotplug -functionality for removing Memory Nodes is available, a similar exception -is expected to apply there as well. In general, the kernel prefers to -violate cpuset placement, over starving a task that has had all -its allowed CPUs or Memory Nodes taken offline. - -There is a second exception to the above. GFP_ATOMIC requests are -kernel internal allocations that must be satisfied, immediately. -The kernel may drop some request, in rare cases even panic, if a -GFP_ATOMIC alloc fails. If the request cannot be satisfied within -the current task's cpuset, then we relax the cpuset, and look for -memory anywhere we can find it. It's better to violate the cpuset -than stress the kernel. - -To start a new job that is to be contained within a cpuset, the steps are: - - 1) mkdir /sys/fs/cgroup/cpuset - 2) mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset - 3) Create the new cpuset by doing mkdir's and write's (or echo's) in - the /sys/fs/cgroup/cpuset virtual file system. - 4) Start a task that will be the "founding father" of the new job. - 5) Attach that task to the new cpuset by writing its pid to the - /sys/fs/cgroup/cpuset tasks file for that cpuset. - 6) fork, exec or clone the job tasks from this founding father task. - -For example, the following sequence of commands will setup a cpuset -named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, -and then start a subshell 'sh' in that cpuset:: - - mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset - cd /sys/fs/cgroup/cpuset - mkdir Charlie - cd Charlie - /bin/echo 2-3 > cpuset.cpus - /bin/echo 1 > cpuset.mems - /bin/echo $$ > tasks - sh - # The subshell 'sh' is now running in cpuset Charlie - # The next line should display '/Charlie' - cat /proc/self/cpuset - -There are ways to query or modify cpusets: - - - via the cpuset file system directly, using the various cd, mkdir, echo, - cat, rmdir commands from the shell, or their equivalent from C. - - via the C library libcpuset. - - via the C library libcgroup. - (http://sourceforge.net/projects/libcg/) - - via the python application cset. - (http://code.google.com/p/cpuset/) - -The sched_setaffinity calls can also be done at the shell prompt using -SGI's runon or Robert Love's taskset. The mbind and set_mempolicy -calls can be done at the shell prompt using the numactl command -(part of Andi Kleen's numa package). - -2. Usage Examples and Syntax -============================ - -2.1 Basic Usage ---------------- - -Creating, modifying, using the cpusets can be done through the cpuset -virtual filesystem. - -To mount it, type: -# mount -t cgroup -o cpuset cpuset /sys/fs/cgroup/cpuset - -Then under /sys/fs/cgroup/cpuset you can find a tree that corresponds to the -tree of the cpusets in the system. For instance, /sys/fs/cgroup/cpuset -is the cpuset that holds the whole system. - -If you want to create a new cpuset under /sys/fs/cgroup/cpuset:: - - # cd /sys/fs/cgroup/cpuset - # mkdir my_cpuset - -Now you want to do something with this cpuset:: - - # cd my_cpuset - -In this directory you can find several files:: - - # ls - cgroup.clone_children cpuset.memory_pressure - cgroup.event_control cpuset.memory_spread_page - cgroup.procs cpuset.memory_spread_slab - cpuset.cpu_exclusive cpuset.mems - cpuset.cpus cpuset.sched_load_balance - cpuset.mem_exclusive cpuset.sched_relax_domain_level - cpuset.mem_hardwall notify_on_release - cpuset.memory_migrate tasks - -Reading them will give you information about the state of this cpuset: -the CPUs and Memory Nodes it can use, the processes that are using -it, its properties. By writing to these files you can manipulate -the cpuset. - -Set some flags:: - - # /bin/echo 1 > cpuset.cpu_exclusive - -Add some cpus:: - - # /bin/echo 0-7 > cpuset.cpus - -Add some mems:: - - # /bin/echo 0-7 > cpuset.mems - -Now attach your shell to this cpuset:: - - # /bin/echo $$ > tasks - -You can also create cpusets inside your cpuset by using mkdir in this -directory:: - - # mkdir my_sub_cs - -To remove a cpuset, just use rmdir:: - - # rmdir my_sub_cs - -This will fail if the cpuset is in use (has cpusets inside, or has -processes attached). - -Note that for legacy reasons, the "cpuset" filesystem exists as a -wrapper around the cgroup filesystem. - -The command:: - - mount -t cpuset X /sys/fs/cgroup/cpuset - -is equivalent to:: - - mount -t cgroup -ocpuset,noprefix X /sys/fs/cgroup/cpuset - echo "/sbin/cpuset_release_agent" > /sys/fs/cgroup/cpuset/release_agent - -2.2 Adding/removing cpus ------------------------- - -This is the syntax to use when writing in the cpus or mems files -in cpuset directories:: - - # /bin/echo 1-4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 - # /bin/echo 1,2,3,4 > cpuset.cpus -> set cpus list to cpus 1,2,3,4 - -To add a CPU to a cpuset, write the new list of CPUs including the -CPU to be added. To add 6 to the above cpuset:: - - # /bin/echo 1-4,6 > cpuset.cpus -> set cpus list to cpus 1,2,3,4,6 - -Similarly to remove a CPU from a cpuset, write the new list of CPUs -without the CPU to be removed. - -To remove all the CPUs:: - - # /bin/echo "" > cpuset.cpus -> clear cpus list - -2.3 Setting flags ------------------ - -The syntax is very simple:: - - # /bin/echo 1 > cpuset.cpu_exclusive -> set flag 'cpuset.cpu_exclusive' - # /bin/echo 0 > cpuset.cpu_exclusive -> unset flag 'cpuset.cpu_exclusive' - -2.4 Attaching processes ------------------------ - -:: - - # /bin/echo PID > tasks - -Note that it is PID, not PIDs. You can only attach ONE task at a time. -If you have several tasks to attach, you have to do it one after another:: - - # /bin/echo PID1 > tasks - # /bin/echo PID2 > tasks - ... - # /bin/echo PIDn > tasks - - -3. Questions -============ - -Q: - what's up with this '/bin/echo' ? - -A: - bash's builtin 'echo' command does not check calls to write() against - errors. If you use it in the cpuset file system, you won't be - able to tell whether a command succeeded or failed. - -Q: - When I attach processes, only the first of the line gets really attached ! - -A: - We can only return one error code per call to write(). So you should also - put only ONE pid. - -4. Contact -========== - -Web: http://www.bullopensource.org/cpuset diff --git a/Documentation/cgroup-v1/devices.rst b/Documentation/cgroup-v1/devices.rst deleted file mode 100644 index e1886783961e..000000000000 --- a/Documentation/cgroup-v1/devices.rst +++ /dev/null @@ -1,132 +0,0 @@ -=========================== -Device Whitelist Controller -=========================== - -1. Description -============== - -Implement a cgroup to track and enforce open and mknod restrictions -on device files. A device cgroup associates a device access -whitelist with each cgroup. A whitelist entry has 4 fields. -'type' is a (all), c (char), or b (block). 'all' means it applies -to all types and all major and minor numbers. Major and minor are -either an integer or * for all. Access is a composition of r -(read), w (write), and m (mknod). - -The root device cgroup starts with rwm to 'all'. A child device -cgroup gets a copy of the parent. Administrators can then remove -devices from the whitelist or add new entries. A child cgroup can -never receive a device access which is denied by its parent. - -2. User Interface -================= - -An entry is added using devices.allow, and removed using -devices.deny. For instance:: - - echo 'c 1:3 mr' > /sys/fs/cgroup/1/devices.allow - -allows cgroup 1 to read and mknod the device usually known as -/dev/null. Doing:: - - echo a > /sys/fs/cgroup/1/devices.deny - -will remove the default 'a *:* rwm' entry. Doing:: - - echo a > /sys/fs/cgroup/1/devices.allow - -will add the 'a *:* rwm' entry to the whitelist. - -3. Security -=========== - -Any task can move itself between cgroups. This clearly won't -suffice, but we can decide the best way to adequately restrict -movement as people get some experience with this. We may just want -to require CAP_SYS_ADMIN, which at least is a separate bit from -CAP_MKNOD. We may want to just refuse moving to a cgroup which -isn't a descendant of the current one. Or we may want to use -CAP_MAC_ADMIN, since we really are trying to lock down root. - -CAP_SYS_ADMIN is needed to modify the whitelist or move another -task to a new cgroup. (Again we'll probably want to change that). - -A cgroup may not be granted more permissions than the cgroup's -parent has. - -4. Hierarchy -============ - -device cgroups maintain hierarchy by making sure a cgroup never has more -access permissions than its parent. Every time an entry is written to -a cgroup's devices.deny file, all its children will have that entry removed -from their whitelist and all the locally set whitelist entries will be -re-evaluated. In case one of the locally set whitelist entries would provide -more access than the cgroup's parent, it'll be removed from the whitelist. - -Example:: - - A - / \ - B - - group behavior exceptions - A allow "b 8:* rwm", "c 116:1 rw" - B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm" - -If a device is denied in group A:: - - # echo "c 116:* r" > A/devices.deny - -it'll propagate down and after revalidating B's entries, the whitelist entry -"c 116:2 rwm" will be removed:: - - group whitelist entries denied devices - A all "b 8:* rwm", "c 116:* rw" - B "c 1:3 rwm", "b 3:* rwm" all the rest - -In case parent's exceptions change and local exceptions are not allowed -anymore, they'll be deleted. - -Notice that new whitelist entries will not be propagated:: - - A - / \ - B - - group whitelist entries denied devices - A "c 1:3 rwm", "c 1:5 r" all the rest - B "c 1:3 rwm", "c 1:5 r" all the rest - -when adding ``c *:3 rwm``:: - - # echo "c *:3 rwm" >A/devices.allow - -the result:: - - group whitelist entries denied devices - A "c *:3 rwm", "c 1:5 r" all the rest - B "c 1:3 rwm", "c 1:5 r" all the rest - -but now it'll be possible to add new entries to B:: - - # echo "c 2:3 rwm" >B/devices.allow - # echo "c 50:3 r" >B/devices.allow - -or even:: - - # echo "c *:3 rwm" >B/devices.allow - -Allowing or denying all by writing 'a' to devices.allow or devices.deny will -not be possible once the device cgroups has children. - -4.1 Hierarchy (internal implementation) ---------------------------------------- - -device cgroups is implemented internally using a behavior (ALLOW, DENY) and a -list of exceptions. The internal state is controlled using the same user -interface to preserve compatibility with the previous whitelist-only -implementation. Removal or addition of exceptions that will reduce the access -to devices will be propagated down the hierarchy. -For every propagated exception, the effective rules will be re-evaluated based -on current parent's access rules. diff --git a/Documentation/cgroup-v1/freezer-subsystem.rst b/Documentation/cgroup-v1/freezer-subsystem.rst deleted file mode 100644 index 582d3427de3f..000000000000 --- a/Documentation/cgroup-v1/freezer-subsystem.rst +++ /dev/null @@ -1,127 +0,0 @@ -============== -Cgroup Freezer -============== - -The cgroup freezer is useful to batch job management system which start -and stop sets of tasks in order to schedule the resources of a machine -according to the desires of a system administrator. This sort of program -is often used on HPC clusters to schedule access to the cluster as a -whole. The cgroup freezer uses cgroups to describe the set of tasks to -be started/stopped by the batch job management system. It also provides -a means to start and stop the tasks composing the job. - -The cgroup freezer will also be useful for checkpointing running groups -of tasks. The freezer allows the checkpoint code to obtain a consistent -image of the tasks by attempting to force the tasks in a cgroup into a -quiescent state. Once the tasks are quiescent another task can -walk /proc or invoke a kernel interface to gather information about the -quiesced tasks. Checkpointed tasks can be restarted later should a -recoverable error occur. This also allows the checkpointed tasks to be -migrated between nodes in a cluster by copying the gathered information -to another node and restarting the tasks there. - -Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping -and resuming tasks in userspace. Both of these signals are observable -from within the tasks we wish to freeze. While SIGSTOP cannot be caught, -blocked, or ignored it can be seen by waiting or ptracing parent tasks. -SIGCONT is especially unsuitable since it can be caught by the task. Any -programs designed to watch for SIGSTOP and SIGCONT could be broken by -attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can -demonstrate this problem using nested bash shells:: - - $ echo $$ - 16644 - $ bash - $ echo $$ - 16690 - - From a second, unrelated bash shell: - $ kill -SIGSTOP 16690 - $ kill -SIGCONT 16690 - - - -This happens because bash can observe both signals and choose how it -responds to them. - -Another example of a program which catches and responds to these -signals is gdb. In fact any program designed to use ptrace is likely to -have a problem with this method of stopping and resuming tasks. - -In contrast, the cgroup freezer uses the kernel freezer code to -prevent the freeze/unfreeze cycle from becoming visible to the tasks -being frozen. This allows the bash example above and gdb to run as -expected. - -The cgroup freezer is hierarchical. Freezing a cgroup freezes all -tasks belonging to the cgroup and all its descendant cgroups. Each -cgroup has its own state (self-state) and the state inherited from the -parent (parent-state). Iff both states are THAWED, the cgroup is -THAWED. - -The following cgroupfs files are created by cgroup freezer. - -* freezer.state: Read-write. - - When read, returns the effective state of the cgroup - "THAWED", - "FREEZING" or "FROZEN". This is the combined self and parent-states. - If any is freezing, the cgroup is freezing (FREEZING or FROZEN). - - FREEZING cgroup transitions into FROZEN state when all tasks - belonging to the cgroup and its descendants become frozen. Note that - a cgroup reverts to FREEZING from FROZEN after a new task is added - to the cgroup or one of its descendant cgroups until the new task is - frozen. - - When written, sets the self-state of the cgroup. Two values are - allowed - "FROZEN" and "THAWED". If FROZEN is written, the cgroup, - if not already freezing, enters FREEZING state along with all its - descendant cgroups. - - If THAWED is written, the self-state of the cgroup is changed to - THAWED. Note that the effective state may not change to THAWED if - the parent-state is still freezing. If a cgroup's effective state - becomes THAWED, all its descendants which are freezing because of - the cgroup also leave the freezing state. - -* freezer.self_freezing: Read only. - - Shows the self-state. 0 if the self-state is THAWED; otherwise, 1. - This value is 1 iff the last write to freezer.state was "FROZEN". - -* freezer.parent_freezing: Read only. - - Shows the parent-state. 0 if none of the cgroup's ancestors is - frozen; otherwise, 1. - -The root cgroup is non-freezable and the above interface files don't -exist. - -* Examples of usage:: - - # mkdir /sys/fs/cgroup/freezer - # mount -t cgroup -ofreezer freezer /sys/fs/cgroup/freezer - # mkdir /sys/fs/cgroup/freezer/0 - # echo $some_pid > /sys/fs/cgroup/freezer/0/tasks - -to get status of the freezer subsystem:: - - # cat /sys/fs/cgroup/freezer/0/freezer.state - THAWED - -to freeze all tasks in the container:: - - # echo FROZEN > /sys/fs/cgroup/freezer/0/freezer.state - # cat /sys/fs/cgroup/freezer/0/freezer.state - FREEZING - # cat /sys/fs/cgroup/freezer/0/freezer.state - FROZEN - -to unfreeze all tasks in the container:: - - # echo THAWED > /sys/fs/cgroup/freezer/0/freezer.state - # cat /sys/fs/cgroup/freezer/0/freezer.state - THAWED - -This is the basic mechanism which should do the right thing for user space task -in a simple scenario. diff --git a/Documentation/cgroup-v1/hugetlb.rst b/Documentation/cgroup-v1/hugetlb.rst deleted file mode 100644 index a3902aa253a9..000000000000 --- a/Documentation/cgroup-v1/hugetlb.rst +++ /dev/null @@ -1,50 +0,0 @@ -================== -HugeTLB Controller -================== - -The HugeTLB controller allows to limit the HugeTLB usage per control group and -enforces the controller limit during page fault. Since HugeTLB doesn't -support page reclaim, enforcing the limit at page fault time implies that, -the application will get SIGBUS signal if it tries to access HugeTLB pages -beyond its limit. This requires the application to know beforehand how much -HugeTLB pages it would require for its use. - -HugeTLB controller can be created by first mounting the cgroup filesystem. - -# mount -t cgroup -o hugetlb none /sys/fs/cgroup - -With the above step, the initial or the parent HugeTLB group becomes -visible at /sys/fs/cgroup. At bootup, this group includes all the tasks in -the system. /sys/fs/cgroup/tasks lists the tasks in this cgroup. - -New groups can be created under the parent group /sys/fs/cgroup:: - - # cd /sys/fs/cgroup - # mkdir g1 - # echo $$ > g1/tasks - -The above steps create a new group g1 and move the current shell -process (bash) into it. - -Brief summary of control files:: - - hugetlb..limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage - hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded - hugetlb..usage_in_bytes # show current usage for "hugepagesize" hugetlb - hugetlb..failcnt # show the number of allocation failure due to HugeTLB limit - -For a system supporting three hugepage sizes (64k, 32M and 1G), the control -files include:: - - hugetlb.1GB.limit_in_bytes - hugetlb.1GB.max_usage_in_bytes - hugetlb.1GB.usage_in_bytes - hugetlb.1GB.failcnt - hugetlb.64KB.limit_in_bytes - hugetlb.64KB.max_usage_in_bytes - hugetlb.64KB.usage_in_bytes - hugetlb.64KB.failcnt - hugetlb.32MB.limit_in_bytes - hugetlb.32MB.max_usage_in_bytes - hugetlb.32MB.usage_in_bytes - hugetlb.32MB.failcnt diff --git a/Documentation/cgroup-v1/index.rst b/Documentation/cgroup-v1/index.rst deleted file mode 100644 index fe76d42edc11..000000000000 --- a/Documentation/cgroup-v1/index.rst +++ /dev/null @@ -1,30 +0,0 @@ -:orphan: - -======================== -Control Groups version 1 -======================== - -.. toctree:: - :maxdepth: 1 - - cgroups - - blkio-controller - cpuacct - cpusets - devices - freezer-subsystem - hugetlb - memcg_test - memory - net_cls - net_prio - pids - rdma - -.. only:: subproject and html - - Indices - ======= - - * :ref:`genindex` diff --git a/Documentation/cgroup-v1/memcg_test.rst b/Documentation/cgroup-v1/memcg_test.rst deleted file mode 100644 index 91bd18c6a514..000000000000 --- a/Documentation/cgroup-v1/memcg_test.rst +++ /dev/null @@ -1,355 +0,0 @@ -===================================================== -Memory Resource Controller(Memcg) Implementation Memo -===================================================== - -Last Updated: 2010/2 - -Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34). - -Because VM is getting complex (one of reasons is memcg...), memcg's behavior -is complex. This is a document for memcg's internal behavior. -Please note that implementation details can be changed. - -(*) Topics on API should be in Documentation/cgroup-v1/memory.rst) - -0. How to record usage ? -======================== - - 2 objects are used. - - page_cgroup ....an object per page. - - Allocated at boot or memory hotplug. Freed at memory hot removal. - - swap_cgroup ... an entry per swp_entry. - - Allocated at swapon(). Freed at swapoff(). - - The page_cgroup has USED bit and double count against a page_cgroup never - occurs. swap_cgroup is used only when a charged page is swapped-out. - -1. Charge -========= - - a page/swp_entry may be charged (usage += PAGE_SIZE) at - - mem_cgroup_try_charge() - -2. Uncharge -=========== - - a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by - - mem_cgroup_uncharge() - Called when a page's refcount goes down to 0. - - mem_cgroup_uncharge_swap() - Called when swp_entry's refcnt goes down to 0. A charge against swap - disappears. - -3. charge-commit-cancel -======================= - - Memcg pages are charged in two steps: - - - mem_cgroup_try_charge() - - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge() - - At try_charge(), there are no flags to say "this page is charged". - at this point, usage += PAGE_SIZE. - - At commit(), the page is associated with the memcg. - - At cancel(), simply usage -= PAGE_SIZE. - -Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. - -4. Anonymous -============ - - Anonymous page is newly allocated at - - page fault into MAP_ANONYMOUS mapping. - - Copy-On-Write. - - 4.1 Swap-in. - At swap-in, the page is taken from swap-cache. There are 2 cases. - - (a) If the SwapCache is newly allocated and read, it has no charges. - (b) If the SwapCache has been mapped by processes, it has been - charged already. - - 4.2 Swap-out. - At swap-out, typical state transition is below. - - (a) add to swap cache. (marked as SwapCache) - swp_entry's refcnt += 1. - (b) fully unmapped. - swp_entry's refcnt += # of ptes. - (c) write back to swap. - (d) delete from swap cache. (remove from SwapCache) - swp_entry's refcnt -= 1. - - - Finally, at task exit, - (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. - -5. Page Cache -============= - - Page Cache is charged at - - add_to_page_cache_locked(). - - The logic is very clear. (About migration, see below) - - Note: - __remove_from_page_cache() is called by remove_from_page_cache() - and __remove_mapping(). - -6. Shmem(tmpfs) Page Cache -=========================== - - The best way to understand shmem's page state transition is to read - mm/shmem.c. - - But brief explanation of the behavior of memcg around shmem will be - helpful to understand the logic. - - Shmem's page (just leaf page, not direct/indirect block) can be on - - - radix-tree of shmem's inode. - - SwapCache. - - Both on radix-tree and SwapCache. This happens at swap-in - and swap-out, - - It's charged when... - - - A new page is added to shmem's radix-tree. - - A swp page is read. (move a charge from swap_cgroup to page_cgroup) - -7. Page Migration -================= - - mem_cgroup_migrate() - -8. LRU -====== - Each memcg has its own private LRU. Now, its handling is under global - VM's control (means that it's handled under global pgdat->lru_lock). - Almost all routines around memcg's LRU is called by global LRU's - list management functions under pgdat->lru_lock. - - A special function is mem_cgroup_isolate_pages(). This scans - memcg's private LRU and call __isolate_lru_page() to extract a page - from LRU. - - (By __isolate_lru_page(), the page is removed from both of global and - private LRU.) - - -9. Typical Tests. -================= - - Tests for racy cases. - -9.1 Small limit to memcg. -------------------------- - - When you do test to do racy case, it's good test to set memcg's limit - to be very small rather than GB. Many races found in the test under - xKB or xxMB limits. - - (Memory behavior under GB and Memory behavior under MB shows very - different situation.) - -9.2 Shmem ---------- - - Historically, memcg's shmem handling was poor and we saw some amount - of troubles here. This is because shmem is page-cache but can be - SwapCache. Test with shmem/tmpfs is always good test. - -9.3 Migration -------------- - - For NUMA, migration is an another special case. To do easy test, cpuset - is useful. Following is a sample script to do migration:: - - mount -t cgroup -o cpuset none /opt/cpuset - - mkdir /opt/cpuset/01 - echo 1 > /opt/cpuset/01/cpuset.cpus - echo 0 > /opt/cpuset/01/cpuset.mems - echo 1 > /opt/cpuset/01/cpuset.memory_migrate - mkdir /opt/cpuset/02 - echo 1 > /opt/cpuset/02/cpuset.cpus - echo 1 > /opt/cpuset/02/cpuset.mems - echo 1 > /opt/cpuset/02/cpuset.memory_migrate - - In above set, when you moves a task from 01 to 02, page migration to - node 0 to node 1 will occur. Following is a script to migrate all - under cpuset.:: - - -- - move_task() - { - for pid in $1 - do - /bin/echo $pid >$2/tasks 2>/dev/null - echo -n $pid - echo -n " " - done - echo END - } - - G1_TASK=`cat ${G1}/tasks` - G2_TASK=`cat ${G2}/tasks` - move_task "${G1_TASK}" ${G2} & - -- - -9.4 Memory hotplug ------------------- - - memory hotplug test is one of good test. - - to offline memory, do following:: - - # echo offline > /sys/devices/system/memory/memoryXXX/state - - (XXX is the place of memory) - - This is an easy way to test page migration, too. - -9.5 mkdir/rmdir ---------------- - - When using hierarchy, mkdir/rmdir test should be done. - Use tests like the following:: - - echo 1 >/opt/cgroup/01/memory/use_hierarchy - mkdir /opt/cgroup/01/child_a - mkdir /opt/cgroup/01/child_b - - set limit to 01. - add limit to 01/child_b - run jobs under child_a and child_b - - create/delete following groups at random while jobs are running:: - - /opt/cgroup/01/child_a/child_aa - /opt/cgroup/01/child_b/child_bb - /opt/cgroup/01/child_c - - running new jobs in new group is also good. - -9.6 Mount with other subsystems -------------------------------- - - Mounting with other subsystems is a good test because there is a - race and lock dependency with other cgroup subsystems. - - example:: - - # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices - - and do task move, mkdir, rmdir etc...under this. - -9.7 swapoff ------------ - - Besides management of swap is one of complicated parts of memcg, - call path of swap-in at swapoff is not same as usual swap-in path.. - It's worth to be tested explicitly. - - For example, test like following is good: - - (Shell-A):: - - # mount -t cgroup none /cgroup -o memory - # mkdir /cgroup/test - # echo 40M > /cgroup/test/memory.limit_in_bytes - # echo 0 > /cgroup/test/tasks - - Run malloc(100M) program under this. You'll see 60M of swaps. - - (Shell-B):: - - # move all tasks in /cgroup/test to /cgroup - # /sbin/swapoff -a - # rmdir /cgroup/test - # kill malloc task. - - Of course, tmpfs v.s. swapoff test should be tested, too. - -9.8 OOM-Killer --------------- - - Out-of-memory caused by memcg's limit will kill tasks under - the memcg. When hierarchy is used, a task under hierarchy - will be killed by the kernel. - - In this case, panic_on_oom shouldn't be invoked and tasks - in other groups shouldn't be killed. - - It's not difficult to cause OOM under memcg as following. - - Case A) when you can swapoff:: - - #swapoff -a - #echo 50M > /memory.limit_in_bytes - - run 51M of malloc - - Case B) when you use mem+swap limitation:: - - #echo 50M > memory.limit_in_bytes - #echo 50M > memory.memsw.limit_in_bytes - - run 51M of malloc - -9.9 Move charges at task migration ----------------------------------- - - Charges associated with a task can be moved along with task migration. - - (Shell-A):: - - #mkdir /cgroup/A - #echo $$ >/cgroup/A/tasks - - run some programs which uses some amount of memory in /cgroup/A. - - (Shell-B):: - - #mkdir /cgroup/B - #echo 1 >/cgroup/B/memory.move_charge_at_immigrate - #echo "pid of the program running in group A" >/cgroup/B/tasks - - You can see charges have been moved by reading ``*.usage_in_bytes`` or - memory.stat of both A and B. - - See 8.2 of Documentation/cgroup-v1/memory.rst to see what value should - be written to move_charge_at_immigrate. - -9.10 Memory thresholds ----------------------- - - Memory controller implements memory thresholds using cgroups notification - API. You can use tools/cgroup/cgroup_event_listener.c to test it. - - (Shell-A) Create cgroup and run event listener:: - - # mkdir /cgroup/A - # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M - - (Shell-B) Add task to cgroup and try to allocate and free memory:: - - # echo $$ >/cgroup/A/tasks - # a="$(dd if=/dev/zero bs=1M count=10)" - # a= - - You will see message from cgroup_event_listener every time you cross - the thresholds. - - Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds. - - It's good idea to test root cgroup as well. diff --git a/Documentation/cgroup-v1/memory.rst b/Documentation/cgroup-v1/memory.rst deleted file mode 100644 index 41bdc038dad9..000000000000 --- a/Documentation/cgroup-v1/memory.rst +++ /dev/null @@ -1,1003 +0,0 @@ -========================== -Memory Resource Controller -========================== - -NOTE: - This document is hopelessly outdated and it asks for a complete - rewrite. It still contains a useful information so we are keeping it - here but make sure to check the current code if you need a deeper - understanding. - -NOTE: - The Memory Resource Controller has generically been referred to as the - memory controller in this document. Do not confuse memory controller - used here with the memory controller that is used in hardware. - -(For editors) In this document: - When we mention a cgroup (cgroupfs's directory) with memory controller, - we call it "memory cgroup". When you see git-log and source code, you'll - see patch's title and function names tend to use "memcg". - In this document, we avoid using it. - -Benefits and Purpose of the memory controller -============================================= - -The memory controller isolates the memory behaviour of a group of tasks -from the rest of the system. The article on LWN [12] mentions some probable -uses of the memory controller. The memory controller can be used to - -a. Isolate an application or a group of applications - Memory-hungry applications can be isolated and limited to a smaller - amount of memory. -b. Create a cgroup with a limited amount of memory; this can be used - as a good alternative to booting with mem=XXXX. -c. Virtualization solutions can control the amount of memory they want - to assign to a virtual machine instance. -d. A CD/DVD burner could control the amount of memory used by the - rest of the system to ensure that burning does not fail due to lack - of available memory. -e. There are several other use cases; find one or use the controller just - for fun (to learn and hack on the VM subsystem). - -Current Status: linux-2.6.34-mmotm(development version of 2010/April) - -Features: - - - accounting anonymous pages, file caches, swap caches usage and limiting them. - - pages are linked to per-memcg LRU exclusively, and there is no global LRU. - - optionally, memory+swap usage can be accounted and limited. - - hierarchical accounting - - soft limit - - moving (recharging) account at moving a task is selectable. - - usage threshold notifier - - memory pressure notifier - - oom-killer disable knob and oom-notifier - - Root cgroup has no limit controls. - - Kernel memory support is a work in progress, and the current version provides - basically functionality. (See Section 2.7) - -Brief summary of control files. - -==================================== ========================================== - tasks attach a task(thread) and show list of - threads - cgroup.procs show list of processes - cgroup.event_control an interface for event_fd() - memory.usage_in_bytes show current usage for memory - (See 5.5 for details) - memory.memsw.usage_in_bytes show current usage for memory+Swap - (See 5.5 for details) - memory.limit_in_bytes set/show limit of memory usage - memory.memsw.limit_in_bytes set/show limit of memory+Swap usage - memory.failcnt show the number of memory usage hits limits - memory.memsw.failcnt show the number of memory+Swap hits limits - memory.max_usage_in_bytes show max memory usage recorded - memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded - memory.soft_limit_in_bytes set/show soft limit of memory usage - memory.stat show various statistics - memory.use_hierarchy set/show hierarchical account enabled - memory.force_empty trigger forced page reclaim - memory.pressure_level set memory pressure notifications - memory.swappiness set/show swappiness parameter of vmscan - (See sysctl's vm.swappiness) - memory.move_charge_at_immigrate set/show controls of moving charges - memory.oom_control set/show oom controls. - memory.numa_stat show the number of memory usage per numa - node - - memory.kmem.limit_in_bytes set/show hard limit for kernel memory - memory.kmem.usage_in_bytes show current kernel memory allocation - memory.kmem.failcnt show the number of kernel memory usage - hits limits - memory.kmem.max_usage_in_bytes show max kernel memory usage recorded - - memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory - memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation - memory.kmem.tcp.failcnt show the number of tcp buf memory usage - hits limits - memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded -==================================== ========================================== - -1. History -========== - -The memory controller has a long history. A request for comments for the memory -controller was posted by Balbir Singh [1]. At the time the RFC was posted -there were several implementations for memory control. The goal of the -RFC was to build consensus and agreement for the minimal features required -for memory control. The first RSS controller was posted by Balbir Singh[2] -in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the -RSS controller. At OLS, at the resource management BoF, everyone suggested -that we handle both page cache and RSS together. Another request was raised -to allow user space handling of OOM. The current memory controller is -at version 6; it combines both mapped (RSS) and unmapped Page -Cache Control [11]. - -2. Memory Control -================= - -Memory is a unique resource in the sense that it is present in a limited -amount. If a task requires a lot of CPU processing, the task can spread -its processing over a period of hours, days, months or years, but with -memory, the same physical memory needs to be reused to accomplish the task. - -The memory controller implementation has been divided into phases. These -are: - -1. Memory controller -2. mlock(2) controller -3. Kernel user memory accounting and slab control -4. user mappings length controller - -The memory controller is the first controller developed. - -2.1. Design ------------ - -The core of the design is a counter called the page_counter. The -page_counter tracks the current memory usage and limit of the group of -processes associated with the controller. Each cgroup has a memory controller -specific data structure (mem_cgroup) associated with it. - -2.2. Accounting ---------------- - -:: - - +--------------------+ - | mem_cgroup | - | (page_counter) | - +--------------------+ - / ^ \ - / | \ - +---------------+ | +---------------+ - | mm_struct | |.... | mm_struct | - | | | | | - +---------------+ | +---------------+ - | - + --------------+ - | - +---------------+ +------+--------+ - | page +----------> page_cgroup| - | | | | - +---------------+ +---------------+ - - (Figure 1: Hierarchy of Accounting) - - -Figure 1 shows the important aspects of the controller - -1. Accounting happens per cgroup -2. Each mm_struct knows about which cgroup it belongs to -3. Each page has a pointer to the page_cgroup, which in turn knows the - cgroup it belongs to - -The accounting is done as follows: mem_cgroup_charge_common() is invoked to -set up the necessary data structures and check if the cgroup that is being -charged is over its limit. If it is, then reclaim is invoked on the cgroup. -More details can be found in the reclaim section of this document. -If everything goes well, a page meta-data-structure called page_cgroup is -updated. page_cgroup has its own LRU on cgroup. -(*) page_cgroup structure is allocated at boot/memory-hotplug time. - -2.2.1 Accounting details ------------------------- - -All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. -Some pages which are never reclaimable and will not be on the LRU -are not accounted. We just account pages under usual VM management. - -RSS pages are accounted at page_fault unless they've already been accounted -for earlier. A file page will be accounted for as Page Cache when it's -inserted into inode (radix-tree). While it's mapped into the page tables of -processes, duplicate accounting is carefully avoided. - -An RSS page is unaccounted when it's fully unmapped. A PageCache page is -unaccounted when it's removed from radix-tree. Even if RSS pages are fully -unmapped (by kswapd), they may exist as SwapCache in the system until they -are really freed. Such SwapCaches are also accounted. -A swapped-in page is not accounted until it's mapped. - -Note: The kernel does swapin-readahead and reads multiple swaps at once. -This means swapped-in pages may contain pages for other tasks than a task -causing page fault. So, we avoid accounting at swap-in I/O. - -At page migration, accounting information is kept. - -Note: we just account pages-on-LRU because our purpose is to control amount -of used pages; not-on-LRU pages tend to be out-of-control from VM view. - -2.3 Shared Page Accounting --------------------------- - -Shared pages are accounted on the basis of the first touch approach. The -cgroup that first touches a page is accounted for the page. The principle -behind this approach is that a cgroup that aggressively uses a shared -page will eventually get charged for it (once it is uncharged from -the cgroup that brought it in -- this will happen on memory pressure). - -But see section 8.2: when moving a task to another cgroup, its pages may -be recharged to the new cgroup, if move_charge_at_immigrate has been chosen. - -Exception: If CONFIG_MEMCG_SWAP is not used. -When you do swapoff and make swapped-out pages of shmem(tmpfs) to -be backed into memory in force, charges for pages are accounted against the -caller of swapoff rather than the users of shmem. - -2.4 Swap Extension (CONFIG_MEMCG_SWAP) --------------------------------------- - -Swap Extension allows you to record charge for swap. A swapped-in page is -charged back to original page allocator if possible. - -When swap is accounted, following files are added. - - - memory.memsw.usage_in_bytes. - - memory.memsw.limit_in_bytes. - -memsw means memory+swap. Usage of memory+swap is limited by -memsw.limit_in_bytes. - -Example: Assume a system with 4G of swap. A task which allocates 6G of memory -(by mistake) under 2G memory limitation will use all swap. -In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. -By using the memsw limit, you can avoid system OOM which can be caused by swap -shortage. - -**why 'memory+swap' rather than swap** - -The global LRU(kswapd) can swap out arbitrary pages. Swap-out means -to move account from memory to swap...there is no change in usage of -memory+swap. In other words, when we want to limit the usage of swap without -affecting global LRU, memory+swap limit is better than just limiting swap from -an OS point of view. - -**What happens when a cgroup hits memory.memsw.limit_in_bytes** - -When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out -in this cgroup. Then, swap-out will not be done by cgroup routine and file -caches are dropped. But as mentioned above, global LRU can do swapout memory -from it for sanity of the system's memory management state. You can't forbid -it by cgroup. - -2.5 Reclaim ------------ - -Each cgroup maintains a per cgroup LRU which has the same structure as -global VM. When a cgroup goes over its limit, we first try -to reclaim memory from the cgroup so as to make space for the new -pages that the cgroup has touched. If the reclaim is unsuccessful, -an OOM routine is invoked to select and kill the bulkiest task in the -cgroup. (See 10. OOM Control below.) - -The reclaim algorithm has not been modified for cgroups, except that -pages that are selected for reclaiming come from the per-cgroup LRU -list. - -NOTE: - Reclaim does not work for the root cgroup, since we cannot set any - limits on the root cgroup. - -Note2: - When panic_on_oom is set to "2", the whole system will panic. - -When oom event notifier is registered, event will be delivered. -(See oom_control section) - -2.6 Locking ------------ - - lock_page_cgroup()/unlock_page_cgroup() should not be called under - the i_pages lock. - - Other lock order is following: - - PG_locked. - mm->page_table_lock - pgdat->lru_lock - lock_page_cgroup. - - In many cases, just lock_page_cgroup() is called. - - per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by - pgdat->lru_lock, it has no lock of its own. - -2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) ------------------------------------------------ - -With the Kernel memory extension, the Memory Controller is able to limit -the amount of kernel memory used by the system. Kernel memory is fundamentally -different than user memory, since it can't be swapped out, which makes it -possible to DoS the system by consuming too much of this precious resource. - -Kernel memory accounting is enabled for all memory cgroups by default. But -it can be disabled system-wide by passing cgroup.memory=nokmem to the kernel -at boot time. In this case, kernel memory will not be accounted at all. - -Kernel memory limits are not imposed for the root cgroup. Usage for the root -cgroup may or may not be accounted. The memory used is accumulated into -memory.kmem.usage_in_bytes, or in a separate counter when it makes sense. -(currently only for tcp). - -The main "kmem" counter is fed into the main counter, so kmem charges will -also be visible from the user counter. - -Currently no soft limit is implemented for kernel memory. It is future work -to trigger slab reclaim when those limits are reached. - -2.7.1 Current Kernel Memory resources accounted ------------------------------------------------ - -stack pages: - every process consumes some stack pages. By accounting into - kernel memory, we prevent new processes from being created when the kernel - memory usage is too high. - -slab pages: - pages allocated by the SLAB or SLUB allocator are tracked. A copy - of each kmem_cache is created every time the cache is touched by the first time - from inside the memcg. The creation is done lazily, so some objects can still be - skipped while the cache is being created. All objects in a slab page should - belong to the same memcg. This only fails to hold when a task is migrated to a - different memcg during the page allocation by the cache. - -sockets memory pressure: - some sockets protocols have memory pressure - thresholds. The Memory Controller allows them to be controlled individually - per cgroup, instead of globally. - -tcp memory pressure: - sockets memory pressure for the tcp protocol. - -2.7.2 Common use cases ----------------------- - -Because the "kmem" counter is fed to the main user counter, kernel memory can -never be limited completely independently of user memory. Say "U" is the user -limit, and "K" the kernel limit. There are three possible ways limits can be -set: - -U != 0, K = unlimited: - This is the standard memcg limitation mechanism already present before kmem - accounting. Kernel memory is completely ignored. - -U != 0, K < U: - Kernel memory is a subset of the user memory. This setup is useful in - deployments where the total amount of memory per-cgroup is overcommited. - Overcommiting kernel memory limits is definitely not recommended, since the - box can still run out of non-reclaimable memory. - In this case, the admin could set up K so that the sum of all groups is - never greater than the total memory, and freely set U at the cost of his - QoS. - -WARNING: - In the current implementation, memory reclaim will NOT be - triggered for a cgroup when it hits K while staying below U, which makes - this setup impractical. - -U != 0, K >= U: - Since kmem charges will also be fed to the user counter and reclaim will be - triggered for the cgroup for both kinds of memory. This setup gives the - admin a unified view of memory, and it is also useful for people who just - want to track kernel memory usage. - -3. User Interface -================= - -3.0. Configuration ------------------- - -a. Enable CONFIG_CGROUPS -b. Enable CONFIG_MEMCG -c. Enable CONFIG_MEMCG_SWAP (to use swap extension) -d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) - -3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) -------------------------------------------------------------------- - -:: - - # mount -t tmpfs none /sys/fs/cgroup - # mkdir /sys/fs/cgroup/memory - # mount -t cgroup none /sys/fs/cgroup/memory -o memory - -3.2. Make the new group and move bash into it:: - - # mkdir /sys/fs/cgroup/memory/0 - # echo $$ > /sys/fs/cgroup/memory/0/tasks - -Since now we're in the 0 cgroup, we can alter the memory limit:: - - # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes - -NOTE: - We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, - mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, - Gibibytes.) - -NOTE: - We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``. - -NOTE: - We cannot set limits on the root cgroup any more. - -:: - - # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes - 4194304 - -We can check the usage:: - - # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes - 1216512 - -A successful write to this file does not guarantee a successful setting of -this limit to the value written into the file. This can be due to a -number of factors, such as rounding up to page boundaries or the total -availability of memory on the system. The user is required to re-read -this file after a write to guarantee the value committed by the kernel:: - - # echo 1 > memory.limit_in_bytes - # cat memory.limit_in_bytes - 4096 - -The memory.failcnt field gives the number of times that the cgroup limit was -exceeded. - -The memory.stat file gives accounting information. Now, the number of -caches, RSS and Active pages/Inactive pages are shown. - -4. Testing -========== - -For testing features and implementation, see memcg_test.txt. - -Performance test is also important. To see pure memory controller's overhead, -testing on tmpfs will give you good numbers of small overheads. -Example: do kernel make on tmpfs. - -Page-fault scalability is also important. At measuring parallel -page fault test, multi-process test may be better than multi-thread -test because it has noise of shared objects/status. - -But the above two are testing extreme situations. -Trying usual test under memory controller is always helpful. - -4.1 Troubleshooting -------------------- - -Sometimes a user might find that the application under a cgroup is -terminated by the OOM killer. There are several causes for this: - -1. The cgroup limit is too low (just too low to do anything useful) -2. The user is using anonymous memory and swap is turned off or too low - -A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of -some of the pages cached in the cgroup (page cache pages). - -To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and -seeing what happens will be helpful. - -4.2 Task migration ------------------- - -When a task migrates from one cgroup to another, its charge is not -carried forward by default. The pages allocated from the original cgroup still -remain charged to it, the charge is dropped when the page is freed or -reclaimed. - -You can move charges of a task along with task migration. -See 8. "Move charges at task migration" - -4.3 Removing a cgroup ---------------------- - -A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a -cgroup might have some charge associated with it, even though all -tasks have migrated away from it. (because we charge against pages, not -against tasks.) - -We move the stats to root (if use_hierarchy==0) or parent (if -use_hierarchy==1), and no change on the charge except uncharging -from the child. - -Charges recorded in swap information is not updated at removal of cgroup. -Recorded information is discarded and a cgroup which uses swap (swapcache) -will be charged as a new owner of it. - -About use_hierarchy, see Section 6. - -5. Misc. interfaces -=================== - -5.1 force_empty ---------------- - memory.force_empty interface is provided to make cgroup's memory usage empty. - When writing anything to this:: - - # echo 0 > memory.force_empty - - the cgroup will be reclaimed and as many pages reclaimed as possible. - - The typical use case for this interface is before calling rmdir(). - Though rmdir() offlines memcg, but the memcg may still stay there due to - charged file caches. Some out-of-use page caches may keep charged until - memory pressure happens. If you want to avoid that, force_empty will be useful. - - Also, note that when memory.kmem.limit_in_bytes is set the charges due to - kernel pages will still be seen. This is not considered a failure and the - write will still return success. In this case, it is expected that - memory.kmem.usage_in_bytes == memory.usage_in_bytes. - - About use_hierarchy, see Section 6. - -5.2 stat file -------------- - -memory.stat file includes following statistics - -per-memory cgroup local status -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -=============== =============================================================== -cache # of bytes of page cache memory. -rss # of bytes of anonymous and swap cache memory (includes - transparent hugepages). -rss_huge # of bytes of anonymous transparent hugepages. -mapped_file # of bytes of mapped file (includes tmpfs/shmem) -pgpgin # of charging events to the memory cgroup. The charging - event happens each time a page is accounted as either mapped - anon page(RSS) or cache page(Page Cache) to the cgroup. -pgpgout # of uncharging events to the memory cgroup. The uncharging - event happens each time a page is unaccounted from the cgroup. -swap # of bytes of swap usage -dirty # of bytes that are waiting to get written back to the disk. -writeback # of bytes of file/anon cache that are queued for syncing to - disk. -inactive_anon # of bytes of anonymous and swap cache memory on inactive - LRU list. -active_anon # of bytes of anonymous and swap cache memory on active - LRU list. -inactive_file # of bytes of file-backed memory on inactive LRU list. -active_file # of bytes of file-backed memory on active LRU list. -unevictable # of bytes of memory that cannot be reclaimed (mlocked etc). -=============== =============================================================== - -status considering hierarchy (see memory.use_hierarchy settings) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -========================= =================================================== -hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy - under which the memory cgroup is -hierarchical_memsw_limit # of bytes of memory+swap limit with regard to - hierarchy under which memory cgroup is. - -total_ # hierarchical version of , which in - addition to the cgroup's own value includes the - sum of all hierarchical children's values of - , i.e. total_cache -========================= =================================================== - -The following additional stats are dependent on CONFIG_DEBUG_VM -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -========================= ======================================== -recent_rotated_anon VM internal parameter. (see mm/vmscan.c) -recent_rotated_file VM internal parameter. (see mm/vmscan.c) -recent_scanned_anon VM internal parameter. (see mm/vmscan.c) -recent_scanned_file VM internal parameter. (see mm/vmscan.c) -========================= ======================================== - -Memo: - recent_rotated means recent frequency of LRU rotation. - recent_scanned means recent # of scans to LRU. - showing for better debug please see the code for meanings. - -Note: - Only anonymous and swap cache memory is listed as part of 'rss' stat. - This should not be confused with the true 'resident set size' or the - amount of physical memory used by the cgroup. - - 'rss + mapped_file" will give you resident set size of cgroup. - - (Note: file and shmem may be shared among other cgroups. In that case, - mapped_file is accounted only when the memory cgroup is owner of page - cache.) - -5.3 swappiness --------------- - -Overrides /proc/sys/vm/swappiness for the particular group. The tunable -in the root cgroup corresponds to the global swappiness setting. - -Please note that unlike during the global reclaim, limit reclaim -enforces that 0 swappiness really prevents from any swapping even if -there is a swap storage available. This might lead to memcg OOM killer -if there are no file pages to reclaim. - -5.4 failcnt ------------ - -A memory cgroup provides memory.failcnt and memory.memsw.failcnt files. -This failcnt(== failure count) shows the number of times that a usage counter -hit its limit. When a memory cgroup hits a limit, failcnt increases and -memory under it will be reclaimed. - -You can reset failcnt by writing 0 to failcnt file:: - - # echo 0 > .../memory.failcnt - -5.5 usage_in_bytes ------------------- - -For efficiency, as other kernel components, memory cgroup uses some optimization -to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the -method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz -value for efficient access. (Of course, when necessary, it's synchronized.) -If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) -value in memory.stat(see 5.2). - -5.6 numa_stat -------------- - -This is similar to numa_maps but operates on a per-memcg basis. This is -useful for providing visibility into the numa locality information within -an memcg since the pages are allowed to be allocated from any physical -node. One of the use cases is evaluating application performance by -combining this information with the application's CPU allocation. - -Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable" -per-node page counts including "hierarchical_" which sums up all -hierarchical children's values in addition to the memcg's own value. - -The output format of memory.numa_stat is:: - - total= N0= N1= ... - file= N0= N1= ... - anon= N0= N1= ... - unevictable= N0= N1= ... - hierarchical_= N0= N1= ... - -The "total" count is sum of file + anon + unevictable. - -6. Hierarchy support -==================== - -The memory controller supports a deep hierarchy and hierarchical accounting. -The hierarchy is created by creating the appropriate cgroups in the -cgroup filesystem. Consider for example, the following cgroup filesystem -hierarchy:: - - root - / | \ - / | \ - a b c - | \ - | \ - d e - -In the diagram above, with hierarchical accounting enabled, all memory -usage of e, is accounted to its ancestors up until the root (i.e, c and root), -that has memory.use_hierarchy enabled. If one of the ancestors goes over its -limit, the reclaim algorithm reclaims from the tasks in the ancestor and the -children of the ancestor. - -6.1 Enabling hierarchical accounting and reclaim ------------------------------------------------- - -A memory cgroup by default disables the hierarchy feature. Support -can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup:: - - # echo 1 > memory.use_hierarchy - -The feature can be disabled by:: - - # echo 0 > memory.use_hierarchy - -NOTE1: - Enabling/disabling will fail if either the cgroup already has other - cgroups created below it, or if the parent cgroup has use_hierarchy - enabled. - -NOTE2: - When panic_on_oom is set to "2", the whole system will panic in - case of an OOM event in any cgroup. - -7. Soft limits -============== - -Soft limits allow for greater sharing of memory. The idea behind soft limits -is to allow control groups to use as much of the memory as needed, provided - -a. There is no memory contention -b. They do not exceed their hard limit - -When the system detects memory contention or low memory, control groups -are pushed back to their soft limits. If the soft limit of each control -group is very high, they are pushed back as much as possible to make -sure that one control group does not starve the others of memory. - -Please note that soft limits is a best-effort feature; it comes with -no guarantees, but it does its best to make sure that when memory is -heavily contended for, memory is allocated based on the soft limit -hints/setup. Currently soft limit based reclaim is set up such that -it gets invoked from balance_pgdat (kswapd). - -7.1 Interface -------------- - -Soft limits can be setup by using the following commands (in this example we -assume a soft limit of 256 MiB):: - - # echo 256M > memory.soft_limit_in_bytes - -If we want to change this to 1G, we can at any time use:: - - # echo 1G > memory.soft_limit_in_bytes - -NOTE1: - Soft limits take effect over a long period of time, since they involve - reclaiming memory for balancing between memory cgroups -NOTE2: - It is recommended to set the soft limit always below the hard limit, - otherwise the hard limit will take precedence. - -8. Move charges at task migration -================================= - -Users can move charges associated with a task along with task migration, that -is, uncharge task's pages from the old cgroup and charge them to the new cgroup. -This feature is not supported in !CONFIG_MMU environments because of lack of -page tables. - -8.1 Interface -------------- - -This feature is disabled by default. It can be enabled (and disabled again) by -writing to memory.move_charge_at_immigrate of the destination cgroup. - -If you want to enable it:: - - # echo (some positive value) > memory.move_charge_at_immigrate - -Note: - Each bits of move_charge_at_immigrate has its own meaning about what type - of charges should be moved. See 8.2 for details. -Note: - Charges are moved only when you move mm->owner, in other words, - a leader of a thread group. -Note: - If we cannot find enough space for the task in the destination cgroup, we - try to make space by reclaiming memory. Task migration may fail if we - cannot make enough space. -Note: - It can take several seconds if you move charges much. - -And if you want disable it again:: - - # echo 0 > memory.move_charge_at_immigrate - -8.2 Type of charges which can be moved --------------------------------------- - -Each bit in move_charge_at_immigrate has its own meaning about what type of -charges should be moved. But in any case, it must be noted that an account of -a page or a swap can be moved only when it is charged to the task's current -(old) memory cgroup. - -+---+--------------------------------------------------------------------------+ -|bit| what type of charges would be moved ? | -+===+==========================================================================+ -| 0 | A charge of an anonymous page (or swap of it) used by the target task. | -| | You must enable Swap Extension (see 2.4) to enable move of swap charges. | -+---+--------------------------------------------------------------------------+ -| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) | -| | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | -| | anonymous pages, file pages (and swaps) in the range mmapped by the task | -| | will be moved even if the task hasn't done page fault, i.e. they might | -| | not be the task's "RSS", but other task's "RSS" that maps the same file. | -| | And mapcount of the page is ignored (the page can be moved even if | -| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to | -| | enable move of swap charges. | -+---+--------------------------------------------------------------------------+ - -8.3 TODO --------- - -- All of moving charge operations are done under cgroup_mutex. It's not good - behavior to hold the mutex too long, so we may need some trick. - -9. Memory thresholds -==================== - -Memory cgroup implements memory thresholds using the cgroups notification -API (see cgroups.txt). It allows to register multiple memory and memsw -thresholds and gets notifications when it crosses. - -To register a threshold, an application must: - -- create an eventfd using eventfd(2); -- open memory.usage_in_bytes or memory.memsw.usage_in_bytes; -- write string like " " to - cgroup.event_control. - -Application will be notified through eventfd when memory usage crosses -threshold in any direction. - -It's applicable for root and non-root cgroup. - -10. OOM Control -=============== - -memory.oom_control file is for OOM notification and other controls. - -Memory cgroup implements OOM notifier using the cgroup notification -API (See cgroups.txt). It allows to register multiple OOM notification -delivery and gets notification when OOM happens. - -To register a notifier, an application must: - - - create an eventfd using eventfd(2) - - open memory.oom_control file - - write string like " " to - cgroup.event_control - -The application will be notified through eventfd when OOM happens. -OOM notification doesn't work for the root cgroup. - -You can disable the OOM-killer by writing "1" to memory.oom_control file, as: - - #echo 1 > memory.oom_control - -If OOM-killer is disabled, tasks under cgroup will hang/sleep -in memory cgroup's OOM-waitqueue when they request accountable memory. - -For running them, you have to relax the memory cgroup's OOM status by - - * enlarge limit or reduce usage. - -To reduce usage, - - * kill some tasks. - * move some tasks to other group with account migration. - * remove some files (on tmpfs?) - -Then, stopped tasks will work again. - -At reading, current status of OOM is shown. - - - oom_kill_disable 0 or 1 - (if 1, oom-killer is disabled) - - under_oom 0 or 1 - (if 1, the memory cgroup is under OOM, tasks may be stopped.) - -11. Memory Pressure -=================== - -The pressure level notifications can be used to monitor the memory -allocation cost; based on the pressure, applications can implement -different strategies of managing their memory resources. The pressure -levels are defined as following: - -The "low" level means that the system is reclaiming memory for new -allocations. Monitoring this reclaiming activity might be useful for -maintaining cache level. Upon notification, the program (typically -"Activity Manager") might analyze vmstat and act in advance (i.e. -prematurely shutdown unimportant services). - -The "medium" level means that the system is experiencing medium memory -pressure, the system might be making swap, paging out active file caches, -etc. Upon this event applications may decide to further analyze -vmstat/zoneinfo/memcg or internal memory usage statistics and free any -resources that can be easily reconstructed or re-read from a disk. - -The "critical" level means that the system is actively thrashing, it is -about to out of memory (OOM) or even the in-kernel OOM killer is on its -way to trigger. Applications should do whatever they can to help the -system. It might be too late to consult with vmstat or any other -statistics, so it's advisable to take an immediate action. - -By default, events are propagated upward until the event is handled, i.e. the -events are not pass-through. For example, you have three cgroups: A->B->C. Now -you set up an event listener on cgroups A, B and C, and suppose group C -experiences some pressure. In this situation, only group C will receive the -notification, i.e. groups A and B will not receive it. This is done to avoid -excessive "broadcasting" of messages, which disturbs the system and which is -especially bad if we are low on memory or thrashing. Group B, will receive -notification only if there are no event listers for group C. - -There are three optional modes that specify different propagation behavior: - - - "default": this is the default behavior specified above. This mode is the - same as omitting the optional mode parameter, preserved by backwards - compatibility. - - - "hierarchy": events always propagate up to the root, similar to the default - behavior, except that propagation continues regardless of whether there are - event listeners at each level, with the "hierarchy" mode. In the above - example, groups A, B, and C will receive notification of memory pressure. - - - "local": events are pass-through, i.e. they only receive notifications when - memory pressure is experienced in the memcg for which the notification is - registered. In the above example, group C will receive notification if - registered for "local" notification and the group experiences memory - pressure. However, group B will never receive notification, regardless if - there is an event listener for group C or not, if group B is registered for - local notification. - -The level and event notification mode ("hierarchy" or "local", if necessary) are -specified by a comma-delimited string, i.e. "low,hierarchy" specifies -hierarchical, pass-through, notification for all ancestor memcgs. Notification -that is the default, non pass-through behavior, does not specify a mode. -"medium,local" specifies pass-through notification for the medium level. - -The file memory.pressure_level is only used to setup an eventfd. To -register a notification, an application must: - -- create an eventfd using eventfd(2); -- open memory.pressure_level; -- write string as " " - to cgroup.event_control. - -Application will be notified through eventfd when memory pressure is at -the specific level (or higher). Read/write operations to -memory.pressure_level are no implemented. - -Test: - - Here is a small script example that makes a new cgroup, sets up a - memory limit, sets up a notification in the cgroup and then makes child - cgroup experience a critical pressure:: - - # cd /sys/fs/cgroup/memory/ - # mkdir foo - # cd foo - # cgroup_event_listener memory.pressure_level low,hierarchy & - # echo 8000000 > memory.limit_in_bytes - # echo 8000000 > memory.memsw.limit_in_bytes - # echo $$ > tasks - # dd if=/dev/zero | read x - - (Expect a bunch of notifications, and eventually, the oom-killer will - trigger.) - -12. TODO -======== - -1. Make per-cgroup scanner reclaim not-shared pages first -2. Teach controller to account for shared-pages -3. Start reclamation in the background when the limit is - not yet hit but the usage is getting closer - -Summary -======= - -Overall, the memory controller has been a stable controller and has been -commented and discussed quite extensively in the community. - -References -========== - -1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ -2. Singh, Balbir. Memory Controller (RSS Control), - http://lwn.net/Articles/222762/ -3. Emelianov, Pavel. Resource controllers based on process cgroups - http://lkml.org/lkml/2007/3/6/198 -4. Emelianov, Pavel. RSS controller based on process cgroups (v2) - http://lkml.org/lkml/2007/4/9/78 -5. Emelianov, Pavel. RSS controller based on process cgroups (v3) - http://lkml.org/lkml/2007/5/30/244 -6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ -7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control - subsystem (v3), http://lwn.net/Articles/235534/ -8. Singh, Balbir. RSS controller v2 test results (lmbench), - http://lkml.org/lkml/2007/5/17/232 -9. Singh, Balbir. RSS controller v2 AIM9 results - http://lkml.org/lkml/2007/5/18/1 -10. Singh, Balbir. Memory controller v6 test results, - http://lkml.org/lkml/2007/8/19/36 -11. Singh, Balbir. Memory controller introduction (v6), - http://lkml.org/lkml/2007/8/17/69 -12. Corbet, Jonathan, Controlling memory use in cgroups, - http://lwn.net/Articles/243795/ diff --git a/Documentation/cgroup-v1/net_cls.rst b/Documentation/cgroup-v1/net_cls.rst deleted file mode 100644 index a2cf272af7a0..000000000000 --- a/Documentation/cgroup-v1/net_cls.rst +++ /dev/null @@ -1,44 +0,0 @@ -========================= -Network classifier cgroup -========================= - -The Network classifier cgroup provides an interface to -tag network packets with a class identifier (classid). - -The Traffic Controller (tc) can be used to assign -different priorities to packets from different cgroups. -Also, Netfilter (iptables) can use this tag to perform -actions on such packets. - -Creating a net_cls cgroups instance creates a net_cls.classid file. -This net_cls.classid value is initialized to 0. - -You can write hexadecimal values to net_cls.classid; the format for these -values is 0xAAAABBBB; AAAA is the major handle number and BBBB -is the minor handle number. -Reading net_cls.classid yields a decimal result. - -Example:: - - mkdir /sys/fs/cgroup/net_cls - mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls - mkdir /sys/fs/cgroup/net_cls/0 - echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid - -- setting a 10:1 handle:: - - cat /sys/fs/cgroup/net_cls/0/net_cls.classid - 1048577 - -- configuring tc:: - - tc qdisc add dev eth0 root handle 10: htb - tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit - -- creating traffic class 10:1:: - - tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup - -configuring iptables, basic example:: - - iptables -A OUTPUT -m cgroup ! --cgroup 0x100001 -j DROP diff --git a/Documentation/cgroup-v1/net_prio.rst b/Documentation/cgroup-v1/net_prio.rst deleted file mode 100644 index b40905871c64..000000000000 --- a/Documentation/cgroup-v1/net_prio.rst +++ /dev/null @@ -1,57 +0,0 @@ -======================= -Network priority cgroup -======================= - -The Network priority cgroup provides an interface to allow an administrator to -dynamically set the priority of network traffic generated by various -applications - -Nominally, an application would set the priority of its traffic via the -SO_PRIORITY socket option. This however, is not always possible because: - -1) The application may not have been coded to set this value -2) The priority of application traffic is often a site-specific administrative - decision rather than an application defined one. - -This cgroup allows an administrator to assign a process to a group which defines -the priority of egress traffic on a given interface. Network priority groups can -be created by first mounting the cgroup filesystem:: - - # mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio - -With the above step, the initial group acting as the parent accounting group -becomes visible at '/sys/fs/cgroup/net_prio'. This group includes all tasks in -the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup. - -Each net_prio cgroup contains two files that are subsystem specific - -net_prio.prioidx - This file is read-only, and is simply informative. It contains a unique - integer value that the kernel uses as an internal representation of this - cgroup. - -net_prio.ifpriomap - This file contains a map of the priorities assigned to traffic originating - from processes in this group and egressing the system on various interfaces. - It contains a list of tuples in the form . Contents of this - file can be modified by echoing a string into the file using the same tuple - format. For example:: - - echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap - -This command would force any traffic originating from processes belonging to the -iscsi net_prio cgroup and egressing on interface eth0 to have the priority of -said traffic set to the value 5. The parent accounting group also has a -writeable 'net_prio.ifpriomap' file that can be used to set a system default -priority. - -Priorities are set immediately prior to queueing a frame to the device -queueing discipline (qdisc) so priorities will be assigned prior to the hardware -queue selection being made. - -One usage for the net_prio cgroup is with mqprio qdisc allowing application -traffic to be steered to hardware/driver based traffic classes. These mappings -can then be managed by administrators or other networking protocols such as -DCBX. - -A new net_prio cgroup inherits the parent's configuration. diff --git a/Documentation/cgroup-v1/pids.rst b/Documentation/cgroup-v1/pids.rst deleted file mode 100644 index 6acebd9e72c8..000000000000 --- a/Documentation/cgroup-v1/pids.rst +++ /dev/null @@ -1,92 +0,0 @@ -========================= -Process Number Controller -========================= - -Abstract --------- - -The process number controller is used to allow a cgroup hierarchy to stop any -new tasks from being fork()'d or clone()'d after a certain limit is reached. - -Since it is trivial to hit the task limit without hitting any kmemcg limits in -place, PIDs are a fundamental resource. As such, PID exhaustion must be -preventable in the scope of a cgroup hierarchy by allowing resource limiting of -the number of tasks in a cgroup. - -Usage ------ - -In order to use the `pids` controller, set the maximum number of tasks in -pids.max (this is not available in the root cgroup for obvious reasons). The -number of processes currently in the cgroup is given by pids.current. - -Organisational operations are not blocked by cgroup policies, so it is possible -to have pids.current > pids.max. This can be done by either setting the limit to -be smaller than pids.current, or attaching enough processes to the cgroup such -that pids.current > pids.max. However, it is not possible to violate a cgroup -policy through fork() or clone(). fork() and clone() will return -EAGAIN if the -creation of a new process would cause a cgroup policy to be violated. - -To set a cgroup to have no limit, set pids.max to "max". This is the default for -all new cgroups (N.B. that PID limits are hierarchical, so the most stringent -limit in the hierarchy is followed). - -pids.current tracks all child cgroup hierarchies, so parent/pids.current is a -superset of parent/child/pids.current. - -The pids.events file contains event counters: - - - max: Number of times fork failed because limit was hit. - -Example -------- - -First, we mount the pids controller:: - - # mkdir -p /sys/fs/cgroup/pids - # mount -t cgroup -o pids none /sys/fs/cgroup/pids - -Then we create a hierarchy, set limits and attach processes to it:: - - # mkdir -p /sys/fs/cgroup/pids/parent/child - # echo 2 > /sys/fs/cgroup/pids/parent/pids.max - # echo $$ > /sys/fs/cgroup/pids/parent/cgroup.procs - # cat /sys/fs/cgroup/pids/parent/pids.current - 2 - # - -It should be noted that attempts to overcome the set limit (2 in this case) will -fail:: - - # cat /sys/fs/cgroup/pids/parent/pids.current - 2 - # ( /bin/echo "Here's some processes for you." | cat ) - sh: fork: Resource temporary unavailable - # - -Even if we migrate to a child cgroup (which doesn't have a set limit), we will -not be able to overcome the most stringent limit in the hierarchy (in this case, -parent's):: - - # echo $$ > /sys/fs/cgroup/pids/parent/child/cgroup.procs - # cat /sys/fs/cgroup/pids/parent/pids.current - 2 - # cat /sys/fs/cgroup/pids/parent/child/pids.current - 2 - # cat /sys/fs/cgroup/pids/parent/child/pids.max - max - # ( /bin/echo "Here's some processes for you." | cat ) - sh: fork: Resource temporary unavailable - # - -We can set a limit that is smaller than pids.current, which will stop any new -processes from being forked at all (note that the shell itself counts towards -pids.current):: - - # echo 1 > /sys/fs/cgroup/pids/parent/pids.max - # /bin/echo "We can't even spawn a single process now." - sh: fork: Resource temporary unavailable - # echo 0 > /sys/fs/cgroup/pids/parent/pids.max - # /bin/echo "We can't even spawn a single process now." - sh: fork: Resource temporary unavailable - # diff --git a/Documentation/cgroup-v1/rdma.rst b/Documentation/cgroup-v1/rdma.rst deleted file mode 100644 index 2fcb0a9bf790..000000000000 --- a/Documentation/cgroup-v1/rdma.rst +++ /dev/null @@ -1,117 +0,0 @@ -=============== -RDMA Controller -=============== - -.. Contents - - 1. Overview - 1-1. What is RDMA controller? - 1-2. Why RDMA controller needed? - 1-3. How is RDMA controller implemented? - 2. Usage Examples - -1. Overview -=========== - -1-1. What is RDMA controller? ------------------------------ - -RDMA controller allows user to limit RDMA/IB specific resources that a given -set of processes can use. These processes are grouped using RDMA controller. - -RDMA controller defines two resources which can be limited for processes of a -cgroup. - -1-2. Why RDMA controller needed? --------------------------------- - -Currently user space applications can easily take away all the rdma verb -specific resources such as AH, CQ, QP, MR etc. Due to which other applications -in other cgroup or kernel space ULPs may not even get chance to allocate any -rdma resources. This can lead to service unavailability. - -Therefore RDMA controller is needed through which resource consumption -of processes can be limited. Through this controller different rdma -resources can be accounted. - -1-3. How is RDMA controller implemented? ----------------------------------------- - -RDMA cgroup allows limit configuration of resources. Rdma cgroup maintains -resource accounting per cgroup, per device using resource pool structure. -Each such resource pool is limited up to 64 resources in given resource pool -by rdma cgroup, which can be extended later if required. - -This resource pool object is linked to the cgroup css. Typically there -are 0 to 4 resource pool instances per cgroup, per device in most use cases. -But nothing limits to have it more. At present hundreds of RDMA devices per -single cgroup may not be handled optimally, however there is no -known use case or requirement for such configuration either. - -Since RDMA resources can be allocated from any process and can be freed by any -of the child processes which shares the address space, rdma resources are -always owned by the creator cgroup css. This allows process migration from one -to other cgroup without major complexity of transferring resource ownership; -because such ownership is not really present due to shared nature of -rdma resources. Linking resources around css also ensures that cgroups can be -deleted after processes migrated. This allow progress migration as well with -active resources, even though that is not a primary use case. - -Whenever RDMA resource charging occurs, owner rdma cgroup is returned to -the caller. Same rdma cgroup should be passed while uncharging the resource. -This also allows process migrated with active RDMA resource to charge -to new owner cgroup for new resource. It also allows to uncharge resource of -a process from previously charged cgroup which is migrated to new cgroup, -even though that is not a primary use case. - -Resource pool object is created in following situations. -(a) User sets the limit and no previous resource pool exist for the device -of interest for the cgroup. -(b) No resource limits were configured, but IB/RDMA stack tries to -charge the resource. So that it correctly uncharge them when applications are -running without limits and later on when limits are enforced during uncharging, -otherwise usage count will drop to negative. - -Resource pool is destroyed if all the resource limits are set to max and -it is the last resource getting deallocated. - -User should set all the limit to max value if it intents to remove/unconfigure -the resource pool for a particular device. - -IB stack honors limits enforced by the rdma controller. When application -query about maximum resource limits of IB device, it returns minimum of -what is configured by user for a given cgroup and what is supported by -IB device. - -Following resources can be accounted by rdma controller. - - ========== ============================= - hca_handle Maximum number of HCA Handles - hca_object Maximum number of HCA Objects - ========== ============================= - -2. Usage Examples -================= - -(a) Configure resource limit:: - - echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max - echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max - -(b) Query resource limit:: - - cat /sys/fs/cgroup/rdma/2/rdma.max - #Output: - mlx4_0 hca_handle=2 hca_object=2000 - ocrdma1 hca_handle=3 hca_object=max - -(c) Query current usage:: - - cat /sys/fs/cgroup/rdma/2/rdma.current - #Output: - mlx4_0 hca_handle=1 hca_object=20 - ocrdma1 hca_handle=1 hca_object=23 - -(d) Delete resource limit:: - - echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index cad797a8a39e..5ecbc03e6b2f 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -98,7 +98,7 @@ A memory policy with a valid NodeList will be saved, as specified, for use at file creation time. When a task allocates a file in the file system, the mount option memory policy will be applied with a NodeList, if any, modified by the calling task's cpuset constraints -[See Documentation/cgroup-v1/cpusets.rst] and any optional flags, listed +[See Documentation/admin-guide/cgroup-v1/cpusets.rst] and any optional flags, listed below. If the resulting NodeLists is the empty set, the effective memory policy for the file will revert to "default" policy. diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt index 5623b9916411..4f18456dd3b1 100644 --- a/Documentation/kernel-per-CPU-kthreads.txt +++ b/Documentation/kernel-per-CPU-kthreads.txt @@ -12,7 +12,7 @@ References - Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs. -- Documentation/cgroup-v1: Using cgroups to bind tasks to sets of CPUs. +- Documentation/admin-guide/cgroup-v1: Using cgroups to bind tasks to sets of CPUs. - man taskset: Using the taskset command to bind tasks to sets of CPUs. diff --git a/Documentation/scheduler/sched-deadline.rst b/Documentation/scheduler/sched-deadline.rst index 3391e86d810c..14a2f7bf63fe 100644 --- a/Documentation/scheduler/sched-deadline.rst +++ b/Documentation/scheduler/sched-deadline.rst @@ -669,7 +669,7 @@ Deadline Task Scheduling -deadline tasks cannot have an affinity mask smaller that the entire root_domain they are created on. However, affinities can be specified - through the cpuset facility (Documentation/cgroup-v1/cpusets.rst). + through the cpuset facility (Documentation/admin-guide/cgroup-v1/cpusets.rst). 5.1 SCHED_DEADLINE and cpusets HOWTO ------------------------------------ diff --git a/Documentation/scheduler/sched-design-CFS.rst b/Documentation/scheduler/sched-design-CFS.rst index 53b30d1967cf..a96c72651877 100644 --- a/Documentation/scheduler/sched-design-CFS.rst +++ b/Documentation/scheduler/sched-design-CFS.rst @@ -222,7 +222,7 @@ SCHED_BATCH) tasks. These options need CONFIG_CGROUPS to be defined, and let the administrator create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See - Documentation/cgroup-v1/cgroups.rst for more information about this filesystem. + Documentation/admin-guide/cgroup-v1/cgroups.rst for more information about this filesystem. When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each group created using the pseudo filesystem. See example steps below to create diff --git a/Documentation/scheduler/sched-rt-group.rst b/Documentation/scheduler/sched-rt-group.rst index d27d3f3712fd..655a096ec8fb 100644 --- a/Documentation/scheduler/sched-rt-group.rst +++ b/Documentation/scheduler/sched-rt-group.rst @@ -133,7 +133,7 @@ This uses the cgroup virtual file system and "/cpu.rt_runtime_us" to control the CPU time reserved for each control group. For more information on working with control groups, you should read -Documentation/cgroup-v1/cgroups.rst as well. +Documentation/admin-guide/cgroup-v1/cgroups.rst as well. Group settings are checked against the following limits in order to keep the configuration schedulable: diff --git a/Documentation/vm/numa.rst b/Documentation/vm/numa.rst index 130f3cfa1c19..99fdeca917ca 100644 --- a/Documentation/vm/numa.rst +++ b/Documentation/vm/numa.rst @@ -67,7 +67,7 @@ nodes. Each emulated node will manage a fraction of the underlying cells' physical memory. NUMA emluation is useful for testing NUMA kernel and application features on non-NUMA platforms, and as a sort of memory resource management mechanism when used together with cpusets. -[see Documentation/cgroup-v1/cpusets.rst] +[see Documentation/admin-guide/cgroup-v1/cpusets.rst] For each node with memory, Linux constructs an independent memory management subsystem, complete with its own free page lists, in-use page lists, usage @@ -114,7 +114,7 @@ allocation behavior using Linux NUMA memory policy. [see System administrators can restrict the CPUs and nodes' memories that a non- privileged user can specify in the scheduling or NUMA commands and functions -using control groups and CPUsets. [see Documentation/cgroup-v1/cpusets.rst] +using control groups and CPUsets. [see Documentation/admin-guide/cgroup-v1/cpusets.rst] On architectures that do not hide memoryless nodes, Linux will include only zones [nodes] with memory in the zonelists. This means that for a memoryless diff --git a/Documentation/vm/page_migration.rst b/Documentation/vm/page_migration.rst index 35bba27d5fff..1d6cd7db4e43 100644 --- a/Documentation/vm/page_migration.rst +++ b/Documentation/vm/page_migration.rst @@ -41,7 +41,7 @@ locations. Larger installations usually partition the system using cpusets into sections of nodes. Paul Jackson has equipped cpusets with the ability to move pages when a task is moved to another cpuset (See -Documentation/cgroup-v1/cpusets.rst). +Documentation/admin-guide/cgroup-v1/cpusets.rst). Cpusets allows the automation of process locality. If a task is moved to a new cpuset then also all its pages are moved with it so that the performance of the process does not sink dramatically. Also the pages diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst index 109052215bce..17d0861b0f1d 100644 --- a/Documentation/vm/unevictable-lru.rst +++ b/Documentation/vm/unevictable-lru.rst @@ -98,7 +98,7 @@ Memory Control Group Interaction -------------------------------- The unevictable LRU facility interacts with the memory control group [aka -memory controller; see Documentation/cgroup-v1/memory.rst] by extending the +memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by extending the lru_list enum. The memory controller data structure automatically gets a per-zone unevictable diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets.rst b/Documentation/x86/x86_64/fake-numa-for-cpusets.rst index 30108684ae87..ff9bcfd2cc14 100644 --- a/Documentation/x86/x86_64/fake-numa-for-cpusets.rst +++ b/Documentation/x86/x86_64/fake-numa-for-cpusets.rst @@ -15,7 +15,7 @@ assign them to cpusets and their attached tasks. This is a way of limiting the amount of system memory that are available to a certain class of tasks. For more information on the features of cpusets, see -Documentation/cgroup-v1/cpusets.rst. +Documentation/admin-guide/cgroup-v1/cpusets.rst. There are a number of different configurations you can use for your needs. For more information on the numa=fake command line option and its various ways of configuring fake nodes, see Documentation/x86/x86_64/boot-options.rst. @@ -40,7 +40,7 @@ A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:: On node 3 totalpages: 131072 Now following the instructions for mounting the cpusets filesystem from -Documentation/cgroup-v1/cpusets.rst, you can assign fake nodes (i.e. contiguous memory +Documentation/admin-guide/cgroup-v1/cpusets.rst, you can assign fake nodes (i.e. contiguous memory address spaces) to individual cpusets:: [root@xroads /]# mkdir exampleset diff --git a/MAINTAINERS b/MAINTAINERS index 0c603ea73034..c1593a668f80 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4158,7 +4158,7 @@ L: cgroups@vger.kernel.org T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git S: Maintained F: Documentation/admin-guide/cgroup-v2.rst -F: Documentation/cgroup-v1/ +F: Documentation/admin-guide/cgroup-v1/ F: include/linux/cgroup* F: kernel/cgroup/ @@ -4169,7 +4169,7 @@ W: http://www.bullopensource.org/cpuset/ W: http://oss.sgi.com/projects/cpusets/ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git S: Maintained -F: Documentation/cgroup-v1/cpusets.rst +F: Documentation/admin-guide/cgroup-v1/cpusets.rst F: include/linux/cpuset.h F: kernel/cgroup/cpuset.c diff --git a/block/Kconfig b/block/Kconfig index b16b3e075d31..8b5f8e560eb4 100644 --- a/block/Kconfig +++ b/block/Kconfig @@ -89,7 +89,7 @@ config BLK_DEV_THROTTLING one needs to mount and use blkio cgroup controller for creating cgroups and specifying per device IO rate policies. - See Documentation/cgroup-v1/blkio-controller.rst for more information. + See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information. config BLK_DEV_THROTTLING_LOW bool "Block throttling .low limit interface support (EXPERIMENTAL)" diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index c5311935239d..430e219e3aba 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -624,7 +624,7 @@ struct cftype { /* * Control Group subsystem type. - * See Documentation/cgroup-v1/cgroups.rst for details + * See Documentation/admin-guide/cgroup-v1/cgroups.rst for details */ struct cgroup_subsys { struct cgroup_subsys_state *(*css_alloc)(struct cgroup_subsys_state *parent_css); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 6f68438aa4ed..82699845ef79 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -806,7 +806,7 @@ union bpf_attr { * based on a user-provided identifier for all traffic coming from * the tasks belonging to the related cgroup. See also the related * kernel documentation, available from the Linux sources in file - * *Documentation/cgroup-v1/net_cls.rst*. + * *Documentation/admin-guide/cgroup-v1/net_cls.rst*. * * The Linux kernel has two versions for cgroups: there are * cgroups v1 and cgroups v2. Both are available to users, who can diff --git a/init/Kconfig b/init/Kconfig index 9eb92ee52d40..381cdfee6e0e 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -821,7 +821,7 @@ menuconfig CGROUPS controls or device isolation. See - Documentation/scheduler/sched-design-CFS.rst (CFS) - - Documentation/cgroup-v1/ (features for grouping, isolation + - Documentation/admin-guide/cgroup-v1/ (features for grouping, isolation and resource control) Say N if unsure. @@ -883,7 +883,7 @@ config BLK_CGROUP CONFIG_CFQ_GROUP_IOSCHED=y; for enabling throttling policy, set CONFIG_BLK_DEV_THROTTLING=y. - See Documentation/cgroup-v1/blkio-controller.rst for more information. + See Documentation/admin-guide/cgroup-v1/blkio-controller.rst for more information. config CGROUP_WRITEBACK bool diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index b3b02b9c4405..863e434a6020 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -729,7 +729,7 @@ static inline int nr_cpusets(void) * load balancing domains (sched domains) as specified by that partial * partition. * - * See "What is sched_load_balance" in Documentation/cgroup-v1/cpusets.rst + * See "What is sched_load_balance" in Documentation/admin-guide/cgroup-v1/cpusets.rst * for a background explanation of this. * * Does not return errors, on the theory that the callers of this diff --git a/security/device_cgroup.c b/security/device_cgroup.c index c07196502577..725674f3276d 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -509,7 +509,7 @@ static inline int may_allow_all(struct dev_cgroup *parent) * This is one of the three key functions for hierarchy implementation. * This function is responsible for re-evaluating all the cgroup's active * exceptions due to a parent's exception change. - * Refer to Documentation/cgroup-v1/devices.rst for more details. + * Refer to Documentation/admin-guide/cgroup-v1/devices.rst for more details. */ static void revalidate_active_exceptions(struct dev_cgroup *devcg) { diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index f506c68b2612..17e2b1713702 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -806,7 +806,7 @@ union bpf_attr { * based on a user-provided identifier for all traffic coming from * the tasks belonging to the related cgroup. See also the related * kernel documentation, available from the Linux sources in file - * *Documentation/cgroup-v1/net_cls.rst*. + * *Documentation/admin-guide/cgroup-v1/net_cls.rst*. * * The Linux kernel has two versions for cgroups: there are * cgroups v1 and cgroups v2. Both are available to users, who can -- cgit v1.2.3-55-g7522 From 8bb0776b8b27d548c7e65828ec3a02cb31fe3eed Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Tue, 9 Jul 2019 12:36:09 -0300 Subject: docs: block: fix pdf output Add an extra blank line and use a markup for the enumberated list, in order to make it possible to build the block book on pdf format. Signed-off-by: Mauro Carvalho Chehab --- Documentation/block/biodoc.rst | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'Documentation/block') diff --git a/Documentation/block/biodoc.rst b/Documentation/block/biodoc.rst index d6e30b680405..c8eb28401332 100644 --- a/Documentation/block/biodoc.rst +++ b/Documentation/block/biodoc.rst @@ -9,6 +9,7 @@ Notes on the Generic Block Layer Rewrite in Linux 2.5 here might still be useful. Notes Written on Jan 15, 2002: + - Jens Axboe - Suparna Bhattacharya @@ -172,8 +173,8 @@ Some new queue property settings: New queue flags: - QUEUE_FLAG_CLUSTER (see 3.2.2) - QUEUE_FLAG_QUEUED (see 3.2.4) + - QUEUE_FLAG_CLUSTER (see 3.2.2) + - QUEUE_FLAG_QUEUED (see 3.2.4) ii. High-mem i/o capabilities are now considered the default @@ -478,7 +479,7 @@ With this multipage bio design: - Splitting of an i/o request across multiple devices (as in the case of lvm or raid) is achieved by cloning the bio (where the clone points to the same bi_io_vec array, but with the index and size accordingly modified) -- A linked list of bios is used as before for unrelated merges [*]_ - this +- A linked list of bios is used as before for unrelated merges [#]_ - this avoids reallocs and makes independent completions easier to handle. - Code that traverses the req list can find all the segments of a bio by using rq_for_each_segment. This handles the fact that a request @@ -489,7 +490,7 @@ With this multipage bio design: [TBD: Should preferably also have a bi_voffset and bi_vlen to avoid modifying bi_offset an len fields] -.. [*] +.. [#] unrelated merges -- a request ends up containing two or more bios that didn't originate from the same place. -- cgit v1.2.3-55-g7522