From 570432470275c3da15b85362bc1461945b9c1919 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 22 Apr 2019 16:48:00 -0300 Subject: docs: admin-guide: move sysctl directory to it The stuff under sysctl describes /sys interface from userspace point of view. So, add it to the admin-guide and remove the :orphan: from its index file. Signed-off-by: Mauro Carvalho Chehab --- Documentation/admin-guide/index.rst | 1 + Documentation/admin-guide/kernel-parameters.txt | 2 +- Documentation/admin-guide/mm/index.rst | 2 +- Documentation/admin-guide/mm/ksm.rst | 2 +- Documentation/admin-guide/sysctl/abi.rst | 67 ++ Documentation/admin-guide/sysctl/fs.rst | 384 ++++++++ Documentation/admin-guide/sysctl/index.rst | 98 ++ Documentation/admin-guide/sysctl/kernel.rst | 1177 +++++++++++++++++++++++ Documentation/admin-guide/sysctl/net.rst | 461 +++++++++ Documentation/admin-guide/sysctl/sunrpc.rst | 25 + Documentation/admin-guide/sysctl/user.rst | 78 ++ Documentation/admin-guide/sysctl/vm.rst | 964 +++++++++++++++++++ 12 files changed, 3258 insertions(+), 3 deletions(-) create mode 100644 Documentation/admin-guide/sysctl/abi.rst create mode 100644 Documentation/admin-guide/sysctl/fs.rst create mode 100644 Documentation/admin-guide/sysctl/index.rst create mode 100644 Documentation/admin-guide/sysctl/kernel.rst create mode 100644 Documentation/admin-guide/sysctl/net.rst create mode 100644 Documentation/admin-guide/sysctl/sunrpc.rst create mode 100644 Documentation/admin-guide/sysctl/user.rst create mode 100644 Documentation/admin-guide/sysctl/vm.rst (limited to 'Documentation/admin-guide') diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 64e97a969857..5c6ae1ccee1a 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -16,6 +16,7 @@ etc. README kernel-parameters devices + sysctl/index This section describes CPU vulnerabilities and their mitigations. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index e8e28cac32a3..b323f5d4366a 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3144,7 +3144,7 @@ numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA. 'node', 'default' can be specified This can be set from sysctl after boot. - See Documentation/sysctl/vm.rst for details. + See Documentation/admin-guide/sysctl/vm.rst for details. ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. See Documentation/debugging-via-ohci1394.txt for more diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index f5e92f33f96e..5f61a6c429e0 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -11,7 +11,7 @@ processes address space and many other cool things. Linux memory management is a complex system with many configurable settings. Most of these settings are available via ``/proc`` filesystem and can be quired and adjusted using ``sysctl``. These APIs -are described in Documentation/sysctl/vm.rst and in `man 5 proc`_. +are described in Documentation/admin-guide/sysctl/vm.rst and in `man 5 proc`_. .. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index 7b2b8767c0b4..874eb0c77d34 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -59,7 +59,7 @@ MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. If a region of memory must be split into at least one new MADV_MERGEABLE or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process -will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.rst). +will exceed ``vm.max_map_count`` (see Documentation/admin-guide/sysctl/vm.rst). Like other madvise calls, they are intended for use on mapped areas of the user address space: they will report ENOMEM if the specified range diff --git a/Documentation/admin-guide/sysctl/abi.rst b/Documentation/admin-guide/sysctl/abi.rst new file mode 100644 index 000000000000..599bcde7f0b7 --- /dev/null +++ b/Documentation/admin-guide/sysctl/abi.rst @@ -0,0 +1,67 @@ +================================ +Documentation for /proc/sys/abi/ +================================ + +kernel version 2.6.0.test2 + +Copyright (c) 2003, Fabian Frederick + +For general info: index.rst. + +------------------------------------------------------------------------------ + +This path is binary emulation relevant aka personality types aka abi. +When a process is executed, it's linked to an exec_domain whose +personality is defined using values available from /proc/sys/abi. +You can find further details about abi in include/linux/personality.h. + +Here are the files featuring in 2.6 kernel: + +- defhandler_coff +- defhandler_elf +- defhandler_lcall7 +- defhandler_libcso +- fake_utsname +- trace + +defhandler_coff +--------------- + +defined value: + PER_SCOSVR3:: + + 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE + +defhandler_elf +-------------- + +defined value: + PER_LINUX:: + + 0 + +defhandler_lcall7 +----------------- + +defined value : + PER_SVR4:: + + 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, + +defhandler_libsco +----------------- + +defined value: + PER_SVR4:: + + 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, + +fake_utsname +------------ + +Unused + +trace +----- + +Unused diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst new file mode 100644 index 000000000000..2a45119e3331 --- /dev/null +++ b/Documentation/admin-guide/sysctl/fs.rst @@ -0,0 +1,384 @@ +=============================== +Documentation for /proc/sys/fs/ +=============================== + +kernel version 2.2.10 + +Copyright (c) 1998, 1999, Rik van Riel + +Copyright (c) 2009, Shen Feng + +For general info and legal blurb, please look in intro.rst. + +------------------------------------------------------------------------------ + +This file contains documentation for the sysctl files in +/proc/sys/fs/ and is valid for Linux kernel version 2.2. + +The files in this directory can be used to tune and monitor +miscellaneous and general things in the operation of the Linux +kernel. Since some of the files _can_ be used to screw up your +system, it is advisable to read both documentation and source +before actually making adjustments. + +1. /proc/sys/fs +=============== + +Currently, these files are in /proc/sys/fs: + +- aio-max-nr +- aio-nr +- dentry-state +- dquot-max +- dquot-nr +- file-max +- file-nr +- inode-max +- inode-nr +- inode-state +- nr_open +- overflowuid +- overflowgid +- pipe-user-pages-hard +- pipe-user-pages-soft +- protected_fifos +- protected_hardlinks +- protected_regular +- protected_symlinks +- suid_dumpable +- super-max +- super-nr + + +aio-nr & aio-max-nr +------------------- + +aio-nr is the running total of the number of events specified on the +io_setup system call for all currently active aio contexts. If aio-nr +reaches aio-max-nr then io_setup will fail with EAGAIN. Note that +raising aio-max-nr does not result in the pre-allocation or re-sizing +of any kernel data structures. + + +dentry-state +------------ + +From linux/include/linux/dcache.h:: + + struct dentry_stat_t dentry_stat { + int nr_dentry; + int nr_unused; + int age_limit; /* age in seconds */ + int want_pages; /* pages requested by system */ + int nr_negative; /* # of unused negative dentries */ + int dummy; /* Reserved for future use */ + }; + +Dentries are dynamically allocated and deallocated. + +nr_dentry shows the total number of dentries allocated (active ++ unused). nr_unused shows the number of dentries that are not +actively used, but are saved in the LRU list for future reuse. + +Age_limit is the age in seconds after which dcache entries +can be reclaimed when memory is short and want_pages is +nonzero when shrink_dcache_pages() has been called and the +dcache isn't pruned yet. + +nr_negative shows the number of unused dentries that are also +negative dentries which do not map to any files. Instead, +they help speeding up rejection of non-existing files provided +by the users. + + +dquot-max & dquot-nr +-------------------- + +The file dquot-max shows the maximum number of cached disk +quota entries. + +The file dquot-nr shows the number of allocated disk quota +entries and the number of free disk quota entries. + +If the number of free cached disk quotas is very low and +you have some awesome number of simultaneous system users, +you might want to raise the limit. + + +file-max & file-nr +------------------ + +The value in file-max denotes the maximum number of file- +handles that the Linux kernel will allocate. When you get lots +of error messages about running out of file handles, you might +want to increase this limit. + +Historically,the kernel was able to allocate file handles +dynamically, but not to free them again. The three values in +file-nr denote the number of allocated file handles, the number +of allocated but unused file handles, and the maximum number of +file handles. Linux 2.6 always reports 0 as the number of free +file handles -- this is not an error, it just means that the +number of allocated file handles exactly matches the number of +used file handles. + +Attempts to allocate more file descriptors than file-max are +reported with printk, look for "VFS: file-max limit +reached". + + +nr_open +------- + +This denotes the maximum number of file-handles a process can +allocate. Default value is 1024*1024 (1048576) which should be +enough for most machines. Actual limit depends on RLIMIT_NOFILE +resource limit. + + +inode-max, inode-nr & inode-state +--------------------------------- + +As with file handles, the kernel allocates the inode structures +dynamically, but can't free them yet. + +The value in inode-max denotes the maximum number of inode +handlers. This value should be 3-4 times larger than the value +in file-max, since stdin, stdout and network sockets also +need an inode struct to handle them. When you regularly run +out of inodes, you need to increase this value. + +The file inode-nr contains the first two items from +inode-state, so we'll skip to that file... + +Inode-state contains three actual numbers and four dummies. +The actual numbers are, in order of appearance, nr_inodes, +nr_free_inodes and preshrink. + +Nr_inodes stands for the number of inodes the system has +allocated, this can be slightly more than inode-max because +Linux allocates them one pageful at a time. + +Nr_free_inodes represents the number of free inodes (?) and +preshrink is nonzero when the nr_inodes > inode-max and the +system needs to prune the inode list instead of allocating +more. + + +overflowgid & overflowuid +------------------------- + +Some filesystems only support 16-bit UIDs and GIDs, although in Linux +UIDs and GIDs are 32 bits. When one of these filesystems is mounted +with writes enabled, any UID or GID that would exceed 65535 is translated +to a fixed value before being written to disk. + +These sysctls allow you to change the value of the fixed UID and GID. +The default is 65534. + + +pipe-user-pages-hard +-------------------- + +Maximum total number of pages a non-privileged user may allocate for pipes. +Once this limit is reached, no new pipes may be allocated until usage goes +below the limit again. When set to 0, no limit is applied, which is the default +setting. + + +pipe-user-pages-soft +-------------------- + +Maximum total number of pages a non-privileged user may allocate for pipes +before the pipe size gets limited to a single page. Once this limit is reached, +new pipes will be limited to a single page in size for this user in order to +limit total memory usage, and trying to increase them using fcntl() will be +denied until usage goes below the limit again. The default value allows to +allocate up to 1024 pipes at their default size. When set to 0, no limit is +applied. + + +protected_fifos +--------------- + +The intent of this protection is to avoid unintentional writes to +an attacker-controlled FIFO, where a program expected to create a regular +file. + +When set to "0", writing to FIFOs is unrestricted. + +When set to "1" don't allow O_CREAT open on FIFOs that we don't own +in world writable sticky directories, unless they are owned by the +owner of the directory. + +When set to "2" it also applies to group writable sticky directories. + +This protection is based on the restrictions in Openwall. + + +protected_hardlinks +-------------------- + +A long-standing class of security issues is the hardlink-based +time-of-check-time-of-use race, most commonly seen in world-writable +directories like /tmp. The common method of exploitation of this flaw +is to cross privilege boundaries when following a given hardlink (i.e. a +root process follows a hardlink created by another user). Additionally, +on systems without separated partitions, this stops unauthorized users +from "pinning" vulnerable setuid/setgid files against being upgraded by +the administrator, or linking to special files. + +When set to "0", hardlink creation behavior is unrestricted. + +When set to "1" hardlinks cannot be created by users if they do not +already own the source file, or do not have read/write access to it. + +This protection is based on the restrictions in Openwall and grsecurity. + + +protected_regular +----------------- + +This protection is similar to protected_fifos, but it +avoids writes to an attacker-controlled regular file, where a program +expected to create one. + +When set to "0", writing to regular files is unrestricted. + +When set to "1" don't allow O_CREAT open on regular files that we +don't own in world writable sticky directories, unless they are +owned by the owner of the directory. + +When set to "2" it also applies to group writable sticky directories. + + +protected_symlinks +------------------ + +A long-standing class of security issues is the symlink-based +time-of-check-time-of-use race, most commonly seen in world-writable +directories like /tmp. The common method of exploitation of this flaw +is to cross privilege boundaries when following a given symlink (i.e. a +root process follows a symlink belonging to another user). For a likely +incomplete list of hundreds of examples across the years, please see: +http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp + +When set to "0", symlink following behavior is unrestricted. + +When set to "1" symlinks are permitted to be followed only when outside +a sticky world-writable directory, or when the uid of the symlink and +follower match, or when the directory owner matches the symlink's owner. + +This protection is based on the restrictions in Openwall and grsecurity. + + +suid_dumpable: +-------------- + +This value can be used to query and set the core dump mode for setuid +or otherwise protected/tainted binaries. The modes are + += ========== =============================================================== +0 (default) traditional behaviour. Any process which has changed + privilege levels or is execute only will not be dumped. +1 (debug) all processes dump core when possible. The core dump is + owned by the current user and no security is applied. This is + intended for system debugging situations only. + Ptrace is unchecked. + This is insecure as it allows regular users to examine the + memory contents of privileged processes. +2 (suidsafe) any binary which normally would not be dumped is dumped + anyway, but only if the "core_pattern" kernel sysctl is set to + either a pipe handler or a fully qualified path. (For more + details on this limitation, see CVE-2006-2451.) This mode is + appropriate when administrators are attempting to debug + problems in a normal environment, and either have a core dump + pipe handler that knows to treat privileged core dumps with + care, or specific directory defined for catching core dumps. + If a core dump happens without a pipe handler or fully + qualified path, a message will be emitted to syslog warning + about the lack of a correct setting. += ========== =============================================================== + + +super-max & super-nr +-------------------- + +These numbers control the maximum number of superblocks, and +thus the maximum number of mounted filesystems the kernel +can have. You only need to increase super-max if you need to +mount more filesystems than the current value in super-max +allows you to. + + +aio-nr & aio-max-nr +------------------- + +aio-nr shows the current system-wide number of asynchronous io +requests. aio-max-nr allows you to change the maximum value +aio-nr can grow to. + + +mount-max +--------- + +This denotes the maximum number of mounts that may exist +in a mount namespace. + + + +2. /proc/sys/fs/binfmt_misc +=========================== + +Documentation for the files in /proc/sys/fs/binfmt_misc is +in Documentation/admin-guide/binfmt-misc.rst. + + +3. /proc/sys/fs/mqueue - POSIX message queues filesystem +======================================================== + + +The "mqueue" filesystem provides the necessary kernel features to enable the +creation of a user space library that implements the POSIX message queues +API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System +Interfaces specification.) + +The "mqueue" filesystem contains values for determining/setting the amount of +resources used by the file system. + +/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the +maximum number of message queues allowed on the system. + +/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the +maximum number of messages in a queue value. In fact it is the limiting value +for another (user) limit which is set in mq_open invocation. This attribute of +a queue must be less or equal then msg_max. + +/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the +maximum message size value (it is every message queue's attribute set during +its creation). + +/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the +default number of messages in a queue value if attr parameter of mq_open(2) is +NULL. If it exceed msg_max, the default value is initialized msg_max. + +/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting +the default message size value if attr parameter of mq_open(2) is NULL. If it +exceed msgsize_max, the default value is initialized msgsize_max. + +4. /proc/sys/fs/epoll - Configuration options for the epoll interface +===================================================================== + +This directory contains configuration options for the epoll(7) interface. + +max_user_watches +---------------- + +Every epoll file descriptor can store a number of files to be monitored +for event readiness. Each one of these monitored files constitutes a "watch". +This configuration option sets the maximum number of "watches" that are +allowed for each user. +Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes +on a 64bit one. +The current default value for max_user_watches is the 1/32 of the available +low memory, divided for the "watch" cost in bytes. diff --git a/Documentation/admin-guide/sysctl/index.rst b/Documentation/admin-guide/sysctl/index.rst new file mode 100644 index 000000000000..03346f98c7b9 --- /dev/null +++ b/Documentation/admin-guide/sysctl/index.rst @@ -0,0 +1,98 @@ +=========================== +Documentation for /proc/sys +=========================== + +Copyright (c) 1998, 1999, Rik van Riel + +------------------------------------------------------------------------------ + +'Why', I hear you ask, 'would anyone even _want_ documentation +for them sysctl files? If anybody really needs it, it's all in +the source...' + +Well, this documentation is written because some people either +don't know they need to tweak something, or because they don't +have the time or knowledge to read the source code. + +Furthermore, the programmers who built sysctl have built it to +be actually used, not just for the fun of programming it :-) + +------------------------------------------------------------------------------ + +Legal blurb: + +As usual, there are two main things to consider: + +1. you get what you pay for +2. it's free + +The consequences are that I won't guarantee the correctness of +this document, and if you come to me complaining about how you +screwed up your system because of wrong documentation, I won't +feel sorry for you. I might even laugh at you... + +But of course, if you _do_ manage to screw up your system using +only the sysctl options used in this file, I'd like to hear of +it. Not only to have a great laugh, but also to make sure that +you're the last RTFMing person to screw up. + +In short, e-mail your suggestions, corrections and / or horror +stories to: + +Rik van Riel. + +-------------------------------------------------------------- + +Introduction +============ + +Sysctl is a means of configuring certain aspects of the kernel +at run-time, and the /proc/sys/ directory is there so that you +don't even need special tools to do it! +In fact, there are only four things needed to use these config +facilities: + +- a running Linux system +- root access +- common sense (this is especially hard to come by these days) +- knowledge of what all those values mean + +As a quick 'ls /proc/sys' will show, the directory consists of +several (arch-dependent?) subdirs. Each subdir is mainly about +one part of the kernel, so you can do configuration on a piece +by piece basis, or just some 'thematic frobbing'. + +This documentation is about: + +=============== =============================================================== +abi/ execution domains & personalities +debug/ +dev/ device specific information (eg dev/cdrom/info) +fs/ specific filesystems + filehandle, inode, dentry and quota tuning + binfmt_misc +kernel/ global kernel info / tuning + miscellaneous stuff +net/ networking stuff, for documentation look in: + +proc/ +sunrpc/ SUN Remote Procedure Call (NFS) +vm/ memory management tuning + buffer and cache management +user/ Per user per user namespace limits +=============== =============================================================== + +These are the subdirs I have on my system. There might be more +or other subdirs in another setup. If you see another dir, I'd +really like to hear about it :-) + +.. toctree:: + :maxdepth: 1 + + abi + fs + kernel + net + sunrpc + user + vm diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst new file mode 100644 index 000000000000..a0c1d4ce403a --- /dev/null +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -0,0 +1,1177 @@ +=================================== +Documentation for /proc/sys/kernel/ +=================================== + +kernel version 2.2.10 + +Copyright (c) 1998, 1999, Rik van Riel + +Copyright (c) 2009, Shen Feng + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ + +This file contains documentation for the sysctl files in +/proc/sys/kernel/ and is valid for Linux kernel version 2.2. + +The files in this directory can be used to tune and monitor +miscellaneous and general things in the operation of the Linux +kernel. Since some of the files _can_ be used to screw up your +system, it is advisable to read both documentation and source +before actually making adjustments. + +Currently, these files might (depending on your configuration) +show up in /proc/sys/kernel: + +- acct +- acpi_video_flags +- auto_msgmni +- bootloader_type [ X86 only ] +- bootloader_version [ X86 only ] +- cap_last_cap +- core_pattern +- core_pipe_limit +- core_uses_pid +- ctrl-alt-del +- dmesg_restrict +- domainname +- hostname +- hotplug +- hardlockup_all_cpu_backtrace +- hardlockup_panic +- hung_task_panic +- hung_task_check_count +- hung_task_timeout_secs +- hung_task_check_interval_secs +- hung_task_warnings +- hyperv_record_panic_msg +- kexec_load_disabled +- kptr_restrict +- l2cr [ PPC only ] +- modprobe ==> Documentation/debugging-modules.txt +- modules_disabled +- msg_next_id [ sysv ipc ] +- msgmax +- msgmnb +- msgmni +- nmi_watchdog +- osrelease +- ostype +- overflowgid +- overflowuid +- panic +- panic_on_oops +- panic_on_stackoverflow +- panic_on_unrecovered_nmi +- panic_on_warn +- panic_print +- panic_on_rcu_stall +- perf_cpu_time_max_percent +- perf_event_paranoid +- perf_event_max_stack +- perf_event_mlock_kb +- perf_event_max_contexts_per_stack +- pid_max +- powersave-nap [ PPC only ] +- printk +- printk_delay +- printk_ratelimit +- printk_ratelimit_burst +- pty ==> Documentation/filesystems/devpts.txt +- randomize_va_space +- real-root-dev ==> Documentation/admin-guide/initrd.rst +- reboot-cmd [ SPARC only ] +- rtsig-max +- rtsig-nr +- sched_energy_aware +- seccomp/ ==> Documentation/userspace-api/seccomp_filter.rst +- sem +- sem_next_id [ sysv ipc ] +- sg-big-buff [ generic SCSI device (sg) ] +- shm_next_id [ sysv ipc ] +- shm_rmid_forced +- shmall +- shmmax [ sysv ipc ] +- shmmni +- softlockup_all_cpu_backtrace +- soft_watchdog +- stack_erasing +- stop-a [ SPARC only ] +- sysrq ==> Documentation/admin-guide/sysrq.rst +- sysctl_writes_strict +- tainted ==> Documentation/admin-guide/tainted-kernels.rst +- threads-max +- unknown_nmi_panic +- watchdog +- watchdog_thresh +- version + + +acct: +===== + +highwater lowwater frequency + +If BSD-style process accounting is enabled these values control +its behaviour. If free space on filesystem where the log lives +goes below % accounting suspends. If free space gets +above % accounting resumes. determines +how often do we check the amount of free space (value is in +seconds). Default: +4 2 30 +That is, suspend accounting if there left <= 2% free; resume it +if we got >=4%; consider information about amount of free space +valid for 30 seconds. + + +acpi_video_flags: +================= + +flags + +See Doc*/kernel/power/video.txt, it allows mode of video boot to be +set during run time. + + +auto_msgmni: +============ + +This variable has no effect and may be removed in future kernel +releases. Reading it always returns 0. +Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni +upon memory add/remove or upon ipc namespace creation/removal. +Echoing "1" into this file enabled msgmni automatic recomputing. +Echoing "0" turned it off. auto_msgmni default value was 1. + + +bootloader_type: +================ + +x86 bootloader identification + +This gives the bootloader type number as indicated by the bootloader, +shifted left by 4, and OR'd with the low four bits of the bootloader +version. The reason for this encoding is that this used to match the +type_of_loader field in the kernel header; the encoding is kept for +backwards compatibility. That is, if the full bootloader type number +is 0x15 and the full version number is 0x234, this file will contain +the value 340 = 0x154. + +See the type_of_loader and ext_loader_type fields in +Documentation/x86/boot.rst for additional information. + + +bootloader_version: +=================== + +x86 bootloader version + +The complete bootloader version number. In the example above, this +file will contain the value 564 = 0x234. + +See the type_of_loader and ext_loader_ver fields in +Documentation/x86/boot.rst for additional information. + + +cap_last_cap: +============= + +Highest valid capability of the running kernel. Exports +CAP_LAST_CAP from the kernel. + + +core_pattern: +============= + +core_pattern is used to specify a core dumpfile pattern name. + +* max length 127 characters; default value is "core" +* core_pattern is used as a pattern template for the output filename; + certain string patterns (beginning with '%') are substituted with + their actual values. +* backward compatibility with core_uses_pid: + + If core_pattern does not include "%p" (default does not) + and core_uses_pid is set, then .PID will be appended to + the filename. + +* corename format specifiers:: + + % '%' is dropped + %% output one '%' + %p pid + %P global pid (init PID namespace) + %i tid + %I global tid (init PID namespace) + %u uid (in initial user namespace) + %g gid (in initial user namespace) + %d dump mode, matches PR_SET_DUMPABLE and + /proc/sys/fs/suid_dumpable + %s signal number + %t UNIX time of dump + %h hostname + %e executable filename (may be shortened) + %E executable path + % both are dropped + +* If the first character of the pattern is a '|', the kernel will treat + the rest of the pattern as a command to run. The core dump will be + written to the standard input of that program instead of to a file. + + +core_pipe_limit: +================ + +This sysctl is only applicable when core_pattern is configured to pipe +core files to a user space helper (when the first character of +core_pattern is a '|', see above). When collecting cores via a pipe +to an application, it is occasionally useful for the collecting +application to gather data about the crashing process from its +/proc/pid directory. In order to do this safely, the kernel must wait +for the collecting process to exit, so as not to remove the crashing +processes proc files prematurely. This in turn creates the +possibility that a misbehaving userspace collecting process can block +the reaping of a crashed process simply by never exiting. This sysctl +defends against that. It defines how many concurrent crashing +processes may be piped to user space applications in parallel. If +this value is exceeded, then those crashing processes above that value +are noted via the kernel log and their cores are skipped. 0 is a +special value, indicating that unlimited processes may be captured in +parallel, but that no waiting will take place (i.e. the collecting +process is not guaranteed access to /proc//). This +value defaults to 0. + + +core_uses_pid: +============== + +The default coredump filename is "core". By setting +core_uses_pid to 1, the coredump filename becomes core.PID. +If core_pattern does not include "%p" (default does not) +and core_uses_pid is set, then .PID will be appended to +the filename. + + +ctrl-alt-del: +============= + +When the value in this file is 0, ctrl-alt-del is trapped and +sent to the init(1) program to handle a graceful restart. +When, however, the value is > 0, Linux's reaction to a Vulcan +Nerve Pinch (tm) will be an immediate reboot, without even +syncing its dirty buffers. + +Note: + when a program (like dosemu) has the keyboard in 'raw' + mode, the ctrl-alt-del is intercepted by the program before it + ever reaches the kernel tty layer, and it's up to the program + to decide what to do with it. + + +dmesg_restrict: +=============== + +This toggle indicates whether unprivileged users are prevented +from using dmesg(8) to view messages from the kernel's log buffer. +When dmesg_restrict is set to (0) there are no restrictions. When +dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use +dmesg(8). + +The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the +default value of dmesg_restrict. + + +domainname & hostname: +====================== + +These files can be used to set the NIS/YP domainname and the +hostname of your box in exactly the same way as the commands +domainname and hostname, i.e.:: + + # echo "darkstar" > /proc/sys/kernel/hostname + # echo "mydomain" > /proc/sys/kernel/domainname + +has the same effect as:: + + # hostname "darkstar" + # domainname "mydomain" + +Note, however, that the classic darkstar.frop.org has the +hostname "darkstar" and DNS (Internet Domain Name Server) +domainname "frop.org", not to be confused with the NIS (Network +Information Service) or YP (Yellow Pages) domainname. These two +domain names are in general different. For a detailed discussion +see the hostname(1) man page. + + +hardlockup_all_cpu_backtrace: +============================= + +This value controls the hard lockup detector behavior when a hard +lockup condition is detected as to whether or not to gather further +debug information. If enabled, arch-specific all-CPU stack dumping +will be initiated. + +0: do nothing. This is the default behavior. + +1: on detection capture more debug information. + + +hardlockup_panic: +================= + +This parameter can be used to control whether the kernel panics +when a hard lockup is detected. + + 0 - don't panic on hard lockup + 1 - panic on hard lockup + +See Documentation/lockup-watchdogs.txt for more information. This can +also be set using the nmi_watchdog kernel parameter. + + +hotplug: +======== + +Path for the hotplug policy agent. +Default value is "/sbin/hotplug". + + +hung_task_panic: +================ + +Controls the kernel's behavior when a hung task is detected. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0: continue operation. This is the default behavior. + +1: panic immediately. + + +hung_task_check_count: +====================== + +The upper bound on the number of tasks that are checked. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + + +hung_task_timeout_secs: +======================= + +When a task in D state did not get scheduled +for more than this value report a warning. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0: means infinite timeout - no checking done. + +Possible values to set are in range {0..LONG_MAX/HZ}. + + +hung_task_check_interval_secs: +============================== + +Hung task check interval. If hung task checking is enabled +(see hung_task_timeout_secs), the check is done every +hung_task_check_interval_secs seconds. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0 (default): means use hung_task_timeout_secs as checking interval. +Possible values to set are in range {0..LONG_MAX/HZ}. + + +hung_task_warnings: +=================== + +The maximum number of warnings to report. During a check interval +if a hung task is detected, this value is decreased by 1. +When this value reaches 0, no more warnings will be reported. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +-1: report an infinite number of warnings. + + +hyperv_record_panic_msg: +======================== + +Controls whether the panic kmsg data should be reported to Hyper-V. + +0: do not report panic kmsg data. + +1: report the panic kmsg data. This is the default behavior. + + +kexec_load_disabled: +==================== + +A toggle indicating if the kexec_load syscall has been disabled. This +value defaults to 0 (false: kexec_load enabled), but can be set to 1 +(true: kexec_load disabled). Once true, kexec can no longer be used, and +the toggle cannot be set back to false. This allows a kexec image to be +loaded before disabling the syscall, allowing a system to set up (and +later use) an image without it being altered. Generally used together +with the "modules_disabled" sysctl. + + +kptr_restrict: +============== + +This toggle indicates whether restrictions are placed on +exposing kernel addresses via /proc and other interfaces. + +When kptr_restrict is set to 0 (the default) the address is hashed before +printing. (This is the equivalent to %p.) + +When kptr_restrict is set to (1), kernel pointers printed using the %pK +format specifier will be replaced with 0's unless the user has CAP_SYSLOG +and effective user and group ids are equal to the real ids. This is +because %pK checks are done at read() time rather than open() time, so +if permissions are elevated between the open() and the read() (e.g via +a setuid binary) then %pK will not leak kernel pointers to unprivileged +users. Note, this is a temporary solution only. The correct long-term +solution is to do the permission checks at open() time. Consider removing +world read permissions from files that use %pK, and using dmesg_restrict +to protect against uses of %pK in dmesg(8) if leaking kernel pointer +values to unprivileged users is a concern. + +When kptr_restrict is set to (2), kernel pointers printed using +%pK will be replaced with 0's regardless of privileges. + + +l2cr: (PPC only) +================ + +This flag controls the L2 cache of G3 processor boards. If +0, the cache is disabled. Enabled if nonzero. + + +modules_disabled: +================= + +A toggle value indicating if modules are allowed to be loaded +in an otherwise modular kernel. This toggle defaults to off +(0), but can be set true (1). Once true, modules can be +neither loaded nor unloaded, and the toggle cannot be set back +to false. Generally used with the "kexec_load_disabled" toggle. + + +msg_next_id, sem_next_id, and shm_next_id: +========================================== + +These three toggles allows to specify desired id for next allocated IPC +object: message, semaphore or shared memory respectively. + +By default they are equal to -1, which means generic allocation logic. +Possible values to set are in range {0..INT_MAX}. + +Notes: + 1) kernel doesn't guarantee, that new object will have desired id. So, + it's up to userspace, how to handle an object with "wrong" id. + 2) Toggle with non-default value will be set back to -1 by kernel after + successful IPC object allocation. If an IPC object allocation syscall + fails, it is undefined if the value remains unmodified or is reset to -1. + + +nmi_watchdog: +============= + +This parameter can be used to control the NMI watchdog +(i.e. the hard lockup detector) on x86 systems. + +0 - disable the hard lockup detector + +1 - enable the hard lockup detector + +The hard lockup detector monitors each CPU for its ability to respond to +timer interrupts. The mechanism utilizes CPU performance counter registers +that are programmed to generate Non-Maskable Interrupts (NMIs) periodically +while a CPU is busy. Hence, the alternative name 'NMI watchdog'. + +The NMI watchdog is disabled by default if the kernel is running as a guest +in a KVM virtual machine. This default can be overridden by adding:: + + nmi_watchdog=1 + +to the guest kernel command line (see Documentation/admin-guide/kernel-parameters.rst). + + +numa_balancing: +=============== + +Enables/disables automatic page fault based NUMA memory +balancing. Memory is moved automatically to nodes +that access it often. + +Enables/disables automatic NUMA memory balancing. On NUMA machines, there +is a performance penalty if remote memory is accessed by a CPU. When this +feature is enabled the kernel samples what task thread is accessing memory +by periodically unmapping pages and later trapping a page fault. At the +time of the page fault, it is determined if the data being accessed should +be migrated to a local memory node. + +The unmapping of pages and trapping faults incur additional overhead that +ideally is offset by improved memory locality but there is no universal +guarantee. If the target workload is already bound to NUMA nodes then this +feature should be disabled. Otherwise, if the system overhead from the +feature is too high then the rate the kernel samples for NUMA hinting +faults may be controlled by the numa_balancing_scan_period_min_ms, +numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, +numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls. + +numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb +=============================================================================================================================== + + +Automatic NUMA balancing scans tasks address space and unmaps pages to +detect if pages are properly placed or if the data should be migrated to a +memory node local to where the task is running. Every "scan delay" the task +scans the next "scan size" number of pages in its address space. When the +end of the address space is reached the scanner restarts from the beginning. + +In combination, the "scan delay" and "scan size" determine the scan rate. +When "scan delay" decreases, the scan rate increases. The scan delay and +hence the scan rate of every task is adaptive and depends on historical +behaviour. If pages are properly placed then the scan delay increases, +otherwise the scan delay decreases. The "scan size" is not adaptive but +the higher the "scan size", the higher the scan rate. + +Higher scan rates incur higher system overhead as page faults must be +trapped and potentially data must be migrated. However, the higher the scan +rate, the more quickly a tasks memory is migrated to a local node if the +workload pattern changes and minimises performance impact due to remote +memory accesses. These sysctls control the thresholds for scan delays and +the number of pages scanned. + +numa_balancing_scan_period_min_ms is the minimum time in milliseconds to +scan a tasks virtual memory. It effectively controls the maximum scanning +rate for each task. + +numa_balancing_scan_delay_ms is the starting "scan delay" used for a task +when it initially forks. + +numa_balancing_scan_period_max_ms is the maximum time in milliseconds to +scan a tasks virtual memory. It effectively controls the minimum scanning +rate for each task. + +numa_balancing_scan_size_mb is how many megabytes worth of pages are +scanned for a given scan. + + +osrelease, ostype & version: +============================ + +:: + + # cat osrelease + 2.1.88 + # cat ostype + Linux + # cat version + #5 Wed Feb 25 21:49:24 MET 1998 + +The files osrelease and ostype should be clear enough. Version +needs a little more clarification however. The '#5' means that +this is the fifth kernel built from this source base and the +date behind it indicates the time the kernel was built. +The only way to tune these values is to rebuild the kernel :-) + + +overflowgid & overflowuid: +========================== + +if your architecture did not always support 32-bit UIDs (i.e. arm, +i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to +applications that use the old 16-bit UID/GID system calls, if the +actual UID or GID would exceed 65535. + +These sysctls allow you to change the value of the fixed UID and GID. +The default is 65534. + + +panic: +====== + +The value in this file represents the number of seconds the kernel +waits before rebooting on a panic. When you use the software watchdog, +the recommended setting is 60. + + +panic_on_io_nmi: +================ + +Controls the kernel's behavior when a CPU receives an NMI caused by +an IO error. + +0: try to continue operation (default) + +1: panic immediately. The IO error triggered an NMI. This indicates a + serious system condition which could result in IO data corruption. + Rather than continuing, panicking might be a better choice. Some + servers issue this sort of NMI when the dump button is pushed, + and you can use this option to take a crash dump. + + +panic_on_oops: +============== + +Controls the kernel's behaviour when an oops or BUG is encountered. + +0: try to continue operation + +1: panic immediately. If the `panic` sysctl is also non-zero then the + machine will be rebooted. + + +panic_on_stackoverflow: +======================= + +Controls the kernel's behavior when detecting the overflows of +kernel, IRQ and exception stacks except a user stack. +This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled. + +0: try to continue operation. + +1: panic immediately. + + +panic_on_unrecovered_nmi: +========================= + +The default Linux behaviour on an NMI of either memory or unknown is +to continue operation. For many environments such as scientific +computing it is preferable that the box is taken out and the error +dealt with than an uncorrected parity/ECC error get propagated. + +A small number of systems do generate NMI's for bizarre random reasons +such as power management so the default is off. That sysctl works like +the existing panic controls already in that directory. + + +panic_on_warn: +============== + +Calls panic() in the WARN() path when set to 1. This is useful to avoid +a kernel rebuild when attempting to kdump at the location of a WARN(). + +0: only WARN(), default behaviour. + +1: call panic() after printing out WARN() location. + + +panic_print: +============ + +Bitmask for printing system info when panic happens. User can chose +combination of the following bits: + +===== ======================================== +bit 0 print all tasks info +bit 1 print system memory info +bit 2 print timer info +bit 3 print locks info if CONFIG_LOCKDEP is on +bit 4 print ftrace buffer +===== ======================================== + +So for example to print tasks and memory info on panic, user can:: + + echo 3 > /proc/sys/kernel/panic_print + + +panic_on_rcu_stall: +=================== + +When set to 1, calls panic() after RCU stall detection messages. This +is useful to define the root cause of RCU stalls using a vmcore. + +0: do not panic() when RCU stall takes place, default behavior. + +1: panic() after printing RCU stall messages. + + +perf_cpu_time_max_percent: +========================== + +Hints to the kernel how much CPU time it should be allowed to +use to handle perf sampling events. If the perf subsystem +is informed that its samples are exceeding this limit, it +will drop its sampling frequency to attempt to reduce its CPU +usage. + +Some perf sampling happens in NMIs. If these samples +unexpectedly take too long to execute, the NMIs can become +stacked up next to each other so much that nothing else is +allowed to execute. + +0: + disable the mechanism. Do not monitor or correct perf's + sampling rate no matter how CPU time it takes. + +1-100: + attempt to throttle perf's sample rate to this + percentage of CPU. Note: the kernel calculates an + "expected" length of each sample event. 100 here means + 100% of that expected length. Even if this is set to + 100, you may still see sample throttling if this + length is exceeded. Set to 0 if you truly do not care + how much CPU is consumed. + + +perf_event_paranoid: +==================== + +Controls use of the performance events system by unprivileged +users (without CAP_SYS_ADMIN). The default value is 2. + +=== ================================================================== + -1 Allow use of (almost) all events by all users + + Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK + +>=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN + + Disallow raw tracepoint access by users without CAP_SYS_ADMIN + +>=1 Disallow CPU event access by users without CAP_SYS_ADMIN + +>=2 Disallow kernel profiling by users without CAP_SYS_ADMIN +=== ================================================================== + + +perf_event_max_stack: +===================== + +Controls maximum number of stack frames to copy for (attr.sample_type & +PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using +'perf record -g' or 'perf trace --call-graph fp'. + +This can only be done when no events are in use that have callchains +enabled, otherwise writing to this file will return -EBUSY. + +The default value is 127. + + +perf_event_mlock_kb: +==================== + +Control size of per-cpu ring buffer not counted agains mlock limit. + +The default value is 512 + 1 page + + +perf_event_max_contexts_per_stack: +================================== + +Controls maximum number of stack frame context entries for +(attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for +instance, when using 'perf record -g' or 'perf trace --call-graph fp'. + +This can only be done when no events are in use that have callchains +enabled, otherwise writing to this file will return -EBUSY. + +The default value is 8. + + +pid_max: +======== + +PID allocation wrap value. When the kernel's next PID value +reaches this value, it wraps back to a minimum PID value. +PIDs of value pid_max or larger are not allocated. + + +ns_last_pid: +============ + +The last pid allocated in the current (the one task using this sysctl +lives in) pid namespace. When selecting a pid for a next task on fork +kernel tries to allocate a number starting from this one. + + +powersave-nap: (PPC only) +========================= + +If set, Linux-PPC will use the 'nap' mode of powersaving, +otherwise the 'doze' mode will be used. + +============================================================== + +printk: +======= + +The four values in printk denote: console_loglevel, +default_message_loglevel, minimum_console_loglevel and +default_console_loglevel respectively. + +These values influence printk() behavior when printing or +logging error messages. See 'man 2 syslog' for more info on +the different loglevels. + +- console_loglevel: + messages with a higher priority than + this will be printed to the console +- default_message_loglevel: + messages without an explicit priority + will be printed with this priority +- minimum_console_loglevel: + minimum (highest) value to which + console_loglevel can be set +- default_console_loglevel: + default value for console_loglevel + + +printk_delay: +============= + +Delay each printk message in printk_delay milliseconds + +Value from 0 - 10000 is allowed. + + +printk_ratelimit: +================= + +Some warning messages are rate limited. printk_ratelimit specifies +the minimum length of time between these messages (in jiffies), by +default we allow one every 5 seconds. + +A value of 0 will disable rate limiting. + + +printk_ratelimit_burst: +======================= + +While long term we enforce one message per printk_ratelimit +seconds, we do allow a burst of messages to pass through. +printk_ratelimit_burst specifies the number of messages we can +send before ratelimiting kicks in. + + +printk_devkmsg: +=============== + +Control the logging to /dev/kmsg from userspace: + +ratelimit: + default, ratelimited + +on: unlimited logging to /dev/kmsg from userspace + +off: logging to /dev/kmsg disabled + +The kernel command line parameter printk.devkmsg= overrides this and is +a one-time setting until next reboot: once set, it cannot be changed by +this sysctl interface anymore. + + +randomize_va_space: +=================== + +This option can be used to select the type of process address +space randomization that is used in the system, for architectures +that support this feature. + +== =========================================================================== +0 Turn the process address space randomization off. This is the + default for architectures that do not support this feature anyways, + and kernels that are booted with the "norandmaps" parameter. + +1 Make the addresses of mmap base, stack and VDSO page randomized. + This, among other things, implies that shared libraries will be + loaded to random addresses. Also for PIE-linked binaries, the + location of code start is randomized. This is the default if the + CONFIG_COMPAT_BRK option is enabled. + +2 Additionally enable heap randomization. This is the default if + CONFIG_COMPAT_BRK is disabled. + + There are a few legacy applications out there (such as some ancient + versions of libc.so.5 from 1996) that assume that brk area starts + just after the end of the code+bss. These applications break when + start of the brk area is randomized. There are however no known + non-legacy applications that would be broken this way, so for most + systems it is safe to choose full randomization. + + Systems with ancient and/or broken binaries should be configured + with CONFIG_COMPAT_BRK enabled, which excludes the heap from process + address space randomization. +== =========================================================================== + + +reboot-cmd: (Sparc only) +======================== + +??? This seems to be a way to give an argument to the Sparc +ROM/Flash boot loader. Maybe to tell it what to do after +rebooting. ??? + + +rtsig-max & rtsig-nr: +===================== + +The file rtsig-max can be used to tune the maximum number +of POSIX realtime (queued) signals that can be outstanding +in the system. + +rtsig-nr shows the number of RT signals currently queued. + + +sched_energy_aware: +=================== + +Enables/disables Energy Aware Scheduling (EAS). EAS starts +automatically on platforms where it can run (that is, +platforms with asymmetric CPU topologies and having an Energy +Model available). If your platform happens to meet the +requirements for EAS but you do not want to use it, change +this value to 0. + + +sched_schedstats: +================= + +Enables/disables scheduler statistics. Enabling this feature +incurs a small amount of overhead in the scheduler but is +useful for debugging and performance tuning. + + +sg-big-buff: +============ + +This file shows the size of the generic SCSI (sg) buffer. +You can't tune it just yet, but you could change it on +compile time by editing include/scsi/sg.h and changing +the value of SG_BIG_BUFF. + +There shouldn't be any reason to change this value. If +you can come up with one, you probably know what you +are doing anyway :) + + +shmall: +======= + +This parameter sets the total amount of shared memory pages that +can be used system wide. Hence, SHMALL should always be at least +ceil(shmmax/PAGE_SIZE). + +If you are not sure what the default PAGE_SIZE is on your Linux +system, you can run the following command: + + # getconf PAGE_SIZE + + +shmmax: +======= + +This value can be used to query and set the run time limit +on the maximum shared memory segment size that can be created. +Shared memory segments up to 1Gb are now supported in the +kernel. This value defaults to SHMMAX. + + +shm_rmid_forced: +================ + +Linux lets you set resource limits, including how much memory one +process can consume, via setrlimit(2). Unfortunately, shared memory +segments are allowed to exist without association with any process, and +thus might not be counted against any resource limits. If enabled, +shared memory segments are automatically destroyed when their attach +count becomes zero after a detach or a process termination. It will +also destroy segments that were created, but never attached to, on exit +from the process. The only use left for IPC_RMID is to immediately +destroy an unattached segment. Of course, this breaks the way things are +defined, so some applications might stop working. Note that this +feature will do you no good unless you also configure your resource +limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't +need this. + +Note that if you change this from 0 to 1, already created segments +without users and with a dead originative process will be destroyed. + + +sysctl_writes_strict: +===================== + +Control how file position affects the behavior of updating sysctl values +via the /proc/sys interface: + + == ====================================================================== + -1 Legacy per-write sysctl value handling, with no printk warnings. + Each write syscall must fully contain the sysctl value to be + written, and multiple writes on the same sysctl file descriptor + will rewrite the sysctl value, regardless of file position. + 0 Same behavior as above, but warn about processes that perform writes + to a sysctl file descriptor when the file position is not 0. + 1 (default) Respect file position when writing sysctl strings. Multiple + writes will append to the sysctl value buffer. Anything past the max + length of the sysctl value buffer will be ignored. Writes to numeric + sysctl entries must always be at file position 0 and the value must + be fully contained in the buffer sent in the write syscall. + == ====================================================================== + + +softlockup_all_cpu_backtrace: +============================= + +This value controls the soft lockup detector thread's behavior +when a soft lockup condition is detected as to whether or not +to gather further debug information. If enabled, each cpu will +be issued an NMI and instructed to capture stack trace. + +This feature is only applicable for architectures which support +NMI. + +0: do nothing. This is the default behavior. + +1: on detection capture more debug information. + + +soft_watchdog: +============== + +This parameter can be used to control the soft lockup detector. + + 0 - disable the soft lockup detector + + 1 - enable the soft lockup detector + +The soft lockup detector monitors CPUs for threads that are hogging the CPUs +without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads +from running. The mechanism depends on the CPUs ability to respond to timer +interrupts which are needed for the 'watchdog/N' threads to be woken up by +the watchdog timer function, otherwise the NMI watchdog - if enabled - can +detect a hard lockup condition. + + +stack_erasing: +============== + +This parameter can be used to control kernel stack erasing at the end +of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. + +That erasing reduces the information which kernel stack leak bugs +can reveal and blocks some uninitialized stack variable attacks. +The tradeoff is the performance impact: on a single CPU system kernel +compilation sees a 1% slowdown, other systems and workloads may vary. + + 0: kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. + + 1: kernel stack erasing is enabled (default), it is performed before + returning to the userspace at the end of syscalls. + + +tainted +======= + +Non-zero if the kernel has been tainted. Numeric values, which can be +ORed together. The letters are seen in "Tainted" line of Oops reports. + +====== ===== ============================================================== + 1 `(P)` proprietary module was loaded + 2 `(F)` module was force loaded + 4 `(S)` SMP kernel oops on an officially SMP incapable processor + 8 `(R)` module was force unloaded + 16 `(M)` processor reported a Machine Check Exception (MCE) + 32 `(B)` bad page referenced or some unexpected page flags + 64 `(U)` taint requested by userspace application + 128 `(D)` kernel died recently, i.e. there was an OOPS or BUG + 256 `(A)` an ACPI table was overridden by user + 512 `(W)` kernel issued warning + 1024 `(C)` staging driver was loaded + 2048 `(I)` workaround for bug in platform firmware applied + 4096 `(O)` externally-built ("out-of-tree") module was loaded + 8192 `(E)` unsigned module was loaded + 16384 `(L)` soft lockup occurred + 32768 `(K)` kernel has been live patched + 65536 `(X)` Auxiliary taint, defined and used by for distros +131072 `(T)` The kernel was built with the struct randomization plugin +====== ===== ============================================================== + +See Documentation/admin-guide/tainted-kernels.rst for more information. + + +threads-max: +============ + +This value controls the maximum number of threads that can be created +using fork(). + +During initialization the kernel sets this value such that even if the +maximum number of threads is created, the thread structures occupy only +a part (1/8th) of the available RAM pages. + +The minimum value that can be written to threads-max is 20. + +The maximum value that can be written to threads-max is given by the +constant FUTEX_TID_MASK (0x3fffffff). + +If a value outside of this range is written to threads-max an error +EINVAL occurs. + +The value written is checked against the available RAM pages. If the +thread structures would occupy too much (more than 1/8th) of the +available RAM pages threads-max is reduced accordingly. + + +unknown_nmi_panic: +================== + +The value in this file affects behavior of handling NMI. When the +value is non-zero, unknown NMI is trapped and then panic occurs. At +that time, kernel debugging information is displayed on console. + +NMI switch that most IA32 servers have fires unknown NMI up, for +example. If a system hangs up, try pressing the NMI switch. + + +watchdog: +========= + +This parameter can be used to disable or enable the soft lockup detector +_and_ the NMI watchdog (i.e. the hard lockup detector) at the same time. + + 0 - disable both lockup detectors + + 1 - enable both lockup detectors + +The soft lockup detector and the NMI watchdog can also be disabled or +enabled individually, using the soft_watchdog and nmi_watchdog parameters. +If the watchdog parameter is read, for example by executing:: + + cat /proc/sys/kernel/watchdog + +the output of this command (0 or 1) shows the logical OR of soft_watchdog +and nmi_watchdog. + + +watchdog_cpumask: +================= + +This value can be used to control on which cpus the watchdog may run. +The default cpumask is all possible cores, but if NO_HZ_FULL is +enabled in the kernel config, and cores are specified with the +nohz_full= boot argument, those cores are excluded by default. +Offline cores can be included in this mask, and if the core is later +brought online, the watchdog will be started based on the mask value. + +Typically this value would only be touched in the nohz_full case +to re-enable cores that by default were not running the watchdog, +if a kernel lockup was suspected on those cores. + +The argument value is the standard cpulist format for cpumasks, +so for example to enable the watchdog on cores 0, 2, 3, and 4 you +might say:: + + echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask + + +watchdog_thresh: +================ + +This value can be used to control the frequency of hrtimer and NMI +events and the soft and hard lockup thresholds. The default threshold +is 10 seconds. + +The softlockup threshold is (2 * watchdog_thresh). Setting this +tunable to zero will disable lockup detection altogether. diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst new file mode 100644 index 000000000000..a7d44e71019d --- /dev/null +++ b/Documentation/admin-guide/sysctl/net.rst @@ -0,0 +1,461 @@ +================================ +Documentation for /proc/sys/net/ +================================ + +Copyright + +Copyright (c) 1999 + + - Terrehon Bowden + - Bodo Bauer + +Copyright (c) 2000 + + - Jorge Nerin + +Copyright (c) 2009 + + - Shen Feng + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ + +This file contains the documentation for the sysctl files in +/proc/sys/net + +The interface to the networking parts of the kernel is located in +/proc/sys/net. The following table shows all possible subdirectories. You may +see only some of them, depending on your kernel's configuration. + + +Table : Subdirectories in /proc/sys/net + + ========= =================== = ========== ================== + Directory Content Directory Content + ========= =================== = ========== ================== + core General parameter appletalk Appletalk protocol + unix Unix domain sockets netrom NET/ROM + 802 E802 protocol ax25 AX25 + ethernet Ethernet protocol rose X.25 PLP layer + ipv4 IP version 4 x25 X.25 protocol + ipx IPX token-ring IBM token ring + bridge Bridging decnet DEC net + ipv6 IP version 6 tipc TIPC + ========= =================== = ========== ================== + +1. /proc/sys/net/core - Network core options +============================================ + +bpf_jit_enable +-------------- + +This enables the BPF Just in Time (JIT) compiler. BPF is a flexible +and efficient infrastructure allowing to execute bytecode at various +hook points. It is used in a number of Linux kernel subsystems such +as networking (e.g. XDP, tc), tracing (e.g. kprobes, uprobes, tracepoints) +and security (e.g. seccomp). LLVM has a BPF back end that can compile +restricted C into a sequence of BPF instructions. After program load +through bpf(2) and passing a verifier in the kernel, a JIT will then +translate these BPF proglets into native CPU instructions. There are +two flavors of JITs, the newer eBPF JIT currently supported on: + + - x86_64 + - x86_32 + - arm64 + - arm32 + - ppc64 + - sparc64 + - mips64 + - s390x + - riscv + +And the older cBPF JIT supported on the following archs: + + - mips + - ppc + - sparc + +eBPF JITs are a superset of cBPF JITs, meaning the kernel will +migrate cBPF instructions into eBPF instructions and then JIT +compile them transparently. Older cBPF JITs can only translate +tcpdump filters, seccomp rules, etc, but not mentioned eBPF +programs loaded through bpf(2). + +Values: + + - 0 - disable the JIT (default value) + - 1 - enable the JIT + - 2 - enable the JIT and ask the compiler to emit traces on kernel log. + +bpf_jit_harden +-------------- + +This enables hardening for the BPF JIT compiler. Supported are eBPF +JIT backends. Enabling hardening trades off performance, but can +mitigate JIT spraying. + +Values: + + - 0 - disable JIT hardening (default value) + - 1 - enable JIT hardening for unprivileged users only + - 2 - enable JIT hardening for all users + +bpf_jit_kallsyms +---------------- + +When BPF JIT compiler is enabled, then compiled images are unknown +addresses to the kernel, meaning they neither show up in traces nor +in /proc/kallsyms. This enables export of these addresses, which can +be used for debugging/tracing. If bpf_jit_harden is enabled, this +feature is disabled. + +Values : + + - 0 - disable JIT kallsyms export (default value) + - 1 - enable JIT kallsyms export for privileged users only + +bpf_jit_limit +------------- + +This enforces a global limit for memory allocations to the BPF JIT +compiler in order to reject unprivileged JIT requests once it has +been surpassed. bpf_jit_limit contains the value of the global limit +in bytes. + +dev_weight +---------- + +The maximum number of packets that kernel can handle on a NAPI interrupt, +it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware +aggregated packet is counted as one packet in this context. + +Default: 64 + +dev_weight_rx_bias +------------------ + +RPS (e.g. RFS, aRFS) processing is competing with the registered NAPI poll function +of the driver for the per softirq cycle netdev_budget. This parameter influences +the proportion of the configured netdev_budget that is spent on RPS based packet +processing during RX softirq cycles. It is further meant for making current +dev_weight adaptable for asymmetric CPU needs on RX/TX side of the network stack. +(see dev_weight_tx_bias) It is effective on a per CPU basis. Determination is based +on dev_weight and is calculated multiplicative (dev_weight * dev_weight_rx_bias). + +Default: 1 + +dev_weight_tx_bias +------------------ + +Scales the maximum number of packets that can be processed during a TX softirq cycle. +Effective on a per CPU basis. Allows scaling of current dev_weight for asymmetric +net stack processing needs. Be careful to avoid making TX softirq processing a CPU hog. + +Calculation is based on dev_weight (dev_weight * dev_weight_tx_bias). + +Default: 1 + +default_qdisc +------------- + +The default queuing discipline to use for network devices. This allows +overriding the default of pfifo_fast with an alternative. Since the default +queuing discipline is created without additional parameters so is best suited +to queuing disciplines that work well without configuration like stochastic +fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). Don't use +queuing disciplines like Hierarchical Token Bucket or Deficit Round Robin +which require setting up classes and bandwidths. Note that physical multiqueue +interfaces still use mq as root qdisc, which in turn uses this default for its +leaves. Virtual devices (like e.g. lo or veth) ignore this setting and instead +default to noqueue. + +Default: pfifo_fast + +busy_read +--------- + +Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL) +Approximate time in us to busy loop waiting for packets on the device queue. +This sets the default value of the SO_BUSY_POLL socket option. +Can be set or overridden per socket by setting socket option SO_BUSY_POLL, +which is the preferred method of enabling. If you need to enable the feature +globally via sysctl, a value of 50 is recommended. + +Will increase power usage. + +Default: 0 (off) + +busy_poll +---------------- +Low latency busy poll timeout for poll and select. (needs CONFIG_NET_RX_BUSY_POLL) +Approximate time in us to busy loop waiting for events. +Recommended value depends on the number of sockets you poll on. +For several sockets 50, for several hundreds 100. +For more than that you probably want to use epoll. +Note that only sockets with SO_BUSY_POLL set will be busy polled, +so you want to either selectively set SO_BUSY_POLL on those sockets or set +sysctl.net.busy_read globally. + +Will increase power usage. + +Default: 0 (off) + +rmem_default +------------ + +The default setting of the socket receive buffer in bytes. + +rmem_max +-------- + +The maximum receive socket buffer size in bytes. + +tstamp_allow_data +----------------- +Allow processes to receive tx timestamps looped together with the original +packet contents. If disabled, transmit timestamp requests from unprivileged +processes are dropped unless socket option SOF_TIMESTAMPING_OPT_TSONLY is set. + +Default: 1 (on) + + +wmem_default +------------ + +The default setting (in bytes) of the socket send buffer. + +wmem_max +-------- + +The maximum send socket buffer size in bytes. + +message_burst and message_cost +------------------------------ + +These parameters are used to limit the warning messages written to the kernel +log from the networking code. They enforce a rate limit to make a +denial-of-service attack impossible. A higher message_cost factor, results in +fewer messages that will be written. Message_burst controls when messages will +be dropped. The default settings limit warning messages to one every five +seconds. + +warnings +-------- + +This sysctl is now unused. + +This was used to control console messages from the networking stack that +occur because of problems on the network like duplicate address or bad +checksums. + +These messages are now emitted at KERN_DEBUG and can generally be enabled +and controlled by the dynamic_debug facility. + +netdev_budget +------------- + +Maximum number of packets taken from all interfaces in one polling cycle (NAPI +poll). In one polling cycle interfaces which are registered to polling are +probed in a round-robin manner. Also, a polling cycle may not exceed +netdev_budget_usecs microseconds, even if netdev_budget has not been +exhausted. + +netdev_budget_usecs +--------------------- + +Maximum number of microseconds in one NAPI polling cycle. Polling +will exit when either netdev_budget_usecs have elapsed during the +poll cycle or the number of packets processed reaches netdev_budget. + +netdev_max_backlog +------------------ + +Maximum number of packets, queued on the INPUT side, when the interface +receives packets faster than kernel can process them. + +netdev_rss_key +-------------- + +RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is +randomly generated. +Some user space might need to gather its content even if drivers do not +provide ethtool -x support yet. + +:: + + myhost:~# cat /proc/sys/net/core/netdev_rss_key + 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total) + +File contains nul bytes if no driver ever called netdev_rss_key_fill() function. + +Note: + /proc/sys/net/core/netdev_rss_key contains 52 bytes of key, + but most drivers only use 40 bytes of it. + +:: + + myhost:~# ethtool -x eth0 + RX flow hash indirection table for eth0 with 8 RX ring(s): + 0: 0 1 2 3 4 5 6 7 + RSS hash key: + 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8:43:e3:c9:0c:fd:17:55:c2:3a:4d:69:ed:f1:42:89 + +netdev_tstamp_prequeue +---------------------- + +If set to 0, RX packet timestamps can be sampled after RPS processing, when +the target CPU processes packets. It might give some delay on timestamps, but +permit to distribute the load on several cpus. + +If set to 1 (default), timestamps are sampled as soon as possible, before +queueing. + +optmem_max +---------- + +Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence +of struct cmsghdr structures with appended data. + +fb_tunnels_only_for_init_net +---------------------------- + +Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0, +sit0, ip6tnl0, ip6gre0) are automatically created when a new +network namespace is created, if corresponding tunnel is present +in initial network namespace. +If set to 1, these devices are not automatically created, and +user space is responsible for creating them if needed. + +Default : 0 (for compatibility reasons) + +devconf_inherit_init_net +------------------------ + +Controls if a new network namespace should inherit all current +settings under /proc/sys/net/{ipv4,ipv6}/conf/{all,default}/. By +default, we keep the current behavior: for IPv4 we inherit all current +settings from init_net and for IPv6 we reset all settings to default. + +If set to 1, both IPv4 and IPv6 settings are forced to inherit from +current ones in init_net. If set to 2, both IPv4 and IPv6 settings are +forced to reset to their default values. + +Default : 0 (for compatibility reasons) + +2. /proc/sys/net/unix - Parameters for Unix domain sockets +---------------------------------------------------------- + +There is only one file in this directory. +unix_dgram_qlen limits the max number of datagrams queued in Unix domain +socket's buffer. It will not take effect unless PF_UNIX flag is specified. + + +3. /proc/sys/net/ipv4 - IPV4 settings +------------------------------------- +Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for +descriptions of these entries. + + +4. Appletalk +------------ + +The /proc/sys/net/appletalk directory holds the Appletalk configuration data +when Appletalk is loaded. The configurable parameters are: + +aarp-expiry-time +---------------- + +The amount of time we keep an ARP entry before expiring it. Used to age out +old hosts. + +aarp-resolve-time +----------------- + +The amount of time we will spend trying to resolve an Appletalk address. + +aarp-retransmit-limit +--------------------- + +The number of times we will retransmit a query before giving up. + +aarp-tick-time +-------------- + +Controls the rate at which expires are checked. + +The directory /proc/net/appletalk holds the list of active Appletalk sockets +on a machine. + +The fields indicate the DDP type, the local address (in network:node format) +the remote address, the size of the transmit pending queue, the size of the +received queue (bytes waiting for applications to read) the state and the uid +owning the socket. + +/proc/net/atalk_iface lists all the interfaces configured for appletalk.It +shows the name of the interface, its Appletalk address, the network range on +that address (or network number for phase 1 networks), and the status of the +interface. + +/proc/net/atalk_route lists each known network route. It lists the target +(network) that the route leads to, the router (may be directly connected), the +route flags, and the device the route is using. + + +5. IPX +------ + +The IPX protocol has no tunable values in proc/sys/net. + +The IPX protocol does, however, provide proc/net/ipx. This lists each IPX +socket giving the local and remote addresses in Novell format (that is +network:node:port). In accordance with the strange Novell tradition, +everything but the port is in hex. Not_Connected is displayed for sockets that +are not tied to a specific remote address. The Tx and Rx queue sizes indicate +the number of bytes pending for transmission and reception. The state +indicates the state the socket is in and the uid is the owning uid of the +socket. + +The /proc/net/ipx_interface file lists all IPX interfaces. For each interface +it gives the network number, the node number, and indicates if the network is +the primary network. It also indicates which device it is bound to (or +Internal for internal networks) and the Frame Type if appropriate. Linux +supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for +IPX. + +The /proc/net/ipx_route table holds a list of IPX routes. For each route it +gives the destination network, the router node (or Directly) and the network +address of the router (or Connected) for internal networks. + +6. TIPC +------- + +tipc_rmem +--------- + +The TIPC protocol now has a tunable for the receive memory, similar to the +tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max) + +:: + + # cat /proc/sys/net/tipc/tipc_rmem + 4252725 34021800 68043600 + # + +The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values +are scaled (shifted) versions of that same value. Note that the min value +is not at this point in time used in any meaningful way, but the triplet is +preserved in order to be consistent with things like tcp_rmem. + +named_timeout +------------- + +TIPC name table updates are distributed asynchronously in a cluster, without +any form of transaction handling. This means that different race scenarios are +possible. One such is that a name withdrawal sent out by one node and received +by another node may arrive after a second, overlapping name publication already +has been accepted from a third node, although the conflicting updates +originally may have been issued in the correct sequential order. +If named_timeout is nonzero, failed topology updates will be placed on a defer +queue until another event arrives that clears the error, or until the timeout +expires. Value is in milliseconds. diff --git a/Documentation/admin-guide/sysctl/sunrpc.rst b/Documentation/admin-guide/sysctl/sunrpc.rst new file mode 100644 index 000000000000..09780a682afd --- /dev/null +++ b/Documentation/admin-guide/sysctl/sunrpc.rst @@ -0,0 +1,25 @@ +=================================== +Documentation for /proc/sys/sunrpc/ +=================================== + +kernel version 2.2.10 + +Copyright (c) 1998, 1999, Rik van Riel + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ + +This file contains the documentation for the sysctl files in +/proc/sys/sunrpc and is valid for Linux kernel version 2.2. + +The files in this directory can be used to (re)set the debug +flags of the SUN Remote Procedure Call (RPC) subsystem in +the Linux kernel. This stuff is used for NFS, KNFSD and +maybe a few other things as well. + +The files in there are used to control the debugging flags: +rpc_debug, nfs_debug, nfsd_debug and nlm_debug. + +These flags are for kernel hackers only. You should read the +source code in net/sunrpc/ for more information. diff --git a/Documentation/admin-guide/sysctl/user.rst b/Documentation/admin-guide/sysctl/user.rst new file mode 100644 index 000000000000..650eaa03f15e --- /dev/null +++ b/Documentation/admin-guide/sysctl/user.rst @@ -0,0 +1,78 @@ +================================= +Documentation for /proc/sys/user/ +================================= + +kernel version 4.9.0 + +Copyright (c) 2016 Eric Biederman + +------------------------------------------------------------------------------ + +This file contains the documentation for the sysctl files in +/proc/sys/user. + +The files in this directory can be used to override the default +limits on the number of namespaces and other objects that have +per user per user namespace limits. + +The primary purpose of these limits is to stop programs that +malfunction and attempt to create a ridiculous number of objects, +before the malfunction becomes a system wide problem. It is the +intention that the defaults of these limits are set high enough that +no program in normal operation should run into these limits. + +The creation of per user per user namespace objects are charged to +the user in the user namespace who created the object and +verified to be below the per user limit in that user namespace. + +The creation of objects is also charged to all of the users +who created user namespaces the creation of the object happens +in (user namespaces can be nested) and verified to be below the per user +limits in the user namespaces of those users. + +This recursive counting of created objects ensures that creating a +user namespace does not allow a user to escape their current limits. + +Currently, these files are in /proc/sys/user: + +max_cgroup_namespaces +===================== + + The maximum number of cgroup namespaces that any user in the current + user namespace may create. + +max_ipc_namespaces +================== + + The maximum number of ipc namespaces that any user in the current + user namespace may create. + +max_mnt_namespaces +================== + + The maximum number of mount namespaces that any user in the current + user namespace may create. + +max_net_namespaces +================== + + The maximum number of network namespaces that any user in the + current user namespace may create. + +max_pid_namespaces +================== + + The maximum number of pid namespaces that any user in the current + user namespace may create. + +max_user_namespaces +=================== + + The maximum number of user namespaces that any user in the current + user namespace may create. + +max_uts_namespaces +================== + + The maximum number of user namespaces that any user in the current + user namespace may create. diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst new file mode 100644 index 000000000000..5aceb5cd5ce7 --- /dev/null +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -0,0 +1,964 @@ +=============================== +Documentation for /proc/sys/vm/ +=============================== + +kernel version 2.6.29 + +Copyright (c) 1998, 1999, Rik van Riel + +Copyright (c) 2008 Peter W. Morreale + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ + +This file contains the documentation for the sysctl files in +/proc/sys/vm and is valid for Linux kernel version 2.6.29. + +The files in this directory can be used to tune the operation +of the virtual memory (VM) subsystem of the Linux kernel and +the writeout of dirty data to disk. + +Default values and initialization routines for most of these +files can be found in mm/swap.c. + +Currently, these files are in /proc/sys/vm: + +- admin_reserve_kbytes +- block_dump +- compact_memory +- compact_unevictable_allowed +- dirty_background_bytes +- dirty_background_ratio +- dirty_bytes +- dirty_expire_centisecs +- dirty_ratio +- dirtytime_expire_seconds +- dirty_writeback_centisecs +- drop_caches +- extfrag_threshold +- hugetlb_shm_group +- laptop_mode +- legacy_va_layout +- lowmem_reserve_ratio +- max_map_count +- memory_failure_early_kill +- memory_failure_recovery +- min_free_kbytes +- min_slab_ratio +- min_unmapped_ratio +- mmap_min_addr +- mmap_rnd_bits +- mmap_rnd_compat_bits +- nr_hugepages +- nr_hugepages_mempolicy +- nr_overcommit_hugepages +- nr_trim_pages (only if CONFIG_MMU=n) +- numa_zonelist_order +- oom_dump_tasks +- oom_kill_allocating_task +- overcommit_kbytes +- overcommit_memory +- overcommit_ratio +- page-cluster +- panic_on_oom +- percpu_pagelist_fraction +- stat_interval +- stat_refresh +- numa_stat +- swappiness +- unprivileged_userfaultfd +- user_reserve_kbytes +- vfs_cache_pressure +- watermark_boost_factor +- watermark_scale_factor +- zone_reclaim_mode + + +admin_reserve_kbytes +==================== + +The amount of free memory in the system that should be reserved for users +with the capability cap_sys_admin. + +admin_reserve_kbytes defaults to min(3% of free pages, 8MB) + +That should provide enough for the admin to log in and kill a process, +if necessary, under the default overcommit 'guess' mode. + +Systems running under overcommit 'never' should increase this to account +for the full Virtual Memory Size of programs used to recover. Otherwise, +root may not be able to log in to recover the system. + +How do you calculate a minimum useful reserve? + +sshd or login + bash (or some other shell) + top (or ps, kill, etc.) + +For overcommit 'guess', we can sum resident set sizes (RSS). +On x86_64 this is about 8MB. + +For overcommit 'never', we can take the max of their virtual sizes (VSZ) +and add the sum of their RSS. +On x86_64 this is about 128MB. + +Changing this takes effect whenever an application requests memory. + + +block_dump +========== + +block_dump enables block I/O debugging when set to a nonzero value. More +information on block I/O debugging is in Documentation/laptops/laptop-mode.rst. + + +compact_memory +============== + +Available only when CONFIG_COMPACTION is set. When 1 is written to the file, +all zones are compacted such that free memory is available in contiguous +blocks where possible. This can be important for example in the allocation of +huge pages although processes will also directly compact memory as required. + + +compact_unevictable_allowed +=========================== + +Available only when CONFIG_COMPACTION is set. When set to 1, compaction is +allowed to examine the unevictable lru (mlocked pages) for pages to compact. +This should be used on systems where stalls for minor page faults are an +acceptable trade for large contiguous free memory. Set to 0 to prevent +compaction from moving pages that are unevictable. Default value is 1. + + +dirty_background_bytes +====================== + +Contains the amount of dirty memory at which the background kernel +flusher threads will start writeback. + +Note: + dirty_background_bytes is the counterpart of dirty_background_ratio. Only + one of them may be specified at a time. When one sysctl is written it is + immediately taken into account to evaluate the dirty memory limits and the + other appears as 0 when read. + + +dirty_background_ratio +====================== + +Contains, as a percentage of total available memory that contains free pages +and reclaimable pages, the number of pages at which the background kernel +flusher threads will start writing out dirty data. + +The total available memory is not equal to total system memory. + + +dirty_bytes +=========== + +Contains the amount of dirty memory at which a process generating disk writes +will itself start writeback. + +Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be +specified at a time. When one sysctl is written it is immediately taken into +account to evaluate the dirty memory limits and the other appears as 0 when +read. + +Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any +value lower than this limit will be ignored and the old configuration will be +retained. + + +dirty_expire_centisecs +====================== + +This tunable is used to define when dirty data is old enough to be eligible +for writeout by the kernel flusher threads. It is expressed in 100'ths +of a second. Data which has been dirty in-memory for longer than this +interval will be written out next time a flusher thread wakes up. + + +dirty_ratio +=========== + +Contains, as a percentage of total available memory that contains free pages +and reclaimable pages, the number of pages at which a process which is +generating disk writes will itself start writing out dirty data. + +The total available memory is not equal to total system memory. + + +dirtytime_expire_seconds +======================== + +When a lazytime inode is constantly having its pages dirtied, the inode with +an updated timestamp will never get chance to be written out. And, if the +only thing that has happened on the file system is a dirtytime inode caused +by an atime update, a worker will be scheduled to make sure that inode +eventually gets pushed out to disk. This tunable is used to define when dirty +inode is old enough to be eligible for writeback by the kernel flusher threads. +And, it is also used as the interval to wakeup dirtytime_writeback thread. + + +dirty_writeback_centisecs +========================= + +The kernel flusher threads will periodically wake up and write `old` data +out to disk. This tunable expresses the interval between those wakeups, in +100'ths of a second. + +Setting this to zero disables periodic writeback altogether. + + +drop_caches +=========== + +Writing to this will cause the kernel to drop clean caches, as well as +reclaimable slab objects like dentries and inodes. Once dropped, their +memory becomes free. + +To free pagecache:: + + echo 1 > /proc/sys/vm/drop_caches + +To free reclaimable slab objects (includes dentries and inodes):: + + echo 2 > /proc/sys/vm/drop_caches + +To free slab objects and pagecache:: + + echo 3 > /proc/sys/vm/drop_caches + +This is a non-destructive operation and will not free any dirty objects. +To increase the number of objects freed by this operation, the user may run +`sync` prior to writing to /proc/sys/vm/drop_caches. This will minimize the +number of dirty objects on the system and create more candidates to be +dropped. + +This file is not a means to control the growth of the various kernel caches +(inodes, dentries, pagecache, etc...) These objects are automatically +reclaimed by the kernel when memory is needed elsewhere on the system. + +Use of this file can cause performance problems. Since it discards cached +objects, it may cost a significant amount of I/O and CPU to recreate the +dropped objects, especially if they were under heavy use. Because of this, +use outside of a testing or debugging environment is not recommended. + +You may see informational messages in your kernel log when this file is +used:: + + cat (1234): drop_caches: 3 + +These are informational only. They do not mean that anything is wrong +with your system. To disable them, echo 4 (bit 2) into drop_caches. + + +extfrag_threshold +================= + +This parameter affects whether the kernel will compact memory or direct +reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in +debugfs shows what the fragmentation index for each order is in each zone in +the system. Values tending towards 0 imply allocations would fail due to lack +of memory, values towards 1000 imply failures are due to fragmentation and -1 +implies that the allocation will succeed as long as watermarks are met. + +The kernel will not compact memory in a zone if the +fragmentation index is <= extfrag_threshold. The default value is 500. + + +highmem_is_dirtyable +==================== + +Available only for systems with CONFIG_HIGHMEM enabled (32b systems). + +This parameter controls whether the high memory is considered for dirty +writers throttling. This is not the case by default which means that +only the amount of memory directly visible/usable by the kernel can +be dirtied. As a result, on systems with a large amount of memory and +lowmem basically depleted writers might be throttled too early and +streaming writes can get very slow. + +Changing the value to non zero would allow more memory to be dirtied +and thus allow writers to write more data which can be flushed to the +storage more effectively. Note this also comes with a risk of pre-mature +OOM killer because some writers (e.g. direct block device writes) can +only use the low memory and they can fill it up with dirty data without +any throttling. + + +hugetlb_shm_group +================= + +hugetlb_shm_group contains group id that is allowed to create SysV +shared memory segment using hugetlb page. + + +laptop_mode +=========== + +laptop_mode is a knob that controls "laptop mode". All the things that are +controlled by this knob are discussed in Documentation/laptops/laptop-mode.rst. + + +legacy_va_layout +================ + +If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel +will use the legacy (2.4) layout for all processes. + + +lowmem_reserve_ratio +==================== + +For some specialised workloads on highmem machines it is dangerous for +the kernel to allow process memory to be allocated from the "lowmem" +zone. This is because that memory could then be pinned via the mlock() +system call, or by unavailability of swapspace. + +And on large highmem machines this lack of reclaimable lowmem memory +can be fatal. + +So the Linux page allocator has a mechanism which prevents allocations +which *could* use highmem from using too much lowmem. This means that +a certain amount of lowmem is defended from the possibility of being +captured into pinned user memory. + +(The same argument applies to the old 16 megabyte ISA DMA region. This +mechanism will also defend that region from allocations which could use +highmem or lowmem). + +The `lowmem_reserve_ratio` tunable determines how aggressive the kernel is +in defending these lower zones. + +If you have a machine which uses highmem or ISA DMA and your +applications are using mlock(), or if you are running with no swap then +you probably should change the lowmem_reserve_ratio setting. + +The lowmem_reserve_ratio is an array. You can see them by reading this file:: + + % cat /proc/sys/vm/lowmem_reserve_ratio + 256 256 32 + +But, these values are not used directly. The kernel calculates # of protection +pages for each zones from them. These are shown as array of protection pages +in /proc/zoneinfo like followings. (This is an example of x86-64 box). +Each zone has an array of protection pages like this:: + + Node 0, zone DMA + pages free 1355 + min 3 + low 3 + high 4 + : + : + numa_other 0 + protection: (0, 2004, 2004, 2004) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + pagesets + cpu: 0 pcp: 0 + : + +These protections are added to score to judge whether this zone should be used +for page allocation or should be reclaimed. + +In this example, if normal pages (index=2) are required to this DMA zone and +watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should +not be used because pages_free(1355) is smaller than watermark + protection[2] +(4 + 2004 = 2008). If this protection value is 0, this zone would be used for +normal page requirement. If requirement is DMA zone(index=0), protection[0] +(=0) is used. + +zone[i]'s protection[j] is calculated by following expression:: + + (i < j): + zone[i]->protection[j] + = (total sums of managed_pages from zone[i+1] to zone[j] on the node) + / lowmem_reserve_ratio[i]; + (i = j): + (should not be protected. = 0; + (i > j): + (not necessary, but looks 0) + +The default values of lowmem_reserve_ratio[i] are + + === ==================================== + 256 (if zone[i] means DMA or DMA32 zone) + 32 (others) + === ==================================== + +As above expression, they are reciprocal number of ratio. +256 means 1/256. # of protection pages becomes about "0.39%" of total managed +pages of higher zones on the node. + +If you would like to protect more pages, smaller values are effective. +The minimum value is 1 (1/1 -> 100%). The value less than 1 completely +disables protection of the pages. + + +max_map_count: +============== + +This file contains the maximum number of memory map areas a process +may have. Memory map areas are used as a side-effect of calling +malloc, directly by mmap, mprotect, and madvise, and also when loading +shared libraries. + +While most applications need less than a thousand maps, certain +programs, particularly malloc debuggers, may consume lots of them, +e.g., up to one or two maps per allocation. + +The default value is 65536. + + +memory_failure_early_kill: +========================== + +Control how to kill processes when uncorrected memory error (typically +a 2bit error in a memory module) is detected in the background by hardware +that cannot be handled by the kernel. In some cases (like the page +still having a valid copy on disk) the kernel will handle the failure +transparently without affecting any applications. But if there is +no other uptodate copy of the data it will kill to prevent any data +corruptions from propagating. + +1: Kill all processes that have the corrupted and not reloadable page mapped +as soon as the corruption is detected. Note this is not supported +for a few types of pages, like kernel internally allocated data or +the swap cache, but works for the majority of user pages. + +0: Only unmap the corrupted page from all processes and only kill a process +who tries to access it. + +The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can +handle this if they want to. + +This is only active on architectures/platforms with advanced machine +check handling and depends on the hardware capabilities. + +Applications can override this setting individually with the PR_MCE_KILL prctl + + +memory_failure_recovery +======================= + +Enable memory failure recovery (when supported by the platform) + +1: Attempt recovery. + +0: Always panic on a memory failure. + + +min_free_kbytes +=============== + +This is used to force the Linux VM to keep a minimum number +of kilobytes free. The VM uses this number to compute a +watermark[WMARK_MIN] value for each lowmem zone in the system. +Each lowmem zone gets a number of reserved free pages based +proportionally on its size. + +Some minimal amount of memory is needed to satisfy PF_MEMALLOC +allocations; if you set this to lower than 1024KB, your system will +become subtly broken, and prone to deadlock under high loads. + +Setting this too high will OOM your machine instantly. + + +min_slab_ratio +============== + +This is available only on NUMA kernels. + +A percentage of the total pages in each zone. On Zone reclaim +(fallback from the local zone occurs) slabs will be reclaimed if more +than this percentage of pages in a zone are reclaimable slab pages. +This insures that the slab growth stays under control even in NUMA +systems that rarely perform global reclaim. + +The default is 5 percent. + +Note that slab reclaim is triggered in a per zone / node fashion. +The process of reclaiming slab memory is currently not node specific +and may not be fast. + + +min_unmapped_ratio +================== + +This is available only on NUMA kernels. + +This is a percentage of the total pages in each zone. Zone reclaim will +only occur if more than this percentage of pages are in a state that +zone_reclaim_mode allows to be reclaimed. + +If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared +against all file-backed unmapped pages including swapcache pages and tmpfs +files. Otherwise, only unmapped pages backed by normal files but not tmpfs +files and similar are considered. + +The default is 1 percent. + + +mmap_min_addr +============= + +This file indicates the amount of address space which a user process will +be restricted from mmapping. Since kernel null dereference bugs could +accidentally operate based on the information in the first couple of pages +of memory userspace processes should not be allowed to write to them. By +default this value is set to 0 and no protections will be enforced by the +security module. Setting this value to something like 64k will allow the +vast majority of applications to work correctly and provide defense in depth +against future potential kernel bugs. + + +mmap_rnd_bits +============= + +This value can be used to select the number of bits to use to +determine the random offset to the base address of vma regions +resulting from mmap allocations on architectures which support +tuning address space randomization. This value will be bounded +by the architecture's minimum and maximum supported values. + +This value can be changed after boot using the +/proc/sys/vm/mmap_rnd_bits tunable + + +mmap_rnd_compat_bits +==================== + +This value can be used to select the number of bits to use to +determine the random offset to the base address of vma regions +resulting from mmap allocations for applications run in +compatibility mode on architectures which support tuning address +space randomization. This value will be bounded by the +architecture's minimum and maximum supported values. + +This value can be changed after boot using the +/proc/sys/vm/mmap_rnd_compat_bits tunable + + +nr_hugepages +============ + +Change the minimum size of the hugepage pool. + +See Documentation/admin-guide/mm/hugetlbpage.rst + + +nr_hugepages_mempolicy +====================== + +Change the size of the hugepage pool at run-time on a specific +set of NUMA nodes. + +See Documentation/admin-guide/mm/hugetlbpage.rst + + +nr_overcommit_hugepages +======================= + +Change the maximum size of the hugepage pool. The maximum is +nr_hugepages + nr_overcommit_hugepages. + +See Documentation/admin-guide/mm/hugetlbpage.rst + + +nr_trim_pages +============= + +This is available only on NOMMU kernels. + +This value adjusts the excess page trimming behaviour of power-of-2 aligned +NOMMU mmap allocations. + +A value of 0 disables trimming of allocations entirely, while a value of 1 +trims excess pages aggressively. Any value >= 1 acts as the watermark where +trimming of allocations is initiated. + +The default value is 1. + +See Documentation/nommu-mmap.txt for more information. + + +numa_zonelist_order +=================== + +This sysctl is only for NUMA and it is deprecated. Anything but +Node order will fail! + +'where the memory is allocated from' is controlled by zonelists. + +(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. +you may be able to read ZONE_DMA as ZONE_DMA32...) + +In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. +ZONE_NORMAL -> ZONE_DMA +This means that a memory allocation request for GFP_KERNEL will +get memory from ZONE_DMA only when ZONE_NORMAL is not available. + +In NUMA case, you can think of following 2 types of order. +Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL:: + + (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL + (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. + +Type(A) offers the best locality for processes on Node(0), but ZONE_DMA +will be used before ZONE_NORMAL exhaustion. This increases possibility of +out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. + +Type(B) cannot offer the best locality but is more robust against OOM of +the DMA zone. + +Type(A) is called as "Node" order. Type (B) is "Zone" order. + +"Node order" orders the zonelists by node, then by zone within each node. +Specify "[Nn]ode" for node order + +"Zone Order" orders the zonelists by zone type, then by node within each +zone. Specify "[Zz]one" for zone order. + +Specify "[Dd]efault" to request automatic configuration. + +On 32-bit, the Normal zone needs to be preserved for allocations accessible +by the kernel, so "zone" order will be selected. + +On 64-bit, devices that require DMA32/DMA are relatively rare, so "node" +order will be selected. + +Default order is recommended unless this is causing problems for your +system/application. + + +oom_dump_tasks +============== + +Enables a system-wide task dump (excluding kernel threads) to be produced +when the kernel performs an OOM-killing and includes such information as +pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj +score, and name. This is helpful to determine why the OOM killer was +invoked, to identify the rogue task that caused it, and to determine why +the OOM killer chose the task it did to kill. + +If this is set to zero, this information is suppressed. On very +large systems with thousands of tasks it may not be feasible to dump +the memory state information for each one. Such systems should not +be forced to incur a performance penalty in OOM conditions when the +information may not be desired. + +If this is set to non-zero, this information is shown whenever the +OOM killer actually kills a memory-hogging task. + +The default value is 1 (enabled). + + +oom_kill_allocating_task +======================== + +This enables or disables killing the OOM-triggering task in +out-of-memory situations. + +If this is set to zero, the OOM killer will scan through the entire +tasklist and select a task based on heuristics to kill. This normally +selects a rogue memory-hogging task that frees up a large amount of +memory when killed. + +If this is set to non-zero, the OOM killer simply kills the task that +triggered the out-of-memory condition. This avoids the expensive +tasklist scan. + +If panic_on_oom is selected, it takes precedence over whatever value +is used in oom_kill_allocating_task. + +The default value is 0. + + +overcommit_kbytes +================= + +When overcommit_memory is set to 2, the committed address space is not +permitted to exceed swap plus this amount of physical RAM. See below. + +Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one +of them may be specified at a time. Setting one disables the other (which +then appears as 0 when read). + + +overcommit_memory +================= + +This value contains a flag that enables memory overcommitment. + +When this flag is 0, the kernel attempts to estimate the amount +of free memory left when userspace requests more memory. + +When this flag is 1, the kernel pretends there is always enough +memory until it actually runs out. + +When this flag is 2, the kernel uses a "never overcommit" +policy that attempts to prevent any overcommit of memory. +Note that user_reserve_kbytes affects this policy. + +This feature can be very useful because there are a lot of +programs that malloc() huge amounts of memory "just-in-case" +and don't use much of it. + +The default value is 0. + +See Documentation/vm/overcommit-accounting.rst and +mm/util.c::__vm_enough_memory() for more information. + + +overcommit_ratio +================ + +When overcommit_memory is set to 2, the committed address +space is not permitted to exceed swap plus this percentage +of physical RAM. See above. + + +page-cluster +============ + +page-cluster controls the number of pages up to which consecutive pages +are read in from swap in a single attempt. This is the swap counterpart +to page cache readahead. +The mentioned consecutivity is not in terms of virtual/physical addresses, +but consecutive on swap space - that means they were swapped out together. + +It is a logarithmic value - setting it to zero means "1 page", setting +it to 1 means "2 pages", setting it to 2 means "4 pages", etc. +Zero disables swap readahead completely. + +The default value is three (eight pages at a time). There may be some +small benefits in tuning this to a different value if your workload is +swap-intensive. + +Lower values mean lower latencies for initial faults, but at the same time +extra faults and I/O delays for following faults if they would have been part of +that consecutive pages readahead would have brought in. + + +panic_on_oom +============ + +This enables or disables panic on out-of-memory feature. + +If this is set to 0, the kernel will kill some rogue process, +called oom_killer. Usually, oom_killer can kill rogue processes and +system will survive. + +If this is set to 1, the kernel panics when out-of-memory happens. +However, if a process limits using nodes by mempolicy/cpusets, +and those nodes become memory exhaustion status, one process +may be killed by oom-killer. No panic occurs in this case. +Because other nodes' memory may be free. This means system total status +may be not fatal yet. + +If this is set to 2, the kernel panics compulsorily even on the +above-mentioned. Even oom happens under memory cgroup, the whole +system panics. + +The default value is 0. + +1 and 2 are for failover of clustering. Please select either +according to your policy of failover. + +panic_on_oom=2+kdump gives you very strong tool to investigate +why oom happens. You can get snapshot. + + +percpu_pagelist_fraction +======================== + +This is the fraction of pages at most (high mark pcp->high) in each zone that +are allocated for each per cpu page list. The min value for this is 8. It +means that we don't allow more than 1/8th of pages in each zone to be +allocated in any single per_cpu_pagelist. This entry only changes the value +of hot per cpu pagelists. User can specify a number like 100 to allocate +1/100th of each zone to each per cpu page list. + +The batch value of each per cpu pagelist is also updated as a result. It is +set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) + +The initial value is zero. Kernel does not use this value at boot time to set +the high water marks for each per cpu page list. If the user writes '0' to this +sysctl, it will revert to this default behavior. + + +stat_interval +============= + +The time interval between which vm statistics are updated. The default +is 1 second. + + +stat_refresh +============ + +Any read or write (by root only) flushes all the per-cpu vm statistics +into their global totals, for more accurate reports when testing +e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo + +As a side-effect, it also checks for negative totals (elsewhere reported +as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. +(At time of writing, a few stats are known sometimes to be found negative, +with no ill effects: errors and warnings on these stats are suppressed.) + + +numa_stat +========= + +This interface allows runtime configuration of numa statistics. + +When page allocation performance becomes a bottleneck and you can tolerate +some possible tool breakage and decreased numa counter precision, you can +do:: + + echo 0 > /proc/sys/vm/numa_stat + +When page allocation performance is not a bottleneck and you want all +tooling to work, you can do:: + + echo 1 > /proc/sys/vm/numa_stat + + +swappiness +========== + +This control is used to define how aggressive the kernel will swap +memory pages. Higher values will increase aggressiveness, lower values +decrease the amount of swap. A value of 0 instructs the kernel not to +initiate swap until the amount of free and file-backed pages is less +than the high water mark in a zone. + +The default value is 60. + + +unprivileged_userfaultfd +======================== + +This flag controls whether unprivileged users can use the userfaultfd +system calls. Set this to 1 to allow unprivileged users to use the +userfaultfd system calls, or set this to 0 to restrict userfaultfd to only +privileged users (with SYS_CAP_PTRACE capability). + +The default value is 1. + + +user_reserve_kbytes +=================== + +When overcommit_memory is set to 2, "never overcommit" mode, reserve +min(3% of current process size, user_reserve_kbytes) of free memory. +This is intended to prevent a user from starting a single memory hogging +process, such that they cannot recover (kill the hog). + +user_reserve_kbytes defaults to min(3% of the current process size, 128MB). + +If this is reduced to zero, then the user will be allowed to allocate +all free memory with a single process, minus admin_reserve_kbytes. +Any subsequent attempts to execute a command will result in +"fork: Cannot allocate memory". + +Changing this takes effect whenever an application requests memory. + + +vfs_cache_pressure +================== + +This percentage value controls the tendency of the kernel to reclaim +the memory which is used for caching of directory and inode objects. + +At the default value of vfs_cache_pressure=100 the kernel will attempt to +reclaim dentries and inodes at a "fair" rate with respect to pagecache and +swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer +to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will +never reclaim dentries and inodes due to memory pressure and this can easily +lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 +causes the kernel to prefer to reclaim dentries and inodes. + +Increasing vfs_cache_pressure significantly beyond 100 may have negative +performance impact. Reclaim code needs to take various locks to find freeable +directory and inode objects. With vfs_cache_pressure=1000, it will look for +ten times more freeable objects than there are. + + +watermark_boost_factor +====================== + +This factor controls the level of reclaim when memory is being fragmented. +It defines the percentage of the high watermark of a zone that will be +reclaimed if pages of different mobility are being mixed within pageblocks. +The intent is that compaction has less work to do in the future and to +increase the success rate of future high-order allocations such as SLUB +allocations, THP and hugetlbfs pages. + +To make it sensible with respect to the watermark_scale_factor +parameter, the unit is in fractions of 10,000. The default value of +15,000 on !DISCONTIGMEM configurations means that up to 150% of the high +watermark will be reclaimed in the event of a pageblock being mixed due +to fragmentation. The level of reclaim is determined by the number of +fragmentation events that occurred in the recent past. If this value is +smaller than a pageblock then a pageblocks worth of pages will be reclaimed +(e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature. + + +watermark_scale_factor +====================== + +This factor controls the aggressiveness of kswapd. It defines the +amount of memory left in a node/system before kswapd is woken up and +how much memory needs to be free before kswapd goes back to sleep. + +The unit is in fractions of 10,000. The default value of 10 means the +distances between watermarks are 0.1% of the available memory in the +node/system. The maximum value is 1000, or 10% of memory. + +A high rate of threads entering direct reclaim (allocstall) or kswapd +going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate +that the number of free pages kswapd maintains for latency reasons is +too small for the allocation bursts occurring in the system. This knob +can then be used to tune kswapd aggressiveness accordingly. + + +zone_reclaim_mode +================= + +Zone_reclaim_mode allows someone to set more or less aggressive approaches to +reclaim memory when a zone runs out of memory. If it is set to zero then no +zone reclaim occurs. Allocations will be satisfied from other zones / nodes +in the system. + +This is value OR'ed together of + += =================================== +1 Zone reclaim on +2 Zone reclaim writes dirty pages out +4 Zone reclaim swaps pages += =================================== + +zone_reclaim_mode is disabled by default. For file servers or workloads +that benefit from having their data cached, zone_reclaim_mode should be +left disabled as the caching effect is likely to be more important than +data locality. + +zone_reclaim may be enabled if it's known that the workload is partitioned +such that each partition fits within a NUMA node and that accessing remote +memory would cause a measurable performance reduction. The page allocator +will then reclaim easily reusable pages (those page cache pages that are +currently not used) before allocating off node pages. + +Allowing zone reclaim to write out pages stops processes that are +writing large amounts of data from dirtying pages on other nodes. Zone +reclaim will write out dirty pages if a zone fills up and so effectively +throttle the process. This may decrease the performance of a single process +since it cannot use all of system memory to buffer the outgoing writes +anymore but it preserve the memory on other nodes so that the performance +of other processes running on other nodes will not be affected. + +Allowing regular swap effectively restricts allocations to the local +node unless explicitly overridden by memory policies or cpuset +configurations. -- cgit v1.2.3-55-g7522