History log of /linux-master/ipc/ipc_sysctl.c
Revision Date Author Comments
# 029c45bb 28-Mar-2024 Joel Granados <j.granados@samsung.com>

ipc: remove the now superfluous sentinel element from ctl_table array

This commit comes at the tail end of a greater effort to remove the empty
elements at the end of the ctl_table arrays (sentinels) which will reduce
the overall build time size of the kernel and run time memory bloat by ~64
bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove the sentinels from ipc_sysctls and mq_sysctls

Link: https://lkml.kernel.org/r/20240328-jag-sysctl_remset_misc-v1-5-47c1463b3af2@samsung.com
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# 795f90c6 15-Mar-2024 Thomas Weißschuh <linux@weissschuh.net>

sysctl: treewide: constify argument ctl_table_root::permissions(table)

The permissions callback should not modify the ctl_table. Enforce this
expectation via the typesystem. This is a step to put "struct ctl_table"
into .rodata throughout the kernel.

The patch was created with the following coccinelle script:

@@
identifier func, head, ctl;
@@

int func(
struct ctl_table_header *head,
- struct ctl_table *ctl)
+ const struct ctl_table *ctl)
{ ... }

(insert_entry() from fs/proc/proc_sysctl.c is a false-positive)

No additional occurrences of '.permissions =' were found after a
tree-wide search for places missed by the conccinelle script.

Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Joel Granados <j.granados@samsung.com>


# 520713a9 15-Mar-2024 Thomas Weißschuh <linux@weissschuh.net>

sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table)

Remove the 'table' argument from set_ownership as it is never used. This
change is a step towards putting "struct ctl_table" into .rodata and
eventually having sysctl core only use "const struct ctl_table".

The patch was created with the following coccinelle script:

@@
identifier func, head, table, uid, gid;
@@

void func(
struct ctl_table_header *head,
- struct ctl_table *table,
kuid_t *uid, kgid_t *gid)
{ ... }

No additional occurrences of 'set_ownership' were found after doing a
tree-wide search.

Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Joel Granados <j.granados@samsung.com>


# bfa858f2 18-Apr-2024 Thomas Weißschuh <linux@weissschuh.net>

sysctl: treewide: constify ctl_table_header::ctl_table_arg

To be able to constify instances of struct ctl_tables it is necessary to
remove ways through which non-const versions are exposed from the
sysctl core.
One of these is the ctl_table_arg member of struct ctl_table_header.

Constify this reference as a prerequisite for the full constification of
struct ctl_table instances.
No functional change.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>


# 8e882910 19-Feb-2024 Thomas Weißschuh <linux@weissschuh.net>

ipc: remove linebreaks from arguments of __register_sysctl_table

Calls to __register_sysctl_table will be validated by
scripts/check-sysctl-docs. As this script is line-based remove the
linebreak which would confuse the script.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>


# 50ec499b 15-Jan-2024 Alexey Gladkov <legion@kernel.org>

sysctl: allow change system v ipc sysctls inside ipc namespace

Patch series "Allow to change ipc/mq sysctls inside ipc namespace", v3.

Right now ipc and mq limits count as per ipc namespace, but only real root
can change them. By default, the current values of these limits are such
that it can only be reduced. Since only root can change the values, it is
impossible to reduce these limits in the rootless container.

We can allow limit changes within ipc namespace because mq parameters are
limited by RLIMIT_MSGQUEUE and ipc parameters are not limited to anything
other than cgroups.


This patch (of 3):

Rootless containers are not allowed to modify kernel IPC parameters.

All default limits are set to such high values that in fact there are no
limits at all. All limits are not inherited and are initialized to
default values when a new ipc_namespace is created.

For new ipc_namespace:

size_t ipc_ns.shm_ctlmax = SHMMAX; // (ULONG_MAX - (1UL << 24))
size_t ipc_ns.shm_ctlall = SHMALL; // (ULONG_MAX - (1UL << 24))
int ipc_ns.shm_ctlmni = IPCMNI; // (1 << 15)
int ipc_ns.shm_rmid_forced = 0;
unsigned int ipc_ns.msg_ctlmax = MSGMAX; // 8192
unsigned int ipc_ns.msg_ctlmni = MSGMNI; // 32000
unsigned int ipc_ns.msg_ctlmnb = MSGMNB; // 16384

The shm_tot (total amount of shared pages) has also ceased to be global,
it is located in ipc_namespace and is not inherited from anywhere.

In such conditions, it cannot be said that these limits limit anything.
The real limiter for them is cgroups.

If we allow rootless containers to change these parameters, then it can
only be reduced.

Link: https://lkml.kernel.org/r/cover.1705333426.git.legion@kernel.org
Link: https://lkml.kernel.org/r/d2f4603305cbfed58a24755aa61d027314b73a45.1705333426.git.legion@kernel.org
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Link: https://lkml.kernel.org/r/e2d84d3ec0172cfff759e6065da84ce0cc2736f8.1663756794.git.legion@kernel.org
Cc: Christian Brauner <brauner@kernel.org>
Cc: Joel Granados <joel.granados@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>


# bff97cf1 08-Aug-2023 Joel Granados <joel.granados@gmail.com>

sysctl: Add a size arg to __register_sysctl_table

We make these changes in order to prepare __register_sysctl_table and
its callers for when we remove the sentinel element (empty element at
the end of ctl_table arrays). We don't actually remove any sentinels in
this commit, but we *do* make sure to use ARRAY_SIZE so the table_size
is available when the removal occurs.

We add a table_size argument to __register_sysctl_table and adjust
callers, all of which pass ctl_table pointers and need an explicit call
to ARRAY_SIZE. We implement a size calculation in register_net_sysctl in
order to forward the size of the array pointer received from the network
register calls.

The new table_size argument does not yet have any effect in the
init_header call which is still dependent on the sentinel's presence.
table_size *does* however drive the `kzalloc` allocation in
__register_sysctl_table with no adverse effects as the allocated memory
is either one element greater than the calculated ctl_table array (for
the calls in ipc_sysctl.c, mq_sysctl.c and ucount.c) or the exact size
of the calculated ctl_table array (for the call from sysctl_net.c and
register_sysctl). This approach will allows us to "just" remove the
sentinel without further changes to __register_sysctl_table as
table_size will represent the exact size for all the callers at that
point.

Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>


# 38cd5b12 03-May-2022 Alexey Gladkov <legion@kernel.org>

ipc: Remove extra braces

Fix coding style. In the previous commit, I added braces because,
in addition to changing .data, .extra1 also changed. Now this is not
needed.

Fixes: 1f5c135ee509 ("ipc: Store ipc sysctls in the ipc namespace")
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/37687827f630bc150210f5b8abeeb00f1336814e.1651584847.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 0889f44e 03-May-2022 Alexey Gladkov <legion@kernel.org>

ipc: Check permissions for checkpoint_restart sysctls at open time

As Eric Biederman pointed out, it is possible not to use a custom
proc_handler and check permissions for every write, but to use a
.permission handler. That will allow the checkpoint_restart sysctls to
perform all of their permission checks at open time, and not need any
other special code.

Link: https://lore.kernel.org/lkml/87czib9g38.fsf@email.froward.int.ebiederm.org/
Fixes: 1f5c135ee509 ("ipc: Store ipc sysctls in the ipc namespace")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/65fa8459803830608da4610a39f33c76aa933eb9.1651584847.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# dd141a49 03-May-2022 Alexey Gladkov <legion@kernel.org>

ipc: Remove extra1 field abuse to pass ipc namespace

Eric Biederman pointed out that using .extra1 to pass ipc namespace
looks like an ugly hack and there is a better solution. We can get the
ipc_namespace using the .data field.

Link: https://lore.kernel.org/lkml/87czib9g38.fsf@email.froward.int.ebiederm.org/
Fixes: 1f5c135ee509 ("ipc: Store ipc sysctls in the ipc namespace")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/93df64a8fe93ba20ebbe1d9f8eda484b2f325426.1651584847.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# def7343f 03-May-2022 Alexey Gladkov <legion@kernel.org>

ipc: Use the same namespace to modify and validate

In the 1f5c135ee509 ("ipc: Store ipc sysctls in the ipc namespace") I
missed that in addition to the modification of sem_ctls[3], the change
is validated. This validation must occur in the same namespace.

Link: https://lore.kernel.org/lkml/875ymnvryb.fsf@email.froward.int.ebiederm.org/
Fixes: 1f5c135ee509 ("ipc: Store ipc sysctls in the ipc namespace")
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/b3cb9a25cce6becbef77186bc1216071a08a969b.1651584847.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 1f5c135e 14-Feb-2022 Alexey Gladkov <legion@kernel.org>

ipc: Store ipc sysctls in the ipc namespace

The ipc sysctls are not available for modification inside the user
namespace. Following the mqueue sysctls, we changed the implementation
to be more userns friendly.

So far, the changes do not provide additional access to files. This
will be done in a future patch.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lkml.kernel.org/r/be6f9d014276f4dddd0c3aa05a86052856c1c555.1644862280.git.legion@kernel.org
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 0e9beb8a 08-Nov-2021 Manfred Spraul <manfred@colorfullife.com>

ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL

Compilation of ipc/ipc_sysctl.c is controlled by
obj-$(CONFIG_SYSVIPC_SYSCTL)
[see ipc/Makefile]

And CONFIG_SYSVIPC_SYSCTL depends on SYSCTL
[see init/Kconfig]

An SYSCTL is selected by PROC_SYSCTL.
[see fs/proc/Kconfig]

Thus: #ifndef CONFIG_PROC_SYSCTL in ipc/ipc_sysctl.c is impossible, the
fallback can be removed.

Link: https://lkml.kernel.org/r/20210918145337.3369-1-manfred@colorfullife.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5563cabd 08-Nov-2021 Michal Clapinski <mclapinski@google.com>

ipc: check checkpoint_restore_ns_capable() to modify C/R proc files

This commit removes the requirement to be root to modify sem_next_id,
msg_next_id and shm_next_id and checks checkpoint_restore_ns_capable
instead.

Since those files are specific to the IPC namespace, there is no reason
they should require root privileges. This is similar to ns_last_pid,
which also only checks checkpoint_restore_ns_capable.

[akpm@linux-foundation.org: ipc/ipc_sysctl.c needs capability.h for checkpoint_restore_ns_capable()]

Link: https://lkml.kernel.org/r/20210916163717.3179496-1-mclapinski@google.com
Signed-off-by: Michal Clapinski <mclapinski@google.com>
Reviewed-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Manfred Spraul <manfred@colorfullife.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# fff1662c 04-Sep-2020 Tobias Klauser <tklauser@distanz.ch>

ipc: adjust proc_ipc_sem_dointvec definition to match prototype

Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
changed ctl_table.proc_handler to take a kernel pointer. Adjust the
signature of proc_ipc_sem_dointvec to match ctl_table.proc_handler which
fixes the following sparse error/warning:

ipc/ipc_sysctl.c:94:47: warning: incorrect type in argument 3 (different address spaces)
ipc/ipc_sysctl.c:94:47: expected void *buffer
ipc/ipc_sysctl.c:94:47: got void [noderef] __user *buffer
ipc/ipc_sysctl.c:194:35: warning: incorrect type in initializer (incompatible argument 3 (different address spaces))
ipc/ipc_sysctl.c:194:35: expected int ( [usertype] *proc_handler )( ... )
ipc/ipc_sysctl.c:194:35: got int ( * )( ... )

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Link: https://lkml.kernel.org/r/20200825105846.5193-1-tklauser@distanz.ch
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 32927393 24-Apr-2020 Christoph Hellwig <hch@lst.de>

sysctl: pass kernel pointers to ->proc_handler

Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>


# eec4844f 18-Jul-2019 Matteo Croce <mcroce@redhat.com>

proc/sysctl: add shared variables for range check

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range. This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

$ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

# scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
Data old new delta
sysctl_vals - 12 +12
__kstrtab_sysctl_vals - 12 +12
max 14 10 -4
int_max 16 - -16
one 68 - -68
zero 128 28 -100
Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b886d83c 01-Jun-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation version 2 of the license

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 315 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>


# 99db46ea 14-May-2019 Manfred Spraul <manfred@colorfullife.com>

ipc: do cyclic id allocation for the ipc object.

For ipcmni_extend mode, the sequence number space is only 7 bits. So
the chance of id reuse is relatively high compared with the non-extended
mode.

To alleviate this id reuse problem, this patch enables cyclic allocation
for the index to the radix tree (idx). The disadvantage is that this
can cause a slight slow-down of the fast path, as the radix tree could
be higher than necessary.

To limit the radix tree height, I have chosen the following limits:
1) The cycling is done over in_use*1.5.
2) At least, the cycling is done over
"normal" ipcnmi mode: RADIX_TREE_MAP_SIZE elements
"ipcmni_extended": 4096 elements

Result:
- for normal mode:
No change for <= 42 active ipc elements. With more than 42
active ipc elements, a 2nd level would be added to the radix
tree.
Without cyclic allocation, a 2nd level would be added only with
more than 63 active elements.

- for extended mode:
Cycling creates always at least a 2-level radix tree.
With more than 2730 active objects, a 3rd level would be
added, instead of > 4095 active objects until the 3rd level
is added without cyclic allocation.

For a 2-level radix tree compared to a 1-level radix tree, I have
observed < 1% performance impact.

Notes:
1) Normal "x=semget();y=semget();" is unaffected: Then the idx
is e.g. a and a+1, regardless if idr_alloc() or idr_alloc_cyclic()
is used.

2) The -1% happens in a microbenchmark after this situation:
x=semget();
for(i=0;i<4000;i++) {t=semget();semctl(t,0,IPC_RMID);}
y=semget();
Now perform semget calls on x and y that do not sleep.

3) The worst-case reuse cycle time is unfortunately unaffected:
If you have 2^24-1 ipc objects allocated, and get/remove the last
possible element in a loop, then the id is reused after 128
get/remove pairs.

Performance check:
A microbenchmark that performes no-op semop() randomly on two IDs,
with only these two IDs allocated.
The IDs were set using /proc/sys/kernel/sem_next_id.
The test was run 5 times, averages are shown.

1 & 2: Base (6.22 seconds for 10.000.000 semops)
1 & 40: -0.2%
1 & 3348: - 0.8%
1 & 27348: - 1.6%
1 & 15777204: - 3.2%

Or: ~12.6 cpu cycles per additional radix tree level.
The cpu is an Intel I3-5010U. ~1300 cpu cycles/syscall is slower
than what I remember (spectre impact?).

V2 of the patch:
- use "min" and "max"
- use RADIX_TREE_MAP_SIZE * RADIX_TREE_MAP_SIZE instead of
(2<<12).

[akpm@linux-foundation.org: fix max() warning]
Link: http://lkml.kernel.org/r/20190329204930.21620-3-longman@redhat.com
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 5ac893b8 14-May-2019 Waiman Long <longman@redhat.com>

ipc: allow boot time extension of IPCMNI from 32k to 16M

The maximum number of unique System V IPC identifiers was limited to
32k. That limit should be big enough for most use cases.

However, there are some users out there requesting for more, especially
those that are migrating from Solaris which uses 24 bits for unique
identifiers. To satisfy the need of those users, a new boot time kernel
option "ipcmni_extend" is added to extend the IPCMNI value to 16M. This
is a 512X increase which should be big enough for users out there that
need a large number of unique IPC identifier.

The use of this new option will change the pattern of the IPC
identifiers returned by functions like shmget(2). An application that
depends on such pattern may not work properly. So it should only be
used if the users really need more than 32k of unique IPC numbers.

This new option does have the side effect of reducing the maximum number
of unique sequence numbers from 64k down to 128. So it is a trade-off.

The computation of a new IPC id is not done in the performance critical
path. So a little bit of additional overhead shouldn't have any real
performance impact.

Link: http://lkml.kernel.org/r/20190329204930.21620-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 8c81ddd2 30-Oct-2018 Waiman Long <longman@redhat.com>

ipc: IPCMNI limit check for semmni

For SysV semaphores, the semmni value is the last part of the 4-element
sem number array. To make semmni behave in a similar way to msgmni and
shmmni, we can't directly use the _minmax handler. Instead, a special sem
specific handler is added to check the last argument to make sure that it
is limited to the [0, IPCMNI] range. An error will be returned if this is
not the case.

Link: http://lkml.kernel.org/r/1536352137-12003-3-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6730e658 30-Oct-2018 Waiman Long <longman@redhat.com>

ipc: IPCMNI limit check for msgmni and shmmni

Patch series "ipc: IPCMNI limit check for *mni & increase that limit", v9.

The sysctl parameters msgmni, shmmni and semmni have an inherent limit of
IPC_MNI (32k). However, users may not be aware of that because they can
write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.

The real IPCMNI limit is now enforced to make sure that users won't put in
an unrealistic value. The first 2 patches enforce the limits.

There are also users out there requesting increase in the IPCMNI value.
The last 2 patches attempt to do that by using a boot kernel parameter
"ipcmni_extend" to increase the IPCMNI limit from 32k to 8M if the users
really want the extended value.

This patch (of 4):

A user can write arbitrary integer values to msgmni and shmmni sysctl
parameters without getting error, but the actual limit is really IPCMNI
(32k). This can mislead users as they think they can get a value that is
not real.

The right limits are now set for msgmni and shmmni so that the users will
become aware if they set a value outside of the acceptable range.

Link: http://lkml.kernel.org/r/1536352137-12003-2-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Luis R. Rodriguez <mcgrof@kernel.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0050ee05 12-Dec-2014 Manfred Spraul <manfred@colorfullife.com>

ipc/msg: increase MSGMNI, remove scaling

SysV can be abused to allocate locked kernel memory. For most systems, a
small limit doesn't make sense, see the discussion with regards to SHMMAX.

Therefore: increase MSGMNI to the maximum supported.

And: If we ignore the risk of locking too much memory, then an automatic
scaling of MSGMNI doesn't make sense. Therefore the logic can be removed.

The code preserves auto_msgmni to avoid breaking any user space applications
that expect that the value exists.

Notes:
1) If an administrator must limit the memory allocations, then he can set
MSGMNI as necessary.

Or he can disable sysv entirely (as e.g. done by Android).

2) MSGMAX and MSGMNB are intentionally not increased, as these values are used
to control latency vs. throughput:
If MSGMNB is large, then msgsnd() just returns and more messages can be queued
before a task switch to a task that calls msgrcv() is forced.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 1195d94e 13-Oct-2014 Andrey Vagin <avagin@openvz.org>

ipc: always handle a new value of auto_msgmni

proc_dointvec_minmax() returns zero if a new value has been set. So we
don't need to check all charecters have been handled.

Below you can find two examples. In the new value has not been handled
properly.

$ strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n\0", 3) = 2
close(3) = 0
exit_group(0)
$ cat /sys/kernel/debug/tracing/trace

$strace ./a.out
open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3
write(3, "0\n", 2) = 2
close(3) = 0

$ cat /sys/kernel/debug/tracing/trace
a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax

Fixes: 9eefe520c814 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin")
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Joe Perches <joe@perches.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a5c5928b 06-Jun-2014 Joe Perches <joe@perches.com>

ipc: convert use of typedef ctl_table to struct ctl_table

This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6d08a256 07-Apr-2014 Davidlohr Bueso <davidlohr@hp.com>

ipc: use device_initcall

... since __initcall is now deprecated.

Signed-off-by: Davidlohr Bueso <davidlohr@hp.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 239521f3 27-Jan-2014 Manfred Spraul <manfred@colorfullife.com>

ipc: whitespace cleanup

The ipc code does not adhere the typical linux coding style.
This patch fixes lots of simple whitespace errors.

- mostly autogenerated by
scripts/checkpatch.pl -f --fix \
--types=pointer_location,spacing,space_before_tab
- one manual fixup (keep structure members tab-aligned)
- removal of additional space_before_tab that were not found by --fix

Tested with some of my msg and sem test apps.

Andrew: Could you include it in -mm and move it towards Linus' tree?

Signed-off-by: Manfred Spraul <manfred@colorfullife.com>
Suggested-by: Li Bin <huawei.libin@huawei.com>
Cc: Joe Perches <joe@perches.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Davidlohr Bueso <davidlohr@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9bf76ca3 02-Nov-2013 Mathias Krause <minipli@googlemail.com>

ipc, msg: forbid negative values for "msg{max,mnb,mni}"

Negative message lengths make no sense -- so don't do negative queue
lenghts or identifier counts. Prevent them from getting negative.

Also change the underlying data types to be unsigned to avoid hairy
surprises with sign extensions in cases where those variables get
evaluated in unsigned expressions with bigger data types, e.g size_t.

In case a user still wants to have "unlimited" sizes she could just use
INT_MAX instead.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 03f59566 04-Jan-2013 Stanislav Kinsbursky <skinsbursky@parallels.com>

ipc: add sysctl to specify desired next object id

Add 3 new variables and sysctls to tune them (by one "next_id" variable
for messages, semaphores and shared memory respectively). This variable
can be used to set desired id for next allocated IPC object. By default
it's equal to -1 and old behaviour is preserved. If this variable is
non-negative, then desired idr will be extracted from it and used as a
start value to search for free IDR slot.

Notes:

1) this patch doesn't guarantee that the new object will have desired
id. So it's up to user space how to handle new object with wrong id.

2) After a sucessful id allocation attempt, "next_id" will be set back
to -1 (if it was non-negative).

[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# b34a6b1d 26-Jul-2011 Vasiliy Kulikov <segoon@openwall.com>

ipc: introduce shm_rmid_forced sysctl

Add support for the shm_rmid_forced sysctl. If set to 1, all shared
memory objects in current ipc namespace will be automatically forced to
use IPC_RMID.

The POSIX way of handling shmem allows one to create shm objects and
call shmdt(), leaving shm object associated with no process, thus
consuming memory not counted via rlimits.

With shm_rmid_forced=1 the shared memory object is counted at least for
one process, so OOM killer may effectively kill the fat process holding
the shared memory.

It obviously breaks POSIX - some programs relying on the feature would
stop working. So set shm_rmid_forced=1 only if you're sure nobody uses
"orphaned" memory. Use shm_rmid_forced=0 by default for compatability
reasons.

The feature was previously impemented in -ow as a configure option.

[akpm@linux-foundation.org: fix documentation, per Randy]
[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: readability/conventionality tweaks]
[akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout]
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Solar Designer <solar@openwall.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 2bc4657c 03-Apr-2009 Eric W. Biederman <ebiederm@xmission.com>

sysctl ipc: Remove dead binary sysctl support code.

Now that sys_sysctl is a generic wrapper around /proc/sys .ctl_name
and .strategy members of sysctl tables are dead code. Remove them.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>


# 8d65af78 23-Sep-2009 Alexey Dobriyan <adobriyan@gmail.com>

sysctl: remove "struct file *" argument of ->proc_handler

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 11dea190 02-Apr-2009 Serge E. Hallyn <serue@us.ibm.com>

proc_sysctl: use CONFIG_PROC_SYSCTL around ipc and utsname proc_handlers

As pointed out by Cedric Le Goater (in response to Alexey's original
comment wrt mqns), ipc_sysctl.c and utsname_sysctl.c are using
CONFIG_PROC_FS, not CONFIG_PROC_SYSCTL, to determine whether to define
the proc_handlers. Change that.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Acked-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 4c2c3b4a 06-Jan-2009 Andrew Morton <akpm@linux-foundation.org>

ipc/ipc_sysctl.c: move the definition of ipc_auto_callback()

proc_ipcauto_dointvec_minmax() is the only user of ipc_auto_callback(),
since the former function is protected by CONFIG_PROC_FS, so should be the
latter one.

Just move its definition down.

Signed-off-by: WANG Cong <wangcong@zeuux.org>
Cc: Eric Biederman <ebiederm@xmision.com>
Cc: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# f221e726 15-Oct-2008 Alexey Dobriyan <adobriyan@gmail.com>

sysctl: simplify ->strategy

name and nlen parameters passed to ->strategy hook are unused, remove
them. In general ->strategy hook should know what it's doing, and don't
do something tricky for which, say, pointer to original userspace array
may be needed (name).

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net> [ networking bits ]
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 9eefe520 25-Jul-2008 Nadia Derbey <Nadia.Derbey@bull.net>

ipc: do not use a negative value to re-enable msgmni automatic recomputing

This patch proposes an alternative to the "magical
positive-versus-negative number trick" Andrew complained about last week
in http://lkml.org/lkml/2008/6/24/418.

This had been introduced with the patches that scale msgmni to the amount
of lowmem. With these patches, msgmni has a registered notification
routine that recomputes msgmni value upon memory add/remove or ipc
namespace creation/ removal.

When msgmni is changed from user space (i.e. value written to the proc
file), that notification routine is unregistered, and the way to make it
registered back is to write a negative value into the proc file. This is
the "magical positive-versus-negative number trick".

To fix this, a new proc file is introduced: /proc/sys/kernel/auto_msgmni.
This file acts as ON/OFF for msgmni automatic recomputing.

With this patch, the process is the following:
1) kernel boots in "automatic recomputing mode"
/proc/sys/kernel/msgmni contains the value that has been computed (depends
on lowmem)
/proc/sys/kernel/automatic_msgmni contains "1"

2) echo <val> > /proc/sys/kernel/msgmni
. sets msg_ctlmni to <val>
. de-activates automatic recomputing (i.e. if, say, some memory is added
msgmni won't be recomputed anymore)
. /proc/sys/kernel/automatic_msgmni now contains "0"

3) echo "0" > /proc/sys/kernel/automatic_msgmni
. de-activates msgmni automatic recomputing
this has the same effect as 2) except that msg_ctlmni's value stays
blocked at its current value)

3) echo "1" > /proc/sys/kernel/automatic_msgmni
. recomputes msgmni's value based on the current available memory size
and number of ipc namespaces
. re-activates automatic recomputing for msgmni.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Solofo Ramangalahy <Solofo.Ramangalahy@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 6546bc42 29-Apr-2008 Nadia Derbey <Nadia.Derbey@bull.net>

ipc: re-enable msgmni automatic recomputing msgmni if set to negative

The enhancement as asked for by Yasunori: if msgmni is set to a negative
value, register it back into the ipcns notifier chain.

A new interface has been added to the notification mechanism:
notifier_chain_cond_register() registers a notifier block only if not already
registered. With that new interface we avoid taking care of the states
changes in procfs.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 91cfb2b4 29-Apr-2008 Nadia Derbey <Nadia.Derbey@bull.net>

ipc: do not recompute msgmni anymore if explicitly set by user

Make msgmni not recomputed anymore upon ipc namespace creation / removal or
memory add/remove, as soon as it has been set from userland.

As soon as msgmni is explicitly set via procfs or sysctl(), the associated
callback routine is unregistered from the ipc namespace notifier chain.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# ae5e1b22 08-Feb-2008 Pavel Emelyanov <xemul@openvz.org>

namespaces: move the IPC namespace under IPC_NS option

Currently the IPC namespace management code is spread over the ipc/*.c files.
I moved this code into ipc/namespace.c file which is compiled out when needed.

The linux/ipc_namespace.h file is used to store the prototypes of the
functions in namespace.c and the stubs for NAMESPACES=n case. This is done
so, because the stub for copy_ipc_namespace requires the knowledge of the
CLONE_NEWIPC flag, which is in sched.h. But the linux/ipc.h file itself in
included into many many .c files via the sys.h->sem.h sequence so adding the
sched.h into it will make all these .c depend on sched.h which is not that
good. On the other hand the knowledge about the namespaces stuff is required
in 4 .c files only.

Besides, this patch compiles out some auxiliary functions from ipc/sem.c,
msg.c and shm.c files. It turned out that moving these functions into
namespaces.c is not that easy because they use many other calls and macros
from the original file. Moving them would make this patch complicated. On
the other hand all these functions can be consolidated, so I will send a
separate patch doing this a bit later.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a08b4be7 17-Oct-2007 Cedric Le Goater <clg@fr.ibm.com>

ipc namespace: remove config ipc ns fix

Finish the work : kill all #ifdef CONFIG_IPC_NS.

Thanks Robert !

Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
Cc: Eric Biederman <ebiederm@xmision.com>
Cc: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# 0b4d4147 14-Feb-2007 Eric W. Biederman <ebiederm@xmission.com>

[PATCH] sysctl: remove insert_at_head from register_sysctl

The semantic effect of insert_at_head is that it would allow new registered
sysctl entries to override existing sysctl entries of the same name. Which is
pain for caching and the proc interface never implemented.

I have done an audit and discovered that none of the current users of
register_sysctl care as (excpet for directories) they do not register
duplicate sysctl entries.

So this patch simply removes the support for overriding existing entries in
the sys_sysctl interface since no one uses it or cares and it makes future
enhancments harder.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Corey Minyard <minyard@acm.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jan Kara <jack@ucw.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>


# a5494dcd 14-Feb-2007 Eric W. Biederman <ebiederm@xmission.com>

[PATCH] sysctl: move SYSV IPC sysctls to their own file

This is just a simple cleanup to keep kernel/sysctl.c from getting to crowded
with special cases, and by keeping all of the ipc logic to together it makes
the code a little more readable.

[gcoady.lk@gmail.com: build fix]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Signed-off-by: Grant Coady <gcoady.lk@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>