Cross Reference: /linux-master/kernel/signal.c

History log of /linux-master/kernel/signal.c
Revision	Date	Author	Comments
# 240a1853	15-Mar-2024	Mike Christie <michael.christie@oracle.com>	kernel: Remove signal hacks for vhost_tasks This removes the signal/coredump hacks added for vhost_tasks in: Commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") When that patch was added vhost_tasks did not handle SIGKILL and would try to ignore/clear the signal and continue on until the device's close function was called. In the previous patches vhost_tasks and the vhost drivers were converted to support SIGKILL by cleaning themselves up and exiting. The hacks are no longer needed so this removes them. Signed-off-by: Mike Christie <michael.christie@oracle.com> Message-Id: <20240316004707.45557-10-michael.christie@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# 11a92190	27-Jun-2023	Joel Granados <j.granados@samsung.com>	kernel misc: Remove the now superfluous sentinel elements from ctl_table array This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove the sentinel from ctl_table arrays. Reduce by one the values used to compare the size of the adjusted arrays. Signed-off-by: Joel Granados <j.granados@samsung.com>
# a436184e	26-Feb-2024	Oleg Nesterov <oleg@redhat.com>	get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task This initialization is incomplete and unnecessary, neither do_group_exit() nor PF_USER_WORKER need ksig->info. Link: https://lkml.kernel.org/r/20240226165653.GA20834@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Wen Yang <wenyang.linux@foxmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# dd69edd6	26-Feb-2024	Oleg Nesterov <oleg@redhat.com>	get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig ksig->ka and ksig->info are not initialized if get_signal() returns 0 or if the caller is PF_USER_WORKER. Check signr != 0 before SA_EXPOSE_TAGBITS and move the "out" label down. The latter means that ksig->sig won't be initialized if a PF_USER_WORKER thread gets a fatal signal but this is fine, PF_USER_WORKER's don't use ksig. And there is nothing new, in this case ksig->ka and ksig-info are not initialized anyway. Add a comment. Link: https://lkml.kernel.org/r/20240226165650.GA20829@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Wen Yang <wenyang.linux@foxmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# 49fd5f5a	26-Feb-2024	Oleg Nesterov <oleg@redhat.com>	get_signal: don't abuse ksig->info.si_signo and ksig->sig Patch series "get_signal: minor cleanups and fix". Lets remove this clear_siginfo() right now. It is incomplete (and thus looks confusing) and unnecessary. Also, PF_USER_WORKER's already don't get a fully initialized ksig anyway. This patch (of 3): Cleanup and preparation for the next changes. get_signal() uses signr or ksig->info.si_signo or ksig->sig in a chaotic way, this looks confusing. Change it to always use signr. Link: https://lkml.kernel.org/r/20240226165612.GA20787@redhat.com Link: https://lkml.kernel.org/r/20240226165647.GA20826@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Wen Yang <wenyang.linux@foxmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# e1fb1dc0	09-Feb-2024	Christian Brauner <brauner@kernel.org>	pidfd: allow to override signal scope in pidfd_send_signal() Right now we determine the scope of the signal based on the type of pidfd. There are use-cases where it's useful to override the scope of the signal. For example in [1]. Add flags to determine the scope of the signal: (1) PIDFD_SIGNAL_THREAD: send signal to specific thread reference by @pidfd (2) PIDFD_SIGNAL_THREAD_GROUP: send signal to thread-group of @pidfd (2) PIDFD_SIGNAL_PROCESS_GROUP: send signal to process-group of @pidfd Since we now allow specifying PIDFD_SEND_PROCESS_GROUP for pidfd_send_signal() to send signals to process groups we need to adjust the check restricting si_code emulation by userspace to account for PIDTYPE_PGID. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://github.com/systemd/systemd/issues/31093 [1] Link: https://lore.kernel.org/r/20240210-chihuahua-hinzog-3945b6abd44a@brauner Link: https://lore.kernel.org/r/20240214123655.GB16265@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
# 81b9d8ac	09-Feb-2024	Oleg Nesterov <oleg@redhat.com>	pidfd: change pidfd_send_signal() to respect PIDFD_THREAD Turn kill_pid_info() into kill_pid_info_type(), this allows to pass any pid_type to group_send_sig_info(), despite its name it should work fine even if type = PIDTYPE_PID. Change pidfd_send_signal() to use PIDTYPE_PID or PIDTYPE_TGID depending on PIDFD_THREAD. While at it kill another TODO comment in pidfd_show_fdinfo(). As Christian expains fdinfo reports f_flags, userspace can already detect PIDFD_THREAD. Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240209130650.GA8048@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
# c044a950	09-Feb-2024	Oleg Nesterov <oleg@redhat.com>	signal: fill in si_code in prepare_kill_siginfo() So that do_tkill() can use this helper too. This also simplifies the next patch. TODO: perhaps we can kill prepare_kill_siginfo() and change the callers to use SEND_SIG_NOINFO, but this needs some changes in __send_signal_locked() and TP_STORE_SIGINFO(). Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240209130620.GA8039@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
# 9ed52108	05-Feb-2024	Oleg Nesterov <oleg@redhat.com>	pidfd: change do_notify_pidfd() to use __wake_up(poll_to_key(EPOLLIN)) rather than wake_up_all(). This way do_notify_pidfd() won't wakeup the POLLHUP-only waiters which wait for pid_task() == NULL. TODO: - as Christian pointed out, this asks for the new wake_up_all_poll() helper, it can already have other users. - we can probably discriminate the PIDFD_THREAD and non-PIDFD_THREAD waiters, but this needs more work. See https://lore.kernel.org/all/20240205140848.GA15853@redhat.com/ Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240205141348.GA16539@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
# 64bef697	31-Jan-2024	Oleg Nesterov <oleg@redhat.com>	pidfd: implement PIDFD_THREAD flag for pidfd_open() With this flag: - pidfd_open() doesn't require that the target task must be a thread-group leader - pidfd_poll() succeeds when the task exits and becomes a zombie (iow, passes exit_notify()), even if it is a leader and thread-group is not empty. This means that the behaviour of pidfd_poll(PIDFD_THREAD, pid-of-group-leader) is not well defined if it races with exec() from its sub-thread; pidfd_poll() can succeed or not depending on whether pidfd_task_exited() is called before or after exchange_tids(). Perhaps we can improve this behaviour later, pidfd_poll() can probably take sig->group_exec_task into account. But this doesn't really differ from the case when the leader exits before other threads (so pidfd_poll() succeeds) and then another thread execs and pidfd_poll() will block again. thread_group_exited() is no longer used, perhaps it can die. Co-developed-by: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240131132602.GA23641@redhat.com Tested-by: Tycho Andersen <tandersen@netflix.com> Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
# 21e25205	27-Jan-2024	Oleg Nesterov <oleg@redhat.com>	pidfd: don't do_notify_pidfd() if !thread_group_empty() do_notify_pidfd() makes no sense until the whole thread group exits, change do_notify_parent() to check thread_group_empty(). This avoids the unnecessary do_notify_pidfd() when tsk is not a leader, or it exits before other threads, or it has a ptraced EXIT_ZOMBIE sub-thread. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240127132407.GA29136@redhat.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
# b454ec29	20-Nov-2023	Oleg Nesterov <oleg@redhat.com>	kernel/signal.c: simplify force_sig_info_to_task(), kill recalc_sigpending_and_wake() The purpose of recalc_sigpending_and_wake() is not clear, it looks "obviously unneeded" because we are going to send the signal which can't be blocked or ignored. Add the comment to explain why we can't rely on send_signal_locked() and make this logic more simple/explicit. recalc_sigpending_and_wake() has no other users, it can die. In fact I think we don't even need signal_wake_up(), the target task must be either current or a TASK_TRACED child, otherwise the usage of siglock is not safe. But this needs another change. Link: https://lkml.kernel.org/r/20231120151649.GA15995@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# 61a7a5e2	30-Oct-2023	Oleg Nesterov <oleg@redhat.com>	introduce for_other_threads(p, t) Cosmetic, but imho it makes the usage look more clear and simple, the new helper doesn't require to initialize "t". After this change while_each_thread() has only 3 users, and it is only used in the do/while loops. Link: https://lkml.kernel.org/r/20231030155710.GA9095@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# a287116a	25-Sep-2023	Li kunyu <kunyu@nfschina.com>	kernel/signal: remove unnecessary NULL values from ucounts ucounts is assigned first, so it does not need to initialize the assignment. Link: https://lkml.kernel.org/r/20230926022410.4280-1-kunyu@nfschina.com Signed-off-by: Li kunyu <kunyu@nfschina.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# e5ecf29c	09-Sep-2023	Oleg Nesterov <oleg@redhat.com>	signal: complete_signal: use __for_each_thread() do/while_each_thread should be avoided when possible. Link: https://lkml.kernel.org/r/20230909164537.GA11633@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# 39835204	23-Aug-2023	Oleg Nesterov <oleg@redhat.com>	__kill_pgrp_info: simplify the calculation of return value No need to calculate/check the "success" variable, we can kill it and update retval in the main loop unless it is zero. Link: https://lkml.kernel.org/r/20230823171455.GA12188@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: David Laight <David.Laight@ACULAB.COM> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# f5e83688	13-Jan-2023	Ard Biesheuvel <ardb@kernel.org>	kernel: Drop IA64 support from sig_fault handlers Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
# 1aabbc53	02-Aug-2023	Sebastian Andrzej Siewior <bigeasy@linutronix.de>	signal: Don't disable preemption in ptrace_stop() on PREEMPT_RT On PREEMPT_RT keeping preemption disabled during the invocation of cgroup_enter_frozen() is a problem because the function acquires css_set_lock which is a sleeping lock on PREEMPT_RT and must not be acquired with disabled preemption. The preempt-disabled section is only for performance optimisation reasons and can be avoided. Extend the comment and don't disable preemption before scheduling on PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20230803100932.325870-3-bigeasy@linutronix.de
# a20d6f63	02-Aug-2023	Sebastian Andrzej Siewior <bigeasy@linutronix.de>	signal: Add a proper comment about preempt_disable() in ptrace_stop() Commit 53da1d9456fe7 ("fix ptrace slowness") added a preempt-disable section between read_unlock() and the following schedule() invocation without explaining why it is needed. Replace the existing contentless comment with a proper explanation to clarify that it is not needed for correctness but for performance reasons. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20230803100932.325870-2-bigeasy@linutronix.de
# b0b88e02	07-Jul-2023	Vincent Whitchurch <vincent.whitchurch@axis.com>	signal: print comm and exe name on fatal signals Make the print-fatal-signals message more useful by printing the comm and the exe name for the process which received the fatal signal: Before: potentially unexpected fatal signal 4 potentially unexpected fatal signal 11 After: buggy-program: pool: potentially unexpected fatal signal 4 some-daemon: gdbus: potentially unexpected fatal signal 11 comm used to be present but was removed in commit 681a90ffe829b8ee25d ("arc, print-fatal-signals: reduce duplicated information") because it's also included as part of the later stack trace. Having the comm as part of the main "unexpected fatal..." print is rather useful though when analysing logs, and the exe name is also valuable as shown in the examples above where the comm ends up having some generic name like "pool". [akpm@linux-foundation.org: don't include linux/file.h twice] Link: https://lkml.kernel.org/r/20230707-fatal-comm-v1-1-400363905d5e@axis.com Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vineet Gupta <vgupta@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
# 5f0bc0b0	25-Jul-2023	Linus Torvalds <torvalds@linux-foundation.org>	mm: suppress mm fault logging if fatal signal already pending Commit eda0047296a1 ("mm: make the page fault mmap locking killable") intentionally made it much easier to trigger the "page fault fails because a fatal signal is pending" situation, by having the mmap locking fail early in that case. We have long aborted page faults in other fatal cases when the actual IO for a page is interrupted by SIGKILL - which is particularly useful for the traditional case of NFS hanging due to network issues, but local filesystems could cause it too if you happened to get the SIGKILL while waiting for a page to be faulted in (eg lock_folio_maybe_drop_mmap()). So aborting the page fault wasn't a new condition - but it now triggers earlier, before we even get to 'handle_mm_fault()'. And as a result the error doesn't go through our 'fault_signal_pending()' logic, and doesn't get filtered away there. Normally you'd never even notice, because if a fatal signal is pending, the new SIGSEGV we send ends up being ignored anyway. But it turns out that there is one very noticeable exception: if you enable 'show_unhandled_signals', the aborted page fault will be logged in the kernel messages, and you'll get a scary line looking something like this in your logs: pverados[2183248]: segfault at 55e5a00f9ae0 ip 000055e5a00f9ae0 sp 00007ffc0720bea8 error 14 in perl[55e5a00d4000+195000] likely on CPU 10 (core 4, socket 0) which is rather misleading. It's not really a segfault at all, it's just "the thread was killed before the page fault completed, so we aborted the page fault". Fix this by just making it clear that a pending fatal signal means that any new signal coming in after that is implicitly handled. This will avoid the misleading logging, since now the signal isn't 'unhandled' any more. Reported-and-tested-by: Fiona Ebner <f.ebner@proxmox.com> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Link: https://lore.kernel.org/lkml/8d063a26-43f5-0bb7-3203-c6a04dc159f8@proxmox.com/ Acked-by: Oleg Nesterov <oleg@redhat.com> Fixes: eda0047296a1 ("mm: make the page fault mmap locking killable") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# f9010dbd	01-Jun-2023	Mike Christie <michael.christie@oracle.com>	fork, vhost: Use CLONE_THREAD to fix freezer/ps regression When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps or ps a would not incorrectly detect the vhost task as another process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's didn't disable or add support for them. To fix both bugs, this switches the vhost task to be thread in the process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that SIGKILL/STOP support is required because CLONE_THREAD requires CLONE_SIGHAND which requires those 2 signals to be supported. This is a modified version of the patch written by Mike Christie <michael.christie@oracle.com> which was a modified version of patch originally written by Linus. Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER. Including ignoring signals, setting up the register state, and having get_signal return instead of calling do_group_exit. Tidied up the vhost_task abstraction so that the definition of vhost_task only needs to be visible inside of vhost_task.c. Making it easier to review the code and tell what needs to be done where. As part of this the main loop has been moved from vhost_worker into vhost_task_fn. vhost_worker now returns true if work was done. The main loop has been updated to call get_signal which handles SIGSTOP, freezing, and collects the message that tells the thread to exit as part of process exit. This collection clears __fatal_signal_pending. This collection is not guaranteed to clear signal_pending() so clear that explicitly so the schedule() sleeps. For now the vhost thread continues to exist and run work until the last file descriptor is closed and the release function is called as part of freeing struct file. To avoid hangs in the coredump rendezvous and when killing threads in a multi-threaded exec. The coredump code and de_thread have been modified to ignore vhost threads. Remvoing the special case for exec appears to require teaching vhost_dev_flush how to directly complete transactions in case the vhost thread is no longer running. Removing the special case for coredump rendezvous requires either the above fix needed for exec or moving the coredump rendezvous into get_signal. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Co-developed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 01e6aac7	18-May-2023	Luis Chamberlain <mcgrof@kernel.org>	signal: move show_unhandled_signals sysctl to its own file The show_unhandled_signals sysctl is the only sysctl for debug left on kernel/sysctl.c. We've been moving the syctls out from kernel/sysctl.c so to help avoid merge conflicts as the shared array gets out of hand. This change incurs simplifies sysctl registration by localizing it where it should go for a penalty in size of increasing the kernel by 23 bytes, we accept this given recent cleanups have actually already saved us 1465 bytes in the prior commits. ./scripts/bloat-o-meter vmlinux.3-remove-dev-table vmlinux.4-remove-debug-table add/remove: 3/1 grow/shrink: 0/1 up/down: 177/-154 (23) Function old new delta signal_debug_table - 128 +128 init_signal_sysctls - 33 +33 __pfx_init_signal_sysctls - 16 +16 sysctl_init_bases 85 59 -26 debug_table 128 - -128 Total: Before=21256967, After=21256990, chg +0.00% Reviewed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
# bcb7ee79	16-Mar-2023	Dmitry Vyukov <dvyukov@google.com>	posix-timers: Prefer delivery of signals to the current thread POSIX timers using the CLOCK_PROCESS_CPUTIME_ID clock prefer the main thread of a thread group for signal delivery. However, this has a significant downside: it requires waking up a potentially idle thread. Instead, prefer to deliver signals to the current thread (in the same thread group) if SIGEV_THREAD_ID is not set by the user. This does not change guaranteed semantics, since POSIX process CPU time timers have never guaranteed that signal delivery is to a specific thread (without SIGEV_THREAD_ID set). The effect is that queueing the signal no longer wakes up potentially idle threads, and the kernel is no longer biased towards delivering the timer signal to any particular thread (which better distributes the timer signals esp. when multiple timers fire concurrently). Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230316123028.2890338-1-elver@google.com
# af7f588d	22-Nov-2022	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>	sched: Introduce per-memory-map concurrency ID This feature allows the scheduler to expose a per-memory map concurrency ID to user-space. This concurrency ID is within the possible cpus range, and is temporarily (and uniquely) assigned while threads are actively running within a memory map. If a memory map has fewer threads than cores, or is limited to run on few cores concurrently through sched affinity or cgroup cpusets, the concurrency IDs will be values close to 0, thus allowing efficient use of user-space memory for per-cpu data structures. This feature is meant to be exposed by a new rseq thread area field. The primary purpose of this feature is to do the heavy-lifting needed by memory allocators to allow them to use per-cpu data structures efficiently in the following situations: - Single-threaded applications, - Multi-threaded applications on large systems (many cores) with limited cpu affinity mask, - Multi-threaded applications on large systems (many cores) with restricted cgroup cpuset per container. One of the key concern from scheduler maintainers is the overhead associated with additional spin locks or atomic operations in the scheduler fast-path. This is why the following optimization is implemented. On context switch between threads belonging to the same memory map, transfer the mm_cid from prev to next without any atomic ops. This takes care of use-cases involving frequent context switch between threads belonging to the same memory map. Additional optimizations can be done if the spin locks added when context switching between threads belonging to different memory maps end up being a performance bottleneck. Those are left out of this patch though. A performance impact would have to be clearly demonstrated to justify the added complexity. The credit goes to Paul Turner (Google) for the original virtual cpu id idea. This feature is implemented based on the discussions with Paul Turner and Peter Oskolkov (Google), but I took the liberty to implement scheduler fast-path optimizations and my own NUMA-awareness scheme. The rumor has it that Google have been running a rseq vcpu_id extension internally in production for a year. The tcmalloc source code indeed has comments hinting at a vcpu_id prototype extension to the rseq system call [1]. The following benchmarks do not show any significant overhead added to the scheduler context switch by this feature: * perf bench sched messaging (process) Baseline: 86.5±0.3 ms With mm_cid: 86.7±2.6 ms * perf bench sched messaging (threaded) Baseline: 84.3±3.0 ms With mm_cid: 84.7±2.6 ms * hackbench (process) Baseline: 82.9±2.7 ms With mm_cid: 82.9±2.9 ms * hackbench (threaded) Baseline: 85.2±2.6 ms With mm_cid: 84.4±2.9 ms [1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26 Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20221122203932.231377-8-mathieu.desnoyers@efficios.com
# 3a017d63	27-Nov-2022	haifeng.xu <haifeng.xu@shopee.com>	signal: Initialize the info in ksignal When handing the SIGNAL_GROUP_EXIT flag, the info in ksignal isn't cleared. However, the info acquired by dequeue_synchronous_signal/dequeue_signal is initialized and can be safely used. Fortunately, the fatal signal process just uses the si_signo and doesn't use any other member. Even so, the initialization before use is more safer. Signed-off-by: haifeng.xu <haifeng.xu@shopee.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221128065606.19570-1-haifeng.xu@shopee.com
# 6a542d1d	07-Jun-2020	Al Viro <viro@zeniv.linux.org.uk>	kill signal_pt_regs() Once upon at it was used on hot paths, but that had not been true since 2013. IOW, there's no point for arch-optimized equivalent of task_pt_regs(current) - remaining two users are not worth bothering with. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
# f5d39b02	22-Aug-2022	Peter Zijlstra <peterz@infradead.org>	freezer,sched: Rewrite core freezer logic Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. As said; replace this with a special task state TASK_FROZEN and add the following state transitions: TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost). The special __TASK_{STOPPED,TRACED} states can be restored since their canonical state is in ->jobctl. With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
# 9a95f78e	08-Jan-2022	Eric W. Biederman <ebiederm@xmission.com>	signal: Drop signals received after a fatal signal has been processed In 403bad72b67d ("coredump: only SIGKILL should interrupt the coredumping task") Oleg modified the kernel to drop all signals that come in during a coredump except SIGKILL, and suggested that it might be a good idea to generalize that to other cases after the process has received a fatal signal. Semantically it does not make sense to perform any signal delivery after the process has already been killed. When a signal is sent while a process is dying today the signal is placed in the signal queue by __send_signal and a single task of the process is woken up with signal_wake_up, if there are any tasks that have not set PF_EXITING. Take things one step farther and have prepare_signal report that all signals that come after a process has been killed should be ignored. While retaining the historical exception of allowing SIGKILL to interrupt coredumps. Update the comment in fs/coredump.c to make it clear coredumps are special in being able to receive SIGKILL. This changes things so that a process stopped in PTRACE_EVENT_EXIT can not be made to escape it's ptracer and finish exiting by sending it SIGKILL. That a process can be made to leave PTRACE_EVENT_EXIT and escape it's tracer by sending the process a SIGKILL has been complicating tracer's for no apparent advantage. If the process needs to be made to leave PTRACE_EVENT_EXIT all that needs to happen is to kill the proceses's tracer. This differs from the coredump code where there is no other mechanism besides honoring SIGKILL to expedite the end of coredumping. Link: https://lkml.kernel.org/r/875yksd4s9.fsf_-_@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
# a382f8fe	06-Jul-2022	Linus Torvalds <torvalds@linux-foundation.org>	signal handling: don't use BUG_ON() for debugging These are indeed "should not happen" situations, but it turns out recent changes made the 'task_is_stopped_or_trace()' case trigger (fix for that exists, is pending more testing), and the BUG_ON() makes it unnecessarily hard to actually debug for no good reason. It's been that way for a long time, but let's make it clear: BUG_ON() is not good for debugging, and should never be used in situations where you could just say "this shouldn't happen, but we can continue". Use WARN_ON_ONCE() instead to make sure it gets logged, and then just continue running. Instead of making the system basically unusuable because you crashed the machine while potentially holding some very core locks (eg this function is commonly called while holding 'tasklist_lock' for writing). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
# 31cae1ea	03-May-2022	Peter Zijlstra <peterz@infradead.org>	sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state Currently ptrace_stop() / do_signal_stop() rely on the special states TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this state exists only in task->__state and nowhere else. There's two spots of bother with this: - PREEMPT_RT has task->saved_state which complicates matters, meaning task_is_{traced,stopped}() needs to check an additional variable. - An alternative freezer implementation that itself relies on a special TASK state would loose TASK_TRACED/TASK_STOPPED and will result in misbehaviour. As such, add additional state to task->jobctl to track this state outside of task->__state. NOTE: this doesn't actually fix anything yet, just adds extra state. --EWB * didn't add a unnecessary newline in signal.h * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up instead of in signal_wake_up_state. This prevents the clearing of TASK_STOPPED and TASK_TRACED from getting lost. * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20220421150654.757693825@infradead.org Tested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20220505182645.497868-12-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
# 2500ad1c	29-Apr-2022	Eric W. Biederman <ebiederm@xmission.com>	ptrace: Don't change __state Stop playing with tsk->__state to remove TASK_WAKEKILL while a ptrace command is executing. Instead remove TASK_WAKEKILL from the definition of TASK_TRACED, and implement a new jobctl flag TASK_PTRACE_FROZEN. This new flag is set in jobctl_freeze_task and cleared when ptrace_stop is awoken or in jobctl_unfreeze_task (when ptrace_stop remains asleep). In signal_wake_up add __TASK_TRACED to state along with TASK_WAKEKILL when the wake up is for a fatal signal. Skip adding __TASK_TRACED when TASK_PTRACE_FROZEN is not set. This has the same effect as changing TASK_TRACED to __TASK_TRACED as all of the wake_ups that use TASK_KILLABLE go through signal_wake_up. Handle a ptrace_stop being called with a pending fatal signal. Previously it would have been handled by schedule simply failing to sleep. As TASK_WAKEKILL is no longer part of TASK_TRACED schedule will sleep with a fatal_signal_pending. The code in signal_wake_up guarantees that the code will be awaked by any fatal signal that codes after TASK_TRACED is set. Previously the __state value of __TASK_TRACED was changed to TASK_RUNNING when woken up or back to TASK_TRACED when the code was left in ptrace_stop. Now when woken up ptrace_stop now clears JOBCTL_PTRACE_FROZEN and when left sleeping ptrace_unfreezed_traced clears JOBCTL_PTRACE_FROZEN. Tested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20220505182645.497868-10-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
# 57b6de08	04-May-2022	Eric W. Biederman <ebiederm@xmission.com>	ptrace: Admit ptrace_stop can generate spuriuos SIGTRAPs Long ago and far away there was a BUG_ON at the start of ptrace_stop that did "BUG_ON(!(current->ptrace & PT_PTRACED));" [1]. The BUG_ON had never triggered but examination of the code showed that the BUG_ON could actually trigger. To complement removing the BUG_ON an attempt to better handle the race was added. The code detected the tracer had gone away and did not call do_notify_parent_cldstop. The code also attempted to prevent ptrace_report_syscall from sending spurious SIGTRAPs when the tracer went away. The code to detect when the tracer had gone away before sending a signal to tracer was a legitimate fix and continues to work to this date. The code to prevent sending spurious SIGTRAPs is a failure. At the time and until today the code only catches it when the tracer goes away after siglock is dropped and before read_lock is acquired. If the tracer goes away after read_lock is dropped a spurious SIGTRAP can still be sent to the tracee. The tracer going away after read_lock is dropped is the far likelier case as it is the bigger window. Given that the attempt to prevent the generation of a SIGTRAP was a failure and continues to be a failure remove the code that attempts to do that. This simplifies the code in ptrace_stop and makes ptrace_stop much easier to reason about. To successfully deal with the tracer going away, all of the tracer's instrumentation of the child would need to be removed, and reliably detecting when the tracer has set a signal to continue with would need to be implemented. [1] 66519f549ae5 ("[PATCH] fix ptracer death race yielding bogus BUG_ON") History-Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git Tested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20220505182645.497868-9-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
# cb3c19c9	29-Apr-2022	Eric W. Biederman <ebiederm@xmission.com>	signal: Use lockdep_assert_held instead of assert_spin_locked The distinction is that assert_spin_locked() checks if the lock is held byanyone* whereas lockdep_assert_held() asserts the current context holds the lock. Also, the check goes away if you build without lockdep. Suggested-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/Ympr/+PX4XgT/UKU@hirez.programming.kicks-ass.net Tested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20220505182645.497868-6-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
# e71ba124	22-Apr-2022	Eric W. Biederman <ebiederm@xmission.com>	signal: Replace __group_send_sig_info with send_signal_locked The function __group_send_sig_info is just a light wrapper around send_signal_locked with one parameter fixed to a constant value. As the wrapper adds no real value update the code to directly call the wrapped function. Tested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20220505182645.497868-2-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
# 157cc181	22-Apr-2022	Eric W. Biederman <ebiederm@xmission.com>	signal: Rename send_signal send_signal_locked Rename send_signal and __send_signal to send_signal_locked and __send_signal_locked to make send_signal usable outside of signal.c. Tested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/20220505182645.497868-1-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
# 78ed93d7	04-Apr-2022	Marco Elver <elver@google.com>	signal: Deliver SIGTRAP on perf event asynchronously if blocked With SIGTRAP on perf events, we have encountered termination of processes due to user space attempting to block delivery of SIGTRAP. Consider this case: <set up SIGTRAP on a perf event> ... sigset_t s; sigemptyset(&s); sigaddset(&s, SIGTRAP \| <and others>); sigprocmask(SIG_BLOCK, &s, ...); ... <perf event triggers> When the perf event triggers, while SIGTRAP is blocked, force_sig_perf() will force the signal, but revert back to the default handler, thus terminating the task. This makes sense for error conditions, but not so much for explicitly requested monitoring. However, the expectation is still that signals generated by perf events are synchronous, which will no longer be the case if the signal is blocked and delivered later. To give user space the ability to clearly distinguish synchronous from asynchronous signals, introduce siginfo_t::si_perf_flags and TRAP_PERF_FLAG_ASYNC (opted for flags in case more binary information is required in future). The resolution to the problem is then to (a) no longer force the signal (avoiding the terminations), but (b) tell user space via si_perf_flags if the signal was synchronous or not, so that such signals can be handled differently (e.g. let user space decide to ignore or consider the data imprecise). The alternative of making the kernel ignore SIGTRAP on perf events if the signal is blocked may work for some usecases, but likely causes issues in others that then have to revert back to interception of sigprocmask() (which we want to avoid). [ A concrete example: when using breakpoint perf events to track data-flow, in a region of code where signals are blocked, data-flow can no longer be tracked accurately. When a relevant asynchronous signal is received after unblocking the signal, the data-flow tracking logic needs to know its state is imprecise. ] Fixes: 97ba62b27867 ("perf: Add support for SIGTRAP on perf events") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/r/20220404111204.935357-1-elver@google.com
# 7dd5ad2d	31-Mar-2022	Thomas Gleixner <tglx@linutronix.de>	Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels" Revert commit bf9ad37dc8a. It needs to be better encapsulated and generalized. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
# 6487d1da	26-Jan-2022	Eric W. Biederman <ebiederm@xmission.com>	ptrace: Return the signal to continue with from ptrace_stop The signal a task should continue with after a ptrace stop is inconsistently read, cleared, and sent. Solve this by reading and clearing the signal to be sent in ptrace_stop. In an ideal world everything except ptrace_signal would share a common implementation of continuing with the signal, so ptracers could count on the signal they ask to continue with actually being delivered. For now retain bug compatibility and just return with the signal number the ptracer requested the code continue with. Link: https://lkml.kernel.org/r/875yoe7qdp.fsf_-_@email.froward.int.ebiederm.org Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
# 336d4b81	26-Jan-2022	Eric W. Biederman <ebiederm@xmission.com>	ptrace: Move setting/clearing ptrace_message into ptrace_stop Today ptrace_message is easy to overlook as it not a core part of ptrace_stop. It has been overlooked so much that there are places that set ptrace_message and don't clear it, and places that never set it. So if you get an unlucky sequence of events the ptracer may be able to read a ptrace_message that does not apply to the current ptrace stop. Move setting of ptrace_message into ptrace_stop so that it always gets set before the stop, and always gets cleared after the stop. This prevents non-sense from being reported to userspace and makes ptrace_message more visible in the ptrace helper functions so that kernel developers can see it. Link: https://lkml.kernel.org/r/87bky67qfv.fsf_-_@email.froward.int.ebiederm.org Acked-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>