Cross Reference: /freebsd-current/sys/kern/kern

History log of /freebsd-current/sys/kern/kern_fork.c
Revision	Date	Author	Comments
# 800da341	22-Apr-2024	Mark Johnston <markj@FreeBSD.org>	thread: Simplify sanitizer integration with thread creation fork() may allocate a new thread in one of two ways: from UMA, or cached in a freed proc that was just allocated from UMA. In either case, KASAN and KMSAN need to initialize some state; in particular they need to initialize the shadow mapping of the new thread's stack. This is done differently between KASAN and KMSAN, which is confusing. This patch improves things a bit: - Add a new thread_recycle() function, which moves all kernel stack handling out of kern_fork.c, since it doesn't really belong there. - Then, thread_alloc_stack() has only one local caller, so just inline it. - Avoid redundant shadow stack initialization: thread_alloc() initializes the KMSAN shadow stack (via kmsan_thread_alloc()) even through vm_thread_new() already did that. - Add kasan_thread_alloc(), for consistency with kmsan_thread_alloc(). No functional change intended. Reviewed by: khng MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D44891
# 68a3a7fc	19-Apr-2024	Ka Ho Ng <khng@FreeBSD.org>	kasan: fix false-positive kasan_report upon thread reuse In fork1(), if a thread is reused and thread_alloc_stack() is not called, mark the reused thread's kstack pages clean in the KASAN shadow buffer. Sponsored by: Juniper Networks, Inc. MFC after: 3 days Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D44875
# 171f0832	28-Nov-2023	Konstantin Belousov <kib@FreeBSD.org>	EVFILT_TIMER: intialize stop timer list in type-stable proc init, instead of fork Since kqueue timer may exist after the process that created it exited (same scenario with rfork(2) as in PR 275286), make the tailq p_kqtim_stop accessed by filt_timerdetach() type-stable. Noted and reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D42777
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# eac62420	10-Oct-2023	Olivier Certner <olce.freebsd@certner.fr>	Ensure "init" (PID 1) also executes userret() initially Calling userret() from fork_return() misses the first return to userspace of the "init" (PID 1) process. The latter is indeed created by fork1() followed by a call to cpu_fork_kthread_handler() call that replaces fork_return() by start_init() as the function to execute after fork. A new process' initial return to userspace in the end always happens through returning from fork_exit(), so move userret() there instead to fix the omission. This problem was discovered as part of a revamp of scheduling priorities that lead to experimenting with asserting and sometimes resetting priorities in sched_userret(), in the course of which the author stumbled on panics being triggered only in init() or only in other processes, depending on the modifications to sched_userret(). This change currently has no practical effect but will have some in the near future. Reviewed by: markj, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D42257
# 92541c12	25-Sep-2023	Olivier Certner <olce.freebsd@certner.fr>	Open-code proc_set_cred_init() This function is to be called only when initializing a new process (so, 'proc0' and at fork), and not in any other circumstances. Setting the process' 'p_ucred' field to the result of crcowget() on the original credentials is the only thing it does, hiding the fact that the process' 'p_ucred' field is crushed by the call. Moreover, most of the code it executes is already encapsulated in crcowget(). To prevent misuse and improve code readability, just remove this function and replace it with a direct assignment to 'p_ucred'. Reviewed by: markj (earlier version), kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D42255
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 474708c3	26-Jul-2023	Konstantin Belousov <kib@FreeBSD.org>	fork1(): properly track the state of the pg_killsx lock Reported by: dchagin Fixes: 232b922cb363e01ac0dd2a277d93cf74d8485e79 Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 232b922c	20-Jul-2023	Konstantin Belousov <kib@FreeBSD.org>	killpg(): close a race with fork(), part 2 When we are sending terminating signal to the group, killpg() needs to guarantee that all group members are to be terminated (it does not need to ensure that they are terminated on return from killpg()). The pg_killsx change eliminates the largest window there, but still, if a multithreaded process is signalled, the following could happen: - thread 1 is selected for the signal delivery and gets descheduled - thread 2 waits for pg_killsx lock, obtains it and forks - thread 1 continue executing and terminates the process This scenario allows the child to escape still. Fix it by single-threading forking parent if a conflict with pg_killsx is noted. We try to lock pg_killsx without sleeping, and failure to acquire it means that a parallel killpg(2) is executed. Then, stop other threads for running and in particular, receive signals, to avoid the situation explained above. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D41128
# aaa92413	20-Jul-2023	Konstantin Belousov <kib@FreeBSD.org>	Revert "killpg(): close a race with fork(), part 2" This reverts commits 81a37995c757b4e3ad8a5c699864197fd1ebdcf5 and 565a343ae3a30bc2973182ff8dfd2fa37d7f615f. There is still a leakage of the p_killpg_cnt, some but not all sources of which were identified. Second, and more important, is that there is a fundamental issue with blocked signals having KSI_KILLPG flag set. Queueing of such signal increments p_killpg_cnt, but it cannot be decremented until the signal is delivered. If, for instance, a single-threaded process with blocked signal receives killpg-kill and executes fork(2), the fork enter check returns with ERESTART. And since signal is blocked, the condition cannot be cleared. Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D41128
# 81a37995	15-Jun-2023	Konstantin Belousov <kib@FreeBSD.org>	killpg(): close a race with fork(), part 2 When we are sending terminating signal to the group, killpg() needs to guarantee that all group members are to be terminated (it does not need to ensure that they are terminated on return from killpg()). The pg_killsx change eliminates the largest window there, but still, if a multithreaded process is signalled, the following could happen: - thread 1 is selected for the signal delivery and gets descheduled - thread 2 waits for pg_killsx lock, obtains it and forks - thread 1 continue executing and terminates the process This scenario allows the child to escape still. To fix it, count the number of signals sent to the process with killpg(2), in p_killpg_cnt variable, which is incremented in killpg() and decremented after signal handler frame is created or in exit1() after single-threading. This way we avoid forking if the termination is due. Noted and reviewed by: markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D40493
# 3360b485	12-Jun-2023	Konstantin Belousov <kib@FreeBSD.org>	killpg(2): close a race with fork(2), part1 If the process group member performs fork(), the child could escape signalling from killpg(). Prevent it by introducing an sx process group lock pg_killsx which is taken interruptibly shared around fork. If there is a pending signal, do the trip through userspace with ERESTART to handle signal ASTs. The lock is taken exclusively during killpg(). The lock is also locked exclusive when the process changes group membership, to avoid escaping a signal by this means, by ensuring that the process group is stable during fork. Note that the new lock is before proctree lock, so in some situations we could only do trylocking to obtain it. This relatively simple approach cannot work for REAP_KILL, because process potentially belongs to more than one reaper tree by having sub-reapers. Reported by: dchagin Tested by: dchagin, pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D40493
# 653738e8	07-Jun-2023	John Baldwin <jhb@FreeBSD.org>	ptrace: Clear TDB_BORN during PT_DETACH. If a debugger detaches from a process that has a new thread that has not yet executed, the new thread will raise a SIGTRAP signal to report it's thread birth event even after the detach. With the debugger detached, this results in a SIGTRAP sent to the process and typically a core dump. Fix this by clearing TDB_BORN from any new threads during detach. Bump __FreeBSD_version for debuggers to notice when the fix is present. Reported by: GDB's testsuite Reviewed by: kib, markj (previous version) Differential Revision: https://reviews.freebsd.org/D39856
# 0282f875	16-May-2023	Dmitry Chagin <dchagin@FreeBSD.org>	ktrace: Simplify ae6ac587, drop the sa var declaration Suggested by: kib MFC after: 5 days
# ae6ac587	14-May-2023	Dmitry Chagin <dchagin@FreeBSD.org>	ktrace: Fix syscall number on a child return path from fork Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D40078 MFC after 1 week
# 5ecb5444	10-Mar-2022	Mateusz Guzik <mjg@FreeBSD.org>	jail: add process linkage It allows iteration over processes belonging to given jail instead of having to walk the entire allproc list. Note the iteration can miss processes which remains bug-compatible with previous code. Reviewed by: jamie (previous version), markj (previous version) Differential Revision: https://reviews.freebsd.org/D34522
# fce3b1c3	21-Aug-2022	Konstantin Belousov <kib@FreeBSD.org>	fork_exit(): style comment Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D36302
# 5e5675cb	12-Aug-2022	Konstantin Belousov <kib@FreeBSD.org>	Remove struct proc p_singlethr member It does not serve any purpose after we stopped doing thread_single(SINGLE_ALLPROC) from stoppable user processes. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36207
# 5e9bba94	10-Aug-2022	Konstantin Belousov <kib@FreeBSD.org>	fork_norfproc(): unlock p1 before retrying Reported and reviewed by: markj Tested by: pho Syzkaller: 647212368c3f32c6f13f Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D36207
# bd76586b	11-Aug-2022	Konstantin Belousov <kib@FreeBSD.org>	fork_norfproc(): style Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 3 days Differential revision: https://reviews.freebsd.org/D36207
# d07675a9	04-Aug-2022	Mark Johnston <markj@FreeBSD.org>	file: Move code to share fdtol structs into kern_descrip.c This ensures the filedesc-to-leader code is consistently encapsulated in kern_descrip.c. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D35988
# c6d31b83	18-Jul-2022	Konstantin Belousov <kib@FreeBSD.org>	AST: rework Make most AST handlers dynamically registered. This allows to have subsystem-specific handler source located in the subsystem files, instead of making subr_trap.c aware of it. For instance, signal delivery code on return to userspace is now moved to kern_sig.c. Also, it allows to have some handlers designated as the cleanup (kclear) type, which are called both at AST and on thread/process exit. For instance, ast(), exit1(), and NFS server no longer need to be aware about UFS softdep processing. The dynamic registration also allows third-party modules to register AST handlers if needed. There is one caveat with loadable modules: the code does not make any effort to ensure that the module is not unloaded before all threads processed through AST handler in it. In fact, this is already present behavior for hwpmc.ko and ufs.ko. I do not think it is worth the efforts and the runtime overhead to try to fix it. Reviewed by: markj Tested by: emaste (arm64), pho Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35888
# 4493a13e	15-May-2022	Konstantin Belousov <kib@FreeBSD.org>	Do not single-thread itself when the process single-threaded some another process Since both self single-threading and remote single-threading rely on suspending the thread doing thread_single(), it cannot be mixed: thread doing thread_suspend_switch() might be subject to thread_suspend_one() and vice versa. In collaboration with: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D35310
# 893d20c9	29-Jan-2022	Mateusz Guzik <mjg@FreeBSD.org>	fd: move fd table sizing out of fdinit now it is placed with the rest of actual initialisation
# 626d6992	26-Dec-2021	Edward Tomasz Napierala <trasz@FreeBSD.org>	Move fork_rfppwait() check into ast() This will always sleep at least once, so it's a slow path by definition. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D33387
# 351d5f7f	23-Oct-2021	Konstantin Belousov <kib@FreeBSD.org>	exec: store parent directory and hardlink name of the binary in struct proc While doing it, also move all the code to resolve pathnames and obtain text vp and dvp, into single place. Besides simplifying the code, it avoids spurious vnode relocks and validates the explanation why a transient text reference on the script vnode is not harmful. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D32611
# 46dd801a	16-Oct-2021	Colin Percival <cperciva@FreeBSD.org>	Add userland boot profiling to TSLOG On kernels compiled with 'options TSLOG', record for each process ID: * The timestamp of the fork() which creates it and the parent process ID, * The first path passed to execve(), if any, * The first path resolved by namei, if any, and * The timestamp of the exit() which terminates the process. Expose this information via a new sysctl, debug.tslog_user. On kernels lacking 'options TSLOG' (the default), no information is recorded and the sysctl does not exist. Note that recording namei is needed in order to obtain the names of rc.d scripts being launched, as the rc system sources them in a subshell rather than execing the scripts. With this commit it is now possible to generate flamecharts of the entire boot process from the start of the loader to the end of /etc/rc. The code needed to perform this processing is currently found in github: https://github.com/cperciva/freebsd-boot-profiling Reviewed by: mhorne Sponsored by: https://www.patreon.com/cperciva Differential Revision: https://reviews.freebsd.org/D32493
# a0558fe9	28-Apr-2021	Mateusz Guzik <mjg@FreeBSD.org>	Retire code added to support CloudABI CloudABI was removed in cf0ee8738e31aa9e6fbf4dca4dac56d89226a71a
# 796a8e1a	01-Sep-2021	Konstantin Belousov <kib@FreeBSD.org>	procctl(2): Add PROC_WXMAP_CTL/STATUS It allows to override kern.elf{32,64}.allow_wx on per-process basis. In particular, it makes it possible to run binaries without PT_GNU_STACK and without elfctl note while allow_wx = 0. Reviewed by: brooks, emaste, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D31779
# 71854d9b	12-Aug-2021	Dmitry Chagin <dchagin@FreeBSD.org>	fork: Remove the unnecessary spaces. MFC after: 2 weeks
# b0f71f1b	10-Aug-2021	Mark Johnston <markj@FreeBSD.org>	amd64: Add MD bits for KMSAN Interrupt and exception handlers must call kmsan_intr_enter() prior to calling any C code. This is because the KMSAN runtime maintains some TLS in order to track initialization state of function parameters and return values across function calls. Then, to ensure that this state is kept consistent in the face of asynchronous kernel-mode excpeptions, the runtime uses a stack of TLS blocks, and kmsan_intr_enter() and kmsan_intr_leave() push and pop that stack, respectively. Use these functions in amd64 interrupt and exception handlers. Note that handlers for user->kernel transitions need not be annotated. Also ensure that trap frames pushed by the CPU and by handlers are marked as initialized before they are used. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31467
# 5dda15ad	10-Aug-2021	Mark Johnston <markj@FreeBSD.org>	kern: Ensure that thread-local KMSAN state is available Sponsored by: The FreeBSD Foundation
# db8d680e	01-Jul-2021	Edward Tomasz Napierala <trasz@FreeBSD.org>	procctl(2): add PROC_NO_NEW_PRIVS_CTL, PROC_NO_NEW_PRIVS_STATUS This introduces a new, per-process flag, "NO_NEW_PRIVS", which is inherited, preserved on exec, and cannot be cleared. The flag, when set, makes subsequent execs ignore any SUID and SGID bits, instead executing those binaries as if they not set. The main purpose of the flag is implementation of Linux PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged chroot. Reviewed By: kib Sponsored By: EPSRC Differential Revision: https://reviews.freebsd.org/D30939
# 9246b309	13-May-2021	Mark Johnston <markj@FreeBSD.org>	fork: Suspend other threads if both RFPROC and RFMEM are not set Otherwise, a multithreaded parent process may trigger races in vm_forkproc() if one thread calls rfork() with RFMEM set and another calls rfork() without RFMEM. Also simplify vm_forkproc() a bit, vmspace_unshare() already checks to see if the address space is shared. Reported by: syzbot+0aa7c2bec74c4066c36f@syzkaller.appspotmail.com Reported by: syzbot+ea84cb06937afeae609d@syzkaller.appspotmail.com Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30220
# 2fd1ffef	05-Mar-2021	Konstantin Belousov <kib@FreeBSD.org>	Stop arming kqueue timers on knote owner suspend or terminate This way, even if the process specified very tight reschedule intervals, it should be stoppable/killable. Reported and reviewed by: markj Tested by: markj, pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29106
# 640d5404	12-Mar-2021	John Baldwin <jhb@FreeBSD.org>	Set TDP_KTHREAD before calling cpu_fork() and cpu_copy_thread(). This permits these routines to use special logic for initializing MD kthread state. For the kproc case, this required moving the logic to set these flags from kproc_create() into do_fork(). Reviewed by: kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29207
# cc7b7306	16-Feb-2021	Jamie Gritton <jamie@FreeBSD.org>	jail: Handle a possible race between jail_remove(2) and fork(2) jail_remove(2) includes a loop that sends SIGKILL to all processes in a jail, but skips processes in PRS_NEW state. Thus it is possible the a process in mid-fork(2) during jail removal can survive the jail being removed. Add a prison flag PR_REMOVE, which is checked before the new process returns. If the jail is being removed, the process will then exit. Also check this flag in jail_attach(2) which has a similar issue. Reported by: trasz Approved by: kib MFC after: 3 days
# f8f74aaa	17-Nov-2020	Conrad Meyer <cem@FreeBSD.org>	linux(4) clone(2): Correctly handle CLONE_FS and CLONE_FILES The two flags are distinct and it is impossible to correctly handle clone(2) without the assistance of fork1(). This change depends on the pwddesc split introduced in r367777. I've added a fork_req flag, FR2_SHARE_PATHS, which indicates that p_pd should be treated the opposite way p_fd is (based on RFFDG flag). This is a little ugly, but the benefit is that existing RFFDG API is preserved. Holding FR2_SHARE_PATHS disabled, RFFDG indicates both p_fd and p_pd are copied, while !RFFDG indicates both should be cloned. In Chrome, clone(2) is used with CLONE_FS, without CLONE_FILES, and expects independent fd tables. The previous conflation of CLONE_FS and CLONE_FILES was introduced in r163371 (2006). Discussed with: markj, trasz (earlier version) Differential Revision: https://reviews.freebsd.org/D27016
# 85078b85	17-Nov-2020	Conrad Meyer <cem@FreeBSD.org>	Split out cwd/root/jail, cmask state from filedesc table No functional change intended. Tracking these structures separately for each proc enables future work to correctly emulate clone(2) in linux(4). __FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof. Reviewed by: kib Discussed with: markj, mjg Differential Revision: https://reviews.freebsd.org/D27037
# 6fed89b1	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	kern: clean up empty lines in .c and .h files
# d8bc2a17	15-Jul-2020	Mateusz Guzik <mjg@FreeBSD.org>	fd: remove fd_lastfile It keeps recalculated way more often than it is needed. Provide a routine (fdlastfile) to get it if necessary. Consumers may be better off with a bitmap iterator instead.
# 1724c563	09-Jun-2020	Mateusz Guzik <mjg@FreeBSD.org>	cred: distribute reference count per thread This avoids dirtying creds in the common case, see the comment in kern_prot.c for details. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D24007
# 757a5642	15-May-2020	Christian S.J. Peron <csjp@FreeBSD.org>	Add BSM record conversion for a number of syscalls: - thr_kill(2) and thr_exit(2) generally (no argument auditing here. - A set of syscalls for the process descriptor family, specifically: pdfork(2), pdgetpid(2) and pdkill(2) For these syscalls, audit the file descriptor. In the case of pdfork(2) a pointer to an integer (file descriptor) is passed in as an argument. We audit the post initialized file descriptor (not the random garbage that would have been passed in). We will also audit the child process which was created from the fork operation (similar to what is done for the fork(2) syscall). pdkill(2) we audit the signal value and fd, and finally pdgetpid(2) just the file descriptor: - Following is a sample of the produced audit trails: header,111,11,pdfork(2),0,Sat May 16 03:07:50 2020, + 394 msec argument,0,0x39d,child PID argument,2,0x2,flags argument,1,0x8,fd subject,root,root,0,root,0,924,0,0,0.0.0.0 return,success,925 header,79,11,pdgetpid(2),0,Sat May 16 03:07:50 2020, + 394 msec argument,1,0x8,fd subject,root,root,0,root,0,924,0,0,0.0.0.0 return,success,0 trailer,79 header,135,11,pdkill(2),0,Sat May 16 03:07:50 2020, + 395 msec argument,1,0x8,fd argument,2,0xf,signal process_ex,root,root,0,root,0,925,0,0,0.0.0.0 subject,root,root,0,root,0,924,0,0,0.0.0.0 return,success,0 trailer,135 MFC after: 1 week
# 59838c1a	01-Apr-2020	John Baldwin <jhb@FreeBSD.org>	Retire procfs-based process debugging. Modern debuggers and process tracers use ptrace() rather than procfs for debugging. ptrace() has a supserset of functionality available via procfs and new debugging features are only added to ptrace(). While the two debugging services share some fields in struct proc, they each use dedicated fields and separate code. This results in extra complexity to support a feature that hasn't been enabled in the default install for several years. PR: 244939 (exp-run) Reviewed by: kib, mjg (earlier version) Relnotes: yes Differential Revision: https://reviews.freebsd.org/D23837
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# 146fc63f	09-Feb-2020	Konstantin Belousov <kib@FreeBSD.org>	Add a way to manage thread signal mask using shared word, instead of syscall. A new syscall sigfastblock(2) is added which registers a uint32_t variable as containing the count of blocks for signal delivery. Its content is read by kernel on each syscall entry and on AST processing, non-zero count of blocks is interpreted same as the signal mask blocking all signals. The biggest downside of the feature that I see is that memory corruption that affects the registered fast sigblock location, would cause quite strange application misbehavior. For instance, the process would be immune to ^C (but killable by SIGKILL). With consumers (rtld and libthr added), benchmarks do not show a slow-down of the syscalls in micro-measurements, and macro benchmarks like buildworld do not demonstrate a difference. Part of the reason is that buildworld time is dominated by compiler, and clang already links to libthr. On the other hand, small utilities typically used by shell scripts have the total number of syscalls cut by half. The syscall is not exported from the stable libc version namespace on purpose. It is intended to be used only by our C runtime implementation internals. Tested by: pho Disscussed with: cem, emaste, jilles Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D12773
# 61a74c5c	15-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	schedlock 1/4 Eliminate recursion from most thread_lock consumers. Return from sched_add() without the thread_lock held. This eliminates unnecessary atomics and lock word loads as well as reducing the hold time for scheduler locks. This will eventually allow for lockless remote adds. Discussed with: kib Reviewed by: jhb Tested by: pho Differential Revision: https://reviews.freebsd.org/D22626
# 079c5b9e	25-Sep-2019	Kyle Evans <kevans@FreeBSD.org>	rfork(2): add RFSPAWN flag When RFSPAWN is passed, rfork exhibits vfork(2) semantics but also resets signal handlers in the child during creation to avoid a point of corruption of parent state from the child. This flag will be used by posix_spawn(3) to handle potential signal issues. Reviewed by: jilles, kib Differential Revision: https://reviews.freebsd.org/D19058
# fe69291f	03-Sep-2019	Konstantin Belousov <kib@FreeBSD.org>	Add procctl(PROC_STACKGAP_CTL) It allows a process to request that stack gap was not applied to its stacks, retroactively. Also it is possible to control the gaps in the process after exec. PR: 239894 Reviewed by: alc Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21352
# 50c7615f	17-Aug-2019	Mateusz Guzik <mjg@FreeBSD.org>	fork: rework locking around do_fork - move allproc lock into the func, it is of no use prior to it - the code would lock p1 and p2 while holding allproc to partially construct it after it gets added to the list. instead we can do the work prior to adding anything. - protect lastpid with procid_lock As a side effect we do less work with allproc held. Sponsored by: The FreeBSD Foundation
# 60cdcb64	17-Aug-2019	Mateusz Guzik <mjg@FreeBSD.org>	fork: bump process count before checking for permission to cross the limit The limit is almost never reached. Do the check only on failure to see if we can override it. No change in user-visible behavior. Sponsored by: The FreeBSD Foundation
# b05641b6	17-Aug-2019	Mateusz Guzik <mjg@FreeBSD.org>	fork: stop skipping < 100 ids on wrap around Code doing this is commented with a claim that these IDs are occupied by daemons, but that's demonstrably false. To an extent the range is used by init and kernel processes (and on sufficiently big machines it indeed is fully populated). On a sample box 40-way box the highest id in the range is 63. On a different one it is 23. Just use the range. Sponsored by: The FreeBSD Foundation
# 2ffee5c1	10-Jul-2019	Mark Johnston <markj@FreeBSD.org>	Inherit P2_PROTMAX_{ENABLE,DISABLE} across fork(). Thus, when using proccontrol(1) to disable implicit application of PROT_MAX within a process, child processes will inherit this setting. Discussed with: kib MFC with: r349609 Sponsored by: The FreeBSD Foundation
# e2e050c8	19-May-2019	Conrad Meyer <cem@FreeBSD.org>	Extract eventfilter declarations to sys/_eventfilter.h This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h" in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header pollution substantially. EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c files into appropriate headers (e.g., sys/proc.h, powernv/opal.h). As a side effect of reduced header pollution, many .c files and headers no longer contain needed definitions. The remainder of the patch addresses adding appropriate includes to fix those files. LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by sys/mutex.h since r326106 (but silently protected by header pollution prior to this change). No functional change (intended). Of course, any out of tree modules that relied on header pollution for sys/eventhandler.h, sys/lock.h, or sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
# 7d43b5c9	07-May-2019	Mark Johnston <markj@FreeBSD.org>	Simplify the test against maxproc in fork1(). Previously nprocs_new would be tested against maxprocs twice when nprocs_new < maxprocs - 10. Eliminate the unnecessary comparison. Submitted by: Wuyang Chung <wuyang.chung1@gmail.com> GitHub PR: https://github.com/freebsd/freebsd/pull/397 MFC after: 1 week
# 37d2b1f3	04-May-2019	Mateusz Guzik <mjg@FreeBSD.org>	Annotate nprocs with __exclusive_cache_line Sponsored by: The FreeBSD Foundation
# fa50a355	10-Feb-2019	Konstantin Belousov <kib@FreeBSD.org>	Implement Address Space Layout Randomization (ASLR) With this change, randomization can be enabled for all non-fixed mappings. It means that the base address for the mapping is selected with a guaranteed amount of entropy (bits). If the mapping was requested to be superpage aligned, the randomization honours the superpage attributes. Although the value of ASLR is diminshing over time as exploit authors work out simple ASLR bypass techniques, it elimintates the trivial exploitation of certain vulnerabilities, at least in theory. This implementation is relatively small and happens at the correct architectural level. Also, it is not expected to introduce regressions in existing cases when turned off (default for now), or cause any significant maintaince burden. The randomization is done on a best-effort basis - that is, the allocator falls back to a first fit strategy if fragmentation prevents entropy injection. It is trivial to implement a strong mode where failure to guarantee the requested amount of entropy results in mapping request failure, but I do not consider that to be usable. I have not fine-tuned the amount of entropy injected right now. It is only a quantitive change that will not change the implementation. The current amount is controlled by aslr_pages_rnd. To not spoil coalescing optimizations, to reduce the page table fragmentation inherent to ASLR, and to keep the transient superpage promotion for the malloced memory, locality clustering is implemented for anonymous private mappings, which are automatically grouped until fragmentation kicks in. The initial location for the anon group range is, of course, randomized. This is controlled by vm.cluster_anon, enabled by default. The default mode keeps the sbrk area unpopulated by other mappings, but this can be turned off, which gives much more breathing bits on architectures with small address space, such as i386. This is tied with the question of following an application's hint about the mmap(2) base address. Testing shows that ignoring the hint does not affect the function of common applications, but I would expect more demanding code could break. By default sbrk is preserved and mmap hints are satisfied, which can be changed by using the kern.elf{32,64}.aslr.honor_sbrk sysctl. ASLR is enabled on per-ABI basis, and currently it is only allowed on FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support for additional architectures will be added after further testing. Both per-process and per-image controls are implemented: - procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS; - NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible to force ASLR off for the given binary. (A tool to edit the feature control note is in development.) Global controls are: - kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2); - kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings; - kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2); - vm.cluster_anon - enables anon mapping clustering. PR: 208580 (exp runs) Exp-runs done by: antoine Reviewed by: markj (previous version) Discussed with: emaste Tested by: pho MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D5603
# be8dd142	16-Jan-2019	Konstantin Belousov <kib@FreeBSD.org>	Re-wrap long line after r341827. Sponsored by: The FreeBSD Foundation MFC after: 3 days
# 19b75ef5	19-Dec-2018	Mateusz Guzik <mjg@FreeBSD.org>	Microoptimize corner case of ID bitmap handling. Prior to the change we would avoidably test more possibly used IDs. While here update the comment: there is no pidchecked variable anymore.
# 7d065d87	19-Dec-2018	Mateusz Guzik <mjg@FreeBSD.org>	Deinline vfork handling out of the syscall return path. vfork is rarely called (comparatively to other syscalls) and it avoidably pollutes the fast path. Sponsored by: The FreeBSD Foundation
# cc426dd3	11-Dec-2018	Mateusz Guzik <mjg@FreeBSD.org>	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation
# eab2132a	08-Dec-2018	Mateusz Guzik <mjg@FreeBSD.org>	Fix a corner case in ID bitmap management. If all IDs from trypid to pid_max were used as pids, the code would enter a loop which would be infinite if none of the IDs could become free (e.g. they all belong to processes which did not transitioned to zombie). Fixes: r341684 ("Manage process-related IDs with bitmaps") Sponsored by: The FreeBSD Foundation
# e52327e3	07-Dec-2018	Mateusz Guzik <mjg@FreeBSD.org>	proc: postpone proc unlock until after reporting with kqueue kqueue would always relock immediately afterwards. While here drop the NULL check for list itself. The list is always allocated. Sponsored by: The FreeBSD Foundation
# 34ebdcea	06-Dec-2018	Mateusz Guzik <mjg@FreeBSD.org>	Manage process-related IDs with bitmaps Currently unique pid allocation on fork often requires a full walk of process, group, session lists to make sure it is not used by anything. This has a side effect of requiring proctree to be held along with allproc, which adds more contention in poudriere -j 128. The patch below implements trivial bitmaps which gets rid of the problem. Dedicated lock is introduced to manage IDs. While here a bug was discovered: all processes would inherit reap id from the first process spawned by init. This had a side effect of keeping the ID used and when allocation rolls over to the beginning it keeps being skipped. The patch is loosely based on initial work by mjoras@. Reviewed by: kib Sponsored by: The FreeBSD Foundation
# 1e9a1bf5	28-Nov-2018	Mateusz Guzik <mjg@FreeBSD.org>	proc: create a dedicated lock for zombproc to ligthen the load on allproc_lock waitpid always takes proctree to evaluate the list, but only takes allproc if it can reap. With this patch allproc is no longer taken, which helps during poudriere -j 128. Discussed with: kib Sponsored by: The FreeBSD Foundation
# e3d3e828	22-Nov-2018	Mateusz Guzik <mjg@FreeBSD.org>	Revert "fork: fix use-after-free with vfork" This unreliably breaks libc handling of vfork where forking succeded, but execve did not. vfork code in libc performs waitpid with WNOHANG in case of failed exec. With the fix exit codepath was waking up the parent before the child fully transitioned to a zombie. Woken up parent would waitpid, which could find a not-yet-zombie child and fail to reap it due to the WNOHANG flag. While removing the flag fixes the problem, it is not an option due to older releases which would still suffer from the kernel change. Revert the fix until a solution can be worked out. Note that while use-after-free which gets back due to the revert is a real bug, it's side-effects are limited due to the fact that struct proc memory is never released by UMA.
# a5ac8272	22-Nov-2018	Mateusz Guzik <mjg@FreeBSD.org>	fork: remove avoidable proc lock/unlock pair We don't have to access the process after making it runnable, so there is no need to hold it either. Sponsored by: The FreeBSD Foundation
# b00b27e9	22-Nov-2018	Mateusz Guzik <mjg@FreeBSD.org>	fork: fix use-after-free with vfork The pointer to the child is stored without any reference held. Then it is blindly used to wait until P_PPWAIT is cleared. However, if the child is autoreaped it could have exited and get freed before the parent started waiting. Use the existing hold mechanism to mitigate the problem. Most common case of doing exec remains unchanged. The corner case of doing exit performs wake up before waiting for holds to clear. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18295
# 3d3e6793	21-Nov-2018	Mateusz Guzik <mjg@FreeBSD.org>	proc: implement pid hash locks and an iterator forks, exits and waits are frequently stalled during poudriere -j 128 runs due to killpg and process list exports performed for each package. Both uses take the allproc lock. The latter case can be modified to iterate over the hash with finer grained locking instead. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17817
# 2c054ce9	16-Nov-2018	Mateusz Guzik <mjg@FreeBSD.org>	proc: always store parent pid in p_oppid Doing so removes the dependency on proctree lock from sysctl process list export which further reduces contention during poudriere -j 128 runs. Reviewed by: kib (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17825
# 6e22bbf6	21-Jun-2018	Konstantin Belousov <kib@FreeBSD.org>	fork: avoid endless wait with PTRACE_FORK and RFSTOPPED. An RFSTOPPED thread can't clean TDB_STOPATFORK, which is done in the fork_return() in its context, so parent is stuck forever. Triggered when trying to ptrace linux process. Instead of waiting for the new thread to clear TDB_STOPATFORK, tag it as traced and reparent to the debugger in do_fork(), and let it only notify the debugger when run. Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com> Reviewed by: jhb MFC after: 1 week X-MFC-Note: keep p_dbgwait placeholder intact Differential revision: https://reviews.freebsd.org/D15857
# 81d68271	19-Feb-2018	Mateusz Guzik <mjg@FreeBSD.org>	Reduce contention on the proctree lock during heavy package build. There is a proctree -> allproc ordering established. Most of the time it is either xlock -> xlock or slock -> slock. On fork however there is a slock -> xlock pair which results in pathological wait times due to threads keeping proctree held for reading and all waiting on allproc. Switch this to xlock -> xlock. Longer term fix would get rid of proctree in this place to begin with. Right now it is necessary to walk the session/process group lists to determine which id is free. The walk can be avoided e.g. with bitmaps. The exit path used to have one place which dealt with allproc and then with proctree. Move the allproc acquire into the section protected by proctree. This reduces contention against threads waiting on proctree in the fork codepath - the fork proctree holder does not have to wait for allproc as often. Finally, move tidhash manipulation outside of the area protected by either of these locks. The removal from the hash was already unprotected. There is no legitimate reason to look up thread ids for a process still under construction. This results in about 50% wait time reduction during -j 128 package build.
# 65f29b9c	16-Feb-2018	Mateusz Guzik <mjg@FreeBSD.org>	Postpone sx_sunlock(&proctree_lock) on fork until after allproc is dropped. There is a significant contention on the lock during -j 128 package build. This change drops total wait time on this lock by 60%.
# 8e23158a	14-Jan-2018	Bjoern A. Zeeb <bz@FreeBSD.org>	Remove trailing whitespace. No functional change.
# 3f289c3f	12-Jan-2018	Jeff Roberson <jeff@FreeBSD.org>	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 2ca45184	09-Nov-2017	Matt Joras <mjoras@FreeBSD.org>	Introduce EVENTHANDLER_LIST and some users. This introduces a facility to EVENTHANDLER(9) for explicitly defining a reference to an event handler list. This is useful since previously all invokers of events had to do a locked traversal of the global list of event handler lists in order to find the appropriate event handler list. By keeping a pointer to the appropriate list an invoker can avoid this traversal completely. The pointer is initialized with SYSINIT(9) during the eventhandler stage. Users registering interest in events do not need to know if the event is backed by such a list, since the list is added to the global list of lists. As with lists that are not pre-defined it is safe to register for the events before the list has been created. This converts the process_* and thread_* events to using the new facility, as these are events whose locked traversals end up showing up significantly in ports build workflows (and presumably other workflows with many short lived threads/procs). It may be advantageous to convert other events to using the new facility. The el_flags field is now unused, but leave it be so that this revision can be MFC'd. Reviewed by: bdrewery, markj, mjg Approved by: rstone (mentor) In collaboration with: ian MFC after: 4 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12814
# 008a0935	10-Sep-2017	Dag-Erling Smørgrav <des@FreeBSD.org>	If the user tries to set kern.randompid to 1 (which is meaningless), set it to a random value between 100 and 1123, rather than 0 as before. Submitted by: Marie Helene Kvello-Aune <marieheleneka@gmail.com> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D5336
# 2d88da2f	12-Jun-2017	Konstantin Belousov <kib@FreeBSD.org>	Move struct syscall_args syscall arguments parameters container into struct thread. For all architectures, the syscall trap handlers have to allocate the structure on the stack. The structure takes 88 bytes on 64bit arches which is not negligible. Also, it cannot be easily found by other code, which e.g. caused duplication of some members of the structure to struct thread already. The change removes td_dbg_sc_code and td_dbg_sc_nargs which were directly copied from syscall_args. The structure is put into the copied on fork part of the struct thread to make the syscall arguments information correct in the child after fork. This move will also allow several more uses shortly. Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation MFC after: 3 weeks X-Differential revision: https://reviews.freebsd.org/D11080
# 83c9dea1	17-Apr-2017	Gleb Smirnoff <glebius@FreeBSD.org>	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156
# 82a4538f	20-Feb-2017	Eric Badger <badger@FreeBSD.org>	Defer ptracestop() signals that cannot be delivered immediately When a thread is stopped in ptracestop(), the ptrace(2) user may request a signal be delivered upon resumption of the thread. Heretofore, those signals were discarded unless ptracestop()'s caller was issignal(). Fix this by modifying ptracestop() to queue up signals requested by the ptrace user that will be delivered when possible. Take special care when the signal is SIGKILL (usually generated from a PT_KILL request); no new stop events should be triggered after a PT_KILL. Add a number of tests for the new functionality. Several tests were authored by jhb. PR: 212607 Reviewed by: kib Approved by: kib (mentor) MFC after: 2 weeks Sponsored by: Dell EMC In collaboration with: jhb Differential Revision: https://reviews.freebsd.org/D9260
# 5afb134c	12-Dec-2016	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add vrefact, to be used when the vnode has to be already active This allows blind increment of relevant counters which under contention is cheaper than inc-not-zero loops at least on amd64. Use it in some of the places which are guaranteed to see already active vnodes. Reviewed by: kib (previous version)
# 643f6f47	21-Sep-2016	Konstantin Belousov <kib@FreeBSD.org>	Add PROC_TRAPCAP procctl(2) controls and global sysctl kern.trap_enocap. Both can be used to cause processes in capability mode to receive SIGTRAP when ENOTCAPABLE or ECAPMODE errors are returned from syscalls. Idea by: emaste Reviewed by: oshogbo (previous version), emaste Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D7965
# 69a28758	15-Sep-2016	Ed Maste <emaste@FreeBSD.org>	Renumber license clauses in sys/kern to avoid skipping #3
# e5574e09	19-Aug-2016	Mark Johnston <markj@FreeBSD.org>	Don't set P2_PTRACE_FSTP in a process that invokes ptrace(PT_TRACE_ME). Such processes are stopped synchronously by a direct call to ptracestop(SIGTRAP) upon exec. P2_PTRACE_FSTP causes the exec()ing thread to suspend itself while waiting for a SIGSTOP that never arrives. Reviewed by: kib MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D7576
# e69ba32f	03-Aug-2016	Konstantin Belousov <kib@FreeBSD.org>	Remove mention of the Giant from the fork_return() description. Making emphasis on this lock in the core function comment is confusing for the modern kernel. Sponsored by: The FreeBSD Foundation MFC after: 3 days
# b7a25e63	28-Jul-2016	Konstantin Belousov <kib@FreeBSD.org>	When a debugger attaches to the process, SIGSTOP is sent to the target. Due to a way issignal() selects the next signal to deliver and report, if the simultaneous or already pending another signal exists, that signal might be reported by the next waitpid(2) call. This causes minor annoyance for debuggers, which must be prepared to take any signal as the first event, then filter SIGSTOP later. More importantly, for tools like gcore(1), which attach and then detach without processing events, SIGSTOP might leak to be delivered after PT_DETACH. This results in the process being unintentionally stopped after detach, which is fatal for automatic tools. The solution is to force SIGSTOP to be the first signal reported after the attach. Attach code is modified to set P2_PTRACE_FSTP to indicate that the attaching ritual was not yet finished, and issignal() prefers SIGSTOP in that condition. Also, the thread which handles P2_PTRACE_FSTP is made to guarantee to own p_xthread during the first waitpid(2). All that ensures that SIGSTOP is consumed first. Additionally, if P2_PTRACE_FSTP is still set on detach, which means that waitpid(2) was not called at all, SIGSTOP is removed from the queue, ensuring that the process is resumed on detach. In issignal(), when acting on STOPing signals, remove the signal from queue before suspending. Otherwise parallel attach could result in ptracestop() acting on that STOP as if it was the STOP signal from the attach. Then SIGSTOP from attach leaks again. As a minor refactoring, some bits of the common attach code is moved to new helper proc_set_traced(). Reported by: markj Reviewed by: jhb, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D7256
# fc4f075a	18-Jul-2016	John Baldwin <jhb@FreeBSD.org>	Add PTRACE_VFORK to trace vfork events. First, PL_FLAG_FORKED events now also set a PL_FLAG_VFORKED flag when the new child was created via vfork() rather than fork(). Second, a new PL_FLAG_VFORK_DONE event can now be enabled via the PTRACE_VFORK event mask. This new stop is reported after the vfork parent resumes due to the child calling exit or exec. Debuggers can use this stop to reinsert breakpoints in the vfork parent process before it resumes. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D7045
# 8d570f64	15-Jul-2016	John Baldwin <jhb@FreeBSD.org>	Add a mask of optional ptrace() events. ptrace() now stores a mask of optional events in p_ptevents. Currently this mask is a single integer, but it can be expanded into an array of integers in the future. Two new ptrace requests can be used to manipulate the event mask: PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK sets the current event mask. The current set of events include: - PTRACE_EXEC: trace calls to execve(). - PTRACE_SCE: trace system call entries. - PTRACE_SCX: trace syscam call exits. - PTRACE_FORK: trace forks and auto-attach to new child processes. - PTRACE_LWP: trace LWP events. The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have been replaced by PTRACE_SCE and PTRACE_SCX. PTRACE_FORK replaces P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS. The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for compatibility but now simply toggle corresponding flags in the event mask. While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both modify the event mask and continue the traced process. Reviewed by: kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D7044
# 9e590ff0	27-Jun-2016	Konstantin Belousov <kib@FreeBSD.org>	When filt_proc() removes event from the knlist due to the process exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen: - knote kn_knlist pointer is reset - INFLUX knote is removed from the process knlist. And, there are two consequences: - KN_LIST_UNLOCK() on such knote is nop - there is nothing which would block exit1() from processing past the knlist_destroy() (and knlist_destroy() resets knlist lock pointers). Both consequences result either in leaked process lock, or dereferencing NULL function pointers for locking. Handle this by stopping embedding the process knlist into struct proc. Instead, the knlist is allocated together with struct proc, but marked as autodestroy on the zombie reap, by knlist_detach() function. The knlist is freed when last kevent is removed from the list, in particular, at the zombie reap time if the list is empty. As result, the knlist_remove_inevent() is no longer needed and removed. Other changes: In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events from kn_sfflags for knote registered by kernel to only get NOTE_CHILD notifications. The flags leak resulted in excessive NOTE_EXEC/NOTE_FORK reports. Fix immediate note activation in filt_procattach(). Condition should be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT report for the exiting process. In knote_fork(), do not perform racy check for KN_INFLUX before kq lock is taken. Besides being racy, it did not accounted for notes just added by scan (KN_SCAN). Some minor and incomplete style fixes. Analyzed and tested by: Eric Badger <eric@badgerio.us> Reviewed by: jhb Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (gjb) Differential revision: https://reviews.freebsd.org/D6859
# 5c2cf818	15-Jun-2016	Konstantin Belousov <kib@FreeBSD.org>	Update comments for the MD functions managing contexts for new threads, to make it less confusing and using modern kernel terms. Rename the functions to reflect current use of the functions, instead of the historic KSE conventions: cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads) cpu_set_upcall -> cpu_copy_thread (for forks) cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation) Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (hrs) Differential revision: https://reviews.freebsd.org/D6731
# b3a73448	07-Jun-2016	Mariusz Zaborski <oshogbo@FreeBSD.org>	Introduce the PD_CLOEXEC for pdfork(2). Reviewed by: mjg
# 93ccd6bf	05-Jun-2016	Konstantin Belousov <kib@FreeBSD.org>	Get rid of struct proc p_sched and struct thread td_sched pointers. p_sched is unused. The struct td_sched is always co-allocated with the struct thread, except for the thread0. Avoid useless indirection, instead calculate td_sched location using simple pointer arithmetic in td_get_sched(9). For thread0, which is statically allocated, create a structure to emulate layout of the dynamic allocation. Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D6711
# e3043798	29-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/kern: spelling fixes in comments. No functional change.
# db57c70a	09-Feb-2016	Konstantin Belousov <kib@FreeBSD.org>	Rename P_KTHREAD struct proc p_flag to P_KPROC. I left as is an apparent bug in ntoskrnl_var.h:AT_PASSIVE_LEVEL() definition. Suggested by: jhb Sponsored by: The FreeBSD Foundation
# fb1f4582	08-Feb-2016	John Baldwin <jhb@FreeBSD.org>	Call kthread_exit() rather than kproc_exit() for a premature kthread exit. Kernel threads (and processes) are supposed to call kthread_exit() (or kproc_exit()) to terminate. However, the kernel includes a fallback in fork_exit() to force a kthread exit if a kernel thread's "main" routine returns. This fallback was added back when the kernel only had processes and was not updated to call kthread_exit() instead of kproc_exit() when threads were added to the kernel. This mistake was particular exciting when the errant thread belonged to proc0. Due to the missing P_KTHREAD flag the fallback did not kick in and instead tried to return to userland via whatever garbage was in the trapframe. With P_KTHREAD set it tried to terminate proc0 resulting in other amusements. PR: 204999 MFC after: 1 week
# 0c829a30	06-Feb-2016	Mateusz Guzik <mjg@FreeBSD.org>	fork: ansify sys_pdfork No functional changes.
# 4732ae43	04-Feb-2016	Konstantin Belousov <kib@FreeBSD.org>	Guard against runnable td2 exiting and than being reused for unrelated process when the parent sleeps waiting for the debugger attach on fork. Diagnosed and reviewed by: mjg Sponsored by: The FreeBSD Foundation
# 813361c1	03-Feb-2016	Mateusz Guzik <mjg@FreeBSD.org>	fork: plug a use after free of the returned process fork1 required its callers to pass a pointer to struct proc * which would be set to the new process (if any). procdesc and racct manipulation also used said pointer. However, the process could have exited prior to do_fork return and be automatically reaped, thus making this a use-after-free. Fix the problem by letting callers indicate whether they want the pid or the struct proc, return the process in stopped state for the latter case. Reviewed by: kib
# 33fd9b9a	03-Feb-2016	Mateusz Guzik <mjg@FreeBSD.org>	fork: pass arguments to fork1 in a dedicated structure Suggested by: kib
# 5fcfab6e	29-Dec-2015	John Baldwin <jhb@FreeBSD.org>	Add ptrace(2) reporting for LWP events. Add two new LWPINFO flags: PL_FLAG_BORN and PL_FLAG_EXITED for reporting thread creation and destruction. Newly created threads will stop to report PL_FLAG_BORN before returning to userland and exiting threads will stop to report PL_FLAG_EXIT before exiting completely. Both of these events are only enabled and reported if PT_LWP_EVENTS is enabled on a process.
# 36160958	16-Dec-2015	Mark Johnston <markj@FreeBSD.org>	Fix style issues around existing SDT probes. - Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect at the moment, but will be needed for some future changes. - Don't hardcode the module component of the probe identifier. This is set automatically by the SDT framework. MFC after: 1 week
# aff57357	22-Oct-2015	Ed Schouten <ed@FreeBSD.org>	Add a way to distinguish between forking and thread creation in schedtail. For CloudABI we need to initialize the registers of new threads differently based on whether the thread got created through a fork or through simple thread creation. Add a flag, TDP_FORKING, that is set by do_fork() and cleared by fork_exit(). This can be tested against in schedtail. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3973
# d8f3dc78	16-Oct-2015	Konstantin Belousov <kib@FreeBSD.org>	If falloc_caps() failed, cleanup needs to be performed. This is a bug in r289026. Sponsored by: The FreeBSD Foundation
# 4b48959f	08-Oct-2015	Konstantin Belousov <kib@FreeBSD.org>	Enforce the maxproc limitation before allocating struct proc, initial struct thread and kernel stack for the thread. Otherwise, a load similar to a fork bomb would exhaust KVA and possibly kmem, mostly due to the struct proc being type-stable. The nprocs counter is changed from being protected by allproc_lock sx to be an atomic variable. Note that ddb/db_ps.c:db_ps() use of nprocs was unsafe before, and is still unsafe, but it seems that the only possible undesired consequence is the harmless warning printed when allproc linked list length does not match nprocs. Diagnosed by: Svatopluk Kraus <onwahe@gmail.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 189ac973	06-Oct-2015	John Baldwin <jhb@FreeBSD.org>	Fix various edge cases related to system call tracing. - Always set td_dbg_sc_* when P_TRACED is set on system call entry even if the debugger is not tracing system call entries. This ensures the fields are valid when reporting other stops that occur at system call boundaries such as for PT_FOLLOW_FORKS or when only tracing system call exits. - Set TDB_SCX when reporting the stop for a new child process in fork_return(). This causes the event to be reported as a system call exit. - Report a system call exit event in fork_return() for new threads in a traced process. - Copy td_dbg_sc_* to new threads instead of zeroing. This ensures that td_dbg_sc_code in particular will report the system call that created the new thread or process when it reports a system call exit event in fork_return(). - Add new ptrace tests to verify that new child processes and threads report system call exit events with a valid pl_syscall_code via PT_LWPINFO. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D3822
# 2f2f522b	27-Sep-2015	Andriy Gapon <avg@FreeBSD.org>	save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters where n is typically smaller than 5. Perhaps SDT_PROBE should be made a private implementation detail. MFC after: 20 days
# edc82223	10-Aug-2015	Konstantin Belousov <kib@FreeBSD.org>	Make kstack_pages a tunable on arm, x86, and powepc. On i386, the initial thread stack is not adjusted by the tunable, the stack is allocated too early to get access to the kernel environment. See TD0_KSTACK_PAGES for the thread0 stack sizing on i386. The tunable was tested on x86 only. From the visual inspection, it seems that it might work on arm and powerpc. The arm USPACE_SVC_STACK_TOP and powerpc USPACE macros seems to be already incorrect for the threads with non-default kstack size. I only changed the macros to use variable instead of constant, since I cannot test. On arm64, mips and sparc64, some static data structures are sized by KSTACK_PAGES, so the tunable is disabled. Sponsored by: The FreeBSD Foundation MFC after: 2 week
# 6236e71b	31-Jul-2015	Ed Schouten <ed@FreeBSD.org>	Fix accidental line wrapping introduced in r286122.
# 367a13f9	31-Jul-2015	Ed Schouten <ed@FreeBSD.org>	Limit rights on process descriptors. On CloudABI, the rights bits returned by cap_rights_get() match up with the operations that you can actually perform on the file descriptor. Limiting the rights is good, because it makes it easier to get uniform behaviour across different operating systems. If process descriptors on FreeBSD would suddenly gain support for any new file operation, this wouldn't become exposed to CloudABI processes without first extending the rights. Extend fork1() to gain a 'struct filecaps' argument that allows you to construct process descriptors with custom rights. Use this in cloudabi_sys_proc_fork() to limit the rights to just fstat() and pdwait(). Obtained from: https://github.com/NuxiNL/freebsd
# 6520495a	11-Jul-2015	Adrian Chadd <adrian@FreeBSD.org>	Add an initial NUMA affinity/policy configuration for threads and processes. This is based on work done by jeff@ and jhb@, as well as the numa.diff patch that has been circulating when someone asks for first-touch NUMA on -10 or -11. * Introduce a simple set of VM policy and iterator types. * tie the policy types into the vm_phys path for now, mirroring how the initial first-touch allocation work was enabled. * add syscalls to control changing thread and process defaults. * add a global NUMA VM domain policy. * implement a simple cascade policy order - if a thread policy exists, use it; if a process policy exists, use it; use the default policy. * processes inherit policies from their parent processes, threads inherit policies from their parent threads. * add a simple tool (numactl) to query and modify default thread/process policities. * add documentation for the new syscalls, for numa and for numactl. * re-enable first touch NUMA again by default, as now policies can be set in a variety of methods. This is only relevant for very specific workloads. This doesn't pretend to be a final NUMA solution. The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by 'sysctl vm.default_policy=rr'. This is only relevant if MAXMEMDOM is set to something other than 1. Ie, if you're using GENERIC or a modified kernel with non-NUMA, then this is a glorified no-op for you. Thank you to Norse Corp for giving me access to rather large (for FreeBSD!) NUMA machines in order to develop and verify this. Thank you to Dell for providing me with dual socket sandybridge and westmere v3 hardware to do NUMA development with. Thank you to Scott Long at Netflix for providing me with access to the two-socket, four-domain haswell v3 hardware. Thank you to Peter Holm for running the stress testing suite against the NUMA branch during various stages of development! Tested: * MIPS (regression testing; non-NUMA) * i386 (regression testing; non-NUMA GENERIC) * amd64 (regression testing; non-NUMA GENERIC) * westmere, 2 socket (thankyou norse!) * sandy bridge, 2 socket (thankyou dell!) * ivy bridge, 2 socket (thankyou norse!) * westmere-EX, 4 socket / 1TB RAM (thankyou norse!) * haswell, 2 socket (thankyou norse!) * haswell v3, 2 socket (thankyou dell) * haswell v3, 2x18 core (thankyou scott long / netflix!) * Peter Holm ran a stress test suite on this work and found one issue, but has not been able to verify it (it doesn't look NUMA related, and he only saw it once over many testing runs.) * I've tested bhyve instances running in fixed NUMA domains and cpusets; all seems to work correctly. Verified: * intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different NUMA policies for processes under test. Review: This was reviewed through phabricator (https://reviews.freebsd.org/D2559) as well as privately and via emails to freebsd-arch@. The git history with specific attributes is available at https://github.com/erikarn/freebsd/ in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy). This has been reviewed by a number of people (stas, rpaulo, kib, ngie, wblock) but not achieved a clear consensus. My hope is that with further exposure and testing more functionality can be implemented and evaluated. Notes: * The VM doesn't handle unbalanced domains very well, and if you have an overly unbalanced memory setup whilst under high memory pressure, VM page allocation may fail leading to a kernel panic. This was a problem in the past, but it's much more easily triggered now with these tools. * This work only controls the path through vm_phys; it doesn't yet strongly/predictably affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory isn't really guaranteed in any way. That's next on my plate. Sponsored by: Norse Corp, Inc.; Dell
# f6f6d240	10-Jun-2015	Mateusz Guzik <mjg@FreeBSD.org>	Implement lockless resource limits. Use the same scheme implemented to manage credentials. Code needing to look at process's credentials (as opposed to thred's) is provided with *_proc variants of relevant functions. Places which possibly had to take the proc lock anyway still use the proc pointer to access limits.
# 4ea6a9a2	10-Jun-2015	Mateusz Guzik <mjg@FreeBSD.org>	Generalised support for copy-on-write structures shared by threads. Thread credentials are maintained as follows: each thread has a pointer to creds and a reference on them. The pointer is compared with proc's creds on userspace<->kernel boundary and updated if needed. This patch introduces a counter which can be compared instead, so that more structures can use this scheme without adding more comparisons on the boundary.
# 515b7a0b	25-May-2015	John Baldwin <jhb@FreeBSD.org>	Add KTR tracing for some MI ptrace events. Differential Revision: https://reviews.freebsd.org/D2643 Reviewed by: kib
# edf1796d	06-May-2015	Mateusz Guzik <mjg@FreeBSD.org>	Fix up panics when fork fails due to hitting proc limit The function clearning credentials on failure asserts the process is a zombie, which is not true when fork fails. Changing creds to NULL is unnecessary, but is still being done for consistency with other code. Pointy hat: mjg Reported by: pho
# 90f54cbf	11-Apr-2015	Mateusz Guzik <mjg@FreeBSD.org>	fd: remove filedesc argument from fdclose Just accept a thread instead. This makes it consistent with fdalloc. No functional changes.
# ffb34484	21-Mar-2015	Mateusz Guzik <mjg@FreeBSD.org>	cred: add proc_set_cred_init helper proc_set_cred_init can be used to set first credentials of a new process. Update proc_set_cred assertions so that it only expects already used processes. This fixes panics where p_ucred of a new process happens to be non-NULL. Reviewed by: kib
# 12cec311	21-Mar-2015	Mateusz Guzik <mjg@FreeBSD.org>	fork: assign refed credentials earlier Prior to this change the kernel would take p1's credentials and assign them tempororarily to p2. But p1 could change credentials at that time and in effect give us a use-after-free. No objections from: kib
# daf63fd2	15-Mar-2015	Mateusz Guzik <mjg@FreeBSD.org>	cred: add proc_set_cred helper The goal here is to provide one place altering process credentials. This eases debugging and opens up posibilities to do additional work when such an action is performed.
# 677258f7	18-Jan-2015	Konstantin Belousov <kib@FreeBSD.org>	Add procctl(2) PROC_TRACE_CTL command to enable or disable debugger attachment to the process. Note that the command is not intended to be a security measure, rather it is an obfuscation feature, implemented for parity with other operating systems. Discussed with: jilles, rwatson Man page fixes by: rwatson Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 237623b0	14-Dec-2014	Konstantin Belousov <kib@FreeBSD.org>	Add a facility for non-init process to declare itself the reaper of the orphaned descendants. Base of the API is modelled after the same feature from the DragonFlyBSD. Requested by: bapt Reviewed by: jilles (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
# 6ddcc233	13-Dec-2014	Konstantin Belousov <kib@FreeBSD.org>	Add facility to stop all userspace processes. The supposed use of the feature is to quisce the system before suspend. Stop is implemented by reusing the thread_single(9) with the special mode SINGLE_ALLPROC. SINGLE_ALLPROC differs from the existing single-threading modes by allowing (requiring) caller to operate on other process. Interruptible sleeps for !TDF_SBDRY threads are suspended like SIGSTOP does it, instead of aborting the sleep, like SINGLE_NO_EXIT, to avoid spurious EINTRs on resume. Provide debugging sysctl debug.stop_all_proc, which causes total stop and suspends syncer, while waiting for variable reset for resume. It is used for debugging; should be removed after the real use of the interface is added. In collaboration with: pho Discussed with: avg Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# eb48fbd9	13-Nov-2014	Mateusz Guzik <mjg@FreeBSD.org>	filedesc: fixup fdinit to lock fdp and preapare files conditinally Not all consumers providing fdp to copy from want files. Perhaps these functions should be reorganized to better express the outcome. This fixes up panics after r273895 . Reported by: markj
# b9d32c36	27-Jun-2014	Mateusz Guzik <mjg@FreeBSD.org>	Make fdunshare accept only td parameter. Proc had to match the thread anyway and 2 parameters were inconsistent with the rest. MFC after: 1 week
# 7159310f	17-Dec-2013	Mark Johnston <markj@FreeBSD.org>	The fasttrap fork handler is responsible for removing tracepoints in the child process that were inherited from its parent. However, this should not be done in the case of a vfork, since the fork handler ends up removing the tracepoints from the shared vm space, and userland DTrace probes in the parent will no longer fire as a result. Now the child of a vfork may trigger userland DTrace probes enabled in its parent, so modify the fasttrap probe handler to handle this case and handle the child process in the same way that it would handle the traced process. In particular, if once traces function foo() in a process that vforks, and the child calls foo(), fasttrap will treat this call as having come from the parent. This is the behaviour of the upstream code. While here, add #ifdef guards to some code that isn't present upstream. MFC after: 1 month
# f2b525e6	30-Nov-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Make process descriptors standard part of the kernel. rwhod(8) already requires process descriptors to work and having PROCDESC in GENERIC seems not enough, especially that we hope to have more and more consumers in the base. MFC after: 3 days
# d9fae5ab	26-Nov-2013	Andriy Gapon <avg@FreeBSD.org>	dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE In its stead use the Solaris / illumos approach of emulating '-' (dash) in probe names with '__' (two consecutive underscores). Reviewed by: markj MFC after: 3 weeks
# 54366c0b	25-Nov-2013	Attilio Rao <attilio@FreeBSD.org>	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip
# 55648840	19-Sep-2013	John Baldwin <jhb@FreeBSD.org>	Extend the support for exempting processes from being killed when swap is exhausted. - Add a new protect(1) command that can be used to set or revoke protection from arbitrary processes. Similar to ktrace it can apply a change to all existing descendants of a process as well as future descendants. - Add a new procctl(2) system call that provides a generic interface for control operations on processes (as opposed to the debugger-specific operations provided by ptrace(2)). procctl(2) uses a combination of idtype_t and an id to identify the set of processes on which to operate similar to wait6(). - Add a PROC_SPROTECT control operation to manage the protection status of a set of processes. MADV_PROTECT still works for backwards compatability. - Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc) the first bit of which is used to track if P_PROTECT should be inherited by new child processes. Reviewed by: kib, jilles (earlier version) Approved by: re (delphij) MFC after: 1 month
# 7b77e1fe	14-Aug-2013	Mark Johnston <markj@FreeBSD.org>	Specify SDT probe argument types in the probe definition itself rather than using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API to allow probes with dynamically-translated types. There is no functional change. MFC after: 2 weeks
# a208417c	19-Apr-2013	Jaakko Heinonen <jh@FreeBSD.org>	Include PID in the error message which is printed when the maxproc limit is exceeded. Improve formatting of the message while here. PR: kern/60550 Submitted by: Lowell Gilbert, bde
# 2609222a	01-Mar-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Merge Capsicum overhaul: - Capability is no longer separate descriptor type. Now every descriptor has set of its own capability rights. - The cap_new(2) system call is left, but it is no longer documented and should not be used in new code. - The new syscall cap_rights_limit(2) should be used instead of cap_new(2), which limits capability rights of the given descriptor without creating a new one. - The cap_getrights(2) syscall is renamed to cap_rights_get(2). - If CAP_IOCTL capability right is present we can further reduce allowed ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed ioctls can be retrived with cap_ioctls_get(2) syscall. - If CAP_FCNTL capability right is present we can further reduce fcntls that can be used with the new cap_fcntls_limit(2) syscall and retrive them with cap_fcntls_get(2). - To support ioctl and fcntl white-listing the filedesc structure was heavly modified. - The audit subsystem, kdump and procstat tools were updated to recognize new syscalls. - Capability rights were revised and eventhough I tried hard to provide backward API and ABI compatibility there are some incompatible changes that are described in detail below: CAP_CREATE old behaviour: - Allow for openat(2)+O_CREAT. - Allow for linkat(2). - Allow for symlinkat(2). CAP_CREATE new behaviour: - Allow for openat(2)+O_CREAT. Added CAP_LINKAT: - Allow for linkat(2). ABI: Reuses CAP_RMDIR bit. - Allow to be target for renameat(2). Added CAP_SYMLINKAT: - Allow for symlinkat(2). Removed CAP_DELETE. Old behaviour: - Allow for unlinkat(2) when removing non-directory object. - Allow to be source for renameat(2). Removed CAP_RMDIR. Old behaviour: - Allow for unlinkat(2) when removing directory. Added CAP_RENAMEAT: - Required for source directory for the renameat(2) syscall. Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR): - Allow for unlinkat(2) on any object. - Required if target of renameat(2) exists and will be removed by this call. Removed CAP_MAPEXEC. CAP_MMAP old behaviour: - Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and PROT_WRITE. CAP_MMAP new behaviour: - Allow for mmap(2)+PROT_NONE. Added CAP_MMAP_R: - Allow for mmap(PROT_READ). Added CAP_MMAP_W: - Allow for mmap(PROT_WRITE). Added CAP_MMAP_X: - Allow for mmap(PROT_EXEC). Added CAP_MMAP_RW: - Allow for mmap(PROT_READ \| PROT_WRITE). Added CAP_MMAP_RX: - Allow for mmap(PROT_READ \| PROT_EXEC). Added CAP_MMAP_WX: - Allow for mmap(PROT_WRITE \| PROT_EXEC). Added CAP_MMAP_RWX: - Allow for mmap(PROT_READ \| PROT_WRITE \| PROT_EXEC). Renamed CAP_MKDIR to CAP_MKDIRAT. Renamed CAP_MKFIFO to CAP_MKFIFOAT. Renamed CAP_MKNODE to CAP_MKNODEAT. CAP_READ old behaviour: - Allow pread(2). - Disallow read(2), readv(2) (if there is no CAP_SEEK). CAP_READ new behaviour: - Allow read(2), readv(2). - Disallow pread(2) (CAP_SEEK was also required). CAP_WRITE old behaviour: - Allow pwrite(2). - Disallow write(2), writev(2) (if there is no CAP_SEEK). CAP_WRITE new behaviour: - Allow write(2), writev(2). - Disallow pwrite(2) (CAP_SEEK was also required). Added convinient defines: #define CAP_PREAD (CAP_SEEK \| CAP_READ) #define CAP_PWRITE (CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_R (CAP_MMAP \| CAP_SEEK \| CAP_READ) #define CAP_MMAP_W (CAP_MMAP \| CAP_SEEK \| CAP_WRITE) #define CAP_MMAP_X (CAP_MMAP \| CAP_SEEK \| 0x0000000000000008ULL) #define CAP_MMAP_RW (CAP_MMAP_R \| CAP_MMAP_W) #define CAP_MMAP_RX (CAP_MMAP_R \| CAP_MMAP_X) #define CAP_MMAP_WX (CAP_MMAP_W \| CAP_MMAP_X) #define CAP_MMAP_RWX (CAP_MMAP_R \| CAP_MMAP_W \| CAP_MMAP_X) #define CAP_RECV CAP_READ #define CAP_SEND CAP_WRITE #define CAP_SOCK_CLIENT \ (CAP_CONNECT \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| CAP_GETSOCKOPT \| \ CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| CAP_SETSOCKOPT \| CAP_SHUTDOWN) #define CAP_SOCK_SERVER \ (CAP_ACCEPT \| CAP_BIND \| CAP_GETPEERNAME \| CAP_GETSOCKNAME \| \ CAP_GETSOCKOPT \| CAP_LISTEN \| CAP_PEELOFF \| CAP_RECV \| CAP_SEND \| \ CAP_SETSOCKOPT \| CAP_SHUTDOWN) Added defines for backward API compatibility: #define CAP_MAPEXEC CAP_MMAP_X #define CAP_DELETE CAP_UNLINKAT #define CAP_MKDIR CAP_MKDIRAT #define CAP_RMDIR CAP_UNLINKAT #define CAP_MKFIFO CAP_MKFIFOAT #define CAP_MKNOD CAP_MKNODAT #define CAP_SOCK_ALL (CAP_SOCK_CLIENT \| CAP_SOCK_SERVER) Sponsored by: The FreeBSD Foundation Reviewed by: Christoph Mallon <christoph.mallon@gmx.de> Many aspects discussed with: rwatson, benl, jonathan ABI compatibility discussed with: kib
# de265498	17-Feb-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Remove redundant parenthesis.
# d4015944	14-Dec-2012	Konstantin Belousov <kib@FreeBSD.org>	Remove a special case for XEN, which is erronous and makes vfork(2) behaviour to differ from the documented, only on XEN. If there are any issues with XEN pmap left, they should be fixed in pmap. MFC after: 2 weeks
# f7e50ea7	04-Dec-2012	Konstantin Belousov <kib@FreeBSD.org>	Fix a race between kern_setitimer() and realitexpire(), where the callout is started before kern_setitimer() acquires process mutex, but looses a race and kern_setitimer() gets the process mutex before the callout. Then, assuming that new specified struct itimerval has it_interval zero, but it_value non-zero, the callout, after it starts executing again, clears p->p_realtimer.it_value, but kern_setitimer() already rescheduled the callout. As the result of the race, both p_realtimer is zero, and the callout is rescheduled. Then, in the exit1(), the exit code sees that it_value is zero and does not even try to stop the callout. This allows the struct proc to be reused and eventually the armed callout is re-initialized. The consequence is the corrupted callwheel tailq. Use process mutex to interlock the callout start, which fixes the race. Reported and tested by: pho Reviewed by: jhb MFC after: 2 weeks
# 324e5715	08-Sep-2012	Attilio Rao <attilio@FreeBSD.org>	userret() already checks for td_locks when INVARIANTS is enabled, so there is no need to check if Giant is acquired after it. Reviewed by: kib MFC after: 1 week
# 02c6fc21	15-Aug-2012	Konstantin Belousov <kib@FreeBSD.org>	Add a sysctl kern.pid_max, which limits the maximum pid the system is allowed to allocate, and corresponding tunable with the same name. Note that existing processes with higher pids are left intact. MFC after: 1 week
# 0a7007b9	19-Jun-2012	Pawel Jakub Dawidek <pjd@FreeBSD.org>	The falloc() function obtains two references to newly created 'fp'. On success we have to drop one after procdesc_finit() and on failure we have to close allocated slot with fdclose(), which also drops one reference for us and drop the remaining reference with fdrop(). Without this change closing process descriptor didn't result in killing pdfork(2)ed child. Reviewed by: rwatson MFC after: 1 month
# 97681567	26-May-2012	Konstantin Belousov <kib@FreeBSD.org>	Stop treating td_sigmask specially for the purposes of new thread creation. Move it into the copied region of the struct thread. Update some comments. Requested by: bde X-MFC after: never
# ab27d5d8	22-May-2012	Edward Tomasz Napierala <trasz@FreeBSD.org>	Fix panic with RACCT that could occur in low memory (or out of swap) situations, due to fork1() calling racct_proc_exit() without calling racct_proc_fork() first. Submitted by: Mateusz Guzik <mjguzik at gmail dot com> (earlier version) Reviewed by: Mateusz Guzik <mjguzik at gmail dot com>
# 1d7ca9bb	27-Feb-2012	Konstantin Belousov <kib@FreeBSD.org>	Currently, the debugger attached to the process executing vfork() does not get syscall exit notification until the child performed exec of exit. Swap the order of doing ptracestop() and waiting for P_PPWAIT clearing, by postponing the wait into syscallret after ptracestop() notification is done. Reported, tested and reviewed by: Dmitry Mikulin <dmitrym juniper net> MFC after: 2 weeks
# dcd43281	23-Feb-2012	Konstantin Belousov <kib@FreeBSD.org>	Allow the parent to gather the exit status of the children reparented to the debugger. When reparenting for debugging, keep the child in the new orphan list of old parent. When looping over the children in kern_wait(), iterate over both children list and orphan list to search for the process by pid. Submitted by: Dmitry Mikulin <dmitrym juniper.net> MFC after: 2 weeks
# db327339	09-Feb-2012	Konstantin Belousov <kib@FreeBSD.org>	Mark the automatically attached child with PL_FLAG_CHILD in struct lwpinfo flags, for PT_FOLLOWFORK auto-attachment. In collaboration with: Dmitry Mikulin <dmitrym juniper net> MFC after: 1 week
# 2d8696d1	03-Oct-2011	Edward Tomasz Napierala <trasz@FreeBSD.org>	Move some code inside the racct_proc_fork(); it spares a few lock operations and it's more logical this way. MFC after: 3 days
# 72a401d9	03-Oct-2011	Edward Tomasz Napierala <trasz@FreeBSD.org>	Fix another bug introduced in r225641, which caused rctl to access certain fields in 'struct proc' before they got initialized in do_fork(). MFC after: 3 days
# 1dbf9dcc	17-Sep-2011	Edward Tomasz Napierala <trasz@FreeBSD.org>	Fix error handling bug that would prevent MAC structures from getting freed properly if resource limit got exceeded. Approved by: re (kib)
# b38520f0	17-Sep-2011	Edward Tomasz Napierala <trasz@FreeBSD.org>	Fix long-standing thinko regarding maxproc accounting. Basically, we were accounting the newly created process to its parent instead of the child itself. This caused problems later, when the child changed its credentials - the per-uid, per-jail etc counters were not properly updated, because the maxproc counter in the child process was 0. Approved by: re (kib)
# 8451d0dd	16-Sep-2011	Kip Macy <kmacy@FreeBSD.org>	In order to maximize the re-usability of kernel code in user space this patch modifies makesyscalls.sh to prefix all of the non-compatibility calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel entry points and all places in the code that use them. It also fixes an additional name space collision between the kernel function psignal and the libc function of the same name by renaming the kernel psignal kern_psignal(). By introducing this change now we will ease future MFCs that change syscalls. Reviewed by: rwatson Approved by: re (bz)
# cfb5f768	18-Aug-2011	Jonathan Anderson <jonathan@FreeBSD.org>	Add experimental support for process descriptors A "process descriptor" file descriptor is used to manage processes without using the PID namespace. This is required for Capsicum's Capability Mode, where the PID namespace is unavailable. New system calls pdfork(2) and pdkill(2) offer the functional equivalents of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote process for debugging purposes. The currently-unimplemented pdwait(2) will, in the future, allow querying rusage/exit status. In the interim, poll(2) may be used to check (and wait for) process termination. When a process is referenced by a process descriptor, it does not issue SIGCHLD to the parent, making it suitable for use in libraries---a common scenario when using library compartmentalisation from within large applications (such as web browsers). Some observers may note a similarity to Mach task ports; process descriptors provide a subset of this behaviour, but in a UNIX style. This feature is enabled by "options PROCDESC", but as with several other Capsicum kernel features, is not enabled by default in GENERIC 9.0. Reviewed by: jhb, kib Approved by: re (kib), mentor (rwatson) Sponsored by: Google Inc
# f49d8202	12-Jul-2011	Konstantin Belousov <kib@FreeBSD.org>	Implement an RFTSIGZMB flag to rfork(2) to specify a signal that is delivered to parent when the child exists. Submitted by: Petr Salinger <Petr.Salinger seznam cz> (Debian/kFreeBSD) MFC after: 1 week X-MFC-note: bump __FreeBSD_version
# afcc55f3	06-Jul-2011	Edward Tomasz Napierala <trasz@FreeBSD.org>	All the racct_*() calls need to happen with the proc locked. Fixing this won't happen before 9.0. This commit adds "#ifdef RACCT" around all the "PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order to avoid useless locking/unlocking in kernels built without "options RACCT".
# 58c77a9d	31-Mar-2011	Edward Tomasz Napierala <trasz@FreeBSD.org>	Enable accounting for RACCT_NPROC and RACCT_NTHR. Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)
# 097055e2	29-Mar-2011	Edward Tomasz Napierala <trasz@FreeBSD.org>	Add racct. It's an API to keep per-process, per-jail, per-loginclass and per-loginclass resource accounting information, to be used by the new resource limits code. It's connected to the build, but the code that actually calls the new functions will come later. Sponsored by: The FreeBSD Foundation Reviewed by: kib (earlier version)
# 8e6fa660	24-Mar-2011	John Baldwin <jhb@FreeBSD.org>	Fix some locking nits with the p_state field of struct proc: - Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL in fork to honor the locking requirements. While here, expand the scope of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously the code was locking the new child process (p2) after it had locked the parent process (p1). However, when locking two processes, the safe order is to lock the child first, then the parent. - Fix various places that were checking p_state against PRS_NEW without having the process locked to use PROC_LOCK(). Every place was already locking the process, just after the PRS_NEW check. - Remove or reduce the use of PROC_SLOCK() for places that were checking p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading the current state. - Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once. MFC after: 1 week
# e5d81ef1	08-Mar-2011	Dmitry Chagin <dchagin@FreeBSD.org>	Extend struct sysvec with new method sv_schedtail, which is used for an explicit process at fork trampoline path instead of eventhadler(schedtail) invocation for each child process. Remove eventhandler(schedtail) code and change linux ABI to use newly added sysvec method. While here replace explicit comparing of module sysentvec structure with the newly created process sysentvec to detect the linux ABI. Discussed with: kib MFC after: 2 Week
# 7705d4b2	25-Feb-2011	Dmitry Chagin <dchagin@FreeBSD.org>	Introduce preliminary support of the show description of the ABI of traced process by adding two new events which records value of process sv_flags to the trace file at process creation/execing/exiting time. MFC after: 1 Month.
# 6fa39a73	25-Jan-2011	Konstantin Belousov <kib@FreeBSD.org>	Allow debugger to specify that children of the traced process should be automatically traced. Extend the ptrace(PL_LWPINFO) to report that child just forked. Reviewed by: davidxu, jhb MFC after: 2 weeks
# 22d19207	06-Jan-2011	John Baldwin <jhb@FreeBSD.org>	- Move sched_fork() later in fork() after the various sections of the new thread and proc have been copied and zeroed from the old thread and proc. Otherwise attempts to modify thread or process data in sched_fork() could be undone. - Don't copy td_{base,}_user_pri from the old thread to the new thread in sched_fork_thread() in ULE. This is already done courtesy the bcopy() of the thread copy region. - Always initialize the real priority (td_priority) of new threads to the new thread's base priority (td_base_pri) to avoid bogusly inheriting a borrowed priority from the parent thread. MFC after: 2 weeks
# 3e73ff1e	01-Jan-2011	Edward Tomasz Napierala <trasz@FreeBSD.org>	Finishing touches to fork1() - ANSIfy missed function definition, style(9) fixes, removal of few comments that didn't really make sense and addition of fork_findpid() locking requirements.
# afd01097	10-Dec-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	Refactor fork1() to make it easier to follow. No functional changes. Reviewed by: kib (earlier version) Tested by: pho
# acbe332a	08-Dec-2010	David Xu <davidxu@FreeBSD.org>	MFp4: It is possible a lower priority thread lending priority to higher priority thread, in old code, it is ignored, however the lending should always be recorded, add field td_lend_user_pri to fix the problem, if a thread does not have borrowed priority, its value is PRI_MAX. MFC after: 1 week
# 087bfb0e	06-Dec-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	Add a KASSERT to make it obvious when fork_norfproc() is to be called, and set *procp to NULL in all cases. Previously, it was not being set in the ERESTART case. This is effectively no-op, since its value is ignored by callers in the error case. Reviewed by: kib@
# f68c74bb	06-Dec-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	Fix style bug introduced by previous commit.
# 1d845e86	06-Dec-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	Improve readability by factoring out the !RFPROC case. While here, turn K&R function definitions into ANSI. No functional changes. Reviewed by: kib@
# d680caab	21-Oct-2010	John Baldwin <jhb@FreeBSD.org>	- When disabling ktracing on a process, free any pending requests that may be left. This fixes a memory leak that can occur when tracing is disabled on a process via disabling tracing of a specific file (or if an I/O error occurs with the tracefile) if the process's next system call is exit(). The trace disabling code clears p_traceflag, so exit1() doesn't do any KTRACE-related cleanup leading to the leak. I chose to make the free'ing of pending records synchronous rather than patching exit1(). - Move KTRACE-specific logic out of kern_(exec\|exit\|fork).c and into kern_ktrace.c instead. Make ktrace_mtx private to kern_ktrace.c as a result. MFC after: 1 month
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# cf7d9a8c	08-Oct-2010	David Xu <davidxu@FreeBSD.org>	Create a global thread hash table to speed up thread lookup, use rwlock to protect the table. In old code, thread lookup is done with process lock held, to find a thread, kernel has to iterate through process and thread list, this is quite inefficient. With this change, test shows in extreme case performance is dramatically improved. Earlier patch was reviewed by: jhb, julian
# d3555b6f	09-Sep-2010	Rui Paulo <rpaulo@FreeBSD.org>	Fix two bugs in DTrace: * when the process exits, remove the associated USDT probes * when the process forks, duplicate the USDT probes. Sponsored by: The FreeBSD Foundation
# 79856499	22-Aug-2010	Rui Paulo <rpaulo@FreeBSD.org>	Add an extra comment to the SDT probes definition. This allows us to get use '-' in probe names, matching the probe names in Solaris.[1] Add userland SDT probes definitions to sys/sdt.h. Sponsored by: The FreeBSD Foundation Discussed with: rwaston [1]
# c193de56	11-May-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r207468: Extract thread_lock()/ruxagg()/thread_unlock() fragment into utility function ruxagg_tlock(). Convert the definition of kern_getrusage() to ANSI C. MFC r207602: Implement RUSAGE_THREAD. Add td_rux to keep extended runtime and ticks information for thread to allow calcru1() (re)use. Rename ruxagg()->ruxagg_locked(), ruxagg_tlock()->ruxagg() [1]. The ruxagg_locked() function no longer clears thread ticks nor td_incruntime. Not an MFC: the td_rux is added to the end of struct thread to keep the KBI. Explicit bzero() of td_rux is added to new thread initialization points.
# 2af00dec	08-Sep-2009	Konstantin Belousov <kib@FreeBSD.org>	MFC r196730: Remove the altkstacks, instead instantiate threads with kernel stack allocated with the right size from the start. For the thread that has kernel stack cached, verify that requested stack size is equial to the actual, and reallocate the stack if sizes differ. Introduce separate kernel stack cache that keeps some limited amount of preallocated kernel stacks to lower the latency of thread allocation. Not a merge: instead of removing td_altkstack* members of struct thread, replace them with placeholders to keep struct thread layout on the stable branch. Also, record r196640, r196644 and r196648 as merged. Approved by: re (kensmith)
# 8a945d10	01-Sep-2009	Konstantin Belousov <kib@FreeBSD.org>	Reintroduce the r196640, after fixing the problem with my testing. Remove the altkstacks, instead instantiate threads with kernel stack allocated with the right size from the start. For the thread that has kernel stack cached, verify that requested stack size is equial to the actual, and reallocate the stack if sizes differ [1]. This fixes the bug introduced by r173361 that was committed several days after r173004 and consisted of kthread_add(9) ignoring the non-default kernel stack size. Also, r173361 removed the caching of the kernel stacks for a non-first thread in the process. Introduce separate kernel stack cache that keeps some limited amount of preallocated kernel stacks to lower the latency of thread allocation. Add vm_lowmem handler to prune the cache on low memory condition. This way, system with reasonable amount of the threads get lower latency of thread creation, while still not exhausting significant portion of KVA for unused kstacks. Submitted by: peter [1] Discussed with: jhb, julian, peter Reviewed by: jhb Tested by: pho (and retested according to new test scenarious) MFC after: 1 week
# f25fa6ab	29-Aug-2009	Konstantin Belousov <kib@FreeBSD.org>	Reverse r196640 and r196644 for now.
# b6b2d1bf	29-Aug-2009	Konstantin Belousov <kib@FreeBSD.org>	Dispose the kernel stack of the proper thread. Submitted by: alc MFC after: 1 week
# c3cf0b47	29-Aug-2009	Konstantin Belousov <kib@FreeBSD.org>	Remove the altkstacks, instead instantiate threads with kernel stack allocated with the right size from the start. For the thread that has kernel stack cached, verify that requested stack size is equial to the actual, and reallocate the stack if sizes differ [1]. This fixes the bug introduced by r173361 that was committed several days after r173004 and consisted of kthread_add(9) ignoring the non-default kernel stack size. Also, r173361 removed the caching of the kernel stacks for a non-first thread in the process. Introduce separate kernel stack cache that keeps some limited amount of preallocated kernel stacks to lower the latency of thread allocation. Add vm_lowmem handler to prune the cache on low memory condition. This way, system with reasonable amount of the threads get lower latency of thread creation, while still not exhausting significant portion of KVA for unused kstacks. Submitted by: peter [1] Discussed with: jhb, julian, peter Reviewed by: jhb Tested by: pho MFC after: 1 week
# 7afcbc18	17-Jul-2009	Jamie Gritton <jamie@FreeBSD.org>	Remove the interim vimage containers, struct vimage and struct procg, and the ioctl-based interface that supported them. Approved by: re (kib), bz (mentor)
# 14961ba7	27-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Replace AUDIT_ARG() with variable argument macros with a set more more specific macros for each audit argument type. This makes it easier to follow call-graphs, especially for automated analysis tools (such as fxr). In MFC, we should leave the existing AUDIT_ARG() macros as they may be used by third-party kernel modules. Suggested by: brooks Approved by: re (kib) Obtained from: TrustedBSD Project MFC after: 1 week
# 3364c323	23-Jun-2009	Konstantin Belousov <kib@FreeBSD.org>	Implement global and per-uid accounting of the anonymous memory. Add rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved for the uid. The accounting information (charge) is associated with either map entry, or vm object backing the entry, assuming the object is the first one in the shadow chain and entry does not require COW. Charge is moved from entry to object on allocation of the object, e.g. during the mmap, assuming the object is allocated, or on the first page fault on the entry. It moves back to the entry on forks due to COW setup. The per-entry granularity of accounting makes the charge process fair for processes that change uid during lifetime, and decrements charge for proper uid when region is unmapped. The interface of vm_pager_allocate(9) is extended by adding struct ucred *, that is used to charge appropriate uid when allocation if performed by kernel, e.g. md(4). Several syscalls, among them is fork(2), may now return ENOMEM when global or per-uid limits are enforced. In collaboration with: pho Reviewed by: alc Approved by: re (kensmith)
# d8b0556c	10-Jun-2009	Konstantin Belousov <kib@FreeBSD.org>	Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use vnode interlock to protect the knote fields [1]. The locking assumes that shared vnode lock is held, thus we get exclusive access to knote either by exclusive vnode lock protection, or by shared vnode lock + vnode interlock. Do not use kl_locked() method to assert either lock ownership or the fact that curthread does not own the lock. For shared locks, ownership is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared lock not owned by curthread, causing false positives in kqueue subsystem assertions about knlist lock. Remove kl_locked method from knlist lock vector, and add two separate assertion methods kl_assert_locked and kl_assert_unlocked, that are supposed to use proper asserts. Change knlist_init accordingly. Add convenience function knlist_init_mtx to reduce number of arguments for typical knlist initialization. Submitted by: jhb [1] Noted by: jhb [2] Reviewed by: jhb Tested by: rnoland
# bcf11e8d	05-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
# 0304c731	27-May-2009	Jamie Gritton <jamie@FreeBSD.org>	Add hierarchical jails. A jail may further virtualize its environment by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)
# 29b02909	08-May-2009	Marko Zec <zec@FreeBSD.org>	Introduce a new virtualization container, provisionally named vprocg, to hold virtualized instances of hostname and domainname, as well as a new top-level virtualization struct vimage, which holds pointers to struct vnet and struct vprocg. Struct vprocg is likely to become replaced in the near future with a new jail management API import. As a consequence of this change, change struct ucred to point to a struct vimage, instead of directly pointing to a vnet. Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage branch. Permit kldload / kldunload operations to be executed only from the default vimage context. This change should have no functional impact on nooptions VIMAGE kernel builds. Reviewed by: bz Approved by: julian (mentor)
# 21ca7b57	05-May-2009	Marko Zec <zec@FreeBSD.org>	Change the curvnet variable from a global const struct vnet , previously always pointing to the default vnet context, to a dynamically changing thread-local one. The currvnet context should be set on entry to networking code via CURVNET_SET() macros, and reverted to previous state via CURVNET_RESTORE(). Recursions on curvnet are permitted, though strongly discuouraged. This change should have no functional impact on nooptions VIMAGE kernel builds, where CURVNET_ macros expand to whitespace. The curthread->td_vnet (aka curvnet) variable's purpose is to be an indicator of the vnet context in which the current network-related operation takes place, in case we cannot deduce the current vnet context from any other source, such as by looking at mbuf's m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so far curvnet has turned out to be an invaluable consistency checking aid: it helps to catch cases when sockets, ifnets or any other vnet-aware structures may have leaked from one vnet to another. The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros was a result of an empirical iterative process, whith an aim to reduce recursions on CURVNET_SET() to a minimum, while still reducing the scope of CURVNET_SET() to networking only operations - the alternative would be calling CURVNET_SET() on each system call entry. In general, curvnet has to be set in three typicall cases: when processing socket-related requests from userspace or from within the kernel; when processing inbound traffic flowing from device drivers to upper layers of the networking stack, and when executing timer-driven networking functions. This change also introduces a DDB subcommand to show the list of all vnet instances. Approved by: julian (mentor)
# aeb32571	05-Dec-2008	Konstantin Belousov <kib@FreeBSD.org>	Several threads in a process may do vfork() simultaneously. Then, all parent threads sleep on the parent' struct proc until corresponding child releases the vmspace. Each sleep is interlocked with proc mutex of the child, that triggers assertion in the sleepq_add(). The assertion requires that at any time, all simultaneous sleepers for the channel use the same interlock. Silent the assertion by using conditional variable allocated in the child. Broadcast the variable event on exec() and exit(). Since struct proc * sleep wait channel is overloaded for several unrelated events, I was unable to remove wakeups from the places where cv_broadcast() is added, except exec(). Reported and tested by: ganbold Suggested and reviewed by: jhb MFC after: 2 week
# 413628a7	29-Nov-2008	Bjoern A. Zeeb <bz@FreeBSD.org>	MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 50d6e424	18-Oct-2008	Kip Macy <kmacy@FreeBSD.org>	- Forward port flush of page table updates on context switch or userret - Forward port vfork XEN hack
# 8b4a2800	23-Jul-2008	Konstantin Belousov <kib@FreeBSD.org>	Do the pargs_hold() on the copy of the pointer to the p_args of the child process immediately after bulk bcopy() without dropping the process lock. Since process is not single-threaded when forking, dropping and reacquiring the lock allows an other thread to change the process title of the parent in between, and results in hold being done on the invalid pointer. The problem manifested itself as the double free of the old p_args. Reported by: kris Reviewed by: jhb MFC after: 1 week
# 7054ee4e	07-Jul-2008	Konstantin Belousov <kib@FreeBSD.org>	The kqueue_register() function assumes that it is called from the top of the syscall code and acquires various event subsystem locks as needed. The handling of the NOTE_TRACK for EVFILT_PROC is currently done by calling the kqueue_register() from filt_proc() filter, causing recursive entrance of the kqueue code. This results in the LORs and recursive acquisition of the locks. Implement the variant of the knote() function designed to only handle the fork() event. It mostly copies the knote() body, but also handles the NOTE_TRACK, removing the handling from the filt_proc(), where it causes problems described above. The function is called from the fork1() instead of knote(). When encountering NOTE_TRACK knote, it marks the knote as influx and drops the knlist and kqueue lock. In this context call to kqueue_register is safe from the problems. An error from the kqueue_register() is reported to the observer as NOTE_TRACKERR fflag. PR: 108201 Reviewed by: jhb, Pramod Srinivasan <pramod juniper net> (previous version) Discussed with: jmg Tested by: pho MFC after: 2 weeks
# 5d217f17	24-May-2008	John Birrell <jb@FreeBSD.org>	Add DTrace 'proc' provider probes using the Statically Defined Trace (sdt) mechanism.
# 69aa768a	20-Mar-2008	Konstantin Belousov <kib@FreeBSD.org>	Fix the leak of the vmspace on the fork when the process limits are exceeded. Pointy hat to: me MFC after: 3 days
# 0ac213ef	19-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Don't call the empty sched_newproc() function. sched_newproc() already existed as sched_fork() which is a non empty function in both schedulers.
# 6617724c	12-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	Remove kernel support for M:N threading. While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.
# 4b9322ae	14-Nov-2007	Julian Elischer <julian@FreeBSD.org>	When forking, the new thread deserves a name too. Don't just use the td_startcopy section as it is not the right thing to do in other cases (e.g. if starting a new thread from one that is already named).
# e01eafef	13-Nov-2007	Julian Elischer <julian@FreeBSD.org>	A bunch more files that should probably print out a thread name instead of a process name.
# 89b57fcf	05-Nov-2007	Konstantin Belousov <kib@FreeBSD.org>	Fix for the panic("vm_thread_new: kstack allocation failed") and silent NULL pointer dereference in the i386 and sparc64 pmap_pinit() when the kmem_alloc_nofault() failed to allocate address space. Both functions now return error instead of panicing or dereferencing NULL. As consequence, vmspace_exec() and vmspace_unshare() returns the errno int. struct vmspace arg was added to vm_forkproc() to avoid dealing with failed allocation when most of the fork1() job is already done. The kernel stack for the thread is now set up in the thread_alloc(), that itself may return NULL. Also, allocation of the first process thread is performed in the fork1() to properly deal with stack allocation failure. proc_linkup() is separated into proc_linkup() called from fork1(), and proc_linkup0(), that is used to set up the kernel process (was known as swapper). In collaboration with: Peter Holm Reviewed by: jhb
# f4bb4fc8	02-Nov-2007	Julian Elischer <julian@FreeBSD.org>	Completely remove the code for single threading the mainline fork code. Put in a little comment explaining why it went away. Re-enable it in the case there an exisiting process is just splitting off its address space and file descriptors. (I donpt think anything uses that code but it needs some sort of locking and this does the job. Reviewed by: Davidxu, alc, others MFC after: 3 days
# 30d239bc	24-Oct-2007	Robert Watson <rwatson@FreeBSD.org>	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
# e9271f53	23-Oct-2007	Julian Elischer <julian@FreeBSD.org>	Take out the single-threading code in fork. After discussions with jeff, alc, (various Ironport people), david Xu, and mostly Alfred (who found the problem) it has been demonstrated that this is not needed for our implementations of threads and represents a real (as in we've seen it happen a lot) deadlock danger. Several points: Since forking multiple threads is not allowed, and posix states that any mutexes owned by othre threads wilol be owned in the child by phantom threads, and therads shouldn't ba accessing shared structures without protection, It can be proved that if this leads to the child process accessing inconsistent data, it's a programming error. The mode of thread_single() being used in fork() is the wrong one. It is using SINGLE_NO_EXIT when it should be using SINGLE_BOUNDARY. Even if this we used, System processes have no need to do it as they have no userland to get inconsistent. This commmit first fixes the above bugs to get tehm correct in CVS. then removes them with #ifdef. This is so that history contains the corrected version should it be needed in the future. This code may be needed if we implement the forkall() syscall from Solaris. It may be needed for other non-posix thread libraries at some time in the future, so let the code sit for a short while while I do some work on it anyhow. This removes a reproducible lockup in NFS. It may be argued that maybe doing a fork while holding a vnode lock may not be the best idea in th efirst place but it shouldn't cause a deadlock. The removal has been running under soak test for several days now. This removal should be seriously considered for 7.0 and RELENG_6. Note. There is code in the core-dumping code that may have a similar problem with coredumping threaded processes MFC After: 4 days
# 3745c395	20-Oct-2007	Julian Elischer <julian@FreeBSD.org>	Rename the kthread_xxx (e.g. kthread_create()) calls to kproc_xxx as they actually make whole processes. Thos makes way for us to add REAL kthread_create() and friends that actually make theads. it turns out that most of these calls actually end up being moved back to the thread version when it's added. but we need to make this cosmetic change first. I'd LOVE to do this rename in 7.0 so that we can eventually MFC the new kthread_xxx() calls.
# 54b0e65f	20-Sep-2007	Jeff Roberson <jeff@FreeBSD.org>	- Redefine p_swtime and td_slptime as p_swtick and td_slptick. This changes the units from seconds to the value of 'ticks' when swapped in/out. ULE does not have a periodic timer that scans all threads in the system and as such maintaining a per-second counter is difficult. - Change computations requiring the unit in seconds to subtract ticks and divide by hz. This does make the wraparound condition hz times more frequent but this is still in the range of several months to years and the adverse effects are minimal. Approved by: re
# b61ce5b0	16-Sep-2007	Jeff Roberson <jeff@FreeBSD.org>	- Move all of the PS_ flags into either p_flag or td_flags. - p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or previously the sched_lock. These bugs have existed for some time. - Allow swapout to try each thread in a process individually and then swapin the whole process if any of these fail. This allows us to move most scheduler related swap flags into td_flags. - Keep ki_sflag for backwards compat but change all in source tools to use the new and more correct location of P_INMEM. Reported by: pho Reviewed by: attilio, kib Approved by: re (kensmith)
# 7251b786	16-Jun-2007	Robert Watson <rwatson@FreeBSD.org>	Rather than passing SUSER_RUID into priv_check_cred() to specify when a privilege is checked against the real uid rather than the effective uid, instead decide which uid to use in priv_check_cred() based on the privilege passed in. We use the real uid for PRIV_MAXFILES, PRIV_MAXPROC, and PRIV_PROC_LIMIT. Remove the definition of SUSER_RUID; there are now no flags defined for priv_check_cred(). Obtained from: TrustedBSD Project
# fe54587f	12-Jun-2007	Jeff Roberson <jeff@FreeBSD.org>	- Move some common code out of sched_fork_exit() and back into fork_exit().
# 32f9753c	11-Jun-2007	Robert Watson <rwatson@FreeBSD.org>	Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in some cases, move to priv_check() if it was an operation on a thread and no other flags were present. Eliminate caller-side jail exception checking (also now-unused); jail privilege exception code now goes solely in kern_jail.c. We can't yet eliminate suser() due to some cases in the KAME code where a privilege check is performed and then used in many different deferred paths. Do, however, move those prototypes to priv.h. Reviewed by: csjp Obtained from: TrustedBSD Project
# 393a081d	10-Jun-2007	Attilio Rao <attilio@FreeBSD.org>	Optimize vmmeter locking. In particular: - Add an explicative table for locking of struct vmmeter members - Apply new rules for some of those members - Remove some unuseful comments Heavily reviewed by: alc, bde, jeff Approved by: jeff (mentor)
# faef5371	07-Jun-2007	Robert Watson <rwatson@FreeBSD.org>	Move per-process audit state from a pointer in the proc structure to embedded storage in struct ucred. This allows audit state to be cached with the thread, avoiding locking operations with each system call, and makes it available in asynchronous execution contexts, such as deep in the network stack or VFS. Reviewed by: csjp Approved by: re (kensmith) Obtained from: TrustedBSD Project
# 11bda9b8	04-Jun-2007	Jeff Roberson <jeff@FreeBSD.org>	Commit 6/14 of sched_lock decomposition. - Use thread_lock() rather than sched_lock for per-thread scheduling sychronization. - Use the per-process spinlock rather than the sched_lock for per-process scheduling synchronization. - Replace the tail-end of fork_exit() with a scheduler specific routine which can do the appropriate lock manipulations. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
# 1c4bcd05	31-May-2007	Jeff Roberson <jeff@FreeBSD.org>	- Move rusage from being per-process in struct pstats to per-thread in td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)
# 2feb50bf	31-May-2007	Attilio Rao <attilio@FreeBSD.org>	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)
# eefc4972	18-May-2007	Robert Watson <rwatson@FreeBSD.org>	Remove unnecessary assignment. CID: 2227 Found with: Coverity Prevent(tm)
# 222d0195	18-May-2007	Jeff Roberson <jeff@FreeBSD.org>	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>
# 5e3f7694	04-Apr-2007	Robert Watson <rwatson@FreeBSD.org>	Replace custom file descriptor array sleep lock constructed using a mutex and flags with an sxlock. This leads to a significant and measurable performance improvement as a result of access to shared locking for frequent lookup operations, reduced general overhead, and reduced overhead in the event of contention. All of these are imported for threaded applications where simultaneous access to a shared file descriptor array occurs frequently. Kris has reported 2x-4x transaction rate improvements on 8-core MySQL benchmarks; smaller improvements can be expected for many workloads as a result of reduced overhead. - Generally eliminate the distinction between "fast" and regular acquisisition of the filedesc lock; the plan is that they will now all be fast. Change all locking instances to either shared or exclusive locks. - Correct a bug (pointed out by kib) in fdfree() where previously msleep() was called without the mutex held; sx_sleep() is now always called with the sxlock held exclusively. - Universally hold the struct file lock over changes to struct file, rather than the filedesc lock or no lock. Always update the f_ops field last. A further memory barrier is required here in the future (discussed with jhb). - Improve locking and reference management in linux_at(), which fails to properly acquire vnode references before using vnode pointers. Annotate improper use of vn_fullpath(), which will be replaced at a future date. In fcntl(), we conservatively acquire an exclusive lock, even though in some cases a shared lock may be sufficient, which should be revisited. The dropping of the filedesc lock in fdgrowtable() is no longer required as the sxlock can be held over the sleep operation; we should consider removing that (pointed out by attilio). Tested by: kris Discussed with: jhb, kris, attilio, jeff
# 0c14ff0e	04-Mar-2007	Robert Watson <rwatson@FreeBSD.org>	Remove 'MPSAFE' annotations from the comments above most system calls: all system calls now enter without Giant held, and then in some cases, acquire Giant explicitly. Remove a number of other MPSAFE annotations in the credential code and tweak one or two other adjacent comments.
# 84d37a46	27-Feb-2007	John Baldwin <jhb@FreeBSD.org>	Use pause() rather than tsleep() on explicit global dummy variables.
# 1ad9ee86	25-Feb-2007	Xin LI <delphij@FreeBSD.org>	Close race conditions between fork() and [sg]etpriority()'s PRIO_USER case, possibly also other places that deferences p_ucred. In the past, we insert a new process into the allproc list right after PID allocation, and release the allproc_lock sx. Because most content in new proc's structure is not yet initialized, this could lead to undefined result if we do not handle PRS_NEW with care. The problem with PRS_NEW state is that it does not provide fine grained information about how much initialization is done for a new process. By defination, after PRIO_USER setpriority(), all processes that belongs to given user should have their nice value set to the specified value. Therefore, if p_{start,end}copy section was done for a PRS_NEW process, we can not safely ignore it because p_nice is in this area. On the other hand, we should be careful on PRS_NEW processes because we do not allow non-root users to lower their nice values, and without a successful copy of the copy section, we can get stale values that is inherted from the uninitialized area of the process structure. This commit tries to close the race condition by grabbing proc mutex before we release allproc_lock xlock, and do copy as well as zero immediately after the allproc_lock xunlock. This guarantees that the new process would have its p_copy and p_zero sections, as well as user credential informaion initialized. In getpriority() case, instead of grabbing PROC_LOCK for a PRS_NEW process, we just skip the process in question, because it does not affect the final result of the call, as the p_nice value would be copied from its parent, and we will see it during allproc traverse. Other potential solutions are still under evaluation. Discussed with: davidxu, jhb, rwatson PR: kern/108071 MFC after: 2 weeks
# f0393f06	23-Jan-2007	Jeff Roberson <jeff@FreeBSD.org>	- Remove setrunqueue and replace it with direct calls to sched_add(). setrunqueue() was mostly empty. The few asserts and thread state setting were moved to the individual schedulers. sched_add() was chosen to displace it for naming consistency reasons. - Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be different on all three schedulers where it was only called in one place each. - Remove the long ifdef'd out remrunqueue code. - Remove the now redundant ts_state. Inspect the thread state directly. - Don't set TSF_* flags from kern_switch.c, we were only doing this to support a feature in one scheduler. - Change sched_choose() to return a thread rather than a td_sched. Also, rely on the schedulers to return the idlethread. This simplifies the logic in choosethread(). Aside from the run queue links kern_switch.c mostly does not care about the contents of td_sched. Discussed with: julian - Move the idle thread loop into the per scheduler area. ULE wants to do something different from the other schedulers. Suggested by: jhb Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.
# ad1e7d28	05-Dec-2006	Julian Elischer <julian@FreeBSD.org>	Threading cleanup.. part 2 of several. Make part of John Birrell's KSE patch permanent.. Specifically, remove: Any reference of the ksegrp structure. This feature was never fully utilised and made things overly complicated. All code in the scheduler that tried to make threaded programs fair to unthreaded programs. Libpthread processes will already do this to some extent and libthr processes already disable it. Also: Since this makes such a big change to the scheduler(s), take the opportunity to rename some structures and elements that had to be moved anyhow. This makes the code a lot more readable. The ULE scheduler compiles again but I have no idea if it works. The 4bsd scheduler still reqires a little cleaning and some functions that now do ALMOST nothing will go away, but I thought I'd do that as a separate commit. Tested by David Xu, and Dan Eischen using libthr and libpthread.
# acd3428b	06-Nov-2006	Robert Watson <rwatson@FreeBSD.org>	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
# 8460a577	26-Oct-2006	John Birrell <jb@FreeBSD.org>	Make KSE a kernel option, turned on by default in all GENERIC kernel configs except sun4v (which doesn't process signals properly with KSE). Reviewed by: davidxu@
# aed55708	22-Oct-2006	Robert Watson <rwatson@FreeBSD.org>	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
# 993182e5	14-Aug-2006	Alexander Leidinger <netchild@FreeBSD.org>	- Change process_exec function handlers prototype to include struct image_params arg. - Change struct image_params to include struct sysentvec pointer and initialize it. - Change all consumers of process_exit/process_exec eventhandlers to new prototypes (includes splitting up into distinct exec/exit functions). - Add eventhandler to userret. Sponsored by: Google SoC 2006 Submitted by: rdivacky Parts suggested by: jhb (on hackers@)
# 38affe13	01-Aug-2006	John Baldwin <jhb@FreeBSD.org>	Don't lock each of the processes while looking for a pid. The allproc and proctree locks that we already hold provide sufficient protection.
# 2905ade2	27-Jun-2006	Pawel Jakub Dawidek <pjd@FreeBSD.org>	- Use suser_cred(9) instead of checking cr_ruid directly. - For privileged processes safe two mutex operations. We may want to consider if this is good idea to use SUSER_ALLOWJAIL here, but for now I didn't wanted to change the original behaviour. Reviewed by: rwatson
# 795a11d0	15-Mar-2006	David Xu <davidxu@FreeBSD.org>	Fix a race between file operations and rfork(RFCFDG) by parking all other threads at user boundary, the race can crash kernel under stress testing. Reviewed by: jhb MFC after: 3 days
# eb2da9a5	08-Feb-2006	Poul-Henning Kamp <phk@FreeBSD.org>	Simplify system time accounting for profiling. Rename struct thread's td_sticks to td_pticks, we will need the other name for more appropriately named use shortly. Reduce it from uint64_t to u_int. Clear td_pticks whenever we enter the kernel instead of recording its value as reference for userret(). Use the absolute value of td->pticks in userret() and eliminate third argument.
# 2c9d9d39	06-Feb-2006	John Baldwin <jhb@FreeBSD.org>	We don't need the proc lock to check P_KTHREAD on curthread since it is only set before the kthread starts executing and is never cleared.
# ad20c8f3	05-Feb-2006	Wayne Salamon <wsalamon@FreeBSD.org>	Audit the args to rfork(), and the child PID for all fork system calls. Obtained from: TrustedBSD Project Approved by: rwatson (mentor)
# fcf7f27a	01-Feb-2006	Robert Watson <rwatson@FreeBSD.org>	Hook up audit to fork() and exit() events. These changes manage the audit state on processes, not auditing of these events. Much work by: wsalamon Obtained from: TrustedBSD Project
# 2c255e9d	13-Nov-2005	Robert Watson <rwatson@FreeBSD.org>	Moderate rewrite of kernel ktrace code to attempt to generally improve reliability when tracing fast-moving processes or writing traces to slow file systems by avoiding unbounded queueuing and dropped records. Record loss was previously possible when the global pool of records become depleted as a result of record generation outstripping record commit, which occurred quickly in many common situations. These changes partially restore the 4.x model of committing ktrace records at the point of trace generation (synchronous), but maintain the 5.x deferred record commit behavior (asynchronous) for situations where entering VFS and sleeping is not possible (i.e., in the scheduler). Records are now queued per-process as opposed to globally, with processes responsible for committing records from their own context as required. - Eliminate the ktrace worker thread and global record queue, as they are no longer used. Keep the global free record list, as records are still used. - Add a per-process record queue, which will hold any asynchronously generated records, such as from context switches. This replaces the global queue as the place to submit asynchronous records to. - When a record is committed asynchronously, simply queue it to the process. - When a record is committed synchronously, first drain any pending per-process records in order to maintain ordering as best we can. Currently ordering between competing threads is provided via a global ktrace_sx, but a per-process flag or lock may be desirable in the future. - When a process returns to user space following a system call, trap, signal delivery, etc, flush any pending records. - When a process exits, flush any pending records. - Assert on process tear-down that there are no pending records. - Slightly abstract the notion of being "in ktrace", which is used to prevent the recursive generation of records, as well as generating traces for ktrace events. Future work here might look at changing the set of events marked for synchronous and asynchronous record generation, re-balancing queue depth, timeliness of commit to disk, and so on. I.e., performing a drain every (n) records. MFC after: 1 month Discussed with: jhb Requested by: Marc Olzheim <marcolz at stack dot nl>
# 571dcd15	01-Jul-2005	Suleiman Souhlal <ssouhlal@FreeBSD.org>	Fix the recent panics/LORs/hangs created by my kqueue commit by: - Introducing the possibility of using locks different than mutexes for the knlist locking. In order to do this, we add three arguments to knlist_init() to specify the functions to use to lock, unlock and check if the lock is owned. If these arguments are NULL, we assume mtx_lock, mtx_unlock and mtx_owned, respectively. - Using the vnode lock for the knlist locking, when doing kqueue operations on a vnode. This way, we don't have to lock the vnode while holding a mutex, in filt_vfsread. Reviewed by: jmg Approved by: re (scottl), scottl (mentor override) Pointyhat to: ssouhlal Will be happy: everyone
# 3d5c30f7	20-Apr-2005	David Xu <davidxu@FreeBSD.org>	Inherit signal mask for child process in fork1(), RELENG_4 and other *BSD have this behaviour, also it is required by POSIX. PR: kern/80130 Submitted by: Kostik Belousov konstantin.belousov at zoral dot com dot ua
# c6a37e84	04-Apr-2005	John Baldwin <jhb@FreeBSD.org>	Divorce critical sections from spinlocks. Critical sections as denoted by critical_enter() and critical_exit() are now solely a mechanism for deferring kernel preemptions. They no longer have any affect on interrupts. This means that standalone critical sections are now very cheap as they are simply unlocked integer increments and decrements for the common case. Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter() and spinlock_exit(). This KPI is responsible for providing whatever MD guarantees are needed to ensure that a thread holding a spin lock won't be preempted by any other code that will try to lock the same lock. For now all archs continue to block interrupts in a "spinlock section" as they did formerly in all critical sections. Note that I've also taken this opportunity to push a few things into MD code rather than MI. For example, critical_fork_exit() no longer exists. Instead, MD code ensures that new threads have the correct state when they are created. Also, we no longer try to fixup the idlethreads for APs in MI code. Instead, each arch sets the initial curthread and adjusts the state of the idle thread it borrows in order to perform the initial context switch. This change is largely a big NOP, but the cleaner separation it provides will allow for more efficient alternative locking schemes in other parts of the kernel (bare critical sections rather than per-CPU spin mutexes for per-CPU data for example). Reviewed by: grehan, cognet, arch@, others Tested on: i386, alpha, sparc64, powerpc, arm, possibly more
# 9454b2d8	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for copyright notices, minor format tweaks as necessary
# c113083c	14-Dec-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Add new function fdunshare() which encapsulates the necessary light magic for ensuring that a process' filedesc is not shared with anybody. Use it in the two places which previously had private implmentations. This collects all fd_refcnt handling in kern_descrip.c
# 6004362e	26-Nov-2004	David Schultz <das@FreeBSD.org>	Don't include sys/user.h merely for its side-effect of recursively including other headers.
# 6db36923	20-Nov-2004	David Schultz <das@FreeBSD.org>	Remove local definitions of RANGEOF() and use __rangeof() instead. Also remove a few bogus casts.
# 8b059651	19-Nov-2004	David Schultz <das@FreeBSD.org>	Malloc p_stats instead of putting it in the U area. We should consider simply embedding it in struct proc. Reviewed by: arch@
# 124e4c3b	13-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST. Use this in all the places where sleeping with the lock held is not an issue. The distinction will become significant once we finalize the exact lock-type to use for this kind of case.
# 598b7ec8	07-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Use more intuitive pointer for fdinit() and fdcopy(). Change fdcopy() to take unlocked filedesc.
# 8ec21e3a	06-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Allow fdinit() to be called with a NULL fdp argument so we can use it when setting up init. Make fdinit() lock the fdp argument as needed.
# cda5aba4	06-Oct-2004	David Schultz <das@FreeBSD.org>	Back out rev 1.240; it is unnecessary. In particular, p1 == curthread, so _PHOLD(p1) will not have to block to swap in p1. Noticed by: jhb
# 299bc736	30-Sep-2004	David Schultz <das@FreeBSD.org>	Avoid calling _PHOLD(p1) with p2's lock held, since _PHOLD() may block to swap in p1. Instead, call _PHOLD earlier, at a point where the only lock held happens to be p1's.
# a3aa5592	13-Sep-2004	Julian Elischer <julian@FreeBSD.org>	make some of these conditions apply equally to both threading systems.
# ed062c8d	04-Sep-2004	Julian Elischer <julian@FreeBSD.org>	Refactor a bunch of scheduler code to give basically the same behaviour but with slightly cleaned up interfaces. The KSE structure has become the same as the "per thread scheduler private data" structure. In order to not make the diffs too great one is #defined as the other at this time. The KSE (or td_sched) structure is now allocated per thread and has no allocation code of its own. Concurrency for a KSEGRP is now kept track of via a simple pair of counters rather than using KSE structures as tokens. Since the KSE structure is different in each scheduler, kern_switch.c is now included at the end of each scheduler. Nothing outside the scheduler knows the contents of the KSE (aka td_sched) structure. The fields in the ksegrp structure that are to do with the scheduler's queueing mechanisms are now moved to the kg_sched structure. (per ksegrp scheduler private data structure). In other words how the scheduler queues and keeps track of threads is no-one's business except the scheduler's. This should allow people to write experimental schedulers with completely different internal structuring. A scheduler call sched_set_concurrency(kg, N) has been added that notifies teh scheduler that no more than N threads from that ksegrp should be allowed to be on concurrently scheduled. This is also used to enforce 'fainess' at this time so that a ksegrp with 10000 threads can not swamp a the run queue and force out a process with 1 thread, since the current code will not set the concurrency above NCPU, and both schedulers will not allow more than that many onto the system run queue at a time. Each scheduler should eventualy develop their own methods to do this now that they are effectively separated. Rejig libthr's kernel interface to follow the same code paths as linkse for scope system threads. This has slightly hurt libthr's performance but I will work to recover as much of it as I can. Thread exit code has been cleaned up greatly. exit and exec code now transitions a process back to 'standard non-threaded mode' before taking the next step. Reviewed by: scottl, peter MFC after: 1 week
# 94ddc707	02-Sep-2004	Alan Cox <alc@FreeBSD.org>	Push Giant deep into vm_forkproc(), acquiring it only if the process has mapped System V shared memory segments (see shmfork_myhook()) or requires the allocation of an ldt (see vm_fault_wire()).
# 2630e4c9	31-Aug-2004	Julian Elischer <julian@FreeBSD.org>	Give setrunqueue() and sched_add() more of a clue as to where they are coming from and what is expected from them. MFC after: 2 days
# 99e9dcb8	31-Aug-2004	Julian Elischer <julian@FreeBSD.org>	Remove sched_free_thread() which was only used in diagnostics. It has outlived its usefulness and has started causing panics for people who turn on DIAGNOSTIC, in what is otherwise good code. MFC after: 2 days
# ad3b9257	15-Aug-2004	John-Mark Gurney <jmg@FreeBSD.org>	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)
# 732d9528	09-Aug-2004	Julian Elischer <julian@FreeBSD.org>	Increase the amount of data exported by KTR in the KTR_RUNQ setting. This extra data is needed to really follow what is going on in the threaded case.
# 0047b9a9	26-Jul-2004	Bosko Milekic <bmilekic@FreeBSD.org>	Move the schedlock owner state update following the context switch in fork_exit() to before anything else is done (but keep schedlock for the deadthread check). This means one less nasty bug if ever in the future whatever might have been called before the update played with schedlock or critical sections. Discussed with: tjr
# 66d5c640	26-Jul-2004	Colin Percival <cperciva@FreeBSD.org>	In revision 1.228, I accidentally broke the "total number of processes in the system" resource limit code: When checking if the caller has superuser privileges, we should be checking the real user, not the effective user. (In general, resource limiting is done based on the real user, in order to avoid resource-exhaustion-by-setuid-program attacks.) Now that a SUSER_RUID flag to suser_cred exists, use it here to return this code to its correct behaviour. Pointed out by: rwatson
# 55d44f79	18-Jul-2004	Julian Elischer <julian@FreeBSD.org>	When calling scheduler entrypoints for creating new threads and processes, specify "us" as the thread not the process/ksegrp/kse. You can always find the others from the thread but the converse is not true. Theorotically this would lead to runtime being allocated to the wrong entity in some cases though it is not clear how often this actually happenned. (would only affect threaded processes and would probably be pretty benign, but it WAS a bug..) Reviewed by: peter
# 49bddf0c	13-Jul-2004	Poul-Henning Kamp <phk@FreeBSD.org>	fix compilation.
# 65bba83f	13-Jul-2004	Colin Percival <cperciva@FreeBSD.org>	Replace "uid != 0" with "suser(td->td_ucred) != 0" when checking if we've hit the maximum number of processes. The last ten processes are reserved for the non-jailed superuser.
# 247aba24	26-Jun-2004	Marcel Moolenaar <marcel@FreeBSD.org>	Allocate TIDs in thread_init() and deallocate them in thread_fini(). The overhead of unconditionally allocating TIDs (and likewise, unconditionally deallocating them), is amortized across multiple thread creations by the way UMA makes it possible to have type-stable storage. Previously the cost was kept down by having threads created as part of a fork operation use the process' PID as the TID. While this had some nice properties, it also introduced complexity in the way TIDs were allocated. Most importantly, by using the type-stable storage that UMA gives us this was also unnecessary. This change affects how core dumps are created and in particular how the PRSTATUS notes are dumped. Since we don't have a thread with a TID equalling the PID, we now need a different way to preserve the old and previous behavior. We do this by having the given thread (i.e. the thread passed to the core dump code in td) dump it's state first and fill in pr_pid with the actual PID. All other threads will have pr_pid contain their TIDs. The upshot of all this is that the debugger will now likely select the right LWP (=TID) as the initial thread. Credits to: julian@ for spotting how we can utilize UMA. Thanks to: all who provided julian@ with test results.
# 7f8a436f	05-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# fdcac928	03-Apr-2004	Marcel Moolenaar <marcel@FreeBSD.org>	Assign thread IDs to kernel threads. The purpose of the thread ID (tid) is twofold: 1. When a 1:1 or M:N threaded process dumps core, we need to put the register state of each of its kernel threads in the core file. This can only be done by differentiating the pid field in the respective note. For this we need the tid. 2. When thread support is present for remote debugging the kernel with gdb(1), threads need to be identified by an integer due to limitations in the remote protocol. This requires having a tid. To minimize the impact of having thread IDs, threads that are created as part of a fork (i.e. the initial thread in a process) will inherit the process ID (i.e. tid=pid). Subsequent threads will have IDs larger than PID_MAX to avoid interference with the pid allocation algorithm. The assignment of tids is handled by thread_new_tid(). The thread ID allocation algorithm has been written with 3 assumptions in mind: 1. IDs need to be created as fast a possible, 2. Reuse of IDs may happen instantaneously, 3. Someone else will write a better algorithm.
# a5bdcb2a	13-Mar-2004	Peter Wemm <peter@FreeBSD.org>	Make the process_exit eventhandler run without Giant. Add Giant hooks in the two consumers that need it.. processes using AIO and netncp. Update docs. Say that process_exec is called with Giant, but not to depend on it. All our consumers can handle it without Giant.
# 8a412f31	13-Mar-2004	Peter Wemm <peter@FreeBSD.org>	Move the process_fork event out from under Giant. This one is easy, since there are no consumers in the tree. Document this.
# 37814395	13-Mar-2004	Peter Wemm <peter@FreeBSD.org>	Push Giant down a little further: - no longer serialize on Giant for thread_single*() and family in fork, exit and exec - thread_wait() is mpsafe, assert no Giant - reduce scope of Giant in exit to not cover thread_wait and just do vm_waitproc(). - assert that thread_single() family are not called with Giant - remove the DROP/PICKUP_GIANT macros from thread_single() family - assert that thread_suspend_check() s not called with Giant - remove manual drop_giant hack in thread_suspend_check since we know it isn't held. - remove the DROP/PICKUP_GIANT macros from thread_suspend_check() family - mark kse_create() mpsafe
# 0235bf02	09-Mar-2004	John-Mark Gurney <jmg@FreeBSD.org>	make sure we had the filedesc lock when calling fdinit when RFCFDG is set on call to rfork. Submitted by: Brian Buchanan Semi-Reviewed by: rwatson
# a69d88af	07-Mar-2004	Peter Wemm <peter@FreeBSD.org>	Move a vref call outside of proc locks and Giant. By virtue of the fact that we (p1) are currently running, we hold a reference on p_textvp which means the vnode cannot go away. p2 cannot run yet (and hence cannot exit) so this should be safe to do at this point. As a bonus, it removes a block of under-Giant code that was there to support the vref.
# 5fadbfea	06-Mar-2004	Alan Cox <alc@FreeBSD.org>	Giant is not required for vm_thread_new_altkstack().
# 5750ee29	05-Mar-2004	Peter Wemm <peter@FreeBSD.org>	Add a missing part of jhb's previous commit. It looks like he had a patch chunk rejected that he missed. This would manifest as a lock assertion panic at boot (Giant not locked in kern_fork.c). Obtained from: jhb
# 5ce2f678	05-Mar-2004	John Baldwin <jhb@FreeBSD.org>	- Grab a share lock of the proctree lock while looking for a pid due to the process group and session dereferences. Also, check that p_pgrp and p_sesssion are NULL before dereferencing them. - Push down Giant in fork1(). Requested by: peter
# c8564ad4	04-Mar-2004	Bruce Evans <bde@FreeBSD.org>	Fixed some style bugs (mainly English usage errors in comments).
# 47934cef	25-Feb-2004	Don Lewis <truckman@FreeBSD.org>	Split the mlock() kernel code into two parts, mlock(), which unpacks the syscall arguments and does the suser() permission check, and kern_mlock(), which does the resource limit checking and calls vm_map_wire(). Split munlock() in a similar way. Enable the RLIMIT_MEMLOCK checking code in kern_mlock(). Replace calls to vslock() and vsunlock() in the sysctl code with calls to kern_mlock() and kern_munlock() so that the sysctl code will obey the wired memory limits. Nuke the vslock() and vsunlock() implementations, which are no longer used. Add a member to struct sysctl_req to track the amount of memory that is wired to handle the request. Modify sysctl_wire_old_buffer() to return an error if its call to kern_mlock() fails. Only wire the minimum of the length specified in the sysctl request and the length specified in its argument list. It is recommended that sysctl handlers that use sysctl_wire_old_buffer() should specify reasonable estimates for the amount of data they want to return so that only the minimum amount of memory is wired no matter what length has been specified by the request. Modify the callers of sysctl_wire_old_buffer() to look for the error return. Modify sysctl_old_user to obey the wired buffer length and clean up its implementation. Reviewed by: bms
# 4c3558aa	05-Feb-2004	John Baldwin <jhb@FreeBSD.org>	Always set a process' state to normal when it is fully constructed in fork1() rather than only doing it for the RFSTOPPED case and then having to fix it up in other places later on.
# 91d5354a	04-Feb-2004	John Baldwin <jhb@FreeBSD.org>	Locking for the per-process resource limits structure. - struct plimit includes a mutex to protect a reference count. The plimit structure is treated similarly to struct ucred in that is is always copy on write, so having a reference to a structure is sufficient to read from it without needing a further lock. - The proc lock protects the p_limit pointer and must be held while reading limits from a process to keep the limit structure from changing out from under you while reading from it. - Various global limits that are ints are not protected by a lock since int writes are atomic on all the archs we support and thus a lock wouldn't buy us anything. - All accesses to individual resource limits from a process are abstracted behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return either an rlimit, or the current or max individual limit of the specified resource from a process. - dosetrlimit() was renamed to kern_setrlimit() to match existing style of other similar syscall helper functions. - The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit() (it didn't used the stackgap when it should have) but uses lim_rlimit() and kern_setrlimit() instead. - The svr4 compat no longer uses the stackgap for resource limits calls, but uses lim_rlimit() and kern_setrlimit() instead. - The ibcs2 compat no longer uses the stackgap for resource limits. It also no longer uses the stackgap for accessing sysctl's for the ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result, ibcs2_sysconf() no longer needs Giant. - The p_rlimit macro no longer exists. Submitted by: mtm (mostly, I only did a few cleanups and catchups) Tested on: i386 Compiled on: alpha, amd64
# 6bea667f	25-Jan-2004	Robert Watson <rwatson@FreeBSD.org>	When aborting fork() due to a failure, if using MAC, make sure to clean up the p_label field. Obtained from: TrustedBSD Project Sponsored by: DARPA, McAfee Research
# 679365e7	21-Jan-2004	Robert Watson <rwatson@FreeBSD.org>	Reduce gratuitous includes: don't include jail.h if it's not needed. Presumably, at some point, you had to include jail.h if you included proc.h, but that is no longer required. Result of: self injury involving adding something to struct prison
# 5cded904	09-Jan-2004	Olivier Houchard <cognet@FreeBSD.org>	Prevent a race condition between fork1() and whatever changes the pgrp by setting the new process' p_pgrp again before inserting it in the p_pglist. Without it we can get the new process to be inserted in a different p_pglist than the one p2->p_pgrp points to, and this is not something we want to happen. This is not a fix, merely a bandaid, but it will work until someone finds a better way to do it. Discussed with: jhb (a long time ago)
# a30ec4b9	02-Jan-2004	David Xu <davidxu@FreeBSD.org>	Make sigaltstack as per-threaded, because per-process sigaltstack state is useless for threaded programs, multiple threads can not share same stack. The alternative signal stack is private for thread, no lock is needed, the orignal P_ALTSTACK is now moved into td_pflags and renamed to TDP_ALTSTACK. For single thread or Linux clone() based threaded program, there is no semantic changed, because those programs only have one kernel thread in every process. Reviewed by: deischen, dfr
# b3aeaf2e	29-Oct-2003	Bruce Evans <bde@FreeBSD.org>	Removed mostly-dead code for setting switchtime after the idle loop clobbers this variable. Long ago, when the idle loop wasn't in a process, it set switchtime.tv_sec to zero to indicate that the time needs to be read after the idle loop finishes. The special case for this isn't needed now that there is an idle process (for each CPU). The time is read in the normal way when the idle process is switched away from. The seconds component of the time is only zero for the first second after the uptime is set, and the mostly-dead code was only executed during this time. (This was slightly broken by using uptimes instead of times relative to the Epoch -- in the original version the seconds component of the time was only 0 for the first second after the Epoch.) In mi_switch(), moved the setting of switchticks to just after the first (and now only) setting of switchtime. This setting used to be delayed since a late setting was needed for the idle case and an early setting was not needed. Now the early setting is needed so that fork_exit() doesn't need to set either switchtime or switchticks. Removed now-completely-rotted comment attached to this. Most of the code described by the comment had already moved to sched_switch().
# 89674a9f	29-Oct-2003	Bruce Evans <bde@FreeBSD.org>	Removed sched_nest variable in sched_switch(). Context switches always begin with sched_lock held but not recursed, so this variable was always 0. Removed fixup of sched_lock.mtx_recurse after context switches in sched_switch(). Context switches always end with this variable in the same state that it began in, so there is no need to fix it up. Only sched_lock.mtx_lock really needs a fixup. Replaced fixup of sched_lock.mtx_recurse in fork_exit() by an assertion that sched_lock is owned and not recursed after it is fixed up. This assertion much match the one in mi_switch(), and if sched_lock were recursed then a non-null fixup of sched_lock.mtx_recurse would probably be needed again, unlike in sched_switch(), since fork_exit() doesn't return to its caller in the normal way.
# c06eb4e2	19-Aug-2003	Sam Leffler <sam@FreeBSD.org>	Change instances of callout_init that specify MPSAFE behaviour to use CALLOUT_MPSAFE instead of "1" for the second parameter. This does not change the behaviour; it just makes the intent more clear.
# 70fca427	15-Aug-2003	John Baldwin <jhb@FreeBSD.org>	- Various style fixes in both code and comments. - Update some stale comments. - Sort a couple of includes. - Only set 'newcpu' in updatepri() if we use it. - No functional changes. Obtained from: bde (via an old diff I got a long time ago)
# bfe65982	04-Aug-2003	John Baldwin <jhb@FreeBSD.org>	Adjust a comment to remove staleness and take slightly less implementation specific perspective.
# b083ea51	18-Jun-2003	Mike Silbersack <silby@FreeBSD.org>	Add a ratelimited message of the form "maxproc limit exceeded by uid %i, please see tuning(7) and login.conf(5)." Which will be triggered whenever a user hits his/her maxproc limit or the systemwide maxproc limit is reached. MFC after: 1 week
# 0e2a4d3a	14-Jun-2003	David Xu <davidxu@FreeBSD.org>	Rename P_THREADED to P_SA. P_SA means a process is using scheduler activations.
# 89f4fca2	14-Jun-2003	Alan Cox <alc@FreeBSD.org>	Move the _new_altkstack() and _dispose_altkstack() functions out of the various pmap implementations into the machine-independent vm. They were all identical.
# 677b542e	10-Jun-2003	David E. O'Brien <obrien@FreeBSD.org>	Use __FBSDID().
# ad05d580	02-Jun-2003	Tor Egge <tegge@FreeBSD.org>	Add tracking of process leaders sharing a file descriptor table and allow a file descriptor table to be shared between multiple process leaders. PR: 50923
# 90af4afa	13-May-2003	John Baldwin <jhb@FreeBSD.org>	- Merge struct procsig with struct sigacts. - Move struct sigacts out of the u-area and malloc() it using the M_SUBPROC malloc bucket. - Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(), sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared(). - Remove the p_sigignore, p_sigacts, and p_sigcatch macros. - Add a mutex to struct sigacts that protects all the members of the struct. - Add sigacts locking. - Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now that sigacts is locked. - Several in-kernel functions such as psignal(), tdsignal(), trapsignal(), and thread_stopped() are now MP safe. Reviewed by: arch@ Approved by: re (rwatson)
# 7d447c95	01-May-2003	John Baldwin <jhb@FreeBSD.org>	Initialize and destroy the struct proc mutex in the proc zone's init and fini routines instead of in fork() and wait(). This has the nice side benefit that the proc lock of any process on the allproc list is always valid and sched_lock doesn't have to be used to test against PRS_NEW anymore.
# 87ccef7b	01-May-2003	Dag-Erling Smørgrav <des@FreeBSD.org>	Instead of recording the Unix time in a process when it starts, record the uptime. Where necessary, convert it back to Unix time by adding boottime to it. This fixes a potential problem in the accounting code, which would compute the elapsed time incorrectly if the Unix time was stepped during the lifetime of the process.
# 428eb576	30-Apr-2003	John Baldwin <jhb@FreeBSD.org>	Axe a stale comment.
# 51da11a2	29-Apr-2003	Mark Murray <markm@FreeBSD.org>	Fix some easy, global, lint warnings. In most cases, this means making some local variables static. In a couple of cases, this means removing an unused variable.
# 9752f794	22-Apr-2003	John Baldwin <jhb@FreeBSD.org>	- Move PS_PROFIL and its new cousin PS_STOPPROF back over to p_flag and rename them appropriately. Protect both flags with both the proc lock and the sched_lock. - Protect p_profthreads with the proc lock. - Remove Giant from profil(2).
# bb0e8070	17-Apr-2003	John Baldwin <jhb@FreeBSD.org>	- Push Giant down into the fork1() function a small bit. - Set p_acflag earlier while already hold the proc lock in fork1(). - Mark the realitexpire() callout MPSAFE for new processes. It was already marked safe for proc0 a long while ago.
# f6f230fe	10-Apr-2003	Jeff Roberson <jeff@FreeBSD.org>	- Adjust sched hooks for fork and exec to take processes as arguments instead of ksegs since they primarily operation on processes. - KSEs take ticks so pass the kse through sched_clock(). - Add a sched_class() routine that adjusts a ksegrp pri class. - Define a sched_fork_{kse,thread,ksegrp} and sched_exit_{kse,thread,ksegrp} that will be used to tell the scheduler about new instances of these structures within the same process. These will be used by THR and KSE. - Change sched_4bsd to reflect this API update.
# 060563ec	10-Apr-2003	Julian Elischer <julian@FreeBSD.org>	Move the _oncpu entry from the KSE to the thread. The entry in the KSE still exists but it's purpose will change a bit when we add the ability to lock a KSE to a cpu.
# 2c10d16a	31-Mar-2003	Jeff Roberson <jeff@FreeBSD.org>	- Borrow the KSE single threading code for exec and exit. We use the check if (p->p_numthreads > 1) and not a flag because action is only necessary if there are other threads. The rest of the system has no need to identify thr threaded processes. - In kern_thread.c use thr_exit1() instead of thread_exit() if P_THREADED is not set.
# 75b8b3b2	24-Mar-2003	John Baldwin <jhb@FreeBSD.org>	Replace the at_fork, at_exec, and at_exit functions with the slightly more flexible process_fork, process_exec, and process_exit eventhandlers. This reduces code duplication and also means that I don't have to go duplicate the eventhandler locking three more times for each of at_fork, at_exec, and at_exit. Reviewed by: phk, jake, almost complete silence on arch@
# a5881ea5	13-Mar-2003	John Baldwin <jhb@FreeBSD.org>	- Cache a reference to the credential of the thread that starts a ktrace in struct proc as p_tracecred alongside the current cache of the vnode in p_tracep. This credential is then used for all later ktrace operations on this file rather than using the credential of the current thread at the time of each ktrace event. - Now that we have multiple ktrace-related items in struct proc that are pointers, rename p_tracep to p_tracevp to make it less ambiguous. Requested by: rwatson (1)
# ac2e4153	26-Feb-2003	Julian Elischer <julian@FreeBSD.org>	Change the process flags P_KSES to be P_THREADED. This is just a cosmetic change but I've been meaning to do it for about a year.
# 27e39ae4	19-Feb-2003	Tim J. Robbins <tjr@FreeBSD.org>	Remove the PL_SHAREMOD flag from struct plimit, which could have been used to share resource limits between rfork threads, but never was. Removing it makes resource limit locking much simpler -- only the current process can change the contents of the structure that p_limit points to.
# a163d034	18-Feb-2003	Warner Losh <imp@FreeBSD.org>	Back out M_* changes, per decision of the TRB. Approved by: trb
# 5215b187	16-Feb-2003	Jeff Roberson <jeff@FreeBSD.org>	- Split the struct kse into struct upcall and struct kse. struct kse will soon be visible only to schedulers. This greatly simplifies much the KSE code. Submitted by: davidxu
# 218a01e0	15-Feb-2003	Tor Egge <tegge@FreeBSD.org>	Avoid file lock leakage when linuxthreads port or rfork is used: - Mark the process leader as having an advisory lock - Check if process leader is marked as having advisory lock when closing file - Check that file is still open after lock has been obtained - Don't allow file descriptor table sharing between processes with different leaders PR: 10265 Reviewed by: alfred
# 6f8132a8	31-Jan-2003	Julian Elischer <julian@FreeBSD.org>	Reversion of commit by Davidxu plus fixes since applied. I'm not convinced there is anything major wrong with the patch but them's the rules.. I am using my "David's mentor" hat to revert this as he's offline for a while.
# 0dbb100b	26-Jan-2003	David Xu <davidxu@FreeBSD.org>	Move UPCALL related data structure out of kse, introduce a new data structure called kse_upcall to manage UPCALL. All KSE binding and loaning code are gone. A thread owns an upcall can collect all completed syscall contexts in its ksegrp, turn itself into UPCALL mode, and takes those contexts back to userland. Any thread without upcall structure has to export their contexts and exit at user boundary. Any thread running in user mode owns an upcall structure, when it enters kernel, if the kse mailbox's current thread pointer is not NULL, then when the thread is blocked in kernel, a new UPCALL thread is created and the upcall structure is transfered to the new UPCALL thread. if the kse mailbox's current thread pointer is NULL, then when a thread is blocked in kernel, no UPCALL thread will be created. Each upcall always has an owner thread. Userland can remove an upcall by calling kse_exit, when all upcalls in ksegrp are removed, the group is atomatically shutdown. An upcall owner thread also exits when process is in exiting state. when an owner thread exits, the upcall it owns is also removed. KSE is a pure scheduler entity. it represents a virtual cpu. when a thread is running, it always has a KSE associated with it. scheduler is free to assign a KSE to thread according thread priority, if thread priority is changed, KSE can be moved from one thread to another. When a ksegrp is created, there is always N KSEs created in the group. the N is the number of physical cpu in the current system. This makes it is possible that even an userland UTS is single CPU safe, threads in kernel still can execute on different cpu in parallel. Userland calls kse_create to add more upcall structures into ksegrp to increase concurrent in userland itself, kernel is not restricted by number of upcalls userland provides. The code hasn't been tested under SMP by author due to lack of hardware. Reviewed by: julian
# 44956c98	21-Jan-2003	Alfred Perlstein <alfred@FreeBSD.org>	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# c522c1bf	31-Dec-2002	Alfred Perlstein <alfred@FreeBSD.org>	fdcopy() only needs a filedesc pointer.
# c7f1c11b	31-Dec-2002	Alfred Perlstein <alfred@FreeBSD.org>	Since fdshare() and fdinit() only operate on filedescs, make them take pointers to filedesc structures instead of threads. This makes it more clear that they do not do any voodoo with the thread/proc or anything other than the filedesc passed in or returned. Remove some XXX KSE's as this resolves the issue.
# 93a7aa79	27-Dec-2002	Julian Elischer <julian@FreeBSD.org>	Add code to ddb to allow backtracing an arbitrary thread. (show thread {address}) Remove the IDLE kse state and replace it with a change in the way threads sahre KSEs. Every KSE now has a thread, which is considered its "owner" however a KSE may also be lent to other threads in the same group to allow completion of in-kernel work. n this case the owner remains the same and the KSE will revert to the owner when the other work has been completed. All creations of upcalls etc. is now done from kse_reassign() which in turn is called from mi_switch or thread_exit(). This means that special code can be removed from msleep() and cv_wait(). kse_release() does not leave a KSE with no thread any more but converts the existing thread into teh KSE's owner, and sets it up for doing an upcall. It is just inhibitted from being scheduled until there is some reason to do an upcall. Remove all trace of the kse_idle queue since it is no-longer needed. "Idle" KSEs are now on the loanable queue.
# 696058c3	09-Dec-2002	Julian Elischer <julian@FreeBSD.org>	Unbreak the KSE code. Keep track of zobie threads using the Per-CPU storage during the context switch. Rearrange thread cleanups to avoid problems with Giant. Clean threads when freed or when recycled. Approved by: re (jhb)
# 2555374c	20-Nov-2002	Robert Watson <rwatson@FreeBSD.org>	Introduce p_label, extensible security label storage for the MAC framework in struct proc. While the process label is actually stored in the struct ucred pointed to by p_ucred, there is a need for transient storage that may be used when asynchronous (deferred) updates need to be performed on the "real" label for locking reasons. Unlike other label storage, this label has no locking semantics, relying on policies to provide their own protection for the label contents, meaning that a policy leaf mutex may be used, avoiding lock order issues. This permits policies that act based on historical process behavior (such as audit policies, the MAC Framework port of LOMAC, etc) can update process properties even when many existing locks are held without violating the lock order. No currently committed policies implement use of this label storage. Approved by: re Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 293d2d22	18-Nov-2002	Robert Watson <rwatson@FreeBSD.org>	We leaked a process lock reference in the event an RFTHREAD process leader wasn't exiting during a fork; instead, do remember to release the lock avoiding lock order reversals and recursion panic. Reported by: "Joel M. Baldwin" <qumqats@outel.org>
# 62220473	18-Oct-2002	John Baldwin <jhb@FreeBSD.org>	Do not lock the process when calling fdfree() (this would have recursed on a non-recursive lock, the proc lock, before) since we don't need it to change p_fd.
# c6544064	14-Oct-2002	John Baldwin <jhb@FreeBSD.org>	- Add a new global mutex 'ppeers_lock' to protect the p_peers list of processes forked with RFTHREAD. - Use a goto to a label for common code when exiting from fork1() in case of an error. - Move the RFTHREAD linkage setup code later in fork since the ppeers_lock cannot be locked while holding a proc lock. Handle the race of a task leader exiting and killing its peers while a peer is forking a new child. In that case, go ahead and let the peer process proceed normally as the parent is about to kill it. However, the task leader may have already gone to sleep to wait for the peers to die, so the new child process may not receive a SIGKILL from the task leader. Rather than try to destruct the new child process, just go ahead and send it a SIGKILL directly and add it to the p_peers list. This ensures that the task leader will wait until both the peer process doing the fork() and the new child process have received their KILL signals and exited. Discussed with: truckman (earlier versions)
# b43179fb	11-Oct-2002	Jeff Roberson <jeff@FreeBSD.org>	- Create a new scheduler api that is defined in sys/sched.h - Begin moving scheduler specific functionality into sched_4bsd.c - Replace direct manipulation of scheduler data with hooks provided by the new api. - Remove KSE specific state modifications and single runq assumptions from kern_switch.c Reviewed by: -arch
# 48bfcddd	08-Oct-2002	Julian Elischer <julian@FreeBSD.org>	Round out the facilty for a 'bound' thread to loan out its KSE in specific situations. The owner thread must be blocked, and the borrower can not proceed back to user space with the borrowed KSE. The borrower will return the KSE on the next context switch where teh owner wants it back. This removes a lot of possible race conditions and deadlocks. It is consceivable that the borrower should inherit the priority of the owner too. that's another discussion and would be simple to do. Also, as part of this, the "preallocatd spare thread" is attached to the thread doing a syscall rather than the KSE. This removes the need to lock the scheduler when we want to access it, as it's now "at hand". DDB now shows a lot mor info for threaded proceses though it may need some optimisation to squeeze it all back into 80 chars again. (possible JKH project) Upcalls are now "bound" threads, but "KSE Lending" now means that other completing syscalls can be completed using that KSE before the upcall finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is one of the highest priorities in the KSE system.) The upcall when it happens will present all the completed syscalls to the KSE for selection.
# 316ec49a	02-Oct-2002	Scott Long <scottl@FreeBSD.org>	Some kernel threads try to do significant work, and the default KSTACK_PAGES doesn't give them enough stack to do much before blowing away the pcb. This adds MI and MD code to allow the allocation of an alternate kstack who's size can be speficied when calling kthread_create. Passing the value 0 prevents the alternate kstack from being created. Note that the ia64 MD code is missing for now, and PowerPC was only partially written due to the pmap.c being incomplete there. Though this patch does not modify anything to make use of the alternate kstack, acpi and usb are good candidates. Reviewed by: jake, peter, jhb
# 1d9c5696	01-Oct-2002	Juli Mallett <jmallett@FreeBSD.org>	Back our kernel support for reliable signal queues. Requested by: rwatson, phk, and many others
# 1226f694	30-Sep-2002	Juli Mallett <jmallett@FreeBSD.org>	First half of implementation of ksiginfo, signal queues, and such. This gets signals operating based on a TailQ, and is good enough to run X11, GNOME, and do job control. There are some intricate parts which could be more refined to match the sigset_t versions, but those require further evaluation of directions in which our signal system can expand and contract to fit our needs. After this has been in the tree for a while, I will make in kernel API changes, most notably to trapsignal(9) and sendsig(9), to use ksiginfo more robustly, such that we can actually pass information with our (queued) signals to the userland. That will also result in using a struct ksiginfo pointer, rather than a signal number, in a lot of kern_sig.c, to refer to an individual pending signal queue member, but right now there is no defined behaviour for such. CODAFS is unfinished in this regard because the logic is unclear in some places. Sponsored by: New Gold Technology Reviewed by: bde, tjr, jake [an older version, logic similar]
# c76e33b6	16-Sep-2002	Jonathan Mini <mini@FreeBSD.org>	Add kernel support needed for the KSE-aware libpthread: - Use ucontext_t's to store KSE thread state. - Synthesize state for the UTS upon each upcall, rather than saving and copying a trapframe. - Deliver signals to KSE-aware processes via upcall. - Rename kse mailbox structure fields to be more BSD-like. - Store the UTS's stack in struct proc in a stack_t. Reviewed by: bde, deischen, julian Approved by: -arch
# 4f0db5e0	15-Sep-2002	Julian Elischer <julian@FreeBSD.org>	Allocate KSEs and KSEGRPs separatly and remove them from the proc structure. next step is to allow > 1 to be allocated per process. This would give multi-processor threads. (when the rest of the infrastructure is in place) While doing this I noticed libkvm and sys/kern/kern_proc.c:fill_kinfo_proc are diverging more than they should.. corrective action needed soon.
# 71fad9fd	11-Sep-2002	Julian Elischer <julian@FreeBSD.org>	Completely redo thread states. Reviewed by: davidxu@freebsd.org
# 1faf202e	06-Sep-2002	Julian Elischer <julian@FreeBSD.org>	Use UMA as a complex object allocator. The process allocator now caches and hands out complete process structures including substructures . i.e. it get's the process structure with the first thread (and soon KSE) already allocated and attached, all in one hit. For the average non threaded program (non KSE that is) the allocated thread and its stack remain attached to the process, even when the process is unused and in the process cache. This saves having to allocate and attach it later, effectively bringing us (hopefully) close to the efficiency of pre-KSE systems where these were a single structure. Reviewed by: davidxu@freebsd.org, peter@freebsd.org
# 1279572a	05-Sep-2002	David Xu <davidxu@FreeBSD.org>	s/SGNL/SIG/ s/SNGL/SINGLE/ s/SNGLE/SINGLE/ Fix abbreviation for P_STOPPED_* etc flags, in original code they were inconsistent and difficult to distinguish between them. Approved by: julian (mentor)
# 49539972	22-Aug-2002	Julian Elischer <julian@FreeBSD.org>	slight cleanup of single-threading code for KSE processes
# df95311a	07-Aug-2002	Matthew N. Dodd <mdodd@FreeBSD.org>	Move code block added in 1.157 to a safer part of fork1(). Submitted by: jake
# 9ccba881	03-Aug-2002	Matthew N. Dodd <mdodd@FreeBSD.org>	Kernel modifications necessary to allow to follow fork()ed children. PR: bin/25587 (in part) MFC after: 3 weeks
# c4441bc7	29-Jul-2002	Mike Silbersack <silby@FreeBSD.org>	Update docs to reflect change in count of procs reserved for root from 1 to 10. PR: kern/40515 Submitted by: David Schultz <dschultz@uclink.Berkeley.EDU> MFC after: 1 day
# 5c38b6db	28-Jul-2002	Don Lewis <truckman@FreeBSD.org>	Wire the sysctl output buffer before grabbing any locks to prevent SYSCTL_OUT() from blocking while locks are held. This should only be done when it would be inconvenient to make a temporary copy of the data and defer calling SYSCTL_OUT() until after the locks are released.
# 66d59314	14-Jul-2002	Julian Elischer <julian@FreeBSD.org>	part of a greater patch set.. 1/ don't need to set td_state to TDS_RUNNING in fork_return. it's already set in choosethread(). 2/ Set a child process state to "normal" as opposed to "new" when we allow it to be put on the run queue. Allows child to receive signals from the parent if the parent runs first and tries to immediatly signal he child. Submitted by: (part 2) Thomas Moestl <tmoestl@gmx.net>
# c3b98db0	13-Jul-2002	Julian Elischer <julian@FreeBSD.org>	Thinking about it I came to the conclusion that the KSE states were incorrectly formulated. The correct states should be: IDLE: On the idle KSE list for that KSEG RUNQ: Linked onto the system run queue. THREAD: Attached to a thread and slaved to whatever state the thread is in. This means that most places where we were adjusting kse state can go away as it is just moving around because the thread is.. The only places we need to adjust the KSE state is in transition to and from the idle and run queues. Reviewed by: jhb@freebsd.org
# aaa1c771	10-Jul-2002	Jonathan Mini <mini@FreeBSD.org>	Revert removal of cred_free_thread(): It is used to ensure that a thread's credentials are not improperly borrowed when the thread is not current in the kernel. Requested by: jhb, alfred
# e602ba25	29-Jun-2002	Julian Elischer <julian@FreeBSD.org>	Part 1 of KSE-III The ability to schedule multiple threads per process (one one cpu) by making ALL system calls optionally asynchronous. to come: ia64 and power-pc patches, patches for gdb, test program (in tools) Reviewed by: Almost everyone who counts (at various times, peter, jhb, matt, alfred, mini, bernd, and a cast of thousands) NOTE: this is still Beta code, and contains lots of debugging stuff. expect slight instability in signals..
# 01ad8a53	24-Jun-2002	Jonathan Mini <mini@FreeBSD.org>	Remove unused diagnostic function cread_free_thread(). Approved by: alfred
# af300f23	06-Jun-2002	John Baldwin <jhb@FreeBSD.org>	- Proper locking for p_tracep and p_traceflag. - Catch up to new ktrace API.
# 3fc755c1	02-May-2002	John Baldwin <jhb@FreeBSD.org>	- Protect randompid and nprocs with the allproc_lock. - Reorder fork1() to do malloc() and other blocking operations prior to acquiring the needed process locks. - The new process inherit's the credentials of curthread, not the credentials of the old process. - Document a really weird race that will come up with KSE allows multiple kernel threads per process.
# ba626c1d	16-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Lock proctree_lock instead of pgrpsess_lock.
# 9b28af91	09-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Whitespace changes to wrap long lines.
# 6008862b	04-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
# 2a60b9b9	02-Apr-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Fix leakage of p_pgrp lock.
# 182da820	01-Apr-2002	Matthew Dillon <dillon@FreeBSD.org>	Stage-2 commit of the critical*() code. This re-inlines cpu_critical_enter() and cpu_critical_exit() and moves associated critical prototypes into their own header file, <arch>/<arch>/critical.h, which is only included by the three MI source files that need it. Backout and re-apply improperly comitted syntactical cleanups made to files that were still under active development. Backout improperly comitted program structure changes that moved localized declarations to the top of two procedures. Partially re-apply one of the program structure changes to move 'mask' into an intermediate block rather then in three separate sub-blocks to make the code more readable. Re-integrate bug fixes that Jake made to the sparc64 code. Note: In general, developers should not gratuitously move declarations out of sub-blocks. They are where they are for reasons of structure, grouping, readability, compiler-localizability, and to avoid developer-introduced bugs similar to several found in recent years in the VFS and VM code. Reviewed by: jake
# 8899023f	27-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Make the reference counting of 'struct pargs' SMP safe. There is still some locations where the PROC lock should be held in order to prevent inconsistent views from outside (like the proc->p_fd fix for kern/vfs_syscalls.c:checkdirs()) that can be fixed later. Submitted by: Jonathan Mini <mini@haikugeek.com>
# f22a4b62	27-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	Add a new mtx_init option "MTX_DUPOK" which allows duplicate acquires of locks with this flag. Remove the dup_list and dup_ok code from subr_witness. Now we just check for the flag instead of doing string compares. Also, switch the process lock, process group lock, and uma per cpu locks over to this interface. The original mechanism did not work well for uma because per cpu lock names are unique to each zone. Approved by: jhb
# d74ac681	26-Mar-2002	Matthew Dillon <dillon@FreeBSD.org>	Compromise for critical()/cpu_critical() recommit. Cleanup the interrupt disablement assumptions in kern_fork.c by adding another API call, cpu_critical_fork_exit(). Cleanup the td_savecrit field by moving it from MI to MD. Temporarily move cpu_critical() from <arch>/include/cpufunc.h to <arch>/<arch>/critical.c (stage-2 will clean this up). Implement interrupt deferral for i386 that allows interrupts to remain enabled inside critical sections. This also fixes an IPI interlock bug, and requires uses of icu_lock to be enclosed in a true interrupt disablement. This is the stage-1 commit. Stage-2 will occur after stage-1 has stabilized, and will move cpu_critical() into its own header file(s) + other things. This commit may break non-i386 architectures in trivial ways. This should be temporary. Reviewed by: core Approved by: core
# 565ab9395	20-Mar-2002	Benno Rice <benno@FreeBSD.org>	Add a change mirroring that made to kern/subr_trap.c and others. This makes kernel builds with DIAGNOSTIC work again. Apparently forgotten by: jhb Might want to be checked by: jhb
# c897b813	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	Remove references to vm_zone.h and switch over to the new uma API. Also, remove maxsockets. If you look carefully you'll notice that the old zone allocator never honored this anyway.
# 181df8c9	26-Feb-2002	Matthew Dillon <dillon@FreeBSD.org>	revert last commit temporarily due to whining on the lists.
# f96ad4c2	26-Feb-2002	Matthew Dillon <dillon@FreeBSD.org>	STAGE-1 of 3 commit - allow (but do not require) interrupts to remain enabled in critical sections and streamline critical_enter() and critical_exit(). This commit allows an architecture to leave interrupts enabled inside critical sections if it so wishes. Architectures that do not wish to do this are not effected by this change. This commit implements the feature for the I386 architecture and provides a sysctl, debug.critical_mode, which defaults to 1 (use the feature). For now you can turn the sysctl on and off at any time in order to test the architectural changes or track down bugs. This commit is just the first stage. Some areas of the code, specifically the MACHINE_CRITICAL_ENTER #ifdef'd code, is strictly temporary and will be cleaned up in the STAGE-2 commit when the critical_() functions are moved entirely into MD files. The following changes have been made: critical_enter() and critical_exit() for I386 now simply increment and decrement curthread->td_critnest. They no longer disable hard interrupts. When critical_exit() decrements the counter to 0 it effectively calls a routine to deal with whatever interrupts were deferred during the time the code was operating in a critical section. Other architectures are unaffected. * fork_exit() has been conditionalized to remove MD assumptions for the new code. Old code will still use the old MD assumptions in regards to hard interrupt disablement. In STAGE-2 this will be turned into a subroutine call into MD code rather then hardcoded in MI code. The new code places the burden of entering the critical section in the trampoline code where it belongs. * I386: interrupts are now enabled while we are in a critical section. The interrupt vector code has been adjusted to deal with the fact. If it detects that we are in a critical section it currently defers the interrupt by adding the appropriate bit to an interrupt mask. * In order to accomplish the deferral, icu_lock is required. This is i386-specific. Thus icu_lock can only be obtained by mainline i386 code while interrupts are hard disabled. This change has been made. * Because interrupts may or may not be hard disabled during a context switch, cpu_switch() can no longer simply assume that PSL_I will be in a consistent state. Therefore, it now saves and restores eflags. * FAST INTERRUPT PROVISION. Fast interrupts are currently deferred. The intention is to eventually allow them to operate either while we are in a critical section or, if we are able to restrict the use of sched_lock, while we are not holding the sched_lock. * ICU and APIC vector assembly for I386 cleaned up. The ICU code has been cleaned up to match the APIC code in regards to format and macro availability. Additionally, the code has been adjusted to deal with deferred interrupts. * Deferred interrupts use a per-cpu boolean int_pending, and masks ipending, spending, and fpending. Being per-cpu variables it is not currently necessary to lock; bus cycles modifying them. Note that the same mechanism will enable preemption to be incorporated as a true software interrupt without having to further hack up the critical nesting code. * Note: the old critical_enter() code in kern/kern_switch.c is currently #ifdef to be compatible with both the old and new methodology. In STAGE-2 it will be moved entirely to MD code. Performance issues: One of the purposes of this commit is to enhance critical section performance, specifically to greatly reduce bus overhead to allow the critical section code to be used to protect per-cpu caches. These caches, such as Jeff's slab allocator work, can potentially operate very quickly making the effective savings of the new critical section code's performance very significant. The second purpose of this commit is to allow architectures to enable certain interrupts while in a critical section. Specifically, the intention is to eventually allow certain FAST interrupts to operate rather then defer. The third purpose of this commit is to begin to clean up the critical_enter()/critical_exit()/cpu_critical_enter()/ cpu_critical_exit() API which currently has serious cross pollution in MI code (in fork_exit() and ast() for example). The fourth purpose of this commit is to provide a framework that allows kernel-preempting software interrupts to be implemented cleanly. This is currently used for two forward interrupts in I386. Other architectures will have the choice of using this infrastructure or building the functionality directly into critical_enter()/ critical_exit(). Finally, this commit is designed to greatly improve the flexibility of various architectures to manage critical section handling, software interrupts, preemption, and other highly integrated architecture-specific details.
# f591779b	23-Feb-2002	Seigo Tanimura <tanimura@FreeBSD.org>	Lock struct pgrp, session and sigio. New locks are: - pgrpsess_lock which locks the whole pgrps and sessions, - pg_mtx which protects the pgrp members, and - s_mtx which protects the session members. Please refer to sys/proc.h for the coverage of these locks. Changes on the pgrp/session interface: - pgfind() needs the pgrpsess_lock held. - The caller of enterpgrp() is responsible to allocate a new pgrp and session. - Call enterthispgrp() in order to enter an existing pgrp. - pgsignal() requires a pgrp lock held. Reviewed by: jhb, alfred Tested on: cvsup.jp.FreeBSD.org (which is a quad-CPU machine running -current)
# 77c40664	22-Feb-2002	Julian Elischer <julian@FreeBSD.org>	Add some DIAGNOSTIC code. While in userland, keep the thread's ucred reference in a shadow field so that the usual place to store it is NULL. If DIAGNOSTIC is not set, the thread ucred is kept valid until the next kernel entry, at which time it is checked against the process cred and possibly corrected. Produces a BIG speedup in kernels with INVARIANTS set. (A previous commit corrected it for the non INVARIANTS case already) Reviewed by: dillon@freebsd.org
# 1cbb9c3b	22-Feb-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Convert p->p_runtime and PCPU(switchtime) to bintime format.
# cc6712ea	18-Feb-2002	Mike Silbersack <silby@FreeBSD.org>	A few misc forkbomb defenses: - Leave 10 processes for root-only use, the previous value of 1 was insufficient to run ps ax \| more. - Remove the printing of "proc: table full". When the table really is full, this would flood the screen/logs, making the problem tougher to deal with. - Force any process trying to fork beyond its user's maximum number of processes to sleep for .5 seconds before returning failure. This turns 2000 rampaging fork monsters into 2000 harmlessly snoozing fork monsters. Reviewed by: dillon, peter MFC after: 1 week
# 2eb927e2	16-Feb-2002	Julian Elischer <julian@FreeBSD.org>	If the credential on an incoming thread is correct, don't bother reaquiring it. In the same vein, don't bother dropping the thread cred when goinf ot userland. We are guaranteed to nned it when we come back, (which we are guaranteed to do). Reviewed by: jhb@freebsd.org, bde@freebsd.org (slightly different version)
# 2b8a08af	07-Feb-2002	Peter Wemm <peter@FreeBSD.org>	Fix a couple of style bugs introduced (or touched by) previous commit.
# 079b7bad	07-Feb-2002	Julian Elischer <julian@FreeBSD.org>	Pre-KSE/M3 commit. this is a low-functionality change that changes the kernel to access the main thread of a process via the linked list of threads rather than assuming that it is embedded in the process. It IS still embeded there but remove all teh code that assumes that in preparation for the next commit which will actually move it out. Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
# 426da3bc	13-Jan-2002	Alfred Perlstein <alfred@FreeBSD.org>	SMP Lock struct file, filedesc and the global file list. Seigo Tanimura (tanimura) posted the initial delta. I've polished it quite a bit reducing the need for locking and adapting it for KSE. Locks: 1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked. 1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex. 1 sx lock for the global filelist. struct file * fhold(struct file fp); / increments reference count on a file / struct file fhold_locked(struct file fp); / like fhold but expects file to locked / struct file ffind_hold(struct thread , int fd); / finds the struct file in thread, adds one reference and returns it unlocked / struct file ffind_lock(struct thread , int fd); / ffind_hold, but returns file locked */ I still have to smp-safe the fget cruft, I'll get to that asap.
# fdba8cf4	08-Jan-2002	Mike Silbersack <silby@FreeBSD.org>	GC fast_vfork; it's not actually referenced anywhere. MFC after: 3 weeks
# 885ccc61	18-Dec-2001	John Baldwin <jhb@FreeBSD.org>	Return EINVAL if kernel only flags are passed to the rfork syscall rather than silently masking them.
# 7e1f6dfe	17-Dec-2001	John Baldwin <jhb@FreeBSD.org>	Modify the critical section API as follows: - The MD functions critical_enter/exit are renamed to start with a cpu_ prefix. - MI wrapper functions critical_enter/exit maintain a per-thread nesting count and a per-thread critical section saved state set when entering a critical section while at nesting level 0 and restored when exiting to nesting level 0. This moves the saved state out of spin mutexes so that interlocking spin mutexes works properly. - Most low-level MD code that used critical_enter/exit now use cpu_critical_enter/exit. MI code such as device drivers and spin mutexes use the MI wrappers. Note that since the MI wrappers store the state in the current thread, they do not have any return values or arguments. - mtx_intr_enable() is replaced with a constant CRITICAL_FORK which is assigned to curthread->td_savecrit during fork_exit(). Tested on: i386, alpha
# 201b0ea8	14-Dec-2001	John Baldwin <jhb@FreeBSD.org>	Fix some nits in fork_exit() so it more properly duplicates the backend of mi_switch: - Set the oncpu value for the current thread. - Always set switchticks, not just in the SMP case. - Add a KTR entry for fork_exit that is the same as the "new proc" entry in mi_switch(). - Release sched_lock a bit later like we do with mi_switch().
# 8e2e767b	26-Oct-2001	John Baldwin <jhb@FreeBSD.org>	Add a per-thread ucred reference for syscalls and synchronous traps from userland. The per thread ucred reference is immutable and thus needs no locks to be read. However, until all the proc locking associated with writes to p_ucred are completed, it is still not safe to use the per-thread reference. Tested on: x86 (SMP), alpha, sparc64
# 79deba82	23-Oct-2001	Matthew Dillon <dillon@FreeBSD.org>	Fix ktrace enablement/disablement races that can result in a vnode ref count panic. Bug noticed by: ps Reviewed by: ps MFC after: 1 day
# bd78cece	11-Oct-2001	John Baldwin <jhb@FreeBSD.org>	Change the kernel's ucred API as follows: - crhold() returns a reference to the ucred whose refcount it bumps. - crcopy() now simply copies the credentials from one credential to another and has no return value. - a new crshared() primitive is added which returns true if a ucred's refcount is > 1 and false (0) otherwise.
# b40ce416	12-Sep-2001	Julian Elischer <julian@FreeBSD.org>	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# eb30c1c0	09-Sep-2001	Peter Wemm <peter@FreeBSD.org>	Rip some well duplicated code out of cpu_wait() and cpu_exit() and move it to the MI area. KSE touched cpu_wait() which had the same change replicated five ways for each platform. Now it can just do it once. The only MD parts seemed to be dealing with fpu state cleanup and things like vm86 cleanup on x86. The rest was identical. XXX: ia64 and powerpc did not have cpu_throw(), so I've put a functional stub in place. Reviewed by: jake, tmm, dillon
# 116734c4	31-Aug-2001	Matthew Dillon <dillon@FreeBSD.org>	Pushdown Giant for acct(), kqueue(), kevent(), execve(), fork(), vfork(), rfork(), jail().
# 9b956e98	09-Jul-2001	Guido van Rooij <guido@FreeBSD.org>	Get rid of useless bcopy (the next statement was equivalent)
# 0cddd8f0	04-Jul-2001	Matthew Dillon <dillon@FreeBSD.org>	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.
# aa3cefd0	29-Jun-2001	John Baldwin <jhb@FreeBSD.org>	Remove the p_spinlocks spin lock count that was obsoleted by the per-CPU spinlocks list.
# 8f7e4eb5	11-Jun-2001	Dag-Erling Smørgrav <des@FreeBSD.org>	Rename nextpid to lastpid and externalize it.
# b1fc0ec1	25-May-2001	Robert Watson <rwatson@FreeBSD.org>	o Merge contents of struct pcred into struct ucred. Specifically, add the real uid, saved uid, real gid, and saved gid to ucred, as well as the pcred->pc_uidinfo, which was associated with the real uid, only rename it to cr_ruidinfo so as not to conflict with cr_uidinfo, which corresponds to the effective uid. o Remove p_cred from struct proc; add p_ucred to struct proc, replacing original macro that pointed. p->p_ucred to p->p_cred->pc_ucred. o Universally update code so that it makes use of ucred instead of pcred, p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo, cr_{r,sv}{u,g}id instead of p_*, etc. o Remove pcred0 and its initialization from init_main.c; initialize cr_ruidinfo there. o Restruction many credential modification chunks to always crdup while we figure out locking and optimizations; generally speaking, this means moving to a structure like this: newcred = crdup(oldcred); ... p->p_ucred = newcred; crfree(oldcred); It's not race-free, but better than nothing. There are also races in sys_process.c, all inter-process authorization, fork, exec, and exit. o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid; remove comments indicating that the old arrangement was a problem. o Restructure exec1() a little to use newcred/oldcred arrangement, and use improved uid management primitives. o Clean up exit1() so as to do less work in credential cleanup due to pcred removal. o Clean up fork1() so as to do less work in credential cleanup and allocation. o Clean up ktrcanset() to take into account changes, and move to using suser_xxx() instead of performing a direct uid==0 comparision. o Improve commenting in various kern_prot.c credential modification calls to better document current behavior. In a couple of places, current behavior is a little questionable and we need to check POSIX.1 to make sure it's "right". More commenting work still remains to be done. o Update credential management calls, such as crfree(), to take into account new ruidinfo reference. o Modify or add the following uid and gid helper routines: change_euid() change_egid() change_ruid() change_rgid() change_svuid() change_svgid() In each case, the call now acts on a credential not a process, and as such no longer requires more complicated process locking/etc. They now assume the caller will do any necessary allocation of an exclusive credential reference. Each is commented to document its reference requirements. o CANSIGIO() is simplified to require only credentials, not processes and pcreds. o Remove lots of (p_pcred==NULL) checks. o Add an XXX to authorization code in nfs_lock.c, since it's questionable, and needs to be considered carefully. o Simplify posix4 authorization code to require only credentials, not processes and pcreds. Note that this authorization, as well as CANSIGIO(), needs to be updated to use the p_cansignal() and p_cansched() centralized authorization routines, as they currently do not take into account some desirable restrictions that are handled by the centralized routines, as well as being inconsistent with other similar authorization instances. o Update libkvm to take these changes into account. Obtained from: TrustedBSD Project Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit
# 23955314	18-May-2001	Alfred Perlstein <alfred@FreeBSD.org>	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb
# 3b26be6a	07-May-2001	Akinori MUSHA <knu@FreeBSD.org>	Properly copy the P_ALTSTACK flag in struct proc::p_flag to the child process on fork(2). It is the supposed behavior stated in the manpage of sigaction(2), and Solaris, NetBSD and FreeBSD 3-STABLE correctly do so. The previous fix against libc_r/uthread/uthread_fork.c fixed the problem only for the programs linked with libc_r, so back it out and fix fork(2) itself to help those not linked with libc_r as well. PR: kern/26705 Submitted by: KUROSAWA Takahiro <fwkg7679@mb.infoweb.ne.jp> Tested by: knu, GOTOU Yuuzou <gotoyuzo@notwork.org>, and some other people Not objected by: hackers MFC in: 3 days
# 1005a129	28-Mar-2001	John Baldwin <jhb@FreeBSD.org>	Convert the allproc and proctree locks from lockmgr locks to sx locks.
# 19284646	28-Mar-2001	John Baldwin <jhb@FreeBSD.org>	Rework the witness code to work with sx locks as well as mutexes. - Introduce lock classes and lock objects. Each lock class specifies a name and set of flags (or properties) shared by all locks of a given type. Currently there are three lock classes: spin mutexes, sleep mutexes, and sx locks. A lock object specifies properties of an additional lock along with a lock name and all of the extra stuff needed to make witness work with a given lock. This abstract lock stuff is defined in sys/lock.h. The lockmgr constants, types, and prototypes have been moved to sys/lockmgr.h. For temporary backwards compatability, sys/lock.h includes sys/lockmgr.h. - Replace proc->p_spinlocks with a per-CPU list, PCPU(spinlocks), of spin locks held. By making this per-cpu, we do not have to jump through magic hoops to deal with sched_lock changing ownership during context switches. - Replace proc->p_heldmtx, formerly a list of held sleep mutexes, with proc->p_sleeplocks, which is a list of held sleep locks including sleep mutexes and sx locks. - Add helper macros for logging lock events via the KTR_LOCK KTR logging level so that the log messages are consistent. - Add some new flags that can be passed to mtx_init(): - MTX_NOWITNESS - specifies that this lock should be ignored by witness. This is used for the mutex that blocks a sx lock for example. - MTX_QUIET - this is not new, but you can pass this to mtx_init() now and no events will be logged for this lock, so that one doesn't have to change all the individual mtx_lock/unlock() operations. - All lock objects maintain an initialized flag. Use this flag to export a mtx_initialized() macro that can be safely called from drivers. Also, we on longer walk the all_mtx list if MUTEX_DEBUG is defined as witness performs the corresponding checks using the initialized flag. - The lock order reversal messages have been improved to output slightly more accurate file and line numbers.
# 486b8ac0	27-Mar-2001	John Baldwin <jhb@FreeBSD.org>	Don't explicitly zero p_intr_nesting_level and p_aioinfo in fork.
# 35a47246	27-Mar-2001	John Baldwin <jhb@FreeBSD.org>	Use mtx_intr_enable() on sched_lock to ensure child processes always start with interrupts enabled rather than calling the no-longer MI function enable_intr(). This is bogus anyways and in theory shouldn't even be needed.
# 5db078a9	09-Mar-2001	John Baldwin <jhb@FreeBSD.org>	Fix mtx_legal2block. The only time that it is bad to block on a mutex is if we hold a spin mutex, since we can trivially get into deadlocks if we start switching out of processes that hold spinlocks. Checking to see if interrupts were disabled was a sort of cheap way of doing this since most of the time interrupts were only disabled when holding a spin lock. At least on the i386. To fix this properly, use a per-process counter p_spinlocks that counts the number of spin locks currently held, and instead of checking to see if interrupts are disabled in the witness code, check to see if we hold any spin locks. Since child processes always start up with the sched lock magically held in fork_exit(), we initialize p_spinlocks to 1 for child processes. Note that proc0 doesn't go through fork_exit(), so it starts with no spin locks held. Consulting from: cp
# 5641ae5d	06-Mar-2001	John Baldwin <jhb@FreeBSD.org>	- Don't hold the proc lock across VREF and the fd* functions to avoid lock order reversals. - Add some preliminary locking in the !RF_PROC case. - Protect p_estcpu with sched_lock.
# 57934cd3	06-Mar-2001	John Baldwin <jhb@FreeBSD.org>	- Lock the forklist with an sx lock. - Add proc locking to fork1(). Always lock the child procoess (new process) first when both processes need to be locked at the same time. - Remove unneeded spl()'s as the data they protected is now locked. - Ensure that the proctree is exclusively locked and the new process is locked when setting up the parent process pointer. - Lock the check for P_KTHREAD in p_flag in fork_exit().
# 5b270b2a	27-Feb-2001	Jake Burkholder <jake@FreeBSD.org>	Sigh. Try to get priorities sorted out. Don't bother trying to update native priority, it is diffcult to get right and likely to end up horribly wrong. Use an honestly wrong fixed value that seems to work; PUSER for user threads, and the interrupt priority for ithreads. Set it once when the process is created and forget about it. Suggested by: bde Pointy hat: me
# be15bfc0	26-Feb-2001	Jake Burkholder <jake@FreeBSD.org>	Initialize native priority to PRI_MAX. It was usually 0 which made a process's priority go through the roof when it released a (contested) mutex. Only set the native priority in mtx_lock if hasn't already been set. Reviewed by: jhb
# 9764c9d3	21-Feb-2001	John Baldwin <jhb@FreeBSD.org>	Quiet a warning with a uintptr_t cast. Noticed by: bde
# 91421ba2	20-Feb-2001	Robert Watson <rwatson@FreeBSD.org>	o Move per-process jail pointer (p->pr_prison) to inside of the subject credential structure, ucred (cr->cr_prison). o Allow jail inheritence to be a function of credential inheritence. o Abstract prison structure reference counting behind pr_hold() and pr_free(), invoked by the similarly named credential reference management functions, removing this code from per-ABI fork/exit code. o Modify various jail() functions to use struct ucred arguments instead of struct proc arguments. o Introduce jailed() function to determine if a credential is jailed, rather than directly checking pointers all over the place. o Convert PRISON_CHECK() macro to prison_check() function. o Move jail() function prototypes to jail.h. o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the flag in the process flags field itself. o Eliminate that "const" qualifier from suser/p_can/etc to reflect mutex use. Notes: o Some further cleanup of the linux/jail code is still required. o It's now possible to consider resolving some of the process vs credential based permission checking confusion in the socket code. o Mutex protection of struct prison is still not present, and is required to protect the reference count plus some fields in the structure. Reviewed by: freebsd-arch Obtained from: TrustedBSD Project
# 5813dc03	19-Feb-2001	John Baldwin <jhb@FreeBSD.org>	- Don't call clear_resched() in userret(), instead, clear the resched flag in mi_switch() just before calling cpu_switch() so that the first switch after a resched request will satisfy the request. - While I'm at it, move a few things into mi_switch() and out of cpu_switch(), specifically set the p_oncpu and p_lastcpu members of proc in mi_switch(), and handle the sched_lock state change across a context switch in mi_switch(). - Since cpu_switch() no longer handles the sched_lock state change, we have to setup an initial state for sched_lock in fork_exit() before we release it.
# d941d475	12-Feb-2001	Robert Watson <rwatson@FreeBSD.org>	o Export the nextpid variable via SYSCTL as kern.lastpid, decreasing by one the number of variables needed for top and other setgid kmem utilities that could only be accessed via /dev/kmem previously. Submitted by: Thomas Moestl <tmoestl@gmx.net> Reviewed by: freebsd-audit
# 9ed346ba	08-Feb-2001	Bosko Milekic <bmilekic@FreeBSD.org>	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# 8865286b	26-Jan-2001	John Baldwin <jhb@FreeBSD.org>	Fix fork_exit() to take a pointer to a function that returns void as its first argument rather than a function that returns a void *. Noticed by: jake
# 2a36ec35	24-Jan-2001	John Baldwin <jhb@FreeBSD.org>	- Change fork_exit() to take a pointer to a trapframe as its 3rd argument instead of a trapframe directly. (Requested by bde.) - Convert the alpha switch_trampoline to call fork_exit() and use the MI fork_return() instead of child_return(). - Axe child_return().
# a7b124c3	24-Jan-2001	John Baldwin <jhb@FreeBSD.org>	- Catch up to proc flag changes. - Add new fork_exit() and fork_return() MI C functions.
# 5d22597f	23-Jan-2001	Hajimu UMEMOTO <ume@FreeBSD.org>	Add mibs to hold the number of forks since boot. New mibs are: vm.stats.vm.v_forks vm.stats.vm.v_vforks vm.stats.vm.v_rforks vm.stats.vm.v_kthreads vm.stats.vm.v_forkpages vm.stats.vm.v_vforkpages vm.stats.vm.v_rforkpages vm.stats.vm.v_kthreadpages Submitted by: Paul Herman <pherman@frenchfries.net> Reviewed by: alfred
# a448b62a	21-Jan-2001	Jake Burkholder <jake@FreeBSD.org>	Make intr_nesting_level per-process, rather than per-cpu. Setup interrupt threads to run with it always >= 1, so that malloc can detect M_WAITOK from "interrupt" context. This is also necessary in order to context switch from sched_ithd() directly. Reviewed By: peter
# 98f03f90	23-Dec-2000	Jake Burkholder <jake@FreeBSD.org>	Protect proc.p_pptr and proc.p_children/p_sibling with the proctree_lock. linprocfs not locked pending response from informal maintainer. Reviewed by: jhb, -smp@
# c0c25570	12-Dec-2000	Jake Burkholder <jake@FreeBSD.org>	- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead of explicit calls to lockmgr. Also provides macros for the flags pased to specify shared, exclusive or release which map to the lockmgr flags. This is so that the use of lockmgr can be easily replaced with optimized reader-writer locks. - Add some locking that I missed the first time.
# 8dd431fc	04-Dec-2000	Jake Burkholder <jake@FreeBSD.org>	Whitespace. Fix indentation, align comments.
# 4971f62a	02-Dec-2000	John Baldwin <jhb@FreeBSD.org>	- Add a mutex to the proc structure p_mtx that will be used to lock accesses to each individual proc. - Initialize the lock during fork1(), and destroy it in wait1().
# 86360fee	01-Dec-2000	Jake Burkholder <jake@FreeBSD.org>	Remove thr_sleep and thr_wakeup. Remove fields p_nthread and p_wakeup from struct proc, which are now unused (p_nthread already was). Remove process flag P_KTHREADP which was untested and only set in vfs_aio.c (it should use kthread_create). Move the yield system call to kern_synch.c as kern_threads.c has been removed completely. moral support from: alfred, jhb
# 1512b5d6	30-Nov-2000	Jake Burkholder <jake@FreeBSD.org>	Use an mp-safe callout for endtsleep.
# 4f559836	27-Nov-2000	Jake Burkholder <jake@FreeBSD.org>	Use callout_reset instead of timeout(9). Most callouts are statically allocated, 2 have been added to struct proc for setitimer and sleep. Reviewed by: jhb, jlemon
# 553629eb	22-Nov-2000	Jake Burkholder <jake@FreeBSD.org>	Protect the following with a lockmgr lock: allproc zombproc pidhashtbl proc.p_list proc.p_hash nextpid Reviewed by: jhb Obtained from: BSD/OS and netbsd
# 35e0e5b3	20-Oct-2000	John Baldwin <jhb@FreeBSD.org>	Catch up to moving headers: - machine/ipl.h -> sys/ipl.h - machine/mutex.h -> sys/mutex.h
# 42fd51ce	14-Sep-2000	Don Lewis <truckman@FreeBSD.org>	Enforce process limit policy in one place to keep proccnt from diverging from reality.
# 0384fff8	06-Sep-2000	Jason Evans <jasone@FreeBSD.org>	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
# f535380c	05-Sep-2000	Don Lewis <truckman@FreeBSD.org>	Remove uidinfo hash table lookup and maintenance out of chgproccnt() and chgsbsize(), which are called rather frequently and may be called from an interrupt context in the case of chgsbsize(). Instead, do the hash table lookup and maintenance when credentials are changed, which is a lot less frequent. Add pointers to the uidinfo structures to the ucred and pcred structures for fast access. Pass a pointer to the credential to chgproccnt() and chgsbsize() instead of passing the uid. Add a reference count to the uidinfo structure and use it to decide when to free the structure rather than freeing the structure when the resource consumption drops to zero. Move the resource tracking code from kern_proc.c to kern_resource.c. Move some duplicate code sequences in kern_prot.c to separate helper functions. Change KASSERTs in this code to unconditional tests and calls to panic().
# 77978ab8	04-Jul-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Previous commit changing SYSCTL_HANDLER_ARGS violated KNF. Pointed out by: bde
# 82d9ae4e	03-Jul-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Style police catches up with rev 1.26 of src/sys/sys/sysctl.h: Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our sources: -sysctl_vm_zone SYSCTL_HANDLER_ARGS +sysctl_vm_zone (SYSCTL_HANDLER_ARGS)
# 47fdd692	26-Jun-2000	Neil Blakey-Milner <nbm@FreeBSD.org>	Add sysctl descriptions to a few sysctls. Simply "documentation". PR: kern/8015 Submitted by: Stefan Eggers <seggers@semyam.dinoco.de>
# c6362551	22-Jun-2000	Alfred Perlstein <alfred@FreeBSD.org>	fix races in the uidinfo subsystem, several problems existed: 1) while allocating a uidinfo struct malloc is called with M_WAITOK, it's possible that while asleep another process by the same user could have woken up earlier and inserted an entry into the uid hash table. Having redundant entries causes inconsistancies that we can't handle. fix: do a non-waiting malloc, and if that fails then do a blocking malloc, after waking up check that no one else has inserted an entry for us already. 2) Because many checks for sbsize were done as "test then set" in a non atomic manner it was possible to exceed the limits put up via races. fix: instead of querying the count then setting, we just attempt to set the count and leave it up to the function to return success or failure. 3) The uidinfo code was inlining and repeating, lookups and insertions and deletions needed to be in their own functions for clarity. Reviewed by: green
# e3975643	25-May-2000	Jake Burkholder <jake@FreeBSD.org>	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 740a1973	23-May-2000	Jake Burkholder <jake@FreeBSD.org>	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# cb679c38	16-Apr-2000	Jonathan Lemon <jlemon@FreeBSD.org>	Introduce kqueue() and kevent(), a kernel event notification facility.
# bb6a234e	06-Dec-1999	Peter Wemm <peter@FreeBSD.org>	Put on my asbestos underwear and commit the patch that I posted to -arch some time ago that changes kern.randompid from a boolean to a randomness range for the next pid assigment. Too high causes a lot of extra work to scan for free pids, and too low merely wastes randomness entropy. It's still possible to select a completely random range by using PID_MAX (100k) or -1 as a shortcut to mean "the whole range". Also, don't waste randomness when doing a wraparound.
# 91c28bfd	05-Dec-1999	Luoqi Chen <luoqi@FreeBSD.org>	User ldt sharing.
# ee3fd601	28-Nov-1999	Dan Moschuk <dan@FreeBSD.org>	Introduce OpenBSD-like Random PIDs. Controlled by a sysctl knob (kern.randompid), which is currently defaulted off. Use ARC4 (RC4) for our random number generation, which will not get me executed for violating crypto laws; a Good Thing(tm). Reviewed and Approved by: bde, imp
# 93efcae8	19-Nov-1999	Poul-Henning Kamp <phk@FreeBSD.org>	The at_exit and at_fork functions currently use a 'roll your own' linked list to store the callbak routines. The patch converts the lists to queue(3) TAILQs, making the code slightly clearer and ensuring that callbacks are executed in FIFO order. Man page also updated as necesary. (discontinued use of M_TEMP malloc type while here anyway /phk) Submitted by: Jake Burkholder jake@checker.org PR: 14912
# b9df5231	16-Nov-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Introduce commandline caching in the kernel. This fixes some nasty procfs problems for SMP, makes ps(1) run much faster, and makes ps(1) even less dependent on /proc which will aid chroot and jails alike. To disable this facility and revert to previous behaviour: sysctl -w kern.ps_arg_cache_limit=0 For full details see the current@FreeBSD.org mail-archives.
# 2e3c8fcb	16-Nov-1999	Poul-Henning Kamp <phk@FreeBSD.org>	This is a partial commit of the patch from PR 14914: Alot of the code in sys/kern directly accesses the Q_HEAD and Q_ENTRY structures for list operations. This patch makes all list operations in sys/kern use the queue(3) macros, rather than directly accessing the *Q_{HEAD,ENTRY} structures. This batch of changes compile to the same object files. Reviewed by: phk Submitted by: Jake Burkholder <jake@checker.org> PR: 14914
# d1f088da	11-Oct-1999	Peter Wemm <peter@FreeBSD.org>	Trim unused options (or #ifdef for undoc options). Submitted by: phk
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# d4da2dba	21-Jul-1999	Alan Cox <alc@FreeBSD.org>	Fix the following problem: When creating new processes (or performing exec), the new page directory is initialized too early. The kernel might grow before p_vmspace is initialized for the new process. Since pmap_growkernel doesn't yet know about the new page directory, it isn't updated, and subsequent use causes a failure. The fix is (1) to clear p_vmspace early, to stop pmap_growkernel from stomping on memory, and (2) to defer part of the initialization of new page directories until p_vmspace is initialized. PR: kern/12378 Submitted by: tegge Reviewed by: dfr
# 1943af61	03-Jul-1999	Peter Wemm <peter@FreeBSD.org>	Stop rfork(0) from panicing. (oops!!) Submitted by: Peter Holm <peter@holm.cc>
# df8abd0b	30-Jun-1999	Peter Wemm <peter@FreeBSD.org>	Slight tweak to fork1() calling conventions. Add a third argument so the caller can easily find the child proc struct. fork(), rfork() etc syscalls set p->p_retval[] themselves. Simplify the SYSINIT_KT() code and other kernel thread creators to not need to use pfind() to find the child based on the pid. While here, partly tidy up some of the fork1() code for RF_SIGSHARE etc.
# 75c13541	28-Apr-1999	Poul-Henning Kamp <phk@FreeBSD.org>	This Implements the mumbled about "Jail" feature. This is a seriously beefed up chroot kind of thing. The process is jailed along the same lines as a chroot does it, but with additional tough restrictions imposed on what the superuser can do. For all I know, it is safe to hand over the root bit inside a prison to the customer living in that prison, this is what it was developed for in fact: "real virtual servers". Each prison has an ip number associated with it, which all IP communications will be coerced to use and each prison has its own hostname. Needless to say, you need more RAM this way, but the advantage is that each customer can run their own particular version of apache and not stomp on the toes of their neighbors. It generally does what one would expect, but setting up a jail still takes a little knowledge. A few notes: I have no scripts for setting up a jail, don't ask me for them. The IP number should be an alias on one of the interfaces. mount a /proc in each jail, it will make ps more useable. /proc/<pid>/status tells the hostname of the prison for jailed processes. Quotas are only sensible if you have a mountpoint per prison. There are no privisions for stopping resource-hogging. Some "#ifdef INET" and similar may be missing (send patches!) If somebody wants to take it from here and develop it into more of a "virtual machine" they should be most welcome! Tools, comments, patches & documentation most welcome. Have fun... Sponsored by: http://www.rndassociates.com/ Run for almost a year by: http://www.servetheweb.com/
# 5206bca1	27-Apr-1999	Luoqi Chen <luoqi@FreeBSD.org>	Enable vmspace sharing on SMP. Major changes are, - %fs register is added to trapframe and saved/restored upon kernel entry/exit. - Per-cpu pages are no longer mapped at the same virtual address. - Each cpu now has a separate gdt selector table. A new segment selector is added to point to per-cpu pages, per-cpu global variables are now accessed through this new selector (%fs). The selectors in gdt table are rearranged for cache line optimization. - fask_vfork is now on as default for both UP and SMP. - Some aio code cleanup. Reviewed by: Alan Cox <alc@cs.rice.edu> John Dyson <dyson@iquest.net> Julian Elischer <julian@whistel.com> Bruce Evans <bde@zeta.org.au> David Greenman <dg@root.com>
# 0dd9741e	24-Apr-1999	Dmitrij Tejblum <dt@FreeBSD.org>	Use pointer arithmetic to do pointer arithmetic.
# e9189611	17-Apr-1999	Peter Wemm <peter@FreeBSD.org>	Well folks, this is it - The second stage of the removal for build support for LKM's..
# af8ad83e	05-Apr-1999	Peter Wemm <peter@FreeBSD.org>	Use the reference-counted PHOLD()/PRELE() rather than P_NOSWAP.
# 4ac9ae70	01-Mar-1999	Julian Elischer <julian@FreeBSD.org>	Fix thread/process tracking and differentiation for Linux threads emulation. Submitted by: Richard Seaman, Jr." <dick@tar.com> Also clean some compiler warnings in surrounding code.
# 88c5ea45	25-Jan-1999	Julian Elischer <julian@FreeBSD.org>	Enable Linux threads support by default. This takes the conditionals out of the code that has been tested by various people for a while. ps and friends (libkvm) will need a recompile as some proc structure changes are made. Submitted by: "Richard Seaman, Jr." <dick@tar.com>
# dc9c271a	07-Jan-1999	Julian Elischer <julian@FreeBSD.org>	Changes to the LINUX_THREADS support to only allocate extra memory for shared signal handling when there is shared signal handling being used. This removes the main objection to making the shared signal handling a standard ability in rfork() and friends and 'unconditionalising' this code. (i.e. the allocation of an extra 328 bytes per process). Signal handling information remains in the U area until such a time as it's reference count would be incremented to > 1. At that point a new struct is malloc'd and maintained in KVM so that it can be shared between the processes (threads) using it. A function to check the reference count and move the struct back to the U area when it drops back to 1 is also supplied. Signal information is therefore now swapable for all processes that are not sharing that information with other processes. THis should addres the concerns raised by Garrett and others. Submitted by: "Richard Seaman, Jr." <dick@tar.com>
# 6626c604	18-Dec-1998	Julian Elischer <julian@FreeBSD.org>	Reviewed by: Luoqi Chen, Jordan Hubbard Submitted by: "Richard Seaman, Jr." <lists@tar.com> Obtained from: linux :-) Code to allow Linux Threads to run under FreeBSD. By default not enabled This code is dependent on the conditional COMPAT_LINUX_THREADS (suggested by Garret) This is not yet a 'real' option but will be within some number of hours.
# 643a8daa	09-Nov-1998	Don Lewis <truckman@FreeBSD.org>	If the session leader dies, s_leader is set to NULL and getsid() may dereference a NULL pointer, causing a panic. Instead of following s_leader to find the session id, store it in the session structure. Jukka found the following info: BTW - I just found what I have been looking for. Std 1003.1 Part 1: SYSTEM API [C LANGUAGE] section 2.2.2.80 states quite explicitly... Session lifetime: The period between when a session is created and the end of lifetime of all the process groups that remain as members of the session. So, this quite clearly tells that while there is any single process in any process group which is a member of the session, the session remains as an independent entity. Reviewed by: peter Submitted by: "Jukka A. Ukkonen" <jau@jau.tmt.tele.fi>
# 2d8acc0f	22-Jan-1998	John Dyson <dyson@FreeBSD.org>	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
# 74b2192a	11-Dec-1997	John Dyson <dyson@FreeBSD.org>	We have had support for running the kernel daemons as threads for quite a while, but forgot to do so. For now, this code supports most daemons running as kernel threads in UP kernels, and as full processes in SMP. We will soon be able to run them as threads in SMP, but not yet.
# be67169a	20-Nov-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused includes. Staticized. Avoid passing a `retval' to fork1(). Fixed some style bugs.
# cb226aaa	06-Nov-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Move the "retval" (3rd) parameter from all syscall functions and put it in struct proc instead. This fixes a boatload of compiler warning, and removes a lot of cruft from the sources. I have not removed the /ARGSUSED/, they will require some looking at. libkvm, ps and other userland struct proc frobbing programs will need recompiled.
# eb776aea	25-Aug-1997	Bruce Evans <bde@FreeBSD.org>	Fixed some gratuitous ANSIisms.
# e384a980	22-Aug-1997	Peter Wemm <peter@FreeBSD.org>	Print a warning if an unsupported (under SMP) shared address space fork is attempted rather than just failing with an errno.
# 2244ea07	05-Jul-1997	John Dyson <dyson@FreeBSD.org>	This is an upgrade so that the kernel supports the AIO calls from POSIX.4. Additionally, there is some initial code that supports LIO. This code supports AIO/LIO for all types of file descriptors, with few if any restrictions. There will be a followup very soon that will support significantly more efficient operation for VCHR type files (raw.) This code is also dependent on some kernel features that don't work under SMP yet. After I commit the changes to the kernel to support proper address space sharing on SMP, this code will also work under SMP.
# b3196e4b	22-Jun-1997	Peter Wemm <peter@FreeBSD.org>	Preliminary support for per-cpu data pages. This eliminates a lot of #ifdef SMP type code. Things like _curproc reside in a data page that is unique on each cpu, eliminating the expensive macros like: #define curproc (SMPcurproc[cpunumber()]) There are some unresolved bootstrap and address space sharing issues at present, but Steve is waiting on this for other work. There is still some strictly temporary code present that isn't exactly pretty. This is part of a larger change that has run into some bumps, this part is standalone so it should be safe. The temporary code goes away when the full idle cpu support is finished. Reviewed by: fsmp, dyson
# 2c1011f7	15-Jun-1997	John Dyson <dyson@FreeBSD.org>	Modifications to existing files to support the initial AIO/LIO and kernel based threading support.
# 8f453f3e	28-May-1997	Peter Wemm <peter@FreeBSD.org>	Don't need "opt_smp.h" on these files
# c76e95c3	26-Apr-1997	Peter Wemm <peter@FreeBSD.org>	Create sysctl kern.fast_vfork, on for uniprocessor by default, off for SMP.
# c32ba248	26-Apr-1997	Peter Wemm <peter@FreeBSD.org>	Disable RFMEM in vfork for smp case.. It doesn't seem to work too well yet..
# 0eaa559c	23-Apr-1997	Andrey A. Chernov <ache@FreeBSD.org>	Restore memory space separation (RFMEM) for vfork() after shell imgact memory clobbering fixed
# 6b707440	22-Apr-1997	John Dyson <dyson@FreeBSD.org>	Give up on the fast vfork() for a while.
# c58494e4	20-Apr-1997	John Dyson <dyson@FreeBSD.org>	Re-institute the efficent version of vfork. It appears to make a difference of approx 3mins in make world on my P6!!! This means that vfork now has full address space sharing, so beware with sloppy vfork programming. Also, you really do need to apply the previously committed popen fix in libc.
# d7f7f3f2	13-Apr-1997	John Dyson <dyson@FreeBSD.org>	Make a problem that I cannot reproduce go away for now. This commit is to decrease the inconvienience of other developers until I can really fix the code. Reviewed by: Donald J. Maddox <dmaddox@scsn.net>
# 5856e12e	12-Apr-1997	John Dyson <dyson@FreeBSD.org>	Fully implement vfork. Vfork is now much much faster than even our fork. (On my machine, fork is about 240usecs, vfork is 78usecs.) Implement rfork(!RFPROC !RFMEM), which allows a thread to divorce its memory from the other threads of a group. Implement rfork(!RFPROC RFCFDG), which closes all file descriptors, eliminating possible existing shares with other threads/processes. Implement rfork(!RFPROC RFFDG), which divorces the file descriptors for a thread from the rest of the group. Fix the case where a thread does an exec. It is almost nonsense for a thread to modify the other threads address space by an exec, so we now automatically divorce the address space before modifying it.
# 263a3392	07-Apr-1997	Peter Wemm <peter@FreeBSD.org>	Remove explicit zero of p_vmspace on creation, it's now in the startzero section of the proc struct.
# a2a1c95c	07-Apr-1997	Peter Wemm <peter@FreeBSD.org>	The biggie: Get rid of the UPAGES from the top of the per-process address space. (!) Have each process use the kernel stack and pcb in the kvm space. Since the stacks are at a different address, we cannot copy the stack at fork() and allow the child to return up through the function call tree to return to user mode - create a new execution context and have the new process begin executing from cpu_switch() and go to user mode directly. In theory this should speed up fork a bit. Context switch the tss_esp0 pointer in the common tss. This is a lot simpler since than swithching the gdt[GPROC0_SEL].sd.sd_base pointer to each process's tss since the esp0 pointer is a 32 bit pointer, and the sd_base setting is split into three different bit sections at non-aligned boundaries and requires a lot of twiddling to reset. The 8K of memory at the top of the process space is now empty, and unmapped (and unmappable, it's higher than VM_MAXUSER_ADDRESS). Simplity the pmap code to manage process contexts, we no longer have to double map the UPAGES, this simplifies and should measuably speed up fork(). The following parts came from John Dyson: Set PG_G on the UPAGES that are now in kernel context, and invalidate them when swapping them out. Move the upages object (upobj) from the vmspace to the proc structure. Now that the UPAGES (pcb and kernel stack) are out of user space, make rfork(..RFMEM..) do what was intended by sharing the vmspace entirely via reference counting rather than simply inheriting the mappings.
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 70e534e7	17-Feb-1997	David Greenman <dg@FreeBSD.org>	Pass P_SUGID on to the child of a fork(). It was possible to get rlogin to coredump previously since it (somewhat uniquely) is setuid and forks without execing, and thus without passing P_SUGID the child could coredump and possibly divulge sensitive information (such as encrypted passwords from the passwd database).
# 996c772f	09-Feb-1997	John Dyson <dyson@FreeBSD.org>	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
# 3e2bca9e	15-Jan-1997	Bruce Evans <bde@FreeBSD.org>	Fixed interrupt unmasking for child processes which I broke in rev.1.10 two years ago. Children continued to run at splhigh() after returning from vm_fork(). This mainly affected kernel processes and init. For ordinary processes, interrupts are normally unmasked a few instructions later after fork() returns (it may be important for syscall() not to reschedule the child processes). Kernel processes had workarounds for the problem. Init manages to start because some routines "know" that it is safe to go to sleep despite their caller starting them at a high ipl. Then its ipl gets fixed on its first normal return from a syscall.
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 51068190	27-Oct-1996	Wolfram Schneider <wosch@FreeBSD.org>	Move static variable nextpid out from fork1(). Now top(1) can print last pid value.
# b71fec07	03-Sep-1996	Bruce Evans <bde@FreeBSD.org>	Eliminated nested include of <sys/unistd.h> in <sys/file.h> in the kernel. Include it directly in the few places where it is used. Reduced some #includes of <sys/file.h> to #includes of <sys/fcntl.h> or nothing.
# e0d898b4	21-Aug-1996	Julian Elischer <julian@FreeBSD.org>	Some cleanups to the callout lists recently added. note that at_shutdown has a new parameter to indicate When during a shutdown the callout should be made. also add a RB_POWEROFF flag to reboot "howto" parameter.. tells the reboot code in our at_shutdown module to turn off the UPS and kill the power. bound to be useful eventually on laptops
# fed06968	18-Aug-1996	Julian Elischer <julian@FreeBSD.org>	add callout lists for exit() and fork() I've been meaning to do this for AGES as I keep having to patch those routines whenever I write a proprietary package or similar.. any module that assigns resources to processes needs to know when these events occur. there are existsing modules that should be modified to take advantage of these.. e.g. SYSV IPC primatives presently have #ifdef entries in exit() this also helps with making LKMs out of such things.. (see the man pages at_exit(9) and at_fork(9))
# b1508c72	31-Jul-1996	David Greenman <dg@FreeBSD.org>	Converted timer/run queues to 4.4BSD queue style. Removed old and unused sleep(). Implemented wakeup_one() which may be used in the future to combat the "thundering herd" problem for some special cases. Reviewed by: dyson
# c23670e2	11-Jun-1996	Gary Palmer <gpalmer@FreeBSD.org>	Clean up -Wunused warnings. Reviewed by: bde
# 88d1b642	02-May-1996	Peter Wemm <peter@FreeBSD.org>	Fix a nasty bug that causes random crashes and lockups particularly on very busy servers (eg: news, web). This is an interaction between embryonic processes that have not yet finished forking, and happen to cause the kernel VM space to grow, hitting the uninitialised variable. It was possible for this to strike at any time, depending on the size of your kernel and load patterns. One machine had paniced occasionally when cron launches a job since before the 2.1 release. If you had "options DIAGNOSTIC", you may have seen references to bogus addresses like 0xdeadc142 and the like. This is a minimal change to fix the problem, it will probably be done better by reordering p_vmspace to be in the startzero section, but it becomes harder to validate then. It's been vulnerable since pmap.c rev 1.40 (Jan 9, 1995), so it's been a cause of problems since well before 2.0.5. This was when the merged VM/buffer cache and the dynamic growing kernel VM space were first committed. This probably fixes a few of PR's.
# 0e3eb7ee	17-Apr-1996	Sujal Patel <smpatel@FreeBSD.org>	Implement the RFNOWAIT flag for rfork(). If set this flag will cause the forked child to be dissociated from the parent). Cleanup fork1(), implement vfork() and fork() in terms of rfork() flags. Remove RFENVG, RFNOTEG, RFCNAMEG, RFCENVG which are Plan9 specific and cannot possibly be implemented in FreeBSD. Renumbered the flags to make up for the removal of the above flags. Reviewed by: peter, smpatel Submitted by: Mike Grupenhoff <kashmir@umiacs.umd.edu>
# edbfedac	11-Mar-1996	Peter Wemm <peter@FreeBSD.org>	Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all files are off the vendor branch, so this should not change anything. A "U" marker generally means that the file was not changed in between the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally means that there was a change. [note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]
# b75356e1	10-Mar-1996	Jeffrey Hsu <hsu@FreeBSD.org>	From Lite2: proc LIST changes. Reviewed by: david & bde
# ef5dc8a9	03-Mar-1996	John Dyson <dyson@FreeBSD.org>	Keep fork from over extending the number of processes. Since u_map is sized exactly for maxproc, the occasional overrunning the maxproc limit can cause problems.
# dabee6fe	23-Feb-1996	Peter Wemm <peter@FreeBSD.org>	kern_descrip.c: add fdshare()/fdcopy() kern_fork.c: add the tiny bit of code for rfork operation. kern/sysv_: shmfork() takes one less arg, it was never used. sys/shm.h: drop "isvfork" arg from shmfork() prototype sys/param.h: declare rfork args.. (this is where OpenBSD put it..) sys/filedesc.h: protos for fdshare/fdcopy. vm/vm_mmap.c: add minherit code, add rounding to mmap() type args where it makes sense. vm/: drop unused isvfork arg. Note: this rfork() implementation copies the address space mappings, it does not connect the mappings together. ie: once the two processes have split, the pages may be shared, but the address space is not. If one does a mmap() etc, it does not appear in the other. This makes it not useful for pthreads, but it is useful in it's own right for having light-weight threads in a static shared address space. Obtained from: Original by Ron Minnich, extended by OpenBSD
# db6a20e2	03-Jan-1996	Garrett Wollman <wollman@FreeBSD.org>	Converted two options over to the new scheme: USER_LDT and KTRACE.
# efeaf95a	06-Dec-1995	David Greenman <dg@FreeBSD.org>	Untangled the vm.h include file spaghetti.
# d2d3e875	11-Nov-1995	Bruce Evans <bde@FreeBSD.org>	Included <sys/sysproto.h> to get central declarations for syscall args structs and prototypes for syscalls. Ifdefed duplicated decentralized declarations of args structs. It's convenient to have this visible but they are hard to maintain. Some are already different from the central declarations. 4.4lite2 puts them in comments in the function headers but I wanted to avoid the large changes for that.
# ad7507e2	07-Oct-1995	Steven Wallace <swallace@FreeBSD.org>	Remove prototype definitions from <sys/systm.h>. Prototypes are located in <sys/sysproto.h>. Add appropriate #include <sys/sysproto.h> to files that needed protos from systm.h. Add structure definitions to appropriate files that relied on sys/systm.h, right before system call definition, as in the rest of the kernel source. In kern_prot.c, instead of using the dummy structure "args", create individual dummy structures named <syscall>_args. This makes life easier for prototype generation.
# 9b2e5354	30-May-1995	Rodney W. Grimes <rgrimes@FreeBSD.org>	Remove trailing whitespace.
# b5e8ce9f	16-Mar-1995	Bruce Evans <bde@FreeBSD.org>	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# d7e3a89f	21-Jan-1995	Bruce Evans <bde@FreeBSD.org>	Don't count the parent's previous timeslice in the child's resource usage (it was counted twice). Set the start time more accurately.
# d93f860c	09-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	Cosmetics. related to getting prototypes into view.
# 35c10d22	09-Oct-1994	David Greenman <dg@FreeBSD.org>	Got rid of map.h. It's a leftover from the rmap code, and we use rlists. Changed swapmap into swaplist.
# 7216391e	01-Oct-1994	David Greenman <dg@FreeBSD.org>	"idle priority" support. Based on code from Henrik Vestergaard Draboel, but substantially rewritten by me.
# e8fb0b2c	31-Aug-1994	David Greenman <dg@FreeBSD.org>	Realtime priority scheduling support. Submitted by: Henrik Vestergaard Draboel
# f23b4c91	18-Aug-1994	Garrett Wollman <wollman@FreeBSD.org>	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
# 0d2afcee	06-Aug-1994	David Greenman <dg@FreeBSD.org>	Process scheduling changes - adapted from FreeBSD 1.1.5. Basically, charge scheduling CPU of child process to the parent and have child inherit scheduling CPU from parent on fork. Makes a big difference in the feel of the system to interactive users. Submitted by: John Dyson
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# 26f9a767	25-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources