History log of /freebsd-current/sys/kern/kern_fork.c
Revision Date Author Comments
# 800da341 22-Apr-2024 Mark Johnston <markj@FreeBSD.org>

thread: Simplify sanitizer integration with thread creation

fork() may allocate a new thread in one of two ways: from UMA, or cached
in a freed proc that was just allocated from UMA. In either case, KASAN
and KMSAN need to initialize some state; in particular they need to
initialize the shadow mapping of the new thread's stack.

This is done differently between KASAN and KMSAN, which is confusing.
This patch improves things a bit:
- Add a new thread_recycle() function, which moves all kernel stack
handling out of kern_fork.c, since it doesn't really belong there.
- Then, thread_alloc_stack() has only one local caller, so just inline
it.
- Avoid redundant shadow stack initialization: thread_alloc()
initializes the KMSAN shadow stack (via kmsan_thread_alloc()) even
through vm_thread_new() already did that.
- Add kasan_thread_alloc(), for consistency with kmsan_thread_alloc().

No functional change intended.

Reviewed by: khng
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D44891


# 68a3a7fc 19-Apr-2024 Ka Ho Ng <khng@FreeBSD.org>

kasan: fix false-positive kasan_report upon thread reuse

In fork1(), if a thread is reused and thread_alloc_stack() is not
called, mark the reused thread's kstack pages clean in the KASAN shadow
buffer.

Sponsored by: Juniper Networks, Inc.
MFC after: 3 days
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D44875


# 171f0832 28-Nov-2023 Konstantin Belousov <kib@FreeBSD.org>

EVFILT_TIMER: intialize stop timer list in type-stable proc init, instead of fork

Since kqueue timer may exist after the process that created it exited
(same scenario with rfork(2) as in PR 275286), make the tailq
p_kqtim_stop accessed by filt_timerdetach() type-stable.

Noted and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42777


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# eac62420 10-Oct-2023 Olivier Certner <olce.freebsd@certner.fr>

Ensure "init" (PID 1) also executes userret() initially

Calling userret() from fork_return() misses the first return to
userspace of the "init" (PID 1) process. The latter is indeed created
by fork1() followed by a call to cpu_fork_kthread_handler() call that
replaces fork_return() by start_init() as the function to execute after
fork.

A new process' initial return to userspace in the end always happens
through returning from fork_exit(), so move userret() there instead to
fix the omission.

This problem was discovered as part of a revamp of scheduling priorities
that lead to experimenting with asserting and sometimes resetting
priorities in sched_userret(), in the course of which the author
stumbled on panics being triggered only in init() or only in other
processes, depending on the modifications to sched_userret(). This
change currently has no practical effect but will have some in the near
future.

Reviewed by: markj, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D42257


# 92541c12 25-Sep-2023 Olivier Certner <olce.freebsd@certner.fr>

Open-code proc_set_cred_init()

This function is to be called only when initializing a new process (so,
'proc0' and at fork), and not in any other circumstances. Setting the
process' 'p_ucred' field to the result of crcowget() on the original
credentials is the only thing it does, hiding the fact that the process'
'p_ucred' field is crushed by the call. Moreover, most of the code it
executes is already encapsulated in crcowget().

To prevent misuse and improve code readability, just remove this
function and replace it with a direct assignment to 'p_ucred'.

Reviewed by: markj (earlier version), kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D42255


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 474708c3 26-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

fork1(): properly track the state of the pg_killsx lock

Reported by: dchagin
Fixes: 232b922cb363e01ac0dd2a277d93cf74d8485e79
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 232b922c 20-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

killpg(): close a race with fork(), part 2

When we are sending terminating signal to the group, killpg() needs
to guarantee that all group members are to be terminated (it does not
need to ensure that they are terminated on return from killpg()). The
pg_killsx change eliminates the largest window there, but still, if
a multithreaded process is signalled, the following could happen:
- thread 1 is selected for the signal delivery and gets descheduled
- thread 2 waits for pg_killsx lock, obtains it and forks
- thread 1 continue executing and terminates the process
This scenario allows the child to escape still.

Fix it by single-threading forking parent if a conflict with pg_killsx
is noted. We try to lock pg_killsx without sleeping, and failure to
acquire it means that a parallel killpg(2) is executed. Then, stop
other threads for running and in particular, receive signals, to avoid
the situation explained above.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41128


# aaa92413 20-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

Revert "killpg(): close a race with fork(), part 2"

This reverts commits 81a37995c757b4e3ad8a5c699864197fd1ebdcf5 and
565a343ae3a30bc2973182ff8dfd2fa37d7f615f.

There is still a leakage of the p_killpg_cnt, some but not all sources
of which were identified.

Second, and more important, is that there is a fundamental issue with
blocked signals having KSI_KILLPG flag set. Queueing of such signal
increments p_killpg_cnt, but it cannot be decremented until the signal
is delivered. If, for instance, a single-threaded process with blocked
signal receives killpg-kill and executes fork(2), the fork enter check
returns with ERESTART. And since signal is blocked, the condition
cannot be cleared.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41128


# 81a37995 15-Jun-2023 Konstantin Belousov <kib@FreeBSD.org>

killpg(): close a race with fork(), part 2

When we are sending terminating signal to the group, killpg() needs to
guarantee that all group members are to be terminated (it does not need
to ensure that they are terminated on return from killpg()). The
pg_killsx change eliminates the largest window there, but still, if a
multithreaded process is signalled, the following could happen:
- thread 1 is selected for the signal delivery and gets descheduled
- thread 2 waits for pg_killsx lock, obtains it and forks
- thread 1 continue executing and terminates the process
This scenario allows the child to escape still.

To fix it, count the number of signals sent to the process with
killpg(2), in p_killpg_cnt variable, which is incremented in killpg()
and decremented after signal handler frame is created or in exit1()
after single-threading. This way we avoid forking if the termination is
due.

Noted and reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D40493


# 3360b485 12-Jun-2023 Konstantin Belousov <kib@FreeBSD.org>

killpg(2): close a race with fork(2), part1

If the process group member performs fork(), the child could escape
signalling from killpg(). Prevent it by introducing an sx process group
lock pg_killsx which is taken interruptibly shared around fork. If there
is a pending signal, do the trip through userspace with ERESTART to
handle signal ASTs. The lock is taken exclusively during killpg().

The lock is also locked exclusive when the process changes group
membership, to avoid escaping a signal by this means, by ensuring that
the process group is stable during fork.

Note that the new lock is before proctree lock, so in some situations we
could only do trylocking to obtain it.

This relatively simple approach cannot work for REAP_KILL, because
process potentially belongs to more than one reaper tree by having
sub-reapers.

Reported by: dchagin
Tested by: dchagin, pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D40493


# 653738e8 07-Jun-2023 John Baldwin <jhb@FreeBSD.org>

ptrace: Clear TDB_BORN during PT_DETACH.

If a debugger detaches from a process that has a new thread that has
not yet executed, the new thread will raise a SIGTRAP signal to report
it's thread birth event even after the detach. With the debugger
detached, this results in a SIGTRAP sent to the process and typically
a core dump. Fix this by clearing TDB_BORN from any new threads
during detach.

Bump __FreeBSD_version for debuggers to notice when the fix is
present.

Reported by: GDB's testsuite
Reviewed by: kib, markj (previous version)
Differential Revision: https://reviews.freebsd.org/D39856


# 0282f875 16-May-2023 Dmitry Chagin <dchagin@FreeBSD.org>

ktrace: Simplify ae6ac587, drop the sa var declaration

Suggested by: kib
MFC after: 5 days


# ae6ac587 14-May-2023 Dmitry Chagin <dchagin@FreeBSD.org>

ktrace: Fix syscall number on a child return path from fork

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D40078
MFC after 1 week


# 5ecb5444 10-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

jail: add process linkage

It allows iteration over processes belonging to given jail instead of
having to walk the entire allproc list.

Note the iteration can miss processes which remains bug-compatible
with previous code.

Reviewed by: jamie (previous version), markj (previous version)
Differential Revision: https://reviews.freebsd.org/D34522


# fce3b1c3 21-Aug-2022 Konstantin Belousov <kib@FreeBSD.org>

fork_exit(): style comment

Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36302


# 5e5675cb 12-Aug-2022 Konstantin Belousov <kib@FreeBSD.org>

Remove struct proc p_singlethr member

It does not serve any purpose after we stopped doing
thread_single(SINGLE_ALLPROC) from stoppable user processes.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36207


# 5e9bba94 10-Aug-2022 Konstantin Belousov <kib@FreeBSD.org>

fork_norfproc(): unlock p1 before retrying

Reported and reviewed by: markj
Tested by: pho
Syzkaller: 647212368c3f32c6f13f
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36207


# bd76586b 11-Aug-2022 Konstantin Belousov <kib@FreeBSD.org>

fork_norfproc(): style

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D36207


# d07675a9 04-Aug-2022 Mark Johnston <markj@FreeBSD.org>

file: Move code to share fdtol structs into kern_descrip.c

This ensures the filedesc-to-leader code is consistently encapsulated in
kern_descrip.c.

No functional change intended.

Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35988


# c6d31b83 18-Jul-2022 Konstantin Belousov <kib@FreeBSD.org>

AST: rework

Make most AST handlers dynamically registered. This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it. For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.

Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit. For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.

The dynamic registration also allows third-party modules to register AST
handlers if needed. There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it. In fact, this
is already present behavior for hwpmc.ko and ufs.ko. I do not think it
is worth the efforts and the runtime overhead to try to fix it.

Reviewed by: markj
Tested by: emaste (arm64), pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888


# 4493a13e 15-May-2022 Konstantin Belousov <kib@FreeBSD.org>

Do not single-thread itself when the process single-threaded some another process

Since both self single-threading and remote single-threading rely on
suspending the thread doing thread_single(), it cannot be mixed: thread
doing thread_suspend_switch() might be subject to thread_suspend_one()
and vice versa.

In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310


# 893d20c9 29-Jan-2022 Mateusz Guzik <mjg@FreeBSD.org>

fd: move fd table sizing out of fdinit

now it is placed with the rest of actual initialisation


# 626d6992 26-Dec-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

Move fork_rfppwait() check into ast()

This will always sleep at least once, so it's a slow path by definition.

Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D33387


# 351d5f7f 23-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

exec: store parent directory and hardlink name of the binary in struct proc

While doing it, also move all the code to resolve pathnames and obtain
text vp and dvp, into single place. Besides simplifying the code, it
avoids spurious vnode relocks and validates the explanation why
a transient text reference on the script vnode is not harmful.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611


# 46dd801a 16-Oct-2021 Colin Percival <cperciva@FreeBSD.org>

Add userland boot profiling to TSLOG

On kernels compiled with 'options TSLOG', record for each process ID:
* The timestamp of the fork() which creates it and the parent
process ID,
* The first path passed to execve(), if any,
* The first path resolved by namei, if any, and
* The timestamp of the exit() which terminates the process.

Expose this information via a new sysctl, debug.tslog_user.

On kernels lacking 'options TSLOG' (the default), no information is
recorded and the sysctl does not exist.

Note that recording namei is needed in order to obtain the names of
rc.d scripts being launched, as the rc system sources them in a
subshell rather than execing the scripts.

With this commit it is now possible to generate flamecharts of the
entire boot process from the start of the loader to the end of
/etc/rc. The code needed to perform this processing is currently
found in github: https://github.com/cperciva/freebsd-boot-profiling

Reviewed by: mhorne
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D32493


# a0558fe9 28-Apr-2021 Mateusz Guzik <mjg@FreeBSD.org>

Retire code added to support CloudABI

CloudABI was removed in cf0ee8738e31aa9e6fbf4dca4dac56d89226a71a


# 796a8e1a 01-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

procctl(2): Add PROC_WXMAP_CTL/STATUS

It allows to override kern.elf{32,64}.allow_wx on per-process basis.
In particular, it makes it possible to run binaries without PT_GNU_STACK
and without elfctl note while allow_wx = 0.

Reviewed by: brooks, emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31779


# 71854d9b 12-Aug-2021 Dmitry Chagin <dchagin@FreeBSD.org>

fork: Remove the unnecessary spaces.

MFC after: 2 weeks


# b0f71f1b 10-Aug-2021 Mark Johnston <markj@FreeBSD.org>

amd64: Add MD bits for KMSAN

Interrupt and exception handlers must call kmsan_intr_enter() prior to
calling any C code. This is because the KMSAN runtime maintains some
TLS in order to track initialization state of function parameters and
return values across function calls. Then, to ensure that this state is
kept consistent in the face of asynchronous kernel-mode excpeptions, the
runtime uses a stack of TLS blocks, and kmsan_intr_enter() and
kmsan_intr_leave() push and pop that stack, respectively.

Use these functions in amd64 interrupt and exception handlers. Note
that handlers for user->kernel transitions need not be annotated.

Also ensure that trap frames pushed by the CPU and by handlers are
marked as initialized before they are used.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31467


# 5dda15ad 10-Aug-2021 Mark Johnston <markj@FreeBSD.org>

kern: Ensure that thread-local KMSAN state is available

Sponsored by: The FreeBSD Foundation


# db8d680e 01-Jul-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

procctl(2): add PROC_NO_NEW_PRIVS_CTL, PROC_NO_NEW_PRIVS_STATUS

This introduces a new, per-process flag, "NO_NEW_PRIVS", which
is inherited, preserved on exec, and cannot be cleared. The flag,
when set, makes subsequent execs ignore any SUID and SGID bits,
instead executing those binaries as if they not set.

The main purpose of the flag is implementation of Linux
PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged
chroot.

Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30939


# 9246b309 13-May-2021 Mark Johnston <markj@FreeBSD.org>

fork: Suspend other threads if both RFPROC and RFMEM are not set

Otherwise, a multithreaded parent process may trigger races in
vm_forkproc() if one thread calls rfork() with RFMEM set and another
calls rfork() without RFMEM.

Also simplify vm_forkproc() a bit, vmspace_unshare() already checks to
see if the address space is shared.

Reported by: syzbot+0aa7c2bec74c4066c36f@syzkaller.appspotmail.com
Reported by: syzbot+ea84cb06937afeae609d@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30220


# 2fd1ffef 05-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

Stop arming kqueue timers on knote owner suspend or terminate

This way, even if the process specified very tight reschedule
intervals, it should be stoppable/killable.

Reported and reviewed by: markj
Tested by: markj, pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29106


# 640d5404 12-Mar-2021 John Baldwin <jhb@FreeBSD.org>

Set TDP_KTHREAD before calling cpu_fork() and cpu_copy_thread().

This permits these routines to use special logic for initializing MD
kthread state.

For the kproc case, this required moving the logic to set these flags
from kproc_create() into do_fork().

Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29207


# cc7b7306 16-Feb-2021 Jamie Gritton <jamie@FreeBSD.org>

jail: Handle a possible race between jail_remove(2) and fork(2)

jail_remove(2) includes a loop that sends SIGKILL to all processes
in a jail, but skips processes in PRS_NEW state. Thus it is possible
the a process in mid-fork(2) during jail removal can survive the jail
being removed.

Add a prison flag PR_REMOVE, which is checked before the new process
returns. If the jail is being removed, the process will then exit.
Also check this flag in jail_attach(2) which has a similar issue.

Reported by: trasz
Approved by: kib
MFC after: 3 days


# f8f74aaa 17-Nov-2020 Conrad Meyer <cem@FreeBSD.org>

linux(4) clone(2): Correctly handle CLONE_FS and CLONE_FILES

The two flags are distinct and it is impossible to correctly handle clone(2)
without the assistance of fork1(). This change depends on the pwddesc split
introduced in r367777.

I've added a fork_req flag, FR2_SHARE_PATHS, which indicates that p_pd
should be treated the opposite way p_fd is (based on RFFDG flag). This is a
little ugly, but the benefit is that existing RFFDG API is preserved.
Holding FR2_SHARE_PATHS disabled, RFFDG indicates both p_fd and p_pd are
copied, while !RFFDG indicates both should be cloned.

In Chrome, clone(2) is used with CLONE_FS, without CLONE_FILES, and expects
independent fd tables.

The previous conflation of CLONE_FS and CLONE_FILES was introduced in
r163371 (2006).

Discussed with: markj, trasz (earlier version)
Differential Revision: https://reviews.freebsd.org/D27016


# 85078b85 17-Nov-2020 Conrad Meyer <cem@FreeBSD.org>

Split out cwd/root/jail, cmask state from filedesc table

No functional change intended.

Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).

__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.

Reviewed by: kib
Discussed with: markj, mjg
Differential Revision: https://reviews.freebsd.org/D27037


# 6fed89b1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

kern: clean up empty lines in .c and .h files


# d8bc2a17 15-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

fd: remove fd_lastfile

It keeps recalculated way more often than it is needed.

Provide a routine (fdlastfile) to get it if necessary.

Consumers may be better off with a bitmap iterator instead.


# 1724c563 09-Jun-2020 Mateusz Guzik <mjg@FreeBSD.org>

cred: distribute reference count per thread

This avoids dirtying creds in the common case, see the comment in kern_prot.c
for details.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D24007


# 757a5642 15-May-2020 Christian S.J. Peron <csjp@FreeBSD.org>

Add BSM record conversion for a number of syscalls:

- thr_kill(2) and thr_exit(2) generally (no argument auditing here.
- A set of syscalls for the process descriptor family, specifically:
pdfork(2), pdgetpid(2) and pdkill(2)

For these syscalls, audit the file descriptor. In the case of pdfork(2)
a pointer to an integer (file descriptor) is passed in as an argument.
We audit the post initialized file descriptor (not the random garbage
that would have been passed in). We will also audit the child process
which was created from the fork operation (similar to what is done for
the fork(2) syscall).

pdkill(2) we audit the signal value and fd, and finally pdgetpid(2)
just the file descriptor:

- Following is a sample of the produced audit trails:

header,111,11,pdfork(2),0,Sat May 16 03:07:50 2020, + 394 msec
argument,0,0x39d,child PID
argument,2,0x2,flags
argument,1,0x8,fd
subject,root,root,0,root,0,924,0,0,0.0.0.0
return,success,925

header,79,11,pdgetpid(2),0,Sat May 16 03:07:50 2020, + 394 msec
argument,1,0x8,fd
subject,root,root,0,root,0,924,0,0,0.0.0.0
return,success,0
trailer,79

header,135,11,pdkill(2),0,Sat May 16 03:07:50 2020, + 395 msec
argument,1,0x8,fd
argument,2,0xf,signal
process_ex,root,root,0,root,0,925,0,0,0.0.0.0
subject,root,root,0,root,0,924,0,0,0.0.0.0
return,success,0
trailer,135

MFC after: 1 week


# 59838c1a 01-Apr-2020 John Baldwin <jhb@FreeBSD.org>

Retire procfs-based process debugging.

Modern debuggers and process tracers use ptrace() rather than procfs
for debugging. ptrace() has a supserset of functionality available
via procfs and new debugging features are only added to ptrace().
While the two debugging services share some fields in struct proc,
they each use dedicated fields and separate code. This results in
extra complexity to support a feature that hasn't been enabled in the
default install for several years.

PR: 244939 (exp-run)
Reviewed by: kib, mjg (earlier version)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D23837


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 146fc63f 09-Feb-2020 Konstantin Belousov <kib@FreeBSD.org>

Add a way to manage thread signal mask using shared word, instead of syscall.

A new syscall sigfastblock(2) is added which registers a uint32_t
variable as containing the count of blocks for signal delivery. Its
content is read by kernel on each syscall entry and on AST processing,
non-zero count of blocks is interpreted same as the signal mask
blocking all signals.

The biggest downside of the feature that I see is that memory
corruption that affects the registered fast sigblock location, would
cause quite strange application misbehavior. For instance, the process
would be immune to ^C (but killable by SIGKILL).

With consumers (rtld and libthr added), benchmarks do not show a
slow-down of the syscalls in micro-measurements, and macro benchmarks
like buildworld do not demonstrate a difference. Part of the reason is
that buildworld time is dominated by compiler, and clang already links
to libthr. On the other hand, small utilities typically used by shell
scripts have the total number of syscalls cut by half.

The syscall is not exported from the stable libc version namespace on
purpose. It is intended to be used only by our C runtime
implementation internals.

Tested by: pho
Disscussed with: cem, emaste, jilles
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D12773


# 61a74c5c 15-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

schedlock 1/4

Eliminate recursion from most thread_lock consumers. Return from
sched_add() without the thread_lock held. This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks. This will eventually allow for lockless remote adds.

Discussed with: kib
Reviewed by: jhb
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22626


# 079c5b9e 25-Sep-2019 Kyle Evans <kevans@FreeBSD.org>

rfork(2): add RFSPAWN flag

When RFSPAWN is passed, rfork exhibits vfork(2) semantics but also resets
signal handlers in the child during creation to avoid a point of corruption
of parent state from the child.

This flag will be used by posix_spawn(3) to handle potential signal issues.

Reviewed by: jilles, kib
Differential Revision: https://reviews.freebsd.org/D19058


# fe69291f 03-Sep-2019 Konstantin Belousov <kib@FreeBSD.org>

Add procctl(PROC_STACKGAP_CTL)

It allows a process to request that stack gap was not applied to its
stacks, retroactively. Also it is possible to control the gaps in the
process after exec.

PR: 239894
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21352


# 50c7615f 17-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

fork: rework locking around do_fork

- move allproc lock into the func, it is of no use prior to it
- the code would lock p1 and p2 while holding allproc to partially
construct it after it gets added to the list. instead we can do the
work prior to adding anything.
- protect lastpid with procid_lock

As a side effect we do less work with allproc held.

Sponsored by: The FreeBSD Foundation


# 60cdcb64 17-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

fork: bump process count before checking for permission to cross the limit

The limit is almost never reached. Do the check only on failure to see if
we can override it.

No change in user-visible behavior.

Sponsored by: The FreeBSD Foundation


# b05641b6 17-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

fork: stop skipping < 100 ids on wrap around

Code doing this is commented with a claim that these IDs are occupied by
daemons, but that's demonstrably false. To an extent the range is used by init
and kernel processes (and on sufficiently big machines it indeed is fully
populated).

On a sample box 40-way box the highest id in the range is 63. On a different one
it is 23. Just use the range.

Sponsored by: The FreeBSD Foundation


# 2ffee5c1 10-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Inherit P2_PROTMAX_{ENABLE,DISABLE} across fork().

Thus, when using proccontrol(1) to disable implicit application of
PROT_MAX within a process, child processes will inherit this setting.

Discussed with: kib
MFC with: r349609
Sponsored by: The FreeBSD Foundation


# e2e050c8 19-May-2019 Conrad Meyer <cem@FreeBSD.org>

Extract eventfilter declarations to sys/_eventfilter.h

This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.


# 7d43b5c9 07-May-2019 Mark Johnston <markj@FreeBSD.org>

Simplify the test against maxproc in fork1().

Previously nprocs_new would be tested against maxprocs twice when
nprocs_new < maxprocs - 10. Eliminate the unnecessary comparison.

Submitted by: Wuyang Chung <wuyang.chung1@gmail.com>
GitHub PR: https://github.com/freebsd/freebsd/pull/397
MFC after: 1 week


# 37d2b1f3 04-May-2019 Mateusz Guzik <mjg@FreeBSD.org>

Annotate nprocs with __exclusive_cache_line

Sponsored by: The FreeBSD Foundation


# fa50a355 10-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

Implement Address Space Layout Randomization (ASLR)

With this change, randomization can be enabled for all non-fixed
mappings. It means that the base address for the mapping is selected
with a guaranteed amount of entropy (bits). If the mapping was
requested to be superpage aligned, the randomization honours the
superpage attributes.

Although the value of ASLR is diminshing over time as exploit authors
work out simple ASLR bypass techniques, it elimintates the trivial
exploitation of certain vulnerabilities, at least in theory. This
implementation is relatively small and happens at the correct
architectural level. Also, it is not expected to introduce
regressions in existing cases when turned off (default for now), or
cause any significant maintaince burden.

The randomization is done on a best-effort basis - that is, the
allocator falls back to a first fit strategy if fragmentation prevents
entropy injection. It is trivial to implement a strong mode where
failure to guarantee the requested amount of entropy results in
mapping request failure, but I do not consider that to be usable.

I have not fine-tuned the amount of entropy injected right now. It is
only a quantitive change that will not change the implementation. The
current amount is controlled by aslr_pages_rnd.

To not spoil coalescing optimizations, to reduce the page table
fragmentation inherent to ASLR, and to keep the transient superpage
promotion for the malloced memory, locality clustering is implemented
for anonymous private mappings, which are automatically grouped until
fragmentation kicks in. The initial location for the anon group range
is, of course, randomized. This is controlled by vm.cluster_anon,
enabled by default.

The default mode keeps the sbrk area unpopulated by other mappings,
but this can be turned off, which gives much more breathing bits on
architectures with small address space, such as i386. This is tied
with the question of following an application's hint about the mmap(2)
base address. Testing shows that ignoring the hint does not affect the
function of common applications, but I would expect more demanding
code could break. By default sbrk is preserved and mmap hints are
satisfied, which can be changed by using the
kern.elf{32,64}.aslr.honor_sbrk sysctl.

ASLR is enabled on per-ABI basis, and currently it is only allowed on
FreeBSD native i386 and amd64 (including compat 32bit) ABIs. Support
for additional architectures will be added after further testing.

Both per-process and per-image controls are implemented:
- procctl(2) adds PROC_ASLR_CTL/PROC_ASLR_STATUS;
- NT_FREEBSD_FCTL_ASLR_DISABLE feature control note bit makes it possible
to force ASLR off for the given binary. (A tool to edit the feature
control note is in development.)
Global controls are:
- kern.elf{32,64}.aslr.enable - for non-fixed mappings done by mmap(2);
- kern.elf{32,64}.aslr.pie_enable - for PIE image activation mappings;
- kern.elf{32,64}.aslr.honor_sbrk - allow to use sbrk area for mmap(2);
- vm.cluster_anon - enables anon mapping clustering.

PR: 208580 (exp runs)
Exp-runs done by: antoine
Reviewed by: markj (previous version)
Discussed with: emaste
Tested by: pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D5603


# be8dd142 16-Jan-2019 Konstantin Belousov <kib@FreeBSD.org>

Re-wrap long line after r341827.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 19b75ef5 19-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

Microoptimize corner case of ID bitmap handling.

Prior to the change we would avoidably test more possibly used IDs.

While here update the comment: there is no pidchecked variable anymore.


# 7d065d87 19-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

Deinline vfork handling out of the syscall return path.

vfork is rarely called (comparatively to other syscalls) and it avoidably
pollutes the fast path.

Sponsored by: The FreeBSD Foundation


# cc426dd3 11-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

Remove unused argument to priv_check_cred.

Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by: The FreeBSD Foundation


# eab2132a 08-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

Fix a corner case in ID bitmap management.

If all IDs from trypid to pid_max were used as pids, the code would enter
a loop which would be infinite if none of the IDs could become free (e.g.
they all belong to processes which did not transitioned to zombie).

Fixes: r341684 ("Manage process-related IDs with bitmaps")

Sponsored by: The FreeBSD Foundation


# e52327e3 07-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

proc: postpone proc unlock until after reporting with kqueue

kqueue would always relock immediately afterwards.

While here drop the NULL check for list itself. The list is
always allocated.

Sponsored by: The FreeBSD Foundation


# 34ebdcea 06-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

Manage process-related IDs with bitmaps

Currently unique pid allocation on fork often requires a full walk of
process, group, session lists to make sure it is not used by anything.
This has a side effect of requiring proctree to be held along with allproc,
which adds more contention in poudriere -j 128.

The patch below implements trivial bitmaps which gets rid of the problem.
Dedicated lock is introduced to manage IDs.

While here a bug was discovered: all processes would inherit reap id from
the first process spawned by init. This had a side effect of keeping the
ID used and when allocation rolls over to the beginning it keeps being
skipped.

The patch is loosely based on initial work by mjoras@.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation


# 1e9a1bf5 28-Nov-2018 Mateusz Guzik <mjg@FreeBSD.org>

proc: create a dedicated lock for zombproc to ligthen the load on allproc_lock

waitpid always takes proctree to evaluate the list, but only takes allproc
if it can reap. With this patch allproc is no longer taken, which helps during
poudriere -j 128.

Discussed with: kib
Sponsored by: The FreeBSD Foundation


# e3d3e828 22-Nov-2018 Mateusz Guzik <mjg@FreeBSD.org>

Revert "fork: fix use-after-free with vfork"

This unreliably breaks libc handling of vfork where forking succeded,
but execve did not.

vfork code in libc performs waitpid with WNOHANG in case of failed exec.
With the fix exit codepath was waking up the parent before the child
fully transitioned to a zombie. Woken up parent would waitpid, which
could find a not-yet-zombie child and fail to reap it due to the WNOHANG
flag.

While removing the flag fixes the problem, it is not an option due to older
releases which would still suffer from the kernel change.

Revert the fix until a solution can be worked out.

Note that while use-after-free which gets back due to the revert is a real
bug, it's side-effects are limited due to the fact that struct proc memory
is never released by UMA.


# a5ac8272 22-Nov-2018 Mateusz Guzik <mjg@FreeBSD.org>

fork: remove avoidable proc lock/unlock pair

We don't have to access the process after making it runnable, so there
is no need to hold it either.

Sponsored by: The FreeBSD Foundation


# b00b27e9 22-Nov-2018 Mateusz Guzik <mjg@FreeBSD.org>

fork: fix use-after-free with vfork

The pointer to the child is stored without any reference held. Then it is
blindly used to wait until P_PPWAIT is cleared. However, if the child is
autoreaped it could have exited and get freed before the parent started
waiting.

Use the existing hold mechanism to mitigate the problem. Most common case
of doing exec remains unchanged. The corner case of doing exit performs
wake up before waiting for holds to clear.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18295


# 3d3e6793 21-Nov-2018 Mateusz Guzik <mjg@FreeBSD.org>

proc: implement pid hash locks and an iterator

forks, exits and waits are frequently stalled during poudriere -j 128 runs
due to killpg and process list exports performed for each package.

Both uses take the allproc lock. The latter case can be modified to iterate
over the hash with finer grained locking instead.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17817


# 2c054ce9 16-Nov-2018 Mateusz Guzik <mjg@FreeBSD.org>

proc: always store parent pid in p_oppid

Doing so removes the dependency on proctree lock from sysctl process list
export which further reduces contention during poudriere -j 128 runs.

Reviewed by: kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17825


# 6e22bbf6 21-Jun-2018 Konstantin Belousov <kib@FreeBSD.org>

fork: avoid endless wait with PTRACE_FORK and RFSTOPPED.

An RFSTOPPED thread can't clean TDB_STOPATFORK, which is done in the
fork_return() in its context, so parent is stuck forever. Triggered
when trying to ptrace linux process. Instead of waiting for the new
thread to clear TDB_STOPATFORK, tag it as traced and reparent to the
debugger in do_fork(), and let it only notify the debugger when run.

Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com>
Reviewed by: jhb
MFC after: 1 week
X-MFC-Note: keep p_dbgwait placeholder intact
Differential revision: https://reviews.freebsd.org/D15857


# 81d68271 19-Feb-2018 Mateusz Guzik <mjg@FreeBSD.org>

Reduce contention on the proctree lock during heavy package build.

There is a proctree -> allproc ordering established.

Most of the time it is either xlock -> xlock or slock -> slock.

On fork however there is a slock -> xlock pair which results in
pathological wait times due to threads keeping proctree held for
reading and all waiting on allproc. Switch this to xlock -> xlock.
Longer term fix would get rid of proctree in this place to begin with.
Right now it is necessary to walk the session/process group lists to
determine which id is free. The walk can be avoided e.g. with bitmaps.

The exit path used to have one place which dealt with allproc and
then with proctree. Move the allproc acquire into the section protected
by proctree. This reduces contention against threads waiting on proctree
in the fork codepath - the fork proctree holder does not have to wait
for allproc as often.

Finally, move tidhash manipulation outside of the area protected by
either of these locks. The removal from the hash was already unprotected.
There is no legitimate reason to look up thread ids for a process still
under construction.

This results in about 50% wait time reduction during -j 128 package build.


# 65f29b9c 16-Feb-2018 Mateusz Guzik <mjg@FreeBSD.org>

Postpone sx_sunlock(&proctree_lock) on fork until after allproc is dropped.

There is a significant contention on the lock during -j 128 package build.
This change drops total wait time on this lock by 60%.


# 8e23158a 14-Jan-2018 Bjoern A. Zeeb <bz@FreeBSD.org>

Remove trailing whitespace.
No functional change.


# 3f289c3f 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 2ca45184 09-Nov-2017 Matt Joras <mjoras@FreeBSD.org>

Introduce EVENTHANDLER_LIST and some users.

This introduces a facility to EVENTHANDLER(9) for explicitly defining a
reference to an event handler list. This is useful since previously all
invokers of events had to do a locked traversal of the global list of
event handler lists in order to find the appropriate event handler list.
By keeping a pointer to the appropriate list an invoker can avoid this
traversal completely. The pointer is initialized with SYSINIT(9) during
the eventhandler stage. Users registering interest in events do not need
to know if the event is backed by such a list, since the list is added
to the global list of lists. As with lists that are not pre-defined it
is safe to register for the events before the list has been created.

This converts the process_* and thread_* events to using the new
facility, as these are events whose locked traversals end up showing up
significantly in ports build workflows (and presumably other workflows
with many short lived threads/procs). It may be advantageous to convert
other events to using the new facility.

The el_flags field is now unused, but leave it be so that this revision
can be MFC'd.

Reviewed by: bdrewery, markj, mjg
Approved by: rstone (mentor)
In collaboration with: ian
MFC after: 4 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12814


# 008a0935 10-Sep-2017 Dag-Erling Smørgrav <des@FreeBSD.org>

If the user tries to set kern.randompid to 1 (which is meaningless), set
it to a random value between 100 and 1123, rather than 0 as before.

Submitted by: Marie Helene Kvello-Aune <marieheleneka@gmail.com>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D5336


# 2d88da2f 12-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Move struct syscall_args syscall arguments parameters container into
struct thread.

For all architectures, the syscall trap handlers have to allocate the
structure on the stack. The structure takes 88 bytes on 64bit arches
which is not negligible. Also, it cannot be easily found by other
code, which e.g. caused duplication of some members of the structure
to struct thread already. The change removes td_dbg_sc_code and
td_dbg_sc_nargs which were directly copied from syscall_args.

The structure is put into the copied on fork part of the struct thread
to make the syscall arguments information correct in the child after
fork.

This move will also allow several more uses shortly.

Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks
X-Differential revision: https://reviews.freebsd.org/D11080


# 83c9dea1 17-Apr-2017 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156


# 82a4538f 20-Feb-2017 Eric Badger <badger@FreeBSD.org>

Defer ptracestop() signals that cannot be delivered immediately

When a thread is stopped in ptracestop(), the ptrace(2) user may request
a signal be delivered upon resumption of the thread. Heretofore, those signals
were discarded unless ptracestop()'s caller was issignal(). Fix this by
modifying ptracestop() to queue up signals requested by the ptrace user that
will be delivered when possible. Take special care when the signal is SIGKILL
(usually generated from a PT_KILL request); no new stop events should be
triggered after a PT_KILL.

Add a number of tests for the new functionality. Several tests were authored
by jhb.

PR: 212607
Reviewed by: kib
Approved by: kib (mentor)
MFC after: 2 weeks
Sponsored by: Dell EMC
In collaboration with: jhb
Differential Revision: https://reviews.freebsd.org/D9260


# 5afb134c 12-Dec-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vrefact, to be used when the vnode has to be already active

This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active
vnodes.

Reviewed by: kib (previous version)


# 643f6f47 21-Sep-2016 Konstantin Belousov <kib@FreeBSD.org>

Add PROC_TRAPCAP procctl(2) controls and global sysctl kern.trap_enocap.

Both can be used to cause processes in capability mode to receive
SIGTRAP when ENOTCAPABLE or ECAPMODE errors are returned from
syscalls.

Idea by: emaste
Reviewed by: oshogbo (previous version), emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D7965


# 69a28758 15-Sep-2016 Ed Maste <emaste@FreeBSD.org>

Renumber license clauses in sys/kern to avoid skipping #3


# e5574e09 19-Aug-2016 Mark Johnston <markj@FreeBSD.org>

Don't set P2_PTRACE_FSTP in a process that invokes ptrace(PT_TRACE_ME).

Such processes are stopped synchronously by a direct call to
ptracestop(SIGTRAP) upon exec. P2_PTRACE_FSTP causes the exec()ing thread
to suspend itself while waiting for a SIGSTOP that never arrives.

Reviewed by: kib
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D7576


# e69ba32f 03-Aug-2016 Konstantin Belousov <kib@FreeBSD.org>

Remove mention of the Giant from the fork_return() description.
Making emphasis on this lock in the core function comment is confusing
for the modern kernel.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# b7a25e63 28-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

When a debugger attaches to the process, SIGSTOP is sent to the
target. Due to a way issignal() selects the next signal to deliver
and report, if the simultaneous or already pending another signal
exists, that signal might be reported by the next waitpid(2) call.
This causes minor annoyance for debuggers, which must be prepared to
take any signal as the first event, then filter SIGSTOP later.

More importantly, for tools like gcore(1), which attach and then
detach without processing events, SIGSTOP might leak to be delivered
after PT_DETACH. This results in the process being unintentionally
stopped after detach, which is fatal for automatic tools.

The solution is to force SIGSTOP to be the first signal reported after
the attach. Attach code is modified to set P2_PTRACE_FSTP to indicate
that the attaching ritual was not yet finished, and issignal() prefers
SIGSTOP in that condition. Also, the thread which handles
P2_PTRACE_FSTP is made to guarantee to own p_xthread during the first
waitpid(2). All that ensures that SIGSTOP is consumed first.

Additionally, if P2_PTRACE_FSTP is still set on detach, which means
that waitpid(2) was not called at all, SIGSTOP is removed from the
queue, ensuring that the process is resumed on detach.

In issignal(), when acting on STOPing signals, remove the signal from
queue before suspending. Otherwise parallel attach could result in
ptracestop() acting on that STOP as if it was the STOP signal from the
attach. Then SIGSTOP from attach leaks again.

As a minor refactoring, some bits of the common attach code is moved
to new helper proc_set_traced().

Reported by: markj
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7256


# fc4f075a 18-Jul-2016 John Baldwin <jhb@FreeBSD.org>

Add PTRACE_VFORK to trace vfork events.

First, PL_FLAG_FORKED events now also set a PL_FLAG_VFORKED flag when
the new child was created via vfork() rather than fork(). Second, a
new PL_FLAG_VFORK_DONE event can now be enabled via the PTRACE_VFORK
event mask. This new stop is reported after the vfork parent resumes
due to the child calling exit or exec. Debuggers can use this stop to
reinsert breakpoints in the vfork parent process before it resumes.

Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7045


# 8d570f64 15-Jul-2016 John Baldwin <jhb@FreeBSD.org>

Add a mask of optional ptrace() events.

ptrace() now stores a mask of optional events in p_ptevents. Currently
this mask is a single integer, but it can be expanded into an array of
integers in the future.

Two new ptrace requests can be used to manipulate the event mask:
PT_GET_EVENT_MASK fetches the current event mask and PT_SET_EVENT_MASK
sets the current event mask.

The current set of events include:
- PTRACE_EXEC: trace calls to execve().
- PTRACE_SCE: trace system call entries.
- PTRACE_SCX: trace syscam call exits.
- PTRACE_FORK: trace forks and auto-attach to new child processes.
- PTRACE_LWP: trace LWP events.

The S_PT_SCX and S_PT_SCE events in the procfs p_stops flags have
been replaced by PTRACE_SCE and PTRACE_SCX. PTRACE_FORK replaces
P_FOLLOW_FORK and PTRACE_LWP replaces P2_LWP_EVENTS.

The PT_FOLLOW_FORK and PT_LWP_EVENTS ptrace requests remain for
compatibility but now simply toggle corresponding flags in the
event mask.

While here, document that PT_SYSCALL, PT_TO_SCE, and PT_TO_SCX both
modify the event mask and continue the traced process.

Reviewed by: kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D7044


# 9e590ff0 27-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

When filt_proc() removes event from the knlist due to the process
exiting (NOTE_EXIT->knlist_remove_inevent()), two things happen:
- knote kn_knlist pointer is reset
- INFLUX knote is removed from the process knlist.
And, there are two consequences:
- KN_LIST_UNLOCK() on such knote is nop
- there is nothing which would block exit1() from processing past the
knlist_destroy() (and knlist_destroy() resets knlist lock pointers).
Both consequences result either in leaked process lock, or
dereferencing NULL function pointers for locking.

Handle this by stopping embedding the process knlist into struct proc.
Instead, the knlist is allocated together with struct proc, but marked
as autodestroy on the zombie reap, by knlist_detach() function. The
knlist is freed when last kevent is removed from the list, in
particular, at the zombie reap time if the list is empty. As result,
the knlist_remove_inevent() is no longer needed and removed.

Other changes:

In filt_procattach(), clear NOTE_EXEC and NOTE_FORK desired events
from kn_sfflags for knote registered by kernel to only get NOTE_CHILD
notifications. The flags leak resulted in excessive
NOTE_EXEC/NOTE_FORK reports.

Fix immediate note activation in filt_procattach(). Condition should
be either the immediate CHILD_NOTE activation, or immediate NOTE_EXIT
report for the exiting process.

In knote_fork(), do not perform racy check for KN_INFLUX before kq
lock is taken. Besides being racy, it did not accounted for notes
just added by scan (KN_SCAN).

Some minor and incomplete style fixes.

Analyzed and tested by: Eric Badger <eric@badgerio.us>
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)
Differential revision: https://reviews.freebsd.org/D6859


# 5c2cf818 15-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Update comments for the MD functions managing contexts for new
threads, to make it less confusing and using modern kernel terms.

Rename the functions to reflect current use of the functions, instead
of the historic KSE conventions:
cpu_set_fork_handler -> cpu_fork_kthread_handler (for kthreads)
cpu_set_upcall -> cpu_copy_thread (for forks)
cpu_set_upcall_kse -> cpu_set_upcall (for new threads creation)

Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (hrs)
Differential revision: https://reviews.freebsd.org/D6731


# b3a73448 07-Jun-2016 Mariusz Zaborski <oshogbo@FreeBSD.org>

Introduce the PD_CLOEXEC for pdfork(2).

Reviewed by: mjg


# 93ccd6bf 05-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Get rid of struct proc p_sched and struct thread td_sched pointers.

p_sched is unused.

The struct td_sched is always co-allocated with the struct thread,
except for the thread0. Avoid useless indirection, instead calculate
td_sched location using simple pointer arithmetic in td_get_sched(9).
For thread0, which is statically allocated, create a structure to
emulate layout of the dynamic allocation.

Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D6711


# e3043798 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: spelling fixes in comments.

No functional change.


# db57c70a 09-Feb-2016 Konstantin Belousov <kib@FreeBSD.org>

Rename P_KTHREAD struct proc p_flag to P_KPROC.

I left as is an apparent bug in ntoskrnl_var.h:AT_PASSIVE_LEVEL()
definition.

Suggested by: jhb
Sponsored by: The FreeBSD Foundation


# fb1f4582 08-Feb-2016 John Baldwin <jhb@FreeBSD.org>

Call kthread_exit() rather than kproc_exit() for a premature kthread exit.

Kernel threads (and processes) are supposed to call kthread_exit() (or
kproc_exit()) to terminate. However, the kernel includes a fallback in
fork_exit() to force a kthread exit if a kernel thread's "main" routine
returns. This fallback was added back when the kernel only had processes
and was not updated to call kthread_exit() instead of kproc_exit() when
threads were added to the kernel.

This mistake was particular exciting when the errant thread belonged to
proc0. Due to the missing P_KTHREAD flag the fallback did not kick in
and instead tried to return to userland via whatever garbage was in the
trapframe. With P_KTHREAD set it tried to terminate proc0 resulting in
other amusements.

PR: 204999
MFC after: 1 week


# 0c829a30 06-Feb-2016 Mateusz Guzik <mjg@FreeBSD.org>

fork: ansify sys_pdfork

No functional changes.


# 4732ae43 04-Feb-2016 Konstantin Belousov <kib@FreeBSD.org>

Guard against runnable td2 exiting and than being reused for unrelated
process when the parent sleeps waiting for the debugger attach on
fork.

Diagnosed and reviewed by: mjg
Sponsored by: The FreeBSD Foundation


# 813361c1 03-Feb-2016 Mateusz Guzik <mjg@FreeBSD.org>

fork: plug a use after free of the returned process

fork1 required its callers to pass a pointer to struct proc * which would
be set to the new process (if any). procdesc and racct manipulation also
used said pointer.

However, the process could have exited prior to do_fork return and be
automatically reaped, thus making this a use-after-free.

Fix the problem by letting callers indicate whether they want the pid or
the struct proc, return the process in stopped state for the latter case.

Reviewed by: kib


# 33fd9b9a 03-Feb-2016 Mateusz Guzik <mjg@FreeBSD.org>

fork: pass arguments to fork1 in a dedicated structure

Suggested by: kib


# 5fcfab6e 29-Dec-2015 John Baldwin <jhb@FreeBSD.org>

Add ptrace(2) reporting for LWP events.

Add two new LWPINFO flags: PL_FLAG_BORN and PL_FLAG_EXITED for reporting
thread creation and destruction. Newly created threads will stop to report
PL_FLAG_BORN before returning to userland and exiting threads will stop to
report PL_FLAG_EXIT before exiting completely. Both of these events are
only enabled and reported if PT_LWP_EVENTS is enabled on a process.


# 36160958 16-Dec-2015 Mark Johnston <markj@FreeBSD.org>

Fix style issues around existing SDT probes.

- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
set automatically by the SDT framework.

MFC after: 1 week


# aff57357 22-Oct-2015 Ed Schouten <ed@FreeBSD.org>

Add a way to distinguish between forking and thread creation in schedtail.

For CloudABI we need to initialize the registers of new threads
differently based on whether the thread got created through a fork or
through simple thread creation.

Add a flag, TDP_FORKING, that is set by do_fork() and cleared by
fork_exit(). This can be tested against in schedtail.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D3973


# d8f3dc78 16-Oct-2015 Konstantin Belousov <kib@FreeBSD.org>

If falloc_caps() failed, cleanup needs to be performed. This is a bug
in r289026.

Sponsored by: The FreeBSD Foundation


# 4b48959f 08-Oct-2015 Konstantin Belousov <kib@FreeBSD.org>

Enforce the maxproc limitation before allocating struct proc, initial
struct thread and kernel stack for the thread. Otherwise, a load
similar to a fork bomb would exhaust KVA and possibly kmem, mostly due
to the struct proc being type-stable.

The nprocs counter is changed from being protected by allproc_lock sx
to be an atomic variable. Note that ddb/db_ps.c:db_ps() use of nprocs
was unsafe before, and is still unsafe, but it seems that the only
possible undesired consequence is the harmless warning printed when
allproc linked list length does not match nprocs.

Diagnosed by: Svatopluk Kraus <onwahe@gmail.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 189ac973 06-Oct-2015 John Baldwin <jhb@FreeBSD.org>

Fix various edge cases related to system call tracing.
- Always set td_dbg_sc_* when P_TRACED is set on system call entry
even if the debugger is not tracing system call entries. This
ensures the fields are valid when reporting other stops that
occur at system call boundaries such as for PT_FOLLOW_FORKS or
when only tracing system call exits.
- Set TDB_SCX when reporting the stop for a new child process in
fork_return(). This causes the event to be reported as a system
call exit.
- Report a system call exit event in fork_return() for new threads in
a traced process.
- Copy td_dbg_sc_* to new threads instead of zeroing. This ensures
that td_dbg_sc_code in particular will report the system call that
created the new thread or process when it reports a system call
exit event in fork_return().
- Add new ptrace tests to verify that new child processes and threads
report system call exit events with a valid pl_syscall_code via
PT_LWPINFO.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D3822


# 2f2f522b 27-Sep-2015 Andriy Gapon <avg@FreeBSD.org>

save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE

SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after: 20 days


# edc82223 10-Aug-2015 Konstantin Belousov <kib@FreeBSD.org>

Make kstack_pages a tunable on arm, x86, and powepc. On i386, the
initial thread stack is not adjusted by the tunable, the stack is
allocated too early to get access to the kernel environment. See
TD0_KSTACK_PAGES for the thread0 stack sizing on i386.

The tunable was tested on x86 only. From the visual inspection, it
seems that it might work on arm and powerpc. The arm
USPACE_SVC_STACK_TOP and powerpc USPACE macros seems to be already
incorrect for the threads with non-default kstack size. I only
changed the macros to use variable instead of constant, since I cannot
test.

On arm64, mips and sparc64, some static data structures are sized by
KSTACK_PAGES, so the tunable is disabled.

Sponsored by: The FreeBSD Foundation
MFC after: 2 week


# 6236e71b 31-Jul-2015 Ed Schouten <ed@FreeBSD.org>

Fix accidental line wrapping introduced in r286122.


# 367a13f9 31-Jul-2015 Ed Schouten <ed@FreeBSD.org>

Limit rights on process descriptors.

On CloudABI, the rights bits returned by cap_rights_get() match up with
the operations that you can actually perform on the file descriptor.

Limiting the rights is good, because it makes it easier to get uniform
behaviour across different operating systems. If process descriptors on
FreeBSD would suddenly gain support for any new file operation, this
wouldn't become exposed to CloudABI processes without first extending
the rights.

Extend fork1() to gain a 'struct filecaps' argument that allows you to
construct process descriptors with custom rights. Use this in
cloudabi_sys_proc_fork() to limit the rights to just fstat() and
pdwait().

Obtained from: https://github.com/NuxiNL/freebsd


# 6520495a 11-Jul-2015 Adrian Chadd <adrian@FreeBSD.org>

Add an initial NUMA affinity/policy configuration for threads and processes.

This is based on work done by jeff@ and jhb@, as well as the numa.diff
patch that has been circulating when someone asks for first-touch NUMA
on -10 or -11.

* Introduce a simple set of VM policy and iterator types.
* tie the policy types into the vm_phys path for now, mirroring how
the initial first-touch allocation work was enabled.
* add syscalls to control changing thread and process defaults.
* add a global NUMA VM domain policy.
* implement a simple cascade policy order - if a thread policy exists, use it;
if a process policy exists, use it; use the default policy.
* processes inherit policies from their parent processes, threads inherit
policies from their parent threads.
* add a simple tool (numactl) to query and modify default thread/process
policities.
* add documentation for the new syscalls, for numa and for numactl.
* re-enable first touch NUMA again by default, as now policies can be
set in a variety of methods.

This is only relevant for very specific workloads.

This doesn't pretend to be a final NUMA solution.

The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved by
'sysctl vm.default_policy=rr'.

This is only relevant if MAXMEMDOM is set to something other than 1.
Ie, if you're using GENERIC or a modified kernel with non-NUMA, then
this is a glorified no-op for you.

Thank you to Norse Corp for giving me access to rather large
(for FreeBSD!) NUMA machines in order to develop and verify this.

Thank you to Dell for providing me with dual socket sandybridge
and westmere v3 hardware to do NUMA development with.

Thank you to Scott Long at Netflix for providing me with access
to the two-socket, four-domain haswell v3 hardware.

Thank you to Peter Holm for running the stress testing suite
against the NUMA branch during various stages of development!

Tested:

* MIPS (regression testing; non-NUMA)
* i386 (regression testing; non-NUMA GENERIC)
* amd64 (regression testing; non-NUMA GENERIC)
* westmere, 2 socket (thankyou norse!)
* sandy bridge, 2 socket (thankyou dell!)
* ivy bridge, 2 socket (thankyou norse!)
* westmere-EX, 4 socket / 1TB RAM (thankyou norse!)
* haswell, 2 socket (thankyou norse!)
* haswell v3, 2 socket (thankyou dell)
* haswell v3, 2x18 core (thankyou scott long / netflix!)

* Peter Holm ran a stress test suite on this work and found one
issue, but has not been able to verify it (it doesn't look NUMA
related, and he only saw it once over many testing runs.)

* I've tested bhyve instances running in fixed NUMA domains and cpusets;
all seems to work correctly.

Verified:

* intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different
NUMA policies for processes under test.

Review:

This was reviewed through phabricator (https://reviews.freebsd.org/D2559)
as well as privately and via emails to freebsd-arch@. The git history
with specific attributes is available at https://github.com/erikarn/freebsd/
in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/adrian_numa_policy).

This has been reviewed by a number of people (stas, rpaulo, kib, ngie,
wblock) but not achieved a clear consensus. My hope is that with further
exposure and testing more functionality can be implemented and evaluated.

Notes:

* The VM doesn't handle unbalanced domains very well, and if you have an overly
unbalanced memory setup whilst under high memory pressure, VM page allocation
may fail leading to a kernel panic. This was a problem in the past, but it's
much more easily triggered now with these tools.

* This work only controls the path through vm_phys; it doesn't yet strongly/predictably
affect contigmalloc, KVA placement, UMA, etc. So, driver placement of memory
isn't really guaranteed in any way. That's next on my plate.

Sponsored by: Norse Corp, Inc.; Dell


# f6f6d240 10-Jun-2015 Mateusz Guzik <mjg@FreeBSD.org>

Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.


# 4ea6a9a2 10-Jun-2015 Mateusz Guzik <mjg@FreeBSD.org>

Generalised support for copy-on-write structures shared by threads.

Thread credentials are maintained as follows: each thread has a pointer to
creds and a reference on them. The pointer is compared with proc's creds on
userspace<->kernel boundary and updated if needed.

This patch introduces a counter which can be compared instead, so that more
structures can use this scheme without adding more comparisons on the boundary.


# 515b7a0b 25-May-2015 John Baldwin <jhb@FreeBSD.org>

Add KTR tracing for some MI ptrace events.

Differential Revision: https://reviews.freebsd.org/D2643
Reviewed by: kib


# edf1796d 06-May-2015 Mateusz Guzik <mjg@FreeBSD.org>

Fix up panics when fork fails due to hitting proc limit

The function clearning credentials on failure asserts the process is a
zombie, which is not true when fork fails.

Changing creds to NULL is unnecessary, but is still being done for
consistency with other code.

Pointy hat: mjg
Reported by: pho


# 90f54cbf 11-Apr-2015 Mateusz Guzik <mjg@FreeBSD.org>

fd: remove filedesc argument from fdclose

Just accept a thread instead. This makes it consistent with fdalloc.

No functional changes.


# ffb34484 21-Mar-2015 Mateusz Guzik <mjg@FreeBSD.org>

cred: add proc_set_cred_init helper

proc_set_cred_init can be used to set first credentials of a new
process.

Update proc_set_cred assertions so that it only expects already used
processes.

This fixes panics where p_ucred of a new process happens to be non-NULL.

Reviewed by: kib


# 12cec311 21-Mar-2015 Mateusz Guzik <mjg@FreeBSD.org>

fork: assign refed credentials earlier

Prior to this change the kernel would take p1's credentials and assign
them tempororarily to p2. But p1 could change credentials at that time
and in effect give us a use-after-free.

No objections from: kib


# daf63fd2 15-Mar-2015 Mateusz Guzik <mjg@FreeBSD.org>

cred: add proc_set_cred helper

The goal here is to provide one place altering process credentials.

This eases debugging and opens up posibilities to do additional work when such
an action is performed.


# 677258f7 18-Jan-2015 Konstantin Belousov <kib@FreeBSD.org>

Add procctl(2) PROC_TRACE_CTL command to enable or disable debugger
attachment to the process. Note that the command is not intended to
be a security measure, rather it is an obfuscation feature,
implemented for parity with other operating systems.

Discussed with: jilles, rwatson
Man page fixes by: rwatson
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 237623b0 14-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

Add a facility for non-init process to declare itself the reaper of
the orphaned descendants. Base of the API is modelled after the same
feature from the DragonFlyBSD.

Requested by: bapt
Reviewed by: jilles (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 6ddcc233 13-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

Add facility to stop all userspace processes. The supposed use of the
feature is to quisce the system before suspend.

Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC. SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process. Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.

Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume. It
is used for debugging; should be removed after the real use of the
interface is added.

In collaboration with: pho
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# eb48fbd9 13-Nov-2014 Mateusz Guzik <mjg@FreeBSD.org>

filedesc: fixup fdinit to lock fdp and preapare files conditinally

Not all consumers providing fdp to copy from want files.

Perhaps these functions should be reorganized to better express the outcome.

This fixes up panics after r273895 .

Reported by: markj


# b9d32c36 27-Jun-2014 Mateusz Guzik <mjg@FreeBSD.org>

Make fdunshare accept only td parameter.

Proc had to match the thread anyway and 2 parameters were inconsistent
with the rest.

MFC after: 1 week


# 7159310f 17-Dec-2013 Mark Johnston <markj@FreeBSD.org>

The fasttrap fork handler is responsible for removing tracepoints in the
child process that were inherited from its parent. However, this should
not be done in the case of a vfork, since the fork handler ends up removing
the tracepoints from the shared vm space, and userland DTrace probes in the
parent will no longer fire as a result.

Now the child of a vfork may trigger userland DTrace probes enabled in its
parent, so modify the fasttrap probe handler to handle this case and handle
the child process in the same way that it would handle the traced process.
In particular, if once traces function foo() in a process that vforks, and
the child calls foo(), fasttrap will treat this call as having come from the
parent. This is the behaviour of the upstream code.

While here, add #ifdef guards to some code that isn't present upstream.

MFC after: 1 month


# f2b525e6 30-Nov-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Make process descriptors standard part of the kernel. rwhod(8) already
requires process descriptors to work and having PROCDESC in GENERIC
seems not enough, especially that we hope to have more and more consumers
in the base.

MFC after: 3 days


# d9fae5ab 26-Nov-2013 Andriy Gapon <avg@FreeBSD.org>

dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE

In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by: markj
MFC after: 3 weeks


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# 55648840 19-Sep-2013 John Baldwin <jhb@FreeBSD.org>

Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.

Reviewed by: kib, jilles (earlier version)
Approved by: re (delphij)
MFC after: 1 month


# 7b77e1fe 14-Aug-2013 Mark Johnston <markj@FreeBSD.org>

Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.

There is no functional change.

MFC after: 2 weeks


# a208417c 19-Apr-2013 Jaakko Heinonen <jh@FreeBSD.org>

Include PID in the error message which is printed when the maxproc limit
is exceeded. Improve formatting of the message while here.

PR: kern/60550
Submitted by: Lowell Gilbert, bde


# 2609222a 01-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


# de265498 17-Feb-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Remove redundant parenthesis.


# d4015944 14-Dec-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove a special case for XEN, which is erronous and makes vfork(2)
behaviour to differ from the documented, only on XEN. If there are
any issues with XEN pmap left, they should be fixed in pmap.

MFC after: 2 weeks


# f7e50ea7 04-Dec-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix a race between kern_setitimer() and realitexpire(), where the
callout is started before kern_setitimer() acquires process mutex, but
looses a race and kern_setitimer() gets the process mutex before the
callout. Then, assuming that new specified struct itimerval has
it_interval zero, but it_value non-zero, the callout, after it starts
executing again, clears p->p_realtimer.it_value, but kern_setitimer()
already rescheduled the callout.

As the result of the race, both p_realtimer is zero, and the callout
is rescheduled. Then, in the exit1(), the exit code sees that it_value
is zero and does not even try to stop the callout. This allows the
struct proc to be reused and eventually the armed callout is
re-initialized. The consequence is the corrupted callwheel tailq.

Use process mutex to interlock the callout start, which fixes the race.

Reported and tested by: pho
Reviewed by: jhb
MFC after: 2 weeks


# 324e5715 08-Sep-2012 Attilio Rao <attilio@FreeBSD.org>

userret() already checks for td_locks when INVARIANTS is enabled, so
there is no need to check if Giant is acquired after it.

Reviewed by: kib
MFC after: 1 week


# 02c6fc21 15-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a sysctl kern.pid_max, which limits the maximum pid the system is
allowed to allocate, and corresponding tunable with the same
name. Note that existing processes with higher pids are left intact.

MFC after: 1 week


# 0a7007b9 19-Jun-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

The falloc() function obtains two references to newly created 'fp'.
On success we have to drop one after procdesc_finit() and on failure
we have to close allocated slot with fdclose(), which also drops one
reference for us and drop the remaining reference with fdrop().

Without this change closing process descriptor didn't result in killing
pdfork(2)ed child.

Reviewed by: rwatson
MFC after: 1 month


# 97681567 26-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Stop treating td_sigmask specially for the purposes of new thread
creation. Move it into the copied region of the struct thread.

Update some comments.

Requested by: bde
X-MFC after: never


# ab27d5d8 22-May-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix panic with RACCT that could occur in low memory (or out of swap)
situations, due to fork1() calling racct_proc_exit() without calling
racct_proc_fork() first.

Submitted by: Mateusz Guzik <mjguzik at gmail dot com> (earlier version)
Reviewed by: Mateusz Guzik <mjguzik at gmail dot com>


# 1d7ca9bb 27-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Currently, the debugger attached to the process executing vfork() does
not get syscall exit notification until the child performed exec of
exit. Swap the order of doing ptracestop() and waiting for P_PPWAIT
clearing, by postponing the wait into syscallret after ptracestop()
notification is done.

Reported, tested and reviewed by: Dmitry Mikulin <dmitrym juniper net>
MFC after: 2 weeks


# dcd43281 23-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Allow the parent to gather the exit status of the children reparented
to the debugger. When reparenting for debugging, keep the child in
the new orphan list of old parent. When looping over the children in
kern_wait(), iterate over both children list and orphan list to search
for the process by pid.

Submitted by: Dmitry Mikulin <dmitrym juniper.net>
MFC after: 2 weeks


# db327339 09-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Mark the automatically attached child with PL_FLAG_CHILD in struct
lwpinfo flags, for PT_FOLLOWFORK auto-attachment.

In collaboration with: Dmitry Mikulin <dmitrym juniper net>
MFC after: 1 week


# 2d8696d1 03-Oct-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Move some code inside the racct_proc_fork(); it spares a few lock operations
and it's more logical this way.

MFC after: 3 days


# 72a401d9 03-Oct-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix another bug introduced in r225641, which caused rctl to access certain
fields in 'struct proc' before they got initialized in do_fork().

MFC after: 3 days


# 1dbf9dcc 17-Sep-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix error handling bug that would prevent MAC structures from getting
freed properly if resource limit got exceeded.

Approved by: re (kib)


# b38520f0 17-Sep-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix long-standing thinko regarding maxproc accounting. Basically,
we were accounting the newly created process to its parent instead
of the child itself. This caused problems later, when the child
changed its credentials - the per-uid, per-jail etc counters were
not properly updated, because the maxproc counter in the child
process was 0.

Approved by: re (kib)


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# cfb5f768 18-Aug-2011 Jonathan Anderson <jonathan@FreeBSD.org>

Add experimental support for process descriptors

A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.

New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.

When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.

This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.

Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc


# f49d8202 12-Jul-2011 Konstantin Belousov <kib@FreeBSD.org>

Implement an RFTSIGZMB flag to rfork(2) to specify a signal that is
delivered to parent when the child exists.

Submitted by: Petr Salinger <Petr.Salinger seznam cz> (Debian/kFreeBSD)
MFC after: 1 week
X-MFC-note: bump __FreeBSD_version


# afcc55f3 06-Jul-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0. This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".


# 58c77a9d 31-Mar-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Enable accounting for RACCT_NPROC and RACCT_NTHR.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 097055e2 29-Mar-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add racct. It's an API to keep per-process, per-jail, per-loginclass
and per-loginclass resource accounting information, to be used by the new
resource limits code. It's connected to the build, but the code that
actually calls the new functions will come later.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)


# 8e6fa660 24-Mar-2011 John Baldwin <jhb@FreeBSD.org>

Fix some locking nits with the p_state field of struct proc:
- Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL
in fork to honor the locking requirements. While here, expand the scope
of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously
the code was locking the new child process (p2) after it had locked the
parent process (p1). However, when locking two processes, the safe order
is to lock the child first, then the parent.
- Fix various places that were checking p_state against PRS_NEW without
having the process locked to use PROC_LOCK(). Every place was already
locking the process, just after the PRS_NEW check.
- Remove or reduce the use of PROC_SLOCK() for places that were checking
p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading
the current state.
- Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once.

MFC after: 1 week


# e5d81ef1 08-Mar-2011 Dmitry Chagin <dchagin@FreeBSD.org>

Extend struct sysvec with new method sv_schedtail, which is used for an
explicit process at fork trampoline path instead of eventhadler(schedtail)
invocation for each child process.

Remove eventhandler(schedtail) code and change linux ABI to use newly added
sysvec method.

While here replace explicit comparing of module sysentvec structure with the
newly created process sysentvec to detect the linux ABI.

Discussed with: kib

MFC after: 2 Week


# 7705d4b2 25-Feb-2011 Dmitry Chagin <dchagin@FreeBSD.org>

Introduce preliminary support of the show description of the ABI of
traced process by adding two new events which records value of process
sv_flags to the trace file at process creation/execing/exiting time.

MFC after: 1 Month.


# 6fa39a73 25-Jan-2011 Konstantin Belousov <kib@FreeBSD.org>

Allow debugger to specify that children of the traced process should be
automatically traced. Extend the ptrace(PL_LWPINFO) to report that child
just forked.

Reviewed by: davidxu, jhb
MFC after: 2 weeks


# 22d19207 06-Jan-2011 John Baldwin <jhb@FreeBSD.org>

- Move sched_fork() later in fork() after the various sections of the new
thread and proc have been copied and zeroed from the old thread and
proc. Otherwise attempts to modify thread or process data in sched_fork()
could be undone.
- Don't copy td_{base,}_user_pri from the old thread to the new thread in
sched_fork_thread() in ULE. This is already done courtesy the bcopy()
of the thread copy region.
- Always initialize the real priority (td_priority) of new threads to the
new thread's base priority (td_base_pri) to avoid bogusly inheriting a
borrowed priority from the parent thread.

MFC after: 2 weeks


# 3e73ff1e 01-Jan-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Finishing touches to fork1() - ANSIfy missed function definition, style(9)
fixes, removal of few comments that didn't really make sense and addition
of fork_findpid() locking requirements.


# afd01097 10-Dec-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Refactor fork1() to make it easier to follow. No functional changes.

Reviewed by: kib (earlier version)
Tested by: pho


# acbe332a 08-Dec-2010 David Xu <davidxu@FreeBSD.org>

MFp4:
It is possible a lower priority thread lending priority to higher priority
thread, in old code, it is ignored, however the lending should always be
recorded, add field td_lend_user_pri to fix the problem, if a thread does
not have borrowed priority, its value is PRI_MAX.

MFC after: 1 week


# 087bfb0e 06-Dec-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add a KASSERT to make it obvious when fork_norfproc() is to be called,
and set *procp to NULL in all cases. Previously, it was not being set
in the ERESTART case. This is effectively no-op, since its value is
ignored by callers in the error case.

Reviewed by: kib@


# f68c74bb 06-Dec-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix style bug introduced by previous commit.


# 1d845e86 06-Dec-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Improve readability by factoring out the !RFPROC case. While here,
turn K&R function definitions into ANSI. No functional changes.

Reviewed by: kib@


# d680caab 21-Oct-2010 John Baldwin <jhb@FreeBSD.org>

- When disabling ktracing on a process, free any pending requests that
may be left. This fixes a memory leak that can occur when tracing is
disabled on a process via disabling tracing of a specific file (or if
an I/O error occurs with the tracefile) if the process's next system
call is exit(). The trace disabling code clears p_traceflag, so exit1()
doesn't do any KTRACE-related cleanup leading to the leak. I chose to
make the free'ing of pending records synchronous rather than patching
exit1().
- Move KTRACE-specific logic out of kern_(exec|exit|fork).c and into
kern_ktrace.c instead. Make ktrace_mtx private to kern_ktrace.c as a
result.

MFC after: 1 month


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# cf7d9a8c 08-Oct-2010 David Xu <davidxu@FreeBSD.org>

Create a global thread hash table to speed up thread lookup, use
rwlock to protect the table. In old code, thread lookup is done with
process lock held, to find a thread, kernel has to iterate through
process and thread list, this is quite inefficient.
With this change, test shows in extreme case performance is
dramatically improved.

Earlier patch was reviewed by: jhb, julian


# d3555b6f 09-Sep-2010 Rui Paulo <rpaulo@FreeBSD.org>

Fix two bugs in DTrace:
* when the process exits, remove the associated USDT probes
* when the process forks, duplicate the USDT probes.

Sponsored by: The FreeBSD Foundation


# 79856499 22-Aug-2010 Rui Paulo <rpaulo@FreeBSD.org>

Add an extra comment to the SDT probes definition. This allows us to get
use '-' in probe names, matching the probe names in Solaris.[1]

Add userland SDT probes definitions to sys/sdt.h.

Sponsored by: The FreeBSD Foundation
Discussed with: rwaston [1]


# c193de56 11-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r207468:
Extract thread_lock()/ruxagg()/thread_unlock() fragment into utility
function ruxagg_tlock().
Convert the definition of kern_getrusage() to ANSI C.

MFC r207602:
Implement RUSAGE_THREAD. Add td_rux to keep extended runtime and ticks
information for thread to allow calcru1() (re)use.

Rename ruxagg()->ruxagg_locked(), ruxagg_tlock()->ruxagg() [1].
The ruxagg_locked() function no longer clears thread ticks nor
td_incruntime.

Not an MFC: the td_rux is added to the end of struct thread to keep
the KBI. Explicit bzero() of td_rux is added to new thread initialization
points.


# 2af00dec 08-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r196730:
Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ.

Introduce separate kernel stack cache that keeps some limited amount of
preallocated kernel stacks to lower the latency of thread allocation.

Not a merge: instead of removing td_altkstack* members of struct thread,
replace them with placeholders to keep struct thread layout on the
stable branch.

Also, record r196640, r196644 and r196648 as merged.

Approved by: re (kensmith)


# 8a945d10 01-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

Reintroduce the r196640, after fixing the problem with my testing.

Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by: peter [1]
Discussed with: jhb, julian, peter
Reviewed by: jhb
Tested by: pho (and retested according to new test scenarious)
MFC after: 1 week


# f25fa6ab 29-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

Reverse r196640 and r196644 for now.


# b6b2d1bf 29-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

Dispose the kernel stack of the proper thread.

Submitted by: alc
MFC after: 1 week


# c3cf0b47 29-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by: peter [1]
Discussed with: jhb, julian, peter
Reviewed by: jhb
Tested by: pho
MFC after: 1 week


# 7afcbc18 17-Jul-2009 Jamie Gritton <jamie@FreeBSD.org>

Remove the interim vimage containers, struct vimage and struct procg,
and the ioctl-based interface that supported them.

Approved by: re (kib), bz (mentor)


# 14961ba7 27-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Replace AUDIT_ARG() with variable argument macros with a set more more
specific macros for each audit argument type. This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).

In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.

Suggested by: brooks
Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 week


# 3364c323 23-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Implement global and per-uid accounting of the anonymous memory. Add
rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved
for the uid.

The accounting information (charge) is associated with either map entry,
or vm object backing the entry, assuming the object is the first one
in the shadow chain and entry does not require COW. Charge is moved
from entry to object on allocation of the object, e.g. during the mmap,
assuming the object is allocated, or on the first page fault on the
entry. It moves back to the entry on forks due to COW setup.

The per-entry granularity of accounting makes the charge process fair
for processes that change uid during lifetime, and decrements charge
for proper uid when region is unmapped.

The interface of vm_pager_allocate(9) is extended by adding struct ucred *,
that is used to charge appropriate uid when allocation if performed by
kernel, e.g. md(4).

Several syscalls, among them is fork(2), may now return ENOMEM when
global or per-uid limits are enforced.

In collaboration with: pho
Reviewed by: alc
Approved by: re (kensmith)


# d8b0556c 10-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by: jhb [1]
Noted by: jhb [2]
Reviewed by: jhb
Tested by: rnoland


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 0304c731 27-May-2009 Jamie Gritton <jamie@FreeBSD.org>

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# 29b02909 08-May-2009 Marko Zec <zec@FreeBSD.org>

Introduce a new virtualization container, provisionally named vprocg, to hold
virtualized instances of hostname and domainname, as well as a new top-level
virtualization struct vimage, which holds pointers to struct vnet and struct
vprocg. Struct vprocg is likely to become replaced in the near future with
a new jail management API import.

As a consequence of this change, change struct ucred to point to a struct
vimage, instead of directly pointing to a vnet.

Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage
branch.

Permit kldload / kldunload operations to be executed only from the default
vimage context.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by: bz
Approved by: julian (mentor)


# 21ca7b57 05-May-2009 Marko Zec <zec@FreeBSD.org>

Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by: julian (mentor)


# aeb32571 05-Dec-2008 Konstantin Belousov <kib@FreeBSD.org>

Several threads in a process may do vfork() simultaneously. Then, all
parent threads sleep on the parent' struct proc until corresponding
child releases the vmspace. Each sleep is interlocked with proc mutex of
the child, that triggers assertion in the sleepq_add(). The assertion
requires that at any time, all simultaneous sleepers for the channel use
the same interlock.

Silent the assertion by using conditional variable allocated in the
child. Broadcast the variable event on exec() and exit().

Since struct proc * sleep wait channel is overloaded for several
unrelated events, I was unable to remove wakeups from the places where
cv_broadcast() is added, except exec().

Reported and tested by: ganbold
Suggested and reviewed by: jhb
MFC after: 2 week


# 413628a7 29-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 50d6e424 18-Oct-2008 Kip Macy <kmacy@FreeBSD.org>

- Forward port flush of page table updates on context switch or userret
- Forward port vfork XEN hack


# 8b4a2800 23-Jul-2008 Konstantin Belousov <kib@FreeBSD.org>

Do the pargs_hold() on the copy of the pointer to the p_args of the
child process immediately after bulk bcopy() without dropping the
process lock.

Since process is not single-threaded when forking, dropping and
reacquiring the lock allows an other thread to change the process title
of the parent in between, and results in hold being done on the invalid
pointer. The problem manifested itself as the double free of the old
p_args.

Reported by: kris
Reviewed by: jhb
MFC after: 1 week


# 7054ee4e 07-Jul-2008 Konstantin Belousov <kib@FreeBSD.org>

The kqueue_register() function assumes that it is called from the top of
the syscall code and acquires various event subsystem locks as needed.
The handling of the NOTE_TRACK for EVFILT_PROC is currently done by
calling the kqueue_register() from filt_proc() filter, causing recursive
entrance of the kqueue code. This results in the LORs and recursive
acquisition of the locks.

Implement the variant of the knote() function designed to only handle
the fork() event. It mostly copies the knote() body, but also handles
the NOTE_TRACK, removing the handling from the filt_proc(), where it
causes problems described above. The function is called from the fork1()
instead of knote().

When encountering NOTE_TRACK knote, it marks the knote as influx
and drops the knlist and kqueue lock. In this context call to
kqueue_register is safe from the problems.

An error from the kqueue_register() is reported to the observer as
NOTE_TRACKERR fflag.

PR: 108201
Reviewed by: jhb, Pramod Srinivasan <pramod juniper net> (previous version)
Discussed with: jmg
Tested by: pho
MFC after: 2 weeks


# 5d217f17 24-May-2008 John Birrell <jb@FreeBSD.org>

Add DTrace 'proc' provider probes using the Statically Defined Trace
(sdt) mechanism.


# 69aa768a 20-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Fix the leak of the vmspace on the fork when the process limits
are exceeded.

Pointy hat to: me
MFC after: 3 days


# 0ac213ef 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Don't call the empty sched_newproc() function. sched_newproc() already
existed as sched_fork() which is a non empty function in both schedulers.


# 6617724c 12-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# 4b9322ae 14-Nov-2007 Julian Elischer <julian@FreeBSD.org>

When forking, the new thread deserves a name too. Don't just use the
td_startcopy section as it is not the right thing to do
in other cases (e.g. if starting a new thread from one that is already named).


# e01eafef 13-Nov-2007 Julian Elischer <julian@FreeBSD.org>

A bunch more files that should probably print out a thread name
instead of a process name.


# 89b57fcf 05-Nov-2007 Konstantin Belousov <kib@FreeBSD.org>

Fix for the panic("vm_thread_new: kstack allocation failed") and
silent NULL pointer dereference in the i386 and sparc64 pmap_pinit()
when the kmem_alloc_nofault() failed to allocate address space. Both
functions now return error instead of panicing or dereferencing NULL.

As consequence, vmspace_exec() and vmspace_unshare() returns the errno
int. struct vmspace arg was added to vm_forkproc() to avoid dealing
with failed allocation when most of the fork1() job is already done.

The kernel stack for the thread is now set up in the thread_alloc(),
that itself may return NULL. Also, allocation of the first process
thread is performed in the fork1() to properly deal with stack
allocation failure. proc_linkup() is separated into proc_linkup()
called from fork1(), and proc_linkup0(), that is used to set up the
kernel process (was known as swapper).

In collaboration with: Peter Holm
Reviewed by: jhb


# f4bb4fc8 02-Nov-2007 Julian Elischer <julian@FreeBSD.org>

Completely remove the code for single threading the mainline fork code.
Put in a little comment explaining why it went away.
Re-enable it in the case there an exisiting process is just splitting
off its address space and file descriptors.
(I donpt think anything uses that code but it needs some sort of locking
and this does the job.

Reviewed by: Davidxu, alc, others
MFC after: 3 days


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# e9271f53 23-Oct-2007 Julian Elischer <julian@FreeBSD.org>

Take out the single-threading code in fork.
After discussions with jeff, alc, (various Ironport people), david Xu,
and mostly Alfred (who found the problem) it has been demonstrated that this
is not needed for our implementations of threads and represents a real
(as in we've seen it happen a lot) deadlock danger.

Several points:
Since forking multiple threads is not allowed, and posix states that
any mutexes owned by othre threads wilol be owned in the child by
phantom threads, and therads shouldn't ba accessing shared structures without
protection, It can be proved that if this leads to the child process accessing
inconsistent data, it's a programming error.

The mode of thread_single() being used in fork() is the wrong one.
It is using SINGLE_NO_EXIT when it should be using SINGLE_BOUNDARY.

Even if this we used, System processes have no need to do it as they have
no userland to get inconsistent.

This commmit first fixes the above bugs to get tehm correct in CVS.
then removes them with #ifdef.
This is so that history contains the corrected version should it
be needed in the future.
This code may be needed if we implement the forkall() syscall from
Solaris. It may be needed for other non-posix thread libraries
at some time in the future, so let the code sit for a short while
while I do some work on it anyhow.

This removes a reproducible lockup in NFS.
It may be argued that maybe doing a fork while holding a vnode lock may
not be the best idea in th efirst place but it shouldn't cause a deadlock.
The removal has been running under soak test for several days now.

This removal should be seriously considered for 7.0 and RELENG_6.

Note. There is code in the core-dumping code that may have a similar problem
with coredumping threaded processes

MFC After: 4 days


# 3745c395 20-Oct-2007 Julian Elischer <julian@FreeBSD.org>

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


# 54b0e65f 20-Sep-2007 Jeff Roberson <jeff@FreeBSD.org>

- Redefine p_swtime and td_slptime as p_swtick and td_slptick. This
changes the units from seconds to the value of 'ticks' when swapped
in/out. ULE does not have a periodic timer that scans all threads in
the system and as such maintaining a per-second counter is difficult.
- Change computations requiring the unit in seconds to subtract ticks
and divide by hz. This does make the wraparound condition hz times
more frequent but this is still in the range of several months to
years and the adverse effects are minimal.

Approved by: re


# b61ce5b0 16-Sep-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move all of the PS_ flags into either p_flag or td_flags.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.

Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)


# 7251b786 16-Jun-2007 Robert Watson <rwatson@FreeBSD.org>

Rather than passing SUSER_RUID into priv_check_cred() to specify when
a privilege is checked against the real uid rather than the effective
uid, instead decide which uid to use in priv_check_cred() based on the
privilege passed in. We use the real uid for PRIV_MAXFILES,
PRIV_MAXPROC, and PRIV_PROC_LIMIT. Remove the definition of
SUSER_RUID; there are now no flags defined for priv_check_cred().

Obtained from: TrustedBSD Project


# fe54587f 12-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move some common code out of sched_fork_exit() and back into fork_exit().


# 32f9753c 11-Jun-2007 Robert Watson <rwatson@FreeBSD.org>

Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.

Reviewed by: csjp
Obtained from: TrustedBSD Project


# 393a081d 10-Jun-2007 Attilio Rao <attilio@FreeBSD.org>

Optimize vmmeter locking.
In particular:
- Add an explicative table for locking of struct vmmeter members
- Apply new rules for some of those members
- Remove some unuseful comments

Heavily reviewed by: alc, bde, jeff
Approved by: jeff (mentor)


# faef5371 07-Jun-2007 Robert Watson <rwatson@FreeBSD.org>

Move per-process audit state from a pointer in the proc structure to
embedded storage in struct ucred. This allows audit state to be cached
with the thread, avoiding locking operations with each system call, and
makes it available in asynchronous execution contexts, such as deep in
the network stack or VFS.

Reviewed by: csjp
Approved by: re (kensmith)
Obtained from: TrustedBSD Project


# 11bda9b8 04-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

Commit 6/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.
- Replace the tail-end of fork_exit() with a scheduler specific routine
which can do the appropriate lock manipulations.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# 1c4bcd05 31-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


# 2feb50bf 31-May-2007 Attilio Rao <attilio@FreeBSD.org>

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# eefc4972 18-May-2007 Robert Watson <rwatson@FreeBSD.org>

Remove unnecessary assignment.

CID: 2227
Found with: Coverity Prevent(tm)


# 222d0195 18-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 5e3f7694 04-Apr-2007 Robert Watson <rwatson@FreeBSD.org>

Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by: kris
Discussed with: jhb, kris, attilio, jeff


# 0c14ff0e 04-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.


# 84d37a46 27-Feb-2007 John Baldwin <jhb@FreeBSD.org>

Use pause() rather than tsleep() on explicit global dummy variables.


# 1ad9ee86 25-Feb-2007 Xin LI <delphij@FreeBSD.org>

Close race conditions between fork() and [sg]etpriority()'s
PRIO_USER case, possibly also other places that deferences
p_ucred.

In the past, we insert a new process into the allproc list right
after PID allocation, and release the allproc_lock sx. Because
most content in new proc's structure is not yet initialized,
this could lead to undefined result if we do not handle PRS_NEW
with care.

The problem with PRS_NEW state is that it does not provide fine
grained information about how much initialization is done for a
new process. By defination, after PRIO_USER setpriority(), all
processes that belongs to given user should have their nice value
set to the specified value. Therefore, if p_{start,end}copy
section was done for a PRS_NEW process, we can not safely ignore
it because p_nice is in this area. On the other hand, we should
be careful on PRS_NEW processes because we do not allow non-root
users to lower their nice values, and without a successful copy
of the copy section, we can get stale values that is inherted
from the uninitialized area of the process structure.

This commit tries to close the race condition by grabbing proc
mutex *before* we release allproc_lock xlock, and do copy as
well as zero immediately after the allproc_lock xunlock. This
guarantees that the new process would have its p_copy and p_zero
sections, as well as user credential informaion initialized. In
getpriority() case, instead of grabbing PROC_LOCK for a PRS_NEW
process, we just skip the process in question, because it does
not affect the final result of the call, as the p_nice value
would be copied from its parent, and we will see it during
allproc traverse.

Other potential solutions are still under evaluation.

Discussed with: davidxu, jhb, rwatson
PR: kern/108071
MFC after: 2 weeks


# f0393f06 23-Jan-2007 Jeff Roberson <jeff@FreeBSD.org>

- Remove setrunqueue and replace it with direct calls to sched_add().
setrunqueue() was mostly empty. The few asserts and thread state
setting were moved to the individual schedulers. sched_add() was
chosen to displace it for naming consistency reasons.
- Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be
different on all three schedulers where it was only called in one place
each.
- Remove the long ifdef'd out remrunqueue code.
- Remove the now redundant ts_state. Inspect the thread state directly.
- Don't set TSF_* flags from kern_switch.c, we were only doing this to
support a feature in one scheduler.
- Change sched_choose() to return a thread rather than a td_sched. Also,
rely on the schedulers to return the idlethread. This simplifies the
logic in choosethread(). Aside from the run queue links kern_switch.c
mostly does not care about the contents of td_sched.

Discussed with: julian

- Move the idle thread loop into the per scheduler area. ULE wants to
do something different from the other schedulers.

Suggested by: jhb

Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.


# ad1e7d28 05-Dec-2006 Julian Elischer <julian@FreeBSD.org>

Threading cleanup.. part 2 of several.

Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs. Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.


# acd3428b 06-Nov-2006 Robert Watson <rwatson@FreeBSD.org>

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# 8460a577 26-Oct-2006 John Birrell <jb@FreeBSD.org>

Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by: davidxu@


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 993182e5 14-Aug-2006 Alexander Leidinger <netchild@FreeBSD.org>

- Change process_exec function handlers prototype to include struct
image_params arg.
- Change struct image_params to include struct sysentvec pointer and
initialize it.
- Change all consumers of process_exit/process_exec eventhandlers to
new prototypes (includes splitting up into distinct exec/exit functions).
- Add eventhandler to userret.

Sponsored by: Google SoC 2006
Submitted by: rdivacky
Parts suggested by: jhb (on hackers@)


# 38affe13 01-Aug-2006 John Baldwin <jhb@FreeBSD.org>

Don't lock each of the processes while looking for a pid. The allproc and
proctree locks that we already hold provide sufficient protection.


# 2905ade2 27-Jun-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Use suser_cred(9) instead of checking cr_ruid directly.
- For privileged processes safe two mutex operations.

We may want to consider if this is good idea to use SUSER_ALLOWJAIL here,
but for now I didn't wanted to change the original behaviour.

Reviewed by: rwatson


# 795a11d0 15-Mar-2006 David Xu <davidxu@FreeBSD.org>

Fix a race between file operations and rfork(RFCFDG) by parking
all other threads at user boundary, the race can crash kernel
under stress testing.

Reviewed by: jhb
MFC after: 3 days


# eb2da9a5 08-Feb-2006 Poul-Henning Kamp <phk@FreeBSD.org>

Simplify system time accounting for profiling.

Rename struct thread's td_sticks to td_pticks, we will need the
other name for more appropriately named use shortly. Reduce it
from uint64_t to u_int.

Clear td_pticks whenever we enter the kernel instead of recording
its value as reference for userret(). Use the absolute value of
td->pticks in userret() and eliminate third argument.


# 2c9d9d39 06-Feb-2006 John Baldwin <jhb@FreeBSD.org>

We don't need the proc lock to check P_KTHREAD on curthread since it is
only set before the kthread starts executing and is never cleared.


# ad20c8f3 05-Feb-2006 Wayne Salamon <wsalamon@FreeBSD.org>

Audit the args to rfork(), and the child PID for all fork system calls.

Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)


# fcf7f27a 01-Feb-2006 Robert Watson <rwatson@FreeBSD.org>

Hook up audit to fork() and exit() events. These changes manage the
audit state on processes, not auditing of these events.

Much work by: wsalamon
Obtained from: TrustedBSD Project


# 2c255e9d 13-Nov-2005 Robert Watson <rwatson@FreeBSD.org>

Moderate rewrite of kernel ktrace code to attempt to generally improve
reliability when tracing fast-moving processes or writing traces to
slow file systems by avoiding unbounded queueuing and dropped records.
Record loss was previously possible when the global pool of records
become depleted as a result of record generation outstripping record
commit, which occurred quickly in many common situations.

These changes partially restore the 4.x model of committing ktrace
records at the point of trace generation (synchronous), but maintain
the 5.x deferred record commit behavior (asynchronous) for situations
where entering VFS and sleeping is not possible (i.e., in the
scheduler). Records are now queued per-process as opposed to
globally, with processes responsible for committing records from their
own context as required.

- Eliminate the ktrace worker thread and global record queue, as they
are no longer used. Keep the global free record list, as records
are still used.

- Add a per-process record queue, which will hold any asynchronously
generated records, such as from context switches. This replaces the
global queue as the place to submit asynchronous records to.

- When a record is committed asynchronously, simply queue it to the
process.

- When a record is committed synchronously, first drain any pending
per-process records in order to maintain ordering as best we can.
Currently ordering between competing threads is provided via a global
ktrace_sx, but a per-process flag or lock may be desirable in the
future.

- When a process returns to user space following a system call, trap,
signal delivery, etc, flush any pending records.

- When a process exits, flush any pending records.

- Assert on process tear-down that there are no pending records.

- Slightly abstract the notion of being "in ktrace", which is used to
prevent the recursive generation of records, as well as generating
traces for ktrace events.

Future work here might look at changing the set of events marked for
synchronous and asynchronous record generation, re-balancing queue
depth, timeliness of commit to disk, and so on. I.e., performing a
drain every (n) records.

MFC after: 1 month
Discussed with: jhb
Requested by: Marc Olzheim <marcolz at stack dot nl>


# 571dcd15 01-Jul-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Fix the recent panics/LORs/hangs created by my kqueue commit by:

- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone


# 3d5c30f7 20-Apr-2005 David Xu <davidxu@FreeBSD.org>

Inherit signal mask for child process in fork1(), RELENG_4 and other
*BSD have this behaviour, also it is required by POSIX.

PR: kern/80130
Submitted by: Kostik Belousov konstantin.belousov at zoral dot com dot ua


# c6a37e84 04-Apr-2005 John Baldwin <jhb@FreeBSD.org>

Divorce critical sections from spinlocks. Critical sections as denoted by
critical_enter() and critical_exit() are now solely a mechanism for
deferring kernel preemptions. They no longer have any affect on
interrupts. This means that standalone critical sections are now very
cheap as they are simply unlocked integer increments and decrements for the
common case.

Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter()
and spinlock_exit(). This KPI is responsible for providing whatever MD
guarantees are needed to ensure that a thread holding a spin lock won't
be preempted by any other code that will try to lock the same lock. For
now all archs continue to block interrupts in a "spinlock section" as they
did formerly in all critical sections. Note that I've also taken this
opportunity to push a few things into MD code rather than MI. For example,
critical_fork_exit() no longer exists. Instead, MD code ensures that new
threads have the correct state when they are created. Also, we no longer
try to fixup the idlethreads for APs in MI code. Instead, each arch sets
the initial curthread and adjusts the state of the idle thread it borrows
in order to perform the initial context switch.

This change is largely a big NOP, but the cleaner separation it provides
will allow for more efficient alternative locking schemes in other parts
of the kernel (bare critical sections rather than per-CPU spin mutexes
for per-CPU data for example).

Reviewed by: grehan, cognet, arch@, others
Tested on: i386, alpha, sparc64, powerpc, arm, possibly more


# 9454b2d8 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for copyright notices, minor format tweaks as necessary


# c113083c 14-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add new function fdunshare() which encapsulates the necessary light magic
for ensuring that a process' filedesc is not shared with anybody.

Use it in the two places which previously had private implmentations.

This collects all fd_refcnt handling in kern_descrip.c


# 6004362e 26-Nov-2004 David Schultz <das@FreeBSD.org>

Don't include sys/user.h merely for its side-effect of recursively
including other headers.


# 6db36923 20-Nov-2004 David Schultz <das@FreeBSD.org>

Remove local definitions of RANGEOF() and use __rangeof() instead.
Also remove a few bogus casts.


# 8b059651 19-Nov-2004 David Schultz <das@FreeBSD.org>

Malloc p_stats instead of putting it in the U area. We should consider
simply embedding it in struct proc.

Reviewed by: arch@


# 124e4c3b 13-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST.

Use this in all the places where sleeping with the lock held is not
an issue.

The distinction will become significant once we finalize the exact
lock-type to use for this kind of case.


# 598b7ec8 07-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Use more intuitive pointer for fdinit() and fdcopy().

Change fdcopy() to take unlocked filedesc.


# 8ec21e3a 06-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Allow fdinit() to be called with a NULL fdp argument so we can use
it when setting up init.

Make fdinit() lock the fdp argument as needed.


# cda5aba4 06-Oct-2004 David Schultz <das@FreeBSD.org>

Back out rev 1.240; it is unnecessary. In particular,
p1 == curthread, so _PHOLD(p1) will not have to block
to swap in p1.

Noticed by: jhb


# 299bc736 30-Sep-2004 David Schultz <das@FreeBSD.org>

Avoid calling _PHOLD(p1) with p2's lock held, since _PHOLD()
may block to swap in p1. Instead, call _PHOLD earlier, at a
point where the only lock held happens to be p1's.


# a3aa5592 13-Sep-2004 Julian Elischer <julian@FreeBSD.org>

make some of these conditions apply equally to both threading systems.


# ed062c8d 04-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Refactor a bunch of scheduler code to give basically the same behaviour
but with slightly cleaned up interfaces.

The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.

The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.

Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.

Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.

The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.

A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.

Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.

Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week


# 94ddc707 02-Sep-2004 Alan Cox <alc@FreeBSD.org>

Push Giant deep into vm_forkproc(), acquiring it only if the process has
mapped System V shared memory segments (see shmfork_myhook()) or requires
the allocation of an ldt (see vm_fault_wire()).


# 2630e4c9 31-Aug-2004 Julian Elischer <julian@FreeBSD.org>

Give setrunqueue() and sched_add() more of a clue as to
where they are coming from and what is expected from them.

MFC after: 2 days


# 99e9dcb8 31-Aug-2004 Julian Elischer <julian@FreeBSD.org>

Remove sched_free_thread() which was only used
in diagnostics. It has outlived its usefulness and has started
causing panics for people who turn on DIAGNOSTIC, in what is otherwise
good code.

MFC after: 2 days


# ad3b9257 15-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 732d9528 09-Aug-2004 Julian Elischer <julian@FreeBSD.org>

Increase the amount of data exported by KTR in the KTR_RUNQ setting.
This extra data is needed to really follow what is going on in the
threaded case.


# 0047b9a9 26-Jul-2004 Bosko Milekic <bmilekic@FreeBSD.org>

Move the schedlock owner state update following the context
switch in fork_exit() to before anything else is done (but keep
schedlock for the deadthread check). This means one less
nasty bug if ever in the future whatever might have been called
before the update played with schedlock or critical sections.

Discussed with: tjr


# 66d5c640 26-Jul-2004 Colin Percival <cperciva@FreeBSD.org>

In revision 1.228, I accidentally broke the "total number of processes in
the system" resource limit code: When checking if the caller has superuser
privileges, we should be checking the *real* user, not the *effective*
user. (In general, resource limiting is done based on the real user, in
order to avoid resource-exhaustion-by-setuid-program attacks.)

Now that a SUSER_RUID flag to suser_cred exists, use it here to return
this code to its correct behaviour.

Pointed out by: rwatson


# 55d44f79 18-Jul-2004 Julian Elischer <julian@FreeBSD.org>

When calling scheduler entrypoints for creating new threads and processes,
specify "us" as the thread not the process/ksegrp/kse.
You can always find the others from the thread but the converse is not true.
Theorotically this would lead to runtime being allocated to the wrong
entity in some cases though it is not clear how often this actually happenned.
(would only affect threaded processes and would probably be pretty benign,
but it WAS a bug..)

Reviewed by: peter


# 49bddf0c 13-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

fix compilation.


# 65bba83f 13-Jul-2004 Colin Percival <cperciva@FreeBSD.org>

Replace "uid != 0" with "suser(td->td_ucred) != 0" when checking if we've
hit the maximum number of processes. The last ten processes are reserved
for the *non-jailed* superuser.


# 247aba24 26-Jun-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Allocate TIDs in thread_init() and deallocate them in thread_fini().
The overhead of unconditionally allocating TIDs (and likewise,
unconditionally deallocating them), is amortized across multiple
thread creations by the way UMA makes it possible to have type-stable
storage.
Previously the cost was kept down by having threads created as part
of a fork operation use the process' PID as the TID. While this had
some nice properties, it also introduced complexity in the way TIDs
were allocated. Most importantly, by using the type-stable storage
that UMA gives us this was also unnecessary.

This change affects how core dumps are created and in particular how
the PRSTATUS notes are dumped. Since we don't have a thread with a
TID equalling the PID, we now need a different way to preserve the
old and previous behavior. We do this by having the given thread (i.e.
the thread passed to the core dump code in td) dump it's state first
and fill in pr_pid with the actual PID. All other threads will have
pr_pid contain their TIDs. The upshot of all this is that the debugger
will now likely select the right LWP (=TID) as the initial thread.

Credits to: julian@ for spotting how we can utilize UMA.
Thanks to: all who provided julian@ with test results.


# 7f8a436f 05-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# fdcac928 03-Apr-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Assign thread IDs to kernel threads. The purpose of the thread ID (tid)
is twofold:
1. When a 1:1 or M:N threaded process dumps core, we need to put the
register state of each of its kernel threads in the core file.
This can only be done by differentiating the pid field in the
respective note. For this we need the tid.
2. When thread support is present for remote debugging the kernel
with gdb(1), threads need to be identified by an integer due to
limitations in the remote protocol. This requires having a tid.

To minimize the impact of having thread IDs, threads that are created
as part of a fork (i.e. the initial thread in a process) will inherit
the process ID (i.e. tid=pid). Subsequent threads will have IDs larger
than PID_MAX to avoid interference with the pid allocation algorithm.
The assignment of tids is handled by thread_new_tid().

The thread ID allocation algorithm has been written with 3 assumptions
in mind:
1. IDs need to be created as fast a possible,
2. Reuse of IDs may happen instantaneously,
3. Someone else will write a better algorithm.


# a5bdcb2a 13-Mar-2004 Peter Wemm <peter@FreeBSD.org>

Make the process_exit eventhandler run without Giant. Add Giant hooks
in the two consumers that need it.. processes using AIO and netncp.
Update docs. Say that process_exec is called with Giant, but not to
depend on it. All our consumers can handle it without Giant.


# 8a412f31 13-Mar-2004 Peter Wemm <peter@FreeBSD.org>

Move the process_fork event out from under Giant. This one is easy,
since there are no consumers in the tree. Document this.


# 37814395 13-Mar-2004 Peter Wemm <peter@FreeBSD.org>

Push Giant down a little further:
- no longer serialize on Giant for thread_single*() and family in fork,
exit and exec
- thread_wait() is mpsafe, assert no Giant
- reduce scope of Giant in exit to not cover thread_wait and just do
vm_waitproc().
- assert that thread_single() family are not called with Giant
- remove the DROP/PICKUP_GIANT macros from thread_single() family
- assert that thread_suspend_check() s not called with Giant
- remove manual drop_giant hack in thread_suspend_check since we know it
isn't held.
- remove the DROP/PICKUP_GIANT macros from thread_suspend_check() family
- mark kse_create() mpsafe


# 0235bf02 09-Mar-2004 John-Mark Gurney <jmg@FreeBSD.org>

make sure we had the filedesc lock when calling fdinit when RFCFDG is set
on call to rfork.

Submitted by: Brian Buchanan
Semi-Reviewed by: rwatson


# a69d88af 07-Mar-2004 Peter Wemm <peter@FreeBSD.org>

Move a vref call outside of proc locks and Giant. By virtue of the fact
that we (p1) are currently running, we hold a reference on p_textvp which
means the vnode cannot go away. p2 cannot run yet (and hence cannot exit)
so this should be safe to do at this point. As a bonus, it removes a
block of under-Giant code that was there to support the vref.


# 5fadbfea 06-Mar-2004 Alan Cox <alc@FreeBSD.org>

Giant is not required for vm_thread_new_altkstack().


# 5750ee29 05-Mar-2004 Peter Wemm <peter@FreeBSD.org>

Add a missing part of jhb's previous commit. It looks like he had a
patch chunk rejected that he missed. This would manifest as a lock
assertion panic at boot (Giant not locked in kern_fork.c).

Obtained from: jhb


# 5ce2f678 05-Mar-2004 John Baldwin <jhb@FreeBSD.org>

- Grab a share lock of the proctree lock while looking for a pid due to the
process group and session dereferences. Also, check that p_pgrp and
p_sesssion are NULL before dereferencing them.
- Push down Giant in fork1().

Requested by: peter


# c8564ad4 04-Mar-2004 Bruce Evans <bde@FreeBSD.org>

Fixed some style bugs (mainly English usage errors in comments).


# 47934cef 25-Feb-2004 Don Lewis <truckman@FreeBSD.org>

Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms


# 4c3558aa 05-Feb-2004 John Baldwin <jhb@FreeBSD.org>

Always set a process' state to normal when it is fully constructed in
fork1() rather than only doing it for the RFSTOPPED case and then having
to fix it up in other places later on.


# 91d5354a 04-Feb-2004 John Baldwin <jhb@FreeBSD.org>

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


# 6bea667f 25-Jan-2004 Robert Watson <rwatson@FreeBSD.org>

When aborting fork() due to a failure, if using MAC, make sure to clean
up the p_label field.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, McAfee Research


# 679365e7 21-Jan-2004 Robert Watson <rwatson@FreeBSD.org>

Reduce gratuitous includes: don't include jail.h if it's not needed.
Presumably, at some point, you had to include jail.h if you included
proc.h, but that is no longer required.

Result of: self injury involving adding something to struct prison


# 5cded904 09-Jan-2004 Olivier Houchard <cognet@FreeBSD.org>

Prevent a race condition between fork1() and whatever changes the pgrp by
setting the new process' p_pgrp again before inserting it in the p_pglist.
Without it we can get the new process to be inserted in a different p_pglist
than the one p2->p_pgrp points to, and this is not something we want to happen.
This is not a fix, merely a bandaid, but it will work until someone finds a
better way to do it.

Discussed with: jhb (a long time ago)


# a30ec4b9 02-Jan-2004 David Xu <davidxu@FreeBSD.org>

Make sigaltstack as per-threaded, because per-process sigaltstack state
is useless for threaded programs, multiple threads can not share same
stack.
The alternative signal stack is private for thread, no lock is needed,
the orignal P_ALTSTACK is now moved into td_pflags and renamed to
TDP_ALTSTACK.
For single thread or Linux clone() based threaded program, there is no
semantic changed, because those programs only have one kernel thread
in every process.

Reviewed by: deischen, dfr


# b3aeaf2e 29-Oct-2003 Bruce Evans <bde@FreeBSD.org>

Removed mostly-dead code for setting switchtime after the idle loop
clobbers this variable. Long ago, when the idle loop wasn't in a
process, it set switchtime.tv_sec to zero to indicate that the time
needs to be read after the idle loop finishes. The special case for
this isn't needed now that there is an idle process (for each CPU).
The time is read in the normal way when the idle process is switched
away from. The seconds component of the time is only zero for the
first second after the uptime is set, and the mostly-dead code was only
executed during this time. (This was slightly broken by using uptimes
instead of times relative to the Epoch -- in the original version the
seconds component of the time was only 0 for the first second after
the Epoch.)

In mi_switch(), moved the setting of switchticks to just after the
first (and now only) setting of switchtime. This setting used to be
delayed since a late setting was needed for the idle case and an early
setting was not needed. Now the early setting is needed so that
fork_exit() doesn't need to set either switchtime or switchticks.
Removed now-completely-rotted comment attached to this. Most of the
code described by the comment had already moved to sched_switch().


# 89674a9f 29-Oct-2003 Bruce Evans <bde@FreeBSD.org>

Removed sched_nest variable in sched_switch(). Context switches always
begin with sched_lock held but not recursed, so this variable was
always 0.

Removed fixup of sched_lock.mtx_recurse after context switches in
sched_switch(). Context switches always end with this variable in the
same state that it began in, so there is no need to fix it up. Only
sched_lock.mtx_lock really needs a fixup.

Replaced fixup of sched_lock.mtx_recurse in fork_exit() by an assertion
that sched_lock is owned and not recursed after it is fixed up. This
assertion much match the one in mi_switch(), and if sched_lock were
recursed then a non-null fixup of sched_lock.mtx_recurse would probably
be needed again, unlike in sched_switch(), since fork_exit() doesn't
return to its caller in the normal way.


# c06eb4e2 19-Aug-2003 Sam Leffler <sam@FreeBSD.org>

Change instances of callout_init that specify MPSAFE behaviour to
use CALLOUT_MPSAFE instead of "1" for the second parameter. This
does not change the behaviour; it just makes the intent more clear.


# 70fca427 15-Aug-2003 John Baldwin <jhb@FreeBSD.org>

- Various style fixes in both code and comments.
- Update some stale comments.
- Sort a couple of includes.
- Only set 'newcpu' in updatepri() if we use it.
- No functional changes.

Obtained from: bde (via an old diff I got a long time ago)


# bfe65982 04-Aug-2003 John Baldwin <jhb@FreeBSD.org>

Adjust a comment to remove staleness and take slightly less implementation
specific perspective.


# b083ea51 18-Jun-2003 Mike Silbersack <silby@FreeBSD.org>

Add a ratelimited message of the form
"maxproc limit exceeded by uid %i, please see tuning(7) and login.conf(5)."

Which will be triggered whenever a user hits his/her maxproc limit or
the systemwide maxproc limit is reached.

MFC after: 1 week


# 0e2a4d3a 14-Jun-2003 David Xu <davidxu@FreeBSD.org>

Rename P_THREADED to P_SA. P_SA means a process is using scheduler
activations.


# 89f4fca2 14-Jun-2003 Alan Cox <alc@FreeBSD.org>

Move the *_new_altkstack() and *_dispose_altkstack() functions out of the
various pmap implementations into the machine-independent vm. They were
all identical.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# ad05d580 02-Jun-2003 Tor Egge <tegge@FreeBSD.org>

Add tracking of process leaders sharing a file descriptor table and
allow a file descriptor table to be shared between multiple process
leaders.

PR: 50923


# 90af4afa 13-May-2003 John Baldwin <jhb@FreeBSD.org>

- Merge struct procsig with struct sigacts.
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.

Reviewed by: arch@
Approved by: re (rwatson)


# 7d447c95 01-May-2003 John Baldwin <jhb@FreeBSD.org>

Initialize and destroy the struct proc mutex in the proc zone's init and
fini routines instead of in fork() and wait(). This has the nice side
benefit that the proc lock of any process on the allproc list is always
valid and sched_lock doesn't have to be used to test against PRS_NEW
anymore.


# 87ccef7b 01-May-2003 Dag-Erling Smørgrav <des@FreeBSD.org>

Instead of recording the Unix time in a process when it starts, record the
uptime. Where necessary, convert it back to Unix time by adding boottime
to it. This fixes a potential problem in the accounting code, which would
compute the elapsed time incorrectly if the Unix time was stepped during
the lifetime of the process.


# 428eb576 30-Apr-2003 John Baldwin <jhb@FreeBSD.org>

Axe a stale comment.


# 51da11a2 29-Apr-2003 Mark Murray <markm@FreeBSD.org>

Fix some easy, global, lint warnings. In most cases, this means
making some local variables static. In a couple of cases, this means
removing an unused variable.


# 9752f794 22-Apr-2003 John Baldwin <jhb@FreeBSD.org>

- Move PS_PROFIL and its new cousin PS_STOPPROF back over to p_flag and
rename them appropriately. Protect both flags with both the proc lock
and the sched_lock.
- Protect p_profthreads with the proc lock.
- Remove Giant from profil(2).


# bb0e8070 17-Apr-2003 John Baldwin <jhb@FreeBSD.org>

- Push Giant down into the fork1() function a small bit.
- Set p_acflag earlier while already hold the proc lock in fork1().
- Mark the realitexpire() callout MPSAFE for new processes. It was already
marked safe for proc0 a long while ago.


# f6f230fe 10-Apr-2003 Jeff Roberson <jeff@FreeBSD.org>

- Adjust sched hooks for fork and exec to take processes as arguments instead
of ksegs since they primarily operation on processes.
- KSEs take ticks so pass the kse through sched_clock().
- Add a sched_class() routine that adjusts a ksegrp pri class.
- Define a sched_fork_{kse,thread,ksegrp} and sched_exit_{kse,thread,ksegrp}
that will be used to tell the scheduler about new instances of these
structures within the same process. These will be used by THR and KSE.
- Change sched_4bsd to reflect this API update.


# 060563ec 10-Apr-2003 Julian Elischer <julian@FreeBSD.org>

Move the _oncpu entry from the KSE to the thread.
The entry in the KSE still exists but it's purpose will change a bit
when we add the ability to lock a KSE to a cpu.


# 2c10d16a 31-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Borrow the KSE single threading code for exec and exit. We use the check
if (p->p_numthreads > 1) and not a flag because action is only necessary
if there are other threads. The rest of the system has no need to
identify thr threaded processes.
- In kern_thread.c use thr_exit1() instead of thread_exit() if P_THREADED
is not set.


# 75b8b3b2 24-Mar-2003 John Baldwin <jhb@FreeBSD.org>

Replace the at_fork, at_exec, and at_exit functions with the slightly more
flexible process_fork, process_exec, and process_exit eventhandlers. This
reduces code duplication and also means that I don't have to go duplicate
the eventhandler locking three more times for each of at_fork, at_exec, and
at_exit.

Reviewed by: phk, jake, almost complete silence on arch@


# a5881ea5 13-Mar-2003 John Baldwin <jhb@FreeBSD.org>

- Cache a reference to the credential of the thread that starts a ktrace in
struct proc as p_tracecred alongside the current cache of the vnode in
p_tracep. This credential is then used for all later ktrace operations on
this file rather than using the credential of the current thread at the
time of each ktrace event.
- Now that we have multiple ktrace-related items in struct proc that are
pointers, rename p_tracep to p_tracevp to make it less ambiguous.

Requested by: rwatson (1)


# ac2e4153 26-Feb-2003 Julian Elischer <julian@FreeBSD.org>

Change the process flags P_KSES to be P_THREADED.
This is just a cosmetic change but I've been meaning to do it for about a year.


# 27e39ae4 19-Feb-2003 Tim J. Robbins <tjr@FreeBSD.org>

Remove the PL_SHAREMOD flag from struct plimit, which could have been
used to share resource limits between rfork threads, but never was.
Removing it makes resource limit locking much simpler -- only the current
process can change the contents of the structure that p_limit points to.


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 5215b187 16-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Split the struct kse into struct upcall and struct kse. struct kse will
soon be visible only to schedulers. This greatly simplifies much the
KSE code.

Submitted by: davidxu


# 218a01e0 15-Feb-2003 Tor Egge <tegge@FreeBSD.org>

Avoid file lock leakage when linuxthreads port or rfork is used:
- Mark the process leader as having an advisory lock
- Check if process leader is marked as having advisory lock when
closing file
- Check that file is still open after lock has been obtained
- Don't allow file descriptor table sharing between processes
with different leaders

PR: 10265
Reviewed by: alfred


# 6f8132a8 31-Jan-2003 Julian Elischer <julian@FreeBSD.org>

Reversion of commit by Davidxu plus fixes since applied.

I'm not convinced there is anything major wrong with the patch but
them's the rules..

I am using my "David's mentor" hat to revert this as he's
offline for a while.


# 0dbb100b 26-Jan-2003 David Xu <davidxu@FreeBSD.org>

Move UPCALL related data structure out of kse, introduce a new
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.

A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.

Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.

Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.

KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.

When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.

The code hasn't been tested under SMP by author due to lack of hardware.

Reviewed by: julian


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# c522c1bf 31-Dec-2002 Alfred Perlstein <alfred@FreeBSD.org>

fdcopy() only needs a filedesc pointer.


# c7f1c11b 31-Dec-2002 Alfred Perlstein <alfred@FreeBSD.org>

Since fdshare() and fdinit() only operate on filedescs, make them
take pointers to filedesc structures instead of threads. This makes
it more clear that they do not do any voodoo with the thread/proc
or anything other than the filedesc passed in or returned.

Remove some XXX KSE's as this resolves the issue.


# 93a7aa79 27-Dec-2002 Julian Elischer <julian@FreeBSD.org>

Add code to ddb to allow backtracing an arbitrary thread.
(show thread {address})

Remove the IDLE kse state and replace it with a change in
the way threads sahre KSEs. Every KSE now has a thread, which is
considered its "owner" however a KSE may also be lent to other
threads in the same group to allow completion of in-kernel work.
n this case the owner remains the same and the KSE will revert to the
owner when the other work has been completed.

All creations of upcalls etc. is now done from
kse_reassign() which in turn is called from mi_switch or
thread_exit(). This means that special code can be removed from
msleep() and cv_wait().

kse_release() does not leave a KSE with no thread any more but
converts the existing thread into teh KSE's owner, and sets it up
for doing an upcall. It is just inhibitted from being scheduled until
there is some reason to do an upcall.

Remove all trace of the kse_idle queue since it is no-longer needed.
"Idle" KSEs are now on the loanable queue.


# 696058c3 09-Dec-2002 Julian Elischer <julian@FreeBSD.org>

Unbreak the KSE code. Keep track of zobie threads using the Per-CPU storage
during the context switch. Rearrange thread cleanups
to avoid problems with Giant. Clean threads when freed or
when recycled.

Approved by: re (jhb)


# 2555374c 20-Nov-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce p_label, extensible security label storage for the MAC framework
in struct proc. While the process label is actually stored in the
struct ucred pointed to by p_ucred, there is a need for transient
storage that may be used when asynchronous (deferred) updates need to
be performed on the "real" label for locking reasons. Unlike other
label storage, this label has no locking semantics, relying on policies
to provide their own protection for the label contents, meaning that
a policy leaf mutex may be used, avoiding lock order issues. This
permits policies that act based on historical process behavior (such
as audit policies, the MAC Framework port of LOMAC, etc) can update
process properties even when many existing locks are held without
violating the lock order. No currently committed policies implement use
of this label storage.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 293d2d22 18-Nov-2002 Robert Watson <rwatson@FreeBSD.org>

We leaked a process lock reference in the event an RFTHREAD process
leader wasn't exiting during a fork; instead, do remember to release
the lock avoiding lock order reversals and recursion panic.

Reported by: "Joel M. Baldwin" <qumqats@outel.org>


# 62220473 18-Oct-2002 John Baldwin <jhb@FreeBSD.org>

Do not lock the process when calling fdfree() (this would have recursed on
a non-recursive lock, the proc lock, before) since we don't need it to
change p_fd.


# c6544064 14-Oct-2002 John Baldwin <jhb@FreeBSD.org>

- Add a new global mutex 'ppeers_lock' to protect the p_peers list of
processes forked with RFTHREAD.
- Use a goto to a label for common code when exiting from fork1() in case
of an error.
- Move the RFTHREAD linkage setup code later in fork since the ppeers_lock
cannot be locked while holding a proc lock. Handle the race of a task
leader exiting and killing its peers while a peer is forking a new child.
In that case, go ahead and let the peer process proceed normally as the
parent is about to kill it. However, the task leader may have already
gone to sleep to wait for the peers to die, so the new child process may
not receive a SIGKILL from the task leader. Rather than try to destruct
the new child process, just go ahead and send it a SIGKILL directly and
add it to the p_peers list. This ensures that the task leader will wait
until both the peer process doing the fork() and the new child process
have received their KILL signals and exited.

Discussed with: truckman (earlier versions)


# b43179fb 11-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Create a new scheduler api that is defined in sys/sched.h
- Begin moving scheduler specific functionality into sched_4bsd.c
- Replace direct manipulation of scheduler data with hooks provided by the
new api.
- Remove KSE specific state modifications and single runq assumptions from
kern_switch.c

Reviewed by: -arch


# 48bfcddd 08-Oct-2002 Julian Elischer <julian@FreeBSD.org>

Round out the facilty for a 'bound' thread to loan out its KSE
in specific situations. The owner thread must be blocked, and the
borrower can not proceed back to user space with the borrowed KSE.
The borrower will return the KSE on the next context switch where
teh owner wants it back. This removes a lot of possible
race conditions and deadlocks. It is consceivable that the
borrower should inherit the priority of the owner too.
that's another discussion and would be simple to do.

Also, as part of this, the "preallocatd spare thread" is attached to the
thread doing a syscall rather than the KSE. This removes the need to lock
the scheduler when we want to access it, as it's now "at hand".

DDB now shows a lot mor info for threaded proceses though it may need
some optimisation to squeeze it all back into 80 chars again.
(possible JKH project)

Upcalls are now "bound" threads, but "KSE Lending" now means that
other completing syscalls can be completed using that KSE before the upcall
finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is
one of the highest priorities in the KSE system.) The upcall when it happens
will present all the completed syscalls to the KSE for selection.


# 316ec49a 02-Oct-2002 Scott Long <scottl@FreeBSD.org>

Some kernel threads try to do significant work, and the default KSTACK_PAGES
doesn't give them enough stack to do much before blowing away the pcb.
This adds MI and MD code to allow the allocation of an alternate kstack
who's size can be speficied when calling kthread_create. Passing the
value 0 prevents the alternate kstack from being created. Note that the
ia64 MD code is missing for now, and PowerPC was only partially written
due to the pmap.c being incomplete there.
Though this patch does not modify anything to make use of the alternate
kstack, acpi and usb are good candidates.

Reviewed by: jake, peter, jhb


# 1d9c5696 01-Oct-2002 Juli Mallett <jmallett@FreeBSD.org>

Back our kernel support for reliable signal queues.

Requested by: rwatson, phk, and many others


# 1226f694 30-Sep-2002 Juli Mallett <jmallett@FreeBSD.org>

First half of implementation of ksiginfo, signal queues, and such. This
gets signals operating based on a TailQ, and is good enough to run X11,
GNOME, and do job control. There are some intricate parts which could be
more refined to match the sigset_t versions, but those require further
evaluation of directions in which our signal system can expand and contract
to fit our needs.

After this has been in the tree for a while, I will make in kernel API
changes, most notably to trapsignal(9) and sendsig(9), to use ksiginfo
more robustly, such that we can actually pass information with our
(queued) signals to the userland. That will also result in using a
struct ksiginfo pointer, rather than a signal number, in a lot of
kern_sig.c, to refer to an individual pending signal queue member, but
right now there is no defined behaviour for such.

CODAFS is unfinished in this regard because the logic is unclear in
some places.

Sponsored by: New Gold Technology
Reviewed by: bde, tjr, jake [an older version, logic similar]


# c76e33b6 16-Sep-2002 Jonathan Mini <mini@FreeBSD.org>

Add kernel support needed for the KSE-aware libpthread:
- Use ucontext_t's to store KSE thread state.
- Synthesize state for the UTS upon each upcall, rather than
saving and copying a trapframe.
- Deliver signals to KSE-aware processes via upcall.
- Rename kse mailbox structure fields to be more BSD-like.
- Store the UTS's stack in struct proc in a stack_t.

Reviewed by: bde, deischen, julian
Approved by: -arch


# 4f0db5e0 15-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Allocate KSEs and KSEGRPs separatly and remove them from the proc structure.
next step is to allow > 1 to be allocated per process. This would give
multi-processor threads. (when the rest of the infrastructure is
in place)

While doing this I noticed libkvm and sys/kern/kern_proc.c:fill_kinfo_proc
are diverging more than they should.. corrective action needed soon.


# 71fad9fd 11-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Completely redo thread states.

Reviewed by: davidxu@freebsd.org


# 1faf202e 06-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Use UMA as a complex object allocator.
The process allocator now caches and hands out complete process structures
*including substructures* .

i.e. it get's the process structure with the first thread (and soon KSE)
already allocated and attached, all in one hit.

For the average non threaded program (non KSE that is) the allocated thread and its stack remain attached to the process, even when the process is
unused and in the process cache. This saves having to allocate and attach it
later, effectively bringing us (hopefully) close to the efficiency
of pre-KSE systems where these were a single structure.

Reviewed by: davidxu@freebsd.org, peter@freebsd.org


# 1279572a 05-Sep-2002 David Xu <davidxu@FreeBSD.org>

s/SGNL/SIG/
s/SNGL/SINGLE/
s/SNGLE/SINGLE/

Fix abbreviation for P_STOPPED_* etc flags, in original code they were
inconsistent and difficult to distinguish between them.

Approved by: julian (mentor)


# 49539972 22-Aug-2002 Julian Elischer <julian@FreeBSD.org>

slight cleanup of single-threading code for KSE processes


# df95311a 07-Aug-2002 Matthew N. Dodd <mdodd@FreeBSD.org>

Move code block added in 1.157 to a safer part of fork1().

Submitted by: jake


# 9ccba881 03-Aug-2002 Matthew N. Dodd <mdodd@FreeBSD.org>

Kernel modifications necessary to allow to follow fork()ed children.

PR: bin/25587 (in part)
MFC after: 3 weeks


# c4441bc7 29-Jul-2002 Mike Silbersack <silby@FreeBSD.org>

Update docs to reflect change in count of procs reserved for root
from 1 to 10.

PR: kern/40515
Submitted by: David Schultz <dschultz@uclink.Berkeley.EDU>
MFC after: 1 day


# 5c38b6db 28-Jul-2002 Don Lewis <truckman@FreeBSD.org>

Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.


# 66d59314 14-Jul-2002 Julian Elischer <julian@FreeBSD.org>

part of a greater patch set..
1/ don't need to set td_state to TDS_RUNNING in fork_return.
it's already set in choosethread().
2/ Set a child process state to "normal" as opposed to "new"
when we allow it to be put on the run queue.
Allows child to receive signals from the parent if the parent
runs first and tries to immediatly signal he child.

Submitted by: (part 2) Thomas Moestl <tmoestl@gmx.net>


# c3b98db0 13-Jul-2002 Julian Elischer <julian@FreeBSD.org>

Thinking about it I came to the conclusion that the KSE states were incorrectly
formulated. The correct states should be:
IDLE: On the idle KSE list for that KSEG
RUNQ: Linked onto the system run queue.
THREAD: Attached to a thread and slaved to whatever state the thread is in.

This means that most places where we were adjusting kse state can go away
as it is just moving around because the thread is..
The only places we need to adjust the KSE state is in transition to and from
the idle and run queues.

Reviewed by: jhb@freebsd.org


# aaa1c771 10-Jul-2002 Jonathan Mini <mini@FreeBSD.org>

Revert removal of cred_free_thread(): It is used to ensure that a thread's
credentials are not improperly borrowed when the thread is not current in
the kernel.

Requested by: jhb, alfred


# e602ba25 29-Jun-2002 Julian Elischer <julian@FreeBSD.org>

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 01ad8a53 24-Jun-2002 Jonathan Mini <mini@FreeBSD.org>

Remove unused diagnostic function cread_free_thread().

Approved by: alfred


# af300f23 06-Jun-2002 John Baldwin <jhb@FreeBSD.org>

- Proper locking for p_tracep and p_traceflag.
- Catch up to new ktrace API.


# 3fc755c1 02-May-2002 John Baldwin <jhb@FreeBSD.org>

- Protect randompid and nprocs with the allproc_lock.
- Reorder fork1() to do malloc() and other blocking operations prior to
acquiring the needed process locks.
- The new process inherit's the credentials of curthread, not the
credentials of the old process.
- Document a really weird race that will come up with KSE allows multiple
kernel threads per process.


# ba626c1d 16-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Lock proctree_lock instead of pgrpsess_lock.


# 9b28af91 09-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Whitespace changes to wrap long lines.


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 2a60b9b9 02-Apr-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Fix leakage of p_pgrp lock.


# 182da820 01-Apr-2002 Matthew Dillon <dillon@FreeBSD.org>

Stage-2 commit of the critical*() code. This re-inlines cpu_critical_enter()
and cpu_critical_exit() and moves associated critical prototypes into their
own header file, <arch>/<arch>/critical.h, which is only included by the
three MI source files that need it.

Backout and re-apply improperly comitted syntactical cleanups made to files
that were still under active development. Backout improperly comitted program
structure changes that moved localized declarations to the top of two
procedures. Partially re-apply one of the program structure changes to
move 'mask' into an intermediate block rather then in three separate
sub-blocks to make the code more readable. Re-integrate bug fixes that Jake
made to the sparc64 code.

Note: In general, developers should not gratuitously move declarations out
of sub-blocks. They are where they are for reasons of structure, grouping,
readability, compiler-localizability, and to avoid developer-introduced bugs
similar to several found in recent years in the VFS and VM code.

Reviewed by: jake


# 8899023f 27-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Make the reference counting of 'struct pargs' SMP safe.

There is still some locations where the PROC lock should be held
in order to prevent inconsistent views from outside (like the
proc->p_fd fix for kern/vfs_syscalls.c:checkdirs()) that can be
fixed later.

Submitted by: Jonathan Mini <mini@haikugeek.com>


# f22a4b62 27-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Add a new mtx_init option "MTX_DUPOK" which allows duplicate acquires of locks
with this flag. Remove the dup_list and dup_ok code from subr_witness. Now
we just check for the flag instead of doing string compares.

Also, switch the process lock, process group lock, and uma per cpu locks over
to this interface. The original mechanism did not work well for uma because
per cpu lock names are unique to each zone.

Approved by: jhb


# d74ac681 26-Mar-2002 Matthew Dillon <dillon@FreeBSD.org>

Compromise for critical*()/cpu_critical*() recommit. Cleanup the interrupt
disablement assumptions in kern_fork.c by adding another API call,
cpu_critical_fork_exit(). Cleanup the td_savecrit field by moving it
from MI to MD. Temporarily move cpu_critical*() from <arch>/include/cpufunc.h
to <arch>/<arch>/critical.c (stage-2 will clean this up).

Implement interrupt deferral for i386 that allows interrupts to remain
enabled inside critical sections. This also fixes an IPI interlock bug,
and requires uses of icu_lock to be enclosed in a true interrupt disablement.

This is the stage-1 commit. Stage-2 will occur after stage-1 has stabilized,
and will move cpu_critical*() into its own header file(s) + other things.
This commit may break non-i386 architectures in trivial ways. This should
be temporary.

Reviewed by: core
Approved by: core


# 565ab9395 20-Mar-2002 Benno Rice <benno@FreeBSD.org>

Add a change mirroring that made to kern/subr_trap.c and others.

This makes kernel builds with DIAGNOSTIC work again.

Apparently forgotten by: jhb
Might want to be checked by: jhb


# c897b813 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 181df8c9 26-Feb-2002 Matthew Dillon <dillon@FreeBSD.org>

revert last commit temporarily due to whining on the lists.


# f96ad4c2 26-Feb-2002 Matthew Dillon <dillon@FreeBSD.org>

STAGE-1 of 3 commit - allow (but do not require) interrupts to remain
enabled in critical sections and streamline critical_enter() and
critical_exit().

This commit allows an architecture to leave interrupts enabled inside
critical sections if it so wishes. Architectures that do not wish to do
this are not effected by this change.

This commit implements the feature for the I386 architecture and provides
a sysctl, debug.critical_mode, which defaults to 1 (use the feature). For
now you can turn the sysctl on and off at any time in order to test the
architectural changes or track down bugs.

This commit is just the first stage. Some areas of the code, specifically
the MACHINE_CRITICAL_ENTER #ifdef'd code, is strictly temporary and will
be cleaned up in the STAGE-2 commit when the critical_*() functions are
moved entirely into MD files.

The following changes have been made:

* critical_enter() and critical_exit() for I386 now simply increment
and decrement curthread->td_critnest. They no longer disable
hard interrupts. When critical_exit() decrements the counter to
0 it effectively calls a routine to deal with whatever interrupts
were deferred during the time the code was operating in a critical
section.

Other architectures are unaffected.

* fork_exit() has been conditionalized to remove MD assumptions for
the new code. Old code will still use the old MD assumptions
in regards to hard interrupt disablement. In STAGE-2 this will
be turned into a subroutine call into MD code rather then hardcoded
in MI code.

The new code places the burden of entering the critical section
in the trampoline code where it belongs.

* I386: interrupts are now enabled while we are in a critical section.
The interrupt vector code has been adjusted to deal with the fact.
If it detects that we are in a critical section it currently defers
the interrupt by adding the appropriate bit to an interrupt mask.

* In order to accomplish the deferral, icu_lock is required. This
is i386-specific. Thus icu_lock can only be obtained by mainline
i386 code while interrupts are hard disabled. This change has been
made.

* Because interrupts may or may not be hard disabled during a
context switch, cpu_switch() can no longer simply assume that
PSL_I will be in a consistent state. Therefore, it now saves and
restores eflags.

* FAST INTERRUPT PROVISION. Fast interrupts are currently deferred.
The intention is to eventually allow them to operate either while
we are in a critical section or, if we are able to restrict the
use of sched_lock, while we are not holding the sched_lock.

* ICU and APIC vector assembly for I386 cleaned up. The ICU code
has been cleaned up to match the APIC code in regards to format
and macro availability. Additionally, the code has been adjusted
to deal with deferred interrupts.

* Deferred interrupts use a per-cpu boolean int_pending, and
masks ipending, spending, and fpending. Being per-cpu variables
it is not currently necessary to lock; bus cycles modifying them.

Note that the same mechanism will enable preemption to be
incorporated as a true software interrupt without having to
further hack up the critical nesting code.

* Note: the old critical_enter() code in kern/kern_switch.c is
currently #ifdef to be compatible with both the old and new
methodology. In STAGE-2 it will be moved entirely to MD code.

Performance issues:

One of the purposes of this commit is to enhance critical section
performance, specifically to greatly reduce bus overhead to allow
the critical section code to be used to protect per-cpu caches.
These caches, such as Jeff's slab allocator work, can potentially
operate very quickly making the effective savings of the new
critical section code's performance very significant.

The second purpose of this commit is to allow architectures to
enable certain interrupts while in a critical section. Specifically,
the intention is to eventually allow certain FAST interrupts to
operate rather then defer.

The third purpose of this commit is to begin to clean up the
critical_enter()/critical_exit()/cpu_critical_enter()/
cpu_critical_exit() API which currently has serious cross pollution
in MI code (in fork_exit() and ast() for example).

The fourth purpose of this commit is to provide a framework that
allows kernel-preempting software interrupts to be implemented
cleanly. This is currently used for two forward interrupts in I386.
Other architectures will have the choice of using this infrastructure
or building the functionality directly into critical_enter()/
critical_exit().

Finally, this commit is designed to greatly improve the flexibility
of various architectures to manage critical section handling,
software interrupts, preemption, and other highly integrated
architecture-specific details.


# f591779b 23-Feb-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock struct pgrp, session and sigio.

New locks are:

- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.

Please refer to sys/proc.h for the coverage of these locks.

Changes on the pgrp/session interface:

- pgfind() needs the pgrpsess_lock held.

- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.

- Call enterthispgrp() in order to enter an existing pgrp.

- pgsignal() requires a pgrp lock held.

Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)


# 77c40664 22-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Add some DIAGNOSTIC code.
While in userland, keep the thread's ucred reference in a shadow
field so that the usual place to store it is NULL.
If DIAGNOSTIC is not set, the thread ucred is kept valid until the next
kernel entry, at which time it is checked against the process cred
and possibly corrected. Produces a BIG speedup in
kernels with INVARIANTS set. (A previous commit corrected it
for the non INVARIANTS case already)

Reviewed by: dillon@freebsd.org


# 1cbb9c3b 22-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Convert p->p_runtime and PCPU(switchtime) to bintime format.


# cc6712ea 18-Feb-2002 Mike Silbersack <silby@FreeBSD.org>

A few misc forkbomb defenses:

- Leave 10 processes for root-only use, the previous
value of 1 was insufficient to run ps ax | more.
- Remove the printing of "proc: table full". When the table
really is full, this would flood the screen/logs, making
the problem tougher to deal with.
- Force any process trying to fork beyond its user's maximum
number of processes to sleep for .5 seconds before returning
failure. This turns 2000 rampaging fork monsters into 2000
harmlessly snoozing fork monsters.

Reviewed by: dillon, peter
MFC after: 1 week


# 2eb927e2 16-Feb-2002 Julian Elischer <julian@FreeBSD.org>

If the credential on an incoming thread is correct, don't bother
reaquiring it. In the same vein, don't bother dropping the thread cred
when goinf ot userland. We are guaranteed to nned it when we come back,
(which we are guaranteed to do).

Reviewed by: jhb@freebsd.org, bde@freebsd.org (slightly different version)


# 2b8a08af 07-Feb-2002 Peter Wemm <peter@FreeBSD.org>

Fix a couple of style bugs introduced (or touched by) previous commit.


# 079b7bad 07-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


# 426da3bc 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


# fdba8cf4 08-Jan-2002 Mike Silbersack <silby@FreeBSD.org>

GC fast_vfork; it's not actually referenced anywhere.

MFC after: 3 weeks


# 885ccc61 18-Dec-2001 John Baldwin <jhb@FreeBSD.org>

Return EINVAL if kernel only flags are passed to the rfork syscall rather
than silently masking them.


# 7e1f6dfe 17-Dec-2001 John Baldwin <jhb@FreeBSD.org>

Modify the critical section API as follows:
- The MD functions critical_enter/exit are renamed to start with a cpu_
prefix.
- MI wrapper functions critical_enter/exit maintain a per-thread nesting
count and a per-thread critical section saved state set when entering
a critical section while at nesting level 0 and restored when exiting
to nesting level 0. This moves the saved state out of spin mutexes so
that interlocking spin mutexes works properly.
- Most low-level MD code that used critical_enter/exit now use
cpu_critical_enter/exit. MI code such as device drivers and spin
mutexes use the MI wrappers. Note that since the MI wrappers store
the state in the current thread, they do not have any return values or
arguments.
- mtx_intr_enable() is replaced with a constant CRITICAL_FORK which is
assigned to curthread->td_savecrit during fork_exit().

Tested on: i386, alpha


# 201b0ea8 14-Dec-2001 John Baldwin <jhb@FreeBSD.org>

Fix some nits in fork_exit() so it more properly duplicates the backend
of mi_switch:
- Set the oncpu value for the current thread.
- Always set switchticks, not just in the SMP case.
- Add a KTR entry for fork_exit that is the same as the "new proc"
entry in mi_switch().
- Release sched_lock a bit later like we do with mi_switch().


# 8e2e767b 26-Oct-2001 John Baldwin <jhb@FreeBSD.org>

Add a per-thread ucred reference for syscalls and synchronous traps from
userland. The per thread ucred reference is immutable and thus needs no
locks to be read. However, until all the proc locking associated with
writes to p_ucred are completed, it is still not safe to use the per-thread
reference.

Tested on: x86 (SMP), alpha, sparc64


# 79deba82 23-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Fix ktrace enablement/disablement races that can result in a vnode
ref count panic.

Bug noticed by: ps
Reviewed by: ps
MFC after: 1 day


# bd78cece 11-Oct-2001 John Baldwin <jhb@FreeBSD.org>

Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# eb30c1c0 09-Sep-2001 Peter Wemm <peter@FreeBSD.org>

Rip some well duplicated code out of cpu_wait() and cpu_exit() and move
it to the MI area. KSE touched cpu_wait() which had the same change
replicated five ways for each platform. Now it can just do it once.
The only MD parts seemed to be dealing with fpu state cleanup and things
like vm86 cleanup on x86. The rest was identical.

XXX: ia64 and powerpc did not have cpu_throw(), so I've put a functional
stub in place.

Reviewed by: jake, tmm, dillon


# 116734c4 31-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Pushdown Giant for acct(), kqueue(), kevent(), execve(), fork(),
vfork(), rfork(), jail().


# 9b956e98 09-Jul-2001 Guido van Rooij <guido@FreeBSD.org>

Get rid of useless bcopy (the next statement was equivalent)


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# aa3cefd0 29-Jun-2001 John Baldwin <jhb@FreeBSD.org>

Remove the p_spinlocks spin lock count that was obsoleted by the
per-CPU spinlocks list.


# 8f7e4eb5 11-Jun-2001 Dag-Erling Smørgrav <des@FreeBSD.org>

Rename nextpid to lastpid and externalize it.


# b1fc0ec1 25-May-2001 Robert Watson <rwatson@FreeBSD.org>

o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 3b26be6a 07-May-2001 Akinori MUSHA <knu@FreeBSD.org>

Properly copy the P_ALTSTACK flag in struct proc::p_flag to the child
process on fork(2).

It is the supposed behavior stated in the manpage of sigaction(2), and
Solaris, NetBSD and FreeBSD 3-STABLE correctly do so.

The previous fix against libc_r/uthread/uthread_fork.c fixed the
problem only for the programs linked with libc_r, so back it out and
fix fork(2) itself to help those not linked with libc_r as well.

PR: kern/26705
Submitted by: KUROSAWA Takahiro <fwkg7679@mb.infoweb.ne.jp>
Tested by: knu, GOTOU Yuuzou <gotoyuzo@notwork.org>,
and some other people
Not objected by: hackers
MFC in: 3 days


# 1005a129 28-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Convert the allproc and proctree locks from lockmgr locks to sx locks.


# 19284646 28-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Rework the witness code to work with sx locks as well as mutexes.
- Introduce lock classes and lock objects. Each lock class specifies a
name and set of flags (or properties) shared by all locks of a given
type. Currently there are three lock classes: spin mutexes, sleep
mutexes, and sx locks. A lock object specifies properties of an
additional lock along with a lock name and all of the extra stuff needed
to make witness work with a given lock. This abstract lock stuff is
defined in sys/lock.h. The lockmgr constants, types, and prototypes have
been moved to sys/lockmgr.h. For temporary backwards compatability,
sys/lock.h includes sys/lockmgr.h.
- Replace proc->p_spinlocks with a per-CPU list, PCPU(spinlocks), of spin
locks held. By making this per-cpu, we do not have to jump through
magic hoops to deal with sched_lock changing ownership during context
switches.
- Replace proc->p_heldmtx, formerly a list of held sleep mutexes, with
proc->p_sleeplocks, which is a list of held sleep locks including sleep
mutexes and sx locks.
- Add helper macros for logging lock events via the KTR_LOCK KTR logging
level so that the log messages are consistent.
- Add some new flags that can be passed to mtx_init():
- MTX_NOWITNESS - specifies that this lock should be ignored by witness.
This is used for the mutex that blocks a sx lock for example.
- MTX_QUIET - this is not new, but you can pass this to mtx_init() now
and no events will be logged for this lock, so that one doesn't have
to change all the individual mtx_lock/unlock() operations.
- All lock objects maintain an initialized flag. Use this flag to export
a mtx_initialized() macro that can be safely called from drivers. Also,
we on longer walk the all_mtx list if MUTEX_DEBUG is defined as witness
performs the corresponding checks using the initialized flag.
- The lock order reversal messages have been improved to output slightly
more accurate file and line numbers.


# 486b8ac0 27-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Don't explicitly zero p_intr_nesting_level and p_aioinfo in fork.


# 35a47246 27-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Use mtx_intr_enable() on sched_lock to ensure child processes always start
with interrupts enabled rather than calling the no-longer MI function
enable_intr(). This is bogus anyways and in theory shouldn't even be
needed.


# 5db078a9 09-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Fix mtx_legal2block. The only time that it is bad to block on a mutex is
if we hold a spin mutex, since we can trivially get into deadlocks if we
start switching out of processes that hold spinlocks. Checking to see if
interrupts were disabled was a sort of cheap way of doing this since most
of the time interrupts were only disabled when holding a spin lock. At
least on the i386. To fix this properly, use a per-process counter
p_spinlocks that counts the number of spin locks currently held, and
instead of checking to see if interrupts are disabled in the witness code,
check to see if we hold any spin locks. Since child processes always
start up with the sched lock magically held in fork_exit(), we initialize
p_spinlocks to 1 for child processes. Note that proc0 doesn't go through
fork_exit(), so it starts with no spin locks held.

Consulting from: cp


# 5641ae5d 06-Mar-2001 John Baldwin <jhb@FreeBSD.org>

- Don't hold the proc lock across VREF and the fd* functions to avoid lock
order reversals.
- Add some preliminary locking in the !RF_PROC case.
- Protect p_estcpu with sched_lock.


# 57934cd3 06-Mar-2001 John Baldwin <jhb@FreeBSD.org>

- Lock the forklist with an sx lock.
- Add proc locking to fork1(). Always lock the child procoess (new
process) first when both processes need to be locked at the same
time.
- Remove unneeded spl()'s as the data they protected is now locked.
- Ensure that the proctree is exclusively locked and the new process is
locked when setting up the parent process pointer.
- Lock the check for P_KTHREAD in p_flag in fork_exit().


# 5b270b2a 27-Feb-2001 Jake Burkholder <jake@FreeBSD.org>

Sigh. Try to get priorities sorted out. Don't bother trying to
update native priority, it is diffcult to get right and likely
to end up horribly wrong. Use an honestly wrong fixed value
that seems to work; PUSER for user threads, and the interrupt
priority for ithreads. Set it once when the process is created
and forget about it.

Suggested by: bde
Pointy hat: me


# be15bfc0 26-Feb-2001 Jake Burkholder <jake@FreeBSD.org>

Initialize native priority to PRI_MAX. It was usually 0 which made a
process's priority go through the roof when it released a (contested)
mutex. Only set the native priority in mtx_lock if hasn't already
been set.

Reviewed by: jhb


# 9764c9d3 21-Feb-2001 John Baldwin <jhb@FreeBSD.org>

Quiet a warning with a uintptr_t cast.

Noticed by: bde


# 91421ba2 20-Feb-2001 Robert Watson <rwatson@FreeBSD.org>

o Move per-process jail pointer (p->pr_prison) to inside of the subject
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.

Notes:

o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.

Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project


# 5813dc03 19-Feb-2001 John Baldwin <jhb@FreeBSD.org>

- Don't call clear_resched() in userret(), instead, clear the resched flag
in mi_switch() just before calling cpu_switch() so that the first switch
after a resched request will satisfy the request.
- While I'm at it, move a few things into mi_switch() and out of
cpu_switch(), specifically set the p_oncpu and p_lastcpu members of
proc in mi_switch(), and handle the sched_lock state change across a
context switch in mi_switch().
- Since cpu_switch() no longer handles the sched_lock state change, we
have to setup an initial state for sched_lock in fork_exit() before we
release it.


# d941d475 12-Feb-2001 Robert Watson <rwatson@FreeBSD.org>

o Export the nextpid variable via SYSCTL as kern.lastpid, decreasing by
one the number of variables needed for top and other setgid kmem
utilities that could only be accessed via /dev/kmem previously.

Submitted by: Thomas Moestl <tmoestl@gmx.net>
Reviewed by: freebsd-audit


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 8865286b 26-Jan-2001 John Baldwin <jhb@FreeBSD.org>

Fix fork_exit() to take a pointer to a function that returns void as its
first argument rather than a function that returns a void *.

Noticed by: jake


# 2a36ec35 24-Jan-2001 John Baldwin <jhb@FreeBSD.org>

- Change fork_exit() to take a pointer to a trapframe as its 3rd argument
instead of a trapframe directly. (Requested by bde.)
- Convert the alpha switch_trampoline to call fork_exit() and use the MI
fork_return() instead of child_return().
- Axe child_return().


# a7b124c3 24-Jan-2001 John Baldwin <jhb@FreeBSD.org>

- Catch up to proc flag changes.
- Add new fork_exit() and fork_return() MI C functions.


# 5d22597f 23-Jan-2001 Hajimu UMEMOTO <ume@FreeBSD.org>

Add mibs to hold the number of forks since boot. New mibs are:

vm.stats.vm.v_forks
vm.stats.vm.v_vforks
vm.stats.vm.v_rforks
vm.stats.vm.v_kthreads
vm.stats.vm.v_forkpages
vm.stats.vm.v_vforkpages
vm.stats.vm.v_rforkpages
vm.stats.vm.v_kthreadpages

Submitted by: Paul Herman <pherman@frenchfries.net>
Reviewed by: alfred


# a448b62a 21-Jan-2001 Jake Burkholder <jake@FreeBSD.org>

Make intr_nesting_level per-process, rather than per-cpu. Setup
interrupt threads to run with it always >= 1, so that malloc can
detect M_WAITOK from "interrupt" context. This is also necessary
in order to context switch from sched_ithd() directly.

Reviewed By: peter


# 98f03f90 23-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

Protect proc.p_pptr and proc.p_children/p_sibling with the
proctree_lock.

linprocfs not locked pending response from informal maintainer.

Reviewed by: jhb, -smp@


# c0c25570 12-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead
of explicit calls to lockmgr. Also provides macros for the flags
pased to specify shared, exclusive or release which map to the
lockmgr flags. This is so that the use of lockmgr can be easily
replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.


# 8dd431fc 04-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

Whitespace. Fix indentation, align comments.


# 4971f62a 02-Dec-2000 John Baldwin <jhb@FreeBSD.org>

- Add a mutex to the proc structure p_mtx that will be used to lock accesses
to each individual proc.
- Initialize the lock during fork1(), and destroy it in wait1().


# 86360fee 01-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

Remove thr_sleep and thr_wakeup. Remove fields p_nthread and p_wakeup
from struct proc, which are now unused (p_nthread already was).
Remove process flag P_KTHREADP which was untested and only set
in vfs_aio.c (it should use kthread_create). Move the yield
system call to kern_synch.c as kern_threads.c has been removed
completely.

moral support from: alfred, jhb


# 1512b5d6 30-Nov-2000 Jake Burkholder <jake@FreeBSD.org>

Use an mp-safe callout for endtsleep.


# 4f559836 27-Nov-2000 Jake Burkholder <jake@FreeBSD.org>

Use callout_reset instead of timeout(9). Most callouts are statically
allocated, 2 have been added to struct proc for setitimer and sleep.

Reviewed by: jhb, jlemon


# 553629eb 22-Nov-2000 Jake Burkholder <jake@FreeBSD.org>

Protect the following with a lockmgr lock:

allproc
zombproc
pidhashtbl
proc.p_list
proc.p_hash
nextpid

Reviewed by: jhb
Obtained from: BSD/OS and netbsd


# 35e0e5b3 20-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# 42fd51ce 14-Sep-2000 Don Lewis <truckman@FreeBSD.org>

Enforce process limit policy in one place to keep proccnt from diverging
from reality.


# 0384fff8 06-Sep-2000 Jason Evans <jasone@FreeBSD.org>

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# f535380c 05-Sep-2000 Don Lewis <truckman@FreeBSD.org>

Remove uidinfo hash table lookup and maintenance out of chgproccnt() and
chgsbsize(), which are called rather frequently and may be called from an
interrupt context in the case of chgsbsize(). Instead, do the hash table
lookup and maintenance when credentials are changed, which is a lot less
frequent. Add pointers to the uidinfo structures to the ucred and pcred
structures for fast access. Pass a pointer to the credential to chgproccnt()
and chgsbsize() instead of passing the uid. Add a reference count to the
uidinfo structure and use it to decide when to free the structure rather
than freeing the structure when the resource consumption drops to zero.
Move the resource tracking code from kern_proc.c to kern_resource.c. Move
some duplicate code sequences in kern_prot.c to separate helper functions.
Change KASSERTs in this code to unconditional tests and calls to panic().


# 77978ab8 04-Jul-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.

Pointed out by: bde


# 82d9ae4e 03-Jul-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:

Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

-sysctl_vm_zone SYSCTL_HANDLER_ARGS
+sysctl_vm_zone (SYSCTL_HANDLER_ARGS)


# 47fdd692 26-Jun-2000 Neil Blakey-Milner <nbm@FreeBSD.org>

Add sysctl descriptions to a few sysctls. Simply "documentation".

PR: kern/8015
Submitted by: Stefan Eggers <seggers@semyam.dinoco.de>


# c6362551 22-Jun-2000 Alfred Perlstein <alfred@FreeBSD.org>

fix races in the uidinfo subsystem, several problems existed:

1) while allocating a uidinfo struct malloc is called with M_WAITOK,
it's possible that while asleep another process by the same user
could have woken up earlier and inserted an entry into the uid
hash table. Having redundant entries causes inconsistancies that
we can't handle.

fix: do a non-waiting malloc, and if that fails then do a blocking
malloc, after waking up check that no one else has inserted an entry
for us already.

2) Because many checks for sbsize were done as "test then set" in a non
atomic manner it was possible to exceed the limits put up via races.

fix: instead of querying the count then setting, we just attempt to
set the count and leave it up to the function to return success or
failure.

3) The uidinfo code was inlining and repeating, lookups and insertions
and deletions needed to be in their own functions for clarity.

Reviewed by: green


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# cb679c38 16-Apr-2000 Jonathan Lemon <jlemon@FreeBSD.org>

Introduce kqueue() and kevent(), a kernel event notification facility.


# bb6a234e 06-Dec-1999 Peter Wemm <peter@FreeBSD.org>

Put on my asbestos underwear and commit the patch that I posted to -arch
some time ago that changes kern.randompid from a boolean to a randomness
range for the next pid assigment. Too high causes a lot of extra work
to scan for free pids, and too low merely wastes randomness entropy. It's
still possible to select a completely random range by using PID_MAX (100k)
or -1 as a shortcut to mean "the whole range".
Also, don't waste randomness when doing a wraparound.


# 91c28bfd 05-Dec-1999 Luoqi Chen <luoqi@FreeBSD.org>

User ldt sharing.


# ee3fd601 28-Nov-1999 Dan Moschuk <dan@FreeBSD.org>

Introduce OpenBSD-like Random PIDs. Controlled by a sysctl knob
(kern.randompid), which is currently defaulted off. Use ARC4 (RC4) for our
random number generation, which will not get me executed for violating
crypto laws; a Good Thing(tm).

Reviewed and Approved by: bde, imp


# 93efcae8 19-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

The at_exit and at_fork functions currently use a 'roll your own'
linked list to store the callbak routines. The patch converts the
lists to queue(3) TAILQs, making the code slightly clearer and ensuring
that callbacks are executed in FIFO order.

Man page also updated as necesary.

(discontinued use of M_TEMP malloc type while here anyway /phk)

Submitted by: Jake Burkholder jake@checker.org
PR: 14912


# b9df5231 16-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce commandline caching in the kernel.

This fixes some nasty procfs problems for SMP, makes ps(1) run much faster,
and makes ps(1) even less dependent on /proc which will aid chroot and
jails alike.

To disable this facility and revert to previous behaviour:
sysctl -w kern.ps_arg_cache_limit=0

For full details see the current@FreeBSD.org mail-archives.


# 2e3c8fcb 16-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

This is a partial commit of the patch from PR 14914:

Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.

This batch of changes compile to the same object files.

Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914


# d1f088da 11-Oct-1999 Peter Wemm <peter@FreeBSD.org>

Trim unused options (or #ifdef for undoc options).

Submitted by: phk


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# d4da2dba 21-Jul-1999 Alan Cox <alc@FreeBSD.org>

Fix the following problem:

When creating new processes (or performing exec), the new page
directory is initialized too early. The kernel might grow before
p_vmspace is initialized for the new process. Since pmap_growkernel
doesn't yet know about the new page directory, it isn't updated, and
subsequent use causes a failure.

The fix is (1) to clear p_vmspace early, to stop pmap_growkernel
from stomping on memory, and (2) to defer part of the initialization
of new page directories until p_vmspace is initialized.

PR: kern/12378
Submitted by: tegge
Reviewed by: dfr


# 1943af61 03-Jul-1999 Peter Wemm <peter@FreeBSD.org>

Stop rfork(0) from panicing. (oops!!)

Submitted by: Peter Holm <peter@holm.cc>


# df8abd0b 30-Jun-1999 Peter Wemm <peter@FreeBSD.org>

Slight tweak to fork1() calling conventions. Add a third argument so
the caller can easily find the child proc struct. fork(), rfork() etc
syscalls set p->p_retval[] themselves. Simplify the SYSINIT_KT() code
and other kernel thread creators to not need to use pfind() to find the
child based on the pid. While here, partly tidy up some of the fork1()
code for RF_SIGSHARE etc.


# 75c13541 28-Apr-1999 Poul-Henning Kamp <phk@FreeBSD.org>

This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/


# 5206bca1 27-Apr-1999 Luoqi Chen <luoqi@FreeBSD.org>

Enable vmspace sharing on SMP. Major changes are,
- %fs register is added to trapframe and saved/restored upon kernel entry/exit.
- Per-cpu pages are no longer mapped at the same virtual address.
- Each cpu now has a separate gdt selector table. A new segment selector
is added to point to per-cpu pages, per-cpu global variables are now
accessed through this new selector (%fs). The selectors in gdt table are
rearranged for cache line optimization.
- fask_vfork is now on as default for both UP and SMP.
- Some aio code cleanup.

Reviewed by: Alan Cox <alc@cs.rice.edu>
John Dyson <dyson@iquest.net>
Julian Elischer <julian@whistel.com>
Bruce Evans <bde@zeta.org.au>
David Greenman <dg@root.com>


# 0dd9741e 24-Apr-1999 Dmitrij Tejblum <dt@FreeBSD.org>

Use pointer arithmetic to do pointer arithmetic.


# e9189611 17-Apr-1999 Peter Wemm <peter@FreeBSD.org>

Well folks, this is it - The second stage of the removal for build support
for LKM's..


# af8ad83e 05-Apr-1999 Peter Wemm <peter@FreeBSD.org>

Use the reference-counted PHOLD()/PRELE() rather than P_NOSWAP.


# 4ac9ae70 01-Mar-1999 Julian Elischer <julian@FreeBSD.org>

Fix thread/process tracking and differentiation for Linux threads emulation.

Submitted by: Richard Seaman, Jr." <dick@tar.com>

Also clean some compiler warnings in surrounding code.


# 88c5ea45 25-Jan-1999 Julian Elischer <julian@FreeBSD.org>

Enable Linux threads support by default.
This takes the conditionals out of the code that has been tested by
various people for a while.
ps and friends (libkvm) will need a recompile as some proc structure
changes are made.

Submitted by: "Richard Seaman, Jr." <dick@tar.com>


# dc9c271a 07-Jan-1999 Julian Elischer <julian@FreeBSD.org>

Changes to the LINUX_THREADS support to only allocate extra memory for
shared signal handling when there is shared signal handling being
used.

This removes the main objection to making the shared signal handling
a standard ability in rfork() and friends and 'unconditionalising'
this code. (i.e. the allocation of an extra 328 bytes per process).

Signal handling information remains in the U area until such a time as
it's reference count would be incremented to > 1. At that point a new
struct is malloc'd and maintained in KVM so that it can be shared between
the processes (threads) using it.

A function to check the reference count and move the struct back to the U
area when it drops back to 1 is also supplied. Signal information is
therefore now swapable for all processes that are not sharing that
information with other processes. THis should addres the concerns raised
by Garrett and others.

Submitted by: "Richard Seaman, Jr." <dick@tar.com>


# 6626c604 18-Dec-1998 Julian Elischer <julian@FreeBSD.org>

Reviewed by: Luoqi Chen, Jordan Hubbard
Submitted by: "Richard Seaman, Jr." <lists@tar.com>
Obtained from: linux :-)

Code to allow Linux Threads to run under FreeBSD.

By default not enabled
This code is dependent on the conditional
COMPAT_LINUX_THREADS (suggested by Garret)
This is not yet a 'real' option but will be within some number of hours.


# 643a8daa 09-Nov-1998 Don Lewis <truckman@FreeBSD.org>

If the session leader dies, s_leader is set to NULL and getsid() may
dereference a NULL pointer, causing a panic. Instead of following
s_leader to find the session id, store it in the session structure.

Jukka found the following info:

BTW - I just found what I have been looking for. Std 1003.1
Part 1: SYSTEM API [C LANGUAGE] section 2.2.2.80 states quite
explicitly...

Session lifetime: The period between when a session is created
and the end of lifetime of all the process groups that remain
as members of the session.

So, this quite clearly tells that while there is any single
process in any process group which is a member of the session,
the session remains as an independent entity.

Reviewed by: peter
Submitted by: "Jukka A. Ukkonen" <jau@jau.tmt.tele.fi>


# 2d8acc0f 22-Jan-1998 John Dyson <dyson@FreeBSD.org>

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 74b2192a 11-Dec-1997 John Dyson <dyson@FreeBSD.org>

We have had support for running the kernel daemons as threads for
quite a while, but forgot to do so. For now, this code supports
most daemons running as kernel threads in UP kernels, and as
full processes in SMP. We will soon be able to run them as
threads in SMP, but not yet.


# be67169a 20-Nov-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused includes.

Staticized.

Avoid passing a `retval' to fork1().

Fixed some style bugs.


# cb226aaa 06-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


# eb776aea 25-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Fixed some gratuitous ANSIisms.


# e384a980 22-Aug-1997 Peter Wemm <peter@FreeBSD.org>

Print a warning if an unsupported (under SMP) shared address space fork
is attempted rather than just failing with an errno.


# 2244ea07 05-Jul-1997 John Dyson <dyson@FreeBSD.org>

This is an upgrade so that the kernel supports the AIO calls from
POSIX.4. Additionally, there is some initial code that supports LIO.
This code supports AIO/LIO for all types of file descriptors, with
few if any restrictions. There will be a followup very soon that
will support significantly more efficient operation for VCHR type
files (raw.) This code is also dependent on some kernel features
that don't work under SMP yet. After I commit the changes to the
kernel to support proper address space sharing on SMP, this code
will also work under SMP.


# b3196e4b 22-Jun-1997 Peter Wemm <peter@FreeBSD.org>

Preliminary support for per-cpu data pages.

This eliminates a lot of #ifdef SMP type code. Things like _curproc reside
in a data page that is unique on each cpu, eliminating the expensive macros
like: #define curproc (SMPcurproc[cpunumber()])

There are some unresolved bootstrap and address space sharing issues at
present, but Steve is waiting on this for other work. There is still some
strictly temporary code present that isn't exactly pretty.

This is part of a larger change that has run into some bumps, this part is
standalone so it should be safe. The temporary code goes away when the
full idle cpu support is finished.

Reviewed by: fsmp, dyson


# 2c1011f7 15-Jun-1997 John Dyson <dyson@FreeBSD.org>

Modifications to existing files to support the initial AIO/LIO and
kernel based threading support.


# 8f453f3e 28-May-1997 Peter Wemm <peter@FreeBSD.org>

Don't need "opt_smp.h" on these files


# c76e95c3 26-Apr-1997 Peter Wemm <peter@FreeBSD.org>

Create sysctl kern.fast_vfork, on for uniprocessor by default, off for
SMP.


# c32ba248 26-Apr-1997 Peter Wemm <peter@FreeBSD.org>

Disable RFMEM in vfork for smp case.. It doesn't seem to work too well
yet..


# 0eaa559c 23-Apr-1997 Andrey A. Chernov <ache@FreeBSD.org>

Restore memory space separation (RFMEM) for vfork() after
shell imgact memory clobbering fixed


# 6b707440 22-Apr-1997 John Dyson <dyson@FreeBSD.org>

Give up on the fast vfork() for a while.


# c58494e4 20-Apr-1997 John Dyson <dyson@FreeBSD.org>

Re-institute the efficent version of vfork. It appears to make a
difference of approx 3mins in make world on my P6!!! This means
that vfork now has full address space sharing, so beware with
sloppy vfork programming. Also, you really do need to apply
the previously committed popen fix in libc.


# d7f7f3f2 13-Apr-1997 John Dyson <dyson@FreeBSD.org>

Make a problem that I cannot reproduce go away for now. This commit
is to decrease the inconvienience of other developers until I can
really fix the code.
Reviewed by: Donald J. Maddox <dmaddox@scsn.net>


# 5856e12e 12-Apr-1997 John Dyson <dyson@FreeBSD.org>

Fully implement vfork. Vfork is now much much faster than even our
fork. (On my machine, fork is about 240usecs, vfork is 78usecs.)

Implement rfork(!RFPROC !RFMEM), which allows a thread to divorce its memory
from the other threads of a group.

Implement rfork(!RFPROC RFCFDG), which closes all file descriptors, eliminating
possible existing shares with other threads/processes.

Implement rfork(!RFPROC RFFDG), which divorces the file descriptors for a
thread from the rest of the group.

Fix the case where a thread does an exec. It is almost nonsense for a thread
to modify the other threads address space by an exec, so we
now automatically divorce the address space before modifying it.


# 263a3392 07-Apr-1997 Peter Wemm <peter@FreeBSD.org>

Remove explicit zero of p_vmspace on creation, it's now in the startzero
section of the proc struct.


# a2a1c95c 07-Apr-1997 Peter Wemm <peter@FreeBSD.org>

The biggie: Get rid of the UPAGES from the top of the per-process address
space. (!)

Have each process use the kernel stack and pcb in the kvm space. Since
the stacks are at a different address, we cannot copy the stack at fork()
and allow the child to return up through the function call tree to return
to user mode - create a new execution context and have the new process
begin executing from cpu_switch() and go to user mode directly.
In theory this should speed up fork a bit.

Context switch the tss_esp0 pointer in the common tss. This is a lot
simpler since than swithching the gdt[GPROC0_SEL].sd.sd_base pointer
to each process's tss since the esp0 pointer is a 32 bit pointer, and the
sd_base setting is split into three different bit sections at non-aligned
boundaries and requires a lot of twiddling to reset.

The 8K of memory at the top of the process space is now empty, and unmapped
(and unmappable, it's higher than VM_MAXUSER_ADDRESS).

Simplity the pmap code to manage process contexts, we no longer have to
double map the UPAGES, this simplifies and should measuably speed up fork().

The following parts came from John Dyson:

Set PG_G on the UPAGES that are now in kernel context, and invalidate
them when swapping them out.

Move the upages object (upobj) from the vmspace to the proc structure.

Now that the UPAGES (pcb and kernel stack) are out of user space, make
rfork(..RFMEM..) do what was intended by sharing the vmspace
entirely via reference counting rather than simply inheriting the mappings.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 70e534e7 17-Feb-1997 David Greenman <dg@FreeBSD.org>

Pass P_SUGID on to the child of a fork(). It was possible to get rlogin
to coredump previously since it (somewhat uniquely) is setuid and forks
without execing, and thus without passing P_SUGID the child could
coredump and possibly divulge sensitive information (such as encrypted
passwords from the passwd database).


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 3e2bca9e 15-Jan-1997 Bruce Evans <bde@FreeBSD.org>

Fixed interrupt unmasking for child processes which I broke in
rev.1.10 two years ago. Children continued to run at splhigh()
after returning from vm_fork(). This mainly affected kernel
processes and init. For ordinary processes, interrupts are normally
unmasked a few instructions later after fork() returns (it may be
important for syscall() not to reschedule the child processes).
Kernel processes had workarounds for the problem. Init manages to
start because some routines "know" that it is safe to go to sleep
despite their caller starting them at a high ipl. Then its ipl
gets fixed on its first normal return from a syscall.


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 51068190 27-Oct-1996 Wolfram Schneider <wosch@FreeBSD.org>

Move static variable nextpid out from fork1(). Now top(1) can print
last pid value.


# b71fec07 03-Sep-1996 Bruce Evans <bde@FreeBSD.org>

Eliminated nested include of <sys/unistd.h> in <sys/file.h> in the kernel.
Include it directly in the few places where it is used.

Reduced some #includes of <sys/file.h> to #includes of <sys/fcntl.h> or
nothing.


# e0d898b4 21-Aug-1996 Julian Elischer <julian@FreeBSD.org>

Some cleanups to the callout lists recently added.
note that at_shutdown has a new parameter to indicate When
during a shutdown the callout should be made. also
add a RB_POWEROFF flag to reboot "howto" parameter..
tells the reboot code in our at_shutdown module to turn off the UPS
and kill the power. bound to be useful eventually on laptops


# fed06968 18-Aug-1996 Julian Elischer <julian@FreeBSD.org>

add callout lists for exit() and fork()

I've been meaning to do this for AGES as I keep having to patch those routines
whenever I write a proprietary package or similar..

any module that assigns resources to processes needs to know when
these events occur. there are existsing modules that should be modified
to take advantage of these.. e.g. SYSV IPC primatives
presently have #ifdef entries in exit()


this also helps with making LKMs out of such things..

(see the man pages at_exit(9) and at_fork(9))


# b1508c72 31-Jul-1996 David Greenman <dg@FreeBSD.org>

Converted timer/run queues to 4.4BSD queue style. Removed old and unused
sleep(). Implemented wakeup_one() which may be used in the future to combat
the "thundering herd" problem for some special cases.

Reviewed by: dyson


# c23670e2 11-Jun-1996 Gary Palmer <gpalmer@FreeBSD.org>

Clean up -Wunused warnings.

Reviewed by: bde


# 88d1b642 02-May-1996 Peter Wemm <peter@FreeBSD.org>

Fix a nasty bug that causes random crashes and lockups particularly on
very busy servers (eg: news, web). This is an interaction between
embryonic processes that have not yet finished forking, and happen to
cause the kernel VM space to grow, hitting the uninitialised variable.

It was possible for this to strike at any time, depending on the size of
your kernel and load patterns. One machine had paniced occasionally
when cron launches a job since before the 2.1 release.

If you had "options DIAGNOSTIC", you may have seen references to bogus
addresses like 0xdeadc142 and the like.

This is a minimal change to fix the problem, it will probably be done
better by reordering p_vmspace to be in the startzero section, but it
becomes harder to validate then.

It's been vulnerable since pmap.c rev 1.40 (Jan 9, 1995), so it's been a
cause of problems since well before 2.0.5. This was when the merged
VM/buffer cache and the dynamic growing kernel VM space were first
committed. This probably fixes a few of PR's.


# 0e3eb7ee 17-Apr-1996 Sujal Patel <smpatel@FreeBSD.org>

Implement the RFNOWAIT flag for rfork(). If set this flag will cause the
forked child to be dissociated from the parent).

Cleanup fork1(), implement vfork() and fork() in terms of rfork() flags.

Remove RFENVG, RFNOTEG, RFCNAMEG, RFCENVG which are Plan9 specific and cannot
possibly be implemented in FreeBSD.

Renumbered the flags to make up for the removal of the above flags.

Reviewed by: peter, smpatel
Submitted by: Mike Grupenhoff <kashmir@umiacs.umd.edu>


# edbfedac 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]


# b75356e1 10-Mar-1996 Jeffrey Hsu <hsu@FreeBSD.org>

From Lite2: proc LIST changes.
Reviewed by: david & bde


# ef5dc8a9 03-Mar-1996 John Dyson <dyson@FreeBSD.org>

Keep fork from over extending the number of processes. Since u_map is
sized exactly for maxproc, the occasional overrunning the maxproc limit
can cause problems.


# dabee6fe 23-Feb-1996 Peter Wemm <peter@FreeBSD.org>

kern_descrip.c: add fdshare()/fdcopy()
kern_fork.c: add the tiny bit of code for rfork operation.
kern/sysv_*: shmfork() takes one less arg, it was never used.
sys/shm.h: drop "isvfork" arg from shmfork() prototype
sys/param.h: declare rfork args.. (this is where OpenBSD put it..)
sys/filedesc.h: protos for fdshare/fdcopy.
vm/vm_mmap.c: add minherit code, add rounding to mmap() type args where
it makes sense.
vm/*: drop unused isvfork arg.

Note: this rfork() implementation copies the address space mappings,
it does not connect the mappings together. ie: once the two processes
have split, the pages may be shared, but the address space is not. If one
does a mmap() etc, it does not appear in the other. This makes it not
useful for pthreads, but it is useful in it's own right for having
light-weight threads in a static shared address space.

Obtained from: Original by Ron Minnich, extended by OpenBSD


# db6a20e2 03-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Converted two options over to the new scheme: USER_LDT and KTRACE.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# d2d3e875 11-Nov-1995 Bruce Evans <bde@FreeBSD.org>

Included <sys/sysproto.h> to get central declarations for syscall args
structs and prototypes for syscalls.

Ifdefed duplicated decentralized declarations of args structs. It's
convenient to have this visible but they are hard to maintain. Some
are already different from the central declarations. 4.4lite2 puts
them in comments in the function headers but I wanted to avoid the
large changes for that.


# ad7507e2 07-Oct-1995 Steven Wallace <swallace@FreeBSD.org>

Remove prototype definitions from <sys/systm.h>.
Prototypes are located in <sys/sysproto.h>.

Add appropriate #include <sys/sysproto.h> to files that needed
protos from systm.h.

Add structure definitions to appropriate files that relied on sys/systm.h,
right before system call definition, as in the rest of the kernel source.

In kern_prot.c, instead of using the dummy structure "args", create
individual dummy structures named <syscall>_args. This makes
life easier for prototype generation.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# d7e3a89f 21-Jan-1995 Bruce Evans <bde@FreeBSD.org>

Don't count the parent's previous timeslice in the child's resource usage
(it was counted twice).

Set the start time more accurately.


# d93f860c 09-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics. related to getting prototypes into view.


# 35c10d22 09-Oct-1994 David Greenman <dg@FreeBSD.org>

Got rid of map.h. It's a leftover from the rmap code, and we use rlists.
Changed swapmap into swaplist.


# 7216391e 01-Oct-1994 David Greenman <dg@FreeBSD.org>

"idle priority" support. Based on code from Henrik Vestergaard Draboel,
but substantially rewritten by me.


# e8fb0b2c 31-Aug-1994 David Greenman <dg@FreeBSD.org>

Realtime priority scheduling support.

Submitted by: Henrik Vestergaard Draboel


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 0d2afcee 06-Aug-1994 David Greenman <dg@FreeBSD.org>

Process scheduling changes - adapted from FreeBSD 1.1.5. Basically,
charge scheduling CPU of child process to the parent and have child
inherit scheduling CPU from parent on fork. Makes a **big** difference
in the feel of the system to interactive users.

Submitted by: John Dyson


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources