History log of /freebsd-current/sys/kern/kern_proc.c
Revision Date Author Comments
# 235436d6 04-Apr-2024 Konstantin Belousov <kib@FreeBSD.org>

stop_all_proc(): skip traced or signal-stoped processes

Since thread_single(SINGLE_ALLPROC) ignores them since 9241ebc796c,
and there is not much we can do for the debugger-controlled process.

Noted by: olce
Reviewed by: markj, olce
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D44638


# 171f0832 28-Nov-2023 Konstantin Belousov <kib@FreeBSD.org>

EVFILT_TIMER: intialize stop timer list in type-stable proc init, instead of fork

Since kqueue timer may exist after the process that created it exited
(same scenario with rfork(2) as in PR 275286), make the tailq
p_kqtim_stop accessed by filt_timerdetach() type-stable.

Noted and reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D42777


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 35b3be81 09-Oct-2023 Zhenlei Huang <zlei@FreeBSD.org>

proc: Add sysctl flag CTLFLAG_TUN to loader tunable

The sysctl variable 'kern.kstack_pages' is actually a loader tunable.
Add sysctl flag CTLFLAG_TUN to it so that `sysctl -T` will report it
correctly.

No functional change intended.

Note that on arm64 the thread0 stack size can not be controlled with it,
kib@ suggested that arm64 maintainers can fix it eventually so let's
enable it also on arm64 right now.

Reviewed by: kib, imp
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D42113


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 7a70f17a 07-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

killpg(): more carefully avoid LoR

otherwise we could end up with the livelock. When pg_killsx trylock
failed, ensure that we do wait for lock availability before retry.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# be30fd3a 07-Jul-2023 Mike Karels <karels@FreeBSD.org>

KERN_PROC_VM_LAYOUT sysctl: fix bug in 32-bit-compatible path

vmspace_free() is called redundantly in the 32-bit-compatible
path in sysctl_kern_proc_vm_layout(), causing a premature free
(possibly for the current address space). Remove the extra call.

PR: 272401
Reported by: marklmi at yahoo.com
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D40908


# 3360b485 12-Jun-2023 Konstantin Belousov <kib@FreeBSD.org>

killpg(2): close a race with fork(2), part1

If the process group member performs fork(), the child could escape
signalling from killpg(). Prevent it by introducing an sx process group
lock pg_killsx which is taken interruptibly shared around fork. If there
is a pending signal, do the trip through userspace with ERESTART to
handle signal ASTs. The lock is taken exclusively during killpg().

The lock is also locked exclusive when the process changes group
membership, to avoid escaping a signal by this means, by ensuring that
the process group is stable during fork.

Note that the new lock is before proctree lock, so in some situations we
could only do trylocking to obtain it.

This relatively simple approach cannot work for REAP_KILL, because
process potentially belongs to more than one reaper tree by having
sub-reapers.

Reported by: dchagin
Tested by: dchagin, pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D40493


# c6d31b83 18-Jul-2022 Konstantin Belousov <kib@FreeBSD.org>

AST: rework

Make most AST handlers dynamically registered. This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it. For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.

Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit. For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.

The dynamic registration also allows third-party modules to register AST
handlers if needed. There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it. In fact, this
is already present behavior for hwpmc.ko and ufs.ko. I do not think it
is worth the efforts and the runtime overhead to try to fix it.

Reviewed by: markj
Tested by: emaste (arm64), pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888


# c84c5e00 18-Jul-2022 Mitchell Horne <mhorne@FreeBSD.org>

ddb: annotate some commands with DB_CMD_MEMSAFE

This is not completely exhaustive, but covers a large majority of
commands in the tree.

Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583


# 939f0b63 10-May-2022 Kornel Dulęba <kd@FreeBSD.org>

Implement shared page address randomization

It used to be mapped at the top of the UVA.
If the randomization is enabled any address above .data section will be
randomly chosen and a guard page will be inserted in the shared page
default location.
The shared page is now mapped in exec_map_stack, instead of
exec_new_vmspace. The latter function is called before image activator
has a chance to parse ASLR related flags.
The KERN_PROC_VM_LAYOUT sysctl was extended to provide shared page
address.
The feature is enabled by default for 64 bit applications on all
architectures.
It can be toggled kern.elf64.aslr.shared_page sysctl.

Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35349


# 361971fb 02-Jun-2022 Kornel Dulęba <kd@FreeBSD.org>

Rework how shared page related data is stored

Store the shared page address in struct vmspace.
Also instead of storing absolute addresses of various shared page
segments save their offsets with respect to the shared page address.
This will be more useful when the shared page address is randomized.

Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35393


# f6ac79fb 02-Jun-2022 Kornel Dulęba <kd@FreeBSD.org>

Introduce the PROC_SIGCODE() macro

Use a getter macro instead of fetching the sigcode address directly
from a sysent of a given process. It assumes that the sigcode is stored
in the shared page, which is true in all cases, except for a.out
binaries. This will be later useful when the shared page address
randomization is introduced.
No functional change intended.

Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35392


# 008b2e65 30-Apr-2022 Konstantin Belousov <kib@FreeBSD.org>

Make stop_all_proc_block interruptible to avoid deadlock with parallel suspension

If we try to single-thread a process which thread entered
procctl(REAP_KILL_SUBTREE), and sleeping waiting for us unlocking
stop_all_proc_blocker, we must be able to finish single-threading. This
requires the sleep to be interruptible.

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310


# d3000939 04-May-2022 Konstantin Belousov <kib@FreeBSD.org>

P2_WEXIT: avoid thread_single() for exiting process earlier

before the process itself does thread_single(SINGLE_EXIT). We cannot
single-thread such process in ALLPROC (external) mode, and properly
detect and report the failure to do so due to the process becoming
zombie is easier to prevent than handle.

In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310


# 2e7595ef 23-Apr-2022 Konstantin Belousov <kib@FreeBSD.org>

Add stop_all_proc_block(9)

It allows to have more than one consumer of thread_signle(SIGNLE_ALLPROC) by
serializing them.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35014


# bb92cd7b 24-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd)


# 3ce04aca 17-Jan-2022 Mark Johnston <markj@FreeBSD.org>

proc: Add a sysctl to fetch virtual address space layout info

This provides information about fixed regions of the target process'
user memory map.

Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33708


# 706f4a81 17-Jan-2022 Mark Johnston <markj@FreeBSD.org>

exec: Introduce the PROC_PS_STRINGS() macro

Rather than fetching the ps_strings address directly from a process'
sysentvec, use this macro. With stack address randomization the
ps_strings address is no longer fixed.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33704


# ab4524b3 05-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

amd64: wrap 64bit sigtramp into vdso

Reviewed by: emaste
Discussed with: jrtc27
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D32960


# 7e1d3eef 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the unused thread argument from NDINIT*

See b4a58fbf640409a1 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.


# 7ac82c96 03-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

proc_get_binpath(): provide syntaxically correct value for unused NDINIT arg

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 02de91d7 03-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

proc_get_binpath(): return empty string instead of NULL

for strange case where queried process does not have text.

Reported by: Michael Butler <imb@protected-networks.net>
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# f34fc6ba 28-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

Extract proc_get_binpath() from sysctl_kern_proc_pathname()

Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32738


# ee92c8a8 23-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

sysctl kern.proc.procname: report right hardlink name

PR: 248184
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611


# 4ccaa87f 11-Aug-2021 Mitchell Horne <mhorne@FreeBSD.org>

kdb: Handle process enumeration before procinit()

Make kdb_thr_first() and kdb_thr_next() return sane values if the
allproc list and pidhashtbl haven't been initialized yet. This can
happen if the debugger is entered very early on, for example with the
'-d' boot flag.

This allows remote gdb to attach at such a time, and fixes some ddb
commands like 'show threads'.

Be explicit about the static initialization of these variables. This
part has no functional change.

Reviewed by: markj, imp (previous version)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D31495


# 0dcef81d 23-Jul-2021 Mark Johnston <markj@FreeBSD.org>

Add required sysctl name length checks to various handlers

Reported by: KMSAN
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 1762f674 14-May-2021 Konstantin Belousov <kib@FreeBSD.org>

ktrace: pack all ktrace parameters into allocated structure ktr_io_params

Ref-count the ktr_io_params structure instead of vnode/cred.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30257


# ecfbddf0 14-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

sysctl vm.objects: report backing object and swap use

For anonymous objects, provide a handle kvo_me naming the object,
and report the handle of the backing object. This allows userspace
to deconstruct the shadow chain. Right now the handle is the address
of the object in KVA, but this is not guaranteed.

For the same anonymous objects, report the swap space used for actually
swapped out pages, in kvo_swapped field. I do not believe that it is
useful to report full 64bit counter there, so only uint32_t value is
returned, clamped to the max.

For kinfo_vmentry, report anonymous object handle backing the entry,
so that the shadow chain for the specific mapping can be deconstructed.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29771


# d7671ad8 25-Feb-2021 Ryan Libby <rlibby@FreeBSD.org>

Close races in vm object chain traversal for unlock

We were unlocking the vm object before reading the backing_object field.
In the meantime, the object could be freed and reused. This could cause
us to go off the rails in the object chain traversal, failing to unlock
the rest of the objects in the original chain and corrupting the lock
state of the victim chain.

Reviewed by: bdrewery, kib, markj, vangyzen
MFC after: 3 days
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D28926


# 25c6318c 13-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

procstat: distinguish vm map guards in procstat vm output.

Requested and reviewed by: rwatson (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28658


# edc374e7 02-Feb-2021 Ed Maste <emaste@FreeBSD.org>

Correct description for kern.proc.proc_td

kern.proc.proc_td returns the process table with an entry for each
thread. Previously the description included "no threads", presumably
a cut-and-pasteo in 2648efa621748.

Description suggested by PauAmma.

PR: 253146
MFC after: 3 days
Sponsored by: The FreeBSD Foundation


# fe258f23 16-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

Save on getpid in setproctitle by supporting -1 as curproc.


# 5844bd05 28-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

jobc: rework detection of orphaned groups.

Instead of trying to maintain pg_jobc counter on each process group
update (and sometimes before), just calculate the counter when needed.
Still, for the benefit of the signal delivery code, explicitly mark
orphaned groups as such with the new process group flag.

This way we prevent bugs in the corner cases where updates to the counter
were missed due to complicated configuration of p_pptr/p_opptr/real_parent
(debugger).

Since we need to iterate over all children of the process on exit, this
change mostly affects the process group entry and leave, where we need
to iterate all process group members to detect orpaned status.

(For MFC, keep pg_jobc around but unused).

Reported by: jhb
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871


# cf4f802e 31-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

kinfo_proc: move job-control related data collection into a new helper.

This improves code structure and allows to put the lock asserts right
into place where the locks are needed.

Also move zeroing of the kinfo_proc structure from fill_kinfo_proc_only()
to fill_kinfo_proc(), this looks more symmetrical.

Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871


# 4daea938 31-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

Lock proctree in around fill_kinfo_proc().

Proctree lock is needed for correct calculation and collection of the
job-control related data in kinfo_proc. There was even an XXX comment
about it.

Satisfy locking and lock ordering requirements by taking proctree lock
around pass over each bucket in proc_iterate(), and in sysctl_kern_proc()
and note_procstat_proc() for individual process reporting.

Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871


# ef739c73 31-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

pgrp: Prevent use after free.

Often, we have a process locked and need to get locked process group.
In this case, because progress group lock is before process lock,
unlocking process allows the group to be freed. See for instance
tty_wait_background().

Make pgrp structures allocated from nofree zone, and ensure type stability
of the pgrp mutex.

Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871


# 993a1699 30-Dec-2020 Konstantin Belousov <kib@FreeBSD.org>

Style. Improve some KASSERTs messages.

Reviewed by: jilles
Tested by: pho
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871


# 4e7d1b52 31-Dec-2020 John Baldwin <jhb@FreeBSD.org>

Add a proc_off_p_hash helper variable.

This is used by kernel debuggers to enumerate processes via the pid
hash table.

Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27825


# 598f2b81 23-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

dtrace: stop using eventhandlers for the part compiled into the kernel

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D27311


# 85078b85 17-Nov-2020 Conrad Meyer <cem@FreeBSD.org>

Split out cwd/root/jail, cmask state from filedesc table

No functional change intended.

Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).

__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.

Reviewed by: kib
Discussed with: markj, mjg
Differential Revision: https://reviews.freebsd.org/D27037


# f5297909 11-Nov-2020 Mark Johnston <markj@FreeBSD.org>

Fix a pair of races in SIGIO registration

First, funsetownlst() list looks at the first element of the list to see
whether it's processing a process or a process group list. Then it
acquires the global sigio lock and processes the list. However, nothing
prevents the first sigio tracker from being freed by a concurrent
funsetown() before the sigio lock is acquired.

Fix this by acquiring the global sigio lock immediately after checking
whether the list is empty. Callers of funsetownlst() ensure that new
sigio trackers cannot be added concurrently.

Second, fsetown() uses funsetown() to remove an existing sigio structure
from a file object. However, funsetown() uses a racy check to avoid the
sigio lock, so two threads may call fsetown() on the same file object,
both observe that no sigio tracker is present, and enqueue two sigio
trackers for the same file object. However, if the file object is
destroyed, funsetown() will only remove one sigio tracker, and
funsetownlst() may later trigger a use-after-free when it clears the
file object reference for each entry in the list.

Fix this by introducing funsetown_locked(), which avoids the racy check.

Reviewed by: kib
Reported by: pho
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27157


# 40aad3e4 10-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

thread: tidy up r367543

"locked" variable is spurious in the committed version.


# f837888a 09-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

thread: use tdfind in sysctl_kern_proc_kstack

This treads linear scans for locked lookup, but more importantly removes
the only consumer of thread_find.


# dd90d963 16-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Put calls to check_pgrp_jobc() in fixjobc_kill() under INVARIANTS.

Reported by: Michael Butler <imb@protected-networks.net>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 182cfe6f 16-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Add check_pgrp_jobc() calls into process exit path.

Both before and after job control adjustments.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26416


# 2f5f11f5 16-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Fix fixjobc+orhpanage.

Orphans affect job control state, we must account for them when
changing pg_jobc.

Instead of p_pptr, use proc_realparent() to get parent relevant for
job control.

Use correct calculation of the parent for exiting process. For jobc
purposes, we must use realparent, but if it is also exiting, we should
fall to reaper, then recursively find non-exiting reaper.

Reported by: trasz
PR: 249257
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26416


# 928b8538 16-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Assert that P_TREE_GRPEXITED is set only once.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26416


# 82207cd2 16-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Improve ddb 'show pgrpdump' command.

Use ddb pager.
Make lines more compact.
Eliminate unneeded casts.
Print more job-control related info when reporting process group.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26416


# 6fed89b1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

kern: clean up empty lines in .c and .h files


# feabaaf9 24-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the always curthread argument from reverse lookup routines

Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs.

Tested by: pho


# c5bc28b2 22-Aug-2020 Konstantin Belousov <kib@FreeBSD.org>

Fix several issues with process group orphanage.

Attempt of adding assertions that pgrp->pg_jobc counters do not
underflow in r361967, reverted in r362910, points out bugs in the
handling of job control. Peter Holm was able to narrow down the
problem to very easy reproduction with timeout(1) which uses reaping.

The following list of problems with calculation of pg_jobs which
directs SIGHUP/SIGCONT delivery for orphaned process group was
identified:
- Re-calculation of the orphaned status for children of exiting parent
was wrong, but mostly unnoticed when all children were reparented to
init(8). When child can be reparented to a different process which
could affect the child' job control state, it was not properly
accounted for in pg_jobc.
- Lockless check for exiting process' parent process group is racy
because nothing prevents the parent from changing its group
membership.
- Exited process is left in the process group, until waited. This
affects other calculations of pg_jobc.

Split handling of job control status on process changing its process
group, and process exiting. Calculate increments and decrements for
pg_jobs by exact checking the orphanage instead of assuming process
group membership for children and parent. Move the call to killjobc()
later under the proctree_lock. Mark exiting process in killjobc()
with a new flag P_TREE_GRPEXITED and skip it for all pg_jobc
calculations after the flag is set.

Add checker that independently recalculates pg_jobc value and compares
it with the memoized process group state. This is enabled under INVARIANTS.

Reviewed by: jilles
Discussed with: kevans
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D26116


# 3b444436 11-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

devfs: rework si_usecount to track opens

This removes a lot of special casing from the VFS layer.

Reviewed by: kib (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D25612


# 58199a70 03-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

ifdef out pg_jobc assertions added in r361967

They trigger for some people, the bug is not obvious, there are no takers
for fixing it, the issue already had to be there for years beforehand and
is low priority.


# 4bc5ce2c 02-Jul-2020 Konstantin Belousov <kib@FreeBSD.org>

Use tdfind() in pget().

Reviewed by: jhb, hselasky
Sponsored by: Mellanox Technologies
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25532


# 90a08d6c 09-Jun-2020 Mateusz Guzik <mjg@FreeBSD.org>

Assert on pg_jobc state.

Stolen from NetBSD.


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 48fcb463 08-Feb-2020 Konstantin Belousov <kib@FreeBSD.org>

Add sysctl kern.proc.sigfastblk for reporting sigfastblock word address.

Tested by: pho
Disscussed with: cem, emaste, jilles
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D12773


# 1c29da02 31-Jan-2020 Mark Johnston <markj@FreeBSD.org>

Reimplement stack capture of running threads on i386 and amd64.

After r355784 the td_oncpu field is no longer synchronized by the thread
lock, so the stack capture interrupt cannot be delievered precisely.
Fix this using a loop which drops the thread lock and restarts if the
wrong thread was sampled from the stack capture interrupt handler.

Change the implementation to use a regular interrupt instead of an NMI.
Now that we drop the thread lock, there is no advantage to the latter.

Simplify the KPIs. Remove stack_save_td_running() and add a return
value to stack_save_td(). On platforms that do not support stack
capture of running threads, stack_save_td() returns EOPNOTSUPP. If the
target thread is running in user mode, stack_save_td() returns EBUSY.

Reviewed by: kib
Reported by: mjg, pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23355


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# fea73412 24-Dec-2019 Conrad Meyer <cem@FreeBSD.org>

sleep(9), sleepqueue(9): const'ify wchan pointers

_sleep(9), wakeup(9), sleepqueue(9), et al do not dereference or modify the
channel pointers provided in any way; they are merely used as intptrs into a
dictionary structure to match waiters with wakers. Correctly annotate this
such that _sleep() and wakeup() may be used on const pointers without
invoking ugly patterns like __DECONST(). Plumb const through all of the
underlying sleepqueue bits.

No functional change.

Reviewed by: rlibby
Discussed with: kib, markj
Differential Revision: https://reviews.freebsd.org/D22914


# 01cef4ca 16-Oct-2019 Mark Johnston <markj@FreeBSD.org>

Remove page locking from pmap_mincore().

After r352110 the page lock no longer protects a page's identity, so
there is no purpose in locking the page in pmap_mincore(). Instead,
if vm.mincore_mapped is set to the non-default value of 0, re-lookup
the page after acquiring its object lock, which holds the page's
identity stable.

The change removes the last callers of vm_page_pa_tryrelock(), so
remove it.

Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21823


# 2288078c 08-Oct-2019 Doug Moore <dougm@FreeBSD.org>

Define macro VM_MAP_ENTRY_FOREACH for enumerating the entries in a vm_map.
In case the implementation ever changes from using a chain of next pointers,
then changing the macro definition will be necessary, but changing all the
files that iterate over vm_map entries will not.

Drop a counter in vm_object.c that would have an effect only if the
vm_map entry count was wrong.

Discussed with: alc
Reviewed by: markj
Tested by: pho (earlier version)
Differential Revision: https://reviews.freebsd.org/D21882


# 88cc62e5 28-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

proc: eliminate the zombproc list

It is not needed by anything in the kernel and it slightly drives up contention
on both proctree and allproc locks.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21447


# 2319489b 27-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

proc: remove zpfind

It is not used by anything. If someone wants it back it should be reimplemented
to use the proc hash.

Sponsored by: The FreeBSD Foundation


# e2e050c8 19-May-2019 Conrad Meyer <cem@FreeBSD.org>

Extract eventfilter declarations to sys/_eventfilter.h

This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.

EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).

As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.

LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).

No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.


# eeb1f3ed 14-Apr-2019 Rick Macklem <rmacklem@FreeBSD.org>

Fix the NFSv4 client to safely find processes.

r340744 broke the NFSv4 client, because it replaced pfind_locked() with a
call to pfind(), since pfind() acquires the sx lock for the pid hash and
the NFSv4 already holds a mutex when it does the call.
The patch fixes the problem by recreating a pfind_any_locked() and adding the
functions pidhash_slockall() and pidhash_sunlockall to acquire/release
all of the pid hash locks.
These functions are then used by the NFSv4 client instead of acquiring
the allproc_lock and calling pfind().

Reviewed by: kib, mjg
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19887


# 6a855903 05-Mar-2019 Mark Johnston <markj@FreeBSD.org>

Show wiring state of map entries in procstat -v.

Note that only entries wired by userspace are shown as such. In
particular, entries transiently wired by sysctl_wire_old_buffer() are
not flagged as wired in procstat -v output.

Reviewed by: kib (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19461


# f02bc51c 03-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

Do not call PHOLD() while owning the allproc_lock sx.

Otherwise the lock might recurse in faultin() if the process is
swapped out.

Reported by: zeising
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 0abc7e41 28-Dec-2018 Jilles Tjoelker <jilles@FreeBSD.org>

pfind, pfind_any: Correct zombie logic

SVN r340744 erroneously changed pfind() to return any process including
zombies and pfind_any() to return only non-zombie processes.

In particular, this caused kill() on a zombie process to fail with [ESRCH].
There is no direct test case for this but /usr/tests/bin/sh/builtins/kill1.0
occasionally triggers it (as reported by lwhsu).

Conversely, returning zombies from pfind() seems likely to violate
invariants and cause panics, but I have not looked at this.

PR: 233646
Reviewed by: mjg, kib, ngie
Differential Revision: https://reviews.freebsd.org/D18665


# 34ebdcea 06-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

Manage process-related IDs with bitmaps

Currently unique pid allocation on fork often requires a full walk of
process, group, session lists to make sure it is not used by anything.
This has a side effect of requiring proctree to be held along with allproc,
which adds more contention in poudriere -j 128.

The patch below implements trivial bitmaps which gets rid of the problem.
Dedicated lock is introduced to manage IDs.

While here a bug was discovered: all processes would inherit reap id from
the first process spawned by init. This had a side effect of keeping the
ID used and when allocation rolls over to the beginning it keeps being
skipped.

The patch is loosely based on initial work by mjoras@.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation


# 5e38e3f5 29-Nov-2018 Eric van Gyzen <vangyzen@FreeBSD.org>

Include path for tmpfs objects in vm.objects sysctl

This applies the fix in r283924 to the vm.objects sysctl
added by r283624 so the output will include the vnode
information (i.e. path) for tmpfs objects.

Reviewed by: kib, dab
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D2724


# 1e9a1bf5 28-Nov-2018 Mateusz Guzik <mjg@FreeBSD.org>

proc: create a dedicated lock for zombproc to ligthen the load on allproc_lock

waitpid always takes proctree to evaluate the list, but only takes allproc
if it can reap. With this patch allproc is no longer taken, which helps during
poudriere -j 128.

Discussed with: kib
Sponsored by: The FreeBSD Foundation


# 53011553 21-Nov-2018 Mateusz Guzik <mjg@FreeBSD.org>

proc: convert pfind & friends to use pidhash locks and other cleanup

pfind_locked is retired as it relied on allproc which unnecessarily
restricts locking of the hash.

Sponsored by: The FreeBSD Foundation


# 3d3e6793 21-Nov-2018 Mateusz Guzik <mjg@FreeBSD.org>

proc: implement pid hash locks and an iterator

forks, exits and waits are frequently stalled during poudriere -j 128 runs
due to killpg and process list exports performed for each package.

Both uses take the allproc lock. The latter case can be modified to iterate
over the hash with finer grained locking instead.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17817


# 2c054ce9 16-Nov-2018 Mateusz Guzik <mjg@FreeBSD.org>

proc: always store parent pid in p_oppid

Doing so removes the dependency on proctree lock from sysctl process list
export which further reduces contention during poudriere -j 128 runs.

Reviewed by: kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17825


# aeb7a84e 15-Nov-2018 Mark Johnston <markj@FreeBSD.org>

Remove mostly-useless proc provider probes.

For some reason the proc UMA zone's ctor, dtor and init functions are
instrumented, but these functions are always available through FBT.
Moreover, the probes are not part of the original Solaris proc
provider, aren't documented, have no uses (e.g., in dwatch(8)) and
have no clear use to begin with. Therefore, remove them.

Reviewed by: rpaulo
Differential Revision: https://reviews.freebsd.org/D2169


# d6eff083 04-Jul-2018 Konstantin Belousov <kib@FreeBSD.org>

Add a way for the process to request cleanup of the kernel cache of
the process arguments. New arguments length zero causes the drop of
the pargs instead of allocation of useless zero-length buffer.

Submitted by: Thomas Munro
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D16111


# 6e22bbf6 21-Jun-2018 Konstantin Belousov <kib@FreeBSD.org>

fork: avoid endless wait with PTRACE_FORK and RFSTOPPED.

An RFSTOPPED thread can't clean TDB_STOPATFORK, which is done in the
fork_return() in its context, so parent is stuck forever. Triggered
when trying to ptrace linux process. Instead of waiting for the new
thread to clear TDB_STOPATFORK, tag it as traced and reparent to the
debugger in do_fork(), and let it only notify the debugger when run.

Submitted by: Yanko Yankulov <yanko.yankulov@gmail.com>
Reviewed by: jhb
MFC after: 1 week
X-MFC-Note: keep p_dbgwait placeholder intact
Differential revision: https://reviews.freebsd.org/D15857


# 7938a442 20-Jun-2018 Bjoern A. Zeeb <bz@FreeBSD.org>

Instead of using hand-rolled loops where not needed switch them
to FOREACH_PROC_IN_SYSTEM() to have a single pattern to look for.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D15916


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# 9d4e369a 25-Feb-2018 Mateusz Guzik <mjg@FreeBSD.org>

Don't generate data in sysctl_out_proc unless we intend to copy out.

The first call is used to gauge how much spaces is needed. Just computing
the size instead of generating the output allows to not take the proctree
lock.


# 31608624 04-Jan-2018 John Baldwin <jhb@FreeBSD.org>

Report offset relative to the backing object for kinfo_vmentry structures.

For the pathname reported in kinfo_vmentry structures (kve_path), the
sysctl handlers walk the object chain to find the bottom-most VM object.
This permits a COW mapping of a file with dirty pages to report the
pathname of the originally mapped file. Do the same for the object
offset (kve_offset) computing a cumulative offset during the same object
walk so that the reported offset is relative to the reported pathname.

Note that ptrace(PT_VM_ENTRY) already returns a cumulative offset
rather than the raw offset of the VM map entry.

Note also that this does not affect procstat -v output (even structured
output) since that output does not include the kve_offset field.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D13767


# 1b25176c 01-Jan-2018 Antoine Brodin <antoine@FreeBSD.org>

sysctl_kern_proc_args: do not take the fast path if p_args is NULL
In this case it falls back to reading ps_strings


# baaa7969 28-Dec-2017 Konstantin Belousov <kib@FreeBSD.org>

Make kern_proc_vmmap_resident() externally accesible, and move the
vmmap_skip_res_cnt control check inside it.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D13595


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 99713164 16-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

Check for PRS_NEW without locking the proc in sysctl_kern_proc


# 712dda7f 13-Nov-2017 Xin LI <delphij@FreeBSD.org>

Be more careful when doing calculation with request from userland.

MFC after: 2 weeks


# baaa6ec7 11-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

Avoid locking and refing in sysctl_kern_proc_args if possible.

Turns out the sysctl is called a lot e.g. by pkg-static.


# 6e1619da 11-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

Add pfind_any

It looks for both regular and zombie processes. This avoids allproc relocking
previously seen with pfind -> zpfind calls.


# 272640b7 11-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

Avoid allproc lock in pfind if curproc->pid == pid


# 9b57bf75 11-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

Remove useless proc lookup from sysctl_out_proc


# 2ca45184 09-Nov-2017 Matt Joras <mjoras@FreeBSD.org>

Introduce EVENTHANDLER_LIST and some users.

This introduces a facility to EVENTHANDLER(9) for explicitly defining a
reference to an event handler list. This is useful since previously all
invokers of events had to do a locked traversal of the global list of
event handler lists in order to find the appropriate event handler list.
By keeping a pointer to the appropriate list an invoker can avoid this
traversal completely. The pointer is initialized with SYSINIT(9) during
the eventhandler stage. Users registering interest in events do not need
to know if the event is backed by such a list, since the list is added
to the global list of lists. As with lists that are not pre-defined it
is safe to register for the events before the list has been created.

This converts the process_* and thread_* events to using the new
facility, as these are events whose locked traversals end up showing up
significantly in ports build workflows (and presumably other workflows
with many short lived threads/procs). It may be advantageous to convert
other events to using the new facility.

The el_flags field is now unused, but leave it be so that this revision
can be MFC'd.

Reviewed by: bdrewery, markj, mjg
Approved by: rstone (mentor)
In collaboration with: ian
MFC after: 4 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12814


# cee09850 07-Nov-2017 Bartek Rutkowski <robak@FreeBSD.org>

Make sysctl_kern_proc_umask execute fast path when requested pid in
curproc->p_pid or 0, avoiding unnecessary locking. Update libc consumer
to skip calling getpid().

Submitted by: Pawel Biernacki <pawel.biernacki@gmail.com>
Reviewed by: mjg, robak
Approved by: mjg
Sponsored by: Mysterious Code Ltd.
Differential Revision: D12972


# a2c36a24 03-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

Special-case pget lookups where pid == curproc->pid

Saves on allproc_lock acquires during buildworld, poudriere etc.

Submitted by: Pawel Biernacki <pawel.biernacki@gmail.com>
Sponsored by: Mysterious Code Ltd.
Differential Revision: D12929


# f38c0c46 06-Oct-2017 Mark Johnston <markj@FreeBSD.org>

Let stack_create(9) take a malloc flags argument.

Reviewed by: cem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12614


# 3e72c844 08-Sep-2017 Mateusz Guzik <mjg@FreeBSD.org>

Annotate global process locks with __exclusive_cache_line

MFC after: 1 week


# 6aacc698 14-Jul-2017 Brooks Davis <brooks@FreeBSD.org>

Add 32-bit compat for kinfo_proc's ki_tdaddr.

This appears to have been an oversight in r213536.

Reviewed by: markj
MFC after: 1 week
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11521


# a7ca2c6a 03-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Ensure that cached struct thread does not keep spurious td_su
reference on an UFS mount point.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 69921123 23-May-2017 Konstantin Belousov <kib@FreeBSD.org>

Commit the 64-bit inode project.

Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439


# d1780e8d 14-Mar-2017 Konstantin Belousov <kib@FreeBSD.org>

Use atop() instead of OFF_TO_IDX() for convertion of addresses or
addresses offsets, as intended.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 316e092a 19-Feb-2017 Hans Petter Selasky <hselasky@FreeBSD.org>

Make sure the thread constructor and destructor eventhandlers are
called for all threads belonging to a procedure. Currently the first
thread in a procedure is kept around as an optimisation step and is
never freed. Because the first thread in a procedure is never freed
nor allocated, its destructor and constructor callbacks are never
called which means per thread structures allocated by dtrace and the
Linux emulation layers for example, might be present for threads which
don't need these structures.

This patch adds a thread construction and destruction call for the
first thread in a procedure.

Tested: dtrace, linux emulation
Reviewed by: kib @
MFC after: 1 week
Sponsored by: Mellanox Technologies


# 3d32d4a7 07-Dec-2016 Eric van Gyzen <vangyzen@FreeBSD.org>

Export the whole thread name in kinfo_proc

kinfo_proc::ki_tdname is three characters shorter than
thread::td_name. Add a ki_moretdname field for these three
extra characters. Add the new field to kinfo_proc32, as well.
Update all in-tree consumers to read the new field and assemble
the full name, except for lldb's HostThreadFreeBSD.cpp, which
I will handle separately. Bump __FreeBSD_version.

Reviewed by: kib
MFC after: 1 week
Relnotes: yes
Sponsored by: Dell EMC
Differential Revision: https://reviews.freebsd.org/D8722


# 69a28758 15-Sep-2016 Ed Maste <emaste@FreeBSD.org>

Renumber license clauses in sys/kern to avoid skipping #3


# 18c23dff 27-Jul-2016 Ed Maste <emaste@FreeBSD.org>

ANSIfy kern_proc.c and delete register keyword

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D6478


# 584b675e 27-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

Hide the boottime and bootimebin globals, provide the getboottime(9)
and getboottimebin(9) KPI. Change consumers of boottime to use the
KPI. The variables were renamed to avoid shadowing issues with local
variables of the same name.

Issue is that boottime* should be adjusted from tc_windup(), which
requires them to be members of the timehands structure. As a
preparation, this commit only introduces the interface.

Some uses of boottime were found doubtful, e.g. NLM uses boottime to
identify the system boot instance. Arguably the identity should not
change on the leap second adjustment, but the commit is about the
timekeeping code and the consumers were kept bug-to-bug compatible.

Tested by: pho (as part of the bigger patch)
Reviewed by: jhb (same)
Discussed with: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
X-Differential revision: https://reviews.freebsd.org/D7302


# 93ccd6bf 05-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Get rid of struct proc p_sched and struct thread td_sched pointers.

p_sched is unused.

The struct td_sched is always co-allocated with the struct thread,
except for the thread0. Avoid useless indirection, instead calculate
td_sched location using simple pointer arithmetic in td_get_sched(9).
For thread0, which is statically allocated, create a structure to
emulate layout of the dynamic allocation.

Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D6711


# 314381b5 05-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Use ANSI function definition.

Sponsored by: The FreeBSD Foundation


# 11748dae 17-Apr-2016 Mark Johnston <markj@FreeBSD.org>

Use a loop instead of a goto in sysctl_kern_proc_kstack().

MFC after: 3 days


# ccd0ec40 17-Apr-2016 Konstantin Belousov <kib@FreeBSD.org>

The struct thread td_estcpu member is only used by the 4BSD scheduler.
Move it to the struct td_sched for 4BSD, removing always present
field, otherwise unused for ULE.

New scheduler method sched_estcpu() returns the estimation for
kinfo_proc consumption. As before, it always returns 0 for ULE.

Remove sched_tick() scheduler method, unused both by 4BSD and ULE.

Update locking comment for the 4BSD struct td_sched, copying it from
the same comment for ULE.

Spell MAXPRI as PRI_MAX_TIMESHARE in the 4BSD comment.

Based on some notes from, and reviewed by: bde
Sponsored by: The FreeBSD Foundation


# b85f65af 15-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

kern: for pointers replace 0 with NULL.

These are mostly cosmetical, no functional change.

Found with devel/coccinelle.


# db57c70a 09-Feb-2016 Konstantin Belousov <kib@FreeBSD.org>

Rename P_KTHREAD struct proc p_flag to P_KPROC.

I left as is an apparent bug in ntoskrnl_var.h:AT_PASSIVE_LEVEL()
definition.

Suggested by: jhb
Sponsored by: The FreeBSD Foundation


# 2dab579b 08-Feb-2016 Konstantin Belousov <kib@FreeBSD.org>

Remove the assert which outlived its usefulness, and, by default,
disable compilation of the code which made it possible to call
stop_all_proc() from usermode at all.

Move the comment to the preamble of stop_all_proc() and reword it to
give overview of the function intent.

proc0 has P_HADTHREADS flag set due to kthread_add(), but no
P_KTHREAD, which triggered the assert, which does not serve a purpose
now.

Reported by: Oliver Pinter
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 17182b13 20-Jan-2016 Mateusz Guzik <mjg@FreeBSD.org>

session: avoid proctree lock on proc exit when possible

We can get away with the common case with only proc lock held.

Reviewed by: kib


# 6760182c 20-Jan-2016 Mateusz Guzik <mjg@FreeBSD.org>

session: tidy up fixjobc

This stops abusing the 'p' pointer for iteration over children processes
and gets rid of useless locking around PRS_ZOMBIE check.

Suggested by: kib


# aa0241d6 18-Dec-2015 Mateusz Guzik <mjg@FreeBSD.org>

proc: fix a race which could result in dereference of bad p_pgrp pointer on fork

During fork p_starcopy - p_endcopy area of a process is populated with bcopy
with only proc lock held. Another forking thread can find such a process and
proceed to access p_pgrp included in said area.

Fix the problem by moving the field outside. It is being properly assigned
later.

Reviewed by: kib
Diagnosed by: kib
Tested by: Fabian Keil <freebsd-listen fabiankeil.de>
MFC after: 10 days


# 36160958 16-Dec-2015 Mark Johnston <markj@FreeBSD.org>

Fix style issues around existing SDT probes.

- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
set automatically by the SDT framework.

MFC after: 1 week


# 711fbd17 07-Dec-2015 Mark Johnston <markj@FreeBSD.org>

Add helper functions proc_readmem() and proc_writemem().

These helper functions can be used to read in or write a buffer from or to
an arbitrary process' address space. Without them, this can only be done
using proc_rwmem(), which requires the caller to fill out a uio. This is
onerous and results in code duplication; the new functions provide a simpler
interface which is sufficient for most existing callers of proc_rwmem().

This change also adds a manual page for proc_rwmem() and the new functions.

Reviewed by: jhb, kib
Differential Revision: https://reviews.freebsd.org/D4245


# 645743ea 12-Nov-2015 John Baldwin <jhb@FreeBSD.org>

Export various helper variables describing the layout and size of
certain kernel structures for use by debuggers. This mostly aids
in examining cores from a kernel without debug symbols as a debugger
can infer these values if debug symbols are available.

One set of variables describes the layout of 'struct linker_file' to
walk the list of loaded kernel modules.

A second set of variables describes the layout of 'struct proc' and
'struct thread' to walk the list of processes in the kernel and the
threads in each process.

The 'pcb_size' variable is used to index into the stoppcbs[] array.

The 'vm_maxuser_address' is used to distinguish kernel virtual addresses
from user addresses. This doesn't have to be perfect, and
'vm_maxuser_address' is a cheap and simple way to differentiate kernel
pointers from simple values like TIDs and PIDs.

While here, annotate the fields in struct pcb used by kgdb on amd64
and i386 to note that their ABI should be preserved. Annotations for
other platforms will be added in the future.

Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3773


# e6b95927 06-Oct-2015 Conrad Meyer <cem@FreeBSD.org>

Fix core corruption caused by race in note_procstat_vmmap

This fix is spiritually similar to r287442 and was discovered thanks to
the KASSERT added in that revision.

NT_PROCSTAT_VMMAP output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' vm map
via vn_fullpath. As vnodes may move during coredump, this is racy.

We do not remove the race, only prevent it from causing coredump
corruption.

- Add a sysctl, kern.coredump_pack_vmmapinfo, to allow users to disable
kinfo packing for PROCSTAT_VMMAP notes. This avoids VMMAP corruption
and truncation, even if names change, at the cost of up to PATH_MAX
bytes per mapped object. The new sysctl is documented in core.5.

- Fix note_procstat_vmmap to self-limit in the second pass. This
addresses corruption, at the cost of sometimes producing a truncated
result.

- Fix PROCSTAT_VMMAP consumers libutil (and libprocstat, via copy-paste)
to grok the new zero padding.

Reported by: pho (https://people.freebsd.org/~pho/stress/log/datamove4-2.txt)
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3824


# 2f2f522b 27-Sep-2015 Andriy Gapon <avg@FreeBSD.org>

save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE

SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after: 20 days


# 4295bec1 16-Sep-2015 John Baldwin <jhb@FreeBSD.org>

When a process group leader exits, all of the processes in the group are
sent SIGHUP and SIGCONT if any of the processes are stopped. Currently this
behavior is triggered for any type of process stop including ptrace() stops
and transient stops for single threading during exit() and execve().
Thus, if a debugger is attached to a process in a group when the leader
exits, the entire group can be HUPed. Instead, only send the signals if a
process in the group is stopped due to SIGSTOP.

PR: 201149
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D3681


# 610141ce 10-Sep-2015 Mark Johnston <markj@FreeBSD.org>

Add stack_save_td_running(), a function to trace the kernel stack of a
running thread.

It is currently implemented only on amd64 and i386; on these
architectures, it is implemented by raising an NMI on the CPU on which
the target thread is currently running. Unlike stack_save_td(), it may
fail, for example if the thread is running in user mode.

This change also modifies the kern.proc.kstack sysctl to use this function,
so that stacks of running threads are shown in the output of "procstat -kk".
This is handy for debugging threads that are stuck in a busy loop.

Reviewed by: bdrewery, jhb, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3256


# b4490c6e 18-Jul-2015 Konstantin Belousov <kib@FreeBSD.org>

The si_status field of the siginfo_t, provided by the waitid(2) and
SIGCHLD signal, should keep full 32 bits of the status passed to the
_exit(2).

Split the combined p_xstat of the struct proc into the separate exit
status p_xexit for normal process exit, and signalled termination
information p_xsig. Kernel-visible macro KW_EXITCODE() reconstructs
old p_xstat from p_xexit and p_xsig. p_xexit contains complete status
and copied out into si_status.

Requested by: Joerg Schilling
Reviewed by: jilles (previous version), pho
Tested by: pho
Sponsored by: The FreeBSD Foundation


# f6f6d240 10-Jun-2015 Mateusz Guzik <mjg@FreeBSD.org>

Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.


# 63e4c6cd 02-Jun-2015 Eric van Gyzen <vangyzen@FreeBSD.org>

Provide vnode in memory map info for files on tmpfs

When providing memory map information to userland, populate the vnode pointer
for tmpfs files. Set the memory mapping to appear as a vnode type, to match
FreeBSD 9 behavior.

This fixes the use of tmpfs files with the dtrace pid provider,
procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY).

Submitted by: Eric Badger <eric@badgerio.us> (initial revision)
Obtained from: Dell Inc.
PR: 198431
MFC after: 2 weeks
Reviewed by: jhb
Approved by: kib (mentor)


# aa32b5e0 27-Apr-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Make setproctitle(3) work in Capsicum capability mode. This makes
ctld(8) child processes to indicate initiator address and name in
their titles, similar to what iscsid(8) child processes do.

PR: 181352
Differential Revision: https://reviews.freebsd.org/D2363
Reviewed by: rwatson@, mjg@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 296f235d 22-Mar-2015 Ian Lepore <ian@FreeBSD.org>

The sysctls that return process argv and envv return binary data, so clear
the SBUF_INCLUDENUL flag.

Pointed out by: tijl@


# f97af970 21-Mar-2015 Mateusz Guzik <mjg@FreeBSD.org>

proc: use MTX_NEW flag in proc_init

This allows us to get rid of bzero which was added specifically to make
mtx_init on p_mtx reliable.

This also fixes a potential problem where mtx_init on other mutexes
could trip over on unitialized memory and fire an assertion.

Reviewed by: kib


# 1eafc078 14-Mar-2015 Ian Lepore <ian@FreeBSD.org>

Set the SBUF_INCLUDENUL flag in sbuf_new_for_sysctl() so that sysctl
strings returned to userland include the nulterm byte.

Some uses of sbuf_new_for_sysctl() write binary data rather than strings;
clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in
those cases. (Note that the sbuf code still automatically adds a nulterm
byte in sbuf_finish(), but since it's not included in the length it won't
get copied to userland along with the binary data.)

Remove explicit adding of a nulterm byte in a couple places now that it
gets done automatically by the sbuf drain code.

PR: 195668


# 83d7eb54 14-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

Fix gcc build.

Sponsored by: The FreeBSD Foundation
MFC after: 13 days


# 6ddcc233 13-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

Add facility to stop all userspace processes. The supposed use of the
feature is to quisce the system before suspend.

Stop is implemented by reusing the thread_single(9) with the special
mode SINGLE_ALLPROC. SINGLE_ALLPROC differs from the existing
single-threading modes by allowing (requiring) caller to operate on
other process. Interruptible sleeps for !TDF_SBDRY threads are
suspended like SIGSTOP does it, instead of aborting the sleep, like
SINGLE_NO_EXIT, to avoid spurious EINTRs on resume.

Provide debugging sysctl debug.stop_all_proc, which causes total stop
and suspends syncer, while waiting for variable reset for resume. It
is used for debugging; should be removed after the real use of the
interface is added.

In collaboration with: pho
Discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 6afb32fc 01-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

Disable recursion for the process spinlock.

Tested by: pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 month


# 5c7bebf9 26-Nov-2014 Konstantin Belousov <kib@FreeBSD.org>

The process spin lock currently has the following distinct uses:

- Threads lifetime cycle, in particular, counting of the threads in
the process, and interlocking with process mutex and thread lock.
The main reason of this is that turnstile locks are after thread
locks, so you e.g. cannot unlock blockable mutex (think process
mutex) while owning thread lock.

- Virtual and profiling itimers, since the timers activation is done
from the clock interrupt context. Replace the p_slock by p_itimmtx
and PROC_ITIMLOCK().

- Profiling code (profil(2)), for similar reason. Replace the p_slock
by p_profmtx and PROC_PROFLOCK().

- Resource usage accounting. Need for the spinlock there is subtle,
my understanding is that spinlock blocks context switching for the
current thread, which prevents td_runtime and similar fields from
changing (updates are done at the mi_switch()). Replace the p_slock
by p_statmtx and PROC_STATLOCK().

The split is done mostly for code clarity, and should not affect
scalability.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# e77f9fed 18-Oct-2014 Adrian Chadd <adrian@FreeBSD.org>

Update the ULE scheduler + thread and kinfo structs to use int for cpuid
rather than u_char.

To try and play nice with the ABI, the u_char CPU ID values are clamped
at 254. The new fields now contain the full CPU ID, or -1 for no cpu.

Differential Revision: D955
Reviewed by: jhb, kib
Sponsored by: Norse Corp, Inc.


# 57c2505e 05-Oct-2014 Konstantin Belousov <kib@FreeBSD.org>

On error, sbuf_bcat() returns -1. Some callers returned this -1 to
the upper layers, which interpret it as errno value, which happens to
be ERESTART. The result was spurious restarts of the sysctls in loop,
e.g. kern.proc.proc, instead of returning ENOMEM to caller.

Convert -1 from sbuf_bcat() to ENOMEM, when returning to the callers
expecting errno.

In collaboration with: pho
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week


# 2570cdd6 03-Sep-2014 Mateusz Guzik <mjg@FreeBSD.org>

Plug a hypothetical use after free in sysctl kern.proc.groups.

MFC after: 1 week


# 5b5477d7 03-Sep-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Fix dereference after NULL check.

CID: 1234607
Sponsored by: Nginx, Inc.


# 8b04bbef 28-Aug-2014 Mateusz Guzik <mjg@FreeBSD.org>

Return real parent pid in kinfo (used by e.g. ps)

Add a separate field which exports tracer pid and add a new keyword
("tracer") for ps to display it.

This is a follow up to r270444.

Reviewed by: kib
MFC after: 1 week
Relnotes: yes


# d7359980 06-Aug-2014 Konstantin Belousov <kib@FreeBSD.org>

Correct the problems with the ptrace(2) making the debuggee an orphan.
One problem is inferior(9) looping due to the process tree becoming a
graph instead of tree if the parent is traced by child. Another issue
is due to the use of p_oppid to restore the original parent/child
relationship, because real parent could already exited and its pid
reused (noted by mjg).

Add the function proc_realparent(9), which calculates the parent for
given process. It uses the flag P_TREE_FIRST_ORPHAN to detect the head
element of the p_orphan list and than stepping back to its container
to find the parent process. If the parent has already exited, the
init(8) is returned.

Move the P_ORPHAN and the new helper flag from the p_flag* to new
p_treeflag field of struct proc, which is protected by proctree lock
instead of proc lock, since the orphans relationship is managed under
the proctree_lock already.

The remaining uses of p_oppid in ptrace(PT_DETACH) and process
reapping are replaced by proc_realparent(9).

Phabric: D417
Reviewed by: jhb
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# d3a3b8b0 28-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

Simplify the expression, by removing redundand calculation.

Noted by: "O'Connor, Daniel" <Daniel.O'Connor@emc.com>
MFC after: 3 days


# a62eb139 15-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

Followup to r268466.

- Move the code to calculate resident count into separate function.
It reduces the indent level and makes the operation of
vmmap_skip_res_cnt tunable more clear.
- Optimize the calculation of the resident page count for map entry.
Skip directly to the next lowest available index and page among the
whole shadow chain.
- Restore the use of pmap_incore(9), only to verify that current
mapping is indeed superpage.
- Note the issue with the invalid pages.

Suggested and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 3760e341 15-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

Change the calculation of the kinfo_vmentry field kve_private_resident
to reflect its name.

Noted and reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 479fcb4e 10-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

Unconditionally initialize addr to handle the case of changed map
timestamp while the map is unlocked.

Reported by: bz
Sponsored by: The FreeBSD Foundation
MFC after: 6 days


# a91831a2 09-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

Current code in sysctl proc.vmmap, which intent is to calculate the
amount of resident pages, in fact calculates the amount of installed
pte entries in the region. Resident pages which were not soft-faulted
yet are not counted.

Calculate the amount of resident pages by looking in the objects chain
backing the region.

Add a knob to disable the residency calculation at all. For large
sparce regions, either previous or updated algorithm runs for too long
time, while several introspection tools do not need the (advisory) RSS
value at all.

PR: kern/188911
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 2db08c03 11-Feb-2014 John Baldwin <jhb@FreeBSD.org>

Expose OBJT_MGTDEVICE VM objects used for GEM/TTM with drm2 as an
explicit object type.

Reviewed by: kib
MFC after: 1 week


# 80c3af4e 26-Nov-2013 Konstantin Belousov <kib@FreeBSD.org>

Add an kinfo sysctl to retrieve signal trampoline location for the
given process.

Note that the correctness of the trampoline length returned for ABIs
which do not use shared page depends on the correctness of the struct
sysvec sv_szsigcodebase member, which will be fixed on as-need basis.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# d9fae5ab 26-Nov-2013 Andriy Gapon <avg@FreeBSD.org>

dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE

In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by: markj
MFC after: 3 weeks


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# 55648840 19-Sep-2013 John Baldwin <jhb@FreeBSD.org>

Extend the support for exempting processes from being killed when swap is
exhausted.
- Add a new protect(1) command that can be used to set or revoke protection
from arbitrary processes. Similar to ktrace it can apply a change to all
existing descendants of a process as well as future descendants.
- Add a new procctl(2) system call that provides a generic interface for
control operations on processes (as opposed to the debugger-specific
operations provided by ptrace(2)). procctl(2) uses a combination of
idtype_t and an id to identify the set of processes on which to operate
similar to wait6().
- Add a PROC_SPROTECT control operation to manage the protection status
of a set of processes. MADV_PROTECT still works for backwards
compatability.
- Add a p_flag2 to struct proc (and a corresponding ki_flag2 to kinfo_proc)
the first bit of which is used to track if P_PROTECT should be inherited
by new child processes.

Reviewed by: kib, jilles (earlier version)
Approved by: re (delphij)
MFC after: 1 month


# 5e9ccc87 26-Aug-2013 Will Andrews <will@FreeBSD.org>

Add the ability to display the default FIB number for a process to the
ps(1) utility, e.g. "ps -O fib".

bin/ps/keyword.c:
Add the "fib" keyword and default its column name to "FIB".

bin/ps/ps.1:
Add "fib" as a supported keyword.

sys/compat/freebsd32/freebsd32.h:
sys/kern/kern_proc.c:
sys/sys/user.h:
Add the default fib number for a process (p->p_fibnum)
to the user land accessible process data of struct kinfo_proc.

Submitted by: Oliver Fromme <olli@fromme.com>, gibbs


# 7b77e1fe 14-Aug-2013 Mark Johnston <markj@FreeBSD.org>

Specify SDT probe argument types in the probe definition itself rather than
using SDT_PROBE_ARGTYPE(). This will make it easy to extend the SDT(9) API
to allow probes with dynamically-translated types.

There is no functional change.

MFC after: 2 weeks


# 5ea21e69 14-Apr-2013 Mikolaj Golub <trociny@FreeBSD.org>

Similarly to proc_getargv() and proc_getenvv(), export proc_getauxv()
to be able to reuse the code.

MFC after: 3 weeks


# fe52cf54 14-Apr-2013 Mikolaj Golub <trociny@FreeBSD.org>

Re-factor the code to provide kern_proc_filedesc_out(), kern_proc_out(),
and kern_proc_vmmap_out() functions to output process kinfo structures
to sbuf, to make the code reusable.

The functions are going to be used in the coredump routine to store
procstat info in the core program header notes.

Reviewed by: kib
MFC after: 3 weeks


# bc403f03 08-Apr-2013 Attilio Rao <attilio@FreeBSD.org>

Switch some "low-hanging fruit" to acquire read lock on vmobjects
rather than write locks.

Sponsored by: EMC / Isilon storage division
Reviewed by: alc
Tested by: pho


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 4f666417 25-Nov-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Look for zombie process only if we were given process id.

Reviewed by: kib
MFC after: 2 weeks
X-MFC-after-or-with: 243142


# 134eb42e 16-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

In pget(9), if PGET_NOTWEXIT flag is not specified, also search the
zombie list for the pid. This allows several kern.proc sysctls to
report useful information for zombies.

Hold the allproc_lock around all searches instead of relocking it.
Remove private pfind_locked() from the new nfs client code.

Requested and reviewed by: pjd
Tested by: pho
MFC after: 3 weeks


# 4419a8a8 13-Nov-2012 Mateusz Guzik <mjg@FreeBSD.org>

enterpgrp: get rid of pgrp2 variable and use KASSERT directly on pgfind result.

pgrp2 was used only for debugging, but pgrp2 = pgfind(..) was present in compiled code even for kernels without INVARIANTS

Approved by: trasz (mentor)
MFC after: 1 week


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 60ee4338 15-Aug-2012 David E. O'Brien <obrien@FreeBSD.org>

Don't include opt_ddb.h & <ddb/ddb.h> twice.


# 1c771f92 05-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

After the PHYS_TO_VM_PAGE() function was de-inlined, the main reason
to pull vm_param.h was removed. Other big dependency of vm_page.h on
vm_param.h are PA_LOCK* definitions, which are only needed for
in-kernel code, because modules use KBI-safe functions to lock the
pages.

Stop including vm_param.h into vm_page.h. Include vm_param.h
explicitely for the kernel code which needs it.

Suggested and reviewed by: alc
MFC after: 2 weeks


# 599fc82b 16-Jul-2012 Gabor Pali <pgj@FreeBSD.org>

- Add support for displaying process stack memory regions.

Approved by: rwatson
MFC after: 3 days


# 371778a3 26-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix ki_cow for compat32 binaries.

MFC after: 3 days


# 4d34e019 23-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Calculate the count of per-process cow faults. Export the count to
userspace using the obscure spare int field in struct kinfo_proc.

Submitted by: Andrey Zonov <andrey zonov org>
MFC after: 1 week


# b3bfb267 23-Apr-2012 Konstantin Belousov <kib@FreeBSD.org>

Allow for the process information sysctls to accept a thread id in addition
to the process id. It follows the ptrace(2) interface and allows debugging
libraries to use thread ids directly, without slow and verbose conversion
of thread id into pid.

The PGET_NOTID flag is provided to allow a specific sysctl to disallow
this behaviour. All current callers of pget(9) have useful semantic to
operate on tid and do not need this flag.

Reviewed by: jhb, trocini
MFC after: 1 week


# 903712c9 23-Mar-2012 Mikolaj Golub <trociny@FreeBSD.org>

Add a sysctl to set and retrieve binary osreldate of another process.

Suggested by: kib
Reviewed by: kib
MFC after: 2 weeks


# e0fcf639 03-Mar-2012 Mikolaj Golub <trociny@FreeBSD.org>

Make kern.proc.umask sysctl readonly.

Requested by: src
MFC after: 1 week


# 6ce13747 26-Feb-2012 Mikolaj Golub <trociny@FreeBSD.org>

Add sysctl to retrieve or set umask of another process.

Submitted by: Dmitry Banschikov <me ubique spb ru>
Discussed with: kib, rwatson
Reviewed by: kib
MFC after: 2 weeks


# 45efc9b4 25-Jan-2012 Mikolaj Golub <trociny@FreeBSD.org>

Fix CTL flags in the declarations of KERN_PROC_ENV, AUXV and
PS_STRINGS sysctls: they are read only.

MFC after: 1 week


# 8854fe39 22-Jan-2012 Mikolaj Golub <trociny@FreeBSD.org>

Change kern.proc.rlimit sysctl to:

- retrive only one, specified limit for a process, not the whole
array, as it was previously (the sysctl has been added recently and
has not been backported to stable yet, so this change is ok);

- allow to set a resource limit for another process.

Submitted by: Andrey Zonov <andrey at zonov.org>
Discussed with: kib
Reviewed by: kib
MFC after: 2 weeks


# fe7f89b7 15-Jan-2012 Mikolaj Golub <trociny@FreeBSD.org>

Abrogate nchr argument in proc_getargv() and proc_getenvv(): we always want
to read strings completely to know the actual size.

As a side effect it fixes the issue with kern.proc.args and kern.proc.env
sysctls, which didn't return the size of available data when calling
sysctl(3) with the NULL argument for oldp.

Note, in get_ps_strings(), which does actual work for proc_getargv() and
proc_getenvv(), we still have a safety limit on the size of data read in
case of a corrupted procces stack.

Suggested by: kib
MFC after: 3 days


# 547b155e 17-Dec-2011 Mikolaj Golub <trociny@FreeBSD.org>

Fix style and white spaces.

MFC after: 1 week


# fa3935bc 17-Dec-2011 Mikolaj Golub <trociny@FreeBSD.org>

On start most of sysctl_kern_proc functions use the same pattern:
locate a process calling pfind() and do some additional checks like
p_candebug(). To reduce this code duplication a new function pget() is
introduced and used.

As the function may be useful not only in kern_proc.c it is in the
kernel name space.

Suggested by: kib
Reviewed by: kib
MFC after: 2 weeks


# 9e94d5b8 05-Dec-2011 Mikolaj Golub <trociny@FreeBSD.org>

Really protect kern.proc.ps_strings sysctls with p_candebug(). This
was intended to be in r228288.

Spotted by: many
MFC after: 1 week


# c65932be 05-Dec-2011 Mikolaj Golub <trociny@FreeBSD.org>

Protect kern.proc.auxv and kern.proc.ps_strings sysctls with p_candebug().

Citing jilles:

If we are ever going to do ASLR, the AUXV information tells an attacker
where the stack, executable and RTLD are located, which defeats much of
the point of randomizing the addresses in the first place.

Given that the AUXV information seems to be used by debuggers only anyway,
I think it would be good to move it to p_candebug() now.

The full virtual memory maps (KERN_PROC_VMMAP, procstat -v) are already
under p_candebug().

Suggested by: jilles
Discussed with: rwatson
MFC after: 1 week


# 0f60ecda 04-Dec-2011 Mikolaj Golub <trociny@FreeBSD.org>

In sysctl_kern_proc_ps_strings() there is no much sense in checking
for P_WEXIT and P_SYSTEM flags.

Reviewed by: kib


# 9732458f 27-Nov-2011 Mikolaj Golub <trociny@FreeBSD.org>

Add sysctl to retrieve ps_strings structure location of another process.

Suggested by: kib
Reviewed by: kib


# 4fd6053b 27-Nov-2011 Mikolaj Golub <trociny@FreeBSD.org>

In sysctl_kern_proc_auxv the process was released too early: we still
need to hold it when checking process sv_flags.

MFC after: 2 weeks


# 9e7d0583 24-Nov-2011 Mikolaj Golub <trociny@FreeBSD.org>

Add sysctl to get process resource limits.

Reviewed by: kib
MFC after: 2 weeks


# 7ad9baae 23-Nov-2011 Mikolaj Golub <trociny@FreeBSD.org>

Fix build without INVARIANTS.

Discussed with: kib


# c5cfcb1c 22-Nov-2011 Mikolaj Golub <trociny@FreeBSD.org>

Add new sysctls, KERN_PROC_ENV and KERN_PROC_AUXV, to return
environment strings and ELF auxiliary vectors from a process stack.

Make sysctl_kern_proc_args to read not cached arguments from the
process stack.

Export proc_getargv() and proc_getenvv() so they can be reused by
procfs and linprocfs.

Suggested by: kib
Reviewed by: kib
Discussed with: kib, rwatson, jilles
Tested by: pho
MFC after: 2 weeks


# ca4aa8c3 20-Nov-2011 Sergey Kandaurov <pluknet@FreeBSD.org>

Remove no more relevant XXXRW comments since accessing the vmspace is now
properly done with the acquired vmspace reference.

Pointed out by: kib


# 18be8527 21-Nov-2011 Sergey Kandaurov <pluknet@FreeBSD.org>

Use the acquired reference to the vmspace instead of direct dereferencing
of p->p_vmspace like it is done in sysctl_kern_proc_vmmap().


# 5384d089 07-Nov-2011 Mikolaj Golub <trociny@FreeBSD.org>

Add KVME_FLAG_SUPER and use it in sysctl_kern_proc_vmmap for marking
entries with superpages.

Submitted by: Mel Flynn <mel.flynn+fbsd.hackers@mailing.thruhere.net>
Reviewed by: alc, rwatson


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# f55d3fbe 18-Aug-2011 John Baldwin <jhb@FreeBSD.org>

One of the general principles of the sysctl(3) API is that a user can
query the needed size for a sysctl result by passing in a NULL old
pointer and a valid oldsize. The kern.proc.args sysctl handler broke
this assumption by not calling SYSCTL_OUT() if the old pointer was
NULL.

Approved by: re (kib)
MFC after: 3 days


# 925af544 18-Jul-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Rename ki_ocomm to ki_tdname and OCOMMLEN to TDNAMLEN.
Provide backward compatibility defines under BURN_BRIDGES.

Suggested by: jhb
Reviewed by: emaste
Sponsored by: Sandvine Incorporated
Approved by: re (kib)


# 2417d97e 18-Jul-2011 John Baldwin <jhb@FreeBSD.org>

- Export each thread's individual resource usage in in struct kinfo_proc's
ki_rusage member when KERN_PROC_INC_THREAD is passed to one of the
process sysctls.
- Correctly account for the current thread's cputime in the thread when
doing the runtime fixup in calcru().
- Use TIDs as the key to lookup the previous thread to compute IO stat
deltas in IO mode in top when thread display is enabled.

Reviewed by: kib
Approved by: re (kib)


# 0daf62d9 12-May-2011 Stanislav Sedov <stas@FreeBSD.org>

- Commit work from libprocstat project. These patches add support for runtime
file and processes information retrieval from the running kernel via sysctl
in the form of new library, libprocstat. The library also supports KVM backend
for analyzing memory crash dumps. Both procstat(1) and fstat(1) utilities have
been modified to take advantage of the library (as the bonus point the fstat(1)
utility no longer need superuser privileges to operate), and the procstat(1)
utility is now able to display information from memory dumps as well.

The newly introduced fuser(1) utility also uses this library and able to operate
via sysctl and kvm backends.

The library is by no means complete (e.g. KVM backend is missing vnode name
resolution routines, and there're no manpages for the library itself) so I
plan to improve it further. I'm commiting it so it will get wider exposure
and review.

We won't be able to MFC this work as it relies on changes in HEAD, which
was introduced some time ago, that break kernel ABI. OTOH we may be able
to merge the library with KVM backend if we really need it there.

Discussed with: rwatson


# 8e6fa660 24-Mar-2011 John Baldwin <jhb@FreeBSD.org>

Fix some locking nits with the p_state field of struct proc:
- Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL
in fork to honor the locking requirements. While here, expand the scope
of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously
the code was locking the new child process (p2) after it had locked the
parent process (p1). However, when locking two processes, the safe order
is to lock the child first, then the parent.
- Fix various places that were checking p_state against PRS_NEW without
having the process locked to use PROC_LOCK(). Every place was already
locking the process, just after the PRS_NEW check.
- Remove or reduce the use of PROC_SLOCK() for places that were checking
p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading
the current state.
- Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once.

MFC after: 1 week


# 7123f4cd6 05-Mar-2011 Edward Tomasz Napierala <trasz@FreeBSD.org>

Export login class information via kinfo and make it possible to view
it using "ps -o class".


# 96fcc75f 01-Mar-2011 Robert Watson <rwatson@FreeBSD.org>

Add initial support for Capsicum's Capability Mode to the FreeBSD kernel,
compiled conditionally on options CAPABILITIES:

Add a new credential flag, CRED_FLAG_CAPMODE, which indicates that a
subject (typically a process) is in capability mode.

Add two new system calls, cap_enter(2) and cap_getmode(2), which allow
setting and querying (but never clearing) the flag.

Export the capability mode flag via process information sysctls.

Sponsored by: Google, Inc.
Reviewed by: anderson
Discussed with: benl, kris, pjd
Obtained from: Capsicum Project
MFC after: 3 months


# 6fa39a73 25-Jan-2011 Konstantin Belousov <kib@FreeBSD.org>

Allow debugger to specify that children of the traced process should be
automatically traced. Extend the ptrace(PL_LWPINFO) to report that child
just forked.

Reviewed by: davidxu, jhb
MFC after: 2 weeks


# 8d065a39 14-Nov-2010 Rebecca Cran <brucec@FreeBSD.org>

Fix some more style(9) issues.


# b389be97 14-Nov-2010 Rebecca Cran <brucec@FreeBSD.org>

Fix style(9) issues from r215281 and r215282.

MFC after: 1 week


# 2baa5cdd 13-Nov-2010 Rebecca Cran <brucec@FreeBSD.org>

Add some descriptions to sys/kern sysctls.

PR: kern/148710
Tested by: Chip Camden <sterling at camdensoftware.com>
MFC after: 1 week


# a37e14e1 11-Nov-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix style.

Submitted by: bde


# 98e0196a 11-Nov-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unneeded conditional.

Discussed with: kib


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 6239ef1d 07-Oct-2010 Ed Maste <emaste@FreeBSD.org>

Make a thread's address available via the kern proc sysctl, just like the
process address.

Add "tdaddr" keyword to ps(1) to display this thread address.

Distilled from Sandvine's patch set by Mark Johnston.


# 79856499 22-Aug-2010 Rui Paulo <rpaulo@FreeBSD.org>

Add an extra comment to the SDT probes definition. This allows us to get
use '-' in probe names, matching the probe names in Solaris.[1]

Add userland SDT probes definitions to sys/sdt.h.

Sponsored by: The FreeBSD Foundation
Discussed with: rwaston [1]


# ba50d597 19-Aug-2010 John Baldwin <jhb@FreeBSD.org>

There isn't really a need to hold the ktrace mutex just to read the value
of p_traceflag that is stored in the kinfo_proc structure. It is still
racey even with the lock and the code will read a consistent snapshot of
the flag without the lock.


# 937912ea 27-May-2010 Attilio Rao <attilio@FreeBSD.org>

Add the support for reporting the NOCOREDUMP flag from
sysctl_kern_proc_vmmap().

Sponsored by: Sandvine Incorporated
Reviewed by: kib, emaste
MFC after: 1 week


# e11f17f3 11-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r207603
Use td_rux.rux_runtime for ki_runtime instead of redoing calculation.

MFC r207659:
Fix a mistake in r207603. td_rux.rux_runtime still needs conversion.


# 43042bba 08-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r207363:
Remove caddr_t casts.


# 19effccd 08-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r204051 (by imp):
n64 has a different size for KINFO_PROC_SIZE.

Approved by: imp

MFC r207152:
Move the constants specifying the size of struct kinfo_proc into
machine-specific header files. Add KINFO_PROC32_SIZE for struct
kinfo_proc32 for architectures providing COMPAT_FREEBSD32. Add
CTASSERT for the size of struct kinfo_proc32.

MFC r207269:
Style: use #define<TAB> instead of #define<SPACE>.


# 213c077f 05-May-2010 Konstantin Belousov <kib@FreeBSD.org>

Fix a mistake in r207603. td_rux.rux_runtime still needs conversion.

Reported and tested by: nwhitehorn
Pointy hat to: kib
MFC after: 6 days


# 03d13670 04-May-2010 Konstantin Belousov <kib@FreeBSD.org>

Use td_rux.rux_runtime for ki_runtime instead of redoing calculation.

Submitted by: bde
MFC after: 1 week


# e68d26fd 29-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

Remove caddr_t casts.

Requested by: bde
MFC after: 10 days


# 517c07c8 28-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r207008:
Provide compat32 shims for kinfo_proc sysctl.

MFC r207016:
Fix typo.


# ed780687 23-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

Move the constants specifying the size of struct kinfo_proc into
machine-specific header files. Add KINFO_PROC32_SIZE for struct
kinfo_proc32 for architectures providing COMPAT_FREEBSD32. Add
CTASSERT for the size of struct kinfo_proc32.

Submitted by: pluknet
Reviewed by: imp, jhb, nwhitehorn
MFC after: 2 weeks


# b4bf2ac1 21-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

Fix typo.

Submitted by: emaste
Pointy hat to: kib (who needs much bigger wardrobe)
MFC after: 1 week


# 05e06d11 21-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

Provide compat32 shims for kinfo_proc sysctl. This allows 32bit ps(1) to
mostly work on 64bit host.

The work is based on an original patch submitted by emaste, obtained
from Sandvine's source tree.

Reviewed by: jhb
MFC after: 1 week


# 873d2744 14-Mar-2010 Jilles Tjoelker <jilles@FreeBSD.org>

MFC r204410: Include terminated threads in ps's process cpu time field.

When a kinfo_proc is filled, first fill_kinfo_proc_only() fills in
ki_runtime using p->p_rux.rux_runtime (all cpu time used by the process
including terminated threads). If information for a specific thread is
requested, fill_kinfo_thread() then overwrites this with the thread's
td->td_runtime (good). If not, fill_kinfo_aggregate() overwrote it with
the sum of all threads' td->td_runtime which does not include terminated
threads.

This affects ps(1)'s TIME field, not its %CPU field nor anything in
top(1).


# d3fe3690 05-Mar-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r204413:
For kinfo_proc in kp->ki_siglist, return the set of the signals pending
in the process queue when gathering information for the process, and set
of signals pending for the thread, when gathering information for the
thread.


# 00305546 27-Feb-2010 Konstantin Belousov <kib@FreeBSD.org>

For kinfo_proc in kp->ki_siglist, return the set of the signals pending
in the process queue when gathering information for the process, and set
of signals pending for the thread, when gathering information for the
thread. Previously, the sysctl returned a union of the process and some
arbitrary thread pending set for the process, and union of the process
and the thread pending set for the thread.

MFC after: 1 week


# bbb17d16 26-Feb-2010 Jilles Tjoelker <jilles@FreeBSD.org>

Include terminated threads in ps's process cpu time field.

MFC after: 2 weeks


# 8aeffdc3 28-Dec-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

MFC r200995:

Remove an unused global.


# 095809b0 25-Dec-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Remove an unused global.

MFC after: 3 days


# 8dc9b4cf 19-Dec-2009 Ed Schouten <ed@FreeBSD.org>

Let access overriding to TTYs depend on the cdev_priv, not the vnode.

Basically this commit changes two things, which improves access to TTYs
in exceptional conditions. Basically the problem was that when you ran
jexec(8) to attach to a jail, you couldn't use /dev/tty (well, also the
node of the actual TTY, e.g. /dev/pts/X). This is very inconvenient if
you want to attach to screens quickly, use ssh(1), etc.

The fixes:

- Cache the cdev_priv of the controlling TTY in struct session. Change
devfs_access() to compare against the cdev_priv instead of the vnode.
This allows you to bypass UNIX permissions, even across different
mounts of devfs.

- Extend devfs_prison_check() to unconditionally expose the device node
of the controlling TTY, even if normal prison nesting rules normally
don't allow this. This actually allows you to interact with this
device node.

To be honest, I'm not really happy with this solution. We now have to
store three pointers to a controlling TTY (s_ttyp, s_ttyvp, s_ttydp).
In an ideal world, we should just get rid of the latter two and only use
s_ttyp, but this makes certian pieces of code very impractical (e.g.
devfs, kern_exit.c).

Reported by: Many people


# b561979f 02-Nov-2009 Ed Maste <emaste@FreeBSD.org>

MFC r197692:

In fill_kinfo_thread, copy the thread's name into struct kinfo_proc even
if it is empty. Otherwise the previous thread's name would remain in the
struct and then be reported for this thread.

Submitted by: Ryan Stone


# e380cc73 01-Oct-2009 Ed Maste <emaste@FreeBSD.org>

In fill_kinfo_thread, copy the thread's name into struct kinfo_proc even
if it is empty. Otherwise the previous thread's name would remain in the
struct and then be reported for this thread.

Submitted by: Ryan Stone
MFC after: 1 week


# 2af00dec 08-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r196730:
Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ.

Introduce separate kernel stack cache that keeps some limited amount of
preallocated kernel stacks to lower the latency of thread allocation.

Not a merge: instead of removing td_altkstack* members of struct thread,
replace them with placeholders to keep struct thread layout on the
stable branch.

Also, record r196640, r196644 and r196648 as merged.

Approved by: re (kensmith)


# 8a945d10 01-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

Reintroduce the r196640, after fixing the problem with my testing.

Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by: peter [1]
Discussed with: jhb, julian, peter
Reviewed by: jhb
Tested by: pho (and retested according to new test scenarious)
MFC after: 1 week


# f25fa6ab 29-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

Reverse r196640 and r196644 for now.


# c3cf0b47 29-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

Remove the altkstacks, instead instantiate threads with kernel stack
allocated with the right size from the start. For the thread that has
kernel stack cached, verify that requested stack size is equial to the
actual, and reallocate the stack if sizes differ [1].

This fixes the bug introduced by r173361 that was committed several days
after r173004 and consisted of kthread_add(9) ignoring the non-default
kernel stack size.

Also, r173361 removed the caching of the kernel stacks for a non-first
thread in the process. Introduce separate kernel stack cache that keeps
some limited amount of preallocated kernel stacks to lower the latency
of thread allocation. Add vm_lowmem handler to prune the cache on
low memory condition. This way, system with reasonable amount of the
threads get lower latency of thread creation, while still not exhausting
significant portion of KVA for unused kstacks.

Submitted by: peter [1]
Discussed with: jhb, julian, peter
Reviewed by: jhb
Tested by: pho
MFC after: 1 week


# 254d03c5 24-Jul-2009 Brooks Davis <brooks@FreeBSD.org>

Introduce a new sysctl process mib, kern.proc.groups which adds the
ability to retrieve the group list of each process.

Modify procstat's -s option to query this mib when the kinfo_proc
reports that the field has been truncated. If the mib does not exist,
fall back to the truncated list.

Reviewed by: rwatson
Approved by: re (kib)
MFC after: 2 weeks


# 1b5768be 24-Jul-2009 Brooks Davis <brooks@FreeBSD.org>

Revert the changes to struct kinfo_proc in r194498. Instead, fill
in up to 16 (KI_NGROUPS) values and steal a bit from ki_cr_flags
(all bits currently unused) to indicate overflow with the new flag
KI_CRF_GRP_OVERFLOW.

This fixes procstat -s.

Approved by: re (kib)


# 01381811 24-Jul-2009 John Baldwin <jhb@FreeBSD.org>

Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to
a device pager (OBJT_DEVICE) object in that it uses fictitious pages to
provide aliases to other memory addresses. The primary difference is that
it uses an sglist(9) to determine the physical addresses for a given offset
into the object instead of invoking the d_mmap() method in a device driver.

Reviewed by: alc
Approved by: re (kensmith)
MFC after: 2 weeks


# 838d9858 19-Jun-2009 Brooks Davis <brooks@FreeBSD.org>

Rework the credential code to support larger values of NGROUPS and
NGROUPS_MAX, eliminate ABI dependencies on them, and raise the to 1024
and 1023 respectively. (Previously they were equal, but under a close
reading of POSIX, NGROUPS_MAX was defined to be too large by 1 since it
is the number of supplemental groups, not total number of groups.)

The bulk of the change consists of converting the struct ucred member
cr_groups from a static array to a pointer. Do the equivalent in
kinfo_proc.

Introduce new interfaces crcopysafe() and crsetgroups() for duplicating
a process credential before modifying it and for setting group lists
respectively. Both interfaces take care for the details of allocating
groups array. crsetgroups() takes care of truncating the group list
to the current maximum (NGROUPS) if necessary. In the future,
crsetgroups() may be responsible for insuring invariants such as sorting
the supplemental groups to allow groupmember() to be implemented as a
binary search.

Because we can not change struct xucred without breaking application
ABIs, we leave it alone and introduce a new XU_NGROUPS value which is
always 16 and is to be used or NGRPS as appropriate for things such as
NFS which need to use no more than 16 groups. When feasible, truncate
the group list rather than generating an error.

Minor changes:
- Reduce the number of hand rolled versions of groupmember().
- Do not assign to both cr_gid and cr_groups[0].
- Modify ipfw to cache ucreds instead of part of their contents since
they are immutable once referenced by more than one entity.

Submitted by: Isilon Systems (initial implementation)
X-MFC after: never
PR: bin/113398 kern/133867


# 820c5181 01-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Add a flags field to struct ucred, and export that via kinfo_proc,
consuming one of its spare fields. The cr_flags field is currently
unused, but will be used for features, including capability mode and
pay-as-you-go audit.

Discussed with: jhb, sson


# 0304c731 27-May-2009 Jamie Gritton <jamie@FreeBSD.org>

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# f8d90480 18-Feb-2009 Attilio Rao <attilio@FreeBSD.org>

- Add a function (fill_kinfo_aggregate()) which aggregates relevant
members for a kinfo entry on a process-wide system.
- Use the newly introduced function in order to fix cases like
KERN_PROC_PROC where aggregating stats are broken because they just
consider the first thread in the pool for each process.
(Note, additively, that KERN_PROC_PROC is rather inaccurate on
thread-wide informations like the 'state' of the process. Such
informations should maybe be invalidated and being forceably discarded
by the consumers?).
- Simplify the logic of sysctl_out_proc() and adjust the
fill_kinfo_thread() accordingly.
- Remove checks on the FIRST_THREAD_IN_PROC() being NULL but add
assertives.

This patch should fix aggregate statistics for KERN_PROC_PROC.
This is one of the reasons why top doesn't use this option and now it
can be use it safely.
ps, when launched in order to display just processes, now should report
correct cpu utilization percentages and times (as opposed by the old
code).

Reviewed by: jhb, emaste
Sponsored by: Sandvine Incorporated


# 24f87fdb 23-Jan-2009 John Baldwin <jhb@FreeBSD.org>

- Add conditional Giant locking around the vrele() in
sysctl_kern_proc_pathname().
- Mark all the kern.proc.* sysctls as MPSAFE.

Submitted by: csjp (2)


# 22a448c4 28-Dec-2008 Konstantin Belousov <kib@FreeBSD.org>

vm_map_lock_read() does not increment map->timestamp, so we should
compare map->timestamp with saved timestamp after map read lock is
reacquired, not with saved timestamp + 1. The only consequence of the +1
was unconditional lookup of the next map entry, though.

Tested by: pho
Approved by: des
MFC after: 2 weeks


# c7462f43 11-Dec-2008 Konstantin Belousov <kib@FreeBSD.org>

Reference the vmspace of the process being inspected by procfs, linprocfs
and sysctl kern_proc_vmmap handlers.

Reported and tested by: pho
Reviewed by: rwatson, des
MFC after: 1 week


# 118d0afa 07-Dec-2008 Konstantin Belousov <kib@FreeBSD.org>

Do drop vm map lock earlier in the sysctl_kern_proc_vmmap(), to avoid
locking a vnode while having vm map locked.

Reported and tested by: pho
MFC after: 1 week


# aeb32571 05-Dec-2008 Konstantin Belousov <kib@FreeBSD.org>

Several threads in a process may do vfork() simultaneously. Then, all
parent threads sleep on the parent' struct proc until corresponding
child releases the vmspace. Each sleep is interlocked with proc mutex of
the child, that triggers assertion in the sleepq_add(). The assertion
requires that at any time, all simultaneous sleepers for the channel use
the same interlock.

Silent the assertion by using conditional variable allocated in the
child. Broadcast the variable event on exec() and exit().

Since struct proc * sleep wait channel is overloaded for several
unrelated events, I was unable to remove wakeups from the places where
cv_broadcast() is added, except exec().

Reported and tested by: ganbold
Suggested and reviewed by: jhb
MFC after: 2 week


# 43151ee6 01-Dec-2008 Peter Wemm <peter@FreeBSD.org>

Merge user/peter/kinfo branch as of r185547 into head.

This changes struct kinfo_filedesc and kinfo_vmentry such that they are
same on both 32 and 64 bit platforms like i386/amd64 and won't require
sysctl wrapping.

Two new OIDs are assigned. The old ones are available under
COMPAT_FREEBSD7 - but it isn't that simple. The superceded interface
was never actually released on 7.x.

The other main change is to pack the data passed to userland via the
sysctl. kf_structsize and kve_structsize are reduced for the copyout.
If you have a process with 100,000+ sockets open, the unpacked records
require a 132MB+ copyout. With packing, it is "only" ~35MB. (Still
seriously unpleasant, but not quite as devastating). A similar problem
exists for the vmentry structure - have lots and lots of shared libraries
and small mmaps and its copyout gets expensive too.

My immediate problem is valgrind. It traditionally achieves this
functionality by parsing procfs output, in a packed format. Secondly, when
tracing 32 bit binaries on amd64 under valgrind, it uses a cross compiled
32 bit binary which ran directly into the differing data structures in 32
vs 64 bit mode. (valgrind uses this to track file descriptor operations
and this therefore affected every single 32 bit binary)

I've added two utility functions to libutil to unpack the structures into
a fixed record length and to make it a little more convenient to use.


# 1ba4a712 17-Nov-2008 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.

- L2ARC

Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris


# 2ff47c5f 04-Nov-2008 John Baldwin <jhb@FreeBSD.org>

Remove unnecessary locking around vn_fullpath(). The vnode lock for the
vnode in question does not need to be held. All the data structures used
during the name lookup are protected by the global name cache lock.
Instead, the caller merely needs to ensure a reference is held on the
vnode (such as vhold()) to keep it from being freed.

In the case of procfs' <pid>/file entry, grab the process lock while we
gain a new reference (via vhold()) on p_textvp to fully close races with
execve(2).

For the kern.proc.vmmap sysctl handler, use a shared vnode lock around
the call to VOP_GETATTR() rather than an exclusive lock.

MFC after: 1 month


# 7a9c4d24 30-Oct-2008 Peter Wemm <peter@FreeBSD.org>

Add three extra to the kinfo_proc_vmmap data. kve_offset - the offset
within an object that a mapping refers to. fileid and fsid are inode/dev
for vnodes. (Linux procfs has these and valgrind is really unhappy
without them.) I believe I didn't change the size of the struct.


# 1ede983c 23-Oct-2008 Dag-Erling Smørgrav <des@FreeBSD.org>

Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 42ff2756 16-Sep-2008 Ed Schouten <ed@FreeBSD.org>

Fix minor TTY API inconsistency.

Unlike tty_rel_gone() and tty_rel_sess(), the tty_rel_pgrp() routine
does not unlock the TTY. I once had the idea to make the code call
tty_rel_pgrp() and tty_rel_sess(), picking up the TTY lock once. This
turned out a little harder than I expected, so this is how it works now.

It's a lot easier if we just let tty_rel_pgrp() unlock the TTY, because
the other routines do this anyway.


# f308bddd 04-Sep-2008 Kevin Lo <kevlo@FreeBSD.org>

If the process id specified is invalid, the system call returns ESRCH


# bc093719 20-Aug-2008 Ed Schouten <ed@FreeBSD.org>

Integrate the new MPSAFE TTY layer to the FreeBSD operating system.

The last half year I've been working on a replacement TTY layer for the
FreeBSD kernel. The new TTY layer was designed to improve the following:

- Improved driver model:

The old TTY layer has a driver model that is not abstract enough to
make it friendly to use. A good example is the output path, where the
device drivers directly access the output buffers. This means that an
in-kernel PPP implementation must always convert network buffers into
TTY buffers.

If a PPP implementation would be built on top of the new TTY layer
(still needs a hooks layer, though), it would allow the PPP
implementation to directly hand the data to the TTY driver.

- Improved hotplugging:

With the old TTY layer, it isn't entirely safe to destroy TTY's from
the system. This implementation has a two-step destructing design,
where the driver first abandons the TTY. After all threads have left
the TTY, the TTY layer calls a routine in the driver, which can be
used to free resources (unit numbers, etc).

The pts(4) driver also implements this feature, which means
posix_openpt() will now return PTY's that are created on the fly.

- Improved performance:

One of the major improvements is the per-TTY mutex, which is expected
to improve scalability when compared to the old Giant locking.
Another change is the unbuffered copying to userspace, which is both
used on TTY device nodes and PTY masters.

Upgrading should be quite straightforward. Unlike previous versions,
existing kernel configuration files do not need to be changed, except
when they reference device drivers that are listed in UPDATING.

Obtained from: //depot/projects/mpsafetty/...
Approved by: philip (ex-mentor)
Discussed: on the lists, at BSDCan, at the DevSummit
Sponsored by: Snow B.V., the Netherlands
dcons(4) fixed by: kan


# 58e8af1b 25-Jul-2008 Konstantin Belousov <kib@FreeBSD.org>

Call pargs_drop() unconditionally in do_execve(), the function correctly
handles the NULL argument.
Make pargs_free() static.

MFC after: 1 week


# 5d217f17 24-May-2008 John Birrell <jb@FreeBSD.org>

Add DTrace 'proc' provider probes using the Statically Defined Trace
(sdt) mechanism.


# 374ae2a3 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.


# 6617724c 12-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# d92909c1 10-Jan-2008 Robert Watson <rwatson@FreeBSD.org>

Don't zero td_runtime when billing thread CPU usage to the process;
maintain a separate td_incruntime to hold unbilled CPU usage for
the thread that has the previous properties of td_runtime.

When thread information is requested using the thread monitoring
sysctls, export thread td_runtime instead of process rusage runtime
in kinfo_proc.

This restores the display of individual ithread and other kernel
thread CPU usage since inception in ps -H and top -SH, as well for
libthr user threads, valuable debugging information lost with the
move to try kthreads since they are no longer independent processes.

There is universal agreement that we should rewrite the process and
thread export sysctls, but this commit gets things going a bit
better in the mean time. Likewise, there are resevations about the
continued validity of statclock given the speed of modern processors.

Reviewed by: attilio, emaste, jhb, julian


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 0417fe54 27-Dec-2007 Robert Watson <rwatson@FreeBSD.org>

Return ESRCH when a kernel stack is queried on a process in execve() --
p_candebug() will return EAGAIN which, if the other process never
leaves execve(), will result in the sysctl spinning and never returning
to userspace. Processes should always eventually leave execve(), but
spinning in kernel while we wait is bad for countless reasons, and
particularly harmful if execve() itself is deadlocked.

Possibly we should return another error, or return a marker indicating
the thread is in execve() so it can be reported that way in userspace.

Reported by: kris


# 63d79c4f 09-Dec-2007 Robert Watson <rwatson@FreeBSD.org>

Check for P_WEXIT before PHOLD() on a process in kstack and vm query
sysctls, as PHOLD() asserts !P_WEXIT.

Reported by: Michael Plass <mfp49_freebsd at plass-family dot net>


# 1cc8c45c 02-Dec-2007 Robert Watson <rwatson@FreeBSD.org>

Add another new sysctl in support of the forthcoming procstat(1) to
support its -k argument:

kern.proc.kstack - dump the kernel stack of a process, if debugging
is permitted.

This sysctl is present if either "options DDB" or "options STACK" is
compiled into the kernel. Having support for tracing the kernel
stacks of processes from user space makes it much easier to debug
(or understand) specific wmesg's while avoiding the need to enter
DDB in order to determine the path by which a process came to be
blocked on a particular wait channel or lock.


# cc43c38c 02-Dec-2007 Robert Watson <rwatson@FreeBSD.org>

Add two new sysctls in support of the forthcoming procstat(1) to support
its -f and -v arguments:

kern.proc.filedesc - dump file descriptor information for a process, if
debugging is permitted, including socket addresses, open flags, file
offsets, file paths, etc.

kern.proc.vmmap - dump virtual memory mapping information for a process,
if debugging is permitted, including layout and information on
underlying objects, such as the type of object and path.

These provide a superset of the information historically available
through the now-deprecated procfs(4), and are intended to be exported
in an ABI-robust form.


# 965b55e2 20-Nov-2007 Robert Watson <rwatson@FreeBSD.org>

Test that p_textvp is non-NULL be dereferencing, as no executable vnode is
set for kernel processes.

Reported by: Skip Ford <skip at menantico dot com>
MFC after: 3 days


# 4a62a3e5 15-Nov-2007 Randall Stewart <rrs@FreeBSD.org>

Adds an event handler for:
- process_ctor,dtor, init and fini
- thread_ctor,dtor, init and fini
This allows the ability to add on additional things
during construction/destruction of threads and processes.

Reviewed by: rwatson


# 89b57fcf 05-Nov-2007 Konstantin Belousov <kib@FreeBSD.org>

Fix for the panic("vm_thread_new: kstack allocation failed") and
silent NULL pointer dereference in the i386 and sparc64 pmap_pinit()
when the kmem_alloc_nofault() failed to allocate address space. Both
functions now return error instead of panicing or dereferencing NULL.

As consequence, vmspace_exec() and vmspace_unshare() returns the errno
int. struct vmspace arg was added to vm_forkproc() to avoid dealing
with failed allocation when most of the fork1() job is already done.

The kernel stack for the thread is now set up in the thread_alloc(),
that itself may return NULL. Also, allocation of the first process
thread is performed in the fork1() to properly deal with stack
allocation failure. proc_linkup() is separated into proc_linkup()
called from fork1(), and proc_linkup0(), that is used to set up the
kernel process (was known as swapper).

In collaboration with: Peter Holm
Reviewed by: jhb


# 54b0e65f 20-Sep-2007 Jeff Roberson <jeff@FreeBSD.org>

- Redefine p_swtime and td_slptime as p_swtick and td_slptick. This
changes the units from seconds to the value of 'ticks' when swapped
in/out. ULE does not have a periodic timer that scans all threads in
the system and as such maintaining a per-second counter is difficult.
- Change computations requiring the unit in seconds to subtract ticks
and divide by hz. This does make the wraparound condition hz times
more frequent but this is still in the range of several months to
years and the adverse effects are minimal.

Approved by: re


# b61ce5b0 16-Sep-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move all of the PS_ flags into either p_flag or td_flags.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.

Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)


# a1fe14bc 09-Jun-2007 Attilio Rao <attilio@FreeBSD.org>

rufetch and calcru sometimes should be called atomically together.
This patch fixes places where they should be called atomically changing
their locking requirements (both assume per-proc spinlock held) and
introducing rufetchcalc which wrappers both calls to be performed in
atomic way.

Reviewed by: jeff
Approved by: jeff (mentor)


# 982d11f8 04-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# 1c4bcd05 31-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


# 13b762a3 22-Mar-2007 Ed Maste <emaste@FreeBSD.org>

Stop setting ki_ocomm (thread name) to the proc name by default, as nothing
in the base system relies on this any longer.


# ad1e7d28 05-Dec-2006 Julian Elischer <julian@FreeBSD.org>

Threading cleanup.. part 2 of several.

Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs. Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.


# 8460a577 26-Oct-2006 John Birrell <jb@FreeBSD.org>

Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by: davidxu@


# 2342d521 30-Sep-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Remove duplicated $FreeBSD$.


# 8be56372 27-Sep-2006 Martin Blapp <mbr@FreeBSD.org>

Move Giant up even further since P_CONTROLT isn't really fully locked
yet (p_flag is, but P_CONTROLT isn't really).

Submitted by: jhb


# 45e68191 23-Sep-2006 Martin Blapp <mbr@FreeBSD.org>

Protect enterpgrp() against another tty/proc race case until the tty locking work
has been fixed.

MFC after: 1 week


# d7b167b5 19-Sep-2006 Martin Blapp <mbr@FreeBSD.org>

Fix races between tty.c and sessrele() / doenterpgrp() / leavepgrp(). The tty
code is still under giant lock, but the session/pgrp release code just used
proctree_locks. This explains why moving the proctree_lock in sys/kern/tty.c
rev. 1.258 did fix the panics in our SMP systems.

This should also fix some race panics with revoked ttys.

Reviewed by: jhb
MFC after: 1 week


# e8444a7e 11-Feb-2006 Poul-Henning Kamp <phk@FreeBSD.org>

CPU time accounting speedup (step 2)

Keep accounting time (in per-cpu) cputicks and the statistics counts
in the thread and summarize into struct proc when at context switch.

Don't reach across CPUs in calcru().

Add code to calibrate the top speed of cpu_tickrate() for variable
cpu_tick hardware (like TSC on power managed machines).

Don't enforce monotonicity (at least for now) in calcru. While the
calibrated cpu_tickrate ramps up it may not be true.

Use 27MHz counter on i386/Geode.

Use TSC on amd64 & i386 if present.

Use tick counter on sparc64


# 5b1a8eb3 07-Feb-2006 Poul-Henning Kamp <phk@FreeBSD.org>

Modify the way we account for CPU time spent (step 1)

Keep track of time spent by the cpu in various contexts in units of
"cputicks" and scale to real-world microsec^H^H^H^H^H^H^H^Hclock_t
only when somebody wants to inspect the numbers.

For now "cputicks" are still derived from the current timecounter
and therefore things should by definition remain sensible also on
SMP machines. (The main reason for this first milestone commit is
to verify that hypothesis.)

On slower machines, the avoided multiplications to normalize timestams
at every context switch, comes out as a 5-7% better score on the
unixbench/context1 microbenchmark. On more modern hardware no change
in performance is seen.


# 11f4763d 18-Jan-2006 Julian Elischer <julian@FreeBSD.org>

Return the thread name in the kinfo_proc structure.
Also correct the comment describing what the value is.


# b241b0a2 17-Jan-2006 Juli Mallett <jmallett@FreeBSD.org>

Since p_cansee will end up dereferencing p_ucred, don't check for p_ucred
equal to NULL several times later. p_ucred "should probably not" be NULL
if the process isn't PRS_NEW anyway. This is strongly reinforced by the fact
that we don't see frequent crashes here. Remove the checks after p_cansee and
add a KASSERT right before it.

Found by: Coverity Prevent (tm)

Also trim one nearby trailing space.


# 3357835a 29-Dec-2005 David Xu <davidxu@FreeBSD.org>

Add code to report zombie state.

PR: threads/91044
MFC after: 3 days


# 2c255e9d 13-Nov-2005 Robert Watson <rwatson@FreeBSD.org>

Moderate rewrite of kernel ktrace code to attempt to generally improve
reliability when tracing fast-moving processes or writing traces to
slow file systems by avoiding unbounded queueuing and dropped records.
Record loss was previously possible when the global pool of records
become depleted as a result of record generation outstripping record
commit, which occurred quickly in many common situations.

These changes partially restore the 4.x model of committing ktrace
records at the point of trace generation (synchronous), but maintain
the 5.x deferred record commit behavior (asynchronous) for situations
where entering VFS and sleeping is not possible (i.e., in the
scheduler). Records are now queued per-process as opposed to
globally, with processes responsible for committing records from their
own context as required.

- Eliminate the ktrace worker thread and global record queue, as they
are no longer used. Keep the global free record list, as records
are still used.

- Add a per-process record queue, which will hold any asynchronously
generated records, such as from context switches. This replaces the
global queue as the place to submit asynchronous records to.

- When a record is committed asynchronously, simply queue it to the
process.

- When a record is committed synchronously, first drain any pending
per-process records in order to maintain ordering as best we can.
Currently ordering between competing threads is provided via a global
ktrace_sx, but a per-process flag or lock may be desirable in the
future.

- When a process returns to user space following a system call, trap,
signal delivery, etc, flush any pending records.

- When a process exits, flush any pending records.

- Assert on process tear-down that there are no pending records.

- Slightly abstract the notion of being "in ktrace", which is used to
prevent the recursive generation of records, as well as generating
traces for ktrace events.

Future work here might look at changing the set of events marked for
synchronous and asynchronous record generation, re-balancing queue
depth, timeliness of commit to disk, and so on. I.e., performing a
drain every (n) records.

MFC after: 1 month
Discussed with: jhb
Requested by: Marc Olzheim <marcolz at stack dot nl>


# ebceaf6d 08-Nov-2005 David Xu <davidxu@FreeBSD.org>

Add support for queueing SIGCHLD same as other UNIX systems did.

For each child process whose status has been changed, a SIGCHLD instance
is queued, if the signal is stilling pending, and process changed status
several times, signal information is updated to reflect latest process
status. If wait() returns because the status of a child process is
available, pending SIGCHLD signal associated with the child process is
discarded. Any other pending SIGCHLD signals remain pending.

The signal information is allocated at the same time when proc structure
is allocated, if process signal queue is fully filled or there is a memory
shortage, it can still send the signal to process.

There is a booting time tunable kern.sigqueue.queue_sigchild which
can control the behavior, setting it to zero disables the SIGCHLD queueing
feature, the tunable will be removed if the function is proved that it is
stable enough.

Tested on: i386 (SMP and UP)


# f55ab994 24-Oct-2005 John Baldwin <jhb@FreeBSD.org>

Document in #ifdef notnow code the actions that proc_fini would need to
take if struct procs were actually freed.


# 5032ff81 02-Oct-2005 Don Lewis <truckman@FreeBSD.org>

Always wire the sysctl output buffer in sysctl_kern_proc() before
calling sysctl_out_proc(). -- fix from jhb

Move the code in fill_kinfo_thread() that gathers data from struct proc
into the new function fill_kinfo_proc_only().

Change all callers of fill_kinfo_thread() to call both
fill_kinfo_proc_only() and fill_kinfo() thread. When gathering
data from a multi-threaded process, fill_kinfo_proc_only() only needs
to be called once.

Grab sched_lock before accessing the process thread list or calling
fill_kinfo_thread().

PR: kern/84684
MFC after: 3 days


# 55b4a5ae 27-Sep-2005 John Baldwin <jhb@FreeBSD.org>

Use the refcount API to implement reference counts on process argument
structures rather than using a global mutex to protect the reference
counts.

Tested on: i386, alpha, sparc64


# fe769cdd 17-Apr-2005 David Schultz <das@FreeBSD.org>

Add a sysctl that returns the full path of a process' text file.
This information is needed by things like `gdb -p' and Sun's javac,
and previously it could only be obtained via procfs


# c6a37e84 04-Apr-2005 John Baldwin <jhb@FreeBSD.org>

Divorce critical sections from spinlocks. Critical sections as denoted by
critical_enter() and critical_exit() are now solely a mechanism for
deferring kernel preemptions. They no longer have any affect on
interrupts. This means that standalone critical sections are now very
cheap as they are simply unlocked integer increments and decrements for the
common case.

Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter()
and spinlock_exit(). This KPI is responsible for providing whatever MD
guarantees are needed to ensure that a thread holding a spin lock won't
be preempted by any other code that will try to lock the same lock. For
now all archs continue to block interrupts in a "spinlock section" as they
did formerly in all critical sections. Note that I've also taken this
opportunity to push a few things into MD code rather than MI. For example,
critical_fork_exit() no longer exists. Instead, MD code ensures that new
threads have the correct state when they are created. Also, we no longer
try to fixup the idlethreads for APs in MI code. Instead, each arch sets
the initial curthread and adjusts the state of the idle thread it borrows
in order to perform the initial context switch.

This change is largely a big NOP, but the cleaner separation it provides
will allow for more efficient alternative locking schemes in other parts
of the kernel (bare critical sections rather than per-CPU spin mutexes
for per-CPU data for example).

Reviewed by: grehan, cognet, arch@, others
Tested on: i386, alpha, sparc64, powerpc, arm, possibly more


# c78941e6 20-Mar-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Add ki_jid field to the kinfo_proc structure and store jail ID there.

Reviewed by: gad
MFC after: 3 days


# 572b4402 17-Mar-2005 Poul-Henning Kamp <phk@FreeBSD.org>

In stange circumstances we may end up being the last reference to a
session in tprintf(). SESSRELE() needs to properly dispose of the
sessions mutex.

Add sessrele() which does the proper cleanup and have SESSRELE() call it.

Use SESSRELE also in pgdelete().

Found by: Coverity (ID:526)


# cefcecbe 12-Mar-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Function jailed() looks into ucred strcture, so be sure ucred is not NULL.

Reviewed by: rwatson
MFC after: 1 week


# d079d0a0 12-Mar-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Clean up a bit.

Reviewed by: rwatson
MFC after: 1 week


# 0c898376 09-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Make a bunch of SYSCTL_NODEs static.


# 9454b2d8 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for copyright notices, minor format tweaks as necessary


# 1eecfae3 26-Nov-2004 David Schultz <das@FreeBSD.org>

Axe a.out core dump support. Neither older gdb binaries nor current
bfd sources understand the present format.


# 6db36923 20-Nov-2004 David Schultz <das@FreeBSD.org>

Remove local definitions of RANGEOF() and use __rangeof() instead.
Also remove a few bogus casts.


# 8b059651 19-Nov-2004 David Schultz <das@FreeBSD.org>

Malloc p_stats instead of putting it in the U area. We should consider
simply embedding it in struct proc.

Reviewed by: arch@


# 9b036bdf 09-Oct-2004 Julian Elischer <julian@FreeBSD.org>

Remove duplicate line.


# 78c85e8d 05-Oct-2004 John Baldwin <jhb@FreeBSD.org>

Rework how we store process times in the kernel such that we always store
the raw values including for child process statistics and only compute the
system and user timevals on demand.

- Fix the various kern_wait() syscall wrappers to only pass in a rusage
pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
times it needs rather than calling getrusage() twice with associated
stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
for user, system, and interrupt time as well as a bintime of the total
runtime. A new p_rux field in struct proc replaces the same inline fields
from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux
field in struct proc contains the "raw" child time usage statistics.
ruadd() has been changed to handle adding the associated rusage_ext
structures as well as the values in rusage. Effectively, the values in
rusage_ext replace the ru_utime and ru_stime values in struct rusage. These
two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
calculates appropriate timevals for user and system time as well as updating
the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a
copy of the process' p_rux structure to compute the timevals after updating
the runtime appropriately if any of the threads in that process are
currently executing. It also now only locks sched_lock internally while
doing the rux_runtime fixup. calcru() now only requires the caller to
hold the proc lock and calcru1() only requires the proc lock internally.
calcru() also no longer allows callers to ask for an interrupt timeval
since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
calling calcru1() on p_crux. Note that this means that any code that wants
child times must now call this function rather than reading from p_cru
directly. This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
proctree lock is used to protect the process group rather than the process
group lock. By holding this lock until the end of the function we now
ensure that the process/thread that we pick to dump info about will no
longer vanish while we are trying to output its info to the console.

Submitted by: bde (mostly)
MFC after: 1 month


# 8daa8c60 19-Sep-2004 David Schultz <das@FreeBSD.org>

The zone from which proc structures are allocated is marked
UMA_ZONE_NOFREE to guarantee type stability, so proc_fini() should
never be called. Move an assertion from proc_fini() to proc_dtor()
and garbage-collect the rest of the unreachable code. I have retained
vm_proc_dispose(), since I consider its disuse a bug.


# ed062c8d 04-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Refactor a bunch of scheduler code to give basically the same behaviour
but with slightly cleaned up interfaces.

The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.

The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.

Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.

Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.

The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.

A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.

Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.

Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week


# 6cbea71c 14-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Cause pfind() not to return processes in the PRS_NEW state. As a result,
threads consuming the result of pfind() will not need to check for a NULL
credential pointer or other signs of an incompletely created process.
However, this also means that pfind() cannot be used to test for the
existence or find such a process. Annotate pfind() to indicate that this
is the case. A review of curent consumers seems to indicate that this is
not a problem for any of them. This closes a number of race conditions
that could result in NULL pointer dereferences and related failure modes.
Other related races continue to exist, especially during iteration of the
allproc list without due caution.

Discussed with: tjr, green


# 332e72dd 09-Aug-2004 Julian Elischer <julian@FreeBSD.org>

Remove typos on KASSERT messages.


# b23f72e9 01-Aug-2004 Brian Feldman <green@FreeBSD.org>

* Add a "how" argument to uma_zone constructors and initialization functions
so that they know whether the allocation is supposed to be able to sleep
or not.
* Allow uma_zone constructors and initialation functions to return either
success or error. Almost all of the ones in the tree currently return
success unconditionally, but mbuf is a notable exception: the packet
zone constructor wants to be able to fail if it cannot suballocate an
mbuf cluster, and the mbuf allocators want to be able to fail in general
in a MAC kernel if the MAC mbuf initializer fails. This fixes the
panics people are seeing when they run out of memory for mbuf clusters.
* Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing
the default.

Both bmilekic and jeff have reviewed the changes made to make failable
zone allocations work.


# cebabef0 29-Jul-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fill some informations about zombie processes as well.
Before this change every zombie process were reported as an owner of PID 0 in
ps(1) output.

Reviewed by: julian


# 7638fa19 20-Jun-2004 Garance A Drosehn <gad@FreeBSD.org>

Fill in the values for the ki_tid and ki_numthreads which have been
added to kproc_info.

PR: bin/65803 (a tiny part...)
Submitted by: Cyrille Lefevre


# 99d2ecbc 19-Jun-2004 Garance A Drosehn <gad@FreeBSD.org>

Add a call to calcru() to update the kproc_info fields of ki_rusage.ru_utime
and ki_rusage.ru_stime. This greatly improves the accuracy of those fields.

Suggested by: bde


# 078842c5 19-Jun-2004 Garance A Drosehn <gad@FreeBSD.org>

Fill in the some new fields 'struct kinfo_proc', namely ki_childstime,
ki_childutime, and ki_emul. Also uses the timevaladd() routine to
correct the calculation of ki_childtime. That will correct the value
returned when ki_childtime.tv_usec > 1,000,000.

This also implements a new KERN_PROC_GID option for kvm_getprocs().
(there will be a similar update to lib/libkvm/kvm_proc.c)

Submitted by: Cyrille Lefevre


# f3732fd1 17-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Second half of the dev_t cleanup.

The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.


# fa885116 15-Jun-2004 Julian Elischer <julian@FreeBSD.org>

Nice, is a property of a process as a whole..
I mistakenly moved it to the ksegroup when breaking up the process
structure. Put it back in the proc structure.


# 2195e420 09-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Reference count struct tty.

Add two new functions: ttyref() and ttyrel(). ttymalloc() creates a struct
tty with a reference count of one. when ttyrel sees the count go to zero,
struct tty is freed.

Hold references for open ttys and for ttys which are controlling terminal
for sessions.

Until drivers start using ttyrel(), this commit will make no difference.


# a59df4e1 09-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Fix a race in destruction of sessions.


# b8fdc89d 22-May-2004 Garance A Drosehn <gad@FreeBSD.org>

Implement the new KERN_PROC_RGID option, and also implement the
KERN_PROC_SESSION option which had been previously defined but
never implemented.

PR: bin/65803 (a very tiny piece of the PR)`
Submitted by: Cyrille Lefevre


# 7f8a436f 05-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 5e2c0c0b 31-Mar-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Remove ps_argsopen check. It is was bogus in the past and was corrected
not quite well by me - if kern.ps_argsopen was set to 0, users weren't
permitted to see arguments of even own processes.
But kern.ps_argsopen is going away, so just remove this check and leave
security checks for p_cansee() function.


# 9cdb6216 17-Mar-2004 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix information leakage.
Without this fix it is possible to cheat policies like:
- sysctl security.bsd.see_other_[gu]ids=0,
- mac_seeotheruids(4),
- jail(2)
and get full processes list with their arguments.

This problem exists from revision 1.62 of kern_proc.c when it was
introduced.

Reviewed by: nectar, rwatson.


# 47934cef 25-Feb-2004 Don Lewis <truckman@FreeBSD.org>

Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms


# 2648efa6 22-Feb-2004 Daniel Eischen <deischen@FreeBSD.org>

Add sysctls to allow showing threads for pgrp, tty, uid, ruid,
and pid.


# 7cf90fb3 16-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Update the sched api. sched_{add,rem,clock,pctcpu} now all accept a td
argument rather than a kse.


# 25e247af 14-Oct-2003 Peter Wemm <peter@FreeBSD.org>

The KERN_PROC_PROC sysctl took 4 args in 5.0-REL and 5.1-REL. We need to
accept this for a bit longer. Requiring the new order of 3 args only
was not very helpful.


# a50f62fd 05-Oct-2003 Tim J. Robbins <tjr@FreeBSD.org>

Remove support for the unused 4th component of the KERN_PROC_PROC sysctl.


# 3ddaef40 19-Sep-2003 Tim J. Robbins <tjr@FreeBSD.org>

Allow the KERN_PROC_PROC sysctl to be used without the useless 4th
name component, for consistency with KERN_PROC_ALL. Support for the
4-argument form will be removed some time before 5.2-R.


# 75ea65e3 04-Aug-2003 David Xu <davidxu@FreeBSD.org>

kse.h is not needed for these files.


# e76bad96 17-Jul-2003 Robert Drehmel <robert@FreeBSD.org>

Correct six return statements which returned zero instead of
an appropriate error number after a failure condition.

In particular, three of the changed statements return ESRCH for a
failed pfind(), and in also three places a non-zero return
from p_cansee() will be passed back,

Also noticed by: rwatson


# baf731e6 11-Jul-2003 Robert Drehmel <robert@FreeBSD.org>

Make the system call vector name of a process accessible to user
land applications by introducing the KERN_PROC_SV_NAME sysctl node,
which is searchable by PID.


# 04d2f20f 17-Jun-2003 Scott Long <scottl@FreeBSD.org>

Drop the proc lock around SYSCTL_OUT in the no-threads case.

Submitted by: truckman


# 89f4fca2 14-Jun-2003 Alan Cox <alc@FreeBSD.org>

Move the *_new_altkstack() and *_dispose_altkstack() functions out of the
various pmap implementations into the machine-independent vm. They were
all identical.


# 30c6f34e 12-Jun-2003 Scott Long <scottl@FreeBSD.org>

Add support to sysctl_kern_proc to return all threads in a proc, not just the
first one. The old behaviour can be switched by specifying KERN_PROC_PROC.

Submitted by: julian, tweaks and added functionality by myself


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 90af4afa 13-May-2003 John Baldwin <jhb@FreeBSD.org>

- Merge struct procsig with struct sigacts.
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.

Reviewed by: arch@
Approved by: re (rwatson)


# 7d447c95 01-May-2003 John Baldwin <jhb@FreeBSD.org>

Initialize and destroy the struct proc mutex in the proc zone's init and
fini routines instead of in fork() and wait(). This has the nice side
benefit that the proc lock of any process on the allproc list is always
valid and sched_lock doesn't have to be used to test against PRS_NEW
anymore.


# 87ccef7b 01-May-2003 Dag-Erling Smørgrav <des@FreeBSD.org>

Instead of recording the Unix time in a process when it starts, record the
uptime. Where necessary, convert it back to Unix time by adding boottime
to it. This fixes a potential problem in the accounting code, which would
compute the elapsed time incorrectly if the Unix time was stepped during
the lifetime of the process.


# 913fc94d 24-Apr-2003 Tim J. Robbins <tjr@FreeBSD.org>

Include altkstack pages in the RSS regardless of whether the process
is swapped out. Pointed out by jhb.


# 013466aa 23-Apr-2003 Dag-Erling Smørgrav <des@FreeBSD.org>

It seems that 1 was not a magic value as I thought, but a coincidence.
Instead of applying the adjustment to processes with a start time of 1,
apply it to all processes with a start time of less than 3600.

None of this would be necessary if the start times were recorded in ticks
instead of seconds and microseconds.


# ceff7f2a 24-Apr-2003 Tim J. Robbins <tjr@FreeBSD.org>

Do a better job of calculating the RSS for swapped-out processes:
don't include the kernel stacks of swapped-out threads in the page count,
but do include the alternate kernel stack. jhb provided some helpful
comments on this.

PR: 49102


# 1f7440d9 23-Apr-2003 Dag-Erling Smørgrav <des@FreeBSD.org>

When filling out a kinfo_proc structure, if we come across a process
whose p_stats->p_start has the magic value 1, replace it with boottime.
Some users were apparently confused by the fact that ps(1) reported a
start time in early 1970 for system processes.


# 02e878d9 18-Apr-2003 John Baldwin <jhb@FreeBSD.org>

- Add a static function pgadjustjobc() to adjust the job control count for
a process group.
- Call pgadjustjobc() twice in fixjobc() to avoid code duplication and
improve readability.
- Use the proc lock to protect P_SHOULDSTOP() instead of sched_lock.
- Check to see if a process is PRS_NEW with sched_lock before trying to
lock its proc lock since the lock may not be constructed yet.


# 060563ec 10-Apr-2003 Julian Elischer <julian@FreeBSD.org>

Move the _oncpu entry from the KSE to the thread.
The entry in the KSE still exists but it's purpose will change a bit
when we add the ability to lock a KSE to a cpu.


# 4093529d 31-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Move p->p_sigmask to td->td_sigmask. Signal masks will be per thread with
a follow on commit to kern_sig.c
- signotify() now operates on a thread since unmasked pending signals are
stored in the thread.
- PS_NEEDSIGCHK moves to TDF_NEEDSIGCHK.


# a5881ea5 13-Mar-2003 John Baldwin <jhb@FreeBSD.org>

- Cache a reference to the credential of the thread that starts a ktrace in
struct proc as p_tracecred alongside the current cache of the vnode in
p_tracep. This credential is then used for all later ktrace operations on
this file rather than using the credential of the current thread at the
time of each ktrace event.
- Now that we have multiple ktrace-related items in struct proc that are
pointers, rename p_tracep to p_tracevp to make it less ambiguous.

Requested by: rwatson (1)


# 8510f2a8 12-Mar-2003 John Baldwin <jhb@FreeBSD.org>

- Various little style fixes.
- If SYSCTL_OUT() fails in sysctl_kern_proc_args(), return the error
instead of ignoring it if we have new arguments for the process.
- If the new arguments for a process are too long, return ENOMEM instead of
returning success but not doing the actual copy.

Submitted by: bde


# 4bc6471b 12-Mar-2003 John Baldwin <jhb@FreeBSD.org>

- Avoid dropping the proc lock around a simple permissions check and just
hold hold it across the check to avoid extra lock operations in the
common case.
- Copy in the new args to a temporary pargs structure before we drop the
reference to the old one. Thus, if the copyin() fails, the process
arguments are unchanged rather than being deleted. Also, p_args is no
longer NULL during the sysctl operation.


# ac2e4153 26-Feb-2003 Julian Elischer <julian@FreeBSD.org>

Change the process flags P_KSES to be P_THREADED.
This is just a cosmetic change but I've been meaning to do it for about a year.


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 98ab1489 04-Jan-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Remove unnecessary lock assertion.


# d64ada50 30-Dec-2002 Jens Schweikhardt <schweikh@FreeBSD.org>

Fix typos, mostly s/ an / a / where appropriate and a few s/an/and/
Add FreeBSD Id tag where missing.


# 79acfc49 21-Nov-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add the new sched_pctcpu() function to the sched_* api.
- Provide a routine in sched_4bsd to add this functionality.
- Use sched_pctcpu() in kern_proc, which is the one place outside of
sched_4bsd where the old pctcpu value was accessed directly.

Approved by: re


# de028f5a 20-Nov-2002 Jeff Roberson <jeff@FreeBSD.org>

- Implement a mechanism for allowing schedulers to place scheduler dependant
data in the scheduler independant structures (proc, ksegrp, kse, thread).
- Implement unused stubs for this mechanism in sched_4bsd.

Approved by: re
Reviewed by: luigi, trb
Tested on: x86, alpha


# 5c8329ed 24-Oct-2002 Julian Elischer <julian@FreeBSD.org>

Move thread related code from kern_proc.c to kern_thread.c.
Add code to free KSEs and KSEGRPs on exit.
Sort KSE prototypes in proc.h.
Add the missing kse_exit() syscall.

ksetest now does not leak KSEs and KSEGRPS.

Submitted by: (parts) davidxu


# 81fd4892 21-Oct-2002 David Xu <davidxu@FreeBSD.org>

detect idle kse correctly.


# 2f030624 20-Oct-2002 Julian Elischer <julian@FreeBSD.org>

Add an actual implementation of kse_wakeup()
Submitted by: Davidxu


# e80fb434 17-Oct-2002 Robert Drehmel <robert@FreeBSD.org>

Use strlcpy() instead of strncpy() to copy NUL terminated strings
for safety and consistency.


# c6544064 14-Oct-2002 John Baldwin <jhb@FreeBSD.org>

- Add a new global mutex 'ppeers_lock' to protect the p_peers list of
processes forked with RFTHREAD.
- Use a goto to a label for common code when exiting from fork1() in case
of an error.
- Move the RFTHREAD linkage setup code later in fork since the ppeers_lock
cannot be locked while holding a proc lock. Handle the race of a task
leader exiting and killing its peers while a peer is forking a new child.
In that case, go ahead and let the peer process proceed normally as the
parent is about to kill it. However, the task leader may have already
gone to sleep to wait for the peers to die, so the new child process may
not receive a SIGKILL from the task leader. Rather than try to destruct
the new child process, just go ahead and send it a SIGKILL directly and
add it to the p_peers list. This ensures that the task leader will wait
until both the peer process doing the fork() and the new child process
have received their KILL signals and exited.

Discussed with: truckman (earlier versions)


# 48bfcddd 08-Oct-2002 Julian Elischer <julian@FreeBSD.org>

Round out the facilty for a 'bound' thread to loan out its KSE
in specific situations. The owner thread must be blocked, and the
borrower can not proceed back to user space with the borrowed KSE.
The borrower will return the KSE on the next context switch where
teh owner wants it back. This removes a lot of possible
race conditions and deadlocks. It is consceivable that the
borrower should inherit the priority of the owner too.
that's another discussion and would be simple to do.

Also, as part of this, the "preallocatd spare thread" is attached to the
thread doing a syscall rather than the KSE. This removes the need to lock
the scheduler when we want to access it, as it's now "at hand".

DDB now shows a lot mor info for threaded proceses though it may need
some optimisation to squeeze it all back into 80 chars again.
(possible JKH project)

Upcalls are now "bound" threads, but "KSE Lending" now means that
other completing syscalls can be completed using that KSE before the upcall
finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is
one of the highest priorities in the KSE system.) The upcall when it happens
will present all the completed syscalls to the KSE for selection.


# 551cf4e1 02-Oct-2002 John Baldwin <jhb@FreeBSD.org>

Rename the mutex thread and process states to use a more generic 'LOCK'
name instead. (e.g., SLOCK instead of SMTX, TD_ON_LOCK() instead of
TD_ON_MUTEX()) Eventually a turnstile abstraction will be added that
will be shared with mutexes and other types of locks. SLOCK/TDI_LOCK will
be used internally by the turnstile code and will not be specific to
mutexes. Making the change now ensures that turnstiles can be dropped
in at a later date without affecting the ABI of userland applications.


# 36a8dac1 02-Oct-2002 Archie Cobbs <archie@FreeBSD.org>

Let kse_wakeup() take a KSE mailbox pointer argument.

Reviewed by: julian


# 316ec49a 02-Oct-2002 Scott Long <scottl@FreeBSD.org>

Some kernel threads try to do significant work, and the default KSTACK_PAGES
doesn't give them enough stack to do much before blowing away the pcb.
This adds MI and MD code to allow the allocation of an alternate kstack
who's size can be speficied when calling kthread_create. Passing the
value 0 prevents the alternate kstack from being created. Note that the
ia64 MD code is missing for now, and PowerPC was only partially written
due to the pmap.c being incomplete there.
Though this patch does not modify anything to make use of the alternate
kstack, acpi and usb are good candidates.

Reviewed by: jake, peter, jhb


# 1d9c5696 01-Oct-2002 Juli Mallett <jmallett@FreeBSD.org>

Back our kernel support for reliable signal queues.

Requested by: rwatson, phk, and many others


# 1226f694 30-Sep-2002 Juli Mallett <jmallett@FreeBSD.org>

First half of implementation of ksiginfo, signal queues, and such. This
gets signals operating based on a TailQ, and is good enough to run X11,
GNOME, and do job control. There are some intricate parts which could be
more refined to match the sigset_t versions, but those require further
evaluation of directions in which our signal system can expand and contract
to fit our needs.

After this has been in the tree for a while, I will make in kernel API
changes, most notably to trapsignal(9) and sendsig(9), to use ksiginfo
more robustly, such that we can actually pass information with our
(queued) signals to the userland. That will also result in using a
struct ksiginfo pointer, rather than a signal number, in a lot of
kern_sig.c, to refer to an individual pending signal queue member, but
right now there is no defined behaviour for such.

CODAFS is unfinished in this regard because the logic is unclear in
some places.

Sponsored by: New Gold Technology
Reviewed by: bde, tjr, jake [an older version, logic similar]


# 9eb1fdea 29-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Implement basic KSE loaning. This stops a hread that is blocked in BOUND mode
from stopping another thread from completing a syscall, and this allows it to
release its resources etc. Probably more related commits to follow (at least
one I know of)

Initial concept by: julian, dillon
Submitted by: davidxu


# 165d2b99 28-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Rewrite the kse_create() function to better aproach the semantics we
have specified in the design.


# 89def71c 25-Sep-2002 Archie Cobbs <archie@FreeBSD.org>

Make the following name changes to KSE related functions, etc., to better
represent their purpose and minimize namespace conflicts:

kse_fn_t -> kse_func_t
struct thread_mailbox -> struct kse_thr_mailbox
thread_interrupt() -> kse_thr_interrupt()
kse_yield() -> kse_release()
kse_new() -> kse_create()

Add missing declaration of kse_thr_interrupt() to <sys/kse.h>.
Regenerate the various generated syscall files. Minor style fixes.

Reviewed by: julian


# 10b33e6b 23-Sep-2002 Julian Elischer <julian@FreeBSD.org>

oops don't do dthe copy range in a new KSE. There isn't one any more.


# acb46062 23-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Add code to create > 1 KSe per process.
(support code not yet complete)

Submitted by: davidxu


# c76e33b6 16-Sep-2002 Jonathan Mini <mini@FreeBSD.org>

Add kernel support needed for the KSE-aware libpthread:
- Use ucontext_t's to store KSE thread state.
- Synthesize state for the UTS upon each upcall, rather than
saving and copying a trapframe.
- Deliver signals to KSE-aware processes via upcall.
- Rename kse mailbox structure fields to be more BSD-like.
- Store the UTS's stack in struct proc in a stack_t.

Reviewed by: bde, deischen, julian
Approved by: -arch


# 4f0db5e0 15-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Allocate KSEs and KSEGRPs separatly and remove them from the proc structure.
next step is to allow > 1 to be allocated per process. This would give
multi-processor threads. (when the rest of the infrastructure is
in place)

While doing this I noticed libkvm and sys/kern/kern_proc.c:fill_kinfo_proc
are diverging more than they should.. corrective action needed soon.


# 71fad9fd 11-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Completely redo thread states.

Reviewed by: davidxu@freebsd.org


# b9f009b0 07-Sep-2002 Peter Wemm <peter@FreeBSD.org>

Make UAREA_PAGES and KSTACK_PAGES visible to userland via sysctl, like
PS_STRINGS and USRSTACK is. This is necessary in order to decode a.out
core dumps. kern_proc.c was already referring to both of these values
but was missing the #include "opt_kstack_pages.h". Make the sysctl
variables visible so that certain kld modules can see how their parent
kernel was configured.


# 6f22742b 06-Sep-2002 Robert Watson <rwatson@FreeBSD.org>

Minor spelling tweak: assume "his" is actually "This".


# 1faf202e 06-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Use UMA as a complex object allocator.
The process allocator now caches and hands out complete process structures
*including substructures* .

i.e. it get's the process structure with the first thread (and soon KSE)
already allocated and attached, all in one hit.

For the average non threaded program (non KSE that is) the allocated thread and its stack remain attached to the process, even when the process is
unused and in the process cache. This saves having to allocate and attach it
later, effectively bringing us (hopefully) close to the efficiency
of pre-KSE systems where these were a single structure.

Reviewed by: davidxu@freebsd.org, peter@freebsd.org


# 2b239dd1 11-Aug-2002 Jens Schweikhardt <schweikh@FreeBSD.org>

Fix typos; each file has at least one s/seperat/separat/
(I skipped those in contrib/, gnu/ and crypto/)
While I was at it, fixed a lot more found by ispell that I
could identify with certainty to be errors. All of these
were in comments or text, not in actual code.

Suggested by: bde
MFC after: 3 days


# 5c38b6db 28-Jul-2002 Don Lewis <truckman@FreeBSD.org>

Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.


# c3b98db0 13-Jul-2002 Julian Elischer <julian@FreeBSD.org>

Thinking about it I came to the conclusion that the KSE states were incorrectly
formulated. The correct states should be:
IDLE: On the idle KSE list for that KSEG
RUNQ: Linked onto the system run queue.
THREAD: Attached to a thread and slaved to whatever state the thread is in.

This means that most places where we were adjusting kse state can go away
as it is just moving around because the thread is..
The only places we need to adjust the KSE state is in transition to and from
the idle and run queues.

Reviewed by: jhb@freebsd.org


# a136efe9 07-Jul-2002 Peter Wemm <peter@FreeBSD.org>

Collect all the (now equivalent) pmap_new_proc/pmap_dispose_proc/
pmap_swapin_proc/pmap_swapout_proc functions from the MD pmap code
and use a single equivalent MI version. There are other cleanups
needed still.

While here, use the UMA zone hooks to keep a cache of preinitialized
proc structures handy, just like the thread system does. This eliminates
one dependency on 'struct proc' being persistent even after being freed.
There are some comments about things that can be factored out into
ctor/dtor functions if it is worth it. For now they are mostly just
doing statistics to get a feel of how it is working.


# 7c7a6f22 30-Jun-2002 Julian Elischer <julian@FreeBSD.org>

If the process is a zombie, then you must not try dereference the thread
because there isn't one. Of course this code only possibly works
for single threaded processes anyhow..


# e602ba25 29-Jun-2002 Julian Elischer <julian@FreeBSD.org>

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 9718382d 22-Jun-2002 Jonathan Mini <mini@FreeBSD.org>

Always drop the p_args reference we held for copyout, even if we're about
to change it. This fixes a leak triggered by setproctitle(3).

Approved by: alfred
Noticed by: Peter Jeremy <peter.jeremy@alcatel.com.au>


# 6c84de02 06-Jun-2002 John Baldwin <jhb@FreeBSD.org>

Properly lock accesses to p_tracep and p_traceflag. Also make a few
ktrace-only things #ifdef KTRACE that were not before.


# f44d9e24 18-May-2002 John Baldwin <jhb@FreeBSD.org>

Change p_can{debug,see,sched,signal}()'s first argument to be a thread
pointer instead of a proc pointer and require the process pointed to
by the second argument to be locked. We now use the thread ucred reference
for the credential checks in p_can*() as a result. p_canfoo() should now
no longer need Giant.


# e649887b 06-May-2002 Alfred Perlstein <alfred@FreeBSD.org>

Make funsetown() take a 'struct sigio **' so that the locking can
be done internally.

Ensure that no one can fsetown() to a dying process/pgrp. We need
to check the process for P_WEXIT to see if it's exiting. Process
groups are already safe because there is no such thing as a pgrp
zombie, therefore the proctree lock completely protects the pgrp
from having sigio structures associated with it after it runs
funsetownlst.

Add sigio lock to witness list under proctree and allproc, but over
proc and pgrp.

Seigo Tanimura helped with this.


# 6041fa0a 03-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

As malloc(9) and free(9) are now Giant-free, remove the Giant lock
across malloc(9) and free(9) of a pgrp or a session.


# c8d8a686 02-May-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Fix the lock order reversal between the sigio lock and a process/pgrp lock in
funsetownlst() by locking the sigio lock across funsetownlst().


# ce00aebe 24-Apr-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Free(9) should be Giant-free.

Suggested by: jhb


# 1c2451c2 19-Apr-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Push down Giant for setpgid(), setsid() and aio_daemon(). Giant protects only
malloc(9) and free(9).


# f089b570 16-Apr-2002 John Baldwin <jhb@FreeBSD.org>

- Merge the pgrpsess_lock and proctree_lock sx locks into one proctree_lock
sx lock. Trying to get the lock order between these locks was getting
too complicated as the locking in wait1() was being fixed.
- leavepgrp() now requires an exclusive lock of proctree_lock to be held
when it is called.
- fixjobc() no longer gets a shared lock of proctree_lock now that it
requires an xlock be held by the caller.
- Locking notes in sys/proc.h are adjusted to note that everything that
used to be protected by the pgrpsess_lock is now protected by the
proctree_lock.


# 65c9b430 09-Apr-2002 John Baldwin <jhb@FreeBSD.org>

- Change fill_kinfo_proc() to require that the process is locked when it
is called.
- Change sysctl_out_proc() to require that the process is locked when it
is called and to drop the lock before it returns. If this proves too
complex we can change sysctl_out_proc() to simply acquire the lock at
the very end and have the calling code drop the lock right after it
returns.
- Lock the process we are going to export before the p_cansee() in the
loop in sysctl_kern_proc() and hold the lock until we call
sysctl_out_proc().
- Don't call p_cansee() on the process about to be exported twice in
the aforementioned loop.


# a30d7c60 06-Apr-2002 Jake Burkholder <jake@FreeBSD.org>

Use CTASSERT rather than a runtime check to detect kinfo_proc size changes.
Remove the ugly yuck code to busy wait for 20 seconds.


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 182da820 01-Apr-2002 Matthew Dillon <dillon@FreeBSD.org>

Stage-2 commit of the critical*() code. This re-inlines cpu_critical_enter()
and cpu_critical_exit() and moves associated critical prototypes into their
own header file, <arch>/<arch>/critical.h, which is only included by the
three MI source files that need it.

Backout and re-apply improperly comitted syntactical cleanups made to files
that were still under active development. Backout improperly comitted program
structure changes that moved localized declarations to the top of two
procedures. Partially re-apply one of the program structure changes to
move 'mask' into an intermediate block rather then in three separate
sub-blocks to make the code more readable. Re-integrate bug fixes that Jake
made to the sparc64 code.

Note: In general, developers should not gratuitously move declarations out
of sub-blocks. They are where they are for reasons of structure, grouping,
readability, compiler-localizability, and to avoid developer-introduced bugs
similar to several found in recent years in the VFS and VM code.

Reviewed by: jake


# 7b11fea6 31-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Close some holes with p->p_args by NULL'ing out the p->p_args pointer
while holding the proc lock, and by holding the pargs structure when
accessing it from outside of the owner.

Submitted by: Jonathan Mini <mini@haikugeek.com>


# c1508b28 28-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

To remove nested include of sys/lock.h and sys/mutex.h from sys/proc.h
make the pargs_* functions into non-inlines in kern/kern_proc.c.

Requested by: bde


# 8899023f 27-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Make the reference counting of 'struct pargs' SMP safe.

There is still some locations where the PROC lock should be held
in order to prevent inconsistent views from outside (like the
proc->p_fd fix for kern/vfs_syscalls.c:checkdirs()) that can be
fixed later.

Submitted by: Jonathan Mini <mini@haikugeek.com>


# f22a4b62 27-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Add a new mtx_init option "MTX_DUPOK" which allows duplicate acquires of locks
with this flag. Remove the dup_list and dup_ok code from subr_witness. Now
we just check for the flag instead of doing string compares.

Also, switch the process lock, process group lock, and uma per cpu locks over
to this interface. The original mechanism did not work well for uma because
per cpu lock names are unique to each zone.

Approved by: jhb


# e6bbfd40 27-Mar-2002 Matthew Dillon <dillon@FreeBSD.org>

oops, forgot to commit this. td->td_savecrit = 0 replaced by API
call cpu_thread_link().


# f2a79bb9 26-Mar-2002 Jake Burkholder <jake@FreeBSD.org>

Make this compile.

Pointy hat to: dillon


# 70f52b48 23-Mar-2002 Bruce Evans <bde@FreeBSD.org>

Fixed some style bugs in the removal of __P(()). The main ones were
not removing tabs before "__P((", and not outdenting continuation lines
to preserve non-KNF lining up of code with parentheses. Switch to KNF
formatting and/or rewrap the whole prototype in some cases.


# c897b813 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# f591779b 23-Feb-2002 Seigo Tanimura <tanimura@FreeBSD.org>

Lock struct pgrp, session and sigio.

New locks are:

- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.

Please refer to sys/proc.h for the coverage of these locks.

Changes on the pgrp/session interface:

- pgfind() needs the pgrpsess_lock held.

- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.

- Call enterthispgrp() in order to enter an existing pgrp.

- pgsignal() requires a pgrp lock held.

Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)


# 1cbb9c3b 22-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Convert p->p_runtime and PCPU(switchtime) to bintime format.


# fd21c2b5 20-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Oops, used wrong error value for unimplemented syscalls.


# c28841c1 18-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Add stub syscalls and definitions for KSE calls.
"Book'em Danno"


# 96347d1e 11-Feb-2002 Alan Cox <alc@FreeBSD.org>

The previous commit included a change to fill_kinfo_proc() that results
in a NULL pointer dereference. Repair this mistake.


# 2c100766 11-Feb-2002 Julian Elischer <julian@FreeBSD.org>

In a threaded world, differnt priorirites become properties of
different entities. Make it so.

Reviewed by: jhb@freebsd.org (john baldwin)


# de9ac44a 07-Feb-2002 Peter Wemm <peter@FreeBSD.org>

Fix a fatal trap when using ksched_setscheduler() (eg: mozilla, netscape
etc) which use: td->td_last_kse->ke_flags |= KEF_NEEDRESCHED;


# 045e8541 07-Feb-2002 Julian Elischer <julian@FreeBSD.org>

remove superfluous blank line


# 2b8a08af 07-Feb-2002 Peter Wemm <peter@FreeBSD.org>

Fix a couple of style bugs introduced (or touched by) previous commit.


# 079b7bad 07-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


# b8e6bf1e 05-Jan-2002 John Baldwin <jhb@FreeBSD.org>

Fix a bug where the mutex name wasn't always displayed for processes in
SMTX in utils such as ps and top. The KI_CTTY flag was assigned to
kinfo_proc->ki_kiflag rather than or'd into the flag, thus clobbering
any flags set earlier, including KI_MTXBLOCK.

Prodding by: peter


# 00f13cb3 13-Nov-2001 John Baldwin <jhb@FreeBSD.org>

As a followup to the previous fixes to inferior, revert some of the
changes in 1.80 that were needed for locking that are no longer needed now
that a lock is simply asserted.

Submitted by: bde


# 5b29d6e9 12-Nov-2001 John Baldwin <jhb@FreeBSD.org>

Clean up breakage in inferior() I introduced in 1.92 of kern_proc.c:
- Restore inferior() to being iterative rather than recursive.
- Assert that the proctree_lock is held in inferior() and change the one
caller to get a shared lock of it. This also ensures that we hold the
lock after performing the check so the check can't be made invalid out
from under us after the check but before we act on it.

Requested by: bde


# 8a7d8cc6 09-Oct-2001 Robert Watson <rwatson@FreeBSD.org>

- Combine kern.ps_showallprocs and kern.ipc.showallsockets into
a single kern.security.seeotheruids_permitted, describes as:
"Unprivileged processes may see subjects/objects with different real uid"
NOTE: kern.ps_showallprocs exists in -STABLE, and therefore there is
an API change. kern.ipc.showallsockets does not.
- Check kern.security.seeotheruids_permitted in cr_cansee().
- Replace visibility calls to socheckuid() with cr_cansee() (retain
the change to socheckuid() in ipfw, where it is used for rule-matching).
- Remove prison_unpcb() and make use of cr_cansee() against the UNIX
domain socket credential instead of comparing root vnodes for the
UDS and the process. This allows multiple jails to share the same
chroot() and not see each others UNIX domain sockets.
- Remove unused socheckproc().

Now that cr_cansee() is used universally for socket visibility, a variety
of policies are more consistently enforced, including uid-based
restrictions and jail-based restrictions. This also better-supports
the introduction of additional MAC models.

Reviewed by: ps, billf
Obtained from: TrustedBSD Project


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# e414d9aa 10-Sep-2001 Peter Wemm <peter@FreeBSD.org>

Add on UPAGES to ki_rssize since it is there as result of the process
and can be swapped out with the process.


# 0ecd57ad 16-Aug-2001 Peter Wemm <peter@FreeBSD.org>

Fix part of another problem that bde pointed out. This is different
to what bde suggested though.


# 5a66a253 16-Aug-2001 Peter Wemm <peter@FreeBSD.org>

Remove redundant null-termination. The buffer is already explicitly
zeroed, and we intentionally leave -1 on the strncpy length to leave
the original \0.

Submitted by: bde


# 77330eeb 16-Aug-2001 Peter Wemm <peter@FreeBSD.org>

Use the backwards compatability mechanisms so that ps/top etc dont have
unnecessary breakage.

While here, use explicit sizes for the string fields so that we dont
have unintentional changes again in the future when key tunables change.

This still is not quite right, but a june userland is happy with
a -current kernel with these tweaks.


# a0f75161 05-Jul-2001 Robert Watson <rwatson@FreeBSD.org>

o Replace calls to p_can(..., P_CAN_xxx) with calls to p_canxxx().
The p_can(...) construct was a premature (and, it turns out,
awkward) abstraction. The individual calls to p_canxxx() better
reflect differences between the inter-process authorization checks,
such as differing checks based on the type of signal. This has
a side effect of improving code readability.
o Replace direct credential authorization checks in ktrace() with
invocation of p_candebug(), while maintaining the special case
check of KTR_ROOT. This allows ktrace() to "play more nicely"
with new mandatory access control schemes, as well as making its
authorization checks consistent with other "debugging class"
checks.
o Eliminate "privused" construct for p_can*() calls which allowed the
caller to determine if privilege was required for successful
evaluation of the access control check. This primitive is currently
unused, and as such, serves only to complicate the API.

Approved by: ({procfs,linprocfs} changes) des
Obtained from: TrustedBSD Project


# fbd26f75 20-Jun-2001 John Baldwin <jhb@FreeBSD.org>

Fix some lock order reversals where we called free() while holding a proc
lock. We now use temporary variables to save the process argument pointer
and just update the pointer while holding the lock. We then perform the
free on the cached pointer after releasing the lock.


# b1fc0ec1 25-May-2001 Robert Watson <rwatson@FreeBSD.org>

o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 33a9ed9d 23-Apr-2001 John Baldwin <jhb@FreeBSD.org>

Change the pfind() and zpfind() functions to lock the process that they
find before releasing the allproc lock and returning.

Reviewed by: -smp, dfr, jake


# 1005a129 28-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Convert the allproc and proctree locks from lockmgr locks to sx locks.


# d1cadeb0 27-Mar-2001 David Malone <dwmalone@FreeBSD.org>

Don't leak the memory we've just malloced if we can't find the
process we're looking for. (I don't think this can currently
happen, but it depends how the function is called).

PR: 25932
Submitted by: David Xu <davidx@viasoft.com.cn>


# 393d77ff 06-Mar-2001 Kirk McKusick <mckusick@FreeBSD.org>

Bitch more loudly when someone botches changes to kinfo_proc
in the hopes that they will actually *read* the comment above
it and *follow* the instructions so as to cause all the rest
of us less a lot less grief.


# 15e9ec51 06-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Proc locking including using proc lock in place of proctree where
appropriate and locking processes while we signal them.


# 91421ba2 20-Feb-2001 Robert Watson <rwatson@FreeBSD.org>

o Move per-process jail pointer (p->pr_prison) to inside of the subject
credential structure, ucred (cr->cr_prison).
o Allow jail inheritence to be a function of credential inheritence.
o Abstract prison structure reference counting behind pr_hold() and
pr_free(), invoked by the similarly named credential reference
management functions, removing this code from per-ABI fork/exit code.
o Modify various jail() functions to use struct ucred arguments instead
of struct proc arguments.
o Introduce jailed() function to determine if a credential is jailed,
rather than directly checking pointers all over the place.
o Convert PRISON_CHECK() macro to prison_check() function.
o Move jail() function prototypes to jail.h.
o Emulate the P_JAILED flag in fill_kinfo_proc() and no longer set the
flag in the process flags field itself.
o Eliminate that "const" qualifier from suser/p_can/etc to reflect
mutex use.

Notes:

o Some further cleanup of the linux/jail code is still required.
o It's now possible to consider resolving some of the process vs
credential based permission checking confusion in the socket code.
o Mutex protection of struct prison is still not present, and is
required to protect the reference count plus some fields in the
structure.

Reviewed by: freebsd-arch
Obtained from: TrustedBSD Project


# d5a08a60 11-Feb-2001 Jake Burkholder <jake@FreeBSD.org>

Implement a unified run queue and adjust priority levels accordingly.

- All processes go into the same array of queues, with different
scheduling classes using different portions of the array. This
allows user processes to have their priorities propogated up into
interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
32. We used to have 4 separate arrays of 32 queues each, so this
may not be optimal. The new run queue code was written with this
in mind; changing the number of run queues only requires changing
constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter. This
is intended to be used to create per-cpu run queues. Implement
wrappers for compatibility with the old interface which pass in
the global run queue structure.
- Group the priority level, user priority, native priority (before
propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
This was used to detect when a process' priority had lowered and
it should yield. We now effectively yield on every interrupt.
- Activate propogate_priority(). It should now have the desired
effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
idle loop. It interfered with propogate_priority() because
the idle process needed to do a non-blocking acquire of Giant
and then other processes would try to propogate their priority
onto it. The idle process should not do anything except idle.
vm_page_zero_idle() will return in the form of an idle priority
kernel thread which is woken up at apprioriate times by the vm
system.
- Update struct kinfo_proc to the new priority interface. Deliberately
change its size by adjusting the spare fields. It remained the same
size, but the layout has changed, so userland processes that use it
would parse the data incorrectly. The size constraint should really
be changed to an arbitrary version number. Also add a debug.sizeof
sysctl node for struct kinfo_proc.


# 1aa97cde 09-Feb-2001 John Baldwin <jhb@FreeBSD.org>

Work around some sizeof(long) != sizeof(int) bogons.


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 6d3e7b9b 23-Jan-2001 John Baldwin <jhb@FreeBSD.org>

Add a new item to kinfo_proc: ki_sflag to mirror p_sflag.


# 42a4ed99 24-Jan-2001 John Baldwin <jhb@FreeBSD.org>

- Proc locking.
- Catch up to proc flag changes.
- Reorder the way we get things in fill_kinfoproc() to minimize the
number of locking operations.


# b947e934 13-Jan-2001 John Baldwin <jhb@FreeBSD.org>

- Use sched_lock to prevent the mutex name from changing out from under us
while we are copying it to the kinfo_proc structure.
- Test against p_stat to see if we are blocked on a mutex.
- Terminate ki_mtxname with a null char rather than ki_wmesg.


# 98f03f90 23-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

Protect proc.p_pptr and proc.p_children/p_sibling with the
proctree_lock.

linprocfs not locked pending response from informal maintainer.

Reviewed by: jhb, -smp@


# c0c25570 12-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead
of explicit calls to lockmgr. Also provides macros for the flags
pased to specify shared, exclusive or release which map to the
lockmgr flags. This is so that the use of lockmgr can be easily
replaced with optimized reader-writer locks.
- Add some locking that I missed the first time.


# 1f7d2501 12-Dec-2000 Kirk McKusick <mckusick@FreeBSD.org>

Change the proc information returned from the kernel so that it
no longer contains kernel specific data structures, but rather
only scalar values and structures that are already part of the
kernel/user interface, specifically rusage and rtprio. It no
longer contains proc, session, pcred, ucred, procsig, vmspace,
pstats, mtx, sigiolst, klist, callout, pasleep, or mdproc. If
any of these changed in size, ps, w, fstat, gcore, systat, and
top would all stop working. The new structure has over 200 bytes
of unassigned space for future values to be added, yet is nearly
100 bytes smaller per entry than the structure that it replaced.


# 62ca2477 29-Nov-2000 John Baldwin <jhb@FreeBSD.org>

Save a copy of p_mtxname in e_mtxname when creating an eproc.


# 553629eb 22-Nov-2000 Jake Burkholder <jake@FreeBSD.org>

Protect the following with a lockmgr lock:

allproc
zombproc
pidhashtbl
proc.p_list
proc.p_hash
nextpid

Reviewed by: jhb
Obtained from: BSD/OS and netbsd


# 0384fff8 06-Sep-2000 Jason Evans <jasone@FreeBSD.org>

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# f535380c 05-Sep-2000 Don Lewis <truckman@FreeBSD.org>

Remove uidinfo hash table lookup and maintenance out of chgproccnt() and
chgsbsize(), which are called rather frequently and may be called from an
interrupt context in the case of chgsbsize(). Instead, do the hash table
lookup and maintenance when credentials are changed, which is a lot less
frequent. Add pointers to the uidinfo structures to the ucred and pcred
structures for fast access. Pass a pointer to the credential to chgproccnt()
and chgsbsize() instead of passing the uid. Add a reference count to the
uidinfo structure and use it to decide when to free the structure rather
than freeing the structure when the resource consumption drops to zero.
Move the resource tracking code from kern_proc.c to kern_resource.c. Move
some duplicate code sequences in kern_prot.c to separate helper functions.
Change KASSERTs in this code to unconditional tests and calls to panic().


# 468ddc8a 31-Aug-2000 Brian Feldman <green@FreeBSD.org>

Casts are needed to subtract u_longs.

Submitted by: tor


# 387d2c03 29-Aug-2000 Robert Watson <rwatson@FreeBSD.org>

o Centralize inter-process access control, introducing:

int p_can(p1, p2, operation, privused)

which allows specification of subject process, object process,
inter-process operation, and an optional call-by-reference privused
flag, allowing the caller to determine if privilege was required
for the call to succeed. This allows jail, kern.ps_showallprocs and
regular credential-based interaction checks to occur in one block of
code. Possible operations are P_CAN_SEE, P_CAN_SCHED, P_CAN_KILL,
and P_CAN_DEBUG. p_can currently breaks out as a wrapper to a
series of static function checks in kern_prot, which should not
be invoked directly.

o Commented out capabilities entries are included for some checks.

o Update most inter-process authorization to make use of p_can() instead
of manual checks, PRISON_CHECK(), P_TRESPASS(), and
kern.ps_showallprocs.

o Modify suser{,_xxx} to use const arguments, as it no longer modifies
process flags due to the disabling of ASU.

o Modify some checks/errors in procfs so that ENOENT is returned instead
of ESRCH, further improving concealment of processes that should not
be visible to other processes. Also introduce new access checks to
improve hiding of processes for procfs_lookup(), procfs_getattr(),
procfs_readdir(). Correct a bug reported by bp concerning not
handling the CREATE case in procfs_lookup(). Remove volatile flag in
procfs that caused apparently spurious qualifier warnigns (approved by
bde).

o Add comment noting that ktrace() has not been updated, as its access
control checks are different from ptrace(), whereas they should
probably be the same. Further discussion should happen on this topic.

Reviewed by: bde, green, phk, freebsd-security, others
Approved by: bde
Obtained from: TrustedBSD Project


# 6aef685f 29-Aug-2000 Brian Feldman <green@FreeBSD.org>

Remove any possibility of hiwat-related race conditions by changing
the chgsbsize() call to use a "subject" pointer (&sb.sb_hiwat) and
a u_long target to set it to. The whole thing is splnet().

This fixes a problem that jdp has been able to provoke.


# 03f808c5 23-Aug-2000 Paul Saab <ps@FreeBSD.org>

Add a sysctl which hides all process except those that belong to
the user asking for the process list.

Reviewed by: peter


# 77978ab8 04-Jul-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.

Pointed out by: bde


# 82d9ae4e 03-Jul-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:

Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

-sysctl_vm_zone SYSCTL_HANDLER_ARGS
+sysctl_vm_zone (SYSCTL_HANDLER_ARGS)


# 1a432a2f 23-Jun-2000 Dima Ruban <dima@FreeBSD.org>

Fix typo (inT -> int)


# c6362551 22-Jun-2000 Alfred Perlstein <alfred@FreeBSD.org>

fix races in the uidinfo subsystem, several problems existed:

1) while allocating a uidinfo struct malloc is called with M_WAITOK,
it's possible that while asleep another process by the same user
could have woken up earlier and inserted an entry into the uid
hash table. Having redundant entries causes inconsistancies that
we can't handle.

fix: do a non-waiting malloc, and if that fails then do a blocking
malloc, after waking up check that no one else has inserted an entry
for us already.

2) Because many checks for sbsize were done as "test then set" in a non
atomic manner it was possible to exceed the limits put up via races.

fix: instead of querying the count then setting, we just attempt to
set the count and leave it up to the function to return success or
failure.

3) The uidinfo code was inlining and repeating, lookups and insertions
and deletions needed to be in their own functions for clarity.

Reviewed by: green


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 9b6d9dba 08-Feb-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Also allow non-rot processes to setproctitle()

Submitted by: Paul Saab <paul@mu.org>
Approved by: jkh


# a8704f89 26-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Add a sysctl to control if argv is disclosed to the world:
kern.ps_argsopen
It defaults to 1 which means that all users can see all argvs in ps(1).

Reviewed by: Warner


# a9e0361b 21-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce the new function
p_trespass(struct proc *p1, struct proc *p2)
which returns zero or an errno depending on the legality of p1 trespassing
on p2.

Replace kern_sig.c:CANSIGNAL() with call to p_trespass() and one
extra signal related check.

Replace procfs.h:CHECKIO() macros with calls to p_trespass().

Only show command lines to process which can trespass on the target
process.


# e29e202c 16-Nov-1999 Peter Wemm <peter@FreeBSD.org>

Add e_stats (p->p_stats, from struct user->u_stats) to eproc so it's
fetchable via sysctl. This saves ps having to read the u-area for stats.
Be sure to recompile libkvm, ps, w, top and the usual suspects.


# b9df5231 16-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce commandline caching in the kernel.

This fixes some nasty procfs problems for SMP, makes ps(1) run much faster,
and makes ps(1) even less dependent on /proc which will aid chroot and
jails alike.

To disable this facility and revert to previous behaviour:
sysctl -w kern.ps_arg_cache_limit=0

For full details see the current@FreeBSD.org mail-archives.


# 1b727751 16-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Commit the remaining part of PR14914:

Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.

Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914


# 56f0bef7 24-Oct-1999 Brian Feldman <green@FreeBSD.org>

Remove a KASSERT() that has fulfilled its purpose. Note that it did
cause problems by tripping on shutdown (reboot(), not the socket
operation :). Cause is still uncertain, but the panic isn't really
necessary here.


# ecf72308 09-Oct-1999 Brian Feldman <green@FreeBSD.org>

Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total
usage limit.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# f33a7ade 18-Aug-1999 Peter Wemm <peter@FreeBSD.org>

Run queue heads have moved to TAILQ's.


# 68f7448f 17-Jul-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Reverse the sense of a test, dev2udev() will be much cheaper than
udev2dev().


# 6a0ce002 17-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Use NOUDEV for udev_t's


# f40ddd55 17-May-1999 Doug Rabson <dfr@FreeBSD.org>

Change the definition of e_tdev in struct kinfo_proc from dev_t to udev_t

Reviewed by: Poul-Henning Kamp <phk@critter.freebsd.dk>


# bfbb9ce6 11-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Divorce "dev_t" from the "major|minor" bitmap, which is now called
udev_t in the kernel but still called dev_t in userland.

Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()

For now they're functions, they will become in-line functions
after one of the next two steps in this process.

Return major/minor/makedev to macro-hood for userland.

Register a name in cdevsw[] for the "filedescriptor" driver.

In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.

In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).

A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.

Without DEVT_FASCIST I belive this patch is a no-op.

Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.

Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).


# dfd5dee1 06-May-1999 Peter Wemm <peter@FreeBSD.org>

Add sufficient braces to keep egcs happy about potentially ambiguous
if/else nesting.


# 3d177f46 03-May-1999 Bill Fumerola <billf@FreeBSD.org>

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


# 75c13541 28-Apr-1999 Poul-Henning Kamp <phk@FreeBSD.org>

This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/


# b1028ad1 19-Feb-1999 Luoqi Chen <luoqi@FreeBSD.org>

Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This
is the preparation step for moving pmap storage out of vmspace proper.

Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>


# 8aef1712 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 88c5ea45 25-Jan-1999 Julian Elischer <julian@FreeBSD.org>

Enable Linux threads support by default.
This takes the conditionals out of the code that has been tested by
various people for a while.
ps and friends (libkvm) will need a recompile as some proc structure
changes are made.

Submitted by: "Richard Seaman, Jr." <dick@tar.com>


# d8c85307 12-Jan-1999 Julian Elischer <julian@FreeBSD.org>

Re-enable the options in ps(1) that were disabled with the Linux
threads support.

Submitted by: "Richard Seaman, Jr." <dick@tar.com>


# 219cbf59 09-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

KNFize, by bde.


# 5526d2d9 08-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith


# 62d6ce3a 11-Nov-1998 Don Lewis <truckman@FreeBSD.org>

I got another batch of suggestions for cosmetic changes from bde.


# 831d27a9 11-Nov-1998 Don Lewis <truckman@FreeBSD.org>

Installed the second patch attached to kern/7899 with some changes suggested
by bde, a few other tweaks to get the patch to apply cleanly again and
some improvements to the comments.

This change closes some fairly minor security holes associated with
F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN
had on tty devices. For more details, see the description on the PR.

Because this patch increases the size of the proc and pgrp structures,
it is necessary to re-install the includes and recompile libkvm,
the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w.

PR: kern/7899
Reviewed by: bde, elvind


# 643a8daa 09-Nov-1998 Don Lewis <truckman@FreeBSD.org>

If the session leader dies, s_leader is set to NULL and getsid() may
dereference a NULL pointer, causing a panic. Instead of following
s_leader to find the session id, store it in the session structure.

Jukka found the following info:

BTW - I just found what I have been looking for. Std 1003.1
Part 1: SYSTEM API [C LANGUAGE] section 2.2.2.80 states quite
explicitly...

Session lifetime: The period between when a session is created
and the end of lifetime of all the process groups that remain
as members of the session.

So, this quite clearly tells that while there is any single
process in any process group which is a member of the session,
the session remains as an independent entity.

Reviewed by: peter
Submitted by: "Jukka A. Ukkonen" <jau@jau.tmt.tele.fi>


# ac1e407b 11-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Fixed printf format errors.


# 876a94ee 20-Feb-1998 Bruce Evans <bde@FreeBSD.org>

Staticized.

Don't depend on "implicit int".


# 303b270b 08-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Staticize.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# 5abb66d2 01-Feb-1998 John Dyson <dyson@FreeBSD.org>

Return the vm_map in the eproc structure, so we can support more accurate
VSZ display in PS.


# 2d8acc0f 22-Jan-1998 John Dyson <dyson@FreeBSD.org>

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# a1c995b6 12-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 55166637 11-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


# 1fd0b058 02-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# b747f8bc 27-Jun-1997 Tor Egge <tegge@FreeBSD.org>

Fill in some extra fields in the eproc structure. gdb uses this information
to determine where the data segment in core dumps should be mapped.
Reviewed by: Peter Wemm <peter@spinner.dialix.com.au>


# fce002fd 24-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Don't include <sys/ioctl.h> in the kernel. Stage 1: don't include
it when it is not used. In most cases, the reasons for including it
went away when the special ioctl headers became self-sufficient.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 831031ce 14-Sep-1996 Bruce Evans <bde@FreeBSD.org>

Attached simple external ddb commands `show rtc', `show pgrpdump'
and `show cbstat'. The pgrpdump code was previously controlled by
`#ifdef DEBUG'.


# 5c17ec63 09-Jul-1996 Garrett Wollman <wollman@FreeBSD.org>

Quiet a couple of -Wunused warnings.


# c23670e2 11-Jun-1996 Gary Palmer <gpalmer@FreeBSD.org>

Clean up -Wunused warnings.

Reviewed by: bde


# 3ce93e4e 06-Jun-1996 Poul-Henning Kamp <phk@FreeBSD.org>

Fix the same problem that davidg fixed in -stable some days ago and
restructure sysctl stuff a bit. KERN_PROC_PID now uses pfind().


# cd73303c 29-May-1996 David Greenman <dg@FreeBSD.org>

Fix a panic caused by (proc)->p_session being dereferenced for a process
that was exiting.


# 92ff4bb0 07-Apr-1996 Bruce Evans <bde@FreeBSD.org>

Declared pgrpdump() properly.


# edbfedac 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]


# b75356e1 10-Mar-1996 Jeffrey Hsu <hsu@FreeBSD.org>

From Lite2: proc LIST changes.
Reviewed by: david & bde


# 8735cebc 01-Jan-1996 Peter Wemm <peter@FreeBSD.org>

fill in kinfo_eproc.e_login - otherwise a sysctl to read the eprocs wont
get the login names, and "ps -ax -O login" will return an empty column
under the login name.


# 87b6de2b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

A Major staticize sweep. Generates a couple of warnings that I'll deal
with later.
A number of unused vars removed.
A number of unused procs removed or #ifdefed.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# 98d93822 02-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Completed function declarations and/or added prototypes.


# 972f9b20 14-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Hmm, I seem to have got all my patches screwed up anyway. Too bad.
this is where the proctable stuff went.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# d93f860c 09-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Cosmetics. related to getting prototypes into view.


# 35c10d22 09-Oct-1994 David Greenman <dg@FreeBSD.org>

Got rid of map.h. It's a leftover from the rmap code, and we use rlists.
Changed swapmap into swaplist.


# 7216391e 01-Oct-1994 David Greenman <dg@FreeBSD.org>

"idle priority" support. Based on code from Henrik Vestergaard Draboel,
but substantially rewritten by me.


# bb56ec4a 25-Sep-1994 Poul-Henning Kamp <phk@FreeBSD.org>

While in the real world, I had a bad case of being swapped out for a lot of
cycles. While waiting there I added a lot of the extra ()'s I have, (I have
never used LISP to any extent). So I compiled the kernel with -Wall and
shut up a lot of "suggest you add ()'s", removed a bunch of unused var's
and added a couple of declarations here and there. Having a lap-top is
highly recommended. My kernel still runs, yell at me if you kernel breaks.


# e8fb0b2c 31-Aug-1994 David Greenman <dg@FreeBSD.org>

Realtime priority scheduling support.

Submitted by: Henrik Vestergaard Draboel


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources