History log of /freebsd-current/sys/kern/kern_cpuset.c
Revision Date Author Comments
# 96c8b3e5 06-Apr-2024 Jake Freeland <jfree@FreeBSD.org>

ktrace: Record cpuset violations with KTR_CAPFAIL

Report Capsicum violations in the cpuset namespace with CAPFAIL_CPUSET.

Reviewed by: markj
Approved by: markj (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D40677


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 2058f075 30-Jan-2023 Dmitry Chagin <dchagin@FreeBSD.org>

cpuset: Handle CPU_WHICH_TIDPID wherever cpuset_which() is called.

cpuset_which() resolves the argument pair which and id and returns references
to an appropriate resources. To avoid leaking resources or accessing unresolved
references to a resources handle new which CPU_WHICH_TIDPID wherever
cpuset_which() is called.
To avoid code duplication cpuset_which2() has been added.

Reported by: syzbot+331e8402e0f7347f0f2a@syzkaller.appspotmail.com
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D38272
MFC after: 2 weeks


# c21b080f 29-Jan-2023 Dmitry Chagin <dchagin@FreeBSD.org>

cpuset: Fix sched_[g|s]etaffinity() for better compatibility with Linux.

Under Linux to sched_[g|s]etaffinity() functions the value returned from a call
to gettid(2) (thread id) can be passed in the argument pid. Specifying pid as 0
will set the attribute for the calling thread, and passing the value returned
from a call to getpid(2) (process id) will set the attribute for the main thread
of the thread group.

Native cpuset(2) family of system calls has "which" argument to determine how
the value of id argument is interpreted, i.e., CPU_WHICH_TID is used to pass
a thread id and CPU_WHICH_PID - to pass a process id.

For now native sched_[g|s]etaffinity() implementation is wrong as uses "which"
CPU_WHICH_PID to pass both (process and thread id) to the kernel. To fix this
adding a new "which" CPU_WHICH_TIDPID intended to handle both id's.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D38209
MFC after: 1 week


# 01f74ccd 29-Jan-2023 Dmitry Chagin <dchagin@FreeBSD.org>

libthr: Fix pthread_attr_[g|s]etaffinity_np to match it's manual and the kernel.

Since f35093f8 semantics of a thread affinity functions is changed to be a
compatible with Linux:

In case of getaffinity(), the minimum cpuset_t size that the kernel permits is
the maximum CPU id, present in the system, / NBBY bytes, the maximum size is not
limited.
In case of setaffinity(), the kernel does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where the upper
bound is the maximum CPU id, present in the system, no larger than the size of
the kernel cpuset_t.

To match pthread_attr_[g|s]etaffinity_np checks of the user-provided cpusets to
the kernel behavior export the minimum cpuset_t size allowed by running kernel
via new sysctl kern.sched.cpusetsizemin and use it in checks.

Reviewed by:
Differential Revision: https://reviews.freebsd.org/D38112
MFC after: 1 week


# db79bf75 03-Oct-2022 Alfredo Dal'Ava Junior <alfredo@FreeBSD.org>

powerpc: cpuset: add local functions for copyin/copyout

Add local functions to workaround an instruction segment trap (panic)
when the indirect functions copyin and copyout are called by an external
loadable kernel module (i.e. pfsync, zfs and linuxulator). The crash
was triggered by change 47a57144af25a7bd768b29272d50a36fdf2874ba, but
kernel binary linked with LLD 9 works fine. LLVM bisect points that LLD
behavior chaged after dc06b0bc9ad055d06535462d91bfc2a744b2f589.

This is know to affect powerpc targets only and the final fix is still
being discussed with the LLVM community.

PR: 266730
Reviewed by: luporl, jhibbits (on IRC, previous version)
MFC after: 2 days
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D36234


# c84c5e00 18-Jul-2022 Mitchell Horne <mhorne@FreeBSD.org>

ddb: annotate some commands with DB_CMD_MEMSAFE

This is not completely exhaustive, but covers a large majority of
commands in the tree.

Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583


# d46174cd 28-May-2022 Dmitry Chagin <dchagin@FreeBSD.org>

Finish cpuset_getaffinity() after f35093f8

Split cpuset_getaffinity() into a two counterparts, where the
user_cpuset_getaffinity() is intended to operate on the cpuset_t from
user va, while kern_cpuset_getaffinity() expects the cpuset from kernel
va.
Accordingly, the code that clears the high bits is moved to the
user_cpuset_getaffinity(). Linux sched_getaffinity() syscall returns
the size of set copied to the user-space and then glibc wrapper clears
the high bits.

MFC after: 2 weeks


# 4a3e5133 20-May-2022 Mark Johnston <markj@FreeBSD.org>

cpuset: Fix the KASAN and KMSAN builds

Rename the "copyin" and "copyout" fields of struct cpuset_copy_cb to
something less generic, since sanitizers define interceptors for
copyin() and copyout() using #define.

Reported by: syzbot+2db5d644097fc698fb6f@syzkaller.appspotmail.com
Fixes: 47a57144af25 ("cpuset: Byte swap cpuset for compat32 on big endian architectures")
Sponsored by: The FreeBSD Foundation


# 47a57144 12-May-2022 Justin Hibbits <jhibbits@FreeBSD.org>

cpuset: Byte swap cpuset for compat32 on big endian architectures

Summary:
BITSET uses long as its basic underlying type, which is dependent on the
compile type, meaning on 32-bit builds the basic type is 32 bits, but on
64-bit builds it's 64 bits. On little endian architectures this doesn't
matter, because the LSB is always at the low bit, so the words get
effectively concatenated moving between 32-bit and 64-bit, but on
big-endian architectures it throws a wrench in, as setting bit 0 in
32-bit mode is equivalent to setting bit 32 in 64-bit mode. To
demonstrate:

32-bit mode:

BIT_SET(foo, 0): 0x00000001

64-bit sees: 0x0000000100000000

cpuset is the only system interface that uses bitsets, so solve this
by swapping the integer sub-components at the copyin/copyout points.

Reviewed by: kib
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D35225


# 586ed321 11-May-2022 Dmitry Chagin <dchagin@FreeBSD.org>

kdump: Decode cpuset_t.

Reviewed by: jhb
Differential revision: https://reviews.freebsd.org/D34982
MFC after: 2 weeks


# f35093f8 11-May-2022 Dmitry Chagin <dchagin@FreeBSD.org>

Use Linux semantics for the thread affinity syscalls.

Linux has more tolerant checks of the user supplied cpuset_t's.

Minimum cpuset_t size that the Linux kernel permits in case of
getaffinity() is the maximum CPU id, present in the system / NBBY,
the maximum size is not limited.
For setaffinity(), Linux does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where
the upper bound is the maximum CPU id, present in the system, no larger
than the size of the kernel cpuset_t.
Unlike FreeBSD, Linux ignores high bits if set in the setaffinity(),
so clear it in the sched_setaffinity() and Linuxulator itself.

Reviewed by: Pau Amma (man pages)
In collaboration with: jhb
Differential revision: https://reviews.freebsd.org/D34849
MFC after: 2 weeks


# e2650af1 29-Dec-2021 Stefan Eßer <se@FreeBSD.org>

Make CPU_SET macros compliant with other implementations

The introduction of <sched.h> improved compatibility with some 3rd
party software, but caused the configure scripts of some ports to
assume that they were run in a GLIBC compatible environment.

Parts of sched.h were made conditional on -D_WITH_CPU_SET_T being
added to ports, but there still were compatibility issues due to
invalid assumptions made in autoconfigure scripts.

The differences between the FreeBSD version of macros like CPU_AND,
CPU_OR, etc. and the GLIBC versions was in the number of arguments:
FreeBSD used a 2-address scheme (one source argument is also used as
the destination of the operation), while GLIBC uses a 3-adderess
scheme (2 source operands and a separately passed destination).

The GLIBC scheme provides a super-set of the functionality of the
FreeBSD macros, since it does not prevent passing the same variable
as source and destination arguments. In code that wanted to preserve
both source arguments, the FreeBSD macros required a temporary copy of
one of the source arguments.

This patch set allows to unconditionally provide functions and macros
expected by 3rd party software written for GLIBC based systems, but
breaks builds of externally maintained sources that use any of the
following macros: CPU_AND, CPU_ANDNOT, CPU_OR, CPU_XOR.

One contributed driver (contrib/ofed/libmlx5) has been patched to
support both the old and the new CPU_OR signatures. If this commit
is merged to -STABLE, the version test will have to be extended to
cover more ranges.

Ports that have added -D_WITH_CPU_SET_T to build on -CURRENT do
no longer require that option.

The FreeBSD version has been bumped to 1400046 to reflect this
incompatible change.

Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D33451


# 29bb6c19 13-Apr-2021 Mark Johnston <markj@FreeBSD.org>

domainset: Define additional global policies

Add global definitions for first-touch and interleave policies. The
former may be useful for UMA, which implements a similar policy without
using domainset iterators.

No functional change intended.

Reviewed by: mav
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29104


# 60c4ec80 26-Feb-2021 Kyle Evans <kevans@FreeBSD.org>

jail: allow root to implicitly widen its cpuset to attach

The default behavior for attaching processes to jails is that the jail's
cpuset augments the attaching processes, so that it cannot be used to
escalate a user's ability to take advantage of more CPUs than the
administrator wanted them to.

This is problematic when root needs to manage jails that have disjoint
sets with whatever process is attaching, as this would otherwise result
in a deadlock. Therefore, if we did not have an appropriate common
subset of cpus/domains for our new policy, we now allow the process to
simply take on the jail set *if* it has the privilege to widen its mask
anyways.

With the new logic, root can still usefully cpuset a process that
attaches to a jail with the desire of maintaining the set it was given
pre-attachment while still retaining the ability to manage child jails
without jumping through hoops.

A test has been added to demonstrate the issue; cpuset of a process
down to just the first CPU and attempting to attach to a jail without
access to any of the same CPUs previously resulted in EDEADLK and now
results in taking on the jail's mask for privileged users.

PR: 253724
Reviewed by: jamie (also discussed with)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28952


# 54a837c8 18-Dec-2020 Kyle Evans <kevans@FreeBSD.org>

kern: cpuset: allow jails to modify child jails' roots

This partially lifts a restriction imposed by r191639 ("Prevent a superuser
inside a jail from modifying the dedicated root cpuset of that jail") that's
perhaps beneficial after r192895 ("Add hierarchical jails."). Jails still
cannot modify their own cpuset, but they can modify child jails' roots to
further restrict them or widen them back to the modifying jails' own mask.

As a side effect of this, the system root may once again widen the mask of
jails as long as they're still using a subset of the parent jails' mask.
This was previously prevented by the fact that cpuset_getroot of a root set
will return that root, rather than the root's parent -- cpuset_modify uses
cpuset_getroot since it was introduced in r327895, previously it was just
validating against set->cs_parent which allowed the system root to widen
jail masks.

Reviewed by: jamie
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27352


# f1b18a66 08-Dec-2020 Kyle Evans <kevans@FreeBSD.org>

cpuset_set{affinity,domain}: do not allow empty masks

cpuset_modify() would not currently catch this, because it only checks that
the new mask is a subset of the root set and circumvents the EDEADLK check
in cpuset_testupdate().

This change both directly validates the mask coming in since we can
trivially detect an empty mask, and it updates cpuset_testupdate to catch
stuff like this going forward by always ensuring we don't end up with an
empty mask.

The check_mask argument has been renamed because the 'check' verbiage does
not imply to me that it's actually doing a different operation. We're either
augmenting the existing mask, or we are replacing it entirely.

Reported by: syzbot+4e3b1009de98d2fabcda@syzkaller.appspotmail.com
Discussed with: andrew
Reviewed by: andrew, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27511


# b2780e85 08-Dec-2020 Kyle Evans <kevans@FreeBSD.org>

kern: cpuset: resolve race between cpuset_lookup/cpuset_rel

The race plays out like so between threads A and B:

1. A ref's cpuset 10
2. B does a lookup of cpuset 10, grabs the cpuset lock and searches
cpuset_ids
3. A rel's cpuset 10 and observes the last ref, waits on the cpuset lock
while B is still searching and not yet ref'd
4. B ref's cpuset 10 and drops the cpuset lock
5. A proceeds to free the cpuset out from underneath B

Resolve the race by only releasing the last reference under the cpuset lock.
Thread A now picks up the spinlock and observes that the cpuset has been
revived, returning immediately for B to deal with later.

Reported by: syzbot+92dff413e201164c796b@syzkaller.appspotmail.com
Reviewed by: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27498


# 9c83dab9 08-Dec-2020 Kyle Evans <kevans@FreeBSD.org>

kern: cpuset: plug a unr leak

cpuset_rel_defer() is supposed to be functionally equivalent to
cpuset_rel() but with anything that might sleep deferred until
cpuset_rel_complete -- this setup is used specifically for cpuset_setproc.

Add in the missing unr free to match cpuset_rel. This fixes a leak that
was observed when I wrote a small userland application to try and debug
another issue, which effectively did:

cpuset(&newid);
cpuset(&scratch);

newid gets leaked when scratch is created; it's off the list, so there's
no mechanism for anything else to relinquish it. A more realistic reproducer
would likely be a process that inherits some cpuset that it's the only ref
for, but it creates a new one to modify. Alternatively, administratively
reassigning a process' cpuset that it's the last ref for will have the same
effect.

Discovered through D27498.

MFC after: 1 week


# e07e3fa3 27-Nov-2020 Kyle Evans <kevans@FreeBSD.org>

kern: cpuset: drop the lock to allocate domainsets

Restructure the loop a little bit to make it a little more clear how it
really operates: we never allocate any domains at the beginning of the first
iteration, and it will run until we've satisfied the amount we need or we
encounter an error.

The lock is now taken outside of the loop to make stuff inside the loop
easier to evaluate w.r.t. locking.

This fixes it to not try and allocate any domains for the freelist under the
spinlock, which would have happened before if we needed any new domains.

Reported by: syzbot+6743fa07b9b7528dc561@syzkaller.appspotmail.com
Reviewed by: markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D27371


# d431dea5 24-Nov-2020 Kyle Evans <kevans@FreeBSD.org>

kern: cpuset: properly rebase when attaching to a jail

The current logic is a fine choice for a system administrator modifying
process cpusets or a process creating a new cpuset(2), but not ideal for
processes attaching to a jail.

Currently, when a process attaches to a jail, it does exactly what any other
process does and loses any mask it might have applied in the process of
doing so because cpuset_setproc() is entirely based around the assumption
that non-anonymous cpusets in the process can be replaced with the new
parent set.

This approach slightly improves the jail attach integration by modifying
cpuset_setproc() callers to indicate if they should rebase their cpuset to
the indicated set or not (i.e. cpuset_setproc_update_set).

If we're rebasing and the process currently has a cpuset assigned that is
not the containing jail's root set, then we will now create a new base set
for it hanging off the jail's root with the existing mask applied instead of
using the jail's root set as the new base set.

Note that the common case will be that the process doesn't have a cpuset
within the jail root, but the system root can freely assign a cpuset from
a jail to a process outside of the jail with no restriction. We assume that
that may have happened or that it could happen due to a race when we drop
the proc lock, so we must recheck both within the loop to gather up
sufficient freed cpusets and after the loop.

To recap, here's how it worked before in all cases:

0 4 <-- jail 0 4 <-- jail / process
| |
1 -> 1
|
3 <-- process

Here's how it works now:

0 4 <-- jail 0 4 <-- jail
| | |
1 -> 1 5 <-- process
|
3 <-- process

or

0 4 <-- jail 0 4 <-- jail / process
| |
1 <-- process -> 1

More importantly, in both cases, the attaching process still retains the
mask it had prior to attaching or the attach fails with EDEADLK if it's
left with no CPUs to run on or the domain policy is incompatible. The
author of this patch considers this almost a security feature, because a MAC
policy could grant PRIV_JAIL_ATTACH to an unprivileged user that's
restricted to some subset of available CPUs the ability to attach to a jail,
which might lift the user's restrictions if they attach to a jail with a
wider mask.

In most cases, it's anticipated that admins will use this to be able to,
for example, `cpuset -c -l 1 jail -c path=/ command=/long/running/cmd`,
and avoid the need for contortions to spawn a command inside a jail with a
more limited cpuset than the jail.

Reviewed by: jamie
MFC after: 1 month (maybe)
Differential Revision: https://reviews.freebsd.org/D27298


# 30b7c6f9 24-Nov-2020 Kyle Evans <kevans@FreeBSD.org>

kern: cpuset: rename _cpuset_create() to cpuset_init()

cpuset_init() is better descriptor for what the function actually does. The
name was previously taken by a sysinit that setup cpuset_zero's mask
from all_cpus, it was removed in r331698 before stable/12 branched.

A comment referencing the removed sysinit has now also been removed, since
the setup previously done was moved into cpuset_thread0().

Suggested by: markj
MFC after: 1 week


# 29d04ea8 24-Nov-2020 Kyle Evans <kevans@FreeBSD.org>

kern: cpuset: allow cpuset_create() to take an allocated *setp

Currently, it must always allocate a new set to be used for passing to
_cpuset_create, but it doesn't have to. This is purely kern_cpuset.c
internal and it's sparsely used, so just change it to use *setp if it's
not-NULL and modify the two consumers to pass in the address of a NULL
cpuset.

This paves the way for consumers that want the unr allocation without the
possibility of sleeping as long as they've done their due diligence to
ensure that the mask will properly apply atop the supplied parent
(i.e. avoiding the free_unr() in the last failure path).

Reviewed by: jamie, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27297


# dac521eb 22-Nov-2020 Kyle Evans <kevans@FreeBSD.org>

cpuset_setproc: use the appropriate parent for new anonymous sets

As far as I can tell, this has been the case since initially committed in
2008. cpuset_setproc is the executor of cpuset reassignment; note this
excerpt from the description:

* 1) Set is non-null. This reparents all anonymous sets to the provided
* set and replaces all non-anonymous td_cpusets with the provided set.

However, reviewing cpuset_setproc_setthread() for some jail related work
unearthed the error: if tdset was not anonymous, we were replacing it with
`set`. If it was anonymous, then we'd rebase it onto `set` (i.e. copy the
thread's mask over and AND it with `set`) but give the new anonymous set
the original tdset as the parent (i.e. the base of the set we're supposed to
be leaving behind).

The primary visible consequences were that:

1.) cpuset_getid() following such assignment returns the wrong result, the
setid that we left behind rather than the one we joined.
2.) When a process attached to the jail, the base set of any anonymous
threads was a set outside of the jail.

This was initially bundled in D27298, but it's a minor fix that's fairly
easy to verify the correctness of.

A test is included in D27307 ("badparent"), which demonstrates the issue
with, effectively:

osetid = cpuset_getid()
newsetid = cpuset()
cpuset_setaffinity(thread)
cpuset_setid(osetid)
cpuset_getid(thread) -> observe that it matches newsetid instead of osetid.

MFC after: 1 week


# 1a7bb896 16-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

cpuset: refcount-clean


# 6fed89b1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

kern: clean up empty lines in .c and .h files


# 69b565d7 06-Jul-2020 Mark Johnston <markj@FreeBSD.org>

Allow accesses of the caller's CPU and domain sets in capability mode.

cpuset_(get|set)(affinity|domain)(2) permit a get or set of the calling
thread or process' CPU and domain set in capability mode, but only when
the thread or process ID is specified as -1. Extend this to cover the
case where the ID actually matches the caller's TID or PID, since some
code, such as our pthread_attr_get_np() implementation, always provides
an explicit ID.

It was not and still is not permitted to access CPU and domain sets for
other threads in the same process when the process is in capability
mode. This might change in the future.

Submitted by: Greg V <greg@unrelenting.technology> (original version)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D25552


# 9eb997cb 06-Jul-2020 Mark Johnston <markj@FreeBSD.org>

Lift cpuset Capsicum checks into a subroutine.

Otherwise the same checks are duplicated across four different system
call implementations, cpuset_(get|set)(affinity|domain)(). No
functional change intended.

MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 9825eadf 13-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

bitset: rename confusing macro NAND to ANDNOT

s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too. The actual
implementation is "and not" (or "but not"), i.e. A but not B.
Fortunately this does appear to be what all existing callers want.

Don't supply a NAND (not (A and B)) operation at this time.

Discussed with: jeff
Reviewed by: cem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22791


# 45cdd437 12-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Remove a redundant NULL pointer check in cpuset_modify_domain().

cpuset_getroot() is guaranteed to return a non-NULL pointer.

Reported by: Mark Millard <marklmi@yahoo.com>
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# d57cd5cc 05-Sep-2019 Stephen J. Kiernan <stevek@FreeBSD.org>

Bump up the low range of cpuset numbers to account for the kernel cpuset.

Reviewed by: jeff
Obtained from: Juniper Networks, Inc.


# 63cdd18e 01-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Restrict the input domain set in cpuset_setdomain(2) to all_domains.

To permit larger values of MAXMEMDOM, which is currently 8 on amd64,
cpuset_setdomain(2) accepts a mask of size 256. In the kernel, domain
set masks are 64 bits wide, but can only represent a set of MAXMEMDOM
domains due to the use of the ds_order table.

Domain sets passed to cpuset_setdomain(2) are restricted to a subset
of their parent set, which is typically the root set, but before this
happens we modify the input set to exclude empty domains.
domainset_empty_vm() and other code which manipulates domain sets
expect the mask to be a subset of all_domains, so enforce that when
performing validation of cpuset_setdomain(2) parameters.

Reported and tested by: pho
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21477


# 44e4def7 27-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Remove an extraneous + 1 in _domainset_create().

DOMAINSET_FLS, like our fls(), is 1-indexed.

Reported by: alc
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# 8e697504 27-Aug-2019 Mark Johnston <markj@FreeBSD.org>

Fix several logic issues in domainset_empty_vm().

- Don't add 1 to the result of DOMAINSET_FLS.
- Do not modify domainsets containing only empty domains.
- Always flatten a _PREFER policy to _ROUNDROBIN if the preferred
domain is empty. Previously we were doing this only when ds_cnt > 1.

These bugs could cause hangs during boot if a VM domain is empty.

Tested by: hselasky
Reviewed by: hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21420


# 9978bd99 30-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Add malloc_domainset(9) and _domainset variants to other allocator KPIs.

Remove malloc_domain(9) and most other _domain KPIs added in r327900.
The new functions allow the caller to specify a general NUMA domain
selection policy, rather than specifically requesting an allocation from
a specific domain. The latter policy tends to interact poorly with
M_WAITOK, resulting in situations where a caller is blocked indefinitely
because the specified domain is depleted. Most existing consumers of
the _domain KPIs are converted to instead use a DOMAINSET_PREF() policy,
in which we fall back to other domains to satisfy the allocation
request.

This change also defines a set of DOMAINSET_FIXED() policies, which
only permit allocations from the specified domain.

Discussed with: gallatin, jeff
Reported and tested by: pho (previous version)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17418


# 920239ef 30-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Fix some problems that manifest when NUMA domain 0 is empty.

- In uma_prealloc(), we need to check for an empty domain before the
first allocation attempt, not after. Fix this by switching
uma_prealloc() to use a vm_domainset iterator, which addresses the
secondary issue of using a signed domain identifier in round-robin
iteration.
- Don't automatically create a page daemon for domain 0.
- In domainset_empty_vm(), recompute ds_cnt and ds_order after
excluding empty domains; otherwise we may frequently specify an empty
domain when calling in to the page allocator, wasting CPU time.
Convert DOMAINSET_PREF() policies for empty domains to round-robin.
- When freeing bootstrap pages, don't count them towards the per-domain
total page counts for now: some vm_phys segments are created before
the SRAT is parsed and are thus always identified as being in domain 0
even when they are not. Then, when bootstrap pages are freed, they
are added to a domain that we had previously thought was empty. Until
this is corrected, we simply exclude them from the per-domain page
count.

Reported and tested by: Rajesh Kumar <rajfbsd@gmail.com>
Reviewed by: gallatin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17704


# b61f3142 22-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Make it possible to disable NUMA support with a tunable.

This provides a chicken switch for anyone negatively impacted by
enabling NUMA in the amd64 GENERIC kernel configuration. With
NUMA disabled at boot-time, information about the NUMA topology
is not exposed to the rest of the kernel, and all of physical
memory is viewed as coming from a single domain.

This method still has some performance overhead relative to disabling
NUMA support at compile time.

PR: 231460
Reviewed by: alc, gallatin, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17439


# 662e7fa8 20-Oct-2018 Mark Johnston <markj@FreeBSD.org>

Create some global domainsets and refactor NUMA registration.

Pre-defined policies are useful when integrating the domainset(9)
policy machinery into various kernel memory allocators.

The refactoring will make it easier to add NUMA support for other
architectures.

No functional change intended.

Reviewed by: alc, gallatin, jeff, kib
Tested by: pho (part of a larger patch)
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17416


# 30c5525b 01-Oct-2018 Andrew Gallatin <gallatin@FreeBSD.org>

Allow empty NUMA memory domains to support Threadripper2

The AMD Threadripper 2990WX is basically a slightly crippled Epyc.
Rather than having 4 memory controllers, one per NUMA domain, it has
only 2 memory controllers enabled. This means that only 2 of the
4 NUMA domains can be populated with physical memory, and the
others are empty.

Add support to FreeBSD for empty NUMA domains by:

- creating empty memory domains when parsing the SRAT table,
rather than failing to parse the table
- not running the pageout deamon threads in empty domains
- adding defensive code to UMA to avoid allocating from empty domains
- adding defensive code to cpuset to avoid binding to an empty domain
Thanks to Jeff for suggesting this strategy.

Reviewed by: alc, markj
Approved by: re (gjb@)
Differential Revision: https://reviews.freebsd.org/D1683


# 70d66bcf 26-May-2018 Eric van Gyzen <vangyzen@FreeBSD.org>

kern_cpuset: fix small leak on error path

The "mask" was leaked on some error paths.

Reported by: Coverity
CID: 1384683
Sponsored by: Dell EMC


# a6c7423a 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

cpuset: revert and annotate instead


# 39eef2f4 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

cpuset_thread0: avoid unused assignment on non debug build


# e5818a53 28-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement several enhancements to NUMA policies.

Add a new "interleave" allocation policy which stripes pages across
domains with a stride or width keeping contiguity within a multi-page
region.

Move the kernel to the dedicated numbered cpuset #2 making it possible
to assign kernel threads and memory policy separately from user. This
also eliminates the need for the complicated interrupt binding code.

Add a sysctl API for viewing and manipulating domainsets. Refactor some
of the cpuset_t manipulation code using the generic bitset type so that
it can be used for both. This probably belongs in a dedicated subr file.

Attempt to improve the include situation.

Reviewed by: kib
Discussed with: jhb (cpuset parts)
Tested by: pho (before review feedback)
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14839


# 27a3c9d7 28-Mar-2018 Jeff Roberson <jeff@FreeBSD.org>

Restore r331606 with a bugfix to setup cpuset_domain[] earlier on all
platforms. Original commit message as follows:

Only use CPUs in the domain the device is attached to for default
assignment. Device drivers are able to override the default assignment
if they bind directly. There are severe performance penalties for
handling interrupts on remote CPUs and this should only be done in
very controlled circumstances.

Reviewed by: jhb, kib
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14838


# dd51fec3 08-Mar-2018 Brooks Davis <brooks@FreeBSD.org>

Copyout a whole int to cpuset_domain's policy pointer.

The previous code only copied 16-bits and corrupted the target int.

Reviewed by: kib, markj
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14611


# 3f289c3f 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403


# 8a36da99 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# a1d0659c 22-Aug-2017 Jung-uk Kim <jkim@FreeBSD.org>

Fix size to copyout(9) for cpuset_getid(2).

MFC after: 3 days


# f299c47b 23-May-2017 Allan Jude <allanjude@FreeBSD.org>

Allow cpuset_{get,set}affinity in capabilities mode

bhyve was recently sandboxed with capsicum, and needs to be able to
control the CPU sets of its vcpu threads

Reviewed by: emaste, oshogbo, rwatson
MFC after: 2 weeks
Sponsored by: ScaleEngine Inc.
Differential Revision: https://reviews.freebsd.org/D10170


# 29dfb631 03-May-2017 Conrad Meyer <cem@FreeBSD.org>

Extend cpuset_get/setaffinity() APIs

Add IRQ placement-only and ithread-only API variants. intr_event_bind
has been extended with sibling methods, as it has many more callsites in
existing code.

Reviewed by: kib@, adrian@ (earlier version)
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D10586


# 6286dc78 17-Apr-2017 Gleb Smirnoff <glebius@FreeBSD.org>

Remove unneeded include of vm_phys.h.


# 96ee4310 05-Feb-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_cpuset_getaffinity() and kern_cpuset_getaffinity(),
and use it in compats instead of their sys_*() counterparts.

Reviewed by: kib, jhb, dchagin
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9383


# ea2ebdc1 31-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern_cpuset_getid() and kern_cpuset_setid(), and use them
in compat32 instead of their sub_*() counterparts.

Reviewed by: jhb@, kib@
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9382


# 62d70a81 09-Apr-2016 John Baldwin <jhb@FreeBSD.org>

Add more fine-grained kernel options for NUMA support.

VM_NUMA_ALLOC is used to enable use of domain-aware memory allocation in
the virtual memory system. DEVICE_NUMA is used to enable affinity
reporting for devices such as bus_get_domain().

MAXMEMDOM must still be set to a value greater than for any NUMA support
to be effective. Note that 'cpuset -gd' always works if MAXMEMDOM is
enabled and the system supports NUMA.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D5782


# 5bbb2169 25-Jun-2015 Adrian Chadd <adrian@FreeBSD.org>

Un-static cpuset_which() - it's useful in other contexts, such as some
CPU set operations in my upcoming NUMA work.

Tested/compiled:

* i386 (run)
* amd64 (run)
* mips (run)
* mips64 (run)
* armv6 (built)

Sponsored by: Norse Corp, Inc.


# 60aa2c85 14-May-2015 Jonathan Anderson <jonathan@FreeBSD.org>

Allow sizeof(cpuset_t) to be queried in capability mode.

This allows functions that retrieve and inspect pthread_attr_t objects to
work correctly: querying the cpuset_t size is part of querying CPU
affinity information, which is part of creating a complete pthread_attr_t.

Approved by: rwatson (mentor)
Reviewed by: pjd
Sponsored by: NSERC


# bbf686ed 08-Jan-2015 John Baldwin <jhb@FreeBSD.org>

Reject attempts to read the cpuset mask of a negative domain ID.


# c0ae6688 08-Jan-2015 John Baldwin <jhb@FreeBSD.org>

Create a cpuset mask for each NUMA domain that is available in the
kernel via the global cpuset_domain[] array. To export these to userland,
add a CPU_WHICH_DOMAIN level that can be used to fetch the mask for a
specific domain. Add a -d flag to cpuset(1) that can be used to fetch
the mask for a given domain.

Differential Revision: https://reviews.freebsd.org/D1232
Submitted by: jeff (kernel bits)
Reviewed by: adrian, jeff


# f0188618 21-Oct-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Fix multiple incorrect SYSCTL arguments in the kernel:

- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after: 3 days
Sponsored by: Mellanox Technologies


# 7f7528fc 15-Sep-2014 Adrian Chadd <adrian@FreeBSD.org>

Modify cpuset_setithread() to take a CPU ID as an integer, not a char.

We're going to end up having > 254 CPUs at some point.


# c1d9ecf2 13-Sep-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Fix error handling in cpuset_setithread() introduced in r267716.

Noted by: kib
MFC after: 1 week


# 81198539 22-Jun-2014 Alexander V. Chernikov <melifaro@FreeBSD.org>

Permit changing cpu mask for cpu set 1 in presence of drivers
binding their threads to particular CPU.

Changing ithread cpu mask is now performed by special cpuset_setithread().
It creates additional cpuset root group on first bind invocation.

No objection: jhb
Tested by: hiren
MFC after: 2 weeks
Sponsored by: Yandex LLC


# cd32bd7a 25-Jun-2013 John Baldwin <jhb@FreeBSD.org>

Several improvements to rmlock(9). Many of these are based on patches
provided by Isilon.
- Add an rm_assert() supporting various lock assertions similar to other
locking primitives. Because rmlocks track readers the assertions are
always fully accurate unlike rw_assert() and sx_assert().
- Flesh out the lock class methods for rmlocks to support sleeping via
condvars and rm_sleep() (but only while holding write locks), rmlock
details in 'show lock' in DDB, and the lc_owner method used by
dtrace.
- Add an internal destroyed cookie so that API functions can assert
that an rmlock is not destroyed.
- Make use of rm_assert() to add various assertions to the API (e.g.
to assert locks are held when an unlock routine is called).
- Give RM_SLEEPABLE locks their own lock class and always use the
rmlock's own lock_object with WITNESS.
- Use THREAD_NO_SLEEPING() / THREAD_SLEEPING_OK() to disallow sleeping
while holding a read lock on an rmlock.

Submitted by: andre
Obtained from: EMC/Isilon


# 17a27377 13-Jun-2013 Jeff Roberson <jeff@FreeBSD.org>

- Add a BIT_FFS() macro and use it to replace cpusetffs_obj()

Discussed with: attilio
Sponsored by: EMC / Isilon Storage Division


# c9813d0a 06-Jun-2013 John Baldwin <jhb@FreeBSD.org>

Do not compare the existing mask of a cpuset with a new mask when changing
the mask of a cpuset. Also, change the cpuset's mask before updating the
masks of all children. Previously changing a cpuset's mask first required
setting the mask to a super-set of both the old and new masks and then
changing it a second time to the new mask.


# d4a2ab8c 30-Aug-2012 Attilio Rao <attilio@FreeBSD.org>

Post r222812 KTR_CPUMASK started being initialized only as a tunable
handler and not more statically.

Unfortunately, it seems that this is not ideal for new platform bringup
and boot low level development (which needs ktr_cpumask to be effective
before tunables can be setup).

Because of this, add a way to statically initialize cpusets, by passing
an list of initializers, divided by commas. Also, provide a way to enforce
an all-set mask, for above mentioned initializers.

This imposes some differences on how KTR_CPUMASK is setup now as a
kernel option, and in particular this makes the words specifications
backward wrt. what is currently in -CURRENT. In order to avoid mismatches
between KTR_CPUMASK definition and other way to setup the mask
(tunable, sysctl) and to print it, change the ordering how
cpusetobj_print() and cpusetobj_scan() acquire the words belonging
to the set.
Please give a look to sys/conf/NOTES in order to understand how the
new format is supposed to work.

Also, ktr manpages will be updated shortly by gjb which volountereed
for this.

This patch won't be merged because it changes a POLA (at least
from the theoretical standpoint) and this is however a patch that
proves to be effective only in development environments.

Requested by: rpaulo
Reviewed by: jeff, rpaulo


# 2b69bb1f 05-Dec-2011 Kevin Lo <kevlo@FreeBSD.org>

Add a missing curly bracket


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# e3709597 31-May-2011 Attilio Rao <attilio@FreeBSD.org>

Fix KTR_CPUMASK in order to accept a string representing a cpuset_t.
This introduce all the underlying support for making this possible (via
the function cpusetobj_strscan() and keeps ktr_cpumask exported. sparc64
implements its own assembly primitives for tracing events and needs to
properly check it. Anyway the sparc64 logic is not implemented yet due
to lack of knowledge (by me) and time (by marius), but it is just a
matter of using ktr_cpumask when possible.

Tested and fixed by: pluknet
Reviewed by: marius


# d0984adc 31-May-2011 Attilio Rao <attilio@FreeBSD.org>

Revert a change that crept in during MFC.


# 217e1c0e 23-May-2011 Attilio Rao <attilio@FreeBSD.org>

Revert a patch that unvolountary sneaked in while I was MFCing.


# e3071102 22-May-2011 Attilio Rao <attilio@FreeBSD.org>

Merge r221912 from largeSMP project branch:
Fix a long-standing bug in cpuset_thread0() where only the first part
of cs_mask is set full.

Submitted by: anonymous
MFC after: 1 week


# 34a1e065 22-May-2011 Attilio Rao <attilio@FreeBSD.org>

Make cpusetobj_strprint() prepare the string in order to print the
least significant cpuset_t word at the outmost right part of the string
(more far from the beginning of it). This follows the natural build of
bits rappresentation in the words.


# f27aed53 14-May-2011 Attilio Rao <attilio@FreeBSD.org>

Fix a longstanding bug where only the first part of the cpumask was
correctly set full.

Submitted by: anonymous


# faa0e911 14-May-2011 Attilio Rao <attilio@FreeBSD.org>

Simplify the code here.

Submitted by: jhb


# 71a19bdc 05-May-2011 Attilio Rao <attilio@FreeBSD.org>

Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).

Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.

The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN

while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.

Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now

The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.

Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno


# e84c2db1 08-Mar-2011 John Baldwin <jhb@FreeBSD.org>

When constructing a new cpuset, apply the parent cpuset's mask to the new
set's mask rather than the root mask. This was causing the root mask to
be modified incorrectly.

Reviewed by: jeff
MFC after: 1 week


# 444528c0 31-Oct-2010 David Xu <davidxu@FreeBSD.org>

Use integer for size of cpuset, as it won't be bigger than INT_MAX,
This is requested by bge.
Also move the sysctl into file kern_cpuset.c, because it should
always be there, it is independent of thread scheduler.


# 4a547870 27-Oct-2010 David Xu <davidxu@FreeBSD.org>

- Revert r214409.
- Use long word to figure out sizeof kernel cpuset, hope it works.


# 1676b425 26-Oct-2010 David Xu <davidxu@FreeBSD.org>

If input parameter cpusetsize is zero, give userland size of cpuset mask
kernel is using.


# 42fe684c 25-Oct-2010 David Xu <davidxu@FreeBSD.org>

Use function tdfind() to find a thread.


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 86855bf5 26-Oct-2009 John Baldwin <jhb@FreeBSD.org>

Another nit that both I and ispell missed.

Submitted by: Ben Kaduk minimarmot of gmail


# 93902625 26-Oct-2009 John Baldwin <jhb@FreeBSD.org>

Fix some spelling nits.


# 399645e1 23-Jun-2009 Jamie Gritton <jamie@FreeBSD.org>

Remove unnecessary/redundant includes.

Approved by: bz (mentor)


# 0304c731 27-May-2009 Jamie Gritton <jamie@FreeBSD.org>

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# 6aaa0b3c 28-Apr-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Prevent a superuser inside a jail from modifying the dedicated
root cpuset of that jail.
Processes inside the jail will still be able to change child sets.
A superuser outside of a jail will still be able to change the jail cpuset
and thus limit the number of cpus available to the jail.

Problem reported by: 000.fbsd@quip.cz (Miroslav Lachman)
PR: kern/134050
Reviewed by: jeff
MFC after: 3 weeks
X-MFC: backout r191596


# 47479a8c 22-Apr-2009 Bjoern A. Zeeb <bz@FreeBSD.org>

Correct a comment: the function name given had never existed in any
(relevant) version of this file orany of my patches.

MFC after: 1 month


# 413628a7 29-Nov-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# dea0ed66 07-Jul-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Add a `show cpusets' DDB command to print numbered root and
assigned CPU affinity sets.

Reviewed by: brooks


# 7a8f695a 07-Jul-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Move cpuset_refroot and cpuset_refbase functions up, grouping the
cpuset_ref* functions together. Will make it easier to read and
add code without forward declarations.
No functional changes.


# ba931c08 29-Jun-2008 Bjoern A. Zeeb <bz@FreeBSD.org>

Add a new priv 'PRIV_SCHED_CPUSET' to check if manipulating cpusets is
allowed and replace the suser() call. Do not allow it in jails.

Reviewed by: rwatson


# 887aedc6 26-May-2008 Konstantin Belousov <kib@FreeBSD.org>

Take into account possible overflow when multiplying. The casuality is
the malloc call later, panicing kernel due to the oversized allocation.

Reported by: pho
Reviewed by: jeff


# 9b33b154 10-Apr-2008 Jeff Roberson <jeff@FreeBSD.org>

- Add the interrupt vector number to intr_event_create so MI code can
lookup hard interrupt events by number. Ignore the irq# for soft intrs.
- Add support to cpuset for binding hardware interrupts. This has the
side effect of binding any ithread associated with the hard interrupt.
As per restrictions imposed by MD code we can only bind interrupts to
a single cpu presently. Interrupts can be 'unbound' by binding them
to all cpus.

Reviewed by: jhb
Sponsored by: Nokia


# 3bc8c68d 03-Apr-2008 Jeff Roberson <jeff@FreeBSD.org>

- Add a Nokia copyright to cpuset to reflect their generous
contribution to this work.


# a03ee000 30-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Consistently return EDEADLK when presented with a new set that is
incompatible with existing bindings.
- Try to copyout the setid in cpuset() before migrating the proc to the
setid in case the user has supplied a bad buffer.
- Rename cpuset_root() and cpuset_base() to cpuset_ref{root,base} to
be more descriptive and free cpuset_root to be used as a different
type of symbol.
- Make cpuset_root the cpuset_t set of all cpus in the system. This
should contain the same bitmask as all_cpus presently.
- Add a CPU_CMP() macro to compare two sets.


# 7f64829a 25-Mar-2008 Ruslan Ermilov <ru@FreeBSD.org>

Fixed type of the fourth argument of cpuset_{get,set}affinity(2) to be size_t.

Prodded by: davidxu


# 374ae2a3 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.


# c6440f72 06-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Add a missing unlock to cpuset_setaffinity(CPU_LEVEL_CPUSET, CPU_WHICH_PID)

Found by: gallatin


# 8bd75bdd 05-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Don't overwrite the recently allocated 'nset' in cpuset_setthread() by
passing it to cpuset_which(). Pass in 'set' instead. This argument
is not used but for convenience cpuset_which() nulls all incoming
parameters.

Submitted by: davidxu


# 73c40187 04-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Verify that when a user supplies a mask that is bigger than the kernel
mask none of the upper bits are set.
- Be more careful about enforcing the boundaries of masks and child sets.
- Introduce a few more CPU_* macros for implementing these tests.
- Change the cpusetsize argument to be bytes rather than bits to match
other apis.

Sponsored by: Nokia


# d7f687fc 02-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

Add cpuset, an api for thread to cpu binding and cpu resource grouping
and assignment.
- Add a reference to a struct cpuset in each thread that is inherited from
the thread that created it.
- Release the reference when the thread is destroyed.
- Add prototypes for syscalls and macros for manipulating cpusets in
sys/cpuset.h
- Add syscalls to create, get, and set new numbered cpusets:
cpuset(), cpuset_{get,set}id()
- Add syscalls for getting and setting affinity masks for cpusets or
individual threads: cpuid_{get,set}affinity()
- Add types for the 'level' and 'which' parameters for the cpuset. This
will permit expansion of the api to cover cpu masks for other objects
identifiable with an id_t integer. For example, IRQs and Jails may be
coming soon.
- The root set 0 contains all valid cpus. All thread initially belong to
cpuset 1. This permits migrating all threads off of certain cpus to
reserve them for special applications.

Sponsored by: Nokia
Discussed with: arch, rwatson, brooks, davidxu, deischen
Reviewed by: antoine