History log of /freebsd-current/sys/kern/sched_4bsd.c
Revision Date Author Comments
# aeff15b3 09-Feb-2024 Olivier Certner <olce@FreeBSD.org>

sched: Simplify sched_lend_user_prio_cond()

If 'td_lend_user_pri' has the expected value, there is no need to check
the fields that sched_lend_user_prio() modifies, they either are already
good or soon will be ('td->td_lend_user_pri' has just been changed by
a concurrent update).

Reviewed by: kib
Approved by: emaste (mentor)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D44050


# 6a3c02bc 16-Jan-2024 Olivier Certner <olce@FreeBSD.org>

sched: sched_switch(): Factorize sleepqueue flags

Avoid duplicating common flags for the preempted and non-preempted
cases, making it clear that they are the same without resorting to
formatting.

No functional change.

Approved by: markj (mentor)
MFC after: 3 days
Sponsored by: The FreeBSD Foundation


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 1029dab6 09-Feb-2023 Mitchell Horne <mhorne@FreeBSD.org>

mi_switch(): clean up switch types and their usage

Overall, this is a non-functional change, except for kernels built with
SCHED_STATS. However, the switch types are useful for communicating the
intent of the caller.

1. Ensure that every caller provides a type. In most cases, we upgrade
the basic yield to sched_relinquish() aka SWT_RELINQUISH.
2. The case of sched_bind() is distinct, so add a new switch type SWT_BIND.
3. Remove the two unused types, SWT_PREEMPT and SWT_SLEEPQTIMO.
4. Remove SWT_NONE altogether and assert that callers always provide
a type flag.
5. Reference the mi_switch(9) man page in the comments, as these flags
will be documented there.

Reviewed by: kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38184


# bff02948 09-Feb-2023 Mitchell Horne <mhorne@FreeBSD.org>

sched_4bsd: use the same switch flags as ULE

ULE uses the more specific SWT_REMOTEPREEMPT and SWT_REMOTEWAKEIDLE
switch types, let's do that here as well. SWT_PREEMPT is somewhat
redundant when we also have the SW_PREEMPT flag.

This only has an effect for kernels built with SCHED_STATS.

Reviewed by: kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38183


# c2d27b0e 23-Sep-2022 Mark Johnston <markj@FreeBSD.org>

sched_4bsd: Fix a racy thread state modification

When a thread switching off-CPU is migrating to a remote CPU,
sched_switch() may trigger a rescheduling of the thread currently
running on that CPU. When doing so, it must ensure that that thread is
locked before modifying thread state. If the thread's lock is not the
scheduler lock, then the thread is in the process of switching off-CPU
and no extra effort is needed, and the initiator does not hold the
thread's lock and thus should not modify any thread state.

Reported and tested by: Steve Kargl
MFC after: 1 week


# c6d31b83 18-Jul-2022 Konstantin Belousov <kib@FreeBSD.org>

AST: rework

Make most AST handlers dynamically registered. This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it. For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.

Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit. For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.

The dynamic registration also allows third-party modules to register AST
handlers if needed. There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it. In fact, this
is already present behavior for hwpmc.ko and ufs.ko. I do not think it
is worth the efforts and the runtime overhead to try to fix it.

Reviewed by: markj
Tested by: emaste (arm64), pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888


# 40efe743 14-Jul-2022 John Baldwin <jhb@FreeBSD.org>

4bsd: Simplistic time-sharing for interrupt threads.

If an interrupt thread runs for a full quantum without yielding the
CPU, demote its priority and schedule a preemption to give other
ithreads a turn.

Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35645


# fea89a28 14-Jul-2022 John Baldwin <jhb@FreeBSD.org>

Add sched_ithread_prio to set the base priority of an interrupt thread.

Use it instead of sched_prio when setting the priority of an interrupt
thread.

Reviewed by: kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35642


# 8758ac75 13-Apr-2022 John Baldwin <jhb@FreeBSD.org>

sched_4bsd: ts is only used in sched_bind for SMP.


# 72ff256c 12-Apr-2022 John Baldwin <jhb@FreeBSD.org>

sched_4bsd: Remove unused variables.


# ec3af9d0 01-Jan-2022 Stefan Eßer <se@FreeBSD.org>

sys/kern/sched_4bsd.c: fix typo introduced in previous commit


# a19bd8e3 01-Jan-2022 Stefan Eßer <se@FreeBSD.org>

Restore variable aliasing in the context of cpu set operations

A simplification of set operations removed side-effects of the
previous code, which are restored by this commit.


# e2650af1 29-Dec-2021 Stefan Eßer <se@FreeBSD.org>

Make CPU_SET macros compliant with other implementations

The introduction of <sched.h> improved compatibility with some 3rd
party software, but caused the configure scripts of some ports to
assume that they were run in a GLIBC compatible environment.

Parts of sched.h were made conditional on -D_WITH_CPU_SET_T being
added to ports, but there still were compatibility issues due to
invalid assumptions made in autoconfigure scripts.

The differences between the FreeBSD version of macros like CPU_AND,
CPU_OR, etc. and the GLIBC versions was in the number of arguments:
FreeBSD used a 2-address scheme (one source argument is also used as
the destination of the operation), while GLIBC uses a 3-adderess
scheme (2 source operands and a separately passed destination).

The GLIBC scheme provides a super-set of the functionality of the
FreeBSD macros, since it does not prevent passing the same variable
as source and destination arguments. In code that wanted to preserve
both source arguments, the FreeBSD macros required a temporary copy of
one of the source arguments.

This patch set allows to unconditionally provide functions and macros
expected by 3rd party software written for GLIBC based systems, but
breaks builds of externally maintained sources that use any of the
following macros: CPU_AND, CPU_ANDNOT, CPU_OR, CPU_XOR.

One contributed driver (contrib/ofed/libmlx5) has been patched to
support both the old and the new CPU_OR signatures. If this commit
is merged to -STABLE, the version test will have to be extended to
cover more ranges.

Ports that have added -D_WITH_CPU_SET_T to build on -CURRENT do
no longer require that option.

The FreeBSD version has been bumped to 1400046 to reflect this
incompatible change.

Reviewed by: kib
MFC after: 2 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D33451


# 6a8ea6d1 03-Nov-2021 Kyle Evans <kevans@FreeBSD.org>

sched: split sched_ap_entry() out of sched_throw()

sched_throw() can no longer take a NULL thread, APs enter through
sched_ap_entry() instead. This completely removes branching in the
common case and cleans up both paths. No functional change intended.

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D32829


# 589aed00 02-Nov-2021 Kyle Evans <kevans@FreeBSD.org>

sched: separate out schedinit_ap()

schedinit_ap() sets up an AP for a later call to sched_throw(NULL).

Currently, ULE sets up some pcpu bits and fixes the idlethread lock with
a call to sched_throw(NULL); this results in a window where curthread is
setup in platforms' init_secondary(), but it has the wrong td_lock.
Typical platform AP startup procedure looks something like:

- Setup curthread
- ... other stuff, including cpu_initclocks_ap()
- Signal smp_started
- sched_throw(NULL) to enter the scheduler

cpu_initclocks_ap() may have callouts to process (e.g., nvme) and
attempt to sched_add() for this AP, but this attempt fails because
of the noted violated assumption leading to locking heartburn in
sched_setpreempt().

Interrupts are still disabled until cpu_throw() so we're not really at
risk of being preempted -- just let the scheduler in on it a little
earlier as part of setting up curthread.

Reviewed by: alfredo, kib, markj
Triage help from: andrew, markj
Smoke-tested by: alfredo (ppc), kevans (arm64, x86), mhorne (arm)
Differential Revision: https://reviews.freebsd.org/D32797


# af29f399 28-Jul-2021 Dmitry Chagin <dchagin@FreeBSD.org>

umtx: Split umtx.h on two counterparts.

To prevent umtx.h polluting by future changes split it on two headers:
umtx.h - ABI header for userspace;
umtxvar.h - the kernel staff.

While here fix umtx_key_match style.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31248
MFC after: 2 weeks


# 6a467cc5 23-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

lockprof: pass lock type as an argument instead of reading the spin flag


# fa2528ac 18-Feb-2021 Alex Richardson <arichardson@FreeBSD.org>

Use atomic loads/stores when updating td->td_state

KCSAN complains about racy accesses in the locking code. Those races are
fine since they are inside a TD_SET_RUNNING() loop that expects the value
to be changed by another CPU.

Use relaxed atomic stores/loads to indicate that this variable can be
written/read by multiple CPUs at the same time. This will also prevent
the compiler from doing unexpected re-ordering.

Reported by: GENERIC-KCSAN
Test Plan: KCSAN no longer complains, kernel still runs fine.
Reviewed By: markj, mjg (earlier version)
Differential Revision: https://reviews.freebsd.org/D28569


# b77594bb 14-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

sched: fix an incorrect comparison in sched_lend_user_prio_cond

Compare with sched_lend_user_prio.


# 6fed89b1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

kern: clean up empty lines in .c and .h files


# b05ca429 02-Mar-2020 Pawel Biernacki <kaktus@FreeBSD.org>

sys/: Document few more sysctls.

Submitted by: Antranig Vartanian <antranigv@freebsd.am>
Reviewed by: kaktus
Commented by: jhb
Approved by: kib (mentor)
Sponsored by: illuria security
Differential Revision: https://reviews.freebsd.org/D23759


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 3ff65f71 30-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

Remove duplicated empty lines from kern/*.c

No functional changes.


# 879e0604 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

Add KERNEL_PANICKED macro for use in place of direct panicstr tests


# 686bcb5c 15-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

schedlock 4/4

Don't hold the scheduler lock while doing context switches. Instead we
unlock after selecting the new thread and switch within a spinlock
section leaving interrupts and preemption disabled to prevent local
concurrency. This means that mi_switch() is entered with the thread
locked but returns without. This dramatically simplifies scheduler
locking because we will not hold the schedlock while spinning on
blocked lock in switch.

This change has not been made to 4BSD but in principle it would be
more straightforward.

Discussed with: markj
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22778


# 61a74c5c 15-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

schedlock 1/4

Eliminate recursion from most thread_lock consumers. Return from
sched_add() without the thread_lock held. This eliminates unnecessary
atomics and lock word loads as well as reducing the hold time for
scheduler locks. This will eventually allow for lockless remote adds.

Discussed with: kib
Reviewed by: jhb
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22626


# 9825eadf 13-Dec-2019 Ryan Libby <rlibby@FreeBSD.org>

bitset: rename confusing macro NAND to ANDNOT

s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too. The actual
implementation is "and not" (or "but not"), i.e. A but not B.
Fortunately this does appear to be what all existing callers want.

Don't supply a NAND (not (A and B)) operation at this time.

Discussed with: jeff
Reviewed by: cem
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D22791


# c3cccf95 07-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

Handle multiple clock interrupts simultaneously in sched_clock().

Reviewed by: kib, markj, mav
Differential Revision: https://reviews.freebsd.org/D22625


# 61322a0a 04-Dec-2019 Alexander Motin <mav@FreeBSD.org>

Mark some more hot global variables with __read_mostly.

MFC after: 1 week


# ac97da9a 08-May-2019 Mateusz Guzik <mjg@FreeBSD.org>

Reduce umtx-related work on exec and exit

- there is no need to take the process lock to iterate the thread
list after single-threading is enforced
- typically there are no mutexes to clean up (testable without taking
the global umtx lock)
- typically there is no need to adjust the priority (testable without
taking thread lock)

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20160


# 2bf95012 05-Jul-2018 Andrew Turner <andrew@FreeBSD.org>

Create a new macro for static DPCPU data.

On arm64 (and possible other architectures) we are unable to use static
DPCPU data in kernel modules. This is because the compiler will generate
PC-relative accesses, however the runtime-linker expects to be able to
relocate these.

In preparation to fix this create two macros depending on if the data is
global or static.

Reviewed by: bz, emaste, markj
Sponsored by: ABT Systems Ltd
Differential Revision: https://reviews.freebsd.org/D16140


# 28240885 07-May-2018 Mateusz Guzik <mjg@FreeBSD.org>

Inlined sched_userret.

The tested condition is rarely true and it induces a function call
on each return to userspace.

Bumps getuid rate by about 1% on Broadwell.


# 3f289c3f 12-Jan-2018 Jeff Roberson <jeff@FreeBSD.org>

Implement 'domainset', a cpuset based NUMA policy mechanism. This allows
userspace to control NUMA policy administratively and programmatically.

Implement domainset based iterators in the page layer.

Remove the now legacy numa_* syscalls.

Cleanup some header polution created by having seq.h in proc.h.

Reviewed by: markj, kib
Discussed with: alc
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13403


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 3e85b721 16-May-2017 Ed Maste <emaste@FreeBSD.org>

Remove register keyword from sys/ and ANSIfy prototypes

A long long time ago the register keyword told the compiler to store
the corresponding variable in a CPU register, but it is not relevant
for any compiler used in the FreeBSD world today.

ANSIfy related prototypes while here.

Reviewed by: cem, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D10193


# afa0a46c 23-Mar-2017 Andriy Gapon <avg@FreeBSD.org>

move thread switch tracing from mi_switch to sched_switch

This is done so that the thread state changes during the switch
are not confused with the thread state changes reported when the thread
spins on a lock.

Here is an example, three consecutive entries for the same thread (from top to
bottom):

KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"sleep", attributes: prio:84, wmesg:"-", lockname:"(null)"
KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"spinning", attributes: lockname:"sched lock 1"
KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"running", attributes: none

The above trace could leave an impression that the final state of
the thread was "running".
After this change the sleep state will be reported after the "spinning"
and "running" states reported for the sched lock.

Reviewed by: jhb, markj
MFC after: 1 week
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9961


# 28ef18b8 11-Mar-2017 Andriy Gapon <avg@FreeBSD.org>

trace thread running state when a thread is run for the first time

This applies to both KTR_SCHED and DTrace sched:::on-cpu tracing.

MFC after: 10 days


# 27ee18ad 16-Feb-2017 Ryan Stone <rstone@FreeBSD.org>

Revert r313814 and r313816

Something evidently got mangled in my git tree in between testing and
review, as an old and broken version of the patch was apparently submitted
to svn. Revert this while I work out what went wrong.

Reported by: tuexen
Pointy hat to: rstone


# 09ae7c48 16-Feb-2017 Ryan Stone <rstone@FreeBSD.org>

Check for preemption after lowering a thread's priority

When a high-priority thread is waiting for a mutex held by a
low-priority thread, it temporarily lends its priority to the
low-priority thread to prevent priority inversion. When the mutex
is released, the lent priority is revoked and the low-priority
thread goes back to its original priority.

When the priority of that thread is lowered (through a call to
sched_priority()), the schedule was not checking whether
there is now a high-priority thread in the run queue. This can
cause threads with real-time priority to be starved in the run
queue while the low-priority thread finishes its quantum.

Fix this by explicitly checking whether preemption is necessary
when a thread's priority is lowered.

Sponsored by: Dell EMC Isilon
Obtained from: Sandvine Inc
Differential Revision: https://reviews.freebsd.org/D9518
Reviewed by: Jeff Roberson (ule)
MFC after: 1 month


# ad9dadc4 19-Jan-2017 Andriy Gapon <avg@FreeBSD.org>

fix a thread preemption regression in schedulers introduced in r270423

Commit r270423 fixed a regression in sched_yield() that was introduced
in earlier changes. Unfortunately, at the same time it introduced an
new regression. The problem is that SWT_RELINQUISH (6), like all other
SWT_* constants and unlike SW_* flags, is not a bit flag. So, (flags &
SWT_RELINQUISH) is true in cases where that was not really indended,
for example, with SWT_OWEPREEMPT (2) and SWT_REMOTEPREEMPT (11).

A straight forward fix would be to use (flags & SW_TYPE_MASK) ==
SWT_RELINQUISH, but my impression is that the switch types are designed
mostly for gathering statistics, not for influencing scheduling
decisions.

So, I decided that it would be better to check for SW_PREEMPT flag
instead. That's also the same flag that was checked before r239157.
I double-checked how that flag is used and I am confident that the flag
is set only in the places where we really have the preemption:
- critical_exit + td_owepreempt
- sched_preempt in the ULE scheduler
- sched_preempt in the 4BSD scheduler

Reviewed by: kib, mav
MFC after: 4 days
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9230


# 892f0ab0 11-Nov-2016 John Baldwin <jhb@FreeBSD.org>

Allow scheduling during early boot.

- Send IPI wakeups once SMP is started even if cold is true.
- Permit preemptions when cold is true.

These changes are needed for EARLY_AP_STARTUP.

MFC after: 2 weeks
Sponsored by: Netflix


# a6b91f0f 11-Nov-2016 John Baldwin <jhb@FreeBSD.org>

Don't place threads on the run queue after waking up other CPUs.

The other CPU might resume and see a still-empty runq and go back to
sleep before sched_add() adds the thread to the runq. This results
in a lost wakeup and a potential hang if the system is otherwise
completely idle.

The race originated due to a micro-optimization (my fault) in 4BSD in
that it avoided putting a thread on the run queue if the scheduler was
going to preempt to the new thread. To avoid complexity while fixing
this race, just drop this optimization. 4BSD now always sets the
"owepreempt" flag when a preemption is warranted and defers the actual
preemption to the thread_unlock of the caller the same as ULE.

MFC after: 2 weeks
Sponsored by: Netflix


# 69a28758 15-Sep-2016 Ed Maste <emaste@FreeBSD.org>

Renumber license clauses in sys/kern to avoid skipping #3


# e2325d82 29-Jul-2016 John Baldwin <jhb@FreeBSD.org>

Don't treat NOCPU as a valid CPU to CPU_ISSET.

If a thread is created bound to a cpuset it might already be bound before
it's very first timeslice, and td_lastcpu will be NOCPU in that case.

MFC after: 1 week


# 93ccd6bf 05-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Get rid of struct proc p_sched and struct thread td_sched pointers.

p_sched is unused.

The struct td_sched is always co-allocated with the struct thread,
except for the thread0. Avoid useless indirection, instead calculate
td_sched location using simple pointer arithmetic in td_get_sched(9).
For thread0, which is statically allocated, create a structure to
emulate layout of the dynamic allocation.

Reviewed by: jhb (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D6711


# e3043798 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: spelling fixes in comments.

No functional change.


# ccd0ec40 17-Apr-2016 Konstantin Belousov <kib@FreeBSD.org>

The struct thread td_estcpu member is only used by the 4BSD scheduler.
Move it to the struct td_sched for 4BSD, removing always present
field, otherwise unused for ULE.

New scheduler method sched_estcpu() returns the estimation for
kinfo_proc consumption. As before, it always returns 0 for ULE.

Remove sched_tick() scheduler method, unused both by 4BSD and ULE.

Update locking comment for the 4BSD struct td_sched, copying it from
the same comment for ULE.

Spell MAXPRI as PRI_MAX_TIMESHARE in the 4BSD comment.

Based on some notes from, and reviewed by: bde
Sponsored by: The FreeBSD Foundation


# 92de34df 03-Aug-2015 John Baldwin <jhb@FreeBSD.org>

kgdb uses td_oncpu to determine if a thread is running and should use
a pcb from stoppcbs[] rather than the thread's PCB. However, exited threads
retained td_oncpu from the last time they ran, and newborn threads had their
CPU fields cleared to zero during fork and thread creation since they are
in the set of fields zeroed when threads are setup. To fix, explicitly
update the CPU fields for exiting threads in sched_throw() to reflect the
switch out and reset the CPU fields for new threads in sched_fork_thread()
to NOCPU.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D3193


# 4b5c9cf6 29-Apr-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision: https://reviews.freebsd.org/D2369
Reviewed by: kib@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation


# 2e7d7bb2 23-Aug-2014 Alexander Motin <mav@FreeBSD.org>

Restore pre-r239157 handling of sched_yield(), when thread time slice was
aborted, allowing other threads to run. Without this change thread is just
rescheduled again, that was illustrated by provided test tool.

PR: 192926
Submitted by: eric@vangyzen.net
MFC after: 2 weeks


# 0d13d5fc 29-Apr-2014 Marius Strobl <marius@FreeBSD.org>

Given that as of r258002 the last external user is gone, make sched_lock
static.


# 7dba7849 29-Dec-2013 Mark Johnston <markj@FreeBSD.org>

The arguments to sched:::off-cpu are the thread and associated process of
the thread selected to run, not the currently running thread. This fix has
already been made for ULE in r252070.

PR: 177706
MFC after: 1 week


# d9fae5ab 26-Nov-2013 Andriy Gapon <avg@FreeBSD.org>

dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE

In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by: markj
MFC after: 3 weeks


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# 785797c3 24-Jul-2013 Andriy Gapon <avg@FreeBSD.org>

rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST

Also directly call swapper() at the end of mi_startup instead of
relying on swapper being the last thing in sysinits order.

Rationale:

- "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage
- "scheduler" was misleading, the function swaps in the swapped out processes
- another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be
invoked depending on its relative order with scheduler; this was not obvious
and the bug actually used to exist

Reviewed by: kib (ealier version)
MFC after: 14 days


# 36af9869 26-Oct-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add CPU percentage limit enforcement to RCTL. The resouce name is "pcpu".
It was implemented by Rudolf Tomori during Google Summer of Code 2012.


# ba96d2d8 22-Aug-2012 John Baldwin <jhb@FreeBSD.org>

Mark the idle threads as non-sleepable and also assert that an idle
thread never blocks on a turnstile.


# 37f4e025 11-Aug-2012 Alexander Motin <mav@FreeBSD.org>

Some more minor tunings inspired by bde@.


# 579895df 10-Aug-2012 Alexander Motin <mav@FreeBSD.org>

Some minor tunings/cleanups inspired by bde@ after previous commits:
- remove extra dynamic variable initializations;
- restore (4BSD) and implement (ULE) hogticks variable setting;
- make sched_rr_interval() more tolerant to options;
- restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more
user-friendly wrapper for sched_slice;
- tune some sysctl descriptions;
- make some style fixes.


# 3d7f4117 09-Aug-2012 Alexander Motin <mav@FreeBSD.org>

Rework r220198 change (by fabient). I believe it solves the problem from
the wrong direction. Before it, if preemption and end of time slice happen
same time, thread was put to the head of the queue as for only preemption.
It could cause single thread to run for indefinitely long time. r220198
handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that
causes delayed context switch every time preemption happens, even when not
needed.

Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND,
set when thread's time slice is over and it should be put to the tail of
queue. Using SW_PREEMPT flag for that purpose as it was before just not
enough informative to work correctly.

On my tests this by 2-3 times reduces run time deviation (improves fairness)
in cases when several threads share one CPU.

Reviewed by: fabient
MFC after: 2 months
Sponsored by: iXsystems, Inc.


# 48317e9e 09-Aug-2012 Alexander Motin <mav@FreeBSD.org>

SCHED_4BSD scheduling quantum mechanism appears to be broken for some time.
With switchticks variable being reset each time thread preempted (that is
done regularly by interrupt threads) scheduling quantum may never expire.
It was not noticed in time because several other factors still regularly
trigger context switches.

Handle the problem by replacing that mechanism with its equivalent from
SCHED_ULE called time slice. It is effectively the same, just measured in
context of stathz instead of hz. Some unification is probably not bad.


# 2aaae99d 15-May-2012 Sergey Kandaurov <pluknet@FreeBSD.org>

Fix typo in function name SDT_PROBE4 and unbreak 4BSD UP.


# b3e9e682 14-May-2012 Ryan Stone <rstone@FreeBSD.org>

Implement the DTrace sched provider. This implementation aims to be
compatible with the sched provider implemented by Solaris and its open-
source derivatives. Full documentation of the sched provider can be found
on Oracle's DTrace wiki pages.

Note that for compatibility with scripts originally written for Solaris,
serveral probes are defined that will never fire. These probes are defined
to fire when Solaris-specific features perform certain actions. As these
features are not present in FreeBSD, the probes can never fire.

Also, I have added a two probes that are not defined in Solaris, lend-pri
and load-change. These probes have been added to make it possible to
collect schedgraph data with DTrace.

Finally, a few probes are defined in Solaris to take a cpuinfo_t *
argument. As it was not immediately clear to me how to translate that to
FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *.

Sponsored by: Sandvine Incorporated
MFC after: 2 weeks


# 44ad5475 08-Mar-2012 John Baldwin <jhb@FreeBSD.org>

Add a new sched_clear_name() method to the scheduler interface to clear
the cached name used for KTR_SCHED traces when a thread's name changes.
This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's
name changes, most commonly via execve().

MFC after: 2 weeks


# 7e3a96ea 03-Jan-2012 John Baldwin <jhb@FreeBSD.org>

Some small fixes to CPU accounting for threads:
- Only initialize the per-cpu switchticks and switchtime in sched_throw()
for the very first context switch on APs during boot. This avoids a
small gap between the middle of thread_exit() and sched_throw() where
time is not accounted to any thread.
- In thread_exit(), update the timestamp bookkeeping to track the changes
to mi_switch() introduced by td_rux so that the code once again matches
the comment claiming it is mimicing mi_switch(). Specifically, only
update the per-thread stats directly and depend on ruxagg() to update
p_rux rather than adjusting p_rux directly. While here, move the
timestamp bookkeeping as late in the function as possible.

Reviewed by: bde, kib
MFC after: 1 week


# 6472ac3d 07-Nov-2011 Ed Schouten <ed@FreeBSD.org>

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# cd39bb09 26-Aug-2011 Xin LI <delphij@FreeBSD.org>

Fix format strings for KTR_STATE in 4BSD ad ULE schedulers.

Submitted by: Ivan Klymenko <fidaj@ukr.net>
PR: kern/159904, kern/159905
MFC after: 2 weeks
Approved by: re (kib)


# a38f1f26 13-Jun-2011 Attilio Rao <attilio@FreeBSD.org>

Remove pc_cpumask and pc_other_cpus usage from MI code.

Tested by: pluknet


# d098f930 31-May-2011 Nathan Whitehorn <nwhitehorn@FreeBSD.org>

On multi-core, multi-threaded PPC systems, it is important that the threads
be brought up in the order they are enumerated in the device tree (in
particular, that thread 0 on each core be brought up first). The SLIST
through which we loop to start the CPUs has all of its entries added with
SLIST_INSERT_HEAD(), which means it is in reverse order of enumeration
and so AP startup would always fail in such situations (causing a machine
check or RTAS failure). Fix this by changing the SLIST into an STAILQ,
and inserting new CPUs at the end.

Reviewed by: jhb


# a0a43452 17-May-2011 Attilio Rao <attilio@FreeBSD.org>

Merge r221285 from largeSMP project:
- Remove the following sysctl:
kern.sched.ipiwakeup.onecpu
kern.sched.ipiwakeup.htt2

Because they are absolutely obsolete. Probabilly the whole wakeup
forward mechanism should be revisited for a better fitting in modern
hw, in the future.
- As map2 variable is no longer used rename map3 to map2
- Fix a string by making more informative the msg and removing the
arguments passing.

Reviewed by: julian
Tested by: several


# d59dd76c 16-May-2011 Attilio Rao <attilio@FreeBSD.org>

Merge r221278 from largeSMP project:
idle_cpus_mask is just used in sched_4bsd, thus make it private for it.

Tested by: several


# 71a19bdc 05-May-2011 Attilio Rao <attilio@FreeBSD.org>

Commit the support for removing cpumask_t and replacing it directly with
cpuset_t objects.
That is going to offer the underlying support for a simple bump of
MAXCPU and then support for number of cpus > 32 (as it is today).

Right now, cpumask_t is an int, 32 bits on all our supported architecture.
cpumask_t on the other side is implemented as an array of longs, and
easilly extendible by definition.

The architectures touched by this commit are the following:
- amd64
- i386
- pc98
- arm
- ia64
- XEN

while the others are still missing.
Userland is believed to be fully converted with the changes contained
here.

Some technical notes:
- This commit may be considered an ABI nop for all the architectures
different from amd64 and ia64 (and sparc64 in the future)
- per-cpu members, which are now converted to cpuset_t, needs to be
accessed avoiding migration, because the size of cpuset_t should be
considered unknown
- size of cpuset_t objects is different from kernel and userland (this is
primirally done in order to leave some more space in userland to cope
with KBI extensions). If you need to access kernel cpuset_t from the
userland please refer to example in this patch on how to do that
correctly (kgdb may be a good source, for example).
- Support for other architectures is going to be added soon
- Only MAXCPU for amd64 is bumped now

The patch has been tested by sbruno and Nicholas Esborn on opteron
4 x 12 pack CPUs. More testing on big SMP is expected to came soon.
pluknet tested the patch with his 8-ways on both amd64 and i386.

Tested by: pluknet, sbruno, gianni, Nicholas Esborn
Reviewed by: jeff, jhb, sbruno


# f0283a73 30-Apr-2011 Attilio Rao <attilio@FreeBSD.org>

- Remove the following sysctl:
kern.sched.ipiwakeup.onecpu
kern.sched.ipiwakeup.htt2

Because they are absolutely obsolete. Probabilly the whole wakeup
forward mechanism should be revisited for a better fitting in modern
hw.
- As map2 variable is no longer used rename map3 to map2
- Fix a string by making more informative the msg and removing the
arguments passing

Approved by: julian


# 3121f534 30-Apr-2011 Attilio Rao <attilio@FreeBSD.org>

idle_cpus_mask is just used in the SMP case and within sched_4BSD.
Declare appropriately.


# 60dd73b7 26-Apr-2011 Ryan Stone <rstone@FreeBSD.org>

If the 4BSD scheduler tries to schedule a thread that has been pinned or
bound to an AP before SMP has started, the system will panic when we try
to touch per-CPU state for that AP because that state has not been
initialized yet. Fix this in the same way as ULE: place all threads in
the global run queue before SMP has started.

Reviewed by: jhb
MFC after: 1 month


# e806d352 06-Apr-2011 John Baldwin <jhb@FreeBSD.org>

Fix several places to ignore processes that are not yet fully constructed.

MFC after: 1 week


# 586cb6ec 31-Mar-2011 Fabien Thomas <fabient@FreeBSD.org>

Clearing the flag when preempting will let the preempted thread run
too much time. This can finish in a scheduler deadlock with ping-pong
between two threads.

One sample of this is:
- device lapic (to have a preemption point on critical_exit())
- options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq)
- running a cpu intensive task (that does not enter the kernel)
- only one CPU on SMP or no SMP.

As requested by jhb@ 4BSD have received the same type of fix instead of
propagating the flag to the new thread.

Reviewed by: jhb, jeff
MFC after: 1 month


# 2dc29adb 14-Jan-2011 John Baldwin <jhb@FreeBSD.org>

Rework realtime priority support:
- Move the realtime priority range up above kernel sleep priorities and
just below interrupt thread priorities.
- Contract the interrupt and kernel sleep priority ranges a bit so that
the timesharing priority band can be increased. The new timeshare range
is now slightly larger than the old realtime + timeshare ranges.
- Change the ULE scheduler to no longer use realtime priorities for
interactive threads. Instead, the larger timeshare range is now split
into separate subranges for interactive and non-interactive ("batch")
threads. The end result is that interactive threads and non-interactive
threads still use the same priority ranges as before, but realtime
threads now have a separate, dedicated priority range.
- Do not modify the priority of non-timeshare threads in sched_sleep()
or via cv_broadcastpri(). Realtime and idle priority threads will
no longer have their priorities affected by sleeping in the kernel.

Reviewed by: jeff


# 52c0b557 13-Jan-2011 Matthew D Fleming <mdf@FreeBSD.org>

One more sysctl(9) type-safety that I missed before.


# 22d19207 06-Jan-2011 John Baldwin <jhb@FreeBSD.org>

- Move sched_fork() later in fork() after the various sections of the new
thread and proc have been copied and zeroed from the old thread and
proc. Otherwise attempts to modify thread or process data in sched_fork()
could be undone.
- Don't copy td_{base,}_user_pri from the old thread to the new thread in
sched_fork_thread() in ULE. This is already done courtesy the bcopy()
of the thread copy region.
- Always initialize the real priority (td_priority) of new threads to the
new thread's base priority (td_base_pri) to avoid bogusly inheriting a
borrowed priority from the parent thread.

MFC after: 2 weeks


# c8e368a9 29-Dec-2010 David Xu <davidxu@FreeBSD.org>

- Follow r216313, the sched_unlend_user_prio is no longer needed, always
use sched_lend_user_prio to set lent priority.
- Improve pthread priority-inherit mutex, when a contender's priority is
lowered, repropagete priorities, this may cause mutex owner's priority
to be lowerd, in old code, mutex owner's priority is rise-only.


# acbe332a 08-Dec-2010 David Xu <davidxu@FreeBSD.org>

MFp4:
It is possible a lower priority thread lending priority to higher priority
thread, in old code, it is ignored, however the lending should always be
recorded, add field td_lend_user_pri to fix the problem, if a thread does
not have borrowed priority, its value is PRI_MAX.

MFC after: 1 week


# 3e288e62 22-Nov-2010 Dimitry Andric <dim@FreeBSD.org>

After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.


# 31c6a003 14-Nov-2010 Dimitry Andric <dim@FreeBSD.org>

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# a157e425 13-Sep-2010 Alexander Motin <mav@FreeBSD.org>

Refactor timer management code with priority to one-shot operation mode.
The main goal of this is to generate timer interrupts only when there is
some work to do. When CPU is busy interrupts are generating at full rate
of hz + stathz to fullfill scheduler and timekeeping requirements. But
when CPU is idle, only minimum set of interrupts (down to 8 interrupts per
second per CPU now), needed to handle scheduled callouts is executed.
This allows significantly increase idle CPU sleep time, increasing effect
of static power-saving technologies. Also it should reduce host CPU load
on virtualized systems, when guest system is idle.

There is set of tunables, also available as writable sysctls, allowing to
control wanted event timer subsystem behavior:
kern.eventtimer.timer - allows to choose event timer hardware to use.
On x86 there is up to 4 different kinds of timers. Depending on whether
chosen timer is per-CPU, behavior of other options slightly differs.
kern.eventtimer.periodic - allows to choose periodic and one-shot
operation mode. In periodic mode, current timer hardware taken as the only
source of time for time events. This mode is quite alike to previous kernel
behavior. One-shot mode instead uses currently selected time counter
hardware to schedule all needed events one by one and program timer to
generate interrupt exactly in specified time. Default value depends of
chosen timer capabilities, but one-shot mode is preferred, until other is
forced by user or hardware.
kern.eventtimer.singlemul - in periodic mode specifies how much times
higher timer frequency should be, to not strictly alias hardclock() and
statclock() events. Default values are 2 and 4, but could be reduced to 1
if extra interrupts are unwanted.
kern.eventtimer.idletick - makes each CPU to receive every timer interrupt
independently of whether they busy or not. By default this options is
disabled. If chosen timer is per-CPU and runs in periodic mode, this option
has no effect - all interrupts are generating.

As soon as this patch modifies cpu_idle() on some platforms, I have also
refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions
(if supported) under high sleep/wakeup rate, as fast alternative to other
methods. It allows SMP scheduler to wake up sleeping CPUs much faster
without using IPI, significantly increasing performance on some highly
task-switching loads.

Tested by: many (on i386, amd64, sparc64 and powerc)
H/W donated by: Gheorghe Ardelean
Sponsored by: iXsystems, Inc.


# b722ad00 11-Sep-2010 Alexander Motin <mav@FreeBSD.org>

Merge some SCHED_ULE features to SCHED_4BSD:
- Teach SCHED_4BSD to inform cpu_idle() about high sleep/wakeup rate to
choose optimized handler. In case of x86 it is MONITOR/MWAIT. Also it
will be needed to bypass forthcoming idle tick skipping logic to not
consume resources on events rescheduling when it won't give any benefits.
- Teach SCHED_4BSD to wake up idle CPUs without using IPI. In case of x86,
when MONITOR/MWAIT is active, it require just single memory write. This
doubles performance on some heavily switching test loads.


# d9d8d144 06-Aug-2010 John Baldwin <jhb@FreeBSD.org>

Add a new ipi_cpu() function to the MI IPI API that can be used to send an
IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that
constructed a mask for a single CPU with calls to ipi_cpu() instead. This
will matter more in the future when we transition from cpumask_t to
cpuset_t for CPU masks in which case building a CPU mask is more expensive.

Submitted by: peter, sbruno
Reviewed by: rookie
Obtained from: Yahoo! (x86)
MFC after: 1 month


# 3aa6d94e 11-Jun-2010 John Baldwin <jhb@FreeBSD.org>

Update several places that iterate over CPUs to use CPU_FOREACH().


# 3da35a0a 03-Jun-2010 John Baldwin <jhb@FreeBSD.org>

Assert that the thread lock is held in sched_pctcpu() instead of
recursively acquiring it. All of the current callers already hold the
lock.

MFC after: 1 month


# 1d7830ed 21-May-2010 John Baldwin <jhb@FreeBSD.org>

Assert that the thread passed to sched_bind() and sched_unbind() is
curthread as those routines are only supported for curthread currently.

MFC after: 1 month


# 95e1c1fa 08-Feb-2010 Attilio Rao <attilio@FreeBSD.org>

MC r202889, r202940:
- Fix a race in sched_switch() of sched_4bsd.
Block the td_lock when acquiring explicitly sched_lock in order to prevent
races with other td_lock contenders.
- Merge the ULE's internal function thread_block_switch() into the global
thread_lock_block() and make the former semantic as the default for
thread_lock_block().
- Split out an invariant in order to have better checks.


# aa16019d 25-Jan-2010 Attilio Rao <attilio@FreeBSD.org>

MFC r201790:
- Set td_slptick to 0 when moving threads out of sleepqueues.
- Move td_slptick from u_int to int in order to follow 'ticks' signedness
and wrap up accordingly.

Sponsored by: Sandvine Incorporated


# 58060789 24-Jan-2010 Attilio Rao <attilio@FreeBSD.org>

Split out an invariant in order to better check that newtd, when
provided, must be on a runqueue.

Tested by: Giovanni Trematerra
<giovanni dot trematerra at gmail dot com>
MFC: 2 weeks
X-MFC: r202889


# b0b9dee5 23-Jan-2010 Attilio Rao <attilio@FreeBSD.org>

- Fix a race in sched_switch() of sched_4bsd.
In the case of the thread being on a sleepqueue or a turnstile, the
sched_lock was acquired (without the aid of the td_lock interface) and
the td_lock was dropped. This was going to break locking rules on other
threads willing to access to the thread (via the td_lock interface) and
modify his flags (allowed as long as the container lock was different
by the one used in sched_switch).
In order to prevent this situation, while sched_lock is acquired there
the td_lock gets blocked. [0]
- Merge the ULE's internal function thread_block_switch() into the global
thread_lock_block() and make the former semantic as the default for
thread_lock_block(). This means that thread_lock_block() will not
disable interrupts when called (and consequently thread_unlock_block()
will not re-enabled them when called). This should be done manually
when necessary.
Note, however, that ULE's thread_unblock_switch() is not reaped
because it does reflect a difference in semantic due in ULE (the
td_lock may not be necessarilly still blocked_lock when calling this).
While asymmetric, it does describe a remarkable difference in semantic
that is good to keep in mind.

[0] Reported by: Kohji Okuno
<okuno dot kohji at jp dot panasonic dot com>
Tested by: Giovanni Trematerra
<giovanni dot trematerra at gmail dot com>
MFC: 2 weeks


# 6eac7e57 08-Jan-2010 Attilio Rao <attilio@FreeBSD.org>

- Fix a bug in sched_4bsd where the timestamp for the sleeping operation
is not cleaned up on the wakeup but reset.
This is harmless mostly because td_slptick (and ki_slptime from
userland) should be analyzed only with the assumption that the thread
is actually sleeping (thus while the td_slptick is correctly set) but
without this invariant the number is nomore consistent.
- Move td_slptick from u_int to int in order to follow 'ticks' signedness
and wrap up accordingly [0]

[0] Submitted by: emaste
Sponsored by: Sandvine Incorporated
MFC 1 week


# e1c0f124 07-Jan-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r201347:
Allow swap out of the kernel stack for the thread with priority greater
or equial then PSOCK, not less or equial.


# 17c4c356 31-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

Allow swap out of the kernel stack for the thread with priority greater
or equial then PSOCK, not less or equial. Higher priority has lesser
numerical value.

Existing test does not allow for swapout of the thread waiting for
advisory lock, for exiting child or sleeping for timeout. On the other
hand, high-priority waiters of VFS/VM events can be swapped out.

Tested by: pho
Reviewed by: jhb
MFC after: 1 week


# 1b9d701f 03-Nov-2009 Attilio Rao <attilio@FreeBSD.org>

Split P_NOLOAD into a per-thread flag (TDF_NOLOAD).
This improvements aims for avoiding further cache-misses in scheduler
specific functions which need to keep track of average thread running
time and further locking in places setting for this flag.

Reported by: jeff (originally), kris (currently)
Reviewed by: jhb
Tested by: Giuseppe Cocomazzi <sbudella at email dot it>


# 0d2cf837 25-Jan-2009 Jeff Roberson <jeff@FreeBSD.org>

- Use __XSTRING where I want the define to be expanded. This resulted in
sizeof("MAXCPU") being used to calculate a string length rather than
something more reasonable such as sizeof("32"). This shouldn't have
caused any ill effect until we run on machines with 1000000 or more
cpus.


# 8f51ad55 17-Jan-2009 Jeff Roberson <jeff@FreeBSD.org>

- Implement generic macros for producing KTR records that are compatible
with src/tools/sched/schedgraph.py. This allows developers to quickly
create a graphical view of ktr data for any resource in the system.
- Add sched_tdname() and the pcpu field 'name' for quickly and uniformly
identifying records associated with a thread or cpu.
- Reimplement the KTR_SCHED traces using the new generic facility.

Obtained from: attilio
Discussed with: jhb
Sponsored by: Nokia


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# c3ea3378 28-Jul-2008 John Baldwin <jhb@FreeBSD.org>

When choosing a CPU for a thread in a cpuset, prefer the last CPU that the
thread ran on if there are no other CPUs in the set with a shorter per-CPU
runqueue.


# f200843b 28-Jul-2008 John Baldwin <jhb@FreeBSD.org>

Implement support for cpusets in the 4BSD scheduler.
- When a cpuset is applied to a thread, walk the cpuset to see if it is a
"full" cpuset (includes all available CPUs). If not, set a new
TDS_AFFINITY flag to indicate that this thread can't run on all CPUs.
When inheriting a cpuset from another thread during thread creation, the
new thread also inherits this flag. It is in a new ts_flags field in
td_sched rather than using one of the TDF_SCHEDx flags because fork()
clears td_flags after invoking sched_fork().
- When placing a thread on a runqueue via sched_add(), if the thread is not
pinned or bound but has the TDS_AFFINITY flag set, then invoke a new
routine (sched_pickcpu()) to pick a CPU for the thread to run on next.
sched_pickcpu() walks the cpuset and picks the CPU with the shortest
per-CPU runqueue length. Note that the reason for the TDS_AFFINITY flag
is to avoid having to walk the cpuset and examine runq lengths in the
common case.
- To avoid walking the per-CPU runqueues in sched_pickcpu(), add an array
of counters to hold the length of the per-CPU runqueues and update them
when adding and removing threads to per-CPU runqueues.

MFC after: 2 weeks


# 8aa3d7ff 28-Jul-2008 John Baldwin <jhb@FreeBSD.org>

Various and sundry style and whitespace fixes.


# 6f5f25e5 24-May-2008 John Birrell <jb@FreeBSD.org>

Add the vtime (virtual time) hooks for DTrace.


# 6c47aaae 24-Apr-2008 Jeff Roberson <jeff@FreeBSD.org>

- Add an integer argument to idle to indicate how likely we are to wake
from idle over the next tick.
- Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are
suspended in cpu specific states. This function can fail and cause the
scheduler to fall back to another mechanism (ipi).
- Implement support for mwait in cpu_idle() on i386/amd64 machines that
support it. mwait is a higher performance way to synchronize cpus
as compared to hlt & ipis.
- Allow selecting the idle routine by name via sysctl machdep.idle. This
replaces machdep.cpu_idle_hlt. Only idle routines supported by the
current machine are permitted.

Sponsored by: Nokia


# 8df78c41 16-Apr-2008 Jeff Roberson <jeff@FreeBSD.org>

- Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
- In reset walk the children of kern_sched_stats and reset the counters
via the oid_arg1 pointer. This allows us to add arbitrary counters to
the tree and still reset them properly.
- Define a set of switch types to be passed with flags to mi_switch().
These types are named SWT_*. These types correspond to SCHED_STATS
counters and are automatically handled in this way.
- Make the new SWT_ types more specific than the older switch stats.
There are now stats for idle switches, remote idle wakeups, remote
preemption ithreads idling, etc.
- Add switch statistics for ULE's pickcpu algorithm. These stats include
how much migration there is, how often affinity was successful, how
often threads were migrated to the local cpu on wakeup, etc.

Sponsored by: Nokia


# 9727e637 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Restore runq to manipulating threads directly by putting runq links and
rqindex back in struct thread.
- Compile kern_switch.c independently again and stop #include'ing it from
schedulers.
- Remove the ts_thread backpointers and convert most code to go from
struct thread to struct td_sched.
- Cleanup the ts_flags #define garbage that was causing us to sometimes
do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD.
- Export the kern.sched sysctl node in sysctl.h


# 8b16c208 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- ULE and 4BSD share only one line of code from sched_newthread() so implement
the required pieces in sched_fork_thread(). The td_sched pointer is already
setup by thread_init anyway.


# a90f3f25 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Move maybe_preempt() from kern_switch.c to sched_4bsd.c. This is function
is only used by 4bsd.
- Create a new runq_choose_fuzz() function rather than polluting runq_choose()
with 4BSD specific code.
- Move the fuzz sysctl into sched_4bsd.c
- Remove some dead code from kern_switch.c


# a564bfc7 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Directly include opt_sched.h in sched_4bsd.


# 374ae2a3 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from
requiring the per-process spinlock to only requiring the process lock.
- Reflect these changes in the proc.h documentation and consumers throughout
the kernel. This is a substantial reduction in locking cost for these
fields and was made possible by recent changes to threading support.


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 6617724c 12-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# c5aa6b58 12-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Pass the priority argument from *sleep() into sleepq and down into
sched_sleep(). This removes extra thread_lock() acquisition and
allows the scheduler to decide what to do with the static boost.
- Change the priority arguments to cv_* to match sleepq/msleep/etc.
where 0 means no priority change. Catch -1 in cv_broadcastpri() and
convert it to 0 for now.
- Set a flag when sleeping in a way that is compatible with swapping
since direct priority comparisons are meaningless now.
- Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
controls the boost behavior. Turning it off gives better performance
in some workloads but needs more investigation.
- While we're modifying sleepq, change signal and broadcast to both
return with the lock held as the lock was held on enter.

Reviewed by: jhb, peter


# 1e24c28f 09-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Add a sched_preempt() routine to be called by md code after IPI_PREEMPT is
delivered.
- Add a simple implementation to 4bsd.


# f5a3ef99 02-Mar-2008 Marcel Moolenaar <marcel@FreeBSD.org>

Unbreak after cpuset: initialize td_cpuset in sched_fork_thread().


# 885d51a3 02-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Add a new sched_affinity() api to be used in the upcoming cpuset
implementation.
- Add empty implementations of sched_affinity() to 4BSD and ULE.

Sponsored by: Nokia


# eea4f254 15-Dec-2007 Jeff Roberson <jeff@FreeBSD.org>

- Re-implement lock profiling in such a way that it no longer breaks
the ABI when enabled. There is no longer an embedded lock_profile_object
in each lock. Instead a list of lock_profile_objects is kept per-thread
for each lock it may own. The cnt_hold statistic is now always 0 to
facilitate this.
- Support shared locking by tracking individual lock instances and
statistics in the per-thread per-instance lock_profile_object.
- Make the lock profiling hash table a per-cpu singly linked list with a
per-cpu static lock_prof allocator. This removes the need for an array
of spinlocks and reduces cache contention between cores.
- Use a seperate hash for spinlocks and other locks so that only a
critical_enter() is required and not a spinlock_enter() to modify the
per-cpu tables.
- Count time spent spinning in the lock statistics.
- Remove the LOCK_PROFILE_SHARED option as it is always supported now.
- Specifically drop and release the scheduler locks in both schedulers
since we track owners now.

In collaboration with: Kip Macy
Sponsored by: Nokia


# 435806d3 11-Dec-2007 David Xu <davidxu@FreeBSD.org>

Fix LOR of thread lock and umtx's priority propagation mutex due
to the reworking of scheduler lock.

MFC: after 3 days


# 431f8906 13-Nov-2007 Julian Elischer <julian@FreeBSD.org>

generally we are interested in what thread did something as
opposed to what process. Since threads by default have teh name of the
process unless over-written with more useful information, just print the
thread name instead.


# 088f5849 04-Nov-2007 Robert Watson <rwatson@FreeBSD.org>

Remove unused variable td from sched_idletd().

MFC after: 3 days
Found with: Coverity Prevent(tm)
CID: 3561


# 9dddab6f 27-Oct-2007 John Baldwin <jhb@FreeBSD.org>

Change the roundrobin implementation in the 4BSD scheduler to trigger a
userland preemption directly from hardclock() via sched_clock() when a
thread uses up a full quantum instead of using a periodic timeout to cause
a userland preemption every so often. This fixes a potential deadlock
when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held
by a thread pinned or bound to another CPU. The current thread on that
CPU will never be preempted while softclock is blocked.

Note that ULE already drives its round-robin userland preemption from
sched_clock() as well and always enables IPI_PREEMPT.

MFC after: 1 week


# 7ab24ea3 26-Oct-2007 Julian Elischer <julian@FreeBSD.org>

Introduce a way to make pure kernal threads.
kthread_add() takes the same parameters as the old kthread_create()
plus a pointer to a process structure, and adds a kernel thread
to that process.

kproc_kthread_add() takes the parameters for kthread_add,
plus a process name and a pointer to a pointer to a process instead of just
a pointer, and if the proc * is NULL, it creates the process to the
specifications required, before adding the thread to it.

All other old kthread_xxx() calls return, but act on (struct thread *)
instead of (struct proc *). One reason to change the name is so that
any old kernel modules that are lying around and expect kthread_create()
to make a process will not just accidentally link.

fix top to show kernel threads by their thread name in -SH mode
add a tdnam formatting option to ps to show thread names.

make all idle threads actual kthreads and put them into their own idled process.
make all interrupt threads kthreads and put them in an interd process
(mainly for aesthetic and accounting reasons)
rename proc 0 to be 'kernel' and it's swapper thread is now 'swapper'

man page fixes to follow.


# 05dc0eb2 08-Oct-2007 Jeff Roberson <jeff@FreeBSD.org>

- Restore historical sched_yield() behavior by changing sched_relinquish()
to simply switch rather than lowering priority and switching. This allows
threads of equal priority to run but not lesser priority.

Discussed with: davidxu
Reported by: NIIMI Satoshi <sa2c@sa2c.net>
Approved by: re


# 54b0e65f 20-Sep-2007 Jeff Roberson <jeff@FreeBSD.org>

- Redefine p_swtime and td_slptime as p_swtick and td_slptick. This
changes the units from seconds to the value of 'ticks' when swapped
in/out. ULE does not have a periodic timer that scans all threads in
the system and as such maintaining a per-second counter is difficult.
- Change computations requiring the unit in seconds to subtract ticks
and divide by hz. This does make the wraparound condition hz times
more frequent but this is still in the range of several months to
years and the adverse effects are minimal.

Approved by: re


# b61ce5b0 16-Sep-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move all of the PS_ flags into either p_flag or td_flags.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.

Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)


# 6ea38de8 18-Jul-2007 Jeff Roberson <jeff@FreeBSD.org>

- Remove the global definition of sched_lock in mutex.h to break
new code and third party modules which try to depend on it.
- Initialize sched_lock in sched_4bsd.c.
- Declare sched_lock in sparc64 pmap.c and assert that we're compiling
with SCHED_4BSD to prevent accidental crashes from running ULE. This
is the sole remaining file outside of the scheduler that uses the
global sched_lock.

Approved by: re


# fe54587f 12-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move some common code out of sched_fork_exit() and back into fork_exit().


# 710eacdc 05-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

- Placing the 'volatile' on the right side of the * in the td_lock
declaration removes the need for __DEVOLATILE().

Pointed out by: tegge


# 95e3a0bc 04-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

- Better fix for previous error; use DEVOLATILE on the td_lock pointer
it can actually sometimes be something other than sched_lock even on
schedulers which rely on a global scheduler lock.

Tested by: kan


# c219b097 04-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

- Pass &sched_lock as the third argument to cpu_switch() as this will
always be the correct lock and we don't get volatile warnings this
way.

Pointed out by: kan


# 7b20fb19 04-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# 4d70511a 27-Feb-2007 John Baldwin <jhb@FreeBSD.org>

Use pause() rather than tsleep() on stack variables and function pointers.


# c6226eea 01-Feb-2007 Julian Elischer <julian@FreeBSD.org>

Move the seting of the idle_mask bits to a place where they
can't be wrong.
Also use the IDLETD bit in the thread mask to test if its an idle thread
rather than doing a PCPU access.


# f0393f06 23-Jan-2007 Jeff Roberson <jeff@FreeBSD.org>

- Remove setrunqueue and replace it with direct calls to sched_add().
setrunqueue() was mostly empty. The few asserts and thread state
setting were moved to the individual schedulers. sched_add() was
chosen to displace it for naming consistency reasons.
- Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be
different on all three schedulers where it was only called in one place
each.
- Remove the long ifdef'd out remrunqueue code.
- Remove the now redundant ts_state. Inspect the thread state directly.
- Don't set TSF_* flags from kern_switch.c, we were only doing this to
support a feature in one scheduler.
- Change sched_choose() to return a thread rather than a td_sched. Also,
rely on the schedulers to return the idlethread. This simplifies the
logic in choosethread(). Aside from the run queue links kern_switch.c
mostly does not care about the contents of td_sched.

Discussed with: julian

- Move the idle thread loop into the per scheduler area. ULE wants to
do something different from the other schedulers.

Suggested by: jhb

Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.


# 2da78e38 31-Dec-2006 Robert Watson <rwatson@FreeBSD.org>

Prefer a more traditional spelling of inhibited in comments and panic
messages.


# 34e1241b 23-Dec-2006 David Xu <davidxu@FreeBSD.org>

Fix typo, p_slptime should be td_slptime.


# ad1e7d28 05-Dec-2006 Julian Elischer <julian@FreeBSD.org>

Threading cleanup.. part 2 of several.

Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs. Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.


# de38cd9d 20-Nov-2006 Julian Elischer <julian@FreeBSD.org>

whitespace fix only


# 65338575 13-Nov-2006 David Xu <davidxu@FreeBSD.org>

Fix a copy-paste bug in NON-KSE case.


# 5a215147 11-Nov-2006 David Xu <davidxu@FreeBSD.org>

Unbreak userland priority inheriting in NO_KSE case.


# 8460a577 26-Oct-2006 John Birrell <jb@FreeBSD.org>

Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by: davidxu@


# 3db720fd 25-Aug-2006 David Xu <davidxu@FreeBSD.org>

Add user priority loaning code to support priority propagation for
1:1 threading's POSIX priority mutexes, the code is no-op unless
priority-aware umtx code is committed.


# 1f36c876 02-Jul-2006 Maxim Konovalov <maxim@FreeBSD.org>

o Fix grammar in the comment, indent macros. No functional changes.


# 75d960eb 02-Jul-2006 Maxim Konovalov <maxim@FreeBSD.org>

o Remove rev. 1.57 leftover, not reached code.


# 2e4db89c 29-Jun-2006 David E. O'Brien <obrien@FreeBSD.org>

Fix building with GCC 4.2: define data types before referring to them.


# 36ec198b 15-Jun-2006 David Xu <davidxu@FreeBSD.org>

Add scheduler API sched_relinquish(), the API is used to implement
yield() and sched_yield() syscalls. Every scheduler has its own way
to relinquish cpu, the ULE and CORE schedulers have two internal run-
queues, a timesharing thread which calls yield() syscall should be
moved to inactive queue.


# b41f1452 13-Jun-2006 David Xu <davidxu@FreeBSD.org>

Add scheduler CORE, the work I have done half a year ago, recent,
I picked it up again. The scheduler is forked from ULE, but the
algorithm to detect an interactive process is almost completely
different with ULE, it comes from Linux paper "Understanding the
Linux 2.6.8.1 CPU Scheduler", although I still use same word
"score" as a priority boost in ULE scheduler.

Briefly, the scheduler has following characteristic:
1. Timesharing process's nice value is seriously respected,
timeslice and interaction detecting algorithm are based
on nice value.
2. per-cpu scheduling queue and load balancing.
3. O(1) scheduling.
4. Some cpu affinity code in wakeup path.
5. Support POSIX SCHED_FIFO and SCHED_RR.
Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler
uses 256 priority queues. Unlike ULE which using pull and push, the
scheduelr uses pull method, the main reason is to let relative idle
cpu do the work, but current the whole scheduler is protected by the
big sched_lock, so the benefit is not visible, it really can be worse
than nothing because all other cpu are locked out when we are doing
balancing work, which the 4BSD scheduelr does not have this problem.
The scheduler does not support hyperthreading very well, in fact,
the scheduler does not make the difference between physical CPU and
logical CPU, this should be improved in feature. The scheduler has
priority inversion problem on MP machine, it is not good for
realtime scheduling, it can cause realtime process starving.
As a result, it seems the MySQL super-smack runs better on my
Pentium-D machine when using libthr, despite on UP or SMP kernel.


# 0ae716e5 05-Jun-2006 David Xu <davidxu@FreeBSD.org>

Make ke_rqindex unsigned.


# 5c06d111 27-Apr-2006 John-Mark Gurney <jmg@FreeBSD.org>

back out for now... revert ccpu to being kern.ccpu...


# c71ce6a4 26-Apr-2006 John-Mark Gurney <jmg@FreeBSD.org>

move remaining sysctl into the kern.sched tree...


# 0f180a7c 17-Apr-2006 John Baldwin <jhb@FreeBSD.org>

Change msleep() and tsleep() to not alter the calling thread's priority
if the specified priority is zero. This avoids a race where the calling
thread could read a snapshot of it's current priority, then a different
thread could change the first thread's priority, then the original thread
would call sched_prio() inside msleep() undoing the change made by the
second thread. I used a priority of zero as no thread that calls msleep()
or tsleep() should be specifying a priority of zero anyway.

The various places that passed 'curthread->td_priority' or some variant
as the priority now pass 0.


# 4da0d332 23-Jun-2005 Peter Wemm <peter@FreeBSD.org>

Move HWPMC_HOOKS into its own opt_hwpmc_hooks.h file. It doesn't merit
being in opt_global.h and forcing a global recompile when only a few files
reference it.

Approved by: re


# a3f2d842 09-Jun-2005 Stephan Uphoff <ups@FreeBSD.org>

Lots of whitespace cleanup.
Fix for broken if condition.

Submitted by: nate@


# f3a0f873 09-Jun-2005 Stephan Uphoff <ups@FreeBSD.org>

Fix some race conditions for pinned threads that may cause them to run
on the wrong CPU.

Add IPI support for preempting a thread on another CPU.

MFC after:3 weeks


# ebccf1e3 18-Apr-2005 Joseph Koshy <jkoshy@FreeBSD.org>

Bring a working snapshot of hwpmc(4), its associated libraries, userland utilities
and documentation into -CURRENT.

Bump FreeBSD_version.

Reviewed by: alc, jhb (kernel changes)


# f3050486 15-Apr-2005 Maxim Konovalov <maxim@FreeBSD.org>

Fix a typo in the comment.

Noticed by: Samy Al Bahra


# 77918643 07-Apr-2005 Stephan Uphoff <ups@FreeBSD.org>

Sprinkle some volatile magic and rearrange things a bit to avoid race
conditions in critical_exit now that it no longer blocks interrupts.

Reviewed by: jhb


# f5c157d9 30-Dec-2004 John Baldwin <jhb@FreeBSD.org>

Rework the interface between priority propagation (lending) and the
schedulers a bit to ensure more correct handling of priorities and fewer
priority inversions:
- Add two functions to the sched(9) API to handle priority lending:
sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these
functions to ask the scheduler to lend a thread a set priority and to
tell the scheduler when it thinks it is ok for a thread to stop borrowing
priority. The unlend case is slightly complex in that the turnstile code
tells the scheduler what the minimum priority of the thread needs to be
to satisfy the requirements of any other threads blocked on locks owned
by the thread in question. The scheduler then decides where the thread
can go back to normal mode (if it's normal priority is high enough to
satisfy the pending lock requests) or it it should continue to use the
priority specified to the sched_unlend_prio() call. This involves adding
a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag
for priority elevation.
- Schedulers now refuse to lower the priority of a thread that is currently
borrowing another therad's priority.
- If a scheduler changes the priority of a thread that is currently sitting
on a turnstile, it will call a new function turnstile_adjust() to inform
the turnstile code of the change. This function resorts the thread on
the priority list of the turnstile if needed, and if the thread ends up
at the head of the list (due to having the highest priority) and its
priority was raised, then it will propagate that new priority to the
owner of the lock it is blocked on.

Some additional fixes specific to the 4BSD scheduler include:
- Common code for updating the priority of a thread when the user priority
of its associated kse group has been consolidated in a new static
function resetpriority_thread(). One change to this function is that
it will now only adjust the priority of a thread if it already has a
time sharing priority, thus preserving any boosts from a tsleep() until
the thread returns to userland. Also, resetpriority() no longer calls
maybe_resched() on each thread in the group. Instead, the code calling
resetpriority() is responsible for calling resetpriority_thread() on
any threads that need to be updated.
- schedcpu() now uses resetpriority_thread() instead of just calling
sched_prio() directly after it updates a kse group's user priority.
- sched_clock() now uses resetpriority_thread() rather than writing
directly to td_priority.
- sched_nice() now updates all the priorities of the threads after the
group priority has been adjusted.

Discussed with: bde
Reviewed by: ups, jeffr
Tested on: 4bsd, ule
Tested on: i386, alpha, sparc64


# 907bdbc2 25-Dec-2004 Jeff Roberson <jeff@FreeBSD.org>

- Wrap the thread count adjustment in sched_load_add() and sched_load_rem()
so that we may place some ktr entries nearby.
- Define other KTR_SCHED tracepoints so that we may graph the operation
of the scheduler.


# 7842f65e 14-Dec-2004 Jeff Roberson <jeff@FreeBSD.org>

- Garbage collect several unused members of struct kse and struce ksegrp.
As best as I can tell, some of these were never used.


# 56564741 07-Dec-2004 Stephan Uphoff <ups@FreeBSD.org>

Propagate TDF_NEEDRESCHED to replacement thread in sched_switch().

Reviewed by: julian, jhb (in October)
Approved by: sam (mentor)
MFC after: 4 weeks


# c20c691b 05-Oct-2004 Julian Elischer <julian@FreeBSD.org>

When preempting a thread, put it back on the HEAD of its run queue.
(Only really implemented in 4bsd)

MFC after: 4 days


# d39063f2 05-Oct-2004 Julian Elischer <julian@FreeBSD.org>

Use some macros to trach available scheduler slots to allow
easier debugging.

MFC after: 4 days


# 14f0e2e9 16-Sep-2004 Julian Elischer <julian@FreeBSD.org>

clean up thread runq accounting a bit.

MFC after: 3 days


# b2578c6c 13-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Add some kasserts


# 1e7fad6b 11-Sep-2004 Scott Long <scottl@FreeBSD.org>

Revert the previous round of changes to td_pinned. The scheduler isn't
fully initialed when the pmap layer tries to call sched_pini() early in the
boot and results in an quick panic. Use ke_pinned instead as was originally
done with Tor's patch.

Approved by: julian


# 5c854acc 10-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Make up my mind if cpu pinning is stored in the thread structure or the
scheduler specific extension to it. Put it in the extension as
the implimentation details of how the pinning is done needn't be visible
outside the scheduler.

Submitted by: tegge (of course!) (with changes)
MFC after: 3 days


# 3389af30 10-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Add some code to allow threads to nominat a sibling to run if theyu are going to sleep.

MFC after: 1 week


# 6a574b2a 06-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Don't do IPIs on behalf of interrupt threads.
just punt straight on through to teh preemption code.

Make a KASSSERT out of a condition that can no longer occur.
MFC after: 1 week


# 0fe38d47 05-Sep-2004 Julian Elischer <julian@FreeBSD.org>

slight code cleanup

MFC after: 1 week


# bce73aed 04-Sep-2004 Julian Elischer <julian@FreeBSD.org>

turn on IPIs for 4bsd scheduler by default.

MFC after: 1 week


# ed062c8d 04-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Refactor a bunch of scheduler code to give basically the same behaviour
but with slightly cleaned up interfaces.

The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.

The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.

Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.

Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.

The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.

A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.

Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.

Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week


# 00b0483d 03-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Don't declare a function we are not defining.


# 37c28a02 03-Sep-2004 Julian Elischer <julian@FreeBSD.org>

fix compile for UP


# 293968d8 03-Sep-2004 Julian Elischer <julian@FreeBSD.org>

ooops finish last commit.
moved the variables but not the declarations.


# 82a1dfc1 03-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Move 4bsd specific experimental IP code into the 4bsd file.
Move the sysctls into kern.sched


# 6804a3ab 01-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Give the 4bsd scheduler the ability to wake up idle processors
when there is new work to be done.

MFC after: 5 days


# 2630e4c9 31-Aug-2004 Julian Elischer <julian@FreeBSD.org>

Give setrunqueue() and sched_add() more of a clue as to
where they are coming from and what is expected from them.

MFC after: 2 days


# ad59c36b 21-Aug-2004 Julian Elischer <julian@FreeBSD.org>

diff reduction for upcoming patch. Use a macro that masks
some of the odd goings on with sub-structures, because they will
go away anyhow.


# 0f54f482 11-Aug-2004 Julian Elischer <julian@FreeBSD.org>

Properly keep track of how many kses are on the system run queue(s).


# 732d9528 09-Aug-2004 Julian Elischer <julian@FreeBSD.org>

Increase the amount of data exported by KTR in the KTR_RUNQ setting.
This extra data is needed to really follow what is going on in the
threaded case.


# e038d354 23-Jul-2004 Scott Long <scottl@FreeBSD.org>

Clean up whitespace, increase consistency and correctness.

Submitted by: bde


# 55d44f79 18-Jul-2004 Julian Elischer <julian@FreeBSD.org>

When calling scheduler entrypoints for creating new threads and processes,
specify "us" as the thread not the process/ksegrp/kse.
You can always find the others from the thread but the converse is not true.
Theorotically this would lead to runtime being allocated to the wrong
entity in some cases though it is not clear how often this actually happenned.
(would only affect threaded processes and would probably be pretty benign,
but it WAS a bug..)

Reviewed by: peter


# 52eb8464 16-Jul-2004 John Baldwin <jhb@FreeBSD.org>

- Move TDF_OWEPREEMPT, TDF_OWEUPC, and TDF_USTATCLOCK over to td_pflags
since they are only accessed by curthread and thus do not need any
locking.
- Move pr_addr and pr_ticks out of struct uprof (which is per-process)
and directly into struct thread as td_profil_addr and td_profil_ticks
as these variables are really per-thread. (They are used to defer an
addupc_intr() that was too "hard" until ast()).


# 6942d433 13-Jul-2004 John Baldwin <jhb@FreeBSD.org>

Set TDF_NEEDRESCHED when a higher priority thread is scheduled in
sched_add() rather than just doing it in sched_wakeup(). The old
ithread preemption code used to set NEEDRESCHED unconditionally if it
didn't preempt which masked this bug in SCHED_4BSD.

Noticed by: jake
Reported by: kensmith, marcel


# 0c0b25ae 02-Jul-2004 John Baldwin <jhb@FreeBSD.org>

Implement preemption of kernel threads natively in the scheduler rather
than as one-off hacks in various other parts of the kernel:
- Add a function maybe_preempt() that is called from sched_add() to
determine if a thread about to be added to a run queue should be
preempted to directly. If it is not safe to preempt or if the new
thread does not have a high enough priority, then the function returns
false and sched_add() adds the thread to the run queue. If the thread
should be preempted to but the current thread is in a nested critical
section, then the flag TDF_OWEPREEMPT is set and the thread is added
to the run queue. Otherwise, mi_switch() is called immediately and the
thread is never added to the run queue since it is switch to directly.
When exiting an outermost critical section, if TDF_OWEPREEMPT is set,
then clear it and call mi_switch() to perform the deferred preemption.
- Remove explicit preemption from ithread_schedule() as calling
setrunqueue() now does all the correct work. This also removes the
do_switch argument from ithread_schedule().
- Do not use the manual preemption code in mtx_unlock if the architecture
supports native preemption.
- Don't call mi_switch() in a loop during shutdown to give ithreads a
chance to run if the architecture supports native preemption since
the ithreads will just preempt DELAY().
- Don't call mi_switch() from the page zeroing idle thread for
architectures that support native preemption as it is unnecessary.
- Native preemption is enabled on the same archs that supported ithread
preemption, namely alpha, i386, and amd64.

This change should largely be a NOP for the default case as committed
except that we will do fewer context switches in a few cases and will
avoid the run queues completely when preempting.

Approved by: scottl (with his re@ hat)


# bf0acc27 02-Jul-2004 John Baldwin <jhb@FreeBSD.org>

- Change mi_switch() and sched_switch() to accept an optional thread to
switch to. If a non-NULL thread pointer is passed in, then the CPU will
switch to that thread directly rather than calling choosethread() to pick
a thread to choose to.
- Make sched_switch() aware of idle threads and know to do
TD_SET_CAN_RUN() instead of sticking them on the run queue rather than
requiring all callers of mi_switch() to know to do this if they can be
called from an idlethread.
- Move constants for arguments to mi_switch() and thread_single() out of
the middle of the function prototypes and up above into their own
section.


# 36c6fd1c 21-Jun-2004 Scott Long <scottl@FreeBSD.org>

Fix another typo in the previous commit.


# c38dd4b6 21-Jun-2004 Scott Long <scottl@FreeBSD.org>

Fix typo that somehow crept into the previous commit


# dc095794 21-Jun-2004 Scott Long <scottl@FreeBSD.org>

Add the sysctl node 'kern.sched.name' that has the name of the scheduler
currently in use. Move the 4bsd kern.quantum node to kern.sched.quantum
for consistency.


# fa885116 15-Jun-2004 Julian Elischer <julian@FreeBSD.org>

Nice, is a property of a process as a whole..
I mistakenly moved it to the ksegroup when breaking up the process
structure. Put it back in the proc structure.


# 7f8a436f 05-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 7d5ea13f 05-Apr-2004 Doug Rabson <dfr@FreeBSD.org>

Try not to crash instantly when signalling a libthr program to death.


# 8cbec0c8 05-Mar-2004 Robert Watson <rwatson@FreeBSD.org>

The roundrobin callout from sched_4bsd is MPSAFE, so set up the
callout as MPSAFE to avoid grabbing Giant.

Reviewed by: jhb


# 44f3b092 27-Feb-2004 John Baldwin <jhb@FreeBSD.org>

Switch the sleep/wakeup and condition variable implementations to use the
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.


# f2f51f8a 31-Jan-2004 Jeff Roberson <jeff@FreeBSD.org>

- Disable ithread binding in all cases for now. This doesn't make as much
sense with sched_4bsd as it does with sched_ule.
- Use P_NOLOAD instead of the absence of td->td_ithd to determine whether or
not a thread should be accounted for in sched_tdcnt.


# ca59f152 31-Jan-2004 Jeff Roberson <jeff@FreeBSD.org>

- Keep a variable 'sched_tdcnt' that is used for the local implementation
of sched_load(). This variable tracks the number of running and runnable
non ithd threads. This removes the need to traverse the proc table and
discover how many threads are runnable.


# 5a2b158d 25-Jan-2004 Jeff Roberson <jeff@FreeBSD.org>

- Correct function names listed in KASSERTs. These were copied from other
code and it was sloppy of me not to adjust these sooner.


# e17c57b1 25-Jan-2004 Jeff Roberson <jeff@FreeBSD.org>

- Implement cpu pinning and binding. This is acomplished by keeping a per-
cpu run queue that is only used for pinned or bound threads.

Submitted by: Chris Bradfield <chrisb@ation.org>


# c55bbb6c 26-Dec-2003 John Baldwin <jhb@FreeBSD.org>

Create a separate kthread that executes sched_cpu() once a second. Because
sched_cpu() locks an sx lock (allproc_lock) which can sleep if it fails to
acquire the lock, it is not safe to execute this in a callout handler from
softclock().


# b698380f 09-Nov-2003 Bruce Evans <bde@FreeBSD.org>

Quick fix for scaling of statclock ticks in the SMP case. As explained
in the log message for kern_sched.c 1.83 (which should have been
repo-copied to preserve history for this file), the (4BSD) scheduler
algorithm only works right if stathz is nearly 128 Hz. The old
commit lock said 64 Hz; the scheduler actually wants nearly 16 Hz
but there was a scale factor of 4 to give the requirement of 64 Hz,
and rev.1.83 changed the scale factor so that the requirement became
128 Hz. The change of the scale factor was incomplete in the SMP
case. Then scheduling ticks are provided by smp_ncpu CPUs, and the
scheduler cannot tell the difference between this and 1 CPU providing
scheduling ticks smp_ncpu times faster, so we need another scale
factor of smp_ncp or an algorithm change.

This quick fix uses the scale factor without even trying to optimize
the runtime divisions required for this as is done for the other
scale factor.

The main algorithmic problem is the clamp on the scheduling tick counts.
This was 295; it is now approximately 295 * smp_ncpu. When the limit
is reached, threads get free timeslices and scheduling becomes very
unfair to the threads that don't hit the limit. The limit can be
reached and maintained in the worst case if the load average is larger
than (limit / effective_stathz - 1) / 2 = 0.65 now (was just 0.08 with
2 CPUs before this change), so there are algorithmic problems even for
a load average of 1. Fortunately, the worst case isn't common enough
for the problem to be very noticeable (it is mainly for niced CPU hogs
competing with less nice CPU hogs).


# 685a6c44 07-Nov-2003 David Xu <davidxu@FreeBSD.org>

Return a reasonable number for top or ps to display for M:N thread,
since there is no direct association between M:N thread and kse,
sometimes, a thread does not have a kse, in that case, return a pctcpu
from its last kse, it is not perfect, but gives a good number to be
displayed.


# 89674a9f 29-Oct-2003 Bruce Evans <bde@FreeBSD.org>

Removed sched_nest variable in sched_switch(). Context switches always
begin with sched_lock held but not recursed, so this variable was
always 0.

Removed fixup of sched_lock.mtx_recurse after context switches in
sched_switch(). Context switches always end with this variable in the
same state that it began in, so there is no need to fix it up. Only
sched_lock.mtx_lock really needs a fixup.

Replaced fixup of sched_lock.mtx_recurse in fork_exit() by an assertion
that sched_lock is owned and not recursed after it is fixed up. This
assertion much match the one in mi_switch(), and if sched_lock were
recursed then a non-null fixup of sched_lock.mtx_recurse would probably
be needed again, unlike in sched_switch(), since fork_exit() doesn't
return to its caller in the normal way.


# 55f2099a 16-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- The kse may be null in sched_pctcpu().

Reported by: kris


# ae53b483 16-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Collapse sched_switchin() and sched_switchout() into sched_switch(). Now
mi_switch() calls sched_switch() which calls cpu_switch(). This is
actually one less function call than it had been.


# 7cf90fb3 16-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Update the sched api. sched_{add,rem,clock,pctcpu} now all accept a td
argument rather than a kse.


# c06eb4e2 19-Aug-2003 Sam Leffler <sam@FreeBSD.org>

Change instances of callout_init that specify MPSAFE behaviour to
use CALLOUT_MPSAFE instead of "1" for the second parameter. This
does not change the behaviour; it just makes the intent more clear.


# 70fca427 15-Aug-2003 John Baldwin <jhb@FreeBSD.org>

- Various style fixes in both code and comments.
- Update some stale comments.
- Sort a couple of includes.
- Only set 'newcpu' in updatepri() if we use it.
- No functional changes.

Obtained from: bde (via an old diff I got a long time ago)


# 0e2a4d3a 14-Jun-2003 David Xu <davidxu@FreeBSD.org>

Rename P_THREADED to P_SA. P_SA means a process is using scheduler
activations.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 51da11a2 29-Apr-2003 Mark Murray <markm@FreeBSD.org>

Fix some easy, global, lint warnings. In most cases, this means
making some local variables static. In a couple of cases, this means
removing an unused variable.


# 2056d0a1 23-Apr-2003 John Baldwin <jhb@FreeBSD.org>

Add lock assertions for various proc/thread/kse/ksegroup fields to the
scheduler functions.


# 0b5318c8 22-Apr-2003 John Baldwin <jhb@FreeBSD.org>

- Assert that the proc lock and sched_lock are held in sched_nice().
- For the 4BSD scheduler, this means that all callers of the static
function resetpriority() now always hold sched_lock, so don't lock
sched_lock explicitly in that function.


# f7f9e7f3 10-Apr-2003 Jeff Roberson <jeff@FreeBSD.org>

- Catch up with sched api changes.


# 060563ec 10-Apr-2003 Julian Elischer <julian@FreeBSD.org>

Move the _oncpu entry from the KSE to the thread.
The entry in the KSE still exists but it's purpose will change a bit
when we add the ability to lock a KSE to a cpu.


# 4974b53e 24-Mar-2003 Maxime Henrion <mux@FreeBSD.org>

Remove a trailing semicolon in SCHED_QUANTUM definition.
Luckily this didn't cause any bugs.

Spotted by: Samy Al Bahra <samy@kerneled.com>


# ac2e4153 26-Feb-2003 Julian Elischer <julian@FreeBSD.org>

Change the process flags P_KSES to be P_THREADED.
This is just a cosmetic change but I've been meaning to do it for about a year.


# 4f6cfa45 19-Feb-2003 David Xu <davidxu@FreeBSD.org>

Update comments to reflect new KSE code.


# 4a338afd 17-Feb-2003 Julian Elischer <julian@FreeBSD.org>

Move a bunch of flags from the KSE to the thread.
I was in two minds as to where to put them in the first case..
I should have listenned to the other mind.

Submitted by: parts by davidxu@
Reviewed by: jeff@ mini@


# 8fb913fa 12-Jan-2003 Jeff Roberson <jeff@FreeBSD.org>

- Unbreak world. I did not notice that libkvm was still used in some places
to access the pctcpu. This will have to be sorted out more later as the
new scheduler requires a procedural interface for this data. A more
complete solution will follow.


# bcb06d59 12-Jan-2003 Jeff Roberson <jeff@FreeBSD.org>

- Move ke_pctcpu and ke_cpticks into the scheduler specific datastructure.
This will prevent access through mechanisms other than the published
interfaces.


# 93a7aa79 27-Dec-2002 Julian Elischer <julian@FreeBSD.org>

Add code to ddb to allow backtracing an arbitrary thread.
(show thread {address})

Remove the IDLE kse state and replace it with a change in
the way threads sahre KSEs. Every KSE now has a thread, which is
considered its "owner" however a KSE may also be lent to other
threads in the same group to allow completion of in-kernel work.
n this case the owner remains the same and the KSE will revert to the
owner when the other work has been completed.

All creations of upcalls etc. is now done from
kse_reassign() which in turn is called from mi_switch or
thread_exit(). This means that special code can be removed from
msleep() and cv_wait().

kse_release() does not leave a KSE with no thread any more but
converts the existing thread into teh KSE's owner, and sets it up
for doing an upcall. It is just inhibitted from being scheduled until
there is some reason to do an upcall.

Remove all trace of the kse_idle queue since it is no-longer needed.
"Idle" KSEs are now on the loanable queue.


# 79acfc49 21-Nov-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add the new sched_pctcpu() function to the sched_* api.
- Provide a routine in sched_4bsd to add this functionality.
- Use sched_pctcpu() in kern_proc, which is the one place outside of
sched_4bsd where the old pctcpu value was accessed directly.

Approved by: re


# 06439a04 21-Nov-2002 Jeff Roberson <jeff@FreeBSD.org>

- Move scheduler specific macros and defines out of proc.h

Approved by: re


# 148302c9 21-Nov-2002 Jeff Roberson <jeff@FreeBSD.org>

- Move FSCALE back to kern_sync. This is not scheduler specific.
- Create a new callout for lbolt and move it out of schedcpu(). This is not
scheduler specific either.

Approved by: re


# de028f5a 20-Nov-2002 Jeff Roberson <jeff@FreeBSD.org>

- Implement a mechanism for allowing schedulers to place scheduler dependant
data in the scheduler independant structures (proc, ksegrp, kse, thread).
- Implement unused stubs for this mechanism in sched_4bsd.

Approved by: re
Reviewed by: luigi, trb
Tested on: x86, alpha


# 1f955e2d 14-Oct-2002 Julian Elischer <julian@FreeBSD.org>

Tidy up the scheduler's code for changing the priority of a thread.
Logically pretty much a NOP.


# b43179fb 11-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Create a new scheduler api that is defined in sys/sched.h
- Begin moving scheduler specific functionality into sched_4bsd.c
- Replace direct manipulation of scheduler data with hooks provided by the
new api.
- Remove KSE specific state modifications and single runq assumptions from
kern_switch.c

Reviewed by: -arch