History log of /freebsd-current/sys/kern/kern_switch.c
Revision Date Author Comments
# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 1029dab6 09-Feb-2023 Mitchell Horne <mhorne@FreeBSD.org>

mi_switch(): clean up switch types and their usage

Overall, this is a non-functional change, except for kernels built with
SCHED_STATS. However, the switch types are useful for communicating the
intent of the caller.

1. Ensure that every caller provides a type. In most cases, we upgrade
the basic yield to sched_relinquish() aka SWT_RELINQUISH.
2. The case of sched_bind() is distinct, so add a new switch type SWT_BIND.
3. Remove the two unused types, SWT_PREEMPT and SWT_SLEEPQTIMO.
4. Remove SWT_NONE altogether and assert that callers always provide
a type flag.
5. Reference the mi_switch(9) man page in the comments, as these flags
will be documented there.

Reviewed by: kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38184


# 6622e299 26-Sep-2022 Alan Somers <asomers@FreeBSD.org>

Fix the build with SCHED_STATS after d3f96f661050

MFC with: d3f96f661050e9bd21fe29931992a8b9e67ff189
Sponsored by: Axcient


# c6c52d8e 26-Dec-2021 Alexander Motin <mav@FreeBSD.org>

kern: Remove CTLFLAG_NEEDGIANT from some more sysctls.

MFC after: 2 weeks


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 3ff65f71 30-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

Remove duplicated empty lines from kern/*.c

No functional changes.


# 879e0604 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

Add KERNEL_PANICKED macro for use in place of direct panicstr tests


# 686bcb5c 15-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

schedlock 4/4

Don't hold the scheduler lock while doing context switches. Instead we
unlock after selecting the new thread and switch within a spinlock
section leaving interrupts and preemption disabled to prevent local
concurrency. This means that mi_switch() is entered with the thread
locked but returns without. This dramatically simplifies scheduler
locking because we will not hold the schedlock while spinning on
blocked lock in switch.

This change has not been made to 4BSD but in principle it would be
more straightforward.

Discussed with: markj
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22778


# 6443773d 02-Jul-2018 Matt Macy <mmacy@FreeBSD.org>

make critical_{enter, exit} inline

Avoid pulling in all of the <sys/proc.h> dependencies by
automatically generating a stripped down thread_lite exporting
only the fields of interest. The field declarations are type checked
against the original and the offsets of the generated result is
automatically checked.

kib has expressed disagreement and would have preferred to simply
use genassym style offsets (which loses type check enforcement).
jhb has expressed dislike of it due to header pollution and a
duplicate structure. He would have preferred to just have defined
thread in _thread.h. Nonetheless, he admits that this is the only
viable solution at the moment.

The impetus for this came from mjg's D15331:
"Inline critical_enter/exit for amd64"

Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D16078


# 6fee84e3 23-May-2018 Mateusz Guzik <mjg@FreeBSD.org>

Remove incorrect owepreempt assertion added in r334062

Yet another preemption request hitting between the counter being 0
and the check being reached will result in the flag no longer being
set.

Note the situation was already present prior to r334062 and is harmless.

Reported by: pho
Reviewed by: kib


# 748b15fc 22-May-2018 Mateusz Guzik <mjg@FreeBSD.org>

Move preemption handling out of critical_exit.

In preperataion for making the enter/exit pair inline.

Reviewed by: kib


# 8a36da99 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# 32aef9ff 16-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

sched: move panic handling code out of choosethread

This avoids jumps in the common case of the kernel not being panicked.


# 3467f88c 22-Jan-2017 Konstantin Belousov <kib@FreeBSD.org>

Add comments explaining unobvious td_critnest adjustments in
critical_exit().

Based on the discussion with: jhb
Reviewed by: imp
Sponsored by: The FreeBSD Foundation
Differential revision: D9276
MFC after: 1 week


# a115fb62 22-Jan-2015 Hans Petter Selasky <hselasky@FreeBSD.org>

Revert for r277213:

FreeBSD developers need more time to review patches in the surrounding
areas like the TCP stack which are using MPSAFE callouts to restore
distribution of callouts on multiple CPUs.

Bump the __FreeBSD_version instead of reverting it.

Suggested by: kmacy, adrian, glebius and kib
Differential Revision: https://reviews.freebsd.org/D1438


# 1a26c3c0 15-Jan-2015 Hans Petter Selasky <hselasky@FreeBSD.org>

Major callout subsystem cleanup and rewrite:
- Close a migration race where callout_reset() failed to set the
CALLOUT_ACTIVE flag.
- Callout callback functions are now allowed to be protected by
spinlocks.
- Switching the callout CPU number cannot always be done on a
per-callout basis. See the updated timeout(9) manual page for more
information.
- The timeout(9) manual page has been updated to reflect how all the
functions inside the callout API are working. The manual page has
been made function oriented to make it easier to deduce how each of
the functions making up the callout API are working without having
to first read the whole manual page. Group all functions into a
handful of sections which should give a quick top-level overview
when the different functions should be used.
- The CALLOUT_SHAREDLOCK flag and its functionality has been removed
to reduce the complexity in the callout code and to avoid problems
about atomically stopping callouts via callout_stop(). If someone
needs it, it can be re-added. From my quick grep there are no
CALLOUT_SHAREDLOCK clients in the kernel.
- A new callout API function named "callout_drain_async()" has been
added. See the updated timeout(9) manual page for a complete
description.
- Update the callout clients in the "kern/" folder to use the callout
API properly, like cv_timedwait(). Previously there was some custom
sleepqueue code in the callout subsystem, which has been removed,
because we now allow callouts to be protected by spinlocks. This
allows us to tear down the callout like done with regular mutexes,
and a "td_slpmutex" has been added to "struct thread" to atomically
teardown the "td_slpcallout". Further the "TDF_TIMOFAIL" and
"SWT_SLEEPQTIMO" states can now be completely removed. Currently
they are marked as available and will be cleaned up in a follow up
commit.
- Bump the __FreeBSD_version to indicate kernel modules need
recompilation.
- There has been several reports that this patch "seems to squash a
serious bug leading to a callout timeout and panic".

Kernel build testing: all architectures were built
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D1438
Sponsored by: Mellanox Technologies
Reviewed by: jhb, adrian, sbruno and emaste


# e68ccbe8 08-Dec-2012 Attilio Rao <attilio@FreeBSD.org>

Add a comment on why inlining critical_enter() may not be a good idea
for the general case.

Reviewed by: bde
MFC after: 1 week


# 5e27a603 04-Dec-2011 Andriy Gapon <avg@FreeBSD.org>

critical_exit: ignore td_owepreempt if kdb_active is set

calling mi_switch in such a context results in a recursion via
kdb_switch

Suggested by: jhb
Reviewed by: jhb
MFC after: 5 weeks


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 3aa6d94e 11-Jun-2010 John Baldwin <jhb@FreeBSD.org>

Update several places that iterate over CPUs to use CPU_FOREACH().


# 791d9a6e 24-Jun-2009 Jeff Roberson <jeff@FreeBSD.org>

- Use DPCPU for SCHED_STATS. This is somewhat awkward because the
offset of the stat is not known until link time so we must emit a
function to call SYSCTL_ADD_PROC rather than using SYSCTL_PROC
directly.
- Eliminate the atomic from SCHED_STAT_INC now that it's using per-cpu
variables. Sched stats are always incremented while we're holding
a spinlock so no further protection is required.

Reviewed by: sam


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 681e4062 12-May-2008 Julian Elischer <julian@FreeBSD.org>

fix typo in runz_fuzz
noticed by:Elijah Buck


# 8df78c41 16-Apr-2008 Jeff Roberson <jeff@FreeBSD.org>

- Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
- In reset walk the children of kern_sched_stats and reset the counters
via the oid_arg1 pointer. This allows us to add arbitrary counters to
the tree and still reset them properly.
- Define a set of switch types to be passed with flags to mi_switch().
These types are named SWT_*. These types correspond to SCHED_STATS
counters and are automatically handled in this way.
- Make the new SWT_ types more specific than the older switch stats.
There are now stats for idle switches, remote idle wakeups, remote
preemption ithreads idling, etc.
- Add switch statistics for ULE's pickcpu algorithm. These stats include
how much migration there is, how often affinity was successful, how
often threads were migrated to the local cpu on wakeup, etc.

Sponsored by: Nokia


# 9727e637 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Restore runq to manipulating threads directly by putting runq links and
rqindex back in struct thread.
- Compile kern_switch.c independently again and stop #include'ing it from
schedulers.
- Remove the ts_thread backpointers and convert most code to go from
struct thread to struct td_sched.
- Cleanup the ts_flags #define garbage that was causing us to sometimes
do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD.
- Export the kern.sched sysctl node in sysctl.h


# 52e95411 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Remove the unused and redundant sched_newproc() function.
- Remove the unused and redundant sched_newthread() which peaks into scheduler
private structures.


# a90f3f25 19-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Move maybe_preempt() from kern_switch.c to sched_4bsd.c. This is function
is only used by 4bsd.
- Create a new runq_choose_fuzz() function rather than polluting runq_choose()
with 4BSD specific code.
- Move the fuzz sysctl into sched_4bsd.c
- Remove some dead code from kern_switch.c


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 6617724c 12-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# 431f8906 13-Nov-2007 Julian Elischer <julian@FreeBSD.org>

generally we are interested in what thread did something as
opposed to what process. Since threads by default have teh name of the
process unless over-written with more useful information, just print the
thread name instead.


# 5bce4ae3 08-Oct-2007 Jeff Roberson <jeff@FreeBSD.org>

- Fix ULE in kernels without PREEMPTION compiled in by always enabling the
critical_exit() owepreempt check. ULE will always use owepreempt to
preempt the idle thread. This change does not effect 4BSD since it will
never set owepreempt without PREEMPTION enabled.
- Remove some unused code from choosethread().

Discussed with: jhb
Approved by: re


# c8790f5d 20-Sep-2007 Attilio Rao <attilio@FreeBSD.org>

Fix some entries in the locks static table of witness.
In particular:
- smp_tlb_mtx is no longer used, so it is axed.
- smp rendezvous lock isn't really a leaf spin-mutex. Its bad placement in
the table, however, has been the source of a false positive LOR reporting
with the dt_lock. However, smp rendezvous lock would have had sched_lock
there for older lock, so it wasn't still a leaf lock.
- allpmaps is only used in ia32 architecture, so it is inserted in the
appropriate stub.

Addictionally:
- kse_zombie_lock is no longer present, so its definition is axed out.
- zombie_lock doesn't need to have an exported symbol, so just let's it be
declared as static.

Tested by: kris
Approved by: jeff (mentor)
Approved by: re


# b61ce5b0 16-Sep-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move all of the PS_ flags into either p_flag or td_flags.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.

Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)


# 67e20930 20-Aug-2007 Jeff Roberson <jeff@FreeBSD.org>

- Improve runq_findbit_from() which is used by ULE's circular queue. Mask
of the bits we want to ignore on the first pass rather than doing a
linear scan. This puts us within a few instructions of the cost of
runq_findbit() and removes this function from the top of profiling output
for context switch heavy workloads.

Approved by: re


# 413ea6f5 03-Aug-2007 Jeff Roberson <jeff@FreeBSD.org>

- Set SW_PREEMPT when we preempt in critical_exit().

Approved by: re


# 56696bd1 19-Jul-2007 Jeff Roberson <jeff@FreeBSD.org>

- Remove explicit references to sched_lock. A simpler assert will do.

Approved by: re


# 671f2709 12-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

- Garbage collect unused concurrency functions.


# 7b20fb19 04-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

Commit 1/14 of sched_lock decomposition.
- Move all scheduler locking into the schedulers utilizing a technique
similar to solaris's container locking.
- A per-process spinlock is now used to protect the queue of threads,
thread count, suspension count, p_sflags, and other process
related scheduling fields.
- The new thread lock is actually a pointer to a spinlock for the
container that the thread is currently owned by. The container may
be a turnstile, sleepqueue, or run queue.
- thread_lock() is now used to protect access to thread related scheduling
fields. thread_unlock() unlocks the lock and thread_set_lock()
implements the transition from one lock to another.
- A new "blocked_lock" is used in cases where it is not safe to hold the
actual thread's lock yet we must prevent access to the thread.
- sched_throw() and sched_fork_exit() are introduced to allow the
schedulers to fix-up locking at these points.
- Add some minor infrastructure for optionally exporting scheduler
statistics that were invaluable in solving performance problems with
this patch. Generally these statistics allow you to differentiate
between different causes of context switches.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# ed0e8f2f 07-Feb-2007 Jeff Roberson <jeff@FreeBSD.org>

- Change types for necent runq additions to u_char rather than int.
- Fix these types in ULE as well. This fixes bugs in priority index
calculations in certain edge cases. (int)-1 % 64 != (uint)-1 % 64.

Reported by: kkenn using pho's stress2.


# f0393f06 23-Jan-2007 Jeff Roberson <jeff@FreeBSD.org>

- Remove setrunqueue and replace it with direct calls to sched_add().
setrunqueue() was mostly empty. The few asserts and thread state
setting were moved to the individual schedulers. sched_add() was
chosen to displace it for naming consistency reasons.
- Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be
different on all three schedulers where it was only called in one place
each.
- Remove the long ifdef'd out remrunqueue code.
- Remove the now redundant ts_state. Inspect the thread state directly.
- Don't set TSF_* flags from kern_switch.c, we were only doing this to
support a feature in one scheduler.
- Change sched_choose() to return a thread rather than a td_sched. Also,
rely on the schedulers to return the idlethread. This simplifies the
logic in choosethread(). Aside from the run queue links kern_switch.c
mostly does not care about the contents of td_sched.

Discussed with: julian

- Move the idle thread loop into the per scheduler area. ULE wants to
do something different from the other schedulers.

Suggested by: jhb

Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.


# cd49bb70 03-Jan-2007 Jeff Roberson <jeff@FreeBSD.org>

- Don't pass a pointer into runq_choose_from(). The caller can adjust the
index if it chooses to.


# 3fed7d23 04-Jan-2007 Jeff Roberson <jeff@FreeBSD.org>

- Add three new functions to support circular run queues.
- runq_add_pri allows the caller to position the thread at any rqindex
regardless of priority.
- runq_choose_from() chooses the lowest priority thread starting from a given
index. The index is updated with the rqindex of the chosen thread. This
routine is used to pick the lowest priority relative to a given index.
- runq_remove_idx() updates the index if the run queue that held the removed
thread is now empty.


# 2da78e38 31-Dec-2006 Robert Watson <rwatson@FreeBSD.org>

Prefer a more traditional spelling of inhibited in comments and panic
messages.


# ad1e7d28 05-Dec-2006 Julian Elischer <julian@FreeBSD.org>

Threading cleanup.. part 2 of several.

Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs. Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.


# 8460a577 26-Oct-2006 John Birrell <jb@FreeBSD.org>

Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by: davidxu@


# b41f1452 13-Jun-2006 David Xu <davidxu@FreeBSD.org>

Add scheduler CORE, the work I have done half a year ago, recent,
I picked it up again. The scheduler is forked from ULE, but the
algorithm to detect an interactive process is almost completely
different with ULE, it comes from Linux paper "Understanding the
Linux 2.6.8.1 CPU Scheduler", although I still use same word
"score" as a priority boost in ULE scheduler.

Briefly, the scheduler has following characteristic:
1. Timesharing process's nice value is seriously respected,
timeslice and interaction detecting algorithm are based
on nice value.
2. per-cpu scheduling queue and load balancing.
3. O(1) scheduling.
4. Some cpu affinity code in wakeup path.
5. Support POSIX SCHED_FIFO and SCHED_RR.
Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler
uses 256 priority queues. Unlike ULE which using pull and push, the
scheduelr uses pull method, the main reason is to let relative idle
cpu do the work, but current the whole scheduler is protected by the
big sched_lock, so the benefit is not visible, it really can be worse
than nothing because all other cpu are locked out when we are doing
balancing work, which the 4BSD scheduelr does not have this problem.
The scheduler does not support hyperthreading very well, in fact,
the scheduler does not make the difference between physical CPU and
logical CPU, this should be improved in feature. The scheduler has
priority inversion problem on MP machine, it is not good for
realtime scheduling, it can cause realtime process starving.
As a result, it seems the MySQL super-smack runs better on my
Pentium-D machine when using libthr, despite on UP or SMP kernel.


# 4bb0f51d 01-Jun-2006 Olivier Houchard <cognet@FreeBSD.org>

sched_rem() already sets ke->ke_state to KES_THREAD, so there's no need
to redo it.


# 3f349776 28-Dec-2005 Alexander Kabaev <kan@FreeBSD.org>

Trim trailing whitespace.


# 1335c4df 18-Dec-2005 Nate Lawson <njl@FreeBSD.org>

Restore KTR_CRITICAL but conditionally compile it in as KTR_SCHED.

Requested by: scottl, jhb


# 8615fd86 16-Dec-2005 Nate Lawson <njl@FreeBSD.org>

Clean up unused or poorly utilized KTR values. Remove KTR_FS, KTR_KGDB,
and KTR_IO as they were never used. Remove KTR_CLK since it was only
used for hardclock firing and use KTR_INTR there instead. Remove
KTR_CRITICAL since it was only used for crit enter/exit and use
KTR_CONTENTION instead.


# 3c424d14 02-Aug-2005 David Xu <davidxu@FreeBSD.org>

In adjustrunqueue(), add code to handle thread migrating case for
ULE scheduler. In original code, local run queue of threaded ksegrp
is corrupted if adjustrunqueue() is called while thread is migrating.


# 3ea6bbc5 09-Jun-2005 Stephan Uphoff <ups@FreeBSD.org>

Restore preemption of idle threads.

Submitted by: jhb


# a3f2d842 09-Jun-2005 Stephan Uphoff <ups@FreeBSD.org>

Lots of whitespace cleanup.
Fix for broken if condition.

Submitted by: nate@


# f3a0f873 09-Jun-2005 Stephan Uphoff <ups@FreeBSD.org>

Fix some race conditions for pinned threads that may cause them to run
on the wrong CPU.

Add IPI support for preempting a thread on another CPU.

MFC after:3 weeks


# d13ec713 23-May-2005 Stephan Uphoff <ups@FreeBSD.org>

Use low level constructs borrowed from interrupt threads to wait for
work in proc0.
Remove the TDP_WAKEPROC0 workaround.


# 503c2ea3 18-May-2005 Stephan Uphoff <ups@FreeBSD.org>

Fix a bug that caused preemption to happen for a thread in the same
ksegrp with the same priority as the currently running thread.
This can cause propagate_priority() to panic.

Pointy hat to: ups


# 77918643 07-Apr-2005 Stephan Uphoff <ups@FreeBSD.org>

Sprinkle some volatile magic and rearrange things a bit to avoid race
conditions in critical_exit now that it no longer blocks interrupts.

Reviewed by: jhb


# c6a37e84 04-Apr-2005 John Baldwin <jhb@FreeBSD.org>

Divorce critical sections from spinlocks. Critical sections as denoted by
critical_enter() and critical_exit() are now solely a mechanism for
deferring kernel preemptions. They no longer have any affect on
interrupts. This means that standalone critical sections are now very
cheap as they are simply unlocked integer increments and decrements for the
common case.

Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter()
and spinlock_exit(). This KPI is responsible for providing whatever MD
guarantees are needed to ensure that a thread holding a spin lock won't
be preempted by any other code that will try to lock the same lock. For
now all archs continue to block interrupts in a "spinlock section" as they
did formerly in all critical sections. Note that I've also taken this
opportunity to push a few things into MD code rather than MI. For example,
critical_fork_exit() no longer exists. Instead, MD code ensures that new
threads have the correct state when they are created. Also, we no longer
try to fixup the idlethreads for APs in MI code. Instead, each arch sets
the initial curthread and adjusts the state of the idle thread it borrows
in order to perform the initial context switch.

This change is largely a big NOP, but the cleaner separation it provides
will allow for more efficient alternative locking schemes in other parts
of the kernel (bare critical sections rather than per-CPU spin mutexes
for per-CPU data for example).

Reviewed by: grehan, cognet, arch@, others
Tested on: i386, alpha, sparc64, powerpc, arm, possibly more


# 6220dcba 20-Mar-2005 Robert Watson <rwatson@FreeBSD.org>

Add a read-only kern.sched.preemption sysctl so that user space can tell
if "options PREEMPTION" is compiled into the kernel.


# bc608306 17-Mar-2005 Robert Watson <rwatson@FreeBSD.org>

A further step on the journey of meaking panics and debugging more reliable:
in the window between the beginning of panic() and entering the debugger,
it's possible to receive interrupts. If we receive an interrupt, don't
preempt if panicstr != NULL, as the system is in the process of failing, and
the preempting thread is likely to stumble over the failure. The typical
scenario is during the printf() in panic() prior to entering the debugger,
but when running with a slower console type such as serial console.

It could be that the panic string should be passed to the debugger to print,
so that it can run from the debugger's environment rather than a regular
kernel printf.

Glanced at by: jhb


# 9454b2d8 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for copyright notices, minor format tweaks as necessary


# 85da7a56 25-Dec-2004 Jeff Roberson <jeff@FreeBSD.org>

- Define KTR points for KTR_SCHED.


# 7842f65e 14-Dec-2004 Jeff Roberson <jeff@FreeBSD.org>

- Garbage collect several unused members of struct kse and struce ksegrp.
As best as I can tell, some of these were never used.


# 6db36923 20-Nov-2004 David Schultz <das@FreeBSD.org>

Remove local definitions of RANGEOF() and use __rangeof() instead.
Also remove a few bogus casts.


# f42a43fa 07-Nov-2004 Robert Watson <rwatson@FreeBSD.org>

Add basic critical section tracing to KTR using event type KTR_CRITICAL.
This generates a KTR event for each critical section entered and exited.

It would be desirable to also log the filename and line number of the
source entering or exiting the critical section, but this requires
hacking up the critical section API, so I've not done that yet.


# b96741f4 16-Oct-2004 Scott Long <scottl@FreeBSD.org>

If a process needs to be swapped in, wakeup the swapper from within
critical_exit as the process is getting scheduled to run. This is subotimal
but for now avoid the LOR between the scheduler and the sleepq systems.
This is a 5.3 candidate.

Submitted by: davidxu
MFC After: 3 days


# 7c71b645 13-Oct-2004 Stephan Uphoff <ups@FreeBSD.org>

Fix maybe_preempt_in_ksegrp for !SMP.

Tested by: tegge
Reviewed by: julian
Approved by: sam (mentor)
MFC after: 3 days


# 13e7430f 12-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Make !SMP kernels compile, and as far as I can tell, work again.


# 84f9d4b1 12-Oct-2004 Stephan Uphoff <ups@FreeBSD.org>

Prevent preemption in slot_fill.
Implement preemption between threads in the same ksegp in out of slot
situations to prevent priority inversion.

Tested by: pho
Reviewed by: jhb, julian
Approved by: sam (mentor)
MFC: ASAP


# 042b7b1a 09-Oct-2004 Julian Elischer <julian@FreeBSD.org>

Don't release the slot twice.. sched_rem() has already done it.

Submitted by: stephan uphoff (ups at tree dot com)
MFC after: 3 days


# c20c691b 05-Oct-2004 Julian Elischer <julian@FreeBSD.org>

When preempting a thread, put it back on the HEAD of its run queue.
(Only really implemented in 4bsd)

MFC after: 4 days


# d39063f2 05-Oct-2004 Julian Elischer <julian@FreeBSD.org>

Use some macros to trach available scheduler slots to allow
easier debugging.

MFC after: 4 days


# 8daa8c60 19-Sep-2004 David Schultz <das@FreeBSD.org>

The zone from which proc structures are allocated is marked
UMA_ZONE_NOFREE to guarantee type stability, so proc_fini() should
never be called. Move an assertion from proc_fini() to proc_dtor()
and garbage-collect the rest of the unreachable code. I have retained
vm_proc_dispose(), since I consider its disuse a bug.


# 14f0e2e9 16-Sep-2004 Julian Elischer <julian@FreeBSD.org>

clean up thread runq accounting a bit.

MFC after: 3 days


# 9da3e923 15-Sep-2004 Julian Elischer <julian@FreeBSD.org>

e specific code to revert a partial add ot teh run queue, not
remrunqueue() which can't handle a partially added thread.

MFC after: 1 week


# e8807f22 14-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Oops accidentally removed #ifdef SCHED_4BSD
as part of another commit
This function is not yet used in ULE


# 1f9f5df6 13-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Commit a fix for some panics we've been seeing with preemption.

MFC after: 2 days


# b2578c6c 13-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Add some kasserts


# 3389af30 10-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Add some code to allow threads to nominat a sibling to run if theyu are going to sleep.

MFC after: 1 week


# 54983505 07-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Make debug printf less threatenning and make it only print out once.

MFC after: 2 days


# 6a574b2a 06-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Don't do IPIs on behalf of interrupt threads.
just punt straight on through to teh preemption code.

Make a KASSSERT out of a condition that can no longer occur.
MFC after: 1 week


# ed062c8d 04-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Refactor a bunch of scheduler code to give basically the same behaviour
but with slightly cleaned up interfaces.

The KSE structure has become the same as the "per thread scheduler
private data" structure. In order to not make the diffs too great
one is #defined as the other at this time.

The KSE (or td_sched) structure is now allocated per thread and has no
allocation code of its own.

Concurrency for a KSEGRP is now kept track of via a simple pair of counters
rather than using KSE structures as tokens.

Since the KSE structure is different in each scheduler, kern_switch.c
is now included at the end of each scheduler. Nothing outside the
scheduler knows the contents of the KSE (aka td_sched) structure.

The fields in the ksegrp structure that are to do with the scheduler's
queueing mechanisms are now moved to the kg_sched structure.
(per ksegrp scheduler private data structure). In other words how the
scheduler queues and keeps track of threads is no-one's business except
the scheduler's. This should allow people to write experimental
schedulers with completely different internal structuring.

A scheduler call sched_set_concurrency(kg, N) has been added that
notifies teh scheduler that no more than N threads from that ksegrp
should be allowed to be on concurrently scheduled. This is also
used to enforce 'fainess' at this time so that a ksegrp with
10000 threads can not swamp a the run queue and force out a process
with 1 thread, since the current code will not set the concurrency above
NCPU, and both schedulers will not allow more than that many
onto the system run queue at a time. Each scheduler should eventualy develop
their own methods to do this now that they are effectively separated.

Rejig libthr's kernel interface to follow the same code paths as
linkse for scope system threads. This has slightly hurt libthr's performance
but I will work to recover as much of it as I can.

Thread exit code has been cleaned up greatly.
exit and exec code now transitions a process back to
'standard non-threaded mode' before taking the next step.
Reviewed by: scottl, peter
MFC after: 1 week


# 44692526 02-Sep-2004 Julian Elischer <julian@FreeBSD.org>

remove unused code

MFC after: 2 days


# 9923b511 02-Sep-2004 Scott Long <scottl@FreeBSD.org>

Turn PREEMPTION into a kernel option. Make sure that it's defined if
FULL_PREEMPTION is defined. Add a runtime warning to ULE if PREEMPTION is
enabled (code inspired by the PREEMPTION warning in kern_switch.c). This
is a possible MT5 candidate.


# 6804a3ab 01-Sep-2004 Julian Elischer <julian@FreeBSD.org>

Give the 4bsd scheduler the ability to wake up idle processors
when there is new work to be done.

MFC after: 5 days


# 2630e4c9 31-Aug-2004 Julian Elischer <julian@FreeBSD.org>

Give setrunqueue() and sched_add() more of a clue as to
where they are coming from and what is expected from them.

MFC after: 2 days


# 6f96710c 27-Aug-2004 Peter Wemm <peter@FreeBSD.org>

Backout the previous backout (with scott's ok). sched_ule.c:1.122 is
believed to fix the problem with ULE that this change triggered.


# 2384290c 19-Aug-2004 Scott Long <scottl@FreeBSD.org>

Revert the previous change. It works great for 4BSD but causes major
problems for ULE. The reason is quite unknown and worrisome.


# 2c86298c 19-Aug-2004 Scott Long <scottl@FreeBSD.org>

In maybe_preempt(), ignore threads that are in an inconsistent state. This
is an effective band-aid for at least some of the scheduler corruption seen
recently. The real fix will involve protecting threads while they are
inconsistent, and will come later.

Submitted by: julian


# 0f4ad918 09-Aug-2004 Scott Long <scottl@FreeBSD.org>

Add a temporary debugging hack to detect a deadlock in setrunqueue(). This
is here so that we can gather stats on the nature of the recent rash of
hard lockups, and in this particular case panic the machine instead of
letting it deadlock forever.


# 1a5cd27b 09-Aug-2004 Julian Elischer <julian@FreeBSD.org>

Make kg->kg_runnable actually count runnable threads in the ksegrp run queue
instead of only doing it sometimes.. This is not used outdide of debugging code
in the current code, but that will probably change.


# 732d9528 09-Aug-2004 Julian Elischer <julian@FreeBSD.org>

Increase the amount of data exported by KTR in the KTR_RUNQ setting.
This extra data is needed to really follow what is going on in the
threaded case.


# 44fe3c1f 06-Aug-2004 John Baldwin <jhb@FreeBSD.org>

Don't scare users with a warning about preemption being off when it isn't
yet safe to have on by default.


# 1a8cfbc4 27-Jul-2004 Robert Watson <rwatson@FreeBSD.org>

Pass a thread argument into cpu_critical_{enter,exit}() rather than
dereference curthread. It is called only from critical_{enter,exit}(),
which already dereferences curthread. This doesn't seem to affect SMP
performance in my benchmarks, but improves MySQL transaction throughput
by about 1% on UP on my Xeon.

Head nodding: jhb, bmilekic


# 18f480f8 23-Jul-2004 Scott Long <scottl@FreeBSD.org>

Remove the previous hack since it doesn't make a difference and is getting
in the way of debugging.


# 9493183e 22-Jul-2004 Scott Long <scottl@FreeBSD.org>

Disable the PREEMPTION-enabled code in critical_exit() that encourages
switching to a different thread. This is just a hack to try to improve
stability some more, but likely points closer to the real culprit.


# 52eb8464 16-Jul-2004 John Baldwin <jhb@FreeBSD.org>

- Move TDF_OWEPREEMPT, TDF_OWEUPC, and TDF_USTATCLOCK over to td_pflags
since they are only accessed by curthread and thus do not need any
locking.
- Move pr_addr and pr_ticks out of struct uprof (which is per-process)
and directly into struct thread as td_profil_addr and td_profil_ticks
as these variables are really per-thread. (They are used to defer an
addupc_intr() that was too "hard" until ast()).


# 2d50560a 10-Jul-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Update for the KDB framework:
o Make debugging code conditional upon KDB instead of DDB.
o Call kdb_enter() instead of Debugger().
o Call kdb_backtrace() instead of db_print_backtrace() or backtrace().

kern_mutex.c:
o Replace checks for db_active with checks for kdb_active and make
them unconditional.

kern_shutdown.c:
o s/DDB_UNATTENDED/KDB_UNATTENDED/g
o s/DDB_TRACE/KDB_TRACE/g
o Save the TID of the thread doing the kernel dump so the debugger
knows which thread to select as the current when debugging the
kernel core file.
o Clear kdb_active instead of db_active and do so unconditionally.
o Remove backtrace() implementation.

kern_synch.c:
o Call kdb_reenter() instead of db_error().


# 8b44a2e2 02-Jul-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Unbreak build for the the !PREEMPTION case: don't define variables
that aren't used in that case.


# 0c0b25ae 02-Jul-2004 John Baldwin <jhb@FreeBSD.org>

Implement preemption of kernel threads natively in the scheduler rather
than as one-off hacks in various other parts of the kernel:
- Add a function maybe_preempt() that is called from sched_add() to
determine if a thread about to be added to a run queue should be
preempted to directly. If it is not safe to preempt or if the new
thread does not have a high enough priority, then the function returns
false and sched_add() adds the thread to the run queue. If the thread
should be preempted to but the current thread is in a nested critical
section, then the flag TDF_OWEPREEMPT is set and the thread is added
to the run queue. Otherwise, mi_switch() is called immediately and the
thread is never added to the run queue since it is switch to directly.
When exiting an outermost critical section, if TDF_OWEPREEMPT is set,
then clear it and call mi_switch() to perform the deferred preemption.
- Remove explicit preemption from ithread_schedule() as calling
setrunqueue() now does all the correct work. This also removes the
do_switch argument from ithread_schedule().
- Do not use the manual preemption code in mtx_unlock if the architecture
supports native preemption.
- Don't call mi_switch() in a loop during shutdown to give ithreads a
chance to run if the architecture supports native preemption since
the ithreads will just preempt DELAY().
- Don't call mi_switch() from the page zeroing idle thread for
architectures that support native preemption as it is unnecessary.
- Native preemption is enabled on the same archs that supported ithread
preemption, namely alpha, i386, and amd64.

This change should largely be a NOP for the default case as committed
except that we will do fewer context switches in a few cases and will
avoid the run queues completely when preempting.

Approved by: scottl (with his re@ hat)


# b209e5e3 02-Feb-2004 Jeff Roberson <jeff@FreeBSD.org>

- style fixes to the critical_exit() KASSERT().

Submitted by: bde


# fca542bc 31-Jan-2004 Robert Watson <rwatson@FreeBSD.org>

Move KASSERT regarding td_critnest to after the value of td is set to
curthread, to avoid warning and incorrect behavior.

Hoped not to mind: jeff


# 6767c654 31-Jan-2004 Jeff Roberson <jeff@FreeBSD.org>

- Assert that td_critnest > 0 in critical_exit() to catch cases of
unbalanced uses of the critical_* api.


# 09a4a69c 12-Dec-2003 Robert Watson <rwatson@FreeBSD.org>

Although sometimes to the uninitiated, it may seem like goup, KSEGOUP
is actually spelt KSEGROUP. Go figure.

Reported by: samy@kerneled.com


# 0d2a2989 17-Nov-2003 Peter Wemm <peter@FreeBSD.org>

Initial landing of SMP support for FreeBSD/amd64.

- This is heavily derived from John Baldwin's apic/pci cleanup on i386.
- I have completely rewritten or drastically cleaned up some other parts.
(in particular, bootstrap)
- This is still a WIP. It seems that there are some highly bogus bioses
on nVidia nForce3-150 boards. I can't stress how broken these boards
are. I have a workaround in mind, but right now the Asus SK8N is broken.
The Gigabyte K8NPro (nVidia based) is also mind-numbingly hosed.
- Most of my testing has been with SCHED_ULE. SCHED_4BSD works.
- the apic and acpi components are 'standard'.
- If you have an nVidia nForce3-150 board, you are stuck with 'device
atpic' in addition, because they somehow managed to forget to connect the
8254 timer to the apic, even though its in the same silicon! ARGH!
This directly violates the ACPI spec.


# 94816f6d 17-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Remove the correct thread from the run queue in setrunqueue(). This
fixes ULE + KSE.


# 7cf90fb3 16-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Update the sched api. sched_{add,rem,clock,pctcpu} now all accept a td
argument rather than a kse.


# 0e2a4d3a 14-Jun-2003 David Xu <davidxu@FreeBSD.org>

Rename P_THREADED to P_SA. P_SA means a process is using scheduler
activations.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# faaa20f6 21-May-2003 Julian Elischer <julian@FreeBSD.org>

When we are spilling threads out of the run queue during panic, make sure we
keep the thread state variable consistent with its real state.
i.e. Don't say it's on the run queue when it isn't.

Also clarify the associated comment.

Turns a double panic back to a single panic :-/

Approved by: re@ (jhb)


# cc66ebe2 02-Apr-2003 Peter Wemm <peter@FreeBSD.org>

Commit a partial lazy thread switch mechanism for i386. it isn't as lazy
as it could be and can do with some more cleanup. Currently its under
options LAZY_SWITCH. What this does is avoid %cr3 reloads for short
context switches that do not involve another user process. ie: we can
take an interrupt, switch to a kthread and return to the user without
explicitly flushing the tlb. However, this isn't as exciting as it could
be, the interrupt overhead is still high and too much blocks on Giant
still. There are some debug sysctls, for stats and for an on/off switch.

The main problem with doing this has been "what if the process that you're
running on exits while we're borrowing its address space?" - in this case
we use an IPI to give it a kick when we're about to reclaim the pmap.

Its not compiled in unless you add the LAZY_SWITCH option. I want to fix a
few more things and get some more feedback before turning it on by default.

This is NOT a replacement for Bosko's lazy interrupt stuff. This was more
meant for the kthread case, while his was for interrupts. Mine helps a
little for interrupts, but his helps a lot more.

The stats are enabled with options SWTCH_OPTIM_STATS - this has been a
pseudo-option for years, I just added a bunch of stuff to it.

One non-trivial change was to select a new thread before calling
cpu_switch() in the first place. This allows us to catch the silly
case of doing a cpu_switch() to the current process. This happens
uncomfortably often. This simplifies a bit of the asm code in cpu_switch
(no longer have to call choosethread() in the middle). This has been
implemented on i386 and (thanks to jake) sparc64. The others will come
soon. This is actually seperate to the lazy switch stuff.

Glanced at by: jake, jhb


# 6ce75196 18-Mar-2003 David Xu <davidxu@FreeBSD.org>

Adjust code for userland preemptive. Userland can set a quantum in
kse_mailbox to schedule an upcall, this is useful for userland timeout
routine, for example pthread_cond_timedwait().

Also extract upcall scheduling code from kse_reassign and create
a new function called thread_switchout to include these code.

Reviewed by: julain


# d03c79ee 08-Mar-2003 David Xu <davidxu@FreeBSD.org>

Cosmetic change, make it QUEUE_MACRO_DEBUG friendly


# ac2e4153 26-Feb-2003 Julian Elischer <julian@FreeBSD.org>

Change the process flags P_KSES to be P_THREADED.
This is just a cosmetic change but I've been meaning to do it for about a year.


# 4f6cfa45 19-Feb-2003 David Xu <davidxu@FreeBSD.org>

Update comments to reflect new KSE code.


# 02bbffaf 17-Feb-2003 David Xu <davidxu@FreeBSD.org>

Move code for detecting PS_NEEDSIGCHK into thread_schedule_upcall,
I think it is a better place to handle it.


# 4a338afd 17-Feb-2003 Julian Elischer <julian@FreeBSD.org>

Move a bunch of flags from the KSE to the thread.
I was in two minds as to where to put them in the first case..
I should have listenned to the other mind.

Submitted by: parts by davidxu@
Reviewed by: jeff@ mini@


# 5215b187 16-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Split the struct kse into struct upcall and struct kse. struct kse will
soon be visible only to schedulers. This greatly simplifies much the
KSE code.

Submitted by: davidxu


# 6f8132a8 31-Jan-2003 Julian Elischer <julian@FreeBSD.org>

Reversion of commit by Davidxu plus fixes since applied.

I'm not convinced there is anything major wrong with the patch but
them's the rules..

I am using my "David's mentor" hat to revert this as he's
offline for a while.


# 0dbb100b 26-Jan-2003 David Xu <davidxu@FreeBSD.org>

Move UPCALL related data structure out of kse, introduce a new
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.

A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.

Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.

Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.

KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.

When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.

The code hasn't been tested under SMP by author due to lack of hardware.

Reviewed by: julian


# 67f7c1bb 19-Jan-2003 Julian Elischer <julian@FreeBSD.org>

Remove a KASSERT that can now happen and add a missing setrunnable.


# 93a7aa79 27-Dec-2002 Julian Elischer <julian@FreeBSD.org>

Add code to ddb to allow backtracing an arbitrary thread.
(show thread {address})

Remove the IDLE kse state and replace it with a change in
the way threads sahre KSEs. Every KSE now has a thread, which is
considered its "owner" however a KSE may also be lent to other
threads in the same group to allow completion of in-kernel work.
n this case the owner remains the same and the KSE will revert to the
owner when the other work has been completed.

All creations of upcalls etc. is now done from
kse_reassign() which in turn is called from mi_switch or
thread_exit(). This means that special code can be removed from
msleep() and cv_wait().

kse_release() does not leave a KSE with no thread any more but
converts the existing thread into teh KSE's owner, and sets it up
for doing an upcall. It is just inhibitted from being scheduled until
there is some reason to do an upcall.

Remove all trace of the kse_idle queue since it is no-longer needed.
"Idle" KSEs are now on the loanable queue.


# 24c5baae 14-Oct-2002 Julian Elischer <julian@FreeBSD.org>

Did you ever notice how stupid bugs show up much clearer
when you see them in a commit message?


# 1f955e2d 14-Oct-2002 Julian Elischer <julian@FreeBSD.org>

Tidy up the scheduler's code for changing the priority of a thread.
Logically pretty much a NOP.


# b43179fb 11-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Create a new scheduler api that is defined in sys/sched.h
- Begin moving scheduler specific functionality into sched_4bsd.c
- Replace direct manipulation of scheduler data with hooks provided by the
new api.
- Remove KSE specific state modifications and single runq assumptions from
kern_switch.c

Reviewed by: -arch


# 48bfcddd 08-Oct-2002 Julian Elischer <julian@FreeBSD.org>

Round out the facilty for a 'bound' thread to loan out its KSE
in specific situations. The owner thread must be blocked, and the
borrower can not proceed back to user space with the borrowed KSE.
The borrower will return the KSE on the next context switch where
teh owner wants it back. This removes a lot of possible
race conditions and deadlocks. It is consceivable that the
borrower should inherit the priority of the owner too.
that's another discussion and would be simple to do.

Also, as part of this, the "preallocatd spare thread" is attached to the
thread doing a syscall rather than the KSE. This removes the need to lock
the scheduler when we want to access it, as it's now "at hand".

DDB now shows a lot mor info for threaded proceses though it may need
some optimisation to squeeze it all back into 80 chars again.
(possible JKH project)

Upcalls are now "bound" threads, but "KSE Lending" now means that
other completing syscalls can be completed using that KSE before the upcall
finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is
one of the highest priorities in the KSE system.) The upcall when it happens
will present all the completed syscalls to the KSE for selection.


# 5da2b58a 02-Oct-2002 David Xu <davidxu@FreeBSD.org>

set ke_bound to NULL when kse owner thread becomes runnable.

Reviewed by: julian (mentor)


# 9eb1fdea 29-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Implement basic KSE loaning. This stops a hread that is blocked in BOUND mode
from stopping another thread from completing a syscall, and this allows it to
release its resources etc. Probably more related commits to follow (at least
one I know of)

Initial concept by: julian, dillon
Submitted by: davidxu


# 33c06e1d 22-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Indentation does not define a block.. you need breces {} as well..
also add a mutex assert. (threaded path only)

Submitted by: davidxu


# 4f0db5e0 15-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Allocate KSEs and KSEGRPs separatly and remove them from the proc structure.
next step is to allow > 1 to be allocated per process. This would give
multi-processor threads. (when the rest of the infrastructure is
in place)

While doing this I noticed libkvm and sys/kern/kern_proc.c:fill_kinfo_proc
are diverging more than they should.. corrective action needed soon.


# 71fad9fd 11-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Completely redo thread states.

Reviewed by: davidxu@freebsd.org


# 472be958 29-Aug-2002 Julian Elischer <julian@FreeBSD.org>

Rejig the code to figure out estcpu and work out how long a KSEGRP has been
idle. What was there before was surprisingly ALMOST correct.

Peter and I fried our brains on this for a couple of hours figuring out
what this actually means in the context of multiple threads.

Reviewed by: peter@freebsd.org


# 9eb881f8 30-Jul-2002 Seigo Tanimura <tanimura@FreeBSD.org>

- Optimize wakeup() and its friends; if a thread waken up is being
swapped in, we do not have to ask for the scheduler thread to do
that.

- Assert that a process is not swapped out in runq functions and
swapout().

- Introduce thread_safetoswapout() for readability.

- In swapout_procs(), perform a test that may block (check of a
thread working on its vm map) first. This lets us call swapout()
with the sched_lock held, providing a better atomicity.


# fe799533 16-Jul-2002 Andrew Gallatin <gallatin@FreeBSD.org>

Allow alphas to do crashdumps: Refuse to run anything in choosethread()
after a panic which is not an interrupt thread, or the thread which
caused the panic. Also, remove panicstr checks from msleep() and from
cv_wait() in order to allow threads to go to sleep and yeild the cpu
to the panicing thread, or to an interrupt thread which might
be doing the crashdump.

Reviewed by: jhb (and it was mostly his idea too)


# c3b98db0 13-Jul-2002 Julian Elischer <julian@FreeBSD.org>

Thinking about it I came to the conclusion that the KSE states were incorrectly
formulated. The correct states should be:
IDLE: On the idle KSE list for that KSEG
RUNQ: Linked onto the system run queue.
THREAD: Attached to a thread and slaved to whatever state the thread is in.

This means that most places where we were adjusting kse state can go away
as it is just moving around because the thread is..
The only places we need to adjust the KSE state is in transition to and from
the idle and run queues.

Reviewed by: jhb@freebsd.org


# 40e55026 12-Jul-2002 Julian Elischer <julian@FreeBSD.org>

also set the KSE state for the idle KSE/thread case.


# 33d7ad1a 12-Jul-2002 John Baldwin <jhb@FreeBSD.org>

Set the thread state of the newly chosen to run thread to TDS_RUNNING in
choosethread() in MI C code instead of doing it in in assembly in all the
various cpu_switch() functions. This fixes problems on ia64 and sparc64.

Reviewed by: julian, peter, benno
Tested on: i386, alpha, sparc64


# 5e3da64e 11-Jul-2002 Julian Elischer <julian@FreeBSD.org>

Remove debugging code that I originally only wanted to be there for a couple of days after merge.

Reminded with pointy stick by: jhb


# e602ba25 29-Jun-2002 Julian Elischer <julian@FreeBSD.org>

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 2f9267ec 20-Jun-2002 Peter Wemm <peter@FreeBSD.org>

Move the "- 1" into the RQB_FFS(mask) macro itself so that
implementations can provide a base zero ffs function if they wish.
This changes
#define RQB_FFS(mask) (ffs64(mask))
foo = RQB_FFS(mask) - 1;
to
#define RQB_FFS(mask) (ffs64(mask) - 1)
foo = RQB_FFS(mask);
On some platforms we can get the "- 1" for free, eg: those that use the
C code for ffs64().

Reviewed by: jake (in principle)


# d2ac2316 24-May-2002 Jake Burkholder <jake@FreeBSD.org>

Make the run queue parameters machine dependent. Optimize 64 bit
architectures by using a 64 bit word for the bit array which keeps
track of non-empty queues.

Reviewed by: peter


# 0cce52f8 07-May-2002 Jake Burkholder <jake@FreeBSD.org>

Remove runq_findproc. This never worked right in the first place and can
be prohibitively expensive.


# 182da820 01-Apr-2002 Matthew Dillon <dillon@FreeBSD.org>

Stage-2 commit of the critical*() code. This re-inlines cpu_critical_enter()
and cpu_critical_exit() and moves associated critical prototypes into their
own header file, <arch>/<arch>/critical.h, which is only included by the
three MI source files that need it.

Backout and re-apply improperly comitted syntactical cleanups made to files
that were still under active development. Backout improperly comitted program
structure changes that moved localized declarations to the top of two
procedures. Partially re-apply one of the program structure changes to
move 'mask' into an intermediate block rather then in three separate
sub-blocks to make the code more readable. Re-integrate bug fixes that Jake
made to the sparc64 code.

Note: In general, developers should not gratuitously move declarations out
of sub-blocks. They are where they are for reasons of structure, grouping,
readability, compiler-localizability, and to avoid developer-introduced bugs
similar to several found in recent years in the VFS and VM code.

Reviewed by: jake


# d74ac681 26-Mar-2002 Matthew Dillon <dillon@FreeBSD.org>

Compromise for critical*()/cpu_critical*() recommit. Cleanup the interrupt
disablement assumptions in kern_fork.c by adding another API call,
cpu_critical_fork_exit(). Cleanup the td_savecrit field by moving it
from MI to MD. Temporarily move cpu_critical*() from <arch>/include/cpufunc.h
to <arch>/<arch>/critical.c (stage-2 will clean this up).

Implement interrupt deferral for i386 that allows interrupts to remain
enabled inside critical sections. This also fixes an IPI interlock bug,
and requires uses of icu_lock to be enclosed in a true interrupt disablement.

This is the stage-1 commit. Stage-2 will occur after stage-1 has stabilized,
and will move cpu_critical*() into its own header file(s) + other things.
This commit may break non-i386 architectures in trivial ways. This should
be temporary.

Reviewed by: core
Approved by: core


# e97c3e3d 06-Mar-2002 Dag-Erling Smørgrav <des@FreeBSD.org>

Rename runq_find() to runq_findproc(), and hide it behind #ifdef DIAGNOSTIC,
as it can have a severe impact on performance under high load, and the bug
it was meant to catch was fixed ages ago.


# 181df8c9 26-Feb-2002 Matthew Dillon <dillon@FreeBSD.org>

revert last commit temporarily due to whining on the lists.


# f96ad4c2 26-Feb-2002 Matthew Dillon <dillon@FreeBSD.org>

STAGE-1 of 3 commit - allow (but do not require) interrupts to remain
enabled in critical sections and streamline critical_enter() and
critical_exit().

This commit allows an architecture to leave interrupts enabled inside
critical sections if it so wishes. Architectures that do not wish to do
this are not effected by this change.

This commit implements the feature for the I386 architecture and provides
a sysctl, debug.critical_mode, which defaults to 1 (use the feature). For
now you can turn the sysctl on and off at any time in order to test the
architectural changes or track down bugs.

This commit is just the first stage. Some areas of the code, specifically
the MACHINE_CRITICAL_ENTER #ifdef'd code, is strictly temporary and will
be cleaned up in the STAGE-2 commit when the critical_*() functions are
moved entirely into MD files.

The following changes have been made:

* critical_enter() and critical_exit() for I386 now simply increment
and decrement curthread->td_critnest. They no longer disable
hard interrupts. When critical_exit() decrements the counter to
0 it effectively calls a routine to deal with whatever interrupts
were deferred during the time the code was operating in a critical
section.

Other architectures are unaffected.

* fork_exit() has been conditionalized to remove MD assumptions for
the new code. Old code will still use the old MD assumptions
in regards to hard interrupt disablement. In STAGE-2 this will
be turned into a subroutine call into MD code rather then hardcoded
in MI code.

The new code places the burden of entering the critical section
in the trampoline code where it belongs.

* I386: interrupts are now enabled while we are in a critical section.
The interrupt vector code has been adjusted to deal with the fact.
If it detects that we are in a critical section it currently defers
the interrupt by adding the appropriate bit to an interrupt mask.

* In order to accomplish the deferral, icu_lock is required. This
is i386-specific. Thus icu_lock can only be obtained by mainline
i386 code while interrupts are hard disabled. This change has been
made.

* Because interrupts may or may not be hard disabled during a
context switch, cpu_switch() can no longer simply assume that
PSL_I will be in a consistent state. Therefore, it now saves and
restores eflags.

* FAST INTERRUPT PROVISION. Fast interrupts are currently deferred.
The intention is to eventually allow them to operate either while
we are in a critical section or, if we are able to restrict the
use of sched_lock, while we are not holding the sched_lock.

* ICU and APIC vector assembly for I386 cleaned up. The ICU code
has been cleaned up to match the APIC code in regards to format
and macro availability. Additionally, the code has been adjusted
to deal with deferred interrupts.

* Deferred interrupts use a per-cpu boolean int_pending, and
masks ipending, spending, and fpending. Being per-cpu variables
it is not currently necessary to lock; bus cycles modifying them.

Note that the same mechanism will enable preemption to be
incorporated as a true software interrupt without having to
further hack up the critical nesting code.

* Note: the old critical_enter() code in kern/kern_switch.c is
currently #ifdef to be compatible with both the old and new
methodology. In STAGE-2 it will be moved entirely to MD code.

Performance issues:

One of the purposes of this commit is to enhance critical section
performance, specifically to greatly reduce bus overhead to allow
the critical section code to be used to protect per-cpu caches.
These caches, such as Jeff's slab allocator work, can potentially
operate very quickly making the effective savings of the new
critical section code's performance very significant.

The second purpose of this commit is to allow architectures to
enable certain interrupts while in a critical section. Specifically,
the intention is to eventually allow certain FAST interrupts to
operate rather then defer.

The third purpose of this commit is to begin to clean up the
critical_enter()/critical_exit()/cpu_critical_enter()/
cpu_critical_exit() API which currently has serious cross pollution
in MI code (in fork_exit() and ast() for example).

The fourth purpose of this commit is to provide a framework that
allows kernel-preempting software interrupts to be implemented
cleanly. This is currently used for two forward interrupts in I386.
Other architectures will have the choice of using this infrastructure
or building the functionality directly into critical_enter()/
critical_exit().

Finally, this commit is designed to greatly improve the flexibility
of various architectures to manage critical section handling,
software interrupts, preemption, and other highly integrated
architecture-specific details.


# 2c100766 11-Feb-2002 Julian Elischer <julian@FreeBSD.org>

In a threaded world, differnt priorirites become properties of
different entities. Make it so.

Reviewed by: jhb@freebsd.org (john baldwin)


# 7e1f6dfe 17-Dec-2001 John Baldwin <jhb@FreeBSD.org>

Modify the critical section API as follows:
- The MD functions critical_enter/exit are renamed to start with a cpu_
prefix.
- MI wrapper functions critical_enter/exit maintain a per-thread nesting
count and a per-thread critical section saved state set when entering
a critical section while at nesting level 0 and restored when exiting
to nesting level 0. This moves the saved state out of spin mutexes so
that interlocking spin mutexes works properly.
- Most low-level MD code that used critical_enter/exit now use
cpu_critical_enter/exit. MI code such as device drivers and spin
mutexes use the MI wrappers. Note that since the MI wrappers store
the state in the current thread, they do not have any return values or
arguments.
- mtx_intr_enable() is replaced with a constant CRITICAL_FORK which is
assigned to curthread->td_savecrit during fork_exit().

Tested on: i386, alpha


# 6a494eeb 17-Sep-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Change p into ke->ke_proc, this was hidden behind INVARIANTS.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# f583b1d9 04-Jul-2001 John Baldwin <jhb@FreeBSD.org>

Spelling fix in a KASSERT: runq_chose -> runq_choose.


# f34fa851 28-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Catch up to header include changes:
- <sys/mutex.h> now requires <sys/systm.h>
- <sys/mutex.h> and <sys/sx.h> now require <sys/lock.h>


# 6fe01250 14-Mar-2001 Peter Wemm <peter@FreeBSD.org>

Jake essentially rewrote this. It is not by any stretch of the
imagination a derivative of what I did before.


# 9cbd0393 11-Mar-2001 Dag-Erling Smørgrav <des@FreeBSD.org>

Assert that the process we're trying to enqueue isn't already there.


# 3a3f6082 08-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Add a new informative KASSERT to ensure that a process is in the SRUN state
before we return it to cpu_switch().


# f32ded2f 24-Feb-2001 Jake Burkholder <jake@FreeBSD.org>

- Assert that the proc to return is not NULL in runq_choose the
same as runq_remove.
- bzero the whole struct runq in runq_init just in case its not
statically allocated.


# d5a08a60 11-Feb-2001 Jake Burkholder <jake@FreeBSD.org>

Implement a unified run queue and adjust priority levels accordingly.

- All processes go into the same array of queues, with different
scheduling classes using different portions of the array. This
allows user processes to have their priorities propogated up into
interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
32. We used to have 4 separate arrays of 32 queues each, so this
may not be optimal. The new run queue code was written with this
in mind; changing the number of run queues only requires changing
constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter. This
is intended to be used to create per-cpu run queues. Implement
wrappers for compatibility with the old interface which pass in
the global run queue structure.
- Group the priority level, user priority, native priority (before
propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
This was used to detect when a process' priority had lowered and
it should yield. We now effectively yield on every interrupt.
- Activate propogate_priority(). It should now have the desired
effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
idle loop. It interfered with propogate_priority() because
the idle process needed to do a non-blocking acquire of Giant
and then other processes would try to propogate their priority
onto it. The idle process should not do anything except idle.
vm_page_zero_idle() will return in the form of an idle priority
kernel thread which is woken up at apprioriate times by the vm
system.
- Update struct kinfo_proc to the new priority interface. Deliberately
change its size by adjusting the spare fields. It remained the same
size, but the layout has changed, so userland processes that use it
would parse the data incorrectly. The size constraint should really
be changed to an arbitrary version number. Also add a debug.sizeof
sysctl node for struct kinfo_proc.


# ef73ae4b 09-Jan-2001 Jake Burkholder <jake@FreeBSD.org>

Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables
other then curproc.


# 35e0e5b3 20-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# f6a0af80 15-Sep-2000 John Baldwin <jhb@FreeBSD.org>

Idle processes are always runnable, so let them state at SRUN.


# 4a6404df 11-Sep-2000 John Baldwin <jhb@FreeBSD.org>

Fix some printf format string warnings due to sizeof(int) != sizeof(long) on
the alpha.


# 0384fff8 06-Sep-2000 Jason Evans <jasone@FreeBSD.org>

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# 36e9f877 28-Mar-2000 Matthew Dillon <dillon@FreeBSD.org>

Commit major SMP cleanups and move the BGL (big giant lock) in the
syscall path inward. A system call may select whether it needs the MP
lock or not (the default being that it does need it).

A great deal of conditional SMP code for various deadended experiments
has been removed. 'cil' and 'cml' have been removed entirely, and the
locking around the cpl has been removed. The conditional
separately-locked fast-interrupt code has been removed, meaning that
interrupts must hold the CPL now (but they pretty much had to anyway).
Another reason for doing this is that the original separate-lock for
interrupts just doesn't apply to the interrupt thread mechanism being
contemplated.

Modifications to the cpl may now ONLY occur while holding the MP
lock. For example, if an otherwise MP safe syscall needs to mess with
the cpl, it must hold the MP lock for the duration and must (as usual)
save/restore the cpl in a nested fashion.

This is precursor work for the real meat coming later: avoiding having
to hold the MP lock for common syscalls and I/O's and interrupt threads.
It is expected that the spl mechanisms and new interrupt threading
mechanisms will be able to run in tandem, allowing a slow piecemeal
transition to occur.

This patch should result in a moderate performance improvement due to
the considerable amount of code that has been removed from the critical
path, especially the simplification of the spl*() calls. The real
performance gains will come later.

Approved by: jkh
Reviewed by: current, bde (exception.s)
Some work taken from: luoqi's patch


# 42cef09b 19-Aug-1999 Peter Wemm <peter@FreeBSD.org>

Fix a typo and a bug.
- One RTP_PRIO_REALTIME was meant to be RTP_PRIO_IDLE.
- RTP_PRIO_FIFO was not handled.
- Move the usual case first for setrunqueue() etc.


# dba6c5a6 18-Aug-1999 Peter Wemm <peter@FreeBSD.org>

Extract the next runnable process selection out of cpu_switch() into a
fairly machine independent C routine. gcc actually does a pretty good
job of this.

Reviewed by: msmith (in principle)