History log of /freebsd-current/sys/kern/kern_condvar.c
Revision Date Author Comments
# a58813fd 08-Mar-2024 Mark Johnston <markj@FreeBSD.org>

ktrace: Fix the build when options KTRACE is not configured

MFC after: 1 week
Reported by: John Nielsen <lists@jnielsen.net>


# 6b353101 18-Jan-2024 Olivier Certner <olce@FreeBSD.org>

SCHEDULER_STOPPED(): Rely on a global variable

A commit from 2012 (5d7380f8e34f0083, r228424) introduced
'td_stopsched', on the ground that a global variable would cause all
CPUs to have a copy of it in their cache, and consequently of all other
variables sharing the same cache line.

This is really a problem only if that cache line sees relatively
frequent modifications. This was unlikely to be the case back then
because nearby variables are almost never modified as well. In any
case, today we have a new tool at our disposal to ensure that this
variable goes into a read-mostly section containing frequently-accessed
variables ('__read_frequently'). Most of the cache lines covering this
section are likely to always be in every CPU cache. This makes the
second reason stated in the commit message (ensuring the field is in the
same cache line as some lock-related fields, since these are accessed in
close proximity) moot, as well as the second order effect of requiring
an additional line to be present in the cache (the one containing the
new 'scheduler_stopped' boolean, see below).

From a pure logical point of view, whether the scheduler is stopped is
a global state and is certainly not a per-thread quality.

Consequently, remove 'td_stopsched', which immediately frees a byte in
'struct thread'. Currently, the latter's size (and layout) stays
unchanged, but some of the later re-orderings will probably benefit from
this removal. Available bytes at the original position for
'td_stopsched' have been made explicit with the addition of the
'_td_pad0' member.

Store the global state in the new 'scheduler_stopped' boolean, which is
annotated with '__read_frequently'.

Replace uses of SCHEDULER_STOPPED_TD() with SCHEDULER_STOPPER() and
remove the former as it is now unnecessary.

Reviewed by: markj, kib
Approved by: markj (mentor)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D43572


# a5ef95cd 14-Jan-2024 Mark Johnston <markj@FreeBSD.org>

condvar: Fix a user-after-free in _cv_wait() when ktrace is enabled

When a thread wakes up after sleeping on a CV, it must not dereference
the CV structure, as it may already have been freed. At least ZFS
relies on this invariant, see commit
c636f94bd2ff15be5b904939872b4bce31456c18 for example.

Thus, when logging context-switch events, copy the wmesg into a stack
buffer while it is still safe to do so, and log that after waking up.

While here, move the initial ktrcsw() call later, after assertions and
the SCHEDULER_STOPPED_TD() condition are checked.

Reported by: syzkaller
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D43450


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 4d846d26 10-May-2023 Warner Losh <imp@FreeBSD.org>

spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop -FreeBSD

The SPDX folks have obsoleted the BSD-2-Clause-FreeBSD identifier. Catch
up to that fact and revert to their recommended match of BSD-2-Clause.

Discussed with: pfg
MFC After: 3 days
Sponsored by: Netflix


# 63ca9ea4 09-Jul-2021 Alexander Motin <mav@FreeBSD.org>

Use sleepq_signal(SLEEPQ_DROP) in cv_signal().

Same as wakeup_one()/wakeup_any() commit before it reduces the lock
hold time and so contention.

MFC after: 1 week


# 8a36da99 27-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.


# 91fa4707 16-Feb-2017 Mateusz Guzik <mjg@FreeBSD.org>

Introduce SCHEDULER_STOPPED_TD for use when the thread pointer was already read

Sprinkle in few places.


# 5b7d9ae2 06-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

cv: do a lockless check for no waiters in cv_signal and cv_broadcastpri

In case of some consumers like zfs there are no waiters vast majority of
the time

Reviewed by: jhb
MFC after: 1 week


# e3043798 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: spelling fixes in comments.

No functional change.


# b4f1d267 31-Mar-2016 John Baldwin <jhb@FreeBSD.org>

Rework handling of thread sleeps before timers are working.

Previously, calls to *sleep() and cv_*wait*() immediately returned during
early boot. Instead, permit threads that request a sleep without a
timeout to sleep as wakeup() works during early boot. Sleeps with
timeouts are harder to emulate without working timers, so just punt and
panic explicitly if any thread tries to use those before timers are
working. Any threads that depend on timeouts should either wait until
SI_SUB_KICK_SCHEDULER to start or they should use DELAY() until timers
are available.

Until APs are started earlier this should be a no-op as other kthreads
shouldn't get a chance to start running until after timers are working
regardless of when they were created.

Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D5724


# 7b8cfe26 01-Mar-2016 John Baldwin <jhb@FreeBSD.org>

Use SCHEDULER_STOPPED() in cv_*wait*() instead of checking panicstr.

Reviewed by: kib
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D5516


# 11f9ca69 08-Jan-2016 Mark Johnston <markj@FreeBSD.org>

Prevent cv_waiters wraparound.

r282971 attempted to fix this problem by decrementing cv_waiters after
waking up from sleeping on a condition variable, but this can result in
a use-after-free if the CV is freed before all woken threads have had a
chance to run. Instead, avoid incrementing cv_waiters past INT_MAX, and
have cv_signal() explicitly check for sleeping threads once cv_waiters has
reached this bound.

Reviewed by: jhb
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4822


# c636f94b 21-May-2015 John Baldwin <jhb@FreeBSD.org>

Revert r282971. It depends on condvar consumers not destroying condvars
until all threads sleeping on a condvar have resumed execution after being
awakened. However, there are cases where that guarantee is very hard to
provide.


# 5c894ee2 15-May-2015 John Baldwin <jhb@FreeBSD.org>

Previously, cv_waiters was only updated by cv_signal or cv_wait. If a
thread awakened due to a time out, then cv_waiters was not decremented.
If INT_MAX threads timed out on a cv without an intervening cv_broadcast,
then cv_waiters could overflow. To fix this, have each sleeping thread
decrement cv_waiters when it resumes.

Note that previously cv_waiters was protected by the sleepq chain lock.
However, that lock is not held when threads resume from sleep. In
addition, the interlock is also not always reacquired after resuming
(cv_wait_unlock), nor is it always held by callers of cv_signal() or
cv_broadcast(). Instead, use atomic ops to update cv_waiters. Since
the sleepq chain lock is still held on every increment, it should
still be safe to compare cv_waiters against zero while holding the
lock in the wakeup routines as the only way the race should be lost
would result in extra calls to sleepq_signal() or sleepq_broadcast().

Differential Revision: https://reviews.freebsd.org/D2427
Reviewed by: benno
Reported by: benno (wrap of cv_waiters in the field)
MFC after: 2 weeks


# a115fb62 22-Jan-2015 Hans Petter Selasky <hselasky@FreeBSD.org>

Revert for r277213:

FreeBSD developers need more time to review patches in the surrounding
areas like the TCP stack which are using MPSAFE callouts to restore
distribution of callouts on multiple CPUs.

Bump the __FreeBSD_version instead of reverting it.

Suggested by: kmacy, adrian, glebius and kib
Differential Revision: https://reviews.freebsd.org/D1438


# 1a26c3c0 15-Jan-2015 Hans Petter Selasky <hselasky@FreeBSD.org>

Major callout subsystem cleanup and rewrite:
- Close a migration race where callout_reset() failed to set the
CALLOUT_ACTIVE flag.
- Callout callback functions are now allowed to be protected by
spinlocks.
- Switching the callout CPU number cannot always be done on a
per-callout basis. See the updated timeout(9) manual page for more
information.
- The timeout(9) manual page has been updated to reflect how all the
functions inside the callout API are working. The manual page has
been made function oriented to make it easier to deduce how each of
the functions making up the callout API are working without having
to first read the whole manual page. Group all functions into a
handful of sections which should give a quick top-level overview
when the different functions should be used.
- The CALLOUT_SHAREDLOCK flag and its functionality has been removed
to reduce the complexity in the callout code and to avoid problems
about atomically stopping callouts via callout_stop(). If someone
needs it, it can be re-added. From my quick grep there are no
CALLOUT_SHAREDLOCK clients in the kernel.
- A new callout API function named "callout_drain_async()" has been
added. See the updated timeout(9) manual page for a complete
description.
- Update the callout clients in the "kern/" folder to use the callout
API properly, like cv_timedwait(). Previously there was some custom
sleepqueue code in the callout subsystem, which has been removed,
because we now allow callouts to be protected by spinlocks. This
allows us to tear down the callout like done with regular mutexes,
and a "td_slpmutex" has been added to "struct thread" to atomically
teardown the "td_slpcallout". Further the "TDF_TIMOFAIL" and
"SWT_SLEEPQTIMO" states can now be completely removed. Currently
they are marked as available and will be cleaned up in a follow up
commit.
- Bump the __FreeBSD_version to indicate kernel modules need
recompilation.
- There has been several reports that this patch "seems to squash a
serious bug leading to a callout timeout and panic".

Kernel build testing: all architectures were built
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D1438
Sponsored by: Mellanox Technologies
Reviewed by: jhb, adrian, sbruno and emaste


# 7faf4d90 20-Sep-2013 Davide Italiano <davide@FreeBSD.org>

Fix lc_lock/lc_unlock() support for rmlocks held in shared mode. With
current lock classes KPI it was really difficult because there was no
way to pass an rmtracker object to the lock/unlock routines. In order
to accomplish the task, modify the aforementioned functions so that
they can return (or pass as argument) an uinptr_t, which is in the rm
case used to hold a pointer to struct rm_priotracker for current
thread. As an added bonus, this fixes rm_sleep() in the rm shared
case, which right now can communicate priotracker structure between
lc_unlock()/lc_lock().

Suggested by: jhb
Reviewed by: jhb
Approved by: re (delphij)


# 46153735 03-Mar-2013 Davide Italiano <davide@FreeBSD.org>

MFcalloutng:
Extend condvar(9) KPI introducing sbt variant of cv_timedwait. This
rely on the previously committed sleepq_set_timeout_sbt().

Sponsored by: Google Summer of Code 2012, iXsystems inc.
Tested by: flo, marius, ian, markj, Fabian Keil


# 0a15e5d3 13-Sep-2012 Attilio Rao <attilio@FreeBSD.org>

Remove all the checks on curthread != NULL with the exception of some MD
trap checks (eg. printtrap()).

Generally this check is not needed anymore, as there is not a legitimate
case where curthread != NULL, after pcpu 0 area has been properly
initialized.

Reviewed by: bde, jhb
MFC after: 1 week


# 88bf5036 20-Apr-2012 John Baldwin <jhb@FreeBSD.org>

Include the associated wait channel message for context switch ktrace
records. kdump supports both the old and new messages.

Submitted by: Andrey Zonov andrey zonov org
MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 52255936 26-Feb-2009 Ed Schouten <ed@FreeBSD.org>

Remove unused variables `p' and unneeded assignments of `rval'.

Found by: LLVM's scan-build


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 7d43ca69 25-Sep-2008 John Baldwin <jhb@FreeBSD.org>

- Don't do a WITNESS_SAVE() on the interlock if it is Giant in the condition
variable wait routines. DROP_GIANT() already manages that state in the
Giant interlock case.
- Assert that Giant is held when it is passed as a sleep interlock.


# 414e7679 07-Aug-2008 John Baldwin <jhb@FreeBSD.org>

Permit Giant to be passed as the explicit interlock either to
msleep/mtx_sleep or the various cv_*wait*() routines. Currently, the
"unlock" behavior of PDROP and cv_wait_unlock() with Giant is not
permitted as it is will be confusing since Giant is fully unrecursed and
unlocked during a thread sleep.

This is handy for subsystems which wish to allow unlocked drivers to
continue to use Giant such as CAM, the new TTY layer, and the new USB
stack. CAM currently uses a hack that I told Scott to use because I
really didn't want to permit this behavior, and the TTY and USB patches
both have various patches to permit this.

MFC after: 2 weeks


# da7bbd2c 05-Aug-2008 John Baldwin <jhb@FreeBSD.org>

If a thread that is swapped out is made runnable, then the setrunnable()
routine wakes up proc0 so that proc0 can swap the thread back in.
Historically, this has been done by waking up proc0 directly from
setrunnable() itself via a wakeup(). When waking up a sleeping thread
that was swapped out (the usual case when waking proc0 since only sleeping
threads are eligible to be swapped out), this resulted in a bit of
recursion (e.g. wakeup() -> setrunnable() -> wakeup()).

With sleep queues having separate locks in 6.x and later, this caused a
spin lock LOR (sleepq lock -> sched_lock/thread lock -> sleepq lock).
An attempt was made to fix this in 7.0 by making the proc0 wakeup use
the ithread mechanism for doing the wakeup. However, this required
grabbing proc0's thread lock to perform the wakeup. If proc0 was asleep
elsewhere in the kernel (e.g. waiting for disk I/O), then this degenerated
into the same LOR since the thread lock would be some other sleepq lock.

Fix this by deferring the wakeup of the swapper until after the sleepq
lock held by the upper layer has been locked. The setrunnable() routine
now returns a boolean value to indicate whether or not proc0 needs to be
woken up. The end result is that consumers of the sleepq API such as
*sleep/wakeup, condition variables, sx locks, and lockmgr, have to wakeup
proc0 if they get a non-zero return value from sleepq_abort(),
sleepq_broadcast(), or sleepq_signal().

Discussed with: jeff
Glanced at by: sam
Tested by: Jurgen Weber jurgen - ish com au
MFC after: 2 weeks


# c5aa6b58 12-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Pass the priority argument from *sleep() into sleepq and down into
sched_sleep(). This removes extra thread_lock() acquisition and
allows the scheduler to decide what to do with the static boost.
- Change the priority arguments to cv_* to match sleepq/msleep/etc.
where 0 means no priority change. Catch -1 in cv_broadcastpri() and
convert it to 0 for now.
- Set a flag when sleeping in a way that is compatible with swapping
since direct priority comparisons are meaningless now.
- Add a sysctl to ule, kern.sched.static_boost, that defaults to on which
controls the boost behavior. Turning it off gives better performance
in some workloads but needs more investigation.
- While we're modifying sleepq, change signal and broadcast to both
return with the lock held as the lock was held on enter.

Reviewed by: jhb, peter


# d72e80f0 04-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

Commit 2/14 of sched_lock decomposition.
- Adapt sleepqueues to the new thread_lock() mechanism.
- Delay assigning the sleep queue spinlock as the thread lock until after
we've checked for signals. It is illegal for a thread to return in
mi_switch() with any lock assigned to td_lock other than the scheduler
locks.
- Change sleepq_catch_signals() to do the switch if necessary to simplify
the callers.
- Simplify timeout handling now that locking a sleeping thread has the
side-effect of locking the sleepqueue. Some previous races are no
longer possible.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# 9fa7ce0f 08-May-2007 John Baldwin <jhb@FreeBSD.org>

Fix a potential LOR with sx_sleep() and cv_wait() with sx locks by
1) adding the thread to the sleepq via sleepq_add() before dropping the
lock, and 2) dropping the sleepq lock around calls to lc_unlock() for
sleepable locks (i.e. locks that use sleepq's in their implementation).


# 8f27b08e 21-Mar-2007 John Baldwin <jhb@FreeBSD.org>

Rename the cv_*wait*() functions to _cv_*wait*() and change their second
argument from a mutex to a lock_object. Add cv_*wait*() wrapper macros
that accept either a mutex, rwlock, or sx lock as the second argument and
convert it to a lock_object and then call _cv_*wait*(). Basically, the
visible difference is that you can now use rwlocks and sx locks with
condition variables using the same API as with mutexes.


# aa89d8cd 21-Mar-2007 John Baldwin <jhb@FreeBSD.org>

Rename the 'mtx_object', 'rw_object', and 'sx_object' members of mutexes,
rwlocks, and sx locks to 'lock_object'.


# 503916a7 21-Mar-2007 John Baldwin <jhb@FreeBSD.org>

Don't use cv_wait_unlock() to implement cv_wait(). Instead, implement
cv_wait() fully and add missing KTRACE context switch traces.


# 6cbb70e2 15-Dec-2006 Kip Macy <kmacy@FreeBSD.org>

Add second sleep queue so that sx and lockmgr can have separate sleep
queues for shared and exclusive acquisitions

Submitted by: Attilio Rao
Approved by: jhb


# 7ee07175 15-Nov-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Change sleepq_add(9) argument from 'struct mtx *' to 'struct lock_object *',
which allows to use it with different kinds of locks. For example it allows
to implement Solaris conditions variables which will be used in ZFS port on
top of sx(9) locks.

Reviewed by: jhb


# c008d517 22-Feb-2006 David Xu <davidxu@FreeBSD.org>

Fix a sleep queue race for KSE thread.

Reviewed by: jhb


# 94f0972b 15-Feb-2006 David Xu <davidxu@FreeBSD.org>

Fix a long standing race between sleep queue and thread
suspension code. When a thread A is going to sleep, it calls
sleepq_catch_signals() to detect any pending signals or thread
suspension request, if nothing happens, it returns without
holding process lock or scheduler lock, this opens a race
window which allows thread B to come in and do process
suspension work, however since A is still at running state,
thread B can do nothing to A, thread A continues, and puts
itself into actually sleeping state, but B has never seen it,
and it sits there forever until B is woken up by other threads
sometimes later(this can be very long delay or never
happen). Fix this bug by forcing sleepq_catch_signals to
return with scheduler lock held.
Fix sleepq_abort() by passing it an interrupted code, previously,
it worked as wakeup_one(), and the interruption can not be
identified correctly by sleep queue code when the sleeping
thread is resumed.
Let thread_suspend_check() returns EINTR or ERESTART, so sleep
queue no longer has to use SIGSTOP as a hack to build a return
value.

Reviewed by: jhb
MFC after: 1 week


# 92f44a3f 11-Dec-2005 Craig Rodrigues <rodrigc@FreeBSD.org>

Contributions from XFS for FreeBSD project:
- Implement cv_wait_unlock() method which has semantics compatible
with the sv_wait() method in IRIX. For cv_wait_unlock(), the lock
must be held before entering the function, but is not held when the
function is exited.

- Implement the existing cv_wait() function in terms of cv_wait_unlock().

Submitted by: kan
Feedback from: jhb, trhodes, Christoph Hellwig <hch at infradead dot org>


# 2ff0e645 12-Oct-2004 John Baldwin <jhb@FreeBSD.org>

Refine the turnstile and sleep queue interfaces just a bit:
- Add a new _lock() call to each API that locks the associated chain lock
for a lock_object pointer or wait channel. The _lookup() functions now
require that the chain lock be locked via _lock() when they are called.
- Change sleepq_add(), turnstile_wait() and turnstile_claim() to lookup
the associated queue structure internally via _lookup() rather than
accepting a pointer from the caller. For turnstiles, this means that
the actual lookup of the turnstile in the hash table is only done when
the thread actually blocks rather than being done on each loop iteration
in _mtx_lock_sleep(). For sleep queues, this means that sleepq_lookup()
is no longer used outside of the sleep queue code except to implement an
assertion in cv_destroy().
- Change sleepq_broadcast() and sleepq_signal() to require that the chain
lock is already required. For condition variables, this lets the
cv_broadcast() and cv_signal() functions lock the sleep queue chain lock
while testing the waiters count. This means that the waiters count
internal to condition variables is no longer protected by the interlock
mutex and cv_broadcast() and cv_signal() now no longer require that the
interlock be held when they are called. This lets consumers of condition
variables drop the lock before waking other threads which can result in
fewer context switches.

MFC after: 1 month


# 007ddf7e 19-Aug-2004 John Baldwin <jhb@FreeBSD.org>

Now that the return value semantics of cv's for multithreaded processes
have been unified with that of msleep(9), further refine the sleepq
interface and consolidate some duplicated code:
- Move the pre-sleep checks for theaded processes into a
thread_sleep_check() function in kern_thread.c.
- Move all handling of TDF_SINTR to be internal to subr_sleepqueue.c.
Specifically, if a thread is awakened by something other than a signal
while checking for signals before going to sleep, clear TDF_SINTR in
sleepq_catch_signals(). This removes a sched_lock lock/unlock combo in
that edge case during an interruptible sleep. Also, fix
sleepq_check_signals() to properly handle the condition if TDF_SINTR is
clear rather than requiring the callers of the sleepq API to notice
this edge case and call a non-_sig variant of sleepq_wait().
- Clarify the flags arguments to sleepq_add(), sleepq_signal() and
sleepq_broadcast() by creating an explicit submask for sleepq types.
Also, add an explicit SLEEPQ_MSLEEP type rather than a magic number of
0. Also, add a SLEEPQ_INTERRUPTIBLE flag for use with sleepq_add() and
move the setting of TDF_SINTR to sleepq_add() if this flag is set rather
than sleepq_catch_signals(). Note that it is the caller's responsibility
to ensure that sleepq_catch_signals() is called if and only if this flag
is passed to the preceeding sleepq_add(). Note that this also removes a
sched_lock lock/unlock pair from sleepq_catch_signals(). It also ensures
that for an interruptible sleep, TDF_SINTR is always set when
TD_ON_SLEEPQ() is true.


# 274f8f48 10-Aug-2004 John Baldwin <jhb@FreeBSD.org>

Synchronize the extra SA threading checks and return value handling of
condition variables with that of msleep().

Reviewed by: davidxu


# a5471e4e 28-Jun-2004 John Baldwin <jhb@FreeBSD.org>

Remove the signal_caught argument from sleepq_timedwait() as it was
effectively always zero.


# 9000d57d 06-Apr-2004 John Baldwin <jhb@FreeBSD.org>

Associate a simple count of waiters with each condition variable. The
count is protected by the mutex that protects the condition, so the count
does not require any extra locking or atomic operations. It serves as an
optimization to avoid calling into the sleepqueue code at all if there are
no waiters.

Note that the count can get temporarily out of sync when threads sleeping
on a condition variable time out or are aborted. However, it doesn't hurt
to call the sleepqueue code for either a signal or a broadcast when there
are no waiters, and the count is never out of sync in the opposite
direction unless we have more than INT_MAX sleeping threads.


# 1ed3e44f 12-Mar-2004 John Baldwin <jhb@FreeBSD.org>

- Remove old sleep queues.
- Remove sleepqueue argument from sleepq_set_timeout() since it is not
used.


# 44f3b092 27-Feb-2004 John Baldwin <jhb@FreeBSD.org>

Switch the sleep/wakeup and condition variable implementations to use the
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.


# 29bcc451 24-Jan-2004 Jeff Roberson <jeff@FreeBSD.org>

- Add a flags parameter to mi_switch. The value of flags may be SW_VOL or
SW_INVOL. Assert that one of these is set in mi_switch() and propery
adjust the rusage statistics. This is to simplify the large number of
users of this interface which were previously all required to adjust the
proper counter prior to calling mi_switch(). This also facilitates more
switch and locking optimizations.
- Change all callers of mi_switch() to pass the appropriate paramter and
remove direct references to the process statistics.


# 512824f8 09-Nov-2003 Seigo Tanimura <tanimura@FreeBSD.org>

- Implement selwakeuppri() which allows raising the priority of a
thread being waken up. The thread waken up can run at a priority as
high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
threads. Used by selwakeuppri() if collision occurs.

Not objected in: -arch, -current


# 34178711 01-Jul-2003 David Xu <davidxu@FreeBSD.org>

Allow SA process unblocks a thread blocked in condition variable.

Reviewed by: deischen


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 90af4afa 13-May-2003 John Baldwin <jhb@FreeBSD.org>

- Merge struct procsig with struct sigacts.
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.

Reviewed by: arch@
Approved by: re (rwatson)


# 53862173 17-Apr-2003 John Baldwin <jhb@FreeBSD.org>

Test the P_WEXIT flag while already hold the proc lock instead of right
after dropping it.


# 0d49bb4b 31-Mar-2003 Julian Elischer <julian@FreeBSD.org>

Do NOT return from an non-interruptable cv_wait, falsely
claiming to have timed out. I don't know what I was thinking..


# 26306795 04-Mar-2003 John Baldwin <jhb@FreeBSD.org>

Replace calls to WITNESS_SLEEP() and witness_list() with equivalent calls
to WITNESS_WARN().


# b89bc9e6 27-Feb-2003 Hartmut Brandt <harti@FreeBSD.org>

When a process has been waiting on a condition variable or mutex the
td_wmesg field in the thread structure points to the description string of
the condition variable or mutex. If the condvar or the mutex had been
initialized from a loadable module that was unloaded in the meantime,
td_wmesg may now point to invalid memory. Retrieving the process table now
may panic the kernel (or access junk). Setting the td_wmesg field to NULL
after unblocking on the condvar/mutex prevents this panic.

PR: kern/47408
Approved by: jake (mentor)


# 4e997f4b 25-Jan-2003 Jeff Roberson <jeff@FreeBSD.org>

- Call sched_sleep() instead of rolling our own in cv_waitq_add().


# 93a7aa79 27-Dec-2002 Julian Elischer <julian@FreeBSD.org>

Add code to ddb to allow backtracing an arbitrary thread.
(show thread {address})

Remove the IDLE kse state and replace it with a change in
the way threads sahre KSEs. Every KSE now has a thread, which is
considered its "owner" however a KSE may also be lent to other
threads in the same group to allow completion of in-kernel work.
n this case the owner remains the same and the KSE will revert to the
owner when the other work has been completed.

All creations of upcalls etc. is now done from
kse_reassign() which in turn is called from mi_switch or
thread_exit(). This means that special code can be removed from
msleep() and cv_wait().

kse_release() does not leave a KSE with no thread any more but
converts the existing thread into teh KSE's owner, and sets it up
for doing an upcall. It is just inhibitted from being scheduled until
there is some reason to do an upcall.

Remove all trace of the kse_idle queue since it is no-longer needed.
"Idle" KSEs are now on the loanable queue.


# 9d102777 25-Oct-2002 Julian Elischer <julian@FreeBSD.org>

More work on the interaction between suspending and sleeping threads.
Also clean up some code used with 'single-threading'.

Reviewed by: davidxu


# 48bfcddd 08-Oct-2002 Julian Elischer <julian@FreeBSD.org>

Round out the facilty for a 'bound' thread to loan out its KSE
in specific situations. The owner thread must be blocked, and the
borrower can not proceed back to user space with the borrowed KSE.
The borrower will return the KSE on the next context switch where
teh owner wants it back. This removes a lot of possible
race conditions and deadlocks. It is consceivable that the
borrower should inherit the priority of the owner too.
that's another discussion and would be simple to do.

Also, as part of this, the "preallocatd spare thread" is attached to the
thread doing a syscall rather than the KSE. This removes the need to lock
the scheduler when we want to access it, as it's now "at hand".

DDB now shows a lot mor info for threaded proceses though it may need
some optimisation to squeeze it all back into 80 chars again.
(possible JKH project)

Upcalls are now "bound" threads, but "KSE Lending" now means that
other completing syscalls can be completed using that KSE before the upcall
finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is
one of the highest priorities in the KSE system.) The upcall when it happens
will present all the completed syscalls to the KSE for selection.


# 71fad9fd 11-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Completely redo thread states.

Reviewed by: davidxu@freebsd.org


# 67bdda97 02-Sep-2002 David Xu <davidxu@FreeBSD.org>

fix bogus CTR3 message.

Reviewed by: julian@freebsd.org (mentor)


# d13947c3 28-Aug-2002 Peter Wemm <peter@FreeBSD.org>

updatepri() works on a ksegrp (where the scheduling parameters are), so
directly give it the ksegrp instead of the thread. The only thing it used
to use in the thread was the ksegrp.

Reviewed by: julian


# b8e45df7 30-Jul-2002 Julian Elischer <julian@FreeBSD.org>

Remove code that removes thread from sleep queue before
adding it to a condvar wait.
We do not have asleep() any more so this can not happen.


# 13326777 30-Jul-2002 Seigo Tanimura <tanimura@FreeBSD.org>

In endtsleep() and cv_timedwait_end(), a thread marked TDF_TIMEOUT may
be swapped out. Do not put such the thread directly back to the run
queue.

Spotted by: David Xu <davidx@viasoft.com.cn>

While I am here, s/PS_TIMEOUT/TDF_TIMEOUT/.


# 9eb881f8 30-Jul-2002 Seigo Tanimura <tanimura@FreeBSD.org>

- Optimize wakeup() and its friends; if a thread waken up is being
swapped in, we do not have to ask for the scheduler thread to do
that.

- Assert that a process is not swapped out in runq functions and
swapout().

- Introduce thread_safetoswapout() for readability.

- In swapout_procs(), perform a test that may block (check of a
thread working on its vm map) first. This lets us call swapout()
with the sched_lock held, providing a better atomicity.


# 1d7b9ed2 29-Jul-2002 Julian Elischer <julian@FreeBSD.org>

Create a new thread state to describe threads that would be ready to run
except for the fact tha they are presently swapped out. Also add a process
flag to indicate that the process has started the struggle to swap
back in. This will be needed for the case where multiple threads
start the swapin action top a collision. Also add code to stop
a process fropm being swapped out if one of the threads in this
process is actually off running on another CPU.. that might hurt...

Submitted by: Seigo Tanimura <tanimura@r.dl.itc.u-tokyo.ac.jp>


# fe799533 16-Jul-2002 Andrew Gallatin <gallatin@FreeBSD.org>

Allow alphas to do crashdumps: Refuse to run anything in choosethread()
after a panic which is not an interrupt thread, or the thread which
caused the panic. Also, remove panicstr checks from msleep() and from
cv_wait() in order to allow threads to go to sleep and yeild the cpu
to the panicing thread, or to an interrupt thread which might
be doing the crashdump.

Reviewed by: jhb (and it was mostly his idea too)


# d5cb7e14 01-Jul-2002 Julian Elischer <julian@FreeBSD.org>

Fix failure to correctly transition back to sleep mode.


# e602ba25 29-Jun-2002 Julian Elischer <julian@FreeBSD.org>

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 9ba7fe1b 06-Jun-2002 John Baldwin <jhb@FreeBSD.org>

- Catch up to new ktrace API.
- ktrace trace points in msleep() and cv_wait() no longer need Giant.


# 628855e7 29-May-2002 Julian Elischer <julian@FreeBSD.org>

CURSIG() is not a macro so rename it cursig().

Obtained from: KSE tree


# 4bc37205 23-Apr-2002 Jeffrey Hsu <hsu@FreeBSD.org>

The cold and panicstr variables do not need to be protected by sched_lock.

Submitted by: Jennifer Yang (yangjihui@yahoo.com)
Reviewed by: jake & jhb in principle


# e7876c09 29-Mar-2002 Dan Moschuk <dan@FreeBSD.org>

Nuke CV_DEBUG in favour of INVARIANTS.

Approved by: jhb


# 2c100766 11-Feb-2002 Julian Elischer <julian@FreeBSD.org>

In a threaded world, differnt priorirites become properties of
different entities. Make it so.

Reviewed by: jhb@freebsd.org (john baldwin)


# c86b6ff5 05-Jan-2002 John Baldwin <jhb@FreeBSD.org>

Change the preemption code for software interrupt thread schedules and
mutex releases to not require flags for the cases when preemption is
not allowed:

The purpose of the MTX_NOSWITCH and SWI_NOSWITCH flags is to prevent
switching to a higher priority thread on mutex releease and swi schedule,
respectively when that switch is not safe. Now that the critical section
API maintains a per-thread nesting count, the kernel can easily check
whether or not it should switch without relying on flags from the
programmer. This fixes a few bugs in that all current callers of
swi_sched() used SWI_NOSWITCH, when in fact, only the ones called from
fast interrupt handlers and the swi_sched of softclock needed this flag.
Note that to ensure that swi_sched()'s in clock and fast interrupt
handlers do not switch, these handlers have to be explicitly wrapped
in critical_enter/exit pairs. Presently, just wrapping the handlers is
sufficient, but in the future with the fully preemptive kernel, the
interrupt must be EOI'd before critical_exit() is called. (critical_exit()
can switch due to a deferred preemption in a fully preemptive kernel.)

I've tested the changes to the interrupt code on i386 and alpha. I have
not tested ia64, but the interrupt code is almost identical to the alpha
code, so I expect it will work fine. PowerPC and ARM do not yet have
interrupt code in the tree so they shouldn't be broken. Sparc64 is
broken, but that's been ok'd by jake and tmm who will be fixing the
interrupt code for sparc64 shortly.

Reviewed by: peter
Tested on: i386, alpha


# a48740b6 09-Dec-2001 David E. O'Brien <obrien@FreeBSD.org>

Update to C99, s/__FUNCTION__/__func__/.


# 66f769fe 18-Sep-2001 Peter Wemm <peter@FreeBSD.org>

Add missing ; in last commit

Pointy-hat-to: jhb


# 9ef3a985 18-Sep-2001 John Baldwin <jhb@FreeBSD.org>

Use a 'p' variable instead of repetitively indirecting td->td_proc for
signal things that are still per-process and won't be per-thread.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 91a4536f 21-Aug-2001 John Baldwin <jhb@FreeBSD.org>

- Fix a bug in the previous workaround for the tsleep/endtsleep race.
callout_stop() would fail in two cases:
1) The timeout was currently executing, and
2) The timeout had already executed.
We only needed to work around the race for 1). We caught some instances
of 2) via the PS_TIMEOUT flag, however, if endtsleep() fired after the
process had been woken up but before it had resumed execution,
PS_TIMEOUT would not be set, but callout_stop() would fail, so we
would block the process until endtsleep() resumed it. Except that
endtsleep() had already run and couldn't resume it. This adds a new flag
PS_TIMOFAIL to indicate the case of 2) when PS_TIMEOUT isn't set.
- Implement this race fix for condition variables as well.

Tested by: sos


# d652b3d9 05-Jul-2001 Jake Burkholder <jake@FreeBSD.org>

Backout mwakeup, etc.


# 9316aed2 03-Jul-2001 Jake Burkholder <jake@FreeBSD.org>

Implement mwakeup, mwakeup_one, cv_signal_drop and cv_broadcast_drop.
These take an additional mutex argument, which is dropped before any
processes are made runnable. This can avoid contention on the mutex
if the processes would immediately acquire it, and is done in such a
way that wakeups will not be lost.

Reviewed by: jhb


# 87f9ffb8 22-Jun-2001 John Baldwin <jhb@FreeBSD.org>

- Lock CURSIG with the proc lock and don't release the proc lock until
after grabbing the sched lock to close a race.
- Lock ktrace points with Giant.


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# c739adbf 28-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Pass in a pointer to the mutex's lock_object as the second argument to
WITNESS_SLEEP() rather than the mutex itself.


# 19284646 28-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Rework the witness code to work with sx locks as well as mutexes.
- Introduce lock classes and lock objects. Each lock class specifies a
name and set of flags (or properties) shared by all locks of a given
type. Currently there are three lock classes: spin mutexes, sleep
mutexes, and sx locks. A lock object specifies properties of an
additional lock along with a lock name and all of the extra stuff needed
to make witness work with a given lock. This abstract lock stuff is
defined in sys/lock.h. The lockmgr constants, types, and prototypes have
been moved to sys/lockmgr.h. For temporary backwards compatability,
sys/lock.h includes sys/lockmgr.h.
- Replace proc->p_spinlocks with a per-CPU list, PCPU(spinlocks), of spin
locks held. By making this per-cpu, we do not have to jump through
magic hoops to deal with sched_lock changing ownership during context
switches.
- Replace proc->p_heldmtx, formerly a list of held sleep mutexes, with
proc->p_sleeplocks, which is a list of held sleep locks including sleep
mutexes and sx locks.
- Add helper macros for logging lock events via the KTR_LOCK KTR logging
level so that the log messages are consistent.
- Add some new flags that can be passed to mtx_init():
- MTX_NOWITNESS - specifies that this lock should be ignored by witness.
This is used for the mutex that blocks a sx lock for example.
- MTX_QUIET - this is not new, but you can pass this to mtx_init() now
and no events will be logged for this lock, so that one doesn't have
to change all the individual mtx_lock/unlock() operations.
- All lock objects maintain an initialized flag. Use this flag to export
a mtx_initialized() macro that can be safely called from drivers. Also,
we on longer walk the all_mtx list if MUTEX_DEBUG is defined as witness
performs the corresponding checks using the initialized flag.
- The lock order reversal messages have been improved to output slightly
more accurate file and line numbers.


# 28aa95b6 06-Mar-2001 John Baldwin <jhb@FreeBSD.org>

Use the proc lock to protect access to p_sigacts->ps_sigintr.


# d5a08a60 11-Feb-2001 Jake Burkholder <jake@FreeBSD.org>

Implement a unified run queue and adjust priority levels accordingly.

- All processes go into the same array of queues, with different
scheduling classes using different portions of the array. This
allows user processes to have their priorities propogated up into
interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
32. We used to have 4 separate arrays of 32 queues each, so this
may not be optimal. The new run queue code was written with this
in mind; changing the number of run queues only requires changing
constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter. This
is intended to be used to create per-cpu run queues. Implement
wrappers for compatibility with the old interface which pass in
the global run queue structure.
- Group the priority level, user priority, native priority (before
propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
This was used to detect when a process' priority had lowered and
it should yield. We now effectively yield on every interrupt.
- Activate propogate_priority(). It should now have the desired
effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
idle loop. It interfered with propogate_priority() because
the idle process needed to do a non-blocking acquire of Giant
and then other processes would try to propogate their priority
onto it. The idle process should not do anything except idle.
vm_page_zero_idle() will return in the form of an idle priority
kernel thread which is woken up at apprioriate times by the vm
system.
- Update struct kinfo_proc to the new priority interface. Deliberately
change its size by adjusting the spare fields. It remained the same
size, but the layout has changed, so userland processes that use it
would parse the data incorrectly. The size constraint should really
be changed to an arbitrary version number. Also add a debug.sizeof
sysctl node for struct kinfo_proc.


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 981808d1 24-Jan-2001 John Baldwin <jhb@FreeBSD.org>

Catch up to P_FOO -> PS_FOO changes in proc flags.


# 238510fc 15-Jan-2001 Jason Evans <jasone@FreeBSD.org>

Implement condition variables.