Cross Reference: /freebsd-current/sys/kern/sched

History log of /freebsd-current/sys/kern/sched_4bsd.c
Revision	Date	Author	Comments
# aeff15b3	09-Feb-2024	Olivier Certner <olce@FreeBSD.org>	sched: Simplify sched_lend_user_prio_cond() If 'td_lend_user_pri' has the expected value, there is no need to check the fields that sched_lend_user_prio() modifies, they either are already good or soon will be ('td->td_lend_user_pri' has just been changed by a concurrent update). Reviewed by: kib Approved by: emaste (mentor) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D44050
# 6a3c02bc	16-Jan-2024	Olivier Certner <olce@FreeBSD.org>	sched: sched_switch(): Factorize sleepqueue flags Avoid duplicating common flags for the preempted and non-preempted cases, making it clear that they are the same without resorting to formatting. No functional change. Approved by: markj (mentor) MFC after: 3 days Sponsored by: The FreeBSD Foundation
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 1029dab6	09-Feb-2023	Mitchell Horne <mhorne@FreeBSD.org>	mi_switch(): clean up switch types and their usage Overall, this is a non-functional change, except for kernels built with SCHED_STATS. However, the switch types are useful for communicating the intent of the caller. 1. Ensure that every caller provides a type. In most cases, we upgrade the basic yield to sched_relinquish() aka SWT_RELINQUISH. 2. The case of sched_bind() is distinct, so add a new switch type SWT_BIND. 3. Remove the two unused types, SWT_PREEMPT and SWT_SLEEPQTIMO. 4. Remove SWT_NONE altogether and assert that callers always provide a type flag. 5. Reference the mi_switch(9) man page in the comments, as these flags will be documented there. Reviewed by: kib, markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D38184
# bff02948	09-Feb-2023	Mitchell Horne <mhorne@FreeBSD.org>	sched_4bsd: use the same switch flags as ULE ULE uses the more specific SWT_REMOTEPREEMPT and SWT_REMOTEWAKEIDLE switch types, let's do that here as well. SWT_PREEMPT is somewhat redundant when we also have the SW_PREEMPT flag. This only has an effect for kernels built with SCHED_STATS. Reviewed by: kib, markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D38183
# c2d27b0e	23-Sep-2022	Mark Johnston <markj@FreeBSD.org>	sched_4bsd: Fix a racy thread state modification When a thread switching off-CPU is migrating to a remote CPU, sched_switch() may trigger a rescheduling of the thread currently running on that CPU. When doing so, it must ensure that that thread is locked before modifying thread state. If the thread's lock is not the scheduler lock, then the thread is in the process of switching off-CPU and no extra effort is needed, and the initiator does not hold the thread's lock and thus should not modify any thread state. Reported and tested by: Steve Kargl MFC after: 1 week
# c6d31b83	18-Jul-2022	Konstantin Belousov <kib@FreeBSD.org>	AST: rework Make most AST handlers dynamically registered. This allows to have subsystem-specific handler source located in the subsystem files, instead of making subr_trap.c aware of it. For instance, signal delivery code on return to userspace is now moved to kern_sig.c. Also, it allows to have some handlers designated as the cleanup (kclear) type, which are called both at AST and on thread/process exit. For instance, ast(), exit1(), and NFS server no longer need to be aware about UFS softdep processing. The dynamic registration also allows third-party modules to register AST handlers if needed. There is one caveat with loadable modules: the code does not make any effort to ensure that the module is not unloaded before all threads processed through AST handler in it. In fact, this is already present behavior for hwpmc.ko and ufs.ko. I do not think it is worth the efforts and the runtime overhead to try to fix it. Reviewed by: markj Tested by: emaste (arm64), pho Discussed with: jhb Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D35888
# 40efe743	14-Jul-2022	John Baldwin <jhb@FreeBSD.org>	4bsd: Simplistic time-sharing for interrupt threads. If an interrupt thread runs for a full quantum without yielding the CPU, demote its priority and schedule a preemption to give other ithreads a turn. Reviewed by: kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35645
# fea89a28	14-Jul-2022	John Baldwin <jhb@FreeBSD.org>	Add sched_ithread_prio to set the base priority of an interrupt thread. Use it instead of sched_prio when setting the priority of an interrupt thread. Reviewed by: kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D35642
# 8758ac75	13-Apr-2022	John Baldwin <jhb@FreeBSD.org>	sched_4bsd: ts is only used in sched_bind for SMP.
# 72ff256c	12-Apr-2022	John Baldwin <jhb@FreeBSD.org>	sched_4bsd: Remove unused variables.
# ec3af9d0	01-Jan-2022	Stefan Eßer <se@FreeBSD.org>	sys/kern/sched_4bsd.c: fix typo introduced in previous commit
# a19bd8e3	01-Jan-2022	Stefan Eßer <se@FreeBSD.org>	Restore variable aliasing in the context of cpu set operations A simplification of set operations removed side-effects of the previous code, which are restored by this commit.
# e2650af1	29-Dec-2021	Stefan Eßer <se@FreeBSD.org>	Make CPU_SET macros compliant with other implementations The introduction of <sched.h> improved compatibility with some 3rd party software, but caused the configure scripts of some ports to assume that they were run in a GLIBC compatible environment. Parts of sched.h were made conditional on -D_WITH_CPU_SET_T being added to ports, but there still were compatibility issues due to invalid assumptions made in autoconfigure scripts. The differences between the FreeBSD version of macros like CPU_AND, CPU_OR, etc. and the GLIBC versions was in the number of arguments: FreeBSD used a 2-address scheme (one source argument is also used as the destination of the operation), while GLIBC uses a 3-adderess scheme (2 source operands and a separately passed destination). The GLIBC scheme provides a super-set of the functionality of the FreeBSD macros, since it does not prevent passing the same variable as source and destination arguments. In code that wanted to preserve both source arguments, the FreeBSD macros required a temporary copy of one of the source arguments. This patch set allows to unconditionally provide functions and macros expected by 3rd party software written for GLIBC based systems, but breaks builds of externally maintained sources that use any of the following macros: CPU_AND, CPU_ANDNOT, CPU_OR, CPU_XOR. One contributed driver (contrib/ofed/libmlx5) has been patched to support both the old and the new CPU_OR signatures. If this commit is merged to -STABLE, the version test will have to be extended to cover more ranges. Ports that have added -D_WITH_CPU_SET_T to build on -CURRENT do no longer require that option. The FreeBSD version has been bumped to 1400046 to reflect this incompatible change. Reviewed by: kib MFC after: 2 weeks Relnotes: yes Differential Revision: https://reviews.freebsd.org/D33451
# 6a8ea6d1	03-Nov-2021	Kyle Evans <kevans@FreeBSD.org>	sched: split sched_ap_entry() out of sched_throw() sched_throw() can no longer take a NULL thread, APs enter through sched_ap_entry() instead. This completely removes branching in the common case and cleans up both paths. No functional change intended. Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D32829
# 589aed00	02-Nov-2021	Kyle Evans <kevans@FreeBSD.org>	sched: separate out schedinit_ap() schedinit_ap() sets up an AP for a later call to sched_throw(NULL). Currently, ULE sets up some pcpu bits and fixes the idlethread lock with a call to sched_throw(NULL); this results in a window where curthread is setup in platforms' init_secondary(), but it has the wrong td_lock. Typical platform AP startup procedure looks something like: - Setup curthread - ... other stuff, including cpu_initclocks_ap() - Signal smp_started - sched_throw(NULL) to enter the scheduler cpu_initclocks_ap() may have callouts to process (e.g., nvme) and attempt to sched_add() for this AP, but this attempt fails because of the noted violated assumption leading to locking heartburn in sched_setpreempt(). Interrupts are still disabled until cpu_throw() so we're not really at risk of being preempted -- just let the scheduler in on it a little earlier as part of setting up curthread. Reviewed by: alfredo, kib, markj Triage help from: andrew, markj Smoke-tested by: alfredo (ppc), kevans (arm64, x86), mhorne (arm) Differential Revision: https://reviews.freebsd.org/D32797
# af29f399	28-Jul-2021	Dmitry Chagin <dchagin@FreeBSD.org>	umtx: Split umtx.h on two counterparts. To prevent umtx.h polluting by future changes split it on two headers: umtx.h - ABI header for userspace; umtxvar.h - the kernel staff. While here fix umtx_key_match style. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31248 MFC after: 2 weeks
# 6a467cc5	23-May-2021	Mateusz Guzik <mjg@FreeBSD.org>	lockprof: pass lock type as an argument instead of reading the spin flag
# fa2528ac	18-Feb-2021	Alex Richardson <arichardson@FreeBSD.org>	Use atomic loads/stores when updating td->td_state KCSAN complains about racy accesses in the locking code. Those races are fine since they are inside a TD_SET_RUNNING() loop that expects the value to be changed by another CPU. Use relaxed atomic stores/loads to indicate that this variable can be written/read by multiple CPUs at the same time. This will also prevent the compiler from doing unexpected re-ordering. Reported by: GENERIC-KCSAN Test Plan: KCSAN no longer complains, kernel still runs fine. Reviewed By: markj, mjg (earlier version) Differential Revision: https://reviews.freebsd.org/D28569
# b77594bb	14-Nov-2020	Mateusz Guzik <mjg@FreeBSD.org>	sched: fix an incorrect comparison in sched_lend_user_prio_cond Compare with sched_lend_user_prio.
# 6fed89b1	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	kern: clean up empty lines in .c and .h files
# b05ca429	02-Mar-2020	Pawel Biernacki <kaktus@FreeBSD.org>	sys/: Document few more sysctls. Submitted by: Antranig Vartanian <antranigv@freebsd.am> Reviewed by: kaktus Commented by: jhb Approved by: kib (mentor) Sponsored by: illuria security Differential Revision: https://reviews.freebsd.org/D23759
# 7029da5c	26-Feb-2020	Pawel Biernacki <kaktus@FreeBSD.org>	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718
# 3ff65f71	30-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	Remove duplicated empty lines from kern/*.c No functional changes.
# 879e0604	11-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	Add KERNEL_PANICKED macro for use in place of direct panicstr tests
# 686bcb5c	15-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	schedlock 4/4 Don't hold the scheduler lock while doing context switches. Instead we unlock after selecting the new thread and switch within a spinlock section leaving interrupts and preemption disabled to prevent local concurrency. This means that mi_switch() is entered with the thread locked but returns without. This dramatically simplifies scheduler locking because we will not hold the schedlock while spinning on blocked lock in switch. This change has not been made to 4BSD but in principle it would be more straightforward. Discussed with: markj Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22778
# 61a74c5c	15-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	schedlock 1/4 Eliminate recursion from most thread_lock consumers. Return from sched_add() without the thread_lock held. This eliminates unnecessary atomics and lock word loads as well as reducing the hold time for scheduler locks. This will eventually allow for lockless remote adds. Discussed with: kib Reviewed by: jhb Tested by: pho Differential Revision: https://reviews.freebsd.org/D22626
# 9825eadf	13-Dec-2019	Ryan Libby <rlibby@FreeBSD.org>	bitset: rename confusing macro NAND to ANDNOT s/BIT_NAND/BIT_ANDNOT/, and for CPU and DOMAINSET too. The actual implementation is "and not" (or "but not"), i.e. A but not B. Fortunately this does appear to be what all existing callers want. Don't supply a NAND (not (A and B)) operation at this time. Discussed with: jeff Reviewed by: cem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D22791
# c3cccf95	07-Dec-2019	Jeff Roberson <jeff@FreeBSD.org>	Handle multiple clock interrupts simultaneously in sched_clock(). Reviewed by: kib, markj, mav Differential Revision: https://reviews.freebsd.org/D22625
# 61322a0a	04-Dec-2019	Alexander Motin <mav@FreeBSD.org>	Mark some more hot global variables with __read_mostly. MFC after: 1 week
# ac97da9a	08-May-2019	Mateusz Guzik <mjg@FreeBSD.org>	Reduce umtx-related work on exec and exit - there is no need to take the process lock to iterate the thread list after single-threading is enforced - typically there are no mutexes to clean up (testable without taking the global umtx lock) - typically there is no need to adjust the priority (testable without taking thread lock) Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20160
# 2bf95012	05-Jul-2018	Andrew Turner <andrew@FreeBSD.org>	Create a new macro for static DPCPU data. On arm64 (and possible other architectures) we are unable to use static DPCPU data in kernel modules. This is because the compiler will generate PC-relative accesses, however the runtime-linker expects to be able to relocate these. In preparation to fix this create two macros depending on if the data is global or static. Reviewed by: bz, emaste, markj Sponsored by: ABT Systems Ltd Differential Revision: https://reviews.freebsd.org/D16140
# 28240885	07-May-2018	Mateusz Guzik <mjg@FreeBSD.org>	Inlined sched_userret. The tested condition is rarely true and it induces a function call on each return to userspace. Bumps getuid rate by about 1% on Broadwell.
# 3f289c3f	12-Jan-2018	Jeff Roberson <jeff@FreeBSD.org>	Implement 'domainset', a cpuset based NUMA policy mechanism. This allows userspace to control NUMA policy administratively and programmatically. Implement domainset based iterators in the page layer. Remove the now legacy numa_* syscalls. Cleanup some header polution created by having seq.h in proc.h. Reviewed by: markj, kib Discussed with: alc Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D13403
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# 3e85b721	16-May-2017	Ed Maste <emaste@FreeBSD.org>	Remove register keyword from sys/ and ANSIfy prototypes A long long time ago the register keyword told the compiler to store the corresponding variable in a CPU register, but it is not relevant for any compiler used in the FreeBSD world today. ANSIfy related prototypes while here. Reviewed by: cem, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D10193
# afa0a46c	23-Mar-2017	Andriy Gapon <avg@FreeBSD.org>	move thread switch tracing from mi_switch to sched_switch This is done so that the thread state changes during the switch are not confused with the thread state changes reported when the thread spins on a lock. Here is an example, three consecutive entries for the same thread (from top to bottom): KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"sleep", attributes: prio:84, wmesg:"-", lockname:"(null)" KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"spinning", attributes: lockname:"sched lock 1" KTRGRAPH group:"thread", id:"zio_write_intr_3 tid 100260", state:"running", attributes: none The above trace could leave an impression that the final state of the thread was "running". After this change the sleep state will be reported after the "spinning" and "running" states reported for the sched lock. Reviewed by: jhb, markj MFC after: 1 week Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9961
# 28ef18b8	11-Mar-2017	Andriy Gapon <avg@FreeBSD.org>	trace thread running state when a thread is run for the first time This applies to both KTR_SCHED and DTrace sched:::on-cpu tracing. MFC after: 10 days
# 27ee18ad	16-Feb-2017	Ryan Stone <rstone@FreeBSD.org>	Revert r313814 and r313816 Something evidently got mangled in my git tree in between testing and review, as an old and broken version of the patch was apparently submitted to svn. Revert this while I work out what went wrong. Reported by: tuexen Pointy hat to: rstone
# 09ae7c48	16-Feb-2017	Ryan Stone <rstone@FreeBSD.org>	Check for preemption after lowering a thread's priority When a high-priority thread is waiting for a mutex held by a low-priority thread, it temporarily lends its priority to the low-priority thread to prevent priority inversion. When the mutex is released, the lent priority is revoked and the low-priority thread goes back to its original priority. When the priority of that thread is lowered (through a call to sched_priority()), the schedule was not checking whether there is now a high-priority thread in the run queue. This can cause threads with real-time priority to be starved in the run queue while the low-priority thread finishes its quantum. Fix this by explicitly checking whether preemption is necessary when a thread's priority is lowered. Sponsored by: Dell EMC Isilon Obtained from: Sandvine Inc Differential Revision: https://reviews.freebsd.org/D9518 Reviewed by: Jeff Roberson (ule) MFC after: 1 month
# ad9dadc4	19-Jan-2017	Andriy Gapon <avg@FreeBSD.org>	fix a thread preemption regression in schedulers introduced in r270423 Commit r270423 fixed a regression in sched_yield() that was introduced in earlier changes. Unfortunately, at the same time it introduced an new regression. The problem is that SWT_RELINQUISH (6), like all other SWT_* constants and unlike SW_* flags, is not a bit flag. So, (flags & SWT_RELINQUISH) is true in cases where that was not really indended, for example, with SWT_OWEPREEMPT (2) and SWT_REMOTEPREEMPT (11). A straight forward fix would be to use (flags & SW_TYPE_MASK) == SWT_RELINQUISH, but my impression is that the switch types are designed mostly for gathering statistics, not for influencing scheduling decisions. So, I decided that it would be better to check for SW_PREEMPT flag instead. That's also the same flag that was checked before r239157. I double-checked how that flag is used and I am confident that the flag is set only in the places where we really have the preemption: - critical_exit + td_owepreempt - sched_preempt in the ULE scheduler - sched_preempt in the 4BSD scheduler Reviewed by: kib, mav MFC after: 4 days Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9230
# 892f0ab0	11-Nov-2016	John Baldwin <jhb@FreeBSD.org>	Allow scheduling during early boot. - Send IPI wakeups once SMP is started even if cold is true. - Permit preemptions when cold is true. These changes are needed for EARLY_AP_STARTUP. MFC after: 2 weeks Sponsored by: Netflix
# a6b91f0f	11-Nov-2016	John Baldwin <jhb@FreeBSD.org>	Don't place threads on the run queue after waking up other CPUs. The other CPU might resume and see a still-empty runq and go back to sleep before sched_add() adds the thread to the runq. This results in a lost wakeup and a potential hang if the system is otherwise completely idle. The race originated due to a micro-optimization (my fault) in 4BSD in that it avoided putting a thread on the run queue if the scheduler was going to preempt to the new thread. To avoid complexity while fixing this race, just drop this optimization. 4BSD now always sets the "owepreempt" flag when a preemption is warranted and defers the actual preemption to the thread_unlock of the caller the same as ULE. MFC after: 2 weeks Sponsored by: Netflix
# 69a28758	15-Sep-2016	Ed Maste <emaste@FreeBSD.org>	Renumber license clauses in sys/kern to avoid skipping #3
# e2325d82	29-Jul-2016	John Baldwin <jhb@FreeBSD.org>	Don't treat NOCPU as a valid CPU to CPU_ISSET. If a thread is created bound to a cpuset it might already be bound before it's very first timeslice, and td_lastcpu will be NOCPU in that case. MFC after: 1 week
# 93ccd6bf	05-Jun-2016	Konstantin Belousov <kib@FreeBSD.org>	Get rid of struct proc p_sched and struct thread td_sched pointers. p_sched is unused. The struct td_sched is always co-allocated with the struct thread, except for the thread0. Avoid useless indirection, instead calculate td_sched location using simple pointer arithmetic in td_get_sched(9). For thread0, which is statically allocated, create a structure to emulate layout of the dynamic allocation. Reviewed by: jhb (previous version) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D6711
# e3043798	29-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/kern: spelling fixes in comments. No functional change.
# ccd0ec40	17-Apr-2016	Konstantin Belousov <kib@FreeBSD.org>	The struct thread td_estcpu member is only used by the 4BSD scheduler. Move it to the struct td_sched for 4BSD, removing always present field, otherwise unused for ULE. New scheduler method sched_estcpu() returns the estimation for kinfo_proc consumption. As before, it always returns 0 for ULE. Remove sched_tick() scheduler method, unused both by 4BSD and ULE. Update locking comment for the 4BSD struct td_sched, copying it from the same comment for ULE. Spell MAXPRI as PRI_MAX_TIMESHARE in the 4BSD comment. Based on some notes from, and reviewed by: bde Sponsored by: The FreeBSD Foundation
# 92de34df	03-Aug-2015	John Baldwin <jhb@FreeBSD.org>	kgdb uses td_oncpu to determine if a thread is running and should use a pcb from stoppcbs[] rather than the thread's PCB. However, exited threads retained td_oncpu from the last time they ran, and newborn threads had their CPU fields cleared to zero during fork and thread creation since they are in the set of fields zeroed when threads are setup. To fix, explicitly update the CPU fields for exiting threads in sched_throw() to reflect the switch out and reset the CPU fields for new threads in sched_fork_thread() to NOCPU. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D3193
# 4b5c9cf6	29-Apr-2015	Edward Tomasz Napierala <trasz@FreeBSD.org>	Add kern.racct.enable tunable and RACCT_DISABLED config option. The point of this is to be able to add RACCT (with RACCT_DISABLED) to GENERIC, to avoid having to rebuild the kernel to use rctl(8). Differential Revision: https://reviews.freebsd.org/D2369 Reviewed by: kib@ MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation
# 2e7d7bb2	23-Aug-2014	Alexander Motin <mav@FreeBSD.org>	Restore pre-r239157 handling of sched_yield(), when thread time slice was aborted, allowing other threads to run. Without this change thread is just rescheduled again, that was illustrated by provided test tool. PR: 192926 Submitted by: eric@vangyzen.net MFC after: 2 weeks
# 0d13d5fc	29-Apr-2014	Marius Strobl <marius@FreeBSD.org>	Given that as of r258002 the last external user is gone, make sched_lock static.
# 7dba7849	29-Dec-2013	Mark Johnston <markj@FreeBSD.org>	The arguments to sched:::off-cpu are the thread and associated process of the thread selected to run, not the currently running thread. This fix has already been made for ULE in r252070. PR: 177706 MFC after: 1 week
# d9fae5ab	26-Nov-2013	Andriy Gapon <avg@FreeBSD.org>	dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE In its stead use the Solaris / illumos approach of emulating '-' (dash) in probe names with '__' (two consecutive underscores). Reviewed by: markj MFC after: 3 weeks
# 54366c0b	25-Nov-2013	Attilio Rao <attilio@FreeBSD.org>	- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging option, unbreak the lock tracing release semantic by embedding calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined version of the releasing functions for mutex, rwlock and sxlock. Failing to do so skips the lockstat_probe_func invokation for unlocking. - As part of the LOCKSTAT support is inlined in mutex operation, for kernel compiled without lock debugging options, potentially every consumer must be compiled including opt_kdtrace.h. Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES is linked there and it is only used as a compile-time stub [0]. [0] immediately shows some new bug as DTRACE-derived support for debug in sfxge is broken and it was never really tested. As it was not including correctly opt_kdtrace.h before it was never enabled so it was kept broken for a while. Fix this by using a protection stub, leaving sfxge driver authors the responsibility for fixing it appropriately [1]. Sponsored by: EMC / Isilon storage division Discussed with: rstone [0] Reported by: rstone [1] Discussed with: philip
# 785797c3	24-Jul-2013	Andriy Gapon <avg@FreeBSD.org>	rename scheduler->swapper and SI_SUB_RUN_SCHEDULER->SI_SUB_LAST Also directly call swapper() at the end of mi_startup instead of relying on swapper being the last thing in sysinits order. Rationale: - "RUN_SCHEDULER" was misleading, scheduling already takes place at that stage - "scheduler" was misleading, the function swaps in the swapped out processes - another SYSINIT(SI_SUB_RUN_SCHEDULER, SI_ORDER_ANY) could never be invoked depending on its relative order with scheduler; this was not obvious and the bug actually used to exist Reviewed by: kib (ealier version) MFC after: 14 days
# 36af9869	26-Oct-2012	Edward Tomasz Napierala <trasz@FreeBSD.org>	Add CPU percentage limit enforcement to RCTL. The resouce name is "pcpu". It was implemented by Rudolf Tomori during Google Summer of Code 2012.
# ba96d2d8	22-Aug-2012	John Baldwin <jhb@FreeBSD.org>	Mark the idle threads as non-sleepable and also assert that an idle thread never blocks on a turnstile.
# 37f4e025	11-Aug-2012	Alexander Motin <mav@FreeBSD.org>	Some more minor tunings inspired by bde@.
# 579895df	10-Aug-2012	Alexander Motin <mav@FreeBSD.org>	Some minor tunings/cleanups inspired by bde@ after previous commits: - remove extra dynamic variable initializations; - restore (4BSD) and implement (ULE) hogticks variable setting; - make sched_rr_interval() more tolerant to options; - restore (4BSD) and implement (ULE) kern.sched.quantum sysctl, a more user-friendly wrapper for sched_slice; - tune some sysctl descriptions; - make some style fixes.
# 3d7f4117	09-Aug-2012	Alexander Motin <mav@FreeBSD.org>	Rework r220198 change (by fabient). I believe it solves the problem from the wrong direction. Before it, if preemption and end of time slice happen same time, thread was put to the head of the queue as for only preemption. It could cause single thread to run for indefinitely long time. r220198 handles it by not clearing TDF_NEEDRESCHED in case of preemption. But that causes delayed context switch every time preemption happens, even when not needed. Solve problem by introducing scheduler-specifoc thread flag TDF_SLICEEND, set when thread's time slice is over and it should be put to the tail of queue. Using SW_PREEMPT flag for that purpose as it was before just not enough informative to work correctly. On my tests this by 2-3 times reduces run time deviation (improves fairness) in cases when several threads share one CPU. Reviewed by: fabient MFC after: 2 months Sponsored by: iXsystems, Inc.
# 48317e9e	09-Aug-2012	Alexander Motin <mav@FreeBSD.org>	SCHED_4BSD scheduling quantum mechanism appears to be broken for some time. With switchticks variable being reset each time thread preempted (that is done regularly by interrupt threads) scheduling quantum may never expire. It was not noticed in time because several other factors still regularly trigger context switches. Handle the problem by replacing that mechanism with its equivalent from SCHED_ULE called time slice. It is effectively the same, just measured in context of stathz instead of hz. Some unification is probably not bad.
# 2aaae99d	15-May-2012	Sergey Kandaurov <pluknet@FreeBSD.org>	Fix typo in function name SDT_PROBE4 and unbreak 4BSD UP.
# b3e9e682	14-May-2012	Ryan Stone <rstone@FreeBSD.org>	Implement the DTrace sched provider. This implementation aims to be compatible with the sched provider implemented by Solaris and its open- source derivatives. Full documentation of the sched provider can be found on Oracle's DTrace wiki pages. Note that for compatibility with scripts originally written for Solaris, serveral probes are defined that will never fire. These probes are defined to fire when Solaris-specific features perform certain actions. As these features are not present in FreeBSD, the probes can never fire. Also, I have added a two probes that are not defined in Solaris, lend-pri and load-change. These probes have been added to make it possible to collect schedgraph data with DTrace. Finally, a few probes are defined in Solaris to take a cpuinfo_t * argument. As it was not immediately clear to me how to translate that to FreeBSD, currently those probes are passed NULL in place of a cpuinfo_t *. Sponsored by: Sandvine Incorporated MFC after: 2 weeks
# 44ad5475	08-Mar-2012	John Baldwin <jhb@FreeBSD.org>	Add a new sched_clear_name() method to the scheduler interface to clear the cached name used for KTR_SCHED traces when a thread's name changes. This way KTR_SCHED traces (and thus schedgraph) will notice when a thread's name changes, most commonly via execve(). MFC after: 2 weeks
# 7e3a96ea	03-Jan-2012	John Baldwin <jhb@FreeBSD.org>	Some small fixes to CPU accounting for threads: - Only initialize the per-cpu switchticks and switchtime in sched_throw() for the very first context switch on APs during boot. This avoids a small gap between the middle of thread_exit() and sched_throw() where time is not accounted to any thread. - In thread_exit(), update the timestamp bookkeeping to track the changes to mi_switch() introduced by td_rux so that the code once again matches the comment claiming it is mimicing mi_switch(). Specifically, only update the per-thread stats directly and depend on ruxagg() to update p_rux rather than adjusting p_rux directly. While here, move the timestamp bookkeeping as late in the function as possible. Reviewed by: bde, kib MFC after: 1 week
# 6472ac3d	07-Nov-2011	Ed Schouten <ed@FreeBSD.org>	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
# cd39bb09	26-Aug-2011	Xin LI <delphij@FreeBSD.org>	Fix format strings for KTR_STATE in 4BSD ad ULE schedulers. Submitted by: Ivan Klymenko <fidaj@ukr.net> PR: kern/159904, kern/159905 MFC after: 2 weeks Approved by: re (kib)
# a38f1f26	13-Jun-2011	Attilio Rao <attilio@FreeBSD.org>	Remove pc_cpumask and pc_other_cpus usage from MI code. Tested by: pluknet
# d098f930	31-May-2011	Nathan Whitehorn <nwhitehorn@FreeBSD.org>	On multi-core, multi-threaded PPC systems, it is important that the threads be brought up in the order they are enumerated in the device tree (in particular, that thread 0 on each core be brought up first). The SLIST through which we loop to start the CPUs has all of its entries added with SLIST_INSERT_HEAD(), which means it is in reverse order of enumeration and so AP startup would always fail in such situations (causing a machine check or RTAS failure). Fix this by changing the SLIST into an STAILQ, and inserting new CPUs at the end. Reviewed by: jhb
# a0a43452	17-May-2011	Attilio Rao <attilio@FreeBSD.org>	Merge r221285 from largeSMP project: - Remove the following sysctl: kern.sched.ipiwakeup.onecpu kern.sched.ipiwakeup.htt2 Because they are absolutely obsolete. Probabilly the whole wakeup forward mechanism should be revisited for a better fitting in modern hw, in the future. - As map2 variable is no longer used rename map3 to map2 - Fix a string by making more informative the msg and removing the arguments passing. Reviewed by: julian Tested by: several
# d59dd76c	16-May-2011	Attilio Rao <attilio@FreeBSD.org>	Merge r221278 from largeSMP project: idle_cpus_mask is just used in sched_4bsd, thus make it private for it. Tested by: several
# 71a19bdc	05-May-2011	Attilio Rao <attilio@FreeBSD.org>	Commit the support for removing cpumask_t and replacing it directly with cpuset_t objects. That is going to offer the underlying support for a simple bump of MAXCPU and then support for number of cpus > 32 (as it is today). Right now, cpumask_t is an int, 32 bits on all our supported architecture. cpumask_t on the other side is implemented as an array of longs, and easilly extendible by definition. The architectures touched by this commit are the following: - amd64 - i386 - pc98 - arm - ia64 - XEN while the others are still missing. Userland is believed to be fully converted with the changes contained here. Some technical notes: - This commit may be considered an ABI nop for all the architectures different from amd64 and ia64 (and sparc64 in the future) - per-cpu members, which are now converted to cpuset_t, needs to be accessed avoiding migration, because the size of cpuset_t should be considered unknown - size of cpuset_t objects is different from kernel and userland (this is primirally done in order to leave some more space in userland to cope with KBI extensions). If you need to access kernel cpuset_t from the userland please refer to example in this patch on how to do that correctly (kgdb may be a good source, for example). - Support for other architectures is going to be added soon - Only MAXCPU for amd64 is bumped now The patch has been tested by sbruno and Nicholas Esborn on opteron 4 x 12 pack CPUs. More testing on big SMP is expected to came soon. pluknet tested the patch with his 8-ways on both amd64 and i386. Tested by: pluknet, sbruno, gianni, Nicholas Esborn Reviewed by: jeff, jhb, sbruno
# f0283a73	30-Apr-2011	Attilio Rao <attilio@FreeBSD.org>	- Remove the following sysctl: kern.sched.ipiwakeup.onecpu kern.sched.ipiwakeup.htt2 Because they are absolutely obsolete. Probabilly the whole wakeup forward mechanism should be revisited for a better fitting in modern hw. - As map2 variable is no longer used rename map3 to map2 - Fix a string by making more informative the msg and removing the arguments passing Approved by: julian
# 3121f534	30-Apr-2011	Attilio Rao <attilio@FreeBSD.org>	idle_cpus_mask is just used in the SMP case and within sched_4BSD. Declare appropriately.
# 60dd73b7	26-Apr-2011	Ryan Stone <rstone@FreeBSD.org>	If the 4BSD scheduler tries to schedule a thread that has been pinned or bound to an AP before SMP has started, the system will panic when we try to touch per-CPU state for that AP because that state has not been initialized yet. Fix this in the same way as ULE: place all threads in the global run queue before SMP has started. Reviewed by: jhb MFC after: 1 month
# e806d352	06-Apr-2011	John Baldwin <jhb@FreeBSD.org>	Fix several places to ignore processes that are not yet fully constructed. MFC after: 1 week
# 586cb6ec	31-Mar-2011	Fabien Thomas <fabient@FreeBSD.org>	Clearing the flag when preempting will let the preempted thread run too much time. This can finish in a scheduler deadlock with ping-pong between two threads. One sample of this is: - device lapic (to have a preemption point on critical_exit()) - options DEVICE_POLLING with HZ>1499 (to have lapic freq = hardclock freq) - running a cpu intensive task (that does not enter the kernel) - only one CPU on SMP or no SMP. As requested by jhb@ 4BSD have received the same type of fix instead of propagating the flag to the new thread. Reviewed by: jhb, jeff MFC after: 1 month
# 2dc29adb	14-Jan-2011	John Baldwin <jhb@FreeBSD.org>	Rework realtime priority support: - Move the realtime priority range up above kernel sleep priorities and just below interrupt thread priorities. - Contract the interrupt and kernel sleep priority ranges a bit so that the timesharing priority band can be increased. The new timeshare range is now slightly larger than the old realtime + timeshare ranges. - Change the ULE scheduler to no longer use realtime priorities for interactive threads. Instead, the larger timeshare range is now split into separate subranges for interactive and non-interactive ("batch") threads. The end result is that interactive threads and non-interactive threads still use the same priority ranges as before, but realtime threads now have a separate, dedicated priority range. - Do not modify the priority of non-timeshare threads in sched_sleep() or via cv_broadcastpri(). Realtime and idle priority threads will no longer have their priorities affected by sleeping in the kernel. Reviewed by: jeff
# 52c0b557	13-Jan-2011	Matthew D Fleming <mdf@FreeBSD.org>	One more sysctl(9) type-safety that I missed before.
# 22d19207	06-Jan-2011	John Baldwin <jhb@FreeBSD.org>	- Move sched_fork() later in fork() after the various sections of the new thread and proc have been copied and zeroed from the old thread and proc. Otherwise attempts to modify thread or process data in sched_fork() could be undone. - Don't copy td_{base,}_user_pri from the old thread to the new thread in sched_fork_thread() in ULE. This is already done courtesy the bcopy() of the thread copy region. - Always initialize the real priority (td_priority) of new threads to the new thread's base priority (td_base_pri) to avoid bogusly inheriting a borrowed priority from the parent thread. MFC after: 2 weeks
# c8e368a9	29-Dec-2010	David Xu <davidxu@FreeBSD.org>	- Follow r216313, the sched_unlend_user_prio is no longer needed, always use sched_lend_user_prio to set lent priority. - Improve pthread priority-inherit mutex, when a contender's priority is lowered, repropagete priorities, this may cause mutex owner's priority to be lowerd, in old code, mutex owner's priority is rise-only.
# acbe332a	08-Dec-2010	David Xu <davidxu@FreeBSD.org>	MFp4: It is possible a lower priority thread lending priority to higher priority thread, in old code, it is ignored, however the lending should always be recorded, add field td_lend_user_pri to fix the problem, if a thread does not have borrowed priority, its value is PRI_MAX. MFC after: 1 week
# 3e288e62	22-Nov-2010	Dimitry Andric <dim@FreeBSD.org>	After some off-list discussion, revert a number of changes to the DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various people working on the affected files. A better long-term solution is still being considered. This reversal may give some modules empty set_pcpu or set_vnet sections, but these are harmless. Changes reverted: ------------------------------------------------------------------------ r215318 \| dim \| 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) \| 4 lines Instead of unconditionally emitting .globl's for the __start_set_xxx and __stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu sections are actually defined. ------------------------------------------------------------------------ r215317 \| dim \| 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) \| 3 lines Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree. ------------------------------------------------------------------------ r215316 \| dim \| 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) \| 2 lines Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
# 31c6a003	14-Nov-2010	Dimitry Andric <dim@FreeBSD.org>	Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree.
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# a157e425	13-Sep-2010	Alexander Motin <mav@FreeBSD.org>	Refactor timer management code with priority to one-shot operation mode. The main goal of this is to generate timer interrupts only when there is some work to do. When CPU is busy interrupts are generating at full rate of hz + stathz to fullfill scheduler and timekeeping requirements. But when CPU is idle, only minimum set of interrupts (down to 8 interrupts per second per CPU now), needed to handle scheduled callouts is executed. This allows significantly increase idle CPU sleep time, increasing effect of static power-saving technologies. Also it should reduce host CPU load on virtualized systems, when guest system is idle. There is set of tunables, also available as writable sysctls, allowing to control wanted event timer subsystem behavior: kern.eventtimer.timer - allows to choose event timer hardware to use. On x86 there is up to 4 different kinds of timers. Depending on whether chosen timer is per-CPU, behavior of other options slightly differs. kern.eventtimer.periodic - allows to choose periodic and one-shot operation mode. In periodic mode, current timer hardware taken as the only source of time for time events. This mode is quite alike to previous kernel behavior. One-shot mode instead uses currently selected time counter hardware to schedule all needed events one by one and program timer to generate interrupt exactly in specified time. Default value depends of chosen timer capabilities, but one-shot mode is preferred, until other is forced by user or hardware. kern.eventtimer.singlemul - in periodic mode specifies how much times higher timer frequency should be, to not strictly alias hardclock() and statclock() events. Default values are 2 and 4, but could be reduced to 1 if extra interrupts are unwanted. kern.eventtimer.idletick - makes each CPU to receive every timer interrupt independently of whether they busy or not. By default this options is disabled. If chosen timer is per-CPU and runs in periodic mode, this option has no effect - all interrupts are generating. As soon as this patch modifies cpu_idle() on some platforms, I have also refactored one on x86. Now it makes use of MONITOR/MWAIT instrunctions (if supported) under high sleep/wakeup rate, as fast alternative to other methods. It allows SMP scheduler to wake up sleeping CPUs much faster without using IPI, significantly increasing performance on some highly task-switching loads. Tested by: many (on i386, amd64, sparc64 and powerc) H/W donated by: Gheorghe Ardelean Sponsored by: iXsystems, Inc.
# b722ad00	11-Sep-2010	Alexander Motin <mav@FreeBSD.org>	Merge some SCHED_ULE features to SCHED_4BSD: - Teach SCHED_4BSD to inform cpu_idle() about high sleep/wakeup rate to choose optimized handler. In case of x86 it is MONITOR/MWAIT. Also it will be needed to bypass forthcoming idle tick skipping logic to not consume resources on events rescheduling when it won't give any benefits. - Teach SCHED_4BSD to wake up idle CPUs without using IPI. In case of x86, when MONITOR/MWAIT is active, it require just single memory write. This doubles performance on some heavily switching test loads.
# d9d8d144	06-Aug-2010	John Baldwin <jhb@FreeBSD.org>	Add a new ipi_cpu() function to the MI IPI API that can be used to send an IPI to a specific CPU by its cpuid. Replace calls to ipi_selected() that constructed a mask for a single CPU with calls to ipi_cpu() instead. This will matter more in the future when we transition from cpumask_t to cpuset_t for CPU masks in which case building a CPU mask is more expensive. Submitted by: peter, sbruno Reviewed by: rookie Obtained from: Yahoo! (x86) MFC after: 1 month
# 3aa6d94e	11-Jun-2010	John Baldwin <jhb@FreeBSD.org>	Update several places that iterate over CPUs to use CPU_FOREACH().
# 3da35a0a	03-Jun-2010	John Baldwin <jhb@FreeBSD.org>	Assert that the thread lock is held in sched_pctcpu() instead of recursively acquiring it. All of the current callers already hold the lock. MFC after: 1 month
# 1d7830ed	21-May-2010	John Baldwin <jhb@FreeBSD.org>	Assert that the thread passed to sched_bind() and sched_unbind() is curthread as those routines are only supported for curthread currently. MFC after: 1 month
# 95e1c1fa	08-Feb-2010	Attilio Rao <attilio@FreeBSD.org>	MC r202889, r202940: - Fix a race in sched_switch() of sched_4bsd. Block the td_lock when acquiring explicitly sched_lock in order to prevent races with other td_lock contenders. - Merge the ULE's internal function thread_block_switch() into the global thread_lock_block() and make the former semantic as the default for thread_lock_block(). - Split out an invariant in order to have better checks.
# aa16019d	25-Jan-2010	Attilio Rao <attilio@FreeBSD.org>	MFC r201790: - Set td_slptick to 0 when moving threads out of sleepqueues. - Move td_slptick from u_int to int in order to follow 'ticks' signedness and wrap up accordingly. Sponsored by: Sandvine Incorporated
# 58060789	24-Jan-2010	Attilio Rao <attilio@FreeBSD.org>	Split out an invariant in order to better check that newtd, when provided, must be on a runqueue. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com> MFC: 2 weeks X-MFC: r202889
# b0b9dee5	23-Jan-2010	Attilio Rao <attilio@FreeBSD.org>	- Fix a race in sched_switch() of sched_4bsd. In the case of the thread being on a sleepqueue or a turnstile, the sched_lock was acquired (without the aid of the td_lock interface) and the td_lock was dropped. This was going to break locking rules on other threads willing to access to the thread (via the td_lock interface) and modify his flags (allowed as long as the container lock was different by the one used in sched_switch). In order to prevent this situation, while sched_lock is acquired there the td_lock gets blocked. [0] - Merge the ULE's internal function thread_block_switch() into the global thread_lock_block() and make the former semantic as the default for thread_lock_block(). This means that thread_lock_block() will not disable interrupts when called (and consequently thread_unlock_block() will not re-enabled them when called). This should be done manually when necessary. Note, however, that ULE's thread_unblock_switch() is not reaped because it does reflect a difference in semantic due in ULE (the td_lock may not be necessarilly still blocked_lock when calling this). While asymmetric, it does describe a remarkable difference in semantic that is good to keep in mind. [0] Reported by: Kohji Okuno <okuno dot kohji at jp dot panasonic dot com> Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com> MFC: 2 weeks
# 6eac7e57	08-Jan-2010	Attilio Rao <attilio@FreeBSD.org>	- Fix a bug in sched_4bsd where the timestamp for the sleeping operation is not cleaned up on the wakeup but reset. This is harmless mostly because td_slptick (and ki_slptime from userland) should be analyzed only with the assumption that the thread is actually sleeping (thus while the td_slptick is correctly set) but without this invariant the number is nomore consistent. - Move td_slptick from u_int to int in order to follow 'ticks' signedness and wrap up accordingly [0] [0] Submitted by: emaste Sponsored by: Sandvine Incorporated MFC 1 week
# e1c0f124	07-Jan-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r201347: Allow swap out of the kernel stack for the thread with priority greater or equial then PSOCK, not less or equial.
# 17c4c356	31-Dec-2009	Konstantin Belousov <kib@FreeBSD.org>	Allow swap out of the kernel stack for the thread with priority greater or equial then PSOCK, not less or equial. Higher priority has lesser numerical value. Existing test does not allow for swapout of the thread waiting for advisory lock, for exiting child or sleeping for timeout. On the other hand, high-priority waiters of VFS/VM events can be swapped out. Tested by: pho Reviewed by: jhb MFC after: 1 week
# 1b9d701f	03-Nov-2009	Attilio Rao <attilio@FreeBSD.org>	Split P_NOLOAD into a per-thread flag (TDF_NOLOAD). This improvements aims for avoiding further cache-misses in scheduler specific functions which need to keep track of average thread running time and further locking in places setting for this flag. Reported by: jeff (originally), kris (currently) Reviewed by: jhb Tested by: Giuseppe Cocomazzi <sbudella at email dot it>
# 0d2cf837	25-Jan-2009	Jeff Roberson <jeff@FreeBSD.org>	- Use __XSTRING where I want the define to be expanded. This resulted in sizeof("MAXCPU") being used to calculate a string length rather than something more reasonable such as sizeof("32"). This shouldn't have caused any ill effect until we run on machines with 1000000 or more cpus.
# 8f51ad55	17-Jan-2009	Jeff Roberson <jeff@FreeBSD.org>	- Implement generic macros for producing KTR records that are compatible with src/tools/sched/schedgraph.py. This allows developers to quickly create a graphical view of ktr data for any resource in the system. - Add sched_tdname() and the pcpu field 'name' for quickly and uniformly identifying records associated with a thread or cpu. - Reimplement the KTR_SCHED traces using the new generic facility. Obtained from: attilio Discussed with: jhb Sponsored by: Nokia
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# c3ea3378	28-Jul-2008	John Baldwin <jhb@FreeBSD.org>	When choosing a CPU for a thread in a cpuset, prefer the last CPU that the thread ran on if there are no other CPUs in the set with a shorter per-CPU runqueue.
# f200843b	28-Jul-2008	John Baldwin <jhb@FreeBSD.org>	Implement support for cpusets in the 4BSD scheduler. - When a cpuset is applied to a thread, walk the cpuset to see if it is a "full" cpuset (includes all available CPUs). If not, set a new TDS_AFFINITY flag to indicate that this thread can't run on all CPUs. When inheriting a cpuset from another thread during thread creation, the new thread also inherits this flag. It is in a new ts_flags field in td_sched rather than using one of the TDF_SCHEDx flags because fork() clears td_flags after invoking sched_fork(). - When placing a thread on a runqueue via sched_add(), if the thread is not pinned or bound but has the TDS_AFFINITY flag set, then invoke a new routine (sched_pickcpu()) to pick a CPU for the thread to run on next. sched_pickcpu() walks the cpuset and picks the CPU with the shortest per-CPU runqueue length. Note that the reason for the TDS_AFFINITY flag is to avoid having to walk the cpuset and examine runq lengths in the common case. - To avoid walking the per-CPU runqueues in sched_pickcpu(), add an array of counters to hold the length of the per-CPU runqueues and update them when adding and removing threads to per-CPU runqueues. MFC after: 2 weeks
# 8aa3d7ff	28-Jul-2008	John Baldwin <jhb@FreeBSD.org>	Various and sundry style and whitespace fixes.
# 6f5f25e5	24-May-2008	John Birrell <jb@FreeBSD.org>	Add the vtime (virtual time) hooks for DTrace.
# 6c47aaae	24-Apr-2008	Jeff Roberson <jeff@FreeBSD.org>	- Add an integer argument to idle to indicate how likely we are to wake from idle over the next tick. - Add a new MD routine, cpu_wake_idle() to wakeup idle threads who are suspended in cpu specific states. This function can fail and cause the scheduler to fall back to another mechanism (ipi). - Implement support for mwait in cpu_idle() on i386/amd64 machines that support it. mwait is a higher performance way to synchronize cpus as compared to hlt & ipis. - Allow selecting the idle routine by name via sysctl machdep.idle. This replaces machdep.cpu_idle_hlt. Only idle routines supported by the current machine are permitted. Sponsored by: Nokia
# 8df78c41	16-Apr-2008	Jeff Roberson <jeff@FreeBSD.org>	- Make SCHED_STATS more generic by adding a wrapper to create the variables and sysctl nodes. - In reset walk the children of kern_sched_stats and reset the counters via the oid_arg1 pointer. This allows us to add arbitrary counters to the tree and still reset them properly. - Define a set of switch types to be passed with flags to mi_switch(). These types are named SWT_*. These types correspond to SCHED_STATS counters and are automatically handled in this way. - Make the new SWT_ types more specific than the older switch stats. There are now stats for idle switches, remote idle wakeups, remote preemption ithreads idling, etc. - Add switch statistics for ULE's pickcpu algorithm. These stats include how much migration there is, how often affinity was successful, how often threads were migrated to the local cpu on wakeup, etc. Sponsored by: Nokia
# 9727e637	19-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Restore runq to manipulating threads directly by putting runq links and rqindex back in struct thread. - Compile kern_switch.c independently again and stop #include'ing it from schedulers. - Remove the ts_thread backpointers and convert most code to go from struct thread to struct td_sched. - Cleanup the ts_flags #define garbage that was causing us to sometimes do things that expanded to td->td_sched->ts_thread->td_flags in 4BSD. - Export the kern.sched sysctl node in sysctl.h
# 8b16c208	19-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- ULE and 4BSD share only one line of code from sched_newthread() so implement the required pieces in sched_fork_thread(). The td_sched pointer is already setup by thread_init anyway.
# a90f3f25	19-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Move maybe_preempt() from kern_switch.c to sched_4bsd.c. This is function is only used by 4bsd. - Create a new runq_choose_fuzz() function rather than polluting runq_choose() with 4BSD specific code. - Move the fuzz sysctl into sched_4bsd.c - Remove some dead code from kern_switch.c
# a564bfc7	19-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Directly include opt_sched.h in sched_4bsd.
# 374ae2a3	19-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Relax requirements for p_numthreads, p_threads, p_swtick, and p_nice from requiring the per-process spinlock to only requiring the process lock. - Reflect these changes in the proc.h documentation and consumers throughout the kernel. This is a substantial reduction in locking cost for these fields and was made possible by recent changes to threading support.
# 237fdd78	16-Mar-2008	Robert Watson <rwatson@FreeBSD.org>	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
# 6617724c	12-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	Remove kernel support for M:N threading. While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.
# c5aa6b58	12-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Pass the priority argument from sleep() into sleepq and down into sched_sleep(). This removes extra thread_lock() acquisition and allows the scheduler to decide what to do with the static boost. - Change the priority arguments to cv_ to match sleepq/msleep/etc. where 0 means no priority change. Catch -1 in cv_broadcastpri() and convert it to 0 for now. - Set a flag when sleeping in a way that is compatible with swapping since direct priority comparisons are meaningless now. - Add a sysctl to ule, kern.sched.static_boost, that defaults to on which controls the boost behavior. Turning it off gives better performance in some workloads but needs more investigation. - While we're modifying sleepq, change signal and broadcast to both return with the lock held as the lock was held on enter. Reviewed by: jhb, peter
# 1e24c28f	09-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Add a sched_preempt() routine to be called by md code after IPI_PREEMPT is delivered. - Add a simple implementation to 4bsd.
# f5a3ef99	02-Mar-2008	Marcel Moolenaar <marcel@FreeBSD.org>	Unbreak after cpuset: initialize td_cpuset in sched_fork_thread().
# 885d51a3	02-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Add a new sched_affinity() api to be used in the upcoming cpuset implementation. - Add empty implementations of sched_affinity() to 4BSD and ULE. Sponsored by: Nokia
# eea4f254	15-Dec-2007	Jeff Roberson <jeff@FreeBSD.org>	- Re-implement lock profiling in such a way that it no longer breaks the ABI when enabled. There is no longer an embedded lock_profile_object in each lock. Instead a list of lock_profile_objects is kept per-thread for each lock it may own. The cnt_hold statistic is now always 0 to facilitate this. - Support shared locking by tracking individual lock instances and statistics in the per-thread per-instance lock_profile_object. - Make the lock profiling hash table a per-cpu singly linked list with a per-cpu static lock_prof allocator. This removes the need for an array of spinlocks and reduces cache contention between cores. - Use a seperate hash for spinlocks and other locks so that only a critical_enter() is required and not a spinlock_enter() to modify the per-cpu tables. - Count time spent spinning in the lock statistics. - Remove the LOCK_PROFILE_SHARED option as it is always supported now. - Specifically drop and release the scheduler locks in both schedulers since we track owners now. In collaboration with: Kip Macy Sponsored by: Nokia
# 435806d3	11-Dec-2007	David Xu <davidxu@FreeBSD.org>	Fix LOR of thread lock and umtx's priority propagation mutex due to the reworking of scheduler lock. MFC: after 3 days
# 431f8906	13-Nov-2007	Julian Elischer <julian@FreeBSD.org>	generally we are interested in what thread did something as opposed to what process. Since threads by default have teh name of the process unless over-written with more useful information, just print the thread name instead.
# 088f5849	04-Nov-2007	Robert Watson <rwatson@FreeBSD.org>	Remove unused variable td from sched_idletd(). MFC after: 3 days Found with: Coverity Prevent(tm) CID: 3561
# 9dddab6f	27-Oct-2007	John Baldwin <jhb@FreeBSD.org>	Change the roundrobin implementation in the 4BSD scheduler to trigger a userland preemption directly from hardclock() via sched_clock() when a thread uses up a full quantum instead of using a periodic timeout to cause a userland preemption every so often. This fixes a potential deadlock when IPI_PREEMPTION isn't enabled where softclock blocks on a lock held by a thread pinned or bound to another CPU. The current thread on that CPU will never be preempted while softclock is blocked. Note that ULE already drives its round-robin userland preemption from sched_clock() as well and always enables IPI_PREEMPT. MFC after: 1 week
# 7ab24ea3	26-Oct-2007	Julian Elischer <julian@FreeBSD.org>	Introduce a way to make pure kernal threads. kthread_add() takes the same parameters as the old kthread_create() plus a pointer to a process structure, and adds a kernel thread to that process. kproc_kthread_add() takes the parameters for kthread_add, plus a process name and a pointer to a pointer to a process instead of just a pointer, and if the proc * is NULL, it creates the process to the specifications required, before adding the thread to it. All other old kthread_xxx() calls return, but act on (struct thread ) instead of (struct proc ). One reason to change the name is so that any old kernel modules that are lying around and expect kthread_create() to make a process will not just accidentally link. fix top to show kernel threads by their thread name in -SH mode add a tdnam formatting option to ps to show thread names. make all idle threads actual kthreads and put them into their own idled process. make all interrupt threads kthreads and put them in an interd process (mainly for aesthetic and accounting reasons) rename proc 0 to be 'kernel' and it's swapper thread is now 'swapper' man page fixes to follow.
# 05dc0eb2	08-Oct-2007	Jeff Roberson <jeff@FreeBSD.org>	- Restore historical sched_yield() behavior by changing sched_relinquish() to simply switch rather than lowering priority and switching. This allows threads of equal priority to run but not lesser priority. Discussed with: davidxu Reported by: NIIMI Satoshi <sa2c@sa2c.net> Approved by: re
# 54b0e65f	20-Sep-2007	Jeff Roberson <jeff@FreeBSD.org>	- Redefine p_swtime and td_slptime as p_swtick and td_slptick. This changes the units from seconds to the value of 'ticks' when swapped in/out. ULE does not have a periodic timer that scans all threads in the system and as such maintaining a per-second counter is difficult. - Change computations requiring the unit in seconds to subtract ticks and divide by hz. This does make the wraparound condition hz times more frequent but this is still in the range of several months to years and the adverse effects are minimal. Approved by: re
# b61ce5b0	16-Sep-2007	Jeff Roberson <jeff@FreeBSD.org>	- Move all of the PS_ flags into either p_flag or td_flags. - p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or previously the sched_lock. These bugs have existed for some time. - Allow swapout to try each thread in a process individually and then swapin the whole process if any of these fail. This allows us to move most scheduler related swap flags into td_flags. - Keep ki_sflag for backwards compat but change all in source tools to use the new and more correct location of P_INMEM. Reported by: pho Reviewed by: attilio, kib Approved by: re (kensmith)
# 6ea38de8	18-Jul-2007	Jeff Roberson <jeff@FreeBSD.org>	- Remove the global definition of sched_lock in mutex.h to break new code and third party modules which try to depend on it. - Initialize sched_lock in sched_4bsd.c. - Declare sched_lock in sparc64 pmap.c and assert that we're compiling with SCHED_4BSD to prevent accidental crashes from running ULE. This is the sole remaining file outside of the scheduler that uses the global sched_lock. Approved by: re
# fe54587f	12-Jun-2007	Jeff Roberson <jeff@FreeBSD.org>	- Move some common code out of sched_fork_exit() and back into fork_exit().
# 710eacdc	05-Jun-2007	Jeff Roberson <jeff@FreeBSD.org>	- Placing the 'volatile' on the right side of the * in the td_lock declaration removes the need for __DEVOLATILE(). Pointed out by: tegge
# 95e3a0bc	04-Jun-2007	Jeff Roberson <jeff@FreeBSD.org>	- Better fix for previous error; use DEVOLATILE on the td_lock pointer it can actually sometimes be something other than sched_lock even on schedulers which rely on a global scheduler lock. Tested by: kan
# c219b097	04-Jun-2007	Jeff Roberson <jeff@FreeBSD.org>	- Pass &sched_lock as the third argument to cpu_switch() as this will always be the correct lock and we don't get volatile warnings this way. Pointed out by: kan
# 7b20fb19	04-Jun-2007	Jeff Roberson <jeff@FreeBSD.org>	Commit 1/14 of sched_lock decomposition. - Move all scheduler locking into the schedulers utilizing a technique similar to solaris's container locking. - A per-process spinlock is now used to protect the queue of threads, thread count, suspension count, p_sflags, and other process related scheduling fields. - The new thread lock is actually a pointer to a spinlock for the container that the thread is currently owned by. The container may be a turnstile, sleepqueue, or run queue. - thread_lock() is now used to protect access to thread related scheduling fields. thread_unlock() unlocks the lock and thread_set_lock() implements the transition from one lock to another. - A new "blocked_lock" is used in cases where it is not safe to hold the actual thread's lock yet we must prevent access to the thread. - sched_throw() and sched_fork_exit() are introduced to allow the schedulers to fix-up locking at these points. - Add some minor infrastructure for optionally exporting scheduler statistics that were invaluable in solving performance problems with this patch. Generally these statistics allow you to differentiate between different causes of context switches. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
# 4d70511a	27-Feb-2007	John Baldwin <jhb@FreeBSD.org>	Use pause() rather than tsleep() on stack variables and function pointers.
# c6226eea	01-Feb-2007	Julian Elischer <julian@FreeBSD.org>	Move the seting of the idle_mask bits to a place where they can't be wrong. Also use the IDLETD bit in the thread mask to test if its an idle thread rather than doing a PCPU access.
# f0393f06	23-Jan-2007	Jeff Roberson <jeff@FreeBSD.org>	- Remove setrunqueue and replace it with direct calls to sched_add(). setrunqueue() was mostly empty. The few asserts and thread state setting were moved to the individual schedulers. sched_add() was chosen to displace it for naming consistency reasons. - Remove adjustrunqueue, it was 4 lines of code that was ifdef'd to be different on all three schedulers where it was only called in one place each. - Remove the long ifdef'd out remrunqueue code. - Remove the now redundant ts_state. Inspect the thread state directly. - Don't set TSF_* flags from kern_switch.c, we were only doing this to support a feature in one scheduler. - Change sched_choose() to return a thread rather than a td_sched. Also, rely on the schedulers to return the idlethread. This simplifies the logic in choosethread(). Aside from the run queue links kern_switch.c mostly does not care about the contents of td_sched. Discussed with: julian - Move the idle thread loop into the per scheduler area. ULE wants to do something different from the other schedulers. Suggested by: jhb Tested on: x86/amd64 sched_{4BSD, ULE, CORE}.
# 2da78e38	31-Dec-2006	Robert Watson <rwatson@FreeBSD.org>	Prefer a more traditional spelling of inhibited in comments and panic messages.
# 34e1241b	23-Dec-2006	David Xu <davidxu@FreeBSD.org>	Fix typo, p_slptime should be td_slptime.
# ad1e7d28	05-Dec-2006	Julian Elischer <julian@FreeBSD.org>	Threading cleanup.. part 2 of several. Make part of John Birrell's KSE patch permanent.. Specifically, remove: Any reference of the ksegrp structure. This feature was never fully utilised and made things overly complicated. All code in the scheduler that tried to make threaded programs fair to unthreaded programs. Libpthread processes will already do this to some extent and libthr processes already disable it. Also: Since this makes such a big change to the scheduler(s), take the opportunity to rename some structures and elements that had to be moved anyhow. This makes the code a lot more readable. The ULE scheduler compiles again but I have no idea if it works. The 4bsd scheduler still reqires a little cleaning and some functions that now do ALMOST nothing will go away, but I thought I'd do that as a separate commit. Tested by David Xu, and Dan Eischen using libthr and libpthread.
# de38cd9d	20-Nov-2006	Julian Elischer <julian@FreeBSD.org>	whitespace fix only
# 65338575	13-Nov-2006	David Xu <davidxu@FreeBSD.org>	Fix a copy-paste bug in NON-KSE case.
# 5a215147	11-Nov-2006	David Xu <davidxu@FreeBSD.org>	Unbreak userland priority inheriting in NO_KSE case.
# 8460a577	26-Oct-2006	John Birrell <jb@FreeBSD.org>	Make KSE a kernel option, turned on by default in all GENERIC kernel configs except sun4v (which doesn't process signals properly with KSE). Reviewed by: davidxu@
# 3db720fd	25-Aug-2006	David Xu <davidxu@FreeBSD.org>	Add user priority loaning code to support priority propagation for 1:1 threading's POSIX priority mutexes, the code is no-op unless priority-aware umtx code is committed.
# 1f36c876	02-Jul-2006	Maxim Konovalov <maxim@FreeBSD.org>	o Fix grammar in the comment, indent macros. No functional changes.
# 75d960eb	02-Jul-2006	Maxim Konovalov <maxim@FreeBSD.org>	o Remove rev. 1.57 leftover, not reached code.
# 2e4db89c	29-Jun-2006	David E. O'Brien <obrien@FreeBSD.org>	Fix building with GCC 4.2: define data types before referring to them.
# 36ec198b	15-Jun-2006	David Xu <davidxu@FreeBSD.org>	Add scheduler API sched_relinquish(), the API is used to implement yield() and sched_yield() syscalls. Every scheduler has its own way to relinquish cpu, the ULE and CORE schedulers have two internal run- queues, a timesharing thread which calls yield() syscall should be moved to inactive queue.
# b41f1452	13-Jun-2006	David Xu <davidxu@FreeBSD.org>	Add scheduler CORE, the work I have done half a year ago, recent, I picked it up again. The scheduler is forked from ULE, but the algorithm to detect an interactive process is almost completely different with ULE, it comes from Linux paper "Understanding the Linux 2.6.8.1 CPU Scheduler", although I still use same word "score" as a priority boost in ULE scheduler. Briefly, the scheduler has following characteristic: 1. Timesharing process's nice value is seriously respected, timeslice and interaction detecting algorithm are based on nice value. 2. per-cpu scheduling queue and load balancing. 3. O(1) scheduling. 4. Some cpu affinity code in wakeup path. 5. Support POSIX SCHED_FIFO and SCHED_RR. Unlike scheduler 4BSD and ULE which using fuzzy RQ_PPQ, the scheduler uses 256 priority queues. Unlike ULE which using pull and push, the scheduelr uses pull method, the main reason is to let relative idle cpu do the work, but current the whole scheduler is protected by the big sched_lock, so the benefit is not visible, it really can be worse than nothing because all other cpu are locked out when we are doing balancing work, which the 4BSD scheduelr does not have this problem. The scheduler does not support hyperthreading very well, in fact, the scheduler does not make the difference between physical CPU and logical CPU, this should be improved in feature. The scheduler has priority inversion problem on MP machine, it is not good for realtime scheduling, it can cause realtime process starving. As a result, it seems the MySQL super-smack runs better on my Pentium-D machine when using libthr, despite on UP or SMP kernel.
# 0ae716e5	05-Jun-2006	David Xu <davidxu@FreeBSD.org>	Make ke_rqindex unsigned.
# 5c06d111	27-Apr-2006	John-Mark Gurney <jmg@FreeBSD.org>	back out for now... revert ccpu to being kern.ccpu...
# c71ce6a4	26-Apr-2006	John-Mark Gurney <jmg@FreeBSD.org>	move remaining sysctl into the kern.sched tree...
# 0f180a7c	17-Apr-2006	John Baldwin <jhb@FreeBSD.org>	Change msleep() and tsleep() to not alter the calling thread's priority if the specified priority is zero. This avoids a race where the calling thread could read a snapshot of it's current priority, then a different thread could change the first thread's priority, then the original thread would call sched_prio() inside msleep() undoing the change made by the second thread. I used a priority of zero as no thread that calls msleep() or tsleep() should be specifying a priority of zero anyway. The various places that passed 'curthread->td_priority' or some variant as the priority now pass 0.
# 4da0d332	23-Jun-2005	Peter Wemm <peter@FreeBSD.org>	Move HWPMC_HOOKS into its own opt_hwpmc_hooks.h file. It doesn't merit being in opt_global.h and forcing a global recompile when only a few files reference it. Approved by: re
# a3f2d842	09-Jun-2005	Stephan Uphoff <ups@FreeBSD.org>	Lots of whitespace cleanup. Fix for broken if condition. Submitted by: nate@
# f3a0f873	09-Jun-2005	Stephan Uphoff <ups@FreeBSD.org>	Fix some race conditions for pinned threads that may cause them to run on the wrong CPU. Add IPI support for preempting a thread on another CPU. MFC after:3 weeks
# ebccf1e3	18-Apr-2005	Joseph Koshy <jkoshy@FreeBSD.org>	Bring a working snapshot of hwpmc(4), its associated libraries, userland utilities and documentation into -CURRENT. Bump FreeBSD_version. Reviewed by: alc, jhb (kernel changes)
# f3050486	15-Apr-2005	Maxim Konovalov <maxim@FreeBSD.org>	Fix a typo in the comment. Noticed by: Samy Al Bahra
# 77918643	07-Apr-2005	Stephan Uphoff <ups@FreeBSD.org>	Sprinkle some volatile magic and rearrange things a bit to avoid race conditions in critical_exit now that it no longer blocks interrupts. Reviewed by: jhb
# f5c157d9	30-Dec-2004	John Baldwin <jhb@FreeBSD.org>	Rework the interface between priority propagation (lending) and the schedulers a bit to ensure more correct handling of priorities and fewer priority inversions: - Add two functions to the sched(9) API to handle priority lending: sched_lend_prio() and sched_unlend_prio(). The turnstile code uses these functions to ask the scheduler to lend a thread a set priority and to tell the scheduler when it thinks it is ok for a thread to stop borrowing priority. The unlend case is slightly complex in that the turnstile code tells the scheduler what the minimum priority of the thread needs to be to satisfy the requirements of any other threads blocked on locks owned by the thread in question. The scheduler then decides where the thread can go back to normal mode (if it's normal priority is high enough to satisfy the pending lock requests) or it it should continue to use the priority specified to the sched_unlend_prio() call. This involves adding a new per-thread flag TDF_BORROWING that replaces the ULE-only kse flag for priority elevation. - Schedulers now refuse to lower the priority of a thread that is currently borrowing another therad's priority. - If a scheduler changes the priority of a thread that is currently sitting on a turnstile, it will call a new function turnstile_adjust() to inform the turnstile code of the change. This function resorts the thread on the priority list of the turnstile if needed, and if the thread ends up at the head of the list (due to having the highest priority) and its priority was raised, then it will propagate that new priority to the owner of the lock it is blocked on. Some additional fixes specific to the 4BSD scheduler include: - Common code for updating the priority of a thread when the user priority of its associated kse group has been consolidated in a new static function resetpriority_thread(). One change to this function is that it will now only adjust the priority of a thread if it already has a time sharing priority, thus preserving any boosts from a tsleep() until the thread returns to userland. Also, resetpriority() no longer calls maybe_resched() on each thread in the group. Instead, the code calling resetpriority() is responsible for calling resetpriority_thread() on any threads that need to be updated. - schedcpu() now uses resetpriority_thread() instead of just calling sched_prio() directly after it updates a kse group's user priority. - sched_clock() now uses resetpriority_thread() rather than writing directly to td_priority. - sched_nice() now updates all the priorities of the threads after the group priority has been adjusted. Discussed with: bde Reviewed by: ups, jeffr Tested on: 4bsd, ule Tested on: i386, alpha, sparc64
# 907bdbc2	25-Dec-2004	Jeff Roberson <jeff@FreeBSD.org>	- Wrap the thread count adjustment in sched_load_add() and sched_load_rem() so that we may place some ktr entries nearby. - Define other KTR_SCHED tracepoints so that we may graph the operation of the scheduler.
# 7842f65e	14-Dec-2004	Jeff Roberson <jeff@FreeBSD.org>	- Garbage collect several unused members of struct kse and struce ksegrp. As best as I can tell, some of these were never used.
# 56564741	07-Dec-2004	Stephan Uphoff <ups@FreeBSD.org>	Propagate TDF_NEEDRESCHED to replacement thread in sched_switch(). Reviewed by: julian, jhb (in October) Approved by: sam (mentor) MFC after: 4 weeks
# c20c691b	05-Oct-2004	Julian Elischer <julian@FreeBSD.org>	When preempting a thread, put it back on the HEAD of its run queue. (Only really implemented in 4bsd) MFC after: 4 days
# d39063f2	05-Oct-2004	Julian Elischer <julian@FreeBSD.org>	Use some macros to trach available scheduler slots to allow easier debugging. MFC after: 4 days
# 14f0e2e9	16-Sep-2004	Julian Elischer <julian@FreeBSD.org>	clean up thread runq accounting a bit. MFC after: 3 days
# b2578c6c	13-Sep-2004	Julian Elischer <julian@FreeBSD.org>	Add some kasserts
# 1e7fad6b	11-Sep-2004	Scott Long <scottl@FreeBSD.org>	Revert the previous round of changes to td_pinned. The scheduler isn't fully initialed when the pmap layer tries to call sched_pini() early in the boot and results in an quick panic. Use ke_pinned instead as was originally done with Tor's patch. Approved by: julian
# 5c854acc	10-Sep-2004	Julian Elischer <julian@FreeBSD.org>	Make up my mind if cpu pinning is stored in the thread structure or the scheduler specific extension to it. Put it in the extension as the implimentation details of how the pinning is done needn't be visible outside the scheduler. Submitted by: tegge (of course!) (with changes) MFC after: 3 days
# 3389af30	10-Sep-2004	Julian Elischer <julian@FreeBSD.org>	Add some code to allow threads to nominat a sibling to run if theyu are going to sleep. MFC after: 1 week
# 6a574b2a	06-Sep-2004	Julian Elischer <julian@FreeBSD.org>	Don't do IPIs on behalf of interrupt threads. just punt straight on through to teh preemption code. Make a KASSSERT out of a condition that can no longer occur. MFC after: 1 week
# 0fe38d47	05-Sep-2004	Julian Elischer <julian@FreeBSD.org>	slight code cleanup MFC after: 1 week
# bce73aed	04-Sep-2004	Julian Elischer <julian@FreeBSD.org>	turn on IPIs for 4bsd scheduler by default. MFC after: 1 week
# ed062c8d	04-Sep-2004	Julian Elischer <julian@FreeBSD.org>	Refactor a bunch of scheduler code to give basically the same behaviour but with slightly cleaned up interfaces. The KSE structure has become the same as the "per thread scheduler private data" structure. In order to not make the diffs too great one is #defined as the other at this time. The KSE (or td_sched) structure is now allocated per thread and has no allocation code of its own. Concurrency for a KSEGRP is now kept track of via a simple pair of counters rather than using KSE structures as tokens. Since the KSE structure is different in each scheduler, kern_switch.c is now included at the end of each scheduler. Nothing outside the scheduler knows the contents of the KSE (aka td_sched) structure. The fields in the ksegrp structure that are to do with the scheduler's queueing mechanisms are now moved to the kg_sched structure. (per ksegrp scheduler private data structure). In other words how the scheduler queues and keeps track of threads is no-one's business except the scheduler's. This should allow people to write experimental schedulers with completely different internal structuring. A scheduler call sched_set_concurrency(kg, N) has been added that notifies teh scheduler that no more than N threads from that ksegrp should be allowed to be on concurrently scheduled. This is also used to enforce 'fainess' at this time so that a ksegrp with 10000 threads can not swamp a the run queue and force out a process with 1 thread, since the current code will not set the concurrency above NCPU, and both schedulers will not allow more than that many onto the system run queue at a time. Each scheduler should eventualy develop their own methods to do this now that they are effectively separated. Rejig libthr's kernel interface to follow the same code paths as linkse for scope system threads. This has slightly hurt libthr's performance but I will work to recover as much of it as I can. Thread exit code has been cleaned up greatly. exit and exec code now transitions a process back to 'standard non-threaded mode' before taking the next step. Reviewed by: scottl, peter MFC after: 1 week
# 00b0483d	03-Sep-2004	Julian Elischer <julian@FreeBSD.org>	Don't declare a function we are not defining.
# 37c28a02	03-Sep-2004	Julian Elischer <julian@FreeBSD.org>	fix compile for UP
# 293968d8	03-Sep-2004	Julian Elischer <julian@FreeBSD.org>	ooops finish last commit. moved the variables but not the declarations.
# 82a1dfc1	03-Sep-2004	Julian Elischer <julian@FreeBSD.org>	Move 4bsd specific experimental IP code into the 4bsd file. Move the sysctls into kern.sched
# 6804a3ab	01-Sep-2004	Julian Elischer <julian@FreeBSD.org>	Give the 4bsd scheduler the ability to wake up idle processors when there is new work to be done. MFC after: 5 days
# 2630e4c9	31-Aug-2004	Julian Elischer <julian@FreeBSD.org>	Give setrunqueue() and sched_add() more of a clue as to where they are coming from and what is expected from them. MFC after: 2 days
# ad59c36b	21-Aug-2004	Julian Elischer <julian@FreeBSD.org>	diff reduction for upcoming patch. Use a macro that masks some of the odd goings on with sub-structures, because they will go away anyhow.
# 0f54f482	11-Aug-2004	Julian Elischer <julian@FreeBSD.org>	Properly keep track of how many kses are on the system run queue(s).
# 732d9528	09-Aug-2004	Julian Elischer <julian@FreeBSD.org>	Increase the amount of data exported by KTR in the KTR_RUNQ setting. This extra data is needed to really follow what is going on in the threaded case.
# e038d354	23-Jul-2004	Scott Long <scottl@FreeBSD.org>	Clean up whitespace, increase consistency and correctness. Submitted by: bde
# 55d44f79	18-Jul-2004	Julian Elischer <julian@FreeBSD.org>	When calling scheduler entrypoints for creating new threads and processes, specify "us" as the thread not the process/ksegrp/kse. You can always find the others from the thread but the converse is not true. Theorotically this would lead to runtime being allocated to the wrong entity in some cases though it is not clear how often this actually happenned. (would only affect threaded processes and would probably be pretty benign, but it WAS a bug..) Reviewed by: peter
# 52eb8464	16-Jul-2004	John Baldwin <jhb@FreeBSD.org>	- Move TDF_OWEPREEMPT, TDF_OWEUPC, and TDF_USTATCLOCK over to td_pflags since they are only accessed by curthread and thus do not need any locking. - Move pr_addr and pr_ticks out of struct uprof (which is per-process) and directly into struct thread as td_profil_addr and td_profil_ticks as these variables are really per-thread. (They are used to defer an addupc_intr() that was too "hard" until ast()).
# 6942d433	13-Jul-2004	John Baldwin <jhb@FreeBSD.org>	Set TDF_NEEDRESCHED when a higher priority thread is scheduled in sched_add() rather than just doing it in sched_wakeup(). The old ithread preemption code used to set NEEDRESCHED unconditionally if it didn't preempt which masked this bug in SCHED_4BSD. Noticed by: jake Reported by: kensmith, marcel
# 0c0b25ae	02-Jul-2004	John Baldwin <jhb@FreeBSD.org>	Implement preemption of kernel threads natively in the scheduler rather than as one-off hacks in various other parts of the kernel: - Add a function maybe_preempt() that is called from sched_add() to determine if a thread about to be added to a run queue should be preempted to directly. If it is not safe to preempt or if the new thread does not have a high enough priority, then the function returns false and sched_add() adds the thread to the run queue. If the thread should be preempted to but the current thread is in a nested critical section, then the flag TDF_OWEPREEMPT is set and the thread is added to the run queue. Otherwise, mi_switch() is called immediately and the thread is never added to the run queue since it is switch to directly. When exiting an outermost critical section, if TDF_OWEPREEMPT is set, then clear it and call mi_switch() to perform the deferred preemption. - Remove explicit preemption from ithread_schedule() as calling setrunqueue() now does all the correct work. This also removes the do_switch argument from ithread_schedule(). - Do not use the manual preemption code in mtx_unlock if the architecture supports native preemption. - Don't call mi_switch() in a loop during shutdown to give ithreads a chance to run if the architecture supports native preemption since the ithreads will just preempt DELAY(). - Don't call mi_switch() from the page zeroing idle thread for architectures that support native preemption as it is unnecessary. - Native preemption is enabled on the same archs that supported ithread preemption, namely alpha, i386, and amd64. This change should largely be a NOP for the default case as committed except that we will do fewer context switches in a few cases and will avoid the run queues completely when preempting. Approved by: scottl (with his re@ hat)
# bf0acc27	02-Jul-2004	John Baldwin <jhb@FreeBSD.org>	- Change mi_switch() and sched_switch() to accept an optional thread to switch to. If a non-NULL thread pointer is passed in, then the CPU will switch to that thread directly rather than calling choosethread() to pick a thread to choose to. - Make sched_switch() aware of idle threads and know to do TD_SET_CAN_RUN() instead of sticking them on the run queue rather than requiring all callers of mi_switch() to know to do this if they can be called from an idlethread. - Move constants for arguments to mi_switch() and thread_single() out of the middle of the function prototypes and up above into their own section.
# 36c6fd1c	21-Jun-2004	Scott Long <scottl@FreeBSD.org>	Fix another typo in the previous commit.
# c38dd4b6	21-Jun-2004	Scott Long <scottl@FreeBSD.org>	Fix typo that somehow crept into the previous commit
# dc095794	21-Jun-2004	Scott Long <scottl@FreeBSD.org>	Add the sysctl node 'kern.sched.name' that has the name of the scheduler currently in use. Move the 4bsd kern.quantum node to kern.sched.quantum for consistency.
# fa885116	15-Jun-2004	Julian Elischer <julian@FreeBSD.org>	Nice, is a property of a process as a whole.. I mistakenly moved it to the ksegroup when breaking up the process structure. Put it back in the proc structure.
# 7f8a436f	05-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 7d5ea13f	05-Apr-2004	Doug Rabson <dfr@FreeBSD.org>	Try not to crash instantly when signalling a libthr program to death.
# 8cbec0c8	05-Mar-2004	Robert Watson <rwatson@FreeBSD.org>	The roundrobin callout from sched_4bsd is MPSAFE, so set up the callout as MPSAFE to avoid grabbing Giant. Reviewed by: jhb
# 44f3b092	27-Feb-2004	John Baldwin <jhb@FreeBSD.org>	Switch the sleep/wakeup and condition variable implementations to use the sleep queue interface: - Sleep queues attempt to merge some of the benefits of both sleep queues and condition variables. Having sleep qeueus in a hash table avoids having to allocate a queue head for each wait channel. Thus, struct cv has shrunk down to just a single char * pointer now. However, the hash table does not hold threads directly, but queue heads. This means that once you have located a queue in the hash bucket, you no longer have to walk the rest of the hash chain looking for threads. Instead, you have a list of all the threads sleeping on that wait channel. - Outside of the sleepq code and the sleep/cv code the kernel no longer differentiates between cv's and sleep/wakeup. For example, calls to abortsleep() and cv_abort() are replaced with a call to sleepq_abort(). Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and cv_waitq_remove() have been replaced with calls to sleepq_remove(). - The sched_sleep() function no longer accepts a priority argument as sleep's no longer inherently bump the priority. Instead, this is soley a propery of msleep() which explicitly calls sched_prio() before blocking. - The TDF_ONSLEEPQ flag has been dropped as it was never used. The associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been dropped and replaced with a single explicit clearing of td_wchan. TD_SET_ONSLEEPQ() would really have only made sense if it had taken the wait channel and message as arguments anyway. Now that that only happens in one place, a macro would be overkill.
# f2f51f8a	31-Jan-2004	Jeff Roberson <jeff@FreeBSD.org>	- Disable ithread binding in all cases for now. This doesn't make as much sense with sched_4bsd as it does with sched_ule. - Use P_NOLOAD instead of the absence of td->td_ithd to determine whether or not a thread should be accounted for in sched_tdcnt.
# ca59f152	31-Jan-2004	Jeff Roberson <jeff@FreeBSD.org>	- Keep a variable 'sched_tdcnt' that is used for the local implementation of sched_load(). This variable tracks the number of running and runnable non ithd threads. This removes the need to traverse the proc table and discover how many threads are runnable.
# 5a2b158d	25-Jan-2004	Jeff Roberson <jeff@FreeBSD.org>	- Correct function names listed in KASSERTs. These were copied from other code and it was sloppy of me not to adjust these sooner.
# e17c57b1	25-Jan-2004	Jeff Roberson <jeff@FreeBSD.org>	- Implement cpu pinning and binding. This is acomplished by keeping a per- cpu run queue that is only used for pinned or bound threads. Submitted by: Chris Bradfield <chrisb@ation.org>
# c55bbb6c	26-Dec-2003	John Baldwin <jhb@FreeBSD.org>	Create a separate kthread that executes sched_cpu() once a second. Because sched_cpu() locks an sx lock (allproc_lock) which can sleep if it fails to acquire the lock, it is not safe to execute this in a callout handler from softclock().
# b698380f	09-Nov-2003	Bruce Evans <bde@FreeBSD.org>	Quick fix for scaling of statclock ticks in the SMP case. As explained in the log message for kern_sched.c 1.83 (which should have been repo-copied to preserve history for this file), the (4BSD) scheduler algorithm only works right if stathz is nearly 128 Hz. The old commit lock said 64 Hz; the scheduler actually wants nearly 16 Hz but there was a scale factor of 4 to give the requirement of 64 Hz, and rev.1.83 changed the scale factor so that the requirement became 128 Hz. The change of the scale factor was incomplete in the SMP case. Then scheduling ticks are provided by smp_ncpu CPUs, and the scheduler cannot tell the difference between this and 1 CPU providing scheduling ticks smp_ncpu times faster, so we need another scale factor of smp_ncp or an algorithm change. This quick fix uses the scale factor without even trying to optimize the runtime divisions required for this as is done for the other scale factor. The main algorithmic problem is the clamp on the scheduling tick counts. This was 295; it is now approximately 295 * smp_ncpu. When the limit is reached, threads get free timeslices and scheduling becomes very unfair to the threads that don't hit the limit. The limit can be reached and maintained in the worst case if the load average is larger than (limit / effective_stathz - 1) / 2 = 0.65 now (was just 0.08 with 2 CPUs before this change), so there are algorithmic problems even for a load average of 1. Fortunately, the worst case isn't common enough for the problem to be very noticeable (it is mainly for niced CPU hogs competing with less nice CPU hogs).
# 685a6c44	07-Nov-2003	David Xu <davidxu@FreeBSD.org>	Return a reasonable number for top or ps to display for M:N thread, since there is no direct association between M:N thread and kse, sometimes, a thread does not have a kse, in that case, return a pctcpu from its last kse, it is not perfect, but gives a good number to be displayed.
# 89674a9f	29-Oct-2003	Bruce Evans <bde@FreeBSD.org>	Removed sched_nest variable in sched_switch(). Context switches always begin with sched_lock held but not recursed, so this variable was always 0. Removed fixup of sched_lock.mtx_recurse after context switches in sched_switch(). Context switches always end with this variable in the same state that it began in, so there is no need to fix it up. Only sched_lock.mtx_lock really needs a fixup. Replaced fixup of sched_lock.mtx_recurse in fork_exit() by an assertion that sched_lock is owned and not recursed after it is fixed up. This assertion much match the one in mi_switch(), and if sched_lock were recursed then a non-null fixup of sched_lock.mtx_recurse would probably be needed again, unlike in sched_switch(), since fork_exit() doesn't return to its caller in the normal way.
# 55f2099a	16-Oct-2003	Jeff Roberson <jeff@FreeBSD.org>	- The kse may be null in sched_pctcpu(). Reported by: kris
# ae53b483	16-Oct-2003	Jeff Roberson <jeff@FreeBSD.org>	- Collapse sched_switchin() and sched_switchout() into sched_switch(). Now mi_switch() calls sched_switch() which calls cpu_switch(). This is actually one less function call than it had been.
# 7cf90fb3	16-Oct-2003	Jeff Roberson <jeff@FreeBSD.org>	- Update the sched api. sched_{add,rem,clock,pctcpu} now all accept a td argument rather than a kse.
# c06eb4e2	19-Aug-2003	Sam Leffler <sam@FreeBSD.org>	Change instances of callout_init that specify MPSAFE behaviour to use CALLOUT_MPSAFE instead of "1" for the second parameter. This does not change the behaviour; it just makes the intent more clear.
# 70fca427	15-Aug-2003	John Baldwin <jhb@FreeBSD.org>	- Various style fixes in both code and comments. - Update some stale comments. - Sort a couple of includes. - Only set 'newcpu' in updatepri() if we use it. - No functional changes. Obtained from: bde (via an old diff I got a long time ago)
# 0e2a4d3a	14-Jun-2003	David Xu <davidxu@FreeBSD.org>	Rename P_THREADED to P_SA. P_SA means a process is using scheduler activations.
# 677b542e	10-Jun-2003	David E. O'Brien <obrien@FreeBSD.org>	Use __FBSDID().
# 51da11a2	29-Apr-2003	Mark Murray <markm@FreeBSD.org>	Fix some easy, global, lint warnings. In most cases, this means making some local variables static. In a couple of cases, this means removing an unused variable.
# 2056d0a1	23-Apr-2003	John Baldwin <jhb@FreeBSD.org>	Add lock assertions for various proc/thread/kse/ksegroup fields to the scheduler functions.
# 0b5318c8	22-Apr-2003	John Baldwin <jhb@FreeBSD.org>	- Assert that the proc lock and sched_lock are held in sched_nice(). - For the 4BSD scheduler, this means that all callers of the static function resetpriority() now always hold sched_lock, so don't lock sched_lock explicitly in that function.
# f7f9e7f3	10-Apr-2003	Jeff Roberson <jeff@FreeBSD.org>	- Catch up with sched api changes.
# 060563ec	10-Apr-2003	Julian Elischer <julian@FreeBSD.org>	Move the _oncpu entry from the KSE to the thread. The entry in the KSE still exists but it's purpose will change a bit when we add the ability to lock a KSE to a cpu.
# 4974b53e	24-Mar-2003	Maxime Henrion <mux@FreeBSD.org>	Remove a trailing semicolon in SCHED_QUANTUM definition. Luckily this didn't cause any bugs. Spotted by: Samy Al Bahra <samy@kerneled.com>
# ac2e4153	26-Feb-2003	Julian Elischer <julian@FreeBSD.org>	Change the process flags P_KSES to be P_THREADED. This is just a cosmetic change but I've been meaning to do it for about a year.
# 4f6cfa45	19-Feb-2003	David Xu <davidxu@FreeBSD.org>	Update comments to reflect new KSE code.
# 4a338afd	17-Feb-2003	Julian Elischer <julian@FreeBSD.org>	Move a bunch of flags from the KSE to the thread. I was in two minds as to where to put them in the first case.. I should have listenned to the other mind. Submitted by: parts by davidxu@ Reviewed by: jeff@ mini@
# 8fb913fa	12-Jan-2003	Jeff Roberson <jeff@FreeBSD.org>	- Unbreak world. I did not notice that libkvm was still used in some places to access the pctcpu. This will have to be sorted out more later as the new scheduler requires a procedural interface for this data. A more complete solution will follow.
# bcb06d59	12-Jan-2003	Jeff Roberson <jeff@FreeBSD.org>	- Move ke_pctcpu and ke_cpticks into the scheduler specific datastructure. This will prevent access through mechanisms other than the published interfaces.
# 93a7aa79	27-Dec-2002	Julian Elischer <julian@FreeBSD.org>	Add code to ddb to allow backtracing an arbitrary thread. (show thread {address}) Remove the IDLE kse state and replace it with a change in the way threads sahre KSEs. Every KSE now has a thread, which is considered its "owner" however a KSE may also be lent to other threads in the same group to allow completion of in-kernel work. n this case the owner remains the same and the KSE will revert to the owner when the other work has been completed. All creations of upcalls etc. is now done from kse_reassign() which in turn is called from mi_switch or thread_exit(). This means that special code can be removed from msleep() and cv_wait(). kse_release() does not leave a KSE with no thread any more but converts the existing thread into teh KSE's owner, and sets it up for doing an upcall. It is just inhibitted from being scheduled until there is some reason to do an upcall. Remove all trace of the kse_idle queue since it is no-longer needed. "Idle" KSEs are now on the loanable queue.
# 79acfc49	21-Nov-2002	Jeff Roberson <jeff@FreeBSD.org>	- Add the new sched_pctcpu() function to the sched_* api. - Provide a routine in sched_4bsd to add this functionality. - Use sched_pctcpu() in kern_proc, which is the one place outside of sched_4bsd where the old pctcpu value was accessed directly. Approved by: re
# 06439a04	21-Nov-2002	Jeff Roberson <jeff@FreeBSD.org>	- Move scheduler specific macros and defines out of proc.h Approved by: re
# 148302c9	21-Nov-2002	Jeff Roberson <jeff@FreeBSD.org>	- Move FSCALE back to kern_sync. This is not scheduler specific. - Create a new callout for lbolt and move it out of schedcpu(). This is not scheduler specific either. Approved by: re
# de028f5a	20-Nov-2002	Jeff Roberson <jeff@FreeBSD.org>	- Implement a mechanism for allowing schedulers to place scheduler dependant data in the scheduler independant structures (proc, ksegrp, kse, thread). - Implement unused stubs for this mechanism in sched_4bsd. Approved by: re Reviewed by: luigi, trb Tested on: x86, alpha
# 1f955e2d	14-Oct-2002	Julian Elischer <julian@FreeBSD.org>	Tidy up the scheduler's code for changing the priority of a thread. Logically pretty much a NOP.
# b43179fb	11-Oct-2002	Jeff Roberson <jeff@FreeBSD.org>	- Create a new scheduler api that is defined in sys/sched.h - Begin moving scheduler specific functionality into sched_4bsd.c - Replace direct manipulation of scheduler data with hooks provided by the new api. - Remove KSE specific state modifications and single runq assumptions from kern_switch.c Reviewed by: -arch