History log of /freebsd-current/sys/kern/subr_trap.c
Revision Date Author Comments
# b7b78c1c 28-Mar-2024 Randall Stewart <rrs@FreeBSD.org>

Optimize HPTS so that little work is done until we have a hpts thread that is over the connection threshold

HPTS inserts a softclock for system call return that optimizes performance. However when
no HPTS threads need the help (i.e. when they have less than 100 or so connections) then
there should be little work done i.e. check the counter and return instead of running through
all the threads getting locks etc.ptimize HPTS so that little work is done until we have a hpts
thread that is over the connection threshold.

Reported by: eduardo
Reviewed by: gallatin, glebius, tuexen
Tested by: gallatin
Differential Revision: https://reviews.freebsd.org/D44420


# e3cbc572 04-Dec-2023 Gleb Smirnoff <glebius@FreeBSD.org>

kern/subr_trap.c: repair the HPTS performance hack in userret()

It wasn't functional as subr_trap.c doesn't include opt_inet.h. Put a
better comment provided by gallatin@ in place of the old one. The idea
is to use userret() as a cheap place to call a soft clock. This approach
saves CPU on busy machines and saves power on idle machines.
An alternative would be to constantly schedule callouts. Running with
neither callouts nor the soft clock ruins HPTS precision.

Reviewed by: tuexen, rrs
Differential Revision: https://reviews.freebsd.org/D42860


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# c159f767 13-Apr-2023 Gordon Bergling <gbe@FreeBSD.org>

kern: remove a double word in a KASSERT in subr_trap

- s/with with/with/

MFC after: 5 days


# c46771a7 25-Jul-2022 Konstantin Belousov <kib@FreeBSD.org>

kern/subr_trap.c: cleanup no longer needed headers

Also bump Foundation' copyright year

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888


# c6d31b83 18-Jul-2022 Konstantin Belousov <kib@FreeBSD.org>

AST: rework

Make most AST handlers dynamically registered. This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it. For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.

Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit. For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.

The dynamic registration also allows third-party modules to register AST
handlers if needed. There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it. In fact, this
is already present behavior for hwpmc.ko and ufs.ko. I do not think it
is worth the efforts and the runtime overhead to try to fix it.

Reviewed by: markj
Tested by: emaste (arm64), pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888


# b53133a7 12-Feb-2022 Mateusz Guzik <mjg@FreeBSD.org>

proc: load/store p_cowgen using atomic primitives


# 626d6992 26-Dec-2021 Edward Tomasz Napierala <trasz@FreeBSD.org>

Move fork_rfppwait() check into ast()

This will always sleep at least once, so it's a slow path by definition.

Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D33387


# 98168a6e 06-Sep-2021 Konstantin Belousov <kib@FreeBSD.org>

kqueue: drain kqueue taskqueue if syscall tickled it

Otherwise return from the syscall and next syscall, which could be
kevent(2) on the kqueue that should be notified, races with the kqueue
taskqueue thread, and potentially misses the wakeup. This is reliably
visible when kevent(2) only peeks into events using zeroed timeout.

PR: 258310
Reported by: arichardson, Jan Kokemüller <jan.kokemueller@gmail.com>
Reviewed by: arichardson, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31858


# b0f71f1b 10-Aug-2021 Mark Johnston <markj@FreeBSD.org>

amd64: Add MD bits for KMSAN

Interrupt and exception handlers must call kmsan_intr_enter() prior to
calling any C code. This is because the KMSAN runtime maintains some
TLS in order to track initialization state of function parameters and
return values across function calls. Then, to ensure that this state is
kept consistent in the face of asynchronous kernel-mode excpeptions, the
runtime uses a stack of TLS blocks, and kmsan_intr_enter() and
kmsan_intr_leave() push and pop that stack, respectively.

Use these functions in amd64 interrupt and exception handlers. Note
that handlers for user->kernel transitions need not be annotated.

Also ensure that trap frames pushed by the CPU and by handlers are
marked as initialized before they are used.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31467


# d7955cc0 06-Jul-2021 Randall Stewart <rrs@FreeBSD.org>

tcp: HPTS performance enhancements

HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31083


# 57f22c82 10-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

sigfastblock: do not skip cursig/postsig loop in ast()

Even if sigfastblock block is non-zero, non-blockable signals must be
checked on ast and delivered now. This also affects debugger ability
to attach, because issignal() also calls ptracestop() if there is
a pending stop for debugee.

Instead of checking for sigfastblock, and either setting PENDING flag
for usermode or doing signal delivery loop, always do the loop after
checking, and then handle PENDING bit. issignal() already does the right
thing for fast-blocked case, allowing only STOPs and SIGKILL delivery to
happen.

Reported by: Vasily Postnicov <shamaz.mazum@gmail.com>, markj
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28089


# 46588778 02-Oct-2020 Edward Tomasz Napierala <trasz@FreeBSD.org>

Move KTRUSERRET() from userret() to ast(). It's a really long
detour - it writes ktrace entries to the filesystem - so the overhead
of ast() won't make any difference.

Reviewed by: kib
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26404


# fecb19e4 14-Sep-2020 Edward Tomasz Napierala <trasz@FreeBSD.org>

Move td_softdep_cleanup() from userret() to ast(); it's infrequent
at best. The schedule_cleanup() function already sets TDF_ASTPENDING.

Reviewed by: kib, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26375


# 60f083ef 14-Sep-2020 Edward Tomasz Napierala <trasz@FreeBSD.org>

Move TDP_GEOM check from userret() to ast(); this code path is quite
infrequent.

Reviewed by: kib
No objections: mav
Tested by: pho
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26374


# 30d158ee 14-Sep-2020 Edward Tomasz Napierala <trasz@FreeBSD.org>

Move racct/rctl throttling from userret() to ast(). There's no reason
for it to sit in the syscall fast path.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D26368


# 022c2f55 09-Sep-2020 Gleb Smirnoff <glebius@FreeBSD.org>

In r354148 the goal was to check THREAD_CAN_SLEEP() only once for the
purpose of epoch_trace() and for calling subsequent panic, but to keep
code fully under INVARIANTS, so don't use bare function call to panic().
However, at the last stage of review a true value slipped in, while
always false was assumed. I checked that in email archive with kib@.

Noticed by: trasz


# 59838c1a 01-Apr-2020 John Baldwin <jhb@FreeBSD.org>

Retire procfs-based process debugging.

Modern debuggers and process tracers use ptrace() rather than procfs
for debugging. ptrace() has a supserset of functionality available
via procfs and new debugging features are only added to ptrace().
While the two debugging services share some fields in struct proc,
they each use dedicated fields and separate code. This results in
extra complexity to support a feature that hasn't been enabled in the
default install for several years.

PR: 244939 (exp-run)
Reviewed by: kib, mjg (earlier version)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D23837


# 0bc52b0b 10-Mar-2020 Konstantin Belousov <kib@FreeBSD.org>

Return reschedule_signals() to being static again.

It was used after sigfastblock_setpend() call in in ast() when current
thread fast-blocks signals. Add a flag to sigfastblock_setpend() to
request reschedule, and remove the direct use of the function from
subr_trap.c

Tested by: pho
Sponsored by: The FreeBSD Foundation


# 74cb9a53 20-Feb-2020 Konstantin Belousov <kib@FreeBSD.org>

Fix a bug in r358168, do not call sigfastblock_setpend() under a mutex.

PR: 244250
Reported and tested by: lwhsu
Sponsored by: The FreeBSD Foundation


# a113b17f 20-Feb-2020 Konstantin Belousov <kib@FreeBSD.org>

Do not read sigfastblock word on syscall entry.

On machines with SMAP, fueword executes two serializing instructions
which can be seen in microbenchmarks.

As a measure to restore microbenchmark numbers, only read the word on
the attempt to deliver signal in ast(). If the word is set, signal is
not delivered and word is kept, preventing interruption of
interruptible sleeps by signals until userspace calls
sigfastblock(UNBLOCK) which clears the word.

This way, the spurious EINTR that userspace can see while in critical
section is on first interruptible sleep, if a signal is pending, and
on signal posting. It is believed that it is not important for rtld
and lbithr critical sections. It might be visible for the application
code e.g. for the callback of dl_iterate_phdr(3), but again the belief
is that the non-compliance is acceptable. Most important is that the
retry of the sleeping syscall does not interrupt unless additional
signal is posted.

For now I added the knob kern.sigfastblock_fetch_always to enable the
word read on syscall entry to be able to diagnose possible issues due
to spurious EINTR.

While there, do some code restructuting to have all sigfastblock()
handling located in kern_sig.c.

Reviewed by: jeff
Discussed with: mjg
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D23622


# 0e84a878 14-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

Annotate branches in the syscall path

This in particular significantly shortens amd64_syscall, which otherwise
keeps jumping forward over 2KB of code in total.

Note some of these branches should be either eliminated altogether or
coalesced.


# 146fc63f 09-Feb-2020 Konstantin Belousov <kib@FreeBSD.org>

Add a way to manage thread signal mask using shared word, instead of syscall.

A new syscall sigfastblock(2) is added which registers a uint32_t
variable as containing the count of blocks for signal delivery. Its
content is read by kernel on each syscall entry and on AST processing,
non-zero count of blocks is interpreted same as the signal mask
blocking all signals.

The biggest downside of the feature that I see is that memory
corruption that affects the registered fast sigblock location, would
cause quite strange application misbehavior. For instance, the process
would be immune to ^C (but killable by SIGKILL).

With consumers (rtld and libthr added), benchmarks do not show a
slow-down of the syscalls in micro-measurements, and macro benchmarks
like buildworld do not demonstrate a difference. Part of the reason is
that buildworld time is dominated by compiler, and clang already links
to libthr. On the other hand, small utilities typically used by shell
scripts have the total number of syscalls cut by half.

The syscall is not exported from the stable libc version namespace on
purpose. It is intended to be used only by our C runtime
implementation internals.

Tested by: pho
Disscussed with: cem, emaste, jilles
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D12773


# b52d50cf 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: prealloc vnodes in getnewvnode_reserve

Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.

Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.


# 686bcb5c 15-Dec-2019 Jeff Roberson <jeff@FreeBSD.org>

schedlock 4/4

Don't hold the scheduler lock while doing context switches. Instead we
unlock after selecting the new thread and switch within a spinlock
section leaving interrupts and preemption disabled to prevent local
concurrency. This means that mi_switch() is entered with the thread
locked but returns without. This dramatically simplifies scheduler
locking because we will not hold the schedlock while spinning on
blocked lock in switch.

This change has not been made to 4BSD but in principle it would be
more straightforward.

Discussed with: markj
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22778


# 5757b59f 29-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Merge td_epochnest with td_no_sleeping.

Epoch itself doesn't rely on the counter and it is provided
merely for sleeping subsystems to check it.

- In functions that sleep use THREAD_CAN_SLEEP() to assert
correctness. With EPOCH_TRACE compiled print epoch info.
- _sleep() was a wrong place to put the assertion for epoch,
right place is sleepq_add(), as there ways to call the
latter bypassing _sleep().
- Do not increase td_no_sleeping in non-preemptible epochs.
The critical section would trigger all possible safeguards,
no sleeping counter is extraneous.

Reviewed by: kib


# ed9d69b5 24-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

Use THREAD_CAN_SLEEP() macro to check if thread can sleep. There is no
functional change.

Discussed with: kib


# bac06038 15-Oct-2019 Gleb Smirnoff <glebius@FreeBSD.org>

When assertion for a thread not being in an epoch fails also print all
entered epochs. Works with EPOCH_TRACE only.

Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D22017


# 0db7afd0 19-Aug-2019 Andriy Gapon <avg@FreeBSD.org>

assert that td_lk_slocks is not leaked upon return from kernel

This is similar to checks for td_sx_slocks and td_rw_rlocks.
Although td_lk_slocks is an implementation detail, it still makes sense
to validate it.

MFC after: 1 week
Sponsored by: Panzura


# 64cf6a62 28-Nov-2018 Mateusz Guzik <mjg@FreeBSD.org>

Deinline racct throttling out of syscall exit path.

racct is not enabled by default and even when it is enabled processes are
typically not throttled. The order of checks is left unchanged since
racct_enable will be annotated as __read_frequently, while checking for the
flag in the processes would probably require an extra fetch.

Sponsored by: The FreeBSD Foundation


# 5de96e33 03-Jun-2018 Matt Macy <mmacy@FreeBSD.org>

hwpmc: support sampling both kernel and user stacks when interrupted in kernel

This adds the -U options to pmcstat which will attribute in-kernel samples
back to the user stack that invoked the system call. It is not the default,
because when looking at kernel profiles it is generally more desirable to
merge all instances of a given system call together.

Although heavily revised, this change is directly derived from D7350 by
Jonathan T. Looney.

Obtained from: jtl
Sponsored by: Juniper Networks, Limelight Networks


# 2466d12b 22-May-2018 Mateusz Guzik <mjg@FreeBSD.org>

sx: port over writer starvation prevention measures from rwlock

A constant stream of readers could completely starve writers and this is not
a hypothetical scenario.

The 'poll2_threads' test from the will-it-scale suite reliably starves writers
even with concurrency < 10 threads.

The problem was run into and diagnosed by dillon@backplane.com

There was next to no change in lock contention profile during -j 128 pkg build,
despite an sx lock being at the top.

Tested by: pho


# 06bf2a6a 10-May-2018 Matt Macy <mmacy@FreeBSD.org>

Add simple preempt safe epoch API

Read locking is over used in the kernel to guarantee liveness. This API makes
it easy to provide livenes guarantees without atomics.

Includes epoch_test kernel module to stress test the API.

Documentation will follow initial use case.

Test case and improvements to preemption handling in response to discussion
with mjg@

Reviewed by: imp@, shurd@
Approved by: sbruno@


# ed9e8bc4 24-Mar-2018 Konstantin Belousov <kib@FreeBSD.org>

Account the size of the vslock-ed memory by the thread.
Assert that all such memory is unwired on return to usermode.

The count of the wired memory will be used to detect the copyout mode.

Tested by: pho (as part of the larger patch)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# df57947f 18-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

spdx: initial adoption of licensing ID tags.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.

Initially, only tag files that use BSD 4-Clause "Original" license.

RelNotes: yes
Differential Revision: https://reviews.freebsd.org/D13133


# 83c9dea1 17-Apr-2017 Gleb Smirnoff <glebius@FreeBSD.org>

- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156


# aca4bb91 25-Feb-2017 Konstantin Belousov <kib@FreeBSD.org>

Do not leak mount references for dying threads.

Thread might create a condition for delayed SU cleanup, which creates
a reference to the mount point in td_su, but exit without returning
through userret(), e.g. when terminating due to single-threading or
process exit. In this case, td_su reference is not dropped and mount
point cannot be freed.

Handle the situation by clearing td_su also in the thread destructor
and in exit1(). softdep_ast_cleanup() has to receive the thread as
argument, since e.g. thread destructor is executed in different
context.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 77d68094 18-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

The assertion re-added in r302614 was triggered when stopping signal
is delivered to vforked child. Issue is that we avoid stopping such
children in issignal() to not block parents. But executed AST, which
ignored stops, leaves the child with the signal pending but no AST
pending.

On first exec after vfork(), call signotify() to handle pending
reenabled signals. Adjust the assert to not check vfork children
until exec.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 8f01bee4 11-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

Revive the check, disabled in r197963.

Despite the implication (process has pending signals -> the current
thread marked for AST and has TDF_NEEDSIGCHK set) is not true due to
other thread might manipulate its signal blocking mask, it should still
hold for the single-threaded processes. Enable check for the condition
for single-threaded case, and replicate it from userret() to ast() as
well, where we check that ast indeed has no signal to deliver.

Note that the check is under DIAGNOSTIC, it is not enabled for INVARIANTS
but !DIAGNOSTIC since it imposes too heavy-weight locking for day-to-day
used debugging kernel.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 4f4c35bb 11-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

Add assert to complement r302328.

AST must not execute with TDF_SBDRY or TDF_SEINTR/TDF_SERESTART thread
flags set, which is asserted in userret(). As the consequence, -1 return
from cursig() must not be possible.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 3a1e5dd8 26-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Rewrite sigdeferstop(9) and sigallowstop(9) into more flexible
framework allowing to set the suspension policy for the dynamic block.
Extend the currently possible policies of stopping on interruptible
sleeps and ignoring such sleeps by two more: do not suspend at
interruptible sleeps, but interrupt them with either EINTR or ERESTART.

Reviewed by: jilles
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Approved by: re (gjb)


# ae34b6ff 06-Apr-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add four new RCTL resources - readbps, readiops, writebps and writeiops,
for limiting disk (actually filesystem) IO.

Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.

Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.

MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5080


# e94e50af 13-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

racct: perform a lockless check for p_throttled

This reduces proc lock contention.

Reviewed by: trasz


# 4ea6a9a2 10-Jun-2015 Mateusz Guzik <mjg@FreeBSD.org>

Generalised support for copy-on-write structures shared by threads.

Thread credentials are maintained as follows: each thread has a pointer to
creds and a reference on them. The pointer is compared with proc's creds on
userspace<->kernel boundary and updated if needed.

This patch introduces a counter which can be compared instead, so that more
structures can use this scheme without adding more comparisons on the boundary.


# 1bc93bb7 27-May-2015 Konstantin Belousov <kib@FreeBSD.org>

Currently, softupdate code detects overstepping on the workitems
limits in the code which is deep in the call stack, and owns several
critical system resources, like vnode locks. Attempt to wait while
the per-mount softupdate thread cleans up the backlog may deadlock,
because the thread might need to lock the same vnode which is owned by
the waiting thread.

Instead of synchronously waiting for the worker, perform the worker'
tickle and pause until the backlog is cleaned, at the safe point
during return from kernel to usermode. A new ast request to call
softdep_ast_cleanup() is created, the SU code now only checks the size
of queue and schedules ast.

There is no ast delivery for the kernel threads, so they are exempted
from the mechanism, except NFS daemon threads. NFS server loop
explicitely checks for the request, and informs the schedule_cleanup()
that it is capable of handling the requests by the process P2_AST_SU
flag. This is needed because nfsd may be the sole cause of the SU
workqueue overflow. But, to not cause nsfd to spawn additional
threads just because we slow down existing workers, only tickle su
threads, without waiting for the backlog cleanup.

Reviewed by: jhb, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# ed95805e 30-Apr-2015 John Baldwin <jhb@FreeBSD.org>

Remove support for Xen PV domU kernels. Support for HVM domU kernels
remains. Xen is planning to phase out support for PV upstream since it
is harder to maintain and has more overhead. Modern x86 CPUs include
virtualization extensions that support HVM guests instead of PV guests.
In addition, the PV code was i386 only and not as well maintained recently
as the HVM code.
- Remove the i386-only NATIVE option that was used to disable certain
components for PV kernels. These components are now standard as they
are on amd64.
- Remove !XENHVM bits from PV drivers.
- Remove various shims required for XEN (e.g. PT_UPDATES_FLUSH, LOAD_CR3,
etc.)
- Remove duplicate copy of <xen/features.h>.
- Remove unused, i386-only xenstored.h.

Differential Revision: https://reviews.freebsd.org/D2362
Reviewed by: royger
Tested by: royger (i386/amd64 HVM domU and amd64 PVH dom0)
Relnotes: yes


# 4b5c9cf6 29-Apr-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision: https://reviews.freebsd.org/D2369
Reviewed by: kib@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation


# 18cc2ff0 12-Jan-2015 Konstantin Belousov <kib@FreeBSD.org>

Revert r263475: TDP_DEVMEMIO no longer needed, since amd64 /dev/kmem
does not access kernel mappings directly.

Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 52f3c44e 21-Mar-2014 Konstantin Belousov <kib@FreeBSD.org>

Fix two issues with /dev/mem access on amd64, both causing kernel page
faults.

First, for accesses to direct map region should check for the limit by
which direct map is instantiated.

Second, for accesses to the kernel map, success returned from the
kernacc(9) does not guarantee that consequent attempt to read or write
to the checked address succeed, since other thread might invalidate
the address meantime. Add a new thread private flag TDP_DEVMEMIO,
which instructs vm_fault() to return error when fault happens on the
MAP_ENTRY_NOFAULT entry, instead of panicing. The trap handler would
then see a page fault from access, and recover in normal way, making
/dev/mem access safer.

Remove GIANT_REQUIRED from the amd64 memrw(), since it is not needed
and having Giant locked does not solve issues for amd64.

Note that at least the second issue exists on other architectures, and
requires similar patching for md code.

Reported and tested by: clusteradm (gjb, sbruno)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 4a144410 16-Mar-2014 Robert Watson <rwatson@FreeBSD.org>

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks


# e7a9eed7 17-Dec-2013 Attilio Rao <attilio@FreeBSD.org>

- Assert for not leaking readers rw locks counter on userland return.
- Use a correct spin_cnt for KDTRACE_HOOK case in rw read lock.

Sponsored by: EMC / Isilon storage division


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# 3cf3b9f0 18-Mar-2013 John Baldwin <jhb@FreeBSD.org>

Partially revert r195702. Deferring stops is now implemented via a set of
calls to toggle TDF_SBDRY rather than passing PBDRY to individual sleep
calls.
- Remove the stop_allowed parameters from cursig() and issignal().
issignal() checks TDF_SBDRY directly.
- Remove the PBDRY and SLEEPQ_STOP_ON_BDRY flags.


# a8efb534 14-Mar-2013 Edward Tomasz Napierala <trasz@FreeBSD.org>

When throttling a process to enforce RACCT limits, do not use neither
PBDRY (which simply doesn't make any sense) nor PCATCH (which could
be used by a malicious process to work around the PCPU limit).

Submitted by: Rudo Tomori
Reviewed by: kib


# f9379dc4 01-Mar-2013 John Baldwin <jhb@FreeBSD.org>

Replace the TDP_NOSLEEPING flag with a counter so that the
THREAD_NO_SLEEPING() and THREAD_SLEEPING_OK() macros can nest.

Reviewed by: attilio


# 593efaf9 21-Feb-2013 John Baldwin <jhb@FreeBSD.org>

Further refine the handling of stop signals in the NFS client. The
changes in r246417 were incomplete as they did not add explicit calls to
sigdeferstop() around all the places that previously passed SBDRY to
_sleep(). In addition, nfs_getcacheblk() could trigger a write RPC from
getblk() resulting in sigdeferstop() recursing. Rather than manually
deferring stop signals in specific places, change the VFS_*() and VOP_*()
methods to defer stop signals for filesystems which request this behavior
via a new VFCF_SBDRY flag. Note that this has to be a VFC flag rather than
a MNTK flag so that it works properly with VFS_MOUNT() when the mount is
not yet fully constructed. For now, only the NFS clients are set this new
flag in VFS_SET().

A few other related changes:
- Add an assertion to ensure that TDF_SBDRY doesn't leak to userland.
- When a lookup request uses VOP_READLINK() to follow a symlink, mark
the request as being on behalf of the thread performing the lookup
(cnp_thread) rather than using a NULL thread pointer. This causes
NFS to properly handle signals during this VOP on an interruptible
mount.

PR: kern/176179
Reported by: Russell Cattelan (sigdeferstop() recursion)
Reviewed by: kib
MFC after: 1 month


# 5584e917 30-Oct-2012 Attilio Rao <attilio@FreeBSD.org>

Fixup r240246: hwpmc needs to retain the pinning until ASTs are not
executed. This means past the point where userret() is generally
executed.

Skip the td_pinned check if a callchain tracing is currently happening
and add a more robust check to pmc_capture_user_callchain() in order to
catch td_pinned leak past ast() in hwpmc case.

Reported and tested by: fabient
MFC after: 1 week
X-MFC: r240246


# 36af9869 26-Oct-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add CPU percentage limit enforcement to RCTL. The resouce name is "pcpu".
It was implemented by Rudolf Tomori during Google Summer of Code 2012.


# 9b233e23 14-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a KPI to allow to reserve some amount of space in the numvnodes
counter, without actually allocating the vnodes. The supposed use of
the getnewvnode_reserve(9) is to reclaim enough free vnodes while the
code still does not hold any resources that might be needed during the
reclamation, and to consume the slack later for getnewvnode() calls
made from the innards. After the critical block is finished, the
caller shall free any reserve left, by getnewvnode_drop_reserve(9).

Reviewed by: avg
Tested by: pho
MFC after: 1 week


# 16cbf13b 08-Sep-2012 Attilio Rao <attilio@FreeBSD.org>

Move the checks for td_pinned, td_critnest, TDP_NOFAULTING and
TDP_NOSLEEPING leaking from syscallret() to userret() so that also
trap handling is covered. Also, the check on td_locks is not duplicated
between the two functions.

Reported by: avg
Reviewed by: kib
MFC after: 1 week


# fbe18392 08-Sep-2012 Attilio Rao <attilio@FreeBSD.org>

Move PT_UPDATED_FLUSH() before td_locks check in order to have more
coverage also in the XEN case.

Reviewed by: kib
MFC after: 1 week


# 324e5715 08-Sep-2012 Attilio Rao <attilio@FreeBSD.org>

userret() already checks for td_locks when INVARIANTS is enabled, so
there is no need to check if Giant is acquired after it.

Reviewed by: kib
MFC after: 1 week


# effb6326 10-Jun-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Remove redundant include.

MFC after: 1 month


# 88bf5036 20-Apr-2012 John Baldwin <jhb@FreeBSD.org>

Include the associated wait channel message for context switch ktrace
records. kdump supports both the old and new messages.

Submitted by: Andrey Zonov andrey zonov org
MFC after: 1 week


# f5f9340b 28-Mar-2012 Fabien Thomas <fabient@FreeBSD.org>

Add software PMC support.

New kernel events can be added at various location for sampling or counting.
This will for example allow easy system profiling whatever the processor is
with known tools like pmcstat(8).

Simultaneous usage of software PMC and hardware PMC is possible, for example
looking at the lock acquire failure, page fault while sampling on
instructions.

Sponsored by: NETASQ
MFC after: 1 month


# 24f3dcfe 03-Oct-2011 Konstantin Belousov <kib@FreeBSD.org>

Assert that exiting process does not return to usermode.

Reviewed by: avg, jhb
MFC after: 1 week


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 26ccf4f1 11-Sep-2011 Konstantin Belousov <kib@FreeBSD.org>

Inline the syscallenter() and syscallret(). This reduces the time measured
by the syscall entry speed microbenchmarks by ~10% on amd64.

Submitted by: jhb
Approved by: re (bz)
MFC after: 2 weeks


# 24c1c3bf 29-Jun-2011 Jonathan Anderson <jonathan@FreeBSD.org>

We may split today's CAPABILITIES into CAPABILITY_MODE (which has
to do with global namespaces) and CAPABILITIES (which has to do with
constraining file descriptors). Just in case, and because it's a better
name anyway, let's move CAPABILITIES out of the way.

Also, change opt_capabilities.h to opt_capsicum.h; for now, this will
only hold CAPABILITY_MODE, but it will probably also hold the new
CAPABILITIES (implying constrained file descriptors) in the future.

Approved by: rwatson
Sponsored by: Google UK Ltd


# fc94e447 01-Mar-2011 Robert Watson <rwatson@FreeBSD.org>

Continue introducing Capsicum capability mode support:

If a system call wasn't listed in capabilities.conf, return ECAPMODE at
syscall entry.

Reviewed by: anderson
Discussed with: benl, kris, pjd
Sponsored by: Google, Inc.
Obtained from: Capsicum Project
MFC after: 3 months


# bf9ce95b 14-Feb-2011 Bjoern A. Zeeb <bz@FreeBSD.org>

Mfp4 CH=177256:

Catch a set vnet upon return to user space. This usually
means return paths with CURVNET_RESTORE() missing.

If VNET_DEBUG is turned on we can even tell the function
that did the CURVNET_SET() which is really helpful; else
we print "N/A".

Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb

MFC after: 11 days


# 6fa39a73 25-Jan-2011 Konstantin Belousov <kib@FreeBSD.org>

Allow debugger to specify that children of the traced process should be
automatically traced. Extend the ptrace(PL_LWPINFO) to report that child
just forked.

Reviewed by: davidxu, jhb
MFC after: 2 weeks


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 25e45560 27-Sep-2010 Ed Maste <emaste@FreeBSD.org>

Remove extra braces for style(9) (found while cleaning up an old work tree).


# b3d354c9 22-Aug-2010 Rui Paulo <rpaulo@FreeBSD.org>

Call the systrace_probe_func() when the error value.

Sponsored by: The FreeBSD Foundation


# f2a664ac 15-Jul-2010 John Baldwin <jhb@FreeBSD.org>

Retire td_syscalls now that it is no longer needed.


# 34a39b7b 04-Jul-2010 Konstantin Belousov <kib@FreeBSD.org>

Obey sv_syscallnames bounds in syscallname().

Reported and tested by: pho


# fc0de8f0 30-Jun-2010 John Baldwin <jhb@FreeBSD.org>

Move prototypes for kern_sigtimedwait() and kern_sigprocmask() to
<sys/syscallsubr.h> where all other kern_<syscall> prototypes live.


# 153ac44c 28-Jun-2010 Konstantin Belousov <kib@FreeBSD.org>

Count number of threads that enter and leave dynamically registered
syscalls. On the dynamic syscall deregistration, wait until all
threads leave the syscall code. This somewhat increases the safety
of the loadable modules unloading.

Reviewed by: jhb
Tested by: pho
MFC after: 1 month


# 699d648a 23-Jun-2010 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for int13 FPU exception reporting on i386. It is
believed that all 486-class CPUs FreeBSD is capable to run on, either
have no FPU and cannot use external coprocessor, or have FPU on the
package and can use #MF.

Reviewed by: bde
Tested by: pho (previous version)


# f05a9476 17-Jun-2010 Rui Paulo <rpaulo@FreeBSD.org>

Make DTrace syscall provider work again by including opt_kdtrace.h here.


# b2318c28 26-May-2010 Konstantin Belousov <kib@FreeBSD.org>

Allow to use syscallname(9) outside subr_trap.c.

MFC after: 1 month


# afe1a688 23-May-2010 Konstantin Belousov <kib@FreeBSD.org>

Reorganize syscall entry and leave handling.

Extend struct sysvec with three new elements:
sv_fetch_syscall_args - the method to fetch syscall arguments from
usermode into struct syscall_args. The structure is machine-depended
(this might be reconsidered after all architectures are converted).
sv_set_syscall_retval - the method to set a return value for usermode
from the syscall. It is a generalization of
cpu_set_syscall_retval(9) to allow ABIs to override the way to set a
return value.
sv_syscallnames - the table of syscall names.

Use sv_set_syscall_retval in kern_sigsuspend() instead of hardcoding
the call to cpu_set_syscall_retval().

The new functions syscallenter(9) and syscallret(9) are provided that
use sv_*syscall* pointers and contain the common repeated code from
the syscall() implementations for the architecture-specific syscall
trap handlers.

Syscallenter() fetches arguments, calls syscall implementation from
ABI sysent table, and set up return frame. The end of syscall
bookkeeping is done by syscallret().

Take advantage of single place for MI syscall handling code and
implement ptrace_lwpinfo pl_flags PL_FLAG_SCE, PL_FLAG_SCX and
PL_FLAG_EXEC. The SCE and SCX flags notify the debugger that the
thread is stopped at syscall entry or return point respectively. The
EXEC flag augments SCX and notifies debugger that the process address
space was changed by one of exec(2)-family syscalls.

The i386, amd64, sparc64, sun4v, powerpc and ia64 syscall()s are
changed to use syscallenter()/syscallret(). MIPS and arm are not
converted and use the mostly unchanged syscall() implementation.

Reviewed by: jhb, marcel, marius, nwhitehorn, stas
Tested by: marcel (ia64), marius (sparc64), nwhitehorn (powerpc),
stas (mips)
MFC after: 1 month


# 7e767511 19-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r198508, r198509:
Reimplement pselect() in kernel, making change of sigmask and sleep atomic.

MFC r198538:
Move pselect(3) man page to section 2.


# 6ddf1cd2 19-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r197963:
Put process-directed signals to the process queue unconditionally,
selecting the thread to deliver the signal only by the thread returning
to usermode.
Change cursig() and postsig() to look both into the thread and process
signal queues.

MFC r197976:
Fix typo.

MFC r200082:
Remove wrong assertion. Debugee is allowed to lose a signal


# 066d836b 27-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

Current pselect(3) is implemented in usermode and thus vulnerable to
well-known race condition, which elimination was the reason for the
function appearance in first place. If sigmask supplied as argument to
pselect() enables a signal, the signal might be delivered before thread
called select(2), causing lost wakeup. Reimplement pselect() in kernel,
making change of sigmask and sleep atomic.

Since signal shall be delivered to the usermode, but sigmask restored,
set TDP_OLDMASK and save old mask in td_oldsigmask. The TDP_OLDMASK
should be cleared by ast() in case signal was not gelivered during
syscall execution.

Reviewed by: davidxu
Tested by: pho
MFC after: 1 month


# 6b286ee8 11-Oct-2009 Konstantin Belousov <kib@FreeBSD.org>

Currently, when signal is delivered to the process and there is a thread
not blocking the signal, signal is placed on the thread sigqueue. If
the selected thread is in kernel executing thr_exit() or sigprocmask()
syscalls, then signal might be not delivered to usermode for arbitrary
amount of time, and for exiting thread it is lost.

Put process-directed signals to the process queue unconditionally,
selecting the thread to deliver the signal only by the thread returning
to usermode, since only then the thread can handle delivery of signal
reliably. For exiting thread or thread that has blocked some signals,
check whether the newly blocked signal is queued for the process, and
try to find a thread to wakeup for delivery, in reschedule_signal(). For
exiting thread, assume that all signals are blocked.

Change cursig() and postsig() to look both into the thread and process
signal queues. When there is a signal that thread returning to usermode
could consume, TDF_NEEDSIGCHK flag is not neccessary set now. Do
unlocked read of p_siglist and p_pendingcnt to check for queued signals.

Note that thread that has a signal unblocked might get spurious wakeup
and EINTR from the interruptible system call now, due to the possibility
of being selected by reschedule_signals(), while other thread returned
to usermode earlier and removed the signal from process queue. This
should not cause compliance issues, since the thread has not blocked a
signal and thus should be ready to receive it anyway.

Reported by: Justin Teller <justin.teller gmail com>
Reviewed by: davidxu, jilles
MFC after: 1 month


# f33a947b 14-Jul-2009 Konstantin Belousov <kib@FreeBSD.org>

Add new msleep(9) flag PBDY that shall be specified together with
PCATCH, to indicate that thread shall not be stopped upon receipt of
SIGSTOP until it reaches the kernel->usermode boundary.

Also change thread_single(SINGLE_NO_EXIT) to only stop threads at
the user boundary unconditionally.

Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 6fe00c78 13-Dec-2008 Joseph Koshy <jkoshy@FreeBSD.org>

- Bug fix: prevent a thread from migrating between CPUs between the
time it is marked for user space callchain capture in the NMI
handler and the time the callchain capture callback runs.

- Improve code and control flow clarity by invoking hwpmc(4)'s user
space callchain capture callback directly from low-level code.

Reviewed by: jhb (kern/subr_trap.c)
Testing (various patch revisions): gnn,
Fabien Thomas <fabien dot thomas at netasq dot com>,
Artem Belevich <artemb at gmail dot com>


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 50d6e424 18-Oct-2008 Kip Macy <kmacy@FreeBSD.org>

- Forward port flush of page table updates on context switch or userret
- Forward port vfork XEN hack


# 8df78c41 16-Apr-2008 Jeff Roberson <jeff@FreeBSD.org>

- Make SCHED_STATS more generic by adding a wrapper to create the
variables and sysctl nodes.
- In reset walk the children of kern_sched_stats and reset the counters
via the oid_arg1 pointer. This allows us to add arbitrary counters to
the tree and still reset them properly.
- Define a set of switch types to be passed with flags to mi_switch().
These types are named SWT_*. These types correspond to SCHED_STATS
counters and are automatically handled in this way.
- Make the new SWT_ types more specific than the older switch stats.
There are now stats for idle switches, remote idle wakeups, remote
preemption ithreads idling, etc.
- Add switch statistics for ULE's pickcpu algorithm. These stats include
how much migration there is, how often affinity was successful, how
often threads were migrated to the local cpu on wakeup, etc.

Sponsored by: Nokia


# b7edba77 21-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Add a new td flag TDF_NEEDSUSPCHK that is set whenever a thread needs
to enter thread_suspend_check().
- Set TDF_ASTPENDING along with TDF_NEEDSUSPCHK so we can move the
thread_suspend_check() to ast() rather than userret().
- Check TDF_NEEDSUSPCHK in the sleepq_catch_signals() optimization so
that we don't miss a suspend request. If this is set use the
expensive signal path.
- Set NEEDSUSPCHK when creating a new thread in thr in case the
creating thread is due to be suspended as well but has not yet.

Reviewed by: davidxu (Authored original patch)


# 6617724c 12-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# d07f36b0 07-Dec-2007 Joseph Koshy <jkoshy@FreeBSD.org>

Kernel and hwpmc(4) support for callchain capture.

Sponsored by: FreeBSD Foundation and Google Inc.


# e01eafef 13-Nov-2007 Julian Elischer <julian@FreeBSD.org>

A bunch more files that should probably print out a thread name
instead of a process name.


# b61ce5b0 16-Sep-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move all of the PS_ flags into either p_flag or td_flags.
- p_sflag was mostly protected by PROC_LOCK rather than the PROC_SLOCK or
previously the sched_lock. These bugs have existed for some time.
- Allow swapout to try each thread in a process individually and then
swapin the whole process if any of these fail. This allows us to move
most scheduler related swap flags into td_flags.
- Keep ki_sflag for backwards compat but change all in source tools to
use the new and more correct location of P_INMEM.

Reported by: pho
Reviewed by: attilio, kib
Approved by: re (kensmith)


# 3036ab79 12-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

- Include opt_sched.h for SCHED_STATS.


# 982d11f8 04-Jun-2007 Jeff Roberson <jeff@FreeBSD.org>

Commit 14/14 of sched_lock decomposition.
- Use thread_lock() rather than sched_lock for per-thread scheduling
sychronization.
- Use the per-process spinlock rather than the sched_lock for per-process
scheduling synchronization.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# b4b70819 04-Jun-2007 Attilio Rao <attilio@FreeBSD.org>

Do proper "locking" for missing vmmeters part.
Now, we assume no more sched_lock protection for some of them and use the
distribuited loads method for vmmeter (distribuited through CPUs).

Reviewed by: alc, bde
Approved by: jeff (mentor)


# 1c4bcd05 31-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- Move rusage from being per-process in struct pstats to per-thread in
td_ru. This removes the requirement for per-process synchronization in
statclock() and mi_switch(). This was previously supported by
sched_lock which is going away. All modifications to rusage are now
done in the context of the owning thread. reads proceed without locks.
- Aggregate exiting threads rusage in thread_exit() such that the exiting
thread's rusage is not lost.
- Provide a new routine, rufetch() to fetch an aggregate of all rusage
structures from all threads in a process. This routine must be used
in any place requiring a rusage from a process prior to it's exit. The
exited process's rusage is still available via p_ru.
- Aggregate tick statistics only on demand via rufetch() or when a thread
exits. Tick statistics are kept in the thread and protected by sched_lock
until it exits.

Initial patch by: attilio
Reviewed by: attilio, bde (some objections), arch (mostly silent)


# 2feb50bf 31-May-2007 Attilio Rao <attilio@FreeBSD.org>

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# 222d0195 18-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 0c14ff0e 04-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.


# ad1e7d28 05-Dec-2006 Julian Elischer <julian@FreeBSD.org>

Threading cleanup.. part 2 of several.

Make part of John Birrell's KSE patch permanent..
Specifically, remove:
Any reference of the ksegrp structure. This feature was
never fully utilised and made things overly complicated.
All code in the scheduler that tried to make threaded programs
fair to unthreaded programs. Libpthread processes will already
do this to some extent and libthr processes already disable it.

Also:
Since this makes such a big change to the scheduler(s), take the opportunity
to rename some structures and elements that had to be moved anyhow.
This makes the code a lot more readable.

The ULE scheduler compiles again but I have no idea if it works.

The 4bsd scheduler still reqires a little cleaning and some functions that now do
ALMOST nothing will go away, but I thought I'd do that as a separate commit.

Tested by David Xu, and Dan Eischen using libthr and libpthread.


# 8460a577 26-Oct-2006 John Birrell <jb@FreeBSD.org>

Make KSE a kernel option, turned on by default in all GENERIC
kernel configs except sun4v (which doesn't process signals properly
with KSE).

Reviewed by: davidxu@


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 1ca2c018 17-Oct-2006 Bruce Evans <bde@FreeBSD.org>

kern_intr.c:
- Count (scheduling of) software interrupts (SWIs) as SWIs, not as
hardware interrupts.
- Don't count (scheduling of) delayed SWIs as interrupts at all, since
in the delayed case it is expected that there are many more scheduling
calls than handling calls. Perhaps all interrupts should be counted
only when they are handled, but it is only counts of delayed SWIs that
shouldn never be combined with the other counts.

subr_trap.c:
- Count (handling of) Asynchronous System Traps (ASTs) as traps, not as
software interrupts.

Before these changes, the counter for SWIs only counted ASTs, and SWIs
weren't counted separately, but a subcounter for ASTs alone is less
needed than for most other exception sources.

4.4BSD-Lite uses the counters for similar things (actually matching
their names) on its main arches (hp300, ..., !i386) where more of the
exceptions are in hardware.


# 42925630 10-Feb-2006 David Xu <davidxu@FreeBSD.org>

Test before modifying p_sflag to avoid unconditionally cache line
ping-pong on SMP.


# eb2da9a5 08-Feb-2006 Poul-Henning Kamp <phk@FreeBSD.org>

Simplify system time accounting for profiling.

Rename struct thread's td_sticks to td_pticks, we will need the
other name for more appropriately named use shortly. Reduce it
from uint64_t to u_int.

Clear td_pticks whenever we enter the kernel instead of recording
its value as reference for userret(). Use the absolute value of
td->pticks in userret() and eliminate third argument.


# 5b1a8eb3 07-Feb-2006 Poul-Henning Kamp <phk@FreeBSD.org>

Modify the way we account for CPU time spent (step 1)

Keep track of time spent by the cpu in various contexts in units of
"cputicks" and scale to real-world microsec^H^H^H^H^H^H^H^Hclock_t
only when somebody wants to inspect the numbers.

For now "cputicks" are still derived from the current timecounter
and therefore things should by definition remain sensible also on
SMP machines. (The main reason for this first milestone commit is
to verify that hypothesis.)

On slower machines, the avoided multiplications to normalize timestams
at every context switch, comes out as a 5-7% better score on the
unixbench/context1 microbenchmark. On more modern hardware no change
in performance is seen.


# 2c255e9d 13-Nov-2005 Robert Watson <rwatson@FreeBSD.org>

Moderate rewrite of kernel ktrace code to attempt to generally improve
reliability when tracing fast-moving processes or writing traces to
slow file systems by avoiding unbounded queueuing and dropped records.
Record loss was previously possible when the global pool of records
become depleted as a result of record generation outstripping record
commit, which occurred quickly in many common situations.

These changes partially restore the 4.x model of committing ktrace
records at the point of trace generation (synchronous), but maintain
the 5.x deferred record commit behavior (asynchronous) for situations
where entering VFS and sleeping is not possible (i.e., in the
scheduler). Records are now queued per-process as opposed to
globally, with processes responsible for committing records from their
own context as required.

- Eliminate the ktrace worker thread and global record queue, as they
are no longer used. Keep the global free record list, as records
are still used.

- Add a per-process record queue, which will hold any asynchronously
generated records, such as from context switches. This replaces the
global queue as the place to submit asynchronous records to.

- When a record is committed asynchronously, simply queue it to the
process.

- When a record is committed synchronously, first drain any pending
per-process records in order to maintain ordering as best we can.
Currently ordering between competing threads is provided via a global
ktrace_sx, but a per-process flag or lock may be desirable in the
future.

- When a process returns to user space following a system call, trap,
signal delivery, etc, flush any pending records.

- When a process exits, flush any pending records.

- Assert on process tear-down that there are no pending records.

- Slightly abstract the notion of being "in ktrace", which is used to
prevent the recursive generation of records, as well as generating
traces for ktrace events.

Future work here might look at changing the set of events marked for
synchronous and asynchronous record generation, re-balancing queue
depth, timeliness of commit to disk, and so on. I.e., performing a
drain every (n) records.

MFC after: 1 month
Discussed with: jhb
Requested by: Marc Olzheim <marcolz at stack dot nl>


# 9104847f 13-Oct-2005 David Xu <davidxu@FreeBSD.org>

1. Change prototype of trapsignal and sendsig to use ksiginfo_t *, most
changes in MD code are trivial, before this change, trapsignal and
sendsig use discrete parameters, now they uses member fields of
ksiginfo_t structure. For sendsig, this change allows us to pass
POSIX realtime signal value to user code.

2. Remove cpu_thread_siginfo, it is no longer needed because we now always
generate ksiginfo_t data and feed it to libpthread.

3. Add p_sigqueue to proc structure to hold shared signals which were
blocked by all threads in the proc.

4. Add td_sigqueue to thread structure to hold all signals delivered to
thread.

5. i386 and amd64 now return POSIX standard si_code, other arches will
be fixed.

6. In this sigqueue implementation, pending signal set is kept as before,
an extra siginfo list holds additional siginfo_t data for signals.
kernel code uses psignal() still behavior as before, it won't be failed
even under memory pressure, only exception is when deleting a signal,
we should call sigqueue_delete to remove signal from sigqueue but
not SIGDELSET. Current there is no kernel code will deliver a signal
with additional data, so kernel should be as stable as before,
a ksiginfo can carry more information, for example, allow signal to
be delivered but throw away siginfo data if memory is not enough.
SIGKILL and SIGSTOP have fast path in sigqueue_add, because they can
not be caught or masked.
The sigqueue() syscall allows user code to queue a signal to target
process, if resource is unavailable, EAGAIN will be returned as
specification said.
Just before thread exits, signal queue memory will be freed by
sigqueue_flush.
Current, all signals are allowed to be queued, not only realtime signals.

Earlier patch reviewed by: jhb, deischen
Tested on: i386, amd64


# 9d65cdf6 27-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Rev 1.83 of kern_lock.c fixes the td_locks assert, reenable it here.

Sponsored by: Isilon Systems, Inc.


# eb8d0e01 25-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- The td_locks check is currently broken with snapshots and possibly
some case in unmount. Disable the KASSERT until these problems can
be diagnosed.

Sponsored by: Isilon Systems, Inc.


# 61ef09d1 24-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Fail an assert if we attempt to return with any lockmgr locks held in
userret().

Sponsored by: Isilon Systems, Inc.


# 99b808f4 30-Dec-2004 John Baldwin <jhb@FreeBSD.org>

Whitespace fix.


# 6a987020 26-Dec-2004 Jeff Roberson <jeff@FreeBSD.org>

- Run sched_userret() after thread_userret(). Before, sched_userret() would
lower the priority of the returning thread to a user priority before
calling into thread_userret() which would call wakeup() which in turn would
cause the returning thread to eventually context switch rather than
completing its slice. Allowing this thread to complete its slice first
yields a 15% performance improvement in super-smack on my dual opteron with
4BSD.


# 9197ce2e 23-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add a new per-thread private flag: TDP_GEOM.

This flag gets set whenever the thread posts an event on the GEOM
event queue, and if the flag is set when the thread is prepared
to return to userland from the kernel, g_waitidle() will be called
to make sure that the posted events have completed.

This can replace an insufficient number of g_waitidle() calls in
various other places, and has the advantage of being failsafe: Any
system call which does a VOP_OPEN()/VOP_CLOSE will now correctly
wait for any geom events it posted as part of spoils or tastes.

Assert that topology and Giant is not held in g_waitidle().


# 78c85e8d 05-Oct-2004 John Baldwin <jhb@FreeBSD.org>

Rework how we store process times in the kernel such that we always store
the raw values including for child process statistics and only compute the
system and user timevals on demand.

- Fix the various kern_wait() syscall wrappers to only pass in a rusage
pointer if they are going to use the result.
- Add a kern_getrusage() function for the ABI syscalls to use so that they
don't have to play stackgap games to call getrusage().
- Fix the svr4_sys_times() syscall to just call calcru() to calculate the
times it needs rather than calling getrusage() twice with associated
stackgap, etc.
- Add a new rusage_ext structure to store raw time stats such as tick counts
for user, system, and interrupt time as well as a bintime of the total
runtime. A new p_rux field in struct proc replaces the same inline fields
from struct proc (i.e. p_[isu]ticks, p_[isu]u, and p_runtime). A new p_crux
field in struct proc contains the "raw" child time usage statistics.
ruadd() has been changed to handle adding the associated rusage_ext
structures as well as the values in rusage. Effectively, the values in
rusage_ext replace the ru_utime and ru_stime values in struct rusage. These
two fields in struct rusage are no longer used in the kernel.
- calcru() has been split into a static worker function calcru1() that
calculates appropriate timevals for user and system time as well as updating
the rux_[isu]u fields of a passed in rusage_ext structure. calcru() uses a
copy of the process' p_rux structure to compute the timevals after updating
the runtime appropriately if any of the threads in that process are
currently executing. It also now only locks sched_lock internally while
doing the rux_runtime fixup. calcru() now only requires the caller to
hold the proc lock and calcru1() only requires the proc lock internally.
calcru() also no longer allows callers to ask for an interrupt timeval
since none of them actually did.
- calcru() now correctly handles threads executing on other CPUs.
- A new calccru() function computes the child system and user timevals by
calling calcru1() on p_crux. Note that this means that any code that wants
child times must now call this function rather than reading from p_cru
directly. This function also requires the proc lock.
- This finishes the locking for rusage and friends so some of the Giant locks
in exit1() and kern_wait() are now gone.
- The locking in ttyinfo() has been tweaked so that a shared lock of the
proctree lock is used to protect the process group rather than the process
group lock. By holding this lock until the end of the function we now
ensure that the process/thread that we pick to dump info about will no
longer vanish while we are trying to output its info to the console.

Submitted by: bde (mostly)
MFC after: 1 month


# ea73c1ea 23-Sep-2004 John Baldwin <jhb@FreeBSD.org>

Don't try to protect td_sticks with sched_lock. It doesn't need it as it
is only accessed by curthread.


# 7eaec467 22-Sep-2004 John Baldwin <jhb@FreeBSD.org>

Various small style fixes.


# 5995adc2 31-Aug-2004 Julian Elischer <julian@FreeBSD.org>

Remove an unneeded argument..
The removed argument could trivially be derived from the remaining one.
That in turn should be the same as curthread, but it is possible that curthread could be expensive to derive on some syste,s so leave it as an argument.
Having both proc and thread as an argumen tjust gives an opportunity for
them to get out sync.

MFC after: 3 days


# 99e9dcb8 31-Aug-2004 Julian Elischer <julian@FreeBSD.org>

Remove sched_free_thread() which was only used
in diagnostics. It has outlived its usefulness and has started
causing panics for people who turn on DIAGNOSTIC, in what is otherwise
good code.

MFC after: 2 days


# 2b70a83a 08-Aug-2004 David Xu <davidxu@FreeBSD.org>

Call thread_user_enter for M:N thread, ast() should be treated as another
entrance of kernel.


# 52eb8464 16-Jul-2004 John Baldwin <jhb@FreeBSD.org>

- Move TDF_OWEPREEMPT, TDF_OWEUPC, and TDF_USTATCLOCK over to td_pflags
since they are only accessed by curthread and thus do not need any
locking.
- Move pr_addr and pr_ticks out of struct uprof (which is per-process)
and directly into struct thread as td_profil_addr and td_profil_ticks
as these variables are really per-thread. (They are used to defer an
addupc_intr() that was too "hard" until ast()).


# bf0acc27 02-Jul-2004 John Baldwin <jhb@FreeBSD.org>

- Change mi_switch() and sched_switch() to accept an optional thread to
switch to. If a non-NULL thread pointer is passed in, then the CPU will
switch to that thread directly rather than calling choosethread() to pick
a thread to choose to.
- Make sched_switch() aware of idle threads and know to do
TD_SET_CAN_RUN() instead of sticking them on the run queue rather than
requiring all callers of mi_switch() to know to do this if they can be
called from an idlethread.
- Move constants for arguments to mi_switch() and thread_single() out of
the middle of the function prototypes and up above into their own
section.


# a3a70178 01-Jul-2004 John Baldwin <jhb@FreeBSD.org>

Tidy up uprof locking. Mostly the fields are protected by both the proc
lock and sched_lock so they can be read with either lock held. Document
the locking as well. The one remaining bogosity is that pr_addr and
pr_ticks should be per-thread but profiling of multithreaded apps is
currently undefined.


# 4ccbe07e 31-Mar-2004 Julian Elischer <julian@FreeBSD.org>

Remove unused variable.


# 37814395 13-Mar-2004 Peter Wemm <peter@FreeBSD.org>

Push Giant down a little further:
- no longer serialize on Giant for thread_single*() and family in fork,
exit and exec
- thread_wait() is mpsafe, assert no Giant
- reduce scope of Giant in exit to not cover thread_wait and just do
vm_waitproc().
- assert that thread_single() family are not called with Giant
- remove the DROP/PICKUP_GIANT macros from thread_single() family
- assert that thread_suspend_check() s not called with Giant
- remove manual drop_giant hack in thread_suspend_check since we know it
isn't held.
- remove the DROP/PICKUP_GIANT macros from thread_suspend_check() family
- mark kse_create() mpsafe


# 16df17d0 05-Mar-2004 Robert Watson <rwatson@FreeBSD.org>

Put "failed to set signal flags properly for ast()" check under
DIAGNOSTIC instead of INVARIANTS. INVARIANTS is intended for tests
that don't substantially change code flow or behavior (passive), but
this test required locking both the proc lock and scheduler lock
in order to execute. It also appears to be a very advisory diagnostic
as opposed to an invariant violation.

Following discussion with: bde


# 91d5354a 04-Feb-2004 John Baldwin <jhb@FreeBSD.org>

Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64


# 29bcc451 24-Jan-2004 Jeff Roberson <jeff@FreeBSD.org>

- Add a flags parameter to mi_switch. The value of flags may be SW_VOL or
SW_INVOL. Assert that one of these is set in mi_switch() and propery
adjust the rusage statistics. This is to simplify the large number of
users of this interface which were previously all required to adjust the
proper counter prior to calling mi_switch(). This also facilitates more
switch and locking optimizations.
- Change all callers of mi_switch() to pass the appropriate paramter and
remove direct references to the process statistics.


# 917cf8d2 05-Sep-2003 Peter Wemm <peter@FreeBSD.org>

Log involuntary context switches correctly.


# 75ea65e3 04-Aug-2003 David Xu <davidxu@FreeBSD.org>

kse.h is not needed for these files.


# aeaead20 30-Jul-2003 Peter Wemm <peter@FreeBSD.org>

When ktracing context switches, make sure we record involuntary switches.
Otherwise, when we get a evicted from the cpu, there is no record of it.
This is not a default ktrace flag.


# 9dde3bc9 28-Jun-2003 David Xu <davidxu@FreeBSD.org>

o Change kse_thr_interrupt to allow send a signal to a specified thread,
or unblock a thread in kernel, and allow UTS to specify whether syscall
should be restarted.
o Add ability for UTS to monitor signal comes in and removed from process,
the flag PS_SIGEVENT is used to indicate the events.
o Add a KMF_WAITSIGEVENT for KSE mailbox flag, UTS call kse_release with
this flag set to wait for above signal event.
o For SA based thread, kernel masks all signal in its signal mask, let
UTS to use kse_thr_interrupt interrupt a thread, and install a signal
frame in userland for the thread.
o Add a tm_syncsig in thread mailbox, when a hardware trap occurs,
it is used to deliver synchronous signal to userland, and upcall
is schedule, so UTS can process the synchronous signal for the thread.

Reviewed by: julian (mentor)


# cd4f6ebb 14-Jun-2003 David Xu <davidxu@FreeBSD.org>

1. Add code to support bound thread. when blocked, a bound thread never
schedules an upcall. Signal delivering to a bound thread is same as
non-threaded process. This is intended to be used by libpthread to
implement PTHREAD_SCOPE_SYSTEM thread.
2. Simplify kse_release() a bit, remove sleep loop.


# 0e2a4d3a 14-Jun-2003 David Xu <davidxu@FreeBSD.org>

Rename P_THREADED to P_SA. P_SA means a process is using scheduler
activations.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# 90af4afa 13-May-2003 John Baldwin <jhb@FreeBSD.org>

- Merge struct procsig with struct sigacts.
- Move struct sigacts out of the u-area and malloc() it using the
M_SUBPROC malloc bucket.
- Add a small sigacts_*() API for managing sigacts structures: sigacts_alloc(),
sigacts_free(), sigacts_copy(), sigacts_share(), and sigacts_shared().
- Remove the p_sigignore, p_sigacts, and p_sigcatch macros.
- Add a mutex to struct sigacts that protects all the members of the struct.
- Add sigacts locking.
- Remove Giant from nosys(), kill(), killpg(), and kern_sigaction() now
that sigacts is locked.
- Several in-kernel functions such as psignal(), tdsignal(), trapsignal(),
and thread_stopped() are now MP safe.

Reviewed by: arch@
Approved by: re (rwatson)


# 5eac9e2d 23-Apr-2003 John Baldwin <jhb@FreeBSD.org>

The signotify() sanity check in userret() doesn't need Giant anymore.


# 9752f794 22-Apr-2003 John Baldwin <jhb@FreeBSD.org>

- Move PS_PROFIL and its new cousin PS_STOPPROF back over to p_flag and
rename them appropriately. Protect both flags with both the proc lock
and the sched_lock.
- Protect p_profthreads with the proc lock.
- Remove Giant from profil(2).


# f5d5cb3c 17-Apr-2003 John Baldwin <jhb@FreeBSD.org>

Tweak locking in the PS_XCPU handler to hold the sched_lock while reading
p_runtime.


# 4093529d 31-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Move p->p_sigmask to td->td_sigmask. Signal masks will be per thread with
a follow on commit to kern_sig.c
- signotify() now operates on a thread since unmasked pending signals are
stored in the thread.
- PS_NEEDSIGCHK moves to TDF_NEEDSIGCHK.


# 1bf4700b 31-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Change trapsignal() to accept a thread and not a proc.
- Change all consumers to pass in a thread.

Right now this does not cause any functional changes but it will be important
later when signals can be delivered to specific threads.


# 21e0492a 10-Mar-2003 David Xu <davidxu@FreeBSD.org>

Fix signal delivering bug for threaded process.


# 26306795 04-Mar-2003 John Baldwin <jhb@FreeBSD.org>

Replace calls to WITNESS_SLEEP() and witness_list() with equivalent calls
to WITNESS_WARN().


# ac2e4153 26-Feb-2003 Julian Elischer <julian@FreeBSD.org>

Change the process flags P_KSES to be P_THREADED.
This is just a cosmetic change but I've been meaning to do it for about a year.


# 58a3c273 17-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add a new function, thread_signal_add(), that is called from postsig to
add a signal to a mailbox's pending set.
- Add a new function, thread_signal_upcall(), this causes the current thread
to upcall so that we can deliver pending signals.

Reviewed by: mini


# 4a338afd 17-Feb-2003 Julian Elischer <julian@FreeBSD.org>

Move a bunch of flags from the KSE to the thread.
I was in two minds as to where to put them in the first case..
I should have listenned to the other mind.

Submitted by: parts by davidxu@
Reviewed by: jeff@ mini@


# e4625663 16-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Move ke_sticks, ke_iticks, ke_uticks, ke_uu, ke_su, and ke_iu back into
the proc. These counters are only examined through calcru.

Submitted by: davidxu
Tested on: x86, alpha, UP/SMP


# 6f8132a8 31-Jan-2003 Julian Elischer <julian@FreeBSD.org>

Reversion of commit by Davidxu plus fixes since applied.

I'm not convinced there is anything major wrong with the patch but
them's the rules..

I am using my "David's mentor" hat to revert this as he's
offline for a while.


# 48ed1432 31-Jan-2003 Tim J. Robbins <tjr@FreeBSD.org>

Use a local variable to store the number of ticks that elapsed in
kernel mode instead of (unintentionally) using the global `ticks'.
This error completely broke profiling.


# 0dbb100b 26-Jan-2003 David Xu <davidxu@FreeBSD.org>

Move UPCALL related data structure out of kse, introduce a new
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.

A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.

Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.

Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.

KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.

When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.

The code hasn't been tested under SMP by author due to lack of hardware.

Reviewed by: julian


# 93a7aa79 27-Dec-2002 Julian Elischer <julian@FreeBSD.org>

Add code to ddb to allow backtracing an arbitrary thread.
(show thread {address})

Remove the IDLE kse state and replace it with a change in
the way threads sahre KSEs. Every KSE now has a thread, which is
considered its "owner" however a KSE may also be lent to other
threads in the same group to allow completion of in-kernel work.
n this case the owner remains the same and the KSE will revert to the
owner when the other work has been completed.

All creations of upcalls etc. is now done from
kse_reassign() which in turn is called from mi_switch or
thread_exit(). This means that special code can be removed from
msleep() and cv_wait().

kse_release() does not leave a KSE with no thread any more but
converts the existing thread into teh KSE's owner, and sets it up
for doing an upcall. It is just inhibitted from being scheduled until
there is some reason to do an upcall.

Remove all trace of the kse_idle queue since it is no-longer needed.
"Idle" KSEs are now on the loanable queue.


# 52378b8a 08-Nov-2002 Robert Watson <rwatson@FreeBSD.org>

To reduce per-return overhead of userret(), call into
mac_thread_userret() only if PS_MACPEND is set in the process AST mask.
This avoids the cost of the entry point in the common case, but
requires policies interested in the userret event to set the flag
(protected by the scheduler lock) if they do want the event. Since
all the policies that we're working with which use mac_thread_userret()
use the entry point only selectively to perform operations deferred
for locking reasons, this maintains the desired semantics.

Approved by: re
Requested by: bde
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 053effc6 25-Oct-2002 Julian Elischer <julian@FreeBSD.org>

iBack out david's last commit. the suspension code needs to be called
for non KSE processes too.


# 3139ada5 25-Oct-2002 David Xu <davidxu@FreeBSD.org>

Move suspension checking code from userret() into thread_userret().


# b43179fb 11-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Create a new scheduler api that is defined in sys/sched.h
- Begin moving scheduler specific functionality into sched_4bsd.c
- Replace direct manipulation of scheduler data with hooks provided by the
new api.
- Remove KSE specific state modifications and single runq assumptions from
kern_switch.c

Reviewed by: -arch


# 5715307f 09-Oct-2002 John Baldwin <jhb@FreeBSD.org>

- Move p_cpulimit to struct proc from struct plimit and protect it with
sched_lock. This means that we no longer access p_limit in mi_switch()
and the p_limit pointer can be protected by the proc lock.
- Remove PRS_ZOMBIE check from CPU limit test in mi_switch(). PRS_ZOMBIE
processes don't call mi_switch(), and even if they did there is no longer
the danger of p_limit being NULL (which is what the original zombie check
was added for).
- When we bump the current processes soft CPU limit in ast(), just bump the
private p_cpulimit instead of the shared rlimit. This fixes an XXX for
some value of fix. There is still a (probably benign) bug in that this
code doesn't check that the new soft limit exceeds the hard limit.

Inspired by: bde (2)


# 289e1e23 02-Oct-2002 Juli Mallett <jmallett@FreeBSD.org>

Access td->td_kse inside sched_lock.

Submitted by: julian


# bc7b9f1d 02-Oct-2002 Juli Mallett <jmallett@FreeBSD.org>

De-obfuscate local use of members of 'struct thread', for which we have
local variables, and group assignment.


# 92dbb82a 01-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Add a new MAC entry point, mac_thread_userret(td), which permits policy
modules to perform MAC-related events when a thread returns to user
space. This is required for policies that have floating process labels,
as it's not always possible to acquire the process lock at arbitrary
points in the stack during system call processing; process labels might
represent traditional authentication data, process history information,
or other data.

LOMAC will use this entry point to perform the process label update
prior to the thread returning to userspace, when plugged into the MAC
framework.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 1d9c5696 01-Oct-2002 Juli Mallett <jmallett@FreeBSD.org>

Back our kernel support for reliable signal queues.

Requested by: rwatson, phk, and many others


# feb24496 01-Oct-2002 John Baldwin <jhb@FreeBSD.org>

Minor style nits in a comment.


# 6cae6dac 01-Oct-2002 John Baldwin <jhb@FreeBSD.org>

Various style fixups.

Submitted by: bde (mostly)


# f6ccde83 01-Oct-2002 John Baldwin <jhb@FreeBSD.org>

Actually clear PS_XCPU in ast() when we handle it.

Submitted by: bde
Pointy hat to: jhb


# dc183990 30-Sep-2002 John Baldwin <jhb@FreeBSD.org>

- Add a new per-process flag PS_XCPU to indicate that at least one thread
has exceeded its CPU time limit.
- In mi_switch(), set PS_XCPU when the CPU time limit is exceeded.
- Perform actual CPU time limit exceeded work in ast() when PS_XCPU is set.

Requested by: many


# 1226f694 30-Sep-2002 Juli Mallett <jmallett@FreeBSD.org>

First half of implementation of ksiginfo, signal queues, and such. This
gets signals operating based on a TailQ, and is good enough to run X11,
GNOME, and do job control. There are some intricate parts which could be
more refined to match the sigset_t versions, but those require further
evaluation of directions in which our signal system can expand and contract
to fit our needs.

After this has been in the tree for a while, I will make in kernel API
changes, most notably to trapsignal(9) and sendsig(9), to use ksiginfo
more robustly, such that we can actually pass information with our
(queued) signals to the userland. That will also result in using a
struct ksiginfo pointer, rather than a signal number, in a lot of
kern_sig.c, to refer to an individual pending signal queue member, but
right now there is no defined behaviour for such.

CODAFS is unfinished in this regard because the logic is unclear in
some places.

Sponsored by: New Gold Technology
Reviewed by: bde, tjr, jake [an older version, logic similar]


# 253fdd5b 23-Sep-2002 Julian Elischer <julian@FreeBSD.org>

slightly clean up the thread_userret() and thread_consider_upcall() calls.
also some slight changes for TDF_BOUND testing and small style changes
Should ONLY affect KSE programs

Submitted by: davidxu


# 1c39a774 22-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Spell proprly properly:

failed to set signal flags proprly for ast()
failed to set signal flags proprly for ast()
failed to set signal flags proprly for ast()
failed to set signal flags proprly for ast()


# aaa1c771 10-Jul-2002 Jonathan Mini <mini@FreeBSD.org>

Revert removal of cred_free_thread(): It is used to ensure that a thread's
credentials are not improperly borrowed when the thread is not current in
the kernel.

Requested by: jhb, alfred


# ad22735e 10-Jul-2002 Julian Elischer <julian@FreeBSD.org>

Don't slow every syscall and trap by doing locks and stuff if the
'stop' bits are not set. This is a temporary thing.. I think this code probably
needs to be rewritten anyhow.


# e602ba25 29-Jun-2002 Julian Elischer <julian@FreeBSD.org>

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 01ad8a53 24-Jun-2002 Jonathan Mini <mini@FreeBSD.org>

Remove unused diagnostic function cread_free_thread().

Approved by: alfred


# d0c149fc 06-Jun-2002 John Baldwin <jhb@FreeBSD.org>

We no longer need to acqure Giant in ast() for ktrpsig() in postsig() now
that ktrace no longer needs Giant.


# 628855e7 29-May-2002 Julian Elischer <julian@FreeBSD.org>

CURSIG() is not a macro so rename it cursig().

Obtained from: KSE tree


# 79065dba 04-Apr-2002 Bruce Evans <bde@FreeBSD.org>

Moved signal handling and rescheduling from userret() to ast() so that
they aren't in the usual path of execution for syscalls and traps.
The main complication for this is that we have to set flags to control
ast() everywhere that changes the signal mask.

Avoid locking in userret() in most of the remaining cases.

Submitted by: luoqi (first part only, long ago, reorganized by me)
Reminded by: dillon


# b454c6dd 29-Mar-2002 Jake Burkholder <jake@FreeBSD.org>

Style fixes purposefully left out of last commit. I checked the kse tree
and didn't see any changes that this conflicts with.


# d0ce9a7e 29-Mar-2002 Jake Burkholder <jake@FreeBSD.org>

Remove abuse of intr_disable/restore in MI code by moving the loop in ast()
back into the calling MD code. The MD code must ensure no races between
checking the astpening flag and returning to usermode.

Submitted by: peter (ia64 bits)
Tested on: alpha (peter, jeff), i386, ia64 (peter), sparc64


# cb9a238a 20-Mar-2002 Warner Losh <imp@FreeBSD.org>

Remove last two abuses of cpu_critical_{enter,exit} in the MI code.

Reviewed by: jake, jhb, rwatson


# 01c04d2d 20-Mar-2002 John Baldwin <jhb@FreeBSD.org>

Change the way we ensure td_ucred is NULL if DIAGNOSTIC is defined.
Instead of caching the ucred reference, just go ahead and eat the
decerement and increment of the refcount. Now that Giant is pushed down
into crfree(), we no longer have to get Giant in the common case. In the
case when we are actually free'ing the ucred, we would normally free it on
the next kernel entry, so the cost there is not new, just in a different
place. This also removse td_cache_ucred from struct thread. This is
still only done #ifdef DIAGNOSTIC.

[ missed this file in the previous commit ]

Tested on: i386, alpha


# 39dda4e3 22-Feb-2002 Jake Burkholder <jake@FreeBSD.org>

Make this compile.

Pointy hat to: julian


# 77c40664 22-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Add some DIAGNOSTIC code.
While in userland, keep the thread's ucred reference in a shadow
field so that the usual place to store it is NULL.
If DIAGNOSTIC is not set, the thread ucred is kept valid until the next
kernel entry, at which time it is checked against the process cred
and possibly corrected. Produces a BIG speedup in
kernels with INVARIANTS set. (A previous commit corrected it
for the non INVARIANTS case already)

Reviewed by: dillon@freebsd.org


# 2eb927e2 16-Feb-2002 Julian Elischer <julian@FreeBSD.org>

If the credential on an incoming thread is correct, don't bother
reaquiring it. In the same vein, don't bother dropping the thread cred
when goinf ot userland. We are guaranteed to nned it when we come back,
(which we are guaranteed to do).

Reviewed by: jhb@freebsd.org, bde@freebsd.org (slightly different version)


# 2c100766 11-Feb-2002 Julian Elischer <julian@FreeBSD.org>

In a threaded world, differnt priorirites become properties of
different entities. Make it so.

Reviewed by: jhb@freebsd.org (john baldwin)


# e744f309 17-Jan-2002 Bruce Evans <bde@FreeBSD.org>

Changed the type of pcb_flags from u_char to u_int and adjusted things.
This removes the only atomic operation on a char type in the entire
kernel.


# c86b6ff5 05-Jan-2002 John Baldwin <jhb@FreeBSD.org>

Change the preemption code for software interrupt thread schedules and
mutex releases to not require flags for the cases when preemption is
not allowed:

The purpose of the MTX_NOSWITCH and SWI_NOSWITCH flags is to prevent
switching to a higher priority thread on mutex releease and swi schedule,
respectively when that switch is not safe. Now that the critical section
API maintains a per-thread nesting count, the kernel can easily check
whether or not it should switch without relying on flags from the
programmer. This fixes a few bugs in that all current callers of
swi_sched() used SWI_NOSWITCH, when in fact, only the ones called from
fast interrupt handlers and the swi_sched of softclock needed this flag.
Note that to ensure that swi_sched()'s in clock and fast interrupt
handlers do not switch, these handlers have to be explicitly wrapped
in critical_enter/exit pairs. Presently, just wrapping the handlers is
sufficient, but in the future with the fully preemptive kernel, the
interrupt must be EOI'd before critical_exit() is called. (critical_exit()
can switch due to a deferred preemption in a fully preemptive kernel.)

I've tested the changes to the interrupt code on i386 and alpha. I have
not tested ia64, but the interrupt code is almost identical to the alpha
code, so I expect it will work fine. PowerPC and ARM do not yet have
interrupt code in the tree so they shouldn't be broken. Sparc64 is
broken, but that's been ok'd by jake and tmm who will be fixing the
interrupt code for sparc64 shortly.

Reviewed by: peter
Tested on: i386, alpha


# 9d234f99 04-Jan-2002 John Baldwin <jhb@FreeBSD.org>

Axe a stale comment. Holding sched_lock across both setrunqueue() and
mi_switch() is sufficient.


# 48fd1f38 18-Dec-2001 John Baldwin <jhb@FreeBSD.org>

- Change all callers of addupc_task() to check PS_PROFIL explicitly and
remove the check from addupc_task(). It would need sched_lock while
testing the flag anyways.
- Always read sticks while holding sched_lock using a temporary variable
where needed.
- Always init prticks to 0 in ast() to quiet a warning.


# 7e1f6dfe 17-Dec-2001 John Baldwin <jhb@FreeBSD.org>

Modify the critical section API as follows:
- The MD functions critical_enter/exit are renamed to start with a cpu_
prefix.
- MI wrapper functions critical_enter/exit maintain a per-thread nesting
count and a per-thread critical section saved state set when entering
a critical section while at nesting level 0 and restored when exiting
to nesting level 0. This moves the saved state out of spin mutexes so
that interlocking spin mutexes works properly.
- Most low-level MD code that used critical_enter/exit now use
cpu_critical_enter/exit. MI code such as device drivers and spin
mutexes use the MI wrappers. Note that since the MI wrappers store
the state in the current thread, they do not have any return values or
arguments.
- mtx_intr_enable() is replaced with a constant CRITICAL_FORK which is
assigned to curthread->td_savecrit during fork_exit().

Tested on: i386, alpha


# 8e2e767b 26-Oct-2001 John Baldwin <jhb@FreeBSD.org>

Add a per-thread ucred reference for syscalls and synchronous traps from
userland. The per thread ucred reference is immutable and thus needs no
locks to be read. However, until all the proc locking associated with
writes to p_ucred are completed, it is still not safe to use the per-thread
reference.

Tested on: x86 (SMP), alpha, sparc64


# 278da511 21-Sep-2001 John Baldwin <jhb@FreeBSD.org>

Remove a bogus comment. "atomic" doesn't mean that the operation is done
as a physical atomic operation. That would require the code to use the
atomic API, which it does not. Instead, the operation is made psuedo
atomic (hence the quotes) by use of the lock to protect clearing all of the
flags in question.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 356861db 30-Aug-2001 Matthew Dillon <dillon@FreeBSD.org>

Remove the MPSAFE keyword from the parser for syscalls.master.
Instead introduce the [M] prefix to existing keywords. e.g.
MSTD is the MP SAFE version of STD. This is prepatory for a
massive Giant lock pushdown. The old MPSAFE keyword made
syscalls.master too messy.

Begin comments MP-Safe procedures with the comment:
/*
* MPSAFE
*/
This comments means that the procedure may be called without
Giant held (The procedure itself may still need to obtain
Giant temporarily to do its thing).

sv_prepsyscall() is now MP SAFE and assumed to be MP SAFE
sv_transtrap() is now MP SAFE and assumed to be MP SAFE

ktrsyscall() and ktrsysret() are now MP SAFE (Giant Pushdown)
trapsignal() is now MP SAFE (Giant Pushdown)

Places which used to do the if (mtx_owned(&Giant)) mtx_unlock(&Giant)
test in syscall[2]() in */*/trap.c now do not. Instead they
explicitly unlock Giant if they previously obtained it, and then
assert that it is no longer held to catch broken system calls.

Rebuild syscall tables.


# 688ebe12 10-Aug-2001 John Baldwin <jhb@FreeBSD.org>

- Close races with signals and other AST's being triggered while we are in
the process of exiting the kernel. The ast() function now loops as long
as the PS_ASTPENDING or PS_NEEDRESCHED flags are set. It returns with
preemption disabled so that any further AST's that arrive via an
interrupt will be delayed until the low-level MD code returns to user
mode.
- Use u_int's to store the tick counts for profiling purposes so that we
do not need sched_lock just to read p_sticks. This also closes a
problem where the call to addupc_task() could screw up the arithmetic
due to non-atomic reads of p_sticks.
- Axe need_proftick(), aston(), astoff(), astpending(), need_resched(),
clear_resched(), and resched_wanted() in favor of direct bit operations
on p_sflag.
- Fix up locking with sched_lock some. In addupc_intr(), use sched_lock
to ensure pr_addr and pr_ticks are updated atomically with setting
PS_OWEUPC. In ast() we clear pr_ticks atomically with clearing
PS_OWEUPC. We also do not grab the lock just to test a flag.
- Simplify the handling of Giant in ast() slightly.

Reviewed by: bde (mostly)


# 085be199 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

postsig() currently requires Giant to be held. Giant is held properly at
the first postsig() call, but not always held at the second place,
resulting in an occassional panic.


# 64acb05b 02-Jul-2001 John Baldwin <jhb@FreeBSD.org>

Grab Giant around postsig() since sendsig() can call into the vm to
grow the stack and we already needed Giant for KTRACE.


# 7aa7260e 29-Jun-2001 John Baldwin <jhb@FreeBSD.org>

Move ast() and userret() to sys/kern/subr_trap.c now that they are MI.


# 6be523bc 29-Jun-2001 John Baldwin <jhb@FreeBSD.org>

Add a new MI pointer to the process' trapframe p_frame instead of using
various differently named pointers buried under p_md.

Reviewed by: jake (in principle)


# 92809bc0 28-Jun-2001 John Baldwin <jhb@FreeBSD.org>

Grab Giant around trap_pfault() for now.


# 06c836bb 22-Jun-2001 John Baldwin <jhb@FreeBSD.org>

- Grab the proc lock around CURSIG and postsig(). Don't release the proc
lock until after grabbing the sched_lock to avoid CURSIG racing with
psignal.
- Don't grab Giant for addupc_task() as it isn't needed.

Reported by: tegge (signal race), bde (addupc_task a while back)


# 262c9f8a 05-Jun-2001 John Baldwin <jhb@FreeBSD.org>

Don't hold sched_lock across addupc_task().

Reported by: David Taylor <davidt@yadt.co.uk>
Submitted by: bde


# 0dfefe68 23-May-2001 John Baldwin <jhb@FreeBSD.org>

Don't acquire Giant just to call trap_fatal(), we are about to panic
anyway so we'd rather see the printf's then block if the system is
hosed.


# 1c1771cb 22-May-2001 Bruce Evans <bde@FreeBSD.org>

Convert npx interrupts into traps instead of vice versa. This is much
simpler for npx exceptions that start as traps (no assembly required...)
and works better for npx exceptions that start as interrupts (there is
no longer a problem for nested interrupts).

Submitted by: original (pre-SMPng) version by luoqi


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 8bd57f8f 15-May-2001 John Baldwin <jhb@FreeBSD.org>

Remove unneeded includes of sys/ipl.h and machine/ipl.h.


# 1efb92b7 11-May-2001 John Baldwin <jhb@FreeBSD.org>

Simplify the vm fault trap handling code a bit by using if-else instead of
duplicating code in the then case and then using a goto to jump around
the else case.


# 6caa8a15 27-Apr-2001 John Baldwin <jhb@FreeBSD.org>

Overhaul of the SMP code. Several portions of the SMP kernel support have
been made machine independent and various other adjustments have been made
to support Alpha SMP.

- It splits the per-process portions of hardclock() and statclock() off
into hardclock_process() and statclock_process() respectively. hardclock()
and statclock() call the *_process() functions for the current process so
that UP systems will run as before. For SMP systems, it is simply necessary
to ensure that all other processors execute the *_process() functions when the
main clock functions are triggered on one CPU by an interrupt. For the alpha
4100, clock interrupts are delievered in a staggered broadcast fashion, so
we simply call hardclock/statclock on the boot CPU and call the *_process()
functions on the secondaries. For x86, we call statclock and hardclock as
usual and then call forward_hardclock/statclock in the MD code to send an IPI
to cause the AP's to execute forwared_hardclock/statclock which then call the
*_process() functions.
- forward_signal() and forward_roundrobin() have been reworked to be MI and to
involve less hackery. Now the cpu doing the forward sets any flags, etc. and
sends a very simple IPI_AST to the other cpu(s). AST IPIs now just basically
return so that they can execute ast() and don't bother with setting the
astpending or needresched flags themselves. This also removes the loop in
forward_signal() as sched_lock closes the race condition that the loop worked
around.
- need_resched(), resched_wanted() and clear_resched() have been changed to take
a process to act on rather than assuming curproc so that they can be used to
implement forward_roundrobin() as described above.
- Various other SMP variables have been moved to a MI subr_smp.c and a new
header sys/smp.h declares MI SMP variables and API's. The IPI API's from
machine/ipl.h have moved to machine/smp.h which is included by sys/smp.h.
- The globaldata_register() and globaldata_find() functions as well as the
SLIST of globaldata structures has become MI and moved into subr_smp.c.
Also, the globaldata list is only available if SMP support is compiled in.

Reviewed by: jake, peter
Looked over by: eivind


# f227364a 06-Mar-2001 John Baldwin <jhb@FreeBSD.org>

- Release Giant a bit earlier on syscall exit.
- Don't try to grab Giant before postsig() in userret() as it is no longer
needed.
- Don't grab Giant before psignal() in ast() but get the proc lock instead.


# 631d7bf3 24-Feb-2001 Jake Burkholder <jake@FreeBSD.org>

- Rename the lcall system call handler from Xsyscall to Xlcall_syscall
to be more like Xint0x80_syscall and less like c function syscall().
- Reduce code duplication between the int0x80 and lcall handlers by
shuffling the elfags into the right place, saving the sizeof the
instruction in tf_err and jumping into the common int0x80 code.

Reviewed by: peter


# feb43c5f 22-Feb-2001 John Baldwin <jhb@FreeBSD.org>

The p_md.md_regs member of proc is used in signal handling to reference
the the original trapframe of the syscall, trap, or interrupt that entered
the kernel. Before SMPng, ast's were handled via a psuedo trap at the
end of doerti. With the SMPng commit, ast's were broken out into a
separate ast() function that was called from doreti to match the behavior
of other architectures. Unfortunately, when this was done, the
p_md.md_regs member of curproc was not updateda in ast(), thus when
signals are handled by userret() after an interrupt that returns to
userland, we end up using a stale trapframe that will result in the
registers from the old trapframe overwriting the real trapframe and
smashing all the registers right before we return to usermode. The saved
%cs:%eip from where we were in usermode are saved in the trapframe for
example.


# f308e0d7 22-Feb-2001 John Baldwin <jhb@FreeBSD.org>

- Change ast() to take a pointer to a trapframe like other architectures.
- Don't use an atomic operation to update cnt.v_soft in ast(). This is
the only place the variable is written to, and sched_lock is always
held when it is written, so it is already protected and the mutex release
of sched_lock asserts a memory barrier that ensures the value will be
updated in a timely fashion.


# 26f9f5c7 22-Feb-2001 John Baldwin <jhb@FreeBSD.org>

- Use TRAPF_PC() on the alpha to acess the PC in the trap frame.
- Don't hold sched_lock around addupc_task() as this apparently breaks
profiling badly due to sched_lock being held across copyin().

Reported by: bde (2)


# 5813dc03 19-Feb-2001 John Baldwin <jhb@FreeBSD.org>

- Don't call clear_resched() in userret(), instead, clear the resched flag
in mi_switch() just before calling cpu_switch() so that the first switch
after a resched request will satisfy the request.
- While I'm at it, move a few things into mi_switch() and out of
cpu_switch(), specifically set the p_oncpu and p_lastcpu members of
proc in mi_switch(), and handle the sched_lock state change across a
context switch in mi_switch().
- Since cpu_switch() no longer handles the sched_lock state change, we
have to setup an initial state for sched_lock in fork_exit() before we
release it.


# 0ad74739 19-Feb-2001 Bruce Evans <bde@FreeBSD.org>

Removed all traces of T_ASTFLT (except for gaps where it was). It became
unused except in dead code when ast() was split off from trap().


# 86654610 18-Feb-2001 Bruce Evans <bde@FreeBSD.org>

Changed the aston() family to operate on a specified process instead of
always on curproc. This is needed to implement signal delivery properly
(see a future log message for kern_sig.c).

Debogotified the definition of aston(). aston() was defined in terms
of signotify() (perhaps because only the latter already operated on
a specified process), but aston() is the primitive.

Similar changes are needed in the ia64 versions of cpu.h and trap.c.
I didn't make them because the ia64 is missing the prerequisite changes
to make astpending and need_resched per-process and those changes are
too large to make without testing.


# d5a08a60 11-Feb-2001 Jake Burkholder <jake@FreeBSD.org>

Implement a unified run queue and adjust priority levels accordingly.

- All processes go into the same array of queues, with different
scheduling classes using different portions of the array. This
allows user processes to have their priorities propogated up into
interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
32. We used to have 4 separate arrays of 32 queues each, so this
may not be optimal. The new run queue code was written with this
in mind; changing the number of run queues only requires changing
constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter. This
is intended to be used to create per-cpu run queues. Implement
wrappers for compatibility with the old interface which pass in
the global run queue structure.
- Group the priority level, user priority, native priority (before
propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
This was used to detect when a process' priority had lowered and
it should yield. We now effectively yield on every interrupt.
- Activate propogate_priority(). It should now have the desired
effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
idle loop. It interfered with propogate_priority() because
the idle process needed to do a non-blocking acquire of Giant
and then other processes would try to propogate their priority
onto it. The idle process should not do anything except idle.
vm_page_zero_idle() will return in the form of an idle priority
kernel thread which is woken up at apprioriate times by the vm
system.
- Update struct kinfo_proc to the new priority interface. Deliberately
change its size by adjusting the spare fields. It remained the same
size, but the layout has changed, so userland processes that use it
would parse the data incorrectly. The size constraint should really
be changed to an arbitrary version number. Also add a debug.sizeof
sysctl node for struct kinfo_proc.


# 3cbe75a4 10-Feb-2001 Jake Burkholder <jake@FreeBSD.org>

Clear the reschedule flag after finding it set in userret(). This
used to be in cpu_switch(), but I don't see any difference between
doing it here.


# 142ba5f3 09-Feb-2001 John Baldwin <jhb@FreeBSD.org>

- Make astpending and need_resched process attributes rather than CPU
attributes. This is needed for AST's to be properly posted in a preemptive
kernel. They are backed by two new flags in p_sflag: PS_ASTPENDING and
PS_NEEDRESCHED. They are still accesssed by their old macros:
aston(), astoff(), etc. For completeness, an astpending() macro has been
added to check for a pending AST, and clear_resched() has been added to
clear need_resched().
- Rename syscall2() on the x86 back to syscall() to be consistent with
other architectures.


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 297c46b6 07-Feb-2001 John Baldwin <jhb@FreeBSD.org>

Don't enable interrupts for a kernel breakpoint or trace trap. Otherwise,
this negates the explicit disabling of interrupts when entering the
debugger in Debugger().


# 1a6e52d0 06-Feb-2001 Jeroen Ruigrok van der Werven <asmodai@FreeBSD.org>

Fix typo: seperate -> separate.

Seperate does not exist in the english language.


# 03927d3c 29-Jan-2001 Peter Wemm <peter@FreeBSD.org>

Send "#if NISA > 0" to the bit-bucket and replace it with an option.
These were compile-time "is the isa code present?" tests and not
'how many isa busses' tests.


# 28df158b 25-Jan-2001 Jake Burkholder <jake@FreeBSD.org>

Push Giant down into the trap handlers that need it, instead of
acquiring it unconditionally.

Reviewed by: jhb


# 625c76db 24-Jan-2001 John Baldwin <jhb@FreeBSD.org>

- Kill the have_giant parameter to userret() along with all instances of
that name as a variable. Use mtx_owned(&Giant) where appropriate
instead.
- Proc locking.
- P_FOO -> PS_FOO.
- Update comments about enable interrupts during trap and why this may be
bad if we trap while holding a spin mutex.
- Don't bother resetting p to curproc in syscall() in case we are the child
returning from fork. The child hasn't returned from fork through syscall
in a while.
- Remove fork_return() as it has been superseded by the MI version.


# a448b62a 21-Jan-2001 Jake Burkholder <jake@FreeBSD.org>

Make intr_nesting_level per-process, rather than per-cpu. Setup
interrupt threads to run with it always >= 1, so that malloc can
detect M_WAITOK from "interrupt" context. This is also necessary
in order to context switch from sched_ithd() directly.

Reviewed By: peter


# 558226ea 19-Jan-2001 Peter Wemm <peter@FreeBSD.org>

Use #ifdef DEV_NPX from opt_npx.h instead of #if NNPX > 0 from npx.h


# ef73ae4b 09-Jan-2001 Jake Burkholder <jake@FreeBSD.org>

Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables
other then curproc.


# 05f9877c 13-Dec-2000 John Baldwin <jhb@FreeBSD.org>

If we fail to emulate a vm86 trap in kernel mode, then we use
vm86_trap() to return to the calling program directly. vm86_trap()
doesn't return, thus it was never returning to trap() to release
Giant. Thus, release Giant before calling vm86_trap().


# 92cf772d 11-Dec-2000 Jake Burkholder <jake@FreeBSD.org>

- Add code to detect if a system call returns with locks other than Giant
held and panic if so (conditional on witness).
- Change witness_list to return the number of locks held so this is easier.
- Add kern/syscalls.c to the kernel build if witness is defined so that the
panic message can contain the name of the offending system call.
- Add assertions that Giant and sched_lock are not held when returning from
a system call, which were missing for alpha and ia64.


# 7da6f977 17-Nov-2000 Jake Burkholder <jake@FreeBSD.org>

- Split the run queue and sleep queue linkage, so that a process
may block on a mutex while on the sleep queue without corrupting
it.
- Move dropping of Giant to after the acquire of sched_lock.

Tested by: John Hay <jhay@icomtek.csir.co.za>
jhb


# 20cdcc5b 15-Nov-2000 John Baldwin <jhb@FreeBSD.org>

Don't release and acquire Giant in mi_switch(). Instead, release and
acquire Giant as needed in functions that call mi_switch(). The releases
need to be done outside of the sched_lock to avoid potential deadlocks
from trying to acquire Giant while interrupts are disabled.

Submitted by: witness


# 35e0e5b3 20-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# 6c567274 05-Oct-2000 John Baldwin <jhb@FreeBSD.org>

- Change fast interrupts on x86 to push a full interrupt frame and to
return through doreti to handle ast's. This is necessary for the
clock interrupts to work properly.
- Change the clock interrupts on the x86 to be fast instead of threaded.
This is needed because both hardclock() and statclock() need to run in
the context of the current process, not in a separate thread context.
- Kill the prevproc hack as it is no longer needed.
- We really need Giant when we call psignal(), but we don't want to block
during the clock interrupt. Instead, use two p_flag's in the proc struct
to mark the current process as having a pending SIGVTALRM or a SIGPROF
and let them be delivered during ast() when hardclock() has finished
running.
- Remove CLKF_BASEPRI, which was #ifdef'd out on the x86 anyways. It was
broken on the x86 if it was turned on since cpl is gone. It's only use
was to bogusly run softclock() directly during hardclock() rather than
scheduling an SWI.
- Remove the COM_LOCK simplelock and replace it with a clock_lock spin
mutex. Since the spin mutex already handles disabling/restoring
interrupts appropriately, this also lets us axe all the *_intr() fu.
- Back out the hacks in the APIC_IO x86 cpu_initclocks() code to use
temporary fast interrupts for the APIC trial.
- Add two new process flags P_ALRMPEND and P_PROFPEND to mark the pending
signals in hardclock() that are to be delivered in ast().

Submitted by: jakeb (making statclock safe in a fast interrupt)
Submitted by: cp (concept of delaying signals until ast())


# a91b7dc1 05-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Various whitespace cleanups after the SMPng commit, which jumbled things
around a bit in the trap handling code.


# 0e2aab12 05-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Don't treat a kernel stack fault the same as a general protect fault or
a segment not present fault in the non-vm86 case.


# 9c15b3c1 12-Sep-2000 Bruce Evans <bde@FreeBSD.org>

Fixed hang on booting with -d. mtx_enter() was called on an uninitialized
lock. The quick fix in trap.c was not quite the version tested and had no
effect; back it out.


# bbbb2579 12-Sep-2000 Bruce Evans <bde@FreeBSD.org>

Quick fix for hang on booting with -d. mtx_enter() was called before
curproc was initialized. curproc == NULL was interpreted as matching
the process holding Giant... Just skip mtx_enter() and mtx_exit() in
trap() if (curproc == NULL && cold) (&& cold for safety).


# 0384fff8 06-Sep-2000 Jason Evans <jasone@FreeBSD.org>

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# c206a860 06-Aug-2000 Paul Saab <ps@FreeBSD.org>

Change the behavior of isa_nmi to log an error message instead of
panicing and return a status so that we can decide whether to drop
into DDB or panic. If the status from isa_nmi is true, panic the
kernel based on machdep.panic_on_nmi, otherwise if DDB is
enabled, drop to DDB based on machdep.ddb_on_nmi.

Reviewed by: peter, phk


# 3fb50adb 31-Jul-2000 Luoqi Chen <luoqi@FreeBSD.org>

Handle write page faults (both write only or read-modify-write) as MI vm
write-only faults. This would allow write-only mmapped regions to function
correctly.


# 88f675ba 14-Jul-2000 Paul Saab <ps@FreeBSD.org>

Change the way NMI's are handled. Before, if DDB was enabled and
a NMI occured, you could type continue in DDB and the kernel would
not attempt to detect what type of NMI was recieved. Now we check
for the type of NMI first and then go to DDB if it is enabled.

This will solve the problem with having DDB enabled and getting an
NMI due to some possibly bad error and being able to continue the
operation of the kernel when you really want to panic and know
what happened.

Submitted by: jhb


# c6d3f3bf 30-Jun-2000 Brian S. Dean <bsd@FreeBSD.org>

Fix my own style bugs (use of spaces instead of tabs for indentation).
This is a style-only change.


# 36e9f877 28-Mar-2000 Matthew Dillon <dillon@FreeBSD.org>

Commit major SMP cleanups and move the BGL (big giant lock) in the
syscall path inward. A system call may select whether it needs the MP
lock or not (the default being that it does need it).

A great deal of conditional SMP code for various deadended experiments
has been removed. 'cil' and 'cml' have been removed entirely, and the
locking around the cpl has been removed. The conditional
separately-locked fast-interrupt code has been removed, meaning that
interrupts must hold the CPL now (but they pretty much had to anyway).
Another reason for doing this is that the original separate-lock for
interrupts just doesn't apply to the interrupt thread mechanism being
contemplated.

Modifications to the cpl may now ONLY occur while holding the MP
lock. For example, if an otherwise MP safe syscall needs to mess with
the cpl, it must hold the MP lock for the duration and must (as usual)
save/restore the cpl in a nested fashion.

This is precursor work for the real meat coming later: avoiding having
to hold the MP lock for common syscalls and I/O's and interrupt threads.
It is expected that the spl mechanisms and new interrupt threading
mechanisms will be able to run in tandem, allowing a slow piecemeal
transition to occur.

This patch should result in a moderate performance improvement due to
the considerable amount of code that has been removed from the critical
path, especially the simplification of the spl*() calls. The real
performance gains will come later.

Approved by: jkh
Reviewed by: current, bde (exception.s)
Some work taken from: luoqi's patch


# 6d9a8d3e 02-Mar-2000 Peter Dufault <dufault@FreeBSD.org>

I applied the wrong patch set. Back out anything associated
with the known bogus currtpriority. This undoes the previous changes to
sys/i386/i386/trap.c, sys/alpha/alpha/trap.c, sys/sys/systm.h

Now we have the patch set approved by bde.

Approved by: bde


# 383774c4 02-Mar-2000 Peter Dufault <dufault@FreeBSD.org>

Patches that eliminate extra context switches in FIFO case.
Fixes p1003_1b regression test in the simple case of no RR and
FIFO processes competing.

Reviewed by: jkh, bde


# de8050f9 20-Feb-2000 Brian S. Dean <bsd@FreeBSD.org>

Don't forget to reset the hardware debug registers when a process that
was using them exits.

Don't allow a user process to cause the kernel to take a TRCTRAP on a
user space address.

Reviewed by: jlemon, sef
Approved by: jkh


# 35e61cbd 11-Jan-2000 Kazutaka YOKOTA <yokota@FreeBSD.org>

Add a new mechanism, cndbctl(), to tell the console driver that
ddb is entered. Don't refer to `in_Debugger' to see if we
are in the debugger. (The variable used to be static in Debugger()
and wasn't updated if ddb is entered via traps and panic anyway.)

- Don't refer to `in_Debugger'.
- Add `db_active' to i386/i386/db_interface.d (as in
alpha/alpha/db_interface.c).
- Remove cnpollc() stub from ddb/db_input.c.
- Add the dbctl function to syscons, pcvt, and sio. (The function for
pcvt and sio is noop at the moment.)

Jointly developed by: bde and me

(The final version was tweaked by me and not reviewed by bde. Thus,
if there is any error in this commit, that is entirely of mine, not
his.)

Some changes were obtained from: NetBSD


# b5616833 08-Nov-1999 Alan Cox <alc@FreeBSD.org>

Passing "0" or "FALSE" as the fourth argument to vm_fault is wrong. It
should be "VM_FAULT_NORMAL".


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# a7674320 25-Jul-1999 Martin Cracauer <cracauer@FreeBSD.org>

On FPU exceptions, pass a useful error code (one of the FPE_...
macros) to the signal handler, for old-style BSD signal handlers as
the second (int) argument, for SA_SIGINFO signal handlers as
siginfo_t->si_code. This is source-compatible with Solaris, except
that we have no <siginfo.h> (which isn't even mentioned in POSIX
1003.1b).

An rather complete example program is at
http://www3.cons.org/cracauer/freebsd-signal.c
This will be added to the regression tests in src/.

This commit also adds code to disable the (hardware) FPU from
userconfig, so that you can use a software FP emulator on a machine
that has hardware floating point. See LINT.


# 50045fbc 18-Jun-1999 Bruce Evans <bde@FreeBSD.org>

Changed the global `idt' from an array to a pointer so that npx.c
automatically hacks on the active copy of the IDT if f00f_hack()
has changed it. This also allows simplifications in setidt().
This fixes breakage of FP exception handling by rev.1.55 of
sys/kernel.h. FP exceptions were sent to npx.c's probe handlers
because npx.c "restored" the old handlers to the wrong copy of the
IDT. The SYSINIT for f00f_hack() was purposely run quite late to
avoid problems like this, but it is bogusly associated with the
SYSINIT for proc0 so it was moved with the latter.

Problem reported and fix tested by: Martin Cracauer <cracauer@cons.org>


# eb9d435a 01-Jun-1999 Jonathan Lemon <jlemon@FreeBSD.org>

Unifdef VM86.

Reviewed by: silence on on -current


# dfd5dee1 06-May-1999 Peter Wemm <peter@FreeBSD.org>

Add sufficient braces to keep egcs happy about potentially ambiguous
if/else nesting.


# 5206bca1 27-Apr-1999 Luoqi Chen <luoqi@FreeBSD.org>

Enable vmspace sharing on SMP. Major changes are,
- %fs register is added to trapframe and saved/restored upon kernel entry/exit.
- Per-cpu pages are no longer mapped at the same virtual address.
- Each cpu now has a separate gdt selector table. A new segment selector
is added to point to per-cpu pages, per-cpu global variables are now
accessed through this new selector (%fs). The selectors in gdt table are
rearranged for cache line optimization.
- fask_vfork is now on as default for both UP and SMP.
- Some aio code cleanup.

Reviewed by: Alan Cox <alc@cs.rice.edu>
John Dyson <dyson@iquest.net>
Julian Elischer <julian@whistel.com>
Bruce Evans <bde@zeta.org.au>
David Greenman <dg@root.com>


# db42d908 19-Apr-1999 Peter Wemm <peter@FreeBSD.org>

unifdef -DVM_STACK - it's been on for a while for x86 and was checked
and appeared to be working for the Alpha some time ago.


# a2210fe1 09-Mar-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Make TIMER_FREQ a normal, undocumented option. Raise confusion to
a higher level with example in LINT.

Clarify comment about PPS_SYNC. Ignore for now that it doesn't
work in FLL mode, it will in a few days.


# 2267af78 06-Jan-1999 Julian Elischer <julian@FreeBSD.org>

Add (but don't activate) code for a special VM option to make
downward growing stacks more general.
Add (but don't activate) code to use the new stack facility
when running threads, (specifically the linux threads support).
This allows people to use both linux compiled linuxthreads, and also the
native FreeBSD linux-threads port.

The code is conditional on VM_STACK. Not using this will
produce the old heavily tested system.

Submitted by: Richard Seaman <dick@tar.com>


# 9959b1a8 28-Dec-1998 Mike Smith <msmith@FreeBSD.org>

Improved DDB_UNATTENDED behaviour. From the submitter:

There's something that's been bugging me for a while, so I decided to fix it.
FreeBSD now will DTRT WRT DDB and DDB_UNATTENDED (!debugger_on_panic), at least
in my opinion. The behavior change is such that:

1. Nothing changes when debugger_on_panic != 0.
2. When DDB_UNATTENDED (!debugger_on_panic), if a panic occurs, the
machine will reboot. Also, if a trap occurs, the machine will
panic and reboot, unlike how it broke to DDB before. HOWEVER,
a trap inside DDB will not cause a panic, allowing full use
of DDB without having to worry about the machine being stuck
at a DDB prompt if something goes wrong during the day.
Patches for this behavior follow my signature, and it would
be a boon to anyone (like me) who uses DDB_UNATTENDED, but
actually wants the machine to panic on a trap (otherwise,
what's the use, if the machine causes a fatal trap rather than
a true panic, of debugger_on_panic?). The changes cause no
adverse behavior, but do involve two symbols becoming global

Submitted by: Brian Feldman <green@unixhelp.org>


# 4f2129fa 16-Dec-1998 Bruce Evans <bde@FreeBSD.org>

Removed bogus casts of USRSTACK and/or the other operand in binary
expressions involving USRSTACK.


# 2326715f 05-Dec-1998 Archie Cobbs <archie@FreeBSD.org>

Avoid compiler warning (printf arg type mismatch) when compiling #ifdef DEBUG


# 9ad861ed 02-Dec-1998 KATO Takenori <kato@FreeBSD.org>

- For some old Cyrix CPUs, %cr2 is clobbered by interrupts. This
problem is worked around by using an interrupt gate for the page
fault handler. This code was originally made for NetBSD/pc98 by
Naofumi Honda <honda@kururu.math.sci.hokudai.ac.jp> and has already
been in PC98 tree. Because of this bug, trap_fatal cannot show
correct page fault address if %cr2 is obtained in this function.
Therefore, trap_fatal uses the value from trap() function.
- The trap handler always enables interruption when buggy application
or kernel code has disabled interrupts and then trapped. This code
was prepared by Bruce Evans <bde@FreeBSD.org>.

Submitted by: Bruce Evans <bde@FreeBSD.org>
Naofumi Honda <honda@kururu.math.sci.hokudai.ac.jp>


# 1fcee469 23-Aug-1998 Bruce Evans <bde@FreeBSD.org>

Fixed printf format errors.


# 288078be 28-Apr-1998 Eivind Eklund <eivind@FreeBSD.org>

Translate T_PROTFLT to SIGSEGV instead of SIGBUS when running under
Linux emulation. This make Allegro Common Lisp 4.3 work under
FreeBSD!

Submitted by: Fred Gilham <gilham@csl.sri.com>
Commented on by: bde, dg, msmith, tg
Hoping he got everything right: eivind


# c1087c13 15-Apr-1998 Bruce Evans <bde@FreeBSD.org>

Support compiling with `gcc -ansi'.


# 227ee8a1 30-Mar-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


# 08637435 28-Mar-1998 Bruce Evans <bde@FreeBSD.org>

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


# 640c4313 23-Mar-1998 Jonathan Lemon <jlemon@FreeBSD.org>

Add the ability to make real-mode BIOS calls from the kernel. Currently,
everything is contained inside #ifdef VM86, so this option must be
present in the config file to use this functionality.

Thanks to Tor Egge, these changes should work on SMP machines. However,
it may not be throughly SMP-safe.

Currently, the only BIOS calls made are memory-sizing routines at bootup,
these replace reading the RTC values.


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# e0d781f3 30-Jan-1998 Eivind Eklund <eivind@FreeBSD.org>

Make POWERFAIL_NMI, PPS_SYNC and NATM new style options.

This also fixes a couple of defunct options; submitted by bde.


# 2a024a2b 05-Dec-1997 Sean Eric Fagan <sef@FreeBSD.org>

Changes to allow event-based process monitoring and control.


# 4d9deedb 04-Dec-1997 John-Mark Gurney <jmg@FreeBSD.org>

document and make the NO_F00F_HACK a proper option...

also, sort some option includes while I'm here..

Forgotten by: sef


# e41b6f2d 04-Dec-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

After consultation with David, change
#ifndef NO_F00F_HACK
to
#if defined(I586_CPU) && !defined(NO_F00F_HACK)


# c4fbf277 02-Dec-1997 Sean Eric Fagan <sef@FreeBSD.org>

Work around for the Intel Pentium F00F bug; this is Intel's recommended
workaround. Note that this currently eats up two pages extra in the system;
this could be alleviated by aligning idt correctly, and then only dealing with
that (as opposed to the current method of allocated two pages and copying the
IDT table to that, and then setting that to be the IDT table).


# 21e52415 24-Nov-1997 Bruce Evans <bde@FreeBSD.org>

Fixed some #include messes.

Hid the check of the user %cs in syscall() under `#ifdef DIAGNOSTIC'.


# cb226aaa 06-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.


# b67dffda 09-Oct-1997 Peter Wemm <peter@FreeBSD.org>

Compensate for pcb.h tweaks.

(Bruce pointed out the nesting)


# 98823b23 10-Oct-1997 Peter Wemm <peter@FreeBSD.org>

Convert the VM86 option from a global option to an option only depended
on by the files that use it. Changing the VM86 option now only causes
a recompile of a dozen files or so rather than the entire kernel.


# 91942903 21-Sep-1997 Justin T. Gibbs <gibbs@FreeBSD.org>

autoconf.c:
Add cpu_rootconf and cpu_dumpconf so that configuring these
two devices can be better controlled by the MI configuration
code.

machdep.c:
MD initialization code for the new callout interface.

trap.c:
Add support for printing out whether cam interrupts are masked
during a panic.


# 279a6932 05-Sep-1997 Peter Wemm <peter@FreeBSD.org>

Cosmetic adjustment for the trap/double fault/panic cpu id listing.
It now prints the apic id in hex rather than decimal.


# 5f073933 28-Aug-1997 Jonathan Lemon <jlemon@FreeBSD.org>

Remove the vm86 support as an LKM, and link it directly into the kernel
if 'options "VM86"' is in the config file. The LKM was really for
development, and has probably outlived its usefulness.


# 9a3b3e8b 26-Aug-1997 Peter Wemm <peter@FreeBSD.org>

Clean up the SMP AP bootstrap and eliminate the wretched idle procs.

- We now have enough per-cpu idle context, the real idle loop has been
revived (cpu's halt now with nothing to do).
- Some preliminary support for running some operations outside the
global lock (eg: zeroing "free but not yet zeroed pages") is present
but appears to cause problems. Off by default.
- the smp_active sysctl now behaves differently. It's merely a 'true/false'
option. Setting smp_active to zero causes the AP's to halt in the idle
loop and stop scheduling processes.
- bootstrap is a lot safer. Instead of sharing a statically compiled in
stack a number of times (which has caused lots of problems) and then
abandoning it, we use the idle context to boot the AP's directly. This
should help >2 cpu support since the bootlock stuff was in doubt.
- print physical apic id in traps.. helps identify private pages getting
out of sync. (You don't want to know how much hair I tore out with this!)

More cleanup to follow, this is more of a checkpoint than a
'finished' thing.


# 40d50994 21-Aug-1997 Philippe Charnier <charnier@FreeBSD.org>

Revert my previous commit about using CS_SECURE macro.
Requested by: Bruce.


# 7b185ef8 19-Aug-1997 Steve Passe <fsmp@FreeBSD.org>

Preperation for moving cpl into critical region access.
Several new fine-grained locks.
New FAST_INTR() methods:
- separate simplelock for FAST_INTR, no more giant lock.
- FAST_INTR()s no longer checks ipending on way out of ISR.
sio made MP-safe (I hope).


# 15f35491 18-Aug-1997 Philippe Charnier <charnier@FreeBSD.org>

Use CS_SECURE macro.
Reviewed by: John Dyson


# 0b6e0f74 12-Aug-1997 John Dyson <dyson@FreeBSD.org>

Back out a part of the disk scheduling "improvements" :-(. Let me know
how the system works now!!!


# c0ecffb9 09-Aug-1997 John Dyson <dyson@FreeBSD.org>

Modify the scheduling policy to take into account disk I/O waits
as chargeable CPU usage. This should mitigate the problem of processes
doing disk I/O hogging the CPU. Various users have reported the
problem, and test code shows that the problem should now be gone.


# 48a09cf2 08-Aug-1997 John Dyson <dyson@FreeBSD.org>

VM86 kernel support.
Work done by BSDI, Jonathan Lemon <jlemon@americantv.com>,
Mike Smith <msmith@gsoft.com.au>, Sean Eric Fagan <sef@kithrup.com>,
and probably alot of others.
Submitted by: Jnathan Lemon <jlemon@americantv.com>


# e31521c3 20-Jul-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# b3196e4b 22-Jun-1997 Peter Wemm <peter@FreeBSD.org>

Preliminary support for per-cpu data pages.

This eliminates a lot of #ifdef SMP type code. Things like _curproc reside
in a data page that is unique on each cpu, eliminating the expensive macros
like: #define curproc (SMPcurproc[cpunumber()])

There are some unresolved bootstrap and address space sharing issues at
present, but Steve is waiting on this for other work. There is still some
strictly temporary code present that isn't exactly pretty.

This is part of a larger change that has run into some bumps, this part is
standalone so it should be safe. The temporary code goes away when the
full idle cpu support is finished.

Reviewed by: fsmp, dyson


# 7b3c8424 06-Jun-1997 Bruce Evans <bde@FreeBSD.org>

Preserve %fs and %gs across context switches. This has a relatively low
cost since it is only done in cpu_switch(), not for every exception.
The extra state is kept in the pcb, and handled much like the npx state,
with similar deficiencies (the state is not preserved across signal
handlers, and error handling loses state).


# 68352337 02-Jun-1997 Doug Rabson <dfr@FreeBSD.org>

Move interrupt handling code from isa.c to a new file. This should make
isa.c (slightly) more portable and will make my life developing the really
portable version much easier.

Reviewed by: peter, fsmp


# 5400ed3b 31-May-1997 Peter Wemm <peter@FreeBSD.org>

Include file updates.. <machine/spl.h> -> <machine/ipl.h>, add
<machine/ipl.h> to those files that were depending on getting SWI_*
implicitly via <machine/cpufunc.h>


# ae924961 28-May-1997 Peter Wemm <peter@FreeBSD.org>

remove opt_smp.h and fix the reason it was needed.


# 835834c0 07-May-1997 Peter Wemm <peter@FreeBSD.org>

md_regs is now a struct trapframe *


# b332d9a6 04-May-1997 John Dyson <dyson@FreeBSD.org>

Make sure that *fork() always returns with %edx == 1 in the
child. This was sometimes not happening correctly during my
threads code work.


# 477a642c 26-Apr-1997 Peter Wemm <peter@FreeBSD.org>

Man the liferafts! Here comes the long awaited SMP -> -current merge!

There are various options documented in i386/conf/LINT, there is more to
come over the next few days.

The kernel should run pretty much "as before" without the options to
activate SMP mode.

There are a handful of known "loose ends" that need to be fixed, but
have been put off since the SMP kernel is in a moderately good condition
at the moment.

This commit is the result of the tinkering and testing over the last 14
months by many people. A special thanks to Steve Passe for implementing
the APIC code!


# 58611a61 14-Apr-1997 Bruce Evans <bde@FreeBSD.org>

Fixed printing of registers in dbflalt_handler(). The registers
were always in a tss; that tss just changed from the one in the
pcb to common_tss (who knows where it was when there was no curpcb?).
Not using the pcb also fixed the problem that there is no pcb in
idle(), so we now always get useful register values.


# a2a1c95c 07-Apr-1997 Peter Wemm <peter@FreeBSD.org>

The biggie: Get rid of the UPAGES from the top of the per-process address
space. (!)

Have each process use the kernel stack and pcb in the kvm space. Since
the stacks are at a different address, we cannot copy the stack at fork()
and allow the child to return up through the function call tree to return
to user mode - create a new execution context and have the new process
begin executing from cpu_switch() and go to user mode directly.
In theory this should speed up fork a bit.

Context switch the tss_esp0 pointer in the common tss. This is a lot
simpler since than swithching the gdt[GPROC0_SEL].sd.sd_base pointer
to each process's tss since the esp0 pointer is a 32 bit pointer, and the
sd_base setting is split into three different bit sections at non-aligned
boundaries and requires a lot of twiddling to reset.

The 8K of memory at the top of the process space is now empty, and unmapped
(and unmappable, it's higher than VM_MAXUSER_ADDRESS).

Simplity the pmap code to manage process contexts, we no longer have to
double map the UPAGES, this simplifies and should measuably speed up fork().

The following parts came from John Dyson:

Set PG_G on the UPAGES that are now in kernel context, and invalidate
them when swapping them out.

Move the upages object (upobj) from the vmspace to the proc structure.

Now that the UPAGES (pcb and kernel stack) are out of user space, make
rfork(..RFMEM..) do what was intended by sharing the vmspace
entirely via reference counting rather than simply inheriting the mappings.


# 271b264e 07-Apr-1997 Peter Wemm <peter@FreeBSD.org>

No longer use an i386tss as the basis of our pcb - it wasn't particularly
convenient and makes life difficult for my next commit. We still need
an i386tss to point to for the tss slot in the gdt, so we use a common
tss shared between all processes.

Note that this is going to break debugging until this series of commits
is finished. core dumps will change again too. :-( we really need
a more modern core dump format that doesn't depend on the pcb/upages.

This change makes VM86 mode harder, but the following commits will remove
a lot of constraints for the VM86 system, including the possibility of
extending the pcb for an IO port map etc.

Obtained from: bde


# a04c970a 05-Apr-1997 John Dyson <dyson@FreeBSD.org>

Fix the gdb executable modify problem. Thanks to the detective work
by Alan Cox <alc@cs.rice.edu>, and his description of the problem.

The bug was primarily in procfs_mem, but the mistake likely happened
due to the lack of vm system support for the operation. I added
better support for selective marking of page dirty flags so that
vm_map_pageable(wiring) will not cause this problem again.

The code in procfs_mem is now less bogus (but maybe still a little
so.)


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 7e64cb7a 22-Jan-1997 John Dyson <dyson@FreeBSD.org>

Remove some dead code from trapwrite.
Submitted by: Stephen McKay <syssgm@devetir.qld.gov.au>


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 959c0278 18-Dec-1996 Bruce Evans <bde@FreeBSD.org>

Only handle copyin/out/etc faults when not in an interrupt handler.
This makes unexpected faults (in an interrupt handler) more likely
to crash properly. It could be done even better (more robustly and
more efficiently) using lazy fault handling.


# f313170d 10-Sep-1996 Bruce Evans <bde@FreeBSD.org>

Updated #includes to 4.4Lite style.


# eaed8903 01-Sep-1996 David Greenman <dg@FreeBSD.org>

Change an splclock that needs to be an splhigh into an splhigh.

Reviewed by: bde


# 11282a57 11-Aug-1996 David Greenman <dg@FreeBSD.org>

Add support for i686 machine check trap.


# f3460ead 12-Jul-1996 Bruce Evans <bde@FreeBSD.org>

Fixed cloned comments about npx traps to match context.


# 79df6d85 25-Jun-1996 Bruce Evans <bde@FreeBSD.org>

trap.c:
Fixed profiling of system times. It was pre-4.4Lite and didn't support
statclocks. System times were too small by a factor of 8.

Handle deferred profiling ticks the 4.4Lite way: use addupc_task() instead
of addupc(). Call addupc_task() directly instead of using the ADDUPC()
macro.

Removed vestigial support for PROFTIMER.

switch.s:
Removed addupc().

resourcevar.h:
Removed ADDUPC() and declarations of addupc().

cpu.h:
Updated a comment. i386's never were tahoe's, and the deferred profiling
tick became (possibly) multiple ticks in 4.4Lite.

Obtained from: mostly from NetBSD


# d7629dff 13-Jun-1996 Satoshi Asami <asami@FreeBSD.org>

A fast memory copy for Pentiums using floating point registers.
It is called from copyin and copyout.

The new routine is conditioned on I586_CPU and I586_FAST_BCOPY, so you
need

options "I586_FAST_BCOPY"

(quotes essenstial) in your kernel config file.

Also, if you have other kernel types configured in your kernel, an
additional check to make sure it is running on a Pentium is inserted.
(It is not clear why it doesn't help on P6s, it may be just that the
Orion chipset doesn't prefetch as efficiently as Tritons and friends.)

Bruce can now hack this away. :)


# c23670e2 11-Jun-1996 Gary Palmer <gpalmer@FreeBSD.org>

Clean up -Wunused warnings.

Reviewed by: bde


# b18bfc3d 17-May-1996 John Dyson <dyson@FreeBSD.org>

This set of commits to the VM system does the following, and contain
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:

More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.


# 4e489ec4 27-Mar-1996 John Dyson <dyson@FreeBSD.org>

Remove a now unnecessary prototype from pmap.c. Also remove now
unnecessary vm_fault's of page table pages in trap.c.


# ba00d77a 27-Mar-1996 Bruce Evans <bde@FreeBSD.org>

Print stack pointer and frame pointer in trap messages.

Fixed "trace/trap" message.

Reviewed by: davidg


# d66a5066 02-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Mega-commit for Linux emulator update.. This has been stress tested under
netscape-2.0 for Linux running all the Java stuff. The scrollbars are now
working, at least on my machine. (whew! :-)

I'm uncomfortable with the size of this commit, but it's too
inter-dependant to easily seperate out.

The main changes:

COMPAT_LINUX is *GONE*. Most of the code has been moved out of the i386
machine dependent section into the linux emulator itself. The int 0x80
syscall code was almost identical to the lcall 7,0 code and a minor tweak
allows them to both be used with the same C code. All kernels can now
just modload the lkm and it'll DTRT without having to rebuild the kernel
first. Like IBCS2, you can statically compile it in with "options LINUX".

A pile of new syscalls implemented, including getdents(), llseek(),
readv(), writev(), msync(), personality(). The Linux-ELF libraries want
to use some of these.

linux_select() now obeys Linux semantics, ie: returns the time remaining
of the timeout value rather than leaving it the original value.

Quite a few bugs removed, including incorrect arguments being used in
syscalls.. eg: mixups between passing the sigset as an int, vs passing
it as a pointer and doing a copyin(), missing return values, unhandled
cases, SIOC* ioctls, etc.

The build for the code has changed. i386/conf/files now knows how
to build linux_genassym and generate linux_assym.h on the fly.

Supporting changes elsewhere in the kernel:

The user-mode signal trampoline has moved from the U area to immediately
below the top of the stack (below PS_STRINGS). This allows the different
binary emulations to have their own signal trampoline code (which gets rid
of the hardwired syscall 103 (sigreturn on BSD, syslog on Linux)) and so
that the emulator can provide the exact "struct sigcontext *" argument to
the program's signal handlers.

The sigstack's "ss_flags" now uses SS_DISABLE and SS_ONSTACK flags, which
have the same values as the re-used SA_DISABLE and SA_ONSTACK which are
intended for sigaction only. This enables the support of a SA_RESETHAND
flag to sigaction to implement the gross SYSV and Linux SA_ONESHOT signal
semantics where the signal handler is reset when it's triggered.

makesyscalls.sh no longer appends the struct sysentvec on the end of the
generated init_sysent.c code. It's a lot saner to have it in a seperate
file rather than trying to update the structure inside the awk script. :-)

At exec time, the dozen bytes or so of signal trampoline code are copied
to the top of the user's stack, rather than obtaining the trampoline code
the old way by getting a clone of the parent's user area. This allows
Linux and native binaries to freely exec each other without getting
trampolines mixed up.


# 3eb77c83 24-Feb-1996 John Dyson <dyson@FreeBSD.org>

Fix a problem with tracking the modified bit. Eliminate the
ugly inline-asm code, and speed up the page-table-page tracking.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 0e41ee30 04-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Convert DDB to new-style option.


# db6a20e2 03-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Converted two options over to the new scheme: USER_LDT and KTRACE.


# c9681930 19-Dec-1995 David Greenman <dg@FreeBSD.org>

Corrected a typo in a comment.


# 2838c968 19-Dec-1995 David Greenman <dg@FreeBSD.org>

Implemented a (sorely needed for years) double fault handler to catch stack
overflows.
It sure would be nice if there was an unmapped page between the PCB and
the stack (and that the size of the stack was configurable!). With the
way things are now, the PCB will get clobbered before the double fault
handler gets control, making somewhat of a mess of things. Despite this,
it is still fairly easy to poke around in the overflowed stack to figure
out the cause.


# b1529bda 14-Dec-1995 Peter Wemm <peter@FreeBSD.org>

GENERIC/LINT: Remove redundant quoting on some option lines.
LINT: add a couple of new/missing/undocumented options
files.i386: add linux code so that you can compile a kernel with static
linux emulation ("options LINUX")
i386/*: use #if defined(COMPAT_LINUX) || defined(LINUX) to enable static
support of linux emulation (just like "IBCS2" makes ibcs2 static)

The main thing this is going to make obvious, is that the LINUX code
(when compiled from LINT) has a lot of warnings, some of which dont look
too pleasant..


# 5e463408 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Make math_emulators LKMable.


# 7dfe504f 09-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Remove various unused symbols and procedures.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# 4ccc87c5 28-Oct-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused functions and variables, make things static, and other cleanups.


# 029b0fc8 08-Oct-1995 Bruce Evans <bde@FreeBSD.org>

Fix tracing of syscalls. The previous fix required the undocumented
option DDB_NO_LCALLS to stop ddb getting control and broke all ddb
tracing. Now there is no option and no way for ddb to trace at
address _Xsyscall or to _Xsyscall, but tracing everywhere else
works. The previous fix did unnecessary things for Linux syscalls.

Don't bother checking that syscall frames are for user mode.

Make debugger traps inside the kernel (except at addresses _Xsyscall
and _Xsyscall+1) fatal if ddb is not configured. They "can't happen".

Add prototypes.

Remove stupid comments, e.g., /*ARGSUSED*/ for args that are used.


# 00c6cada 04-Oct-1995 Julian Elischer <julian@FreeBSD.org>

Submitted by: Juergen Lock <nox@jelal.hb.north.de>
Obtained from: other people on the net ?

1. stepping over syscalls (gdb ni) sends you to DDB, and returned
to the wrong address afterwards, with or without DDB. patch in
i386/i386/trap.c below.

2. the linux emulator (modload'ed) still causes panics with DIAGNOSTIC,
re-applied a patch posted to one of the lists...


# 4219d2b2 21-Aug-1995 David Greenman <dg@FreeBSD.org>

A couple of micro optimizations to improve NULL syscall performance by
about 2%.


# a705fd3e 30-Jul-1995 David Greenman <dg@FreeBSD.org>

Fix a bug in my disabled version of trap_pfault()...curpcb may be NULL even
when curproc isn't. This condition occurs at system startup and perhaps
at other times.


# 1174c7d1 16-Jul-1995 Peter Wemm <peter@FreeBSD.org>

This fixes a compiler warning, and a cosmetic problem with the linux
emul code when compiling with "options KTRACE".
ktrsyscall() was expecting an array of integers, this was passing the
address of a structure containing an array of integers..
The cosmetic problem was that it was calling the "enter syscall"
trace hook twice - this looks like a cut/paste error/typo.


# 446cee6e 16-Jul-1995 Joerg Wunsch <joerg@FreeBSD.org>

Include ``options POWERFAIL_NMI'' for owners of older (non-apm)
notebooks where a powerfail condition (external power drop; battery
state low) is signalled by an NMI. Makes it beep instead of panicing.

Reviewed by: davidg


# 9e951f36 15-Jul-1995 David Greenman <dg@FreeBSD.org>

Truncate the fault address to a page boundry when calling vm_fault(). The
last change to fix the fault-twice bug with page tables wasn't quite
complete.


# 4a67eb71 14-Jul-1995 David Greenman <dg@FreeBSD.org>

Fixed bug that caused page tables to be faulted twice instead of once.

Submitted by: John Dyson


# d3628763 11-Jun-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Merge RELENG_2_0_5 into HEAD


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# f550a707 21-Mar-1995 David Greenman <dg@FreeBSD.org>

Added a new version of trap_pfault() that disallows kernel page faults
to the user address space unless pcb_onfault is set. The code is currently
commented out because iBCS2 and process debugging parts of the kernel
need to be changed/fixed first.


# c6d5f3ac 21-Mar-1995 David Greenman <dg@FreeBSD.org>

Changed some #ifdef DIAGNOSTIC code that I added to be #ifdef DEBUG.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 1e1e0b44 14-Feb-1995 Søren Schmidt <sos@FreeBSD.org>

First attempt to run linux binaries. This is only the changes needed to
the generic kernel. The actual emulator is a separate LKM. (not finished
yet, sorry).
Submitted by: sos@freebsd.org & sef@kithrup.com


# b1e4a738 09-Feb-1995 David Greenman <dg@FreeBSD.org>

Removed unnecessary check for pr_scale in the AST/OWEUPC case.


# 5a32829d 09-Feb-1995 David Greenman <dg@FreeBSD.org>

Check P_PROFIL flag for profiling rather than pr_scale as it makes more
sense.


# fbdfe8ac 24-Jan-1995 David Greenman <dg@FreeBSD.org>

Changed buffer allocation policy (machdep.c)
Moved various pmap 'bit' test/set functions back into real functions; gcc
generates better code at the expense of more of it. (pmap.c)
Fixed a deadlock problem with pv entry allocations (pmap.c)
Added a new, optional function 'pmap_prefault' that does clustered page
table preloading (pmap.c)
Changed the way that page tables are held onto (trap.c).

Submitted by: John Dyson


# 20415301 14-Jan-1995 Bruce Evans <bde@FreeBSD.org>

Fix security holes in sigreturn(), ptrace() and procfs. sigreturn()
attempted to check for insecure and fatal eflags and segment
selectors, but missed many cases and got the IOPL check back to
front. The other syscalls didn't check at all.

sys_process.c, machdep.c:
Only allow PT_WRITE_U to write to the registers (ordinary and FP).

psl.h, locore.s, machdep.c:
Eliminate PSL_MBZ, PSL_MBO and PSL_USERCLR. We are not supposed
to assume anything about the reserved bits. Use PSL_USERCHANGE
and PSL_KERNEL instead. Rename PSL_USERSET to PSL_USER.

exception.s:
Define a private label for use by doreti when returning to user
mode fails.

machdep.c:
In syscalls, allow changing only the eflags that can be changed on
486's in user mode (no longer attempt to allow benign IOPL changes;
allow changing the nasty PSL_NT; don't allow changing the i586
bits).

Don't attempt to check all the cases involving invalid selectors
and %eip's. Just check for privilege violations and let the invalid
things cause a trap.

procfs_machdep.c:
Call the ptrace register functions to do all the work for reading
and writing ordinary registers and for single stepping.

trap.c:
Ignore traps caused by PSL_NT being set. Previously, users could
cause a fatal trap in user mode by setting PSL_NT and executing an
iret, and a fatal trap in kernel mode by setting PSL_NT and making
a syscall. PSL_NT was cleared too late and not in enough modes to
fix the problem.

Make all traps in user mode (except T_NMI) nonfatal.

Recover from traps caused by attempting to load invalid user
registers in doreti by restarting the traps so that they appear to
occur in user mode.
---

Fix bogons that I noticed while fixing the above:

psl.h:
Fix some comments.

Uniformize idempotency ifdef.

exception.s, machdep.c:
Remove rsvd[0-14]. rsvd0 hasn't been reserved since the 486 came
out. Replace rsvd0 by `align'. rsvd[0-11] used wrong (magic
non-unique) trap numbers. Replace rsvd[1-14] by rsvd.

locore.s:
Enable alignment check flag on 486's and 586's.

machdep.c:
Use a better type for kstack[].

Use TFREGP() to find the registers.

Reformat ptrace functions from SEF to something closer to KNF.

procfs_machdep.c:
The wrong pointer to the registers got fixed as a side effect.

Implement reading and writing of FP registers.

/proc/*/*regs now work (only) for processes that are in memory.

Clean up comments.

trap.c, trap.h:
Remove unused trap types.


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 13d7c724 24-Dec-1994 Bruce Evans <bde@FreeBSD.org>

Obtained from: 1.1.5

Fix single-stepping of emulated FPU instructions.

Don't panic if an FPU instruction is attempted but there is no FPU
and no FPU emulator is configured.


# ab4bc4b2 30-Oct-1994 Bruce Evans <bde@FreeBSD.org>

Fix selector arg to match the (missing) prototype for sdtossd().
Cosmetic.

Return from trap() if trap_fatal() returns. trap_fatal() isn't
fatal if you have ddb. Returning from trap() is usually the right
thing to do and much better than falling through.


# 09f7992a 20-Oct-1994 Garrett Wollman <wollman@FreeBSD.org>

Make my ALLDEVS kernel compile (basically, LINT minus a lot of options).


# fabbd9b7 11-Oct-1994 Søren Schmidt <sos@FreeBSD.org>

Ouch, fixed bug in errno translation (ibcs2 support).


# 76d121f2 10-Oct-1994 Søren Schmidt <sos@FreeBSD.org>

Hmm, only translate errno when doing an actual return.

Reviewed by: sef@freefall.cdrom.com


# c96f1293 09-Oct-1994 Søren Schmidt <sos@FreeBSD.org>

Updated to convert errno return in syscall if conversion tabel present.


# 3fb3086e 08-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

db_disasm.c: Unused var zapped.
pmap.c: tons of unused vars zapped, various other warnings silenced.
trap.c: unused vars zapped.
vm_machdep.c: A wrong argument, which by chance did the right thing, was
corrected.


# 22414e53 30-Sep-1994 David Greenman <dg@FreeBSD.org>

Laptop Advanced Power Management support by HOSOKAWA Tatsumi.

Submitted by: HOSOKAWA Tatsumi


# f7d6afc6 11-Sep-1994 David Greenman <dg@FreeBSD.org>

Be more careful about dereferencing curproc, p_vmspace, and curpcb,
otherwise the machine will overflow the stack in a recursive fault loop
(causing the machine to spontaneously reboot because of the stack fault
that ultimately happens).

Submitted by: Inspired by Bruce Evans, but this change is different
than what he suggested.


# fe7bb84c 08-Sep-1994 Bruce Evans <bde@FreeBSD.org>

Remove <machine/eflags.h> and all dependencies on it. eflags.h is just
the Mach/i386 version of the BSD/vax(?) <machine/psl.h>. The Mach
version has slightly better names for many macros but is now out of
date and little used. It was originally used even less (for spelling
PSL_T as EFL_TF in <machine/db_machdep.h>).


# 406d45e0 28-Aug-1994 Bruce Evans <bde@FreeBSD.org>

Don't test if a u_int is < 0. The remaining test is sufficient and the
extra one caused a warning.


# 8a129cae 27-Aug-1994 David Greenman <dg@FreeBSD.org>

1) Changed ddb into a option rather than a pseudo-device (use options DDB
in your kernel config now).
2) Added ps ddb function from 1.1.5. Cleaned it up a bit and moved into its
own file.
3) Added \r handing in db_printf.
4) Added missing memory usage stats to statclock().
5) Added dummy function to pseudo_set so it will be emitted if there
are no other pseudo declarations.


# f3f0ca60 24-Aug-1994 Søren Schmidt <sos@FreeBSD.org>

Changes preparing for iBCS support
Reviewed by:
Submitted by:


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 5c8b38d4 09-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Handle NMI's in accordance with data in van Gilluwe book.


# 03e6c253 01-Aug-1994 David Greenman <dg@FreeBSD.org>

Removed all code related to the pagescan daemon, and changed 'act_count'
adjustments to compensate for a world without the pagescan daemon.


# 8c481329 10-Jun-1994 David Greenman <dg@FreeBSD.org>

Fixed minor spelling error.


# 3c256f53 06-Jun-1994 David Greenman <dg@FreeBSD.org>

trap.c:
Vastly improved trap.c from me. This rewritten version has a variety of
features, amoung them: higher performance and much higher code quality.

support.s, cpufunc.h:
No longer use gs override to enforce range limits - compare directly
against VM_MAXUSER_ADDRESS instead. The old way caused problems in
preserving the gs selector...and this method is just as fast or faster.


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 9f82ad3d 29-Apr-1994 Gary Clark II <gclarkii@FreeBSD.org>

Added ifdef for GPL_MATH_EMULATE to keep the sytem from panicing when
using it.


# 28626748 07-Apr-1994 David Greenman <dg@FreeBSD.org>

Make Bruce happy: silently enter ddb on a BPT or trace trap if ddb is
configured in the kernel.


# d2306226 02-Apr-1994 David Greenman <dg@FreeBSD.org>

New interrupt code from Bruce Evans. In additional to Bruce's attached
list of changes, I've made the following additional changes:

1) i386/include/ipl.h renamed to spl.h as the name conflicts with the
file of the same name in i386/isa/ipl.h.
2) changed all use of *mask (i.e. netmask, biomask, ttymask, etc) to
*_imask (net_imask, etc).
3) changed vestige of splnet use in if_is to splimp.
4) got rid of "impmask" completely (Bruce had gotten rid of netmask),
and are now using net_imask instead.
5) dozens of minor cruft to glue in Bruce's changes.

These require changes I made to config(8) as well, and thus it must
be rebuilt.

-DG

from Bruce Evans:

sio:
o No diff is supplied. Remove the define of setsofttty(). I hope
that is enough.

*.s:
o i386/isa/debug.h no longer exists. The event counters became too
much trouble to maintain. All function call entry and exception
entry counters can be recovered by using profiling kernel (the new
profiling supports all entry points; however, it is too slow to
leave enabled all the time; it also). Only BDBTRAP() from debug.h
is now used. That is moved to exception.s. It might be worth
preserving SHOW_BITS() and calling it from _mcount() (if enabled).
o T_ASTFLT is now only set just before calling trap().
o All exception handlers set SWI_AST_MASK in cpl as soon as possible
after entry and arrange for _doreti to restore it atomically with
exiting. It is not possible to set it atomically with entering
the kernel, so it must be checked against the user mode bits in
the trap frame before committing to using it. There is no place
to store the old value of cpl for syscalls or traps, so there are
some complications restoring it.

Profiling stuff (mostly in *.s):
o Changes to kern/subr_mcount.c, gcc and gprof are not supplied yet.
o All interesting labels `foo' are renamed `_foo' and all
uninteresting labels `_bar' are renamed `bar'. A small change
to gprof allows ignoring labels not starting with underscores.
o MCOUNT_LABEL() is to provide names for counters for times spent
in exception handlers.
o FAKE_MCOUNT() is a version of MCOUNT() suitable for exception
handlers. Its arg is the pc where the exception occurred. The
new mcount() pretends that this was a call from that pc to a
suitable MCOUNT_LABEL().
o MEXITCOUNT is to turn off any timer started by MCOUNT().

/usr/src/sys/i386/i386/exception.s:
o The non-BDB BPTTRAP() macros were doing a sti even when interrupts
were disabled when the trap occurred. The sti (fixed) sti is
actually a no-op unless you have my changes to machdep.c that make
the debugger trap gates interrupt gates, but fixing that would
make the ifdefs messier. ddb seems to be unharmed by both
interrupts always disabled and always enabled (I had the branch in
the fix back to front for some time :-().
o There is no known pushal bug.
o tf_err can be left as garbage for syscalls.

/usr/src/sys/i386/i386/locore.s:
o Fix and update BDE_DEBUGGER support.
o ENTRY(btext) before initialization was dangerous.
o Warm boot shot was longer than intended.

/usr/src/sys/i386/i386/machdep.c:
o DON'T APPLY ALL OF THIS DIFF. It's what I'm using, but may require
other changes.
Use the following:
o Remove aston() and setsoftclock().
Maybe use the following:
o No netisr.h.
o Spelling fix.
o Delay to read the Rebooting message.
o Fix for vm system unmapping a reduced area of memory
after bounds_check_with_label() reduces the size of
a physical i/o for a partition boundary. A similar
fix is required in kern_physio.c.
o Correct use of __CONCAT. It never worked here for non-
ANSI cpp's. Is it time to drop support for non-ANSI?
o gdt_segs init. 0xffffffffUL is bogus because ssd_limit
is not 32 bits. The replacement may have the same
value :-), but is more natural.
o physmem was one page too low. Confusing variable names.
Don't use the following:
o Better numbers of buffers. Each 8K page requires up to
16 buffer headers. On my system, this results in 5576
buffers containing [up to] 2854912 bytes of memory.
The usual allocation of about 384 buffers only holds
192K of disk if you use it on an fs with a block size
of 512.
o gdt changes for bdb.
o *TGT -> *IDT changes for bdb.
o #ifdefed changes for bdb.

/usr/src/sys/i386/i386/microtime.s:
o Use the correct asm macros. I think asm.h was copied from Mach
just for microtime and isn't used now. It certainly doesn't
belong in <sys>. Various macros are also duplicated in
sys/i386/boot.h and libc/i386/*.h.
o Don't switch to and from the IRR; it is guaranteed to be selected
(default after ICU init and explicitly selected in isa.c too, and
never changed until the old microtime clobbered it).

/usr/src/sys/i386/i386/support.s:
o Non-essential changes (none related to spls or profiling).
o Removed slow loads of %gs again. The LDT support may require
not relying on %gs, but loading it is not the way to fix it!
Some places (copyin ...) forgot to load it. Loading it clobbers
the user %gs. trap() still loads it after certain types of
faults so that fuword() etc can rely on it without loading it
explicitly. Exception handlers don't restore it. If we want
to preserve the user %gs, then the fastest method is to not
touch it except for context switches. Comparing with
VM_MAXUSER_ADDRESS and branching takes only 2 or 4 cycles on
a 486, while loading %gs takes 9 cycles and using it takes
another.
o Fixed a signed branch to unsigned.

/usr/src/sys/i386/i386/swtch.s:
o Move spl0() outside of idle loop.
o Remove cli/sti from idle loop. sw1 does a cli, and in the
unlikely event of an interrupt occurring and whichqs becoming
zero, sw1 will just jump back to _idle.
o There's no spl0() function in asm any more, so use splz().
o swtch() doesn't need to be superaligned, at least with the
new mcounting.
o Fixed a signed branch to unsigned.
o Removed astoff().

/usr/src/sys/i386/i386/trap.c:
o The decentralized extern decls were inconsistent, of course.
o Fixed typo MATH_EMULTATE in comments. */
o Removed unused variables.
o Old netmask is now impmask; print it instead. Perhaps we
should print some of the new masks.
o BTW, trap() should not print anything for normal debugger
traps.

/usr/src/sys/i386/include/asmacros.h:
o DON'T APPLY ALL OF THIS DIFF. Just use some of the null macros
as necessary.

/usr/src/sys/i386/include/cpu.h:
o CLKF_BASEPRI() changes since cpl == SWI_AST_MASK is now normal
while the kernel is running.
o Don't use var++ to set boolean variables. It fails after a mere
4G times :-) and is slower than storing a constant on [3-4]86s.

/usr/src/sys/i386/include/cpufunc.h:
o DON'T APPLY ALL OF THIS DIFF. You need mainly the include of
<machine/ipl.h>. Unfortunately, <machine/ipl.h> is needed by
almost everything for the inlines.

/usr/src/sys/i386/include/ipl.h:
o New file. Defines spl inlines and SWI macros and declares most
variables related to hard and soft interrupt masks.

/usr/src/sys/i386/isa/icu.h:
o Moved definitions to <machine/ipl.h>

/usr/src/sys/i386/isa/icu.s:
o Software interrupts (SWIs) and delayed hardware interrupts (HWIs)
are now handled uniformally, and dispatching them from splx() is
more like dispatching them from _doreti. The dispatcher is
essentially *(handler[ffs(ipending & ~cpl)]().
o More care (not quite enough) is taken to avoid unbounded nesting
of interrupts.
o The interface to softclock() is changed so that a trap frame is
not required.
o Fast interrupt handlers are now handled more uniformally.
Configuration is still too early (new handlers would require
bits in <machine/ipl.h> and functions to vector.s).
o splnnn() and splx() are no longer here; they are inline functions
(could be macros for other compilers). splz() is the nontrivial
part of the old splx().

/usr/src/sys/i386/isa/ipl.h
o New file. Supposed to have only bus-dependent stuff. Perhaps
the h/w masks should be declared here.

/usr/src/sys/i386/isa/isa.c:
o DON'T APPLY ALL OF THIS DIFF. You need only things involving
*mask and *MASK and comments about them. netmask is now a pure
software mask. It works like the softclock mask.

/usr/src/sys/i386/isa/vector.s:
o Reorganize AUTO_EOI* macros.
o Option FAST_INTR_HANDLER_USERS_ES for people who don't trust
fastintr handlers.
o fastintr handlers need to metamorphose into ordinary interrupt
handlers if their SWI bit has become set. Previously, sio had
unintended latency for handling output completions and input
of SLIP framing characters because this was not done.

/usr/src/sys/net/netisr.h:
o The machine-dependent stuff is now imported from <machine/ipl.h>.

/usr/src/sys/sys/systm.h
o DON'T APPLY ALL OF THIS DIFF. You need mainly the different
splx() prototype. The spl*() prototypes are duplicated as
inlines in <machine/ipl.h> but they need to be duplicated here
in case there are no inlines. I sent systm.h and cpufunc.h
to Garrett. We agree that spl0 should be replaced by splnone
and not the other way around like I've done.

/usr/src/sys/kern/kern_clock.c
o splsoftclock() now lowers cpl so the direct call to softclock()
works as intended.
o softclock() interface changed to avoid passing the whole frame
(some machines may need another change for profile_tick()).
o profiling renamed _profiling to avoid ANSI namespace pollution.
(I had to improve the mcount() interface and may as well fix it.)
The GUPROF variant doesn't actually reference profiling here,
but the 'U' in GUPROF should mean to select the microtimer
mcount() and not change the interface.


# ed7fcbd0 24-Mar-1994 David Greenman <dg@FreeBSD.org>

From John Dyson: performance improvements to the new bounce buffer
code.


# 943a66f3 14-Mar-1994 David Greenman <dg@FreeBSD.org>

Performance improvements from John Dyson.

1) A new mechanism has been added to prevent pages from being paged
out called "vm_page_hold". Similar to vm_page_wire, but
much lower overhead.
2) Scheduling algorithm has been changed to improve interactive
performance.
3) Paging algorithm improved.
4) Some vnode and swap pager bugs fixed.


# 04f18356 07-Mar-1994 David Greenman <dg@FreeBSD.org>

1) "Pre-faulting" in of pages into process address space
Eliminates vm_fault overhead on process startup and
mmap referenced data for in-memory pages.

(process startup time using in-memory segments *much* faster)

2) Even more efficient pmap code. Code partially cleaned up.
More comments yet to follow.

(generally more efficient pte management)

3) Pageout clustering ( in addition to the FreeBSD V1.1 pagein
clustering.)

(much faster paging performance on non-write behind disk
subsystems, slightly faster performance on other systems.)

4) Slightly changed vm_pageout code for more efficiency and
better statistics. Also, resist swapout a little more.

(less likely to pageout a recently used page)

5) Slight improvement to the page table page trap efficiency.

(generally faster system VM fault performance)

6) Defer creation of unnamed anonymous regions pager until needed.

(speeds up shared memory bss creation)

7) Remove possible deadlock from swap_pager initialization.

8) Enhanced procfs to provide "vminfo" about vm objects and user
pmaps.

9) Increased MCLSHIFT/MCLBYTES from 2K to 4K to improve net &
socket performance and to prepare for things to come.

John Dyson
dyson@implode.root.com
David Greenman
davidg@root.com


# b9d60b3f 08-Feb-1994 David Greenman <dg@FreeBSD.org>

Fixed bugs in stack grow code, and moved it back into a seperate function
like it was originally. Also added back call to "grow" in sendsig now
that this routine actually works.


# 0172c219 01-Feb-1994 David Greenman <dg@FreeBSD.org>

Minor cleanup. Decode state information better in the case of a fatal
trap.


# d64f660f 17-Jan-1994 David Greenman <dg@FreeBSD.org>

Improvements mostly from John Dyson, with a little bit from me.

* Removed pmap_is_wired
* added extra cli/sti protection in idle (swtch.s)
* slight code improvement in trap.c
* added lots of comments
* improved paging and other algorithms in VM system


# 7f8cb368 14-Jan-1994 David Greenman <dg@FreeBSD.org>

"New" VM system from John Dyson & myself. For a run-down of the
major changes, see the log of any effected file in the sys/vm
directory (swap_pager.c for instance).


# c8a13ecd 03-Jan-1994 David Greenman <dg@FreeBSD.org>

Convert syscall to trapframe. Based on work done by John Brezak.


# aaf08d94 18-Dec-1993 Garrett Wollman <wollman@FreeBSD.org>

Make everything compile with -Wtraditional. Make it easier to distribute
a binary link-kit. Make all non-optional options (pagers, procfs) standard,
and update LINT to reflect new symtab requirements.

NB: -Wtraditional will henceforth be forgotten. This editing pass was
primarily intended to detect any constructions where the old code might
have been relying on traditional C semantics or syntax. These were all
fixed, and the result of fixing some of them means that -Wall is now a
realistic possibility within a few weeks.


# 6d01f02e 11-Dec-1993 David Greenman <dg@FreeBSD.org>

1) Added proc file system from Paul Kranenburg with changes from
John Dyson to make it reliably work under FreeBSD.
2) Added and enabled PROCFS in the GENERICxx and LINT kernels.
3) New execve() from me. Still work to be done here, but this version
works well and is needed before other changes can be made. For
a description of the design behind this, see freebsd-arch or
ask me.
4) Rewrote stack fault code; made user stack VM grow as needed rather
than all up front; improves performance a little and reduces
process memory requirements.
5) Incorporated fix from Gene Stark to fault/wire a user page table
page to fix a problem in copyout. This is a temporary fix and
is not appropriate for pageable page tables. For a description
of the problem, see Gene's post to the freebsd-hackers mailing
list.
6) Tighten up vm_page struct to reduce memory requirements for it. ifdef
pager page lock code as it's not being used currently.
7) Introduced new element to vmspace struct - vm_minsaddr; initial
(minimum) stack address. Compliment to vm_maxsaddr.
8) Added a panic if the allocation for process u-pages fails.
9) Improve performance and accuracy of kernel profiling by putting in
a little inline assembly instead of spl().
10) Made serial console with sio driver work. Still has problems with
serial input, but is almost useable.
11) Added -Bstatic to SYSTEM_LD in Makefile.i386 so that kernels will
build properly with the new ld.


# 05e634ef 02-Dec-1993 Andrew Moore <alm@FreeBSD.org>

From: Jeffrey Hsu <hsu@soda.berkeley.edu>

The following patch adds the addr argument to signal handlers.

The kernel with the patch is no more and no less in compliance or in
violation of POSIX and ANSI C than the kernel before the patch.

The added functionality this addr argument provides is quite useful. It
enables an entire class of algorithms which use mprotect to trace memory
references. Beside garbage collectors, I have heard of this technique being
applied to debuggers and profilers. The only benchmarking I've performed is
using akcl to compile maxima: without the kernel patch, it takes 7 hours to
compile maxima, while with stratified garbage collection, it only takes 50
minutes.

Basically, I can't think of a reason not to add the addr argument and there
is a compelling need for it.

If you find the patch acceptable, please let me know so I can send my
FreeBSD akcl config files to wfs for inclusion in the core akcl release.
The old 386BSD config files there won't work on either NetBSD or FreeBSD.


# a9627169 28-Nov-1993 David Greenman <dg@FreeBSD.org>

Patch from Gene Stark:

Subject: Page fault in PTE area fails in copyout
Index: sys/i386/i386/trap.c FreeBSD-1.0.2

Description:
Reading files of several megabytes into Emacs, or many small
files all at once, would fail with "IO error - bad address".

Repeat-By:
The bug can be exercised by a test program that malloc()'s
a 5MB chunk of memory, and then, without accessing the memory
first, filling it with data from a file using read().
(I read 64k chunks from /dev/wd0d into successive 64k regions
of the 5MB chunk.) The read() will fail with EFAULT at the first
virtual address boundary that is a multiple of 0x400000.

Fix:
The problem was code in sys/i386/i386/trap.c that tries to
figure out what kind of trap occurred and to handle it appropriately.
It was interpreting any page fault with virtual address
>= vm->vm_maxsaddr as being a user stack segment fault.
In fact, addresses >= USRSTACK are in the user structure/PTE area,
and if they are handled as stack faults, the proper PTE will
not be paged in when it is supposed to be. This situation comes
up in copyout() and copyoutstr(), if PTE's are accessed for the
first time ever. The page fault on accessing the nonexistent PTE
is mishandled as a stack fault, and then the fault that occurs on
the subsequent access to the page itself causes copyout to fail
with EFAULT.


# 381fe1aa 24-Nov-1993 Garrett Wollman <wollman@FreeBSD.org>

Make the LINT kernel compile with -W -Wreturn-type -Wcomment -Werror, and
add same (sans -Werror) to Makefile for future compilations.


# 0967373e 12-Nov-1993 David Greenman <dg@FreeBSD.org>

First steps in rewriting locore.s, and making info useful
when the machine panics.

i386/i386/locore.s:
1) got rid of most .set directives that were being used like
#define's, and replaced them with appropriate #define's in
the appropriate header files (accessed via genassym).
2) added comments to header inclusions and global definitions,
and global variables
3) replaced some hardcoded constants with cpp defines (such as
PDESIZE and others)
4) aligned all comments to the same column to make them easier to
read
5) moved macro definitions for ENTRY, ALIGN, NOP, etc. to
/sys/i386/include/asmacros.h
6) added #ifdef BDE_DEBUGGER around all of Bruce's debugger code
7) added new global '_KERNend' to store last location+1 of kernel
8) cleaned up zeroing of bss so that only bss is zeroed
9) fix zeroing of page tables so that it really does zero them all
- not just if they follow the bss.
10) rewrote page table initialization code so that 1) works correctly
and 2) write protects the kernel text by default
11) properly initialize the kernel page directory, upages, p0stack PT,
and page tables. The previous scheme was more than a bit
screwy.
12) change allocation of virtual area of IO hole so that it is
fixed at KERNBASE + 0xa0000. The previous scheme put it
right after the kernel page tables and then later expected
it to be at KERNBASE +0xa0000
13) change multiple bogus settings of user read/write of various
areas of kernel VM - including the IO hole; we should never
be accessing the IO hole in user mode through the kernel
page tables
14) split kernel support routines such as bcopy, bzero, copyin,
copyout, etc. into a seperate file 'support.s'
15) split swtch and related routines into a seperate 'swtch.s'
16) split routines related to traps, syscalls, and interrupts
into a seperate file 'exception.s'
17) remove some unused global variables from locore that got
inserted by Garrett when he pulled them out of some .h
files.

i386/isa/icu.s:
1) clean up global variable declarations
2) move in declaration of astpending and netisr

i386/i386/pmap.c:
1) fix calculation of virtual_avail. It previously was calculated
to be right in the middle of the kernel page tables - not
a good place to start allocating kernel VM.
2) properly allocate kernel page dir/tables etc out of kernel map
- previously only took out 2 pages.

i386/i386/machdep.c:
1) modify boot() to print a warning that the system will reboot in
PANIC_REBOOT_WAIT_TIME amount of seconds, and let the user
abort with a key on the console. The machine will wait for
ever if a key is typed before the reboot. The default is
15 seconds, but can be set to 0 to mean don't wait at all,
-1 to mean wait forever, or any positive value to wait for
that many seconds.
2) print "Rebooting..." just before doing it.

kern/subr_prf.c:
1) remove PANICWAIT as it is deprecated by the change to machdep.c

i386/i386/trap.c:
1) add table of trap type strings and use it to print a real trap/
panic message rather than just a number. Lot's of work to
be done here, but this is the first step. Symbolic traceback
is in the TODO.

i386/i386/Makefile.i386:
1) add support in to build support.s, exception.s and swtch.s

...and various changes to various header files to make all of the
above happen.


# d19eeaea 04-Nov-1993 David Greenman <dg@FreeBSD.org>

splnone()'s in the trap code can be deadly. Save/restore previous priority
instead.


# d7354624 01-Nov-1993 Christoph Robitschko <chmr@FreeBSD.org>

Modified the "rude stack hack" that it only applies to addresses within
the stack area and not memory above VM_MAXUSER_ADDRESS.
That way, copyout and friends now work for pages whose page table entries
have not yet been allocated/been paged out.


# 960173b9 15-Oct-1993 Rodney W. Grimes <rgrimes@FreeBSD.org>

genassym.c:
Remove NKMEMCLUSTERS, it is no longer define or used.

locores.s:
Fix comment on PTDpde and APTDpde to be pde instead of pte
Add new equation for calculating location of Sysmap
Remove Bill's old #ifdef garbage for counting up memory,
that stuff will never be made to work and was just cluttering
up the file.

Add code that places the PTD, page table pages, and kernel
stack below the 640k ISA hole if there is room for it, otherwise
put this stuff all at 1MB. This fixes the 28K bogusity in
the boot blocks, that can now go away!

Fix the caclulation of where first is to be dependent on
NKPDE so that we can skip over the above mentioned areas.
The 28K thing is now 44K in size due to the increase in
kernel virtual memory space, but since we no longer have
to worry about that this is no big deal.

Use if NNPX > 0 instead of ifdef NPX for floating point code.

machdep.c
Change the calculation of for the buffer cache to be
20% of all memory above 2MB and add back the upper limit
of 2/5's of the VM_KMEM_SIZE so that we do not eat ALL
of the kernel memory space on large memory machines, note
that this will not even come into effect unless you have
more than 32MB. The current buffer cache limit is 6.7MB
due to this caclulation.

It seems that we where erroniously allocating bufpages pages
for buffer_map. buffer_map is UNUSED in this implementation
of the buffer cache, but since the map is referenced in
several if statements a quick fix was to simply allocate
1 vm page (but no real memory) to it.

pmap.h
Remove rcsid, don't want them in the kernel files!

Removed some cruft inside an #ifdef DEBUGx that caused
compiler errors if you where compiling this for debug.

Use the #defines for PD_SHIFT and PG_SHIFT in place of
constants.

trap.c:
Remove patch kit header and rcsid, fix $Id$.
Now include "npx.h" and use NNPX for controlling the
floating point code.

Remove a now completly invalid check for a maximum virtual
address, the virtual address now ends at 0xFFFFFFFF so
there is no more MAX!! (Thanks David, I completly missed
that one!)

vm_machdep.c
Remove patch kit header and rcsid, fix $Id$.
Now include "npx.h" and use NNPX for controlling the
floating point code.

Replace several 0xFE00000 constants with KERNBASE


# a0ad3760 28-Aug-1993 Rodney W. Grimes <rgrimes@FreeBSD.org>

Changed trap.c so that a panic will occur if we do not have hardware
FP and we try to call the emulator when it is not compiled in.
Removed the #if defined(i486) || defined(i387) that use to call the
panic if we did not have a math emulator.
Removed an extranious include of i386/i386/math_emu.h from math_emulate.c.


# 26931201 27-Jul-1993 David Greenman <dg@FreeBSD.org>

* Applied fixes from Bruce Evans to fix COW bugs, >1MB kernel loading,
profiling, and various protection checks that cause security holes
and system crashes.
* Changed min/max/bcmp/ffs/strlen to be static inline functions
- included from cpufunc.h in via systm.h. This change
improves performance in many parts of the kernel - up to 5% in the
networking layer alone. Note that this requires systm.h to be included
in any file that uses these functions otherwise it won't be able to
find them during the load.
* Fixed incorrect call to splx() in if_is.c
* Fixed bogus variable assignment to splx() in if_ed.c


# 5b81b6b3 12-Jun-1993 Rodney W. Grimes <rgrimes@FreeBSD.org>

Initial import, 0.1 + pk 0.2.4-B1