History log of /freebsd-10.0-release/sys/i386/i386/vm_machdep.c
Revision Date Author Comments
(<<< Hide modified files)
(Show modified files >>>)
# 259065 07-Dec-2013 gjb

- Copy stable/10 (r259064) to releng/10.0 as part of the
10.0-RELEASE cycle.
- Update __FreeBSD_version [1]
- Set branch name to -RC1

[1] 10.0-CURRENT __FreeBSD_version value ended at '55', so
start releng/10.0 at '100' so the branch is started with
a value ending in zero.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation

# 256281 10-Oct-2013 gjb

Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle.

Approved by: re (implicit)
Sponsored by: The FreeBSD Foundation


# 255827 23-Sep-2013 kib

Free both KVA and backing pages when freeing TSS memory.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (marius)


# 255786 22-Sep-2013 glebius

- Create kern.ipc.sendfile namespace, and put the new "readhead" OID
there as "kern.ipc.sendfile.readahead".
- Push all nsfbuf related tunables into MD code. Don't move them
to new namespace in favor of POLA.

Reviewed by: scottl
Approved by: re (gjb)


# 254025 07-Aug-2013 jeff

Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

- Normalize functions that allocate memory to use kmem_*
- Those that allocate address space are named kva_*
- Those that operate on maps are named kmap_*
- Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by: alc
Tested by: pho
Sponsored by: EMC / Isilon Storage Division


# 253351 15-Jul-2013 ae

Introduce new structure sfstat for collecting sendfile's statistics
and remove corresponding fields from struct mbstat. Use PCPU counters
and SFSTAT_INC() macro for update these statistics.

Discussed with: glebius


# 250624 13-May-2013 ed

Improve readability of static assertions for OFFSET_* macros.

Instead of doing all sorts of weird casting of constants to
pointer-pointers, simply use the standard C offsetof() macro to obtain
the offset of the respective fields in the structures.


# 238792 26-Jul-2012 kib

MFamd64 r238623:
Introduce curpcb magic variable.

Requested and reviewed by: bde
MFC after: 3 weeks


# 223758 04-Jul-2011 attilio

With retirement of cpumask_t and usage of cpuset_t for representing a
mask of CPUs, pc_other_cpus and pc_cpumask become highly inefficient.

Remove them and replace their usage with custom pc_cpuid magic (as,
atm, pc_cpumask can be easilly represented by (1 << pc_cpuid) and
pc_other_cpus by (all_cpus & ~(1 << pc_cpuid))).

This change is not targeted for MFC because of struct pcpu members
removal and dependency by cpumask_t retirement.

MD review by: marcel, marius, alc
Tested by: pluknet
MD testing by: marcel, marius, gonzo, andreast


# 222813 07-Jun-2011 attilio

etire the cpumask_t type and replace it with cpuset_t usage.

This is intended to fix the bug where cpu mask objects are
capped to 32. MAXCPU, then, can now arbitrarely bumped to whatever
value. Anyway, as long as several structures in the kernel are
statically allocated and sized as MAXCPU, it is suggested to keep it
as low as possible for the time being.

Technical notes on this commit itself:
- More functions to handle with cpuset_t objects are introduced.
The most notable are cpusetobj_ffs() (which calculates a ffs(3)
for a cpuset_t object), cpusetobj_strprint() (which prepares a string
representing a cpuset_t object) and cpusetobj_strscan() (which
creates a valid cpuset_t starting from a string representation).
- pc_cpumask and pc_other_cpus are target to be removed soon.
With the moving from cpumask_t to cpuset_t they are now inefficient
and not really useful. Anyway, for the time being, please note that
access to pcpu datas is protected by sched_pin() in order to avoid
migrating the CPU while reading more than one (possible) word
- Please note that size of cpuset_t objects may differ between kernel
and userland. While this is not directly related to the patch itself,
it is good to understand that concept and possibly use the patch
as a reference on how to deal with cpuset_t objects in userland, when
accessing kernland members.
- KTR_CPUMASK is changed and now is represented through a string, to be
set as the example reported in NOTES.

Please additively note that no MAXCPU is bumped in this patch, but
private testing has been done until to MAXCPU=128 on a real 8x8x2(htt)
machine (amd64).

Please note that the FreeBSD version is not yet bumped because of
the upcoming pcpu changes. However, note that this patch is not
targeted for MFC.

People to thank for the time spent on this patch:
- sbruno, pluknet and Nicholas Esborn (nick AT desert DOT net) tested
several revision of the patches and really helped in improving
stability of this work.
- marius fixed several bugs in the sparc64 implementation and reviewed
patches related to ktr.
- jeff and jhb discussed the basic approach followed.
- kib and marcel made targeted review on some specific part of the
patch.
- marius, art, nwhitehorn and andreast reviewed MD specific part of
the patch.
- marius, andreast, gonzo, nwhitehorn and jceel tested MD specific
implementations of the patch.
- Other people have made contributions on other patches that have been
already committed and have been listed separately.

Companies that should be mentioned for having participated at several
degrees:
- Yahoo! for having offered the machines used for testing on big
count of CPUs.
- The FreeBSD Foundation for having sponsored my devsummit attendance,
which has been instrumental.
- Sandvine for having offered offices and infrastructure during
development.

(I really hope I didn't forget anyone, if it happened I apologize in
advance).


# 217561 18-Jan-2011 kib

For architectures not using direct map , and requiring real KVA page for
sf buf allocation, use wakeup() instead of wakeup_one() to notify sf
buffer waiters about free buffer.

sf_buf_alloc() calls msleep(PCATCH) when SFB_CATCH flag was given,
and for simultaneous wakeup and signal delivery, msleep() returns
EINTR/ERESTART despite the thread was selected for wakeup_one(). As
result, we loose a wakeup, and some other waiter will not be woken up.

Reported and tested by: az
Reviewed by: alc, jhb
MFC after: 1 week


# 211197 11-Aug-2010 jhb

Update various places that store or manipulate CPU masks to use cpumask_t
instead of int or u_int. Since cpumask_t is currently u_int on all
platforms this should just be a cosmetic change.


# 209462 23-Jun-2010 kib

After the FPU use requires #MF working due to INT13 FPU exception handling
removal, MFi386 r209198:
Use critical sections instead of disabling local interrupts to ensure
the consistency between PCPU fpcurthread and the state of FPU.

Reviewed by: bde
Tested by: pho


# 209461 23-Jun-2010 kib

Remove the support for int13 FPU exception reporting on i386. It is
believed that all 486-class CPUs FreeBSD is capable to run on, either
have no FPU and cannot use external coprocessor, or have FPU on the
package and can use #MF.

Reviewed by: bde
Tested by: pho (previous version)


# 208833 05-Jun-2010 kib

Introduce the x86 kernel interfaces to allow kernel code to use
FPU/SSE hardware. Caller should provide a save area that is chained
into the stack of the areas; pcb save_area for usermode FPU state is
on top. The pcb now contains a pointer to the current FPU saved area,
used during FPUDNA handling and context switches. There is also a
facility to allow the kernel thread to use pcb save_area.

Change the dreaded warnings "npxdna in kernel mode!" into the panics
when FPU usage is not registered.

KPI discussed with: fabient
Tested by: pho, fabient
Hardware provided by: Sentex Communications
MFC after: 1 month


# 204309 25-Feb-2010 attilio

Introduce the new kernel sub-tree x86 which should contain all the code
shared and generalized between our current amd64, i386 and pc98.

This is just an initial step that should lead to a more complete effort.
For the moment, a very simple porting of cpufreq modules, BIOS calls and
the whole MD specific ISA bus part is added to the sub-tree but ideally
a lot of code might be added and more shared support should grow.

Sponsored by: Sandvine Incorporated
Reviewed by: emaste, kib, jhb, imp
Discussed on: arch
MFC: 3 weeks


# 199135 10-Nov-2009 kib

Extract the code that records syscall results in the frame into MD
function cpu_set_syscall_retval().

Suggested by: marcel
Reviewed by: marcel, davidxu
PowerPC, ARM, ia64 changes: marcel
Sparc64 tested and reviewed by: marius, also sunv reviewed
MIPS tested by: gonzo
MFC after: 1 month


# 197693 01-Oct-2009 kmacy

make read_eflags and write_eflags accomplish the same effect on PVM as native,
simplifying interrupt handling


# 195940 29-Jul-2009 kib

As was done in r195820 for amd64, use clflush for flushing cache lines
when memory page caching attributes changed, and CPU does not support
self-snoop, but implemented clflush, for i386.

Take care of possible mappings of the page by sf buffer by utilizing
the mapping for clflush, otherwise map the page transiently. Amd64
used direct map.

Proposed and reviewed by: alc
Approved by: re (kensmith)


# 190237 22-Mar-2009 alc

Eliminate the recomputation of pcb_cr3 from cpu_set_upcall(). The
bcopy()ed value from the old thread is the correct value because the new
thread and the old thread will share a page table.


# 188200 05-Feb-2009 kmacy

halt APs on reboot


# 188199 05-Feb-2009 kmacy

reboot instance on reset


# 186557 29-Dec-2008 kmacy

merge 186535, 186537, and 186538 from releng_7_xen

Log:
- merge in latest xenbus from dfr's xenhvm
- fix race condition in xs_read_reply by converting tsleep to mtx_sleep

Log:
unmask evtchn in bind_{virq, ipi}_to_irq

Log:
- remove code for handling case of not being able to sleep
- eliminate tsleep - make sleeps atomic


# 183615 05-Oct-2008 davidxu

If the current thread has the trap bit set (i.e. a debugger had
single stepped the process to the system call), we need to clear
the trap flag from the new frame. Otherwise, the new thread will
receive a (likely unexpected) SIGTRAP when it executes the first
instruction after returning to userland.


# 182961 12-Sep-2008 kib

When doing rfork(0), i.e. separating curproc VM from any other user of
the same vmspace, decrement the reference count of the shared LDT instead
of a newly-made copy. Code factually removed LDT from the process that
did rfork(0).

Introduce user_ldt_deref() function that does decrement of refcount for
the struct proc_ldt, and call it in the rfork(0) case on the shared LDT.

Reviewed by: jhb
MFC after: 1 week


# 182936 11-Sep-2008 jhb

Update the comments above the 0xcf9 register reset attempt to match the
code. We only attempt a single reset using this method (a "hard" reset),
and we use two writes to ensure there is a 0 -> 1 transition in bit 2 to
force a reset.

MFC after: 1 week


# 181938 20-Aug-2008 kmacy

fix typo in previous commit breaking bootup

pointed out by: Takahashi Yoshihiro nyan@


# 181911 20-Aug-2008 kmacy

- clean up interrupt handling for xen a tiny bit
- parse the command line in to kenv
- defer shutdown watcher until later in boot

MFC after: 1 month


# 181775 15-Aug-2008 kmacy

Integrate support for xen in to i386 common code.

MFC after: 1 month


# 177253 16-Mar-2008 rwatson

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 177091 12-Mar-2008 jeff

Remove kernel support for M:N threading.

While the KSE project was quite successful in bringing threading to
FreeBSD, the M:N approach taken by the kse library was never developed
to its full potential. Backwards compatibility will be provided via
libmap.conf for dynamically linked binaries and static binaries will
be broken.


# 173615 14-Nov-2007 marcel

o Rename cpu_thread_setup() to cpu_thread_alloc() to better
communicate that it relates to (is called by) thread_alloc()
o Add cpu_thread_free() which is called from thread_free()
to counter-act cpu_thread_alloc().

i386: Have cpu_thread_free() call cpu_thread_clean() to
preserve behaviour.
ia64: Have cpu_thread_free() call mtx_destroy() for the
mutex initialized in cpu_thread_alloc().

PR: ia64/118024


# 171295 07-Jul-2007 attilio

Actual code shows several problems in ia32 LDT handling:
- When a LDT entry changes, the old one is freed while it is still
referenced by gdt and ldtr. This can lead to disruptive behaviours in
particular on SMP machines.
- When a LDT entry changes, it is assumed that the only one entity sharing
the same LDT are threads in the same proc. It doesn't take in account
edge cases where two processes share the same VM (rfork'ed ones, for
example).

This patch addresses these two problems and addictionally it fixes the
usage of refcount switching back it to the old manually-grown refcount
(since in this case would be faster).

Diagnosed by: tegge
Tested by: pho (a former version)
Reviewed by: kib
Approved by: jeff (mentor)
Approved by: re


# 170305 04-Jun-2007 jeff

- Change comments and asserts to reflect the removal of the global
scheduler lock.

Tested by: kris, current@
Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc.
Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)


# 170110 29-May-2007 attilio

Fix some problems introduced with the last descriptors tables locking
patch:
- Do the correct test for ldt allocation
- Drop dt_lock just before to call kmem_free (since it acquires blocking
locks inside)
- Solve a deadlock with smp_rendezvous() where other CPU will wait
undefinitively for dt_lock acquisition.
- Add dt_lock in the WITNESS list of spinlocks

While applying these modifies, change the requirement for user_ldt_free()
making that returning without dt_lock held.

Tested by: marcus, tegge
Reviewed by: tegge
Approved by: jeff (mentor)


# 169802 20-May-2007 jeff

- Move GDT/LDT locking into a seperate spinlock, removing the global
scheduler lock from this responsibility.

Contributed by: Attilio Rao <attilio@FreeBSD.org>
Tested by: jeff, kkenn


# 169030 24-Apr-2007 jhb

Fix the triple fault used as a last resort during a reboot to actually
fault. The previous method zero'd out the page tables, invalidated the
TLB, and then entered a spin loop. The idea was that the instruction after
the TLB invalidate would result in a page fault and the page fault and
subsequent double fault wouldn't be able to determine the physical page
for their fault handlers' first instruction. This stopped working when
PGE (PG_G PTE/PDE bit) support was added as a TLB invalidate via %cr3
reload doesn't clear TLB entries with PG_G set. Thus, the CPU was still
able to map the virtual address for the spin loop and happily performed
its infinite loop.

The triple fault now uses a much more deterministic sledge-hammer approach
to generate a triple fault. First, the IDT descriptor is set to point to
an empty IDT, so any interrupts (including a double fault) will instantly
fault. Second, we trigger a int 3 breakpoint to force an interrupt and
kick off a triple fault.

MFC after: 3 days


# 169020 24-Apr-2007 jhb

Update comments for the 0xcf9 and 0x92 reset methods to explain what we are
actually doing and what the various bits mean.


# 168996 23-Apr-2007 jhb

Tweak printf string.


# 167250 05-Mar-2007 alc

Acquiring smp_ipi_mtx on every call to pmap_invalidate_*() is wasteful.
For example, during a buildworld more than half of the calls do not
generate an IPI because the only TLB entry invalidated is on the calling
processor. This revision pushes down the acquisition and release of
smp_ipi_mtx into smp_tlb_shootdown() and smp_targeted_tlb_shootdown() and
instead uses sched_pin() and sched_unpin() in pmap_invalidate_*() so that
thread migration doesn't lead to a missed TLB invalidation.

Reviewed by: jhb
MFC after: 3 weeks


# 166315 28-Jan-2007 rwatson

As we now have an SFB_NOWAIT flag, change 'will' to 'may' where the
comment for sf_buf_alloc(9) talks about sleeping.


# 159027 29-May-2006 davidxu

Backout changes trying to inherit floating-point environment, although
POSIX (susv3) requires this, but it is unclear what should be inherited,
duplicating whole 387 stack for new thread seems to be unnecessary and
dangerous. Revert to previous code, force a new thread to be started with
clean FP state.


# 158997 28-May-2006 davidxu

PCB_NPXINITDONE is cleared by npx_fork_thread.


# 158995 28-May-2006 davidxu

When creating a new thread, inherit floating-point environment from
current thread, this is required by POSIX pthread_create document.


# 158097 28-Apr-2006 sobomax

Unbreak pc98. Sorry...


# 158068 27-Apr-2006 sobomax

In the case when reset via keyboard controller doesn't work for some reason
(i.e. no keyboard controller present), try two other common methods for
resetting i386 machine - pci reset and port 0x92 fast reset. Only if neither
works warn user and resort to "unmap entire address space and hope for good"
hack. This makes my MacBook Pro rebooting just fine and should also help
other legacy-free hardware out there.

Also, disable interrupts unconditionally in cpu_reset_real(), since we don't
want any interference.

MFC after: 1 week


# 156523 10-Mar-2006 davidxu

It is not necessary to read %gs twice.


# 156522 10-Mar-2006 davidxu

Fix stack offset to allow gcc's stack aligment code to work correctly.

MFC after: 3 days


# 152404 13-Nov-2005 imp

Provide a dummy NO_XBOX option that lives in opt_xbox.h for pc98.
This allows us to eliminate a three ifdef PC98 instances.


# 152238 09-Nov-2005 nyan

Fix pc98 build.


# 152219 09-Nov-2005 imp

Add support for XBOX to the FreeBSD port. The xbox architecture is
nearly identical to wintel/ia32, with a couple of tweaks. Since it is
so similar to ia32, it is optionally added to a i386 kernel. This
port is preliminary, but seems to work well. Further improvements
will improve the interaction with syscons(4), port Linux nforce driver
and future versions of the xbox.

This supports the 64MB and 128MB boxes. You'll need the most recent
CVS version of Cromwell (the Linux BIOS for the XBOX) to boot.

Rink will be maintaining this port, and is interested in feedback.
He's setup a website http://xbox-bsd.nl to report the latest
developments.

Any silly mistakes are my fault.

Submitted by: Rink P.W. Springer rink at stack dot nl and
Ed Schouten ed at fxq dot nl


# 151633 24-Oct-2005 jhb

When restarting the BSP during cpu_reset() use a membar to ensure that
the updated cpustop_restartfunc is seen when the BSP resumes execution.
This matches the membar already present in restart_cpus().


# 151302 13-Oct-2005 alc

Restore the UP optimization to reduce the number of TLB invalidations. The
previous revision only restored the MP optimization.

Describe the optimization strategy for TLB invalidations in a comment.

Reviewed by: ups@
MFC after: 3 days


# 151274 13-Oct-2005 ups

Restore optimizations to reduce TLB shootdowns.
Alan Cox pointed out that they are really useful for
sendfile().

MFC after: 3 days


# 151247 11-Oct-2005 ups

Ensure that a thread stays on same CPU when calculating per CPU
TLB shootdown requirements. Otherwise a CPU may not get the needed
TLB invalidation.

The PTE valid and access flags can not be used here to avoid TLB
shootdowns unless sf->cpumask == all_cpus.
( Otherwise some CPUs may still hold an even older entry in the TLB)
Since sf_buf_alloc mappings are normally always used this is
also not really useful and presetting accessed and modified
allows the CPU to speculatively load the entry into the TLB.

Both bugs can cause random data corruption.

MFC after: 3 days


# 150041 12-Sep-2005 nyan

opt_pc98.h is not needed.


# 147889 10-Jul-2005 davidxu

Validate if the value written into {FS,GS}.base is a canonical
address, writting non-canonical address can cause kernel a panic,
by restricting base values to 0..VM_MAXUSER_ADDRESS, ensuring
only canonical values get written to the registers.

Reviewed by: peter, Josepha Koshy < joseph.koshy at gmail dot com >
Approved by: re (scottl)


# 147558 23-Jun-2005 jhb

Various and sundry style fixes and comment cleanups.

Approved by: re (scottl)


# 146049 10-May-2005 nyan

Change a directory layout for pc98.
- Move MD files into <arch>/<arch>.
- Move bus dependent files into <arch>/<bus>.
Rename some files to more suitable names.

Repo-copied by: peter
Discussed with: imp


# 145433 23-Apr-2005 davidxu

Change cpu_set_kse_upcall to more generic style, so we can reuse it
in other codes. Add cpu_set_user_tls, use it to tweak user register
and setup user TLS. I ever wanted to merge it into cpu_set_kse_upcall,
but since cpu_set_kse_upcall is also used by M:N threads which may
not need this feature, so I wrote a separated cpu_set_user_tls.


# 144969 12-Apr-2005 jhb

Use NULL rather than 0 in a couple of places.


# 144637 04-Apr-2005 jhb

Divorce critical sections from spinlocks. Critical sections as denoted by
critical_enter() and critical_exit() are now solely a mechanism for
deferring kernel preemptions. They no longer have any affect on
interrupts. This means that standalone critical sections are now very
cheap as they are simply unlocked integer increments and decrements for the
common case.

Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter()
and spinlock_exit(). This KPI is responsible for providing whatever MD
guarantees are needed to ensure that a thread holding a spin lock won't
be preempted by any other code that will try to lock the same lock. For
now all archs continue to block interrupts in a "spinlock section" as they
did formerly in all critical sections. Note that I've also taken this
opportunity to push a few things into MD code rather than MI. For example,
critical_fork_exit() no longer exists. Instead, MD code ensures that new
threads have the correct state when they are created. Also, we no longer
try to fixup the idlethreads for APs in MI code. Instead, each arch sets
the initial curthread and adjusts the state of the idle thread it borrows
in order to perform the initial context switch.

This change is largely a big NOP, but the cleaner separation it provides
will allow for more efficient alternative locking schemes in other parts
of the kernel (bare critical sections rather than per-CPU spin mutexes
for per-CPU data for example).

Reviewed by: grehan, cognet, arch@, others
Tested on: i386, alpha, sparc64, powerpc, arm, possibly more


# 141784 13-Feb-2005 alc

Implement support for CPU private mappings within sf_buf_alloc().


# 140260 14-Jan-2005 jhb

Bah, another whitespace fix.


# 140259 14-Jan-2005 jhb

Remove an extraneous space.


# 140257 14-Jan-2005 jhb

Remove redundant code to drop per-thread debug register state from
cpu_exit() as this is already performed in cpu_thread_exit() and the
debug state is per-thread rather than per-process.


# 139450 30-Dec-2004 jhb

Use NULL instead of 0 in a few places as well as various whitespace fixes.


# 139343 27-Dec-2004 njl

Restore the cpu_reset proxy code. It is needed if you want to reset the
system from an AP at runtime (i.e., calling cpu_reset from ddb). Someday,
if we move to an NMI for stopping cpus instead, we can do away with this.

Requested by: jhb


# 138592 08-Dec-2004 kbyanc

If the parent process has the trap bit set (i.e. a debugger had single
stepped the process to the system call), we need to clear the trap flag
from the new frame unless the debugger had set PF_FORK on the parent.
Otherwise, the child will receive a (likely unexpected) SIGTRAP when it
executes the first instruction after returning to userland.

Reviewed by: bde
MFC after: 3 days


# 138236 30-Nov-2004 njl

Remove now unused variable.

Pointy hat: njl from nskyline_r35 at yahoo com


# 138216 30-Nov-2004 njl

MFamd64: Remove the cpu_reset_proxy cruft now that we run boot() on
cpu 0. Also, restructure cpu_reset to be cleaner (no functional change.)


# 138129 27-Nov-2004 das

Don't include sys/user.h merely for its side-effect of recursively
including other headers.


# 137372 07-Nov-2004 alc

Introduce two new options, "CPU private" and "no wait", to sf_buf_alloc().
Change the spelling of the "catch" option to be consistent with the new
options. Implement the "no wait" option. An implementation of the "CPU
private" for i386 will be committed at a later date.


# 136601 16-Oct-2004 alc

When sf_buf_alloc() replaces a virtual-to-physical mapping, it needn't
invalidate the TLB(s) if the old mapping wasn't used by the CPU. With
network interfaces that implement checksum off-loading, the old mapping is
almost never used by the CPU, only by the device driver for setting up the
DMA operation.

Reviewed by: tegge@


# 132422 19-Jul-2004 davidxu

Make end of frames for KSE thread, for system scope thread, without this
change, debugger will dump a weird stack backtrace.


# 130036 03-Jun-2004 phk

The NatSemi (now AMD) Geode SC1100 needs special treatment here and there
because it is an embedded gadget. Give it it's own value for the "cpu"
variable and add code to reset it lacking a keyboard controller.


# 129906 31-May-2004 bmilekic

Bring in mbuma to replace mballoc.

mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.

Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.

mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.

From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.

Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.

This change removes more code than it adds.

A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.

Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)


# 129876 30-May-2004 phk

Add some missing <sys/module.h> includes which are masked by the
one on death-row in <sys/kernel.h>


# 129750 26-May-2004 tmm

Retire cpu_sched_exit(); it is not used any more.


# 128100 11-Apr-2004 alc

- is_physical_memory()'s parameter, which is a physical address, should be
a vm_paddr_t not a vm_offset_t.


# 127788 03-Apr-2004 alc

In some cases, sf_buf_alloc() should sleep with pri PCATCH; in others, it
should not. Add a new parameter so that the caller can specify which is
the case.

Reported by: dillon


# 127582 29-Mar-2004 peter

Finish tidying up a couple of leftovers from the KSTACK_PAGES stuff. Some
files still #included the opt_ file. powerpc hadn't been updated yet.


# 127283 21-Mar-2004 wpaul

The kthread_create() API is supposed to allow you to create threads
with more than the normal amount of stack pages, however the stack
pointer always wound up being initialized using KSTACK_PAGES. It
should be using td->td_kstack_pages instead. This means that although
the vm subsystem would give you all the stack pages you asked for,
%esp would always be initialized as if you had just 2 pages, and
the rest would go to waste.

I wanted to use the 'give me more stack pages' feature of kthread_create()
because the Intel 2200BG NDIS driver does an alloca() of about 5000 bytes,
which wrecks the stack with the default 2 page size, and I was baffled
that no matter how much code I shoved into thread contexts with
allegedly larger stacks, the thing would still crash unless I changed
KSTACK_PAGES.

Note: this bug is present in _ALL_ arches at this point. Peter has
promised to merge this fix into all of them.


# 127086 16-Mar-2004 alc

Refactor the existing machine-dependent sf_buf_free() into a machine-
dependent function by the same name and a machine-independent function,
sf_buf_mext(). Aside from the virtue of making more of the code machine-
independent, this change also makes the interface more logical. Before,
sf_buf_free() did more than simply undo an sf_buf_alloc(); it also
unwired and if necessary freed the page. That is now the purpose of
sf_buf_mext(). Thus, sf_buf_alloc() and sf_buf_free() can now be used
as a general-purpose emphemeral map cache.


# 126942 14-Mar-2004 alc

Simplify sf_buf_alloc().


# 126793 10-Mar-2004 alc

- Make the acquisition of Giant in vm_fault_unwire() conditional on the
pmap. For the kernel pmap, Giant is not required. In general, for
other pmaps, Giant is required by i386's pmap_pte() implementation.
Specifically, the use of PMAP2/PADDR2 is synchronized by Giant.
Note: In principle, updates to the kernel pmap's wired count could be
lost without Giant. However, in practice, we never use the kernel
pmap's wired count. This will be resolved when pmap locking appears.
- With the above change, cpu_thread_clean() and uma_large_free() need
not acquire Giant. (The first case is simply the revival of
i386/i386/vm_machdep.c's revision 1.226 by peter.)


# 126785 09-Mar-2004 jb

Remove duplicate code.

Requested by: bde


# 126761 09-Mar-2004 jb

AMD's ELAN documentation says that you write to the SYS_RST register
in the Memory Mapped Configuration Region (MMCR) to reset the CPU.
If CPU_ELAN is set, try this first to reset the CPU before the
traditional way.

Without this change, my Compulab board powers down on 'reset' instead
of rebooting.


# 126738 07-Mar-2004 peter

Add back Giant locks around kmem_free() call from user_ldt cleanup path
during exit. Apparently it isn't safe after all. See uma_large_free().

Pointed out by: alc


# 126736 07-Mar-2004 peter

Other parts of the tree do not protect calls to kmem_free() with Giant,
so remove it from here. The most notable examples include vm_mmap().
This removes one more Giant event from exit(2).


# 124144 05-Jan-2004 phk

Add struct definition of the Elan MMCR registers (from jb@)

Put a CTASSERT() on the size of the struct.

Use the struct where it is easy to do so in elan_mmcr.c

Add the Elan specific hardware reset code (also from jb@).


# 123955 29-Dec-2003 bde

Sorted includes.


# 123929 28-Dec-2003 silby

Track three new sendfile-related statistics:
- The number of times sendfile had to do disk I/O
- The number of times sfbuf allocation failed
- The number of times sfbuf allocation had to wait


# 123920 27-Dec-2003 silby

Move the declaration of sfbufspeak and sfbufsused to mbuf.h,
and use imax instead of max, as sfbufspeak and sfbufsused
are signed.

Submitted by: bde


# 123884 27-Dec-2003 silby

Track current and peak sfbuf usage, export the values via sysctl.


# 123266 07-Dec-2003 alc

Don't remove the virtual-to-physical mapping when an sf_buf is freed.
Instead, allow the mapping to persist, but add the sf_buf to a free list.
If a later sendfile(2) or zero-copy send resends the same physical page,
perhaps with the same or different contents, then the mapping overhead is
avoided and the sf_buf is simply removed from the free list.

In other words, the i386 sf_buf implementation now behaves as a cache of
virtual-to-physical translations using an LRU replacement policy on
inactive sf_bufs. This is similar in concept to a part of
http://www.cs.princeton.edu/~yruan/debox/ patch, but much simpler in
implementation. Note: none of this is required on alpha, amd64, or ia64.
They now use their direct virtual-to-physical mapping to avoid any
emphemeral mapping overheads in their sf_buf implementations.


# 122860 17-Nov-2003 alc

- Change the i386's sf_buf implementation so that it never allocates
more than one sf_buf for one vm_page. To accomplish this, we add
a global hash table mapping vm_pages to sf_bufs and a reference
count to each sf_buf. (This is similar to the patches for RELENG_4
at http://www.cs.princeton.edu/~yruan/debox/.)

For the uninitiated, an sf_buf is nothing more than a kernel virtual
address that is used for temporary virtual-to-physical mappings by
sendfile(2) and zero-copy sockets. As such, there is no reason for
one vm_page to have several sf_bufs mapping it. In fact, using more
than one sf_buf for a single vm_page increases the likelihood that
sendfile(2) blocks, hurting throughput.
(See http://www.cs.princeton.edu/~yruan/debox/.)


# 122821 16-Nov-2003 alc

- Remove unnecessary synchronization from sf_buf_init(). (There is only
one active CPU when sf_buf_init() is performed.)


# 122780 16-Nov-2003 alc

- Modify alpha's sf_buf implementation to use the direct virtual-to-
physical mapping.
- Move the sf_buf API to its own header file; make struct sf_buf's
definition machine dependent. In this commit, we remove an
unnecessary field from struct sf_buf on the alpha, amd64, and ia64.
Ultimately, we may eliminate struct sf_buf on those architecures
except as an opaque pointer that references a vm page.


# 121235 18-Oct-2003 davidxu

Use npxdrop in cpu_thread_exit to save some cycles.
Clear FPU pcb flags for new upcall thread, these flags needn't
be inherited, the new thread should start from clean FPU status.


# 119563 29-Aug-2003 alc

Migrate the sf_buf allocator that is used by sendfile(2) and zero-copy
sockets into machine-dependent files. The rationale for this
migration is illustrated by the modified amd64 allocator. It uses the
amd64's direct map to avoid emphemeral mappings in the kernel's
address space. On an SMP, the emphemeral mappings result in an IPI
for TLB shootdown for each transmitted page. Yuck.

Maintainers of other 64-bit platforms with direct maps should be able
to use the amd64 allocator as a reference implementation.


# 119004 16-Aug-2003 marcel

In vm_thread_swap{in|out}(), remove the alpha specific conditional
compilation and replace it with a call to cpu_thread_swap{in|out}().
This allows us to add similar code on ia64 without cluttering the
code even more.


# 117285 06-Jul-2003 bsd

Woops - don't forget to declare locals needed for that last commit.


# 117284 06-Jul-2003 bsd

Don't unconditionally reset the hardware debug registers in cpu_exit(),
reset them only if they were previously in use. Unconditionally
resetting the registers wipes them out frequently, which interferes
with their use for kernel debugging.

While I'm here, be less verbose in the associated comment of a
neighboring function.

Noticed by: bde


# 116188 11-Jun-2003 peter

GC unused cpu_wait() function


# 115858 04-Jun-2003 marcel

Change the second (and last) argument of cpu_set_upcall(). Previously
we were passing in a void* representing the PCB of the parent thread.
Now we pass a pointer to the parent thread itself.
The prime reason for this change is to allow cpu_set_upcall() to copy
(parts of) the trapframe instead of having it done in MI code in each
caller of cpu_set_upcall(). Copying the trapframe cannot always be
done with a simply bcopy() or may not always be optimal that way. On
ia64 specifically the trapframe contains information that is specific
to an entry into the kernel and can only be used by the corresponding
exit from the kernel. A trapframe copied verbatim from another frame
is in most cases useless without some additional normalization.

Note that this change removes the assignment to td->td_frame in some
implementations of cpu_set_upcall(). The assignment is redundant.
A previous call to cpu_thread_setup() already did the exact same
assignment. An added benefit of removing the redundant assignment is
that we can now change td_pcb without nasty side-effects.

This change officially marks the ability on ia64 for 1:1 threading.

Not tested on: amd64, powerpc
Compile & boot tested on: alpha, sparc64
Functionally tested on: i386, ia64


# 115728 02-Jun-2003 tegge

Initialize td->td_pcb->pcb_ext in cpu_thread_setup() since a garbage
value (e.g. 0xd0d0d0d0) can cause a kernel panic.


# 115683 02-Jun-2003 obrien

Use __FBSDID().


# 115044 15-May-2003 tmm

In cpu_fork(), initialize pcb_psl for the new process to PSL_KERNEL,
instead of taking the (userland) eflags from the trap frame and masking
out PSL_I. There is no need to inherit any flags from the forking process;
the old method however can cause flags set in userland for the forking
process to be bogusly set in kernel mode when the newly forked process
runs for the first time (in particular PSL_T, which is set for userland
when the process is single-stepped; this would cause trace traps in
kernel mode).

Approved by: re (jhb)


# 113796 21-Apr-2003 davidxu

Reset pcb_gs and %gs before possibly invalidating it.


# 113364 11-Apr-2003 davidxu

Copy %gs from current CPU not from a stale PCB backup.


# 112841 30-Mar-2003 jake

- Add support for PAE and more than 4 gigs of ram on x86, dependent on the
kernel opition 'options PAE'. This will only work with device drivers which
either use busdma, or are able to handle 64 bit physical addresses.

Thanks to Lanny Baron from FreeBSD Systems for the loan of a test machine
with 6 gigs of ram.

Sponsored by: DARPA, Network Associates Laboratories, FreeBSD Systems


# 112569 24-Mar-2003 jake

- Add vm_paddr_t, a physical address type. This is required for systems
where physical addresses larger than virtual addresses, such as i386s
with PAE.
- Use this to represent physical addresses in the MI vm system and in the
i386 pmap code. This also changes the paddr parameter to d_mmap_t.
- Fix printf formats to handle physical addresses >4G in the i386 memory
detection code, and due to kvtop returning vm_paddr_t instead of u_long.

Note that this is a name change only; vm_paddr_t is still the same as
vm_offset_t on all currently supported platforms.

Sponsored by: DARPA, Network Associates Laboratories
Discussed with: re, phk (cdevsw change)


# 111363 23-Feb-2003 jake

- Added macros NPGPTD, NBPTD, and NPDEPTD, for dealing with the size of the
page directory.
- Use these instead of the magic constants 1 or PAGE_SIZE where appropriate.
There are still numerous assumptions that the page directory is exactly
1 page.

Sponsored by: DARPA, Network Associates Laboratories


# 111028 17-Feb-2003 jeff

- Split the struct kse into struct upcall and struct kse. struct kse will
soon be visible only to schedulers. This greatly simplifies much the
KSE code.

Submitted by: davidxu


# 110190 01-Feb-2003 julian

Reversion of commit by Davidxu plus fixes since applied.

I'm not convinced there is anything major wrong with the patch but
them's the rules..

I am using my "David's mentor" hat to revert this as he's
offline for a while.


# 109877 26-Jan-2003 davidxu

Move UPCALL related data structure out of kse, introduce a new
data structure called kse_upcall to manage UPCALL. All KSE binding
and loaning code are gone.

A thread owns an upcall can collect all completed syscall contexts in
its ksegrp, turn itself into UPCALL mode, and takes those contexts back
to userland. Any thread without upcall structure has to export their
contexts and exit at user boundary.

Any thread running in user mode owns an upcall structure, when it enters
kernel, if the kse mailbox's current thread pointer is not NULL, then
when the thread is blocked in kernel, a new UPCALL thread is created and
the upcall structure is transfered to the new UPCALL thread. if the kse
mailbox's current thread pointer is NULL, then when a thread is blocked
in kernel, no UPCALL thread will be created.

Each upcall always has an owner thread. Userland can remove an upcall by
calling kse_exit, when all upcalls in ksegrp are removed, the group is
atomatically shutdown. An upcall owner thread also exits when process is
in exiting state. when an owner thread exits, the upcall it owns is also
removed.

KSE is a pure scheduler entity. it represents a virtual cpu. when a thread
is running, it always has a KSE associated with it. scheduler is free to
assign a KSE to thread according thread priority, if thread priority is changed,
KSE can be moved from one thread to another.

When a ksegrp is created, there is always N KSEs created in the group. the
N is the number of physical cpu in the current system. This makes it is
possible that even an userland UTS is single CPU safe, threads in kernel still
can execute on different cpu in parallel. Userland calls kse_create to add more
upcall structures into ksegrp to increase concurrent in userland itself, kernel
is not restricted by number of upcalls userland provides.

The code hasn't been tested under SMP by author due to lack of hardware.

Reviewed by: julian


# 109342 15-Jan-2003 dillon

Merge all the various copies of vm_fault_quick() into a single
portable copy.


# 109340 15-Jan-2003 dillon

Merge all the various copies of vmapbuf() and vunmapbuf() into a single
portable copy. Note that pmap_extract() must be used instead of
pmap_kextract().

This is precursor work to a reorganization of vmapbuf() to close remaining
user/kernel races (which can lead to a panic).


# 107719 10-Dec-2002 julian

Unbreak the KSE code. Keep track of zobie threads using the Per-CPU storage
during the context switch. Rearrange thread cleanups
to avoid problems with Giant. Clean threads when freed or
when recycled.

Approved by: re (jhb)


# 107212 24-Nov-2002 alc

Add page queues locking to vunmapbuf(); reduce differences with respect
to the sparc64 implementation. (Note: With modest effort on the alpha and
ia64 this function could migrate to the MI part of the kernel.)

Approved by: re (blanket)


# 107180 22-Nov-2002 mux

Under certain circumstances, we were calling kmem_free() from
i386 cpu_thread_exit(). This resulted in a panic with WITNESS
since we need to hold Giant to call kmem_free(), and we weren't
helding it anymore in cpu_thread_exit(). We now do this from a
new MD function, cpu_thread_dtor(), called by thread_dtor().

Approved by: re@
Suggested by: jhb


# 104695 09-Oct-2002 julian

Round out the facilty for a 'bound' thread to loan out its KSE
in specific situations. The owner thread must be blocked, and the
borrower can not proceed back to user space with the borrowed KSE.
The borrower will return the KSE on the next context switch where
teh owner wants it back. This removes a lot of possible
race conditions and deadlocks. It is consceivable that the
borrower should inherit the priority of the owner too.
that's another discussion and would be simple to do.

Also, as part of this, the "preallocatd spare thread" is attached to the
thread doing a syscall rather than the KSE. This removes the need to lock
the scheduler when we want to access it, as it's now "at hand".

DDB now shows a lot mor info for threaded proceses though it may need
some optimisation to squeeze it all back into 80 chars again.
(possible JKH project)

Upcalls are now "bound" threads, but "KSE Lending" now means that
other completing syscalls can be completed using that KSE before the upcall
finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is
one of the highest priorities in the KSE system.) The upcall when it happens
will present all the completed syscalls to the KSE for selection.


# 103407 16-Sep-2002 mini

Add kernel support needed for the KSE-aware libpthread:
- Maintain fpu state across signals.
- Use ucontext_t's to store KSE thread state.
- Synthesize state for the UTS upon each upcall, rather than
saving and copying a trapframe.
- Save and restore FPU state properly in ucontext_t's.

Reviewed by: deischen, julian
Approved by: -arch


# 103049 06-Sep-2002 peter

Zap the implementations of the i386-aout specific cpu_coredump function.
Most of the non-i386 platforms had rather broken implementations anyway.


# 101941 15-Aug-2002 rwatson

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 99072 29-Jun-2002 julian

Part 1 of KSE-III

The ability to schedule multiple threads per process
(one one cpu) by making ALL system calls optionally asynchronous.
to come: ia64 and power-pc patches, patches for gdb, test program (in tools)

Reviewed by: Almost everyone who counts
(at various times, peter, jhb, matt, alfred, mini, bernd,
and a cast of thousands)

NOTE: this is still Beta code, and contains lots of debugging stuff.
expect slight instability in signals..


# 98765 24-Jun-2002 jake

Add an MD callout like cpu_exit, but which is called after sched_lock is
obtained, when all other scheduling activity is suspended. This is needed
on sparc64 to deactivate the vmspace of the exiting process on all cpus.
Otherwise if another unrelated process gets the exact same vmspace structure
allocated to it (same address), its address space will not be activated
properly. This seems to fix some spontaneous signal 11 problems with smp
on sparc64.


# 93264 27-Mar-2002 dillon

Compromise for critical*()/cpu_critical*() recommit. Cleanup the interrupt
disablement assumptions in kern_fork.c by adding another API call,
cpu_critical_fork_exit(). Cleanup the td_savecrit field by moving it
from MI to MD. Temporarily move cpu_critical*() from <arch>/include/cpufunc.h
to <arch>/<arch>/critical.c (stage-2 will clean this up).

Implement interrupt deferral for i386 that allows interrupts to remain
enabled inside critical sections. This also fixes an IPI interlock bug,
and requires uses of icu_lock to be enclosed in a true interrupt disablement.

This is the stage-1 commit. Stage-2 will occur after stage-1 has stabilized,
and will move cpu_critical*() into its own header file(s) + other things.
This commit may break non-i386 architectures in trivial ways. This should
be temporary.

Reviewed by: core
Approved by: core


# 92890 21-Mar-2002 alc

o Use the MI vm_map_growstack() instead of grow_stack() in trap_pfault()
and trapwrite().
o On i386/pc98, remove the (now) unused grow_stack().


# 92860 21-Mar-2002 imp

Fix abuses of cpu_critical_{enter,exit} by converting to
intr_{disable,restore} as well as providing an implemenation of
intr_{disable,restore}.

Reviewed by: jake, rwatson, jhb


# 92770 20-Mar-2002 alfred

Remove __P.


# 91328 26-Feb-2002 dillon

revert last commit temporarily due to whining on the lists.


# 91315 26-Feb-2002 dillon

STAGE-1 of 3 commit - allow (but do not require) interrupts to remain
enabled in critical sections and streamline critical_enter() and
critical_exit().

This commit allows an architecture to leave interrupts enabled inside
critical sections if it so wishes. Architectures that do not wish to do
this are not effected by this change.

This commit implements the feature for the I386 architecture and provides
a sysctl, debug.critical_mode, which defaults to 1 (use the feature). For
now you can turn the sysctl on and off at any time in order to test the
architectural changes or track down bugs.

This commit is just the first stage. Some areas of the code, specifically
the MACHINE_CRITICAL_ENTER #ifdef'd code, is strictly temporary and will
be cleaned up in the STAGE-2 commit when the critical_*() functions are
moved entirely into MD files.

The following changes have been made:

* critical_enter() and critical_exit() for I386 now simply increment
and decrement curthread->td_critnest. They no longer disable
hard interrupts. When critical_exit() decrements the counter to
0 it effectively calls a routine to deal with whatever interrupts
were deferred during the time the code was operating in a critical
section.

Other architectures are unaffected.

* fork_exit() has been conditionalized to remove MD assumptions for
the new code. Old code will still use the old MD assumptions
in regards to hard interrupt disablement. In STAGE-2 this will
be turned into a subroutine call into MD code rather then hardcoded
in MI code.

The new code places the burden of entering the critical section
in the trampoline code where it belongs.

* I386: interrupts are now enabled while we are in a critical section.
The interrupt vector code has been adjusted to deal with the fact.
If it detects that we are in a critical section it currently defers
the interrupt by adding the appropriate bit to an interrupt mask.

* In order to accomplish the deferral, icu_lock is required. This
is i386-specific. Thus icu_lock can only be obtained by mainline
i386 code while interrupts are hard disabled. This change has been
made.

* Because interrupts may or may not be hard disabled during a
context switch, cpu_switch() can no longer simply assume that
PSL_I will be in a consistent state. Therefore, it now saves and
restores eflags.

* FAST INTERRUPT PROVISION. Fast interrupts are currently deferred.
The intention is to eventually allow them to operate either while
we are in a critical section or, if we are able to restrict the
use of sched_lock, while we are not holding the sched_lock.

* ICU and APIC vector assembly for I386 cleaned up. The ICU code
has been cleaned up to match the APIC code in regards to format
and macro availability. Additionally, the code has been adjusted
to deal with deferred interrupts.

* Deferred interrupts use a per-cpu boolean int_pending, and
masks ipending, spending, and fpending. Being per-cpu variables
it is not currently necessary to lock; bus cycles modifying them.

Note that the same mechanism will enable preemption to be
incorporated as a true software interrupt without having to
further hack up the critical nesting code.

* Note: the old critical_enter() code in kern/kern_switch.c is
currently #ifdef to be compatible with both the old and new
methodology. In STAGE-2 it will be moved entirely to MD code.

Performance issues:

One of the purposes of this commit is to enhance critical section
performance, specifically to greatly reduce bus overhead to allow
the critical section code to be used to protect per-cpu caches.
These caches, such as Jeff's slab allocator work, can potentially
operate very quickly making the effective savings of the new
critical section code's performance very significant.

The second purpose of this commit is to allow architectures to
enable certain interrupts while in a critical section. Specifically,
the intention is to eventually allow certain FAST interrupts to
operate rather then defer.

The third purpose of this commit is to begin to clean up the
critical_enter()/critical_exit()/cpu_critical_enter()/
cpu_critical_exit() API which currently has serious cross pollution
in MI code (in fork_exit() and ast() for example).

The fourth purpose of this commit is to provide a framework that
allows kernel-preempting software interrupts to be implemented
cleanly. This is currently used for two forward interrupts in I386.
Other architectures will have the choice of using this infrastructure
or building the functionality directly into critical_enter()/
critical_exit().

Finally, this commit is designed to greatly improve the flexibility
of various architectures to manage critical section handling,
software interrupts, preemption, and other highly integrated
architecture-specific details.


# 90562 12-Feb-2002 alc

Remove an unused (but initialized) variable from vmapbuf().


# 90361 07-Feb-2002 julian

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


# 88146 18-Dec-2001 julian

In a couple of places, we recalculated addresses we already had in local
pointer variables.


# 88088 17-Dec-2001 jhb

Modify the critical section API as follows:
- The MD functions critical_enter/exit are renamed to start with a cpu_
prefix.
- MI wrapper functions critical_enter/exit maintain a per-thread nesting
count and a per-thread critical section saved state set when entering
a critical section while at nesting level 0 and restored when exiting
to nesting level 0. This moves the saved state out of spin mutexes so
that interlocking spin mutexes works properly.
- Most low-level MD code that used critical_enter/exit now use
cpu_critical_enter/exit. MI code such as device drivers and spin
mutexes use the MI wrappers. Note that since the MI wrappers store
the state in the current thread, they do not have any return values or
arguments.
- mtx_intr_enable() is replaced with a constant CRITICAL_FORK which is
assigned to curthread->td_savecrit during fork_exit().

Tested on: i386, alpha


# 87702 11-Dec-2001 jhb

Overhaul the per-CPU support a bit:

- The MI portions of struct globaldata have been consolidated into a MI
struct pcpu. The MD per-CPU data are specified via a macro defined in
machine/pcpu.h. A macro was chosen over a struct mdpcpu so that the
interface would be cleaner (PCPU_GET(my_md_field) vs.
PCPU_GET(md.md_my_md_field)).
- All references to globaldata are changed to pcpu instead. In a UP kernel,
this data was stored as global variables which is where the original name
came from. In an SMP world this data is per-CPU and ideally private to each
CPU outside of the context of debuggers. This also included combining
machine/globaldata.h and machine/globals.h into machine/pcpu.h.
- The pointer to the thread using the FPU on i386 was renamed from
npxthread to fpcurthread to be identical with other architectures.
- Make the show pcpu ddb command MI with a MD callout to display MD
fields.
- The globaldata_register() function was renamed to pcpu_init() and now
init's MI fields of a struct pcpu in addition to registering it with
the internal array and list.
- A pcpu_destroy() function was added to remove a struct pcpu from the
internal array and list.

Tested on: alpha, i386
Reviewed by: peter, jake


# 85449 24-Oct-2001 jhb

Split the per-process Local Descriptor Table out of the PCB and into
struct mdproc.

Submitted by: Andrew R. Reiter <arr@watson.org>
Silence on: -current


# 84935 14-Oct-2001 tegge

Change vmapbuf() to use pmap_qenter() and vunmapbuf() to use pmap_qremove().

This significantly reduces the number of TLB shootdowns caused by
vmapbuf/vunmapbuf when performing many large reads from raw disk devices.

Reviewed by: dillon


# 84381 02-Oct-2001 mjacob

Fix problem where a user buffer outside of the area being tested
will be corrupted.

PR: 29194
Obtained from: Tor.Egge@fast.no
MFC after: 2 weeks


# 83660 19-Sep-2001 peter

Reserve an extra 16 bytes in case we have to grow the trapframe into
a vm86trapframe for switching to vm86 [unlikely] while exiting.
I lost this when doing the pcb move that went in with the KSE commit.

Reviewed by: jake


# 83366 12-Sep-2001 julian

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 83276 10-Sep-2001 peter

Rip some well duplicated code out of cpu_wait() and cpu_exit() and move
it to the MI area. KSE touched cpu_wait() which had the same change
replicated five ways for each platform. Now it can just do it once.
The only MD parts seemed to be dealing with fpu state cleanup and things
like vm86 cleanup on x86. The rest was identical.

XXX: ia64 and powerpc did not have cpu_throw(), so I've put a functional
stub in place.

Reviewed by: jake, tmm, dillon


# 83223 08-Sep-2001 peter

Missing part of dillon's coredump commit. cpu_coredump() was still
passing IO_NODELOCKED to vn_rdwr(), this would cause operations on the
unlocked core vnode and softupdates nastiness if an a.out binary cored.


# 82938 04-Sep-2001 peter

Nuke #if 0'ed "setredzone()" stub. We never used it, and probably
never will. I've implemented an optional redzone as part of the KSE
upage breakup.


# 82309 25-Aug-2001 peter

Optionize UPAGES for the i386. As part of this I split some of the low
level implementation stuff out of machine/globaldata.h to avoid exposing
UPAGES to lots more places. The end result is that we can double
the kernel stack size with 'options UPAGES=4' etc.

This is mainly being done for the benefit of a MFC to RELENG_4 at some
point. -current doesn't really need this so much since each interrupt
runs on its own kstack.


# 79609 12-Jul-2001 peter

Activate SSE/SIMD. This is the extra context switching support that
we are required to do if we let user processes use the extra 128 bit
registers etc.

This is the base part of the diff I got from:
http://www.issei.org/issei/FreeBSD/sse.html
I believe this is by: Mr. SUZUKI Issei <issei@issei.org>
SMP support apparently by: Takekazu KATO <kato@chino.it.okayama-u.ac.jp>
Test code by: NAKAMURA Kazushi <kaz@kobe1995.net>, see
http://kobe1995.net/~kaz/FreeBSD/SSE.en.html

I have fixed a couple of style(9) deviations. I have some followup
commits to fix a couple of non-style things.


# 79265 04-Jul-2001 dillon

Move vm_page_zero_idle() from machine-dependant sections to a
machine-independant source file, vm/vm_zeroidle.c. It was exactly the
same for all platforms and updating them all was getting annoying.


# 79263 04-Jul-2001 dillon

Reorg vm_page.c into vm_page.c, vm_pageq.c, and vm_contig.c (for contigmalloc).
Also removed some spl's and added some VM mutexes, but they are not actually
used yet, so this commit does not really make any operational changes
to the system.

vm_page.c relates to vm_page_t manipulation, including high level deactivation,
activation, etc... vm_pageq.c relates to finding free pages and aquiring
exclusive access to a page queue (exclusivity part not yet implemented).
And the world still builds... :-)


# 79224 04-Jul-2001 dillon

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# 79123 03-Jul-2001 jhb

Allow Giant to be recursed when a process terminates.


# 78962 29-Jun-2001 jhb

Add a new MI pointer to the process' trapframe p_frame instead of using
various differently named pointers buried under p_md.

Reviewed by: jake (in principle)


# 77486 30-May-2001 jhb

We can't grab the sched_lock in set_user_ldt() because when it is called
from cpu_switch(), curproc has been changed, but the sched_lock owner will
not be updated until we return to mi_switch(), thus we deadlock against
ourselves. As a workaround, push the acquire and release of sched_lock out
to the callers of set_user_ldt(). Note that we can't use a mtx_assert() in
set_user_ldt for the same reason.

Sleuting by: tmm
Tested by: tmm, dougb


# 76947 21-May-2001 jhb

Remove a few more spl's I missed earlier.

Reported by: Michael Harnois <mdharnois@home.com>
Pointy hat: me


# 76939 21-May-2001 jhb

Axe unneeded spl()'s.


# 76827 18-May-2001 alfred

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 76546 13-May-2001 bde

Use a critical region to protect pushing of the parent's npx state to the
pcb for fork(). It was possible for the state to be saved twice when an
interrupt handler saved it concurrently. This corrupted (reset) the state
because fnsave has the (in)convenient side effect of doing an implicit
fninit. Mundane null pointer bugs were not possible, because we save to
an "arbitrary" process's pcb and not to the "right" place (npxproc).

Push the parent's %gs to the pcb for fork(). Changes to %gs before
fork() were not preserved in the child unless an accidental context
switch did the pushing. Updated the list of pcb contents which is
supposed to inhibit bugs like this. pcb_dr*, pcb_gs and pcb_ext were
missing. Copying is correct for pcb_dr*, and pcb_ext is already
handled specially (although XXX'ly).

Reducing the savectx() call to an npxsave() call in rev.1.80 was a
mistake. The above bugs are duplicated in many places, including in
savectx() itself.

The arbitraryness of the parent process pointer for the fork()
subroutines, the pcb pointer for savectx(), and the save87 pointer
for npxsave(), is illusory. These functions don't work "right" unless
the pointers are precisely curproc, curpcb, and the address of npxproc's
save87 area, respectively, although the special context in which they
are called allows savectx(&dumppcb) to sort of work and npxsave(&dummy)
to work. cpu_fork() just doesn't work unless the parent process
pointer is curproc, or the caller has pushed %gs to the pcb, or %gs
happens to already be in the pcb.


# 76434 10-May-2001 jhb

- Use sched_lock and critical regions to ensure that LDT updates are thread
safe from preemption and concurrent access to the LDT.
- Move the prototype for i386_extend_pcb() to <machine/pcb_ext.h>.

Reviewed by: silence on -hackers


# 76078 27-Apr-2001 jhb

Overhaul of the SMP code. Several portions of the SMP kernel support have
been made machine independent and various other adjustments have been made
to support Alpha SMP.

- It splits the per-process portions of hardclock() and statclock() off
into hardclock_process() and statclock_process() respectively. hardclock()
and statclock() call the *_process() functions for the current process so
that UP systems will run as before. For SMP systems, it is simply necessary
to ensure that all other processors execute the *_process() functions when the
main clock functions are triggered on one CPU by an interrupt. For the alpha
4100, clock interrupts are delievered in a staggered broadcast fashion, so
we simply call hardclock/statclock on the boot CPU and call the *_process()
functions on the secondaries. For x86, we call statclock and hardclock as
usual and then call forward_hardclock/statclock in the MD code to send an IPI
to cause the AP's to execute forwared_hardclock/statclock which then call the
*_process() functions.
- forward_signal() and forward_roundrobin() have been reworked to be MI and to
involve less hackery. Now the cpu doing the forward sets any flags, etc. and
sends a very simple IPI_AST to the other cpu(s). AST IPIs now just basically
return so that they can execute ast() and don't bother with setting the
astpending or needresched flags themselves. This also removes the loop in
forward_signal() as sched_lock closes the race condition that the loop worked
around.
- need_resched(), resched_wanted() and clear_resched() have been changed to take
a process to act on rather than assuming curproc so that they can be used to
implement forward_roundrobin() as described above.
- Various other SMP variables have been moved to a MI subr_smp.c and a new
header sys/smp.h declares MI SMP variables and API's. The IPI API's from
machine/ipl.h have moved to machine/smp.h which is included by sys/smp.h.
- The globaldata_register() and globaldata_find() functions as well as the
SLIST of globaldata structures has become MI and moved into subr_smp.c.
Also, the globaldata list is only available if SMP support is compiled in.

Reviewed by: jake, peter
Looked over by: eivind


# 73922 07-Mar-2001 jhb

Use the proc lock to protect p_pptr when waking up our parent in cpu_exit()
and remove the mpfixme() message that is now fixed.


# 72930 22-Feb-2001 peter

Activate USER_LDT by default. The new thread libraries are going to
depend on this. The linux ABI emulator tries to use it for some linux
binaries too. VM86 had a bigger cost than this and it was made default
a while ago.

Reviewed by: jhb, imp


# 72746 20-Feb-2001 jhb

- Don't call clear_resched() in userret(), instead, clear the resched flag
in mi_switch() just before calling cpu_switch() so that the first switch
after a resched request will satisfy the request.
- While I'm at it, move a few things into mi_switch() and out of
cpu_switch(), specifically set the p_oncpu and p_lastcpu members of
proc in mi_switch(), and handle the sched_lock state change across a
context switch in mi_switch().
- Since cpu_switch() no longer handles the sched_lock state change, we
have to setup an initial state for sched_lock in fork_exit() before we
release it.


# 72200 09-Feb-2001 bmilekic

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# 71785 29-Jan-2001 peter

Send "#if NISA > 0" to the bit-bucket and replace it with an option.
These were compile-time "is the isa code present?" tests and not
'how many isa busses' tests.


# 71528 24-Jan-2001 jhb

Setup the return values for a child process in the trapframe when we setup
the rest of the trapframe instead of doing it in fork_return().


# 71257 19-Jan-2001 peter

Use #ifdef DEV_NPX from opt_npx.h instead of #if NNPX > 0 from npx.h


# 70861 10-Jan-2001 jake

Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables
other then curproc.


# 70317 23-Dec-2000 jake

Protect proc.p_pptr and proc.p_children/p_sibling with the
proctree_lock.

linprocfs not locked pending response from informal maintainer.

Reviewed by: jhb, -smp@


# 69781 08-Dec-2000 dwmalone

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 69779 08-Dec-2000 jake

Revert the previous change I made to cpu_switch. It doesn't help as
much as I thought it would and according to bde was a pessimization.


# 69536 02-Dec-2000 jake

Change cpu_switch to explicitly popl the callers program counter and
pushl that of the new process, rather than doing a movl (%esp) and
assuming that the stack has been setup right. This make the initial
stack setup slightly more sane, and will make it easier to stick
an interrupted process onto the run queue without its knowing.


# 67551 25-Oct-2000 jhb

- Overhaul the software interrupt code to use interrupt threads for each
type of software interrupt. Roughly, what used to be a bit in spending
now maps to a swi thread. Each thread can have multiple handlers, just
like a hardware interrupt thread.
- Instead of using a bitmask of pending interrupts, we schedule the specific
software interrupt thread to run, so spending, NSWI, and the shandlers
array are no longer needed. We can now have an arbitrary number of
software interrupt threads. When you register a software interrupt
thread via sinthand_add(), you get back a struct intrhand that you pass
to sched_swi() when you wish to schedule your swi thread to run.
- Convert the name of 'struct intrec' to 'struct intrhand' as it is a bit
more intuitive. Also, prefix all the members of struct intrhand with
'ih_'.
- Make swi_net() a MI function since there is now no point in it being
MD.

Submitted by: cp


# 67477 23-Oct-2000 jhb

Don't dink with interrupts in vm_page_zero_idle(). This code assumed it
was being called with interrupts disabled, when it was actually being called
with them enabled.

Pointed out by: tegge


# 67360 20-Oct-2000 jhb

- machine/mutex.h -> sys/mutex.h
- Use cpu_throw() instead of cpu_switch() during cpu_exit() since we don't
need to save our previous state.


# 67164 15-Oct-2000 phk

Remove unneeded #include <machine/clock.h>


# 65557 06-Sep-2000 jasone

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# 64529 11-Aug-2000 peter

Clean up some low level bootstrap code:

- stop using the evil 'struct trapframe' argument for mi_startup()
(formerly main()). There are much better ways of doing it.
- do not use prepare_usermode() - setregs() in execve() will do it
all for us as long as the p_md.md_regs pointer is set. (which is
now done in machdep.c rather than init_main.c. The Alpha port did it
this way all along and is much cleaner).
- collect all the magic %cr0 etc register settings into one place and
have the AP's call that instead of using magic numbers (!!) that keep
changing over and over again.
- Make it safe to call kthread_create() earlier, including during the
device probe sequence. It doesn't need the callback mechanism that
NetBSD's version uses.
- kthreads created this way are root-less as they exist before the root
filesystem is mounted. init(1) is set up so that it aquires the root
pointers prior to running. If other kthreads want filesystem acccess
we can make this code more generic.
- set all threads start times once we have decided what time it is.
- init uses a trampoline rather than the evil prepare_usermode() hack.
- kern_descrip.c has a couple of tweaks to deal with forking when there
is no rootdir or cwd etc.
- adjust the early SYSINIT() sequence so that a few prereqisites are in
place. eg: make sure the run queue is initialized before doing forks.

With this, the USB code can easily create a kthread to do the device
tree discovery. (I have tested it, it works nicely).

There are still some open issues before this is truely useful.
- tsleep() does not like working before the clock is running. It
sort-of tries to spin wait, but it can do more useful things now.
- stopping a kthread in kld code at unload time is "interesting" but
we have a solution for that.

The Alpha code needs no changes for this. It already uses pretty much the
same strategies, but a little cleaner.


# 61469 10-Jun-2000 peter

Add option BROKEN_KEYBOARD_RESET to an opt_*.h file


# 60041 05-May-2000 phk

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# 58717 28-Mar-2000 dillon

Commit major SMP cleanups and move the BGL (big giant lock) in the
syscall path inward. A system call may select whether it needs the MP
lock or not (the default being that it does need it).

A great deal of conditional SMP code for various deadended experiments
has been removed. 'cil' and 'cml' have been removed entirely, and the
locking around the cpl has been removed. The conditional
separately-locked fast-interrupt code has been removed, meaning that
interrupts must hold the CPL now (but they pretty much had to anyway).
Another reason for doing this is that the original separate-lock for
interrupts just doesn't apply to the interrupt thread mechanism being
contemplated.

Modifications to the cpl may now ONLY occur while holding the MP
lock. For example, if an otherwise MP safe syscall needs to mess with
the cpl, it must hold the MP lock for the duration and must (as usual)
save/restore the cpl in a nested fashion.

This is precursor work for the real meat coming later: avoiding having
to hold the MP lock for common syscalls and I/O's and interrupt threads.
It is expected that the spl mechanisms and new interrupt threading
mechanisms will be able to run in tandem, allowing a slow piecemeal
transition to occur.

This patch should result in a moderate performance improvement due to
the considerable amount of code that has been removed from the critical
path, especially the simplification of the spl*() calls. The real
performance gains will come later.

Approved by: jkh
Reviewed by: current, bde (exception.s)
Some work taken from: luoqi's patch


# 58345 20-Mar-2000 phk

Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd. The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue. It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users: Greg has not had time to test this yet, be careful.


# 57362 20-Feb-2000 bsd

Don't forget to reset the hardware debug registers when a process that
was using them exits.

Don't allow a user process to cause the kernel to take a TRCTRAP on a
user space address.

Reviewed by: jlemon, sef
Approved by: jkh


# 54188 06-Dec-1999 luoqi

User ldt sharing.


# 52647 30-Oct-1999 alc

The core of this patch is to vm/vm_page.h. The effects are two-fold: (1) to
eliminate an extra (useless) level of indirection in half of the page
queue accesses and (2) to use a single name for each queue throughout,
instead of, e.g., "vm_page_queue_active" in some places and
"vm_page_queues[PQ_ACTIVE]" in others.

Reviewed by: dillon


# 52635 29-Oct-1999 phk

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# 52121 11-Oct-1999 peter

Zap unneeded #includes

Submitted by: phk


# 51474 20-Sep-1999 dillon

Fix bug in pipe code relating to writes of mmap'd but illegal address
spaces which cross a segment boundry in the page table. pmap_kextract()
is not designed for access to the user space portion of the page
table and cannot handle the null-page-directory-entry case.

The fix is to have vm_fault_quick() return a success or failure which
is then used to avoid calling pmap_kextract().


# 50477 27-Aug-1999 peter

$Id$ -> $FreeBSD$


# 49326 31-Jul-1999 alc

Change the type of vpgqueues::lcnt from "int *" to "int". The indirection
served no purpose.


# 48974 22-Jul-1999 alc

Reduce the number of "magic constants" used for page coloring
by one: PQ_PRIME2 and PQ_PRIME3 are used to accomplish the same
thing at different places in the kernel. Drop PQ_PRIME3.


# 48391 01-Jul-1999 peter

Slight reorganization of kernel thread/process creation. Instead of using
SYSINIT_KT() etc (which is a static, compile-time procedure), use a
NetBSD-style kthread_create() interface. kproc_start is still available
as a SYSINIT() hook. This allowed simplification of chunks of the
sysinit code in the process. This kthread_create() is our old kproc_start
internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work
the same as the NetBSD one.

One thing I'd like to do shortly is get rid of nfsiod as a user initiated
process. It makes sense for the nfs client code to create them on the
fly as needed up to a user settable limit. This means that nfsiod
doesn't need to be in /sbin and is always "available". This is a fair bit
easier to do outside of the SYSINIT_KT() framework.


# 47678 01-Jun-1999 jlemon

Unifdef VM86.

Reviewed by: silence on on -current


# 45821 19-Apr-1999 peter

unifdef -DVM_STACK - it's been on for a while for x86 and was checked
and appeared to be working for the Alpha some time ago.


# 44146 19-Feb-1999 luoqi

Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This
is the preparation step for moving pmap storage out of vmspace proper.

Reviewed by: Alan Cox <alc@cs.rice.edu>
Matthew Dillion <dillon@apollo.backplane.com>


# 44078 16-Feb-1999 dfr

* Change sysctl from using linker_set to construct its tree using SLISTs.
This makes it possible to change the sysctl tree at runtime.

* Change KLD to find and register any sysctl nodes contained in the loaded
file and to unregister them when the file is unloaded.

Reviewed by: Archie Cobbs <archie@whistle.com>,
Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)


# 43758 08-Feb-1999 dillon

Adjust idle zero-page fill hysteresis based on tests. Use 2/3 and 4/5
zero-fill levels.

Adjust comment for ozfod in vmmeter.h - this counter represents
non-optimal ( on the fly ) zero fills, not prefills.


# 43752 07-Feb-1999 dillon

Rip out PQ_ZERO queue. PQ_ZERO functionality is now combined in with
PQ_FREE. There is little operational difference other then the kernel
being a few kilobytes smaller and the code being more readable.

* vm_page_select_free() has been *greatly* simplified.
* The PQ_ZERO page queue and supporting structures have been removed
* vm_page_zero_idle() revamped (see below)

PG_ZERO setting and clearing has been migrated from vm_page_alloc()
to vm_page_free[_zero]() and will eventually be guarenteed to remain
tracked throughout a page's life ( if it isn't already ).

When a page is freed, PG_ZERO pages are appended to the appropriate
tailq in the PQ_FREE queue while non-PG_ZERO pages are prepended.
When locating a new free page, PG_ZERO selection operates from within
vm_page_list_find() ( get page from end of queue instead of beginning
of queue ) and then only occurs in the nominal critical path case. If
the nominal case misses, both normal and zero-page allocation devolves
into the same _vm_page_list_find() select code without any specific
zero-page optimizations.

Additionally, vm_page_zero_idle() has been revamped. Hysteresis has been
added and zero-page tracking adjusted to conform with the other changes.
Currently hysteresis is set at 1/3 (lo) and 1/2 (hi) the number of free
pages. We may wish to increase both parameters as time permits. The
hysteresis is designed to avoid silly zeroing in borderline allocation/free
situations.


# 43387 29-Jan-1999 dillon

More -Wall / -Wcast-qual cleanup. Also, EXEC_SET can't use
C_DECLARE_MODULE due to the linker_file_sysinit() function
making modifications to the data.


# 42360 06-Jan-1999 julian

Add (but don't activate) code for a special VM option to make
downward growing stacks more general.
Add (but don't activate) code to use the new stack facility
when running threads, (specifically the linux threads support).
This allows people to use both linux compiled linuxthreads, and also the
native FreeBSD linux-threads port.

The code is conditional on VM_STACK. Not using this will
produce the old heavily tested system.

Submitted by: Richard Seaman <dick@tar.com>


# 41868 16-Dec-1998 bde

Removed bogus casts of USRSTACK and/or the other operand in binary
expressions involving USRSTACK.


# 40794 31-Oct-1998 peter

Add John Dyson's SYSCTL descriptions, and an export of more stats to
a sysctl hierarchy (vm.stats.*). SYSCTL descriptions are only present
in source, they do not get compiled into the binaries taking up memory.


# 40286 13-Oct-1998 dg

Fixed two potentially serious classes of bugs:

1) The vnode pager wasn't properly tracking the file size due to
"size" being page rounded in some cases and not in others.
This sometimes resulted in corrupted files. First noticed by
Terry Lambert.
Fixed by changing the "size" pager_alloc parameter to be a 64bit
byte value (as opposed to a 32bit page index) and changing the
pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
caused some 64bit offsets and sizes to be scrambled. Removing
the cast required adding casts at a few dozen callers.
There may be problems with other bogus casts in close-by
macros. A quick check seemed to indicate that those were okay,
however.


# 39703 28-Sep-1998 tegge

Initialize pcb_mpnest to 1 in the child process in cpu_fork(). This should
fix the 50% idle problem that the ELF /sbin/init triggered. The problem
appeared when the last context switch before a fork() call was due to
the kernel faulting in user pages via normal page faults (e.g. copyin).
Reviewed by: Peter Wemm <peter@netplex.com.au>


# 39648 25-Sep-1998 peter

Goodbye BOUNCE_BUFFERS, for a hack it has served us well.

The last consumer of this code (the old SCSI system) has left us and
the CAM code does it's own bouncing. The isa dma system has been
doing it's own bouncing for a while too.

Reviewed by: core


# 38422 18-Aug-1998 msmith

Presently there is only one `currentldt' variable for all cpus
in a SMP system. Unexpected things could happen if each cpu
has a different ldt setting and one cpu tries to use value
of currentldt set by another cpu.

The fix is to move currentldt to the per-cpu area. It includes
patches I filed in PR i386/6219 which are also user ldt related.

PR: i386/7591, i386/6219
Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>


# 36168 18-May-1998 tegge

Disallow reading the current kernel stack. Only the user structure and
the current registers should be accessible.
Reviewed by: David Greenman <dg@root.com>


# 36135 17-May-1998 tegge

Add forwarding of roundrobin to other cpus. This gives a more regular
update of cpu usage as shown by top when one process is cpu bound
(no system calls) while the system is otherwise idle (except for top).

Don't attempt to switch to the BSP in boot(). If the system was idle when
an interrupt caused a panic, this won't work. Instead, switch to the BSP
in cpu_reset.

Remove some spurious forward_statclock/forward_hardclock warnings.


# 36095 16-May-1998 kato

Some of newer PC-98 may cause "Windows Protection Fault" when booting
Windows 95 after rebooting FreeBSD without power off. In PC-98
system, reboot mode is set via I/O port 0x37 in cpu_reset(), and
accessing of this port is the reason of the problem. To avnoid the
fault, current status of reboot mode should be checked before
accessing the I/O port.


# 34840 23-Mar-1998 jlemon

Add the ability to make real-mode BIOS calls from the kernel. Currently,
everything is contained inside #ifdef VM86, so this option must be
present in the config file to use this functionality.

Thanks to Tor Egge, these changes should work on SMP machines. However,
it may not be throughly SMP-safe.

Currently, the only BIOS calls made are memory-sizing routines at bootup,
these replace reading the RTC values.


# 34643 17-Mar-1998 kato

Make EPSON_BOUNCEDMA a new-style option.


# 34569 14-Mar-1998 tegge

Don't use the standard macros for disabling/enabling interrupt.
On SMP systems, this left the mpintr_lock simplelock locked, causing
further calls to disable_intr to deadlock or panic.


# 34507 12-Mar-1998 bde

Fixed breakage of the !SMP case in vm_page_zero_idle() in the
previous commit. Opportunities to clean pages were often missed,
and leaving of the idle state was sometimes delayed until the next
interrupt (after any that occurred while cleaning).

Fixed an unstaticization, a syntax error and a style bug in the
previous commit.


# 33817 25-Feb-1998 dyson

Fix page prezeroing for SMP, and fix some potential paging-in-progress
hangs. The paging-in-progress diagnosis was a result of Tor Egge's
excellent detective work.
Submitted by: Partially from Tor Egge.


# 33307 13-Feb-1998 bde

Ifdefed some npx code. npx should be optional again.


# 33134 06-Feb-1998 eivind

Back out DIAGNOSTIC changes.


# 33108 04-Feb-1998 eivind

Turn DIAGNOSTIC into a new-style option.


# 32884 30-Jan-1998 dyson

Make the bounce buffer code a little more robust when space isn't
available. If there isn't bounce space available, the bounce code
is disabled. This will allow most large systems to run properly
when the bounce space is mistakenly allocated above 16MB.


# 32702 22-Jan-1998 dyson

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 32617 19-Jan-1998 tegge

The removal of a page from the free queue in vm_page_zero_idle was
imcomplete. Also set m->queue, in order to prevent vm_page_select_free
from selecting the page being zeroed.


# 32516 15-Jan-1998 gibbs

Implementation of Bus DMA for FreeBSD-x86. This is sufficient to do
page level bounce buffering, but there are still some issues left to
address.


# 32010 27-Dec-1997 peter

#include "opt_user_ldt.h" so that the #ifdef USER_LDT checks can work, as
commented about at length in the PR audit trail.

PR: 2412


# 31322 20-Nov-1997 bde

Removed a duplicate (sloppy common-style) definition.

Fixed some style bugs.


# 31249 18-Nov-1997 bde

Don't #include <machine/smp.h> even in the SMP case. Fixed the one
place that depended on it. The "bazillion warnings" mentioned in the
log for rev.1.45 apparently aren't a problem any more. It is hard
to be sure because the SIMPLELOCK_DEBUG option turns off (and breaks)
things in the SMP case.


# 30265 10-Oct-1997 peter

Convert the VM86 option from a global option to an option only depended
on by the files that use it. Changing the VM86 option now only causes
a recompile of a dozen files or so rather than the entire kernel.


# 29330 13-Sep-1997 joerg

Revert the logic behind my last change, and use a function called
`is_physical_memory()' now for the decision whether to dump some
region of memory or not.

Suggested by: davidg


# 29280 10-Sep-1997 joerg

Do not ever try to coredump adapter memory regions.

PR: 4486
Submitted by: tegge@idi.ntnu.no (Tor Egge)

Implement a function is_adapter_memory() in order to determine what
should nto be dumped at all. Currently, only populated with the ``ISA
memory hole''. Adapter regions of other busses should be added.


# 29041 02-Sep-1997 bde

Removed unused #includes.


# 28808 26-Aug-1997 peter

Clean up the SMP AP bootstrap and eliminate the wretched idle procs.

- We now have enough per-cpu idle context, the real idle loop has been
revived (cpu's halt now with nothing to do).
- Some preliminary support for running some operations outside the
global lock (eg: zeroing "free but not yet zeroed pages") is present
but appears to cause problems. Off by default.
- the smp_active sysctl now behaves differently. It's merely a 'true/false'
option. Setting smp_active to zero causes the AP's to halt in the idle
loop and stop scheduling processes.
- bootstrap is a lot safer. Instead of sharing a statically compiled in
stack a number of times (which has caused lots of problems) and then
abandoning it, we use the idle context to boot the AP's directly. This
should help >2 cpu support since the bootlock stuff was in doubt.
- print physical apic id in traps.. helps identify private pages getting
out of sync. (You don't want to know how much hair I tore out with this!)

More cleanup to follow, this is more of a checkpoint than a
'finished' thing.


# 27993 08-Aug-1997 dyson

VM86 kernel support.
Work done by BSDI, Jonathan Lemon <jlemon@americantv.com>,
Mike Smith <msmith@gsoft.com.au>, Sean Eric Fagan <sef@kithrup.com>,
and probably alot of others.
Submitted by: Jnathan Lemon <jlemon@americantv.com>


# 27535 20-Jul-1997 bde

Removed unused #includes.


# 26954 26-Jun-1997 tegge

Back out a bad commit.


# 26943 25-Jun-1997 tegge

Block some interrupts during the call to pmap_zero_page in
vm_page_zero_idle. This fixes some occurences of the problem
reported in PR kern/3216: "panic: pmap_zero_page: CMAP busy"


# 26812 22-Jun-1997 peter

Preliminary support for per-cpu data pages.

This eliminates a lot of #ifdef SMP type code. Things like _curproc reside
in a data page that is unique on each cpu, eliminating the expensive macros
like: #define curproc (SMPcurproc[cpunumber()])

There are some unresolved bootstrap and address space sharing issues at
present, but Steve is waiting on this for other work. There is still some
strictly temporary code present that isn't exactly pretty.

This is part of a larger change that has run into some bumps, this part is
standalone so it should be safe. The temporary code goes away when the
full idle cpu support is finished.

Reviewed by: fsmp, dyson


# 25557 07-May-1997 peter

clean up forked child creation. This is simplified also by having
md_regs being struct trapframe *. Do a npxsave() if needed and copy the
pcb rather than use the increasingly defunct savectx(). Copy %edi and
%ebp explicitly.

Submitted by: bde

XXX npxproc could be declared in npx.h so the externs with smp fruit
are not needed.


# 24980 16-Apr-1997 kato

Use reset port before clearing page table in cpu_reset if PC98 is
defined. Clearing page table could hang some new PC-98.


# 24691 07-Apr-1997 peter

The biggie: Get rid of the UPAGES from the top of the per-process address
space. (!)

Have each process use the kernel stack and pcb in the kvm space. Since
the stacks are at a different address, we cannot copy the stack at fork()
and allow the child to return up through the function call tree to return
to user mode - create a new execution context and have the new process
begin executing from cpu_switch() and go to user mode directly.
In theory this should speed up fork a bit.

Context switch the tss_esp0 pointer in the common tss. This is a lot
simpler since than swithching the gdt[GPROC0_SEL].sd.sd_base pointer
to each process's tss since the esp0 pointer is a 32 bit pointer, and the
sd_base setting is split into three different bit sections at non-aligned
boundaries and requires a lot of twiddling to reset.

The 8K of memory at the top of the process space is now empty, and unmapped
(and unmappable, it's higher than VM_MAXUSER_ADDRESS).

Simplity the pmap code to manage process contexts, we no longer have to
double map the UPAGES, this simplifies and should measuably speed up fork().

The following parts came from John Dyson:

Set PG_G on the UPAGES that are now in kernel context, and invalidate
them when swapping them out.

Move the upages object (upobj) from the vmspace to the proc structure.

Now that the UPAGES (pcb and kernel stack) are out of user space, make
rfork(..RFMEM..) do what was intended by sharing the vmspace
entirely via reference counting rather than simply inheriting the mappings.


# 24361 29-Mar-1997 bde

Don't keep cpu interrupts enabled during the lookup in vm_page_zero_idle().
Lookup isn't done every time the system goes idle now, but it can still
take > 1800 instructions in the worst case, so if cpu interrupts are kept
disabled then it might lose 20 characters of sio input at 115200 bps.

Fixed style in vm_page_zero_idle().


# 24099 22-Mar-1997 dyson

Decrease the latency/overhead in the prezero code when there is
an adequate number of prezeroed pages.


# 22975 22-Feb-1997 peter

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 22521 10-Feb-1997 dyson

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 21673 14-Jan-1997 jkh

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 21181 01-Jan-1997 se

Add code to copy the LDT, if required.

This code was sent to me by Bruce Evans, and seems to fix some
possible kernel panic in case of an execution error. It did not
cause any problems on my system, but I did never observe the
problem this patch is supposed to fix, anyway.

This patch is a NOP, unless the kernel is built with "options
USER_LDT", and doesn't affect the GENERIC kernel for this reason.

I want to have it in 2.2: it fixes a bug ...

Submitted by: bde


# 19269 30-Oct-1996 asami

More merge and update.

(1) deleted #if 0

pc98/pc98/mse.c

(2) hold per-unit I/O ports in ed_softc

pc98/pc98/if_ed.c
pc98/pc98/if_ed98.h

(3) merge more files by segregating changes into headers.

new file (moved from pc98/pc98):

i386/isa/aic_98.h

deleted:

well, it's already in the commit message so I won't repeat the
long list here ;)

Submitted by: The FreeBSD(98) Development Team


# 18937 15-Oct-1996 dyson

Move much of the machine dependent code from vm_glue.c into
pmap.c. Along with the improved organization, small proc fork
performance is now about 5%-10% faster.


# 18548 28-Sep-1996 dyson

Essentially rename pmap_update to be invltlb. It is a very machine
dependent operation, and not really a correct name. invltlb and invlpg
are more descriptive, and in the case of invlpg, a real opcode.

Additionally, fix the tlb management code for 386 machines.


# 18169 08-Sep-1996 dyson

Addition of page coloring support. Various levels of coloring are afforded.
The default level works with minimal overhead, but one can also enable
full, efficient use of a 512K cache. (Parameters can be generated
to support arbitrary cache sizes also.)


# 17108 12-Jul-1996 bde

Don't use NULL in non-pointer contexts.


# 16532 20-Jun-1996 dg

Properly account for non-page aligned buffers.


# 16530 19-Jun-1996 dg

Minor KNF formatting change to vmapbuf() and vunmapbuf().


# 16499 19-Jun-1996 dyson

Clean up vmapbuf and vunmapbuf significantly. The previous code was
very rough.


# 15809 18-May-1996 dyson

This set of commits to the VM system does the following, and contain
contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>,
Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me:

More usage of the TAILQ macros. Additional minor fix to queue.h.
Performance enhancements to the pageout daemon.
Addition of a wait in the case that the pageout daemon
has to run immediately.
Slightly modify the pageout algorithm.
Significant revamp of the pmap/fork code:
1) PTE's and UPAGES's are NO LONGER in the process's map.
2) PTE's and UPAGES's reside in their own objects.
3) TOTAL elimination of recursive page table pagefaults.
4) The page directory now resides in the PTE object.
5) Implemented pmap_copy, thereby speeding up fork time.
6) Changed the pv entries so that the head is a pointer
and not an entire entry.
7) Significant cleanup of pmap_protect, and pmap_remove.
8) Removed significant amounts of machine dependent
fork code from vm_glue. Pushed much of that code into
the machine dependent pmap module.
9) Support more completely the reuse of already zeroed
pages (Page table pages and page directories) as being
already zeroed.
Performance and code cleanups in vm_map:
1) Improved and simplified allocation of map entries.
2) Improved vm_map_copy code.
3) Corrected some minor problems in the simplify code.
Implemented splvm (combo of splbio and splimp.) The VM code now
seldom uses splhigh.
Improved the speed of and simplified kmem_malloc.
Minor mod to vm_fault to avoid using pre-zeroed pages in the case
of objects with backing objects along with the already
existant condition of having a vnode. (If there is a backing
object, there will likely be a COW... With a COW, it isn't
necessary to start with a pre-zeroed page.)
Minor reorg of source to perhaps improve locality of ref.


# 15543 02-May-1996 phk

removed:
CLBYTES PD_SHIFT PGSHIFT NBPG PGOFSET CLSIZELOG2 CLSIZE pdei()
ptei() kvtopte() ptetov() ispt() ptetoav() &c &c
new:
NPDEPG

Major macro cleanup.


# 15538 02-May-1996 phk

First pass at cleaning up macros relating to pages, clusters and all that.


# 15379 25-Apr-1996 phk

Fix cpu_fork for real.

Suggested by: bde


# 15301 18-Apr-1996 phk

Fix a bogon. cpu_fork & savectx ecpected cpu_switch to restore %eax,
they shouldn't.


# 15117 07-Apr-1996 bde

Removed never-used #includes of <machine/cpu.h>. Many were apparently
copied from bad examples.


# 14348 02-Mar-1996 jkh

USER_LDT changes for the Willows TwinXPDK toolkit. Only tested with WINE
since that's the only other USER_LDT using code that I know of.
Submitted by: Gary Jennejohn <Gary.Jennejohn@munich.netsurf.de>
Obtained from: {Origin of diffs may be someone else - I only rec'd them from
Gary}


# 13915 05-Feb-1996 dg

Unspam my changes in rev 1.54 that John spammed in rev 1.55.


# 13909 04-Feb-1996 dyson

Changed vm_fault_quick in vm_machdep.c to be global. Needed for
new pipe code.


# 13908 04-Feb-1996 dg

Rewrote cpu_fork so that it doesn't use pmap_activate, and removed
pmap_activate since it's not used anymore. Changed cpu_fork so that
it uses one line of inline assembly rather than calling mvesp() to
get the current stack pointer. Removed mvesp() since it is no longer
being used.


# 13740 30-Jan-1996 dg

savectx() strikes again: the saved stack pointer wasn't properly adjusted
to remove the return address. It's only the frame pointer and luck that
allowed the code to work at all.


# 13580 23-Jan-1996 dg

Simplified savectx() a little and fixed a bug that caused it to return
garbage in the child process rather than "1" like it is supposed to.

Reviewed by: bde


# 13490 19-Jan-1996 dyson

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 13265 05-Jan-1996 wollman

Convert BOUNCE_BUFFERS and BOUNCEPAGES to new option scheme.


# 12819 14-Dec-1995 phk

A Major staticize sweep. Generates a couple of warnings that I'll deal
with later.
A number of unused vars removed.
A number of unused procs removed or #ifdefed.


# 12767 11-Dec-1995 dyson

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# 12722 10-Dec-1995 phk

Staticize and cleanup.
remove a TON of #includes from machdep.


# 12662 07-Dec-1995 dg

Untangled the vm.h include file spaghetti.


# 12417 20-Nov-1995 phk

Remove unused vars.


# 12357 18-Nov-1995 bde

Fixed the type of vm_fault_quick() - don't convert types back and forth
through bogus immediate types.

Added prototypes.


# 11116 01-Oct-1995 dg

Insert zeroed pages at the head of the zero queue rather than at the tail.
A measurable performance improvement results from the potential for the
page to be partially cached when it is eventually used.


# 10546 03-Sep-1995 dyson

Machine dependent routines to support pre-zeroed free pages. This
significantly improves demand zero performance.


# 9759 29-Jul-1995 bde

Eliminate sloppy common-style declarations. There should be none left for
the LINT configuation.


# 9507 13-Jul-1995 dg

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# 8876 30-May-1995 rgrimes

Remove trailing whitespace.


# 8590 18-May-1995 dg

Added "BROKEN_KEYBOARD_RESET" option to disable using the keyboard reset
in cpu_reset(). Some MBs don't deal with this properly.

Submitted by: Rod Grimes


# 8211 01-May-1995 dyson

Fixed a problem that can cause left-over pv_entries and as
as side-effect, removed some legacy code that was necessary
when we called vm_fault inside of vm_fault_quick instead of using
the kernel/user space byte move routines.


# 8074 26-Apr-1995 rgrimes

Add outb to keyboard controller to do a cpu_reset, this fixes 2 known
cases of motherboards that failed to reboot.


# 7170 19-Mar-1995 dg

Removed redundant newlines that were in some panic strings.


# 7090 16-Mar-1995 bde

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 6817 01-Mar-1995 dg

Use su/fubyte instead of directly touching the user's address space.


# 6579 20-Feb-1995 dg

Use of vm_allocate() and vm_deallocate() has been deprecated.


# 5771 21-Jan-1995 bde

Don't use mi_switch() to terminate cpu_exit(). Calling it just happened to
work (mi_switch() counted the last timeslice again but this didn't affect
the exiting process' rusage because the rusage has already been finalized).

Remove stale comment.


# 5455 09-Jan-1995 dg

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 3436 08-Oct-1994 phk

db_disasm.c: Unused var zapped.
pmap.c: tons of unused vars zapped, various other warnings silenced.
trap.c: unused vars zapped.
vm_machdep.c: A wrong argument, which by chance did the right thing, was
corrected.


# 2455 02-Sep-1994 dg

Removed all vestiges of tlbflush(). Replaced them with calls to pmap_update().
Made pmap_update an inline assembly function.


# 2422 31-Aug-1994 dg

Rather than exclude bounce buffers support with NOBOUNCE, include it
with BOUNCE_BUFFERS. This is more intuitive, and is better for future
multiplatform support. Added BOUNCE_BUFFERS option to the GENERIC and
LINT kernel config files.


# 1896 07-Aug-1994 dg

Made pmap_kenter "TLB safe". ...and then removed all the pmap_updates that
are no longer needed because of this.


# 1894 07-Aug-1994 dg

Don't kremove process VM pages (oops!). This was the cause of the instability
that was introduced last night.

Submitted by: John Dyson


# 1890 06-Aug-1994 dg

Fixed various prototype problems with the pmap functions and the subsequent
problems that fixing them caused.


# 1889 06-Aug-1994 dg

Incorporated 1.1.5 improvements to the bounce buffer code (i.e. make it
actually work), and additionally improved it's performance via new pmap
routines and "pbuf" allocation policy.

Submitted by: John Dyson


# 1549 25-May-1994 rgrimes

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# 1415 25-Apr-1994 dg

From John Dyson:

Fixed physio in the 386 case - write faults weren't properly implemented.


# 1379 20-Apr-1994 dg

Bug fixes and performance improvements from John Dyson and myself:

1) check va before clearing the page clean flag. Not doing so was
causing the vnode pager error 5 messages when paging from
NFS. (pmap.c)
2) put back interrupt protection in idle_loop. Bruce didn't think
it was necessary, John insists that it is (and I agree). (swtch.s)
3) various improvements to the clustering code (vm_machdep.c). It's
now enabled/used by default.
4) bad disk blocks are now handled properly when doing clustered IOs.
(wd.c, vm_machdep.c)
5) bogus bad block handling fixed in wd.c.
6) algorithm improvements to the pageout/pagescan daemons. It's amazing
how well 4MB machines work now.


# 1362 14-Apr-1994 dg

Changes from John Dyson and myself:

1) Removed all instances of disable_intr()/enable_intr() and changed
them back to splimp/splx. The previous method was done to improve
the performance, but Bruces recent changes to inline spl* have
made this unnecessary.
2) Cleaned up vm_machdep.c considerably. Probably fixed a few bugs, too.
3) Added a new mechanism for collecting page statistics - now done by
a new system process "pagescan". Previously this was done by the
pageout daemon, but this proved to be impractical.
4) Improved the page usage statistics gathering mechanism - performance is
much improved in small memory machines.
5) Modified mbuf.h to enable the support for an external free routine when
using mbuf clusters. Added appropriate glue in various places to
allow this to work.
6) Adapted a suggested change to the NFS code from Yuval Yurom to take
advantage of #5.
7) Added fault/swap statistics support.


# 1335 05-Apr-1994 dg

from John Dyson:

1) fixed some bugs related to the bounce buffer code
2) vnode pager now supports clustered pageouts
3) experimental code for clustering all I/O via a new "cldisksort"
4) added >16MB check to Bustek driver
5) made some experimental algorithmic changes to the pageout daemon
6) fixed bugs in truncating mapped files (esp when mapped via NFS)
7) reorganized vnode pager I/O code


# 1314 30-Mar-1994 dg

Eliminated the "physstrat" wart and merged it into kern_physio.c. This
patch also fixes a bug which causes a kernel VM leak.


# 1312 30-Mar-1994 dg

New routine "pmap_kenter", designed to take advantage of the special
case of the kernel pmap.


# 1307 24-Mar-1994 dg

From John Dyson: performance improvements to the new bounce buffer
code.


# 1298 23-Mar-1994 dg

Bounce buffers. From John Dyson with help from me.


# 1288 21-Mar-1994 dg

Changed dynamic stack grow code to grow by "SGROWSIZ" amount. Initially
allocate SGROWSIZ amount of stack. Also set vm_ssize to the initial
stack VM size. Increased DFLSSIZ stack rlimit default to 8MB.


# 1246 07-Mar-1994 dg

1) "Pre-faulting" in of pages into process address space
Eliminates vm_fault overhead on process startup and
mmap referenced data for in-memory pages.

(process startup time using in-memory segments *much* faster)

2) Even more efficient pmap code. Code partially cleaned up.
More comments yet to follow.

(generally more efficient pte management)

3) Pageout clustering ( in addition to the FreeBSD V1.1 pagein
clustering.)

(much faster paging performance on non-write behind disk
subsystems, slightly faster performance on other systems.)

4) Slightly changed vm_pageout code for more efficiency and
better statistics. Also, resist swapout a little more.

(less likely to pageout a recently used page)

5) Slight improvement to the page table page trap efficiency.

(generally faster system VM fault performance)

6) Defer creation of unnamed anonymous regions pager until needed.

(speeds up shared memory bss creation)

7) Remove possible deadlock from swap_pager initialization.

8) Enhanced procfs to provide "vminfo" about vm objects and user
pmaps.

9) Increased MCLSHIFT/MCLBYTES from 2K to 4K to improve net &
socket performance and to prepare for things to come.

John Dyson
dyson@implode.root.com
David Greenman
davidg@root.com


# 1127 08-Feb-1994 dg

Fixed bugs in stack grow code, and moved it back into a seperate function
like it was originally. Also added back call to "grow" in sendsig now
that this routine actually works.


# 991 21-Jan-1994 dg

Remove some old, unused, major UGLY code.


# 974 14-Jan-1994 dg

"New" VM system from John Dyson & myself. For a run-down of the
major changes, see the log of any effected file in the sys/vm
directory (swap_pager.c for instance).


# 879 18-Dec-1993 wollman

Make everything compile with -Wtraditional. Make it easier to distribute
a binary link-kit. Make all non-optional options (pagers, procfs) standard,
and update LINT to reflect new symtab requirements.

NB: -Wtraditional will henceforth be forgotten. This editing pass was
primarily intended to detect any constructions where the old code might
have been relying on traditional C semantics or syntax. These were all
fixed, and the result of fixing some of them means that -Wall is now a
realistic possibility within a few weeks.


# 798 24-Nov-1993 wollman

Make the LINT kernel compile with -W -Wreturn-type -Wcomment -Werror, and
add same (sans -Werror) to Makefile for future compilations.


# 608 15-Oct-1993 rgrimes

genassym.c:
Remove NKMEMCLUSTERS, it is no longer define or used.

locores.s:
Fix comment on PTDpde and APTDpde to be pde instead of pte
Add new equation for calculating location of Sysmap
Remove Bill's old #ifdef garbage for counting up memory,
that stuff will never be made to work and was just cluttering
up the file.

Add code that places the PTD, page table pages, and kernel
stack below the 640k ISA hole if there is room for it, otherwise
put this stuff all at 1MB. This fixes the 28K bogusity in
the boot blocks, that can now go away!

Fix the caclulation of where first is to be dependent on
NKPDE so that we can skip over the above mentioned areas.
The 28K thing is now 44K in size due to the increase in
kernel virtual memory space, but since we no longer have
to worry about that this is no big deal.

Use if NNPX > 0 instead of ifdef NPX for floating point code.

machdep.c
Change the calculation of for the buffer cache to be
20% of all memory above 2MB and add back the upper limit
of 2/5's of the VM_KMEM_SIZE so that we do not eat ALL
of the kernel memory space on large memory machines, note
that this will not even come into effect unless you have
more than 32MB. The current buffer cache limit is 6.7MB
due to this caclulation.

It seems that we where erroniously allocating bufpages pages
for buffer_map. buffer_map is UNUSED in this implementation
of the buffer cache, but since the map is referenced in
several if statements a quick fix was to simply allocate
1 vm page (but no real memory) to it.

pmap.h
Remove rcsid, don't want them in the kernel files!

Removed some cruft inside an #ifdef DEBUGx that caused
compiler errors if you where compiling this for debug.

Use the #defines for PD_SHIFT and PG_SHIFT in place of
constants.

trap.c:
Remove patch kit header and rcsid, fix $Id$.
Now include "npx.h" and use NNPX for controlling the
floating point code.

Remove a now completly invalid check for a maximum virtual
address, the virtual address now ends at 0xFFFFFFFF so
there is no more MAX!! (Thanks David, I completly missed
that one!)

vm_machdep.c
Remove patch kit header and rcsid, fix $Id$.
Now include "npx.h" and use NNPX for controlling the
floating point code.

Replace several 0xFE00000 constants with KERNBASE


# 434 10-Sep-1993 nate

Removed volatile functions which were causing grief in the system, since
volatile functions are undefined, and there is no reason to have them
in our kernel.


# 433 10-Sep-1993 rgrimes

This is just to shut the compiler up
===================================================================
RCS file: /a/cvs/386BSD/src/sys/i386/i386/vm_machdep.c,v
retrieving revision 1.3
diff -c -r1.3 vm_machdep.c
*** 1.3 1993/07/27 10:52:21
--- vm_machdep.c 1993/09/10 20:12:53
***************
*** 179,184 ****
--- 179,186 ----
#endif
splclock();
swtch();
+ /*NOTREACHED*/
+ for(;;);
}

cpu_wait(p) struct proc *p; {


# 200 27-Jul-1993 dg

* Applied fixes from Bruce Evans to fix COW bugs, >1MB kernel loading,
profiling, and various protection checks that cause security holes
and system crashes.
* Changed min/max/bcmp/ffs/strlen to be static inline functions
- included from cpufunc.h in via systm.h. This change
improves performance in many parts of the kernel - up to 5% in the
networking layer alone. Note that this requires systm.h to be included
in any file that uses these functions otherwise it won't be able to
find them during the load.
* Fixed incorrect call to splx() in if_is.c
* Fixed bogus variable assignment to splx() in if_ed.c


# 140 18-Jul-1993 paul

Added volatile void to cpu_exit() in the hope that it would
stop warning about returning from gcc.

It hasn't but the declaration is still correct.


# 5 12-Jun-1993 rgrimes

This commit was generated by cvs2svn to compensate for changes in r4,
which included commits to RCS files with non-trunk default branches.


# 4 12-Jun-1993 rgrimes

Initial import, 0.1 + pk 0.2.4-B1