Cross Reference: /freebsd-10.0-release/sys/i386/i386/vm

History log of /freebsd-10.0-release/sys/i386/i386/vm_machdep.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 259065	07-Dec-2013	gjb	- Copy stable/10 (r259064) to releng/10.0 as part of the 10.0-RELEASE cycle. - Update __FreeBSD_version [1] - Set branch name to -RC1 [1] 10.0-CURRENT __FreeBSD_version value ended at '55', so start releng/10.0 at '100' so the branch is started with a value ending in zero. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-10.0-release /freebsd-10.0-release/sys/conf/newvers.sh /freebsd-10.0-release/sys/sys/param.h
# 256281	10-Oct-2013	gjb	Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
# 255827	23-Sep-2013	kib	Free both KVA and backing pages when freeing TSS memory. Reported and tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (marius)
# 255786	22-Sep-2013	glebius	- Create kern.ipc.sendfile namespace, and put the new "readhead" OID there as "kern.ipc.sendfile.readahead". - Push all nsfbuf related tunables into MD code. Don't move them to new namespace in favor of POLA. Reviewed by: scottl Approved by: re (gjb)
# 254025	07-Aug-2013	jeff	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division
# 253351	15-Jul-2013	ae	Introduce new structure sfstat for collecting sendfile's statistics and remove corresponding fields from struct mbstat. Use PCPU counters and SFSTAT_INC() macro for update these statistics. Discussed with: glebius
# 250624	13-May-2013	ed	Improve readability of static assertions for OFFSET_* macros. Instead of doing all sorts of weird casting of constants to pointer-pointers, simply use the standard C offsetof() macro to obtain the offset of the respective fields in the structures.
# 238792	26-Jul-2012	kib	MFamd64 r238623: Introduce curpcb magic variable. Requested and reviewed by: bde MFC after: 3 weeks
# 223758	04-Jul-2011	attilio	With retirement of cpumask_t and usage of cpuset_t for representing a mask of CPUs, pc_other_cpus and pc_cpumask become highly inefficient. Remove them and replace their usage with custom pc_cpuid magic (as, atm, pc_cpumask can be easilly represented by (1 << pc_cpuid) and pc_other_cpus by (all_cpus & ~(1 << pc_cpuid))). This change is not targeted for MFC because of struct pcpu members removal and dependency by cpumask_t retirement. MD review by: marcel, marius, alc Tested by: pluknet MD testing by: marcel, marius, gonzo, andreast
# 222813	07-Jun-2011	attilio	etire the cpumask_t type and replace it with cpuset_t usage. This is intended to fix the bug where cpu mask objects are capped to 32. MAXCPU, then, can now arbitrarely bumped to whatever value. Anyway, as long as several structures in the kernel are statically allocated and sized as MAXCPU, it is suggested to keep it as low as possible for the time being. Technical notes on this commit itself: - More functions to handle with cpuset_t objects are introduced. The most notable are cpusetobj_ffs() (which calculates a ffs(3) for a cpuset_t object), cpusetobj_strprint() (which prepares a string representing a cpuset_t object) and cpusetobj_strscan() (which creates a valid cpuset_t starting from a string representation). - pc_cpumask and pc_other_cpus are target to be removed soon. With the moving from cpumask_t to cpuset_t they are now inefficient and not really useful. Anyway, for the time being, please note that access to pcpu datas is protected by sched_pin() in order to avoid migrating the CPU while reading more than one (possible) word - Please note that size of cpuset_t objects may differ between kernel and userland. While this is not directly related to the patch itself, it is good to understand that concept and possibly use the patch as a reference on how to deal with cpuset_t objects in userland, when accessing kernland members. - KTR_CPUMASK is changed and now is represented through a string, to be set as the example reported in NOTES. Please additively note that no MAXCPU is bumped in this patch, but private testing has been done until to MAXCPU=128 on a real 8x8x2(htt) machine (amd64). Please note that the FreeBSD version is not yet bumped because of the upcoming pcpu changes. However, note that this patch is not targeted for MFC. People to thank for the time spent on this patch: - sbruno, pluknet and Nicholas Esborn (nick AT desert DOT net) tested several revision of the patches and really helped in improving stability of this work. - marius fixed several bugs in the sparc64 implementation and reviewed patches related to ktr. - jeff and jhb discussed the basic approach followed. - kib and marcel made targeted review on some specific part of the patch. - marius, art, nwhitehorn and andreast reviewed MD specific part of the patch. - marius, andreast, gonzo, nwhitehorn and jceel tested MD specific implementations of the patch. - Other people have made contributions on other patches that have been already committed and have been listed separately. Companies that should be mentioned for having participated at several degrees: - Yahoo! for having offered the machines used for testing on big count of CPUs. - The FreeBSD Foundation for having sponsored my devsummit attendance, which has been instrumental. - Sandvine for having offered offices and infrastructure during development. (I really hope I didn't forget anyone, if it happened I apologize in advance).
# 217561	18-Jan-2011	kib	For architectures not using direct map , and requiring real KVA page for sf buf allocation, use wakeup() instead of wakeup_one() to notify sf buffer waiters about free buffer. sf_buf_alloc() calls msleep(PCATCH) when SFB_CATCH flag was given, and for simultaneous wakeup and signal delivery, msleep() returns EINTR/ERESTART despite the thread was selected for wakeup_one(). As result, we loose a wakeup, and some other waiter will not be woken up. Reported and tested by: az Reviewed by: alc, jhb MFC after: 1 week
# 211197	11-Aug-2010	jhb	Update various places that store or manipulate CPU masks to use cpumask_t instead of int or u_int. Since cpumask_t is currently u_int on all platforms this should just be a cosmetic change.
# 209462	23-Jun-2010	kib	After the FPU use requires #MF working due to INT13 FPU exception handling removal, MFi386 r209198: Use critical sections instead of disabling local interrupts to ensure the consistency between PCPU fpcurthread and the state of FPU. Reviewed by: bde Tested by: pho
# 209461	23-Jun-2010	kib	Remove the support for int13 FPU exception reporting on i386. It is believed that all 486-class CPUs FreeBSD is capable to run on, either have no FPU and cannot use external coprocessor, or have FPU on the package and can use #MF. Reviewed by: bde Tested by: pho (previous version)
# 208833	05-Jun-2010	kib	Introduce the x86 kernel interfaces to allow kernel code to use FPU/SSE hardware. Caller should provide a save area that is chained into the stack of the areas; pcb save_area for usermode FPU state is on top. The pcb now contains a pointer to the current FPU saved area, used during FPUDNA handling and context switches. There is also a facility to allow the kernel thread to use pcb save_area. Change the dreaded warnings "npxdna in kernel mode!" into the panics when FPU usage is not registered. KPI discussed with: fabient Tested by: pho, fabient Hardware provided by: Sentex Communications MFC after: 1 month
# 204309	25-Feb-2010	attilio	Introduce the new kernel sub-tree x86 which should contain all the code shared and generalized between our current amd64, i386 and pc98. This is just an initial step that should lead to a more complete effort. For the moment, a very simple porting of cpufreq modules, BIOS calls and the whole MD specific ISA bus part is added to the sub-tree but ideally a lot of code might be added and more shared support should grow. Sponsored by: Sandvine Incorporated Reviewed by: emaste, kib, jhb, imp Discussed on: arch MFC: 3 weeks
# 199135	10-Nov-2009	kib	Extract the code that records syscall results in the frame into MD function cpu_set_syscall_retval(). Suggested by: marcel Reviewed by: marcel, davidxu PowerPC, ARM, ia64 changes: marcel Sparc64 tested and reviewed by: marius, also sunv reviewed MIPS tested by: gonzo MFC after: 1 month
# 197693	01-Oct-2009	kmacy	make read_eflags and write_eflags accomplish the same effect on PVM as native, simplifying interrupt handling
# 195940	29-Jul-2009	kib	As was done in r195820 for amd64, use clflush for flushing cache lines when memory page caching attributes changed, and CPU does not support self-snoop, but implemented clflush, for i386. Take care of possible mappings of the page by sf buffer by utilizing the mapping for clflush, otherwise map the page transiently. Amd64 used direct map. Proposed and reviewed by: alc Approved by: re (kensmith)
# 190237	22-Mar-2009	alc	Eliminate the recomputation of pcb_cr3 from cpu_set_upcall(). The bcopy()ed value from the old thread is the correct value because the new thread and the old thread will share a page table.
# 188200	05-Feb-2009	kmacy	halt APs on reboot
# 188199	05-Feb-2009	kmacy	reboot instance on reset
# 186557	29-Dec-2008	kmacy	merge 186535, 186537, and 186538 from releng_7_xen Log: - merge in latest xenbus from dfr's xenhvm - fix race condition in xs_read_reply by converting tsleep to mtx_sleep Log: unmask evtchn in bind_{virq, ipi}_to_irq Log: - remove code for handling case of not being able to sleep - eliminate tsleep - make sleeps atomic
# 183615	05-Oct-2008	davidxu	If the current thread has the trap bit set (i.e. a debugger had single stepped the process to the system call), we need to clear the trap flag from the new frame. Otherwise, the new thread will receive a (likely unexpected) SIGTRAP when it executes the first instruction after returning to userland.
# 182961	12-Sep-2008	kib	When doing rfork(0), i.e. separating curproc VM from any other user of the same vmspace, decrement the reference count of the shared LDT instead of a newly-made copy. Code factually removed LDT from the process that did rfork(0). Introduce user_ldt_deref() function that does decrement of refcount for the struct proc_ldt, and call it in the rfork(0) case on the shared LDT. Reviewed by: jhb MFC after: 1 week
# 182936	11-Sep-2008	jhb	Update the comments above the 0xcf9 register reset attempt to match the code. We only attempt a single reset using this method (a "hard" reset), and we use two writes to ensure there is a 0 -> 1 transition in bit 2 to force a reset. MFC after: 1 week
# 181938	20-Aug-2008	kmacy	fix typo in previous commit breaking bootup pointed out by: Takahashi Yoshihiro nyan@
# 181911	20-Aug-2008	kmacy	- clean up interrupt handling for xen a tiny bit - parse the command line in to kenv - defer shutdown watcher until later in boot MFC after: 1 month
# 181775	15-Aug-2008	kmacy	Integrate support for xen in to i386 common code. MFC after: 1 month
# 177253	16-Mar-2008	rwatson	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
# 177091	12-Mar-2008	jeff	Remove kernel support for M:N threading. While the KSE project was quite successful in bringing threading to FreeBSD, the M:N approach taken by the kse library was never developed to its full potential. Backwards compatibility will be provided via libmap.conf for dynamically linked binaries and static binaries will be broken.
# 173615	14-Nov-2007	marcel	o Rename cpu_thread_setup() to cpu_thread_alloc() to better communicate that it relates to (is called by) thread_alloc() o Add cpu_thread_free() which is called from thread_free() to counter-act cpu_thread_alloc(). i386: Have cpu_thread_free() call cpu_thread_clean() to preserve behaviour. ia64: Have cpu_thread_free() call mtx_destroy() for the mutex initialized in cpu_thread_alloc(). PR: ia64/118024
# 171295	07-Jul-2007	attilio	Actual code shows several problems in ia32 LDT handling: - When a LDT entry changes, the old one is freed while it is still referenced by gdt and ldtr. This can lead to disruptive behaviours in particular on SMP machines. - When a LDT entry changes, it is assumed that the only one entity sharing the same LDT are threads in the same proc. It doesn't take in account edge cases where two processes share the same VM (rfork'ed ones, for example). This patch addresses these two problems and addictionally it fixes the usage of refcount switching back it to the old manually-grown refcount (since in this case would be faster). Diagnosed by: tegge Tested by: pho (a former version) Reviewed by: kib Approved by: jeff (mentor) Approved by: re
# 170305	04-Jun-2007	jeff	- Change comments and asserts to reflect the removal of the global scheduler lock. Tested by: kris, current@ Tested on: i386, amd64, ULE, 4BSD, libthr, libkse, PREEMPTION, etc. Discussed with: kris, attilio, kmacy, jhb, julian, bde (small parts each)
# 170110	29-May-2007	attilio	Fix some problems introduced with the last descriptors tables locking patch: - Do the correct test for ldt allocation - Drop dt_lock just before to call kmem_free (since it acquires blocking locks inside) - Solve a deadlock with smp_rendezvous() where other CPU will wait undefinitively for dt_lock acquisition. - Add dt_lock in the WITNESS list of spinlocks While applying these modifies, change the requirement for user_ldt_free() making that returning without dt_lock held. Tested by: marcus, tegge Reviewed by: tegge Approved by: jeff (mentor)
# 169802	20-May-2007	jeff	- Move GDT/LDT locking into a seperate spinlock, removing the global scheduler lock from this responsibility. Contributed by: Attilio Rao <attilio@FreeBSD.org> Tested by: jeff, kkenn
# 169030	24-Apr-2007	jhb	Fix the triple fault used as a last resort during a reboot to actually fault. The previous method zero'd out the page tables, invalidated the TLB, and then entered a spin loop. The idea was that the instruction after the TLB invalidate would result in a page fault and the page fault and subsequent double fault wouldn't be able to determine the physical page for their fault handlers' first instruction. This stopped working when PGE (PG_G PTE/PDE bit) support was added as a TLB invalidate via %cr3 reload doesn't clear TLB entries with PG_G set. Thus, the CPU was still able to map the virtual address for the spin loop and happily performed its infinite loop. The triple fault now uses a much more deterministic sledge-hammer approach to generate a triple fault. First, the IDT descriptor is set to point to an empty IDT, so any interrupts (including a double fault) will instantly fault. Second, we trigger a int 3 breakpoint to force an interrupt and kick off a triple fault. MFC after: 3 days
# 169020	24-Apr-2007	jhb	Update comments for the 0xcf9 and 0x92 reset methods to explain what we are actually doing and what the various bits mean.
# 168996	23-Apr-2007	jhb	Tweak printf string.
# 167250	05-Mar-2007	alc	Acquiring smp_ipi_mtx on every call to pmap_invalidate_() is wasteful. For example, during a buildworld more than half of the calls do not generate an IPI because the only TLB entry invalidated is on the calling processor. This revision pushes down the acquisition and release of smp_ipi_mtx into smp_tlb_shootdown() and smp_targeted_tlb_shootdown() and instead uses sched_pin() and sched_unpin() in pmap_invalidate_() so that thread migration doesn't lead to a missed TLB invalidation. Reviewed by: jhb MFC after: 3 weeks
# 166315	28-Jan-2007	rwatson	As we now have an SFB_NOWAIT flag, change 'will' to 'may' where the comment for sf_buf_alloc(9) talks about sleeping.
# 159027	29-May-2006	davidxu	Backout changes trying to inherit floating-point environment, although POSIX (susv3) requires this, but it is unclear what should be inherited, duplicating whole 387 stack for new thread seems to be unnecessary and dangerous. Revert to previous code, force a new thread to be started with clean FP state.
# 158997	28-May-2006	davidxu	PCB_NPXINITDONE is cleared by npx_fork_thread.
# 158995	28-May-2006	davidxu	When creating a new thread, inherit floating-point environment from current thread, this is required by POSIX pthread_create document.
# 158097	28-Apr-2006	sobomax	Unbreak pc98. Sorry...
# 158068	27-Apr-2006	sobomax	In the case when reset via keyboard controller doesn't work for some reason (i.e. no keyboard controller present), try two other common methods for resetting i386 machine - pci reset and port 0x92 fast reset. Only if neither works warn user and resort to "unmap entire address space and hope for good" hack. This makes my MacBook Pro rebooting just fine and should also help other legacy-free hardware out there. Also, disable interrupts unconditionally in cpu_reset_real(), since we don't want any interference. MFC after: 1 week
# 156523	10-Mar-2006	davidxu	It is not necessary to read %gs twice.
# 156522	10-Mar-2006	davidxu	Fix stack offset to allow gcc's stack aligment code to work correctly. MFC after: 3 days
# 152404	13-Nov-2005	imp	Provide a dummy NO_XBOX option that lives in opt_xbox.h for pc98. This allows us to eliminate a three ifdef PC98 instances.
# 152238	09-Nov-2005	nyan	Fix pc98 build.
# 152219	09-Nov-2005	imp	Add support for XBOX to the FreeBSD port. The xbox architecture is nearly identical to wintel/ia32, with a couple of tweaks. Since it is so similar to ia32, it is optionally added to a i386 kernel. This port is preliminary, but seems to work well. Further improvements will improve the interaction with syscons(4), port Linux nforce driver and future versions of the xbox. This supports the 64MB and 128MB boxes. You'll need the most recent CVS version of Cromwell (the Linux BIOS for the XBOX) to boot. Rink will be maintaining this port, and is interested in feedback. He's setup a website http://xbox-bsd.nl to report the latest developments. Any silly mistakes are my fault. Submitted by: Rink P.W. Springer rink at stack dot nl and Ed Schouten ed at fxq dot nl
# 151633	24-Oct-2005	jhb	When restarting the BSP during cpu_reset() use a membar to ensure that the updated cpustop_restartfunc is seen when the BSP resumes execution. This matches the membar already present in restart_cpus().
# 151302	13-Oct-2005	alc	Restore the UP optimization to reduce the number of TLB invalidations. The previous revision only restored the MP optimization. Describe the optimization strategy for TLB invalidations in a comment. Reviewed by: ups@ MFC after: 3 days
# 151274	13-Oct-2005	ups	Restore optimizations to reduce TLB shootdowns. Alan Cox pointed out that they are really useful for sendfile(). MFC after: 3 days
# 151247	11-Oct-2005	ups	Ensure that a thread stays on same CPU when calculating per CPU TLB shootdown requirements. Otherwise a CPU may not get the needed TLB invalidation. The PTE valid and access flags can not be used here to avoid TLB shootdowns unless sf->cpumask == all_cpus. ( Otherwise some CPUs may still hold an even older entry in the TLB) Since sf_buf_alloc mappings are normally always used this is also not really useful and presetting accessed and modified allows the CPU to speculatively load the entry into the TLB. Both bugs can cause random data corruption. MFC after: 3 days
# 150041	12-Sep-2005	nyan	opt_pc98.h is not needed.
# 147889	10-Jul-2005	davidxu	Validate if the value written into {FS,GS}.base is a canonical address, writting non-canonical address can cause kernel a panic, by restricting base values to 0..VM_MAXUSER_ADDRESS, ensuring only canonical values get written to the registers. Reviewed by: peter, Josepha Koshy < joseph.koshy at gmail dot com > Approved by: re (scottl)
# 147558	23-Jun-2005	jhb	Various and sundry style fixes and comment cleanups. Approved by: re (scottl)
# 146049	10-May-2005	nyan	Change a directory layout for pc98. - Move MD files into <arch>/<arch>. - Move bus dependent files into <arch>/<bus>. Rename some files to more suitable names. Repo-copied by: peter Discussed with: imp
# 145433	23-Apr-2005	davidxu	Change cpu_set_kse_upcall to more generic style, so we can reuse it in other codes. Add cpu_set_user_tls, use it to tweak user register and setup user TLS. I ever wanted to merge it into cpu_set_kse_upcall, but since cpu_set_kse_upcall is also used by M:N threads which may not need this feature, so I wrote a separated cpu_set_user_tls.
# 144969	12-Apr-2005	jhb	Use NULL rather than 0 in a couple of places.
# 144637	04-Apr-2005	jhb	Divorce critical sections from spinlocks. Critical sections as denoted by critical_enter() and critical_exit() are now solely a mechanism for deferring kernel preemptions. They no longer have any affect on interrupts. This means that standalone critical sections are now very cheap as they are simply unlocked integer increments and decrements for the common case. Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter() and spinlock_exit(). This KPI is responsible for providing whatever MD guarantees are needed to ensure that a thread holding a spin lock won't be preempted by any other code that will try to lock the same lock. For now all archs continue to block interrupts in a "spinlock section" as they did formerly in all critical sections. Note that I've also taken this opportunity to push a few things into MD code rather than MI. For example, critical_fork_exit() no longer exists. Instead, MD code ensures that new threads have the correct state when they are created. Also, we no longer try to fixup the idlethreads for APs in MI code. Instead, each arch sets the initial curthread and adjusts the state of the idle thread it borrows in order to perform the initial context switch. This change is largely a big NOP, but the cleaner separation it provides will allow for more efficient alternative locking schemes in other parts of the kernel (bare critical sections rather than per-CPU spin mutexes for per-CPU data for example). Reviewed by: grehan, cognet, arch@, others Tested on: i386, alpha, sparc64, powerpc, arm, possibly more
# 141784	13-Feb-2005	alc	Implement support for CPU private mappings within sf_buf_alloc().
# 140260	14-Jan-2005	jhb	Bah, another whitespace fix.
# 140259	14-Jan-2005	jhb	Remove an extraneous space.
# 140257	14-Jan-2005	jhb	Remove redundant code to drop per-thread debug register state from cpu_exit() as this is already performed in cpu_thread_exit() and the debug state is per-thread rather than per-process.
# 139450	30-Dec-2004	jhb	Use NULL instead of 0 in a few places as well as various whitespace fixes.
# 139343	27-Dec-2004	njl	Restore the cpu_reset proxy code. It is needed if you want to reset the system from an AP at runtime (i.e., calling cpu_reset from ddb). Someday, if we move to an NMI for stopping cpus instead, we can do away with this. Requested by: jhb
# 138592	08-Dec-2004	kbyanc	If the parent process has the trap bit set (i.e. a debugger had single stepped the process to the system call), we need to clear the trap flag from the new frame unless the debugger had set PF_FORK on the parent. Otherwise, the child will receive a (likely unexpected) SIGTRAP when it executes the first instruction after returning to userland. Reviewed by: bde MFC after: 3 days
# 138236	30-Nov-2004	njl	Remove now unused variable. Pointy hat: njl from nskyline_r35 at yahoo com
# 138216	30-Nov-2004	njl	MFamd64: Remove the cpu_reset_proxy cruft now that we run boot() on cpu 0. Also, restructure cpu_reset to be cleaner (no functional change.)
# 138129	27-Nov-2004	das	Don't include sys/user.h merely for its side-effect of recursively including other headers.
# 137372	07-Nov-2004	alc	Introduce two new options, "CPU private" and "no wait", to sf_buf_alloc(). Change the spelling of the "catch" option to be consistent with the new options. Implement the "no wait" option. An implementation of the "CPU private" for i386 will be committed at a later date.
# 136601	16-Oct-2004	alc	When sf_buf_alloc() replaces a virtual-to-physical mapping, it needn't invalidate the TLB(s) if the old mapping wasn't used by the CPU. With network interfaces that implement checksum off-loading, the old mapping is almost never used by the CPU, only by the device driver for setting up the DMA operation. Reviewed by: tegge@
# 132422	19-Jul-2004	davidxu	Make end of frames for KSE thread, for system scope thread, without this change, debugger will dump a weird stack backtrace.
# 130036	03-Jun-2004	phk	The NatSemi (now AMD) Geode SC1100 needs special treatment here and there because it is an embedded gadget. Give it it's own value for the "cpu" variable and add code to reset it lacking a keyboard controller.
# 129906	31-May-2004	bmilekic	Bring in mbuma to replace mballoc. mbuma is an Mbuf & Cluster allocator built on top of a number of extensions to the UMA framework, all included herein. Extensions to UMA worth noting: - Better layering between slab <-> zone caches; introduce Keg structure which splits off slab cache away from the zone structure and allows multiple zones to be stacked on top of a single Keg (single type of slab cache); perhaps we should look into defining a subset API on top of the Keg for special use by malloc(9), for example. - UMA_ZONE_REFCNT zones can now be added, and reference counters automagically allocated for them within the end of the associated slab structures. uma_find_refcnt() does a kextract to fetch the slab struct reference from the underlying page, and lookup the corresponding refcnt. mbuma things worth noting: - integrates mbuf & cluster allocations with extended UMA and provides caches for commonly-allocated items; defines several zones (two primary, one secondary) and two kegs. - change up certain code paths that always used to do: m_get() + m_clget() to instead just use m_getcl() and try to take advantage of the newly defined secondary Packet zone. - netstat(1) and systat(1) quickly hacked up to do basic stat reporting but additional stats work needs to be done once some other details within UMA have been taken care of and it becomes clearer to how stats will work within the modified framework. From the user perspective, one implication is that the NMBCLUSTERS compile-time option is no longer used. The maximum number of clusters is still capped off according to maxusers, but it can be made unlimited by setting the kern.ipc.nmbclusters boot-time tunable to zero. Work should be done to write an appropriate sysctl handler allowing dynamic tuning of kern.ipc.nmbclusters at runtime. Additional things worth noting/known issues (READ): - One report of 'ips' (ServeRAID) driver acting really slow in conjunction with mbuma. Need more data. Latest report is that ips is equally sucking with and without mbuma. - Giant leak in NFS code sometimes occurs, can't reproduce but currently analyzing; brueffer is able to reproduce but THIS IS NOT an mbuma-specific problem and currently occurs even WITHOUT mbuma. - Issues in network locking: there is at least one code path in the rip code where one or more locks are acquired and we end up in m_prepend() with M_WAITOK, which causes WITNESS to whine from within UMA. Current temporary solution: force all UMA allocations to be M_NOWAIT from within UMA for now to avoid deadlocks unless WITNESS is defined and we can determine with certainty that we're not holding any locks when we're M_WAITOK. - I've seen at least one weird socketbuffer empty-but- mbuf-still-attached panic. I don't believe this to be related to mbuma but please keep your eyes open, turn on debugging, and capture crash dumps. This change removes more code than it adds. A paper is available detailing the change and considering various performance issues, it was presented at BSDCan2004: http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf Please read the paper for Future Work and implementation details, as well as credits. Testing and Debugging: rwatson, brueffer, Ketrien I. Saihr-Kesenchedra, ... Reviewed by: Lots of people (for different parts)
# 129876	30-May-2004	phk	Add some missing <sys/module.h> includes which are masked by the one on death-row in <sys/kernel.h>
# 129750	26-May-2004	tmm	Retire cpu_sched_exit(); it is not used any more.
# 128100	11-Apr-2004	alc	- is_physical_memory()'s parameter, which is a physical address, should be a vm_paddr_t not a vm_offset_t.
# 127788	03-Apr-2004	alc	In some cases, sf_buf_alloc() should sleep with pri PCATCH; in others, it should not. Add a new parameter so that the caller can specify which is the case. Reported by: dillon
# 127582	29-Mar-2004	peter	Finish tidying up a couple of leftovers from the KSTACK_PAGES stuff. Some files still #included the opt_ file. powerpc hadn't been updated yet.
# 127283	21-Mar-2004	wpaul	The kthread_create() API is supposed to allow you to create threads with more than the normal amount of stack pages, however the stack pointer always wound up being initialized using KSTACK_PAGES. It should be using td->td_kstack_pages instead. This means that although the vm subsystem would give you all the stack pages you asked for, %esp would always be initialized as if you had just 2 pages, and the rest would go to waste. I wanted to use the 'give me more stack pages' feature of kthread_create() because the Intel 2200BG NDIS driver does an alloca() of about 5000 bytes, which wrecks the stack with the default 2 page size, and I was baffled that no matter how much code I shoved into thread contexts with allegedly larger stacks, the thing would still crash unless I changed KSTACK_PAGES. Note: this bug is present in _ALL_ arches at this point. Peter has promised to merge this fix into all of them.
# 127086	16-Mar-2004	alc	Refactor the existing machine-dependent sf_buf_free() into a machine- dependent function by the same name and a machine-independent function, sf_buf_mext(). Aside from the virtue of making more of the code machine- independent, this change also makes the interface more logical. Before, sf_buf_free() did more than simply undo an sf_buf_alloc(); it also unwired and if necessary freed the page. That is now the purpose of sf_buf_mext(). Thus, sf_buf_alloc() and sf_buf_free() can now be used as a general-purpose emphemeral map cache.
# 126942	14-Mar-2004	alc	Simplify sf_buf_alloc().
# 126793	10-Mar-2004	alc	- Make the acquisition of Giant in vm_fault_unwire() conditional on the pmap. For the kernel pmap, Giant is not required. In general, for other pmaps, Giant is required by i386's pmap_pte() implementation. Specifically, the use of PMAP2/PADDR2 is synchronized by Giant. Note: In principle, updates to the kernel pmap's wired count could be lost without Giant. However, in practice, we never use the kernel pmap's wired count. This will be resolved when pmap locking appears. - With the above change, cpu_thread_clean() and uma_large_free() need not acquire Giant. (The first case is simply the revival of i386/i386/vm_machdep.c's revision 1.226 by peter.)
# 126785	09-Mar-2004	jb	Remove duplicate code. Requested by: bde
# 126761	09-Mar-2004	jb	AMD's ELAN documentation says that you write to the SYS_RST register in the Memory Mapped Configuration Region (MMCR) to reset the CPU. If CPU_ELAN is set, try this first to reset the CPU before the traditional way. Without this change, my Compulab board powers down on 'reset' instead of rebooting.
# 126738	07-Mar-2004	peter	Add back Giant locks around kmem_free() call from user_ldt cleanup path during exit. Apparently it isn't safe after all. See uma_large_free(). Pointed out by: alc
# 126736	07-Mar-2004	peter	Other parts of the tree do not protect calls to kmem_free() with Giant, so remove it from here. The most notable examples include vm_mmap(). This removes one more Giant event from exit(2).
# 124144	05-Jan-2004	phk	Add struct definition of the Elan MMCR registers (from jb@) Put a CTASSERT() on the size of the struct. Use the struct where it is easy to do so in elan_mmcr.c Add the Elan specific hardware reset code (also from jb@).
# 123955	29-Dec-2003	bde	Sorted includes.
# 123929	28-Dec-2003	silby	Track three new sendfile-related statistics: - The number of times sendfile had to do disk I/O - The number of times sfbuf allocation failed - The number of times sfbuf allocation had to wait
# 123920	27-Dec-2003	silby	Move the declaration of sfbufspeak and sfbufsused to mbuf.h, and use imax instead of max, as sfbufspeak and sfbufsused are signed. Submitted by: bde
# 123884	27-Dec-2003	silby	Track current and peak sfbuf usage, export the values via sysctl.
# 123266	07-Dec-2003	alc	Don't remove the virtual-to-physical mapping when an sf_buf is freed. Instead, allow the mapping to persist, but add the sf_buf to a free list. If a later sendfile(2) or zero-copy send resends the same physical page, perhaps with the same or different contents, then the mapping overhead is avoided and the sf_buf is simply removed from the free list. In other words, the i386 sf_buf implementation now behaves as a cache of virtual-to-physical translations using an LRU replacement policy on inactive sf_bufs. This is similar in concept to a part of http://www.cs.princeton.edu/~yruan/debox/ patch, but much simpler in implementation. Note: none of this is required on alpha, amd64, or ia64. They now use their direct virtual-to-physical mapping to avoid any emphemeral mapping overheads in their sf_buf implementations.
# 122860	17-Nov-2003	alc	- Change the i386's sf_buf implementation so that it never allocates more than one sf_buf for one vm_page. To accomplish this, we add a global hash table mapping vm_pages to sf_bufs and a reference count to each sf_buf. (This is similar to the patches for RELENG_4 at http://www.cs.princeton.edu/~yruan/debox/.) For the uninitiated, an sf_buf is nothing more than a kernel virtual address that is used for temporary virtual-to-physical mappings by sendfile(2) and zero-copy sockets. As such, there is no reason for one vm_page to have several sf_bufs mapping it. In fact, using more than one sf_buf for a single vm_page increases the likelihood that sendfile(2) blocks, hurting throughput. (See http://www.cs.princeton.edu/~yruan/debox/.)
# 122821	16-Nov-2003	alc	- Remove unnecessary synchronization from sf_buf_init(). (There is only one active CPU when sf_buf_init() is performed.)
# 122780	16-Nov-2003	alc	- Modify alpha's sf_buf implementation to use the direct virtual-to- physical mapping. - Move the sf_buf API to its own header file; make struct sf_buf's definition machine dependent. In this commit, we remove an unnecessary field from struct sf_buf on the alpha, amd64, and ia64. Ultimately, we may eliminate struct sf_buf on those architecures except as an opaque pointer that references a vm page.
# 121235	18-Oct-2003	davidxu	Use npxdrop in cpu_thread_exit to save some cycles. Clear FPU pcb flags for new upcall thread, these flags needn't be inherited, the new thread should start from clean FPU status.
# 119563	29-Aug-2003	alc	Migrate the sf_buf allocator that is used by sendfile(2) and zero-copy sockets into machine-dependent files. The rationale for this migration is illustrated by the modified amd64 allocator. It uses the amd64's direct map to avoid emphemeral mappings in the kernel's address space. On an SMP, the emphemeral mappings result in an IPI for TLB shootdown for each transmitted page. Yuck. Maintainers of other 64-bit platforms with direct maps should be able to use the amd64 allocator as a reference implementation.
# 119004	16-Aug-2003	marcel	In vm_thread_swap{in\|out}(), remove the alpha specific conditional compilation and replace it with a call to cpu_thread_swap{in\|out}(). This allows us to add similar code on ia64 without cluttering the code even more.
# 117285	06-Jul-2003	bsd	Woops - don't forget to declare locals needed for that last commit.
# 117284	06-Jul-2003	bsd	Don't unconditionally reset the hardware debug registers in cpu_exit(), reset them only if they were previously in use. Unconditionally resetting the registers wipes them out frequently, which interferes with their use for kernel debugging. While I'm here, be less verbose in the associated comment of a neighboring function. Noticed by: bde
# 116188	11-Jun-2003	peter	GC unused cpu_wait() function
# 115858	04-Jun-2003	marcel	Change the second (and last) argument of cpu_set_upcall(). Previously we were passing in a void* representing the PCB of the parent thread. Now we pass a pointer to the parent thread itself. The prime reason for this change is to allow cpu_set_upcall() to copy (parts of) the trapframe instead of having it done in MI code in each caller of cpu_set_upcall(). Copying the trapframe cannot always be done with a simply bcopy() or may not always be optimal that way. On ia64 specifically the trapframe contains information that is specific to an entry into the kernel and can only be used by the corresponding exit from the kernel. A trapframe copied verbatim from another frame is in most cases useless without some additional normalization. Note that this change removes the assignment to td->td_frame in some implementations of cpu_set_upcall(). The assignment is redundant. A previous call to cpu_thread_setup() already did the exact same assignment. An added benefit of removing the redundant assignment is that we can now change td_pcb without nasty side-effects. This change officially marks the ability on ia64 for 1:1 threading. Not tested on: amd64, powerpc Compile & boot tested on: alpha, sparc64 Functionally tested on: i386, ia64
# 115728	02-Jun-2003	tegge	Initialize td->td_pcb->pcb_ext in cpu_thread_setup() since a garbage value (e.g. 0xd0d0d0d0) can cause a kernel panic.
# 115683	02-Jun-2003	obrien	Use __FBSDID().
# 115044	15-May-2003	tmm	In cpu_fork(), initialize pcb_psl for the new process to PSL_KERNEL, instead of taking the (userland) eflags from the trap frame and masking out PSL_I. There is no need to inherit any flags from the forking process; the old method however can cause flags set in userland for the forking process to be bogusly set in kernel mode when the newly forked process runs for the first time (in particular PSL_T, which is set for userland when the process is single-stepped; this would cause trace traps in kernel mode). Approved by: re (jhb)
# 113796	21-Apr-2003	davidxu	Reset pcb_gs and %gs before possibly invalidating it.
# 113364	11-Apr-2003	davidxu	Copy %gs from current CPU not from a stale PCB backup.
# 112841	30-Mar-2003	jake	- Add support for PAE and more than 4 gigs of ram on x86, dependent on the kernel opition 'options PAE'. This will only work with device drivers which either use busdma, or are able to handle 64 bit physical addresses. Thanks to Lanny Baron from FreeBSD Systems for the loan of a test machine with 6 gigs of ram. Sponsored by: DARPA, Network Associates Laboratories, FreeBSD Systems
# 112569	24-Mar-2003	jake	- Add vm_paddr_t, a physical address type. This is required for systems where physical addresses larger than virtual addresses, such as i386s with PAE. - Use this to represent physical addresses in the MI vm system and in the i386 pmap code. This also changes the paddr parameter to d_mmap_t. - Fix printf formats to handle physical addresses >4G in the i386 memory detection code, and due to kvtop returning vm_paddr_t instead of u_long. Note that this is a name change only; vm_paddr_t is still the same as vm_offset_t on all currently supported platforms. Sponsored by: DARPA, Network Associates Laboratories Discussed with: re, phk (cdevsw change)
# 111363	23-Feb-2003	jake	- Added macros NPGPTD, NBPTD, and NPDEPTD, for dealing with the size of the page directory. - Use these instead of the magic constants 1 or PAGE_SIZE where appropriate. There are still numerous assumptions that the page directory is exactly 1 page. Sponsored by: DARPA, Network Associates Laboratories
# 111028	17-Feb-2003	jeff	- Split the struct kse into struct upcall and struct kse. struct kse will soon be visible only to schedulers. This greatly simplifies much the KSE code. Submitted by: davidxu
# 110190	01-Feb-2003	julian	Reversion of commit by Davidxu plus fixes since applied. I'm not convinced there is anything major wrong with the patch but them's the rules.. I am using my "David's mentor" hat to revert this as he's offline for a while.
# 109877	26-Jan-2003	davidxu	Move UPCALL related data structure out of kse, introduce a new data structure called kse_upcall to manage UPCALL. All KSE binding and loaning code are gone. A thread owns an upcall can collect all completed syscall contexts in its ksegrp, turn itself into UPCALL mode, and takes those contexts back to userland. Any thread without upcall structure has to export their contexts and exit at user boundary. Any thread running in user mode owns an upcall structure, when it enters kernel, if the kse mailbox's current thread pointer is not NULL, then when the thread is blocked in kernel, a new UPCALL thread is created and the upcall structure is transfered to the new UPCALL thread. if the kse mailbox's current thread pointer is NULL, then when a thread is blocked in kernel, no UPCALL thread will be created. Each upcall always has an owner thread. Userland can remove an upcall by calling kse_exit, when all upcalls in ksegrp are removed, the group is atomatically shutdown. An upcall owner thread also exits when process is in exiting state. when an owner thread exits, the upcall it owns is also removed. KSE is a pure scheduler entity. it represents a virtual cpu. when a thread is running, it always has a KSE associated with it. scheduler is free to assign a KSE to thread according thread priority, if thread priority is changed, KSE can be moved from one thread to another. When a ksegrp is created, there is always N KSEs created in the group. the N is the number of physical cpu in the current system. This makes it is possible that even an userland UTS is single CPU safe, threads in kernel still can execute on different cpu in parallel. Userland calls kse_create to add more upcall structures into ksegrp to increase concurrent in userland itself, kernel is not restricted by number of upcalls userland provides. The code hasn't been tested under SMP by author due to lack of hardware. Reviewed by: julian
# 109342	15-Jan-2003	dillon	Merge all the various copies of vm_fault_quick() into a single portable copy.
# 109340	15-Jan-2003	dillon	Merge all the various copies of vmapbuf() and vunmapbuf() into a single portable copy. Note that pmap_extract() must be used instead of pmap_kextract(). This is precursor work to a reorganization of vmapbuf() to close remaining user/kernel races (which can lead to a panic).
# 107719	10-Dec-2002	julian	Unbreak the KSE code. Keep track of zobie threads using the Per-CPU storage during the context switch. Rearrange thread cleanups to avoid problems with Giant. Clean threads when freed or when recycled. Approved by: re (jhb)
# 107212	24-Nov-2002	alc	Add page queues locking to vunmapbuf(); reduce differences with respect to the sparc64 implementation. (Note: With modest effort on the alpha and ia64 this function could migrate to the MI part of the kernel.) Approved by: re (blanket)
# 107180	22-Nov-2002	mux	Under certain circumstances, we were calling kmem_free() from i386 cpu_thread_exit(). This resulted in a panic with WITNESS since we need to hold Giant to call kmem_free(), and we weren't helding it anymore in cpu_thread_exit(). We now do this from a new MD function, cpu_thread_dtor(), called by thread_dtor(). Approved by: re@ Suggested by: jhb
# 104695	09-Oct-2002	julian	Round out the facilty for a 'bound' thread to loan out its KSE in specific situations. The owner thread must be blocked, and the borrower can not proceed back to user space with the borrowed KSE. The borrower will return the KSE on the next context switch where teh owner wants it back. This removes a lot of possible race conditions and deadlocks. It is consceivable that the borrower should inherit the priority of the owner too. that's another discussion and would be simple to do. Also, as part of this, the "preallocatd spare thread" is attached to the thread doing a syscall rather than the KSE. This removes the need to lock the scheduler when we want to access it, as it's now "at hand". DDB now shows a lot mor info for threaded proceses though it may need some optimisation to squeeze it all back into 80 chars again. (possible JKH project) Upcalls are now "bound" threads, but "KSE Lending" now means that other completing syscalls can be completed using that KSE before the upcall finally makes it back to the UTS. (getting threads OUT OF THE KERNEL is one of the highest priorities in the KSE system.) The upcall when it happens will present all the completed syscalls to the KSE for selection.
# 103407	16-Sep-2002	mini	Add kernel support needed for the KSE-aware libpthread: - Maintain fpu state across signals. - Use ucontext_t's to store KSE thread state. - Synthesize state for the UTS upon each upcall, rather than saving and copying a trapframe. - Save and restore FPU state properly in ucontext_t's. Reviewed by: deischen, julian Approved by: -arch
# 103049	06-Sep-2002	peter	Zap the implementations of the i386-aout specific cpu_coredump function. Most of the non-i386 platforms had rather broken implementations anyway.
# 101941	15-Aug-2002	rwatson	In order to better support flexible and extensible access control, make a series of modifications to the credential arguments relating to file read and write operations to cliarfy which credential is used for what: - Change fo_read() and fo_write() to accept "active_cred" instead of "cred", and change the semantics of consumers of fo_read() and fo_write() to pass the active credential of the thread requesting an operation rather than the cached file cred. The cached file cred is still available in fo_read() and fo_write() consumers via fp->f_cred. These changes largely in sys_generic.c. For each implementation of fo_read() and fo_write(), update cred usage to reflect this change and maintain current semantics: - badfo_readwrite() unchanged - kqueue_read/write() unchanged pipe_read/write() now authorize MAC using active_cred rather than td->td_ucred - soo_read/write() unchanged - vn_read/write() now authorize MAC using active_cred but VOP_READ/WRITE() with fp->f_cred Modify vn_rdwr() to accept two credential arguments instead of a single credential: active_cred and file_cred. Use active_cred for MAC authorization, and select a credential for use in VOP_READ/WRITE() based on whether file_cred is NULL or not. If file_cred is provided, authorize the VOP using that cred, otherwise the active credential, matching current semantics. Modify current vn_rdwr() consumers to pass a file_cred if used in the context of a struct file, and to always pass active_cred. When vn_rdwr() is used without a file_cred, pass NOCRED. These changes should maintain current semantics for read/write, but avoid a redundant passing of fp->f_cred, as well as making it more clear what the origin of each credential is in file descriptor read/write operations. Follow-up commits will make similar changes to other file descriptor operations, and modify the MAC framework to pass both credentials to MAC policy modules so they can implement either semantic for revocation. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 99072	29-Jun-2002	julian	Part 1 of KSE-III The ability to schedule multiple threads per process (one one cpu) by making ALL system calls optionally asynchronous. to come: ia64 and power-pc patches, patches for gdb, test program (in tools) Reviewed by: Almost everyone who counts (at various times, peter, jhb, matt, alfred, mini, bernd, and a cast of thousands) NOTE: this is still Beta code, and contains lots of debugging stuff. expect slight instability in signals..
# 98765	24-Jun-2002	jake	Add an MD callout like cpu_exit, but which is called after sched_lock is obtained, when all other scheduling activity is suspended. This is needed on sparc64 to deactivate the vmspace of the exiting process on all cpus. Otherwise if another unrelated process gets the exact same vmspace structure allocated to it (same address), its address space will not be activated properly. This seems to fix some spontaneous signal 11 problems with smp on sparc64.
# 93264	27-Mar-2002	dillon	Compromise for critical()/cpu_critical() recommit. Cleanup the interrupt disablement assumptions in kern_fork.c by adding another API call, cpu_critical_fork_exit(). Cleanup the td_savecrit field by moving it from MI to MD. Temporarily move cpu_critical() from <arch>/include/cpufunc.h to <arch>/<arch>/critical.c (stage-2 will clean this up). Implement interrupt deferral for i386 that allows interrupts to remain enabled inside critical sections. This also fixes an IPI interlock bug, and requires uses of icu_lock to be enclosed in a true interrupt disablement. This is the stage-1 commit. Stage-2 will occur after stage-1 has stabilized, and will move cpu_critical() into its own header file(s) + other things. This commit may break non-i386 architectures in trivial ways. This should be temporary. Reviewed by: core Approved by: core
# 92890	21-Mar-2002	alc	o Use the MI vm_map_growstack() instead of grow_stack() in trap_pfault() and trapwrite(). o On i386/pc98, remove the (now) unused grow_stack().
# 92860	21-Mar-2002	imp	Fix abuses of cpu_critical_{enter,exit} by converting to intr_{disable,restore} as well as providing an implemenation of intr_{disable,restore}. Reviewed by: jake, rwatson, jhb
# 92770	20-Mar-2002	alfred	Remove __P.
# 91328	26-Feb-2002	dillon	revert last commit temporarily due to whining on the lists.
# 91315	26-Feb-2002	dillon	STAGE-1 of 3 commit - allow (but do not require) interrupts to remain enabled in critical sections and streamline critical_enter() and critical_exit(). This commit allows an architecture to leave interrupts enabled inside critical sections if it so wishes. Architectures that do not wish to do this are not effected by this change. This commit implements the feature for the I386 architecture and provides a sysctl, debug.critical_mode, which defaults to 1 (use the feature). For now you can turn the sysctl on and off at any time in order to test the architectural changes or track down bugs. This commit is just the first stage. Some areas of the code, specifically the MACHINE_CRITICAL_ENTER #ifdef'd code, is strictly temporary and will be cleaned up in the STAGE-2 commit when the critical_() functions are moved entirely into MD files. The following changes have been made: critical_enter() and critical_exit() for I386 now simply increment and decrement curthread->td_critnest. They no longer disable hard interrupts. When critical_exit() decrements the counter to 0 it effectively calls a routine to deal with whatever interrupts were deferred during the time the code was operating in a critical section. Other architectures are unaffected. * fork_exit() has been conditionalized to remove MD assumptions for the new code. Old code will still use the old MD assumptions in regards to hard interrupt disablement. In STAGE-2 this will be turned into a subroutine call into MD code rather then hardcoded in MI code. The new code places the burden of entering the critical section in the trampoline code where it belongs. * I386: interrupts are now enabled while we are in a critical section. The interrupt vector code has been adjusted to deal with the fact. If it detects that we are in a critical section it currently defers the interrupt by adding the appropriate bit to an interrupt mask. * In order to accomplish the deferral, icu_lock is required. This is i386-specific. Thus icu_lock can only be obtained by mainline i386 code while interrupts are hard disabled. This change has been made. * Because interrupts may or may not be hard disabled during a context switch, cpu_switch() can no longer simply assume that PSL_I will be in a consistent state. Therefore, it now saves and restores eflags. * FAST INTERRUPT PROVISION. Fast interrupts are currently deferred. The intention is to eventually allow them to operate either while we are in a critical section or, if we are able to restrict the use of sched_lock, while we are not holding the sched_lock. * ICU and APIC vector assembly for I386 cleaned up. The ICU code has been cleaned up to match the APIC code in regards to format and macro availability. Additionally, the code has been adjusted to deal with deferred interrupts. * Deferred interrupts use a per-cpu boolean int_pending, and masks ipending, spending, and fpending. Being per-cpu variables it is not currently necessary to lock; bus cycles modifying them. Note that the same mechanism will enable preemption to be incorporated as a true software interrupt without having to further hack up the critical nesting code. * Note: the old critical_enter() code in kern/kern_switch.c is currently #ifdef to be compatible with both the old and new methodology. In STAGE-2 it will be moved entirely to MD code. Performance issues: One of the purposes of this commit is to enhance critical section performance, specifically to greatly reduce bus overhead to allow the critical section code to be used to protect per-cpu caches. These caches, such as Jeff's slab allocator work, can potentially operate very quickly making the effective savings of the new critical section code's performance very significant. The second purpose of this commit is to allow architectures to enable certain interrupts while in a critical section. Specifically, the intention is to eventually allow certain FAST interrupts to operate rather then defer. The third purpose of this commit is to begin to clean up the critical_enter()/critical_exit()/cpu_critical_enter()/ cpu_critical_exit() API which currently has serious cross pollution in MI code (in fork_exit() and ast() for example). The fourth purpose of this commit is to provide a framework that allows kernel-preempting software interrupts to be implemented cleanly. This is currently used for two forward interrupts in I386. Other architectures will have the choice of using this infrastructure or building the functionality directly into critical_enter()/ critical_exit(). Finally, this commit is designed to greatly improve the flexibility of various architectures to manage critical section handling, software interrupts, preemption, and other highly integrated architecture-specific details.
# 90562	12-Feb-2002	alc	Remove an unused (but initialized) variable from vmapbuf().
# 90361	07-Feb-2002	julian	Pre-KSE/M3 commit. this is a low-functionality change that changes the kernel to access the main thread of a process via the linked list of threads rather than assuming that it is embedded in the process. It IS still embeded there but remove all teh code that assumes that in preparation for the next commit which will actually move it out. Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
# 88146	18-Dec-2001	julian	In a couple of places, we recalculated addresses we already had in local pointer variables.
# 88088	17-Dec-2001	jhb	Modify the critical section API as follows: - The MD functions critical_enter/exit are renamed to start with a cpu_ prefix. - MI wrapper functions critical_enter/exit maintain a per-thread nesting count and a per-thread critical section saved state set when entering a critical section while at nesting level 0 and restored when exiting to nesting level 0. This moves the saved state out of spin mutexes so that interlocking spin mutexes works properly. - Most low-level MD code that used critical_enter/exit now use cpu_critical_enter/exit. MI code such as device drivers and spin mutexes use the MI wrappers. Note that since the MI wrappers store the state in the current thread, they do not have any return values or arguments. - mtx_intr_enable() is replaced with a constant CRITICAL_FORK which is assigned to curthread->td_savecrit during fork_exit(). Tested on: i386, alpha
# 87702	11-Dec-2001	jhb	Overhaul the per-CPU support a bit: - The MI portions of struct globaldata have been consolidated into a MI struct pcpu. The MD per-CPU data are specified via a macro defined in machine/pcpu.h. A macro was chosen over a struct mdpcpu so that the interface would be cleaner (PCPU_GET(my_md_field) vs. PCPU_GET(md.md_my_md_field)). - All references to globaldata are changed to pcpu instead. In a UP kernel, this data was stored as global variables which is where the original name came from. In an SMP world this data is per-CPU and ideally private to each CPU outside of the context of debuggers. This also included combining machine/globaldata.h and machine/globals.h into machine/pcpu.h. - The pointer to the thread using the FPU on i386 was renamed from npxthread to fpcurthread to be identical with other architectures. - Make the show pcpu ddb command MI with a MD callout to display MD fields. - The globaldata_register() function was renamed to pcpu_init() and now init's MI fields of a struct pcpu in addition to registering it with the internal array and list. - A pcpu_destroy() function was added to remove a struct pcpu from the internal array and list. Tested on: alpha, i386 Reviewed by: peter, jake
# 85449	24-Oct-2001	jhb	Split the per-process Local Descriptor Table out of the PCB and into struct mdproc. Submitted by: Andrew R. Reiter <arr@watson.org> Silence on: -current
# 84935	14-Oct-2001	tegge	Change vmapbuf() to use pmap_qenter() and vunmapbuf() to use pmap_qremove(). This significantly reduces the number of TLB shootdowns caused by vmapbuf/vunmapbuf when performing many large reads from raw disk devices. Reviewed by: dillon
# 84381	02-Oct-2001	mjacob	Fix problem where a user buffer outside of the area being tested will be corrupted. PR: 29194 Obtained from: Tor.Egge@fast.no MFC after: 2 weeks
# 83660	19-Sep-2001	peter	Reserve an extra 16 bytes in case we have to grow the trapframe into a vm86trapframe for switching to vm86 [unlikely] while exiting. I lost this when doing the pcb move that went in with the KSE commit. Reviewed by: jake
# 83366	12-Sep-2001	julian	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 83276	10-Sep-2001	peter	Rip some well duplicated code out of cpu_wait() and cpu_exit() and move it to the MI area. KSE touched cpu_wait() which had the same change replicated five ways for each platform. Now it can just do it once. The only MD parts seemed to be dealing with fpu state cleanup and things like vm86 cleanup on x86. The rest was identical. XXX: ia64 and powerpc did not have cpu_throw(), so I've put a functional stub in place. Reviewed by: jake, tmm, dillon
# 83223	08-Sep-2001	peter	Missing part of dillon's coredump commit. cpu_coredump() was still passing IO_NODELOCKED to vn_rdwr(), this would cause operations on the unlocked core vnode and softupdates nastiness if an a.out binary cored.
# 82938	04-Sep-2001	peter	Nuke #if 0'ed "setredzone()" stub. We never used it, and probably never will. I've implemented an optional redzone as part of the KSE upage breakup.
# 82309	25-Aug-2001	peter	Optionize UPAGES for the i386. As part of this I split some of the low level implementation stuff out of machine/globaldata.h to avoid exposing UPAGES to lots more places. The end result is that we can double the kernel stack size with 'options UPAGES=4' etc. This is mainly being done for the benefit of a MFC to RELENG_4 at some point. -current doesn't really need this so much since each interrupt runs on its own kstack.
# 79609	12-Jul-2001	peter	Activate SSE/SIMD. This is the extra context switching support that we are required to do if we let user processes use the extra 128 bit registers etc. This is the base part of the diff I got from: http://www.issei.org/issei/FreeBSD/sse.html I believe this is by: Mr. SUZUKI Issei <issei@issei.org> SMP support apparently by: Takekazu KATO <kato@chino.it.okayama-u.ac.jp> Test code by: NAKAMURA Kazushi <kaz@kobe1995.net>, see http://kobe1995.net/~kaz/FreeBSD/SSE.en.html I have fixed a couple of style(9) deviations. I have some followup commits to fix a couple of non-style things.
# 79265	04-Jul-2001	dillon	Move vm_page_zero_idle() from machine-dependant sections to a machine-independant source file, vm/vm_zeroidle.c. It was exactly the same for all platforms and updating them all was getting annoying.
# 79263	04-Jul-2001	dillon	Reorg vm_page.c into vm_page.c, vm_pageq.c, and vm_contig.c (for contigmalloc). Also removed some spl's and added some VM mutexes, but they are not actually used yet, so this commit does not really make any operational changes to the system. vm_page.c relates to vm_page_t manipulation, including high level deactivation, activation, etc... vm_pageq.c relates to finding free pages and aquiring exclusive access to a page queue (exclusivity part not yet implemented). And the world still builds... :-)
# 79224	04-Jul-2001	dillon	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.
# 79123	03-Jul-2001	jhb	Allow Giant to be recursed when a process terminates.
# 78962	29-Jun-2001	jhb	Add a new MI pointer to the process' trapframe p_frame instead of using various differently named pointers buried under p_md. Reviewed by: jake (in principle)
# 77486	30-May-2001	jhb	We can't grab the sched_lock in set_user_ldt() because when it is called from cpu_switch(), curproc has been changed, but the sched_lock owner will not be updated until we return to mi_switch(), thus we deadlock against ourselves. As a workaround, push the acquire and release of sched_lock out to the callers of set_user_ldt(). Note that we can't use a mtx_assert() in set_user_ldt for the same reason. Sleuting by: tmm Tested by: tmm, dougb
# 76947	21-May-2001	jhb	Remove a few more spl's I missed earlier. Reported by: Michael Harnois <mdharnois@home.com> Pointy hat: me
# 76939	21-May-2001	jhb	Axe unneeded spl()'s.
# 76827	18-May-2001	alfred	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb
# 76546	13-May-2001	bde	Use a critical region to protect pushing of the parent's npx state to the pcb for fork(). It was possible for the state to be saved twice when an interrupt handler saved it concurrently. This corrupted (reset) the state because fnsave has the (in)convenient side effect of doing an implicit fninit. Mundane null pointer bugs were not possible, because we save to an "arbitrary" process's pcb and not to the "right" place (npxproc). Push the parent's %gs to the pcb for fork(). Changes to %gs before fork() were not preserved in the child unless an accidental context switch did the pushing. Updated the list of pcb contents which is supposed to inhibit bugs like this. pcb_dr, pcb_gs and pcb_ext were missing. Copying is correct for pcb_dr, and pcb_ext is already handled specially (although XXX'ly). Reducing the savectx() call to an npxsave() call in rev.1.80 was a mistake. The above bugs are duplicated in many places, including in savectx() itself. The arbitraryness of the parent process pointer for the fork() subroutines, the pcb pointer for savectx(), and the save87 pointer for npxsave(), is illusory. These functions don't work "right" unless the pointers are precisely curproc, curpcb, and the address of npxproc's save87 area, respectively, although the special context in which they are called allows savectx(&dumppcb) to sort of work and npxsave(&dummy) to work. cpu_fork() just doesn't work unless the parent process pointer is curproc, or the caller has pushed %gs to the pcb, or %gs happens to already be in the pcb.
# 76434	10-May-2001	jhb	- Use sched_lock and critical regions to ensure that LDT updates are thread safe from preemption and concurrent access to the LDT. - Move the prototype for i386_extend_pcb() to <machine/pcb_ext.h>. Reviewed by: silence on -hackers
# 76078	27-Apr-2001	jhb	Overhaul of the SMP code. Several portions of the SMP kernel support have been made machine independent and various other adjustments have been made to support Alpha SMP. - It splits the per-process portions of hardclock() and statclock() off into hardclock_process() and statclock_process() respectively. hardclock() and statclock() call the _process() functions for the current process so that UP systems will run as before. For SMP systems, it is simply necessary to ensure that all other processors execute the _process() functions when the main clock functions are triggered on one CPU by an interrupt. For the alpha 4100, clock interrupts are delievered in a staggered broadcast fashion, so we simply call hardclock/statclock on the boot CPU and call the _process() functions on the secondaries. For x86, we call statclock and hardclock as usual and then call forward_hardclock/statclock in the MD code to send an IPI to cause the AP's to execute forwared_hardclock/statclock which then call the _process() functions. - forward_signal() and forward_roundrobin() have been reworked to be MI and to involve less hackery. Now the cpu doing the forward sets any flags, etc. and sends a very simple IPI_AST to the other cpu(s). AST IPIs now just basically return so that they can execute ast() and don't bother with setting the astpending or needresched flags themselves. This also removes the loop in forward_signal() as sched_lock closes the race condition that the loop worked around. - need_resched(), resched_wanted() and clear_resched() have been changed to take a process to act on rather than assuming curproc so that they can be used to implement forward_roundrobin() as described above. - Various other SMP variables have been moved to a MI subr_smp.c and a new header sys/smp.h declares MI SMP variables and API's. The IPI API's from machine/ipl.h have moved to machine/smp.h which is included by sys/smp.h. - The globaldata_register() and globaldata_find() functions as well as the SLIST of globaldata structures has become MI and moved into subr_smp.c. Also, the globaldata list is only available if SMP support is compiled in. Reviewed by: jake, peter Looked over by: eivind
# 73922	07-Mar-2001	jhb	Use the proc lock to protect p_pptr when waking up our parent in cpu_exit() and remove the mpfixme() message that is now fixed.
# 72930	22-Feb-2001	peter	Activate USER_LDT by default. The new thread libraries are going to depend on this. The linux ABI emulator tries to use it for some linux binaries too. VM86 had a bigger cost than this and it was made default a while ago. Reviewed by: jhb, imp
# 72746	20-Feb-2001	jhb	- Don't call clear_resched() in userret(), instead, clear the resched flag in mi_switch() just before calling cpu_switch() so that the first switch after a resched request will satisfy the request. - While I'm at it, move a few things into mi_switch() and out of cpu_switch(), specifically set the p_oncpu and p_lastcpu members of proc in mi_switch(), and handle the sched_lock state change across a context switch in mi_switch(). - Since cpu_switch() no longer handles the sched_lock state change, we have to setup an initial state for sched_lock in fork_exit() before we release it.
# 72200	09-Feb-2001	bmilekic	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# 71785	29-Jan-2001	peter	Send "#if NISA > 0" to the bit-bucket and replace it with an option. These were compile-time "is the isa code present?" tests and not 'how many isa busses' tests.
# 71528	24-Jan-2001	jhb	Setup the return values for a child process in the trapframe when we setup the rest of the trapframe instead of doing it in fork_return().
# 71257	19-Jan-2001	peter	Use #ifdef DEV_NPX from opt_npx.h instead of #if NNPX > 0 from npx.h
# 70861	10-Jan-2001	jake	Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables other then curproc.
# 70317	23-Dec-2000	jake	Protect proc.p_pptr and proc.p_children/p_sibling with the proctree_lock. linprocfs not locked pending response from informal maintainer. Reviewed by: jhb, -smp@
# 69781	08-Dec-2000	dwmalone	Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
# 69779	08-Dec-2000	jake	Revert the previous change I made to cpu_switch. It doesn't help as much as I thought it would and according to bde was a pessimization.
# 69536	02-Dec-2000	jake	Change cpu_switch to explicitly popl the callers program counter and pushl that of the new process, rather than doing a movl (%esp) and assuming that the stack has been setup right. This make the initial stack setup slightly more sane, and will make it easier to stick an interrupted process onto the run queue without its knowing.
# 67551	25-Oct-2000	jhb	- Overhaul the software interrupt code to use interrupt threads for each type of software interrupt. Roughly, what used to be a bit in spending now maps to a swi thread. Each thread can have multiple handlers, just like a hardware interrupt thread. - Instead of using a bitmask of pending interrupts, we schedule the specific software interrupt thread to run, so spending, NSWI, and the shandlers array are no longer needed. We can now have an arbitrary number of software interrupt threads. When you register a software interrupt thread via sinthand_add(), you get back a struct intrhand that you pass to sched_swi() when you wish to schedule your swi thread to run. - Convert the name of 'struct intrec' to 'struct intrhand' as it is a bit more intuitive. Also, prefix all the members of struct intrhand with 'ih_'. - Make swi_net() a MI function since there is now no point in it being MD. Submitted by: cp
# 67477	23-Oct-2000	jhb	Don't dink with interrupts in vm_page_zero_idle(). This code assumed it was being called with interrupts disabled, when it was actually being called with them enabled. Pointed out by: tegge
# 67360	20-Oct-2000	jhb	- machine/mutex.h -> sys/mutex.h - Use cpu_throw() instead of cpu_switch() during cpu_exit() since we don't need to save our previous state.
# 67164	15-Oct-2000	phk	Remove unneeded #include <machine/clock.h>
# 65557	06-Sep-2000	jasone	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
# 64529	11-Aug-2000	peter	Clean up some low level bootstrap code: - stop using the evil 'struct trapframe' argument for mi_startup() (formerly main()). There are much better ways of doing it. - do not use prepare_usermode() - setregs() in execve() will do it all for us as long as the p_md.md_regs pointer is set. (which is now done in machdep.c rather than init_main.c. The Alpha port did it this way all along and is much cleaner). - collect all the magic %cr0 etc register settings into one place and have the AP's call that instead of using magic numbers (!!) that keep changing over and over again. - Make it safe to call kthread_create() earlier, including during the device probe sequence. It doesn't need the callback mechanism that NetBSD's version uses. - kthreads created this way are root-less as they exist before the root filesystem is mounted. init(1) is set up so that it aquires the root pointers prior to running. If other kthreads want filesystem acccess we can make this code more generic. - set all threads start times once we have decided what time it is. - init uses a trampoline rather than the evil prepare_usermode() hack. - kern_descrip.c has a couple of tweaks to deal with forking when there is no rootdir or cwd etc. - adjust the early SYSINIT() sequence so that a few prereqisites are in place. eg: make sure the run queue is initialized before doing forks. With this, the USB code can easily create a kthread to do the device tree discovery. (I have tested it, it works nicely). There are still some open issues before this is truely useful. - tsleep() does not like working before the clock is running. It sort-of tries to spin wait, but it can do more useful things now. - stopping a kthread in kld code at unload time is "interesting" but we have a solution for that. The Alpha code needs no changes for this. It already uses pretty much the same strategies, but a little cleaner.
# 61469	10-Jun-2000	peter	Add option BROKEN_KEYBOARD_RESET to an opt_*.h file
# 60041	05-May-2000	phk	Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>. <sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes. Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data. Still a few bogus uses of struct buf to track down. Repocopy by: peter
# 58717	28-Mar-2000	dillon	Commit major SMP cleanups and move the BGL (big giant lock) in the syscall path inward. A system call may select whether it needs the MP lock or not (the default being that it does need it). A great deal of conditional SMP code for various deadended experiments has been removed. 'cil' and 'cml' have been removed entirely, and the locking around the cpl has been removed. The conditional separately-locked fast-interrupt code has been removed, meaning that interrupts must hold the CPL now (but they pretty much had to anyway). Another reason for doing this is that the original separate-lock for interrupts just doesn't apply to the interrupt thread mechanism being contemplated. Modifications to the cpl may now ONLY occur while holding the MP lock. For example, if an otherwise MP safe syscall needs to mess with the cpl, it must hold the MP lock for the duration and must (as usual) save/restore the cpl in a nested fashion. This is precursor work for the real meat coming later: avoiding having to hold the MP lock for common syscalls and I/O's and interrupt threads. It is expected that the spl mechanisms and new interrupt threading mechanisms will be able to run in tandem, allowing a slow piecemeal transition to occur. This patch should result in a moderate performance improvement due to the considerable amount of code that has been removed from the critical path, especially the simplification of the spl*() calls. The real performance gains will come later. Approved by: jkh Reviewed by: current, bde (exception.s) Some work taken from: luoqi's patch
# 58345	20-Mar-2000	phk	Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new field in struct buf: b_iocmd. The b_iocmd is enforced to have exactly one bit set. B_WRITE was bogusly defined as zero giving rise to obvious coding mistakes. Also eliminate the redundant struct buf flag B_CALL, it can just as efficiently be done by comparing b_iodone to NULL. Should you get a panic or drop into the debugger, complaining about "b_iocmd", don't continue. It is likely to write on your disk where it should have been reading. This change is a step in the direction towards a stackable BIO capability. A lot of this patch were machine generated (Thanks to style(9) compliance!) Vinum users: Greg has not had time to test this yet, be careful.
# 57362	20-Feb-2000	bsd	Don't forget to reset the hardware debug registers when a process that was using them exits. Don't allow a user process to cause the kernel to take a TRCTRAP on a user space address. Reviewed by: jlemon, sef Approved by: jkh
# 54188	06-Dec-1999	luoqi	User ldt sharing.
# 52647	30-Oct-1999	alc	The core of this patch is to vm/vm_page.h. The effects are two-fold: (1) to eliminate an extra (useless) level of indirection in half of the page queue accesses and (2) to use a single name for each queue throughout, instead of, e.g., "vm_page_queue_active" in some places and "vm_page_queues[PQ_ACTIVE]" in others. Reviewed by: dillon
# 52635	29-Oct-1999	phk	useracc() the prequel: Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ\|WRITE} rather than B_{READ\|WRITE} as argument.
# 52121	11-Oct-1999	peter	Zap unneeded #includes Submitted by: phk
# 51474	20-Sep-1999	dillon	Fix bug in pipe code relating to writes of mmap'd but illegal address spaces which cross a segment boundry in the page table. pmap_kextract() is not designed for access to the user space portion of the page table and cannot handle the null-page-directory-entry case. The fix is to have vm_fault_quick() return a success or failure which is then used to avoid calling pmap_kextract().
# 50477	27-Aug-1999	peter	$Id$ -> $FreeBSD$
# 49326	31-Jul-1999	alc	Change the type of vpgqueues::lcnt from "int *" to "int". The indirection served no purpose.
# 48974	22-Jul-1999	alc	Reduce the number of "magic constants" used for page coloring by one: PQ_PRIME2 and PQ_PRIME3 are used to accomplish the same thing at different places in the kernel. Drop PQ_PRIME3.
# 48391	01-Jul-1999	peter	Slight reorganization of kernel thread/process creation. Instead of using SYSINIT_KT() etc (which is a static, compile-time procedure), use a NetBSD-style kthread_create() interface. kproc_start is still available as a SYSINIT() hook. This allowed simplification of chunks of the sysinit code in the process. This kthread_create() is our old kproc_start internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work the same as the NetBSD one. One thing I'd like to do shortly is get rid of nfsiod as a user initiated process. It makes sense for the nfs client code to create them on the fly as needed up to a user settable limit. This means that nfsiod doesn't need to be in /sbin and is always "available". This is a fair bit easier to do outside of the SYSINIT_KT() framework.
# 47678	01-Jun-1999	jlemon	Unifdef VM86. Reviewed by: silence on on -current
# 45821	19-Apr-1999	peter	unifdef -DVM_STACK - it's been on for a while for x86 and was checked and appeared to be working for the Alpha some time ago.
# 44146	19-Feb-1999	luoqi	Hide access to vmspace:vm_pmap with inline function vmspace_pmap(). This is the preparation step for moving pmap storage out of vmspace proper. Reviewed by: Alan Cox <alc@cs.rice.edu> Matthew Dillion <dillon@apollo.backplane.com>
# 44078	16-Feb-1999	dfr	* Change sysctl from using linker_set to construct its tree using SLISTs. This makes it possible to change the sysctl tree at runtime. * Change KLD to find and register any sysctl nodes contained in the loaded file and to unregister them when the file is unloaded. Reviewed by: Archie Cobbs <archie@whistle.com>, Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)
# 43758	08-Feb-1999	dillon	Adjust idle zero-page fill hysteresis based on tests. Use 2/3 and 4/5 zero-fill levels. Adjust comment for ozfod in vmmeter.h - this counter represents non-optimal ( on the fly ) zero fills, not prefills.
# 43752	07-Feb-1999	dillon	Rip out PQ_ZERO queue. PQ_ZERO functionality is now combined in with PQ_FREE. There is little operational difference other then the kernel being a few kilobytes smaller and the code being more readable. * vm_page_select_free() has been greatly simplified. * The PQ_ZERO page queue and supporting structures have been removed * vm_page_zero_idle() revamped (see below) PG_ZERO setting and clearing has been migrated from vm_page_alloc() to vm_page_free[_zero]() and will eventually be guarenteed to remain tracked throughout a page's life ( if it isn't already ). When a page is freed, PG_ZERO pages are appended to the appropriate tailq in the PQ_FREE queue while non-PG_ZERO pages are prepended. When locating a new free page, PG_ZERO selection operates from within vm_page_list_find() ( get page from end of queue instead of beginning of queue ) and then only occurs in the nominal critical path case. If the nominal case misses, both normal and zero-page allocation devolves into the same _vm_page_list_find() select code without any specific zero-page optimizations. Additionally, vm_page_zero_idle() has been revamped. Hysteresis has been added and zero-page tracking adjusted to conform with the other changes. Currently hysteresis is set at 1/3 (lo) and 1/2 (hi) the number of free pages. We may wish to increase both parameters as time permits. The hysteresis is designed to avoid silly zeroing in borderline allocation/free situations.
# 43387	29-Jan-1999	dillon	More -Wall / -Wcast-qual cleanup. Also, EXEC_SET can't use C_DECLARE_MODULE due to the linker_file_sysinit() function making modifications to the data.
# 42360	06-Jan-1999	julian	Add (but don't activate) code for a special VM option to make downward growing stacks more general. Add (but don't activate) code to use the new stack facility when running threads, (specifically the linux threads support). This allows people to use both linux compiled linuxthreads, and also the native FreeBSD linux-threads port. The code is conditional on VM_STACK. Not using this will produce the old heavily tested system. Submitted by: Richard Seaman <dick@tar.com>
# 41868	16-Dec-1998	bde	Removed bogus casts of USRSTACK and/or the other operand in binary expressions involving USRSTACK.
# 40794	31-Oct-1998	peter	Add John Dyson's SYSCTL descriptions, and an export of more stats to a sysctl hierarchy (vm.stats.*). SYSCTL descriptions are only present in source, they do not get compiled into the binaries taking up memory.
# 40286	13-Oct-1998	dg	Fixed two potentially serious classes of bugs: 1) The vnode pager wasn't properly tracking the file size due to "size" being page rounded in some cases and not in others. This sometimes resulted in corrupted files. First noticed by Terry Lambert. Fixed by changing the "size" pager_alloc parameter to be a 64bit byte value (as opposed to a 32bit page index) and changing the pagers and their callers to deal with this properly. 2) Fixed a bogus type cast in round_page() and trunc_page() that caused some 64bit offsets and sizes to be scrambled. Removing the cast required adding casts at a few dozen callers. There may be problems with other bogus casts in close-by macros. A quick check seemed to indicate that those were okay, however.
# 39703	28-Sep-1998	tegge	Initialize pcb_mpnest to 1 in the child process in cpu_fork(). This should fix the 50% idle problem that the ELF /sbin/init triggered. The problem appeared when the last context switch before a fork() call was due to the kernel faulting in user pages via normal page faults (e.g. copyin). Reviewed by: Peter Wemm <peter@netplex.com.au>
# 39648	25-Sep-1998	peter	Goodbye BOUNCE_BUFFERS, for a hack it has served us well. The last consumer of this code (the old SCSI system) has left us and the CAM code does it's own bouncing. The isa dma system has been doing it's own bouncing for a while too. Reviewed by: core
# 38422	18-Aug-1998	msmith	Presently there is only one `currentldt' variable for all cpus in a SMP system. Unexpected things could happen if each cpu has a different ldt setting and one cpu tries to use value of currentldt set by another cpu. The fix is to move currentldt to the per-cpu area. It includes patches I filed in PR i386/6219 which are also user ldt related. PR: i386/7591, i386/6219 Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>
# 36168	18-May-1998	tegge	Disallow reading the current kernel stack. Only the user structure and the current registers should be accessible. Reviewed by: David Greenman <dg@root.com>
# 36135	17-May-1998	tegge	Add forwarding of roundrobin to other cpus. This gives a more regular update of cpu usage as shown by top when one process is cpu bound (no system calls) while the system is otherwise idle (except for top). Don't attempt to switch to the BSP in boot(). If the system was idle when an interrupt caused a panic, this won't work. Instead, switch to the BSP in cpu_reset. Remove some spurious forward_statclock/forward_hardclock warnings.
# 36095	16-May-1998	kato	Some of newer PC-98 may cause "Windows Protection Fault" when booting Windows 95 after rebooting FreeBSD without power off. In PC-98 system, reboot mode is set via I/O port 0x37 in cpu_reset(), and accessing of this port is the reason of the problem. To avnoid the fault, current status of reboot mode should be checked before accessing the I/O port.
# 34840	23-Mar-1998	jlemon	Add the ability to make real-mode BIOS calls from the kernel. Currently, everything is contained inside #ifdef VM86, so this option must be present in the config file to use this functionality. Thanks to Tor Egge, these changes should work on SMP machines. However, it may not be throughly SMP-safe. Currently, the only BIOS calls made are memory-sizing routines at bootup, these replace reading the RTC values.
# 34643	17-Mar-1998	kato	Make EPSON_BOUNCEDMA a new-style option.
# 34569	14-Mar-1998	tegge	Don't use the standard macros for disabling/enabling interrupt. On SMP systems, this left the mpintr_lock simplelock locked, causing further calls to disable_intr to deadlock or panic.
# 34507	12-Mar-1998	bde	Fixed breakage of the !SMP case in vm_page_zero_idle() in the previous commit. Opportunities to clean pages were often missed, and leaving of the idle state was sometimes delayed until the next interrupt (after any that occurred while cleaning). Fixed an unstaticization, a syntax error and a style bug in the previous commit.
# 33817	25-Feb-1998	dyson	Fix page prezeroing for SMP, and fix some potential paging-in-progress hangs. The paging-in-progress diagnosis was a result of Tor Egge's excellent detective work. Submitted by: Partially from Tor Egge.
# 33307	13-Feb-1998	bde	Ifdefed some npx code. npx should be optional again.
# 33134	06-Feb-1998	eivind	Back out DIAGNOSTIC changes.
# 33108	04-Feb-1998	eivind	Turn DIAGNOSTIC into a new-style option.
# 32884	30-Jan-1998	dyson	Make the bounce buffer code a little more robust when space isn't available. If there isn't bounce space available, the bounce code is disabled. This will allow most large systems to run properly when the bounce space is mistakenly allocated above 16MB.
# 32702	22-Jan-1998	dyson	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
# 32617	19-Jan-1998	tegge	The removal of a page from the free queue in vm_page_zero_idle was imcomplete. Also set m->queue, in order to prevent vm_page_select_free from selecting the page being zeroed.
# 32516	15-Jan-1998	gibbs	Implementation of Bus DMA for FreeBSD-x86. This is sufficient to do page level bounce buffering, but there are still some issues left to address.
# 32010	27-Dec-1997	peter	#include "opt_user_ldt.h" so that the #ifdef USER_LDT checks can work, as commented about at length in the PR audit trail. PR: 2412
# 31322	20-Nov-1997	bde	Removed a duplicate (sloppy common-style) definition. Fixed some style bugs.
# 31249	18-Nov-1997	bde	Don't #include <machine/smp.h> even in the SMP case. Fixed the one place that depended on it. The "bazillion warnings" mentioned in the log for rev.1.45 apparently aren't a problem any more. It is hard to be sure because the SIMPLELOCK_DEBUG option turns off (and breaks) things in the SMP case.
# 30265	10-Oct-1997	peter	Convert the VM86 option from a global option to an option only depended on by the files that use it. Changing the VM86 option now only causes a recompile of a dozen files or so rather than the entire kernel.
# 29330	13-Sep-1997	joerg	Revert the logic behind my last change, and use a function called `is_physical_memory()' now for the decision whether to dump some region of memory or not. Suggested by: davidg
# 29280	10-Sep-1997	joerg	Do not ever try to coredump adapter memory regions. PR: 4486 Submitted by: tegge@idi.ntnu.no (Tor Egge) Implement a function is_adapter_memory() in order to determine what should nto be dumped at all. Currently, only populated with the ``ISA memory hole''. Adapter regions of other busses should be added.
# 29041	02-Sep-1997	bde	Removed unused #includes.
# 28808	26-Aug-1997	peter	Clean up the SMP AP bootstrap and eliminate the wretched idle procs. - We now have enough per-cpu idle context, the real idle loop has been revived (cpu's halt now with nothing to do). - Some preliminary support for running some operations outside the global lock (eg: zeroing "free but not yet zeroed pages") is present but appears to cause problems. Off by default. - the smp_active sysctl now behaves differently. It's merely a 'true/false' option. Setting smp_active to zero causes the AP's to halt in the idle loop and stop scheduling processes. - bootstrap is a lot safer. Instead of sharing a statically compiled in stack a number of times (which has caused lots of problems) and then abandoning it, we use the idle context to boot the AP's directly. This should help >2 cpu support since the bootlock stuff was in doubt. - print physical apic id in traps.. helps identify private pages getting out of sync. (You don't want to know how much hair I tore out with this!) More cleanup to follow, this is more of a checkpoint than a 'finished' thing.
# 27993	08-Aug-1997	dyson	VM86 kernel support. Work done by BSDI, Jonathan Lemon <jlemon@americantv.com>, Mike Smith <msmith@gsoft.com.au>, Sean Eric Fagan <sef@kithrup.com>, and probably alot of others. Submitted by: Jnathan Lemon <jlemon@americantv.com>
# 27535	20-Jul-1997	bde	Removed unused #includes.
# 26954	26-Jun-1997	tegge	Back out a bad commit.
# 26943	25-Jun-1997	tegge	Block some interrupts during the call to pmap_zero_page in vm_page_zero_idle. This fixes some occurences of the problem reported in PR kern/3216: "panic: pmap_zero_page: CMAP busy"
# 26812	22-Jun-1997	peter	Preliminary support for per-cpu data pages. This eliminates a lot of #ifdef SMP type code. Things like _curproc reside in a data page that is unique on each cpu, eliminating the expensive macros like: #define curproc (SMPcurproc[cpunumber()]) There are some unresolved bootstrap and address space sharing issues at present, but Steve is waiting on this for other work. There is still some strictly temporary code present that isn't exactly pretty. This is part of a larger change that has run into some bumps, this part is standalone so it should be safe. The temporary code goes away when the full idle cpu support is finished. Reviewed by: fsmp, dyson
# 25557	07-May-1997	peter	clean up forked child creation. This is simplified also by having md_regs being struct trapframe *. Do a npxsave() if needed and copy the pcb rather than use the increasingly defunct savectx(). Copy %edi and %ebp explicitly. Submitted by: bde XXX npxproc could be declared in npx.h so the externs with smp fruit are not needed.
# 24980	16-Apr-1997	kato	Use reset port before clearing page table in cpu_reset if PC98 is defined. Clearing page table could hang some new PC-98.
# 24691	07-Apr-1997	peter	The biggie: Get rid of the UPAGES from the top of the per-process address space. (!) Have each process use the kernel stack and pcb in the kvm space. Since the stacks are at a different address, we cannot copy the stack at fork() and allow the child to return up through the function call tree to return to user mode - create a new execution context and have the new process begin executing from cpu_switch() and go to user mode directly. In theory this should speed up fork a bit. Context switch the tss_esp0 pointer in the common tss. This is a lot simpler since than swithching the gdt[GPROC0_SEL].sd.sd_base pointer to each process's tss since the esp0 pointer is a 32 bit pointer, and the sd_base setting is split into three different bit sections at non-aligned boundaries and requires a lot of twiddling to reset. The 8K of memory at the top of the process space is now empty, and unmapped (and unmappable, it's higher than VM_MAXUSER_ADDRESS). Simplity the pmap code to manage process contexts, we no longer have to double map the UPAGES, this simplifies and should measuably speed up fork(). The following parts came from John Dyson: Set PG_G on the UPAGES that are now in kernel context, and invalidate them when swapping them out. Move the upages object (upobj) from the vmspace to the proc structure. Now that the UPAGES (pcb and kernel stack) are out of user space, make rfork(..RFMEM..) do what was intended by sharing the vmspace entirely via reference counting rather than simply inheriting the mappings.
# 24361	29-Mar-1997	bde	Don't keep cpu interrupts enabled during the lookup in vm_page_zero_idle(). Lookup isn't done every time the system goes idle now, but it can still take > 1800 instructions in the worst case, so if cpu interrupts are kept disabled then it might lose 20 characters of sio input at 115200 bps. Fixed style in vm_page_zero_idle().
# 24099	22-Mar-1997	dyson	Decrease the latency/overhead in the prezero code when there is an adequate number of prezeroed pages.
# 22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 22521	10-Feb-1997	dyson	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
# 21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 21181	01-Jan-1997	se	Add code to copy the LDT, if required. This code was sent to me by Bruce Evans, and seems to fix some possible kernel panic in case of an execution error. It did not cause any problems on my system, but I did never observe the problem this patch is supposed to fix, anyway. This patch is a NOP, unless the kernel is built with "options USER_LDT", and doesn't affect the GENERIC kernel for this reason. I want to have it in 2.2: it fixes a bug ... Submitted by: bde
# 19269	30-Oct-1996	asami	More merge and update. (1) deleted #if 0 pc98/pc98/mse.c (2) hold per-unit I/O ports in ed_softc pc98/pc98/if_ed.c pc98/pc98/if_ed98.h (3) merge more files by segregating changes into headers. new file (moved from pc98/pc98): i386/isa/aic_98.h deleted: well, it's already in the commit message so I won't repeat the long list here ;) Submitted by: The FreeBSD(98) Development Team
# 18937	15-Oct-1996	dyson	Move much of the machine dependent code from vm_glue.c into pmap.c. Along with the improved organization, small proc fork performance is now about 5%-10% faster.
# 18548	28-Sep-1996	dyson	Essentially rename pmap_update to be invltlb. It is a very machine dependent operation, and not really a correct name. invltlb and invlpg are more descriptive, and in the case of invlpg, a real opcode. Additionally, fix the tlb management code for 386 machines.
# 18169	08-Sep-1996	dyson	Addition of page coloring support. Various levels of coloring are afforded. The default level works with minimal overhead, but one can also enable full, efficient use of a 512K cache. (Parameters can be generated to support arbitrary cache sizes also.)
# 17108	12-Jul-1996	bde	Don't use NULL in non-pointer contexts.
# 16532	20-Jun-1996	dg	Properly account for non-page aligned buffers.
# 16530	19-Jun-1996	dg	Minor KNF formatting change to vmapbuf() and vunmapbuf().
# 16499	19-Jun-1996	dyson	Clean up vmapbuf and vunmapbuf significantly. The previous code was very rough.
# 15809	18-May-1996	dyson	This set of commits to the VM system does the following, and contain contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>, Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me: More usage of the TAILQ macros. Additional minor fix to queue.h. Performance enhancements to the pageout daemon. Addition of a wait in the case that the pageout daemon has to run immediately. Slightly modify the pageout algorithm. Significant revamp of the pmap/fork code: 1) PTE's and UPAGES's are NO LONGER in the process's map. 2) PTE's and UPAGES's reside in their own objects. 3) TOTAL elimination of recursive page table pagefaults. 4) The page directory now resides in the PTE object. 5) Implemented pmap_copy, thereby speeding up fork time. 6) Changed the pv entries so that the head is a pointer and not an entire entry. 7) Significant cleanup of pmap_protect, and pmap_remove. 8) Removed significant amounts of machine dependent fork code from vm_glue. Pushed much of that code into the machine dependent pmap module. 9) Support more completely the reuse of already zeroed pages (Page table pages and page directories) as being already zeroed. Performance and code cleanups in vm_map: 1) Improved and simplified allocation of map entries. 2) Improved vm_map_copy code. 3) Corrected some minor problems in the simplify code. Implemented splvm (combo of splbio and splimp.) The VM code now seldom uses splhigh. Improved the speed of and simplified kmem_malloc. Minor mod to vm_fault to avoid using pre-zeroed pages in the case of objects with backing objects along with the already existant condition of having a vnode. (If there is a backing object, there will likely be a COW... With a COW, it isn't necessary to start with a pre-zeroed page.) Minor reorg of source to perhaps improve locality of ref.
# 15543	02-May-1996	phk	removed: CLBYTES PD_SHIFT PGSHIFT NBPG PGOFSET CLSIZELOG2 CLSIZE pdei() ptei() kvtopte() ptetov() ispt() ptetoav() &c &c new: NPDEPG Major macro cleanup.
# 15538	02-May-1996	phk	First pass at cleaning up macros relating to pages, clusters and all that.
# 15379	25-Apr-1996	phk	Fix cpu_fork for real. Suggested by: bde
# 15301	18-Apr-1996	phk	Fix a bogon. cpu_fork & savectx ecpected cpu_switch to restore %eax, they shouldn't.
# 15117	07-Apr-1996	bde	Removed never-used #includes of <machine/cpu.h>. Many were apparently copied from bad examples.
# 14348	02-Mar-1996	jkh	USER_LDT changes for the Willows TwinXPDK toolkit. Only tested with WINE since that's the only other USER_LDT using code that I know of. Submitted by: Gary Jennejohn <Gary.Jennejohn@munich.netsurf.de> Obtained from: {Origin of diffs may be someone else - I only rec'd them from Gary}
# 13915	05-Feb-1996	dg	Unspam my changes in rev 1.54 that John spammed in rev 1.55.
# 13909	04-Feb-1996	dyson	Changed vm_fault_quick in vm_machdep.c to be global. Needed for new pipe code.
# 13908	04-Feb-1996	dg	Rewrote cpu_fork so that it doesn't use pmap_activate, and removed pmap_activate since it's not used anymore. Changed cpu_fork so that it uses one line of inline assembly rather than calling mvesp() to get the current stack pointer. Removed mvesp() since it is no longer being used.
# 13740	30-Jan-1996	dg	savectx() strikes again: the saved stack pointer wasn't properly adjusted to remove the return address. It's only the frame pointer and luck that allowed the code to work at all.
# 13580	23-Jan-1996	dg	Simplified savectx() a little and fixed a bug that caused it to return garbage in the child process rather than "1" like it is supposed to. Reviewed by: bde
# 13490	19-Jan-1996	dyson	Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
# 13265	05-Jan-1996	wollman	Convert BOUNCE_BUFFERS and BOUNCEPAGES to new option scheme.
# 12819	14-Dec-1995	phk	A Major staticize sweep. Generates a couple of warnings that I'll deal with later. A number of unused vars removed. A number of unused procs removed or #ifdefed.
# 12767	11-Dec-1995	dyson	Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
# 12722	10-Dec-1995	phk	Staticize and cleanup. remove a TON of #includes from machdep.
# 12662	07-Dec-1995	dg	Untangled the vm.h include file spaghetti.
# 12417	20-Nov-1995	phk	Remove unused vars.
# 12357	18-Nov-1995	bde	Fixed the type of vm_fault_quick() - don't convert types back and forth through bogus immediate types. Added prototypes.
# 11116	01-Oct-1995	dg	Insert zeroed pages at the head of the zero queue rather than at the tail. A measurable performance improvement results from the potential for the page to be partially cached when it is eventually used.
# 10546	03-Sep-1995	dyson	Machine dependent routines to support pre-zeroed free pages. This significantly improves demand zero performance.
# 9759	29-Jul-1995	bde	Eliminate sloppy common-style declarations. There should be none left for the LINT configuation.
# 9507	13-Jul-1995	dg	NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
# 8876	30-May-1995	rgrimes	Remove trailing whitespace.
# 8590	18-May-1995	dg	Added "BROKEN_KEYBOARD_RESET" option to disable using the keyboard reset in cpu_reset(). Some MBs don't deal with this properly. Submitted by: Rod Grimes
# 8211	01-May-1995	dyson	Fixed a problem that can cause left-over pv_entries and as as side-effect, removed some legacy code that was necessary when we called vm_fault inside of vm_fault_quick instead of using the kernel/user space byte move routines.
# 8074	26-Apr-1995	rgrimes	Add outb to keyboard controller to do a cpu_reset, this fixes 2 known cases of motherboards that failed to reboot.
# 7170	19-Mar-1995	dg	Removed redundant newlines that were in some panic strings.
# 7090	16-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 6817	01-Mar-1995	dg	Use su/fubyte instead of directly touching the user's address space.
# 6579	20-Feb-1995	dg	Use of vm_allocate() and vm_deallocate() has been deprecated.
# 5771	21-Jan-1995	bde	Don't use mi_switch() to terminate cpu_exit(). Calling it just happened to work (mi_switch() counted the last timeslice again but this didn't affect the exiting process' rusage because the rusage has already been finalized). Remove stale comment.
# 5455	09-Jan-1995	dg	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
# 3436	08-Oct-1994	phk	db_disasm.c: Unused var zapped. pmap.c: tons of unused vars zapped, various other warnings silenced. trap.c: unused vars zapped. vm_machdep.c: A wrong argument, which by chance did the right thing, was corrected.
# 2455	02-Sep-1994	dg	Removed all vestiges of tlbflush(). Replaced them with calls to pmap_update(). Made pmap_update an inline assembly function.
# 2422	31-Aug-1994	dg	Rather than exclude bounce buffers support with NOBOUNCE, include it with BOUNCE_BUFFERS. This is more intuitive, and is better for future multiplatform support. Added BOUNCE_BUFFERS option to the GENERIC and LINT kernel config files.
# 1896	07-Aug-1994	dg	Made pmap_kenter "TLB safe". ...and then removed all the pmap_updates that are no longer needed because of this.
# 1894	07-Aug-1994	dg	Don't kremove process VM pages (oops!). This was the cause of the instability that was introduced last night. Submitted by: John Dyson
# 1890	06-Aug-1994	dg	Fixed various prototype problems with the pmap functions and the subsequent problems that fixing them caused.
# 1889	06-Aug-1994	dg	Incorporated 1.1.5 improvements to the bounce buffer code (i.e. make it actually work), and additionally improved it's performance via new pmap routines and "pbuf" allocation policy. Submitted by: John Dyson
# 1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# 1415	25-Apr-1994	dg	From John Dyson: Fixed physio in the 386 case - write faults weren't properly implemented.
# 1379	20-Apr-1994	dg	Bug fixes and performance improvements from John Dyson and myself: 1) check va before clearing the page clean flag. Not doing so was causing the vnode pager error 5 messages when paging from NFS. (pmap.c) 2) put back interrupt protection in idle_loop. Bruce didn't think it was necessary, John insists that it is (and I agree). (swtch.s) 3) various improvements to the clustering code (vm_machdep.c). It's now enabled/used by default. 4) bad disk blocks are now handled properly when doing clustered IOs. (wd.c, vm_machdep.c) 5) bogus bad block handling fixed in wd.c. 6) algorithm improvements to the pageout/pagescan daemons. It's amazing how well 4MB machines work now.
# 1362	14-Apr-1994	dg	Changes from John Dyson and myself: 1) Removed all instances of disable_intr()/enable_intr() and changed them back to splimp/splx. The previous method was done to improve the performance, but Bruces recent changes to inline spl* have made this unnecessary. 2) Cleaned up vm_machdep.c considerably. Probably fixed a few bugs, too. 3) Added a new mechanism for collecting page statistics - now done by a new system process "pagescan". Previously this was done by the pageout daemon, but this proved to be impractical. 4) Improved the page usage statistics gathering mechanism - performance is much improved in small memory machines. 5) Modified mbuf.h to enable the support for an external free routine when using mbuf clusters. Added appropriate glue in various places to allow this to work. 6) Adapted a suggested change to the NFS code from Yuval Yurom to take advantage of #5. 7) Added fault/swap statistics support.
# 1335	05-Apr-1994	dg	from John Dyson: 1) fixed some bugs related to the bounce buffer code 2) vnode pager now supports clustered pageouts 3) experimental code for clustering all I/O via a new "cldisksort" 4) added >16MB check to Bustek driver 5) made some experimental algorithmic changes to the pageout daemon 6) fixed bugs in truncating mapped files (esp when mapped via NFS) 7) reorganized vnode pager I/O code
# 1314	30-Mar-1994	dg	Eliminated the "physstrat" wart and merged it into kern_physio.c. This patch also fixes a bug which causes a kernel VM leak.
# 1312	30-Mar-1994	dg	New routine "pmap_kenter", designed to take advantage of the special case of the kernel pmap.
# 1307	24-Mar-1994	dg	From John Dyson: performance improvements to the new bounce buffer code.
# 1298	23-Mar-1994	dg	Bounce buffers. From John Dyson with help from me.
# 1288	21-Mar-1994	dg	Changed dynamic stack grow code to grow by "SGROWSIZ" amount. Initially allocate SGROWSIZ amount of stack. Also set vm_ssize to the initial stack VM size. Increased DFLSSIZ stack rlimit default to 8MB.
# 1246	07-Mar-1994	dg	1) "Pre-faulting" in of pages into process address space Eliminates vm_fault overhead on process startup and mmap referenced data for in-memory pages. (process startup time using in-memory segments much faster) 2) Even more efficient pmap code. Code partially cleaned up. More comments yet to follow. (generally more efficient pte management) 3) Pageout clustering ( in addition to the FreeBSD V1.1 pagein clustering.) (much faster paging performance on non-write behind disk subsystems, slightly faster performance on other systems.) 4) Slightly changed vm_pageout code for more efficiency and better statistics. Also, resist swapout a little more. (less likely to pageout a recently used page) 5) Slight improvement to the page table page trap efficiency. (generally faster system VM fault performance) 6) Defer creation of unnamed anonymous regions pager until needed. (speeds up shared memory bss creation) 7) Remove possible deadlock from swap_pager initialization. 8) Enhanced procfs to provide "vminfo" about vm objects and user pmaps. 9) Increased MCLSHIFT/MCLBYTES from 2K to 4K to improve net & socket performance and to prepare for things to come. John Dyson dyson@implode.root.com David Greenman davidg@root.com
# 1127	08-Feb-1994	dg	Fixed bugs in stack grow code, and moved it back into a seperate function like it was originally. Also added back call to "grow" in sendsig now that this routine actually works.
# 991	21-Jan-1994	dg	Remove some old, unused, major UGLY code.
# 974	14-Jan-1994	dg	"New" VM system from John Dyson & myself. For a run-down of the major changes, see the log of any effected file in the sys/vm directory (swap_pager.c for instance).
# 879	18-Dec-1993	wollman	Make everything compile with -Wtraditional. Make it easier to distribute a binary link-kit. Make all non-optional options (pagers, procfs) standard, and update LINT to reflect new symtab requirements. NB: -Wtraditional will henceforth be forgotten. This editing pass was primarily intended to detect any constructions where the old code might have been relying on traditional C semantics or syntax. These were all fixed, and the result of fixing some of them means that -Wall is now a realistic possibility within a few weeks.
# 798	24-Nov-1993	wollman	Make the LINT kernel compile with -W -Wreturn-type -Wcomment -Werror, and add same (sans -Werror) to Makefile for future compilations.
# 608	15-Oct-1993	rgrimes	genassym.c: Remove NKMEMCLUSTERS, it is no longer define or used. locores.s: Fix comment on PTDpde and APTDpde to be pde instead of pte Add new equation for calculating location of Sysmap Remove Bill's old #ifdef garbage for counting up memory, that stuff will never be made to work and was just cluttering up the file. Add code that places the PTD, page table pages, and kernel stack below the 640k ISA hole if there is room for it, otherwise put this stuff all at 1MB. This fixes the 28K bogusity in the boot blocks, that can now go away! Fix the caclulation of where first is to be dependent on NKPDE so that we can skip over the above mentioned areas. The 28K thing is now 44K in size due to the increase in kernel virtual memory space, but since we no longer have to worry about that this is no big deal. Use if NNPX > 0 instead of ifdef NPX for floating point code. machdep.c Change the calculation of for the buffer cache to be 20% of all memory above 2MB and add back the upper limit of 2/5's of the VM_KMEM_SIZE so that we do not eat ALL of the kernel memory space on large memory machines, note that this will not even come into effect unless you have more than 32MB. The current buffer cache limit is 6.7MB due to this caclulation. It seems that we where erroniously allocating bufpages pages for buffer_map. buffer_map is UNUSED in this implementation of the buffer cache, but since the map is referenced in several if statements a quick fix was to simply allocate 1 vm page (but no real memory) to it. pmap.h Remove rcsid, don't want them in the kernel files! Removed some cruft inside an #ifdef DEBUGx that caused compiler errors if you where compiling this for debug. Use the #defines for PD_SHIFT and PG_SHIFT in place of constants. trap.c: Remove patch kit header and rcsid, fix $Id$. Now include "npx.h" and use NNPX for controlling the floating point code. Remove a now completly invalid check for a maximum virtual address, the virtual address now ends at 0xFFFFFFFF so there is no more MAX!! (Thanks David, I completly missed that one!) vm_machdep.c Remove patch kit header and rcsid, fix $Id$. Now include "npx.h" and use NNPX for controlling the floating point code. Replace several 0xFE00000 constants with KERNBASE
# 434	10-Sep-1993	nate	Removed volatile functions which were causing grief in the system, since volatile functions are undefined, and there is no reason to have them in our kernel.
# 433	10-Sep-1993	rgrimes	This is just to shut the compiler up =================================================================== RCS file: /a/cvs/386BSD/src/sys/i386/i386/vm_machdep.c,v retrieving revision 1.3 diff -c -r1.3 vm_machdep.c * 1.3 1993/07/27 10:52:21 --- vm_machdep.c 1993/09/10 20:12:53 *********** * 179,184 **** --- 179,186 ---- #endif splclock(); swtch(); + /NOTREACHED/ + for(;;); } cpu_wait(p) struct proc *p; {
# 200	27-Jul-1993	dg	* Applied fixes from Bruce Evans to fix COW bugs, >1MB kernel loading, profiling, and various protection checks that cause security holes and system crashes. * Changed min/max/bcmp/ffs/strlen to be static inline functions - included from cpufunc.h in via systm.h. This change improves performance in many parts of the kernel - up to 5% in the networking layer alone. Note that this requires systm.h to be included in any file that uses these functions otherwise it won't be able to find them during the load. * Fixed incorrect call to splx() in if_is.c * Fixed bogus variable assignment to splx() in if_ed.c
# 140	18-Jul-1993	paul	Added volatile void to cpu_exit() in the hope that it would stop warning about returning from gcc. It hasn't but the declaration is still correct.
# 5	12-Jun-1993	rgrimes	This commit was generated by cvs2svn to compensate for changes in r4, which included commits to RCS files with non-trunk default branches.
# 4	12-Jun-1993	rgrimes	Initial import, 0.1 + pk 0.2.4-B1