Cross Reference: /freebsd-10.1-release/sys/vm/vm

History log of /freebsd-10.1-release/sys/vm/vm_object.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 272461	02-Oct-2014	gjb	Copy stable/10@r272459 to releng/10.1 as part of the 10.1-RELEASE process. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-10.1-release
# 270920	01-Sep-2014	kib	Fix a leak of the wired pages when unwiring of the PROT_NONE-mapped wired region. Rework the handling of unwire to do the it in batch, both at pmap and object level. All commits below are by alc. MFC r268327: Introduce pmap_unwire(). MFC r268591: Implement pmap_unwire() for powerpc. MFC r268776: Implement pmap_unwire() for arm. MFC r268806: pmap_unwire(9) man page. MFC r269134: When unwiring a region of an address space, do not assume that the underlying physical pages are mapped by the pmap. This fixes a leak of the wired pages on the unwiring of the region mapped with no access allowed. MFC r269339: In the implementation of the new function pmap_unwire(), the call to MOEA64_PVO_TO_PTE() must be performed before any changes are made to the PVO. Otherwise, MOEA64_PVO_TO_PTE() will panic. MFC r269365: Correct a long-standing problem in moea{,64}_pvo_enter() that was revealed by the combination of r268591 and r269134: When we attempt to add the wired attribute to an existing mapping, moea{,64}_pvo_enter() do nothing. (They only set the wired attribute on newly created mappings.) MFC r269433: Handle wiring failures in vm_map_wire() with the new functions pmap_unwire() and vm_object_unwire(). Retire vm_fault_{un,}wire(), since they are no longer used. MFC r269438: Rewrite a loop in vm_map_wire() so that gcc doesn't think that the variable "rv" is uninitialized. MFC r269485: Retire pmap_change_wiring(). Reviewed by: alc
# 269174	27-Jul-2014	kib	MFC r268615: Add OBJ_TMPFS_NODE flag. MFC r268616: Set the OBJ_TMPFS_NODE flag for vm_object of VREG tmpfs node. MFC r269053: Correct assertion. tmpfs vm object is always at the bottom of the shadow chain.
# 263359	19-Mar-2014	kib	MFC r263092: Do not vdrop() the tmpfs vnode until it is unlocked. The hold reference might be the last, and then vdrop() would free the vnode.
# 262291	21-Feb-2014	attilio	MFC r261867: Use the right index to free swapspace after vm_page_rename().
# 258037	12-Nov-2013	kib	MFC r257680: Do not coalesce if the swap object belongs to tmpfs vnode. Approved by: re (glebius)
# 256281	10-Oct-2013	gjb	Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
# 255396	08-Sep-2013	kib	Drain for the xbusy state for two places which potentially do pmap_remove_all(). Not doing the drain allows the pmap_enter() to proceed in parallel, making the pmap_remove_all() effects void. The race results in an invalidated page mapped wired by usermode. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (glebius)
# 254649	22-Aug-2013	kib	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation
# 254141	09-Aug-2013	attilio	On all the architectures, avoid to preallocate the physical memory for nodes used in vm_radix. On architectures supporting direct mapping, also avoid to pre-allocate the KVA for such nodes. In order to do so make the operations derived from vm_radix_insert() to fail and handle all the deriving failure of those. vm_radix-wise introduce a new function called vm_radix_replace(), which can replace a leaf node, already present, with a new one, and take into account the possibility, during vm_radix_insert() allocation, that the operations on the radix trie can recurse. This means that if operations in vm_radix_insert() recursed vm_radix_insert() will start from scratch again. Sponsored by: EMC / Isilon storage division Reviewed by: alc (older version) Reviewed by: jeff Tested by: pho, scottl
# 254138	09-Aug-2013	attilio	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
# 254025	07-Aug-2013	jeff	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division
# 253189	11-Jul-2013	kib	Never remove user-wired pages from an object when doing msync(MS_INVALIDATE). The vm_fault_copy_entry() requires that object range which corresponds to the user-wired vm_map_entry, is always fully populated. Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the preserving behaviour, use it when calling vm_object_page_remove() from vm_object_sync(). Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 252330	28-Jun-2013	jeff	- Add a general purpose resource allocator, vmem, from NetBSD. It was originally inspired by the Solaris vmem detailed in the proceedings of usenix 2001. The NetBSD version was heavily refactored for bugs and simplicity. - Use this resource allocator to allocate the buffer and transient maps. Buffer cache defrags are reduced by 25% when used by filesystems with mixed block sizes. Ultimately this may permit dynamic buffer cache sizing on low KVA machines. Discussed with: alc, kib, attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division
# 251591	09-Jun-2013	alc	Revise the interface between vm_object_madvise() and vm_page_dontneed() so that pointless calls to pmap_is_modified() can be easily avoided when performing madvise(..., MADV_FREE). Sponsored by: EMC / Isilon Storage Division
# 251397	04-Jun-2013	attilio	In vm_object_split(), busy and consequently unbusy the pages only when swap_pager_copy() is invoked, otherwise there is no reason to do so. This will eliminate the necessity to busy pages most of the times. Sponsored by: EMC / Isilon storage division Reviewed by: alc
# 251151	30-May-2013	kib	After the object lock was dropped, the object' reference count could change. Retest the ref_count and return from the function to not execute the further code which assumes that ref_count == 1 if it is not. Also, do not leak vnode lock if other thread cleared OBJ_TMPFS flag meantime. Reported by: bdrewery Tested by: bdrewery, pho Sponsored by: The FreeBSD Foundation
# 251150	30-May-2013	kib	Remove the capitalization in the assertion message. Print the address of the object to get useful information from optimizated kernels dump.
# 250030	28-Apr-2013	kib	Rework the handling of the tmpfs node backing swap object and tmpfs vnode v_object to avoid double-buffering. Use the same object both as the backing store for tmpfs node and as the v_object. Besides reducing memory use up to 2x times for situation of mapping files from tmpfs, it also makes tmpfs read and write operations copy twice bytes less. VM subsystem was already slightly adapted to tolerate OBJT_SWAP object as v_object. Now the vm_object_deallocate() is modified to not reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle VV_TEXT flag on the last dereference of the tmpfs backing object. Reviewed by: alc Tested by: pho, bf MFC after: 1 month
# 250029	28-Apr-2013	kib	Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode' v_object of non OBJT_VNODE type. For vm_object_page_clean(), simply do not assert that object type must be OBJT_VNODE, and add a comment explaining how the check for OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such objects. For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it to be for swap pager (or default), handle the bypass filesystems, and correctly acquire the object reference in this case. Reviewed by: alc Tested by: pho, bf MFC after: 1 week
# 248449	17-Mar-2013	attilio	Sync back vmcontention branch into HEAD: Replace the per-object resident and cached pages splay tree with a path-compressed multi-digit radix trie. Along with this, switch also the x86-specific handling of idle page tables to using the radix trie. This change is supposed to do the following: - Allowing the acquisition of read locking for lookup operations of the resident/cached pages collections as the per-vm_page_t splay iterators are now removed. - Increase the scalability of the operations on the page collections. The radix trie does rely on the consumers locking to ensure atomicity of its operations. In order to avoid deadlocks the bisection nodes are pre-allocated in the UMA zone. This can be done safely because the algorithm needs at maximum one new node per insert which means the maximum number of the desired nodes is the number of available physical frames themselves. However, not all the times a new bisection node is really needed. The radix trie implements path-compression because UFS indirect blocks can lead to several objects with a very sparse trie, increasing the number of levels to usually scan. It also helps in the nodes pre-fetching by introducing the single node per-insert property. This code is not generalized (yet) because of the possible loss of performance by having much of the sizes in play configurable. However, efforts to make this code more general and then reusable in further different consumers might be really done. The only KPI change is the removal of the function vm_page_splay() which is now reaped. The only KBI change, instead, is the removal of the left/right iterators from struct vm_page, which are now reaped. Further technical notes broken into mealpieces can be retrieved from the svn branch: http://svn.freebsd.org/base/user/attilio/vmcontention/ Sponsored by: EMC / Isilon storage division In collaboration with: alc, jeff Tested by: flo, pho, jhb, davide Tested by: ian (arm) Tested by: andreast (powerpc)
# 248084	09-Mar-2013	attilio	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
# 248082	09-Mar-2013	attilio	Merge from vmc-playground: Introduce a new KPI that verifies if the page cache is empty for a specified vm_object. This KPI does not make assumptions about the locking in order to be used also for building assertions at init and destroy time. It is mostly used to hide implementation details of the page cache. Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: alc (vm_radix based version) Tested by: flo, pho, jhb, davide
# 247788	04-Mar-2013	attilio	Merge from vmcontention: As vm objects are type-stable there is no need to initialize the resident splay tree pointer and the cache splay tree pointer in _vm_object_allocate() but this could be done in the init UMA zone handler. The destructor UMA zone handler, will further check if the condition is retained at every destruction and catch for bugs. Sponsored by: EMC / Isilon storage division Submitted by: alc
# 247659	02-Mar-2013	alc	The value held by the vm object's field pg_color is only considered valid if the flag OBJ_COLORED is set. Since _vm_object_allocate() doesn't set this flag, it needn't initialize pg_color. Sponsored by: EMC / Isilon Storage Division
# 247360	26-Feb-2013	attilio	Merge from vmc-playground branch: Replace the sub-optimal uma_zone_set_obj() primitive with more modern uma_zone_reserve_kva(). The new primitive reserves before hand the necessary KVA space to cater the zone allocations and allocates pages with ALLOC_NOOBJ. More specifically: - uma_zone_reserve_kva() does not need an object to cater the backend allocator. - uma_zone_reserve_kva() can cater M_WAITOK requests, in order to serve zones which need to do uma_prealloc() too. - When possible, uma_zone_reserve_kva() uses directly the direct-mapping by uma_small_alloc() rather than relying on the KVA / offset combination. The removal of the object attribute allows 2 further changes: 1) _vm_object_allocate() becomes static within vm_object.c 2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by direct calls to mtx_init() as there is no need to export it anymore and the calls aren't either homogeneous anymore: there are now small differences between arguments passed to mtx_init(). Sponsored by: EMC / Isilon storage division Reviewed by: alc (which also offered almost all the comments) Tested by: pho, jhb, davide
# 247346	26-Feb-2013	attilio	Remove white spaces. Sponsored by: EMC / Isilon storage division
# 247323	26-Feb-2013	attilio	Wrap the sleeps synchronized by the vm_object lock into the specific macro VM_OBJECT_SLEEP(). This hides some implementation details like the usage of the msleep() primitive and the necessity to access to the lock address directly. For this reason VM_OBJECT_MTX() macro is now retired. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho
# 244043	08-Dec-2012	alc	In the past four years, we've added two new vm object types. Each time, similar changes had to be made in various places throughout the machine- independent virtual memory layer to support the new vm object type. However, in most of these places, it's actually not the type of the vm object that matters to us but instead certain attributes of its pages. For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain fictitious pages. In other words, in most of these places, we were testing the vm object's type to determine if it contained fictitious (or unmanaged) pages. To both simplify the code in these places and make the addition of future vm object types easier, this change introduces two new vm object flags that describe attributes of the vm object's pages, specifically, whether they are fictitious or unmanaged. Reviewed and tested by: kib
# 243659	28-Nov-2012	alc	Add support for the (relatively) new object type OBJT_MGTDEVICE to vm_object_set_memattr(). Also, add a "safety belt" so that vm_object_set_memattr() doesn't silently modify undefined object types. Reviewed by: kib MFC after: 10 days
# 241896	22-Oct-2012	kib	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
# 241512	13-Oct-2012	alc	Eliminate the conditional for releasing the page queues lock in vm_page_sleep(). vm_page_sleep() is no longer called with this lock held. Eliminate assertions that the page queues lock is NOT held. These assertions won't translate well to having distinct locks on the active and inactive page queues, and they really aren't that useful. MFC after: 3 weeks
# 241025	28-Sep-2012	kib	Fix the mis-handling of the VV_TEXT on the nullfs vnodes. If you have a binary on a filesystem which is also mounted over by nullfs, you could execute the binary from the lower filesystem, or from the nullfs mount. When executed from lower filesystem, the lower vnode gets VV_TEXT flag set, and the file cannot be modified while the binary is active. But, if executed as the nullfs alias, only the nullfs vnode gets VV_TEXT set, and you still can open the lower vnode for write. Add a set of VOPs for the VV_TEXT query, set and clear operations, which are correctly bypassed to lower vnode. Tested by: pho (previous version) MFC after: 2 weeks
# 240741	20-Sep-2012	kib	Plug the accounting leak for the wired pages when msync(MS_INVALIDATE) is performed on the vnode mapping which is wired in other address space. While there, explicitely assert that the page is unwired and zero the wire_count instead of substract. The condition is rechecked later in vm_page_free(_toq) already. Reported and tested by: zont Reviewed by: alc (previous version) MFC after: 1 week
# 238359	10-Jul-2012	attilio	Document the object type movements, related to swp_pager_copy(), in vm_object_collapse() and vm_object_split(). In collabouration with: alc MFC after: 3 days
# 233191	19-Mar-2012	jhb	Fix madvise(MADV_WILLNEED) to properly handle individual mappings larger than 4GB. Specifically, the inlined version of 'ptoa' of the the 'int' count of pages overflowed on 64-bit platforms. While here, change vm_object_madvise() to accept two vm_pindex_t parameters (start and end) rather than a (start, count) tuple to match other VM APIs as suggested by alc@.
# 233100	17-Mar-2012	kib	In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag if the filesystem performed short write and we are skipping the page due to this. Propogate write error from the pager back to the callers of vm_pageout_flush(). Report the failure to write a page from the requested range as the FALSE return value from vm_object_page_clean(), and propagate it back to msync(2) to return EIO to usermode. While there, convert the clearobjflags variable in the vm_object_page_clean() and arguments of the helper functions to boolean. PR: kern/165927 Reviewed by: alc MFC after: 2 weeks
# 229495	04-Jan-2012	kib	Do not restart the scan in vm_object_page_clean() on the object generation change if requested mode is async. The object generation is only changed when the object is marked as OBJ_MIGHTBEDIRTY. For async mode it is enough to write each dirty page, not to make a guarantee that all pages are cleared after the vm_object_page_clean() returned. Diagnosed by: truckman Tested by: flo Reviewed by: alc, truckman MFC after: 2 weeks
# 228936	28-Dec-2011	alc	Optimize vm_object_split()'s handling of reservations.
# 228838	23-Dec-2011	kib	Optimize the common case of msyncing the whole file mapping with MS_SYNC flag. The system must guarantee that all writes are finished before syscalls returned. Schedule the writes in async mode, which is much faster and allows the clustering to occur. Wait for writes using VOP_FSYNC(), since we are syncing the whole file mapping. Potentially, the restriction to only apply the optimization can be relaxed by not requiring that the mapping cover whole file, as it is done by other OSes. Reported and tested by: az Reviewed by: alc MFC after: 2 weeks
# 227309	07-Nov-2011	ed	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.
# 227070	04-Nov-2011	jhb	Add the posix_fadvise(2) system call. It is somewhat similar to madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month
# 225418	06-Sep-2011	kib	Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic flags field. Updates to the atomic flags are performed using the atomic ops on the containing word, do not require any vm lock to be held, and are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9) functions are provided to modify afalgs. Document the changes to flags field to only require the page lock. Introduce vm_page_reference(9) function to provide a stable KPI and KBI for filesystems like tmpfs and zfs which need to mark a page as referenced. Reviewed by: alc, attilio Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64) Approved by: re (bz)
# 224746	09-Aug-2011	kib	- Move the PG_UNMANAGED flag from m->flags to m->oflags, renaming the flag to VPO_UNMANAGED (and also making the flag protected by the vm object lock, instead of vm page queue lock). - Mark the fake pages with both PG_FICTITIOUS (as it is now) and VPO_UNMANAGED. As a consequence, pmap code now can use use just VPO_UNMANAGED to decide whether the page is unmanaged. Reviewed by: alc Tested by: pho (x86, previous version), marius (sparc64), marcel (arm, ia64, powerpc), ray (mips) Sponsored by: The FreeBSD Foundation Approved by: re (bz)
# 223677	29-Jun-2011	alc	Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this option to vm_object_page_remove() asserts that the specified range of pages is not mapped, or more precisely that none of these pages have any managed mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on the pages. This change not only saves time by eliminating pointless calls to pmap_remove_all(), but it also eliminates an inconsistency in the use of pmap_remove_all() versus related functions, like pmap_remove_write(). It eliminates harmless but pointless calls to pmap_remove_all() that were being performed on PG_UNMANAGED pages. Update all of the existing assertions on pmap_remove_all() to reflect this change. Reviewed by: kib
# 222586	01-Jun-2011	kib	In the VOP_PUTPAGES() implementations, change the default error from VM_PAGER_AGAIN to VM_PAGER_ERROR for the uwritten pages. Return VM_PAGER_AGAIN for the partially written page. Always forward at least one page in the loop of vm_object_page_clean(). VM_PAGER_ERROR causes the page reactivation and does not clear the page dirty state, so the write is not lost. The change fixes an infinite loop in vm_object_page_clean() when the filesystem returns permanent errors for some page writes. Reported and tested by: gavin Reviewed by: alc, rmacklem MFC after: 1 week
# 221714	09-May-2011	mlaier	Another long standing vm bug found at Isilon: Fix a race between vm_object_collapse and vm_fault. Reviewed by: alc@ MFC after: 3 days
# 220977	23-Apr-2011	kib	Fix two bugs in r218670. Hold the vnode around the region where object lock is dropped, until vnode lock is acquired. Do not drop the vnode reference for a case when the object was deallocated during unlock. Note that in this case, VV_TEXT is cleared by vnode_pager_dealloc(). Reported and tested by: pho Reviewed by: alc MFC after: 3 days
# 218670	13-Feb-2011	kib	Lock the vnode around clearing of VV_TEXT flag. Remove mp_fixme() note mentioning that vnode lock is needed. Reviewed by: alc Tested by: pho MFC after: 1 week
# 218345	05-Feb-2011	alc	Unless "cnt" exceeds MAX_COMMIT_COUNT, nfsrv_commit() and nfsvno_fsync() are incorrectly calling vm_object_page_clean(). They are passing the length of the range rather than the ending offset of the range. Perform the OFF_TO_IDX() conversion in vm_object_page_clean() rather than the callers. Reviewed by: kib MFC after: 3 weeks
# 218304	04-Feb-2011	alc	Since the last parameter to vm_object_shadow() is a vm_size_t and not a vm_pindex_t, it makes no sense for its callers to perform atop(). Let vm_object_shadow() do that instead.
# 217463	15-Jan-2011	kib	For consistency, use kernel_object instead of &kernel_object_store when initializing the object mutex. Do the same for kmem_object. Discussed with: alc MFC after: 1 week
# 216874	01-Jan-2011	alc	Make a couple refinements to r216799 and r216810. In particular, revise a comment and move it to its proper place. Reviewed by: kib
# 216810	29-Dec-2010	kib	Remove OBJ_CLEANING flag. The vfs_setdirty_locked_object() is the only consumer of the flag, and it used the flag because OBJ_MIGHTBEDIRTY was cleared early in vm_object_page_clean, before the cleaning pass was done. This is no longer true after r216799. Moreover, since OBJ_CLEANING is a flag, and not the counter, it could be reset too prematurely when parallel vm_object_page_clean() are performed. Reviewed by: alc (as a part of the bigger patch) MFC after: 1 month (after r216799 is merged)
# 216799	29-Dec-2010	kib	Move the increment of vm object generation count into vm_object_set_writeable_dirty(). Fix an issue where restart of the scan in vm_object_page_clean() did not removed write permissions for newly added pages or, if the mapping for some already scanned page changed to writeable due to fault. Merge the two loops in vm_object_page_clean(), doing the remove of write permission and cleaning in the same loop. The restart of the loop then correctly downgrade writeable mappings. Fix an issue where a second caller to msync() might actually return before the first caller had actually completed flushing the pages. Clear the OBJ_MIGHTBEDIRTY flag after the cleaning loop, not before. Calls to pmap_is_modified() are not needed after pmap_remove_write() there. Proposed, reviewed and tested by: alc MFC after: 1 week
# 216128	02-Dec-2010	trasz	Replace pointer to "struct uidinfo" with pointer to "struct ucred" in "struct vm_object". This is required to make it possible to account for per-jail swap usage. Reviewed by: kib@ Tested by: pho@ Sponsored by: FreeBSD Foundation
# 215796	24-Nov-2010	kib	After the sleep caused by encountering a busy page, relookup the page. Submitted and reviewed by: alc Reprted and tested by: pho MFC after: 5 days
# 215610	21-Nov-2010	kib	Eliminate the mab, maf arrays and related variables. The change also fixes off-by-one error in the calculation of mreq. Suggested and reviewed by: alc Tested by: pho MFC after: 5 days
# 215597	20-Nov-2010	alc	Optimize vm_object_terminate(). Reviewed by: kib MFC after: 1 week
# 215574	20-Nov-2010	kib	The runlen returned from vm_pageout_flush() might be zero legitimately, when mreq page has status VM_PAGER_AGAIN. MFC after: 5 days
# 215471	18-Nov-2010	kib	vm_pageout_flush() might cache the pages that finished write to the backing storage. Such pages might be then reused, racing with the assert in vm_object_page_collect_flush() that verified that dirty pages from the run (most likely, pages with VM_PAGER_AGAIN status) are write-protected still. In fact, the page indexes for the pages that were removed from the object page list should be ignored by vm_object_page_clean(). Return the length of successfully written run from vm_pageout_flush(), that is, the count of pages between requested page and first page after requested with status VM_PAGER_AGAIN. Supply the requested page index in the array to vm_pageout_flush(). Use the returned run length to forward the index of next page to clean in vm_object_page_clean(). Reported by: avg Reviewed by: alc MFC after: 1 week
# 215469	18-Nov-2010	kib	Only increment object generation count when inserting the page into object page list. The only use of object generation count now is a restart of the scan in vm_object_page_clean(), which makes sense to do on the page addition. Page removals do not affect the dirtiness of the object, as well as manipulations with the shadow chain. Suggested and reviewed by: alc MFC after: 1 week
# 209702	04-Jul-2010	kib	Several cleanups for the r209686: - remove unused defines; - remove unused curgeneration argument for vm_object_page_collect_flush(); - always assert that vm_object_page_clean() is called for OBJT_VNODE; - move vm_page_find_least() into for() statement initial clause. Submitted by: alc
# 209686	04-Jul-2010	kib	Reimplement vm_object_page_clean(), using the fact that vm object memq is ordered by page index. This greatly simplifies the implementation, since we no longer need to mark the pages with VPO_CLEANCHK to denote the progress. It is enough to remember the current position by index before dropping the object lock. Remove VPO_CLEANCHK and VM_PAGER_IGNORE_CLEANCHK as unused. Garbage-collect vm.msync_flush_flags sysctl. Suggested and reviewed by: alc Tested by: pho
# 209685	04-Jul-2010	kib	Introduce a helper function vm_page_find_least(). Use it in several places, which inline the function. Reviewed by: alc Tested by: pho MFC after: 1 week
# 208504	24-May-2010	alc	Roughly half of a typical pmap_mincore() implementation is machine- independent code. Move this code into mincore(), and eliminate the page queues lock from pmap_mincore(). Push down the page queues lock into pmap_clear_modify(), pmap_clear_reference(), and pmap_is_modified(). Assert that these functions are never passed an unmanaged page. Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m: Contrary to what the comment says, pmap_mincore() is not simply an optimization. Without a complete pmap_mincore() implementation, mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED because only the pmap can provide this information. Eliminate the page queues lock from vfs_setdirty_locked_object(), vm_pageout_clean(), vm_object_page_collect_flush(), and vm_object_page_clean(). Generally speaking, these are all accesses to the page's dirty field, which are synchronized by the containing vm object's lock. Reduce the scope of the page queues lock in vm_object_madvise() and vm_page_dontneed(). Reviewed by: kib (an earlier version)
# 208159	16-May-2010	alc	Add a comment about the proper use of vm_object_page_remove(). MFC after: 1 week
# 207796	08-May-2010	alc	Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and vm_page_try_to_free(). Consequently, push down the page queues lock into pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and pmap_remove_write(). Push down the page queues lock into Xen's pmap_page_is_mapped(). (I overlooked the Xen pmap in r207702.) Switch to a per-processor counter for the total number of pages cached.
# 207739	07-May-2010	alc	Eliminate acquisitions of the page queues lock that are no longer needed. Switch to a per-processor counter for the number of pages freed during process termination.
# 207728	06-May-2010	alc	Eliminate page queues locking around most calls to vm_page_free().
# 207669	05-May-2010	alc	Acquire the page lock around all remaining calls to vm_page_free() on managed pages that didn't already have that lock held. (Freeing an unmanaged page, such as the various pmaps use, doesn't require the page lock.) This allows a change in vm_page_remove()'s locking requirements. It now expects the page lock to be held instead of the page queues lock. Consequently, the page queues lock is no longer required at all by callers to vm_page_rename(). Discussed with: kib
# 207531	02-May-2010	alc	Correct an error in r207410: Remove an unlock of a lock that is no longer held.
# 207452	30-Apr-2010	kmacy	push up dropping of the page queue lock to avoid holding it in vm_pageout_flush
# 207451	30-Apr-2010	kmacy	don't call vm_pageout_flush with the page queue mutex held Reported by: Michael Butler
# 207410	29-Apr-2010	kmacy	On Alan's advice, rather than do a wholesale conversion on a single architecture from page queue lock to a hashed array of page locks (based on a patch by Jeff Roberson), I've implemented page lock support in the MI code and have only moved vm_page's hold_count out from under page queue mutex to page lock. This changes pmap_extract_and_hold on all pmaps. Supported by: Bitgravity Inc. Discussed with: alc, jeffr, and kib
# 207306	28-Apr-2010	alc	Change vm_object_madvise() so that it checks whether the page is invalid or unmanaged before acquiring the page queues lock. Neither of these tests require that lock. Moreover, a better way of testing if the page is unmanaged is to test the type of vm object. This avoids a pointless vm_page_lookup(). MFC after: 3 weeks
# 206801	18-Apr-2010	alc	There is no justification for vm_object_split() setting PG_REFERENCED on a page that it is going to sleep on. Eliminate it. MFC after: 3 weeks
# 206770	17-Apr-2010	alc	In vm_object_madvise() setting PG_REFERENCED on a page before sleeping on that page only makes sense if the advice is MADV_WILLNEED. In that case, the intention is to activate the page, so discouraging the page daemon from reclaiming the page makes sense. In contrast, in the other cases, MADV_DONTNEED and MADV_FREE, it makes no sense whatsoever to discourage the page daemon from reclaiming the page by setting PG_REFERENCED. Wrap a nearby line. Discussed with: kib MFC after: 3 weeks
# 206768	17-Apr-2010	alc	In vm_object_backing_scan(), setting PG_REFERENCED on a page before sleeping on that page is nonsensical. Doing so reduces the likelihood that the page daemon will reclaim the page before the thread waiting in vm_object_backing_scan() is reawakened. However, it does not guarantee that the page is not reclaimed, so vm_object_backing_scan() restarts after reawakening. More importantly, this muddles the meaning of PG_REFERENCED. There is no reason to believe that the caller of vm_object_backing_scan() is going to use (i.e., access) the contents of the page. There is especially no reason to believe that an access is more likely because vm_object_backing_scan() had to sleep on the page. Discussed with: kib MFC after: 3 weeks
# 200770	21-Dec-2009	kib	VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object flag. Besides providing the redundand information, need to update both vnode and object flags causes more acquisition of vnode interlock. OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects. Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for vnode-backed vm objects. Suggested and reviewed by: alc Tested by: pho MFC after: 3 weeks
# 195840	24-Jul-2009	jhb	Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to a device pager (OBJT_DEVICE) object in that it uses fictitious pages to provide aliases to other memory addresses. The primary difference is that it uses an sglist(9) to determine the physical addresses for a given offset into the object instead of invoking the d_mmap() method in a device driver. Reviewed by: alc Approved by: re (kensmith) MFC after: 2 weeks
# 195649	12-Jul-2009	alc	Add support to the virtual memory system for configuring machine- dependent memory attributes: Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the fact that there are machine-dependent memory attributes that have nothing to do with controlling the cache's behavior. Introduce vm_object_set_memattr() for setting the default memory attributes that will be given to an object's pages. Introduce and use pmap_page_{get,set}_memattr() for getting and setting a page's machine-dependent memory attributes. Add full support for these functions on amd64 and i386 and stubs for them on the other architectures. The function pmap_page_set_memattr() is also responsible for any other machine-dependent aspects of changing a page's memory attributes, such as flushing the cache or updating the direct map. The uses include kmem_alloc_contig(), vm_page_alloc(), and the device pager: kmem_alloc_contig() can now be used to allocate kernel memory with non-default memory attributes on amd64 and i386. vm_page_alloc() and the device pager will set the memory attributes for the real or fictitious page according to the object's default memory attributes. Update the various pmap functions on amd64 and i386 that map pages to incorporate each page's memory attributes in the mapping. Notes: (1) Inherent to this design are safety features that prevent the specification of inconsistent memory attributes by different mappings on amd64 and i386. In addition, the device pager provides a warning when a device driver creates a fictitious page with memory attributes that are inconsistent with the real page that the fictitious page is an alias for. (2) Storing the machine-dependent memory attributes for amd64 and i386 as a dedicated "int" in "struct md_page" represents a compromise between space efficiency and the ease of MFCing these changes to RELENG_7. In collaboration with: jhb Approved by: re (kib)
# 195131	28-Jun-2009	kib	Eliminiate code duplication by calling vm_object_destroy() from vm_object_collapse(). Requested and reviewed by: alc Approved by: re (kensmith)
# 194806	24-Jun-2009	alc	The bits set in a page's dirty mask are a subset of the bits set in its valid mask. Consequently, there is no need to perform a bit-wise and of the page's dirty and valid masks in order to determine which parts of a page are dirty and valid. Eliminate an unnecessary #include.
# 194766	23-Jun-2009	kib	Implement global and per-uid accounting of the anonymous memory. Add rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved for the uid. The accounting information (charge) is associated with either map entry, or vm object backing the entry, assuming the object is the first one in the shadow chain and entry does not require COW. Charge is moved from entry to object on allocation of the object, e.g. during the mmap, assuming the object is allocated, or on the first page fault on the entry. It moves back to the entry on forks due to COW setup. The per-entry granularity of accounting makes the charge process fair for processes that change uid during lifetime, and decrements charge for proper uid when region is unmapped. The interface of vm_pager_allocate(9) is extended by adding struct ucred *, that is used to charge appropriate uid when allocation if performed by kernel, e.g. md(4). Several syscalls, among them is fork(2), may now return ENOMEM when global or per-uid limits are enforced. In collaboration with: pho Reviewed by: alc Approved by: re (kensmith)
# 194209	14-Jun-2009	alc	Long, long ago in r27464 special case code for mapping device-backed memory with 4MB pages was added to pmap_object_init_pt(). This code assumes that the pages of a OBJT_DEVICE object are always physically contiguous. Unfortunately, this is not always the case. For example, jhb@ informs me that the recently introduced /dev/ksyms driver creates a OBJT_DEVICE object that violates this assumption. Thus, this revision modifies pmap_object_init_pt() to abort the mapping if the OBJT_DEVICE object's pages are not physically contiguous. This revision also changes some inconsistent if not buggy behavior. For example, the i386 version aborts if the first 4MB virtual page that would be mapped is already valid. However, it incorrectly replaces any subsequent 4MB virtual page mappings that it encounters, potentially leaking a page table page. The amd64 version has a bug of my own creation. It potentially busies the wrong page and always an insufficent number of pages if it blocks allocating a page table page. To my knowledge, there have been no reports of these bugs, hence, their persistance. I suspect that the existing restrictions that pmap_object_init_pt() placed on the OBJT_DEVICE objects that it would choose to map, for example, that the first page must be aligned on a 2 or 4MB physical boundary and that the size of the mapping must be a multiple of the large page size, were enough to avoid triggering the bug for drivers like ksyms. However, one side effect of testing the OBJT_DEVICE object's pages for physical contiguity is that a dubious difference between pmap_object_init_pt() and the standard path for mapping devices pages, i.e., vm_fault(), has been eliminated. Previously, pmap_object_init_pt() would only instantiate the first PG_FICTITOUS page being mapped because it never examined the rest. Now, however, pmap_object_init_pt() uses the new function vm_object_populate() to instantiate them all (in order to support testing their physical contiguity). These pages need to be instantiated for the mechanism that I have prototyped for automatically maintaining the consistency of the PAT settings across multiple mappings, particularly, amd64's direct mapping, to work. (Translation: This change is also being made to support jhb@'s work on the Nvidia feature requests.) Discussed with: jhb@
# 192968	28-May-2009	alc	Change vm_object_page_remove() such that it clears the page's dirty bits when it invalidates the page. Suggested by: tegge
# 191439	23-Apr-2009	kib	Do not call vm_page_lookup() from the ddb routine, namely from "show vmopag" implementation. The vm_page_lookup() code modifies splay tree of the object pages, and asserts that object lock is taken. First issue could cause kernel data corruption, and second one instantly panics the INVARIANTS-enabled kernel. Take the advantage of the fact that object->memq is ordered by page index, and iterate over memq to calculate the runs. While there, make the code slightly more style-compliant by moving variables declarations to the right place. Discussed with: jhb, alc Reviewed by: alc MFC after: 2 weeks
# 188900	21-Feb-2009	alc	Reduce the scope of the page queues lock in vm_object_page_remove(). MFC after: 1 week
# 188348	08-Feb-2009	alc	Eliminate OBJ_NEEDGIANT. After r188331, OBJ_NEEDGIANT's only use is by a redundant assertion in vm_fault(). Reviewed by: kib
# 186374	21-Dec-2008	rnoland	Fix printing of KASSERT message missed in r163604. Approved by: kib
# 183754	10-Oct-2008	attilio	Remove the struct thread unuseful argument from bufobj interface. In particular following functions KPI results modified: - bufobj_invalbuf() - bufsync() and BO_SYNC() "virtual method" of the buffer objects set. Main consumers of bufobj functions are affected by this change too and, in particular, functions which changed their KPI are: - vinvalbuf() - g_vfs_close() Due to the KPI breakage, __FreeBSD_version will be bumped in a later commit. As a side note, please consider just temporary the 'curthread' argument passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP Reviewed by: kib Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
# 181239	03-Aug-2008	trhodes	Fill in a few sysctl descriptions. Reviewed by: alc, Matt Dillon <dillon@apollo.backplane.com> Approved by: alc
# 181024	30-Jul-2008	jhb	One more whitespace nit.
# 181020	30-Jul-2008	jhb	A few more whitespace fixes.
# 179159	20-May-2008	ups	Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set) Directory IO without a VM object will store data in 'malloced' buffers severely limiting caching of the data. Without this change VM objects for directories are only created on an open() of the directory. TODO: Inline test if VM object already exists to avoid locking/function call overhead. Tested by: kris@ Reviewed by: jeff@ Reported by: David Filo
# 177704	29-Mar-2008	jeff	- Use vm_object_reference_locked() directly from vm_object_reference(). This is intended to get rid of vget() consumers who don't wish to acquire a lock. This is functionally the same as calling vref(). vm_object_reference_locked() already uses vref. Discussed with: alc
# 176596	26-Feb-2008	alc	Correct a long-standing error in vm_object_page_remove(). Specifically, pmap_remove_all() must not be called on fictitious pages. To date, fictitious pages have been allocated from zeroed memory, effectively hiding this problem because the fictitious pages appear to have an empty pv list. Submitted by: Kostik Belousov Rewrite the comments describing vm_object_page_remove() to better describe what it does. Add an assertion. Reviewed by: Kostik Belousov MFC after: 1 week
# 176526	24-Feb-2008	alc	Correct a long-standing error in vm_object_deallocate(). Specifically, only anonymous default (OBJT_DEFAULT) and swap (OBJT_SWAP) objects should ever have OBJ_ONEMAPPING set. However, vm_object_deallocate() was setting it on device (OBJT_DEVICE) objects. As a result, vm_object_page_remove() could be called on a device object and if that occurred pmap_remove_all() would be called on the device object's pages. However, a device object's pages are fictitious, and fictitious pages do not have an initialized pv list (struct md_page). To date, fictitious pages have been allocated from zeroed memory, effectively hiding this problem. Now, however, the conversion of rotting diagnostics to invariants in the amd64 and i386 pmaps has revealed the problem. Specifically, assertion failures have occurred during the initialization phase of the X server on some hardware. MFC after: 1 week Discussed with: Kostik Belousov Reported by: Michiel Boland
# 175294	13-Jan-2008	attilio	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
# 175202	09-Jan-2008	attilio	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>
# 174982	29-Dec-2007	alc	Add the superpage reservation system. This is "part 2 of 2" of the machine-independent support for superpages. (The earlier part was the rewrite of the physical memory allocator.) The remainder of the code required for superpages support is machine-dependent and will be added to the various pmap implementations at a later date. Initially, I am only supporting one large page size per architecture. Moreover, I am only enabling the reservation system on amd64. (In an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0 in amd64/include/vmparam.h or your kernel configuration file.)
# 173708	17-Nov-2007	alc	Prevent the leakage of wired pages in the following circumstances: First, a file is mmap(2)ed and then mlock(2)ed. Later, it is truncated. Under "normal" circumstances, i.e., when the file is not mlock(2)ed, the pages beyond the EOF are unmapped and freed. However, when the file is mlock(2)ed, the pages beyond the EOF are unmapped but not freed because they have a non-zero wire count. This can be a mistake. Specifically, it is a mistake if the sole reason why the pages are wired is because of wired, managed mappings. Previously, unmapping the pages destroys these wired, managed mappings, but does not reduce the pages' wire count. Consequently, when the file is unmapped, the pages are not unwired because the wired mapping has been destroyed. Moreover, when the vm object is finally destroyed, the pages are leaked because they are still wired. The fix is to reduce the pages' wired count by the number of wired, managed mappings destroyed. To do this, I introduce a new pmap function pmap_page_wired_mappings() that returns the number of managed mappings to the given physical page that are wired, and I use this function in vm_object_page_remove(). Reviewed by: tegge MFC after: 6 weeks
# 172780	18-Oct-2007	alc	The previous revision, updating vm_object_page_remove() for the new page cache, did not account for the case where the vm object has nothing but cached pages. Reported by: kris, tegge Reviewed by: tegge MFC after: 3 days
# 172341	27-Sep-2007	alc	Correct an error of omission in the reimplementation of the page cache: vm_object_page_remove() should convert any cached pages that fall with the specified range to free pages. Otherwise, there could be a problem if a file is first truncated and then regrown. Specifically, some old data from prior to the truncation might reappear. Generalize vm_page_cache_free() to support the conversion of either a subset or the entirety of an object's cached pages. Reported by: tegge Reviewed by: tegge Approved by: re (kensmith)
# 172322	25-Sep-2007	alc	Correct an error in the previous revision, specifically, vm_object_madvise() should request that the reactivated, cached page not be busied. Reported by: Rink Springer Approved by: re (kensmith)
# 172317	25-Sep-2007	alc	Change the management of cached pages (PQ_CACHE) in two fundamental ways: (1) Cached pages are no longer kept in the object's resident page splay tree and memq. Instead, they are kept in a separate per-object splay tree of cached pages. However, access to this new per-object splay tree is synchronized by the _free_ page queues lock, not to be confused with the heavily contended page queues lock. Consequently, a cached page can be reclaimed by vm_page_alloc(9) without acquiring the object's lock or the page queues lock. This solves a problem independently reported by tegge@ and Isilon. Specifically, they observed the page daemon consuming a great deal of CPU time because of pages bouncing back and forth between the cache queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of this problem turned out to be a deadlock avoidance strategy employed when selecting a cached page to reclaim in vm_page_select_cache(). However, the root cause was really that reclaiming a cached page required the acquisition of an object lock while the page queues lock was already held. Thus, this change addresses the problem at its root, by eliminating the need to acquire the object's lock. Moreover, keeping cached pages in the object's primary splay tree and memq was, in effect, optimizing for the uncommon case. Cached pages are reclaimed far, far more often than they are reactivated. Instead, this change makes reclamation cheaper, especially in terms of synchronization overhead, and reactivation more expensive, because reactivated pages will have to be reentered into the object's primary splay tree and memq. (2) Cached pages are now stored alongside free pages in the physical memory allocator's buddy queues, increasing the likelihood that large allocations of contiguous physical memory (i.e., superpages) will succeed. Finally, as a result of this change long-standing restrictions on when and where a cached page can be reclaimed and returned by vm_page_alloc(9) are eliminated. Specifically, calls to vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and return a formerly cached page. Consequently, a call to malloc(9) specifying M_NOWAIT is less likely to fail. Discussed with: many over the course of the summer, including jeff@, Justin Husted @ Isilon, peter@, tegge@ Tested by: an earlier version by kris@ Approved by: re (kensmith)
# 170816	16-Jun-2007	alc	Enable the new physical memory allocator. This allocator uses a binary buddy system with a twist. First and foremost, this allocator is required to support the implementation of superpages. As a side effect, it enables a more robust implementation of contigmalloc(9). Moreover, this reimplementation of contigmalloc(9) eliminates the acquisition of Giant by contigmalloc(..., M_NOWAIT, ...). The twist is that this allocator tries to reduce the number of TLB misses incurred by accesses through a direct map to small, UMA-managed objects and page table pages. Roughly speaking, the physical pages that are allocated for such purposes are clustered together in the physical address space. The performance benefits vary. In the most extreme case, a uniprocessor kernel running on an Opteron, I measured an 18% reduction in system time during a buildworld. This allocator does not implement page coloring. The reason is that superpages have much the same effect. The contiguous physical memory allocation necessary for a superpage is inherently colored. Finally, the one caveat is that this allocator does not effectively support prezeroed pages. I hope this is temporary. On i386, this is a slight pessimization. However, on amd64, the beneficial effects of the direct-map optimization outweigh the ill effects. I speculate that this is true in general of machines with a direct map. Approved by: re
# 170517	10-Jun-2007	attilio	Optimize vmmeter locking. In particular: - Add an explicative table for locking of struct vmmeter members - Apply new rules for some of those members - Remove some unuseful comments Heavily reviewed by: alc, bde, jeff Approved by: jeff (mentor)
# 170292	04-Jun-2007	attilio	Do proper "locking" for missing vmmeters part. Now, we assume no more sched_lock protection for some of them and use the distribuited loads method for vmmeter (distribuited through CPUs). Reviewed by: alc, bde Approved by: jeff (mentor)
# 170170	31-May-2007	attilio	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)
# 169667	18-May-2007	jeff	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>
# 167939	27-Mar-2007	alc	Prevent a race between vm_object_collapse() and vm_object_split() from causing a crash. Suppose that we have two objects, obj and backing_obj, where backing_obj is obj's backing object. Further, suppose that backing_obj has a reference count of two. One being the reference held by obj and the other by a map entry. Now, suppose that the map entry is deallocated and its reference removed by vm_object_deallocate(). vm_object_deallocate() recognizes that the only remaining reference is from a shadow object, obj, and calls vm_object_collapse() on obj. vm_object_collapse() executes if (backing_object->ref_count == 1) { /* * If there is exactly one reference to the backing * object, we can collapse it into the parent. */ vm_object_backing_scan(object, OBSC_COLLAPSE_WAIT); vm_object_backing_scan(OBSC_COLLAPSE_WAIT) executes if (op & OBSC_COLLAPSE_WAIT) { vm_object_set_flag(backing_object, OBJ_DEAD); } Finally, suppose that either vm_object_backing_scan() or vm_object_collapse() sleeps releasing its locks. At this instant, another thread executes vm_object_split(). It crashes in vm_object_reference_locked() on the assertion that the object is not dead. If, however, assertions are not enabled, it crashes much later, after the object has been recycled, in vm_object_deallocate() because the shadow count and shadow list are inconsistent. Reviewed by: tegge Reported by: jhb MFC after: 1 week
# 167795	22-Mar-2007	alc	Change the order of lock reacquisition in vm_object_split() in order to simplify the code slightly. Add a comment concerning lock ordering.
# 167091	27-Feb-2007	jhb	Use pause() in vm_object_deallocate() to yield the CPU to the lock holder rather than a tsleep() on &proc0. The only wakeup on &proc0 is intended to awaken the swapper, not random threads blocked in vm_object_deallocate().
# 166964	25-Feb-2007	alc	Change the way that unmanaged pages are created. Specifically, immediately flag any page that is allocated to a OBJT_PHYS object as unmanaged in vm_page_alloc() rather than waiting for a later call to vm_page_unmanage(). This allows for the elimination of some uses of the page queues lock. Change the type of the kernel and kmem objects from OBJT_DEFAULT to OBJT_PHYS. This allows us to take advantage of the above change to simplify the allocation of unmanaged pages in kmem_alloc() and kmem_malloc(). Remove vm_page_unmanage(). It is no longer used.
# 166882	22-Feb-2007	alc	Change the page's CLEANCHK flag from being a page queue mutex synchronized flag to a vm object mutex synchronized flag.
# 166074	17-Jan-2007	delphij	Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form.
# 165309	17-Dec-2006	alc	Optimize vm_object_split(). Specifically, make the number of iterations equal to the number of physical pages that are renamed to the new object rather than the new object's virtual size.
# 165278	16-Dec-2006	alc	Simplify the computation of the new object's size in vm_object_split().
# 163614	22-Oct-2006	alc	The page queues lock is no longer required by vm_page_busy() or vm_page_wakeup(). Reduce or eliminate its use accordingly.
# 163604	22-Oct-2006	alc	Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object lock instead of the global page queues lock.
# 161492	21-Aug-2006	alc	Add _vm_stats and _vm_stats_misc to the sysctl declarations in sysctl.h and eliminate their declarations from various source files.
# 161257	12-Aug-2006	alc	Reimplement the page's NOSYNC flag as an object-synchronized instead of a page queues-synchronized flag. Reduce the scope of the page queues lock in vm_fault() accordingly. Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the scope of the page queues lock. Reviewed by: tegge Additionally, eliminate an unnecessary dereference in computing the argument that is passed to vm_object_set_writeable_dirty().
# 161125	09-Aug-2006	alc	Introduce a field to struct vm_page for storing flags that are synchronized by the lock on the object containing the page. Transition PG_WANTED and PG_SWAPINPROG to use the new field, eliminating the need for holding the page queues lock when setting or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to VPO_WANTED and VPO_SWAPINPROG, respectively. Eliminate the assertion that the page queues lock is held in vm_page_io_finish(). Eliminate the acquisition and release of the page queues lock around calls to vm_page_io_finish() in kern_sendfile() and vfs_unbusy_pages().
# 160960	03-Aug-2006	alc	When sleeping on a busy page, use the lock from the containing object rather than the global page queues lock.
# 160889	01-Aug-2006	alc	Complete the transition from pmap_page_protect() to pmap_remove_write(). Originally, I had adopted sparc64's name, pmap_clear_write(), for the function that is now pmap_remove_write(). However, this function is more like pmap_remove_all() than like pmap_clear_modify() or pmap_clear_reference(), hence, the name change. The higher-level rationale behind this change is described in src/sys/amd64/amd64/pmap.c revision 1.567. The short version is that I'm trying to clean up and fix our support for execute access. Reviewed by: marcel@ (ia64)
# 160585	22-Jul-2006	alc	Export the number of object bypasses and collapses through sysctl.
# 160540	21-Jul-2006	alc	Eliminate OBJ_WRITEABLE. It hasn't been used in a long time.
# 160421	17-Jul-2006	alc	Ensure that vm_object_deallocate() doesn't dereference a stale object pointer: When vm_object_deallocate() sleeps because of a non-zero paging in progress count on either object or object's shadow, vm_object_deallocate() must ensure that object is still the shadow's backing object when it reawakens. In fact, object may have been deallocated while vm_object_deallocate() slept. If so, reacquiring the lock on object can lead to a deadlock. Submitted by: ups@ MFC after: 3 weeks
# 156225	02-Mar-2006	tegge	Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must be called without any vnode locks held. Remove calls to vn_start_write() and vn_finished_write() in vnode_pager_putpages() and add these calls before the vnode lock is obtained to most of the callers that don't already have them.
# 155884	21-Feb-2006	jhb	Lock the vm_object while checking its type to see if it is a vnode-backed object that requires Giant in vm_object_deallocate(). This is somewhat hairy in that if we can't obtain Giant directly, we have to drop the object lock, then lock Giant, then relock the object lock and verify that we still need Giant. If we don't (because the object changed to OBJT_DEAD for example), then we drop Giant before continuing. Reviewed by: alc Tested by: kris
# 155169	01-Feb-2006	jeff	- Install a temporary bandaid in vm_object_reference() that will stop mtx_assert()s from triggering until I find a real long-term solution.
# 154889	27-Jan-2006	alc	Use the new macros abstracting the page coloring/queues implementation. (There are no functional changes.)
# 154805	25-Jan-2006	jeff	- Avoid calling vm_object_backing_scan() when collapsing an object when the resident page count matches the object size. We know it fully backs its parent in this case. Reviewed by: acl, tegge Sponsored by: Isilon Systems, Inc.
# 154694	22-Jan-2006	alc	Make vm_object_vndeallocate() static. The external calls to it were eliminated in ufs/ffs/ffs_vnops.c's revision 1.125.
# 153940	31-Dec-2005	netchild	MI changes: - provide an interface (macros) to the page coloring part of the VM system, this allows to try different coloring algorithms without the need to touch every file [1] - make the page queue tuning values readable: sysctl vm.stats.pagequeue - autotuning of the page coloring values based upon the cache size instead of options in the kernel config (disabling of the page coloring as a kernel option is still possible) MD changes: - detection of the cache size: only IA32 and AMD64 (untested) contains cache size detection code, every other arch just comes with a dummy function (this results in the use of default values like it was the case without the autotuning of the page coloring) - print some more info on Intel CPU's (like we do on AMD and Transmeta CPU's) Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue" and report if the cache* values are zero (= bug in the cache detection code) or not. Based upon work by: Chad David <davidc@acns.ab.ca> [1] Reviewed by: alc, arch (in 2004) Discussed with: alc, Chad David, arch (in 2004)
# 153060	03-Dec-2005	alc	Eliminate unneeded preallocation at initialization. Reviewed by: tegge
# 151558	22-Oct-2005	alc	Use of the ZERO_COPY_SOCKETS options can result in an unusual state that vm_object_backing_scan() was not written to handle. Specifically, a wired page within a backing object that is shadowed by a page within the shadow object. Handle this state by removing the wired page from the backing object. The wired page will be freed by socow_iodone(). Stop masking errors: If a page is being freed by vm_object_backing_scan(), assert that it is no longer mapped rather than quietly destroying any mappings. Tested by: Harald Schmalzbauer
# 148909	09-Aug-2005	tegge	Don't allow pagedaemon to skip pages while scanning PQ_ACTIVE or PQ_INACTIVE due to the vm object being locked. When a process writes large amounts of data to a file, the vm object associated with that file can contain most of the physical pages on the machine. If the process is preempted while holding the lock on the vm object, pagedaemon would be able to move very few pages from PQ_INACTIVE to PQ_CACHE or from PQ_ACTIVE to PQ_INACTIVE, resulting in unlimited cleaning of dirty pages belonging to other vm objects. Temporarily unlock the page queues lock while locking vm objects to avoid lock order violation. Detect and handle relevant page queue changes. This change depends on both the lock portion of struct vm_object and normal struct vm_page being type stable. Reviewed by: alc
# 145888	04-May-2005	jeff	- We need to inhert the OBJ_NEEDGIANT flag from the original object in vm_object_split(). Spotted by: alc
# 145826	03-May-2005	jeff	- Add a new object flag "OBJ_NEEDSGIANT". We set this flag if the underlying vnode requires Giant. - In vm_fault only acquire Giant if the underlying object has NEEDSGIANT set. - In vm_object_shadow inherit the NEEDSGIANT flag from the backing object.
# 144322	30-Mar-2005	alc	Eliminate (now) unnecessary acquisition and release of the global page queues lock in vm_object_backing_scan(). Updates to the page's PG_BUSY flag and busy field are synchronized by the containing object's lock. Testing the page's hold_count and wire_count in vm_object_backing_scan()'s OBSC_COLLAPSE_NOWAIT case is unnecessary. There is no reason why the held or wired pages cannot be migrated to the shadow object. Reviewed by: tegge
# 143745	17-Mar-2005	jeff	- Don't lock the vnode interlock in vm_object_set_writeable_dirty() if we've already set the object flags. Reviewed by: alc
# 141068	30-Jan-2005	alc	Update the text of an assertion to reflect changes made in revision 1.148. Submitted by: tegge Eliminate an unnecessary, temporary increment of the backing object's reference count in vm_object_qcollapse(). Reviewed by: tegge
# 140723	24-Jan-2005	jeff	- Remove GIANT_REQUIRED where giant is no longer required. - Use VFS_LOCK_GIANT() rather than directly acquiring giant in places where giant is only held because vfs requires it. Sponsored By: Isilon Systems, Inc.
# 140319	15-Jan-2005	alc	Consider three objects, O, BO, and BBO, where BO is O's backing object and BBO is BO's backing object. Now, suppose that O and BO are being collapsed. Furthermore, suppose that BO has been marked dead (OBJ_DEAD) by vm_object_backing_scan() and that either vm_object_backing_scan() has been forced to sleep due to encountering a busy page or vm_object_collapse() has been forced to sleep due to memory allocation in the swap pager. If vm_object_deallocate() is then called on BBO and BO is BBO's only shadow object, vm_object_deallocate() will collapse BO and BBO. In doing so, it adds a necessary temporary reference to BO. If this collapse also sleeps and the prior collapse resumes first, the temporary reference will cause vm_object_collapse to panic with the message "backing_object %p was somehow re-referenced during collapse!" Resolve this race by changing vm_object_deallocate() such that it doesn't collapse BO and BBO if BO is marked dead. Once O and BO are collapsed, vm_object_collapse() will attempt to collapse O and BBO. So, vm_object_deallocate() on BBO need do nothing. Reported by: Peter Holm on 20050107 URL: http://www.holm.cc/stress/log/cons102.html In collaboration with: tegge@ Candidate for RELENG_4 and RELENG_5 MFC after: 2 weeks
# 140220	14-Jan-2005	phk	Eliminate unused and unnecessary "cred" argument from vinvalbuf()
# 140048	11-Jan-2005	phk	Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC(). I'm not sure why a credential was added to these in the first place, it is not used anywhere and it doesn't make much sense: The credentials for syncing a file (ability to write to the file) should be checked at the system call level. Credentials for syncing one or more filesystems ("none") should be checked at the system call level as well. If the filesystem implementation needs a particular credential to carry out the syncing it would logically have to the cached mount credential, or a credential cached along with any delayed write data. Discussed with: rwatson
# 139921	08-Jan-2005	alc	Move the acquisition and release of the page queues lock outside of a loop in vm_object_split() to avoid repeated acquisition and release.
# 139825	07-Jan-2005	imp	/* -> /*- for license, minor formatting changes
# 138986	17-Dec-2004	alc	Eliminate another unnecessary call to vm_page_busy(). (See revision 1.333 for a detailed explanation.)
# 138538	08-Dec-2004	alc	With the removal of kern/uipc_jumbo.c and sys/jumbo.h, vm_object_allocate_wait() is not used. Remove it.
# 137324	06-Nov-2004	alc	Eliminate an unnecessary atomic operation. Articulate the rationale in a comment.
# 137297	06-Nov-2004	alc	Move a call to wakeup() from vm_object_terminate() to vnode_pager_dealloc() because this call is only needed to wake threads that slept when they discovered a dead object connected to a vnode. To eliminate unnecessary calls to wakeup() by vnode_pager_dealloc(), introduce a new flag, OBJ_DISCONNECTWNT. Reviewed by: tegge@
# 137242	05-Nov-2004	alc	Eliminate another unnecessary call to vm_page_busy() that immediately precedes a call to vm_page_rename(). (See the previous revision for a detailed explanation.)
# 137168	03-Nov-2004	alc	The synchronization provided by vm object locking has eliminated the need for most calls to vm_page_busy(). Specifically, most calls to vm_page_busy() occur immediately prior to a call to vm_page_remove(). In such cases, the containing vm object is locked across both calls. Consequently, the setting of the vm page's PG_BUSY flag is not even visible to other threads that are following the synchronization protocol. This change (1) eliminates the calls to vm_page_busy() that immediately precede a call to vm_page_remove() or functions, such as vm_page_free() and vm_page_rename(), that call it and (2) relaxes the requirement in vm_page_remove() that the vm page's PG_BUSY flag is set. Now, the vm page's PG_BUSY flag is set only when the vm object lock is released while the vm page is still in transition. Typically, this is when it is undergoing I/O.
# 134496	29-Aug-2004	alc	Move the acquisition and release of the lock on the object at the head of the shadow chain outside of the loop in vm_object_madvise(), reducing the number of times that this lock is acquired and released.
# 132987	01-Aug-2004	green	* Add a "how" argument to uma_zone constructors and initialization functions so that they know whether the allocation is supposed to be able to sleep or not. * Allow uma_zone constructors and initialation functions to return either success or error. Almost all of the ones in the tree currently return success unconditionally, but mbuf is a notable exception: the packet zone constructor wants to be able to fail if it cannot suballocate an mbuf cluster, and the mbuf allocators want to be able to fail in general in a MAC kernel if the MAC mbuf initializer fails. This fixes the panics people are seeing when they run out of memory for mbuf clusters. * Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing the default. Both bmilekic and jeff have reviewed the changes made to make failable zone allocations work.
# 132883	30-Jul-2004	dfr	Fix handling of msync(2) for character special files. Submitted by: nvidia
# 132804	28-Jul-2004	alc	Correct a very old error in both vm_object_madvise() (originating in vm/vm_object.c revision 1.88) and vm_object_sync() (originating in vm/vm_map.c revision 1.36): When descending a chain of backing objects, both use the wrong object's backing offset. Consequently, both may operate on the wrong pages. Quoting Matt, "This could be responsible for all of the sporatic madvise oddness that has been reported over the years." Reviewed by: Matt Dillon
# 132636	25-Jul-2004	alc	Remove spl calls.
# 132627	25-Jul-2004	alc	Make the code and comments for vm_object_coalesce() consistent.
# 132550	22-Jul-2004	alc	- Change uma_zone_set_obj() to call kmem_alloc_nofault() instead of kmem_alloc_pageable(). The difference between these is that an errant memory access to the zone will be detected sooner with kmem_alloc_nofault(). The following changes serve to eliminate the following lock-order reversal reported by witness: 1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311 2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797 3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931 There is no potential deadlock in this case. However, witness is unable to recognize this because vm objects used by UMA have the same type as ordinary vm objects. To remedy this, we make the following changes: - Add a mutex type argument to VM_OBJECT_LOCK_INIT(). - Use the mutex type argument to assign distinct types to special vm objects such as the kernel object, kmem object, and UMA objects. - Define a static swap zone object for use by UMA. (Only static objects are assigned a special mutex type.)
# 131256	28-Jun-2004	tegge	Initialize result->backing_object_offset before linking result onto the list of vm objects shadowing source in vm_object_shadow(). This closes a race where vm_object_collapse() could be called with a partially uninitialized object argument causing symptoms that looked like hardware problems, e.g. signal 6, 10, 11 or a /bin/sh busy-waiting for a nonexistant child process.
# 129729	25-May-2004	des	MFS: vm_map.c rev 1.187.2.27 through 1.187.2.29, fix MS_INVALIDATE semantics but provide a sysctl knob for reverting to old ones.
# 127961	06-Apr-2004	imp	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 126739	08-Mar-2004	alc	Implement a work around for the deadlock avoidance case in vm_object_deallocate() so that it doesn't spin forever either. Submitted by: bde
# 126108	22-Feb-2004	alc	Correct a long-standing race condition in vm_object_page_remove() that could result in a dirty page being unintentionally freed. Reviewed by: tegge MFC after: 7 days
# 124646	18-Jan-2004	alc	Don't acquire Giant in vm_object_deallocate() unless the object is vnode- backed.
# 124084	02-Jan-2004	alc	Revision 1.74 of vm_meter.c ("Avoid lock-order reversal") makes the release and subsequent reacquisition of the same vm object lock in vm_object_collapse() unnecessary.
# 124008	30-Dec-2003	alc	- Modify vm_object_split() to expect a locked vm object on entry and return on a locked vm object on exit. Remove GIANT_REQUIRED. - Eliminate some unnecessary local variables from vm_object_split().
# 122349	09-Nov-2003	alc	- Rename vm_map_clean() to vm_map_sync(). This better reflects the fact that msync(2) is its only caller. - Migrate the parts of the old vm_map_clean() that examined the internals of a vm object to a new function vm_object_sync() that is implemented in vm_object.c. At the same, introduce the necessary vm object locking so that vm_map_sync() and vm_object_sync() can be called without Giant. Reviewed by: tegge
# 121913	02-Nov-2003	alc	- Increase the scope of two vm object locks in vm_object_split().
# 121907	02-Nov-2003	alc	- Introduce and use vm_object_reference_locked(). Unlike vm_object_reference(), this function must not be used to reanimate dead vm objects. This restriction simplifies locking. Reviewed by: tegge
# 121866	01-Nov-2003	alc	- Increase the scope of two vm object locks in vm_object_collapse(). - Remove the acquisition and release of Giant from vm_object_coalesce().
# 121854	01-Nov-2003	alc	- Modify swap_pager_copy() and its callers such that the source and destination objects are locked on entry and exit. Add comments to the callers noting that the locks can be released by swap_pager_copy(). - Remove several instances of GIANT_REQUIRED.
# 121844	01-Nov-2003	alc	- Additional vm object locking in vm_object_split() - New vm object locking assertions in vm_page_insert() and vm_object_set_writeable_dirty()
# 121821	31-Oct-2003	alc	- Revert a part of revision 1.73: Make vm_object_set_flag() an inline function. This function is so trivial that inlining reduces the size of the kernel.
# 121815	31-Oct-2003	alc	- Take advantage of the swap pager locking: Eliminate the use of Giant from vm_object_madvise(). - Remove excessive blank lines from vm_object_madvise().
# 121562	26-Oct-2003	alc	- Simplify vm_object_collapse()'s collapse case, reducing the number of lock acquires and releases performed. - Move an assertion from vm_object_collapse() to vm_object_zdtor() because it applies to all cases of object destruction.
# 121226	18-Oct-2003	alc	- Increase the object lock's scope in vm_contig_launder() so that access to the object's type field and the call to vm_pageout_flush() are synchronized. - The above change allows for the eliminaton of the last parameter to vm_pageout_flush(). - Synchronize access to the page's valid field in vm_pageout_flush() using the containing object's lock.
# 120739	04-Oct-2003	jeff	- Use the UMA_ZONE_VM flag on the fakepg and object zones to prevent vm recursion and LORs. This may be necessary for other zones created in the vm but this needs to be verified.
# 120152	17-Sep-2003	alc	Remove GIANT_REQUIRED from vm_object_shadow().
# 120086	15-Sep-2003	alc	Eliminate the use of Giant from vm_object_reference().
# 120035	13-Sep-2003	alc	There is no need for an atomic increment on the vm object's generation count in _vm_object_allocate(). (Access to the generation count is governed by the vm object's lock.) Note: the introduction of the atomic increment in revision 1.238 appears to be an accident. The purpose of that commit was to fix an Alpha-specific bug in UMA's debugging code.
# 118537	06-Aug-2003	phk	Remove an unused variable.
# 118074	27-Jul-2003	alc	Allow vm_object_reference() on kernel_object without Giant.
# 117876	22-Jul-2003	phk	Don't inline very large functions. Gcc has silently not been doing this for a long time.
# 116662	22-Jun-2003	alc	Complete the vm object locking in vm_object_backing_scan(); specifically, deal with the case where we need to sleep on a busy page with two vm object locks held.
# 116645	21-Jun-2003	alc	- Increase the scope of the vm object lock in vm_object_collapse(). - Assert that the vm object and its backing vm object are both locked in vm_object_qcollapse().
# 116226	11-Jun-2003	obrien	Use __FBSDID().
# 116079	09-Jun-2003	alc	Don't use vm_object_set_flag() to initialize the vm object's flags.
# 116067	08-Jun-2003	alc	- Properly handle the paging_in_progress case on two vm objects in vm_object_deallocate(). - Remove vm_object_pip_sleep().
# 115931	07-Jun-2003	alc	Pass the vm object to vm_object_collapse() with its lock held.
# 115879	05-Jun-2003	alc	- Extend the scope of the backing object's lock in vm_object_collapse().
# 115856	04-Jun-2003	alc	- Add further vm object locking to vm_object_deallocate(), specifically, for accessing a vm object's shadows.
# 115818	04-Jun-2003	alc	- Add vm object locking to vm_object_deallocate(). (Still more changes are required.) - Remove special-case macros for kmem object locking. They are no longer used.
# 115782	03-Jun-2003	alc	Add vm object locking to vm_object_coalesce().
# 115655	01-Jun-2003	alc	Change kernel_object and kmem_object to (&kernel_object_store) and (&kmem_object_store), respectively. This allows the address of these objects to be resolved at link-time rather than run-time.
# 115516	31-May-2003	alc	Add vm object locking to vm_object_madvise().
# 115127	18-May-2003	alc	Reduce the size of a vm object by converting its shadow list from a TAILQ to a LIST. Approved by: re (rwatson)
# 114850	09-May-2003	alc	Give the kmem object's mutex a unique name, instead of "vm object", to avoid false reports of lock-order reversal with a system map mutex. Approved by: re (jhb)
# 114774	06-May-2003	alc	Lock the vm_object when performing vm_pager_deallocate().
# 114669	04-May-2003	alc	Extend the scope of the vm_object lock in vm_object_terminate().
# 114599	03-May-2003	alc	Lock the vm_object on entry to vm_object_vndeallocate().
# 114570	03-May-2003	alc	- Revert kern/vfs_subr.c revision 1.444. The vm_object's size isn't trustworthy for vnode-backed objects. - Restore the old behavior of vm_object_page_remove() when the end of the given range is zero. Add a comment to vm_object_page_remove() regarding this behavior. Reported by: iedowse
# 114564	03-May-2003	alc	Move a declaration to its proper place.
# 114489	02-May-2003	alc	Lock the vm_object when updating its shadow list.
# 114487	02-May-2003	alc	Simplify the removal of a shadow object in vm_object_collapse().
# 114387	01-May-2003	alc	Extend the scope of the vm_object locking in vm_object_split().
# 114372	01-May-2003	alc	- Update the vm_object locking in vm_object_reference(). - Convert some dead code in vm_object_reference() into a comment.
# 114145	28-Apr-2003	alc	- Define VM_OBJECT_LOCK_INIT(). - Avoid repeatedly mtx_init()ing and mtx_destroy()ing the vm_object's lock using UMA's uminit callback, in this case, vm_object_zinit().
# 114128	27-Apr-2003	alc	- Tell witness that holding two or more vm_object locks is okay. - In vm_object_deallocate(), lock the child when removing the parent from the child's shadow list.
# 114112	27-Apr-2003	alc	Various changes to vm_object_shadow(): (1) update the vm_object locking, (2) remove a pointless assertion, and (3) make a trivial change to a comment.
# 114091	26-Apr-2003	alc	Various changes to vm_object_page_remove(): - Eliminate an odd, special-case feature: if start == end == 0 then all pages are removed. Only one caller used this feature and that caller can trivially pass the object's size. - Assert that the vm_object is locked on entry; don't bother testing for a NULL vm_object. - Style: Fix lines that are longer than 80 characters.
# 114080	26-Apr-2003	alc	- Lock the vm_object on entry to vm_object_terminate().
# 114074	26-Apr-2003	alc	- Convert vm_object_pip_wait() from using tsleep() to msleep(). - Make vm_object_pip_sleep() static. - Lock the vm_object when performing vm_object_pip_wait().
# 114053	26-Apr-2003	alc	- Extend the scope of two existing vm_object locks to cover swap_pager_freespace().
# 113955	24-Apr-2003	alc	- Acquire the vm_object's lock when performing vm_object_page_clean(). - Add a parameter to vm_pageout_flush() that tells vm_pageout_flush() whether its caller has locked the vm_object. (This is a temporary measure to bootstrap vm_object locking.)
# 113791	21-Apr-2003	alc	- Assert that the vm_object is locked in vm_object_clear_flag(), vm_object_pip_add() and vm_object_pip_wakeup(). - Remove GIANT_REQUIRED from vm_object_pip_subtract() and vm_object_pip_subtract(). - Lock the vm_object when performing vm_object_page_remove().
# 113775	20-Apr-2003	alc	- Lock the vm_object when performing either vm_object_clear_flag() or vm_object_pip_wakeup().
# 113739	20-Apr-2003	alc	- Lock the vm_object when performing vm_object_pip_add().
# 113722	19-Apr-2003	alc	- Lock the vm_object when performing vm_object_pip_subtract(). - Assert that the vm_object lock is held in vm_object_pip_subtract().
# 113721	19-Apr-2003	alc	- Lock the vm_object when performing vm_object_pip_wakeupn(). - Assert that the vm_object lock is held in vm_object_pip_wakeupn(). - Add a new macro VM_OBJECT_LOCK_ASSERT().
# 113457	13-Apr-2003	alc	Lock some manipulations of the vm object's flags.
# 113449	13-Apr-2003	alc	Lock some manipulations of the vm object's flags.
# 113419	12-Apr-2003	alc	Permit vm_object_pip_add() and vm_object_pip_wakeup() on the kmem_object without Giant held.
# 112569	24-Mar-2003	jake	- Add vm_paddr_t, a physical address type. This is required for systems where physical addresses larger than virtual addresses, such as i386s with PAE. - Use this to represent physical addresses in the MI vm system and in the i386 pmap code. This also changes the paddr parameter to d_mmap_t. - Fix printf formats to handle physical addresses >4G in the i386 memory detection code, and due to kvtop returning vm_paddr_t instead of u_long. Note that this is a name change only; vm_paddr_t is still the same as vm_offset_t on all currently supported platforms. Sponsored by: DARPA, Network Associates Laboratories Discussed with: re, phk (cdevsw change)
# 112367	18-Mar-2003	phk	Including <sys/stdint.h> is (almost?) universally only to be able to use %j in printfs, so put a newsted include in <sys/systm.h> where the printf prototype lives and save everybody else the trouble.
# 111937	06-Mar-2003	alc	Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress. Discussed on: arch@
# 111119	19-Feb-2003	imp	Back out M_* changes, per decision of the TRB. Approved by: trb
# 109912	26-Jan-2003	alc	Simplify vm_object_page_remove(): The object's memq is now ordered. The two cases that existed before for performance optimization purposes can be reduced to one.
# 109623	21-Jan-2003	alfred	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 108676	04-Jan-2003	alc	Use vm_object_lock() and vm_object_unlock() in vm_object_deallocate(). (This procedure needs further work, but this change is sufficient for locking the kmem_object.)
# 108610	03-Jan-2003	alc	Refine the assertion in vm_object_clear_flag() to allow operation on the kmem_object without Giant. In that case, assert that the kmem_object's mutex is held.
# 108533	01-Jan-2003	schweikh	Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup, especially in troff files.
# 108413	29-Dec-2002	alc	- Remove vm_object_init2(). It is unused. - Add a mtx_destroy() to vm_object_collapse(). (This allows a bzero() to migrate from _vm_object_allocate() to vm_object_zinit(), where it will be performed less often.)
# 108384	29-Dec-2002	alc	Reduce the number of times that we acquire and release the page queues lock by making vm_page_rename()'s caller, rather than vm_page_rename(), responsible for acquiring it.
# 108358	28-Dec-2002	dillon	Allow the VM object flushing code to cluster. When the filesystem syncer comes along and flushes a file which has been mmap()'d SHARED/RW, with dirty pages, it was flushing the underlying VM object asynchronously, resulting in thousands of 8K writes. With this change the VM Object flushing code will cluster dirty pages in 64K blocks. Note that until the low memory deadlock issue is reviewed, it is not safe to allow the pageout daemon to use this feature. Forced pageouts still use fs block size'd ops for the moment. MFC after: 3 days
# 108334	27-Dec-2002	alc	- Change vm_object_page_collect_flush() to assert rather than acquire the page queues lock. - Acquire the page queues lock in vm_object_page_clean().
# 108251	24-Dec-2002	alc	- Hold the page queues lock around vm_page_wakeup().
# 108117	20-Dec-2002	alc	Add a mutex to struct vm_object. Initialize and destroy that mutex at appropriate times. For the moment, the mutex is only used on the kmem_object.
# 108101	19-Dec-2002	alc	Remove the hash_rand field from struct vm_object. As of revision 1.215 of vm/vm_page.c, it is unused.
# 108012	18-Dec-2002	alc	- Hold the page queues lock when performing vm_page_busy(). - Replace vm_page_sleep_busy() with proper page queues locking and vm_page_sleep_if_busy().
# 107893	15-Dec-2002	alc	As per the comments, vm_object_page_remove() now expects its caller to lock the object (i.e., acquire Giant).
# 107304	27-Nov-2002	alc	Hold the page queues lock while performing pmap_page_protect(). Approved by: re (blanket)
# 107200	24-Nov-2002	alc	Extend the scope of the page queues/fields locking in vm_freeze_copyopts() to cover pmap_remove_all(). Approved by: re
# 107039	18-Nov-2002	alc	Remove vm_page_protect(). Instead, use pmap_page_protect() directly.
# 106981	16-Nov-2002	alc	Now that pmap_remove_all() is exported by our pmap implementations use it directly.
# 106871	13-Nov-2002	alc	Remove dead code that hasn't been needed since the demise of share maps in various revisions of vm/vm_map.c between 1.148 and 1.153.
# 106773	11-Nov-2002	mjacob	atomic_set_8 isn't MI. Instead, follow Jake's suggestions about ZONE_LOCK.
# 106720	10-Nov-2002	alc	When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than indirectly through vm_page_protect(). The one remaining page flag that is updated by vm_page_protect() is already being updated by our various pmap implementations. Note: A later commit will similarly change the VM_PROT_READ case and eliminate vm_page_protect().
# 106602	07-Nov-2002	mux	Some more printf() format fixes.
# 105407	18-Oct-2002	dillon	Replace the vm_page hash table with a per-vmobject splay tree. There should be no major change in performance from this change at this time but this will allow other work to progress: Giant lock removal around VM system in favor of per-object mutexes, ranged fsyncs, more optimal COMMIT rpc's for NFS, partial filesystem syncs by the syncer, more optimal object flushing, etc. Note that the buffer cache is already using a similar splay tree mechanism. Note that a good chunk of the old hash table code is still in the tree. Alan or I will remove it prior to the release if the new code does not introduce unsolvable bugs, else we can revert more easily. Submitted by: alc (this is Alan's code) Approved by: re
# 103925	24-Sep-2002	jeff	- Get rid of the unused LK_NOOBJ.
# 102372	24-Aug-2002	alc	o Use vm_object_lock() in place of directly locking Giant. Reviewed by: md5
# 101656	10-Aug-2002	alc	o Lock page queue accesses by vm_page_activate().
# 101308	04-Aug-2002	jeff	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS
# 101236	02-Aug-2002	alc	o Convert two instances of vm_page_sleep_busy() into vm_page_sleep_if_busy() with appropriate page queue locking.
# 101196	02-Aug-2002	alc	o Lock page queue accesses by vm_page_deactivate().
# 100915	30-Jul-2002	alc	o In vm_object_madvise() and vm_object_page_remove() replace vm_page_sleep_busy() with vm_page_sleep_if_busy(). At the same time, increase the scope of the page queues lock. (This should significantly reduce the locking overhead in vm_object_page_remove().) o Apply some style fixes.
# 100829	28-Jul-2002	alc	o Lock page queue accesses by vm_page_free().
# 100779	27-Jul-2002	alc	o Require that the page queues lock is held on entry to vm_pageout_clean() and vm_pageout_flush(). o Acquire the page queues lock before calling vm_pageout_clean() or vm_pageout_flush().
# 100686	25-Jul-2002	alc	o Remove a vm_page_deactivate() that is immediately followed by a vm_page_rename() from vm_object_backing_scan(). vm_page_rename() also performs vm_page_deactivate() on pages in the cache queues, making the removed vm_page_deactivate() redundant.
# 100545	23-Jul-2002	alc	o Lock page queue accesses by vm_page_dontneed(). o Assert that the page queue lock is held in vm_page_dontneed().
# 99683	09-Jul-2002	alc	o Lock accesses to the page queues in vm_object_terminate(). o Eliminate some unnecessary 64-bit arithmetic in vm_object_split().
# 99514	07-Jul-2002	alc	o Traverse the object's memq rather than repeatedly calling vm_page_lookup() in vm_object_split().
# 99093	29-Jun-2002	iedowse	Change the type of `tscan' in vm_object_page_clean() to vm_pindex_t, as it stores an absolute page index that may not fit in a vm_offset_t.
# 98892	26-Jun-2002	iedowse	Avoid using the 64-bit vm_pindex_t in a few places where 64-bit types are not required, as the overhead is unnecessary: o In the i386 pmap_protect(), `sindex' and `eindex' represent page indices within the 32-bit virtual address space. o In swp_pager_meta_build() and swp_pager_meta_ctl(), use a temporary variable to store the low few bits of a vm_pindex_t that gets used as an array index. o vm_uiomove() uses `osize' and `idx' for page offsets within a map entry. o In vm_object_split(), `idx' is a page offset within a map entry.
# 98849	26-Jun-2002	ken	At long last, commit the zero copy sockets code. MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes. ti.4: Update the ti(4) man page to include information on the TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options, and also include information about the new character device interface and the associated ioctls. man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated links. jumbo.9: New man page describing the jumbo buffer allocator interface and operation. zero_copy.9: New man page describing the general characteristics of the zero copy send and receive code, and what an application author should do to take advantage of the zero copy functionality. NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS, TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT. conf/files: Add uipc_jumbo.c and uipc_cow.c. conf/options: Add the 5 options mentioned above. kern_subr.c: Receive side zero copy implementation. This takes "disposable" pages attached to an mbuf, gives them to a user process, and then recycles the user's page. This is only active when ZERO_COPY_SOCKETS is turned on and the kern.ipc.zero_copy.receive sysctl variable is set to 1. uipc_cow.c: Send side zero copy functions. Takes a page written by the user and maps it copy on write and assigns it kernel virtual address space. Removes copy on write mapping once the buffer has been freed by the network stack. uipc_jumbo.c: Jumbo disposable page allocator code. This allocates (optionally) disposable pages for network drivers that want to give the user the option of doing zero copy receive. uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are enabled if ZERO_COPY_SOCKETS is turned on. Add zero copy send support to sosend() -- pages get mapped into the kernel instead of getting copied if they meet size and alignment restrictions. uipc_syscalls.c:Un-staticize some of the sf* functions so that they can be used elsewhere. (uipc_cow.c) if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid calling malloc() with M_WAITOK. Return an error if the M_NOWAIT malloc fails. The ti(4) driver and the wi(4) driver, at least, call this with a mutex held. This causes witness warnings for 'ifconfig -a' with a wi(4) or ti(4) board in the system. (I've only verified for ti(4)). ip_output.c: Fragment large datagrams so that each segment contains a multiple of PAGE_SIZE amount of data plus headers. This allows the receiver to potentially do page flipping on receives. if_ti.c: Add zero copy receive support to the ti(4) driver. If TI_PRIVATE_JUMBOS is not defined, it now uses the jumbo(9) buffer allocator for jumbo receive buffers. Add a new character device interface for the ti(4) driver for the new debugging interface. This allows (a patched version of) gdb to talk to the Tigon board and debug the firmware. There are also a few additional debugging ioctls available through this interface. Add header splitting support to the ti(4) driver. Tweak some of the default interrupt coalescing parameters to more useful defaults. Add hooks for supporting transmit flow control, but leave it turned off with a comment describing why it is turned off. if_tireg.h: Change the firmware rev to 12.4.11, since we're really at 12.4.11 plus fixes from 12.4.13. Add defines needed for debugging. Remove the ti_stats structure, it is now defined in sys/tiio.h. ti_fw.h: 12.4.11 firmware. ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13, and my header splitting patches. Revision 12.4.13 doesn't handle 10/100 negotiation properly. (This firmware is the same as what was in the tree previously, with the addition of header splitting support.) sys/jumbo.h: Jumbo buffer allocator interface. sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to indicate that the payload buffer can be thrown away / flipped to a userland process. socketvar.h: Add prototype for socow_setup. tiio.h: ioctl interface to the character portion of the ti(4) driver, plus associated structure/type definitions. uio.h: Change prototype for uiomoveco() so that we'll know whether the source page is disposable. ufs_readwrite.c:Update for new prototype of uiomoveco(). vm_fault.c: In vm_fault(), check to see whether we need to do a page based copy on write fault. vm_object.c: Add a new function, vm_object_allocate_wait(). This does the same thing that vm_object allocate does, except that it gives the caller the opportunity to specify whether it should wait on the uma_zalloc() of the object structre. This allows vm objects to be allocated while holding a mutex. (Without generating WITNESS warnings.) vm_object_allocate() is implemented as a call to vm_object_allocate_wait() with the malloc flag set to M_WAITOK. vm_object.h: Add prototype for vm_object_allocate_wait(). vm_page.c: Add page-based copy on write setup, clear and fault routines. vm_page.h: Add page based COW function prototypes and variable in the vm_page structure. Many thanks to Drew Gallatin, who wrote the zero copy send and receive code, and to all the other folks who have tested and reviewed this code over the years.
# 98824	25-Jun-2002	iedowse	Complete the initial set of VM changes required to support full 64-bit file sizes. This step simply addresses the remaining overflows, and does attempt to optimise performance. The details are: o Use a 64-bit type for the vm_object `size' and the size argument to vm_object_allocate(). o Use the correct type for index variables in dev_pager_getpages(), vm_object_page_clean() and vm_object_page_remove(). o Avoid an overflow in the i386 pmap_object_init_pt().
# 98414	19-Jun-2002	alc	o Replace GIANT_REQUIRED in vm_object_coalesce() by the acquisition and release of Giant. o Reduce the scope of GIANT_REQUIRED in vm_map_insert(). These changes will enable us to remove the acquisition and release of Giant from obreak().
# 97753	02-Jun-2002	alc	o Migrate vm_map_split() from vm_map.c to vm_object.c, renaming it to vm_object_split(). Its interface should still be changed to resemble vm_object_shadow().
# 97729	02-Jun-2002	alc	o Condition vm_object_pmap_copy_1()'s compilation on the kernel option ENABLE_VFS_IOOPT. Unless this option is in effect, vm_object_pmap_copy_1() is not used.
# 97648	31-May-2002	alc	Further work on pushing Giant out of the vm_map layer and down into the vm_object layer: o Acquire and release Giant in vm_object_shadow() and vm_object_page_remove(). o Remove the GIANT_REQUIRED assertion preceding vm_map_delete()'s call to vm_object_page_remove(). o Remove the acquisition and release of Giant around vm_map_lookup()'s call to vm_object_shadow().
# 96839	18-May-2002	alc	o Remove GIANT_REQUIRED from vm_map_madvise(). Instead, acquire and release Giant around vm_map_madvise()'s call to pmap_object_init_pt(). o Replace GIANT_REQUIRED in vm_object_madvise() with the acquisition and release of Giant. o Remove the acquisition and release of Giant from madvise().
# 96441	12-May-2002	alc	o Acquire and release Giant in vm_object_reference() and vm_object_deallocate(), replacing the assertion GIANT_REQUIRED. o Remove GIANT_REQUIRED from vm_map_protect() and vm_map_simplify_entry(). o Acquire and release Giant around vm_map_protect()'s call to pmap_protect(). Altogether, these changes eliminate the need for mprotect() to acquire and release Giant.
# 96095	06-May-2002	alc	o Condition the compilation and use of vm_freeze_copyopts() on ENABLE_VFS_IOOPT.
# 96091	06-May-2002	alc	o Some improvements to the page coloring of vm objects, particularly, for shadow objects. Submitted by: bde
# 96087	05-May-2002	alc	o Move vm_freeze_copyopts() from vm_map.{c.h} to vm_object.{c,h}. It's plainly an operation on a vm_object and belongs in the latter place.
# 96042	04-May-2002	alc	o Make _vm_object_allocate() and vm_object_allocate() callable without holding Giant. o Begin documenting the trivial cases of the locking protocol on vm_object.
# 95112	20-Apr-2002	alc	Reintroduce locking on accesses to vm_object_list.
# 93818	04-Apr-2002	jhb	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
# 92748	20-Mar-2002	jeff	Remove references to vm_zone.h and switch over to the new uma API.
# 92654	19-Mar-2002	jeff	This is the first part of the new kernel memory allocator. This replaces malloc(9) and vm_zone with a slab like allocator. Reviewed by: arch@
# 92511	17-Mar-2002	alc	Remove vm_object_count: It's unused, incorrectly maintained and duplicates information maintained by the zone allocator.
# 92029	10-Mar-2002	eivind	- Remove a number of extra newlines that do not belong here according to style(9) - Minor space adjustment in cases where we have "( ", " )", if(), return(), while(), for(), etc. - Add /* SYMBOL */ after a few #endifs. Reviewed by: alc
# 91724	06-Mar-2002	dillon	Add a sequential iteration optimization to vm_object_page_clean(). This moderately improves msync's and VM object flushing for objects containing randomly dirtied pages (fsync(), msync(), filesystem update daemon), and improves cpu use for small-ranged sequential msync()s in the face of very large mmap()ings from O(N) to O(1) as might be performed by a database. A sysctl, vm.msync_flush_flag, has been added and defaults to 3 (the two committed optimizations are turned on by default). 0 will turn off both optimizations. This code has already been tested under stable and is one in a series of memq / vp->v_dirtyblkhd / fsync optimizations to remove O(N^2) restart conditions that will be coming down the pipe. MFC after: 3 days
# 85541	26-Oct-2001	dillon	Move recently added procedure which was incorrectly placed within an #ifdef DDB block.
# 85517	25-Oct-2001	dillon	Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a real effect. Optimize vfs_msync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. Improves looping case by 500%. Optimize ffs_sync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. This makes a couple of assumptions, which I believe are ok, in regards to vnode stability when the mount list mutex is held. Improves looping case by 500%. (more optimization work is needed on top of these fixes) MFC after: 1 week
# 83366	12-Sep-2001	julian	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 80704	31-Jul-2001	jake	Remove the use of atomic ops to manipulate vm_object and vm_page flags. Giant is required here, so they are superfluous. Discussed with: dillon
# 79248	04-Jul-2001	dillon	Change inlines back into mainline code in preparation for mutexing. Also, most of these inlines had been bloated in -current far beyond their original intent. Normalize prototypes and function declarations to be ANSI only (half already were). And do some general cleanup. (kernel size also reduced by 50-100K, but that isn't the prime intent)
# 79242	04-Jul-2001	dillon	whitespace / register cleanup
# 79224	04-Jul-2001	dillon	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.
# 78592	22-Jun-2001	bmilekic	Introduce numerous SMP friendly changes to the mbuf allocator. Namely, introduce a modified allocation mechanism for mbufs and mbuf clusters; one which can scale under SMP and which offers the possibility of resource reclamation to be implemented in the future. Notable advantages: o Reduce contention for SMP by offering per-CPU pools and locks. o Better use of data cache due to per-CPU pools. o Much less code cache pollution due to excessively large allocation macros. o Framework for `grouping' objects from same page together so as to be able to possibly free wired-down pages back to the system if they are no longer needed by the network stacks. Additional things changed with this addition: - Moved some mbuf specific declarations and initializations from sys/conf/param.c into mbuf-specific code where they belong. - m_getclr() has been renamed to m_get_clrd() because the old name is really confusing. m_getclr() HAS been preserved though and is defined to the new name. No tree sweep has been done "to change the interface," as the old name will continue to be supported and is not depracated. The change was merely done because m_getclr() sounds too much like "m_get a cluster." - TEMPORARILY disabled mbtypes statistics displaying in netstat(1) and systat(1) (see TODO below). - Fixed systat(1) to display number of "free mbufs" based on new per-CPU stat structures. - Fixed netstat(1) to display new per-CPU stats based on sysctl-exported per-CPU stat structures. All infos are fetched via sysctl. TODO (in order of priority): - Re-enable mbtypes statistics in both netstat(1) and systat(1) after introducing an SMP friendly way to collect the mbtypes stats under the already introduced per-CPU locks (i.e. hopefully don't use atomic() - it seems too costly for a mere stat update, especially when other locks are already present). - Optionally have systat(1) display not only "total free mbufs" but also "total free mbufs per CPU pool." - Fix minor length-fetching issues in netstat(1) related to recently re-enabled option to read mbuf stats from a core file. - Move reference counters at least for mbuf clusters into an unused portion of the cluster itself, to save space and need to allocate a counter. - Look into introducing resource freeing possibly from a kproc. Reviewed by (in parts): jlemon, jake, silby, terry Tested by: jlemon (Intel & Alpha), mjacob (Intel & Alpha) Preliminary performance measurements: jlemon (and me, obviously) URL: http://people.freebsd.org/~bmilekic/mb_alloc/
# 77091	23-May-2001	jhb	- Assert that the vm lock is held for all of _vm_object_allocate(). - Restore the previous order of setting up a new vm_object. The previous had a small bug where we zero'd out the flags after we set the OBJ_ONEMAPPING flag. - Add several asserts of vm_mtx. - Assert Giant is held rather than locking and unlocking it in a few places. - Add in some #ifdef objlocks code to lock individual vm objects when vm objects each have their own lock someday. - Don't bother acquiring the allproc lock for a ddb command. If DDB blocked on the lock, that would be worse than having an inconsistent allproc list.
# 76827	18-May-2001	alfred	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb
# 76166	01-May-2001	markm	Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files. Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files. Sort sys/*.h includes where possible in affected files. OK'ed by: bde (with reservations)
# 76117	29-Apr-2001	grog	Revert consequences of changes to mount.h, part 2. Requested by: bde
# 75858	23-Apr-2001	grog	Correct #includes to work with fixed sys/mount.h.
# 75523	15-Apr-2001	alfred	use TAILQ_FOREACH, fix a comment's location
# 75477	13-Apr-2001	alfred	if/panic -> KASSERT
# 74927	28-Mar-2001	jhb	Convert the allproc and proctree locks from lockmgr locks to sx locks.
# 73534	04-Mar-2001	alfred	Simplify vm_object_deallocate(), by decrementing the refcount first. This allows some of the conditionals to be combined.
# 72200	09-Feb-2001	bmilekic	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# 71999	04-Feb-2001	phk	Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details. Created with: sed(1) Reviewed by: md5(1)
# 71576	24-Jan-2001	jasone	Convert all simplelocks to mutexes and remove the simplelock implementations.
# 69972	13-Dec-2000	tanimura	- If swap metadata does not fit into the KVM, reduce the number of struct swblock entries by dividing the number of the entries by 2 until the swap metadata fits. - Reject swapon(2) upon failure of swap_zone allocation. This is just a temporary fix. Better solutions include: (suggested by: dillon) o reserving swap in SWAP_META_PAGES chunks, and o swapping the swblock structures themselves. Reviewed by: alfred, dillon
# 69947	12-Dec-2000	jake	- Change the allproc_lock to use a macro, ALLPROC_LOCK(how), instead of explicit calls to lockmgr. Also provides macros for the flags pased to specify shared, exclusive or release which map to the lockmgr flags. This is so that the use of lockmgr can be easily replaced with optimized reader-writer locks. - Add some locking that I missed the first time.
# 69022	22-Nov-2000	jake	Protect the following with a lockmgr lock: allproc zombproc pidhashtbl proc.p_list proc.p_hash nextpid Reviewed by: jhb Obtained from: BSD/OS and netbsd
# 61081	29-May-2000	dillon	This is a cleanup patch to Peter's new OBJT_PHYS VM object type and sysv shared memory support for it. It implements a new PG_UNMANAGED flag that has slightly different characteristics from PG_FICTICIOUS. A new sysctl, kern.ipc.shm_use_phys has been added to enable the use of physically-backed sysv shared memory rather then swap-backed. Physically backed shm segments are not tracked with PV entries, allowing programs which use a large shm segment as a rendezvous point to operate without eating an insane amount of KVM in the PV entry management. Read: Oracle. Peter's OBJT_PHYS object will also allow us to eventually implement page-table sharing and/or 4MB physical page support for such segments. We're half way there.
# 60755	21-May-2000	peter	Implement an optimization of the VM<->pmap API. Pass vm_page_t's directly to various pmap_*() functions instead of looking up the physical address and passing that. In many cases, the first thing the pmap code was doing was going to a lot of trouble to get back the original vm_page_t, or it's shadow pv_table entry. Inspired by: John Dyson's 1998 patches. Also: Eliminate pv_table as a seperate thing and build it into a machine dependent part of vm_page_t. This eliminates having a seperate set of structions that shadow each other in a 1:1 fashion that we often went to a lot of trouble to translate from one to the other. (see above) This happens to save 4 bytes of physical memory for each page in the system. (8 bytes on the Alpha). Eliminate the use of the phys_avail[] array to determine if a page is managed (ie: it has pv_entries etc). Store this information in a flag. Things like device_pager set it because they create vm_page_t's on the fly that do not have pv_entries. This makes it easier to "unmanage" a page of physical memory (this will be taken advantage of in subsequent commits). Add a function to add a new page to the freelist. This could be used for reclaiming the previously wasted pages left over from preloaded loader(8) files. Reviewed by: dillon
# 59395	19-Apr-2000	alc	vm_object_shadow: Remove an incorrect assertion. In obscure circumstances vm_object_shadow can be called on an object with ref_count > 1 and OBJ_ONEMAPPING set. This isn't really a problem for vm_object_shadow.
# 58705	27-Mar-2000	charnier	Revert spelling mistake I made in the previous commit Requested by: Alan and Bruce
# 58634	26-Mar-2000	charnier	Spelling
# 58132	16-Mar-2000	phk	Eliminate the undocumented, experimental, non-delivering and highly dangerous MAX_PERF option.
# 54467	12-Dec-1999	dillon	Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to madvise(). This feature prevents the update daemon from gratuitously flushing dirty pages associated with a mapped file-backed region of memory. The system pager will still page the memory as necessary and the VM system will still be fully coherent with the filesystem. Modifications made by other means to the same area of memory, for example by write(), are unaffected. The feature works on a page-granularity basis. MAP_NOSYNC allows one to use mmap() to share memory between processes without incuring any significant filesystem overhead, putting it in the same performance category as SysV Shared memory and anonymous memory. Reviewed by: julian, alc, dg
# 52635	29-Oct-1999	phk	useracc() the prequel: Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ\|WRITE} rather than B_{READ\|WRITE} as argument.
# 52617	29-Oct-1999	alc	Remove the last vestiges of "vm_map_t phys_map". It's been unused since i386/i386/machdep.c rev 1.45 (or 1994 :-) ).
# 51343	17-Sep-1999	dillon	Remove inappropriate VOP_FSYNC from vm_object_page_clean(). The fsync syncs the entire underlying file rather then just the requested range, resulting in huge inefficiencies when the VM system is articulated in a certain way. The VOP_FSYNC was also found to massively reduce NFS performance in certain cases. Change MADV_DONTNEED and MADV_FREE to call vm_page_dontneed() instead of vm_page_deactivate(). Using vm_page_deactivate() causes all inactive and cache pages to be recycled before the dontneed/free page is recycled, effectively flushing our entire VM inactive & cache queues continuously even if only a few pages are being actively MADV free'd and reused (such as occurs with a sequential scan of a memory-mapped file). Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
# 50477	27-Aug-1999	peter	$Id$ -> $FreeBSD$
# 49858	15-Aug-1999	alc	Remove the declarations for "vm_map_t io_map". It's been unused since i386/i386/machdep rev 1.310, i.e., the demise of BOUNCE_BUFFERS.
# 49852	15-Aug-1999	alc	Remove the declarations for "vm_map_t u_map". It's been unused since i386/i386/pmap rev 1.190. (The alpha never used it.)
# 49655	12-Aug-1999	alc	vm_object_madvise: Update the comments to match the implementation. Submitted by: dillon
# 49654	12-Aug-1999	alc	vm_object_madvise: Support MADV_DONTNEED and MADV_WILLNEED on object types besides OBJT_DEFAULT and OBJT_SWAP. Submitted by: dillon
# 49558	09-Aug-1999	phk	Merge the cons.c and cons.h to the best of my ability. alpha may or may not compile, I can't test it.
# 49338	01-Aug-1999	alc	Move the memory access behavior information provided by madvise from the vm_object to the vm_map. Submitted by: dillon
# 48833	16-Jul-1999	alc	Remove vm_object::last_read. It is used by the old swap pager, but not by the new one, i.e., vm/swap_pager.c rev 1.108. Reviewed by: dillon@backplane.com
# 48757	11-Jul-1999	alc	Cleanup OBJ_ONEMAPPING management. vm_map.c: Don't set OBJ_ONEMAPPING on arbitrary vm objects. Only default and swap type vm objects should have it set. vm_object_deallocate already handles these cases. vm_object.c: If OBJ_ONEMAPPING isn't already clear in vm_object_shadow, we are in trouble. Instead of clearing it, make it an assertion that it is already clear.
# 48409	01-Jul-1999	peter	Fix some int/long printf problems for the Alpha
# 48059	20-Jun-1999	alc	Remove vm_object::cache_count and vm_object::wired_count. They are not used. (Nor is there any planned use by John who introduced them.) Reviewed by: "John S. Dyson" <toor@dyson.iquest.net>
# 47607	29-May-1999	alc	Addendum to 1.155. Verify the existence of the object before checking its reference count.
# 47568	28-May-1999	alc	Avoid the creation of unnecessary shadow objects.
# 47243	16-May-1999	alc	Remove prototypes for functions that don't exist anymore (vm_map.h). Remove a useless argument from vm_map_madvise's interface (vm_map.c, vm_map.h, and vm_mmap.c). Remove a redundant test in vm_uiomove (vm_map.c). Make two changes to vm_object_coalesce: 1. Determine whether the new range of pages actually overlaps the existing object's range of pages before calling vm_object_page_remove. (Prior to this change almost 90% of the calls to vm_object_page_remove were to remove pages that were beyond the end of the object.) 2. Free any swap space allocated to removed pages.
# 44733	14-Mar-1999	alc	Correct two optimization errors in vm_object_page_remove: 1. The size of vm_object::memq is vm_object::resident_page_count, not vm_object::size. 2. The "size > 4" test sometimes results in the traversal of a ~1000 page memq in order to locate ~10 pages.
# 44245	24-Feb-1999	dillon	Remove unnecessary page protects on map_split and collapse operations. Fix bug where an object's OBJ_WRITEABLE/OBJ_MIGHTBEDIRTY flags do not get set under certain circumstances ( page rename case ). Reviewed by: Alan Cox <alc@cs.rice.edu>, John Dyson
# 44034	15-Feb-1999	dillon	Fix a bug in the new madvise() code that would possibly (improperly) free swap space out from under a busy page. This is not legal because the swap may be reallocated and I/O issued while I/O is still in progress on the same swap page from the madvise()'d object. This bug could only occur under extreme paging conditions but might not cause an error until much later. As a side-benefit, madvise() is now even smaller.
# 43941	12-Feb-1999	dillon	Minor optimization to madvise() MADV_FREE to make page as freeable as possible without actually unmapping it from the process. As of now, I declare madvise() on OBJT_DEFAULT/OBJT_SWAP objects to be 'working and complete'.
# 43923	12-Feb-1999	dillon	Fix non-fatal bug in vm_map_insert() which improperly cleared OBJ_ONEMAPPING in the case where an object is extended by an additional vm_map_entry must be allocated. In vm_object_madvise(), remove calll to vm_page_cache() in MADV_FREE case in order to avoid a page fault on page reuse. However, we still mark the page as clean and destroy any swap backing store. Submitted by: Alan Cox <alc@cs.rice.edu>
# 43777	08-Feb-1999	dillon	Revamp vm_object_[q]collapse(). Despite the complexity of this patch, no major operational changes were made. The three core object->memq loops were moved into a single inline procedure and various operational characteristics of the collapse function were documented.
# 43761	08-Feb-1999	dillon	General cleanup. Remove #if 0's and remove useless register qualifiers.
# 43748	07-Feb-1999	dillon	Remove MAP_ENTRY_IS_A_MAP 'share' maps. These maps were once used to attempt to optimize forks but were essentially given-up on due to problems and replaced with an explicit dup of the vm_map_entry structure. Prior to the removal, they were entirely unused.
# 43729	07-Feb-1999	dillon	When shadowing objects, adjust the page coloring of the shadowing object such that pages in the combined/shadowed object are consistantly colored. Submitted by: "John S. Dyson" <dyson@iquest.net>
# 43616	04-Feb-1999	dillon	Fix bug in a KASSERT I introduced in vm_page_qcollapse() rev 1.139. Since paging is in progress, page scan in vm_page_qcollapse() must be protected at atleast splbio() to prevent pages from being ripped out from under the scan.
# 43547	02-Feb-1999	dillon	Submitted by: Alan Cox The vm_map_insert()/vm_object_coalesce() optimization has been extended to include OBJT_SWAP objects as well as OBJT_DEFAULT objects. This is possible because it costs nothing to extend an OBJT_SWAP object with the new swapper. We can't do this with the old swapper. The old swapper used a linear array that would have had to have been reallocated, costing time as well as a potential low-memory deadlock.
# 43311	27-Jan-1999	dillon	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# 43120	23-Jan-1999	dillon	Depreciate vm_object_pmap_copy() - nobody uses it. Everyone uses vm_object_pmap_copt_1() now, apparently.
# 42972	21-Jan-1999	dillon	object->id was badly implemented. It has simply been removed. object->paging_offset has been removed - it was used to optimize a single OBJT_SWAP collapse case yet introduced massive confusion throughout vm_object.c. The optimization was inconsequential except for the claim that it didn't have to allocate any memory. The optimization has been removed. madvise() has been fixed. The old madvise() could be made to operate on shared objects which is a big no-no. The new one is much more careful in what it modifies. MADV_FREE was totally broken and has now been fixed. vm_page_rename() now automatically dirties a page, so explicit dirtying of the page prior to calling vm_page_rename() has been removed.
# 42957	21-Jan-1999	dillon	This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues. Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
# 42453	09-Jan-1999	eivind	KNFize, by bde.
# 42408	08-Jan-1999	eivind	Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as discussed on -hackers. Introduce 'KASSERT(assertion, ("panic message", args))' for simple check + panic. Reviewed by: msmith
# 42248	02-Jan-1999	bde	Ifdefed conditionally used simplock variables.
# 40931	05-Nov-1998	dg	Implemented zero-copy TCP/IP extensions via sendfile(2) - send a file to a stream socket. sendfile(2) is similar to implementations in HP-UX, Linux, and other systems, but the API is more extensive and addresses many of the complaints that the Apache Group and others have had with those other implementations. Thanks to Marc Slemko of the Apache Group for helping me work out the best API for this. Anyway, this has the "net" result of speeding up sends of files over TCP/IP sockets by about 10X (that is to say, uses 1/10th of the CPU cycles) when compared to a traditional read/write loop.
# 40673	27-Oct-1998	dg	Added needed splvm() protection around object page traversal in vm_object_terminate().
# 40648	25-Oct-1998	phk	Nitpicking and dusting performed on a train. Removes trivial warnings about unused variables, labels and other lint.
# 40605	23-Oct-1998	dg	Oops, revert part of last fix. vm_pager_dealloc() can't be called until after the pages are removed from the object...so fix the problem by not printing the diagnostic for wired fictitious pages (which is normal).
# 40604	23-Oct-1998	dg	Fixed two bugs in recent commit: in vm_object_terminate, vm_pager_dealloc needs to be called prior to freeing remaining pages in the object so that the device pager has an opportunity to grab its "fake" pages. Also, in the case of wired pages, the page must be made busy prior to calling vm_page_remove. This is a difference from 2.2.x that I overlooked when I brought these changes forward.
# 40560	22-Oct-1998	dg	Make the VM system handle the case where a terminating object contains legitimately wired pages. Currently we print a diagnostic when this happens, but this will be removed soon when it will be common for this to occur with zero-copy TCP/IP buffers.
# 39700	28-Sep-1998	dg	Be more selctive about when we clear p->valid. Submitted by: John Dyson <toor@dyson.iquest.net>
# 38799	04-Sep-1998	dfr	Cosmetic changes to the PAGE_XXX macros to make them consistent with the other objects in vm.
# 38517	24-Aug-1998	dfr	Change various syscalls to use size_t arguments instead of u_int. Add some overflow checks to read/write (from bde). Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags and vm_object::paging_in_progress to use operations which are not interruptable. Reviewed by: Bruce Evans <bde@zeta.org.au>
# 38135	06-Aug-1998	dfr	Protect all modifications to paging_in_progress with splvm(). The i386 managed to avoid corruption of this variable by luck (the compiler used a memory read-modify-write instruction which wasn't interruptable) but other architectures cannot. With this change, I am now able to 'make buildworld' on the alpha (sfx: the crowd goes wild...)
# 37641	14-Jul-1998	bde	Print pointers using %p instead of attempting to print them by casting them to long, etc. Fixed some nearby printf bogons (sign errors not warned about by gcc, and style bugs, but not truncation of vm_ooffset_t's).
# 37562	11-Jul-1998	bde	Fixed printf format errors.
# 37094	21-Jun-1998	bde	Removed unused includes.
# 36735	07-Jun-1998	dfr	This commit fixes various 64bit portability problems required for FreeBSD/alpha. The most significant item is to change the command argument to ioctl functions from int to u_long. This change brings us inline with various other BSD versions. Driver writers may like to use (__FreeBSD_version == 300003) to detect this change. The prototype FreeBSD/alpha machdep will follow in a couple of days time.
# 36275	21-May-1998	dyson	Make flushing dirty pages work correctly on filesystems that unexpectedly do not complete writes even with sync I/O requests. This should help the behavior of mmaped files when using softupdates (and perhaps in other circumstances also.)
# 35497	29-Apr-1998	dyson	Tighten up management of memory and swap space during map allocation, deallocation cycles. This should provide a measurable improvement on swap and memory allocation on loaded systems. It is unlikely a complete solution. Also, provide more map info with procfs. Chuck Cranor spurred on this improvement.
# 34611	15-Mar-1998	dyson	Some VM improvements, including elimination of alot of Sig-11 problems. Tor Egge and others have helped with various VM bugs lately, but don't blame him -- blame me!!! pmap.c: 1) Create an object for kernel page table allocations. This fixes a bogus allocation method previously used for such, by grabbing pages from the kernel object, using bogus pindexes. (This was a code cleanup, and perhaps a minor system stability issue.) pmap.c: 2) Pre-set the modify and accessed bits when prudent. This will decrease bus traffic under certain circumstances. vfs_bio.c, vfs_cluster.c: 3) Rather than calculating the beginning virtual byte offset multiple times, stick the offset into the buffer header, so that the calculated offset can be reused. (Long long multiplies are often expensive, and this is a probably unmeasurable performance improvement, and code cleanup.) vfs_bio.c: 4) Handle write recursion more intelligently (but not perfectly) so that it is less likely to cause a system panic, and is also much more robust. vfs_bio.c: 5) getblk incorrectly wrote out blocks that are incorrectly sized. The problem is fixed, and writes blocks out ONLY when B_DELWRI is true. vfs_bio.c: 6) Check that already constituted buffers have fully valid pages. If not, then make sure that the B_CACHE bit is not set. (This was a major source of Sig-11 type problems.) vfs_bio.c: 7) Fix a potential system deadlock due to an incorrectly specified sleep priority while waiting for a buffer write operation. The change that I made opens the system up to serious problems, and we need to examine the issue of process sleep priorities. vfs_cluster.c, vfs_bio.c: 8) Make clustered reads work more correctly (and more completely) when buffers are already constituted, but not fully valid. (This was another system reliability issue.) vfs_subr.c, ffs_inode.c: 9) Create a vtruncbuf function, which is used by filesystems that can truncate files. The vinvalbuf forced a file sync type operation, while vtruncbuf only invalidates the buffers past the new end of file, and also invalidates the appropriate pages. (This was a system reliabiliy and performance issue.) 10) Modify FFS to use vtruncbuf. vm_object.c: 11) Make the object rundown mechanism for OBJT_VNODE type objects work more correctly. Included in that fix, create pager entries for the OBJT_DEAD pager type, so that paging requests that might slip in during race conditions are properly handled. (This was a system reliability issue.) vm_page.c: 12) Make some of the page validation routines be a little less picky about arguments passed to them. Also, support page invalidation change the object generation count so that we handle generation counts a little more robustly. vm_pageout.c: 13) Further reduce pageout daemon activity when the system doesn't need help from it. There should be no additional performance decrease even when the pageout daemon is running. (This was a significant performance issue.) vnode_pager.c: 14) Teach the vnode pager to handle race conditions during vnode deallocations.
# 34320	08-Mar-1998	dyson	Remove a very ill advised vm_page_protect. This was being called for a non-managed page. That is a big no-no.
# 34235	08-Mar-1998	dyson	Several minor fixes: 1) When freeing pages, it is a good idea to protect them off. (This is probably gratuitious, but good form.) 2) Allow collapsing pages in the backing object that are PQ_CACHE. This will improve memory utilization. 3) Correct the collapse code so that pages that were on the cache queue are moved to the inactive queue. This is done when pages are marked dirty (so that those pages will be properly paged out instead of freed), so that cached pages will not be paradoxically marked dirty.
# 34206	07-Mar-1998	dyson	This mega-commit is meant to fix numerous interrelated problems. There has been some bitrot and incorrect assumptions in the vfs_bio code. These problems have manifest themselves worse on NFS type filesystems, but can still affect local filesystems under certain circumstances. Most of the problems have involved mmap consistancy, and as a side-effect broke the vfs.ioopt code. This code might have been committed seperately, but almost everything is interrelated. 1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that are fully valid. 2) Rather than deactivating erroneously read initial (header) pages in kern_exec, we now free them. 3) Fix the rundown of non-VMIO buffers that are in an inconsistent (missing vp) state. 4) Fix the disassociation of pages from buffers in brelse. The previous code had rotted and was faulty in a couple of important circumstances. 5) Remove a gratuitious buffer wakeup in vfs_vmio_release. 6) Remove a crufty and currently unused cluster mechanism for VBLK files in vfs_bio_awrite. When the code is functional, I'll add back a cleaner version. 7) The page busy count wakeups assocated with the buffer cache usage were incorrectly cleaned up in a previous commit by me. Revert to the original, correct version, but with a cleaner implementation. 8) The cluster read code now tries to keep data associated with buffers more aggressively (without breaking the heuristics) when it is presumed that the read data (buffers) will be soon needed. 9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The delay loop waiting is not useful for filesystem locks, due to the length of the time intervals. 10) Correct and clean-up spec_getpages. 11) Implement a fully functional nfs_getpages, nfs_putpages. 12) Fix nfs_write so that modifications are coherent with the NFS data on the server disk (at least as well as NFS seems to allow.) 13) Properly support MS_INVALIDATE on NFS. 14) Properly pass down MS_INVALIDATE to lower levels of the VM code from vm_map_clean. 15) Better support the notion of pages being busy but valid, so that fewer in-transit waits occur. (use p->busy more for pageouts instead of PG_BUSY.) Since the page is fully valid, it is still usable for reads. 16) It is possible (in error) for cached pages to be busy. Make the page allocation code handle that case correctly. (It should probably be a printf or panic, but I want the system to handle coding errors robustly. I'll probably add a printf.) 17) Correct the design and usage of vm_page_sleep. It didn't handle consistancy problems very well, so make the design a little less lofty. After vm_page_sleep, if it ever blocked, it is still important to relookup the page (if the object generation count changed), and verify it's status (always.) 18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up. 19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush. 20) Fix vm_pager_put_pages and it's descendents to support an int flag instead of a boolean, so that we can pass down the invalidate bit.
# 33936	01-Mar-1998	dyson	1) Use a more consistent page wait methodology. 2) Do not unnecessarily force page blocking when paging pages out. 3) Further improve swap pager performance and correctness, including fixing the paging in progress deadlock (except in severe I/O error conditions.) 4) Enable vfs_ioopt=1 as a default. 5) Fix and enable the page prezeroing in SMP mode. All in all, SMP systems especially should show a significant improvement in "snappyness."
# 33817	25-Feb-1998	dyson	Fix page prezeroing for SMP, and fix some potential paging-in-progress hangs. The paging-in-progress diagnosis was a result of Tor Egge's excellent detective work. Submitted by: Partially from Tor Egge.
# 33181	09-Feb-1998	eivind	Staticize.
# 33134	06-Feb-1998	eivind	Back out DIAGNOSTIC changes.
# 33109	05-Feb-1998	dyson	1) Start using a cleaner and more consistant page allocator instead of the various ad-hoc schemes. 2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup. 3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some processor errata, and to minimize redundant processor updating of page tables. 4) Modify pmap_protect so that it can only remove permissions (as it originally supported.) The additional capability is not needed. 5) Streamline read-only to read-write page mappings. 6) For pmap_copy_page, don't enable write mapping for source page. 7) Correct and clean-up pmap_incore. 8) Cluster initial kern_exec pagin. 9) Removal of some minor lint from kern_malloc. 10) Correct some ioopt code. 11) Remove some dead code from the MI swapout routine. 12) Correct vm_object_deallocate (to remove backing_object ref.) 13) Fix dead object handling, that had problems under heavy memory load. 14) Add minor vm_page_lookup improvements. 15) Some pages are not in objects, and make sure that the vm_page.c can properly support such pages. 16) Add some more page deficit handling. 17) Some minor code readability improvements.
# 33108	04-Feb-1998	eivind	Turn DIAGNOSTIC into a new-style option.
# 32937	31-Jan-1998	dyson	Change the busy page mgmt, so that when pages are freed, they MUST be PG_BUSY. It is bogus to free a page that isn't busy, because it is in a state of being "unavailable" when being freed. The additional advantage is that the page_remove code has a better cross-check that the page should be busy and unavailable for other use. There were some minor problems with the collapse code, and this plugs those subtile "holes." Also, the vfs_bio code wasn't checking correctly for PG_BUSY pages. I am going to develop a more consistant scheme for grabbing pages, busy or otherwise. For now, we are stuck with the current morass.
# 32702	22-Jan-1998	dyson	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
# 32585	17-Jan-1998	dyson	Tie up some loose ends in vnode/object management. Remove an unneeded config option in pmap. Fix a problem with faulting in pages. Clean-up some loose ends in swap pager memory management. The system should be much more stable, but all subtile bugs aren't fixed yet.
# 32454	11-Jan-1998	dyson	Fix some vnode management problems, and better mgmt of vnode free list. Fix the UIO optimization code. Fix an assumption in vm_map_insert regarding allocation of swap pagers. Fix an spl problem in the collapse handling in vm_object_deallocate. When pages are freed from vnode objects, and the criteria for putting the associated vnode onto the free list is reached, either put the vnode onto the list, or put it onto an interrupt safe version of the list, for further transfer onto the actual free list. Some minor syntax changes changing pre-decs, pre-incs to post versions. Remove a bogus timeout (that I added for debugging) from vn_lock. PHK will likely still have problems with the vnode list management, and so do I, but it is better than it was.
# 32305	07-Jan-1998	dyson	Turn off the VTEXT flag when an object is no longer referenced, so that an executable that is no longer running can be written to. Also, clear the OBJ_OPT flag more often, when appropriate.
# 32286	06-Jan-1998	dyson	Make our v_usecount vnode reference count work identically to the original BSD code. The association between the vnode and the vm_object no longer includes reference counts. The major difference is that vm_object's are no longer freed gratuitiously from the vnode, and so once an object is created for the vnode, it will last as long as the vnode does. When a vnode object reference count is incremented, then the underlying vnode reference count is incremented also. The two "objects" are now more intimately related, and so the interactions are now much less complex. When vnodes are now normally placed onto the free queue with an object still attached. The rundown of the object happens at vnode rundown time, and happens with exactly the same filesystem semantics of the original VFS code. There is absolutely no need for vnode_pager_uncache and other travesties like that anymore. A side-effect of these changes is that SMP locking should be much simpler, the I/O copyin/copyout optimizations work, NFS should be more ponderable, and further work on layered filesystems should be less frustrating, because of the totally coherent management of the vnode objects and vnodes. Please be careful with your system while running this code, but I would greatly appreciate feedback as soon a reasonably possible.
# 32071	28-Dec-1997	dyson	Lots of improvements, including restructring the caching and management of vnodes and objects. There are some metadata performance improvements that come along with this. There are also a few prototypes added when the need is noticed. Changes include: 1) Cleaning up vref, vget. 2) Removal of the object cache. 3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore. 4) Correct some missing LK_RETRY's in vn_lock. 5) Correct the page range in the code for msync. Be gentle, and please give me feedback asap.
# 31853	19-Dec-1997	dyson	Some performance improvements, and code cleanups (including changing our expensive OFF_TO_IDX to btoc whenever possible.)
# 31252	18-Nov-1997	bde	Removed unused #include of <sys/malloc.h>. This file now uses only zalloc(). Many more cases like this are probably obscured by not including <vm/zone.h> explicitly (it is spammed into <sys/malloc.h>).
# 31017	07-Nov-1997	phk	Rename some local variables to avoid shadowing other local variables. Found by: -Wshadow
# 30700	24-Oct-1997	dyson	Decrease the initial allocation for the zone allocations.
# 29653	21-Sep-1997	dyson	Change the M_NAMEI allocations to use the zone allocator. This change plus the previous changes to use the zone allocator decrease the useage of malloc by half. The Zone allocator will be upgradeable to be able to use per CPU-pools, and has more intelligent usage of SPLs. Additionally, it has reasonable stats gathering capabilities, while making most calls inline.
# 28992	01-Sep-1997	bde	Removed unused #includes.
# 28991	01-Sep-1997	bde	Some staticized variables were still declared to be extern.
# 27899	04-Aug-1997	dyson	Get rid of the ad-hoc memory allocator for vm_map_entries, in lieu of a simple, clean zone type allocator. This new allocator will also be used for machine dependent pmap PV entries.
# 26811	22-Jun-1997	peter	Kill some stale leftovers from the earlier attempts at SMP per-cpu pages
# 26780	22-Jun-1997	dyson	Remove a window during running down a file vnode. Also, the OBJ_DEAD flag wasn't being respected during vref(), et. al. Note that this isn't the eventual fix for the locking problem. Fine grained SMP in the VM and VFS code will require (lots) more work.
# 26258	29-May-1997	peter	Update the #include "opt_smpxxx.h" includes - opt_smp.h isn't needed very much in the generic parts of the kernel now.
# 25164	26-Apr-1997	peter	Man the liferafts! Here comes the long awaited SMP -> -current merge! There are various options documented in i386/conf/LINT, there is more to come over the next few days. The kernel should run pretty much "as before" without the options to activate SMP mode. There are a handful of known "loose ends" that need to be fixed, but have been put off since the SMP kernel is in a moderately good condition at the moment. This commit is the result of the tinkering and testing over the last 14 months by many people. A special thanks to Steve Passe for implementing the APIC code!
# 22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 22521	10-Feb-1997	dyson	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
# 21881	20-Jan-1997	dyson	Make MADV_FREE work better. Specifically, it did not wait for the page to be unbusy, and it caused some algorithmic problems as a result. There were some other problems with it also, so this is a general cleanup of the code. Submitted by: Douglas Crosher <dtc@scrooge.ee.swin.oz.au> and myself.
# 21754	16-Jan-1997	dyson	Change the map entry flags from bitfields to bitmasks. Allows for some code simplification.
# 21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 21258	03-Jan-1997	dyson	Undo the collapse breakage (swap space usage problem.)
# 21157	01-Jan-1997	dyson	Guess what? We left alot of the old collapse code that is not needed anymore with the "full" collapse fix that we added about 1yr ago!!! The code has been removed by optioning it out for now, so we can put it back in ASAP if any problems are found.
# 21134	31-Dec-1996	dyson	A very significant improvement in the management of process maps and objects. Previously, "fancy" memory management techniques such as that used by the M3 RTS would have the tendancy of chopping up processes allocated memory into lots of little objects. Alan has come up with some improvements to migtigate the sitution to the point where even the M3 RTS only has one object for bss and it's managed memory (when running CVSUP.) (There are still cases where the situation isn't improved when the system pages -- but this is much much better for the vast majority of cases.) The system will now be able to much more effectively merge map entries. Submitted by: Alan Cox <alc@cs.rice.edu>
# 18526	28-Sep-1996	dyson	Reviewed by: Submitted by: Obtained from:
# 18298	14-Sep-1996	bde	Attached vm ddb commands `show map', `show vmochk', `show object', `show vmopag', `show page' and `show pageq'. Moved all vm ddb stuff to the ends of the vm source files. Changed printf() to db_printf(), `indent' to db_indent, and iprintf() to db_iprintf() in ddb commands. Moved db_indent and db_iprintf() from vm to ddb. vm_page.c: Don't use __pure. Staticized. db_output.c: Reduced page width from 80 to 79 to inhibit double spacing for long lines (there are still some problems if words are printed across column 79).
# 18169	08-Sep-1996	dyson	Addition of page coloring support. Various levels of coloring are afforded. The default level works with minimal overhead, but one can also enable full, efficient use of a 512K cache. (Parameters can be generated to support arbitrary cache sizes also.)
# 17761	21-Aug-1996	dyson	Even though this looks like it, this is not a complex code change. The interface into the "VMIO" system has changed to be more consistant and robust. Essentially, it is now no longer necessary to call vn_open to get merged VM/Buffer cache operation, and exceptional conditions such as merged operation of VBLK devices is simpler and more correct. This code corrects a potentially large set of problems including the problems with ktrace output and loaded systems, file create/deletes, etc. Most of the changes to NFS are cosmetic and name changes, eliminating a layer of subroutine calls. The direct calls to vput/vrele have been re-instituted for better cross platform compatibility. Reviewed by: davidg
# 17334	30-Jul-1996	dyson	Backed out the recent changes/enhancements to the VM code. The problem with the 'shell scripts' was found, but there was a 'strange' problem found with a 486 laptop that we could not find. This commit backs the code back to 25-jul, and will be re-entered after the snapshot in smaller (more easily tested) chunks.
# 17294	27-Jul-1996	dyson	This commit is meant to solve a couple of VM system problems or performance issues. 1) The pmap module has had too many inlines, and so the object file is simply bigger than it needs to be. Some common code is also merged into subroutines. 2) Removal of some evil PHYS_TO_VM_PAGE macro calls. Unfortunately, a few have needed to be added also. The removal caused the need for more vm_page_lookups. I added lookup hints to minimize the need for the page table lookup operations. 3) Removal of some bogus performance improvements, that mostly made the code more complex (tracking individual page table page updates unnecessarily). Those improvements actually hurt 386 processors perf (not that people who worry about perf use 386 processors anymore :-)). 4) Changed pv queue manipulations/structures to be TAILQ's. 5) The pv queue code has had some performance problems since day one. Some significant scalability issues are resolved by threading the pv entries from the pmap AND the physical address instead of just the physical address. This makes certain pmap operations run much faster. This does not affect most micro-benchmarks, but should help loaded system performance significantly. DG helped and came up with most of the solution for this one. 6) Most if not all pmap bit operations follow the pattern: pmap_test_bit(); pmap_clear_bit(); That made for twice the necessary pv list traversal. The pmap interface now supports only pmap_tc_bit type operations: pmap_[test/clear]_modified, pmap_[test/clear]_referenced. Additionally, the modified routine now takes a vm_page_t arg instead of a phys address. This eliminates a PHYS_TO_VM_PAGE operation. 7) Several rewrites of routines that contain redundant code to use common routines, so that there is a greater likelihood of keeping the cache footprint smaller.
# 16409	16-Jun-1996	dyson	Various bugfixes/cleanups from me and others: 1) Remove potential race conditions on waking up in vm_page_free_wakeup by making sure that it is at splvm(). 2) Fix another bug in vm_map_simplify_entry. 3) Be more complete about converting from default to swap pager when an object grows to be large enough that there can be a problem with data structure allocation under low memory conditions. 4) Make some madvise code more efficient. 5) Added some comments.
# 16026	30-May-1996	dyson	This commit is dual-purpose, to fix more of the pageout daemon queue corruption problems, and to apply Gary Palmer's code cleanups. David Greenman helped with these problems also. There is still a hang problem using X in small memory machines.
# 15888	24-May-1996	dyson	Eliminate inefficient check for dirty pages for pages in the PQ_CACHE queue. Also, modify the MADV_FREE policy (it probably still isn't the final version.)
# 15873	22-May-1996	dyson	Initial support for MADV_FREE, support for pages that we don't care about the contents anymore. This gives us alot of the advantage of freeing individual pages through munmap, but with almost none of the overhead.
# 15841	21-May-1996	dyson	After reviewing the previous commit to vm_object, the page protection is never necessary, not just for PG_FICTICIOUS.
# 15836	21-May-1996	dyson	Don't protect non-managed pages off during object rundown. This fixes a hang that occurs under certain circumstances when exiting X.
# 15819	19-May-1996	dyson	Initial support for mincore and madvise. Both are almost fully supported, except madvise does not page in with MADV_WILLNEED, and MADV_DONTNEED doesn't force dirty pages out.
# 15809	18-May-1996	dyson	This set of commits to the VM system does the following, and contain contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>, Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me: More usage of the TAILQ macros. Additional minor fix to queue.h. Performance enhancements to the pageout daemon. Addition of a wait in the case that the pageout daemon has to run immediately. Slightly modify the pageout algorithm. Significant revamp of the pmap/fork code: 1) PTE's and UPAGES's are NO LONGER in the process's map. 2) PTE's and UPAGES's reside in their own objects. 3) TOTAL elimination of recursive page table pagefaults. 4) The page directory now resides in the PTE object. 5) Implemented pmap_copy, thereby speeding up fork time. 6) Changed the pv entries so that the head is a pointer and not an entire entry. 7) Significant cleanup of pmap_protect, and pmap_remove. 8) Removed significant amounts of machine dependent fork code from vm_glue. Pushed much of that code into the machine dependent pmap module. 9) Support more completely the reuse of already zeroed pages (Page table pages and page directories) as being already zeroed. Performance and code cleanups in vm_map: 1) Improved and simplified allocation of map entries. 2) Improved vm_map_copy code. 3) Corrected some minor problems in the simplify code. Implemented splvm (combo of splbio and splimp.) The VM code now seldom uses splhigh. Improved the speed of and simplified kmem_malloc. Minor mod to vm_fault to avoid using pre-zeroed pages in the case of objects with backing objects along with the already existant condition of having a vnode. (If there is a backing object, there will likely be a COW... With a COW, it isn't necessary to start with a pre-zeroed page.) Minor reorg of source to perhaps improve locality of ref.
# 15367	24-Apr-1996	dyson	This fixes kmem_malloc/kmem_free (and malloc/free of objects of > 8K). A page index was calculated incorrectly in vm_kern, and vm_object_page_remove removed pages that should not have been.
# 14900	29-Mar-1996	dg	Revert to previous calculation of vm_object_cache_max: it simply works better in most real-world cases.
# 14865	28-Mar-1996	dyson	VM performance improvements, and reorder some operations in VM fault in anticipation of a fix in pmap that will allow the mlock system call to work without panicing the system.
# 14531	11-Mar-1996	hsu	For Lite2: proc LIST changes. Reviewed by: davidg & bde
# 14316	02-Mar-1996	dyson	1) Eliminate unnecessary bzero of UPAGES. 2) Eliminate unnecessary copying of pages during/after forks. 3) Add user map simplification.
# 13490	19-Jan-1996	dyson	Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
# 13228	04-Jan-1996	wollman	Convert DDB to new-style option.
# 13223	04-Jan-1996	dg	Increased vm_object_cache_max by about 50% to yield better utilization of memory when lots of small files are cached. Reviewed by: dyson
# 12820	14-Dec-1995	phk	Another mega commit to staticize things.
# 12767	11-Dec-1995	dyson	Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
# 12662	07-Dec-1995	dg	Untangled the vm.h include file spaghetti.
# 12591	03-Dec-1995	bde	Completed function declarations and/or added prototypes. Staticized some functions. __purified some functions. Some functions were bogusly declared as returning `const'. This hasn't done anything since gcc-2.5. For later versions of gcc, the equivalent is __attribute__((const)) at the end of function declarations.
# 12423	20-Nov-1995	phk	Remove unused vars & funcs, make things static, protoize a little bit.
# 12110	05-Nov-1995	dyson	Greatly simplify the msync code. Eliminate complications in vm_pageout for msyncing. Remove a bug that manifests itself primarily on NFS (the dirty range on the buffers is not set on msync.)
# 11705	23-Oct-1995	dyson	First phase of removing the PG_COPYONWRITE flag, and an architectural cleanup of mapping files.
# 10345	26-Aug-1995	bde	Change vm_object_print() to have the correct number and type of args for a ddb command.
# 10080	16-Aug-1995	bde	Make everything except the unsupported network sources compile cleanly with -Wnested-externs.
# 9759	29-Jul-1995	bde	Eliminate sloppy common-style declarations. There should be none left for the LINT configuation.
# 9548	16-Jul-1995	dg	1) Merged swpager structure into vm_object. 2) Changed swap_pager internal interfaces to cope w/#1. 3) Eliminated object->copy as we no longer have copy objects. 4) Minor stylistic changes.
# 9507	13-Jul-1995	dg	NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
# 9202	11-Jun-1995	rgrimes	Merge RELENG_2_0_5 into HEAD
# 8876	30-May-1995	rgrimes	Remove trailing whitespace.
# 8692	21-May-1995	dg	Changes to fix the following bugs: 1) Files weren't properly synced on filesystems other than UFS. In some cases, this lead to lost data. Most likely would be noticed on NFS. The fix is to make the VM page sync/object_clean general rather than in each filesystem. 2) Mixing regular and mmaped file I/O on NFS was very broken. It caused chunks of files to end up as zeroes rather than the intended contents. The fix was to fix several race conditions and to kludge up the "b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention to page modifications that occurred via the mmapping. Reviewed by: David Greenman Submitted by: John Dyson
# 8216	02-May-1995	dg	Changed object hash list to be a list rather than a tailq. This saves space for the hash list buckets and is a little faster. The features of tailq aren't needed. Increased the size of the object hash table to improve performance. In the future, this will be changed so that the table is sized dynamically.
# 7968	21-Apr-1995	dyson	Fixed a problem in _vm_object_page_clean that could cause an infinite loop.
# 7883	16-Apr-1995	dg	Moved some zero-initialized variables into .bss. Made code intended to be called only from DDB #ifdef DDB. Removed some completely unused globals.
# 7870	16-Apr-1995	dg	Fixed a few bugs in vm_object_page_clean, mostly related to not syncing pages that are in FS buffers. This fixes the (believed to already have been fixed) problem with msync() not doing it's job...in other words, the stuff that Andrew has continuously been complaining about. Submitted by: John Dyson, w/minor changes by me.
# 7695	09-Apr-1995	dg	Changes from John Dyson and myself: Fixed remaining known bugs in the buffer IO and VM system. vfs_bio.c: Fixed some race conditions and locking bugs. Improved performance by removing some (now) unnecessary code and fixing some broken logic. Fixed process accounting of # of FS outputs. Properly handle NFS interrupts (B_EINTR). (various) Replaced calls to clrbuf() with calls to an optimized routine called vfs_bio_clrbuf(). (various FS sync) Sync out modified vnode_pager backed pages. ffs_vnops.c: Do two passes: Sync out file data first, then indirect blocks. vm_fault.c: Fixed deadly embrace caused by acquiring locks in the wrong order. vnode_pager.c: Changed to use buffer I/O system for writing out modified pages. This should fix the problem with the modification date previous not getting updated. Also dramatically simplifies the code. Note that this is going to change in the future and be implemented via VOP_PUTPAGES(). vm_object.c: Fixed a pile of bugs related to cleaning (vnode) objects. The performance of vm_object_page_clean() is terrible when dealing with huge objects, but this will change when we implement a binary tree to keep the object pages sorted. vm_pageout.c: Fixed broken clustering of pageouts. Fixed race conditions and other lockup style bugs in the scanning of pages. Improved performance.
# 7350	25-Mar-1995	dg	Removed (almost) meaningless "object cache lookups/hits" statistic. In our framework, these numbers will usually be nearly the same, and not because of any sort of high 'hit rate'.
# 7346	25-Mar-1995	dg	Removed cnt.v_nzfod: In our current scheme of things it is not possible to accurately track this. It isn't an indicator of resource consumption anyway. Removed cnt.v_kernel_pages: We don't implement this and doing so accurately would be very difficult (and ambiguous - since process pages are often double mapped in the kernel and the process address spaces).
# 7263	23-Mar-1995	dg	Fixed warning caused by returning a value in a void function (introduced in a recent commit by me). Relaxed checks before calling vm_object_remove; a non-internal object always has a pager.
# 7246	22-Mar-1995	dg	Removed unused fifth argument to vm_object_page_clean(). Fixed bug with VTEXT not always getting cleared when it is supposed to. Added check to make sure that vm_object_remove() isn't called with a NULL pager or for a pager for an OBJ_INTERNAL object (neither of which will be on the hash list). Clear OBJ_CANPERSIST if we decide to terminate it because of no resident pages.
# 7243	22-Mar-1995	dg	Fixed potential sleep/wakeup race conditional with splhigh(). Submitted by: John Dyson
# 7204	20-Mar-1995	dg	Added a new boolean argument to vm_object_page_clean that causes it to only toss out clean pages if TRUE.
# 7187	20-Mar-1995	dg	Don't gain/lose an object reference in vnode_pager_setsize(). It will cause vnode locking problems in vm_object_terminate(). Implement proper vnode locking in vm_object_terminate().
# 7180	20-Mar-1995	dg	Removed an unnecessary call to vinvalbuf after the page clean.
# 7090	16-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 7017	12-Mar-1995	dg	Fixed obsolete comment.
# 7015	12-Mar-1995	dg	Deleted vm_object_setpager().
# 6943	07-Mar-1995	dg	Don't attempt to reverse collapse non OBJ_INTERNAL objects.
# 6816	01-Mar-1995	dg	Various changes from John and myself that do the following: New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that are used to reduce the actual number of wakeups. New function vm_page_protect which is used in conjuction with some new page flags to reduce the number of calls to pmap_page_protect. Minor changes to reduce unnecessary spl nesting. Rewrote vm_page_alloc() to improve readability. Various other mostly cosmetic changes.
# 6622	22-Feb-1995	dg	Removed bogus copy object collapse check (the idea is right, but the spcific check was bogus). Removed old copy of vm_object_page_clean and took out the #if 1 around the remaining one. Submitted by: John Dyson
# 6618	22-Feb-1995	dg	Only do object paging_in_progress wakeups if someone is waiting on this condition. Submitted by: John Dyson
# 6585	20-Feb-1995	dg	Deprecated remaining use of vm_deallocate. Deprecated vm_allocate_with_ pager(). Almost completely rewrote vm_mmap(); when John gets done with the bottom half, it will be a complete rewrite. Deprecated most use of vm_object_setpager(). Removed side effect of setting object persist in vm_object_enter and moved this into the pager(s). A few other cosmetic changes.
# 6567	20-Feb-1995	dg	Panic if object is deallocated too many times. Slight change to reverse collapsing so that vm_object_deallocate doesn't have to be called recursively. Removed half of a previous fix - the renamed page during a collapse doesn't need to be marked dirty because the pager backing store pointers are copied - thus preserving the page's data. This assumes that pages without backing store are always dirty (except perhaps for when they are first zeroed, but this doesn't matter). Switch order of two lines of code so that the correct pager is removed from the hash list. The previous code bogusly passed a NULL pointer to vm_object_remove(). The call to vm_object_remove() should be unnecessary if named anonymous objects were being dealt with correctly. They are currently marked as OBJ_INTERNAL, which really screws up things (such as this).
# 6541	18-Feb-1995	dg	1) Added protection against collapsing OBJ_DEAD objects. 2) bump reference counts by 2 instead of 1 so that an object deallocate doesn't try to recursively collapse the object. 3) mark pages renamed during the collapse as dirty so that their contents are preserved. Submitted by: John and me.
# 6326	12-Feb-1995	dg	Carefully choose the value for vm_object_cache_max. The previous calculation was rather bogus in most cases; the new value works very well for both large and small memory machines.
# 6129	02-Feb-1995	dg	swap_pager.c: Fixed long standing bug in freeing swap space during object collapses. Fixed 'out of space' messages from printing out too often. Modified to use new kmem_malloc() calling convention. Implemented an additional stat in the swap pager struct to count the amount of space allocated to that pager. This may be removed at some point in the future. Minimized unnecessary wakeups. vm_fault.c: Don't try to collect fault stats on 'swapped' processes - there aren't any upages to store the stats in. Changed read-ahead policy (again!). vm_glue.c: Be sure to gain a reference to the process's map before swapping. Be sure to lose it when done. kern_malloc.c: Added the ability to specify if allocations are at interrupt time or are 'safe'; this affects what types of pages can be allocated. vm_map.c: Fixed a variety of map lock problems; there's still a lurking bug that will eventually bite. vm_object.c: Explicitly initialize the object fields rather than bzeroing the struct. Eliminated the 'rcollapse' code and folded it's functionality into the "real" collapse routine. Moved an object_unlock() so that the backing_object is protected in the qcollapse routine. Make sure nobody fools with the backing_object when we're destroying it. Added some diagnostic code which can be called from the debugger that looks through all the internal objects and makes certain that they all belong to someone. vm_page.c: Fixed a rather serious logic bug that would result in random system crashes. Changed pagedaemon wakeup policy (again!). vm_pageout.c: Removed unnecessary page rotations on the inactive queue. Changed the number of pages to explicitly free to just free_reserved level. Submitted by: John Dyson
# 5903	25-Jan-1995	dg	Don't attempt to clean device_pager backed objects at terminate time. There is similar bogusness in the pageout daemon that will be fixed soon. This fixes a panic pointed out to me by Bruce Evans that occurs when /dev/mem is used to map managed memory.
# 5841	24-Jan-1995	dg	Added ability to detect sequential faults and DTRT. (swap_pager.c) Added hook for pmap_prefault() and use symbolic constant for new third argument to vm_page_alloc() (vm_fault.c, various) Changed the way that upages and page tables are held. (vm_glue.c) Fixed architectural flaw in allocating pages at interrupt time that was introduced with the merged cache changes. (vm_page.c, various) Adjusted some algorithms to acheive better paging performance and to accomodate the fix for the architectural flaw mentioned above. (vm_pageout.c) Fixed pbuf handling problem, changed policy on handling read-behind page. (vnode_pager.c) Submitted by: John Dyson
# 5571	13-Jan-1995	dg	Protect a qcollapse call with an object lock before calling. The locks need to be moved into the qcollapse and rcollapse routines, but I don't have time at the moment to make all the required changes...this will do for now.
# 5520	11-Jan-1995	dg	Improve my previous change to use the same tests as are used in qcollapse.
# 5519	11-Jan-1995	dg	Fixed a panic that Garrett reported to me...the OBJ_INTERNAL flag wasn't being cleared in some cases for vnode backed objects; we now do this in vnode_pager_alloc proper to guarantee it. Also be more careful in the rcollapse code about messing with busy/bmapped pages.
# 5455	09-Jan-1995	dg	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
# 5404	05-Jan-1995	dg	Make sure that the object being collapsed doesn't go away on us...by gaining extra references to it. Submitted by: John Dyson Obtained from:
# 5203	23-Dec-1994	dg	Do vm_page_rename more conservatively in rcollapse and qcollapse, and change list walk so that it doesn't get stuck in an infinite loop. Submitted by: John Dyson
# 5033	10-Dec-1994	dg	Don't put objects that have no parent on the reverse_shadow_list. Problem identified and explained by Gene Stark (thanks Gene!). Submitted by: John Dyson
# 4810	25-Nov-1994	dg	These changes fix a couple of lingering VM problems: 1. The pageout daemon used to block under certain circumstances, and we needed to add new functionality that would cause the pageout daemon to block more often. Now, the pageout daemon mostly just gets rid of pages and kills processes when the system is out of swap. The swapping, rss limiting and object cache trimming have been folded into a new daemon called "vmdaemon". This new daemon does things that need to be done for the VM system, but can block. For example, if the vmdaemon blocks for memory, the pageout daemon can take care of it. If the pageout daemon had blocked for memory, it was difficult to handle the situation correctly (and in some cases, was impossible). 2. The collapse problem has now been entirely fixed. It now appears to be impossible to accumulate unnecessary vm objects. The object collapsing now occurs when ref counts drop to one (where it is more likely to be more simple anyway because less pages would be out on disk.) The original fixes were incomplete in that pathological circumstances could still be contrived to cause uncontrolled growth of swap. Also, the old code still, under steady state conditions, used more swap space than necessary. When using the new code, users will generally notice a significant decrease in swap space usage, and theoretically, the system should be leaving fewer unused pages around competing for memory. Submitted by: John Dyson
# 4203	06-Nov-1994	dg	Added support for starting the experimental "vmdaemon" system process. Enabled via REL2_1. Added support for doing object collapses "on the fly". Enabled via REL2_1a. Improved object collapses so that they can happen in more cases. Improved sensing of modified pages to fix an apparant race condition and improve clustered pageout opportunities. Fixed an "oops" with not restarting page scan after a potential block in vm_pageout_clean() (not doing this can result in strange behavior in some cases). Submitted by: John Dyson & David Greenman
# 3610	15-Oct-1994	dg	Properly count object lookups and hits.
# 3449	08-Oct-1994	phk	Cosmetics: unused vars, ()'s, #include's &c &c to silence gcc. Reviewed by: davidg
# 3374	05-Oct-1994	dg	Stuff object into v_vmdata rather than pager. Not important which at the moment, but will be in the future. Other changes mostly cosmetic, but are made for future VMIO considerations. Submitted by: John Dyson
# 2320	27-Aug-1994	dg	1) Changed ddb into a option rather than a pseudo-device (use options DDB in your kernel config now). 2) Added ps ddb function from 1.1.5. Cleaned it up a bit and moved into its own file. 3) Added \r handing in db_printf. 4) Added missing memory usage stats to statclock(). 5) Added dummy function to pseudo_set so it will be emitted if there are no other pseudo declarations.
# 2112	18-Aug-1994	wollman	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
# 1895	07-Aug-1994	dg	Provide support for upcoming merged VM/buffer cache, and fixed a few bugs that haven't appeared to manifest themselves (yet). Submitted by: John Dyson
# 1817	02-Aug-1994	dg	Added $Id$
# 1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# 1542	24-May-1994	rgrimes	This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.
# 1541	24-May-1994	rgrimes	BSD 4.4 Lite Kernel Sources