Cross Reference: /freebsd-10.0-release/sys/kern/vfs

History log of /freebsd-10.0-release/sys/kern/vfs_bio.c
Revision	Date	Author	Comments (<<< Hide modified files) (Show modified files >>>)
# 259065	07-Dec-2013	gjb	- Copy stable/10 (r259064) to releng/10.0 as part of the 10.0-RELEASE cycle. - Update __FreeBSD_version [1] - Set branch name to -RC1 [1] 10.0-CURRENT __FreeBSD_version value ended at '55', so start releng/10.0 at '100' so the branch is started with a value ending in zero. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation /freebsd-10.0-release /freebsd-10.0-release/sys/conf/newvers.sh /freebsd-10.0-release/sys/sys/param.h
# 256281	10-Oct-2013	gjb	Copy head (r256279) to stable/10 as part of the 10.0-RELEASE cycle. Approved by: re (implicit) Sponsored by: The FreeBSD Foundation
# 256213	09-Oct-2013	kib	The device vnodes are often unlocked when bread() or bwrite() is called. This probably should be fixed eventually, but for now it is not needed to try to flush such vnodes from the buffer allocation context. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)
# 255986	02-Oct-2013	kib	When helping the bufdaemon from the buffer allocation context, there is no sense to walk the whole dirty buffer queue. We are only interested in, and can operate on, the buffers owned by the current vnode [1]. Instead of calling generic queue flush routine, do VOP_FSYNC() if possible. Holding the dirty buffer queue lock in the bufdaemon, without dropping it, can cause starvation of buffer writes from other threads. This is esp. easy to reproduce on the big memory machines, where large files are written, causing almost all dirty buffers accumulating in several big files, which vnodes are locked by writers. Bufdaemon cannot flush any buffer, but is iterating over the whole dirty queue continuously. Since dirty queue mutex is not dropped, bufdone() in g_up thread is starved, usually deadlocking the machine [2]. Mitigate this by dropping the queue lock after the vnode is locked, allowing other queue lock contenders to make a progress. Discussed with: Jeff [1] Reported by: pho [2] Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Approved by: re (hrs)
# 255941	29-Sep-2013	kib	Reimplement r255797 using LK_TRYUPGRADE. The r255797 was: Increase the chance of the buffer write from the bufdaemon helper context to succeed. If the locked vnode which owns the buffer to be written is shared locked, try the non-blocking upgrade of the lock to exclusive. PR: kern/178997 Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)
# 255800	22-Sep-2013	kib	Revert r255797. The LK_UPGRADE \| LK_NOWAIT drops the lock. Approved by: re (marius, implicit)
# 255797	22-Sep-2013	kib	Increase the chance of the buffer write from the bufdaemon helper context to succeed. If the locked vnode which owns the buffer to be written is shared locked, try the non-blocking upgrade of the lock to exclusive. PR: kern/178997 Reported and tested by: Klaus Weber <fbsd-bugs-2013-1@unix-admin.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)
# 255396	08-Sep-2013	kib	Drain for the xbusy state for two places which potentially do pmap_remove_all(). Not doing the drain allows the pmap_enter() to proceed in parallel, making the pmap_remove_all() effects void. The race results in an invalidated page mapped wired by usermode. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (glebius)
# 255245	05-Sep-2013	kib	The vm_pageout_flush() functions sbusies pages in the passed pages run. After that, the pager put method is called, usually translated to VOP_WRITE(). For the filesystems which use buffer cache, bufwrite() sbusies the buffer pages again, waiting for the xbusy state to drain. The later is done in vfs_drain_busy_pages(), which is called with the buffer pages already sbusied (by vm_pageout_flush()). Since vfs_drain_busy_pages() can only wait for one page at the time, and during the wait, the object lock is dropped, previous pages in the buffer must be protected from other threads busying them. Up to the moment, it was done by xbusying the pages, that is incompatible with the sbusy state in the new implementation of busy. Switch to sbusy. Reported and tested by: pho Sponsored by: The FreeBSD Foundation
# 254668	22-Aug-2013	kib	Both cluster_rbuild() and cluster_wbuild() sometimes set the pages shared busy without first draining the hard busy state. Previously it went unnoticed since VPO_BUSY and m->busy fields were distinct, and vm_page_io_start() did not verified that the passed page has VPO_BUSY flag cleared, but such page state is wrong. New implementation is more strict and catched this case. Drain the busy state as needed, before calling vm_page_sbusy(). Tested by: pho, jkim Sponsored by: The FreeBSD Foundation
# 254649	22-Aug-2013	kib	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation
# 254138	09-Aug-2013	attilio	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl
# 254025	07-Aug-2013	jeff	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division
# 253327	13-Jul-2013	kib	Assert that runningbufspace does not underflow. Sponsored by: The FreeBSD Foundation
# 253326	13-Jul-2013	kib	There is no need to count waiters for the runningbufspace. Sponsored by: The FreeBSD Foundation
# 253187	11-Jul-2013	kib	Do not invalidate page of the B_NOCACHE buffer or buffer after an I/O error if any user wired mappings exist. Doing the invalidation destroys the user wiring. The change is the temporal measure to close the bug, the more proper fix is to delegate the invalidation of the page to upper layers always. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 253007	07-Jul-2013	alfred	Make kassert_printf use __printflike. Fix associated errors/warnings while I'm here. Requested by: avg
# 252330	28-Jun-2013	jeff	- Add a general purpose resource allocator, vmem, from NetBSD. It was originally inspired by the Solaris vmem detailed in the proceedings of usenix 2001. The NetBSD version was heavily refactored for bugs and simplicity. - Use this resource allocator to allocate the buffer and transient maps. Buffer cache defrags are reduced by 25% when used by filesystems with mixed block sizes. Ultimately this may permit dynamic buffer cache sizing on low KVA machines. Discussed with: alc, kib, attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division
# 251446	05-Jun-2013	jeff	- Consolidate duplicate code into support functions. - Split the bqlock into bqclean and bqdirty locks. - Only acquire the wakeup synchronization locks when we cross a threshold requiring them. - Restructure the way flushbufqueues() targets work so they are more smp friendly and sane. Reviewed by: kib Discussed with: mckusick, attilio Sponsored by: EMC / Isilon Storage Division M vfs_bio.c
# 251282	03-Jun-2013	kib	When auto-sizing the buffer cache, limit the amount of physical memory used as the estimation of size, to 32GB. This provides around 100K of buffer headers and corresponding KVA for buffer map at the peak. Sizing the cache larger is not useful, also resulting in the wasting and exhausting of KVA for large machines. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation
# 251257	02-Jun-2013	alc	Reduce the scope of the VM object locking in brelse(). In my tests, this change reduced the total number of VM object lock acquisitions by brelse() by 74%. Sponsored by: EMC / Isilon Storage Division
# 251171	30-May-2013	jeff	- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf
# 250885	21-May-2013	attilio	vm_object locking is not needed there as pages are already wired. Sponsored by: EMC / Isilon storage division Submitted by: alc
# 250751	17-May-2013	attilio	Use readlocking now that assertions on vm_page_lookup() are relaxed. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: flo, pho
# 248792	27-Mar-2013	kib	Add dev_strategy_csw() function, which is similar to dev_strategy() but assumes that a thread reference was already obtained on the passed device. Use the function from physio(), to avoid two extra dev_mtx lock and unlock. Note that physio() is always used as the cdevsw method, or is called from a cdevsw method, and the caller already owns the reference. dev_strategy() is left to keep KPI intact, but now it is implemented as a wrapper around dev_strategy_csw(). Do some style cleanup in physio(). Requested and reviewed by: kan (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 248790	27-Mar-2013	kib	On i386, double the default size of the bio transient map. With the maxbcache size fixed, the auto-tuned transient map is too small for real-world load on i386. Tested by: David Wolfskill Sponsored by: The FreeBSD Foundation
# 248569	21-Mar-2013	kib	Only size and create the bio_transient_map when unmapped buffers are enabled. Now, disabling the unmapped buffers should result in the kernel memory map identical to pre-r248550. Sponsored by: The FreeBSD Foundation
# 248563	20-Mar-2013	kib	In bufwrite(), a dirty buffer is moved to the clean queue before the bufobj counter of the writes in progress is incremented. Other thread inspecting the bufobj would consider it clean. For the regular vnodes, the vnode lock is typically held both by the thread performing the bufwrite() and an other thread doing syncing, which prevents the situation. On the other hand, writes to the VCHR vnodes are done without holding vnode lock. Increment the write ref counter for the buffer object before calling bundirty(). Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 2 weeks
# 248515	19-Mar-2013	kib	Do not remap usermode pages into KVA for physio. Sponsored by: The FreeBSD Foundation Tested by: pho
# 248510	19-Mar-2013	kib	Add a helper function vfs_bio_bzero_buf() to zero the portion of the buffer, transparently handling mapped or unmapped buffers. Its intent is to replace the use of bzero(bp->b_data) in cases where the buffer might be unmapped, to avoid unneeded upgrades. Sponsored by: The FreeBSD Foundation Tested by: pho
# 248508	19-Mar-2013	kib	Implement the concept of the unmapped VMIO buffers, i.e. buffers which do not map the b_pages pages into buffer_map KVA. The use of the unmapped buffers eliminate the need to perform TLB shootdown for mapping on the buffer creation and reuse, greatly reducing the amount of IPIs for shootdown on big-SMP machines and eliminating up to 25-30% of the system time on i/o intensive workloads. The unmapped buffer should be explicitely requested by the GB_UNMAPPED flag by the consumer. For unmapped buffer, no KVA reservation is performed at all. The consumer might request unmapped buffer which does have a KVA reserve, to manually map it without recursing into buffer cache and blocking, with the GB_KVAALLOC flag. When the mapped buffer is requested and unmapped buffer already exists, the cache performs an upgrade, possibly reusing the KVA reservation. Unmapped buffer is translated into unmapped bio in g_vfs_strategy(). Unmapped bio carry a pointer to the vm_page_t array, offset and length instead of the data pointer. The provider which processes the bio should explicitely specify a readiness to accept unmapped bio, otherwise g_down geom thread performs the transient upgrade of the bio request by mapping the pages into the new bio_transient_map KVA submap. The bio_transient_map submap claims up to 10% of the buffer map, and the total buffer_map + bio_transient_map KVA usage stays the same. Still, it could be manually tuned by kern.bio_transient_maxcnt tunable, in the units of the transient mappings. Eventually, the bio_transient_map could be removed after all geom classes and drivers can accept unmapped i/o requests. Unmapped support can be turned off by the vfs.unmapped_buf_allowed tunable, disabling which makes the buffer (or cluster) creation requests to ignore GB_UNMAPPED and GB_KVAALLOC flags. Unmapped buffers are only enabled by default on the architectures where pmap_copy_page() was implemented and tested. In the rework, filesystem metadata is not the subject to maxbufspace limit anymore. Since the metadata buffers are always mapped, the buffers still have to fit into the buffer map, which provides a reasonable (but practically unreachable) upper bound on it. The non-metadata buffer allocations, both mapped and unmapped, is accounted against maxbufspace, as before. Effectively, this means that the maxbufspace is forced on mapped and unmapped buffers separately. The pre-patch bufspace limiting code did not worked, because buffer_map fragmentation does not allow the limit to be reached. By Jeff Roberson request, the getnewbuf() function was split into smaller single-purpose functions. Sponsored by: The FreeBSD Foundation Discussed with: jeff (previous version) Tested by: pho, scottl (previous version), jhb, bf MFC after: 2 weeks
# 248283	14-Mar-2013	kib	Some style fixes. Sponsored by: The FreeBSD Foundation
# 248282	14-Mar-2013	kib	Add currently unused flag argument to the cluster_read(), cluster_write() and cluster_wbuild() functions. The flags to be allowed are a subset of the GB_* flags for getblk(). Sponsored by: The FreeBSD Foundation Tested by: pho
# 248276	14-Mar-2013	kib	Rewrite the vfs_bio_clrbuf(9) to not access the b_data for B_VMIO buffers directly, use pmap_zero_page_area(9) for each zeroing page region instead. Sponsored by: The FreeBSD Foundation Tested by: pho MFC after: 2 weeks
# 248084	09-Mar-2013	attilio	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
# 247389	27-Feb-2013	kib	Make recursive getblk() slightly more useful. Keep the buffer state intact if getblk() is done on the already owned buffer. Exit from brelse() early when the lock recursion is detected, otherwise brelse() might prematurely destroy the buffer under some circumstances. Sponsored by: The FreeBSD Foundation Noted by: mckusick Tested by: pho MFC after: 2 weeks
# 246876	16-Feb-2013	mckusick	Add barrier write capability to the VFS buffer interface. A barrier write is a disk write request that tells the disk that the buffer being written must be committed to the media along with any writes that preceeded it before any future blocks may be written to the drive. Barrier writes are provided by adding the functions bbarrierwrite (bwrite with barrier) and babarrierwrite (bawrite with barrier). Following a bbarrierwrite the client knows that the requested buffer is on the media. It does not ensure that buffers written before that buffer are on the media. It only ensure that buffers written before that buffer will get to the media before any buffers written after that buffer. A flush command must be sent to the disk to ensure that all earlier written buffers are on the media. Reviewed by: kib Tested by: Peter Holm
# 244534	21-Dec-2012	attilio	Fixup r218424: uio_yield() was scaling directly to userland priority. When kern_yield() was introduced with the possibility to specify a new priority, the behaviour changed by not lowering priority at all in the consumers, making the yielding mechanism highly ineffective for high priority kthreads like bufdaemon, syncer, vlrudaemon, etc. There are no evidences that consumers could bear with such change in semantic and this situation could finally lead to bugs similar to the ones fixed in r244240. Re-specify userland pri for kthreads involved. Tested by: pho Reviewed by: kib, mdf MFC after: 1 week
# 244076	10-Dec-2012	kib	Do not ignore zero address, possibly returned by the vm_map_find() call. The function indicates a failure by the TRUE return value. To be extra safe, assert that the return value from the following vm_map_insert() indicates success. Fix style issues in the nearby lines, reformulate the comment. Reviewed by: alc (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 244054	09-Dec-2012	kib	Remove useless comment. MFC after: 3 days
# 241896	22-Oct-2012	kib	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
# 236487	02-Jun-2012	kib	Fix typo [1]. Use commas to separate flag printouts, in style with other parts of function. Submitted by: bf [1] MFC after: 1 week
# 236465	02-Jun-2012	kib	Update the print mask for decoding b_flags. Add print masks for b_vflags and b_xflags_t and print them as well. MFC after: 1 week
# 235469	15-May-2012	gber	Do not call bremfree for managed buffers. Calling bremfree for these buffers results in panic: "bremfree: buffer %p not on a queue." Approved by: kib
# 232351	01-Mar-2012	mckusick	This change avoids a kernel deadlock on "snaplk" when using snapshots on UFS filesystems running with journaled soft updates. This is the first of several bugs that need to be fixed before removing the restriction added in -r230250 to prevent the use of snapshots on filesystems running with journaled soft updates. The deadlock occurs when holding the snapshot lock (snaplk) and then trying to flush an inode via ffs_update(). We become blocked by another process trying to flush a different inode contained in the same inode block that we need. It holds the inode block for which we are waiting locked. When it tries to write the inode block, it gets blocked waiting for the our snaplk when it calls ffs_copyonwrite() to see if the inode block needs to be copied in our snapshot. The most obvious place that this deadlock arises is in the ffs_copyonwrite() routine when it updates critical metadata in a snapshot and tries to write it out before proceeding. The fix here is to write the data and indirect block pointer for the snapshot, but to skip the call to ffs_update() to write the snapshot inode. To ensure that we will never have to update a pointer in the inode itself, the ffs_snapshot() routine that creates the snapshot has to ensure that all the direct blocks are allocated as part of the creation of the snapshot. A less obvious place that this deadlock occurs is when we hold the snaplk because we are deleting a snapshot. In the course of doing the deletion, we need to allocate various soft update dependency structures and allocate some journal space. If we hit a resource limit while doing this we decrease the resources in use by flushing out an existing dirty file to get it to give up the soft dependency resources that it holds. The flush can cause an ffs_update() to be done on the inode for the file that we have selected to flush resulting in the same deadlock as described above when the inode that we have chosen to flush resides in the same inode block as the snapshot inode that we hold. The fix is to defer cleaning up any time that the inode on which we are operating is a snapshot. Help and review by: Jeff Roberson Tested by: Peter Holm MFC (to 9 only) after: 2 weeks
# 232192	26-Feb-2012	alc	Fix typo. MFC after: 1 week
# 228156	30-Nov-2011	kib	Rename vm_page_set_valid() to vm_page_set_valid_range(). The vm_page_set_valid() is the most reasonable name for the m->valid accessor. Reviewed by: attilio, alc
# 226843	27-Oct-2011	alc	Eliminate vestiges of page coloring in VM_ALLOC_NOOBJ calls to vm_page_alloc(). While I'm here, for the sake of consistency, always specify the allocation class, such as VM_ALLOC_NORMAL, as the first of the flags.
# 225448	08-Sep-2011	attilio	Improve the informations reported in case of busy buffers during the shutdown: - Axe out the SHOW_BUSYBUFS option and uses a tunable for selectively enable/disable it, which is defaulted for not printing anything (0 value) but can be changed for printing (1 value) and be verbose (2 value) - Improves the informations outputed: right now, there is no track of the actual struct buf object or vnode which are referenced by the shutdown process, but it is printed the related struct bufobj object which is not really helpful - Add more verbosity about the state of the struct buf lock and the vnode informations, with the latter to be activated separately by the sysctl Sponsored by: Sandvine Incorporated Reviewed by: emaste, kib Approved by: re (ksmith) MFC after: 10 days
# 223795	05-Jul-2011	marius	Call pmap_qremove() before freeing or unwiring the pages, otherwise there's a window during which a page can be re-used before its previous mapping is removed. Reviewed by: alc MFC after: 1 week
# 222953	10-Jun-2011	jeff	- When printing bufs with show buf the lblkno is often more useful than the blkno. Print them both.
# 222220	23-May-2011	ru	BKVASIZE was bumped to 16k more than a decade ago.
# 221829	13-May-2011	mdf	Use a name instead of a magic number for kern_yield(9) when the priority should not change. Fetch the td_user_pri under the thread lock. This is probably not necessary but a magic number also seems preferable to knowing the implementation details here. Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >
# 218589	11-Feb-2011	alc	Retire VFS_BIO_DEBUG. Convert those checks that were still valid into KASSERT()s and eliminate the rest. Replace excessive printf()s and a panic() in bufdone_finish() with a KASSERT() in vm_page_io_finish(). Reviewed by: kib
# 218424	07-Feb-2011	mdf	Based on discussions on the svn-src mailing list, rework r218195: - entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler. - move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h - add a slightly more generic kern_yield() that can replace the functionality of uio_yield(). - replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching. - fix a logic inversion bug in vlrureclaim(), pointed out by bde@. - instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.
# 218223	03-Feb-2011	alc	Eliminate unnecessary page hold_count checks. These checks predate r90944, which introduced a general mechanism for handling the freeing of held pages. Reviewed by: kib@
# 216810	29-Dec-2010	kib	Remove OBJ_CLEANING flag. The vfs_setdirty_locked_object() is the only consumer of the flag, and it used the flag because OBJ_MIGHTBEDIRTY was cleared early in vm_object_page_clean, before the cleaning pass was done. This is no longer true after r216799. Moreover, since OBJ_CLEANING is a flag, and not the counter, it could be reset too prematurely when parallel vm_object_page_clean() are performed. Reviewed by: alc (as a part of the bigger patch) MFC after: 1 month (after r216799 is merged)
# 216699	25-Dec-2010	alc	Introduce and use a new VM interface for temporarily pinning pages. This new interface replaces the combined use of vm_fault_quick() and pmap_extract_and_hold() throughout the kernel. In collaboration with: kib@
# 216511	17-Dec-2010	alc	Implement and use a single optimized function for unholding a set of pages. Reviewed by: kib@
# 214342	25-Oct-2010	ivoras	Reduce the difference between hirunningspace and lorunningspace, it should help interactivity in edge cases.
# 211213	12-Aug-2010	kib	The buffers b_vflags field is not always properly protected by bufobj lock. If b_bufobj is not NULL, then bufobj lock should be held when manipulating the flags. Not doing this sometimes leaves BV_BKGRDINPROG to be erronously set, causing softdep' getdirtybuf() to stuck indefinitely in "getbuf" sleep, waiting for background write to finish which is not actually performed. Add BO_LOCK() in the cases where it was missed. In collaboration with: pho Tested by: bz Reviewed by: jeff MFC after: 1 month
# 211129	09-Aug-2010	ivoras	Fix (hopefully) the spelling of "queuing." Submitted by: bf1783 at gmail com
# 211123	09-Aug-2010	ivoras	Elaborate on how hirunningspace was chosen.
# 210923	06-Aug-2010	kib	Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that created cdev will never be destroyed. Propagate the flag to devfs vnodes as VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a thread reference on such nodes. In collaboration with: pho MFC after: 1 month
# 210410	23-Jul-2010	ivoras	Make lorunningspace catch up with hirunningspace. While there, add comment about the magic numbers. Prodded by: alc
# 210295	20-Jul-2010	ivoras	Fix expression style. Prodded by: jhb
# 210217	18-Jul-2010	ivoras	In keeping with the Age-of-the-fruitbat theme, scale up hirunningspace on machines which can clearly afford the memory. This is a somewhat conservative version of the patch - more fine tuning may be necessary. Idea from: Thread on hackers@ Discussed with: alc
# 209902	11-Jul-2010	alc	Change the implementation of vm_hold_free_pages() so that it performs at most one call to pmap_qremove(), and thus one TLB shootdown, instead of one call and TLB shootdown per page. Simplify the interface to vm_hold_free_pages(). MFC after: 3 weeks
# 209861	09-Jul-2010	alc	Add support for the VM_ALLOC_COUNT() hint to vm_page_alloc(). Consequently, the maintenance of vm_pageout_deficit can be localized to just two places: vm_page_alloc() and vm_pageout_scan(). This change also corrects an off-by-one error in the maintenance of vm_pageout_deficit. Historically, the buffer cache functions, allocbuf() and vm_hold_load_pages(), have not taken into account that vm_page_alloc() already increments vm_pageout_deficit by one. Reviewed by: kib
# 209713	05-Jul-2010	kib	Add the ability for the allocflag argument of the vm_page_grab() to specify the increment of vm_pageout_deficit when sleeping due to page shortage. Then, in allocbuf(), the code to allocate pages when extending vmio buffer can be replaced by a call to vm_page_grab(). Suggested and reviewed by: alc MFC after: 2 weeks
# 209605	30-Jun-2010	alc	Improve bufdone_finish()'s handling of the bogus page. Specifically, if one or more mappings to the bogus page must be replaced, call pmap_qenter() just once. Previously, pmap_qenter() was called for each mapping to the bogus page. MFC after: 3 weeks
# 209053	11-Jun-2010	mdf	Add INVARIANTS checking that numfreebufs values are sane. Also add a per-buf flag to catch if a buf is double-counted in the free count. This code was useful to debug an instance where a local patch at Isilon was incorrectly managing numfreebufs for a new buf state. Reviewed by: jeff Approved by: zml (mentor)
# 208920	08-Jun-2010	kib	Reorganize the code in bdwrite() which handles move of dirtiness from the buffer pages to buffer. Combine the code to set buffer dirty range (previously in vfs_setdirty()) and to clean the pages (vfs_clean_pages()) into new function vfs_clean_pages_dirty_buf(). Now the vm object lock is acquired only once. Drain the VPO_BUSY bit of the buffer pages before setting valid and clean bits in vfs_clean_pages_dirty_buf() with new helper vfs_drain_busy_pages(). pmap_clear_modify() asserts that page is not busy. In vfs_busy_pages(), move the wait for draining of VPO_BUSY before the dirtyness handling, to follow the structure of vfs_clean_pages_dirty_buf(). Reported and tested by: pho Suggested and reviewed by: alc MFC after: 2 weeks
# 208745	02-Jun-2010	alc	Minimize the use of the page queues lock for synchronizing access to the page's dirty field. With the exception of one case, access to this field is now synchronized by the object lock.
# 208524	25-May-2010	alc	Eliminate the acquisition and release of the page queues lock from vfs_busy_pages(). It is no longer needed. Submitted by: kib
# 208504	24-May-2010	alc	Roughly half of a typical pmap_mincore() implementation is machine- independent code. Move this code into mincore(), and eliminate the page queues lock from pmap_mincore(). Push down the page queues lock into pmap_clear_modify(), pmap_clear_reference(), and pmap_is_modified(). Assert that these functions are never passed an unmanaged page. Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m: Contrary to what the comment says, pmap_mincore() is not simply an optimization. Without a complete pmap_mincore() implementation, mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED because only the pmap can provide this information. Eliminate the page queues lock from vfs_setdirty_locked_object(), vm_pageout_clean(), vm_object_page_collect_flush(), and vm_object_page_clean(). Generally speaking, these are all accesses to the page's dirty field, which are synchronized by the containing vm object's lock. Reduce the scope of the page queues lock in vm_object_madvise() and vm_page_dontneed(). Reviewed by: kib (an earlier version)
# 208264	18-May-2010	alc	The page queues lock is no longer required by vm_page_set_invalid(), so eliminate it. Assert that the object containing the page is locked in vm_page_test_dirty(). Perform some style clean up while I'm here. Reviewed by: kib
# 207796	08-May-2010	alc	Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and vm_page_try_to_free(). Consequently, push down the page queues lock into pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and pmap_remove_write(). Push down the page queues lock into Xen's pmap_page_is_mapped(). (I overlooked the Xen pmap in r207702.) Switch to a per-processor counter for the total number of pages cached.
# 207644	05-May-2010	alc	Push down the acquisition of the page queues lock into vm_page_unwire(). Update the comment describing which lock should be held on entry to vm_page_wire(). Reviewed by: kib
# 207617	04-May-2010	alc	Add page locking to the vm_page_cow* functions. Push down the acquisition and release of the page queues lock into vm_page_wire(). Reviewed by: kib
# 207573	03-May-2010	alc	Acquire the page lock around vm_page_unwire() and vm_page_wire(). Reviewed by: kib
# 207534	02-May-2010	alc	Properly synchronize access to the page's hold_count in vfs_vmio_release(). Reviewed by: kib
# 207530	02-May-2010	alc	It makes no sense for vm_page_sleep_if_busy()'s helper, vm_page_sleep(), to unconditionally set PG_REFERENCED on a page before sleeping. In many cases, it's perfectly ok for the page to disappear, i.e., be reclaimed by the page daemon, before the caller to vm_page_sleep() is reawakened. Instead, we now explicitly set PG_REFERENCED in those cases where having the page persist until the caller is awakened is clearly desirable. Note, however, that setting PG_REFERENCED on the page is still only a hint, and not a guarantee that the page should persist.
# 207410	29-Apr-2010	kmacy	On Alan's advice, rather than do a wholesale conversion on a single architecture from page queue lock to a hashed array of page locks (based on a patch by Jeff Roberson), I've implemented page lock support in the MI code and have only moved vm_page's hold_count out from under page queue mutex to page lock. This changes pmap_extract_and_hold on all pmaps. Supported by: Bitgravity Inc. Discussed with: alc, jeffr, and kib
# 207141	24-Apr-2010	jeff	- Merge soft-updates journaling from projects/suj/head into head. This brings in support for an optional intent log which eliminates the need for background fsck on unclean shutdown. Sponsored by: iXsystems, Yahoo!, and Juniper. With help from: McKusick and Peter Holm
# 206097	02-Apr-2010	avg	bo_bsize: revert r205860 and take an alternative approch in getblk In r205860 I missed the fact that there is code that strongly assumes that devvp bo_bsize is equal to underlying provider's sectorsize. In those places it is hard to obtain the sectorsize in an alternative way if devvp bo_bsize is set to something else. So, I am reverting bo_bsize assigment in g_vfs_open. Instead, in getblk I use DEV_BSIZE block size for b_offset calculation if vp is a disk vp as reported by vn_isdisk. This should coinside with vp being a devvp. Reported by: Mykola Dzham <i@levsha.me> Tested by: Mykola Dzham <i@levsha.me> Pointyhat to: avg MFC after: 2 weeks X-ToDo: convert bread(devvp) in all fs to use bo_bsize-d blocks
# 195773	19-Jul-2009	kib	When buffer write is failed, it is wrong for brelse() to invalidate portion of the page that was written. Among other problems, this page might be picked up by pagedaemon, with failed assertion in vm_pageout_flush() about validity of the page. Reported and tested by: pho Approved by: re (kensmith) MFC after: 3 weeks
# 193637	07-Jun-2009	alc	Eliminate an unused variable from allocbuf(). Eliminate the unnecessary setting of page valid bits from a non-VMIO buffer in vm_hold_load_pages().
# 193201	01-Jun-2009	alc	Eliminate a comment describing code that was deleted over eight years ago. Move another comment to its proper place. Fix a typo in a third comment.
# 193187	31-May-2009	alc	nfs_write() can use the recently introduced vfs_bio_set_valid() instead of vfs_bio_set_validclean(), thereby avoiding the page queues lock. Garbage collect vfs_bio_set_validclean(). Nothing uses it any longer.
# 193044	29-May-2009	alc	Modify vm_hold_load_pages() to allocate pages using VM_ALLOC_NOOBJ rather than using the kernel object. This allows the elimination of page queues locking from vm_hold_free_pages().
# 192908	27-May-2009	zml	fail(9) support: Add support for kernel fault injection using KFAIL_POINT_* macros and fail_point_* infrastructure. Add example fail point in vfs_bio.c to simulate VM buf pressure. Approved by: dfr (mentor)
# 192543	21-May-2009	jhb	Only use the ABI compat shim for vfs.bufspace if the old buffer is smaller than a long. PR: amd64/134786 Submitted by: Emil Mikulic emikulic\| gmail MFC after: 3 days
# 192270	17-May-2009	alc	Several changes to vfs_bio_clrbuf(): Provide a more descriptive comment. Eliminate dead code. The page cannot possibly have PG_ZERO set. Eliminate unnecessary blank lines. Reviewed by: tegge
# 192260	17-May-2009	alc	Introduce vfs_bio_set_valid() and use it from ffs_realloccg(). This eliminates the misuse of vfs_bio_clrbuf() by ffs_realloccg(). In collaboration with: tegge
# 192034	13-May-2009	alc	Eliminate page queues locking from bufdone_finish() through the following changes: Rename vfs_page_set_valid() to vfs_page_set_validclean() to reflect what this function actually does. Suggested by: tegge Introduce a new version of vfs_page_set_valid() that does no more than what the function's name implies. Specifically, it does not update the page's dirty mask, and thus it does not require the page queues lock to be held. Update two of the three callers to the old vfs_page_set_valid() to call vfs_page_set_validclean() instead because they actually require the page's dirty mask to be cleared. Introduce vm_page_set_valid(). Reviewed by: tegge
# 191986	11-May-2009	alc	Revert CVS revision 1.94 (svn r16840). Current pmap implementations don't suffer from the race condition that motivated revision 1.94. Consequently, the work-around that was implemented by revision 1.94 is no longer needed. Moreover, reverting this work-around eliminates the need for vfs_busy_pages() to acquire the page queues lock when preparing a buffer for read. Reviewed by: tegge
# 191220	17-Apr-2009	kan	Undo private changes that should never have been committed.
# 191218	17-Apr-2009	kan	More fallout from negative dotdot caching. Negative entries should be removed from and reinserted to proper ncneg list. Reported by: pho Submitted by: kib
# 191136	16-Apr-2009	kib	In flushbufqueues(), do not allocate sentinel buffer on the stack, struct buf is large. Use sleeping malloc(9) call, and zero the allocated buf as a debugging feature.
# 191135	16-Apr-2009	kib	Export the number of times bufdaemon got help from the normal threads.
# 190331	23-Mar-2009	jhb	Improve the description of a few sysctls. Submitted by: bde (partially) MFC after: 3 days
# 189933	17-Mar-2009	attilio	Fix an old-standing bug that crept in along the several revisions: B_DELWRI cleanup and vnode disassociation should happen just before to assign the buffer to a queue. Reported by: miwi, Volker <volker at vwsoft dot com>, Ben Kaduk <minimarmot at gmail dot com>, Christopher Mallon <christoph dot mallon at gmx dot de> Tested by: lulf, miwi
# 189878	16-Mar-2009	kib	Fix two issues with bufdaemon, often causing the processes to hang in the "nbufkv" sleep. First, ffs background cg group block write requests a new buffer for the shadow copy. When ffs_bufwrite() is called from the bufdaemon due to buffers shortage, requesting the buffer deadlock bufdaemon. Introduce a new flag for getnewbuf(), GB_NOWAIT_BD, to request getblk to not block while allocating the buffer, and return failure instead. Add a flag argument to the geteblk to allow to pass the flags to getblk(). Do not repeat the getnewbuf() call from geteblk if buffer allocation failed and either GB_NOWAIT_BD is specified, or geteblk() is called from bufdaemon (or its helper, see below). In ffs_bufwrite(), fall back to synchronous cg block write if shadow block allocation failed. Since r107847, buffer write assumes that vnode owning the buffer is locked. The second problem is that buffer cache may accumulate many buffers belonging to limited number of vnodes. With such workload, quite often threads that own the mentioned vnodes locks are trying to read another block from the vnodes, and, due to buffer cache exhaustion, are asking bufdaemon for help. Bufdaemon is unable to make any substantial progress because the vnodes are locked. Allow the threads owning vnode locks to help the bufdaemon by doing the flush pass over the buffer cache before getnewbuf() is going to uninterruptible sleep. Move the flushing code from buf_daemon() to new helper function buf_do_flush(), that is called from getnewbuf(). The number of buffers flushed by single call to buf_do_flush() from getnewbuf() is limited by new sysctl vfs.flushbufqtarget. Prevent recursive calls to buf_do_flush() by marking the bufdaemon and threads that temporarily help bufdaemon by TDP_BUFNEED flag. In collaboration with: pho Reviewed by: tegge (previous version) Tested by: glebius, yandex ... MFC after: 3 weeks
# 189648	10-Mar-2009	jhb	In the ABI shim for vfs.bufspace, rather than truncating values larger than INT_MAX to INT_MAX, just go ahead and write out the full long to give an error of ENOMEM to the user process. Requested by: bde
# 189627	10-Mar-2009	jhb	Add an ABI compat shim for the vfs.bufspace sysctl for sysctl requests that try to fetch it as an int rather than a long. If the current value is greater than INT_MAX it reports a value of INT_MAX.
# 189595	09-Mar-2009	jhb	Adjust some variables (mostly related to the buffer cache) that hold address space sizes to be longs instead of ints. Specifically, the follow values are now longs: runningbufspace, bufspace, maxbufspace, bufmallocspace, maxbufmallocspace, lobufspace, hibufspace, lorunningspace, hirunningspace, maxswzone, maxbcache, and maxpipekva. Previously, a relatively small number (~ 44000) of buffers set in kern.nbuf would result in integer overflows resulting either in hangs or bogus values of hidirtybuffers and lodirtybuffers. Now one has to overflow a long to see such problems. There was a check for a nbuf setting that would cause overflows in the auto-tuning of nbuf. I've changed it to always check and cap nbuf but warn if a user-supplied tunable would cause overflow. Note that this changes the ABI of several sysctls that are used by things like top(1), etc., so any MFC would probably require a some gross shims to allow for that. MFC after: 1 month
# 188244	06-Feb-2009	jhb	Tweak the output of VOP_PRINT/vn_printf() some. - Align the fifo output in fifo_print() with other vn_printf() output. - Remove the leading space from lockmgr_printinfo() so its output lines up in vn_printf(). - lockmgr_printinfo() now ends with a newline, so remove an extra newline from vn_printf().
# 183754	10-Oct-2008	attilio	Remove the struct thread unuseful argument from bufobj interface. In particular following functions KPI results modified: - bufobj_invalbuf() - bufsync() and BO_SYNC() "virtual method" of the buffer objects set. Main consumers of bufobj functions are affected by this change too and, in particular, functions which changed their KPI are: - vinvalbuf() - g_vfs_close() Due to the KPI breakage, __FreeBSD_version will be bumped in a later commit. As a side note, please consider just temporary the 'curthread' argument passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP Reviewed by: kib Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
# 183072	16-Sep-2008	kib	Add the ffs structures introspection functions for ddb. Show the b_dep value for the buffer in the show buffer command. Add a comand to dump the dirty/clean buffer list for vnode. Reviewed by: tegge Tested and used by: pho MFC after: 1 month
# 181868	19-Aug-2008	kib	In brelse, put the B_NEEDSGIANT buffer on the QUEUE_DIRTY_GIANT queue, instead of QUEUE_DIRTY. Tested by: pho Reviewed by: attilio MFC after: 3 days
# 180625	20-Jul-2008	alc	Eliminate dead code. (The commit message for revision 1.287 explains why this code is dead.)
# 177687	28-Mar-2008	attilio	b_waiters cannot be adequately protected by the interlock because it is dropped after the call to lockmgr() so just revert this approach using something similar to the precedent one: BUF_LOCKWAITERS() just checks if there are waiters (not the actual number of them) and it is based on newly introduced lockmgr_waiters() which returns if the lockmgr has waiters or not. The name has been choosen differently by old lockwaiters() in order to not confuse them. KPI results enriched by this commit so __FreeBSD_version bumping and manpage update will be happening soon. 'struct buf' also changes, so kernel ABI is disturbed. Bug found by: jeff Approved by: jeff, kib
# 177493	22-Mar-2008	jeff	- Complete part of the unfinished bufobj work by consistently using BO_LOCK/UNLOCK/MTX when manipulating the bufobj. - Create a new lock in the bufobj to lock bufobj fields independently. This leaves the vnode interlock as an 'identity' lock while the bufobj is an io lock. The bufobj lock is ordered before the vnode interlock and also before the mnt ilock. - Exploit this new lock order to simplify softdep_check_suspend(). - A few sync related functions are marked with a new XXX to note that we may not properly interlock against a non-zero bv_cnt when attempting to sync all vnodes on a mountlist. I do not believe this race is important. If I'm wrong this will make these locations easier to find. Reviewed by: kib (earlier diff) Tested by: kris, pho (earlier diff)
# 177475	21-Mar-2008	kib	Reduce contention on the vnode interlock by not acquiring the BO_LOCK around the check for the BV_BKGRDINPROG in the brelse() and bqrelse(). See the comment for the explanation why it is safe. Tested by: pho Submitted by: jeff
# 177472	21-Mar-2008	jeff	- Reduce contention on the global bdonelock and bpinlock by using a pool mutex to protect these sleep/wakeup/counter races. This still is preferable to bloating each bio with a mtx.
# 177253	16-Mar-2008	rwatson	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
# 176708	01-Mar-2008	attilio	- Handle buffer lock waiters count directly in the buffer cache instead than rely on the lockmgr support [1]: * bump the waiters only if the interlock is held * let brelvp() return the waiters count * rely on brelvp() instead than BUF_LOCKWAITERS() in order to check for the waiters number - Remove a namespace pollution introduced recently with lockmgr.h including lock.h by including lock.h directly in the consumers and making it mandatory for using lockmgr. - Modify flags accepted by lockinit(): * introduce LK_NOPROFILE which disables lock profiling for the specified lockmgr * introduce LK_QUIET which disables ktr tracing for the specified lockmgr [2] * disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it can only be used on a per-instance basis - Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer used This patch breaks KPI so __FreBSD_version will be bumped and manpages updated by further commits. Additively, 'struct buf' changes results in a disturbed ABI also. [2] Really, currently there is no ktr tracing in the lockmgr, but it will be added soon. [1] Submitted by: kib Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
# 176249	13-Feb-2008	attilio	- Add real assertions to lockmgr locking primitives. A couple of notes for this: * WITNESS support, when enabled, is only used for shared locks in order to avoid problems with the "disowned" locks * KA_HELD and KA_UNHELD only exists in the lockmgr namespace in order to assert for a generic thread (not curthread) owning or not the lock. Really, this kind of check is bogus but it seems very widespread in the consumers code. So, for the moment, we cater this untrusted behaviour, until the consumers are not fixed and the options could be removed (hopefully during 8.0-CURRENT lifecycle) * Implementing KA_HELD and KA_UNHELD (not surported natively by WITNESS) made necessary the introduction of LA_MASKASSERT which specifies the range for default lock assertion flags * About other aspects, lockmgr_assert() follows exactly what other locking primitives offer about this operation. - Build real assertions for buffer cache locks on the top of lockmgr_assert(). They can be used with the BUF_ASSERT_*(bp) paradigm. - Add checks at lock destruction time and use a cookie for verifying lock integrity at any operation. - Redefine BUF_LOCKFREE() in order to not use a direct assert but let it rely on the aforementioned destruction time check. KPI results evidently broken, so __FreeBSD_version bumping and manpage update result necessary and will be committed soon. Side note: lockmgr_assert() will be used soon in order to implement real assertions in the vnode namespace replacing the legacy and still bogus "VOP_ISLOCKED()" way. Tested by: kris (earlier version) Reviewed by: jhb
# 175486	19-Jan-2008	attilio	- Introduce the function lockmgr_recursed() which returns true if the lockmgr lkp, when held in exclusive mode, is recursed - Introduce the function BUF_RECURSED() which does the same for bufobj locks based on the top of lockmgr_recursed() - Introduce the function BUF_ISLOCKED() which works like the counterpart VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus BUF_REFCNT() in a more explicative and SMP-compliant way. This allows us to axe out BUF_REFCNT() and leaving the function lockcount() totally unused in our stock kernel. Further commits will axe lockcount() as well as part of lockmgr() cleanup. KPI results, obviously, broken so further commits will update manpages and freebsd version. Tested by: kris (on UFS and NFS)
# 175294	13-Jan-2008	attilio	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
# 175202	09-Jan-2008	attilio	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>
# 174992	30-Dec-2007	imp	Rather than not redirting the bp when we get ENXIO, only redirty it when the error is EIO. This catches a much larger class of errors that are unlikely to succeed if retried. Submitted by: bde
# 174937	27-Dec-2007	imp	A partial solution to some of the 'pull the umass device with a mounted FS' problems. These are more along the lines of 'avoiding an avoidable panic' than a complete solution to removable devices. We now close the barn door after the horse has gotten lose and has been hit by a truck, as it were. The barn no longer catches fire in this case, but the horse is still dead :-). The vfs_bio.c fix causes us not to put a failed write back into the dirty pool if the error returned was ENXIO. In that case, the buffer is treated like any other clean buffer that's being retured. ENXIO means the device isn't there anymore and will never be there again in the future, so retrying is futile. The vfs_mount.c fix treats 'ENXIO' as success for unmounting a file system. If the device is gone, retrying later won't help and we'll never be able to unmount the device. These two are part of a larger patch set submitted by the author. The other patches will be forth coming. I added comments to these two patches. Submitted by: Henrik Gulbrandsen Reviewed by: phk@ PR: usb/46176 (partial)
# 174140	01-Dec-2007	alc	Eliminate vfs_page_set_valid()'s unused argument.
# 172836	20-Oct-2007	julian	Rename the kthread_xxx (e.g. kthread_create()) calls to kproc_xxx as they actually make whole processes. Thos makes way for us to add REAL kthread_create() and friends that actually make theads. it turns out that most of these calls actually end up being moved back to the thread version when it's added. but we need to make this cosmetic change first. I'd LOVE to do this rename in 7.0 so that we can eventually MFC the new kthread_xxx() calls.
# 172329	26-Sep-2007	ru	Fix the description of the formula used to autosize the number of buffers in the buffer cache. Approved by: re (kensmith)
# 172317	25-Sep-2007	alc	Change the management of cached pages (PQ_CACHE) in two fundamental ways: (1) Cached pages are no longer kept in the object's resident page splay tree and memq. Instead, they are kept in a separate per-object splay tree of cached pages. However, access to this new per-object splay tree is synchronized by the _free_ page queues lock, not to be confused with the heavily contended page queues lock. Consequently, a cached page can be reclaimed by vm_page_alloc(9) without acquiring the object's lock or the page queues lock. This solves a problem independently reported by tegge@ and Isilon. Specifically, they observed the page daemon consuming a great deal of CPU time because of pages bouncing back and forth between the cache queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of this problem turned out to be a deadlock avoidance strategy employed when selecting a cached page to reclaim in vm_page_select_cache(). However, the root cause was really that reclaiming a cached page required the acquisition of an object lock while the page queues lock was already held. Thus, this change addresses the problem at its root, by eliminating the need to acquire the object's lock. Moreover, keeping cached pages in the object's primary splay tree and memq was, in effect, optimizing for the uncommon case. Cached pages are reclaimed far, far more often than they are reactivated. Instead, this change makes reclamation cheaper, especially in terms of synchronization overhead, and reactivation more expensive, because reactivated pages will have to be reentered into the object's primary splay tree and memq. (2) Cached pages are now stored alongside free pages in the physical memory allocator's buddy queues, increasing the likelihood that large allocations of contiguous physical memory (i.e., superpages) will succeed. Finally, as a result of this change long-standing restrictions on when and where a cached page can be reclaimed and returned by vm_page_alloc(9) are eliminated. Specifically, calls to vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and return a formerly cached page. Consequently, a call to malloc(9) specifying M_NOWAIT is less likely to fail. Discussed with: many over the course of the summer, including jeff@, Justin Husted @ Isilon, peter@, tegge@ Tested by: an earlier version by kris@ Approved by: re (kensmith)
# 170475	09-Jun-2007	marcel	Work around an integer overflow in expression `3 * maxbufspace / 4', when maxbufspace is larger than INT_MAX / 3. The overflow causes a hard hang on ia64 when physical memory is sufficiently large (8GB).
# 170424	08-Jun-2007	delphij	In getblk(), before gbincore(), use BO_LOCK directly when locking the bufobj, rather than using VI_LOCK, like what was done with revision 1.453.
# 170174	31-May-2007	jeff	- Move rusage from being per-process in struct pstats to per-thread in td_ru. This removes the requirement for per-process synchronization in statclock() and mi_switch(). This was previously supported by sched_lock which is going away. All modifications to rusage are now done in the context of the owning thread. reads proceed without locks. - Aggregate exiting threads rusage in thread_exit() such that the exiting thread's rusage is not lost. - Provide a new routine, rufetch() to fetch an aggregate of all rusage structures from all threads in a process. This routine must be used in any place requiring a rusage from a process prior to it's exit. The exited process's rusage is still available via p_ru. - Aggregate tick statistics only on demand via rufetch() or when a thread exits. Tick statistics are kept in the thread and protected by sched_lock until it exits. Initial patch by: attilio Reviewed by: attilio, bde (some objections), arch (mostly silent)
# 170170	31-May-2007	attilio	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)
# 169667	18-May-2007	jeff	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>
# 169006	24-Apr-2007	kib	Disable nesting of BOP_BDFLUSH(). VOP_FSYNC() call in bdwrite() could result in bdwrite() being reentered, thus causing infinite recursion. Reported and tested by: Peter Holm Reviewed by: tegge MFC after: 2 weeks
# 168024	29-Mar-2007	wkoszek	vm_map_delete should be used only internally, by the VM subsystem. Replace it with vm_map_remove, which not only embeds additional check, but also takes care of locking. Reviewed by: alc Approved by: alc, cognet (mentor)
# 167877	25-Mar-2007	kris	Correct a comment typo
# 167327	08-Mar-2007	julian	Instead of doing comparisons using the pcpu area to see if a thread is an idle thread, just see if it has the IDLETD flag set. That flag will probably move to the pflags word as it's permenent and never chenges for the life of the system so it doesn't need locking.
# 166889	22-Feb-2007	delphij	Use LIST_EMPTY() instead of unrolled version (LIST_FIRST() [!=]= NULL)
# 166193	23-Jan-2007	kib	Cylinder group bitmaps and blocks containing inode for a snapshot file are after snaplock, while other ffs device buffers are before snaplock in global lock order. By itself, this could cause deadlock when bdwrite() tries to flush dirty buffers on snapshotted ffs. If, during the flush, COW activity for snapshot needs to allocate block and ffs_alloccg() selects the cylinder group that is being written by bdwrite(), then kernel would panic due to recursive buffer lock acquision. Avoid dealing with buffers in bdwrite() that are from other side of snaplock divisor in the lock order then the buffer being written. Add new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in the bdwrite(). Default implementation, bufbdflush(), refactors the code from bdwrite(). For ffs device buffers, specialized implementation is used. Reviewed by: tegge, jeff, Russell Cattelan (cattelan xfs org, xfs changes) Tested by: Peter Holm X-MFC after: 3 weeks (if ever: it changes ABI)
# 165375	20-Dec-2006	kib	In rev. 1.514, iodone on async buffer may happen before code checks the vnode v_flag. For cluster buffers this would result in dereferencing NULL b_vp. To prevent the panic, cache relevant vnode flag before calling bstrategy. Reported by: Peter Holm, kris Tested by: Peter Holm Reviewed by: tegge Pointy hat to: kib
# 165203	14-Dec-2006	kib	Resolve two deadlocks that could be caused by busy md device backed by vnode. Allow for md thread and the thread that owns lock on vnode backing the md device to do the write even when runningbufspace is exhausted. Tested by: Peter Holm Reviewed by: tegge MFC after: 2 weeks
# 163750	28-Oct-2006	alc	Refactor vfs_setdirty(), creating vfs_setdirty_locked_object(). Call vfs_setdirty_locked_object() from vfs_busy_pages() instead of vfs_setdirty(), thereby eliminating a second acquisition and release of the same vm object lock.
# 163745	28-Oct-2006	alc	In bufdone_finish() restrict the acquisition and release of the page queues lock to BIO_READ operations. Recent changes to the implementation of the per-page flags have eliminated the need for the page queues lock in the other cases.
# 163604	22-Oct-2006	alc	Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object lock instead of the global page queues lock.
# 162941	02-Oct-2006	tegge	If the buffer lock has waiters after the buffer has changed identity then getnewbuf() needs to drop the buffer in order to wake waiters that might sleep on the buffer in the context of the old identity.
# 161125	09-Aug-2006	alc	Introduce a field to struct vm_page for storing flags that are synchronized by the lock on the object containing the page. Transition PG_WANTED and PG_SWAPINPROG to use the new field, eliminating the need for holding the page queues lock when setting or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to VPO_WANTED and VPO_SWAPINPROG, respectively. Eliminate the assertion that the page queues lock is held in vm_page_io_finish(). Eliminate the acquisition and release of the page queues lock around calls to vm_page_io_finish() in kern_sendfile() and vfs_unbusy_pages().
# 161070	08-Aug-2006	alc	Reduce the scope of the page queues lock in vfs_busy_pages() now that vm_page_sleep_if_busy() no longer requires the caller to hold the page queues lock.
# 160540	21-Jul-2006	alc	Eliminate OBJ_WRITEABLE. It hasn't been used in a long time.
# 157469	04-Apr-2006	jeff	- Properly check against B_DELWRI and B_NEEDSGIANT. This check was incorrectly written and caused some !NEEDSGIANT buffers to be put in the NEEDSGIANT queue. Sponsored by: Isilon Systems, Inc.
# 157319	31-Mar-2006	jeff	- Add the B_NEEDSGIANT flag which is only set if the vnode that owns a buf requires Giant. It is set in bgetvp and cleared in brelvp. - Create QUEUE_DIRTY_GIANT for dirty buffers that require giant. - In the buf daemon, only grab giant when processing QUEUE_DIRTY_GIANT and only if we think there are buffers in that queue. Sponsored by: Isilon Systems, Inc.
# 156980	21-Mar-2006	pjd	Destroy "bip" bio in error case. Found by: Coverity Prevent analysis tool Coverity ID: 795 MFC after: 3 days
# 155229	02-Feb-2006	tegge	For low memory situations, non-VMIO buffers didnt't release pages back to the system when brelse() was called with B_RELBUF set on the buffer. This could be a problem when the system was low on memory, had many buffers on QUEUE_EMPTYKVA and started to traverse directories. For each getnewbuf(), pages were allocated from the system, driving the free reserve downwards. For each brelse(), the system put the buffer on QUEUE_CLEAN, with B_INVAL set. This commit changes the semantics of B_RELBUF to also free pages from non-VMIO buffers. Reviewed by: alc
# 154695	22-Jan-2006	alc	Remove an unnecessary call to pmap_remove_all(). The given page is not mapped because its contents are invalid. Reviewed by: tegge
# 154441	16-Jan-2006	tegge	Set flag in needsbuffer while still holding bqlock to avoid lost wakeup.
# 153940	31-Dec-2005	netchild	MI changes: - provide an interface (macros) to the page coloring part of the VM system, this allows to try different coloring algorithms without the need to touch every file [1] - make the page queue tuning values readable: sysctl vm.stats.pagequeue - autotuning of the page coloring values based upon the cache size instead of options in the kernel config (disabling of the page coloring as a kernel option is still possible) MD changes: - detection of the cache size: only IA32 and AMD64 (untested) contains cache size detection code, every other arch just comes with a dummy function (this results in the use of default values like it was the case without the autotuning of the page coloring) - print some more info on Intel CPU's (like we do on AMD and Transmeta CPU's) Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue" and report if the cache* values are zero (= bug in the cache detection code) or not. Based upon work by: Chad David <davidc@acns.ab.ca> [1] Reviewed by: alc, arch (in 2004) Discussed with: alc, Chad David, arch (in 2004)
# 153192	07-Dec-2005	rodrigc	Changes imported from XFS for FreeBSD project: - add fields to struct buf (needed by XFS) - 3 private fields: b_fsprivate1, b_fsprivate2, b_fsprivate3 - b_pin_count, count of pinned buffer - add new B_MANAGED flag - add breada() function to initiate asynchronous I/O on read-ahead blocks. - add bufdone_finish(), bpin(), bunpin_wait() functions Patches provided by: kan Reviewed by: phk Silence on: arch@
# 151897	31-Oct-2005	rwatson	Normalize a significant number of kernel malloc type names: - Prefer '_' to ' ', as it results in more easily parsed results in memory monitoring tools such as vmstat. - Remove punctuation that is incompatible with using memory type names as file names, such as '/' characters. - Disambiguate some collisions by adding subsystem prefixes to some memory types. - Generally prefer lower case to upper case. - If the same type is defined in multiple architecture directories, attempt to use the same name in additional cases. Not all instances were caught in this change, so more work is required to finish this conversion. Similar changes are required for UMA zone names.
# 151187	09-Oct-2005	tegge	Release clean buffer with wrong size and no dependencies also for non-VMIO case.
# 150760	30-Sep-2005	truckman	Un-staticize waitrunningbufspace() and call it before returning from ffs_copyonwrite() if any async writes were launched. Restore the threads previous TDP_NORUNNINGBUF state before returning from ffs_copyonwrite().
# 150741	29-Sep-2005	truckman	Un-staticize runningbufwakeup() and staticize updateproc. Add a new private thread flag to indicate that the thread should not sleep if runningbufspace is too large. Set this flag on the bufdaemon and syncer threads so that they skip the waitrunningbufspace() call in bufwrite() rather than than checking the proc pointer vs. the known proc pointers for these two threads. A way of preventing these threads from being starved for I/O but still placing limits on their outstanding I/O would be desirable. Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from blocking on the runningbufspace check while holding snaplk. This prevents snaplk from being held for an arbitrarily long period of time if runningbufspace is high and greatly reduces the contention for snaplk. The disadvantage is that ffs_copyonwrite() can start a large amount of I/O if there are a large number of snapshots, which could cause a deadlock in other parts of the code. Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace before attempting to grab snaplk so that I/O requests waiting on snaplk are not counted in runningbufspace as being in-progress. Increment runningbufspace again before actually launching the original I/O request. Prior to the above two changes, the system could deadlock if enough I/O requests were blocked by snaplk to prevent runningbufspace from falling below lorunningspace and one of the bawrite() calls in ffs_copyonwrite() blocked in waitrunningbufspace() while holding snaplk. See <http://www.holm.cc/stress/log/cons143.html>
# 150705	29-Sep-2005	peadar	Close a race in biodone(), whereby the bio_done field of the passed bio may have been freed and reassigned by the wakeup before being tested after releasing the bdonelock. There's a non-zero chance this is the cause of a few of the crashes knocking around with biodone() sitting in the stack backtrace. Reviewed By: phk@
# 148670	03-Aug-2005	jeff	- Use lockmgr_printinfo rather than rolling our own. This introduces a slight problem by using printf instead of db_printf however 'show lockedvnods' does the same so I believe it is ok for now.
# 148200	20-Jul-2005	alc	Eliminate inconsistency in the setting of the B_DONE flag. Specifically, make the b_iodone callback responsible for setting it if it is needed. Previously, it was set unconditionally by bufdone() without holding whichever lock is shared by the b_iodone callback and the corresponding top-half function. Consequently, in a race, the top-half function could conclude that operation was done before the b_iodone callback finished. See, for example, aio_physwakeup() and aio_fphysio(). Note: I don't believe that the other, more widely-used b_iodone callbacks are affected. Discussed with: jeff Reviewed by: phk MFC after: 2 weeks
# 147388	14-Jun-2005	jeff	- Add and enhance asserts related to the wrong bufobj panic. Sponsored by: Isilon Systems, Inc. Approved by: re (blanket vfs)
# 147325	12-Jun-2005	jeff	- Split one KASSERT in bremfree() into two to aid in debugging. Sponsored by: Isilon Systems, Inc.
# 147280	10-Jun-2005	green	Fix a serious deadlock with the NFS client. Given a large enough atomic write request, it can fill the buffer cache with the entirety of that write in order to handle retries. However, it never drops the vnode lock, or else it wouldn't be atomic, so it ends up waiting indefinitely for more buf memory that cannot be gotten as it has it all, and it waits in an uncancellable state. To fix this, hibufspace is exported and scaled to a reasonable fraction. This is used as the limit of how much of an atomic write request by the NFS client will be handled asynchronously. If the request is larger than this, it will be turned into a synchronous request which won't deadlock the system. It's possible this value is far off from what is required by some, so it shall be tunable as soon as mount_nfs(8) learns of the new field. The slowdown between an asynchronous and a synchronous write on NFS appears to be on the order of 2x-4x. General nod by: gad MFC after: 2 weeks More testing: wes PR: kern/79208
# 147154	09-Jun-2005	jeff	- My sub-par public school education has been exposed. s/sentinal/sentinel/ Noticed by: Emil Mikulic
# 147140	08-Jun-2005	jeff	- Under heavy IO load the buf daemon can run for many hundereds of milliseconds due to what is essentially n^2 algorithmic complexity. This change makes the algorithm N*2 instead. This heavy processing manifested itself as skipping in audio and video playback due to the long scheduling latencies and contention on giant by pcm. - flushbufqueues() is now responsible for flushing multiple buffers rather than one at a time. This allows us to save our progress in the list by using a sentinal. We must do the numdirtywakeup() and waitrunningbufspace() here now rather than in buf_daemon(). - Also add a uio_yield() after we have processed the list once for bufs without deps and again for bufs with deps. This is to release Giant and allow any other giant locked code to proceed. Tested by: Many users on current@ Revealed by: schedgraph traces sent by Emil Mikulic & Anthony Ginepro
# 146801	30-May-2005	jeff	- Add bufobj_wrefl() to add a write ref to a bufobj that is already locked.
# 145704	30-Apr-2005	jeff	- Remove long dead splbio() calls and comments relating to the old synchronization mechanism.
# 145703	30-Apr-2005	jeff	- Don't acquire Giant before calling b_biodone, individual consumers are now required to do so themselves. Sponsored by: Isilon Systems, Inc.
# 145384	21-Apr-2005	jeff	- Add two KASSERTs to prevent us from recycling a buf that is still on a bufobj list. Sponsored by: Isilon Systems, Inc.
# 144084	24-Mar-2005	jeff	- Add information about the buf lock to db_show_buffer. - Add a 'show lockedbufs' command that is similar to show lockedvnods. Sponsored by: Isilon Systems, Inc.
# 143281	08-Mar-2005	jeff	- Lock access to the buffer_map with the vm_map lock. In 4.x this was done with splbio, in 5.x this was done with Giant. Discussed with: alc Reported by: julian, pho
# 141637	10-Feb-2005	phk	Make various vnode related functions static
# 141597	10-Feb-2005	jeff	- Add more information to the getnewbuf() recycling KTR. Sponsored by: Isilon Systems, Inc.
# 141546	08-Feb-2005	jeff	- Remove an invalid KASSERT added in recent background write reshuffling. Sponsored by: Isilon Systems, Inc.
# 141539	08-Feb-2005	phk	Background writes are entirely an FFS/Softupdates thing. Give FFS vnodes a specific bufwrite method which contains all the background write stuff and then calls into the default bufwrite() for the rest of the job. Remove all the background write related stuff from the normal bufwrite. This drags the softdep_move_dependencies() back into FFS. Long term, it is worth looking at simply copying the data into allocated memory and issuing the bio directly and not create the "shadow buf" in the first place (just like copy-on-write is done in snapshots for instance). I don't think we really gain anything but complexity from doing this with a buf.
# 141337	04-Feb-2005	jeff	- Don't release BKGRDINPROG until after we've bufdone'd the copy. Sponsored by: Isilon Systems, Inc.
# 140946	28-Jan-2005	jeff	- Don't drop the wref on the bufobj until after bufdone() has completed. Without this, threads waiting in bufobj_wwait() may wakeup prior to bufdone() completing. Sponsored by: Isilon Systems, Inc.
# 140782	24-Jan-2005	phk	Don't use VOP_GETVOBJECT, use vp->v_object directly.
# 140734	24-Jan-2005	phk	Kill the VV_OBJBUF and test the v_object for NULL instead.
# 140721	24-Jan-2005	jeff	- Add CTR calls to trace the lifecycle of a buffer. - Remove some KASSERTs which are invalid if the appropriate lock is not held. - Slightly restructure bremfree() so that it is more sane. - Change the flush code in bdwrite() to avoid acquiring a mutex whenever possible. - Change the flush code in bdwrite() to avoid holding the bufobj mutex while calling buf_countdeps(). This introduces a lock-order relationship with the softdep lock that can not otherwise be resolved. - Don't set B_DONE until bufdone() is complete, otherwise another processor may believe the buf is done before it is. - Only acquire Giant if the caller has set b_iodone. Don't grab giant around normal bufdone() calls. Sponsored By: Isilon Systems, Inc.
# 140056	11-Jan-2005	phk	Add BO_SYNC() and add a default which uses the secret vnode pointer and VOP_FSYNC() for now.
# 140048	11-Jan-2005	phk	Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC(). I'm not sure why a credential was added to these in the first place, it is not used anywhere and it doesn't make much sense: The credentials for syncing a file (ability to write to the file) should be checked at the system call level. Credentials for syncing one or more filesystems ("none") should be checked at the system call level as well. If the filesystem implementation needs a particular credential to carry out the syncing it would logically have to the cached mount credential, or a credential cached along with any delayed write data. Discussed with: rwatson
# 137846	18-Nov-2004	jeff	- Eliminate the acquisition and release of the bqlock in bremfree() by setting the B_REMFREE flag in the buf. This is done to prevent lock order reversals with code that must call bremfree() with a local lock held. This also reduces overhead by removing two lock operations per buf for fsync() and similar. - Check for the B_REMFREE flag in brelse() and bqrelse() after the bqlock has been acquired so that we may remove ourself from the free-list. - Provide a bremfreef() function to immediately remove a buf from a free-list for use only by NFS. This is done because the nfsclient code overloads the b_freelist queue for its own async. io queue. - Simplify the numfreebuffers accounting by removing a switch statement that executed the same code in every possible case. - getnewbuf() can encounter locked bufs on free-lists once Giant is removed. Remove a panic associated with this condition and delay asserts that inspect the buf until after it is locked. Reviewed by: phk Sponsored by: Isilon Systems, Inc.
# 137197	04-Nov-2004	phk	Retire b_magic now, we have the bufobj containing the same hint.
# 137193	04-Nov-2004	phk	Change buf->b_object to buf->b_bufobj->bo_object some whitespace fixes.
# 137188	04-Nov-2004	phk	whitespace
# 137186	04-Nov-2004	phk	Remove buf->b_dev field.
# 137168	03-Nov-2004	alc	The synchronization provided by vm object locking has eliminated the need for most calls to vm_page_busy(). Specifically, most calls to vm_page_busy() occur immediately prior to a call to vm_page_remove(). In such cases, the containing vm object is locked across both calls. Consequently, the setting of the vm page's PG_BUSY flag is not even visible to other threads that are following the synchronization protocol. This change (1) eliminates the calls to vm_page_busy() that immediately precede a call to vm_page_remove() or functions, such as vm_page_free() and vm_page_rename(), that call it and (2) relaxes the requirement in vm_page_remove() that the vm page's PG_BUSY flag is set. Now, the vm page's PG_BUSY flag is set only when the vm object lock is released while the vm page is still in transition. Typically, this is when it is undergoing I/O.
# 137042	29-Oct-2004	phk	Remove the last call in the system to VOP_SPECSTRATEGY(): We can no longer come through the VNODE layer to the disks since all the filesystems now go via geom_vfs to GEOM.
# 137029	29-Oct-2004	phk	Give dev_strategy() an explict cdev argument in preparation for removing buf->b-dev. Put a bio between the buf passed to dev_strategy() and the device driver strategy routine in order to not clobber fields in the buf. Assert copyright on vfs_bio.c and update copyright message to canonical text. There is no legal difference between John Dysons two-clause abbreviated BSD license and the canonical text.
# 137010	28-Oct-2004	phk	Lock bp->b_bufobj->b_object instead of bp->b_object
# 136969	26-Oct-2004	phk	The island council met and voted buf_prewrite() home. Give ffs it's own bufobj->bo_ops vector and create a private strategy routine, (currently misnamed for forwards compatibility), which is just a copy of the generic bufstrategy routine except we call softdep_disk_prewrite() directly instead of through the buf_prewrite() indirection. Teach UFS about the need for softdep_disk_prewrite() and call the function directly in FFS. Remove buf_prewrite() from the default bufstrategy() and from the global bio_ops method vector.
# 136966	26-Oct-2004	phk	Put the I/O block size in bufobj->bo_bsize. We keep si_bsize_phys around for now as that is the simplest way to pull the number out of disk device drivers in devfs_open(). The correct solution would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth when filesystems sit on GEOM, so don't bother for now.
# 136965	26-Oct-2004	alc	Hold the lock on the containing vm object when calling vm_page_sleep_if_busy().
# 136941	25-Oct-2004	phk	Remove vnode->v_bsize. This was a dead-end.
# 136939	25-Oct-2004	alc	Use VM_ALLOC_NOBUSY to eliminate vm_page_wakeup() calls and the acquisition and release of the global page queues lock required to make the call. Remove GIANT_REQUIRED from vm_hold_free_pages(). All of its VM operations are properly synchronized.
# 136938	25-Oct-2004	phk	Collapse vnode->v_object and buf->b_object into bufobj->bo_object.
# 136927	24-Oct-2004	phk	Move the buffer method vector (buf->b_op) to the bufobj. Extend it with a strategy method. Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY song and dance. Rename ibwrite to bufwrite(). Move the two NFS buf_ops to more sensible places, add bufstrategy to them. Add inlines for bwrite() and bstrategy() which calls through buf->b_bufobj->b_ops->b_{write,strategy}(). Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().
# 136767	22-Oct-2004	phk	Add b_bufobj to struct buf which eventually will eliminate the need for b_vp. Initialize b_bufobj for all buffers. Make incore() and gbincore() take a bufobj instead of a vnode. Make inmem() local to vfs_bio.c Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj) also VI_MTX() to BO_MTX(), Make buf_vlist_add() take a bufobj instead of a vnode. Eliminate other uses of bp->b_vp where bp->b_bufobj will do. Various minor polishing: remove "register", turn panic into KASSERT, use new function declarations, TAILQ_FOREACH_SAFE() etc.
# 136751	21-Oct-2004	phk	Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write count on a bufobj. Bufobj_wdrop() replaces vwakeup(). Use these functions all relevant places except in ffs_softdep.c where the use if interlocked_sleep() makes this impossible. Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.
# 135705	24-Sep-2004	phk	use dev_re[fl]thread() rather than home rolled versions.
# 135617	23-Sep-2004	phk	Eliminate DEV_STRATEGY() macro: call dev_strategy() directly. Make dev_strategy() handle errors and departing devices properly.
# 135600	23-Sep-2004	phk	Do not refcount the cdevsw, but rather maintain a cdev->si_threadcount of the number of threads which are inside whatever is behind the cdevsw for this particular cdev. Make the device mutex visible through dev_lock() and dev_unlock(). We may want finer granularity later. Replace spechash_mtx use with dev_lock()/dev_unlock().
# 135280	15-Sep-2004	phk	Remove unused B_WRITEINPROG flag
# 135277	15-Sep-2004	phk	undent some functions a bit.
# 135276	15-Sep-2004	phk	stylistic polishing.
# 135135	13-Sep-2004	phk	Remove the buffercache/vnode side of BIO_DELETE processing in preparation for integration of p4::phk_bufwork. In the future, local filesystems will talk to GEOM directly and they will consequently be able to issue BIO_DELETE directly. Since the removal of the fla driver, BIO_DELETE has effectively been a no-op anyway.
# 132640	25-Jul-2004	phk	Eliminate unused second argument to reassignbuf() and simplify it accordingly.
# 132628	25-Jul-2004	phk	Neuter this warning for now, I think I know the remaining issues.
# 132337	18-Jul-2004	alc	Remove GIANT_REQUIRED from vmapbuf().
# 131729	06-Jul-2004	peadar	Fix bug introduced in rev 1.434: When avoiding the zeroing of "bogus_page" when it appears in a buf, be sure to advance the pointers into the data for successive pages. The bug caused file corruption when read(2)ing from a "hole" in a file where a previous page of the read block had already been faulted in: fsx tripped up on this pretty quickly. The particular access pattern is probably pretty unusual, so other applications probably wouldn't have had problems, but you'd never know. Reviewed By: alc@
# 131590	04-Jul-2004	phk	Make the last commit handle non-phk root devices better.
# 131575	04-Jul-2004	stefanf	Consistently use __inline instead of __inline__ as the former is an empty macro in <sys/cdefs.h> for compilers without support for inline.
# 131565	04-Jul-2004	phk	Blocksize for I/O should be a property of the vnode and not found by groping around in the vnodes surroundings when we allocate a block. Assign a blocksize when we create a vnode, and yell a warning (and ignore it) if we got the wrong size. Please email all such warnings to me.
# 131533	03-Jul-2004	phk	Remove stale comment
# 130640	17-Jun-2004	phk	Second half of the dev_t cleanup. The big lines are: NODEV -> NULL NOUDEV -> NODEV udev_t -> dev_t udev2dev() -> findcdev() Various minor adjustments including handling of userland access to kernel space struct cdev etc.
# 130585	16-Jun-2004	phk	Do the dreaded s/dev_t/struct cdev */ Bump __FreeBSD_version accordingly.
# 129048	08-May-2004	alc	Avoid pointless zeroing of the bogus page in vfs_bio_clrbuf(). Suggested by: tegge@ (from October of last year)
# 128992	06-May-2004	alc	Make vm_page's PG_ZERO flag immutable between the time of the page's allocation and deallocation. This flag's principal use is shortly after allocation. For such cases, clearing the flag is pointless. The only unusual use of PG_ZERO is in vfs_bio_clrbuf(). However, allocbuf() never requests a prezeroed page. So, vfs_bio_clrbuf() never sees a prezeroed page. Reviewed by: tegge@
# 126872	12-Mar-2004	des	Replace a manual check of a VMIO candidate with vn_canvmio(). This silences an annoying warning in getblk() when VMIO'ing on a directory vnode, which can happen when vfs.vmiodirenable is 1. Bring the warning message in line with reality at the same time. Submitted by: hmp
# 126858	11-Mar-2004	phk	When I was a kid my work table was one cluttered mess an cleaning it up were a rather overwhelming task. I soon learned that if you don't know where you're going to store something, at least try to pile it next to something slightly related in the hope that a pattern emerges. Apply the same principle to the ffs/snapshot/softupdates code which have leaked into specfs: Add yet a buf-quasi-method and call it from the only two places I can see it can make a difference and implement the magic in ffs_softdep.c where it belongs. It's not pretty, but at least it's one less layer violated.
# 126853	11-Mar-2004	phk	Properly vector all bwrite() and BUF_WRITE() calls through the same path and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().
# 126704	06-Mar-2004	alc	Remove GIANT_REQUIRED from vunmapbuf().
# 126082	21-Feb-2004	phk	Device megapatch 6/6: This is what we came here for: Hang dev_t's from their cdevsw, refcount cdevsw and dev_t and generally keep track of things a lot better than we used to: Hold a cdevsw reference around all entrances into the device driver, this will be necessary to safely determine when we can unload driver code. Hold a dev_t reference while the device is open. KASSERT that we do not enter the driver on a non-referenced dev_t. Remove old D_NAG code, anonymous dev_t's are not a problem now. When destroy_dev() is called on a referenced dev_t, move it to dead_cdevsw's list. When the refcount drops, free it. Check that cdevsw->d_version is correct. If not, set all methods to the dead_*() methods to prevent entrance into driver. Print warning on console to this effect. The device driver may still explode if it is also incompatible with newbus, but in that case we probably didn't get this far in the first place.
# 125558	07-Feb-2004	alc	swp_pager_async_iodone() no longer requires Giant. Modify bufdone() and swapgeom_done() to perform swp_pager_async_iodone() without Giant. Reviewed by: tegge
# 123691	20-Dec-2003	alc	Remove a variable that has been initialized but otherwise unused since revision 1.315.
# 122747	15-Nov-2003	phk	Send B_PHYS out to pasture, it no longer serves any function.
# 122746	15-Nov-2003	alc	- Remove the remaining now unnecessary checks for the buf's b_object being NULL. See revision 1.421 for more detail. - Remove GIANT_REQUIRED from vfs_unbusy_pages(). Discussed with: jeff
# 122553	12-Nov-2003	phk	Replace B_PHYS conditional assignment to bio_offset with KASSERT check to see that the originating code already did it right.
# 122537	12-Nov-2003	mckusick	Update the statfs structure with 64-bit fields to allow accurate reporting of multi-terabyte filesystem sizes. You should build and boot a new kernel BEFORE doing a `make world' as the new kernel will know about binaries using the old statfs structure, but an old kernel will not know about the new system calls that support the new statfs structure. Running an old kernel after a `make world' will cause programs such as `df' that do a statfs system call to fail with a bad system call. Reviewed by: Bruce Evans <bde@zeta.org.au> Reviewed by: Tim Robbins <tjr@freebsd.org> Reviewed by: Julian Elischer <julian@elischer.org> Reviewed by: the hoards of <arch@freebsd.org> Sponsored by: DARPA & NAI Labs.
# 122455	11-Nov-2003	alc	- Revision 1.469 of vfs_subr.c resulted in the buf's b_object field being consistency initialized. Consequently, a number of conditionals that checked the validity of b_object before passing it to VM_OBJECT_LOCK() and VM_OBJECT_UNLOCK() are no longer needed.
# 122031	04-Nov-2003	mckusick	Allow the bufdaemon and update daemon processes to skip the waitrunningbufspace() calls so that they are always able to proceed and clean up buffer space. Submitted by: Brian Fundakowski Feldman <green@freebsd.org>
# 121443	23-Oct-2003	jhb	Move the P_COWINPROGRESS flag from being a per-process p_flag to being a per-thread td_pflag which doesn't require any locks to read or write as it is only read or written by curthread on itself. Glanced at by: mckusick
# 121296	21-Oct-2003	phk	Remove KASSERTS on B_PHYS for vmapbuf() and vunmapbuf(), B_PHYS is going away.
# 121255	19-Oct-2003	alc	- Add vm object locking to vfs_clean_pages() and vfs_bio_set_validclean(). This is to synchronize access to the vm page's valid field by vm_page_set_validclean().
# 121224	18-Oct-2003	phk	Initialize b_iooffset before calling VOP_[SPEC]STRATEGY
# 121218	18-Oct-2003	phk	Don't report b_pblkno, it is going away.
# 121190	18-Oct-2003	phk	Convert some if(bla) panic("foo") to KASSERTS to improve grep-ability.
# 121188	18-Oct-2003	phk	The size and contents of the DEV_STRATEGY() macro has progressed to the point where it being a macro is no longer sensible, and it will only be more so in days to come. BIO_STRATEGY() is now only used from DEV_STRATEGY() and should not be used directly anymore. Put the contents of both in the new function dev_strategy() and make DEV_STRATEGY() call that function. In addition, this allows us to make the rather magic bufdonebio() helper function static. This alse saves hunderedandsome bytes of code in a typical kernel.
# 121075	13-Oct-2003	jeff	- Add a mising vn_finished_write() Pointy hat: jeff Found by: robert Obtained from: kirk
# 121045	12-Oct-2003	alc	In vfs_bio_clrbuf(), ignore the state of the object lock if the page is the "bogus" page. Found by: tegge
# 120963	10-Oct-2003	alc	- Synchronize access to a page's valid field in vfs_bio_clrbuf() by using the lock from its containing object. - Remove GIANT_REQUIRED from vm_hold_load_pages().
# 120823	05-Oct-2003	jeff	- Add a missing vn_start_write() to flushbufqueues(). This could have caused snapshot related problems. - The vp can not be NULL here or we would panic in vfs_bio_awrite(). Stop confusing the logic by checking for it in several places. Submitted by: kirk and then rototilled by me to remove vp == NULL checks.
# 120769	04-Oct-2003	alc	Eliminate some unnecessary uses of the vm page queues lock around the vm page's valid field. This field is being synchronized using the containing vm object's lock.
# 120762	04-Oct-2003	alc	- Extend the scope the vm object lock to cover calls to vm_page_is_valid(). - Assert that the lock on the containing vm object is held in vm_page_is_valid().
# 120328	22-Sep-2003	alc	- vm_hold_free_pages() should lock the kernel object. (The pages being freed belong to the kernel object.) - Increase the granularity of the vm object locking in vm_hold_load_pages() in order to reduce the number of times that we acquire and release the same lock.
# 120080	15-Sep-2003	alc	Correct a typo in the previous revision.
# 120020	13-Sep-2003	alc	Convert vmapbuf() from using pmap_extract() to using pmap_extract_and_hold(). Note, however, that GIANT_REQUIRED should not be removed until all platforms fully implement the "prot" parameter to pmap_extract_and_hold(). Reviewed by: tegge
# 119603	31-Aug-2003	jeff	- Define a new flag for getblk(): GB_NOCREAT. This flag causes getblk() to bail out if the buffer is not already present. - The buffer returned by incore() is not locked and should not be sent to brelse(). Use getblk() with the new GB_NOCREAT flag to preserve the desired semantics.
# 119599	30-Aug-2003	jeff	- If there is no vp assume that BKGRDINPROG is not set and set RELPBUF in brelse().
# 119597	30-Aug-2003	jeff	- In some cases bp->b_vp can be NULL in brelse, don't try to lock the interlock in that case. Found by: alc
# 119536	28-Aug-2003	marcel	In bufdone(), change the format specifier for m->valid and m->dirty to a long type and explicitly cast m->valid and m->dirty to unsigned long. When PAGE_SIZE is 32K, these fields are in fact unsigned long.
# 119528	28-Aug-2003	kan	Do not return with vnode interlock held. Reviewed by: rwatson
# 119521	28-Aug-2003	jeff	- Move BX_BKGRDWAIT and BX_BKGRDINPROG to BV_ and the b_vflags field. - Surround all accesses of the BKGRD{WAIT,INPROG} flags with the vnode interlock. - Don't use the B_LOCKED flag and QUEUE_LOCKED for background write buffers. Check for the BKGRDINPROG flag before recycling or throwing away a buffer. We do this instead because it is not safe for us to move the original buffer to a new queue from the callback on the background write buffer. - Remove the B_LOCKED flag and the locked buffer queue. They are no longer used. - The vnode interlock is used around checks for BKGRDINPROG where it may not be strictly necessary. If we hold the buf lock the a back-ground write will not be started without our knowledge, one may only be completed while we're not looking. Rather than remove the code, Document two of the places where this extra locking is done. A pass should be done to verify and minimize the locking later.
# 119370	23-Aug-2003	alc	Hold the page queues lock when performing vm_page_clear_dirty() and vm_page_set_invalid().
# 118354	02-Aug-2003	phk	Grab Giant in bufdonebio() since drivers may not hold it. This only protects the "struct buf" consumers (ie: DEV_STRATEGY()), but does not protect BIO_STRATEGY() users.
# 118337	02-Aug-2003	alc	Eliminate an abuse of kmem_alloc_pageable() in bufinit() by using VM_ALLOC_NOOBJ to allocate the bogus page. Reviewed by: tegge
# 116604	20-Jun-2003	phk	Initialize b_saveaddr when we hand out buffers
# 116203	11-Jun-2003	alc	Lock the vm object when removing a page.
# 116182	10-Jun-2003	obrien	Use __FBSDID().
# 115456	31-May-2003	phk	The IO_NOWDRAIN and B_NOWDRAIN hacks are no longer needed to prevent deadlocks with vnode backed md(4) devices because md now uses a kthread to run the bio requests instead of doing it directly from the bio down path.
# 114147	28-Apr-2003	alc	Finish the vm_object locking for this file, including holding the vm_object lock when accessing the vm_object's flags or calling vm_page_lookup().
# 114060	26-Apr-2003	alc	- Lock the vm_object when performing vm_page_alloc() in allocbuf().
# 113725	19-Apr-2003	alc	Lock the vm_object in vfs_busy_pages().
# 113722	19-Apr-2003	alc	- Lock the vm_object when performing vm_object_pip_subtract(). - Assert that the vm_object lock is held in vm_object_pip_subtract().
# 113721	19-Apr-2003	alc	- Lock the vm_object when performing vm_object_pip_wakeupn(). - Assert that the vm_object lock is held in vm_object_pip_wakeupn(). - Add a new macro VM_OBJECT_LOCK_ASSERT().
# 113458	13-Apr-2003	alc	Update locking on the kernel_object to use the new macros.
# 113152	05-Apr-2003	alc	Remove an unnecessary trunc_page() from vmapbuf(). Reviewed by: tegge
# 113046	04-Apr-2003	alc	o Check the b_bufsize passed to vmapbuf() returning an error if it is invalid. o Remove a debugging printf() from vmapbuf(). Suggested by: tegge
# 112846	30-Mar-2003	phk	Preparation commit before I start on the bioqueue lockdown: Collect all the bits of bioqueue handing in subr_disk.c, vfs_bio.c is big enough as it is and disksort already lives in subr_disk.c.
# 112694	26-Mar-2003	tegge	Add support for reading directly from file to userland buffer when the O_DIRECT descriptor status flag is set and both offset and length is a multiple of the physical media sector size.
# 112569	24-Mar-2003	jake	- Add vm_paddr_t, a physical address type. This is required for systems where physical addresses larger than virtual addresses, such as i386s with PAE. - Use this to represent physical addresses in the MI vm system and in the i386 pmap code. This also changes the paddr parameter to d_mmap_t. - Fix printf formats to handle physical addresses >4G in the i386 memory detection code, and due to kvtop returning vm_paddr_t instead of u_long. Note that this is a name change only; vm_paddr_t is still the same as vm_offset_t on all currently supported platforms. Sponsored by: DARPA, Network Associates Laboratories Discussed with: re, phk (cdevsw change)
# 112367	18-Mar-2003	phk	Including <sys/stdint.h> is (almost?) universally only to be able to use %j in printfs, so put a newsted include in <sys/systm.h> where the printf prototype lives and save everybody else the trouble.
# 112183	13-Mar-2003	jeff	- Add a lock for protecting against msleep(bp, ...) wakeup(bp) races. - Create a new function bdone() which sets B_DONE and calls wakup(bp). This is suitable for use as b_iodone for buf consumers who are not going through the buf cache. - Create a new function bwait() which waits for the buf to be done at a set priority and with a specific wmesg. - Replace several cases where the above functionality was implemented without locking with the new functions.
# 112181	13-Mar-2003	jeff	- Remove a race between fsync like functions and flushbufqueues() by requiring locked bufs in vfs_bio_awrite(). Previously the buf could have been written out by fsync before we acquired the buf lock if it weren't for giant. The cluster_wbuild() handles this race properly but the single write at the end of vfs_bio_awrite() would not. - Modify flushbufqueues() so there is only one copy of the loop. Pass a parameter in that says whether or not we should sync bufs with deps. - Call flushbufqueues() a second time and then break if we couldn't find any bufs without deps.
# 111856	03-Mar-2003	jeff	- Add a new 'flags' parameter to getblk(). - Define one flag GB_LOCK_NOWAIT that tells getblk() to pass the LK_NOWAIT flag to the initial BUF_LOCK(). This will eventually be used in cases were we want to use a buffer only if it is not currently in use. - Convert all consumers of the getblk() api to use this extra parameter. Reviwed by: arch Not objected to by: mckusick
# 111723	02-Mar-2003	jeff	- Hold the vnode interlock across calls to bgetvp instead of acquiring it internally. This is required to stop multiple bufs from being associated with a single lblkno.
# 111694	01-Mar-2003	jeff	- gc USE_BUFHASH. The smp locking of the buf cache renders this useless.
# 111511	25-Feb-2003	mckusick	When doing cleanup of excessive buffers in bdwrite (see kern/vfs_bio.c delta 1.371) we must ensure that we do not get ourselves into a recursive trap endlessly trying to clean up after ourselves. Reported by: Attila Nagy <bra@fsn.hu> Sponsored by: DARPA & NAI Labs.
# 111474	25-Feb-2003	jeff	- Add the missing NULL interlock argument to a recently added BUF_LOCK.
# 111466	25-Feb-2003	mckusick	Prevent large files from monopolizing the system buffers. Keep track of the number of dirty buffers held by a vnode. When a bdwrite is done on a buffer, check the existing number of dirty buffers associated with its vnode. If the number rises above vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one of the other (hopefully older) dirty buffers associated with the vnode is written (using bawrite). In the event that this approach fails to curb the growth in it the vnode's number of dirty buffers (due to soft updates rollback dependencies), the more drastic approach of doing a VOP_FSYNC on the vnode is used. This code primarily affects very large and actively written files such as snapshots. This change should eliminate hanging when taking snapshots or doing background fsck on very large filesystems. Hopefully, one day it will be possible to cache filesystem metadata in the VM cache as is done with file data. As it stands, only the buffer cache can be used which limits total metadata storage to about 20Mb no matter how much memory is available on the system. This rather small memory gets badly thrashed causing a lot of extra I/O. For example, taking a snapshot of a 1Tb filesystem minimally requires about 35,000 write operations, but because of the cache thrashing (we only have about 350 buffers at our disposal) ends up doing about 237,540 I/O's thus taking twenty-five minutes instead of four if it could run entirely in the cache. Reported by: Attila Nagy <bra@fsn.hu> Sponsored by: DARPA & NAI Labs.
# 111463	25-Feb-2003	jeff	- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK. - Remove the buftimelock mutex and acquire the buf's interlock to protect these fields instead. - Hold the vnode interlock while locking bufs on the clean/dirty queues. This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another BUF_LOCK with a LK_TIMEFAIL to a single lock. Reviewed by: arch, mckusick
# 111119	19-Feb-2003	imp	Back out M_* changes, per decision of the TRB. Approved by: trb
# 110987	16-Feb-2003	jeff	- Introduce a new function bremfreel() that does a bremfree with the buf queue lock already held. - In getblk() and flushbufqueues() use bremfreel() while we still have the buf queue lock held to keep the lists consistent. - Add LK_NOWAIT to two cases where we're essentially asserting that the bufs are not locked while acquiring the locks. This will make sure that we get the appropriate panic() and not another one for sleeping with a lock held.
# 110658	10-Feb-2003	jeff	- Add a comment about a race that will happen without Giant.
# 110657	10-Feb-2003	jeff	- Unlock the nblock after the loop in bwillwrite().
# 110625	10-Feb-2003	jeff	- In getnewbuf() unlock the bq lock prior to sleeping when we're out of buffers. Submitted by: tegge
# 110602	09-Feb-2003	jeff	- Correct another atomic op. Spotted by: alc
# 110586	09-Feb-2003	jeff	- Move some code out from #ifdef INVARIANTS.
# 110584	09-Feb-2003	jeff	- Cleanup unlocked accesses to buf flags by introducing a new b_vflag member that is protected by the vnode lock. - Move B_SCANNED into b_vflags and call it BV_SCANNED. - Create a vop_stdfsync() modeled after spec's sync. - Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some fs specific processing. This gives all of these filesystems proper behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag. - Annotate the locking in buf.h
# 110583	09-Feb-2003	jeff	- spell add 'add' and not 'subtract' in an atomic op. Spotted by: alc Pointy hat to: jeff
# 110581	09-Feb-2003	jeff	- Lock down the buffer cache's infrastructure code. This includes locks on buf lists, synchronization variables, and atomic ops for the counters. This change does not remove giant from any code although some pushdown may be possible. - In vfs_bio_awrite() don't access buf fields without the buf lock.
# 109623	21-Jan-2003	alfred	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 109572	20-Jan-2003	dillon	Close the remaining user address mapping races for physical I/O, CAM, and AIO. Still TODO: streamline useracc() checks. Reviewed by: alc, tegge MFC after: 7 days
# 109554	20-Jan-2003	alc	- Hold the page queues lock around vm_page_hold(). - Assert that the page queues lock rather than Giant is held in vm_page_hold().
# 109373	16-Jan-2003	alc	Fix two long-standing, but likely harmless, errors in the use of vm_pageout_deficit: 1. Update vm_pageout_deficit before VM_WAIT. There is no sense in delaying the update; the sooner the pageout daemon receives this information the better. Reviewed by: tegge 2. Update vm_pageout_deficit according to the number of pages still needed to complete the allocation, not the original size of the allocation. Submitted by: tegge (These errors have existed since the introduction of vm_pageout_deficit in revision 1.144.)
# 109340	15-Jan-2003	dillon	Merge all the various copies of vmapbuf() and vunmapbuf() into a single portable copy. Note that pmap_extract() must be used instead of pmap_kextract(). This is precursor work to a reorganization of vmapbuf() to close remaining user/kernel races (which can lead to a panic).
# 109223	14-Jan-2003	alc	- Update vm_pageout_deficit using atomic operations. It's a simple counter outside the scope of existing locks. - Eliminate a redundant clearing of vm_pageout_deficit.
# 109128	12-Jan-2003	alc	vm_hold_load_pages() needn't clear PG_ZERO because it didn't pass VM_ALLOC_ZERO to vm_page_alloc(). (PG_ZERO is clear by default.)
# 108895	07-Jan-2003	alc	Make bogus_offset local to bufinit().
# 108735	05-Jan-2003	phk	Fix cut&paste bug which would result in a panic because buffer was being biodone'ed multiple times.
# 108715	05-Jan-2003	alc	Allocate bogus_page with VM_ALLOC_WIRED. (Previously, bogus_page's allocation incremented the global count of wired pages, but not the page's own wire count. This inconsistency was introduced in revision 1.230.)
# 108686	04-Jan-2003	phk	Temporarily introduce a new VOP_SPECSTRATEGY operation while I try to sort out disk-io from file-io in the vm/buffer/filesystem space. The intent is to sort VOP_STRATEGY calls into those which operate on "real" vnodes and those which operate on VCHR vnodes. For the latter kind, the call will be changed to VOP_SPECSTRATEGY, possibly conditionally for those places where dual-use happens. Add a default VOP_SPECSTRATEGY method which will call the normal VOP_STRATEGY. First time it is called it will print debugging information. This will only happen if a normal vnode is passed to VOP_SPECSTRATEGY by mistake. Add a real VOP_SPECSTRATEGY in specfs, which does what VOP_STRATEGY does on a VCHR vnode today. Add a new VOP_STRATEGY method in specfs to catch instances where the conversion to VOP_SPECSTRATEGY has not yet happened. Handle the request just like we always did, but first time called print debugging information. Apart up to two instances of console messages per boot, this amounts to a glorified no-op commit. If you get any of the messages on your console I would very much like a copy of them mailed to phk@freebsd.org
# 108651	04-Jan-2003	phk	Don't call VOP_BMAP on VCHR vnodes when the logical and physical block numbers are identical: it cannot even hope to accomplish anything.
# 108589	03-Jan-2003	phk	Convert calls to BUF_STRATEGY to VOP_STRATEGY calls. This is a no-op since all BUF_STRATEGY did in the first place was call VOP_STRATEGY.
# 108533	01-Jan-2003	schweikh	Correct typos, mostly s/ a / an / where appropriate. Some whitespace cleanup, especially in troff files.
# 108307	27-Dec-2002	alc	Hold the page queues lock when calling vm_page_flag_clear().
# 108232	23-Dec-2002	alc	- Hold the kernel_object's lock around vm_page_alloc(kernel_object,...). - Hold the page queues lock around vm_page_wakeup().
# 107847	13-Dec-2002	mckusick	The buffer daemon cannot skip over buffers owned by locked inodes as they may be the only viable ones to flush. Thus it will now wait for an inode lock if the other alternatives will result in rollbacks (and immediate redirtying of the buffer). If only buffers with rollbacks are available, one will be flushed, but then the buffer daemon will wait briefly before proceeding. Failing to wait briefly effectively deadlocks a uniprocessor since every other process writing to that filesystem will wait for the buffer daemon to clean up which takes close enough to forever to feel like a deadlock. Reported by: Archie Cobbs <archie@dellroad.org> Sponsored by: DARPA & NAI Labs. Approved by: re
# 107189	23-Nov-2002	alc	Hold the page queues/flags lock when calling vm_page_set_validclean(). Approved by: re
# 106981	16-Nov-2002	alc	Now that pmap_remove_all() is exported by our pmap implementations use it directly.
# 106720	10-Nov-2002	alc	When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than indirectly through vm_page_protect(). The one remaining page flag that is updated by vm_page_protect() is already being updated by our various pmap implementations. Note: A later commit will similarly change the VM_PROT_READ case and eliminate vm_page_protect().
# 105369	17-Oct-2002	mckusick	When the number of dirty buffers rises too high, the buf_daemon runs to help clean up. After selecting a potential buffer to write, this patch has it acquire a lock on the vnode that owns the buffer before trying to write it. The vnode lock is necessary to avoid a race with some other process holding the vnode locked and trying to flush its dirty buffers. In particular, if the vnode in question is a snapshot file, then the race can lead to a deadlock. To avoid slowing down the buf_daemon, it does a non-blocking lock request when trying to lock the vnode. If it fails to get the lock it skips over the buffer and continues down its queue looking for buffers to flush. Sponsored by: DARPA & NAI Labs.
# 104100	28-Sep-2002	phk	Remove unused includes. Clarify the intention of a while(); Move a local variable to avoid potential name-confusion.
# 104094	28-Sep-2002	phk	Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512
# 104088	28-Sep-2002	phk	Correctly order VI_UNLOCK(), local variables and block comment.
# 104008	26-Sep-2002	phk	Make biowait() check bio_error before the BIO_ERROR flag, to propery catch internal GEOM use of bio_error. Sponsored by: DARPA & NAI Labs.
# 103930	25-Sep-2002	jeff	- Lock accesses to v_numoutput. - Lock calls to gbincore.
# 103353	15-Sep-2002	phk	s/Danglish/English/ Some style issues. Change the timeout to be hz/10 instead of hz. Brucification by: bde.
# 103330	14-Sep-2002	phk	Un-inline the non-trivial "trivial" bio* functions. Untangle devstat_end_transaction_bio()
# 103314	14-Sep-2002	njl	Remove all use of vnode->v_tag, replacing with appropriate substitutes. v_tag is now const char * and should only be used for debugging. Additionally: 1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK 2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP. Suggested by: phk Reviewed by: bde, rwatson (earlier version)
# 103281	13-Sep-2002	phk	Oops, broke the build there. Uninline biodone() now that it is non-trivial. Introduce biowait() function. Currently there is a race condition and the mitigation is a timeout/retry. It is not obvious what kind of locking (if any) is suitable for BIO_DONE, since the majority of users take are of this themselves, and only a few places actually rely on the wakeup. Sponsored by: DARPA & NAI Labs.
# 102600	30-Aug-2002	peter	Change hw.physmem and hw.usermem to unsigned long like they used to be in the original hardwired sysctl implementation. The buf size calculator still overflows an integer on machines with large KVA (eg: ia64) where the number of pages does not fit into an int. Use 'long' there. Change Maxmem and physmem and related variables to 'long', mostly for completeness. Machines are not likely to overflow 'int' pages in the near term, but then again, 640K ought to be enough for anybody. This comes for free on 32 bit machines, so why not?
# 102412	25-Aug-2002	charnier	Replace various spelling with FALLTHROUGH which is lint()able
# 101308	04-Aug-2002	jeff	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS
# 101279	03-Aug-2002	alc	o Convert two instances of vm_page_sleep_busy() to vm_page_sleep_if_busy() with appropriate page queue locking.
# 101174	01-Aug-2002	alc	o Acquire the page queues lock before calling vm_page_io_finish(). o Assert that the page queues lock is held in vm_page_io_finish().
# 100972	30-Jul-2002	alc	o Replace vm_page_sleep_busy() with vm_page_sleep_if_busy() in vfs_busy_pages().
# 100378	19-Jul-2002	alc	o Use vm_page_alloc(... \| VM_ALLOC_WIRED) in place of vm_page_wire().
# 100344	19-Jul-2002	mckusick	Add support to UFS2 to provide storage for extended attributes. As this code is not actually used by any of the existing interfaces, it seems unlikely to break anything (famous last words). The internal kernel interface to manipulate these attributes is invoked using two new IO_ flags: IO_NORMAL and IO_EXT. These flags may be specified in the ioflags word of VOP_READ, VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that you want to do I/O to the normal data part of the file and IO_EXT means that you want to do I/O to the extended attributes part of the file. IO_NORMAL and IO_EXT are mutually exclusive for VOP_READ and VOP_WRITE, but may be specified individually or together in the case of VOP_TRUNCATE. For example, when removing a file, VOP_TRUNCATE is called with both IO_NORMAL and IO_EXT set. For backward compatibility, if neither IO_NORMAL nor IO_EXT is set, then IO_NORMAL is assumed. Note that the BA_ and IO_ flags have been `merged' so that they may both be used in the same flags word. This merger is possible by assigning the IO_ flags to the low sixteen bits and the BA_ flags the high sixteen bits. This works because the high sixteen bits of the IO_ word is reserved for read-ahead and help with write clustering so will never be used for flags. This merge lets us get away from code of the form: if (ioflags & IO_SYNC) flags \|= BA_SYNC; For the future, I have considered adding a new field to the vattr structure, va_extsize. This addition could then be exported through the stat structure to allow applications to find out the size of the extended attribute storage and also would provide a more standard interface for truncating them (via VOP_SETATTR rather than VOP_TRUNCATE). I am also contemplating adding a pathconf parameter (for concreteness, lets call it _PC_MAX_EXTSIZE) which would let an application determine the maximum size of the extended atribute storage. Sponsored by: DARPA & NAI Labs.
# 99986	14-Jul-2002	alc	o Lock page queue accesses by vm_page_wire().
# 99926	13-Jul-2002	alc	o Lock some page queue accesses, in particular, those by vm_page_unwire().
# 99737	10-Jul-2002	dillon	Replace the global buffer hash table with per-vnode splay trees using a methodology similar to the vm_map_entry splay and the VM splay that Alan Cox is working on. Extensive testing has appeared to have shown no increase in overhead. Disadvantages Dirties more cache lines during lookups. Not as fast as a hash table lookup (but still N log N and optimal when there is locality of reference). Advantages vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem syncer operate more efficiently. I get to rip out all the old hacks (some of which were mine) that tried to keep the v_dirtyblkhd tailq sorted. The per-vnode splay tree should be easier to lock / SMPng pushdown on vnodes will be easier. This commit along with another that Alan is working on for the VM page global hash table will allow me to implement ranged fsync(), optimize server-side nfs commit rpcs, and implement partial syncs by the filesystem syncer (aka filesystem syncer would detect that someone is trying to get the vnode lock, remembers its place, and skip to the next vnode). Note that the buffer cache splay is somewhat more complex then other splays due to special handling of background bitmap writes (multiple buffers with the same lblkno in the same vnode), and B_INVAL discontinuities between the old hash table and the existence of the buffer on the v_cleanblkhd list. Suggested by: alc
# 99589	08-Jul-2002	bde	Fixed some printf format errors (one new one reported by gcc and 3 nearby old ones not reported by gcc). This helps unbreak LINT.
# 99512	07-Jul-2002	jeff	Add two asserts that prove & document getblk and geteblk's behavior of returning locked bufs.
# 99508	06-Jul-2002	jeff	Fix a mistake in my last commit. Don't grab an extra reference to the object in bp->b_object.
# 99489	06-Jul-2002	jeff	Fixup uses of GETVOBJECT. - Cache a pointer to the vnode's object in the buf. - Hold a reference to that object in addition to the vnode's reference just to be consistent. - Cleanup code that got the object indirectly through the vp and VOP calls. This fixes at least one case where we were calling GETVOBJECT without a lock. It also avoids an expensive layered call at the cost of another pointer in struct buf.
# 98690	23-Jun-2002	mux	More 64 bits platforms warning fixes. Reviewed by: rwatson
# 98631	22-Jun-2002	dillon	Fix a bug in vfs_bio_clrbuf(). The single-page-clrbuf optimization was improperly clearing more then just the invalid portions of the page. (This bug is not known to have been triggered by anything). Submitted by: tegge MFC after: 7 days
# 98542	21-Jun-2002	mckusick	This commit adds basic support for the UFS2 filesystem. The UFS2 filesystem expands the inode to 256 bytes to make space for 64-bit block pointers. It also adds a file-creation time field, an ability to use jumbo blocks per inode to allow extent like pointer density, and space for extended attributes (up to twice the filesystem block size worth of attributes, e.g., on a 16K filesystem, there is space for 32K of attributes). UFS2 fully supports and runs existing UFS1 filesystems. New filesystems built using newfs can be built in either UFS1 or UFS2 format using the -O option. In this commit UFS1 is the default format, so if you want to build UFS2 format filesystems, you must specify -O 2. This default will be changed to UFS2 when UFS2 proves itself to be stable. In this commit the boot code for reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c) as there is insufficient space in the boot block. Once the size of the boot block is increased, this code can be defined. Things to note: the definition of SBSIZE has changed to SBLOCKSIZE. The header file <ufs/ufs/dinode.h> must be included before <ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and ufs_lbn_t. Still TODO: Verify that the first level bootstraps work for all the architectures. Convert the utility ffsinfo to understand UFS2 and test growfs. Add support for the extended attribute storage. Update soft updates to ensure integrity of extended attribute storage. Switch the current extended attribute interfaces to use the extended attribute storage. Add the extent like functionality (framework is there, but is currently never used). Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@freebsd.org>
# 97919	06-Jun-2002	phk	Use "bwrbg" as description when we sleep for background writing, "biord" was misleading in every possible way.
# 96037	04-May-2002	phk	Remove a six year old undocumented #ifdef : NO_B_MALLOC.
# 93818	04-Apr-2002	jhb	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
# 93707	02-Apr-2002	dillon	brelse() was improperly clearing B_DELWRI in the B_DELWRI\|B_INVAL case without removing the buffer from the vnode's dirty buffer list, which can result in a panic in NFS. Replaced the code with a call to bundirty() which deals with it properly. PR: kern/36108, kern/36174 Submitted by: various people Special mention: to Danny Schales <dan@coes.LaTech.edu> for providing a core dump that helped me track this down. MFC after: 1 day
# 92723	19-Mar-2002	alfred	Remove __P.
# 92640	19-Mar-2002	bde	Fixed some printf format errors (hopefully all of the remaining daddr64_t ones for GENERIC, and all others on the same line as those). Reformat the printfs if necessary to avoid new long lones or old format printf errors.
# 92461	16-Mar-2002	jake	Convert all pmap_kenter/pmap_kremove pairs in MI code to use pmap_qenter/ pmap_qremove. pmap_kenter is not safe to use in MI code because it is not guaranteed to flush the mapping from the tlb on all cpus. If the process in question is preempted and migrates cpus between the call to pmap_kenter and pmap_kremove, the original cpu will be left with stale mappings in its tlb. This is currently not a problem for i386 because we do not use PG_G on SMP, and thus all mappings are flushed from the tlb on context switches, not just user mappings. This is not the case on all architectures, and if PG_G is to be used with SMP on i386 it will be a problem. This was committed by peter earlier as part of his fine grained tlb shootdown work for i386, which was backed out for other reasons. Reviewed by: peter
# 92363	15-Mar-2002	mckusick	Introduce the new 64-bit size disk block, daddr64_t. Change the bio and buffer structures to have daddr64_t bio_pblkno, b_blkno, and b_lblkno fields which allows access to disks larger than a Terabyte in size. This change also requires that the VOP_BMAP vnode operation accept and return daddr64_t blocks. This delta should not affect system operation in any way. It merely sets up the necessary interfaces to allow the development of disk drivers that work with these larger disk block addresses. It also allows for the development of UFS2 which will use 64-bit block addresses.
# 92310	15-Mar-2002	alfred	Giant pushdown for read/write/pread/pwrite syscalls. kern/kern_descrip.c: Aquire Giant in fdrop_locked when file refcount hits zero, this removes the requirement for the caller to own Giant for the most part. kern/kern_ktrace.c: Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls. kern/vfs_bio.c: Aquire Giant in bwillwrite if needed. kern/sys_generic.c Giant pushdown, remove Giant for: read, pread, write and pwrite. readv and writev aren't done yet because of the possible malloc calls for iov to uio processing. kern/sys_socket.c Grab giant in the socket fo_read/write functions. kern/vfs_vnops.c Grab giant in the vnode fo_read/write functions.
# 91700	05-Mar-2002	eivind	* Move bswlist declaration and initialization from kern/vfs_bio.c to vm/vm_pager.c, which is the only place it is used. * Make the QUEUE_* definitions and bufqueues local to vfs_bio.c. * constify buf_wmesg.
# 91690	05-Mar-2002	eivind	Document all functions, global and static variables, and sysctls. Includes some minor whitespace changes, and re-ordering to be able to document properly (e.g, grouping of variables and the SYSCTL macro calls for them, where the documentation has been added.) Reviewed by: phk (but all errors are mine)
# 91367	27-Feb-2002	peter	Back out all the pmap related stuff I've touched over the last few days. There is some unresolved badness that has been eluding me, particularly affecting uniprocessor kernels. Turning off PG_G helped (which is a bad sign) but didn't solve it entirely. Userland programs still crashed.
# 91344	27-Feb-2002	peter	Jake further reduced IPI shootdowns on sparc64 in loops by using ranged shootdowns in a couple of key places. Do the same for i386. This also hides some physical addresses from higher levels and has it use the generic vm_page_t's instead. This will help for PAE down the road. Obtained from: jake (MI code, suggestions for MD part)
# 91063	22-Feb-2002	phk	GC: BIO_ORDERED, various infrastructure dealing with BIO_ORDERED.
# 91060	22-Feb-2002	phk	Replace bowrite() with BUF_WRITE in ufs. Remove bowrite(), it is now unused. This is the first step in getting entirely rid of BIO_ORDERED which is a generally accepted evil thing. Approved by: mckusick
# 90033	31-Jan-2002	dillon	GC P_BUFEXHAUST leftovers, we've had a new mechanism to avoid buffer cache lockups for over a year now. MFC after: 0 days
# 87834	13-Dec-2001	dillon	This fixes a large number of bugs in our NFS client side code. A recent commit by Kirk also fixed a softupdates bug that could easily be triggered by server side NFS. * An edge case with shared R+W mmap()'s and truncate whereby the system would inappropriately clear the dirty bits on still-dirty data. (applicable to all filesystems) THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING. see vm/vm_page.c line 1641 * The straddle case for VM pages and buffer cache buffers when truncating. (applicable to NFS client side) * Possible SMP database corruption due to vm_pager_unmap_page() not clearing the TLB for the other cpu's. (applicable to NFS client side but could effect all filesystems). Note: not considered serious since the corruption occurs beyond the file EOF. * When flusing a dirty buffer due to B_CACHE getting cleared, we were accidently setting B_CACHE again (that is, bwrite() sets B_CACHE), when we really want it to stay clear after the write is complete. This resulted in a corrupt buffer. (applicable to all filesystems but probably only triggered by NFS) * We have to call vtruncbuf() when ftruncate()ing to remove any buffer cache buffers. This is still tentitive, I may be able to remove it due to the second bug fix. (applicable to NFS client side) * vnode_pager_setsize() race against nfs_vinvalbuf()... we have to set n_size before calling nfs_vinvalbuf or the NFS code may recursively vnode_pager_setsize() to the original value before the truncate. This is what was causing the user mmap bus faults in the nfs tester program. (applicable to NFS client side) * Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made by Kirk). Testing program written by: Avadis Tevanian, Jr. Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step') MFC after: 1 week
# 87535	08-Dec-2001	dillon	The nbuf calculation was assuming that PAGE_SIZE = 4096 bytes, which is bogus. The calculation has been adjusted to use units of kilobytes. Noticed by: Chad David <davidc@acns.ab.ca> MFC after: 1 week
# 86194	08-Nov-2001	dillon	Placemark an interrupt race in -current which is currently protected by Giant. -stable will get spl*() fixes for the race. Reported by: Rob Anderson <rob@isilon.com> MFC after: 0 days
# 86089	05-Nov-2001	dillon	Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking in wdrain during a write. This flag needs to be used in devices whos strategy routines turn-around and issue another high level I/O, such as when MD turns around and issues a VOP_WRITE to vnode backing store, in order to avoid deadlocking the dirty buffer draining code. Remove a vprintf() warning from MD when the backing vnode is found to be in-use. The syncer of buf_daemon could be flushing the backing vnode at the time of an MD operation so the warning is not correct. MFC after: 1 week
# 85274	21-Oct-2001	dillon	Documentation MFC after: 1 day
# 84827	11-Oct-2001	jhb	Change the kernel's ucred API as follows: - crhold() returns a reference to the ucred whose refcount it bumps. - crcopy() now simply copies the credentials from one credential to another and has no return value. - a new crshared() primitive is added which returns true if a ucred's refcount is > 1 and false (0) otherwise.
# 83966	26-Sep-2001	dillon	Enable vmiodirenable by default. Remove incorrect comment from sysctl.conf. MFC after: 1 week
# 83366	12-Sep-2001	julian	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 82144	22-Aug-2001	dillon	Remove the code that limited the buffer_map to 1/2 the size of the kernel_map. maxbcache takes care of this now and the 1/2 limit can interfere with testing. Suggested by: bde
# 82127	22-Aug-2001	dillon	Move most of the kernel submap initialization code, including the timeout callwheel and buffer cache, out of the platform specific areas and into the machine independant area. i386 and alpha adjusted here. Other cpus can be fixed piecemeal. Reviewed by: freebsd-smp, jake
# 80448	27-Jul-2001	peter	Revert previous accidental commit. FWIW, it was part of enabling VM caching of disks through mmap() and stopping syncing of open files that had their last reference in the fs removed (ie: their unsync'ed pages get discarded on close already, so I made it stop syncing too).
# 80447	27-Jul-2001	peter	Fix cut/paste blunder. Serves me right for doing a last minute tweak to what I had for some time. Submitted by: bde
# 79224	04-Jul-2001	dillon	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.
# 77115	24-May-2001	dillon	This patch implements O_DIRECT about 80% of the way. It takes a patchset Tor created a while ago, removes the raw I/O piece (that has cache coherency problems), and adds a buffer cache / VM freeing piece. Essentially this patch causes O_DIRECT I/O to not be left in the cache, but does not prevent it from going through the cache, hence the 80%. For the last 20% we need a method by which the I/O can be issued directly to buffer supplied by the user process and bypass the buffer cache entirely, but still maintain cache coherency. I also have the code working under -stable but the changes made to sys/file.h may not be MFCable, so an MFC is not on the table yet. Submitted by: tegge, dillon
# 77085	23-May-2001	jhb	- Always call bfreekva() w/o vm_mtx held. - Always call vfs_setdirty() with vm_mtx held. - Fix an old comment: vm_hold_unload_pages is called vm_hold_free_pages() nowadays. - Always call vm_hold_free_pages() w/o vm_mtx held.
# 76827	18-May-2001	alfred	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb
# 76117	29-Apr-2001	grog	Revert consequences of changes to mount.h, part 2. Requested by: bde
# 75858	23-Apr-2001	grog	Correct #includes to work with fixed sys/mount.h.
# 75648	18-Apr-2001	phk	bread() is a special case of breadn(), so don't replicate code.
# 75629	17-Apr-2001	phk	Write a switch statement as less obscure if statements.
# 75580	17-Apr-2001	phk	This patch removes the VOP_BWRITE() vector. VOP_BWRITE() was a hack which made it possible for NFS client side to use struct buf with non-bio backing. This patch takes a more general approach and adds a bp->b_op vector where more methods can be added. The success of this patch depends on bp->b_op being initialized all relevant places for some value of "relevant" which is not easy to determine. For now the buffers have grown a b_magic element which will make such issues a tiny bit easier to debug.
# 75573	17-Apr-2001	mckusick	Add debugging option to always read/write cylinder groups as full sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'. Add debugging option to disable background writes of cylinder groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'. These debugging options should be tried on systems that are panicing with corrupted cylinder group maps to see if it makes the problem go away. The set of panics in question are: ffs_clusteralloc: map mismatch ffs_nodealloccg: map corrupted ffs_nodealloccg: block not in map ffs_alloccg: map corrupted ffs_alloccg: block not in map ffs_alloccgblk: cyl groups corrupted ffs_alloccgblk: can't find blk in cyl ffs_checkblk: partially free fragment The following panics are less likely to be related to this problem, but might be helped by these debugging options: ffs_valloc: dup alloc ffs_blkfree: freeing free block ffs_blkfree: freeing free frag ffs_vfree: freeing free inode If you try these options, please report whether they helped reduce your bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com> and to Matt Dillon <dillon@earth.backplane.com>.
# 73211	28-Feb-2001	dillon	Fix lockup for loopback NFS mounts. The pipelined I/O limitations could be hit on the client side and prevent the server side from retiring writes. Pipeline operations turned off for all READs (no big loss since reads are usually synchronous) and for NFS writes, and left on for the default bwrite(). (MFC expected prior to 4.3 freeze) Testing by: mjacob, dillon
# 72200	09-Feb-2001	bmilekic	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# 71983	04-Feb-2001	dillon	This commit represents work mainly submitted by Tor and slightly modified by myself. It solves a serious vm_map corruption problem that can occur with the buffer cache when block sizes > 64K are used. This code has been heavily tested in -stable but only tested somewhat on -current. An MFC will occur in a few days. My additions include the vm_map_simplify_entry() and minor buffer cache boundry case fix. Make the buffer cache use a system map for buffer cache KVM rather then a normal map. Ensure that VM objects are not allocated for system maps. There were cases where a buffer map could wind up with a backing VM object -- normally harmless, but this could also result in the buffer cache blocking in places where it assumes no blocking will occur, possibly resulting in corrupted maps. Fix a minor boundry case in the buffer cache size limit is reached that could result in non-optimal code. Add vm_map_simplify_entry() calls to prevent 'creeping proliferation' of vm_map_entry's in the buffer cache's vm_map. Previously only a simple linear optimization was made. (The buffer vm_map typically has only a handful of vm_map_entry's. This stabilizes it at that level permanently). PR: 20609 Submitted by: (Tor Egge) tegge
# 70861	10-Jan-2001	jake	Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables other then curproc.
# 70374	26-Dec-2000	dillon	This implements a better launder limiting solution. There was a solution in 4.2-REL which I ripped out in -stable and -current when implementing the low-memory handling solution. However, maxlaunder turns out to be the saving grace in certain very heavily loaded systems (e.g. newsreader box). The new algorithm limits the number of pages laundered in the first pageout daemon pass. If that is not sufficient then suceessive will be run without any limit. Write I/O is now pipelined using two sysctls, vfs.lorunningspace and vfs.hirunningspace. This prevents excessive buffered writes in the disk queues which cause long (multi-second) delays for reads. It leads to more stable (less jerky) and generally faster I/O streaming to disk by allowing required read ops (e.g. for indirect blocks and such) to occur without interrupting the write stream, amoung other things. NOTE: eventually, filesystem write I/O pipelining needs to be done on a per-device basis. At the moment it is globalized.
# 70063	15-Dec-2000	jhb	Stick the kthread API in a kthread_* namespace, and the specialized kproc functions in a kproc_* namespace. Reviewed by: -arch
# 68885	18-Nov-2000	dillon	Implement a low-memory deadlock solution. Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the WRONG page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
# 68259	02-Nov-2000	phk	Take VBLK devices further out of their missery. This should fix the panic I introduced in my previous commit on this topic.
# 67365	20-Oct-2000	jhb	Catch up to moving headers: - machine/ipl.h -> sys/ipl.h - machine/mutex.h -> sys/mutex.h
# 66615	03-Oct-2000	jasone	Convert lockmgr locks from using simple locks to using mutexes. Add lockdestroy() and appropriate invocations, which corresponds to lockinit() and must be called to clean up after a lockmgr lock is no longer needed.
# 65770	12-Sep-2000	bp	Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT. They will be used by nullfs and other stacked filesystems to support full cache coherency. Reviewed in general by: mckusick, dillon
# 65557	06-Sep-2000	jasone	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
# 63850	25-Jul-2000	mckusick	Now that buffer locks can be recursive, we need to delete the panics that complain about them. Obtained from: Brian Fundakowski Feldman <green@FreeBSD.org>
# 62976	11-Jul-2000	mckusick	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
# 61724	16-Jun-2000	phk	Virtualizes & untangles the bioops operations vector. Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@
# 60938	26-May-2000	jake	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 60833	23-May-2000	jake	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# 60041	05-May-2000	phk	Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>. <sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes. Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data. Still a few bogus uses of struct buf to track down. Repocopy by: peter
# 59840	01-May-2000	phk	Give struct bio it's own call back mechanism.
# 59773	30-Apr-2000	phk	Hmm, diff/patch still doesn't like me. Missed one s/biowait/bufwait/g
# 59762	29-Apr-2000	phk	s/biowait/bufwait/g Prodded by: several.
# 59761	29-Apr-2000	phk	Remove a leftover dysonism.
# 59391	19-Apr-2000	phk	Remove ~25 unneeded #include <sys/conf.h> Remove ~60 unneeded #include <sys/malloc.h>
# 59358	18-Apr-2000	phk	Don't declare common variables in include files: move buftimelock til vfs_bio.c where it is initialized.
# 59249	15-Apr-2000	phk	Complete the bio/buf divorce for all code below devfs::strategy Exceptions: Vinum untouched. This means that it cannot be compiled. Greg Lehey is on the case. CCD not converted yet, casts to struct buf (still safe) atapi-cd casts to struct buf to examine B_PHYS
# 58934	02-Apr-2000	phk	Move B_ERROR flag to b_ioflags and call it BIO_ERROR. (Much of this done by script) Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED. Move b_pblkno and b_iodone_chain to struct bio while we transition, they will be obsoleted once bio structs chain/stack. Add bio_queue field for struct bio aware disksort. Address a lot of stylistic issues brought up by bde.
# 58926	02-Apr-2000	phk	Draw the outline of "struct bio". Struct bio is the future carrier of I/O requests for "struct buf".
# 58706	27-Mar-2000	dillon	Commit the buffer cache cleanup patch to 4.x and 5.x. This patch fixes a fragmentation problem due to geteblk() reserving too much space for the buffer and imposes a larger granularity (16K) on KVA reservations for the buffer cache to avoid fragmentation issues. The buffer cache size calculations have been redone to simplify them (fewer defines, better comments, less chance of running out of KVA). The geteblk() fix solves a performance problem that DG was able reproduce. This patch does not completely fix the KVA fragmentation problems, but it goes a long way Mostly Reviewed by: bde and others Approved by: jkh
# 58349	20-Mar-2000	phk	Rename the existing BUF_STRATEGY() to DEV_STRATEGY() substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo) substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo) This patch is machine generated except for the ccd.c and buf.h parts.
# 58345	20-Mar-2000	phk	Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new field in struct buf: b_iocmd. The b_iocmd is enforced to have exactly one bit set. B_WRITE was bogusly defined as zero giving rise to obvious coding mistakes. Also eliminate the redundant struct buf flag B_CALL, it can just as efficiently be done by comparing b_iodone to NULL. Should you get a panic or drop into the debugger, complaining about "b_iocmd", don't continue. It is likely to write on your disk where it should have been reading. This change is a step in the direction towards a stackable BIO capability. A lot of this patch were machine generated (Thanks to style(9) compliance!) Vinum users: Greg has not had time to test this yet, be careful.
# 58132	16-Mar-2000	phk	Eliminate the undocumented, experimental, non-delivering and highly dangerous MAX_PERF option.
# 56212	18-Jan-2000	mckusick	Need to reset the buffer pointer to avoid reconsidering the same buffer again (without this the rollback analysis was being lost). Should reduce the write count for most workloads. Submitted by: Craig A Soules <soules+@andrew.cmu.edu>
# 55756	10-Jan-2000	phk	Give vn_isdisk() a second argument where it can return a suitable errno. Suggested by: bde
# 55697	09-Jan-2000	mckusick	Several performance improvements for soft updates have been added: 1) Fastpath deletions. When a file is being deleted, check to see if it was so recently created that its inode has not yet been written to disk. If so, the delete can proceed to immediately free the inode. 2) Background writes: No file or block allocations can be done while the bitmap is being written to disk. To avoid these stalls, the bitmap is copied to another buffer which is written thus leaving the original available for futher allocations. 3) Link count tracking. Constantly track the difference in i_effnlink and i_nlink so that inodes that have had no change other than i_effnlink need not be written. 4) Identify buffers with rollback dependencies so that the buffer flushing daemon can choose to skip over them.
# 55539	07-Jan-2000	luoqi	Introduce a mechanism to suspend/resume system processes. Suspend syncer and bufdaemon prior to disk sync during system shutdown.
# 54911	20-Dec-1999	dillon	Reimplement buf_daemon / getnewbuf() interaction for dealing with stressful situations. buf_daemon now makes a distinction between being woken up and its sleep timing out, and as a consequence is now much better able to dynamically tune itself to its environment. Reviewed by: Alfred Perlstein <bright@wintelcom.net>
# 53975	01-Dec-1999	mckusick	Collect read and write counts for filesystems. This new code drops the counting in bwrite and puts it all in spec_strategy. I did some tests and verified that the counts collected for writes in spec_strategy is identical to the counts that we previously collected in bwrite. We now also get read counts (async reads come from requests for read-ahead blocks). Note that you need to compile a new version of mount to get the read counts printed out. The old mount binary is completely compatible, the only reason to install a new mount is to get the read counts printed. Submitted by: Craig A Soules <soules+@andrew.cmu.edu> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# 53577	22-Nov-1999	phk	Convert various pieces of code to use vn_isdisk() rather than checking for vp->v_type == VBLK. In ccd: we don't need to call VOP_GETATTR to find the type of a vnode. Reviewed by: sos
# 53212	16-Nov-1999	phk	This is a partial commit of the patch from PR 14914: Alot of the code in sys/kern directly accesses the Q_HEAD and Q_ENTRY structures for list operations. This patch makes all list operations in sys/kern use the queue(3) macros, rather than directly accessing the *Q_{HEAD,ENTRY} structures. This batch of changes compile to the same object files. Reviewed by: phk Submitted by: Jake Burkholder <jake@checker.org> PR: 14914
# 52652	30-Oct-1999	phk	Remove a #define which doesn't do miracles anymore.
# 52635	29-Oct-1999	phk	useracc() the prequel: Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ\|WRITE} rather than B_{READ\|WRITE} as argument.
# 52452	24-Oct-1999	dillon	Adjust the buffer cache to better handle small-memory machines. A slightly older version of this code was tested by BDE and I. Also fixes a lockup situation when kva gets too fragmented. Remove the maxvmiobufspace variable and sysctl, they are no longer used. Also cleanup (remove) #if 0 sections from prior commits. This code is more of a hack, but presumably the whole buffer cache implementation is going to be rewritten in the next year so it's no big deal.
# 52128	11-Oct-1999	peter	Trim unused options (or #ifdef for undoc options). Submitted by: phk
# 51811	30-Sep-1999	dt	Count bogus_page as wired.
# 51465	20-Sep-1999	dillon	Fix bug in brelse() regarding redirtying buffers on B_ERROR. brelse() improperly ignored the B_INVAL flag when acting on the B_ERROR. If both B_INVAL and B_ERROR are set the buffer is typically out of the underlying device's block range and must be destroyed. If only B_ERROR is set (for a write), a write error occured and operation remains as it was before: the buffer must be redirtied to avoid corrupting the filesystem state. Reviewed by: David Greenman <dg@root.com> Submitted by: Tor.Egge@fast.no
# 50849	03-Sep-1999	luoqi	Allow getblk() to be called from an idle context (by panic() inside an interrupt handler). Reviewed by: dillon
# 50477	27-Aug-1999	peter	$Id$ -> $FreeBSD$
# 50275	23-Aug-1999	bde	Cast pointers to uintptr_t instead of casting them to u_long, and/or vice versa. Cosmetic.
# 49535	08-Aug-1999	phk	Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>, a few lines into <sys/vnode.h>. Add a few fields to struct specinfo, paving the way for the fun part.
# 49101	26-Jul-1999	alc	Add sysctl and support code to allow directories to be VMIO'd. The default setting for the sysctl is OFF, which is the historical operation. Submitted by: dillon
# 48710	09-Jul-1999	peter	bufhashinit() is called with a caddr_t and is expected to return the same in both the alpha and i386 ports.
# 48686	08-Jul-1999	mckusick	Condition in KASSERT was reversed.
# 48677	08-Jul-1999	mckusick	These changes appear to give us benefits with both small (32MB) and large (1G) memory machine configurations. I was able to run 'dbench 32' on a 32MB system without bring the machine to a grinding halt. * buffer cache hash table now dynamically allocated. This will have no effect on memory consumption for smaller systems and will help scale the buffer cache for larger systems. * minor enhancement to pmap_clearbit(). I noticed that all the calls to it used constant arguments. Making it an inline allows the constants to propogate to deeper inlines and should produce better code. * removal of inherent vfs_ioopt support through the emplacement of appropriate #ifdef's, with John's permission. If we do not find a use for it by the end of the year we will remove it entirely. * removal of getnewbufloops* counters & sysctl's - no longer necessary for debugging, getnewbuf() is now optimal. * buffer hash table functions removed from sys/buf.h and localized to vfs_bio.c * VFS_BIO_NEED_DIRTYFLUSH flag and support code added ( bwillwrite() ), allowing processes to block when too many dirty buffers are present in the system. * removal of a softdep test in bdwrite() that is no longer necessary now that bdwrite() no longer attempts to flush dirty buffers. * slight optimization added to bqrelse() - there is no reason to test for available buffer space on B_DELWRI buffers. * addition of reverse-scanning code to vfs_bio_awrite(). vfs_bio_awrite() will attempt to locate clusterable areas in both the forward and reverse direction relative to the offset of the buffer passed to it. This will probably not make much of a difference now, but I believe we will start to rely on it heavily in the future if we decide to shift some of the burden of the clustering closer to the actual I/O initiation. * Removal of the newbufcnt and lastnewbuf counters that Kirk added. They do not fix any race conditions that haven't already been fixed by the gbincore() test done after the only call to getnewbuf(). getnewbuf() is a static, so there is no chance of it being misused by other modules. ( Unless Kirk can think of a specific thing that this code fixes. I went through it very carefully and didn't see anything ). * removal of VOP_ISLOCKED() check in flushbufqueues(). I do not think this check is necessary, the buffer should flush properly whether the vnode is locked or not. ( yes? ). * removal of extra arguments passed to getnewbuf() that are not necessary. * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in vfs_cluster.c * vn_write() now calls bwillwrite() PRIOR to locking the vnode, which should greatly aid flushing operations in heavy load situations - both the pageout and update daemons will be able to operate more efficiently. * removal of b_usecount. We may add it back in later but for now it is useless. Prior implementations of the buffer cache never had enough buffers for it to be useful, and current implementations which make more buffers available might not benefit relative to the amount of sophistication required to implement a b_usecount. Straight LRU should work just as well, especially when most things are VMIO backed. I expect that (even though John will not like this assumption) directories will become VMIO backed some point soon. Submitted by: Matthew Dillon <dillon@backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# 48544	03-Jul-1999	mckusick	The buffer queue mechanism has been reformulated. Instead of having QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN, QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean and dirty buffers have been separated. Empty buffers with KVM assignments have been separated from truely empty buffers. getnewbuf() has been rewritten and now operates in a 100% optimal fashion. That is, it is able to find precisely the right kind of buffer it needs to allocate a new buffer, defragment KVM, or to free-up an existing buffer when the buffer cache is full (which is a steady-state situation for the buffer cache). Buffer flushing has been reorganized. Previously buffers were flushed in the context of whatever process hit the conditions forcing buffer flushing to occur. This resulted in processes blocking on conditions unrelated to what they were doing. This also resulted in inappropriate VFS stacking chains due to multiple processes getting stuck trying to flush dirty buffers or due to a single process getting into a situation where it might attempt to flush buffers recursively - a situation that was only partially fixed in prior commits. We have added a new daemon called the buf_daemon which is responsible for flushing dirty buffers when the number of dirty buffers exceeds the vfs.hidirtybuffers limit. This daemon attempts to dynamically adjust the rate at which dirty buffers are flushed such that getnewbuf() calls (almost) never block. The number of nbufs and amount of buffer space is now scaled past the 8MB limit that was previously imposed for systems with over 64MB of memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed somewhat. The number of physical buffers has been increased with the intention that we will manage physical I/O differently in the future. reassignbuf previously attempted to keep the dirtyblkhd list sorted which could result in non-deterministic operation under certain conditions, such as when a large number of dirty buffers are being managed. This algorithm has been changed. reassignbuf now keeps buffers locally sorted if it can do so cheaply, and otherwise gives up and adds buffers to the head of the dirtyblkhd list. The new algorithm is deterministic but not perfect. The new algorithm greatly reduces problems that previously occured when write_behind was turned off in the system. The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive P_BUFEXHAUST bit. This bit allows processes working with filesystem buffers to use available emergency reserves. Normal processes do not set this bit and are not allowed to dig into emergency reserves. The purpose of this bit is to avoid low-memory deadlocks. A small race condition was fixed in getpbuf() in vm/vm_pager.c. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# 48333	29-Jun-1999	peter	Hopefully fix the remaining glitches with the BUF_*() changes. This should (really this time) fix pageout to swap and a couple of clustering cases. This simplifies BUF_KERNPROC() so that it unconditionally reassigns the lock owner rather than testing B_ASYNC and having the caller decide when to do the reassign. At present this is required because some places use B_CALL/b_iodone to free the buffers without B_ASYNC being set. Also, vfs_cluster.c explicitly calls BUF_KERNPROC() when attaching the buffers rather than the parent walking the cluster_head tailq. Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# 48326	28-Jun-1999	peter	Fix a bug that was almost certainly making breadn() fail. BUF_KERNPROC() was being called on the wrong bp - it should be called on the one that's just about to be fed to VOP_STRATEGY().
# 48251	26-Jun-1999	peter	GC the remnants of the old pre-softupdates update daemon. It's been #if 0'd for a fair while now.
# 48225	26-Jun-1999	mckusick	Convert buffer locking from using the B_BUSY and B_WANTED flags to using lockmgr locks. This commit should be functionally equivalent to the old semantics. That is, all buffer locking is done with LK_EXCLUSIVE requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will be done in future commits.
# 48088	21-Jun-1999	mckusick	When allocating new buffers in getnewbuf, there are several points at which we may sleep. So, after completing our buffer allocation we must ensure that another process has not come along and allocated a different buffer with the same identity. We do this by keeping a global counter of the number of buffers that getnewbuf has allocated. We save this count when we enter getnewbuf and check it when we are about to return. If it has changed, then other buffers were allocated while we were in getnewbuf, so we must return NULL to let our parent know that it must recheck to see if it still needs the new buffer. Hopefully this fix will eliminate the creation of duplicate buffers with the same identity and the obscure corruptions that they cause.
# 47964	16-Jun-1999	mckusick	Add a vnode argument to VOP_BWRITE to get rid of the last vnode operator special case. Delete special case code from vnode_if.sh, vnode_if.src, umap_vnops.c, and null_vnops.c.
# 47941	16-Jun-1999	tegge	If we still haven't got a sufficient number of free buffers after the call to flushdirtybuffers() then sleep in waitfreebuffers(). PR: 11697 Reviewed by: David Greenman, Matt Dillon
# 47940	15-Jun-1999	mckusick	Get rid of the global variable rushjob and replace it with a function in kern/vfs_subr.c named speedup_syncer() which handles the speedup request. Change the various clients of rushjob to use the new function.
# 47084	12-May-1999	peter	Try an fix a couple of dev_t/major/minor etc nits.
# 46580	06-May-1999	phk	remove b_proc from struct buf, it's (now) unused. Reviewed by: dillon, bde
# 46566	06-May-1999	phk	Remove unused fields from struct buf: b_savekva b_validoff b_validend Reviewed by: dillon, bde
# 46349	02-May-1999	alc	The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
# 46181	29-Apr-1999	alc	Address a performance problem in getnewbuf: In heavy-writing situations, QUEUE_LRU can contain a large number of DELWRI buffers at its head. These buffers must be moved to the tail if they cannot be written async in order to reduce the scanning time required to skip past these buffers in later getnewbuf() calls. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
# 45685	14-Apr-1999	dt	getnewbuf(): check return value from tsleep(). Interruptible NFS may pass PCATCH to slpflag.
# 45397	07-Apr-1999	alc	Fix a performance problem with the new getnewbuf() code: in an outofspace condition ( bufspace > hibufspace ), an inappropriate scan of the empty queue was performed looking for buffer space to free up. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
# 45347	05-Apr-1999	julian	Catch a case spotted by Tor where files mmapped could leave garbage in the unallocated parts of the last page when the file ended on a frag but not a page boundary. Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF, in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c ufs/ufs/ufs_readwrite.c kern/vfs_bio.c Submitted by: Matt Dillon <dillon@freebsd.org> Reviewed by: Alan Cox <alc@freebsd.org>
# 44893	19-Mar-1999	bde	Fixed a serious bug in rev.1.202. getnewbuf() sometimes didn't initialise bp->b_data. This tended to cause panics for file systems whose block size is smaller than one page.
# 44679	12-Mar-1999	julian	Reviewed by: Many at differnt times in differnt parts, including alan, john, me, luoqi, and kirk Submitted by: Matt Dillon <dillon@frebsd.org> This change implements a relatively sophisticated fix to getnewbuf(). There were two problems with getnewbuf(). First, the writerecursion can lead to a system stack overflow when you have NFS and/or VN devices in the system. Second, the free/dirty buffer accounting was completely broken. Not only did the nfs routines blow it trying to manually account for the buffer state, but the accounting that was done did not work well with the purpose of their existance: figuring out when getnewbuf() needs to sleep. The meat of the change is to kern/vfs_bio.c. The remaining diffs are all minor except for NFS, which includes both the fixes for bp interaction AND fixes for a 'biodone(): buffer already done' lockup. Sys/buf.h also contains a chaining structure which is not used by this patchset but is used by other patches that are coming soon. This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF. (sorry for the missing T matt)
# 44435	02-Mar-1999	julian	Make comment match code.
# 44433	02-Mar-1999	julian	Remove inapropriate use of VOP_ISLOCKED() This produced races resulting in panics and filesystem corruptions under some circumstances. Reviewed by: luoqi chen <luoqi@freebsd.org> Reviewed by: Kirk McKusick <mckusick@mckusick.com> Submitted by: Matt Dillon <dillon@freebsd.org>
# 43301	27-Jan-1999	dillon	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# 43118	23-Jan-1999	dillon	Don't try to calculate B_CACHE for an NFS related bp that has a > 0 b_validend. This will screw up small-writes, causing lots of little writes out the network. We will assume that NFS handles B_CACHE properly.
# 43087	23-Jan-1999	dillon	Fix an expression parenthesization typo in a conditional. It should not have any operational effects other then to make the code in question a little faster. Also added a more involved comment.
# 43043	22-Jan-1999	dg	Don't throw away the buffer contents on a fatal write error; just mark the buffer as still being dirty. This isn't a perfect solution, but throwing away the buffer contents will often result in filesystem corruption and this solution will at least correctly deal with transient errors. Submitted by: Kirk McKusick <mckusick@mckusick.com>
# 42961	21-Jan-1999	dillon	The main operational changes are in getblk()'s handling of the B_DELWRI and B_CACHE flags, fixing a bug that showed up with NFS. Also, a number of cases where manually inserted code has been removed and replaced with an inline function call giving us better functional isolation in the source.
# 42957	21-Jan-1999	dillon	This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues. Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
# 42827	19-Jan-1999	dillon	Obtained from: Luoqi Fix NFS file corruption problem introduced in 1.188. The valid range was not being set properly, causing a later reference to the buffer to clear the B_CACHE bit.
# 42569	12-Jan-1999	eivind	Silence warnings.
# 42453	09-Jan-1999	eivind	KNFize, by bde.
# 42408	08-Jan-1999	eivind	Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as discussed on -hackers. Introduce 'KASSERT(assertion, ("panic message", args))' for simple check + panic. Reviewed by: msmith
# 42014	22-Dec-1998	dillon	Adjust some comments to prevent future confusion on the implementation. Also add a reference to the buf(9) manual page.
# 42007	22-Dec-1998	luoqi	Correctly handle misaligned VMIO buffer (whose start or end offset in the VM object are not page aligned). This should fix the mount_msdos panic after a failed attemp to mount as ffs. Reviewed By: Matthew Dillon <dillon@apollo.backplane.com> Archie Cobbs <archie@whistle.com> Dmitrij Tejblum <dima@tejblum.dnttm.rssi.ru>
# 41803	14-Dec-1998	dillon	fix intermediate overflow in 'quad = int * int' situation by casting the arguments to the multiply to a quad equivalent. In this case, vm_ooffset_t. Reviewed by: Archie Cobbs <archie@whistle.com>
# 41589	07-Dec-1998	eivind	Fix grouping of statements. This remove a potential panic in the soft updates code. While I'm here, remove an unintended trigraph. Reviewed by: Kirk McKusick <kirk@freebsd.org>
# 41237	18-Nov-1998	dg	Closed a very narrow and rare race condition that involved net interrupts, bio interrupts, and a truncated file that along with the precise alignment of the planets could result in a page being freed multiple times or a just-freed page being put onto the inactive queue.
# 40790	31-Oct-1998	peter	Use TAILQ macros for clean/dirty block list processing. Set b_xflags rather than abusing the list next pointer with a magic number.
# 40764	30-Oct-1998	dg	Unwire everything to the inactive queue in order to preserve LRU ordering.
# 40726	29-Oct-1998	dg	Fixed editing error. Pointed out by bde.
# 40700	28-Oct-1998	dg	Added a second argument, "activate" to the vm_page_unwire() call so that the caller can select either inactive or active queue to put the page on.
# 40648	25-Oct-1998	phk	Nitpicking and dusting performed on a train. Removes trivial warnings about unused variables, labels and other lint.
# 40286	13-Oct-1998	dg	Fixed two potentially serious classes of bugs: 1) The vnode pager wasn't properly tracking the file size due to "size" being page rounded in some cases and not in others. This sometimes resulted in corrupted files. First noticed by Terry Lambert. Fixed by changing the "size" pager_alloc parameter to be a 64bit byte value (as opposed to a 32bit page index) and changing the pagers and their callers to deal with this properly. 2) Fixed a bogus type cast in round_page() and trunc_page() that caused some 64bit offsets and sizes to be scrambled. Removing the cast required adding casts at a few dozen callers. There may be problems with other bogus casts in close-by macros. A quick check seemed to indicate that those were okay, however.
# 39658	25-Sep-1998	dillon	PR: kern/7418 Reviewed by: Luoqi Chen <luoqi@watermarkgroup.com> Fixed problem where write()s can get lost due to buffers flagged B_DELWRI being improperly released in brelse().
# 39648	25-Sep-1998	peter	Goodbye BOUNCE_BUFFERS, for a hack it has served us well. The last consumer of this code (the old SCSI system) has left us and the CAM code does it's own bouncing. The isa dma system has been doing it's own bouncing for a while too. Reviewed by: core
# 39245	15-Sep-1998	gibbs	kern_clock.c: Remove old disk statistics variables. vfs_bio.c: Enable bowrite now that B_ORDERED works for all buffer devices.
# 38862	05-Sep-1998	phk	Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to inform device drivers about sectors no longer in use. Device-drivers receive the call through d_strategy, if they have D_CANFREE in d_flags. This allows flash based devices to erase the sectors and avoid pointlessly carrying them around in compactions. Reviewed by: Kirk Mckusick, bde Sponsored by: M-Systems (www.m-sys.com)
# 38799	04-Sep-1998	dfr	Cosmetic changes to the PAGE_XXX macros to make them consistent with the other objects in vm.
# 38611	28-Aug-1998	luoqi	Close a race window for getnewbuf() between shared lock holders of the vnode. Reviewed by: Mike Smith
# 38543	25-Aug-1998	phk	Fix DDBs printing of buf-flags after I changed them yesterday.
# 38525	24-Aug-1998	phk	Remove the last remaining evidence of B_TAPE. Reclaim 3 unused bits in b_flags
# 38517	24-Aug-1998	dfr	Change various syscalls to use size_t arguments instead of u_int. Add some overflow checks to read/write (from bde). Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags and vm_object::paging_in_progress to use operations which are not interruptable. Reviewed by: Bruce Evans <bde@zeta.org.au>
# 38299	13-Aug-1998	dfr	Protect all modifications to v_numoutput with splbio().
# 38135	06-Aug-1998	dfr	Protect all modifications to paging_in_progress with splvm(). The i386 managed to avoid corruption of this variable by luck (the compiler used a memory read-modify-write instruction which wasn't interruptable) but other architectures cannot. With this change, I am now able to 'make buildworld' on the alpha (sfx: the crowd goes wild...)
# 37615	13-Jul-1998	bde	Fixed printf format errors.
# 37490	07-Jul-1998	julian	Catch a few corner cases where FreeBSD differs enough from BSD 4.4 to confuse Soft updates.. Should solve several "dangling deps" panics.
# 37384	04-Jul-1998	julian	VOP_STRATEGY grows an (struct vnode *) argument as the value in b_vp is often not really what you want. (and needs to be frobbed). more cleanups will follow this. Reviewed by: Bruce Evans <bde@freebsd.org>
# 35590	01-May-1998	peter	vm_page_is_valid() wasn't expecting a large offset argument, it's expecting a sub-page offset. We were passing the file position, and vm_page_bits() could do some interesting things when base was larger PAGE_SIZE. if (size > PAGE_SIZE - base) size = PAGE_SIZE - base; is interesting when (PAGE_SIZE - base) is negative. I could imagine that this could have interesting consequences for memory page -> device block bit validation.
# 35589	01-May-1998	peter	Fix one problem with NFSv3 > 2GB file support. Submitted by: bde
# 35256	17-Apr-1998	des	Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.
# 35210	15-Apr-1998	bde	Support compiling with `gcc -ansi'.
# 34908	27-Mar-1998	dyson	Correct a problem where buffers might not be zeroed when needed. The B_MALLOC buffers might not have been properly zeroed.
# 34694	19-Mar-1998	dyson	In kern_physio.c fix tsleep priority messup. In vfs_bio.c, remove b_generation count usage, remove redundant reassignbuf, remove redundant spl(s), manage page PG_ZERO flags more correctly, utilize in invalid value for b_offset until it is properly initialized. Add asserts for #ifdef DIAGNOSTIC, when b_offset is improperly used. when a process is not performing I/O, and just waiting on a buffer generally, make the sleep priority low. only check page validity in getblk for B_VMIO buffers. In vfs_cluster, add b_offset asserts, correct pointer calculation for clustered reads. Improve readability of certain parts of the code. Remove redundant spl(s). In vfs_subr, correct usage of vfs_bio_awrite (From Andrew Gallatin <gallatin@cs.duke.edu>). More vtruncbuf problems fixed.
# 34646	17-Mar-1998	dyson	Correct a problem where data OR metadata could be thrown away if a buffer is grown.
# 34640	17-Mar-1998	kato	Deleted PC-98 code because (1) machine dependent code should not be in here, and (2) the flag used in PC-98 code has been assigned to another purpose.
# 34611	15-Mar-1998	dyson	Some VM improvements, including elimination of alot of Sig-11 problems. Tor Egge and others have helped with various VM bugs lately, but don't blame him -- blame me!!! pmap.c: 1) Create an object for kernel page table allocations. This fixes a bogus allocation method previously used for such, by grabbing pages from the kernel object, using bogus pindexes. (This was a code cleanup, and perhaps a minor system stability issue.) pmap.c: 2) Pre-set the modify and accessed bits when prudent. This will decrease bus traffic under certain circumstances. vfs_bio.c, vfs_cluster.c: 3) Rather than calculating the beginning virtual byte offset multiple times, stick the offset into the buffer header, so that the calculated offset can be reused. (Long long multiplies are often expensive, and this is a probably unmeasurable performance improvement, and code cleanup.) vfs_bio.c: 4) Handle write recursion more intelligently (but not perfectly) so that it is less likely to cause a system panic, and is also much more robust. vfs_bio.c: 5) getblk incorrectly wrote out blocks that are incorrectly sized. The problem is fixed, and writes blocks out ONLY when B_DELWRI is true. vfs_bio.c: 6) Check that already constituted buffers have fully valid pages. If not, then make sure that the B_CACHE bit is not set. (This was a major source of Sig-11 type problems.) vfs_bio.c: 7) Fix a potential system deadlock due to an incorrectly specified sleep priority while waiting for a buffer write operation. The change that I made opens the system up to serious problems, and we need to examine the issue of process sleep priorities. vfs_cluster.c, vfs_bio.c: 8) Make clustered reads work more correctly (and more completely) when buffers are already constituted, but not fully valid. (This was another system reliability issue.) vfs_subr.c, ffs_inode.c: 9) Create a vtruncbuf function, which is used by filesystems that can truncate files. The vinvalbuf forced a file sync type operation, while vtruncbuf only invalidates the buffers past the new end of file, and also invalidates the appropriate pages. (This was a system reliabiliy and performance issue.) 10) Modify FFS to use vtruncbuf. vm_object.c: 11) Make the object rundown mechanism for OBJT_VNODE type objects work more correctly. Included in that fix, create pager entries for the OBJT_DEAD pager type, so that paging requests that might slip in during race conditions are properly handled. (This was a system reliability issue.) vm_page.c: 12) Make some of the page validation routines be a little less picky about arguments passed to them. Also, support page invalidation change the object generation count so that we handle generation counts a little more robustly. vm_pageout.c: 13) Further reduce pageout daemon activity when the system doesn't need help from it. There should be no additional performance decrease even when the pageout daemon is running. (This was a significant performance issue.) vnode_pager.c: 14) Teach the vnode pager to handle race conditions during vnode deallocations.
# 34266	08-Mar-1998	julian	Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman) Submitted by: Kirk McKusick (mcKusick@mckusick.com) Obtained from: WHistle development tree
# 34206	07-Mar-1998	dyson	This mega-commit is meant to fix numerous interrelated problems. There has been some bitrot and incorrect assumptions in the vfs_bio code. These problems have manifest themselves worse on NFS type filesystems, but can still affect local filesystems under certain circumstances. Most of the problems have involved mmap consistancy, and as a side-effect broke the vfs.ioopt code. This code might have been committed seperately, but almost everything is interrelated. 1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that are fully valid. 2) Rather than deactivating erroneously read initial (header) pages in kern_exec, we now free them. 3) Fix the rundown of non-VMIO buffers that are in an inconsistent (missing vp) state. 4) Fix the disassociation of pages from buffers in brelse. The previous code had rotted and was faulty in a couple of important circumstances. 5) Remove a gratuitious buffer wakeup in vfs_vmio_release. 6) Remove a crufty and currently unused cluster mechanism for VBLK files in vfs_bio_awrite. When the code is functional, I'll add back a cleaner version. 7) The page busy count wakeups assocated with the buffer cache usage were incorrectly cleaned up in a previous commit by me. Revert to the original, correct version, but with a cleaner implementation. 8) The cluster read code now tries to keep data associated with buffers more aggressively (without breaking the heuristics) when it is presumed that the read data (buffers) will be soon needed. 9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The delay loop waiting is not useful for filesystem locks, due to the length of the time intervals. 10) Correct and clean-up spec_getpages. 11) Implement a fully functional nfs_getpages, nfs_putpages. 12) Fix nfs_write so that modifications are coherent with the NFS data on the server disk (at least as well as NFS seems to allow.) 13) Properly support MS_INVALIDATE on NFS. 14) Properly pass down MS_INVALIDATE to lower levels of the VM code from vm_map_clean. 15) Better support the notion of pages being busy but valid, so that fewer in-transit waits occur. (use p->busy more for pageouts instead of PG_BUSY.) Since the page is fully valid, it is still usable for reads. 16) It is possible (in error) for cached pages to be busy. Make the page allocation code handle that case correctly. (It should probably be a printf or panic, but I want the system to handle coding errors robustly. I'll probably add a printf.) 17) Correct the design and usage of vm_page_sleep. It didn't handle consistancy problems very well, so make the design a little less lofty. After vm_page_sleep, if it ever blocked, it is still important to relookup the page (if the object generation count changed), and verify it's status (always.) 18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up. 19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush. 20) Fix vm_pager_put_pages and it's descendents to support an int flag instead of a boolean, so that we can pass down the invalidate bit.
# 34022	04-Mar-1998	dyson	Fix a rounding error for the NFS buffer validend. Submitted by: John W. De Boskey <jwd@unx.sas.com>
# 33936	01-Mar-1998	dyson	1) Use a more consistent page wait methodology. 2) Do not unnecessarily force page blocking when paging pages out. 3) Further improve swap pager performance and correctness, including fixing the paging in progress deadlock (except in severe I/O error conditions.) 4) Enable vfs_ioopt=1 as a default. 5) Fix and enable the page prezeroing in SMP mode. All in all, SMP systems especially should show a significant improvement in "snappyness."
# 33255	11-Feb-1998	dg	Fix a && that should be an &. Reviewed by: "John S. Dyson" <dyson@FreeBSD.ORG> Submitted by: jwd@unx.sas.com (John W. DeBoskey)
# 33181	09-Feb-1998	eivind	Staticize.
# 33134	06-Feb-1998	eivind	Back out DIAGNOSTIC changes.
# 33108	04-Feb-1998	eivind	Turn DIAGNOSTIC into a new-style option.
# 32937	31-Jan-1998	dyson	Change the busy page mgmt, so that when pages are freed, they MUST be PG_BUSY. It is bogus to free a page that isn't busy, because it is in a state of being "unavailable" when being freed. The additional advantage is that the page_remove code has a better cross-check that the page should be busy and unavailable for other use. There were some minor problems with the collapse code, and this plugs those subtile "holes." Also, the vfs_bio code wasn't checking correctly for PG_BUSY pages. I am going to develop a more consistant scheme for grabbing pages, busy or otherwise. For now, we are stuck with the current morass.
# 32755	25-Jan-1998	dyson	Various NFS fixes: Make vfs_bio buffer mgmt work better. Buffers were being used after brelse. Make nfs_getpages work independently of other NFS interfaces. This eliminates some difficult recursion problems and decreases pagefault overhead. Remove an erroneous vfs_unbusy_pages. Fix a reentrancy problem, with nfs_vinvalbuf when vnode is already being rundown. Reassignbuf wasn't being called when needed under certain circumstances. (Thanks to Bill Paul for help.)
# 32724	24-Jan-1998	dyson	Add better support for larger I/O clusters, including larger physical I/O. The support is not mature yet, and some of the underlying implementation needs help. However, support does exist for IDE devices now.
# 32702	22-Jan-1998	dyson	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
# 32585	17-Jan-1998	dyson	Tie up some loose ends in vnode/object management. Remove an unneeded config option in pmap. Fix a problem with faulting in pages. Clean-up some loose ends in swap pager memory management. The system should be much more stable, but all subtile bugs aren't fixed yet.
# 32454	11-Jan-1998	dyson	Fix some vnode management problems, and better mgmt of vnode free list. Fix the UIO optimization code. Fix an assumption in vm_map_insert regarding allocation of swap pagers. Fix an spl problem in the collapse handling in vm_object_deallocate. When pages are freed from vnode objects, and the criteria for putting the associated vnode onto the free list is reached, either put the vnode onto the list, or put it onto an interrupt safe version of the list, for further transfer onto the actual free list. Some minor syntax changes changing pre-decs, pre-incs to post versions. Remove a bogus timeout (that I added for debugging) from vn_lock. PHK will likely still have problems with the vnode list management, and so do I, but it is better than it was.
# 32286	06-Jan-1998	dyson	Make our v_usecount vnode reference count work identically to the original BSD code. The association between the vnode and the vm_object no longer includes reference counts. The major difference is that vm_object's are no longer freed gratuitiously from the vnode, and so once an object is created for the vnode, it will last as long as the vnode does. When a vnode object reference count is incremented, then the underlying vnode reference count is incremented also. The two "objects" are now more intimately related, and so the interactions are now much less complex. When vnodes are now normally placed onto the free queue with an object still attached. The rundown of the object happens at vnode rundown time, and happens with exactly the same filesystem semantics of the original VFS code. There is absolutely no need for vnode_pager_uncache and other travesties like that anymore. A side-effect of these changes is that SMP locking should be much simpler, the I/O copyin/copyout optimizations work, NFS should be more ponderable, and further work on layered filesystems should be less frustrating, because of the totally coherent management of the vnode objects and vnodes. Please be careful with your system while running this code, but I would greatly appreciate feedback as soon a reasonably possible.
# 31936	22-Dec-1997	dyson	Improve my copyright.
# 31596	07-Dec-1997	dyson	Slight performance improvement, removal of unneeded SPLs.
# 31561	05-Dec-1997	bde	Don't include <sys/lock.h> in headers when only `struct simplelock' is required. Fixed everything that depended on the pollution.
# 31493	02-Dec-1997	phk	In all such uses of struct buf: 's/b_un.b_addr/b_data/g'
# 31478	01-Dec-1997	dyson	Fix a serious problem during resizing buffers where old buffers address space wasn't being properly reclaimed. Submitted by: Bruce Evans <bde@freebsd.org>
# 31380	24-Nov-1997	dyson	Avoid manipulating the buffer map at interrupt time by deferring bfreekva to getnewbuf, and remove from brelse. Reviewed by: dg@root.com
# 31016	07-Nov-1997	phk	Remove a bunch of variables which were unused both in GENERIC and LINT. Found by: -Wunused
# 30994	06-Nov-1997	phk	Move the "retval" (3rd) parameter from all syscall functions and put it in struct proc instead. This fixes a boatload of compiler warning, and removes a lot of cruft from the sources. I have not removed the /ARGSUSED/, they will require some looking at. libkvm, ps and other userland struct proc frobbing programs will need recompiled.
# 30813	28-Oct-1997	bde	Removed unused #includes.
# 30743	26-Oct-1997	phk	VFS interior redecoration. Rename vn_default_error to vop_defaultop all over the place. Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite. Use vop_null instead of nullop. Move vop_nopoll from vfs_subr.c to vfs_default.c Move vop_sharedlock from vfs_subr.c to vfs_default.c Move vop_nolock from vfs_subr.c to vfs_default.c Move vop_nounlock from vfs_subr.c to vfs_default.c Move vop_noislocked from vfs_subr.c to vfs_default.c Use vop_ebadf instead of *_ebadf. Add vop_defaultop for getpages on master vnode in MFS.
# 30354	12-Oct-1997	phk	Last major round (Unless Bruce thinks of somthing :-) of malloc changes. Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them. A couple of finer points by: bde
# 30309	11-Oct-1997	phk	Distribute and statizice a lot of the malloc M_* types. Substantial input from: bde
# 29680	21-Sep-1997	gibbs	init_main.c subr_autoconf.c: Add support for "interrupt driven configuration hooks". A component of the kernel can register a hook, most likely during auto-configuration, and receive a callback once interrupt services are available. This callback will occur before the root and dump devices are configured, so the configuration task can affect the selection of those two devices or complete any tasks that need to be performed prior to launching init. System boot is posponed so long as a hook is registered. The hook owner is responsible for removing the hook once their task is complete or the system boot can continue. kern_acct.c kern_clock.c kern_exit.c kern_synch.c kern_time.c: Change the interface and implementation for the kernel callout service. The new implemntaion is based on the work of Adam M. Costello and George Varghese, published in a technical report entitled "Redesigning the BSD Callout and Timer Facilities". The interface used in FreeBSD is a little different than the one outlined in the paper. The new function prototypes are: struct callout_handle timeout(void (func)(void ), void arg, int ticks); void untimeout(void (func)(void ), void arg, struct callout_handle handle); If a client wishes to remove a timeout, it must store the callout_handle returned by timeout and pass it to untimeout. The new implementation gives 0(1) insert and removal of callouts making this interface scale well even for applications that keep 100s of callouts outstanding. See the updated timeout.9 man page for more details.
# 29654	21-Sep-1997	dyson	Re-institute a bugfix in allocation of anonymous buffer memory.
# 29289	10-Sep-1997	phk	The patch is needed in order to not throw away unmodified local filesystem metadata at the first brelse call when the block device vnode has v_tag set to VT_NFS. Reviewed by: phk Submitted by: Tor Egge <tegge@idi.ntnu.no>
# 29211	07-Sep-1997	bde	Some staticized variables were still declared to be extern.
# 28774	26-Aug-1997	dyson	Back out some incorrect changes that was worse than the original bug.
# 28465	20-Aug-1997	dyson	Some corrections to the anonymous page managment. Submitted by: Peter Chen <pmchen@eecs.umich.edu>
# 28013	09-Aug-1997	dyson	Modify the scheduling policy to take into account disk I/O waits as chargeable CPU usage. This should mitigate the problem of processes doing disk I/O hogging the CPU. Various users have reported the problem, and test code shows that the problem should now be gone.
# 26664	15-Jun-1997	dyson	Fix a problem with the VN device. Specifically, the VN device can cause a problem of spiraling death due to buffer resource limitations. The vfs_bio code in general had little ability to handle buffer resource management, and now it does. Also, there are a lot more knobs for tuning the vfs_bio code now. The knobs came free because of the need that there always be some immediately available buffers (non-delayed or locked) for use. Note that the buffer cache code is much less likely to get bogged down with lots of delayed writes, even more so than before.
# 26599	13-Jun-1997	bde	Fixed livelock in getnewbuf(). It is possible for multiple process to sleep concurrently waiting for a buffer. When the buffer shortage is a shortage of space but not a shortage of buffer headers, the processes took turns creating empty buffers and waking each other to advertise the brelse() of the empties; progress was never made because tsleep() always found another high-priority process to run and everything was done at splbio(), so vfs_update never had a chance to flush delayed writes, not to mention that i/o never had a chance to complete. The problem seems to be rare in practice, but it can easily be reproduced by misusing block devices, at least for sufficently slow devices on machines with a sufficiently small buffer cache. E.g., `tar cvf /dev/fd0 /kernel' on an 8MB system with no disk in fd0 causes the problem quickly; the same command with a disk in fd0 causes the problem not quite as quickly; and people have reported problems newfs'ing file systems on block devices. Block devices only cause this problem indirectly. They are pessimized for time and space, and the space pessimization causes the shortage (it manifests as internal fragmentation in buffer_map). This should be fixed in 2.2.
# 26471	06-Jun-1997	dfr	Don't throw NFS B_DELWRI buffers back to the vm system in brelse. Make sure that b_validoff..b_validend is at least as big as b_dirtyoff..b_dirtyend.
# 26409	03-Jun-1997	dfr	Fix some performance problems with the NFS mmap fixes.
# 26290	30-May-1997	dfr	The previous fix didn't work properly for small block size filesystems, which caused very slow file access for cd9660 and some ext2fs filesystems. Reviewed by: bde
# 25930	19-May-1997	dfr	Fix a few bugs with NFS and mmap caused by NFS' use of b_validoff and b_validend. The changes to vfs_bio.c are a bit ugly but hopefully can be tidied up later by a slight redesign. PR: kern/2573, kern/2754, kern/3046 (possibly) Reviewed by: dyson
# 25649	10-May-1997	joerg	Add a DDB command `show buffer', to display a struct buf. It's impossible to display everything, so i've chosen a small subset. Add more to this as you think seems useful.
# 24850	13-Apr-1997	dyson	Improve the buffer cache memory policy by moving pages over to the cache queue more often. The pageout daemon had to be waken up more often than necessary since pages were not put on the cache queue, when they should have been. Submitted by: David Greenman <dg@freebsd.org>
# 24478	01-Apr-1997	bde	Removed potentially harmful garbage <vm/lock.h> and fixed bogus use of it. It was actually harmless because the use was null due to fortuitous include orders and identical (wrong) idempotency macros.
# 22975	22-Feb-1997	peter	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 21733	15-Jan-1997	bde	Removed redundant spl0()'s from kernel processes. They were work-arounds for a bug in fork().
# 21673	14-Jan-1997	jkh	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 20146	05-Dec-1996	dyson	Clean-up of the new buffer kva allocation code. Also, there was an error in the !BOUNCE_BUFFERS case.
# 20068	01-Dec-1996	dyson	Fix a problem with the new buffer_map management code. Additionally, decrease the size of buffer_map to approx 2/3 of what it used to be (buffer_map can be smaller now.) The original commit of these changes increased the size of buffer_map to the point where the system would not boot on large systems -- now large systems with large caches will have even less problems than before.
# 20054	30-Nov-1996	dyson	Implement a new totally dynamic (up to MAXPHYS) buffer kva allocation scheme. Additionally, add the capability for checking for unexpected kernel page faults. The maximum amount of kva space for buffers hasn't been decreased from where it is, but it will now be possible to do so. This scheme manages the kva space similar to the buffers themselves. If there isn't enough kva space because of usage or fragementation, buffers will be reclaimed until a buffer allocation is successful. This scheme should be very resistant to fragmentation problems until/if the LFS code is fixed and uses the bogus buffer locking scheme -- but a 'fixed' LFS is not likely to use such a scheme. Now there should be NO problem allocating buffers up to MAXPHYS.
# 19996	28-Nov-1996	dyson	Potentially fix a problem, whereby MSDOSFS can request buffers larger than the vfs layer can provide. We now automatically support 32K clusters if MSDOSFS is installed, and panic if a filesystem tries to allocate a buffer larger than MAXBSIZE. This commit is a result of some "prodding" by BDE.
# 19828	17-Nov-1996	dyson	Improve the caching of small files like directories, while not substantially increasing buffer space. Specifically, we double the number of buffers, but allocate only half the amount of memory per buffer. Note that VDIR files aren't cached unless instantiated in a buffer. This will significantly improve caching.
# 18975	17-Oct-1996	dyson	Fix a problem that could cause msync (or many other things) to deadlock. The heuristic for managment of memory backing the buffer cache was nice, but didn't work due to some architectural problems. Simplify and improve the algorithm.
# 18737	06-Oct-1996	dyson	Fix 4 problems: Major: When blocking occurs in allocbuf() for VMIO files, excess wire counts could accumulate. Major: Pages are incorrectly accumulated into the physical buffer for clustered reads. This happens when bogus page is needed. Minor: When reclaiming buffers, the async flag on the buffer needs to be zero, or the reclaim is not optimal. Minor: The age flag should be cleared, if a buffer is wanted.
# 18401	20-Sep-1996	dyson	Fix an spl window, a page manipulation at interrupt time that was incorrect, and correct the support for B_ORDERED. The spl window fix was from Peter Wemm, and his questions led me to find the problem with the interrupt time page manipulation.
# 18358	18-Sep-1996	dyson	Add needed spl protection, and some minor cleanups in vfs_vmio_release. Submitted by: Peter Wemm <peter@spinner.dialix.com> and me.
# 18291	14-Sep-1996	dyson	Clean up some more problems with freeing busy or wired pages. The vfs_bio code was not waiting properly for page state until manipulating it.
# 18271	13-Sep-1996	dyson	A modification that allows the driver strategy to modify the B_ASYNC flag broke things pretty bad (freeing buffer already on queue or other wierd buffer queue errors.) The broken code is left in commented out, but this makes the problem go away for now.
# 18169	08-Sep-1996	dyson	Addition of page coloring support. Various levels of coloring are afforded. The default level works with minimal overhead, but one can also enable full, efficient use of a 512K cache. (Parameters can be generated to support arbitrary cache sizes also.)
# 18070	06-Sep-1996	gibbs	Add bowrite. Bowrite guarantees that buffers queued after a call to bowrite will be written after the specified buffer (on a particular device). Bowrite does this either by taking advantage of hardware ordering support (e.g. tagged queueing on SCSI devices) or resorting to a synchronous write.
# 17761	21-Aug-1996	dyson	Even though this looks like it, this is not a complex code change. The interface into the "VMIO" system has changed to be more consistant and robust. Essentially, it is now no longer necessary to call vn_open to get merged VM/Buffer cache operation, and exceptional conditions such as merged operation of VBLK devices is simpler and more correct. This code corrects a potentially large set of problems including the problems with ktrace output and loaded systems, file create/deletes, etc. Most of the changes to NFS are cosmetic and name changes, eliminating a layer of subroutine calls. The direct calls to vput/vrele have been re-instituted for better cross platform compatibility. Reviewed by: davidg
# 17429	04-Aug-1996	phk	Add separate kmalloc classes for BIO buffers and Ktrace info.
# 16840	30-Jun-1996	dg	Fixed a major bug that caused various pmap related panics, hangs, and reboots. The i386 pmap module uses a special area of kernel virtual memory for mapping of page tables pages when it needs to modify another process's virtual address space. It's called the 'alternate page table map'. There is only one of them and it's expected that only one process will be using it at once and that the operation is atomic. When the merged VM/buffer cache was implemented over a year ago, it became necessary to rundown VM pages at I/O completion. The unfortunate and unforeseen side effect of this is that pmap functions are now called at bio interrupt time. If there happend to be a process using the alternate page table map when this I/O completion occurred, it was possible for a different process's address space to be switched into the alternate page table map - leaving the current pmap process with the wrong address space mapped when the interrupt completed. This resulted in BAD things happening like pages being mapped or removed from the wrong address space, etc.. Since a very common case of a process modifying another process's address space is during fork when the kernel stack is inserted, one of the most common manifestations of this bug was the kernel stack not being mapped properly, resulting in a silent hang or reboot. This made it VERY difficult to troubleshoot this bug (I've been trying to figure out the cause of this for >6 months). Fortunately, the set of conditions that must be true before this problem occurs is sufficiently rare enough that most people never saw the bug occur. As I/O rates increase, however, so does the frequency of the crashes. This problem used to kill wcarchive about every 10 days, but in more recent times when the traffic exceeded >100GB/day, the machine could barely manage 6 hours of uptime. The fix is to make certain that no process has the pages mapped that are involved in the I/O, before the I/O is started. The pages are made busy, so no process will be able to map them, either, until the I/O has finished. This side-steps the issue by still allowing the pmap functions to be called at interrupt time, but also assuring that the alternate page table map won't be switched. Unfortunately, this appears to not be the only cause of this problem. :-( Reviewed by: dyson
# 16363	14-Jun-1996	asami	The Great PC98 Merge. All new code is "#ifdef PC98"ed so this should make no difference to PC/AT (and its clones) users. Ok'd by: core Submitted by: FreeBSD(98) development team
# 16027	30-May-1996	dyson	Keep brelse from freeing busy pages.
# 15891	24-May-1996	dyson	Make sure that we don't place a busy or held page onto the PQ_CACHE queue.
# 15809	18-May-1996	dyson	This set of commits to the VM system does the following, and contain contributions or ideas from Stephen McKay <syssgm@devetir.qld.gov.au>, Alan Cox <alc@cs.rice.edu>, David Greenman <davidg@freebsd.org> and me: More usage of the TAILQ macros. Additional minor fix to queue.h. Performance enhancements to the pageout daemon. Addition of a wait in the case that the pageout daemon has to run immediately. Slightly modify the pageout algorithm. Significant revamp of the pmap/fork code: 1) PTE's and UPAGES's are NO LONGER in the process's map. 2) PTE's and UPAGES's reside in their own objects. 3) TOTAL elimination of recursive page table pagefaults. 4) The page directory now resides in the PTE object. 5) Implemented pmap_copy, thereby speeding up fork time. 6) Changed the pv entries so that the head is a pointer and not an entire entry. 7) Significant cleanup of pmap_protect, and pmap_remove. 8) Removed significant amounts of machine dependent fork code from vm_glue. Pushed much of that code into the machine dependent pmap module. 9) Support more completely the reuse of already zeroed pages (Page table pages and page directories) as being already zeroed. Performance and code cleanups in vm_map: 1) Improved and simplified allocation of map entries. 2) Improved vm_map_copy code. 3) Corrected some minor problems in the simplify code. Implemented splvm (combo of splbio and splimp.) The VM code now seldom uses splhigh. Improved the speed of and simplified kmem_malloc. Minor mod to vm_fault to avoid using pre-zeroed pages in the case of objects with backing objects along with the already existant condition of having a vnode. (If there is a backing object, there will likely be a COW... With a COW, it isn't necessary to start with a pre-zeroed page.) Minor reorg of source to perhaps improve locality of ref.
# 15583	03-May-1996	phk	Another sweep over the pmap/vm macros, this time with more focus on the usage. I'm not satisfied with the naming, but now at least there is less bogus stuff around.
# 14426	09-Mar-1996	dyson	Correct handling of dirty pages in I/O buffers. The case where pages residing in a buffer that had been dirtied by a process was being handled incorrectly. The pages were mistakenly placed into the cache queue. This would likely have the effect of mmaped page modifications being lost when I/O system calls were being used simultaneously to the same locations in a file. Submitted by: davidg
# 14347	02-Mar-1996	dyson	Fix the buffer queue problem differently. The previous fix could panic with a buffer not on queue panic.
# 14319	02-Mar-1996	dyson	1) Fix a bug that a buffer is removed from a queue, but the queue type is not set to QUEUE_NONE. This appears to have caused a hang bug that has been lurking. 2) Fix bugs that brelse'ing locked buffers do not "free" them, but the code assumes so. This can cause hangs when LFS is used. 3) Use malloced memory for directories when applicable. The amount of malloced memory is seriously limited, but should decrease the amount of memory used by an average directory to 1/4 - 1/2 previous. This capability is fully tunable. (Note that there is no config parameter, and might never be.) 4) Bias slightly the buffer cache usage towards non-VMIO buffers. Since the data in VMIO buffers is not lost when the buffer is reclaimed, this will help performance. This is adjustable also.
# 14317	02-Mar-1996	dyson	Enable VMIO for non-VDIR metadata and block device.
# 13490	19-Jan-1996	dyson	Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
# 13294	06-Jan-1996	dg	Print out the queue index if it's found to be inconsistent.
# 13292	06-Jan-1996	dg	Rework vm_hold_{load,free}_pages to calculate an index once and use that. At the same time, be sure to page-truncate bp->b_data so that the result of the calculation isn't negative.
# 13265	05-Jan-1996	wollman	Convert BOUNCE_BUFFERS and BOUNCEPAGES to new option scheme.
# 13211	04-Jan-1996	dg	Fixed minor struct cred leak. Discovered while looking for the opposite condition - too many frees, which has yet to be found. Reviewed by: dyson
# 12819	14-Dec-1995	phk	A Major staticize sweep. Generates a couple of warnings that I'll deal with later. A number of unused vars removed. A number of unused procs removed or #ifdefed.
# 12799	13-Dec-1995	dyson	Fix a problem that was caused by new (partial) support for merged cache metadata and VBLK type devices. The code is currently mostly disabled, and a work-around has been added to disabled attempted clustered writes for VBLK type device buffers. Clustered write of meta-data is currently a work in progress.
# 12787	12-Dec-1995	dyson	This should have fixed some conditions that could cause the "getblk" hang. The B_WANTED flag was being cleared gratuitously, also the optimization of gbincore for ignoring the B_INVAL flag was incorrect. There is no place in the code where buffers are on the hash list that are B_INVAL and not B_BUSY.
# 12767	11-Dec-1995	dyson	Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
# 12662	07-Dec-1995	dg	Untangled the vm.h include file spaghetti.
# 12623	04-Dec-1995	phk	A major sweep over the sysctl stuff. Move a lot of variables home to their own code (In good time before xmas :-) Introduce the string descrition of format. Add a couple more functions to poke into these marvels, while I try to decide what the correct interface should look like. Next is adding vars on the fly, and sysctl looking at them too. Removed a tine bit of defunct and #ifdefed notused code in swapgeneric.
# 12577	02-Dec-1995	bde	Completed function declarations and/or added prototypes.
# 12569	02-Dec-1995	bde	Finished (?) cleaning up sysinit stuff.
# 12404	19-Nov-1995	dyson	General fixes to the vfs clustring code: 1) Make cluster buffer list be a non-malloced chain. This eliminates yet another 'evil' M_WAITOK and generally cleans up the code. 2) Fix write clustering for ext2fs. It was just broken. Also, ffs clustering had an efficiency problem that more bawrites were happening than should have been. 3) Make changes to buf.h to support the above, plus remove b_pfcent at the request of David Greenman. Reviewed by: davidg (partially)
# 12379	18-Nov-1995	dyson	Added a missing splx(s).
# 12110	05-Nov-1995	dyson	Greatly simplify the msync code. Eliminate complications in vm_pageout for msyncing. Remove a bug that manifests itself primarily on NFS (the dirty range on the buffers is not set on msync.)
# 11921	29-Oct-1995	phk	Second batch of cleanup changes. This time mostly making a lot of things static and some unused variables here and there.
# 11577	19-Oct-1995	dyson	If we clear the B_CACHE flag because a buffer isn't composed fully of valid bytes, we must also clear the B_DONE flag. Some filesystems depend on this (incl NFS) and is probably the cause of the biodone error and subsequent crash. Anyway this change needs to be made.
# 11332	07-Oct-1995	swallace	Remove prototype definitions from <sys/systm.h>. Prototypes are located in <sys/sysproto.h>. Add appropriate #include <sys/sysproto.h> to files that needed protos from systm.h. Add structure definitions to appropriate files that relied on sys/systm.h, right before system call definition, as in the rest of the kernel source. In kern_prot.c, instead of using the dummy structure "args", create individual dummy structures named <syscall>_args. This makes life easier for prototype generation.
# 11101	01-Oct-1995	dg	Two critical bugfixes: 1) "obj" was't initialized properly, resulting in an important vm_page_lookup always failing (resulting in a panic). 2) busy pages could be put on the cache queue or freed (resulting in a panic).
# 10978	23-Sep-1995	dyson	These changes fix a bug in the clustering code that I made worse when adding support for EXT2FS. Note that the Sig-11 problems appear to be caused by this, but there is still probably an underlying VM problem that let this clustering bug cause vnode objects to appear to be corrupted. The direct manifestation of this bug would have been severely mis-read files. It is possible that processes would Sig-11 on very damaged input files and might explain the mysterious differences in system behaviour when phk's malloc is being used.
# 10653	09-Sep-1995	dg	Fixed init functions argument type - caddr_t -> void *. Fixed a couple of compiler warnings.
# 10551	03-Sep-1995	dyson	Added VOP_GETPAGES/VOP_PUTPAGES and also the "backwards" block count for VOP_BMAP. Updated affected filesystems...
# 10541	03-Sep-1995	dyson	Improvements to the cluster code, minor vfs_bio efficiency: Better performance -- more aggressive read-ahead under certain circumstanses. Mods to support clustering on small ( < PAGE_SIZE) block size filesystems (e.g. ext2fs, msdosfs.)
# 10358	28-Aug-1995	julian	Reviewed by: julian with quick glances by bruce and others Submitted by: terry (terry lambert) This is a composite of 3 patch sets submitted by terry. they are: New low-level init code that supports loadbal modules better some cleanups in the namei code to help terry in 16-bit character support some changes to the mount-root code to make it a little more modular.. NOTE: mounting root off cdrom or NFS MIGHT be broken as I haven't been able to test those cases.. certainly mounting root of disk still works just fine.. mfs should work but is untested. (tomorrows task) The low level init stuff includes a total rewrite of init_main.c to make it possible for new modules to have an init phase by simply adding an entry to a TEXT_SET (or is it DATA_SET) list. thus a new module can be added to the kernel without editing any other files other than the 'files' file.
# 10228	24-Aug-1995	dg	Another minor optimization, this time to incore().
# 10227	24-Aug-1995	dg	Minor optimization.
# 9969	06-Aug-1995	dg	Resize both VMIO and non-VMIO buffers if the size changes.
# 9759	29-Jul-1995	bde	Eliminate sloppy common-style declarations. There should be none left for the LINT configuation.
# 9708	25-Jul-1995	dg	Killed bogus casts in tsleep/wakeup calls.
# 9706	25-Jul-1995	dg	Fixed broken offset use in vfs_unbusy_pages() which resulted in several different types of panics/inconsistencies with NFS clients. Cleared PG_WANTED where appropriate. Added checks for buffer busy in allocbuf and biodone. Reviewed by: John Dyson
# 9676	24-Jul-1995	dg	Panic if no object in biodone. Slightly optimized allocbuf() again.
# 9670	23-Jul-1995	dg	Added some additional diagnostic information output when panicing in biodone().
# 9668	23-Jul-1995	dg	Fixed two cases where some parans were missing, resulting in some bogus logic. Slightly simplified allocbuf().
# 9602	21-Jul-1995	dg	Re-lookup the buffer if the vnode isn't locked. The previous check for VBLK vnodes isn't adequate since all NFS nodes aren't locked, either. The result is a race condition that would lead to duplicate buffers at the same block offset. Submitted by: John Dyson
# 9558	17-Jul-1995	dg	Fixed "bufspace" calculation. It was lossy in some circumstances of the buffer resizing and caused a "newbuf" deadlock. Reviewed by: John Dyson & David Greenman Submitted by: Peter Wemm
# 9530	15-Jul-1995	dg	Resize buffers if they aren't the correct size. Several months ago we made a change to NFS that caused buffers at EOF to be variable size. This had the undesired side-effect of breaking delayed writes on NFS. This fixes it. Submitted by: John Dyson
# 9356	28-Jun-1995	dg	1) Converted v_vmdata to v_object. 2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs after vnode_pager_alloc() calls - the object is already guaranteed to be persistent. 3) Removed some gratuitous casts.
# 8876	30-May-1995	rgrimes	Remove trailing whitespace.
# 8692	21-May-1995	dg	Changes to fix the following bugs: 1) Files weren't properly synced on filesystems other than UFS. In some cases, this lead to lost data. Most likely would be noticed on NFS. The fix is to make the VM page sync/object_clean general rather than in each filesystem. 2) Mixing regular and mmaped file I/O on NFS was very broken. It caused chunks of files to end up as zeroes rather than the intended contents. The fix was to fix several race conditions and to kludge up the "b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention to page modifications that occurred via the mmapping. Reviewed by: David Greenman Submitted by: John Dyson
# 8456	11-May-1995	rgrimes	Fix -Wformat warnings from LINT kernel.
# 8176	30-Apr-1995	dg	Check for curproc != NULL before dereferencing it.
# 7880	16-Apr-1995	dg	Removed unused & empty bufstats() function.
# 7878	16-Apr-1995	dg	Killed gratuitous b_vp=NULL in bufinit. The entire buffer is already bzero()'d.
# 7872	16-Apr-1995	dg	1) Check for curproc != NULL in bread/bwrite. John convinced me that this is necessary in order for panic+sync to work. Will also gloss over a panic that Jordan was having with the install floppies that remains unexplainable. 2) Handle "bogus_page" a little better. 3) Set page protection to VM_PROT_NONE if the entire page has become !valid. Submitted by: John Dyson (2&3), me (1).
# 7694	09-Apr-1995	dg	Changes from John Dyson and myself: Fixed remaining known bugs in the buffer IO and VM system. vfs_bio.c: Fixed some race conditions and locking bugs. Improved performance by removing some (now) unnecessary code and fixing some broken logic. Fixed process accounting of # of FS outputs. Properly handle NFS interrupts (B_EINTR). (various) Replaced calls to clrbuf() with calls to an optimized routine call vfs_bio_clrbuf(). (various FS sync) Sync out modified vnode_pager backed pages. ffs_vnops.c: Do two passes: Sync out file data first, then indirect blocks. vm_fault.c: Fixed deadly embrace caused by acquiring locks in the wrong order. vnode_pager.c: Changed to use buffer I/O system for writing out modified pages. This should fix the problem with the modification date previous not getting updated. Also dramatically simplifies the code. Note that this is going to change in the future and be implemented via VOP_PUTPAGES(). vm_object.c: Fixed a pile of bugs related to cleaning (vnode) objects. The performance of vm_object_page_clean() is terrible when dealing with huge objects, but this will change when we implement a binary tree to keep the object pages sorted. vm_pageout.c: Fixed broken clustering of pageouts. Fixed race conditions and other lockup style bugs in the scanning of pages. Improved performance.
# 7404	26-Mar-1995	dg	Removed some redundant 'vmio' checks.
# 7399	26-Mar-1995	dg	Removed third arg (vmio) to allocbuf() that was added with the original merged cache changes, and figure it out based on the B_VMIO buffer flag. Fixes a problem where delayed write VMIO buffers would sometimes get recopied into kernel-alloced memory. Submitted by: John Dyson
# 7090	16-Mar-1995	bde	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 6948	07-Mar-1995	dg	Removed most of the special policy regarding the seperation of VMIO and dir/metadata buffers as it seems to have anomolous effects.
# 6884	04-Mar-1995	dg	Added some more of John's "anti-chatter" fixes - set the page activation count to 0 after activating the page; the previous behavior biased the pages too high in some cases. Submitted by: John Dyson
# 6864	03-Mar-1995	dg	Fixes from John Dyson to work around vnode lock hang. Basically, remove the VOP_BMAP calls, and add one to bdwrite. Submitted by: John Dyson
# 6807	01-Mar-1995	dg	Various changes from John and myself that do the following: New functions create - vm_object_pip_wakeup and pagedaemon_wakeup that are used to reduce the actual number of wakeups. New function vm_page_protect which is used in conjuction with some new page flags to reduce the number of calls to pmap_page_protect. Minor changes to reduce unnecessary spl nesting. Rewrote vm_page_alloc() to improve readability. Various other mostly cosmetic changes.
# 6692	24-Feb-1995	dg	Fixed thrashing buffer problem. Submitted by: John Dyson
# 6620	22-Feb-1995	dg	Added some code to make sure that buffers associated with directories and metadata aren't thrashed by regular file I/O. Added mechanism to limit the amount of outstanding I/O on a given vnode. Pagedaemon wakeup policy changed to skew priority a little in favor of file caching. Slight code reorganization to improve clarity. Added a few more comments. Submitted by: John Dyson
# 6619	22-Feb-1995	dg	Only do object paging_in_progress wakeups if someone is waiting on this condition. Added some comments. Submitted by: John Dyson
# 6539	18-Feb-1995	dg	Only clear B_VMIO in brelse() - a bunch of special processing is required whenever this happens, and that wasn't occurring in some cases.
# 6147	03-Feb-1995	dg	Make B_NOCACHE and B_INVAL buffers work correctly - throw away the data in the page cache. Submitted by: John Dyson
# 5918	26-Jan-1995	dg	Fix problem with freeing busy pages reported by Nick Sayer. Submitted by: John Dyson
# 5839	24-Jan-1995	dg	Fixed a variety of deadlock and panic bugs, removed the bypass code, and implemented the ability to limit bufferspace by memory consumed. (vfs_bio.c) Fixed recently introduced bugs that caused extra I/O to happen in some cases. (vfs_cluster.c) Submitted by: John Dyson
# 5762	21-Jan-1995	ache	Restore original fix from ohki, not check m for NULL it is already done in the code above. Submitted by: ohki@gssm.otsuka.tsukuba.ac.jp
# 5759	20-Jan-1995	ache	Change if (m->valid == 0) to if (m && m->valid == 0)
# 5748	20-Jan-1995	wpaul	Submitted by: ohki@gssm.otsuka.tsukuba.ac.jp When using cp to copy a file under the following circumstanes: - original file in on an NFS filesystem - destination file is on the same NFS filesystem - the file is less than 8Mbytes in size - the file is larger than 65536 bytes in size the cp process can get frozen in device-wait and never wake up (cp uses mmap() in this case). A small change to allocbuf() fixes this.
# 5645	15-Jan-1995	dg	Attempt to close a hole using splhigh/splx. There still appears to be a serious one in the same area that I don't have time to fix.
# 5484	10-Jan-1995	dg	MFS doesn't bother to associate a struct mount with the vnode...so work around this by not trying to cluster this type of I/O. Submitted by: John Dyson
# 5466	10-Jan-1995	dg	PG_FAKE is no longer used - so don't bother to clear it.
# 5464	10-Jan-1995	dg	Fixed some formatting weirdness that I overlooked in the previous commit.
# 5455	09-Jan-1995	dg	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
# 3813	23-Oct-1994	dg	Only VM_WAIT if curproc != pageproc. A deadlock can occur otherwise. Submitted by: John Dyson
# 3688	18-Oct-1994	dg	Removed references to bclnlist which we don't use/support/need.
# 3374	05-Oct-1994	dg	Stuff object into v_vmdata rather than pager. Not important which at the moment, but will be in the future. Other changes mostly cosmetic, but are made for future VMIO considerations. Submitted by: John Dyson
# 3349	04-Oct-1994	dg	Commented out anti-paging code as it was found to be the cause of a buffer deadlock.
# 3098	25-Sep-1994	phk	While in the real world, I had a bad case of being swapped out for a lot of cycles. While waiting there I added a lot of the extra ()'s I have, (I have never used LISP to any extent). So I compiled the kernel with -Wall and shut up a lot of "suggest you add ()'s", removed a bunch of unused var's and added a couple of declarations here and there. Having a lap-top is highly recommended. My kernel still runs, yell at me if you kernel breaks.
# 2422	31-Aug-1994	dg	Rather than exclude bounce buffers support with NOBOUNCE, include it with BOUNCE_BUFFERS. This is more intuitive, and is better for future multiplatform support. Added BOUNCE_BUFFERS option to the GENERIC and LINT kernel config files.
# 2411	30-Aug-1994	dg	Changed to reclaim memory from other buffers to eliminate memory thrashing. Submitted by: John Dyson
# 2112	18-Aug-1994	wollman	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
# 1952	08-Aug-1994	wollman	Run-time configuration of VFS update interval. Old UPDATE_INTERVAL configuration option is no longer supported.
# 1896	07-Aug-1994	dg	Made pmap_kenter "TLB safe". ...and then removed all the pmap_updates that are no longer needed because of this.
# 1887	06-Aug-1994	dg	Incorporated post 1.1.5 work from John Dyson. This includes performance improvements via the new routines pmap_qenter/pmap_qremove and pmap_kenter/ pmap_kremove. These routine allow fast mapping of pages for those architectures that have "normal" MMUs. Also included is a fix to the pageout daemon to properly check a queue end condition. Submitted by: John Dyson
# 1836	04-Aug-1994	dg	Fixed bug that would cause free memory reserves to be depleted and cause a panic in some cases. Submitted by: John Dyson
# 1817	02-Aug-1994	dg	Added $Id$
# 1564	26-May-1994	dg	Moved header definitions to buf.h, and added missing splx() - found by Johannes Helander.
# 1549	25-May-1994	rgrimes	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# 1542	24-May-1994	rgrimes	This commit was generated by cvs2svn to compensate for changes in r1541, which included commits to RCS files with non-trunk default branches.
# 1541	24-May-1994	rgrimes	BSD 4.4 Lite Kernel Sources