Cross Reference: /freebsd-current/sys/kern/vfs

History log of /freebsd-current/sys/kern/vfs_subr.c
Revision	Date	Author	Comments
# 9530182e	25-Dec-2023	Jason A. Harmening <jah@FreeBSD.org>	VFS: update VOP_FSYNC() debug check to reflect actual locking policy Shared vs. exclusive locking is determined not by MNT_EXTENDED_SHARED but by MNT_SHARED_WRITES (although there are several places that ignore this and simply always use an exclusive lock). Also add a comment on the possible difference between VOP_GETWRITEMOUNT(vp) and vp->v_mount on this path. Found by local testing of unionfs atop ZFS with DEBUG_VFS_LOCKS. MFC after: 2 weeks Reviewed by: kib, olce Differential Revision: https://reviews.freebsd.org/D43816
# b068bb09	07-Jan-2024	Konstantin Belousov <kib@FreeBSD.org>	Add vnode_pager_clean_{a,}sync(9) Bump __FreeBSD_version for ZFS use. Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D43356
# 2d33ad48	31-Dec-2023	Konstantin Belousov <kib@FreeBSD.org>	vtruncbuf: improve the check for meta buffer Revision e99215a614675 reorganized the code in vtruncbuf(), and moved the logic to flush meta buffers into a dedicated loop. While doing it, the condition was changed from bp->b_lblkno < 0 (to handle) into bp->b_lblkno > 0 (to skip), which causes buffer at lblkno to needlessly flush. Reviewed by: chs, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D43261
# 4c41d10f	31-Dec-2023	Konstantin Belousov <kib@FreeBSD.org>	vtruncbuf: add a comment explaining the purpose of the loop Reviewed by: chs, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D43261
# 27f4eda3	04-Jan-2024	Mark Johnston <markj@FreeBSD.org>	vfs: Simplify vrefact() refcount_acquire() returns the old value, just use that. No functional change intended. Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D43255
# 29363fb4	23-Nov-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove ancient SCCS tags. Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
# 0c5cd045	01-Nov-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove majority of stale commentary about free list There is no "free list" for a long time now. While here slightly tidy up affected comments in other ways. Note that the "free vnode" term is a misnomer at best and will also need to get sorted out.
# 3943698c	20-Oct-2023	Kirk McKusick <mckusick@FreeBSD.org>	Minor sysctl description cleanup. No functional change. Agreed-by: Mateusz Guzik
# 37544d97	12-Oct-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: convert recycles_count and recycles_free_count to mere u_long Only vnlru ever updates them. This also removes recycles_count updates from hand-rolled debug vnode recycling via sysctl. Sponsored by: Rubicon Communications, LLC ("Netgate")
# a92fc312	12-Oct-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: count recycles by vnlru and by vn_alloc separately Sponsored by: Rubicon Communications, LLC ("Netgate")
# bb679b0c	11-Oct-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: count calls to uma_reclaim in vnlru
# 281a9715	11-Oct-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add max_vnlru_free to the vfs.vnode.vnlru tree While here rename the var internally.
# 054f45e0	11-Oct-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: further speed up continuous free vnode recycle The primary bottleneck was vnode_list mtx, which got artificially worsened due to the following work done with the lock held: 1. the global heavily modified numvnodes counter was being read, inducing massive cache line ping pong 2. should the value fit limits (which it normally did) there would be an avoidable write to vn_alloc_cyclecount, which is being read outside of the lock, once more inducing traffic But if vn_alloc_cyclecount is 0, which it normally is even when facing vnode shortage, there is no need to check numvnodes nor set it to 0 again. Another problem was numvnodes adjustment (which made the locked read much worse). While it fundamentally does not scale as it is not distributed in any fashion, it was avoidably slow. When bumping over the vnode limit, it would be modified with atomics 3 times: inc + dec to backpedal in vn_alloc, then final inc in vn_alloc_hard. One can let some slop persist over calls to vnlru_free instead. In principle each thread in the system could get here and bump it, so a limit is put in place to keep things sane. Bench setup same as in prior commits: zfs, 20 separate directory trees each with 1 million files in total and 20 find(1) processes stating them in parallel (one per each tree). Total run time (in seconds) goes down as follows: vnode limit 8388608 400000 before ~20 ~35 after ~8 ~15 With this in place the primary bottleneck is now ZFS. Sponsored by: Rubicon Communications, LLC ("Netgate")
# a4f753e8	11-Oct-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: don't recycle transiently excess vnodes Sponsored by: Rubicon Communications, LLC ("Netgate")
# 90a008e9	14-Sep-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: prefix regular vnlru with a special case for free vnodes Works around severe performance problems in certain corner cases, see the commentary added. Modifying vnlru logic has proven rather error prone in the past and a release is near, thus take the easy way out and fix it without having to dig into the current machinery.
# 23ef25d2	10-Oct-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: consult freevnodes in vnlru_kick_cond If the count is high enough there is no point trying to produce more. Not going there reduces traffic on the vnode_list mtx. This further shaves total real time in a test mentioned in: 74be676d87745eb7 ("vfs: drop one vnode list lock trip during vnlru free recycle") -- 20 instances of find each creating 1 million vnodes, while total limit is set to 400k. Time goes down from ~41 to ~35 seconds. Sponsored by: Rubicon Communications, LLC ("Netgate")
# 1bf55a73	10-Oct-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: be less eager to call uma_reclaim(UMA_RECLAIM_DRAIN) In face of vnode shortage the count very easily can go few units above the limit before going back down. Calling uma_reclaim results in massive amount of work which in this case is not warranted. Sponsored by: Rubicon Communications, LLC ("Netgate")
# 8733bc27	14-Sep-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: don't provoke recycling non-free vnodes without a good reason If the total number of free vnodes is at or above target, there is no point creating more of them. Tested by: pho (in a bigger patch)
# 9080190b	16-Sep-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: count how many times vnlru got woken up due to vnode shortage
# ef89b78b	16-Sep-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: stabilize freevnodes_old In face of parallel callers.
# 509d843a	16-Sep-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: s/u_long vstir/bool vstir/
# d3e64789	15-Sep-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: group vnode-related sysctls under vfs.vnode Instead of having things scattered through vfs, debug and kern trees. Old names remain for compatibility. Sample output of "sysctl vfs.vnode": vfs.vnode.vnlru.failed_runs: 0 vfs.vnode.vnlru.recycles_free: 0 vfs.vnode.vnlru.recycles: 0 vfs.vnode.stats.alloc_sleeps: 0 vfs.vnode.stats.free: 1310 vfs.vnode.stats.skipped_requeues: 0 vfs.vnode.stats.created: 1686 vfs.vnode.stats.count: 1641 vfs.vnode.param.wantfree: 2097152 vfs.vnode.param.limit: 8388608
# 2a689cad	16-Sep-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: retire kern.minvnodes It was marked as legacy in 2005.
# 03bfee17	14-Sep-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: use vnlru_read_freevnodes for the freevnodes sysctl For a more accurate result.
# ba5dc166	14-Sep-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: retire vnlru_under_unlocked It only looks at the centralized value which in corner cases can end up being negative.
# 9dc0c983	14-Sep-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix stale comment about freevnodes management
# 76f11537	14-Sep-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: don't kick vnlru if it is already running Further shaves some lock trips.
# 74be676d	14-Sep-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: drop one vnode list lock trip during vnlru free recycle vnlru_free_impl would take the lock prior to returning even though most frequent caller does not need it. Unsurprisingly vnode_list mtx is the primary bottleneck when recycling and avoiding the useless lock trip helps. Setting maxvnodes to 400000 and running 20 parallel finds each with a dedicated directory tree of 1 million vnodes in total: before: 4.50s user 1225.71s system 1979% cpu 1:02.14 total after: 4.20s user 806.23s system 1973% cpu 41.059 total That's 34% reduction in total real time. With this the block remains the primary bottleneck when running on ZFS.
# 712806fc	24-Aug-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: retried++ -> retried = true for the boolean No real changes. Noted by: rpokala
# c1d85ac3	23-Aug-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: try harder to find free vnodes when recycling The free vnode marker can slide past eligible entries. Artificially reducing vnode limit to 300k and spawning 104 workers each creating a million files results in all of them trying to recycle, which often fails when it should not have to. Because of the excessive traffic in this scenario, the trylock to requeue is virtually guaranteed to fail, meaning nothing gets pushed forward. Since no vnodes were found, the most unfortunate sleep for 1 second is induced (see vn_alloc_hard, the "vlruwk" msleep). Without the fix the machine is mostly idle with almost everyone stuck off CPU waiting for the sleep to finish. With the fix it is busy creating files. Unrelated to the above problem the marker could have landed in a similarly problematic spot for because of any failure in vtryrecycle. Originally reported as poudriere builders stalling in a vnode-count restricted setup. Fixes: 138a5dafba31 ("vfs: trylock vnode requeue") Reported by: Mark Millard
# 64e881f2	18-Aug-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: track how many times vn_alloc blocked on hitting the vnode limit
# 685dc743	16-Aug-2023	Warner Losh <imp@FreeBSD.org>	sys: Remove $FreeBSD$: one-line .c pattern Remove /^[\s]__FBSDID$"\$FreeBSD\$"$;?\s*\n/
# 9c3bfe2a	10-Jul-2023	Konstantin Belousov <kib@FreeBSD.org>	Revert "VFS: Remove VV_READLINK flag" and "fdescfs: improve linrdlnk mount option" This reverts commits 4a402dfe0bc44770c9eac6e58a501e4805e29413 and 3bffa2262328e4ff1737516f176107f607e7bc76. The fix will be implemented in somewhat different manner. The semantic adjustment is incompatible with linuxolator expectations. Reported and reviewed by: dchagin Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D40969
# ba8cc6d7	12-Mar-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: use __enum_uint8 for vtype and vstate This whacks hackery around only reading v_type once. Bump __FreeBSD_version to 1400093
# 4a402dfe	21-Jun-2023	Konstantin Belousov <kib@FreeBSD.org>	VFS: Remove VV_READLINK flag since its only reason to exist is removed. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D40700
# 2544b8e0	28-Apr-2023	Olivier Certner <olce.freebsd@certner.fr>	vfs: Rename vfs_emptydir() to vn_dir_check_empty() No functional change. While here, adapt comments to style(9). Reviewed by: kib MFC after: 1 week
# 6450e7bb	22-Apr-2023	Olivier Certner <olce.freebsd@certner.fr>	vfs: Fix "emptydir" mount option Fix vfs_emptydir(). It would consider directories containing directories with name of the form 'X.' (X being any authorized byte) as empty. Also, it would cause VOP_READDIR() to return an error on directories containing enough whiteouts. While here, use a more decently sized buffer as done elsewhere. Remove ad-hoc iteration on the directory's content and instead use the newly exported vn_dir_next_dirent() function (this is what fixes the second problem mentioned above). PR: 270988 Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D39775
# 7aeea73e	16-Apr-2023	Konstantin Belousov <kib@FreeBSD.org>	syncer vnode: add VOP_GETWRITEMOUNT() definition explicitly Since syncer vnode vector does not provide a fallback to the default one, its VOP_GETWRITEMOUNT() implementation implicitly returned EOPNOTSUPP, which means that syncer ignored suspension. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# d8a09662	15-Apr-2023	Konstantin Belousov <kib@FreeBSD.org>	sync_vnode(): add assert to check vn_start_write() correctness Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# c53e990b	10-Apr-2023	Konstantin Belousov <kib@FreeBSD.org>	DEBUG_VFS_LOCKS: restore diagnostic for the witness use case Reviewed by: jah, markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39477
# 7b6fe242	08-Apr-2023	Konstantin Belousov <kib@FreeBSD.org>	DEBUG_VFS_LOCKS: use witness if available The assert_vop_locked messages are ignored, and file/line information is not too useful. Fixing this without changing both witness and VFS asserts KPIs is not possible. Reviewed by: markj (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D39464
# 02e6e8d2	07-Apr-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: extend vn_printf with vop vector
# 26b96487	07-Apr-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: more informative panic for missing fplookup ops
# f87a9f51	05-Apr-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: validate that a mount point with FPLOOKUP has vop_fplookup ops
# e237e2ba	03-Nov-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: only allow doomed vnodes to return EOPNOTSUPP for fplookup vops This helps asserting that they are provided by filesystems indicating they do it.
# 0baef43e	06-Apr-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add missing vop_fplookup ops to syncer
# 8495fa49	06-Apr-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: whack spurious comments from syncer's vop_vector
# 138a5daf	20-Mar-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: trylock vnode requeue The quasi-LRU still gets in the way for example when doing an incremental bzImage build, with vnode_list lock being at the top of the profile. Further damage control the problem by trylocking. Note the entire mechanism desperately wants to be reaped out in favor of something(tm) which both scales in a multicore setting and provides sensible replacement policy. With this change everything vfs almost disappears from the on CPU flamegraph, what is left is tons of contention in the VM.
# 245767c2	25-Mar-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: flip deferred_inact to atomic Turns out it is very rarely triggered, making a per-cpu counter a waste. Examples from real life boxes: uptime counter 135 days 847 138 days 2190 141 days 1
# e5eb1d29	25-Mar-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: replace some spelled out VNASSERTs with VNPASS nfc
# b5d43972	21-Mar-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: decouple freevnodes from vnode batching In principle one cpu can keep vholding vnodes, while another vdrops them. In this case it may be the local count will keep growing in an unbounded manner. Roll it up after a threshold instead. While here move it out of dpcpu into struct pcpu. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D39195
# 62a573d9	16-Mar-2023	Mateusz Guzik <mjg@FreeBSD.org>	vfs: retire KERN_VNODE It got disabled in 2003: commit acb18acfec97aa7fe26ff48f80a5c3f89c9b542d Author: Poul-Henning Kamp <phk@FreeBSD.org> Date: Sun Feb 23 18:09:05 2003 +0000 Bracket the kern.vnode sysctl in #ifdef notyet because it results in massive locking issues on diskless systems. It is also not clear that this sysctl is non-dangerous in its requirements for locked down memory on large RAM systems. There does not seem to be practical use for it and the disabled routine does not work anyway. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D39127
# dc9b1373	09-Feb-2023	Mitchell Horne <mhorne@FreeBSD.org>	Use maybe_yield() in a few more places Reviewed by: kib, markj MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D38186
# 456f0575	19-Jan-2023	Konstantin Belousov <kib@FreeBSD.org>	Handle int rank issues in in vn_getsize_locked() and vn_seek() In vn_getsize_locked(), when storing vattr.va_size of type u_quad_t into off_t size, we must avoid overflow. Then, the check for fsize < 0, introduced in the commit f45feecfb27ca51067d6789eaa43547cadc4990b 'vfs: add vn_getsize', is nop [1]. Reported and reviewed by: jhb Coverity CID: 1502346 Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D38133
# 0f80d5eb	15-Jan-2023	Konstantin Belousov <kib@FreeBSD.org>	Require INVARIANTS and WITNESS if DEBUG_VFS_LOCKS is set Reported by: pho Reviewed by: markj, mjg Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D38070
# f45feecf	22-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add vn_getsize getattr is very expensive and in important cases only gets called to get the size. This can be optimized with a dedicated routine which obtains that statistic. As a step towards that goal make size-only consumers use a dedicated routine. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D37885
# 829f0bcb	19-Dec-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add the concept of vnode state transitions To quote from a comment above vput_final: <quote> * XXX Some filesystems pass in an exclusively locked vnode and strongly depend * on the lock being held all the way until VOP_INACTIVE. This in particular * happens with UFS which adds half-constructed vnodes to the hash, where they * can be found by other code. </quote> As is there is no mechanism which allows filesystems to denote that a vnode is fully initialized, consequently problems like the above are only found the hard way(tm). Add rudimentary support for state transitions, which in particular allow to assert the vnode is not legally unlocked until its fate is decided (either construction finishes or vgone is called to abort it). The new field lands in a 1-byte hole, thus it does not grow the struct. Bump __FreeBSD_version to 1400077 Reviewed by: kib (previous version) Tested by: pho Differential Revision: https://reviews.freebsd.org/D37759
# 94267fc9	22-Dec-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: use designated initializers for the typename array While here prefix with v for better consistency with the vnode stuff. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D37759
# 85dac03e	17-Nov-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: stop using NDFREE It provides nothing but a branchfest and next to no consumers want it anyway. Tested by: pho
# a134a12b	17-Nov-2022	Eric van Gyzen <vangyzen@FreeBSD.org>	Mark the debug.vnlru_nowhere sysctl as CTLFLAG_STATS The kernel doesn't read it. It's only writable so it can be cleared. Sponsored by: Dell EMC Isilon
# 4390622c	19-Oct-2022	Jason A. Harmening <jah@FreeBSD.org>	vfs_busy(): fix wording in comment Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D35054
# 080ef8a4	04-Aug-2022	Jason A. Harmening <jah@FreeBSD.org>	Add VV_CROSSLOCK vnode flag to avoid cross-mount lookup LOR When a lookup operation crosses into a new mountpoint, the mountpoint must first be busied before the root vnode can be locked. When a filesystem is unmounted, the vnode covered by the mountpoint must first be locked, and then the busy count for the mountpoint drained. Ordinarily, these two operations work fine if executed concurrently, but with a stacked filesystem the root vnode may in fact use the same lock as the covered vnode. By design, this will always be the case for unionfs (with either the upper or lower root vnode depending on mount options), and can also be the case for nullfs if the target and mount point are the same (which admittedly is very unlikely in practice). In this case, we have LOR. The lookup path holds the mountpoint busy while waiting on what is effectively the covered vnode lock, while a concurrent unmount holds the covered vnode lock and waits for the mountpoint's busy count to drain. Attempt to resolve this LOR by allowing the stacked filesystem to specify a new flag, VV_CROSSLOCK, on a covered vnode as necessary. Upon observing this flag, the vfs_lookup() will leave the covered vnode lock held while crossing into the mountpoint. Employ this flag for unionfs with the caveat that it can't be used for '-o below' mounts until other unionfs locking issues are resolved. Reported by: pho Tested by: pho Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D35054
# d346e3ac	26-Oct-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: use cache_assert_no_entries instead of open-coding it
# a3ab1102	12-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: silence a bogus LOR in freevnode Reported by: imp
# 5b5b7e2c	17-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: always retain path buffer after lookup This removes some of the complexity needed to maintain HASBUF and allows for removing injecting SAVENAME by filesystems. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D36542
# d04c7f10	14-Sep-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: make delmntque return with the interlock held saves on relocking dance -- the lock is taken immediately afterwards anyway.
# c84c5e00	18-Jul-2022	Mitchell Horne <mhorne@FreeBSD.org>	ddb: annotate some commands with DB_CMD_MEMSAFE This is not completely exhaustive, but covers a large majority of commands in the tree. Reviewed by: markj Sponsored by: Juniper Networks, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D35583
# c6487446	11-Apr-2022	Dmitry Chagin <dchagin@FreeBSD.org>	getdirentries: return ENOENT for unlinked but still open directory. To be more compatible to IEEE Std 1003.1-2008 (“POSIX.1”). Reviewed by: mjg, Pau Amma (doc) Differential revision: https://reviews.freebsd.org/D34680 MFC after: 2 weeks
# 768f9b8b	09-Apr-2022	Gordon Bergling <gbe@FreeBSD.org>	kern: Fix a typo in a source code comment - s/is is/is/ MFC after: 3 days
# 2533b5dc	27-Mar-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add missing bits to vdropl_impl This completes the patch which was originally meant to go in. Spotted by: mhorne Fixes: c35ec1efdcb2978b ("vfs: [1/2] fix stalls in vnode reclaim by not requeieing from vnlru")
# eb574ba0	19-Mar-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: replace VFS_NOTIFY_UPPER_* macros with an enum
# cceb91b0	18-Mar-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add missing flags to db show mount
# 93a0ba8f	17-Sep-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: retire the no longer used MNTK_LOOKUP_EXCL_DOTDOT flag Reviewed by: markj Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D34466
# 1cb0045c	07-Mar-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add MNTK_UNLOCKED_INSMNTQUE Can be used when the fs at hand can synchronize insmntque with other means than the vnode lock. Reviewed by: markj Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D34466
# 3a4c5dab	07-Mar-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: [2/2] fix stalls in vnode reclaim by only counting attempts ... and ignoring if they succeded, which matches historical behavior. Reported by: pho
# c35ec1ef	07-Mar-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: [1/2] fix stalls in vnode reclaim by not requeieing from vnlru Reported by: pho
# 9af41803	03-Mar-2022	John Baldwin <jhb@FreeBSD.org>	Use vnsz2log directly in assertion on its relation to sizeof(struct vnode). This reduces the size of diffs required to support different values of vnsz2log. In CheriBSD, kernels for CHERI architectures have vnodes larger than 512 bytes and require a value of 9. Reviewed by: mjg Obtained from: CheriBSD Sponsored by: University of Cambridge, Google, Inc. Differential Revision: https://reviews.freebsd.org/D34418
# 6aa246e6	12-Feb-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: convert vnsz2log to a macro
# 66c5fbca	27-Jan-2022	Konstantin Belousov <kib@FreeBSD.org>	insmntque1(): remove useless arguments Also remove once-used functions to clean up after failed insmntque1(), which were destructor callbacks in previous life. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D34071
# 3d68c4e1	21-Jan-2022	Konstantin Belousov <kib@FreeBSD.org>	syncer VOP_FSYNC(): unlock syncer vnode around call to VFS_SYNC() The lock is unneccessary since the mount point is busied, which prevents unmount and syncer vnode deallocation. Having the vnode locked causes innocent LoRs and complicates debugging. Also stop starting write accounting around it. Any caller of VOP_FSYNC() must do it already, and sync_vnode() does. Reported and tested by: pho Reviewed by: markj, mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D34072
# 2a7e4cf8	27-Jan-2022	Mateusz Guzik <mjg@FreeBSD.org>	Revert b58ca5df0bb7 ("vfs: remove the now unused insmntque1") I was somehow convinced that insmntque calls insmntque1 with a NULL destructor. Unfortunately this worked well enough to not immediately blow up in simple testing. Keep not using the destructor in previously patched filesystems though as it avoids unnecessary casts. Noted by: kib Reported by: pho
# b58ca5df	26-Jan-2022	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove the now unused insmntque1 Bump __FreeBSD_version to 1400052.
# b214fcce	13-Dec-2021	Alan Somers <asomers@FreeBSD.org>	Change VOP_READDIR's cookies argument to a **uint64_t The cookies argument is only used by the NFS server. NFSv2 defines the cookie as 32 bits on the wire, but NFSv3 increased it to 64 bits. Our VOP_READDIR, however, has always defined it as u_long, which is 32 bits on some architectures. Change it to 64 bits on all architectures. This doesn't matter for any in-tree file systems, but it matters for some FUSE file systems that use 64-bit directory cookies. PR: 260375 Reviewed by: rmacklem Differential Revision: https://reviews.freebsd.org/D33404
# 4dd23ae1	10-Dec-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: retire MNTK_NOKNOTE and VV_NOKNOTE MNTK_NOKNOTE was introduced in 679985d03a64f5dfb4355538ae6e3b70f8347f38 (dated 2005), VV_NOKNOTE in 34cc826ae8999f454dd6cb9c77d17ce83b169f92 few months later. Neither was ever used by anything in the tree.
# 4dcdf398	17-May-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: replace the MNTK_TEXT_REFS flag with VIRF_TEXT_REF This allows to stop maintaing the VI_TEXT_REF flag and consequently opens up fully lockless v_writecount adjustment. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D33127
# 7e1d3eef	25-Nov-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove the unused thread argument from NDINIT* See b4a58fbf640409a1 ("vfs: remove cn_thread") Bump __FreeBSD_version to 1400043.
# d032cda0	31-Oct-2021	Konstantin Belousov <kib@FreeBSD.org>	DEBUG_VFS_LOCKS: stop excluding devfs and doomed vnode from asserts We do not require devvp vnode locked for metadata io. It is typically not needed indeed, since correctness of the file system using corresponding block device ensures that there is no incorrect or racy manipulations. But right now DEBUG_VFS_LOCKS option excludes both character device vnodes and completely destroyed (VBAD) vnodes from asserts. This is not too bad since WITNESS still ensures that we do not leak locks. On the other hand, asserts do not mean what they should, to the reader, and reliance on them being enforced might result in wrong code. Note that ASSERT_VOP_LOCKED() still silently accepts NULLVP, I think it is worth fixing as well, in the next round. In collaboration with: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D32761
# 47b248ac	03-Nov-2021	Konstantin Belousov <kib@FreeBSD.org>	Make locking assertions for VOP_FSYNC() and VOP_FDATASYNC() more correct For devfs vnodes, it is fine to not lock vnodes for VOP_FSYNC(). Otherwise vnode must be locked exclusively, except for MNT_SHARED_WRITES() where the shared lock is enough. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D32761
# d1d675cb	01-Nov-2021	Konstantin Belousov <kib@FreeBSD.org>	freevnode(): lock the freeing vnode around destroy_vpollinfo() to satisfy locking requirements of knlist manipulations. Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D32761
# 03d5820f	12-Oct-2021	Mark Johnston <markj@FreeBSD.org>	mount: Check for !VDIR mount points before handling -o emptydir To implement -o emptydir, vfs_emptydir() checks that the passed directory is empty. This should be done after checking whether the vnode is of type VDIR, though, or vfs_emptydir() may end up calling VOP_READDIR on a non-directory. Reported by: syzbot+4006732c69fb0f792b2c@syzkaller.appspotmail.com Reviewed by: kib, imp MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32475
# 7259ca31	01-Oct-2021	Kyle Evans <kevans@FreeBSD.org>	fifos: delegate unhandled kqueue filters to underlying filesystem This gives the vfs layer a chance to provide handling for EVFILT_VNODE, for instance. Change pipe_specops to use the default vop_kqfilter to accommodate fifoops that don't specify the method (i.e. all in-tree). Based on a patch by Jan Kokemüller. PR: 225934 Reviewed by: kib, markj (both pre-KASSERT) Differential Revision: https://reviews.freebsd.org/D32271
# b4a58fbf	01-Oct-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove cn_thread It is always curthread. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D32453
# 5d8e32a6	18-Sep-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: retire VNODE_REFCOUNT_FENCE_* macros They are unused as of last year.
# 2bc16e8a	17-Jul-2021	Jason A. Harmening <jah@FreeBSD.org>	VFS: remove MNTK_MARKER We no longer allow upper filesystems to be unregistered from the base mount while vfs_notify_upper() or any other upper operation is pending. New upper mounts can still be registered during this period, but they will be added at the end of the upper mount tailq. We therefore no longer need to allocate marker nodes during vfs_notify_upper() to keep our place in the iteration. Reviewed by: kib, mckusick Tested by: pho Differential Revision: https://reviews.freebsd.org/D31016
# c746ed72	12-Jun-2021	Jason A. Harmening <jah@FreeBSD.org>	Allow stacked filesystems to be recursively unmounted In certain emergency cases such as media failure or removal, UFS will initiate a forced unmount in order to prevent dirty buffers from accumulating against the no-longer-usable filesystem. The presence of a stacked filesystem such as nullfs or unionfs above the UFS mount will prevent this forced unmount from succeeding. This change addreses the situation by allowing stacked filesystems to be recursively unmounted on a taskqueue thread when the MNT_RECURSE flag is specified to dounmount(). This call will block until all upper mounts have been removed unless the caller specifies the MNT_DEFERRED flag to indicate the base filesystem should also be unmounted from the taskqueue. To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs have been combined with the existing 'mnt_uppers' list used by nullfs and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper(). The format of the mnt_uppers list has also been changed to accommodate filesystems such as unionfs in which a given mount may be stacked atop more than one lower mount. Additionally, management of lower FS reclaim/unlink notifications has been split into a separate list managed by a separate set of KPIs, as registration of an upper FS no longer implies interest in these notifications. Reviewed by: kib, mckusick Tested by: pho Differential Revision: https://reviews.freebsd.org/D31016
# 59409cb9	17-May-2021	Jason A. Harmening <jah@FreeBSD.org>	Add a generic mechanism for preventing forced unmount This is aimed at preventing stacked filesystems like nullfs and unionfs from "losing" their lower mounts due to forced unmount. Otherwise, VFS operations that are passed through to the lower filesystem(s) may crash or otherwise cause unpredictable behavior. Introduce two new functions: vfs_pin_from_vp() and vfs_unpin(). which are intended to be called on the lower mount(s) when the stacked filesystem is mounted and unmounted, respectively. Much as registration in the mnt_uppers list previously did, pinning will prevent even forced unmount of the lower FS and will allow the stacked FS to freely operate on the lower mount either by direct use of the struct mount* or indirect use through a properly-referenced vnode's v_mount field. vfs_pin_from_vp() is modeled after vfs_ref_from_vp() in that it uses the mount interlock coupled with re-checking vp->v_mount to ensure that it will fail in the face of a pending unmount request, even if the concurrent unmount fully completes. Adopt these new functions in both nullfs and unionfs. Reviewed By: kib, markj Differential Revision: https://reviews.freebsd.org/D30401
# 27006229	30-May-2021	Konstantin Belousov <kib@FreeBSD.org>	vinvalbuf: do not panic if we were unable to flush dirty buffers Return EBUSY instead and let caller to handle the issue. For vgone()/vnode reclamation, caller first does vinvalbuf(V_SAVE), which return EBUSY in case dirty buffers where not flushed. Then caller calls vinvalbuf(0) due to non-zero return, which gets rid of all dirty buffers without dependencies. PR: 238565 Reviewed by: asomers, mckusick Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D30555
# 3cf75ca2	28-May-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: retire unused vn_seqc_write_begin_unheld*
# cf74b2be	22-May-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: retire the now unused vnlru_free routine
# f784da88	17-May-2021	Konstantin Belousov <kib@FreeBSD.org>	Move mnt_maxsymlinklen into appropriate fs mount data structures Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week X-MFC-Note: struct mount layout Differential revision: https://reviews.freebsd.org/D30325
# d713bf79	21-May-2021	Konstantin Belousov <kib@FreeBSD.org>	vn_need_pageq_flush(): simplify There is no need to own vnode interlock, since v_object is type stable and can only change to/from NULL, and no other checks in the function access fields protected by the interlock. Remove the need variable, the result of the test is directly usable as return value. Tested by: mav, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# cc6f46ac	14-May-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: refactor vdrop In particular move vunlazy into its own routine.
# 715fcc0d	14-May-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: change vn_freevnodes_* prefix to idiomatic vfs_freevnodes_*
# 9a2fac6b	16-May-2021	Kirk McKusick <mckusick@FreeBSD.org>	Fix handling of embedded symbolic links (and history lesson). The original filesystem release (4.2BSD) had no embedded sysmlinks. Historically symbolic links were just a different type of file, so the content of the symbolic link was contained in a single disk block fragment. We observed that most symbolic links were short enough that they could fit in the area of the inode that normally holds the block pointers. So we created embedded symlinks where the content of the link was held in the inode's pointer area thus avoiding the need to seek and read a data fragment and reducing the pressure on the block cache. At the time we had only UFS1 with 32-bit block pointers, so the test for a fastlink was: di_size < (NDADDR + NIADDR) * sizeof(daddr_t) (where daddr_t would be ufs1_daddr_t today). When embedded symlinks were added, a spare field in the superblock with a known zero value became fs_maxsymlinklen. New filesystems set this field to (NDADDR + NIADDR) * sizeof(daddr_t). Embedded symlinks were assumed when di_size < fs->fs_maxsymlinklen. Thus filesystems that preceeded this change always read from blocks (since fs->fs_maxsymlinklen == 0) and newer ones used embedded symlinks if they fit. Similarly symlinks created on pre-embedded symlink filesystems always spill into blocks while newer ones will embed if they fit. At the same time that the embedded symbolic links were added, the on-disk directory structure was changed splitting the former u_int16_t d_namlen into u_int8_t d_type and u_int8_t d_namlen. Thus fs_maxsymlinklen <= 0 (as used by the OFSFMT() macro) can be used to distinguish old directory formats. In retrospect that should have just been an added flag, but we did not realize we needed to know about that change until it was already in production. Code was split into ufs/ffs so that the log structured filesystem could use ufs functionality while doing its own disk layout. This meant that no ffs superblock fields could be used in the ufs code. Thus ffs superblock fields that were needed in ufs code had to be copied to fields in the mount structure. Since ufs_readlink needed to know if a link was embedded, fs_maxlinklen gets copied to mnt_maxsymlinklen. The kernel panic that arose to making this fix was triggered when a disk error created an inode of type symlink with no allocated data blocks but a large size. When readlink was called the uiomove was attempted which segment faulted. static int ufs_readlink(ap) struct vop_readlink_args /* { struct vnode a_vp; struct uio a_uio; struct ucred a_cred; } / ap; { struct vnode vp = ap->a_vp; struct inode ip = VTOI(vp); doff_t isize; isize = ip->i_size; if ((isize < vp->v_mount->mnt_maxsymlinklen) \|\| DIP(ip, i_blocks) == 0) { / XXX - for old fastlink support / return (uiomove(SHORTLINK(ip), isize, ap->a_uio)); } return (VOP_READ(vp, ap->a_uio, 0, ap->a_cred)); } The second part of the "if" statement that adds DIP(ip, i_blocks) == 0) { / XXX - for old fastlink support */ is problematic. It never appeared in BSD released by Berkeley because as noted above mnt_maxsymlinklen is 0 for old format filesystems, so will always fall through to the VOP_READ as it should. I had to dig back through `git blame' to find that Rodney Grimes added it as part of ``The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.'' He must have brought it across from an earlier FreeBSD. Unfortunately the source-control logs for FreeBSD up to the merger with the AT&T-blessed 4.4BSD-Lite conversion were destroyed as part of the agreement to let FreeBSD remain unencumbered, so I cannot pin-point where that line got added on the FreeBSD side. The one change needed here is that mnt_maxsymlinklen is declared as an `int' and should be changed to be `u_int64_t'. This discovery led us to check out the code that deletes symbolic links. Specifically if (vp->v_type == VLNK && (ip->i_size < vp->v_mount->mnt_maxsymlinklen \|\| datablocks == 0)) { if (length != 0) panic("ffs_truncate: partial truncate of symlink"); bzero(SHORTLINK(ip), (u_int)ip->i_size); ip->i_size = 0; DIP_SET(ip, i_size, 0); UFS_INODE_SET_FLAG(ip, IN_SIZEMOD \| IN_CHANGE \| IN_UPDATE); if (needextclean) goto extclean; return (ffs_update(vp, waitforupdate)); } Here too our broken symlink inode with no data blocks allocated and a large size will segment fault as we are incorrectly using the test that we have no data blocks to decide that it is an embdedded symbolic link and attempting to bzero past the end of the inode. The test for datablocks == 0 is unnecessary as the test for ip->i_size < vp->v_mount->mnt_maxsymlinklen will do the right thing in all cases. The test for datablocks == 0 was added by David Greenman in this commit: Author: David Greenman <dg@FreeBSD.org> Date: Tue Aug 2 13:51:05 1994 +0000 Completed (hopefully) the kernel support for old style "fastlinks". Notes: svn path=/head/; revision=1821 I am guessing that he likely earlier added the incorrect test in the ufs_readlink code. I asked David if he had any recollection of why he made this change. Amazingly, he still had a recollection of why he had made a one-line change more than twenty years ago. And unsurpisingly it was because he had been stuck between a rock and a hard place. FreeBSD was up to 1.1.5 before the switch to the 4.4BSD-Lite code base. Prior to that, there were three years of development in all areas of the kernel, including the filesystem code, from the combined set of people including Bill Jolitz, Patchkit contributors, and FreeBSD Project members. The compatibility issue at hand was caused by the FASTLINKS patches from Curt Mayer. In merging in the 4.4BSD-Lite changes David had to find a way to provide compatibility with both the changes that had been made in FreeBSD 1.1.5 and with 4.4BSD-Lite. He felt that these changes would provide compatibility with both systems. In his words: ``My recollection is that the 'FASTLINKS' symlinks support in FreeBSD-1.x, as implemented by Curt Mayer, worked differently than 4.4BSD. He used a spare field in the inode to duplicately store the length. When the 4.4BSD-Lite merge was done, the optimized symlinks support for existing filesystems (those that were initialized in FreeBSD-1.x) were broken due to the FFS on-disk structure of 4.4BSD-Lite differing from FreeBSD-1.x. My commit was needed to restore the backward compatibility with FreeBSD-1.x filesystems. I think it was the best that could be done in the somewhat urgent circumstances of the post Berkeley-USL settlement. Also, regarding Rod's massive commit with little explanation, some context: John Dyson and I did the initial re-port of the 4.4BSD-Lite kernel to the 386 platform in just 10 days. It was by far the most intense hacking effort of my life. In addition to the porting of tons of FreeBSD-1 code, I think we wrote more than 30,000 lines of new code in that time to deal with the missing pieces and architectural changes of 4.4BSD-Lite. We didn't make many notes along the way. There was a lot of pressure to get something out to the rest of the developer community as fast as possible, so detailed discrete commits didn't happen - it all came as a giant wad, which is why Rod's commit message was worded the way it was.'' Reported by: Chuck Silvers Tested by: Chuck Silvers History by: David Greenman Lawrence MFC after: 1 week Sponsored by: Netflix
# b261bb40	13-Apr-2021	Mark Johnston <markj@FreeBSD.org>	vfs: Add KASAN state transitions for vnodes vnodes are a bit special in that they may exist on per-CPU lists even while free. Add a KASAN-only destructor that poisons regions of each vnode that are not expected to be accessed after a free. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29459
# e9272225	17-Mar-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix vnlru marker handling for filtered/unfiltered cases The global list has a marker with an invariant that free vnodes are placed somewhere past that. A caller which performs filtering (like ZFS) can move said marker all the way to the end, across free vnodes which don't match. Then a caller which does not perform filtering will fail to find them. This makes vn_alloc_hard sleep for 1 second instead of reclaiming, resulting in significant stalls. Fix the problem by requiring an explicit marker by callers which do filtering. As a temporary measure extend vnlru_free to restart if it fails to reclaim anything. Big thanks go to the reporter for testing several iterations of the patch. Reported by: Yamagi <lists yamagi.org> Tested by: Yamagi <lists yamagi.org> Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D29324
# 44691b33	06-Mar-2021	Konstantin Belousov <kib@FreeBSD.org>	vlrureclaim: only skip vnode with resident pages if it own the pages Nullfs vnode which shares vm_object and pages with the lower vnode should not be exempt from the reclaim just because lower vnode cached a lot. Their reclamation is actually very cheap and should be preferred over real fs vnodes, but this change is already useful. Reported and tested by: pho Reviewed by: mckusick Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D29178
# 0cc746f1	02-Mar-2021	Robert Wing <rew@FreeBSD.org>	filt_fsevent: only record interested events Respect filter-specific flags for the EVFILT_FS filter. When a kevent is registered with the EVFILT_FS filter, it is always triggered when an EVFILT_FS event occurs, regardless of the filter-specific flags used. Fix that. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D28974
# 2bfd8992	14-Feb-2021	Konstantin Belousov <kib@FreeBSD.org>	vnode: move write cluster support data to inodes. The data is only needed by filesystems that 1. use buffer cache 2. utilize clustering write support. Requested by: mjg Reviewed by: asomers (previous version), fsu (ext2 parts), mckusick Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D28679
# 662283b1	18-Feb-2021	Konstantin Belousov <kib@FreeBSD.org>	vn_printf: handle VI_FOPENING Noted by: mjg Sponsored by: The FreeBSD Foundation MFC after: 6 days Fixes: fa3bd463cee
# b59a8e63	30-Jan-2021	Konstantin Belousov <kib@FreeBSD.org>	Stop ignoring ERELOOKUP from VOP_INACTIVE() When possible, relock the vnode and retry inactivation. Only vunref() is required not to drop the vnode lock, so handle it specially by not retrying. This is a part of the efforts to ensure that unlinked not referenced vnode does not prevent inode from reusing. Reviewed by: chs, mckusick Tested by: pho MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
# 739ecbcf	23-Jan-2021	Mateusz Guzik <mjg@FreeBSD.org>	cache: add symlink support to lockless lookup Reviewed by: kib (previous version) Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D27488
# 33a195ba	03-Jan-2021	Mateusz Guzik <mjg@FreeBSD.org>	vfs: keep seqc unchanged as long as the vnode is accessible via SMR
# 82397d79	31-Dec-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: denote vnode being a mount point with VIRF_MOUNTPOINT Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D27794
# 3e506a67	27-Dec-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add v_irflag accessors Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D27793
# 0c23d262	04-Dec-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: keep bad ops on vnode reclaim They were only modified to accomodate a redundant assertion. This runs into problems as lockless lookup can still try to use the vnode and crash instead of getting an error. The bug was only present in kernels with INVARIANTS. Reported by: kevans
# 5335f643	18-Nov-2020	John Baldwin <jhb@FreeBSD.org>	Fix a few nits in vn_printf(). - Mask out recently added VV_* bits to avoid printing them twice. - Keep VI_LOCKed on the same line as the rest of the flags. Reviewed by: kib Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D27261
# 441eb16a	13-Nov-2020	Konstantin Belousov <kib@FreeBSD.org>	Allow some VOPs to return ERELOOKUP to indicate VFS operation restart at top level. Restart syscalls and some sync operations when filesystem indicated ERELOOKUP condition, mostly for VOPs operating on metdata. In particular, lookup results cached in the inode/v_data is no longer valid and needs recalculating. Right now this should be nop. Assert that ERELOOKUP is catched everywhere and not returned to userspace, by asserting that td_errno != ERELOOKUP on syscall return path. In collaboration with: pho Reviewed by: mckusick (previous version), markj Tested by: markj (syzkaller), pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26136
# f6dd1aef	09-Nov-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: group mount per-cpu vars into one struct While here move frequently read stuff into the same cacheline. This shrinks struct mount by 64 bytes. Tested by: pho
# e90afaa0	08-Nov-2020	Mateusz Guzik <mjg@FreeBSD.org>	kqueue: save space by using only one func pointer for assertions
# 06855749	30-Oct-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: change vnode poll to just a malloc type The size is 120, close fit for 128 and rarely used. The infrequent use avoidably populates per-CPU caches and ends up with more memory.
# 11743b6e	27-Oct-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: tidy up vnlru_free Apart from cosmeatic changes make sure to only decrease the recycled counter if vtryrecycle succeeded. Tested by: pho
# 68ac2b80	27-Oct-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix vnode reclaim races against getnwevnode All vnodes allocated by UMA are present on the global list used by vnlru. getnewvnode modifies the state of the vnode (most notably altering v_holdcnt) but never locks it. Moreover filesystems also modify it in arbitrary manners sometimes before taking the vnode lock or adding any other indicator that the vnode can be used. Picking up such a vnode by vnlru would be problematic. To that end there are 2 fixes: - vlrureclaim, not recycling v_holdcnt == 0 vnodes, takes the interlock and verifies that v_mount has been set. It is an invariant that the vnode lock is held by that point, providing the necessary serialisation against locking after vhold. - vnlru_free_locked, only wanting to free v_holdcnt == 0 vnodes, now makes sure to only transition the count 0->1 and newly allocated vnodes start with v_holdcnt == VHOLD_NO_SMR. getnewvnode will only transition VHOLD_NO_SMR->1 once more making the hold fail Tested by: pho
# 7cc17186	24-Oct-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix a race where reclaim vholds freed vnodes Reported by: pho Tested by: pho (previous version) Fixes: r366974 ("vfs: stop taking the interlock in vnode reclaim")
# 703f3faf	23-Oct-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: stop taking the interlock in vnode reclaim It no longer protects any of tested fields, keeping all the checks racy. While here make vtryrecycle drop the vnode on its own. Avoids an additional lock trip.
# c7520caa	22-Oct-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: prevent avoidable evictions on mkdir of existing directories mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST. The NOCACHE lookup will make the namecache unnecessarily evict the existing entry, and then fallback to the fs lookup routine eventually leading namei to return an error as the directory is already there. For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers fallbacks to the slowpath for concurrently executing lookups. Tested by: pho Discussed with: kib
# ab21ed17	20-Oct-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: drop the de facto curthread argument from VOP_INACTIVE
# c0baa3dc	19-Oct-2020	Konstantin Belousov <kib@FreeBSD.org>	vgonel(): avoid recursing into VOP_INACTIVE(). It is a common pattern for filesystems' VOP_INACTIVE() implementation to forcibly reclaim the vnode when its state is final. For instance, UFS vnode with zero link count is removed, and since it is inactivated, the last open reference on it is dropped. On the other hand, vnode might get spurious usecount reference for many reasons. If the spurious reference exists while vgonel() checks for active state of the vnode, it would recurse into VOP_INACTIVE(). Fix it by checking and not doing inactivation when vgone() was called from inactive VOP. Reported and tested by: pho Discussed with: mjg Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 3c484f32	15-Sep-2020	Konstantin Belousov <kib@FreeBSD.org>	Convert page cache read to VOP. There are several negative side-effects of not calling into VOP layer at all for page cache reads. The biggest is the missed activation of EVFILT_READ knotes. Also, it allows filesystem to make more fine grained decision to refuse read from page cache. Keep VIRF_PGREAD flag around, it is still useful for nullfs, and for asserts. Reviewed by: markj Tested by: pho Discussed with: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D26346
# 2bcfa5ba	08-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: drop a write-only var in vfs_periodic_msync_inactive
# b1a824b6	02-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: retire vholdl as a symbol Similarly to vrefl in r364283.
# 2b4632ae	02-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: purge cache entries early on vgone There is no reason for them to linger across reclaim and it is an invariant that doomed vnodes are not added to the namecache.
# 6fed89b1	01-Sep-2020	Mateusz Guzik <mjg@FreeBSD.org>	kern: clean up empty lines in .c and .h files
# a459a6cf	25-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: respect PRIV_VFS_LOOKUP in vaccess_smr Reported by: novel
# 9ce9158b	23-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: support denying access in vaccess_vexec_smr
# ba3b0991	23-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: factor away doomed vnode handling into vdropl_final
# 2ca83b5c	23-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: mark freevnode as noinline
# 19337211	21-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix freevnode accounting Most notably add the missing decrement to vhold_smr. .-'---`-. ,' `. \| \ \| \ \ _ \ ,\ _ ,'-,/-)\ ( * \ \,' ,' ,'-) `._,) -',-') \/ ''/ ) / / / ,'-' Reported by: Dan Nelson <dnelson_1901@yahoo.com> Fixes: r362827 ("vfs: protect vnodes with smr")
# 8f226f4c	19-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove the always-curthread td argument from VOP_RECLAIM
# 7ad2a82d	18-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error Most consumers pass NULL.
# fbca789f	16-Aug-2020	Konstantin Belousov <kib@FreeBSD.org>	VMIO read If possible, i.e. if the requested range is resident valid in the vm object queue, and some secondary conditions hold, copy data for read(2) directly from the valid cached pages, avoiding vnode lock and instantiating buffers. I intentionally do not start read-ahead, nor handle the advises on the cached range. Filesystems indicate support for VMIO reads by setting VIRF_PGREAD flag, which must not be cleared until vnode reclamation. Currently only filesystems that use vnode pager for v_objects can enable it, due to reliance on vnp_size. There is a WIP to handle it for tmpfs. Reviewed by: markj Discussed with: jeff Tested by: pho Benchmarked by: mjg Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D25968
# 60414088	16-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: retire vrefl as a symbol vrefl calls vref and there is only one in-tree consumer. Keep it as a macro for assertion purposes.
# a92a971b	16-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove the thread argument from vget It was already asserted to be curthread. Semantic patch: @@ expression arg1, arg2, arg3; @@ - vget(arg1, arg2, arg3) + vget(arg1, arg2)
# 36f47512	11-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: inline vrefcnt
# 4c2d103a	11-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: garbage collect vrefactn
# 6883f07e	11-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: reimplement vref on top of vget No change in generated assembly.
# 3b444436	11-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	devfs: rework si_usecount to track opens This removes a lot of special casing from the VFS layer. Reviewed by: kib (previous version) Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D25612
# 7f700801	10-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: disallow NOCACHE with LOOKUP This means there is no expectation lookup will purge the terminal entry, which simplifies lockless lookup. Tested by: pho Sponsored by: The FreeBSD Foundation
# 1ff80a34	07-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: release the interlock after failing to set VHOLD_NO_SMR While here add more comments. Diagnosed by: markj Reported by: pho Fixes: r362827 ("vfs: protect vnodes with smr")
# 0ffec1b0	06-Aug-2020	Mark Johnston <markj@FreeBSD.org>	Clean up reassignbuf() and buf_vlist_remove() a bit. - Convert panic() calls to INVARIANTS-only assertions. The PCTRIE code provides some of the same protection since it will panic upon an attempt to remove a non-resident buffer. - Update the comment above reassignbuf() to reflect reality. Reviewed by: cem, kib, mjg MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25965
# 7013797e	06-Aug-2020	Mark Johnston <markj@FreeBSD.org>	Remove the vfs.reassignbufcalls counter and sysctl. As the 20-year old comment above it suggests, the counter is of dubious value. Moreover, the (global) counter was not updated precisely and hurts scalability. Reviewed by: cem, kib, mjg MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D25965
# d292b194	05-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove the obsolete privused argument from vaccess This brings argument count down to 6, which is passable without the stack on amd64.
# db99ec56	04-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: support lockless dotdot lookup Tested by: pho
# 6e10434c	04-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	cache: add cache_purge_vgone cache_purge locklessly checks whether the vnode at hand has any namecache entries. This can race with a concurrent purge which managed to remove the last entry, but may not be done touching the vnode. Make sure we observe the relevant vnode lock as not taken before proceeding with vgone. Paired with the fact that doomed vnodes cannnot receive entries this restores the invariant that there are no namecache-related writing users past cache_purge in vgone. Reported by: pho
# 838984de	02-Aug-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: move namecache initialisation into cache_vnode_init
# 848f8eff	30-Jul-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: inline vops if there are no pre/post associated calls This removes a level of indirection from frequently used methods, most notably VOP_LOCK1 and VOP_UNLOCK1. Tested by: pho
# 07d2145a	25-Jul-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add the infrastructure for lockless lookup Reviewed by: kib Tested by: pho (in a patchset) Differential Revision: https://reviews.freebsd.org/D25577
# 0379ff6a	25-Jul-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: introduce vnode sequence counters Modified on each permission change and link/unlink. Reviewed by: kib Tested by: pho (in a patchset) Differential Revision: https://reviews.freebsd.org/D25573
# 68ee1dda	24-Jul-2020	Conrad Meyer <cem@FreeBSD.org>	Add unlocked/SMR fast path to getblk() Convert the bufobj tries to an SMR zone/PCTRIE and add a gbincore_unlocked() API wrapping this functionality. Use it for a fast path in getblkx(), falling back to locked lookup if we raced a thread changing the buf's identity. Reported by: Attilio Reviewed by: kib, markj Testing: pho (in progress) Sponsored by: Isilon Differential Revision: https://reviews.freebsd.org/D25782
# 422f38d8	10-Jul-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix trivial whitespace issues which don't interefere with blame .. even without the -w switch
# 9b0c2e59	05-Jul-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: expand on vhold_smr comment
# f8022be3	30-Jun-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: protect vnodes with smr vget_prep_smr and vhold_smr can be used to ref a vnode while within vfs_smr section, allowing consumers to get away without locking. See vhold_smr and vdropl for comments explaining caveats. Reviewed by: kib Testec by: pho Differential Revision: https://reviews.freebsd.org/D23913
# 245bfd34	20-May-2020	Ryan Moeller <freqlabs@FreeBSD.org>	Deduplicate fsid comparisons Comparing fsid_t objects requires internal knowledge of the fsid structure and yet this is duplicated across a number of places in the code. Simplify by creating a fsidcmp function (macro). Reviewed by: mjg, rmacklem Approved by: mav (mentor) MFC after: 1 week Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D24749
# f15ccf88	06-Mar-2020	Chuck Silvers <chs@FreeBSD.org>	Add a new "mntfs" pseudo file system which provides private device vnodes for file systems to safely access their disk devices, and adapt FFS to use it. Also add a new BO_NOBUFS flag to allow enforcing that file systems using mntfs vnodes do not accidentally use the original devfs vnode to create buffers. Reviewed by: kib, mckusick Approved by: imp (mentor) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D23787
# 2782c00c	22-Feb-2020	Ryan Libby <rlibby@FreeBSD.org>	vfs: quiet -Wwrite-strings Reviewed by: kib, markj Differential Revision: https://reviews.freebsd.org/D23797
# 6c5f36ff	19-Feb-2020	Jeff Roberson <jeff@FreeBSD.org>	Eliminate some unnecessary uses of UMA_ZONE_VM. Only zones involved in virtual address or physical page allocation need to be marked with this flag. Reviewed by: markj Tested by: pho Differential Revision: https://reviews.freebsd.org/D23712
# 3403d524	15-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix vlrureclaim ->v_object access The routine was checking for ->v_type == VBAD. Since vgone drops the interlock early sets this type at the end of the process of dooming a vnode, this opens a time window where it can clear the pointer while the inerlock-holders is accessing it. Another note is that the code was: (vp->v_object != NULL && vp->v_object->resident_page_count > trigger) With the compiler being fully allowed to emit another read to get the pointer, and in fact it did on the kernel used by pho. Use atomic_load_ptr and remember the result. Note that this depends on type-safety of vm_object. Reported by: pho
# c6150094	15-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: check early for VCHR in vput_final to short-circuit in the common case Otherwise the compiler inlines v_decr_devcount which keps getting jumped over in the common case of not dealing with a device.
# df0d5a2a	14-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove no longer needed atomic_load_ptr casts
# 46022147	12-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: refactor vputx and add more comment Reviewed by: jeff (previous version) Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D23530
# 123c5197	12-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: switch to smp_rendezvous_cpus_retry for vfs_op_thread_enter/exit In particular on amd64 this eliminates an atomic op in the common case, trading it for IPIs in the uncommon case of catching CPUs executing the code while the filesystem is getting suspended or unmounted.
# 57349a4f	11-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix vhold race in mnt_vnode_next_lazy_relock vdrop can set the hold count to 0 and wait for the ->mnt_listmtx held by mnt_vnode_next_lazy_relock caller. The routine incorrectly asserted the count has to be > 0. Reported by: pho Tested by: pho
# 2e57c8fd	10-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix device count leak on vrele racing with vgone The race is: CPU1 CPU2 devfs_reclaim_vchr make v_usecount 0 VI_LOCK sees v_usecount == 0, no updates vp->v_rdev = NULL; ... VI_UNLOCK VI_LOCK v_decr_devcount sees v_rdev == NULL, no updates In this scenario si_devcount decrement is not performed. Note this can only happen if the vnode lock is not held. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D23529
# cd951a0d	10-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix lock recursion in vrele vrele is supposed to be called with an unlocked vnode, but this was never asserted for if v_usecount was > 0. For such counts the lock is never touched by the routine. As a result the kernel has several consumers which expect vunref semantics and get away with calling vrele since they happen to never do it when this is the last reference (and for some of them this may happen to be a guarantee). Work around the problem by changing vrele semantics to tolerate being called with a lock. This eliminates a possible bug where the lock is already held and vputx takes it anyway. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D23528
# 2f7f11b7	08-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: tidy up vget_finish and vn_lock - remove assertion which duplicates vn_lock - use VNPASS instead of retyping the failure - report what flags were passed if panicking on them
# 6a5abb1e	02-Feb-2020	Kyle Evans <kevans@FreeBSD.org>	Provide O_SEARCH O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping permissions checks on the directory itself after the initial open(). This is close to the semantics we've historically applied for O_EXEC on a directory, which is UB according to POSIX. Conveniently, O_SEARCH on a file is also explicitly undefined behavior according to POSIX, so O_EXEC would be a fine choice. The spec goes on to state that O_SEARCH and O_EXEC need not be distinct values, but they're not defined to be the same value. This was pointed out as an incompatibility with other systems that had made its way into libarchive, which had assumed that O_EXEC was an alias for O_SEARCH. This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a directory is checked in vn_open_vnode already, so for completeness we add a NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not re-check that when descending in namei. [0] https://pubs.opengroup.org/onlinepubs/9699919799/ Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23247
# 6698e11f	02-Feb-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove the now empty vop_unlock_post
# 643656cf	31-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: replace VOP_MARKATIME with VOP_MMAPPED The routine is only provided by ufs and is only used on mmap and exec. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23422
# 21c4f104	31-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add vrefactn Differential Revision: https://reviews.freebsd.org/D23427
# 0f4d8b77	31-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: revert the overzealous assert added in r357285 to vgone The intent was to make it more likely to catch filesystems with custom need_inactive routines which fail to call vn_need_pageq_flush (or do an equivalent). One immediate case which is missed is vgone from called by inactive itself. A better assertion may land later. The routine is not added to vputx because it is of no use to tmpfs et al. Reported by: syzbot+5f697ec11f89b60941db@syzkaller.appspotmail.com
# 3ff65f71	30-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	Remove duplicated empty lines from kern/*.c No functional changes.
# c2ef6aa3	29-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: assert that doomed vnodes don't need to call vm_object_page_clean ... after the optional inactive processing.
# 07c6e2f4	29-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: unlazy before dooming the vnode With this change having the listmtx lock held postpones dooming the vnode. Use this fact to simplify iteration over the lazy list. It also allows filters to safely access ->v_data. Reviewed by: kib (early version) Differential Revision: https://reviews.freebsd.org/D23397
# 79674264	29-Jan-2020	Gleb Smirnoff <glebius@FreeBSD.org>	Fix text format definition for kern.maxvnodes, vfs.wantfreevnodes. This is a regression from r356642, r356645.
# 1513f803	26-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: do an unlocked check before iterating the lazy list For most filesystems it is expected to be empty most of the time.
# 6d69e665	25-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fix freevnodes count update race against preemption vdbatch_process leaves the critical section too early, openign a time window where another thread can get scheduled and modify vd->freevnodes. Once it the preempted thread gets back it overrides the value with 0. Just move critical_exit to the end of the function.
# dc9a1cb6	25-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: predict vn_lock failure as unlikely in vget
# 28eb39a5	24-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: allow v_usecount to transition 0->1 without the interlock There is nothing to do but to bump the count even during said transition. There are 2 places which can do it: - vget only does this after locking the vnode, meaning there is no change in contract versus inactive or reclamantion - vref only ever did it with the interlock held which did not protect against either (that is, it would always succeed) VCHR vnodes retain special casing due to the need to maintain dev use count. Reviewed by: jeff, kib Tested by: pho (previous version) Differential Revision: https://reviews.freebsd.org/D23185
# d93762b9	24-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: stop handling VI_OWEINACT in vget vget is almost always called with LK_SHARED, meaning the flag (if present) is almost guaranteed to get cleared. Stop handling it in the first place and instead let the thread which wanted to do inactive handle the bumepd usecount. Reviewed by: jeff Tested by: pho Differential Revision: https://reviews.freebsd.org/D23184
# 74c4b7cc	24-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: stop unlocking the vnode upfront in vput Doing so runs into races with filesystems which make half-constructed vnodes visible to other users, while depending on the chain vput -> vinactive -> vrecycle to be executed without dropping the vnode lock. Impediments for making this work got cleared up (notably vop_unlock_post now does not do anything and lockmgr stops touching the lock after the final write). Stacked filesystems keep vhold/vdrop across unlock, which arguably can now be eliminated. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23344
# 28479aaa	19-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: allow v_holdcnt to transition 0->1 without the interlock Since r356672 ("vfs: rework vnode list management") there is nothing to do apart from altering freevnodes count, but this much can be safely done based on the result of atomic_fetchadd. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D23186
# 512fa9a4	18-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: plug a conditional assigment of lo_name in getnewvnode It only matters for witness. No functional changes.
# 2d0c6202	17-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: distribute freevnodes counter per-cpu It gets rolled up to the global when deferred requeueing is performed. A dedicated read routine makes sure to return a value only off by a certain amount. This soothes a global serialisation point for all 0<->1 hold count transitions. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23235
# 1ad72b27	17-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: shorten lock hold time in vdbatch_process
# 66f67d5e	16-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: increment numvnodes without the vnode list lock unless under pressure The vnode list lock is only needed to reclaim free vnodes or kick the vnlru thread (or to block and not miss a wake up (but note the sleep has a timeout so this would not be a correctness issue)). Try to get away without the lock by just doing an atomic increment. The lock is contended e.g., during poudriere -j 104 where about half of all acquires come from vnode allocation code. Note the entire scheme needs a rewrite, the above just reduces it's SMP impact. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D23140
# b7f50b9a	16-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: refcator vnode allocation Semantics are almost identical. Some code is deduplicated and there are fewer memory accesses. Reviewed by: kib, jeff Differential Revision: https://reviews.freebsd.org/D23158
# 875cfc08	16-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: reimplement vlrureclaim to actually use LRU Take advantage of global ordering introduced in r356672. Reviewed by: mckusick (previous version) Differential Revision: https://reviews.freebsd.org/D23067
# 0c236d3d	12-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: per-cpu batched requeuing of free vnodes Constant requeuing adds significant lock contention in certain workloads. Lessen the problem by batching it. Per-cpu areas are locked in order to synchronize against UMA freeing memory. vnode's v_mflag is converted to short to prevent the struct from growing. Sample result from an incremental make -s -j 104 bzImage on tmpfs: stock: 122.38s user 1780.45s system 6242% cpu 30.480 total patched: 144.84s user 985.90s system 4856% cpu 23.282 total Reviewed by: jeff Tested by: pho (in a larger patch, previous version) Differential Revision: https://reviews.freebsd.org/D22998
# cc3593fb	12-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: rework vnode list management The current notion of an active vnode is eliminated. Vnodes transition between 0<->1 hold counts all the time and the associated traversal between different lists induces significant scalability problems in certain workloads. Introduce a global list containing all allocated vnodes. They get unlinked only when UMA reclaims memory and are only requeued when hold count reaches 0. Sample result from an incremental make -s -j 104 bzImage on tmpfs: stock: 118.55s user 3649.73s system 7479% cpu 50.382 total patched: 122.38s user 1780.45s system 6242% cpu 30.480 total Reviewed by: jeff Tested by: pho (in a larger patch, previous version) Differential Revision: https://reviews.freebsd.org/D22997
# 57083d25	12-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add per-mount vnode lazy list and use it for deferred inactive + msync This obviates the need to scan the entire active list looking for vnodes of interest. msync is handled by adding all vnodes with write count to the lazy list. deferred inactive directly adds vnodes as it sets the VI_DEFINACT flag. Vnodes get dequeued from the list when their hold count reaches 0. Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that spurious locking is avoided in the common case. Reviewed by: jeff Tested by: pho (in a larger patch, previous version) Differential Revision: https://reviews.freebsd.org/D22995
# 879e0604	11-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	Add KERNEL_PANICKED macro for use in place of direct panicstr tests
# 91de98e6	11-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: only recalculate watermarks when limits are changing Previously they would get recalculated all the time, in particular in: getnewvnode -> vcheckspace -> vspace
# e6ae744e	11-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: deduplicate vnode allocation logic This creates a dedicated routine (vn_alloc) to allocate vnodes. As a side effect code duplicationw with getnewvnode_reserve is eleminated. Add vn_free for symmetry.
# b52d50cf	11-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: prealloc vnodes in getnewvnode_reserve Having a reserved vnode count does not guarantee that getnewvnodes wont block later. Said blocking partially defeats the purpose of reserving in the first place. Preallocate instaed. The only consumer was always passing "1" as count and never nesting reservations.
# 69283067	11-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: incomplete pass at converting more ints to u_long Most notably numvnodes and freevnodes were u_long, but parameters used to govern them remained as ints.
# bf62296f	11-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add missing CLTFLA_MPSAFE annotations This covers all kern/vfs_*.c files.
# a9a047bc	07-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: handle doomed vnodes in vdefer_inactive vgone dooms the vnode while keeping VI_OWEINACT set and then drops the interlock. vputx can pick up the interlock and pass it to vdefer_inactive since the flag is set. The race is harmless, just don't defer anything as vgone will take care of it. Reported by: pho
# c8b3463d	07-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: reimplement deferred inactive to use a dedicated flag (VI_DEFINACT) The previous behavior of leaving VI_OWEINACT vnodes on the active list without a hold count is eliminated. Hold count is kept and inactive processing gets explicitly deferred by setting the VI_DEFINACT flag. The syncer is then responsible for vdrop. Reviewed by: kib (previous version) Tested by: pho (in a larger patch, previous version) Differential Revision: https://reviews.freebsd.org/D23036
# b7cc9d18	07-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: trylock in vfs_msync and refactor the func - use LK_NOWAIT instead of calling VOP_ISLOCKED before deciding to lock - evaluate flags before looping over vnodes Reviewed by: kib Tested by: pho (in a larger patch, previous version) Differential Revision: https://reviews.freebsd.org/D23035
# c92fe112	07-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: use a dedicated counter for free vnode recycling Otherwise vlrureclaim activitity is mixed in and it is hard to tell which vnodes got reclaimed.
# cc2b586d	06-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: prevent numvnodes and freevnodes re-reads when appropriate Otherwise in code like this: if (numvnodes > desiredvnodes) vnlru_free_locked(numvnodes - desiredvnodes, NULL); numvnodes can drop below desiredvnodes prior to the call and if the compiler generated another read the subtraction would get a negative value.
# 37fe521a	06-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: annotate numvnodes and vnode_free_list_mtx with __exclusive_cache_line
# 478368ca	06-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: eliminate v_tag from struct vnode There was only one consumer and it was using it incorrectly. It is given an equivalent hack. Reviewed by: jeff Differential Revision: https://reviews.freebsd.org/D23037
# a91190c6	06-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add a helper for allocating marker vnodes
# 8dbc6352	04-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: drop thread argument from vinactive
# 867fd730	04-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: patch up vnode count assertions to report found value
# b249ce48	03-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: drop the mostly unused flags argument from VOP_UNLOCK Filesystems which want to use it in limited capacity can employ the VOP_UNLOCK_FLAGS macro. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D21427
# 57db0e12	01-Jan-2020	Mateusz Guzik <mjg@FreeBSD.org>	vfs: drop an always-false check from vlrureclaim The vnode gets held few lines prior, making the VI_FREE condition illegal.
# eb976461	27-Dec-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove production kernel checks and mp == NULL support from vdrop 1. The only place in the tree which calls getnewvnode with mp == NULL does it for vp_crossmp which will never execute this codepath. Any vnode which legally has ->v_mount == NULL is also doomed, which once more wont execute this code. 2. Remove an assertion for v_holdcnt from production kernels. It gets taken care of by refcount macros in debug kernels. Any code which would want to pass NULL mp can construct a fake one instead. Reviewed by: kib (previous version) Differential Revision: https://reviews.freebsd.org/D22722
# 6fa079fc	15-Dec-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: flatten vop vectors This eliminates the following loop from all VOP calls: while(vop != NULL && \ vop->vop_spare2 == NULL && vop->vop_bypass == NULL) vop = vop->vop_default; Reviewed by: jeff Tesetd by: pho Differential Revision: https://reviews.freebsd.org/D22738
# ff4486e8	09-Dec-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: refactor vhold and vdrop No fuctional changes.
# abd80ddb	08-Dec-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: introduce v_irflag and make v_type smaller The current vnode layout is not smp-friendly by having frequently read data avoidably sharing cachelines with very frequently modified fields. In particular v_iflag inspected for VI_DOOMED can be found in the same line with v_usecount. Instead make it available in the same cacheline as the v_op, v_data and v_type which all get read all the time. v_type is avoidably 4 bytes while the necessary data will easily fit in 1. Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new flag field with a new value: VIRF_DOOMED. Reviewed by: kib, jeff Differential Revision: https://reviews.freebsd.org/D22715
# 791a24c7	08-Dec-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: clean up vputx a little 1. replace hand-rolled macros for operation type with enum 2. unlock the vnode in vput itself, there is no need to branch on it. existence of VPUTX_VPUT remains significant in that the inactive variant adds LK_NOWAIT to locking request. 3. remove the useless v_usecount assertion. few lines above the checks if v_usecount > 0 and leaves. should the value be negative, refcount would fail. 4. the CTR return vnode %p to the freelist is incorrect as vdrop may find the vnode with holdcnt > 1. if the like should exist, it should be moved there 5. no need to error = 0 for everyone Reviewed by: kib, jeff (previous version) Differential Revision: https://reviews.freebsd.org/D22718
# fd6e0c43	08-Dec-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: factor out vnode destruction out of vdrop Sponsored by: The FreeBSD Foundation
# 12e483e5	06-Dec-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: clean up delmntque similarly to vdrop r355414
# 4f4d9a08	06-Dec-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: catch vn_printf up with reality - add the missing VV_VMSIZEVNLOCK and VV_READLINK flags - add decoding v_mflag While here sort flags.
# 3eeb8a1f	05-Dec-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove 'active' variable from _vdrop No functional changes.
# d957f3a4	19-Nov-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: perform a more racy check in vfs_notify_upper Locking mp does not buy anything interms of correctness and only contributes to contention.
# 1fccb43c	19-Nov-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: change si_usecount management to count used vnodes Currently si_usecount is effectively a sum of usecounts from all associated vnodes. This is maintained by special-casing for VCHR every time usecount is modified. Apart from complicating the code a little bit, it has a scalability impact since it forces a read from a cacheline shared with said count. There are no consumers of the feature in the ports tree. In head there are only 2: revoke and devfs_close. Both can get away with a weaker requirement than the exact usecount, namely just the count of active vnodes. Changing the meaning to the latter means we only need to modify it on 0<->1 transitions, avoiding the check plenty of times (and entirely in something like vrefact). Reviewed by: kib, jeff Tested by: pho Differential Revision: https://reviews.freebsd.org/D22202
# 67d0e293	29-Oct-2019	Jeff Roberson <jeff@FreeBSD.org>	Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY flag and use the same system. This enables further fault locking improvements by allowing more faults to proceed with a shared lock. Reviewed by: kib Tested by: pho Differential Revision: https://reviews.freebsd.org/D22116
# c92f1304	23-Oct-2019	Konstantin Belousov <kib@FreeBSD.org>	Fix undefined behavior. Create a sequence point by ending a full expression for call to vspace() and use of the globals which are modified by vspace(). Reported and reviewed by: imp Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D22126
# 8076c4e7	23-Oct-2019	Konstantin Belousov <kib@FreeBSD.org>	vn_printf(): Decode VI_TEXT_REF. Sponsored by: The FreeBSD Foundation MFC after: 3 days
# d1cbf3ee	13-Oct-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add MNTK_NOMSYNC On many filesystems the traversal is effectively a no-op. Add a way to avoid the overhead. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22009
# 737241cd	13-Oct-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: return free vnode batches in sync instead of vfs_msync It is a more natural fit. vfs_msync only deals with active vnodes. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D22008
# dc20b834	06-Oct-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add optional root vnode caching Root vnodes looekd up all the time, e.g. when crossing a mount point. Currently used routines always perform a costly lookup which can be trivially avoided. Reviewed by: jeff (previous version), kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21646
# e61e783b	04-Oct-2019	Eric van Gyzen <vangyzen@FreeBSD.org>	Add CTLFLAG_STATS to some vfs sysctl OIDs Add CTLFLAG_STATS to the following OIDs: vfs.altbufferflushes vfs.recursiveflushes vfs.barrierwrites vfs.flushwithdeps vfs.reassignbufcalls Refer to r353111. MFC after: 2 weeks Sponsored by: Dell EMC Isilon
# f91dd609	02-Oct-2019	Ed Maste <emaste@FreeBSD.org>	simplify path handling in sysctl_try_reclaim_vnode MAXPATHLEN / PATH_MAX includes space for the terminating NUL, and namei verifies the presence of the NUL. Thus there is no need to increase the buffer size here. The sysctl passes the string excluding the NUL, so req->newlen equal to PATH_MAX is too long. Reviewed by: kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21876
# ba7a55d9	22-Sep-2019	Sean Eric Fagan <sef@FreeBSD.org>	Add two options to allow mount to avoid covering up existing mount points. The two options are * nocover/cover: Prevent/allow mounting over an existing root mountpoint. E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local is already a mountpoint. * emptydir/noemptydir: Prevent/allow mounting on a non-empty directory. E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail. Neither of these options is intended to be a default, for historical and compatibility reasons. Reviewed by: allanjude, kib Differential Revision: https://reviews.freebsd.org/D21458
# 4cace859	16-Sep-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: convert struct mount counters to per-cpu There are 3 counters modified all the time in this structure - one for keeping the structure alive, one for preventing unmount and one for tracking active writers. Exact values of these counters are very rarely needed, which makes them a prime candidate for conversion to a per-cpu scheme, resulting in much better performance. Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on a 104-way 2 socket Skylake system: before: 852393 ops/s after: 76682077 ops/s Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21637
# ee831b25	16-Sep-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: manage mnt_lockref with atomics See r352424. Reviewed by: kib, jeff Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21574
# a8c8e44b	16-Sep-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: manage mnt_ref with atomics New primitive is introduced to denote sections can operate locklessly on aspects of struct mount, but which can also be disabled if necessary. This provides an opportunity to start scaling common case modifications while providing stable state of the struct when facing unmount, write suspendion or other events. mnt_ref is the first counter to start being managed in this manner with the intent to make it per-cpu. Reviewed by: kib, jeff Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21425
# ce3ba63f	13-Sep-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: release usecount using fetchadd 1. If we release the last usecount we take ownership of the hold count, which means the vnode will remain allocated until we vdrop it. 2. If someone else vrefs they will find no usecount and will proceed to add their own hold count. 3. No code has a problem with v_usecount transitioning to 0 without the interlock These facts combined mean we can fetchadd instead of having a cmpset loop. Reviewed by: kib (previous version) Tested by: pho Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21528
# 68c3c1ab	05-Sep-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: temporarily revert r351825 There are 2 problems: - it introduces a funny bug where it can end up trylocking the same vnode [1] - it exposes a pre-existing softdep deadlock [2] Both are easier to run into that the bug which got fixed, so revert until a complete solution is worked out. Reported by: cy [1], pho [2] Sponsored by: The FreeBSD Foundation
# c07d4a0a	04-Sep-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: fully hold vnodes in vnlru_free_locked Currently the code only bumps holdcnt and clears the VI_FREE flag, not performing actual vhold. Since the vnode is still visible elsewhere, a potential new user can find it and incorrectly assume it is properly held. Use vholdl instead to correctly hold the vnode. Another place recycling (vlrureclaim) does this already. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21522
# e3c3248c	03-Sep-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: implement usecount implying holdcnt vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting freed and usecount to denote it is actively used. Previously all operations bumping usecount would also bump holdcnt, which is not necessary. We can detect if usecount is already > 1 (in which case holdcnt is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves on atomic ops. Reviewed by: kib Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21471
# 08cfa56e	01-Sep-2019	Mark Johnston <markj@FreeBSD.org>	Extend uma_reclaim() to permit different reclamation targets. The page daemon periodically invokes uma_reclaim() to reclaim cached items from each zone when the system is under memory pressure. This is important since the size of these caches is unbounded by default. However it also results in bursts of high latency when allocating from heavily used zones as threads miss in the per-CPU caches and must access the keg in order to allocate new items. With r340405 we maintain an estimate of each zone's usage of its (per-NUMA domain) cache of full buckets. Start making use of this estimate to avoid reclaiming the entire cache when under memory pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only items in excess of the estimate are reclaimed. Draining a zone reclaims all of the cached full buckets (the previous behaviour of uma_reclaim()), and may further drain the per-CPU caches in extreme cases. Now, when under memory pressure, the page daemon will trim zones rather than draining them. As a result, heavily used zones do not incur bursts of bucket cache misses following reclamation, but large, unused caches will be reclaimed as before. Reviewed by: jeff Tested by: pho (an earlier version) MFC after: 2 months Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16667
# c2b600f9	30-Aug-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add a missing VNODE_REFCOUNT_FENCE_REL to v_incr_usecount_locked Sponsored by: The FreeBSD Foundation
# 3bb8d8d8	29-Aug-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: tidy up assertions in vfs_subr - assert unlocked vnode interlock in vref - assert right counts in vputx - print debug info for panic in vdrop Sponsored by: The FreeBSD Foundation
# 6470c8d3	29-Aug-2019	Konstantin Belousov <kib@FreeBSD.org>	Rework v_object lifecycle for vnodes. Current implementation of vnode_create_vobject() and vnode_destroy_vobject() is written so that it prepared to handle the vm object destruction for live vnode. Practically, no filesystems use this, except for some remnants that were present in UFS till today. One of the consequences of that model is that each filesystem must call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result all of them get rid of the v_object in reclaim. Move the call to vnode_destroy_vobject() to vgonel() before VOP_RECLAIM(). This makes v_object stable: either the object is NULL, or it is valid vm object till the vnode reclamation. Remove code from vnode_create_vobject() to handle races with the parallel destruction. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D21412
# 1e2f0ceb	28-Aug-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add VOP_NEED_INACTIVE vnode usecount drops to 0 all the time (e.g. for directories during path lookup). When that happens the kernel would always lock the exclusive lock for the vnode in order to call vinactive(). This blocks other threads who want to use the vnode for looukp. vinactive is very rarely needed and can be tested for without the vnode lock held. This patch gives filesytems an opportunity to do it, sample total wait time for tmpfs over 500 minutes of poudriere -j 104: before: 557563641706 (lockmgr:tmpfs) after: 46309603301 (lockmgr:tmpfs) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21371
# 368cabbc	27-Aug-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: stop passing LK_INTERLOCK to VOP_UNLOCK The plan is to drop the flags argument. There is also a temporary bug now that nullfs ignores the flag. Reviewed by: kib Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21252
# 0256405e	24-Aug-2019	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add vholdnz (for already held vnodes) Reviewed by: kib (previous version) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21358
# e671edac	23-Aug-2019	Konstantin Belousov <kib@FreeBSD.org>	De-commision the MNTK_NOINSMNTQ kernel mount flag. After all the changes, its dynamic scope is same as for MNTK_UNMOUNT, but to allow the syncer vnode to be re-installed on unmount failure. But the case of syncer was already handled by using the VV_FORCEINSMQ flag for quite some time. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# cf27e0d1	19-Aug-2019	Jeff Roberson <jeff@FreeBSD.org>	Use an atomic reference count for paging in progress so that callers do not require the object lock. Reviewed by: markj Tested by: pho (as part of a larger branch) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21311
# e7d8ebc8	28-Jul-2019	Alan Somers <asomers@FreeBSD.org>	Better comments for vlrureclaim MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
# 2240d8c4	27-Jul-2019	Alan Somers <asomers@FreeBSD.org>	Add v_inval_buf_range, like vtruncbuf but for a range of a file v_inval_buf_range invalidates all buffers within a certain LBA range of a file. It will be used by fusefs(5). This commit is a partial merge of r346162, r346606, and r346756 from projects/fuse2. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D21032
# d10b7578	06-Jun-2019	Alan Somers <asomers@FreeBSD.org>	[skip ci] Better comments for vlrureclaim Sponsored by: The FreeBSD Foundation
# 46f8169a	06-Jun-2019	Alan Somers <asomers@FreeBSD.org>	Add a testing facility to manually reclaim a vnode Add the debug.try_reclaim_vnode sysctl. When a pathname is written to it, it will be reclaimed, as long as it isn't already or doomed. The purpose is to gain test coverage for vnode reclamation, which is otherwise hard to achieve. Add the debug.ftry_reclaim_vnode sysctl. It does the same thing, except that its argument is a file descriptor instead of a pathname. Reviewed by: kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20519
# 65417f5e	24-May-2019	Alan Somers <asomers@FreeBSD.org>	Remove "struct ucred" argument from vtruncbuf vtruncbuf takes a "struct ucred" argument. AFAICT, it's been unused ever since that function was first added in r34611. Remove it. Also, remove some "struct ucred" arguments from fuse and nfs functions that were only used by vtruncbuf. Reviewed by: cem MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20377
# daec9284	21-May-2019	Conrad Meyer <cem@FreeBSD.org>	Include ktr.h in more compilation units Similar to r348026, exhaustive search for uses of CTRn() and cross reference ktr.h includes. Where it was obvious that an OS compat header of some kind included ktr.h indirectly, .c files were left alone. Some of these files clearly got ktr.h via header pollution in some scenarios, or tinderbox would not be passing prior to this revision, but go ahead and explicitly include it in files using it anyway. Like r348026, these CUs did not show up in tinderbox as missing the include. Reported by: peterj (arm64/mp_machdep.c) X-MFC-With: r347984 Sponsored by: Dell EMC Isilon
# 5422b0e6	19-May-2019	Konstantin Belousov <kib@FreeBSD.org>	Fix rw->ro remount when there is a text vnode mapping. Reported and tested by: hrs Sponsored by: The FreeBSD Foundation MFC after: 16 days
# 78022527	05-May-2019	Konstantin Belousov <kib@FreeBSD.org>	Switch to use shared vnode locks for text files during image activation. kern_execve() locks text vnode exclusive to be able to set and clear VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0 condition. The change removes VV_TEXT, replacing it with the condition v_writecount <= -1, and puts v_writecount under the vnode interlock. Each text reference decrements v_writecount. To clear the text reference when the segment is unmapped, it is recorded in the vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and v_writecount is incremented on the map entry removal The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that v_writecount does not contradict the desired change. vn_writecheck() is now racy and its use was eliminated everywhere except access. Atomic check for writeability and increment of v_writecount is performed by the VOP. vn_truncate() now increments v_writecount around VOP_SETATTR() call, lack of which is arguably a bug on its own. nullfs bypasses v_writecount to the lower vnode always, so nullfs vnode has its own v_writecount correct, and lower vnode gets all references, since object->handle is always lower vnode. On the text vnode' vm object dealloc, the v_writecount value is reset to zero, and deadfs vop_unset_text short-circuit the operation. Reclamation of lowervp always reclaims all nullfs vnodes referencing lowervp first, so no stray references are left. Reviewed by: markj, trasz Tested by: mjg, pho Sponsored by: The FreeBSD Foundation MFC after: 1 month Differential revision: https://reviews.freebsd.org/D19923
# 75d5cb29	26-Apr-2019	Alan Somers <asomers@FreeBSD.org>	fusefs: fix cache invalidation error from r346162 An off-by-one error led to the last page of a write not being removed from its object, even though that page's buffer was marked as invalid. PR: 235774 Sponsored by: The FreeBSD Foundation
# 77b82478	23-Apr-2019	Alan Somers <asomers@FreeBSD.org>	Fix bug in vtruncbuf introduced by r346162 r346162 factored out v_inval_buf_range from vtruncbuf, but it made an error in the interface between the two. The result was a failure to remove buffers past the first. Surprisingly, I couldn't reproduce the failure with file systems other than fuse. Also, modify fusefs's truncate_discards_cached_data test to catch this bug. PR: 346162 Sponsored by: The FreeBSD Foundation
# 6af6fdce	12-Apr-2019	Alan Somers <asomers@FreeBSD.org>	fusefs: evict invalidated cache contents during write-through fusefs's default cache mode is "writethrough", although it currently works more like "write-around"; writes bypass the cache completely. Since writes bypass the cache, they were leaving stale previously-read data in the cache. This commit invalidates that stale data. It also adds a new global v_inval_buf_range method, like vtruncbuf but for a range of a file. PR: 235774 Reported by: cem Sponsored by: The FreeBSD Foundation
# c11cbfd9	11-Mar-2019	Kirk McKusick <mckusick@FreeBSD.org>	Update the main loop in the flushbuflist() routine to properly select buffers for flushing when requested to flush both normal and extended attributes buffers. Sponsored by: Netflix
# 735835ed	08-Jan-2019	Michael Tuexen <tuexen@FreeBSD.org>	Avoid overfow in vtruncbuf() Using daddr_t instead of int avoids trunclbn to become negative when it shouldn't. This isssue was found by running syzkaller. Reviewed by: mckusick, kib, markj MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18763
# c0029546	27-Dec-2018	Kirk McKusick <mckusick@FreeBSD.org>	When loading an inode from disk, verify that its mode is valid. If invalid, return EINVAL. Note that inode check-hashes greatly reduce the chance that these errors will go undetected. Reported by: Christopher Krah <krah@protonmail.com> Reported as: FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read) Reviewed by: kib MFC after: 1 week Sponsored by: Netflix M sys/fs/ext2fs/ext2_vnops.c M sys/kern/vfs_subr.c M sys/ufs/ffs/ffs_snapshot.c M sys/ufs/ufs/ufs_vnops.c
# 6c59824b	23-Dec-2018	Konstantin Belousov <kib@FreeBSD.org>	Properly test for vmio buffer in bnoreuselist(). The presence of allocated v_object does not imply that the buffer is necessary VMIO kind. Buffer might has been allocated before the object created, then the buffer is malloced. Although we try to avoid such situation, it seems to be still legitimate. Reported and tested by: pho MFC after: 1 week Sponsored by: The FreeBSD Foundation
# cc426dd3	11-Dec-2018	Mateusz Guzik <mjg@FreeBSD.org>	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation
# 1436ff1e	17-Aug-2018	Mark Johnston <markj@FreeBSD.org>	Typo. X-MFC with: r337974
# 3ccbdc82	17-Aug-2018	Mark Johnston <markj@FreeBSD.org>	Add INVARIANTS-only fences around lockless vnode refcount updates. Some internal KASSERTs access the v_iflag field without the vnode interlock held after such a refcount update. The fences are needed for the assertions to be correct in the face of store reordering. Reported and tested by: jhibbits Reviewed by: kib, mjg MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16756
# 3f9e1fc8	06-Jun-2018	Justin Hibbits <jhibbits@FreeBSD.org>	Revert r334708 This is the wrong place to put the barrier. Requested by: kib,mjg
# 32c369f4	05-Jun-2018	Justin Hibbits <jhibbits@FreeBSD.org>	Add a memory barrier after taking a reference on the vnode holdcnt in _vhold This is needed to avoid a race between the VNASSERT() below, and another thread updating the VI_FREE flag, on weakly-ordered architectures. On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would panic on the assert regularly. It may be possible to use a weaker barrier, and I'll investigate that once all stability issues are worked out on POWER9.
# 84482abd	18-May-2018	Matt Macy <mmacy@FreeBSD.org>	vfs: annotate variables only used by debug builds as __unused
# 0e5c6bd4	04-May-2018	Jamie Gritton <jamie@FreeBSD.org>	Make it easier for filesystems to count themselves as jail-enabled, by doing most of the work in a new function prison_add_vfs in kern_jail.c Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and the rest is taken care of. This includes adding a jail parameter like allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed. Both of these used to be a static list of known filesystems, with predefined permission bits. Reviewed by: kib Differential Revision: D14681
# 6469bdcd	06-Apr-2018	Brooks Davis <brooks@FreeBSD.org>	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941
# f4043145	28-Mar-2018	Andriy Gapon <avg@FreeBSD.org>	ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count It's not sufficient nor required to use the vnode interlock when checking if we are going to drop the last use count as the code in vputx() uses refcount (atomic) operations for both checking and decrementing the use code. Apply the same method to vn_rele_async(). While here, remove vn_rele_inactive(), a wrapper around vrele() that didn't add any value. Also, the change required making vfs_refcount_release_if_not_last() public. I've made vfs_refcount_acquire_if_not_zero() public as well. They are in sys/refcount.h now. While making the move I've dropped the vfs_ prefix. Reviewed by: mjg MFC after: 2 weeks Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D14869
# 06220fa7	19-Feb-2018	Jeff Roberson <jeff@FreeBSD.org>	Further parallelize the buffer cache. Provide multiple clean queues partitioned into 'domains'. Each domain manages its own bufspace and has its own bufspace daemon. Each domain has a set of subqueues indexed by the current cpuid to reduce lock contention on the cleanq. Refine the sleep/wakeup around the bufspace daemon to use atomics as much as possible. Add a B_REUSE flag that is used to requeue bufs during the scan to approximate LRU rather than locking the queue on every use of a frequently accessed buf. Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts. Reviewed by: markj Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14274
# 5a35a042	31-Jan-2018	Kirk McKusick <mckusick@FreeBSD.org>	One of the vnode fields listed by vn_printf is the union of pointers whose type depends on the type of vnode. Correct vn_printf so that it correctly identifies the name of the pointer that it is printing. Submitted by: Andreas Longwitz <longwitz at incore.de> MFC after: 1 week
# 31c2c6e9	12-Jan-2018	Mateusz Guzik <mjg@FreeBSD.org>	vfs: tidy up vdrop Skip vfs_refcount_release_if_not_last if the interlock is held and just go straight to refcount_release. While here do cosmetic rearrangement of _vhold to better show it contains equivalent behaviour.
# caa7e52f	26-Dec-2017	Eitan Adler <eadler@FreeBSD.org>	kernel: Fix several typos and minor errors - duplicate words - typos - references to old versions of FreeBSD Reviewed by: imp, benno
# 151ba793	24-Dec-2017	Alexander Kabaev <kan@FreeBSD.org>	Do pass removing some write-only variables from the kernel. This reduces noise when kernel is compiled by newer GCC versions, such as one used by external toolchain ports. Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial) Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c) Differential Revision: https://reviews.freebsd.org/D10385
# 51369649	20-Nov-2017	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.
# a3e8a25a	20-Oct-2017	Mark Johnston <markj@FreeBSD.org>	Avoid the nbp lookup in the final loop iteration in flushbuflist(). The end of the loop must re-lookup the next buf since the bufobj lock is dropped in the loop body. If the lookup fails, the loop is restarted. This mechanism non-obviously also terminates the loop when the end of the buf list is reached. Split up the two loops termination cases to make the code a bit less fragile. No functional change intended. Reviewed by: kib MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12730
# fa00affd	17-Oct-2017	Mark Johnston <markj@FreeBSD.org>	Fix a racy VI_DOOMED check in MNT_VNODE_FOREACH_ALL(). MNT_VNODE_FOREACH_ALL() is supposed to avoid returning doomed vnodes, but the VI_DOOMED check it used was done without the vnode interlock held, so it could race with a concurrent vgone(). Submitted by: Don Morris <don.morris@isilon.com> Reviewed by: kib, mckusick MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12704
# 5bf94937	19-Sep-2017	Konstantin Belousov <kib@FreeBSD.org>	For unlinked files, do not msync(2) or sync on the vnode deactivation. One consequence of the patch is that msyncing unlinked file mappings no longer reduces the amount of the dirty memory in the system, but I do not think that there are users of msync(2) that utilize it for such side-effect. Reported and tested by: tjil PR: 222356 Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential revision: https://reviews.freebsd.org/D12411
# 8359a6b7	28-Aug-2017	Bryan Drewery <bdrewery@FreeBSD.org>	Allow vdrop() of a vnode not yet on the per-mount list after r306512. The old code allowed calling vdrop() before insmntque() to place the vnode back onto the freelist for later recycling. Some downstream consumers may rely on this support. Normally insmntque() failing is fine since is uses vgone() and immediately frees the vnode rather than attempting to add it to the freelist if vdrop() were used instead. Also assert that vhold() cannot be used on such a vnode. Reviewed by: kib, cem, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D12126
# b59ea730	20-Aug-2017	Konstantin Belousov <kib@FreeBSD.org>	Allow vinvalbuf() to operate with the shared vnode lock. This mode allows other clean buffers to arrive while we flush the buf lists for the vnode, which is fine for the targeted use. We only need that all buffers existed at the time of the function start were flushed. In fact, only one assert has to be relaxed. In collaboration with: pho Reviewed by: rmacklem Sponsored by: The FreeBSD Foundation MFC after: 2 weeks X-Differential revision: https://reviews.freebsd.org/D12083
# 0c3c207f	02-Jun-2017	Gleb Smirnoff <glebius@FreeBSD.org>	For UNIX sockets make vnode point not to the socket, but to the UNIX PCB, since the latter is the thing that links together VFS and sockets. While here, make the union in the struct vnode anonymous.
# 391aba32	15-May-2017	Konstantin Belousov <kib@FreeBSD.org>	mnt_vnode_next_active: use conventional lock order when trylock fails. Previously, when the VI_TRYLOCK failed, we would spin under the mutex that protects the vnode active list until we either succeeded or noticed that we had hogged the CPU. Since we were violating the lock order, this would guarantee that we would become a hog under any deadlock condition (e.g. a race with vdrop(9) on the same vnode). In the presence of many concurrent threads in sync(2) or vdrop etc, the victim could hang for a long time. Now, avoid spinning by dropping and reacquiring the locks in the conventional lock order when the trylock fails. This requires a dance with the vnode hold count. Submitted by: Tom Rix <trix@juniper.net> Tested by: pho Differential revision: https://reviews.freebsd.org/D10692
# 0226f659	05-Apr-2017	Konstantin Belousov <kib@FreeBSD.org>	Add V_VMIO flag for vinvalbuf(9) to indicate that the flush request was issued during VM-initiated i/o (pageout), so that the function does not try to flush or remove pages or wait for the vm object paging-in-progress counter. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week X-Differential revision: https://reviews.freebsd.org/D10241
# db362553	04-Apr-2017	Brooks Davis <brooks@FreeBSD.org>	Correct a kernel stack leak in 32-bit compat when vfc_name is short. Don't zero unused pointer members again. Per discussion with secteam we are not issuing an advisory for this issue as we have no current evidence it leaks exploitable information. Reviewed by: rwatson, glebius, delphij MFC after: 1 day Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D10227
# e3f87f6c	12-Mar-2017	Ian Lepore <ian@FreeBSD.org>	Change 'Hz' back to 'HZ'... it's referring to the kernel config option named HZ, not being used as an abbreviation of the unit of measure.
# 8a396640	12-Mar-2017	Ian Lepore <ian@FreeBSD.org>	Correct the abbreviations for microseconds (us, not ms), and for Hz (not HZ).
# 2d78a553	04-Feb-2017	Mateusz Guzik <mjg@FreeBSD.org>	vfs: use atomic_fcmpset in vfs_refcount_*
# 8acac5a9	22-Jan-2017	Edward Tomasz Napierala <trasz@FreeBSD.org>	Improve debugging printf.
# 067115e0	21-Jan-2017	Mateusz Guzik <mjg@FreeBSD.org>	vfs: hide the getvnode NULL mp message behind DIAGNOSTIC Since crossmp vnode changes the message was being printed on each boot. Reported by: trasz Discussed with: kib
# 41b0046a	31-Dec-2016	Mateusz Guzik <mjg@FreeBSD.org>	vfs: switch nodes_created, recycles_count and free_owe_inact to counter(9) Reviewed by: kib
# 5afb134c	12-Dec-2016	Mateusz Guzik <mjg@FreeBSD.org>	vfs: add vrefact, to be used when the vnode has to be already active This allows blind increment of relevant counters which under contention is cheaper than inc-not-zero loops at least on amd64. Use it in some of the places which are guaranteed to see already active vnodes. Reviewed by: kib (previous version)
# 64910ddb	26-Nov-2016	Mark Johnston <markj@FreeBSD.org>	Launder VPO_NOSYNC pages upon vnode deactivation. As of r234483, vnode deactivation causes non-VPO_NOSYNC pages to be laundered. This behaviour has two problems: 1. Dirty VPO_NOSYNC pages must be laundered before the vnode can be reclaimed, and this work may be unfairly deferred to the vnlru process or an unrelated application when the system is under vnode pressure. 2. Deactivation of a vnode with dirty VPO_NOSYNC pages requires a scan of the corresponding VM object's memq for non-VPO_NOSYNC dirty pages; if the laundry thread needs to launder pages from an unreferenced such vnode, it will reactivate and deactivate the vnode with each laundering, potentially resulting in a large number of expensive scans. Therefore, ensure that all dirty pages are laundered upon deactivation, i.e., when all maps of the vnode are removed and all references are released. Reviewed by: alc, kib MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D8641
# c6c44ff7	08-Oct-2016	Mateusz Guzik <mjg@FreeBSD.org>	vfs: clear the tmp free list flag before taking the free vnode list lock Safe access is already guaranteed because of the mnt_listmx lock.
# 32641585	06-Oct-2016	Bryan Drewery <bdrewery@FreeBSD.org>	vrefl: Assert that the interlock is held. Sponsored by: Dell EMC Isilon MFC after: 2 weeks
# 5a22c958	06-Oct-2016	Bryan Drewery <bdrewery@FreeBSD.org>	Add vrecyclel() to vrecycle() a vnode with the interlock already held. Obtained from: OneFS Sponsored by: Dell EMC Isilon MFC after: 2 weeks
# 0617f64e	04-Oct-2016	Bryan Drewery <bdrewery@FreeBSD.org>	Correct some comments after r294299. Sponsored by: Dell EMC Isilon
# 5bb81f9b	30-Sep-2016	Mateusz Guzik <mjg@FreeBSD.org>	vfs: batch free vnodes in per-mnt lists Previously free vnodes would always by directly returned to the global LRU list. With this change up to mnt_free_list_batch vnodes are collected first. syncer runs always return the batch regardless of its size. While vnodes on per-mnt lists are not counted as free, they can be returned in case of vnode shortage. Reviewed by: kib Tested by: pho
# 8660b707	30-Sep-2016	Mateusz Guzik <mjg@FreeBSD.org>	vfs: remove the __bo_vnode field from struct vnode The pointer can be obtained using __containerof instead. Reviewed by: kib
# 69a28758	15-Sep-2016	Ed Maste <emaste@FreeBSD.org>	Renumber license clauses in sys/kern to avoid skipping #3
# f83cc0aa	12-Aug-2016	Edward Tomasz Napierala <trasz@FreeBSD.org>	Print vnode details when vnode locking assertion gets triggered. MFC after: 1 month
# 411455a8	10-Aug-2016	Edward Tomasz Napierala <trasz@FreeBSD.org>	Replace all remaining calls to vprint(9) with vn_printf(9), and remove the old macro. MFC after: 1 month
# 7b255097	04-Aug-2016	Edward Tomasz Napierala <trasz@FreeBSD.org>	Remove unused - never actually implemented - vnode lock types from vnode_if.src. MFC after: 1 month
# 725496ce	11-Jul-2016	Konstantin Belousov <kib@FreeBSD.org>	Fix grammar. Submitted by: alc MFC after: 2 weeks
# 19efd8a5	11-Jul-2016	Konstantin Belousov <kib@FreeBSD.org>	In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called, if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate(). The vnode_destroy_object(), when calling into vm_object_terminate(), must be able to flush buffers. BO_DEAD purpose is to quickly destroy buffers on write when the underlying vnode is not operable any more (one example is the devfs node after geom is gone). Setting BO_DEAD for reclaiming vnode before object is terminated is premature, and results in unability to flush buffers with live SU dependencies from vinvalbuf() in vm_object_terminate(). Reported by: David Cross <dcrosstech@gmail.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# d12d6a84	02-Jul-2016	Konstantin Belousov <kib@FreeBSD.org>	Remove racy assert. The thread which changes vnode usecount from 0 to 1 does it under the vnode interlock, but the interlock is not owned by the asserting thread. As result, we might read increased use counter but also still see VI_OWEINACT. In collaboration with: nwhitehorn Hardware donated by: IBM LTC Sponsored by: The FreeBSD Foundation (kib) Approved by: re (gjb)
# d9a503be	20-Jun-2016	Konstantin Belousov <kib@FreeBSD.org>	Fix typo. Note that atomic is still required even for interlocked case. Sponsored by: The FreeBSD Foundation Approved by: re (marius)
# e896fb3b	17-Jun-2016	Mateusz Guzik <mjg@FreeBSD.org>	vfs: ifdef out noop vop_* primitives on !DEBUG_VFS_LOCKS kernels This removes calls to empty functions like vop_lock_{pre/post} from common vfs routines. Approved by: re (gjb)
# f8a75278	17-Jun-2016	Konstantin Belousov <kib@FreeBSD.org>	Add VFS interface to flush specified amount of free vnodes belonging to mount points with the given filesystem type, specified by mount vfs_ops pointer. Based on patch by: mckusick Reviewed by: avg, mckusick Tested by: allanjude, madpilot Sponsored by: The FreeBSD Foundation Approved by: re (gjb)
# f7bd2217	31-May-2016	Edward Tomasz Napierala <trasz@FreeBSD.org>	Cosmetics - add missing space after ellipses in shutdown messages. MFC after: 1 month Sponsored by: The FreeBSD Foundation
# 27d4b35f	16-May-2016	Andriy Gapon <avg@FreeBSD.org>	vfs_read_dirent: increment ncookies after adding a cookie It seems that at present vfs_read_dirent() is used only with filesystems that do not support cookies, so the bug never manifested itself. MFC after: 1 week
# c89e1b87	03-May-2016	Konstantin Belousov <kib@FreeBSD.org>	Add EVFILT_VNODE open, read and close notifications. While there, order EVFILT_VNODE notes descriptions alphabetically. Based on submission, and tested by: Vladimir Kondratyev <wulf@cicgroup.ru> MFC after: 2 weeks
# f7b71c8a	02-May-2016	Konstantin Belousov <kib@FreeBSD.org>	Issue NOTE_EXTEND when a directory entry is added to or removed from the monitored directory as the result of rename(2) operation. The renames staying in the directory are not reported. Submitted by: Vladimir Kondratyev <wulf@cicgroup.ru> MFC after: 2 weeks
# bd2ead6b	02-May-2016	Konstantin Belousov <kib@FreeBSD.org>	Fix reporting of NOTE_LINK when directory link count changes due to rename removing or adding subdirectory entry. Discussed with and tested by: Vladimir Kondratyev <wulf@cicgroup.ru> NetBSD PR: 48958 (http://gnats.netbsd.org/48958) MFC after: 2 weeks Sponsored by: The FreeBSD Foundation
# e3043798	29-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys/kern: spelling fixes in comments. No functional change.
# 55e0987a	26-Apr-2016	Pedro F. Giffuni <pfg@FreeBSD.org>	sys: extend use of the howmany() macro when available. We have a howmany() macro in the <sys/param.h> header that is convenient to re-use as it makes things easier to read.
# 0791e0c0	24-Feb-2016	Konstantin Belousov <kib@FreeBSD.org>	Provide more correct sizing of the KVA consumed by a vnode, used by the virtvnodes calculation. Include the size of fs-specific v_data as the nfs nclnode inline, the NFS nclnode is bigger than either ZFS znode or UFS inode. Include the size of namecache_ts and short cache path element, multiplied by the name cache population factor, again inline. Inline defines are used to avoid pollution of the vnode.h with the subsystem-private objects. Non-significant unsynchronized changes of the definitions are fine, we do not care about that precision, and e.g. ZFS consumes much malloced memory per vnode for reasons unaccounted in the formula. Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10. The measures reduce vnode cache pressure on kmem and bring the vnode cache memory use below some apparent thresholds that were exceeded by r291244 due to more robust vnode reuse. Reported and tested by: marius (i386, previous version) Reviewed by: bde Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# fa48f413	17-Feb-2016	Konstantin Belousov <kib@FreeBSD.org>	In bnoreuselist(), check both ends of the specified logical block numbers range. This effectively skips indirect and extdata blocks on the buffer queue. Since their logical block numbers are negative, bnoreuselist() could loop infinitely. Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 793c3817	18-Jan-2016	Mark Johnston <markj@FreeBSD.org>	Add vrefl(), a locked variant of vref(9). This API has no in-tree consumers at the moment but is useful to at least one out-of-tree consumer, and naturally complements existing vnode refcount functions (vholdl(9), vdropl(9)). Obtained from: kib (sys/ portion) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D4947 Differential Revision: https://reviews.freebsd.org/D4953
# 1041e090	05-Jan-2016	Konstantin Belousov <kib@FreeBSD.org>	Two fixes for excessive iterations after r292326. Advance the logical block number to the lblkno of the found block plus one, instead of incrementing the block number which was used for lookup. This change skips sparcely populated buffer ranges, similar to r292325, instead of doing useless lookups. Do not restart the bnoreuselist() from the start of the range if buffer lock cannot be obtained without sleep. Only retry lookup and lock for the same queue and same logical block number. Reported by: benno Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days
# 106ebb76	16-Dec-2015	Konstantin Belousov <kib@FreeBSD.org>	Optimize vop_stdadvise(POSIX_FADV_DONTNEED). Instead of looking up a buffer for each block number in the range with gbincore(), look up the next instantiated buffer with the logical block number which is greater or equal to the next lblkno. This significantly speeds up the iteration for sparce-populated range. Move the iteration into new helper bnoreuselist(), which is structured similarly to flushbuflist(). Reported and tested by: pho Reviewed by: markj Sponsored by: The FreeBSD Foundation
# 8549b4b9	16-Dec-2015	Konstantin Belousov <kib@FreeBSD.org>	Simplify the loop step in the flushbuflist() and make it independed on the type stability of the buffers memory. Instead of memoizing pointer to the next buffer and validating it, remember the next logical block number in the bo list and re-lookup. Reviewed by: markj Tested by: pho Sponsored by: The FreeBSD Foundation
# d9ea698c	03-Dec-2015	Kirk McKusick <mckusick@FreeBSD.org>	We need to zero out the clustering variables in a freed vnode structure. For completeness add a VNASSERT that there are no threads waiting on a range lock (this was previously checked on every vnode free). Reported by; Rick Macklem Fix from: Mateusz Guzik PR: 204949
# 003a7c2b	02-Dec-2015	Kirk McKusick <mckusick@FreeBSD.org>	We need to zero out the union of pointers in a freed vnode structure. PR: 204949 Fix from: Mateusz Guzik Tested by: Jason Unovitch
# 41d4f103	29-Nov-2015	Kirk McKusick <mckusick@FreeBSD.org>	As the kernel allocates and frees vnodes, it fully initializes them on every allocation and fully releases them on every free. These are not trivial costs: it starts by zeroing a large structure then initializes a mutex, a lock manager lock, an rw lock, four lists, and six pointers. And looking at vfs.vnodes_created, these operations are being done millions of times an hour on a busy machine. As a performance optimization, this code update uses the uma_init and uma_fini routines to do these initializations and cleanups only as the vnodes enter and leave the vnode_zone. With this change the initializations are only done kern.maxvnodes times at system startup and then only rarely again. The frees are done only if the vnode_zone shrinks which never happens in practice. For those curious about the avoided work, look at the vnode_init() and vnode_fini() functions in kern/vfs_subr.c to see the code that has been removed from the main vnode allocation/free path. Reviewed by: kib Tested by: Peter Holm
# f186a80d	26-Nov-2015	Konstantin Belousov <kib@FreeBSD.org>	Remove VI_AGE vnode iflag, it is unused. Noted by: bde Sponsored by: The FreeBSD Foundation
# b3162b45	26-Nov-2015	Konstantin Belousov <kib@FreeBSD.org>	Move the comment about resident pages preventing vnode from leaving active list, into the header comment for vdrop(), which is the function that decides whether to leave the vnode on the list. Note that dirty page write-out in vinactive() is asynchronous. Discussed with: alc Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 547831b6	24-Nov-2015	Konstantin Belousov <kib@FreeBSD.org>	Rework the vnode cache recycling to meet free and unused vnodes targets. See the comment above wantfreevnodes variable for the description of the algorithm. The vfs.vlru_alloc_cache_src sysctl is removed. New code frees namecache sources as the last chance to satisfy the highest watermark, instead of selecting the source vnodes randomly. This provides good enough behaviour to keep vn_fullpath() working in most situations. The filesystem layout with deep trees, where the removed knob was required, is thus handled automatically. Submitted by: bde Discussed with: mckusick Tested by: pho MFC after: 1 month
# 09c837b8	20-Nov-2015	Gleb Smirnoff <glebius@FreeBSD.org>	Remove remnants of the old NFS from vnode pager. Reviewed by: kib Sponsored by: Netflix
# 0a805de6	26-Sep-2015	Mark Johnston <markj@FreeBSD.org>	Remove a check for a condition that is always false by a preceding KASSERT that was added in r144704.
# d925c2e8	26-Sep-2015	Mark Johnston <markj@FreeBSD.org>	Fix argument ordering in vn_printf(). MFC after: 3 days
# 55d33667	15-Sep-2015	Conrad Meyer <cem@FreeBSD.org>	kevent(2): Note DOOMED vnodes with NOTE_REVOKE In poll mode, check for and wake VBAD vnodes. (Vnodes that are VBAD at registration will never be woken by the RECLAIM trigger.) Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation. (Vnodes that were fine at registration but are vgoned while being monitored should signal waiters.) Reviewed by: kib Approved by: markj (mentor) Sponsored by: EMC / Isilon Storage Division Differential Revision: https://reviews.freebsd.org/D3675
# 17518b1a	05-Sep-2015	Kirk McKusick <mckusick@FreeBSD.org>	Track changes to kern.maxvnodes and appropriately increase or decrease the size of the name cache hash table (mapping file names to vnodes) and the vnode hash table (mapping mount point and inode number to vnode). An appropriate locking strategy is the key to changing hash table sizes while they are in active use. Reviewed by: kib Tested by: Peter Holm Differential Revision: https://reviews.freebsd.org/D2265 MFC after: 2 weeks
# c9ba6504	24-Aug-2015	Edward Tomasz Napierala <trasz@FreeBSD.org>	Make vfs_unmountall() unmount /dev after /, not before. The only reason this didn't result in an unclean shutdown is that devfs ignores MNT_FORCE flag. Reviewed by: kib@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D3467
# 6e572e08	23-Aug-2015	Edward Tomasz Napierala <trasz@FreeBSD.org>	After r286237 it should be fine to call vgone(9) on a busy GEOM vnode; remove KASSERT that would prevent forced devfs unmount from working. MFC after: 1 month Sponsored by: The FreeBSD Foundation
# 2433a4eb	05-Aug-2015	Ed Schouten <ed@FreeBSD.org>	Make it possible to implement poll(2) on top of kqueue(2). It looks like EVFILT_READ and EVFILT_WRITE trigger under the same conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX. The only difference is that POLLRDNORM has to be triggered on regular files unconditionally, whereas EVFILT_READ only triggers when not EOF. Introduce a new flag, NOTE_FILE_POLL, that can be used to make EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag will be used by cloudlibc's poll() function. Reviewed by: jmg Differential Revision: https://reviews.freebsd.org/D3303
# 57a73b26	04-Aug-2015	Edward Tomasz Napierala <trasz@FreeBSD.org>	Mark vgonel() as static. It was already declared static earlier; no idea why compilers don't warn about this. MFC after: 1 month Sponsored by: The FreeBSD Foundation
# 752fc07d	16-Jul-2015	Mateusz Guzik <mjg@FreeBSD.org>	vfs: implement v_holdcnt/v_usecount manipulation using atomic ops Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free list) of either counter are still guarded with vnode interlock. Reviewed by: kib (earlier version) Tested by: pho
# c634b752	11-Jul-2015	Mateusz Guzik <mjg@FreeBSD.org>	vfs: always clear VI_OWEINACT in consumers bumping v_usecount Previously vputx would detect the condition and clear the flag. With this change it is invalid to have both v_usecount > 0 and the flag set. Assert the condition is met in all revlevant places. Reviewed by: kib
# 2d1ca3cd	11-Jul-2015	Mateusz Guzik <mjg@FreeBSD.org>	vfs: move si_usecount manipulation to dedicated functions Reviewed by: kib
# cf88021a	11-Jul-2015	Konstantin Belousov <kib@FreeBSD.org>	Do not allow creation of the dirty buffers for the dead buffer objects, i.e. for buffer objects which vnode was reclaimed. Buffer cache cannot write such buffers. Return the error and discard the buffer immediately on write attempt. BO_DIRTY now always set during vnode reclamation, since it is used not only for the INVARIANTS checks. Do allow placement of the clean buffers on dead bufobj list, otherwise filesystems cannot use bufcache at all after the devvp reclaim. Reported and tested by: trasz Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 8bbd1f25	05-Jul-2015	Mark Johnston <markj@FreeBSD.org>	Remove a stale descriptive comment for gbincore(). The splay trees referenced in the comment were converted to path-compressed tries in r250551. MFC after: 3 days
# 1977bd23	23-Jun-2015	John-Mark Gurney <jmg@FreeBSD.org>	zero this struct as it depends upon it... Reviewed by: mjg Differential Revision: https://reviews.freebsd.org/D2890
# 1eabd967	16-Jun-2015	Konstantin Belousov <kib@FreeBSD.org>	vfs_msync(), called from syncer vnode fsync VOP, only iterates over the active vnode list for the given mount point, with the assumption that vnodes with dirty pages are active. This is enforced by vinactive() doing vm_object_page_clean() pass over the vnode pages. The issue is, if vinactive() cannot be called during vput() due to the vnode being only shared-locked, we might end up with the dirty pages for the vnode on the free list. Such vnode is invisible to syncer, and pages are only cleaned on the vnode reactivation. In other words, the race results in the broken guarantee that user data, written through the mmap(2), is written to the disk not later than in 30 seconds after the write. Fix this by keeping the vnode which is freed but still owing inactivation, on the active list. When syncer loops find such vnode, it is deactivated and cleaned by the final vput() call. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 780dca1b	27-May-2015	Konstantin Belousov <kib@FreeBSD.org>	Right now, dounmount() is called with unreferenced mount point. Nothing stops a parallel unmount to suceed before the given call to dounmount() checks and locks the covered vnode. Prevent dounmount() from acting on the freed (although type-stable) memory by changing the interface to require the mount point to be referenced. dounmount() consumes the reference on return, regardless of the sucessfull or erronous result. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# dda11d4a	15-Apr-2015	Rick Macklem <rmacklem@FreeBSD.org>	File systems that do not use the buffer cache (such as ZFS) must use VOP_FSYNC() to perform the NFS server's Commit operation. This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which is set by file systems that use the buffer cache. If this flag is not set, the NFS server always does a VOP_FSYNC(). This should be ok for old file system modules that do not set MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although it might not be optimal for file systems that use the buffer cache. Reviewed by: kib MFC after: 2 weeks
# 08189ed6	27-Feb-2015	Konstantin Belousov <kib@FreeBSD.org>	The VNASSERT in vflush() FORCECLOSE case is trying to panic early to prevent errors from yanking devices out from under filesystems. Only care about special vnodes on devfs, special nodes on other kinds of filesystems do not have special properties. Sponsored by: EMC / Isilon Storage Division Submitted by: Conrad Meyer MFC after: 1 week
# c514f051	17-Feb-2015	Enji Cooper <ngie@FreeBSD.org>	Add the mnt_lockref field to the ddb(4) 'show mount' command MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D1688 Submitted by: Conrad Meyer <conrad.meyer@isilon.com> Sponsored by: EMC / Isilon Storage Division
# 1b76e0b7	14-Feb-2015	John Baldwin <jhb@FreeBSD.org>	Add two new counters for vnode life cycle events: - vfs.recycles counts the number of vnodes forcefully recycled to avoid exceeding kern.maxvnodes. - vfs.vnodes_created counts the number of vnodes created by successful calls to getnewvnode(). Differential Revision: https://reviews.freebsd.org/D1671 Reviewed by: kib MFC after: 1 week
# a77a1234	25-Jan-2015	John Baldwin <jhb@FreeBSD.org>	Change the default VFS timestamp precision from seconds to microseconds. Discussed on: arch@ MFC after: 2 weeks
# ea117d17	13-Dec-2014	Konstantin Belousov <kib@FreeBSD.org>	The vinactive() call in vgonel() may start writes for the dirty pages, creating delayed write buffers belonging to the reclaimed vnode. Put the buffer cleanup code after inactivation. Add asserts that ensure that buffer queues are empty and add BO_DEAD flag for bufobj to check that no buffers are added after the cleanup. BO_DEAD is only used by INVARIANTS-enabled kernels. Reported and tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 1 week
# a77c72f5	09-Dec-2014	Konstantin Belousov <kib@FreeBSD.org>	Apply chunk forgotten in r275620. Remove local variable for real. CID: 1257462 Sponsored by: The FreeBSD Foundation
# a25100c5	08-Dec-2014	Konstantin Belousov <kib@FreeBSD.org>	Add functions syncer_suspend() and syncer_resume(), which are supposed to be called before suspension and after resume, correspondingly. The syncer_suspend() ensures that all filesystems dirty data and metadata are saved to the permanent storage, and stops kernel threads which might modify filesystems. The syncer_resume() restores stopped threads. For now, only syncer is stopped. This is needed, because each sync loop causes superblock updates for UFS. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 32e7f8e4	14-Oct-2014	Mateusz Guzik <mjg@FreeBSD.org>	Don't take devmtx unnecessarily in vn_isdisk. MFC after: 1 week
# 9832a24d	01-Oct-2014	Will Andrews <will@FreeBSD.org>	In the syncer, drop the sync mutex while patting the watchdog. Some watchdog drivers (like ipmi) need to sleep while patting the watchdog. See sys/dev/ipmi/ipmi.c:ipmi_wd_event(), which calls malloc(M_WAITOK). Submitted by: asomers MFC after: 1 month Sponsored by: Spectra Logic MFSpectraBSD: 637548 on 2012/10/04
# 168f4ee0	02-Aug-2014	Konstantin Belousov <kib@FreeBSD.org>	Remove Giant acquisition from the mount and unmount pathes. It could be claimed that two things were reasonable protected by Giant. One is vfsconf list links, which is converted to the new dedicated sx vfsconf_sx. Another is vfsconf.vfc_refcount, which is now updated with atomics. Note that vfc_refcount still has the same races now as it has under the Giant, the unload of filesystem modules can happen while the module is still in use. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# 634012b9	29-Jul-2014	Konstantin Belousov <kib@FreeBSD.org>	Remove one-time use macros which check for the vnode lifecycle. More, some parts of the checks are in fact redundand in the surrounding code, and it is more clear what the conditions are by direct testing of the flags. Two of the three macros were only used in assertions. In vnlru_free(), all relevant parts of vholdl() were already inlined, except the increment of v_holdcnt itself. Do not call vholdl() to do the increment as well, this allows to make assertions in vholdl()/vhold() more strict. In v_incr_usecount(), call vholdl() before incrementing other ref counters. The change is no-op, but it makes less surprising to see the vnode state in debugger if interrupted inside v_incr_usecount(). Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week
# 781c93d4	11-Jun-2014	Alexander Motin <mav@FreeBSD.org>	Implement simple direct-mapped cache for popular filesystem identifiers to avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while traversing through the list of mount points. This change significantly improves NFS server scalability, since it had to do this translation for every request, and the global lock becomes quite congested. This code is more optimized for relatively small number of mount points. On systems with hundreds of active mount points this simple cache may have many collisions. But the original traversal code in that case should also behave much worse, so we are not loosing much. Reviewed by: attilio MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# 4f655310	10-Jun-2014	Alexander Motin <mav@FreeBSD.org>	Remove unneeded mountlist_mtx acquisition from sync_fsync(). All struct mount fields accessed by sync_fsync() are protected by MNT_MTX.
# 3345d73c	08-Jun-2014	Alexander Motin <mav@FreeBSD.org>	Remove extra branching from r267232. MFC after: 2 weeks
# 590d6363	08-Jun-2014	Alexander Motin <mav@FreeBSD.org>	Use atomics to modify numvnodes variable. This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(), that reduces number of global vnode_free_list_mtx mutex acquisitions from 4 to 2 per NFS request on ZFS, improving SMP scalability. Reviewed by: kib MFC after: 2 weeks Sponsored by: iXsystems, Inc.
# bf09eca2	20-May-2014	Benjamin Kaduk <bjk@FreeBSD.org>	Check for mismatched vref()/vdrop() Assert that the hold count has not fallen below the use count, a situation that would only happen when a vref() (or similar) is erroneously paired with a vdrop(). This situation has not been observed in the wild, but could be helpful for someone implementing a new filesystem. Reviewed by: kib Approved by: hrs (mentor)
# 44f1c916	22-Mar-2014	Bryan Drewery <bdrewery@FreeBSD.org>	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division
# 2c1531e7	09-Oct-2013	Konstantin Belousov <kib@FreeBSD.org>	Do not flush buffers when the v_object of the passed vnode does not really belong to it. Such vnodes, with the pointers to other vnodes v_objects, are typically instantiated by the bypass filesystems. Invalidating mappings of other vnode pages and the pages is wrong, since reclamation of the upper vnode does not imply that lower vnode is reclaimed too. One of the consequences of the improper reclamation was destruction of the wired mappings of the lower vnode pages, triggering miscellaneous assertions in the VM system. Reported by: John Marshall <john.marshall@riverwillow.com.au> Tested by: John Marshall <john.marshall@riverwillow.com.au>, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)
# d6498b15	01-Oct-2013	Konstantin Belousov <kib@FreeBSD.org>	When printing the vnode information from ddb, print the lengths of the dirty and clean buffer queues. Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (gjb)
# fe39412e	29-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	For vunref(), try to upgrade the vnode lock if the function was called with the vnode shared-locked. If upgrade succeeded, the inactivation can be done immediately, instead of being postponed. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (glebius)
# 27884e3b	26-Sep-2013	Konstantin Belousov <kib@FreeBSD.org>	Acquire a hold reference on the vnode when a knote is instantiated. Otherwise, knote keeps a pointer to a vnode which could become invalid any time. Reported by: many Tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Discussed with: jmg Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: re (marius)
# 4593c0ad	17-Aug-2013	Pawel Jakub Dawidek <pjd@FreeBSD.org>	In r114945 the line 'nmp = TAILQ_NEXT(mp, mnt_list);' was duplicated. Instead of just removing the duplicate, convert the loop to TAILQ_FOREACH().
# 2c38cc79	28-Jul-2013	Konstantin Belousov <kib@FreeBSD.org>	When creation of the v_pollinfo raced and our instance of vpollinfo must be destroyed, knlist_clear() and seldrain() calls could be avoided, since vpollinfo was not used. More, the knlist_clear() calling protocol requires the knlist locked, which is not true at the call site. Split the destruction into the helper destroy_vpollinfo_free(), and call it when raced, instead of destroy_vpollinfo(). Reported and tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days
# a8b0523a	17-Jul-2013	Konstantin Belousov <kib@FreeBSD.org>	Clear the vnode knotes before destroying vpollinfo. Reported and tested by: Patrick Lamaiziere <patfbsd@davenulle.org> Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# d39116f5	03-Jun-2013	Konstantin Belousov <kib@FreeBSD.org>	Be more generous when donating the current thread time to the owner of the vnode lock while iterating over the free vnode list. Instead of yielding, pause for 1 tick. The change is reported to help in some virtualized environments. Submitted by: Roger Pau Monn? <roger.pau@citrix.com> Discussed with: jilles Tested by: pho MFC after: 2 weeks
# 22a72260	30-May-2013	Jeff Roberson <jeff@FreeBSD.org>	- Convert the bufobj lock to rwlock. - Use a shared bufobj lock in getblk() and inmem(). - Convert softdep's lk to rwlock to match the bufobj lock. - Move INFREECNT to b_flags and protect it with the buf lock. - Remove unnecessary locking around bremfree() and BKGRDINPROG. Sponsored by: EMC / Isilon Storage Division Discussed with: mckusick, kib, mdf
# f2cc1285	11-May-2013	Jeff Roberson <jeff@FreeBSD.org>	- Add a new general purpose path-compressed radix trie which can be used with any structure containing a uint64_t index. The tree code auto-generates type safe wrappers. - Eliminate the buf splay and replace it with pctrie. This is not only significantly faster with large files but also allows for the possibility of shared locking. Reviewed by: alc, attilio Sponsored by: EMC / Isilon Storage Division
# 0fc6daa7	11-May-2013	Konstantin Belousov <kib@FreeBSD.org>	- Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The null_hashget() obtains the reference on the nullfs vnode, which must be dropped. - Fix a wart which existed from the introduction of the nullfs caching, do not unlock lower vnode in the nullfs_reclaim_lowervp(). It should be innocent, but now it is also formally safe. Inform the nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on nullfs inode. - Add a callback to the upper filesystems for the lower vnode unlinking. When inactivating a nullfs vnode, check if the lower vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC on the lower vnode, and reclaim upper vnode if so. This allows nullfs to purge cached vnodes for the unlinked lower vnode, avoiding excessive caching. Reported by: G??ran L??wkrantz <goran.lowkrantz@ismobile.com> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks
# e63091ea	09-May-2013	Marcel Moolenaar <marcel@FreeBSD.org>	Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE locks. To support this, VNODE locks are created with the LK_IS_VNODE flag. This flag is propagated down using the LO_IS_VNODE flag. Note that WITNESS still records the LOR. Only the printing and the optional entering into the kernel debugger is bypassed with the WITNESS_NO_VNODE option.
# 770b41b3	04-May-2013	Matthew D Fleming <mdf@FreeBSD.org>	Add missing vdrop() in error case. Submitted by: Fahad (mohd.fahadullah@isilon.com) MFC after: 1 week
# 64fa8df6	16-Apr-2013	Rick Macklem <rmacklem@FreeBSD.org>	Allow the vnode to be unlocked for the weird case of LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a usecount on a vnode during NFSv4 recovery from an expired lease. Reported and tested by: pho MFC after: 2 weeks
# 26089666	06-Apr-2013	Jeff Roberson <jeff@FreeBSD.org>	Prepare to replace the buf splay with a trie: - Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists. No consumers need to find them there and it complicates the tree. These flags are all FFS specific and could be moved out of the buf cache. - Use pbgetvp() and pbrelvp() to associate the background and journal bufs with the vp. Not only is this much cheaper it makes more sense for these transient bufs. - Fix the assertions in pbget* and pbrel*. It's not safe to check list pointers which were never initialized. Use the BX flags instead. We also check B_PAGING in reassignbuf() so this should cover all cases. Discussed with: kib, mckusick, attilio Sponsored by: EMC / Isilon Storage Division
# 89f6b863	08-Mar-2013	Attilio Rao <attilio@FreeBSD.org>	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
# 10b4bb0b	13-Jan-2013	Konstantin Belousov <kib@FreeBSD.org>	Add a trivial comment to record the proper commit log for r245407: Set the v_hash for a new vnode in the getnewvnode() to the value calculated based on the vnode structure address. Filesystems using vfs_hash_insert() override the v_hash using the standard formula of (inode_number + mnt_hashseed). For other filesystems, the initialization allows the vfs_hash_index() to provide useful hash too. Suggested, reviewed and tested by: peter Sponsored by: The FreeBSD Foundation MFC after: 5 days
# a41df848	13-Jan-2013	Konstantin Belousov <kib@FreeBSD.org>	diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c index 7c243b6..0bdaf36 100644 --- a/sys/kern/vfs_subr.c +++ b/sys/kern/vfs_subr.c @@ -279,6 +279,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, #define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt) #define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt) +static int vnsz2log; /* * Initialize the vnode management data structures. @@ -293,6 +294,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, static void vntblinit(void dummy __unused) { + u_int i; int physvnodes, virtvnodes; / @@ -332,6 +334,9 @@ vntblinit(void dummy __unused) syncer_maxdelay = syncer_mask + 1; mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF); cv_init(&sync_wakeup, "syncer"); + for (i = 1; i <= sizeof(struct vnode); i <<= 1) + vnsz2log++; + vnsz2log--; } SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_FIRST, vntblinit, NULL); @@ -1067,6 +1072,14 @@ alloc: } rangelock_init(&vp->v_rl); + / + * For the filesystems which do not use vfs_hash_insert(), + * still initialize v_hash to have vfs_hash_index() useful. + * E.g., nullfs uses vfs_hash_index() on the lower vnode for + * its own hashing. + / + vp->v_hash = (uintptr_t)vp >> vnsz2log; + vpp = vp; return (0); }
# c92c859b	26-Dec-2012	Attilio Rao <attilio@FreeBSD.org>	Fixup r244240: mp_ncpus will be 1 also in the !SMP and smp_disabled=1 case. There is no point in optimizing further the code and use a TRUE litteral for a path that does heavyweight stuff anyway (like lock acq), at the price of obfuscated code. Use the appropriate check where necessary and remove a macro. Sponsored by: EMC / Isilon storage division MFC after: 3 days
# b1308d72	21-Dec-2012	Attilio Rao <attilio@FreeBSD.org>	Fixup r218424: uio_yield() was scaling directly to userland priority. When kern_yield() was introduced with the possibility to specify a new priority, the behaviour changed by not lowering priority at all in the consumers, making the yielding mechanism highly ineffective for high priority kthreads like bufdaemon, syncer, vlrudaemon, etc. There are no evidences that consumers could bear with such change in semantic and this situation could finally lead to bugs similar to the ones fixed in r244240. Re-specify userland pri for kthreads involved. Tested by: pho Reviewed by: kib, mdf MFC after: 1 week
# 14df601e	14-Dec-2012	Konstantin Belousov <kib@FreeBSD.org>	When mnt_vnode_next_active iterator cannot lock the next vnode and yields, specify the user priority for the yield. Otherwise, a higher-priority (kernel) thread could fall into the priority-inversion with the thread owning the mutex lock. On single-processor machines or UP kernels, do not loop adaptively when the next vnode cannot be locked, instead yield unconditionally. Restructure the iteration initializer and the iterator to remove code duplication. Put the code to fetch and lock a vnode next to the current marker, into the mnt_vnode_next_active() function, and use it instead of repeating the loop. Reported by: hrs, rmacklem Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 days
# 686ffcac	10-Dec-2012	Konstantin Belousov <kib@FreeBSD.org>	Do not yield while owning a mutex. The Giant reacquire in the kern_yield() is problematic than. The owned mutex is the mount interlock, and it is in fact not needed to guarantee the stability of the mount list of active vnodes, so fix the the issue by only taking the mount interlock for MNT_REF and MNT_REL operations. While there, augment the unconditional yield by some amount of spinning [1]. Reported and tested by: pho Reviewed by: attilio Submitted by: attilio [1] MFC after: 3 days
# 07840861	03-Dec-2012	Konstantin Belousov <kib@FreeBSD.org>	The vnode_free_list_mtx is required unconditionally when iterating over the active list. The mount interlock is not enough to guarantee the validity of the tailq link pointers. The __mnt_vnode_next_active() and __mnt_vnode_first_active() active lists iterators helper functions did not provided the neccessary stability for the list, allowing the iterators to pick garbage. This was uncovered after the r243599 made the active list iterators non-nop. Since a vnode interlock is before the vnode_free_list_mtx, obtain the vnode ilock in the non-blocking manner when under vnode_free_list_mtx, and restart iteration after the yield if the lock attempt failed. Assert that a vnode found on the list is active, and assert that the helpers return the vnode with interlock owned. Reported and tested by: pho MFC after: 1 week
# 3da9ab75	26-Nov-2012	David Xu <davidxu@FreeBSD.org>	Take first active vnode correctly. Reviewed by: kib MFC after: 3 days
# 6b991098	24-Nov-2012	Andriy Gapon <avg@FreeBSD.org>	assert_vop_locked: make the assertion race-free and more efficient this is really a minor improvement for the sake of correctness MFC after: 6 days
# 4f15bb67	22-Nov-2012	Andriy Gapon <avg@FreeBSD.org>	remove vop_lookup_pre and vop_lookup_post Suggested by: kib MFC after: 5 days
# 973b795b	19-Nov-2012	Attilio Rao <attilio@FreeBSD.org>	insmntque() is always called with the lock held in exclusive mode, then: - assume the lock is held in exclusive mode and remove a moot check about the lock acquisition. - in the destructor remove !MPSAFE specific chunk. Reviewed by: kib MFC after: 2 weeks
# ab49c952	19-Nov-2012	Andriy Gapon <avg@FreeBSD.org>	assert_vop_locked should treat LK_EXCLOTHER as the not locked case ... from a perspective of the current thread. Spotted by: mjg Discussed with: kib MFC after: 18 days
# c496727c	19-Nov-2012	Andriy Gapon <avg@FreeBSD.org>	vnode_if: fix locking protocol description for lookup and cachedlookup Also remove the checks from vop_lookup_pre and vop_lookup_post, which are now completely redundant (before this change they were partially redundant). Discussed with: kib MFC after: 10 days
# bc2258da	09-Nov-2012	Attilio Rao <attilio@FreeBSD.org>	Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag. Porters should refer to __FreeBSD_version 1000021 for this change as it may have happened at the same timeframe.
# 76fd782c	05-Nov-2012	Konstantin Belousov <kib@FreeBSD.org>	A clarification to the behaviour of the active vnode list management regarding the vnode page cleaning. In collaboration with: pho MFC after: 1 week
# 90af5793	04-Nov-2012	Konstantin Belousov <kib@FreeBSD.org>	Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command. MFC after: 3 weeks
# fb819415	04-Nov-2012	Konstantin Belousov <kib@FreeBSD.org>	Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command. MFC after: 3 days
# df3161c7	04-Nov-2012	Konstantin Belousov <kib@FreeBSD.org>	Order the enumeration of the MNT_ flags to be the same as the order of their definitions. MFC after: 3 days
# 5050aa86	22-Oct-2012	Konstantin Belousov <kib@FreeBSD.org>	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho
# 9b233e23	14-Oct-2012	Konstantin Belousov <kib@FreeBSD.org>	Add a KPI to allow to reserve some amount of space in the numvnodes counter, without actually allocating the vnodes. The supposed use of the getnewvnode_reserve(9) is to reclaim enough free vnodes while the code still does not hold any resources that might be needed during the reclamation, and to consume the slack later for getnewvnode() calls made from the innards. After the critical block is finished, the caller shall free any reserve left, by getnewvnode_drop_reserve(9). Reviewed by: avg Tested by: pho MFC after: 1 week
# 0a15e5d3	13-Sep-2012	Attilio Rao <attilio@FreeBSD.org>	Remove all the checks on curthread != NULL with the exception of some MD trap checks (eg. printtrap()). Generally this check is not needed anymore, as there is not a legitimate case where curthread != NULL, after pcpu 0 area has been properly initialized. Reviewed by: bde, jhb MFC after: 1 week
# bcd5bb8e	09-Sep-2012	Konstantin Belousov <kib@FreeBSD.org>	Add a facility for vgone() to inform the set of subscribed mounts about vnode reclamation. Typical use is for the bypass mounts like nullfs to get a notification about lower vnode going away. Now, vgone() calls new VFS op vfs_reclaim_lowervp() with an argument lowervp which is reclaimed. It is possible to register several reclamation event listeners, to correctly handle the case of several nullfs mounts over the same directory. For the filesystem not having nullfs mounts over it, the overhead added is a single mount interlock lock/unlock in the vnode reclamation path. In collaboration with: pho MFC after: 3 weeks
# 258f9442	22-Aug-2012	Konstantin Belousov <kib@FreeBSD.org>	Provide some compat32 shims for sysctl vfs.conflist. It is required for getvfsbyname(3) operation when called from 32bit process, and getvfsbyname(3) is used by recent bsdtar import. Reported by: many Tested by: David Naylor <naylor.b.david@gmail.com> MFC after: 5 days
# 7adc598a	03-Jun-2012	Andriy Gapon <avg@FreeBSD.org>	free wdog_kern_pat calls in post-panic paths from under SW_WATCHDOG Those calls are useful with hardware watchdog drivers too. MFC after: 3 weeks
# 8f0e9130	30-May-2012	Konstantin Belousov <kib@FreeBSD.org>	Add a rangelock implementation, intended to be used to range-locking the i/o regions of the vnode data space. The implementation is quite simple-minded, it uses the list of the lock requests, ordered by arrival time. Each request may be for read or for write. The implementation is fair FIFO. MFC after: 2 month
# af6e6b87	23-Apr-2012	Edward Tomasz Napierala <trasz@FreeBSD.org>	Remove unused thread argument to vrecycle(). Reviewed by: kib
# c52fd858	23-Apr-2012	Edward Tomasz Napierala <trasz@FreeBSD.org>	Remove unused thread argument from vtruncbuf(). Reviewed by: kib
# dca5e0ec	20-Apr-2012	Kirk McKusick <mckusick@FreeBSD.org>	This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops over just the active vnodes associated with a mount point to replace MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync routines. The vfs_msync routine is run every 30 seconds for every writably mounted filesystem. It ensures that any files mmap'ed from the filesystem with modified pages have those pages queued to be written back to the file from which they are mapped. The ffs_lazy_sync and qsync routines are run every 30 seconds for every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine ensures that any files that have been accessed in the previous 30 seconds have had their access times queued for updating in the filesystem. The qsync routine ensures that any files with modified quotas have those quotas queued to be written back to their associated quota file. In a system configured with 250,000 vnodes, less than 1000 are typically active at any point in time. Prior to this change all 250,000 vnodes would be locked and inspected twice every minute by the syncer. For UFS/FFS filesystems they would be locked and inspected six times every minute (twice by each of these three routines since each of these routines does its own pass over the vnodes associated with a mount point). With this change the syncer now locks and inspects only the tiny set of vnodes that are active. Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
# f257ebbb	20-Apr-2012	Kirk McKusick <mckusick@FreeBSD.org>	This change creates a new list of active vnodes associated with a mount point. Active vnodes are those with a non-zero use or hold count, e.g., those vnodes that are not on the free list. Note that this list is in addition to the list of all the vnodes associated with a mount point. To avoid adding another set of linkage pointers to the vnode structure, the active list uses the existing linkage pointers used by the free list (previously named v_freelist, now renamed v_actfreelist). This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops over just the active vnodes associated with a mount point (typically less than 1% of the vnodes associated with the mount point). Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
# 16165fee	18-Apr-2012	Kirk McKusick <mckusick@FreeBSD.org>	Delete a no longer useful VNASSERT missed during changes in 234400. Suggested by: kib
# 60005d66	18-Apr-2012	Kirk McKusick <mckusick@FreeBSD.org>	Fix a memory leak of M_VNODE_MARKER introduced in 234386. Found by: Peter Holm
# 73305eb8	17-Apr-2012	Kirk McKusick <mckusick@FreeBSD.org>	Drop export of vdestroy() function from kern/vfs_subr.c as it is used only as a helper function in that file. Replace sole call to vbusy() with inline code in vholdl(). Replace sole calls to vfree() and vdestroy() with inline code in vdropl(). The Clang compiler already inlines these functions, so they do not show up in a kernel backtrace which is confusing. Also you cannot set their frame in kgdb which means that it is impossible to view their local variables. So, while the produced code is unchanged, the debugging should be easier. Discussed with: kib MFC after: 2 weeks
# 71469bb3	17-Apr-2012	Kirk McKusick <mckusick@FreeBSD.org>	Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL. The primary changes are that the user of the interface no longer needs to manage the mount-mutex locking and that the vnode that is returned has its mutex locked (thus avoiding the need to check to see if its is DOOMED or other possible end of life senarios). To minimize compatibility issues for third-party developers, the old MNT_VNODE_FOREACH interface will remain available so that this change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH will be removed in head. The reason for this update is to prepare for the addition of the MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the active vnodes associated with a mount point (typically less than 1% of the vnodes associated with the mount point). Reviewed by: kib Tested by: Peter Holm MFC after: 2 weeks
# ecb6e528	11-Apr-2012	Kirk McKusick <mckusick@FreeBSD.org>	Export vinactive() from kern/vfs_subr.c (e.g., make it no longer static and declare its prototype in sys/vnode.h) so that it can be called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c) instead of the body of vinactive() being cut and pasted into process_deferred_inactive(). Reviewed by: kib MFC after: 2 weeks
# 38ddb572	08-Mar-2012	Konstantin Belousov <kib@FreeBSD.org>	Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which allows a filesystem to request VFS to not allow MNTK_ASYNC. MFC after: 1 week
# 662c901c	25-Feb-2012	Mikolaj Golub <trociny@FreeBSD.org>	When detaching an unix domain socket, uipc_detach() checks unp->unp_vnode pointer to detect if there is a vnode associated with (binded to) this socket and does necessary cleanup if there is. The issue is that after forced unmount this check may be too late as the unp_vnode is reclaimed and the reference is stale. To fix this provide a helper function that is called on a socket vnode reclamation to do necessary cleanup. Pointed by: kib Reviewed by: kib MFC after: 2 weeks
# c480f781	06-Feb-2012	Konstantin Belousov <kib@FreeBSD.org>	Current implementations of sync(2) and syncer vnode fsync() VOP uses mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which is needed to guarantee a synchronous completion of the initiated i/o before syscall or VOP return. Global removal of MNTK_ASYNC option is harmful because not only i/o started from corresponding thread becomes synchronous, but all i/o is synchronous on the filesystem which is initiated during sync(2) or syncer activity. Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local thread flag to disable async i/o for current thread only. Use the opportunity to move DOINGASYNC() macro into sys/vnode.h and consistently use it through places which tested for MNTK_ASYNC. Some testing demonstrated 60-70% improvements in run time for the metadata-intensive operations on async-mounted UFS volumes, but still with great deviation due to other reasons. Reviewed by: mckusick Tested by: scottl MFC after: 2 weeks
# abc942b5	25-Jan-2012	Konstantin Belousov <kib@FreeBSD.org>	When doing vflush(WRITECLOSE), clean vnode pages. Unmounts do vfs_msync() before calling VFS_UNMOUNT(), but there is still a race allowing a process to dirty pages after msync finished. Remounts rw->ro just left dirty pages in system. Reviewed by: alc, tegge (long time ago) Tested by: pho MFC after: 2 weeks
# cc672d35	16-Jan-2012	Kirk McKusick <mckusick@FreeBSD.org>	Make sure all intermediate variables holding mount flags (mnt_flag) and that all internal kernel calls passing mount flags are declared as uint64_t so that flags in the top 32-bits are not lost. MFC after: 2 weeks
# 908cac07	06-Jan-2012	John Baldwin <jhb@FreeBSD.org>	Use proper argument structure types for the extattr post-VOP hooks. The wrong structure happened to work since the only argument used was the vnode which is in the same place in both VOP_SETATTR() and the two extattr VOPs. MFC after: 3 days
# f0d6c5ca	23-Dec-2011	John Baldwin <jhb@FreeBSD.org>	Add post-VOP hooks for VOP_DELETEEXTATTR() and VOP_SETEXTATTR() and use these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended attributes of a vnode are changed. Note that OS X already implements this behavior. Reviewed by: rwatson MFC after: 2 weeks
# 936c09ac	03-Nov-2011	John Baldwin <jhb@FreeBSD.org>	Add the posix_fadvise(2) system call. It is somewhat similar to madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month
# 62238a67	27-Oct-2011	John Baldwin <jhb@FreeBSD.org>	Whitespace fix.
# 4c11f091	25-Oct-2011	Pawel Jakub Dawidek <pjd@FreeBSD.org>	The v_data field is a pointer, so set it to NULL, not 0. MFC after: 3 days
# 25e33e62	07-Oct-2011	Jonathan Anderson <jonathan@FreeBSD.org>	Change one printf() to log(). As noted in kern/159780, printf() is not very jail-friendly, since it can't be easily monitored by jail management tools. This patch reports an error via log() instead, which, if nobody is watching the log file, still prints to the console. Approved by: mentor (rwatson) Submitted by: Eugene Grosbein <eugen@eg.sd.rdtc.ru> MFC after: 5 days
# 837b4d46	04-Oct-2011	Konstantin Belousov <kib@FreeBSD.org>	Move parts of the commit log for r166167, where Tor explained the interaction between vnode locks and vfs_busy(), into comment. MFC after: 1 week
# 6aba400a	25-Aug-2011	Attilio Rao <attilio@FreeBSD.org>	Fix a deficiency in the selinfo interface: If a selinfo object is recorded (via selrecord()) and then it is quickly destroyed, with the waiters missing the opportunity to awake, at the next iteration they will find the selinfo object destroyed, causing a PF#. That happens because the selinfo interface has no way to drain the waiters before to destroy the registered selinfo object. Also this race is quite rare to get in practice, because it would require a selrecord(), a poll request by another thread and a quick destruction of the selrecord()'ed selinfo object. Fix this by adding the seldrain() routine which should be called before to destroy the selinfo objects (in order to avoid such case), and fix the present cases where it might have already been called. Sometimes, the context is safe enough to prevent this type of race, like it happens in device drivers which installs selinfo objects on poll callbacks. There, the destruction of the selinfo object happens at driver detach time, when all the filedescriptors should be already closed, thus there cannot be a race. For this case, mfi(4) device driver can be set as an example, as it implements a full correct logic for preventing this from happening. Sponsored by: Sandvine Incorporated Reported by: rstone Tested by: pluknet Reviewed by: jhb, kib Approved by: re (bz) MFC after: 3 weeks
# d716efa9	24-Jul-2011	Kirk McKusick <mckusick@FreeBSD.org>	Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag so that it is visible to userland programs. This change enables the `mount' command with no arguments to be able to show if a filesystem is mounted using journaled soft updates as opposed to just normal soft updates. Approved by: re (bz)
# 6bbee8e2	29-Jun-2011	Alan Cox <alc@FreeBSD.org>	Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this option to vm_object_page_remove() asserts that the specified range of pages is not mapped, or more precisely that none of these pages have any managed mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on the pages. This change not only saves time by eliminating pointless calls to pmap_remove_all(), but it also eliminates an inconsistency in the use of pmap_remove_all() versus related functions, like pmap_remove_write(). It eliminates harmless but pointless calls to pmap_remove_all() that were being performed on PG_UNMANAGED pages. Update all of the existing assertions on pmap_remove_all() to reflect this change. Reviewed by: kib
# 5e26234f	24-Jun-2011	Jonathan Anderson <jonathan@FreeBSD.org>	Tidy up a capabilities-related comment. This comment refers to an #ifdef that hasn't been merged [yet?]; remove it. Approved by: rwatson
# 3d08a76b	12-May-2011	Matthew D Fleming <mdf@FreeBSD.org>	Use a name instead of a magic number for kern_yield(9) when the priority should not change. Fetch the td_user_pri under the thread lock. This is probably not necessary but a magic number also seems preferable to knowing the implementation details here. Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >
# 2be767e0	28-Apr-2011	Attilio Rao <attilio@FreeBSD.org>	Add the watchdogs patting during the (shutdown time) disk syncing and disk dumping. With the option SW_WATCHDOG on, these operations are doomed to let watchdog fire, fi they take too long. I implemented the stubs this way because I really want wdog_kern_* KPI to not be dependant by SW_WATCHDOG being on (and really, the option only enables watchdog activation in hardclock) and also avoid to call them when not necessary (avoiding not-volountary watchdog activations). Sponsored by: Sandvine Incorporated Discussed with: emaste, des MFC after: 2 weeks
# c65c068a	23-Apr-2011	Rick Macklem <rmacklem@FreeBSD.org>	Fix a LOR in vfs_busy() where, after msleeping, it would lock the mutexes in the wrong order for the case where the MBF_MNTLSTLOCK is set. I believe this did have the potential for deadlock. For example, if multiple nfsd threads called vfs_busyfs(), which calls vfs_busy() with MBF_MNTLSTLOCK. Thanks go to pho for catching this during his testing. Tested by: pho Submitted by: kib MFC after: 2 weeks
# 443db695	04-Apr-2011	Sergey Kandaurov <pluknet@FreeBSD.org>	Remove malloc type M_NETADDR unused since splitting into vfs_subr.c and vfs_export.c. MFC after: 1 week
# fd7032e1	08-Mar-2011	Konstantin Belousov <kib@FreeBSD.org>	Do not assert buffer lock in VFS_STRATEGY() when kernel already paniced. Sponsored by: The FreeBSD Foundation MFC after: 1 week
# e7ceb1e9	07-Feb-2011	Matthew D Fleming <mdf@FreeBSD.org>	Based on discussions on the svn-src mailing list, rework r218195: - entirely eliminate some calls to uio_yeild() as being unnecessary, such as in a sysctl handler. - move should_yield() and maybe_yield() to kern_synch.c and move the prototypes from sys/uio.h to sys/proc.h - add a slightly more generic kern_yield() that can replace the functionality of uio_yield(). - replace source uses of uio_yield() with the functional equivalent, or in some cases do not change the thread priority when switching. - fix a logic inversion bug in vlrureclaim(), pointed out by bde@. - instead of using the per-cpu last switched ticks, use a per thread variable for should_yield(). With PREEMPTION, the only reasonable use of this is to determine if a lock has been held a long time and relinquish it. Without PREEMPTION, this is essentially the same as the per-cpu variable.
# 08b163fa	02-Feb-2011	Matthew D Fleming <mdf@FreeBSD.org>	Put the general logic for being a CPU hog into a new function should_yield(). Use this in various places. Encapsulate the common case of check-and-yield into a new function maybe_yield(). Change several checks for a magic number of iterations to use should_yield() instead. MFC after: 1 week
# dbccdf76	25-Jan-2011	Konstantin Belousov <kib@FreeBSD.org>	When vtruncbuf() iterates over the vnode buffer list, lock buffer object before checking the validity of the next buffer pointer. Otherwise, the buffer might be reclaimed after the check, causing iteration to run into wrong buffer. Reported and tested by: pho MFC after: 1 week
# 2fee06f0	18-Jan-2011	Matthew D Fleming <mdf@FreeBSD.org>	Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need to rely on the format string.
# fbbb13f9	12-Jan-2011	Matthew D Fleming <mdf@FreeBSD.org>	sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly. Commit the kernel changes.
# a8f4344f	06-Jan-2011	John Baldwin <jhb@FreeBSD.org>	- Restore dropping the priority of syncer down to PPAUSE when it is idle. This was lost when it was converted to using a condition variable instead of lbolt. - Drop the priority of flowtable down to PPAUSE when it is idle as well since it is a similar background task. MFC after: 2 weeks
# 7dbb59c7	26-Dec-2010	Konstantin Belousov <kib@FreeBSD.org>	Teach ddb "show mount" about MNTK_SUJ flag.
# f5eb95b1	23-Nov-2010	Konstantin Belousov <kib@FreeBSD.org>	Allow shared-locked vnode to be passed to vunref(9). When shared-locked vnode is supplied as an argument to vunref(9) and resulting usecount is 0, set VI_OWEINACT and do not try to upgrade vnode lock. The later could cause vnode unlock, allowing the vnode to be reclaimed meantime. Tested by: pho MFC after: 1 week
# 730b63b0	19-Nov-2010	Konstantin Belousov <kib@FreeBSD.org>	Remove prtactive variable and related printf()s in the vop_inactive and vop_reclaim() methods. They seems to be unused, and the reported situation is normal for the forced unmount. MFC after: 1 week X-MFC-note: keep prtactive symbol in vfs_subr.c
# 8d065a39	14-Nov-2010	Rebecca Cran <brucec@FreeBSD.org>	Fix some more style(9) issues.
# b389be97	14-Nov-2010	Rebecca Cran <brucec@FreeBSD.org>	Fix style(9) issues from r215281 and r215282. MFC after: 1 week
# 5d7abc87	14-Nov-2010	Rebecca Cran <brucec@FreeBSD.org>	Add descriptions to some more sysctls. PR: kern/148510 MFC after: 1 week
# a7d5f7eb	19-Oct-2010	Jamie Gritton <jamie@FreeBSD.org>	A new jail(8) with a configuration file, to replace the work currently done by /etc/rc.d/jail.
# 9a24dc07	11-Sep-2010	Konstantin Belousov <kib@FreeBSD.org>	Protect mnt_syncer with the sync_mtx. This prevents a (rare) vnode leak when mount and update are executed in parallel. Encapsulate syncer vnode deallocation into the helper function vfs_deallocate_syncvnode(), to not externalize sync_mtx from vfs_subr.c. Found and reviewed by: jh (previous version of the patch) Tested by: pho MFC after: 3 weeks
# e5ddf115	01-Sep-2010	Ed Maste <emaste@FreeBSD.org>	As long as we are going to panic anyway, there's no need to hide additional information behind DIAGNOSTIC.
# de478dd4	30-Aug-2010	Jaakko Heinonen <jh@FreeBSD.org>	execve(2) has a special check for file permissions: a file must have at least one execute bit set, otherwise execve(2) will return EACCES even for an user with PRIV_VFS_EXEC privilege. Add the check also to vaccess(9), vaccess_acl_nfs4(9) and vaccess_acl_posix1e(9). This makes access(2) to better agree with execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check to zfs_freebsd_access() too. There may be other file systems which are not using vaccess*() functions and need to be handled separately. PR: kern/125009 Reviewed by: bde, trasz Approved by: pjd (ZFS part)
# c87f1ad4	28-Aug-2010	Pawel Jakub Dawidek <pjd@FreeBSD.org>	There is a bug in vfs_allocate_syncvnode() failure handling in mount code. Actually it is hard to properly handle such a failure, especially in MNT_UPDATE case. The only reason for the vfs_allocate_syncvnode() function to fail is getnewvnode() failure. Fortunately it is impossible for current implementation of getnewvnode() to fail, so we can assert this and make vfs_allocate_syncvnode() void. This in turn free us from handling its failures in the mount code. Reviewed by: kib MFC after: 1 month
# 3beb1b72	12-Aug-2010	Konstantin Belousov <kib@FreeBSD.org>	The buffers b_vflags field is not always properly protected by bufobj lock. If b_bufobj is not NULL, then bufobj lock should be held when manipulating the flags. Not doing this sometimes leaves BV_BKGRDINPROG to be erronously set, causing softdep' getdirtybuf() to stuck indefinitely in "getbuf" sleep, waiting for background write to finish which is not actually performed. Add BO_LOCK() in the cases where it was missed. In collaboration with: pho Tested by: bz Reviewed by: jeff MFC after: 1 month
# 3b156706	03-Aug-2010	Alan Cox <alc@FreeBSD.org>	In order for MAXVNODES_MAX to be an "int" on powerpc and sparc, we must cast PAGE_SIZE to an "int". (Powerpc and sparc, unlike the other architectures, define PAGE_SIZE as a "long".) Submitted by: Andreas Tobler
# 1d7fe4b5	02-Aug-2010	Alan Cox <alc@FreeBSD.org>	Update the "desiredvnodes" calculation. In particular, make the part of the calculation that is based on the kernel's heap size more conservative. Hopefully, this will eliminate the need for MAXVNODES_MAX, but for the time being set MAXVNODES_MAX to a large value. Reviewed by: jhb@ MFC after: 6 weeks
# 60ae52f7	21-Jun-2010	Ed Schouten <ed@FreeBSD.org>	Use ISO C99 integer types in sys/kern where possible. There are only about 100 occurences of the BSD-specific u_int*_t datatypes in sys/kern. The ISO C99 integer types are used here more often.
# 52e50b42	18-Jun-2010	Pawel Jakub Dawidek <pjd@FreeBSD.org>	MFC r209265: r209260: Backout r207970 for now, it can lead to deadlocks. Reported by: kan r209261: Turn off UMA allocations on all archs by default. It isn't stable even on amd64. Reported by: many Approved by: re (kib)
# d32ef791	17-Jun-2010	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Backout r207970 for now, it can lead to deadlocks. Reported by: kan MFC after: 3 days
# 882da14c	03-Jun-2010	Konstantin Belousov <kib@FreeBSD.org>	Sometimes vnodes share the lock despite being different vnodes on different mount points, e.g. the nullfs vnode and the covered vnode from the lower filesystem. In this case, existing assertion in vop_rename_pre() may be triggered. Check for vnode locks equiality instead of the vnodes itself to not trip over the situation. Submitted by: Mikolaj Golub <to.my.trociny@gmail.com> Tested by: pho MFC after: 2 weeks
# 665b912a	24-May-2010	Pawel Jakub Dawidek <pjd@FreeBSD.org>	MFC r207920,r207934,r207936,r207937,r207970,r208142,r208147,r208148,r208166, r208454,r208455,r208458: r207920: Back out r205134. It is not stable. r207934: Add missing new line characters to the warnings. r207936: Eventhough r203504 eliminates taste traffic provoked by vdev_geom.c, ZFS still like to open all vdevs, close them and open them again, which in turn provokes taste traffic anyway. I don't know of any clean way to fix it, so do it the hard way - if we can't open provider for writing just retry 5 times with 0.5 pauses. This should elimitate accidental races caused by other classes tasting providers created on top of our vdevs. Reported by: James R. Van Artsdalen <james-freebsd-fs2@jrv.org> Reported by: Yuri Pankov <yuri.pankov@gmail.com> r207937: I added vfs_lowvnodes event, but it was only used for a short while and now it is totally unused. Remove it. r207970: When there is no memory or KVA, try to help by reclaiming some vnodes. This helps with 'kmem_map too small' panics. No objections from: kib Tested by: Alexander V. Ribchansky <shurik@zk.informjust.ua> r208142: The whole point of having dedicated worker thread for each leaf VDEV was to avoid calling zio_interrupt() from geom_up thread context. It turns out that when provider is forcibly removed from the system and we kill worker thread there can still be some ZIOs pending. To complete pending ZIOs when there is no worker thread anymore we still have to call zio_interrupt() from geom_up context. To avoid this race just remove use of worker threads altogether. This should be more or less fine, because I also thought that zio_interrupt() does more work, but it only makes small UMA allocation with M_WAITOK. It also saves one context switch per I/O request. PR: kern/145339 Reported by: Alex Bakhtin <Alex.Bakhtin@gmail.com> r208147: Add task structure to zio and use it instead of allocating one. This eliminates the only place where we can sleep when calling zio_interrupt(). As a side-effect this can actually improve performance a little as we allocate one less thing for every I/O. Prodded by: kib r208148: Allow to configure UMA usage for ZIO data via loader and turn it on by default for amd64. On i386 I saw performance degradation when UMA was used, but for amd64 it should help. r208166: Fix userland build by making io_task available only for the kernel and by providing taskq_dispatch_safe() macro. r208454: Remove ZIO_USE_UMA from arc.c as well. r208455: ZIO_USE_UMA is no longer used. r208458: Create UMA zones unconditionally.
# 7fd32ea9	12-May-2010	Zachary Loafman <zml@FreeBSD.org>	Add VOP_ADVLOCKPURGE so that the file system is called when purging locks (in the case where the VFS impl isn't using lf_*) Submitted by: Matthew Fleming <matthew.fleming@isilon.com> Reviewed by: zml, dfr
# 408a7c50	12-May-2010	Pawel Jakub Dawidek <pjd@FreeBSD.org>	When there is no memory or KVA, try to help by reclaiming some vnodes. This helps with 'kmem_map too small' panics. No objections from: kib Tested by: Alexander V. Ribchansky <shurik@zk.informjust.ua> MFC after: 1 week
# c60c36a7	11-May-2010	Pawel Jakub Dawidek <pjd@FreeBSD.org>	I added vfs_lowvnodes event, but it was only used for a short while and now it is totally unused. Remove it. MFC after: 3 days
# 113db2dd	24-Apr-2010	Jeff Roberson <jeff@FreeBSD.org>	- Merge soft-updates journaling from projects/suj/head into head. This brings in support for an optional intent log which eliminates the need for background fsck on unclean shutdown. Sponsored by: iXsystems, Yahoo!, and Juniper. With help from: McKusick and Peter Holm
# a515de67	18-Apr-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	MFC r206160 by jh@: Add missing MNT_NFS4ACLS.
# bb9a8424	09-Apr-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r206093: Add function vop_rename_fail(9) that performs needed cleanup for locks and references of the VOP_RENAME(9) arguments. Use vop_rename_fail() in deadfs_rename().
# 0e9bd417	04-Apr-2010	Jaakko Heinonen <jh@FreeBSD.org>	Add missing MNT_NFS4ACLS.
# b9d8d691	03-Apr-2010	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Fix some whitespace nits.
# 000026c8	03-Apr-2010	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Add missing mnt_kern_flag flags in 'show mount' output.
# ea015880	02-Apr-2010	Konstantin Belousov <kib@FreeBSD.org>	Add function vop_rename_fail(9) that performs needed cleanup for locks and references of the VOP_RENAME(9) arguments. Use vop_rename_fail() in deadfs_rename(). Tested by: Mikolaj Golub MFC after: 1 week
# bd7ae209	27-Mar-2010	Edward Tomasz Napierala <trasz@FreeBSD.org>	MFC r197680: Provide default implementation for VOP_ACCESS(9), so that filesystems which want to provide VOP_ACCESSX(9) don't have to implement both. Note that this commit makes implementation of either of these two mandatory. Reviewed by: kib
# e9560b40	07-Feb-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r202528: Add vunref(9).
# d2f334bf	17-Jan-2010	Konstantin Belousov <kib@FreeBSD.org>	Add new function vunref(9) that decrements vnode use count (and hold count) while vnode is exclusively locked. The code for vput(9), vrele(9) and vunref(9) is merged. In collaboration with: pho Reviewed by: alc MFC after: 3 weeks
# 2d63cbda	10-Jan-2010	Konstantin Belousov <kib@FreeBSD.org>	MFC r200770: Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for vnode-backed vm objects.
# e6b37c3a	31-Dec-2009	Konstantin Belousov <kib@FreeBSD.org>	MFC r201134: Add a knob to allow reclaim of the directory vnodes that are source of the namecache records.
# a4117865	28-Dec-2009	Konstantin Belousov <kib@FreeBSD.org>	Add a knob to allow reclaim of the directory vnodes that are source of the namecache records. The reclamation is not enabled by default because for typical workload it would make namecache unusable, but large nested directory tree easily puts any process that accesses filesystem into 1 second wait for vlru. Reported by: yar (long time ago) MFC after: 3 days
# 558e9b5c	26-Dec-2009	Edward Tomasz Napierala <trasz@FreeBSD.org>	Now that all the callers seem to be fixed, add KASSERTs to make sure VAPPEND is not being used improperly.
# 49e3050e	20-Dec-2009	Konstantin Belousov <kib@FreeBSD.org>	VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object flag. Besides providing the redundand information, need to update both vnode and object flags causes more acquisition of vnode interlock. OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects. Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for vnode-backed vm objects. Suggested and reviewed by: alc Tested by: pho MFC after: 3 weeks
# a9533616	04-Dec-2009	Jaakko Heinonen <jh@FreeBSD.org>	MFC r199529: Extend ddb(4) "show mount" command to print active string mount options. Note that only option names are printed, not values. Approved by: trasz (mentor)
# 10d843a4	19-Nov-2009	Jaakko Heinonen <jh@FreeBSD.org>	Extend ddb(4) "show mount" command to print active string mount options. Note that only option names are printed, not values. Reviewed by: pjd Approved by: trasz (mentor) MFC after: 2 weeks
# 2c29cfa0	01-Oct-2009	Edward Tomasz Napierala <trasz@FreeBSD.org>	Provide default implementation for VOP_ACCESS(9), so that filesystems which want to provide VOP_ACCESSX(9) don't have to implement both. Note that this commit makes implementation of either of these two mandatory. Reviewed by: kib
# e76d823b	12-Sep-2009	Robert Watson <rwatson@FreeBSD.org>	Use C99 initialization for struct filterops. Obtained from: Mac OS X Sponsored by: Apple Inc. MFC after: 3 weeks
# 3c9d279b	12-Sep-2009	Konstantin Belousov <kib@FreeBSD.org>	MFC r197030: In vfs_mark_atime(9), be resistent against reclaimed vnodes. Assert that neccessary locks are taken, since vop might not be called. Approved by: re (kensmith)
# 427992ec	09-Sep-2009	Konstantin Belousov <kib@FreeBSD.org>	In vfs_mark_atime(9), be resistent against reclaimed vnodes. Assert that neccessary locks are taken, since vop might not be called. Tested by: pho MFC after: 3 days
# f0899a34	02-Jul-2009	Jamie Gritton <jamie@FreeBSD.org>	Call prison_check from vfs_suser rather than re-implementing it. Approved by: re (kib), bz (mentor)
# d8b0556c	10-Jun-2009	Konstantin Belousov <kib@FreeBSD.org>	Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use vnode interlock to protect the knote fields [1]. The locking assumes that shared vnode lock is held, thus we get exclusive access to knote either by exclusive vnode lock protection, or by shared vnode lock + vnode interlock. Do not use kl_locked() method to assert either lock ownership or the fact that curthread does not own the lock. For shared locks, ownership is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared lock not owned by curthread, causing false positives in kqueue subsystem assertions about knlist lock. Remove kl_locked method from knlist lock vector, and add two separate assertion methods kl_assert_locked and kl_assert_unlocked, that are supposed to use proper asserts. Change knlist_init accordingly. Add convenience function knlist_init_mtx to reduce number of arguments for typical knlist initialization. Submitted by: jhb [1] Noted by: jhb [2] Reviewed by: jhb Tested by: rnoland
# bcf11e8d	05-Jun-2009	Robert Watson <rwatson@FreeBSD.org>	Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd
# faef64cc	30-May-2009	Attilio Rao <attilio@FreeBSD.org>	Remove the now invalid (and possibly unused) debug.mpsafevfs sysctl/tunable. Reviewed by: emaste Sponsored by: Sandvine Incorporated
# c97fcdba	30-May-2009	Edward Tomasz Napierala <trasz@FreeBSD.org>	Add VOP_ACCESSX, which can be used to query for newly added V* permissions, such as VWRITE_ACL. For a filsystems that don't implement it, there is a default implementation, which works as a wrapper around VOP_ACCESS. Reviewed by: rwatson@
# 0304c731	27-May-2009	Jamie Gritton <jamie@FreeBSD.org>	Add hierarchical jails. A jail may further virtualize its environment by creating a child jail, which is visible to that jail and to any parent jails. Child jails may be restricted more than their parents, but never less. Jail names reflect this hierarchy, being MIB-style dot-separated strings. Every thread now points to a jail, the default being prison0, which contains information about the physical system. Prison0's root directory is the same as rootvnode; its hostname is the same as the global hostname, and its securelevel replaces the global securelevel. Note that the variable "securelevel" has actually gone away, which should not cause any problems for code that properly uses securelevel_gt() and securelevel_ge(). Some jail-related permissions that were kept in global variables and set via sysctls are now per-jail settings. The sysctls still exist for backward compatibility, used only by the now-deprecated jail(2) system call. Approved by: bz (mentor)
# dfd233ed	11-May-2009	Attilio Rao <attilio@FreeBSD.org>	Remove the thread argument from the FSD (File-System Dependent) parts of the VFS. Now all the VFS_* functions and relating parts don't want the context as long as it always refers to curthread. In some points, in particular when dealing with VOPs and functions living in the same namespace (eg. vflush) which still need to be converted, pass curthread explicitly in order to retain the old behaviour. Such loose ends will be fixed ASAP. While here fix a bug: now, UFS_EXTATTR can be compiled alone without the UFS_EXTATTR_AUTOSTART option. VFS KPI is heavilly changed by this commit so thirdy parts modules needs to be recompiled. Bump __FreeBSD_version in order to signal such situation.
# 607fc40b	29-Mar-2009	Alexander Kabaev <kan@FreeBSD.org>	Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache in directory vnodes. Allow namecache dotdot entry to be created pointing from child vnode to parent vnode if no existing links in opposite direction exist. Use direct link from parent to child for dotdot lookups otherwise. This restores more efficient dotdot caching in NFS filesystems which was lost when vnodes stoppped being type stable. Reviewed by: kib
# 5ab4bb35	02-Mar-2009	Alexander Kabaev <kan@FreeBSD.org>	Change vfs_busy to wait until an outcome of pending unmount operation is known and to retry or fail accordingly to that outcome. This fixes the problem with namespace traversing programs failing with random ENOENT errors if someone just happened to try to unmount that same filesystem at the same time. Reported by: dhw Reviewed by: kib, attilio Sponsored by: Juniper Networks, Inc.
# 8941aad1	06-Feb-2009	John Baldwin <jhb@FreeBSD.org>	Tweak the output of VOP_PRINT/vn_printf() some. - Align the fifo output in fifo_print() with other vn_printf() output. - Remove the leading space from lockmgr_printinfo() so its output lines up in vn_printf(). - lockmgr_printinfo() now ends with a newline, so remove an extra newline from vn_printf().
# ec48c16f	06-Feb-2009	Edward Tomasz Napierala <trasz@FreeBSD.org>	Add KASSERTs to make it easier to debug problems like the one fixed in r188141. Reviewed by: kib,attilio Approved by: rwatson (mentor) Tested by: pho Sponsored by: FreeBSD Foundation
# feabc903	05-Feb-2009	Attilio Rao <attilio@FreeBSD.org>	Add more KTR_VFS logging point in order to have a more effective tracing. Reviewed by: brueffer, kib Tested by: Gianni Trematerra <giovanni D trematerra A gmail D com>
# 91082624	23-Jan-2009	John Baldwin <jhb@FreeBSD.org>	Tweak the wording for vfs_mark_atime() since the I/O it is avoiding by not updating va_atime via VOP_SETATTR() isn't always synchronous. For some filesystems it is asynchronous. Suggested by: bde
# 645f1f4e	23-Jan-2009	John Baldwin <jhb@FreeBSD.org>	Push down Giant in the vlnru kproc main loop so that it is only acquired around calls to vlrureclaim() on non-MPSAFE filesystems. Specifically, vnlru no longer needs Giant for the common case of waking up and deciding there is nothing for it to do. MFC after: 2 weeks
# 1c570a0c	21-Jan-2009	John Baldwin <jhb@FreeBSD.org>	Fix a few style bogons. Submitted by: bde
# beace176	21-Jan-2009	John Baldwin <jhb@FreeBSD.org>	Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP: VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME can be performed while holding a shared vnode lock (the same functionality is done internally by VOP_READ which can run with a shared vnode lock). Add missing locking of the vnode interlock to the ufs implementation and remove a special note and test from the NFS client about not supporting the feature. Inspired by: ups Tested by: pho
# 9316467d	20-Jan-2009	Konstantin Belousov <kib@FreeBSD.org>	FFS puts the extended attributes blocks at the negative blocks for the vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it incorrectly does vm_object_page_remove(0, 0), removing all pages from the underlying vm object, not only the pages that back the extended attributes data. Change vinvalbuf() to not remove any pages from the object when V_NORMAL or V_ALT are specified. Instead, the only in-tree caller in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely removes the corresponding page range. The V_NORMAL caller does vnode_pager_setsize(vp, 0) immediately after the call to vinvalbuf(V_NORMAL) already. Reported by: csjp Reviewed by: ups MFC after: 3 weeks
# 4a0f8076	16-Dec-2008	Attilio Rao <attilio@FreeBSD.org>	1) Fix a deadlock in the VFS: - threadA runs vfs_rel(mp1) - threadB does unmount the mp1 fs, sets MNTK_UNMOUNT and drop MNT_ILOCK() - threadA runs vfs_busy(mp1) and, as long as, MNTK_UNMOUNT is set, sleeps waiting for threadB to complete the unmount - threadB, in vfs_mount_destroy(), finds mnt_lock > 0 and sleeps waiting for the refcount to expire. Fix the deadlock by adding a flag called MNTK_REFEXPIRE which signals the unmounter is waiting for mnt_ref to expire. The vfs_busy contenders got awake, fails, and if they retry the MNTK_REFEXPIRE won't allow them to sleep again. 2) Simplify significantly the code of vfs_mount_destroy() trimming unnecessary codes: - as long as any reference exited, it is no-more possible to have write-op (primarty and secondary) in progress. - it is no needed to drop and reacquire the mount lock. - filling the structures with dummy values is unuseful as long as it is going to be freed. Tested by: pho, Andrea Barberio <insomniac at slackware dot it> Discussed with: kib
# 61791644	29-Nov-2008	Konstantin Belousov <kib@FreeBSD.org>	In the nfsrv_fhtovp(), after the vfs_getvfs() function found the pointer to the fs, but before a vnode on the fs is locked, unmount may free fs structures, causing access to destroyed data and freed memory. Introduce a vfs_busymp() function that looks up and busies found fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the implementation of the handle syscalls. Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular, sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl handler, that prevents Giant-locked unmount code to interfere with it. Noted by: tegge Reviewed by: dfr Tested by: pho MFC after: 1 month
# 1ba4a712	17-Nov-2008	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes. This bring huge amount of changes, I'll enumerate only user-visible changes: - Delegated Administration Allows regular users to perform ZFS operations, like file system creation, snapshot creation, etc. - L2ARC Level 2 cache for ZFS - allows to use additional disks for cache. Huge performance improvements mostly for random read of mostly static content. - slog Allow to use additional disks for ZFS Intent Log to speed up operations like fsync(2). - vfs.zfs.super_owner Allows regular users to perform privileged operations on files stored on ZFS file systems owned by him. Very careful with this one. - chflags(2) Not all the flags are supported. This still needs work. - ZFSBoot Support to boot off of ZFS pool. Not finished, AFAIK. Submitted by: dfr - Snapshot properties - New failure modes Before if write requested failed, system paniced. Now one can select from one of three failure modes: - panic - panic on write error - wait - wait for disk to reappear - continue - serve read requests if possible, block write requests - Refquota, refreservation properties Just quota and reservation properties, but don't count space consumed by children file systems, clones and snapshots. - Sparse volumes ZVOLs that don't reserve space in the pool. - External attributes Compatible with extattr(2). - NFSv4-ACLs Not sure about the status, might not be complete yet. Submitted by: trasz - Creation-time properties - Regression tests for zpool(8) command. Obtained from: OpenSolaris
# 30f60d8c	03-Nov-2008	Attilio Rao <attilio@FreeBSD.org>	Remove the mnt_holdcnt and mnt_holdcntwaiters because they are useless. Really, the concept of holdcnt in the struct mount is rappresented by the mnt_ref (which prevents the type-stable structure from being "recycled) handled through vfs_ref() and vfs_rel(). On this optic, switch the holdcnt acquisition into an emulated vfs_ref() (and subsequent release into vfs_rel()). Discussed with: kib Tested by: pho
# 83b3bdbc	02-Nov-2008	Attilio Rao <attilio@FreeBSD.org>	Improve VFS locking: - Implement real draining for vfs consumers by not relying on the mnt_lock and using instead a refcount in order to keep track of lock requesters. - Due to the change above, remove the mnt_lock lockmgr because it is now useless. - Due to the change above, vfs_busy() is no more linked to a lockmgr. Change so its KPI by removing the interlock argument and defining 2 new flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the old version (which was unlinked from the lockmgr alredy) and MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx once the mnt interlock is held (ability still desired by most consumers). - The stub used into vfs_mount_destroy(), that allows to override the mnt_ref if running for more than 3 seconds, make it totally useless. Remove it as it was thought to work into older versions. If a problem of "refcount held never going away" should appear, we will need to fix properly instead than trust on such hackish solution. - Fix a bug where returning (with an error) from dounmount() was still leaving the MNTK_MWAIT flag on even if it the waiters were actually woken up. Just a place in vfs_mount_destroy() is left because it is going to recycle the structure in any case, so it doesn't matter. - Remove the markercnt refcount as it is useless. This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and __FreeBSD_version will be modified accordingly. Discussed with: kib Tested by: pho
# 15bc6b2b	28-Oct-2008	Edward Tomasz Napierala <trasz@FreeBSD.org>	Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary to add more V* constants, and the variables changed by this patch were often being assigned to mode_t variables, which is 16 bit. Approved by: rwatson (mentor)
# 7cd5a03a	27-Oct-2008	Konstantin Belousov <kib@FreeBSD.org>	Style return statements in vn_pollrecord().
# ae53539e	27-Oct-2008	Konstantin Belousov <kib@FreeBSD.org>	Protect check for v_pollinfo == NULL and assignment of the newly allocated vpollinfo with vnode interlock. Fully initialize vpollinfo before putting pointer to it into vp->v_pollinfo. Discussed with: dwhite Tested by: pho MFC after: 1 week
# 3cfc3089	20-Oct-2008	Konstantin Belousov <kib@FreeBSD.org>	In vfs_busy(), lockmgr() cannot legitimately sleep, because code checked MNTK_UNMOUNT before, and mnt_mtx is used as interlock. vfs_busy() always tries to obtain a shared lock on mnt_lock, the other user is unmount who tries to drain it, setting MNTK_UNMOUNT before. Reviewed by: tegge, attilio Tested by: pho MFC after: 2 weeks
# d7f03759	19-Oct-2008	Ulf Lilleengen <lulf@FreeBSD.org>	- Import the HEAD csup code which is the basis for the cvsmode work.
# 0d7935fd	10-Oct-2008	Attilio Rao <attilio@FreeBSD.org>	Remove the struct thread unuseful argument from bufobj interface. In particular following functions KPI results modified: - bufobj_invalbuf() - bufsync() and BO_SYNC() "virtual method" of the buffer objects set. Main consumers of bufobj functions are affected by this change too and, in particular, functions which changed their KPI are: - vinvalbuf() - g_vfs_close() Due to the KPI breakage, __FreeBSD_version will be bumped in a later commit. As a side note, please consider just temporary the 'curthread' argument passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP Reviewed by: kib Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
# 59d49325	31-Aug-2008	Attilio Rao <attilio@FreeBSD.org>	Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions. Manpages are updated accordingly. Tested by: Diego Sardina <siarodx at gmail dot com>
# 0359a12e	28-Aug-2008	Attilio Rao <attilio@FreeBSD.org>	Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread was always curthread and totally unuseful. Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
# a888d54d	28-Aug-2008	Konstantin Belousov <kib@FreeBSD.org>	Introduce the VV_FORCEINSMQ vnode flag. It instructs the insmnque() function to ignore the unmounting and forces insertion of the vnode into the mount vnode list. Change insmntque() to fail when forced unmount is in progress and VV_FORCEINSMQ is not specified. Add an assertion to the insmntque(), requiring the vnode to be exclusively locked for mp-safe filesystems. Use the VV_FORCEINSMQ for the creation of the syncvnode. Tested by: pho Reviewed by: tegge MFC after: 1 month
# e4517337	24-Aug-2008	Christian S.J. Peron <csjp@FreeBSD.org>	Remove worrying printf warning on bootup when processing vnodes which have NULL mount-points. This is the case for special vnodes, such as the one used in nameiinit() which is used for crossing mount points in lookup() to avoid lock ordering issues. MFC after: 2 weeks Discussed with: rwatson, kib
# e7ea30e4	29-Jul-2008	Ed Schouten <ed@FreeBSD.org>	Remove the use of lbolt from the VFS syncer. It seems we only use `lbolt' inside the VFS syncer and the TTY layer now. Because I'm planning to replace the TTY layer next month, there's no reason to keep `lbolt' if it's only used in a single thread inside the kernel. Because the syncer code wanted to wake up the syncer thread before the timeout, it called sleepq_remove(). Because we now just use a condvar(9) with a timeout value of `hz', we can wake it up using cv_broadcast() without waking up any unrelated threads. Reviewed by: phk
# 5573021d	27-Jul-2008	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Assert for exclusive vnode lock in vinactive(), vrecycle() and vgonel() functions. Reviewed by: kib
# 610507ae	27-Jul-2008	Pawel Jakub Dawidek <pjd@FreeBSD.org>	- Move vp test for beeing NULL under IGNORE_LOCK(). - Check if panicstr isn't set, if it is ignore the lock. This helps to avoid confusion, because lockmgr is a no-op when panicstr isn't NULL, so asserting anything at this point doesn't make sense and can just race with other panic. Discussed with: kib
# 09400d5a	21-Jul-2008	Attilio Rao <attilio@FreeBSD.org>	- Disallow XFS mounting in write mode. The write support never worked really and there is no need to maintain it. - Fix vn_get() in order to let it call vget(9) with a valid locking request. vget(9) returns the vnode locked in order to prevent recycling, but in this case internal XFS locks alredy prevent it from happening, so it is safe to drop the vnode lock before to return by vn_get(). - Add a VNASSERT() in vget(9) in order to catch malformed locking requests. Discussed with: kan, kib Tested by: Lothar Braun <lothar at lobraun dot de>
# 988f0e19	18-May-2008	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Be more friendly for DDB pager. Educated by: jhb's BSDCan presentation
# 60e2edce	04-May-2008	Attilio Rao <attilio@FreeBSD.org>	sync_vnode() has some messy code about locking in order to deal with mount fs needing Giant to be held when processing bufobjs. Use a different subqueue for pending workitems on filesystems requiring Giant. This simplifies the code notably and also reduces the number of Giant acquisitions (and the whole processing cost). Suggested by: jeff Reviewed by: kib Tested by: pho
# 3800322f	26-Apr-2008	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Implement 'show mount' command in DDB. Without argument, it prints short info about all currently mounted file systems. When an address is given as an argument, prints detailed info about the given mount point. MFC after: 2 weeks
# 12e79a9b	24-Apr-2008	Konstantin Belousov <kib@FreeBSD.org>	Allow the vnode zone to return the unused memory. The vnode reference count is/shall be properly maintained for the long time, and VFS shall be safe against the vnode memory reclamation. Proposed by: jeff Tested by: pho
# eab626f1	16-Apr-2008	Konstantin Belousov <kib@FreeBSD.org>	Move the head of byte-level advisory lock list from the filesystem-specific vnode data to the struct vnode. Provide the default implementation for the vop_advlock and vop_advlockasync. Purge the locks on the vnode reclaim by using the lf_purgelocks(). The default implementation is augmented for the nfs and smbfs. In the nfs_advlock, push the Giant inside the nfs_dolock. Before the change, the vop_advlock and vop_advlockasync have taken the unlocked vnode and dereferenced the fs-private inode data, racing with with the vnode reclamation due to forced unmount. Now, the vop_getattr under the shared vnode lock is used to obtain the inode size, and later, in the lf_advlockasync, after locking the vnode interlock, the VI_DOOMED flag is checked to prevent an operation on the doomed vnode. The implementation of the lf_purgelocks() is submitted by dfr. Reported by: kris Tested by: kris, pho Discussed with: jeff, dfr MFC after: 2 weeks
# 1fd9b6a5	02-Apr-2008	Jeff Roberson <jeff@FreeBSD.org>	- Destroy the bo mtx when the vnode is destroyed.
# 71072af5	27-Mar-2008	Attilio Rao <attilio@FreeBSD.org>	b_waiters cannot be adequately protected by the interlock because it is dropped after the call to lockmgr() so just revert this approach using something similar to the precedent one: BUF_LOCKWAITERS() just checks if there are waiters (not the actual number of them) and it is based on newly introduced lockmgr_waiters() which returns if the lockmgr has waiters or not. The name has been choosen differently by old lockwaiters() in order to not confuse them. KPI results enriched by this commit so __FreeBSD_version bumping and manpage update will be happening soon. 'struct buf' also changes, so kernel ABI is disturbed. Bug found by: jeff Approved by: jeff, kib
# 0ee6cecc	23-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Greatly simplify vget() by removing the guarantee that any new references to a vnode with VI_OWEINACT set will force the vinactive() call. The kernel makes no guarantees about which reference was the last to close a file or when the actual inactive processing will happen. The previous code was designed to preserve existing semantics in the face of shared locks, however, this was unnecessary. Discussed with: mckusick
# e6b2545b	22-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Only return 1 from sync_vnode() in cases where the vnode is still at the head of the sync list. This prevents sched_sync() from re-queueing a vnode which may have been freed already. Discussed with: kib
# f6a8cecf	22-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Pass BO_MTX(bo) to lockmgr in vtruncbuf, we don't own the vnode interlock here anymore. Reported by: kris
# 698b1a66	22-Mar-2008	Jeff Roberson <jeff@FreeBSD.org>	- Complete part of the unfinished bufobj work by consistently using BO_LOCK/UNLOCK/MTX when manipulating the bufobj. - Create a new lock in the bufobj to lock bufobj fields independently. This leaves the vnode interlock as an 'identity' lock while the bufobj is an io lock. The bufobj lock is ordered before the vnode interlock and also before the mnt ilock. - Exploit this new lock order to simplify softdep_check_suspend(). - A few sync related functions are marked with a new XXX to note that we may not properly interlock against a non-zero bv_cnt when attempting to sync all vnodes on a mountlist. I do not believe this race is important. If I'm wrong this will make these locations easier to find. Reviewed by: kib (earlier diff) Tested by: kris, pho (earlier diff)
# 237fdd78	16-Mar-2008	Robert Watson <rwatson@FreeBSD.org>	In keeping with style(9)'s recommendations on macros, use a ';' after each SYSINIT() macro invocation. This makes a number of lightweight C parsers much happier with the FreeBSD kernel source, including cflow's prcc and lxr. MFC after: 1 month Discussed with: imp, rink
# 7fbfba7b	01-Mar-2008	Attilio Rao <attilio@FreeBSD.org>	- Handle buffer lock waiters count directly in the buffer cache instead than rely on the lockmgr support [1]: * bump the waiters only if the interlock is held * let brelvp() return the waiters count * rely on brelvp() instead than BUF_LOCKWAITERS() in order to check for the waiters number - Remove a namespace pollution introduced recently with lockmgr.h including lock.h by including lock.h directly in the consumers and making it mandatory for using lockmgr. - Modify flags accepted by lockinit(): * introduce LK_NOPROFILE which disables lock profiling for the specified lockmgr * introduce LK_QUIET which disables ktr tracing for the specified lockmgr [2] * disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it can only be used on a per-instance basis - Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer used This patch breaks KPI so __FreBSD_version will be bumped and manpages updated by further commits. Additively, 'struct buf' changes results in a disturbed ABI also. [2] Really, currently there is no ktr tracing in the lockmgr, but it will be added soon. [1] Submitted by: kib Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
# 81c794f9	25-Feb-2008	Attilio Rao <attilio@FreeBSD.org>	Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is always curthread. As KPI gets broken by this patch, manpages and __FreeBSD_version will be updated by further commits. Tested by: Andrea Barberio <insomniac at slackware dot it>
# 2433c488	08-Feb-2008	Attilio Rao <attilio@FreeBSD.org>	Conver all explicit instances to VOP_ISLOCKED(arg, NULL) into VOP_ISLOCKED(arg, curthread). Now, VOP_ISLOCKED() and lockstatus() should only acquire curthread as argument; this will lead in axing the additional argument from both functions, making the code cleaner. Reviewed by: jeff, kib
# 0e9eb108	23-Jan-2008	Attilio Rao <attilio@FreeBSD.org>	Cleanup lockmgr interface and exported KPI: - Remove the "thread" argument from the lockmgr() function as it is always curthread now - Axe lockcount() function as it is no longer used - Axe LOCKMGR_ASSERT() as it is bogus really and no currently used. Hopefully this will be soonly replaced by something suitable for it. - Remove the prototype for dumplockinfo() as the function is no longer present Addictionally: - Introduce a KASSERT() in lockstatus() in order to let it accept only curthread or NULL as they should only be passed - Do a little bit of style(9) cleanup on lockmgr.h KPI results heavilly broken by this change, so manpages and FreeBSD_version will be modified accordingly by further commits. Tested by: matteo
# d638e093	19-Jan-2008	Attilio Rao <attilio@FreeBSD.org>	- Introduce the function lockmgr_recursed() which returns true if the lockmgr lkp, when held in exclusive mode, is recursed - Introduce the function BUF_RECURSED() which does the same for bufobj locks based on the top of lockmgr_recursed() - Introduce the function BUF_ISLOCKED() which works like the counterpart VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus BUF_REFCNT() in a more explicative and SMP-compliant way. This allows us to axe out BUF_REFCNT() and leaving the function lockcount() totally unused in our stock kernel. Further commits will axe lockcount() as well as part of lockmgr() cleanup. KPI results, obviously, broken so further commits will update manpages and freebsd version. Tested by: kris (on UFS and NFS)
# 22db15c0	13-Jan-2008	Attilio Rao <attilio@FreeBSD.org>	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
# cb05b60a	09-Jan-2008	Attilio Rao <attilio@FreeBSD.org>	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>
# c5f1beb0	27-Dec-2007	Robert Watson <rwatson@FreeBSD.org>	In "show lockedvnods" DDB command, use db_printf() rather than printf() so that the results end up in the DDB output stream rather than the console output stream. This should likely also be done for the vprint() function it calls. MFC after: 3 months
# 98e4f2e2	27-Dec-2007	Attilio Rao <attilio@FreeBSD.org>	As LK_EXCLUPGRADE is used in conjuction with LK_NOWAIT, LK_UPGRADE becames equivalent with this and so operate the switch. That call is the only one remaining LK_EXCLUPGRADE consumer and removing it will prepare the ground for LK_EXCLUPGRADE axing and further lockmgr improvements. Discussed with: jeff, ups
# 3de213cc	25-Dec-2007	Robert Watson <rwatson@FreeBSD.org>	Add a new 'why' argument to kdb_enter(), and a set of constants to use for that argument. This will allow DDB to detect the broad category of reason why the debugger has been entered, which it can use for the purposes of deciding which DDB script to run. Assign approximate why values to all current consumers of the kdb_enter() interface.
# 973bdaa0	05-Dec-2007	Konstantin Belousov <kib@FreeBSD.org>	Use curthread instead of the FIRST_THREAD_IN_PROC for vnlru and syncer, when applicable. Aquire Giant slightly later for vnlru. In the syncer, aquire the Giant only when a vnode belongs to the non-MPsafe fs. In both speedup_syncer() and syncer_shutdown(), remove the syncer thread from the lbolt sleep queue after the syncer state is modified, not before. Herded by: attilio Tested by: Peter Holm Reviewed by: ups MFC after: 1 week
# 30d239bc	24-Oct-2007	Robert Watson <rwatson@FreeBSD.org>	Merge first in a series of TrustedBSD MAC Framework KPI changes from Mac OS X Leopard--rationalize naming for entry points to the following general forms: mac_<object>_<method/action> mac_<object>_check_<method/action> The previous naming scheme was inconsistent and mostly reversed from the new scheme. Also, make object types more consistent and remove spaces from object types that contain multiple parts ("posix_sem" -> "posixsem") to make mechanical parsing easier. Introduce a new "netinet" object type for certain IPv4/IPv6-related methods. Also simplify, slightly, some entry point names. All MAC policy modules will need to be recompiled, and modules not updates as part of this commit will need to be modified to conform to the new KPI. Sponsored by: SPARTA (original patches against Mac OS X) Obtained from: TrustedBSD Project, Apple Computer
# 3745c395	20-Oct-2007	Julian Elischer <julian@FreeBSD.org>	Rename the kthread_xxx (e.g. kthread_create()) calls to kproc_xxx as they actually make whole processes. Thos makes way for us to add REAL kthread_create() and friends that actually make theads. it turns out that most of these calls actually end up being moved back to the thread version when it's added. but we need to make this cosmetic change first. I'd LOVE to do this rename in 7.0 so that we can eventually MFC the new kthread_xxx() calls.
# 245b2044	12-Sep-2007	Konstantin Belousov <kib@FreeBSD.org>	When restoring the mount after umount failed, the MNTK_UNMOUNT flag prevents insmntque() from placing reallocated syncer vnode on mount list, that causes panic in vfs_allocate_syncvnode(). Introduce MNTK_NOINSMNTQ flag, that marks the period when instmntque is not allowed to success, instead of MNTK_UNMOUNT. The MNTK_NOINSMNTQ is set and cleared simultaneously with MNTK_UNMOUNT, except on umount error path, where it is cleaned just before the syncer vnode is going to be allocated. Reported by: Peter Jeremy <peterjeremy optushome com au> Suggested by: tegge Approved by: re (rwatson)
# 354eb801	13-Aug-2007	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Improve vn_printf() by: - adding missing vnode flags, - printing unknown flags as numbers, - using strlcat() instead of strcat(). Approved by: re (bmah)
# 32f9753c	11-Jun-2007	Robert Watson <rwatson@FreeBSD.org>	Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in some cases, move to priv_check() if it was an operation on a thread and no other flags were present. Eliminate caller-side jail exception checking (also now-unused); jail privilege exception code now goes solely in kern_jail.c. We can't yet eliminate suser() due to some cases in the KAME code where a privilege check is performed and then used in many different deferred paths. Do, however, move those prototypes to priv.h. Reviewed by: csjp Obtained from: TrustedBSD Project
# 2feb50bf	31-May-2007	Attilio Rao <attilio@FreeBSD.org>	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)
# e1e8f51b	27-May-2007	Robert Watson <rwatson@FreeBSD.org>	Universally adopt most conventional spelling of acquire.
# d413d210	18-May-2007	Konstantin Belousov <kib@FreeBSD.org>	Since renaming of vop_lock to _vop_lock, pre- and post-condition function calls are no more generated for vop_lock. Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption about vop naming conventions. This restores pre/post-condition calls.
# 222d0195	18-May-2007	Jeff Roberson <jeff@FreeBSD.org>	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>
# 24b0502e	13-Apr-2007	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Fix jails and jail-friendly file systems handling: - We need to allow for PRIV_VFS_MOUNT_OWNER inside a jail. - Move security checks to vfs_suser() and deny unmounting and updating for jailed root from different jails, etc. OK'ed by: rwatson
# 6bc3ab25	13-Apr-2007	Pawel Jakub Dawidek <pjd@FreeBSD.org>	When we are running low on vnodes, there is currently no way to ask other subsystems to release some vnodes. Implement backpressure based on vfs_lowvnodes event (similar to vm_lowmem for memory).
# 08be8194	10-Apr-2007	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Minor style cleanups (mostly removal of trailing whitespaces).
# 21ff8c67	10-Apr-2007	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Correct typos.
# def72fbb	01-Apr-2007	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Now that the vdropl() function is public, assert that the vnode interlock is held.
# e6534b36	31-Mar-2007	Dag-Erling Smørgrav <des@FreeBSD.org>	Make vdropl() public; zfs needs it. There is also plenty of existing file system code (mostly _reclaim()) which look like this: VOP_LOCK(vp); / examine vp / VOP_UNLOCK(vp); vdrop(vp); This can now be rewritten to: VOP_LOCK(vp); / examine vp / vdropl(vp); / will unlock vp */ MFC after: 1 week
# f3ea971b	26-Mar-2007	Marcel Moolenaar <marcel@FreeBSD.org>	PowerPC is the only architecture with mpsafe_vfs=0. This is now broken. Rudimentary tests show that PowerPC can run with mpsafe_vfs=1. Make it so...
# 61b9d89f	12-Mar-2007	Tor Egge <tegge@FreeBSD.org>	Make insmntque() externally visibile and allow it to fail (e.g. during late stages of unmount). On failure, the vnode is recycled. Add insmntque1(), to allow for file system specific cleanup when recycling vnode on failure. Change getnewvnode() to no longer call insmntque(). Previously, embryonic vnodes were put onto the list of vnode belonging to a file system, which is unsafe for a file system marked MPSAFE. Change vfs_hash_insert() to no longer lock the vnode. The caller now has that responsibility. Change most file systems to lock the vnode and call insmntque() or insmntque1() after a new vnode has been sufficiently setup. Handle failed insmntque*() calls by propagating errors to callers, possibly after some file system specific cleanup. Approved by: re (kensmith) Reviewed by: kib In collaboration with: kib
# 2f6a774b	12-Nov-2006	Kip Macy <kmacy@FreeBSD.org>	change vop_lock handling to allowing tracking of callers' file and line for acquisition of lockmgr locks Approved by: scottl (standing in for mentor rwatson)
# 6b8de13a	07-Nov-2006	John Baldwin <jhb@FreeBSD.org>	Simplify operations with sync_mtx in sched_sync(): - Don't drop the lock just to reacquire it again to check rushjob, this only wastes time. - Use msleep() to drop the mutex while sleeping instead of explicitly unlocking around tsleep. Reviewed by: pjd
# 8064e5d7	07-Nov-2006	John Baldwin <jhb@FreeBSD.org>	Fix comment typo and function declaration.
# acd3428b	06-Nov-2006	Robert Watson <rwatson@FreeBSD.org>	Sweep kernel replacing suser(9) calls with priv(9) calls, assigning specific privilege names to a broad range of privileges. These may require some future tweaking. Sponsored by: nCircle Network Security, Inc. Obtained from: TrustedBSD Project Discussed on: arch@ Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri, Alex Lyashkov <umka at sevcity dot net>, Skip Ford <skip dot ford at verizon dot net>, Antoine Brodin <antoine dot brodin at laposte dot net>
# a2ca03b3	04-Nov-2006	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Typo, 'from' vnode is locked here, not 'to' vnode.
# 1a60c7fc	31-Oct-2006	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Add gjournal specific code to the UFS file system: - Add FS_GJOURNAL flag which enables gjournal support on a file system. - Add cg_unrefs field to the cylinder group structure which holds number of unreferenced (orphaned) inodes in the given cylinder group. - Add fs_unrefs field to the super block structure which holds total number of unreferenced (orphaned) inodes. - When file or a directory is orphaned (last reference is removed, but object is still open), increase fs_unrefs and cg_unrefs fields, which is a hint for fsck in which cylinder groups looks for such (orphaned) objects. - When file is last closed, decrease {fs,cg}_unrefs fields. - Add VV_DELETED vnode flag which points at orphaned objects. Sponsored by: home.pl
# aed55708	22-Oct-2006	Robert Watson <rwatson@FreeBSD.org>	Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now contains the userspace and user<->kernel API and definitions, with all in-kernel interfaces moved to mac_framework.h, which is now included across most of the kernel instead. This change is the first step in a larger cleanup and sweep of MAC Framework interfaces in the kernel, and will not be MFC'd. Obtained from: TrustedBSD Project Sponsored by: SPARTA
# 45ea8737	02-Oct-2006	Konstantin Belousov <kib@FreeBSD.org>	Correct the comment: numvnodes is decreased on vdestroying the vnode. OKed by: tegge Approved by: pjd (mentor) MFC after: 1 week
# a1e363f2	25-Sep-2006	Tor Egge <tegge@FreeBSD.org>	Add mnt_noasync counter to better handle interleaved calls to nmount(), sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag which is set only when MNT_ASYNC is set and mnt_noasync is zero, and check that flag instead of MNT_ASYNC before initiating async io.
# 5da56ddb	25-Sep-2006	Tor Egge <tegge@FreeBSD.org>	Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag. This eliminates a race where MNT_UPDATE flag could be lost when nmount() raced against sync(), sync_fsync() or quotactl().
# c37789fe	04-Sep-2006	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Add 'show vnode <addr>' DDB command.
# 04d9e255	10-Aug-2006	Pawel Jakub Dawidek <pjd@FreeBSD.org>	getnewvnode() can be called with NULL mp. Found by: Coverity Prevent (tm) Coverity ID: 1521 Confirmed by: phk
# 13c85d33	08-Aug-2006	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Add a bandaid to avoid a deadlock in a situation, when we are trying to suspend a file system, but need to obtain a vnode. We may not be able to do it, because all vnodes could be already in use and other processes cannot release them, because they are waiting in "suspfs" state. In such situation, we allow to allocate a vnode anyway. This is a temporary fix - there is no backpressure to free vnodes allocated in those circumstances. MFC after: 1 week Reviewed by: tegge
# ccdebe46	06-Aug-2006	Robert Watson <rwatson@FreeBSD.org>	Improve commenting of vaccess(), making sure to be clear that the ifdef capabilities code is there for reference and never actually used. Slight style tweak.
# 27ea2953	15-Jul-2006	Alan Cox <alc@FreeBSD.org>	Enable debug.mpsafevfs by default on arm. Since every architecture except powerpc has debug.mpsafevfs enabled by default, it is shorter to enumerate the architectures on which debug.mpsafevfs is off. Tested by: cognet@
# c8d3bc1f	05-Jul-2006	Konstantin Belousov <kib@FreeBSD.org>	Back out my rev. 1.674. The better fix (rev. 1.637) is already in tree. Approved by: kan (mentor)
# d81175c7	26-Jun-2006	Sergey Babkin <babkin@FreeBSD.org>	Backed out the change by request from rwatson. PR: kern/14584
# 7a799f1e	25-Jun-2006	Sergey Babkin <babkin@FreeBSD.org>	The common UID/GID space implementation. It has been discussed on -arch in 1999, and there are changes to the sysctl names compared to PR, according to that discussion. The description is in sys/conf/NOTES. Lines in the GENERIC files are added in commented-out form. I'll attach the test script I've used to PR. PR: kern/14584 Submitted by: babkin
# 55aef263	08-Jun-2006	Konstantin Belousov <kib@FreeBSD.org>	Fix the LOR that occurs when the MAC compiled into the kernel and vnode is destroyed. Reviewed by: rwatson LOR: 189 MFC after: 2 weeks Approved by: kan (mentor)
# dcf67e65	24-May-2006	Stephan Uphoff <ups@FreeBSD.org>	Do not set B_NOCACHE on buffers when releasing them in flushbuflist(). If B_NOCACHE is set the pages of vm backed buffers will be invalidated. However clean buffers can be backed by dirty VM pages so invalidating them can lead to data loss. Add support for flush dirty page in the data invalidation function of some network file systems. This fixes data losses during vnode recycling (and other code paths using invalbuf(,V_SAVE,,*)) for data written using an mmaped file. Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@ Reviewed by: tegge@ MFC after: 7 days
# 73dbd3da	11-May-2006	John Baldwin <jhb@FreeBSD.org>	Remove various bits of conditional Alpha code and fixup a few comments.
# 643df192	29-Apr-2006	Pawel Jakub Dawidek <pjd@FreeBSD.org>	vn_start_write()/vn_finished_write() is not needed here, because vn_start_write() is always called earlier in the code path and calling the function recursively may lead to a deadlock. Confirmed by: tegge MFC after: 2 weeks
# 6ca9fcc5	27-Apr-2006	Jeff Roberson <jeff@FreeBSD.org>	- Add a BO_NEEDSGIANT flag to the bufobj. This flag forces all child buffers to go on the buf daemon's DIRTYGIANT queue. - Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler runs in the context of the buf daemon and may require Giant.
# b53bf126	04-Apr-2006	Jeff Roberson <jeff@FreeBSD.org>	- VFS_LOCK_GIANT when recycling a vnode via getnewvnode. We may be recycling for an unrelated filesystem. I really don't like potentially acquiring giant in the context of a giantless filesystem but there are reasonable objections to removing the recycling from this path. Sponsored by: Isilon Systems, Inc.
# 0af24721	31-Mar-2006	Jeff Roberson <jeff@FreeBSD.org>	- Add an assert to vgone. It is illegal to call vgone without a reference to the vnode. Without a reference the vnode will never be vdestroy'd and the memory will never be reclaimed. Sponsored by: Isilon Systems, Inc.
# 94bc95db	30-Mar-2006	Jeff Roberson <jeff@FreeBSD.org>	- Hold a reference from the time vfs_busy starts until vfs_unbusy is called. - vfs_getvfs has to return a reference to prevent the returned mountpoint from changing identities. - Release references acquired via vfs_getvfs. Discussed with: tegge Tested by: kris Sponsored by: Isilon Systems, Inc.
# 084d64ac	30-Mar-2006	Jeff Roberson <jeff@FreeBSD.org>	- Add the B_NEEDSGIANT flag which is only set if the vnode that owns a buf requires Giant. It is set in bgetvp and cleared in brelvp. - Create QUEUE_DIRTY_GIANT for dirty buffers that require giant. - In the buf daemon, only grab giant when processing QUEUE_DIRTY_GIANT and only if we think there are buffers in that queue. Sponsored by: Isilon Systems, Inc.
# e44270a7	19-Mar-2006	Jeff Roberson <jeff@FreeBSD.org>	- Correct an assert in vop_rename_pre. fdvp may be locked if it is either the target directory or file. This case should fail in the filesystem anyway and perhaps kern_rename() should catch it. Sponsored by: Isilon Systems, Inc.
# 791dd2fa	08-Mar-2006	Tor Egge <tegge@FreeBSD.org>	Use vn_start_secondary_write() and vn_finished_secondary_write() as a replacement for vn_write_suspend_wait() to better account for secondary write processing. Close race where secondary writes could be started after ffs_sync() returned but before the file system was marked as suspended. Detect if secondary writes or softdep processing occurred during vnode sync loop in ffs_sync() and retry the loop if needed.
# 3b582b4e	02-Mar-2006	Tor Egge <tegge@FreeBSD.org>	Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must be called without any vnode locks held. Remove calls to vn_start_write() and vn_finished_write() in vnode_pager_putpages() and add these calls before the vnode lock is obtained to most of the callers that don't already have them.
# b983aac7	02-Mar-2006	Tor Egge <tegge@FreeBSD.org>	Don't try to show marker nodes.
# eb2ea105	01-Mar-2006	Jeff Roberson <jeff@FreeBSD.org>	- Move softdep from using a global worklist to per-mount worklists. This has many positive effects including improved smp locking, reducing interdependencies between mounts that can lead to deadlocks, etc. - Add the softdep worklist and various counters to the ufsmnt structure. - Add a mount pointer to the workitem and remove mount pointers from the various structures derived from the workitem as they are now redundant. - Remove the poor-man's semaphore protecting softdep_process_worklist and softdep_flushworklist. Several threads may now process the list simultaneously. - Add softdep_waitidle() to block the thread until all pending dependencies being operated on by other threads have been flushed. - Use softdep_waitidle() in unmount and snapshots to block either operation until the fs is stable. - Remove softdep worklist processing from the syncer and move it into the softdep_flush() thread. This thread processes all softdep mounts once each second and when it is called via the new softdep_speedup() when there is a resource shortage. This removes the softdep hook from the kernel and various hacks in header files to support it. Reviewed by/Discussed with: tegge, truckman, mckusick Tested by: kris
# a1db11fc	22-Feb-2006	Jeff Roberson <jeff@FreeBSD.org>	- Release the mount ref once the vnode has been recycled rather than once the last reference is dropped. I forgot that vnodes can stick around for a very long time until processes discover that they are dead. This means that a vnode reference is not sufficient to keep the mount referenced and even more code will be required to ref mount points. Discovered by: kris
# 8a7cd2fd	21-Feb-2006	Jeff Roberson <jeff@FreeBSD.org>	- Grab a mnt ref in vfs_busy() before dropping the interlock. This will prevent the mount point from going away while we're waiting on the lock. The ref does not need to persist once we have the lock because the lock prevents the mount point from being unmounted. MFC After: 1 week
# 04f6d3ef	06-Feb-2006	Jeff Roberson <jeff@FreeBSD.org>	- Add a ref count to the mount structure. Sleep for up to 3 seconds in vfs_mount_destroy waiting for this ref to hit 0. We don't print an error if we are rebooting as the root mount always retains some refernces by init proc. - Acquire a mnt ref for every vnode allocated to a mount point. Drop this ref only once vdestroy() has been called and the mount has been freed. - No longer NULL the v_mount pointer in delmntque() so that we may release the ref after vgone() has been called. This allows us to guarantee that the mount point structure will be valid until the last vnode has lost its last ref. - Fix a few places that rely on checking v_mount to detect recycling. Sponsored by: Isilon Systems, Inc. MFC After: 1 week
# b099db58	31-Jan-2006	Jeff Roberson <jeff@FreeBSD.org>	- Solve a race where we could lose a call to VOP_INACTIVE. If vget() waiting on a lock held the last usecount ref on a vnode and the lock failed we would not call INACTIVE. Solve this by only holding a holdcnt to prevent the vnode from disappearing while we wait on vn_lock. Other callers may now VOP_INACTIVE while we are waiting on the lock, however this race is acceptable, while losing INACTIVE is not. Discussed with: kan, pjd Tested by: kkenn Sponsored by: Isilon Systems, Inc. MFC After: 1 week
# d5e5528a	27-Jan-2006	Kris Kennaway <kris@FreeBSD.org>	Back out r1.653; it turns out that the race (or at least the printf) is actually not hard to trigger, and it can cause a lot of console spam. Approved by: kan
# 6be2c41a	21-Jan-2006	Robert Watson <rwatson@FreeBSD.org>	Convert remaining functions in vfs_subr.c from K&R prototypes to ANSI C prototypes, as the majority of new functions added have been in this style. Changing prototype style now results in gcc noticing that the implementation of vn_pollrecord() has a 'short' argument instead of 'int' as prototyped in vnode.h, so correct that definition. In practice this didn't matter as only poll flags in the lower 16 bits are used. MFC after: 1 week
# 82be0a5a	09-Jan-2006	Tor Egge <tegge@FreeBSD.org>	Add marker vnodes to ensure that all vnodes associated with the mount point are iterated over when using MNT_VNODE_FOREACH. Reviewed by: truckman
# e7736557	29-Dec-2005	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Print a warning when we miss vinactive() call, because of race in vget(). The race is very real, but conditions needed for triggering it are rather hard to meet now. When gjournal will be committed (where it is quite easy to trigger) we need to fix it. For now, verify if it is really hard to trigger. Discussed with: kan
# 16e35dcc	09-Nov-2005	Doug White <dwhite@FreeBSD.org>	This is a workaround for a complicated issue involving VFS cookies and devfs. The PR and patch have the details. The ultimate fix requires architectural changes and clarifications to the VFS API, but this will prevent the system from panicking when someone does "ls /dev" while running in a shell under the linuxulator. This issue affects HEAD and RELENG_6 only. PR: 88249 Submitted by: "Devon H. O'Dell" <dodell@ixsystems.com> MFC after: 3 days
# 5bb84bc8	31-Oct-2005	Robert Watson <rwatson@FreeBSD.org>	Normalize a significant number of kernel malloc type names: - Prefer '_' to ' ', as it results in more easily parsed results in memory monitoring tools such as vmstat. - Remove punctuation that is incompatible with using memory type names as file names, such as '/' characters. - Disambiguate some collisions by adding subsystem prefixes to some memory types. - Generally prefer lower case to upper case. - If the same type is defined in multiple architecture directories, attempt to use the same name in additional cases. Not all instances were caught in this change, so more work is required to finish this conversion. Similar changes are required for UMA zone names.
# 14cdc364	14-Oct-2005	Kris Kennaway <kris@FreeBSD.org>	mpsafevm has been stable and defaulted to 1 on sparc64 for over 6 months, so we are ready for mpsafevfs=1 by default on sparc64 too. I have been running this on all my sparc64 machines for over 6 months, and have not encountered MD problems. MFC after: 1 week
# 9f5c1d19	12-Oct-2005	Diomidis Spinellis <dds@FreeBSD.org>	Move execve's access time update functionality into a new vfs_mark_atime() function, and use the new function for performing efficient atime updates in mmap(). Reviewed by: bde MFC after: 2 weeks
# 6c8b634f	29-Sep-2005	Don Lewis <truckman@FreeBSD.org>	Un-staticize runningbufwakeup() and staticize updateproc. Add a new private thread flag to indicate that the thread should not sleep if runningbufspace is too large. Set this flag on the bufdaemon and syncer threads so that they skip the waitrunningbufspace() call in bufwrite() rather than than checking the proc pointer vs. the known proc pointers for these two threads. A way of preventing these threads from being starved for I/O but still placing limits on their outstanding I/O would be desirable. Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from blocking on the runningbufspace check while holding snaplk. This prevents snaplk from being held for an arbitrarily long period of time if runningbufspace is high and greatly reduces the contention for snaplk. The disadvantage is that ffs_copyonwrite() can start a large amount of I/O if there are a large number of snapshots, which could cause a deadlock in other parts of the code. Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace before attempting to grab snaplk so that I/O requests waiting on snaplk are not counted in runningbufspace as being in-progress. Increment runningbufspace again before actually launching the original I/O request. Prior to the above two changes, the system could deadlock if enough I/O requests were blocked by snaplk to prevent runningbufspace from falling below lorunningspace and one of the bawrite() calls in ffs_copyonwrite() blocked in waitrunningbufspace() while holding snaplk. See <http://www.holm.cc/stress/log/cons143.html>
# 61ac14da	16-Sep-2005	Tor Egge <tegge@FreeBSD.org>	Break out of loop if next buffer pointer has become invalid while flushing current buffer. Reviewed by: kan
# fd1a469b	12-Sep-2005	Robert Watson <rwatson@FreeBSD.org>	In vfs_kqfilter(), return EINVAL instead of 1 (EPERM) when an unsupported kqueue filter type is requested on a vnode. MFC after: 3 days
# 9ed448b2	12-Sep-2005	Jung-uk Kim <jkim@FreeBSD.org>	use monotonic `time_uptime' instead of `time_second' Approved by: anholt (mentor) Discussed on: arch
# 2883ba66	12-Sep-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Introduce vfs_read_dirent() which can help VOP_READDIR() implementations by handling all the cookie stuff.
# a6c109d6	28-Aug-2005	Suleiman Souhlal <ssouhlal@FreeBSD.org>	Fix a typo in vop_rename_pre() where we ended up using vholdl() instead of vhold(), even though the vnode interlock is unlocked. MFC after: 3 days
# ad9f1801	22-Aug-2005	Don Lewis <truckman@FreeBSD.org>	Back out the removal of LK_NOWAIT from the VOP_LOCK() call in vlrureclaim() in vfs_subr.c 1.636 because waiting for the vnode lock aggravates an existing race condition. It is also undesirable according to the commit log for 1.631. Fix the tiny race condition that remains by rechecking the vnode state after grabbing the vnode lock and grabbing the vnode interlock. Fix the problem of other threads being starved (which 1.636 attempted to fix by removing LK_NOWAIT) by calling uio_yield() periodically in vlrureclaim(). This should be more deterministic than hoping that VOP_LOCK() without LK_NOWAIT will block, which may not happen in this loop. Reviewed by: kan MFC after: 5 days
# 6cd8dee3	20-Aug-2005	Robert Watson <rwatson@FreeBSD.org>	Silence "busy" warnings when unmounting devfs at system shutdown. This is a workaround for non-symetric teardown of the file systems at shutdown with respect to the mount order at boot. The proper long term fix is to properly detach devfs from the root mount before unmounting each, and should be implemented, but since the problem is non-harmful, this temporary band-aid will prevent false positive bug reports and unnecessary error output for 6.0-RELEASE. MFC after: 3 days Tested by: pav, pjd
# fd65baf8	13-Aug-2005	Marcel Moolenaar <marcel@FreeBSD.org>	Make mpsafe_vfs=1 the default on ia64.
# 45a0d1ed	10-Aug-2005	Alexander Kabaev <kan@FreeBSD.org>	Do not drop the vnode interlock if vdropl is called on already doomed vnode. vdropl callers expect it to return with interlock still being held. MFC after: 2 days
# 34cc826a	05-Aug-2005	Suleiman Souhlal <ssouhlal@FreeBSD.org>	Holding a vnode doesn't prevent v_mount from disappearing (when the vnode is inactivated), possibly leading to a NULL dereference when checking if the mount wants knotes to be activated in the VOP hooks. So, we add a new vnode flag VV_NOKNOTE that is only set in getnewvnode(), if necessary, and check it when activating knotes. Since the flags are not erased when a vnode is being held, we can safely read them. Reviewed by: kris@ MFC after: 3 days
# 40a49585	02-Aug-2005	Jeff Roberson <jeff@FreeBSD.org>	- Unlock before we call mac_destroy_vnode to prevent a lock order reversal. Found by: trhodes
# 39b24068	19-Jul-2005	Jeff Roberson <jeff@FreeBSD.org>	- Allow vnlru to drop giant if the filesystem does not require it. The vnlru proc is extremely inefficient, potentially iteration over tens of thousands of vnodes without blocking. Droping Giant allows other threads to preempt us although we should revisit the algorithm to fix the runtime problems especially since this may hold up all vnode allocations. - Remove the LK_NOWAIT from the VOP_LOCK in vlrureclaim. This provides a natural blocking point to help alleviate the situation described above although it may not technically be desirable. - yield after we make a pass on all mount points to prevent us from blocking other threads which require Giant. MFC after: 2 weeks
# c23c87bd	05-Jul-2005	Pawel Jakub Dawidek <pjd@FreeBSD.org>	Fix one "wrong b_bufobj" panic in reassignbuf() by moving VI_UNLOCK(vp) below KASSERT()s, which means there was no real problem here, we just needed better locking for assertions. OK'ed by: jeff Approved by: re (scottl)
# 571dcd15	01-Jul-2005	Suleiman Souhlal <ssouhlal@FreeBSD.org>	Fix the recent panics/LORs/hangs created by my kqueue commit by: - Introducing the possibility of using locks different than mutexes for the knlist locking. In order to do this, we add three arguments to knlist_init() to specify the functions to use to lock, unlock and check if the lock is owned. If these arguments are NULL, we assume mtx_lock, mtx_unlock and mtx_owned, respectively. - Using the vnode lock for the knlist locking, when doing kqueue operations on a vnode. This way, we don't have to lock the vnode while holding a mutex, in filt_vfsread. Reviewed by: jmg Approved by: re (scottl), scottl (mentor override) Pointyhat to: ssouhlal Will be happy: everyone
# b770ff6e	18-Jun-2005	Jeff Roberson <jeff@FreeBSD.org>	- Try to catch the wrong bufobj panics a little earlier. I believe they are actually caused by a buf with both VNCLEAN and VNDIRTY set. In the traces it is clear that the buf is removed from the dirty queue while it is actually on the clean queue which leaves the tail pointer set. Assert that both flags are not set in buf_vlist_add and buf_vlist_remove. Sponsored by: Isilon Systems, Inc. Approved by: re (blanket vfs)
# 114a1006	15-Jun-2005	Jeff Roberson <jeff@FreeBSD.org>	- Change holdcnt use around vnode recycling. We now always keep a holdcnt ref while we're calling vgone(). This prevents transient refs from re-adding us to the free list. Previously, a vfree() triggered via vinvalbuf() getting rid of all of a vnode's pages could place a partially destructed vnode on the free list where vtryrecycle() could find it. The first call to vtryrecycle would hang up on the vnode lock, but when it failed it would place a now dead vnode onto the free list, and another call to vtryrecycle() would free an already free vnode. There were many complications of having a zero ref count while freeing which can now go away. - Change vdropl() to release the interlock before returning. All callers now respect this, so vdropl() directly frees VI_DOOMED vnodes once the last ref is dropped. This means that we'll never have VI_DOOMED vnodes on the free list. - Seperate v_incr_usecount() into v_incr_usecount(), v_decr_usecount() and v_decr_useonly(). The incr/decr split is so that incr usecount can return with the interlock still held while decr drops the interlock so it can call vdropl() which will potentially free the vnode. The calling function can't drop the lock of an already free'd node. v_decr_useonly() drops a usecount without droping the hold count. This is done so the usecount reaches zero in vput() before we recycle, however the holdcount is still 1 which prevents any new references from placing the vnode back on the free list. - Fix vnlrureclaim() to vhold the vnode since it doesn't do a vget(). We wouldn't want vnlrureclaim() to bump the usecount since this has different semantics. Also change vnlrureclaim() to do a NOWAIT on the vn_lock. When this function runs we're usually in a desperate situation and we wouldn't want to wait for any specific vnode to be released. - Fix a bunch of misc comments to reflect the new behavior. - Add vhold() and vdrop() to vflush() for the same reasons that we do in vlrureclaim(). Previously we held no reference and a vnode could have been freed while we were waiting on the lock. - Get rid of vlruvp() and vfreehead(). Neither are used. vlruvp() should really be rethought before it's reintroduced. - vgonel() always returns with the vnode locked now and never puts the vnode back on a free list. The vnode will be freed as soon as the last reference is released. Sponsored by: Isilon Systems, Inc. Debugging help from: Kris Kennaway, Peter Holm Approved by: re (blanket vfs)
# 12c2dcde	14-Jun-2005	Jeff Roberson <jeff@FreeBSD.org>	- In reassignbuf() add many asserts to validate the head and tail pointers of the clean and dirty lists. This is in an attempt to catch the wrong bufobj problem sooner. - In vgonel() don't acquire an extra reference in the active case, the vnode lock and VI_DOOMED protect us from recursively cleaning. - Also in vgonel() clean up some stale comments. Sponsored by: Isilon Systems, Inc. Approved by: re (blanket vfs)
# b930d853	13-Jun-2005	Jeff Roberson <jeff@FreeBSD.org>	- Don't make vgonel() globally visible, we want to change its prototype anyway and it's not used outside of vfs_subr.c. - Change vgonel() to accept a parameter which determines whether or not we'll put the vnode on the free list when we're done. - Use the new vgonel() parameter rather than VI_DOOMED to signal our intentions in vtryrecycle(). - In vgonel() return if VI_DOOMED is already set, this vnode has already been reclaimed. Sponsored by: Isilon Systems, Inc.
# d2ad9baa	12-Jun-2005	Jeff Roberson <jeff@FreeBSD.org>	- Add KTR_VFS events to vdestroy, vtruncbuf, vinvalbuf, vfreehead. Sponsored by: Isilon Systems, Inc.
# d6dbf760	11-Jun-2005	Jeff Roberson <jeff@FreeBSD.org>	- Assert that we're not in the name cache anymore in vdestroy(). Sponsored by: Isilon Systems, Inc.
# 9aa0eba4	10-Jun-2005	Jeff Roberson <jeff@FreeBSD.org>	- Add KTR_VFS tracing to track the life of vnodes. Eventually KTR_VFS events could be added to cover other interesting details. - Add some VNASSERTs to discover places where we access vnodes after they have been uma_zfree'd before we try to free them again. - Add a few more VNASSERTs to vdestroy() to be certain that the vnode is really unused. Sponsored by: Isilon Systems, Inc.
# 679985d0	09-Jun-2005	Suleiman Souhlal <ssouhlal@FreeBSD.org>	Allow EVFILT_VNODE events to work on every filesystem type, not just UFS by: - Making the pre and post hooks for the VOP functions work even when DEBUG_VFS_LOCKS is not defined. - Moving the KNOTE activations into the corresponding VOP hooks. - Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct mount that permits filesystems to disable the new behavior. - Creating a default VOP_KQFILTER function: vfs_kqfilter() My benchmarks have not revealed any performance degradation. Reviewed by: jeff, bde Approved by: rwatson, jmg (kqueue changes), grehan (mentor)
# fae89dce	07-Jun-2005	Jeff Roberson <jeff@FreeBSD.org>	- Clear OWEINACT prior to calling VOP_INACTIVE to remove the possibility of a vget causing another call to INACTIVE before we're finished.
# fd94099e	05-May-2005	Colin Percival <cperciva@FreeBSD.org>	If we are going to 1. Copy a NULL-terminated string into a fixed-length buffer, and 2. copyout that buffer to userland, we really ought to 0. Zero the entire buffer first. Security: FreeBSD-SA-05:08.kmem
# 059f090f	03-May-2005	Jeff Roberson <jeff@FreeBSD.org>	- A vnode may have made its way onto the free list while it was being vgone'd. We must remove it from the freelist before returning in vtryrecycle() or we may get a duplicate free. Reported by: kkenn
# 02fe1744	01-May-2005	Christian S.J. Peron <csjp@FreeBSD.org>	Since it is not possible for curthread to be NULL in this context, drop the check+initialization for a straight initialization. Also assert that curthread will never be NULL just to be sure. Discussed with: rwatson, peter MFC after: 1 week
# b2e21664	30-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- All buffers should either be clean or dirty. If neither of these flags are set when we attempt to remove a buffer from a queue we should panic. Hopefully this will catch the source of the wrong bufobj panics. Sponsored by: Isilon Systems, Inc.
# b2183bfe	30-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- In vnlru_free() remove the vnode from the free list before we call vtryrecycle(). We could sometimes get into situations where two threads could try to recycle the same vnode before this. - vtryrecycle() is now responsible for returning the vnode to the free list if it fails and someone else hasn't done it. - Make a new function vfreehead() which moves a vnode to the head of the free list and use it in vgone() to clean up that code a bit. Sponsored by: Isilon Systems, Inc. Reported by: pho, kkenn
# 0dd02d67	27-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Don't vgonel() via vgone() or vrecycle() if the vnode is already doomed. This fixes forced unmounts via nullfs. Reported by: kkenn Sponsored by: Isilon Systems, Inc.
# 6c317bc4	27-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Stop setting vxthread, we've asserted that it was useless for several weeks now.
# 7d60dc52	21-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Disable code which allows getnewvnode() to fail. Many ffs_vget() callers do not correctly deal with failures. This presently risks deadlock problems if dependency processing is held up by failures to allocate a vnode, however, this is better than the situation with the failures. Sponsored by: Isilon Systems, Inc.
# bdb35646	18-Apr-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Initialize mountlist_mtx with an MTX_SYSINIT(), we need it to be ready earlier.
# 374df05f	13-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Change vop_lookup_post assertions to reflect recent vfs_lookup changes. Sponsored by: Isilon Systems, Inc.
# 539de9ed	11-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Enable ASSERT_VOP_ELOCKED and assert_vop_elocked() now that vnode_if.awk uses it. Sponsored by: Isilon Systems, Inc.
# 070898b1	11-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Change the VOP_LOCK UPGRADE in vput() to do a LK_NOWAIT to avoid a potential lock order reversal. Also, don't unlock the vnode if this fails, lockmgr has already unlocked it for us. - Restructure vget() now that vn_lock() does all of VI_DOOMED checking for us and also handles the case where there is no real lock type. - If VI_OWEINACT is set, we need to upgrade the lock request to EXCLUSIVE so that we can call inactive. It's not legal to vget a vnode that hasn't had INACTIVE called yet. Sponsored by: Isilon Systems, Inc.
# d78e0ee9	06-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Assert that the bufobj matches in flushbuflists. I still haven't gotten to root cause on exactly how this happens. - If the assert is disabled, we presently try to handle this case, but the BUF_UNLOCK was missing. Thus, if this condition ever hit we would leak a buf lock. Many thanks to Peter Holm for all his help in finding this bug. He really put more effort into it than I did.
# 2bbd6c98	05-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Move NDFREE() from vfs_subr to vfs_lookup where namei() is.
# d1cc6041	03-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Add a missing unlock of the vnode_free_list_mtx. Spotted by: Antoine Brodin
# 92b8231d	04-Apr-2005	Jeff Roberson <jeff@FreeBSD.org>	- Instead of waiting forever to get a vnode in getnewvnode() wait for one to become available for one second and then return ENFILE. We can run out of vnodes, and there must be a hard limit because without one we can quickly run out of KVA on x86. Presently the system can deadlock if there are maxvnodes directories in the namecache. The original 4.x BSD behavior was to return ENFILE if we reached the max, but 4.x BSD did not have the vnlru proc so it was less profitable to wait.
# e451d879	30-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Disable vfs shared locks by default. They must be specifically enabled on filesystems which safely support them. It appears that many network filesystems specifically are not shared lock safe. Sponsored by: Isilon Systems, Inc.
# f247a524	30-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- LK_NOPAUSE is a nop now. Sponsored by: Isilon Systems, Inc.
# 7ce7f713	29-Mar-2005	David Schultz <das@FreeBSD.org>	Eliminate v_id and v_ddid. The name cache now holds references to vnodes whose names it caches, so we no longer need a `generation number' to tell us if a referenced vnode is invalid. Replace the use of the parent's v_id in the hash function with the address of the parent vnode. Tested by: Peter Holm Glanced at by: jeff, phk
# 0fbc3b7d	29-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Dont clear OWEINACT in vbusy(), we still owe an inactive call if someone vhold()s us. - Avoid an extra mutex acquire and release in the common case of vgonel() by checking for OWEINACT at the start of the function. - Fix the case where we set OWEINACT in vput(). LK_EXCLUPGRADE drops our shared lock if it fails. Sponsored by: Isilon Systems, Inc.
# cb34b95b	29-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Don't initial v_dd here, let cache_purge() do it for us. Sponsored by: Isilon Systems, Inc.
# 9dcc5da3	28-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Move code that should probably be an assert above the main body of vrele so that we can decrease the indentation of the real work and make things slightly more clear. Sponsored by: Isilon Systems, Inc.
# d36f0a4f	28-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Adjust asserts in vop_lookup_post() to match the new post PDIRUNLOCK vfs. Sponsored by: Isilon Systems, Inc.
# 3b73a3c0	27-Mar-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Remove another ';' after if(). Also spotted by: bz
# 2d8dfb28	27-Mar-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Remove extra ; at end of if(). Found by: bz
# 228ea9d2	24-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Don't recycle vnodes anymore. Free them once they are dead. getnewvnode now always allocates a new vnode. - Define a new function, vnlru_free, which frees vnodes from the free list. It takes as a parameter the number of vnodes to free, which is wantfreevnodes - freevnodes when called from vnlru_proc or 1 when called from getnewvnode(). For now, getnewvnode() still tries to reclaim a free vnode before creating a new one when we are near the limit. - Define a function, vdestroy, which handles the actual release of memory and teardown of locks, etc. This could become a uma_dtor() routine. - Get rid of minvnodes. Now wantfreevnodes is 1/4th the max vnodes. This keeps more unreferenced vnodes around so that files which have only been stat'd are less likely to be kicked out of the system before we have a chance to read them, etc. These vnodes may still be freed via the normal vnlru_proc() routines which may some day become a real lru.
# d830f828	24-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Pass LK_EXCLUSIVE to VFS_ROOT() to satisfy the new flags argument. For now, all calls to VFS_ROOT() should still acquire exclusive locks. Sponsored by: Isilon Systems, Inc.
# c167961e	23-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- If vput() is called with a shared lock it must upgrade to an exclusive before it can call VOP_INACTIVE(). This must use the EXCLUPGRADE path because we may violate some lock order with another locked vnode if we drop and reacquire the lock. If EXCLUPGRADE fails, we mark the vnode with VI_OWEINACT. This case should be very rare. - Clear VI_OWEINACT in vinactive() and vbusy(). - If VI_OWEINACT is set in vgone() do the VOP_INACTIVE call here as well. Sponsored by: Isilon Systems, Inc.
# b172f6c5	15-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Now that there are no external users of vfree() make it static. - Move VSHOULDBUSY, VSHOULDFREE, and VTRYRECYCLE into vfs_subr.c so no one else attempts to grow a dependency on them. - Now that objects with pages hold the vnode we don't have to do unlocked checks for the page count in the vm object in VSHOULDFREE. These three macros could simply check for holdcnt state transitions to determine whether the vnode is on the free list already, but the extra safety the flag affords us is probably worth the minimal cost. - The leafonly sysctl and code have been dead for several years now, remove the sysctl and the code that employed it from vtryrecycle(). - vtryrecycle() also no longer has to check the object's page count as the object holds the vnode until it reaches 0. Sponsored by: Isilon Systems, Inc.
# c178628d	15-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Expose vholdl() so it may be used outside of vfs_subr.c
# 8045557f	14-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Increment the holdcnt once for each usecount reference. This allows us to use only the holdcnt to determine whether a vnode may be recycled, simplifying the V* macros as well as vtryrecycle(), etc. Sponsored by: Isilon Systems, Inc.
# 159b4548	14-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- We do not have to check the object's ref_count in VSHOULDFREE or vtryrecycle(). All obj refs also ref the vnode. - Consistently use v_incr_usecount() to increment the usecount. This will be more important later. Sponsored by: Isilon Systems, Inc.
# 8f13a540	14-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Slightly rearrange vrele() to move the common case in one indentation level. Sponsored by: Isilon Systems, Inc.
# 6fc16a83	14-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Rework vget() so we drop the usecount in two failure cases that were missed by my last commit. Sponsored by: Isilon Systems, Inc.
# 6703c30b	13-Mar-2005	Jeff Roberson <jeff@FreeBSD.org>	- Remove vx_lock, vx_unlock, vx_wait, etc. - Add a vn_start_write/vn_finished_write around vlrureclaim so we don't do writing ops without suspending. This could suspend the vlruproc which should not be a problem under normal circumstances. - Manually implement VMIGHTFREE in vlrureclaim as this was the only instance where it was used. - Acquire a lock before calling vgone() as it now requires it. - Move the acquisition of the vnode interlock from vtryrecycle() to getnewvnode() so that if it fails we don't drop and reacquire the vnode_free_list_mtx. - Check for a usecount or holdcount at the end of vtryrecycle() in case someone grabbed a ref while we were recycling. Abort the recycle, and on the final ref drop this vnode will be placed on the head of the free list. - Move the redundant VOP_INACTIVE protection code into the local vinactive() routine to avoid code bloat. - Keep the vnode lock held across calls to vgone() in several places. - vgonel() no longer uses XLOCK, instead callers must hold an exclusive vnode lock. The VI_DOOMED flag is set to allow other threads to detect a vnode which is no longer valid. This flag is set until the last reference is gone, and there are no chances for a new ref. vgonel() holds this lock across the entire function, which greatly simplifies logic. _ Only vfree() in one place in vgone() not three. - Adjust vget() to check the VI_DOOMED flag prior to waiting on the lock in the LK_NOWAIT case. In other cases, check after we have slept and acquired an exlusive lock. This will simulate the old vx_wait() behavior. Sponsored by: Isilon Systems, Inc.
# d9a9c2c2	23-Feb-2005	Jeff Roberson <jeff@FreeBSD.org>	- Enable SMP VFS by default on current. More users are needed to turn up any remaining bugs. Anyone inconvenienced by this can still disable it in the loader. Sponsored by: Isilon Systems, Inc.
# d8a7c99a	22-Feb-2005	Jeff Roberson <jeff@FreeBSD.org>	- Only the xlock holder should be calling VOP_LOCK on a vp once VI_XLOCK has been set. Assert that this is the case so that we catch filesystems who are using naked VOP_LOCKs in illegal cases. Sponsored by: Isilon Systems, Inc.
# 4c11620b	22-Feb-2005	Jeff Roberson <jeff@FreeBSD.org>	- Add a check for xlock in vop_lock_assert. Presently the xlock is considered to be as good as an exclusive lock, although there is still a possibility of someone acquiring a VOP LOCK while xlock is held. Sponsored by: Isilon Systems, Inc.
# 767056c0	22-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Zero the v_un container field to make sure everything is gone.
# aa2f6ddc	22-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Reap more benefits from DEVFS: List devfs_dirents rather than vnodes off their shared struct cdev, this saves a pointer field in the vnode at the expense of a field in the devfs_dirent. There are often 100 times more vnodes so this is bargain. In addition it makes it harder for people to try to do stypid things like "finding the vnode from cdev". Since DEVFS handles all VCHR nodes now, we can do the vnode related cleanup in devfs_reclaim() instead of in dev_rel() and vgonel(). Similarly, we can do the struct cdev related cleanup in dev_rel() instead of devfs_reclaim(). rename idestroy_dev() to destroy_devl() for consistency. Add LIST_ENTRY de_alias to struct devfs_dirent. Remove v_specnext from struct vnode. Change si_hlist to si_alist in struct cdev. String new devfs vnodes' devfs_dirent on si_alist when we create them and take them off in devfs_reclaim(). Fix devfs_revoke() accordingly. Also don't clear fields devfs_reclaim() will clear when called from vgone(); Let devfs_reclaim() call dev_rel() instead of vgonel(). Move the usecount tracking from dev_rel() to devfs_reclaim(), and let dev_rel() take a struct cdev argument instead of vnode. Destroy SI_CHEAPCLONE devices in dev_rel() (instead of devfs_reclaim()) when they are no longer used. (This should maybe happen in devfs_close() instead.)
# 7fc940b2	22-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Remove vfinddev(), it is generally bogus when faced with jails and chroot and has no legitimate use(r)s in the tree.
# dfd4be14	19-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Try to unbreak the vnode locking around vop_reclaim() (based mostly on patch from kan@). Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on close. This is not yet a generally safe function, but for this very specific use it is safe. This solves the problem with buffers not being flushed by unmount or after failed mount attempts.
# 900b7e26	18-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Make sure to drop the VI_LOCK in vgonel(); Spotted by: Taku YAMAMOTO <taku@tackymt.homeip.net>
# 4d8ac58b	17-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Introduce vx_wait{l}() and use it instead of home-rolled versions.
# 58aac128	17-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Convert KASSERTS to VNASSERTS
# 1ba21282	09-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Make various vnode related functions static
# fe019877	10-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Don't pass NULL to vprint()
# 68f2274d	08-Feb-2005	Jeff Roberson <jeff@FreeBSD.org>	- Add a new assert in the getnewvnode(). Assert that the usecount is still 0 to detect getnewvnode() races. - Add the vnode address to a few panics near by to help in debugging. Sponsored by: Isilon Systems, Inc.
# b9489a44	07-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Access vmobject via the bufobj instead of the vnode
# b348abd6	07-Feb-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Don't call VOP_DESTROYVOBJECT(), trust that VOP_RECLAIM() did what was necessary.
# d4eb29ba	28-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Remove unused argument to vrecycle()
# 1fdfaafb	28-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Integrate vclean() into vgonel(). Various associated polishing.
# 3fc8dd06	27-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Remove register keyword
# 8516dd18	24-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Don't use VOP_GETVOBJECT, use vp->v_object directly.
# b5b6ec5f	24-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Eliminate the constant flags argument to vclean()
# 7c93282e	24-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Change vprint() to vn_printf() which takes varargs. Add #define for vprint() to call vn_printf().
# 35764be3	24-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Kill the VV_OBJBUF and test the v_object for NULL instead.
# d1fcf3bb	24-Jan-2005	Jeff Roberson <jeff@FreeBSD.org>	- Add the tunable and sysctl for the mpsafevfs. It currently defaults to off. - Protect access to mnt_kern_flag with the mointpoint mutex. - Remove some KASSERTs which are not legal checks without the appropriate locks held. - Use VCANRECYCLE() rather than rolling several slightly different checks together. - Return from vtryrecycle() with a recycled vnode rather than a locked vnode. This simplifies some locking. - Remove several GIANT_REQUIRED lines. - Add a few KASSERTs to help with INACT debugging. Sponsored By: Isilon Systems, Inc.
# 7bf38aea	16-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Fix a bug I introduced in 1.561 which has caused considerable filesystem unhappiness lately. As far as I can tell, no files that have made it safely to disk have been endangered, but stuff in transit has been in peril. Pointy hat: phk
# 7c0745ee	14-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Eliminate unused and unnecessary "cred" argument from vinvalbuf()
# e39db32a	12-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT() directly.
# 6ef8480a	11-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Add BO_SYNC() and add a default which uses the secret vnode pointer and VOP_FSYNC() for now.
# 6afa350d	11-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	More vnode -> bufobj migration.
# 8d785753	11-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Give flushbuflist() a struct bufv as first argument and avoid home-rolling TAILQ_FOREACH_SAFE(). Loose the error pointer argument and return any errors the normal way. Return EAGAIN for the case where more work needs to be done.
# 8df6bac4	11-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC(). I'm not sure why a credential was added to these in the first place, it is not used anywhere and it doesn't make much sense: The credentials for syncing a file (ability to write to the file) should be checked at the system call level. Credentials for syncing one or more filesystems ("none") should be checked at the system call level as well. If the filesystem implementation needs a particular credential to carry out the syncing it would logically have to the cached mount credential, or a credential cached along with any delayed write data. Discussed with: rwatson
# 9454b2d8	06-Jan-2005	Warner Losh <imp@FreeBSD.org>	/* -> /*- for copyright notices, minor format tweaks as necessary
# 0b3e4fe2	04-Jan-2005	Poul-Henning Kamp <phk@FreeBSD.org>	Since we do not support forceful unmount of DEVFS we can do away with the partially implemented vnode-readoption code in vgonechrl().
# e87047b4	20-Dec-2004	Poul-Henning Kamp <phk@FreeBSD.org>	We can only ever get to vgonechrl() from a devfs vnode, so we do not need to reassign the vp->v_op to devfs_specops, we know that is the value already. Make devfs_specops private to devfs.
# 20a92a18	07-Dec-2004	Poul-Henning Kamp <phk@FreeBSD.org>	The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly split the conversion of the remaining three filesystems out from the root mounting changes, so in one go: cd9660: Convert to nmount. Add omount compat shims. Remove dedicated rootfs mounting code. Use vfs_mountedfrom() Rely on vfs_mount.c calling VFS_STATFS() nfs(client): Convert to nmount (the simple way, mount_nfs(8) is still necessary). Add omount compat shims. Drop COMPAT_PRELITE2 mount arg compatibility. ffs: Convert to nmount. Add omount compat shims. Remove dedicated rootfs mounting code. Use vfs_mountedfrom() Rely on vfs_mount.c calling VFS_STATFS() Remove vfs_omount() method, all filesystems are now converted. Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem task, and they all do it now. Change rootmounting to use DEVFS trampoline: vfs_mount.c: Mount devfs on /. Devfs needs no 'from' so this is clean. symlink /dev to /. This makes it possible to lookup /dev/foo. Mount "real" root filesystem on /. Surgically move the devfs mountpoint from under the real root filesystem onto /dev in the real root filesystem. Remove now unnecessary getdiskbyname(). kern_init.c: Don't do devfs mounting and rootvnode assignment here, it was already handled by vfs_mount.c. Remove now unused bdevvp(), addaliasu() and addalias(). Put the few necessary lines in devfs where they belong. This eliminates the second-last source of bogo vnodes, leaving only the lemming-syncer. Remove rootdev variable, it doesn't give meaning in a global context and was not trustworth anyway. Correct information is provided by statfs(/).
# f76fedd2	02-Dec-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Improve vprint() a little bit: break long lines, reduce indent and tell if the VI_LOCK() is held.
# aec0fb7b	01-Dec-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Back when VOP_* was introduced, we did not have new-style struct initializations but we did have lofty goals and big ideals. Adjust to more contemporary circumstances and gain type checking. Replace the entire vop_t frobbing thing with properly typed structures. The only casualty is that we can not add a new VOP_ method with a loadable module. History has not given us reason to belive this would ever be feasible in the the first place. Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc. Give coda correct prototypes and function definitions for all vop_()s. Generate a bit more data from the vnode_if.src file: a struct vop_vector and protype typedefs for all vop methods. Add a new vop_bypass() and make vop_default be a pointer to another struct vop_vector. Remove a lot of vfs_init since vop_vector is ready to use from the compiler. Cast various vop_mumble() to void * with uppercase name, for instance VOP_PANIC, VOP_NULL etc. Implement VCALL() by making vdesc_offset the offsetof() the relevant function pointer in vop_vector. This is disgusting but since the code is generated by a script comparatively safe. The alternative for nullfs etc. would be much worse. Fix up all vnode method vectors to remove casts so they become typesafe. (The bulk of this is generated by scripts)
# a752aa8f	15-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Move pbgetvp() and pbrelvp() to vm_pager.c with the rest of the pbuf stuff.
# 11bcbee1	14-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Move the bit of the syncer which deals with vnodes into a separate function.
# db442506	13-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Eliminate vop_revoke() function now that devfs_revoke() does the entire job.
# c5b846fe	10-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Slim vnodes by another four bytes by eliminating the (now) unused field v_cachedid.
# c13a4e88	10-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Remove vn_todev()
# b797084e	09-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Remove vnode->v_cachedfs. It was only used for the highly dangerous "export all vnodes with a sysctl" function.
# c5690651	04-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Remove buf->b_dev field.
# e0b687d3	03-Nov-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Always initialize bo_private along with bo_ops in getnewvnode(). Spotted by: tegge
# 996b2c82	29-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Loose vfs_mountedon()
# e1f355fe	29-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Give the bufobj a private __bo_vnode for now to keep the syncer floating [1] At some point later the syncer will unlearn about vnodes and the filesystems method called by the syncer will know enough about what's in bo_private to do the right thing. [1] Ok, I know, but I couldn't resist the pun.
# 20eba72f	27-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Move the syncer linkage from vnode to bufobj. This is not quite a perfect separation: the syncer still think it knows that everything is a vnode.
# 5d9d81e7	26-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Put the I/O block size in bufobj->bo_bsize. We keep si_bsize_phys around for now as that is the simplest way to pull the number out of disk device drivers in devfs_open(). The correct solution would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth when filesystems sit on GEOM, so don't bother for now.
# 156cb265	25-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Loose the v_dirty* and v_clean* alias macros. Check the count field where we just want to know the full/empty state, rather than using TAILQ_EMPTY() or TAILQ_FIRST().
# ee1d0eb3	25-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Remove vnode->v_bsize. This was a dead-end.
# 4dcd0ac4	25-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Collapse vnode->v_object and buf->b_object into bufobj->bo_object.
# b792bebe	24-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Move the buffer method vector (buf->b_op) to the bufobj. Extend it with a strategy method. Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY song and dance. Rename ibwrite to bufwrite(). Move the two NFS buf_ops to more sensible places, add bufstrategy to them. Add inlines for bwrite() and bstrategy() which calls through buf->b_bufobj->b_ops->b_{write,strategy}(). Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().
# b0e86f6a	22-Oct-2004	Robert Watson <rwatson@FreeBSD.org>	When MAC is enabled, warn if getnewvnode() is asked to produce a vnode without a mountpoint. In this scenario, there's no useful source for a label on the vnode, since we can't query the mountpoint for the labeling strategy or default label.
# ff7c5a48	22-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Alas, poor SPECFS! -- I knew him, Horatio; A filesystem of infinite jest, of most excellent fancy: he hath taught me lessons a thousand times; and now, how abhorred in my imagination it is! my gorge rises at it. Here were those hacks that I have curs'd I know not how oft. Where be your kludges now? your workarounds? your layering violations, that were wont to set the table on a roar? Move the skeleton of specfs into devfs where it now belongs and bury the rest.
# 494eb176	22-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Add b_bufobj to struct buf which eventually will eliminate the need for b_vp. Initialize b_bufobj for all buffers. Make incore() and gbincore() take a bufobj instead of a vnode. Make inmem() local to vfs_bio.c Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj) also VI_MTX() to BO_MTX(), Make buf_vlist_add() take a bufobj instead of a vnode. Eliminate other uses of bp->b_vp where bp->b_bufobj will do. Various minor polishing: remove "register", turn panic into KASSERT, use new function declarations, TAILQ_FOREACH_SAFE() etc.
# a76d8f4e	21-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write count on a bufobj. Bufobj_wdrop() replaces vwakeup(). Use these functions all relevant places except in ffs_softdep.c where the use if interlocked_sleep() makes this impossible. Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.
# 1bca607b	21-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Add BO_* macros parallel to VI_* macros for manipulating the bo_mtx. Initialize the bo_mtx when we allocate a vnode i getnewvnode() For now we point to the vnodes interlock mutex, that retains the exact same locking sematics. Move v_numoutput from vnode to bufobj. Add renaming macro to postpone code sweep.
# 67647b23	21-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Polish vtruncbuf() to improve readability and style a bit.
# e1633956	21-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Simplify buf_vlist_remove(). Now that we have encapsulated the splaytree related information into a structure we can eliminate the half of this function.
# 57259f28	05-Oct-2004	Greg Lehey <grog@FreeBSD.org>	vtryrecycle: Don't rely on type VBAD alone to mean that we don't need to clean the vnode. If v_data is set, we still need to clean it. This code change should catch all incidents of the previous commit (INVARIANTS only).
# f2154b33	05-Oct-2004	Greg Lehey <grog@FreeBSD.org>	getnewvnode: Weaken the panic "cleaned vnode isn't" to a warning. Discussion: this panic (or waning) only occurs when the kernel is compiled with INVARIANTS. Otherwise the problem (which means that the vp->v_data field isn't NULL, and represents a coding error and possibly a memory leak) is silently ignored by setting it to NULL later on. Panicking here isn't very helpful: by this time, we can only find the symptoms. The panic occurs long after the reason for "not cleaning" has been forgotten; in the case in point, it was the result of severe file system corruption which left the v_type field set to VBAD. That issue will be addressed by a separate commit.
# ba285125	01-Oct-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Fix a LOR relating to freeing cdevs.
# 70526ca6	24-Sep-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Hold dev_lock and check for NULL devsw pointer when we determine if a vnode is a disk.
# a0e78d2e	23-Sep-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Do not refcount the cdevsw, but rather maintain a cdev->si_threadcount of the number of threads which are inside whatever is behind the cdevsw for this particular cdev. Make the device mutex visible through dev_lock() and dev_unlock(). We may want finer granularity later. Replace spechash_mtx use with dev_lock()/dev_unlock().
# 08dbd671	15-Sep-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Remove unused B_WRITEINPROG flag
# 1affa3ad	07-Sep-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Create simple function init_va_filerev() for initializing a va_filerev field. Replace three instances of longhaired initialization va_filerev fields. Added XXX comment wondering why we don't use random bits instead of uptime of the system for this purpose.
# 8ded6540	20-Aug-2004	Don Lewis <truckman@FreeBSD.org>	Don't attempt to trigger the syncer thread final sync code in the shutdown_pre_sync state if the RB_NOSYNC flag is set. This is the likely cause of hangs after a system panic that are keeping crash dumps from being done. This is a MFC candidate for RELENG_5. MFC after: 3 days
# 78c37b0d	16-Aug-2004	David E. O'Brien <obrien@FreeBSD.org>	s/MAX_SAFE_MAXVNODES/MAXVNODES_MAX/g
# ad3b9257	15-Aug-2004	John-Mark Gurney <jmg@FreeBSD.org>	Add locking to the kqueue subsystem. This also makes the kqueue subsystem a more complete subsystem, and removes the knowlege of how things are implemented from the drivers. Include locking around filter ops, so a module like aio will know when not to be unloaded if there are outstanding knotes using it's filter ops. Currently, it uses the MTX_DUPOK even though it is not always safe to aquire duplicate locks. Witness currently doesn't support the ability to discover if a dup lock is ok (in some cases). Reviewed by: green, rwatson (both earlier versions)
# 87e83e7d	10-Aug-2004	Robert Watson <rwatson@FreeBSD.org>	In v_addpollinfo(), we allocate storage to back vp->v_pollinfo. However, we may sleep when doing so; check that we didn't race with another thread allocating storage for the vnode after allocation is made to a local pointer, and only update the vnode pointer if it's still NULL. Otherwise, accept that another thread got there first, and release the local storage. Discussed with: jmg
# c8c216d5	09-Aug-2004	Nate Lawson <njl@FreeBSD.org>	Skip the syncing disks loop if there are no dirty buffers. Remove a variable used to flag the initial printf. Submitted by: truckman (earlier version)
# 64298d52	02-Aug-2004	David E. O'Brien <obrien@FreeBSD.org>	Put a cap on the auto-tuning of kern.maxvnodes. Cap value chosen by: scottl
# b1c81391	29-Jul-2004	Nate Lawson <njl@FreeBSD.org>	Minor message cleanup.
# 3dfe213e	27-Jul-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Convert the vfsconf list to a TAILQ. Introduce vfs_byname() function to find things on it. Staticize vfs_nmount() function under the name vfs_donmount(). Various cleanups.
# 56f21b9d	26-Jul-2004	Colin Percival <cperciva@FreeBSD.org>	Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is somewhat clearer, but more importantly allows for a consistent naming scheme for suser_cred flags. The old name is still defined, but will be removed in a few days (unless I hear any complaints...) Discussed with: rwatson, scottl Requested by: jhb
# cf95b5c3	25-Jul-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Eliminate unused second argument to reassignbuf() and simplify it accordingly.
# 05656b6e	21-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	put several of the options for DEBUG_VFS_LOCKS under control of sysctls.
# bb5faea3	15-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	Cleanup shutdown output.
# da6303ba	14-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	Tidy up system shutdown.
# f257b7a5	12-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	Make VFS_ROOT() and vflush() take a thread argument. This is to allow filesystems to decide based on the passed thread which vnode to return. Several filesystems used curthread, they now use the passed thread.
# 7ae8ce5d	11-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	Dump the actual bad values when this assertion is tripped.
# 32240d08	10-Jul-2004	Marcel Moolenaar <marcel@FreeBSD.org>	Update for the KDB framework: o Call kdb_enter() instead of Debugger().
# 057589c4	08-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	fixup sysctl by fsid node
# ea0104b0	06-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	Introduce vfs_suser(), used to test if a user should have special privs for a mount.
# c713aaae	06-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	NFS mobility PHASE I, II & III (phase VI, and V pending): Rebind the client socket when we experience a timeout. This fixes the case where our IP changes for some reason. Signal a VFS event when NFS transitions from up to down and vice versa. Add a placeholder vfs_sysctl where we will put status reporting shortly. Also: Make down NFS mounts return EIO instead of EINTR when there is a soft timeout or force unmount in progress.
# 27875d9c	05-Jul-2004	Don Lewis <truckman@FreeBSD.org>	Unconditionally set last_work_seen while in the SYNCER_RUNNING state so that last_work_seen has a reasonable value at the transition to the SYNCER_SHUTTING_DOWN state, even if net_worklist_len happened to be zero at the time. Initialize last_work_seen to zero as a safety measure in case the syncer never ran in the SYNCER_RUNNING state. Tested by: phk
# faf1b66d	04-Jul-2004	Don Lewis <truckman@FreeBSD.org>	Rework syncer termination code: Speed up the syncer when shutting down by sleeping for a shorter period of time instead of cranking up rushjob and using the normal one second sleep. Skip empty worklist slots when shutting down to avoid lengthy intervals of inactivity. Give I/O more time to complete between steps by not speeding the syncer quite as much. Terminate the syncer after one full pass through the worklist plus one second with the worklist containing nothing but syncer vnodes. Print an indication of shutdown progress to the console. Add a sysctl, vfs.worklist_len, to allow the size of the syncer worklist to be monitored.
# c555963f	04-Jul-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Give synthetic root filesystem device vnodes a v_bsize of DEV_BSIZE.
# 2d1dca73	04-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	Pass the operation in with the fsidctl. Remove some fsidctls that we will not be using. Correct prototypes for fs sysctls.
# 7f6599fe	04-Jul-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Make the last commit handle non-phk root devices better.
# 1cbb1e02	03-Jul-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Blocksize for I/O should be a property of the vnode and not found by groping around in the vnodes surroundings when we allocate a block. Assign a blocksize when we create a vnode, and yell a warning (and ignore it) if we got the wrong size. Please email all such warnings to me.
# 94ed9c8a	04-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	Introduce a new kevent filter. EVFILT_FS that will be used to signal generic filesystem events to userspace. Currently only mount and unmount of filesystems are signalled. Soon to be added, up/down status of NFS. Introduce a sysctl node used to route requests to/from filesystems based on filesystem ids. Introduce a new vfsop, vfs_sysctl(mp, req) that is used as the callback/ entrypoint by the sysctl code to change individual filesystems.
# 903ac7c2	04-Jul-2004	Alfred Perlstein <alfred@FreeBSD.org>	Revision 1.496 would not boot on my system due to ffs_mount -> bdevvp -> getnewvnode(..., mp = NULL, ...) -> insmntqueue(vp, mp = NULL) -> KASSERT -> panic Make getnewvnode() only call insmntqueue() if the mountpoint parameter is not NULL.
# e3c5a7a4	04-Jul-2004	Poul-Henning Kamp <phk@FreeBSD.org>	When we traverse the vnodes on a mountpoint we need to look out for our cached 'next vnode' being removed from this mountpoint. If we find that it was recycled, we restart our traversal from the start of the list. Code to do that is in all local disk filesystems (and a few other places) and looks roughly like this: MNT_ILOCK(mp); loop: for (vp = TAILQ_FIRST(&mp...); (vp = nvp) != NULL; nvp = TAILQ_NEXT(vp,...)) { if (vp->v_mount != mp) goto loop; MNT_IUNLOCK(mp); ... MNT_ILOCK(mp); } MNT_IUNLOCK(mp); The code which takes vnodes off a mountpoint looks like this: MNT_ILOCK(vp->v_mount); ... TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes); ... MNT_IUNLOCK(vp->v_mount); ... vp->v_mount = something; (Take a moment and try to spot the locking error before you read on.) On a SMP system, one CPU could have removed nvp from our mountlist but not yet gotten to assign a new value to vp->v_mount while another CPU simultaneously get to the top of the traversal loop where it finds that (vp->v_mount != mp) is not true despite the fact that the vnode has indeed been removed from our mountpoint. Fix: Introduce the macro MNT_VNODE_FOREACH() to traverse the list of vnodes on a mountpoint while taking into account that vnodes may be removed from the list as we go. This saves approx 65 lines of duplicated code. Split the insmntque() which potentially moves a vnode from one mount point to another into delmntque() and insmntque() which does just what the names say. Fix delmntque() to set vp->v_mount to NULL while holding the mountpoint lock.
# e06500dd	01-Jul-2004	Don Lewis <truckman@FreeBSD.org>	When shutting down the syncer kernel thread, first tell it to run faster and iterate to over its work list a few times in an attempt to empty the work list before the syncer terminates. This leaves fewer dirty blocks to be written at the "syncing disks" stage and keeps the the "giving up on N buffers" problem from being triggered by the presence of a large soft updates work list at system shutdown time. The downside is that the syncer takes noticeably longer to terminate. Tested by: "Arjan van Leeuwen" <avleeuwen AT piwebs DOT com> Approved by: mckusick
# f3732fd1	17-Jun-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Second half of the dev_t cleanup. The big lines are: NODEV -> NULL NOUDEV -> NODEV udev_t -> dev_t udev2dev() -> findcdev() Various minor adjustments including handling of userland access to kernel space struct cdev etc.
# 89c9c53d	16-Jun-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Do the dreaded s/dev_t/struct cdev */ Bump __FreeBSD_version accordingly.
# 170593a9	14-Jun-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Remove a left over from userland buffer-cache access to disks.
# 9e6127fe	31-May-2004	Robert Watson <rwatson@FreeBSD.org>	Assert Giant in vrele().
# a0b5a679	11-Apr-2004	Maxime Henrion <mux@FreeBSD.org>	Put deprecated sysctl code inside BURN_BRIDGES.
# 7f8a436f	05-Apr-2004	Warner Losh <imp@FreeBSD.org>	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999. Approved by: core
# 39d3505a	29-Mar-2004	Peter Wemm <peter@FreeBSD.org>	Kill some XXXKSE's. vnlru/syncer are single threaded.
# 4d453ef1	11-Mar-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Properly vector all bwrite() and BUF_WRITE() calls through the same path and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().
# 651b11ea	11-Mar-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Remove unused second arg to vfinddev(). Don't call addaliasu() on VBLK nodes.
# ff85a3f0	05-Mar-2004	Alexander Kabaev <kan@FreeBSD.org>	Always call vn_finished_write after vn_start_write was called. All occurences of 'goto done' after vn_start_write invocation were cleaning up incompletely.
# 44f3b092	27-Feb-2004	John Baldwin <jhb@FreeBSD.org>	Switch the sleep/wakeup and condition variable implementations to use the sleep queue interface: - Sleep queues attempt to merge some of the benefits of both sleep queues and condition variables. Having sleep qeueus in a hash table avoids having to allocate a queue head for each wait channel. Thus, struct cv has shrunk down to just a single char * pointer now. However, the hash table does not hold threads directly, but queue heads. This means that once you have located a queue in the hash bucket, you no longer have to walk the rest of the hash chain looking for threads. Instead, you have a list of all the threads sleeping on that wait channel. - Outside of the sleepq code and the sleep/cv code the kernel no longer differentiates between cv's and sleep/wakeup. For example, calls to abortsleep() and cv_abort() are replaced with a call to sleepq_abort(). Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and cv_waitq_remove() have been replaced with calls to sleepq_remove(). - The sched_sleep() function no longer accepts a priority argument as sleep's no longer inherently bump the priority. Instead, this is soley a propery of msleep() which explicitly calls sched_prio() before blocking. - The TDF_ONSLEEPQ flag has been dropped as it was never used. The associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been dropped and replaced with a single explicit clearing of td_wchan. TD_SET_ONSLEEPQ() would really have only made sense if it had taken the wait channel and message as arguments anyway. Now that that only happens in one place, a macro would be overkill.
# 47934cef	25-Feb-2004	Don Lewis <truckman@FreeBSD.org>	Split the mlock() kernel code into two parts, mlock(), which unpacks the syscall arguments and does the suser() permission check, and kern_mlock(), which does the resource limit checking and calls vm_map_wire(). Split munlock() in a similar way. Enable the RLIMIT_MEMLOCK checking code in kern_mlock(). Replace calls to vslock() and vsunlock() in the sysctl code with calls to kern_mlock() and kern_munlock() so that the sysctl code will obey the wired memory limits. Nuke the vslock() and vsunlock() implementations, which are no longer used. Add a member to struct sysctl_req to track the amount of memory that is wired to handle the request. Modify sysctl_wire_old_buffer() to return an error if its call to kern_mlock() fails. Only wire the minimum of the length specified in the sysctl request and the length specified in its argument list. It is recommended that sysctl handlers that use sysctl_wire_old_buffer() should specify reasonable estimates for the amount of data they want to return so that only the minimum amount of memory is wired no matter what length has been specified by the request. Modify the callers of sysctl_wire_old_buffer() to look for the error return. Modify sysctl_old_user to obey the wired buffer length and clean up its implementation. Reviewed by: bms
# ded67d0f	21-Feb-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Check for NODEV return from udev2dev()
# cd690b60	21-Feb-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Device megapatch 6/6: This is what we came here for: Hang dev_t's from their cdevsw, refcount cdevsw and dev_t and generally keep track of things a lot better than we used to: Hold a cdevsw reference around all entrances into the device driver, this will be necessary to safely determine when we can unload driver code. Hold a dev_t reference while the device is open. KASSERT that we do not enter the driver on a non-referenced dev_t. Remove old D_NAG code, anonymous dev_t's are not a problem now. When destroy_dev() is called on a referenced dev_t, move it to dead_cdevsw's list. When the refcount drops, free it. Check that cdevsw->d_version is correct. If not, set all methods to the dead_*() methods to prevent entrance into driver. Print warning on console to this effect. The device driver may still explode if it is also incompatible with newbus, but in that case we probably didn't get this far in the first place.
# 816d62bb	21-Feb-2004	Poul-Henning Kamp <phk@FreeBSD.org>	Device megapatch 5/6: Remove the unused second argument from udev2dev(). Convert all remaining users of makedev() to use udev2dev(). The semantic difference is that udev2dev() will only locate a pre-existing dev_t, it will not line makedev() create a new one. Apart from the tiny well controlled windown in D_PSEUDO drivers, there should no longer be any "anonymous" dev_t's in the system now, only dev_t's created with make_dev() and make_dev_alias()
# 580ddfa6	05-Jan-2004	Alexander Kabaev <kan@FreeBSD.org>	More style fixes. Obtained from: bde
# b0fdf716	05-Jan-2004	Alexander Kabaev <kan@FreeBSD.org>	style(9): Add empty line before first code line in functions with no local variables. Properly terminate comment sentences. Indent lines which are longer that 80 characters. Move v_addpollinfo closer to the rest of poll-related functions. Move DEBUG_VFS_LOCKS ifdefed block to the end of file. Obtained from: bde (partly)
# 3ff1b7c2	03-Jan-2004	Alexander Kabaev <kan@FreeBSD.org>	Cosmetics: strip '\n' from a string passed to Debugger().
# 9efe7d9d	28-Dec-2003	Bruce Evans <bde@FreeBSD.org>	v_vxproc was a bogus name for a thread (pointer).
# 958557e9	16-Dec-2003	Jeff Roberson <jeff@FreeBSD.org>	- In vget() if LK_NOWAIT is specified we should return EBUSY and not ENOENT. Submitted by: Stephan Uphoff <ups@stups.com>
# d8521366	16-Dec-2003	Jeff Roberson <jeff@FreeBSD.org>	- When doing a forced unmount, VFS attempts to keep VCHR vnodes valid by reassigning their v_ops field to specfs, detaching from the mountpoint, etc. However, this is not sufficient. If we vclean() the vnode the pages owned by the vnode are lost, potentially while buffers reference them. Implement parts of vclean() seperately in vgonechrl() so that the pages and bufs associated with a device vnode are not destroyed while in use.
# a6c6a93c	30-Nov-2003	Jeff Roberson <jeff@FreeBSD.org>	- Don't forget to unlock the vnode interlock in the LK_NOWAIT case. Submitted by: Stephan Uphoff <ups@stups.com> Approved by: re (rwatson)
# 512824f8	09-Nov-2003	Seigo Tanimura <tanimura@FreeBSD.org>	- Implement selwakeuppri() which allows raising the priority of a thread being waken up. The thread waken up can run at a priority as high as after tsleep(). - Replace selwakeup()s with selwakeuppri()s and pass appropriate priorities. - Add cv_broadcastpri() which raises the priority of the broadcast threads. Used by selwakeuppri() if collision occurs. Not objected in: -arch, -current
# ca430f2e	04-Nov-2003	Alexander Kabaev <kan@FreeBSD.org>	Remove mntvnode_mtx and replace it with per-mountpoint mutex. Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to operate on this mutex transparently. Eventually new mutex will be protecting more fields in struct mount, not only vnode list. Discussed with: jeff
# 06cb76bd	23-Oct-2003	Garrett Wollman <wollman@FreeBSD.org>	Add appropriate const poisoning to the assert_*locked() family so that I can call ASSERT_VOP_LOCKED(vp, __func__) without a diagnostic. Inspired by: the evil and rude OpenAFS cache manager code
# f2b1200d	20-Oct-2003	Alan Cox <alc@FreeBSD.org>	Initialize the buf's b_object in pbgetvp(). Clear it in pbrelvp(). (This facilitates synchronization of the vm page's valid field using the vm object's lock.) Suggested by: tegge
# 3da2d6a4	17-Oct-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Simplify count_dev()
# 5108cd36	12-Oct-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Simplify vn_isdisk() a bit.
# 7dd1328c	11-Oct-2003	Jeff Roberson <jeff@FreeBSD.org>	- Fix a typo, I meant & and not \|. This was causing lockups from the syncer looping forever due to list corruption. Solved by: tegge
# bdcfcdec	05-Oct-2003	Jeff Roberson <jeff@FreeBSD.org>	- Fix an XXX. Check the error of vn_lock() in vflush(). Don't specify LK_RETRY either, we don't want this vnode if it turns into another. - Remove the code that checks the mount point after acquiring the lock we are guaranteed to either fail or get the vnode that we wanted.
# 45503a37	04-Oct-2003	Jeff Roberson <jeff@FreeBSD.org>	- Rename vcanrecycle() to vtryrecycle() to reflect its new role. - In vtryrecycle() try to vgonel the vnode if all of the previous checks passed. We won't vgonel if someone has either acquired a hold or usecount or started the vgone process elsewhere. This is because we may have been removed from the free list while we were inspecting the vnode for recycling. - The VI_TRYLOCK stops two threads from entering getnewvnode() and recycling the same vnode. To further reduce the likelyhood of this event, requeue the vnode on the tail of the list prior to calling vtryrecycle(). We can not actually remove the vnode from the list until we know that it's going to be recycled because other interlock holders may see the VI_FREE flag and try to remove it from the free list. - Kill a bogus XXX comment. If XLOCK is set we shouldn't wait for it regardless of MNT_WAIT because the vnode does not actually belong to this filesystem.
# 85311d4b	04-Oct-2003	Jeff Roberson <jeff@FreeBSD.org>	- Don't cache_purge() in getnewvnode. It's done in vclean(). With this purge, the purge in vclean, and the filesystems purge, we had 3 purges per vnode. - Move the insmntque(vp, 0) to vclean() so that we may remove it from the two vgone() functions and reduce the number of lock operations required.
# ce13b187	04-Oct-2003	Jeff Roberson <jeff@FreeBSD.org>	- Solve a LOR with the sync_mtx by using the VI_ONWORKLST flag to determine whether or not the sync failed. This could potentially get set between the time that we VOP_UNLOCK and VI_LOCK() but the race would harmelssly lead to the sync being delayed by an extra 30 seconds. If we do not move the vnode it could cause an endless loop if it continues to fail to sync. - Use vhold and vdrop to stop the vnode from changing identities while we have it unlocked. Other internal vfs lists are likely to follow this scheme.
# 894fbf97	04-Oct-2003	Jeff Roberson <jeff@FreeBSD.org>	- Move the xlock 'locking' code into vx_lock() and vx_unlock(). - Create a new function, vgonechrl(), which performs vgone for an in-use character device. Move the code from vflush() that did this into vgonechrl(). - Hold the xlock across the entirety of vgonel() and vgonechrl() so that at no point will an invalid vnode exist on any list without XLOCK set. - Move the xlock code out of vclean() now that it is in the vgone*() functions.
# 6f4b0863	04-Oct-2003	Jeff Roberson <jeff@FreeBSD.org>	- In sched_sync() test our preconditions prior to dropping the sync_mtx. This is so that we may grab the interlock while still holding the sync_mtx. We have to VI_TRYLOCK() because in all other cases the lock order runs the other way. - If we don't meet any of the preconditions, reinsert the vp into the list for the next second. - We don't need to panic if we fail to sync here because each FSYNC function handles this case. Removing this redundant code also simplifies locking.
# e4c49d2b	04-Oct-2003	Jeff Roberson <jeff@FreeBSD.org>	- In a Giantless world, the vn_lock() in vcanrecycle() could legitimately fail. Remove the panic from that case and document why it might fail. - Document the reason for calling cache_purge() on a newly created vnode. - In insmntque() order the operations so that we can call mtx_unlock() one fewer times. This makes the code somewhat clearer as well. - Add XXX comments in sched_sync() and vflush(). - In vget(), do not sleep while waiting for XLOCK to clear if LK_NOWAIT is set. - In vclean() we don't need to acquire a lock around a single TAILQ_FIRST call. It's ok if we race here, the vinvalbuf will just do nothing. - Increase the scope of the lock in vgonel() to reduce the number of lock operations that are performed.
# 51b57549	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- In reassignbuf() don't unlock vp and lock newvp if they are the same. Doing so creates a race where the buf is on neither list. - Only vfree() in an error case in vclean() if VSHOULDFREE() thinks we should. - Convert the error case in vclean() to INVARIANTS from DIAGNOSTIC as this really should not happen and is fast to check.
# 6b6c163a	19-Sep-2003	Jeff Roberson <jeff@FreeBSD.org>	- Remove spls(). The locking that has replaced them is in place and they no longer serve as guidelines for future work.
# aebbeee8	19-Sep-2003	Alexander Kabaev <kan@FreeBSD.org>	Eliminate one case of VI_UNLOCK followed by an immediate VI_LOCK.
# 8b149b51	07-Aug-2003	John Baldwin <jhb@FreeBSD.org>	Consistently use the BSD u_int and u_short instead of the SYSV uint and ushort. In most of these files, there was a mixture of both styles and this change just makes them self-consistent. Requested by: bde (kern_ktrace.c)
# 68f2d20b	22-Jul-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Revert stuff which accidentally ended up in the previous commit.
# 55d1d703	22-Jul-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Don't attempt to inline large functions mb_alloc() and mb_free(), it more than doubles the text size of this file. GCC has wisely ignored us on this previously
# 677b542e	10-Jun-2003	David E. O'Brien <obrien@FreeBSD.org>	Use __FBSDID().
# a62f80f8	31-May-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Remove unused variable and now unbalanced call to splbio(); Found by: FlexeLint
# 2e05d898	23-May-2003	Alan Cox <alc@FreeBSD.org>	Make the maximum number of vnodes a function of both the physical memory size and the kernel's heap size, specifically, vm_kmem_size. This function allows a maximum of 40% of the vm_kmem_size to be used for vnodes and vm objects. This is a conservative bound based upon recent problem reports. (In other words, a slight increase in this percentage may be safe.) Finally, machines with less than ~3GB of RAM should be unaffected by this change, i.e., the maximum number of vnodes should remain the same. If necessary, machines with 3GB or more of RAM can increase the maximum number of vnodes by increasing vm_kmem_size. Desired by: scottl Tested by: jake Approved by: re (rwatson,scottl)
# 1e9bc9f8	16-May-2003	Don Lewis <truckman@FreeBSD.org>	Detect that a vnode has been reclaimed while vflush() was waiting to lock the vnode and restart the loop. Vflush() is vulnerable since it does not hold a reference to the vnode and it holds no other locks while waiting for the vnode lock. The vnode will no longer be on the list when the loop is restarted. Approved by: re (rwatson)
# 099e981a	12-May-2003	Alan Cox <alc@FreeBSD.org>	Optimize the use of splay in gbincore(). During a "make buildworld" the desired buffer is found at one of the roots more than 60% of the time. Thus, checking both roots before performing either splay eliminates unnecessary splays on the first tree splayed. Approved by: re (jhb)
# 1964fb9b	12-May-2003	Robert Watson <rwatson@FreeBSD.org>	Remove bogus locking from DDB's "show lockedvnods" command: using synchronization primitives from inside DDB is generally a bad idea, and in this case it frequently results in panics due to DDB commands being executed from the sio fast interrupt context on a serial console. Replace the locking with a note that a lack of locking means that DDB may get see inconsistent views of the mount and vnode lists, which could also result in a panic. More frequently, though, this avoids a panic than causes it. Discussed with ages ago: bde Approved by: re (scottl)
# bff99f0d	03-May-2003	Alan Cox <alc@FreeBSD.org>	- Revert kern/vfs_subr.c revision 1.444. The vm_object's size isn't trustworthy for vnode-backed objects. - Restore the old behavior of vm_object_page_remove() when the end of the given range is zero. Add a comment to vm_object_page_remove() regarding this behavior. Reported by: iedowse
# ebba1b25	30-Apr-2003	Alan Cox <alc@FreeBSD.org>	Lock accesses to the vm_object's ref_count and resident_page_count.
# ecde4b32	26-Apr-2003	Alan Cox <alc@FreeBSD.org>	Various changes to vm_object_page_remove(): - Eliminate an odd, special-case feature: if start == end == 0 then all pages are removed. Only one caller used this feature and that caller can trivially pass the object's size. - Assert that the vm_object is locked on entry; don't bother testing for a NULL vm_object. - Style: Fix lines that are longer than 80 characters.
# 1ca58953	26-Apr-2003	Alan Cox <alc@FreeBSD.org>	- Convert vm_object_pip_wait() from using tsleep() to msleep(). - Make vm_object_pip_sleep() static. - Lock the vm_object when performing vm_object_pip_wait().
# b6e48e03	23-Apr-2003	Alan Cox <alc@FreeBSD.org>	- Acquire the vm_object's lock when performing vm_object_page_clean(). - Add a parameter to vm_pageout_flush() that tells vm_pageout_flush() whether its caller has locked the vm_object. (This is a temporary measure to bootstrap vm_object locking.)
# 49281fbf	18-Apr-2003	Alan Cox <alc@FreeBSD.org>	Update locking around vm_object_page_remove() to use the new macros.
# e96c181d	12-Apr-2003	Alan Cox <alc@FreeBSD.org>	Use vm_object_pip_wait() rather than reimplementing it.
# 6b080461	26-Mar-2003	Tor Egge <tegge@FreeBSD.org>	Adjust the number of vnodes scanned by vlrureclaim() according to the size of the vnode list.
# 17ce5b94	22-Mar-2003	Yaroslav Tykhiy <ytykhiy@gmail.com>	We shouldn't assert that a vode is locked in vop_lock_post() if VOP_LOCK() has failed. Reviewed by: jeff
# e99215a6	13-Mar-2003	Jeff Roberson <jeff@FreeBSD.org>	- Remove a dead check for bp->b_vp == vp in vtruncbuf(). This has not been possible for some time. - Lock the buf before accessing fields. This should very rarely be locked. - Assert that B_DELWRI is set after we acquire the buf. This should always be the case now.
# 09f11da5	13-Mar-2003	Jeff Roberson <jeff@FreeBSD.org>	- Remove a race between fsync like functions and flushbufqueues() by requiring locked bufs in vfs_bio_awrite(). Previously the buf could have been written out by fsync before we acquired the buf lock if it weren't for giant. The cluster_wbuild() handles this race properly but the single write at the end of vfs_bio_awrite() would not. - Modify flushbufqueues() so there is only one copy of the loop. Pass a parameter in that says whether or not we should sync bufs with deps. - Call flushbufqueues() a second time and then break if we couldn't find any bufs without deps.
# 09c80124	05-Mar-2003	Alan Cox <alc@FreeBSD.org>	Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress. Discussed on: arch@
# 99648386	03-Mar-2003	Nate Lawson <njl@FreeBSD.org>	Finish cleanup of vprint() which was begun with changing v_tag to a string. Remove extraneous uses of vop_null, instead defering to the default op. Rename vnode type "vfs" to the more descriptive "syncer". Fix formatting for various filesystems that use vop_print.
# 491081fa	01-Mar-2003	Jeff Roberson <jeff@FreeBSD.org>	- Hold the vnode interlock across calls to bgetvp instead of acquiring it internally. This is required to stop multiple bufs from being associated with a single lblkno.
# bff5362b	28-Feb-2003	Jeff Roberson <jeff@FreeBSD.org>	- gc USE_BUFHASH. The smp locking of the buf cache renders this useless.
# 3a7053cb	24-Feb-2003	Kirk McKusick <mckusick@FreeBSD.org>	Prevent large files from monopolizing the system buffers. Keep track of the number of dirty buffers held by a vnode. When a bdwrite is done on a buffer, check the existing number of dirty buffers associated with its vnode. If the number rises above vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one of the other (hopefully older) dirty buffers associated with the vnode is written (using bawrite). In the event that this approach fails to curb the growth in it the vnode's number of dirty buffers (due to soft updates rollback dependencies), the more drastic approach of doing a VOP_FSYNC on the vnode is used. This code primarily affects very large and actively written files such as snapshots. This change should eliminate hanging when taking snapshots or doing background fsck on very large filesystems. Hopefully, one day it will be possible to cache filesystem metadata in the VM cache as is done with file data. As it stands, only the buffer cache can be used which limits total metadata storage to about 20Mb no matter how much memory is available on the system. This rather small memory gets badly thrashed causing a lot of extra I/O. For example, taking a snapshot of a 1Tb filesystem minimally requires about 35,000 write operations, but because of the cache thrashing (we only have about 350 buffers at our disposal) ends up doing about 237,540 I/O's thus taking twenty-five minutes instead of four if it could run entirely in the cache. Reported by: Attila Nagy <bra@fsn.hu> Sponsored by: DARPA & NAI Labs.
# 17661e5a	24-Feb-2003	Jeff Roberson <jeff@FreeBSD.org>	- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK. - Remove the buftimelock mutex and acquire the buf's interlock to protect these fields instead. - Hold the vnode interlock while locking bufs on the clean/dirty queues. This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another BUF_LOCK with a LK_TIMEFAIL to a single lock. Reviewed by: arch, mckusick
# acb18acf	23-Feb-2003	Poul-Henning Kamp <phk@FreeBSD.org>	Bracket the kern.vnode sysctl in #ifdef notyet because it results in massive locking issues on diskless systems. It is also not clear that this sysctl is non-dangerous in its requirements for locked down memory on large RAM systems.
# a163d034	18-Feb-2003	Warner Losh <imp@FreeBSD.org>	Back out M_* changes, per decision of the TRB. Approved by: trb
# 44956c98	21-Jan-2003	Alfred Perlstein <alfred@FreeBSD.org>	Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0. Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
# 6a1b2a22	29-Dec-2002	Ian Dowse <iedowse@FreeBSD.org>	Add a new vnode flag VI_DOINGINACT to indicate that a VOP_INACTIVE call is in progress on the vnode. When vput() or vrele() sees a 1->0 reference count transition, it now return without any further action if this flag is set. This flag is necessary to avoid recursion into VOP_INACTIVE if the filesystem inactive routine causes the reference count to increase and then drop back to zero. It is also used to guarantee that an unlocked vnode will not be recycled while blocked in VOP_INACTIVE(). There are at least two cases where the recursion can occur: one is that the softupdates code called by ufs_inactive() via ffs_truncate() can call vput() on the vnode. This has been reported by many people as "lockmgr: draining against myself" panics. The other case is that nfs_inactive() can call vget() and then vrele() on the vnode to clean up a sillyrename file. Reviewed by: mckusick (an older version of the patch)
# 371400cf	29-Dec-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Use a timeout of one second while we wait for the vnode washer, this prevents a potential race and makes the system a little bit less jerky under extreme loads.
# 851a87ea	29-Dec-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Vnodes pull in 800-900 bytes these days, all things counted, so we need to treat desiredvnodes much more like a limit than as a vague concept. On a 2GB RAM machine where desired vnodes is 130k, we run out of kmem_map space when we hit about 190k vnodes. If we wake up the vnode washer in getnewvnode(), sleep until it is done, so that it has a chance to offer us a washed vnode. If we don't sleep here we'll just race ahead and allocate yet a vnode which will never get freed. In the vnodewasher, instead of doing 10 vnodes per mountpoint per rotation, do 10% of the vnodes distributed evenly across the mountpoints.
# 9f162827	28-Dec-2002	Poul-Henning Kamp <phk@FreeBSD.org>	KASSERT that vop_revoke() gets a VCHR.
# 475e8011	14-Dec-2002	Alan Cox <alc@FreeBSD.org>	Perform vm_object_lock() and vm_object_unlock() around vm_object_page_remove().
# 2e29a1f2	07-Dec-2002	Alan Cox <alc@FreeBSD.org>	To avoid lock order reversals in getnewvnode(), the call to uma_zfree() must be delayed until the vnode interlock is released. Reported by: kris@ Approved by: re (jhb)
# f85a9619	27-Nov-2002	Robert Drehmel <robert@FreeBSD.org>	Do not set a variable (vp->p_pollinfo) to NULL if we know it already has that value. Approved by: re
# 763bbd2f	26-Oct-2002	Robert Watson <rwatson@FreeBSD.org>	Slightly change the semantics of vnode labels for MAC: rather than "refreshing" the label on the vnode before use, just get the label right from inception. For single-label file systems, set the label in the generic VFS getnewvnode() code; for multi-label file systems, leave the labeling up to the file system. With UFS1/2, this means reading the extended attribute during vfs_vget() as the inode is pulled off disk, rather than hitting the extended attributes frequently during operations later, improving performance. This also corrects sematics for shared vnode locks, which were not previously present in the system. This chances the cache coherrency properties WRT out-of-band access to label data, but in an acceptable form. With UFS1, there is a small race condition during automatic extended attribute start -- this is not present with UFS2, and occurs because EAs aren't available at vnode inception. We'll introduce a work around for this shortly. Approved by: re Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 0d6dc414	25-Oct-2002	Poul-Henning Kamp <phk@FreeBSD.org>	In vrele() we can actually have a VCHR with v_rdev == NULL if we came from the bottom of addaliasu(). Don't panic.
# 9ab73fd1	24-Oct-2002	Kirk McKusick <mckusick@FreeBSD.org>	Within ufs, the ffs_sync and ffs_fsync functions did not always check for and/or report I/O errors. The result is that a VFS_SYNC or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in the presence of a hard error writing a disk sector or in a filesystem full condition. This patch ensures that I/O errors will always be checked and returned. This patch also ensures that every call to VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes appropriate action when an error is returned. Sponsored by: DARPA & NAI Labs.
# a2fb4fed	24-Oct-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Fix the spechash lock order reversal by keeping an updated sum of v_usecount in the dev_t which vcount() can return without locking any vnodes. Seen by: jhb
# a6b9f47b	14-Oct-2002	Kirk McKusick <mckusick@FreeBSD.org>	When scanning the freelist looking for candidate vnodes to recycle, be sure to exit the loop with vp == NULL if no candidates are found. Formerly, this bug would cause the last vnode inspected to be used, even if it was not available. The result was a panic "vn_finished_write: neg cnt". Sponsored by: DARPA & NAI Labs.
# e04a0200	14-Oct-2002	Kirk McKusick <mckusick@FreeBSD.org>	Unconditionally reset vp->v_vnlock back to the default in the vclean() function (e.g., vp->v_vnlock = &vp->v_lock) rather than requiring filesystems that use alternate locks to do so in their vop_reclaim functions. This change is a further cleanup of the vop_stdlock interface. Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk> Sponsored by: DARPA & NAI Labs.
# a5b65058	13-Oct-2002	Kirk McKusick <mckusick@FreeBSD.org>	Regularize the vop_stdlock'ing protocol across all the filesystems that use it. Specifically, vop_stdlock uses the lock pointed to by vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to reference vp->v_lock. Filesystems that wish to use the default do not need to allocate a lock at the front of their node structure (as some still did) or do a lockinit. They can simply start using vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks, but still use the vop_stdlock functions (such as nullfs) can simply replace vp->v_vnlock with a pointer to the lock that they wish to have used for the vnode. Such filesystems are responsible for setting the vp->v_vnlock back to the default in their vop_reclaim routine (e.g., vp->v_vnlock = &vp->v_lock). In theory, this set of changes cleans up the existing filesystem lock interface and should have no function change to the existing locking scheme. Sponsored by: DARPA & NAI Labs.
# 192e439e	10-Oct-2002	Kirk McKusick <mckusick@FreeBSD.org>	When considering a vnode for reuse in getnewvnode, we call vcanrecycle to check a free vnode's availability. If it is available, vcanrecycle returns an error code of zero and the vnode in question locked. The getnewvnode routine then used to call vn_start_write with the V_NOWAIT flag. If the filesystem was suspended while taking a snapshot, the vn_start_write would fail but getnewvnode would fail to unlock the vnode, instead leaving it locked on the freelist. The result would be that the vnode would be locked forever and would eventually hang the system with a race to the root when it was attempted to recycle it. This fix moves the vn_start_write check into vcanrecycle where it will properly unlock the vnode if it is unavailable for recycling due to filesystem suspension. Sponsored by: DARPA & NAI Labs.
# 790a8088	04-Oct-2002	Maxim Sobolev <sobomax@FreeBSD.org>	Fix problem introduced in rev.1.406, which can cause already unlocked mutex being unlocked again causing system panic.
# 8d3574c7	01-Oct-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Fix some harmless mis-indents. Spotted by: FlexeLint
# 0626774f	30-Sep-2002	Robert Watson <rwatson@FreeBSD.org>	Move vnode MAC label initialization to after the release of the vnode interlock in getnewvnode() to avoid possible sleeps while holding the mutex. Note that the warning from Witness is a slight false positive since we know there will be no contention on the interlock since we haven't made the vnode available for use yet, but the theory is not a bad one. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories
# 37c84183	28-Sep-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512
# 6423c943	25-Sep-2002	Jeff Roberson <jeff@FreeBSD.org>	- Move ASSERT_VOP_LOCK functionality into functions in vfs_subr.c - Make the VI asserts more orthogonal to the rest of the asserts by using a new, common vfs_badlock() function and adding a 'str' arg. - Adjust generated ASSERTS to match the new prototype. - Adjust explicit ASSERTS to match the new prototype.
# 6cb8bf20	24-Sep-2002	Jeff Roberson <jeff@FreeBSD.org>	- Lock down the syncer with sync_mtx. - Enable vfs_badlock_mutex by default. - Assert that the vp is locked in VOP_UNLOCK. - Use standard interlock macros in remaining code. - Correct a race in getnewvnode(). - Lock access to v_numoutput with interlock. - Lock access to buf lists and splay tree with interlock. - Add VOP and VI asserts. - Lock b_vnbufs with the vnode interlock. - Add vrefcnt() for callers who want to retreive the vnode ref without holding a lock. Add a comment that describes when this is safe. - Add vholdl() and vdropl() so that callers who already own the interlock can avoid race conditions and unnecessary unlocking. - Move the VOP_GETATTR() in vflush() into the WRITECLOSE conditional case. - Hold the interlock before droping the mntlist_mtx in vflush() to avoid a race. - Fix locking in vfs_msync().
# 86ed6d45	18-Sep-2002	Nate Lawson <njl@FreeBSD.org>	Remove any VOP_PRINT that redundantly prints the tag. Move lockmgr_printinfo() into vprint() for everyone's benefit. Suggested by: bde
# 06be2aaa	14-Sep-2002	Nate Lawson <njl@FreeBSD.org>	Remove all use of vnode->v_tag, replacing with appropriate substitutes. v_tag is now const char * and should only be used for debugging. Additionally: 1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK 2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP. Suggested by: phk Reviewed by: bde, rwatson (earlier version)
# 85e40eaf	11-Sep-2002	Julian Elischer <julian@FreeBSD.org>	Indentation does not make a block.. need curly braces too. Submitted by: Eagle-eyes evans <bde@freebsd.org>
# 71fad9fd	11-Sep-2002	Julian Elischer <julian@FreeBSD.org>	Completely redo thread states. Reviewed by: davidxu@freebsd.org
# f8b66361	05-Sep-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Fix an inherited style bug: compare with NOCRED instead of NULL. Sponsored by: DARPA & NAI Labs.
# c1a925a6	05-Sep-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Introduce new extattr_check_cred() function which implements the canonical crential washing for extended attributes. Sponsored by: DARPA & NAI Labs.
# 93b0017f	25-Aug-2002	Philippe Charnier <charnier@FreeBSD.org>	Replace various spelling with FALLTHROUGH which is lint()able
# ad32f726	22-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Fix a mistake in my last few commits. The PDROP flag stops msleep from re-acquiring the mutex. Pointy hat to: me Noticed by: tegge
# 9abf54f0	22-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Make vn_lock() vget() and VOP_LOCK() all behave the same way WRT LK_INTERLOCK. The interlock will never be held on return from these functions even when there is an error. Errors typically only occur when the XLOCK is held which means this isn't the vnode we want anyway. Almost all users of these interfaces expected this behavior even though it was not provided before.
# 18315848	22-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Fix interlock handling in vn_lock(). Previously, vn_lock() could return with interlock held in error conditions when the caller did not specify LK_INTERLOCK. - Add several comments to vn_lock() describing the rational behind the code flow since it was not immediately obvious.
# 0b600db4	21-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Document two cases, one in vget and the other in vn_lock, where the state of interlock on exit is not consistent. There are probably several bugs relating to this.
# 88cf6b94	21-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- If vn_lock fails with the LK_INTERLOCK flag set, interlock will not be released. vcanrecycle() failed to unlock interlock under this condition. - Remove an extra VOP_UNLOCK from a failure case in vcanrecycle(). Pointed out by: rwatson
# 71ea4ba5	21-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Add two new debugging macros: ASSERT_VI_LOCKED and ASSERT_VI_UNLOCKED - Use the new VI asserts in place of the old mtx_assert checks. - Add the VI asserts to the automated lock checking in the VOP calls. The interlock should not be held across vops with a few exceptions. - Add the vop_(un)lock_{pre,post} functions to assert that interlock is held when LK_INTERLOCK is set.
# 055c0123	12-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Extend the vnode_free_list_mtx to cover numvnodes and freevnodes. This was done only some of the time before, and now it is uniformly applied.
# 5965373e	10-Aug-2002	Maxime Henrion <mux@FreeBSD.org>	- Introduce a new struct xvfsconf, the userland version of struct vfsconf. - Make getvfsbyname() take a struct xvfsconf *. - Convert several consumers of getvfsbyname() to use struct xvfsconf. - Correct the getvfsbyname.3 manpage. - Create a new vfs.conflist sysctl to dump all the struct xvfsconf in the kernel, and rewrite getvfsbyname() to use this instead of the weird existing API. - Convert some {set,get,end}vfsent() consumers to use the new vfs.conflist sysctl. - Convert a vfsload() call in nfsiod.c to kldload() and remove the useless vfsisloadable() and endvfsent() calls. - Add a warning printf() in vfs_sysctl() to tell people they are using an old userland. After these changes, it's possible to modify struct vfsconf without breaking the binary compatibility. Please note that these changes don't break this compatibility either. When bp will have updated mount_smbfs(8) with the patch I sent him, there will be no more consumers of the {set,get,end}vfsent(), vfsisloadable() and vfsload() API, and I will promptly delete it.
# 8947be9b	05-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Move some logic from getnewvnode() to a new function vcanrecycle() - Unlock the free list mutex around vcanrecycle to prevent a lock order reversal.
# e6e370a7	04-Aug-2002	Jeff Roberson <jeff@FreeBSD.org>	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS
# f9d0d524	01-Aug-2002	Robert Watson <rwatson@FreeBSD.org>	Include file cleanup; mac.h and malloc.h at one point had ordering relationship requirements, and no longer do. Reminded by: bde
# 30721972	30-Jul-2002	Dag-Erling Smørgrav <des@FreeBSD.org>	Nit in previous commit: the correct sysctl type is "S,xvnode"
# 217b2a0b	30-Jul-2002	Dag-Erling Smørgrav <des@FreeBSD.org>	Initialize v_cachedid to -1 in getnewvnode(). Reintroduce the kern.vnode sysctl and make it export xvnodes rather than vnodes. Sponsored by: DARPA, NAI Labs
# 07bdba7e	30-Jul-2002	Robert Watson <rwatson@FreeBSD.org>	Note that the privilege indicating flag to vaccess() originally used by the process accounting system is now deprecated.
# a0ee6ed1	30-Jul-2002	Robert Watson <rwatson@FreeBSD.org>	Introduce support for Mandatory Access Control and extensible kernel access control. Invoke the necessary MAC entry points to maintain labels on vnodes. In particular, initialize the label when the vnode is allocated or reused, and destroy the label when the vnode is going to be released, or reused. Wow, an object where there really is exactly one place where it's allocated, and one other where it's freed. Amazing. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# a562685f	29-Jul-2002	Jeff Roberson <jeff@FreeBSD.org>	- Backout the patch made in revision 1.75 of vfs_mount.c. The vputs here were hiding the real problem of the missing unlock in sync_inactive. - Add the missing unlock in sync_inactive. Submitted by: iedowse
# 5c38b6db	28-Jul-2002	Don Lewis <truckman@FreeBSD.org>	Wire the sysctl output buffer before grabbing any locks to prevent SYSCTL_OUT() from blocking while locks are held. This should only be done when it would be inconvenient to make a temporary copy of the data and defer calling SYSCTL_OUT() until after the locks are released.
# b02aac46	21-Jul-2002	Robert Watson <rwatson@FreeBSD.org>	Teach discretionary access control methods for files about VAPPEND and VALLPERM. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs
# 7aca6291	19-Jul-2002	Kirk McKusick <mckusick@FreeBSD.org>	Add support to UFS2 to provide storage for extended attributes. As this code is not actually used by any of the existing interfaces, it seems unlikely to break anything (famous last words). The internal kernel interface to manipulate these attributes is invoked using two new IO_ flags: IO_NORMAL and IO_EXT. These flags may be specified in the ioflags word of VOP_READ, VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that you want to do I/O to the normal data part of the file and IO_EXT means that you want to do I/O to the extended attributes part of the file. IO_NORMAL and IO_EXT are mutually exclusive for VOP_READ and VOP_WRITE, but may be specified individually or together in the case of VOP_TRUNCATE. For example, when removing a file, VOP_TRUNCATE is called with both IO_NORMAL and IO_EXT set. For backward compatibility, if neither IO_NORMAL nor IO_EXT is set, then IO_NORMAL is assumed. Note that the BA_ and IO_ flags have been `merged' so that they may both be used in the same flags word. This merger is possible by assigning the IO_ flags to the low sixteen bits and the BA_ flags the high sixteen bits. This works because the high sixteen bits of the IO_ word is reserved for read-ahead and help with write clustering so will never be used for flags. This merge lets us get away from code of the form: if (ioflags & IO_SYNC) flags \|= BA_SYNC; For the future, I have considered adding a new field to the vattr structure, va_extsize. This addition could then be exported through the stat structure to allow applications to find out the size of the extended attribute storage and also would provide a more standard interface for truncating them (via VOP_SETATTR rather than VOP_TRUNCATE). I am also contemplating adding a pathconf parameter (for concreteness, lets call it _PC_MAX_EXTSIZE) which would let an application determine the maximum size of the extended atribute storage. Sponsored by: DARPA & NAI Labs.
# fb36a3d8	16-Jul-2002	Kirk McKusick <mckusick@FreeBSD.org>	Change utimes to set the file creation time (for filesystems that support creation times such as UFS2) to the value of the modification time if the value of the modification time is older than the current creation time. See utimes(2) for further details. Sponsored by: DARPA & NAI Labs.
# d331c5d4	10-Jul-2002	Matthew Dillon <dillon@FreeBSD.org>	Replace the global buffer hash table with per-vnode splay trees using a methodology similar to the vm_map_entry splay and the VM splay that Alan Cox is working on. Extensive testing has appeared to have shown no increase in overhead. Disadvantages Dirties more cache lines during lookups. Not as fast as a hash table lookup (but still N log N and optimal when there is locality of reference). Advantages vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem syncer operate more efficiently. I get to rip out all the old hacks (some of which were mine) that tried to keep the v_dirtyblkhd tailq sorted. The per-vnode splay tree should be easier to lock / SMPng pushdown on vnodes will be easier. This commit along with another that Alan is working on for the VM page global hash table will allow me to implement ranged fsync(), optimize server-side nfs commit rpcs, and implement partial syncs by the filesystem syncer (aka filesystem syncer would detect that someone is trying to get the vnode lock, remembers its place, and skip to the next vnode). Note that the buffer cache splay is somewhat more complex then other splays due to special handling of background bitmap writes (multiple buffers with the same lblkno in the same vnode), and B_INVAL discontinuities between the old hash table and the existence of the buffer on the v_cleanblkhd list. Suggested by: alc
# 25b286d6	09-Jul-2002	Jeff Roberson <jeff@FreeBSD.org>	- Use standard locking functions in syncer's opv - vput instead of vrele syncer vnodes in vfs_mount - Add vop_lookup_{pre,post} to verify locking in VOP_LOOKUP
# 18c48f43	07-Jul-2002	Jeff Roberson <jeff@FreeBSD.org>	- Don't hold the vn lock while calling VOP_CLOSE in vclean().
# bed75d46	06-Jul-2002	Jeff Roberson <jeff@FreeBSD.org>	- BUF_REFCNT() seems to be the preferred method for verifying a locked buf. Tell vop_strategy_pre() to use this instead. - Ignore B_CLUSTER bufs. Their components are locked but they don't really exist so they don't have to be. This isn't ideal but it is safe.
# c031d11b	06-Jul-2002	Jeff Roberson <jeff@FreeBSD.org>	Fix a mistake in my last commit. Don't grab an extra reference to the object in bp->b_object.
# 9a236af3	06-Jul-2002	Jeff Roberson <jeff@FreeBSD.org>	Fixup uses of GETVOBJECT. - Cache a pointer to the vnode's object in the buf. - Hold a reference to that object in addition to the vnode's reference just to be consistent. - Cleanup code that got the object indirectly through the vp and VOP calls. This fixes at least one case where we were calling GETVOBJECT without a lock. It also avoids an expensive layered call at the cost of another pointer in struct buf.
# 302c7aaa	05-Jul-2002	Jeff Roberson <jeff@FreeBSD.org>	- Add vop_strategy_pre to validate VOP_STRATEGY locking. - Disable original vop_strategy lock specification. - Switch to the new vop_strategy_pre for lock validation. VOP_STRATEGY requires only that the buf is locked UNLESS the block numbers need to be translated. There may be other reasons, but as long as the underlying layer uses a VOP to perform the operations they will be caught later.
# cc8662b0	05-Jul-2002	Jeff Roberson <jeff@FreeBSD.org>	Add "vop_rename_pre" to do pre rename lock verification. This is enabled only with DEBUG_VFS_LOCKS.
# d7f9ecc8	03-Jul-2002	Maxime Henrion <mux@FreeBSD.org>	Move vfs_rootmountalloc() in vfs_mount.c and remove lite2_vfs_mountroot() which was #if 0'd and is not likely to be used now.
# 2b4edb69	02-Jul-2002	Maxime Henrion <mux@FreeBSD.org>	Move every code related to mount(2) in a new file, vfs_mount.c. The file vfs_conf.c which was dealing with root mounting has been repo-copied into vfs_mount.c to preserve history. This makes nmount related development easier, and help reducing the size of vfs_syscalls.c, which is still an enormous file. Reviewed by: rwatson Repo-copy by: peter
# 6bd521df	01-Jul-2002	Ian Dowse <iedowse@FreeBSD.org>	Use indirect function pointer hooks instead of #ifdef SOFTUPDATES direct calls for the two places where the kernel calls into soft updates code. Set up the hooks in softdep_initialize() and NULL them out in softdep_uninitialize(). This change allows soft updates to function correctly when ufs is loaded as a module. Reviewed by: mckusick
# 87e1503e	28-Jun-2002	David E. O'Brien <obrien@FreeBSD.org>	Rename the db command lockedvnodes to lockedvnods so that it fits on the help screen and one doens't think we have a lockedvnodesmap command.
# 210a5a71	28-Jun-2002	Alfred Perlstein <alfred@FreeBSD.org>	nuke caddr_t.
# 90769c9e	28-Jun-2002	Jeff Roberson <jeff@FreeBSD.org>	Improve the VOP locking asserts - Add vfs_badlock_print to control whether or not we print lock violations - Add vfs_badlock_panic to control whether we panic on lock violations Both default to on to mimic the original behavior if DEBUG_VFS_LOCKS is on.
# aac12bcf	28-Jun-2002	Brian Feldman <green@FreeBSD.org>	Fix a case where a vnode got explicitly unlocked after the pointer to it got set to NULL. Revision 1.355: in the box
# 7d2d4409	20-Jun-2002	Maxime Henrion <mux@FreeBSD.org>	Change the way we internally store the mount options to a linked list. This is to allow the merging of the mount options in the MNT_UPDATE case, as the current data structure is unsuitable for this. There are no functional differences in this commit. Reviewed by: phk
# fe937506	14-Jun-2002	Maxime Henrion <mux@FreeBSD.org>	Change vfs_copyopt() so that the length argument passed to it must be the exact same size as the mount option. This makes vfs_copyopt() much more useful.
# edad3af2	06-Jun-2002	Dag-Erling Smørgrav <des@FreeBSD.org>	Move some sysctls from the debug tree to the vfs tree.
# 4a357a32	06-Jun-2002	Dag-Erling Smørgrav <des@FreeBSD.org>	Gratuitous whitespace cleanup.
# d394511d	16-May-2002	Tom Rhodes <trhodes@FreeBSD.org>	More s/file system/filesystem/g
# 34e53231	16-May-2002	Maxime Henrion <mux@FreeBSD.org>	o Fix vfs_copyopt(), the first argument to bcopy() is the source, not the destination. o Remove some code from vfs_getopt() which was making the interface more complicated to use for a very slight gain.
# f0d73b3e	06-May-2002	Jeff Roberson <jeff@FreeBSD.org>	Switch from just holding the interlock to holding the standard lock throughout getnewvnode(). This is safer. In the future, we should investigate requiring only the interlock to get the vnode object.
# 6953f5da	05-May-2002	Jeff Roberson <jeff@FreeBSD.org>	Hold the currently selected vnode's lock across the call to VOP_GETVOBJECT. Don't try to create a vm object before the file system has a chance to finish initializing it. This is incorrect for a number of reasons. Firstly, that VOP requires a lock which the file system may not have initialized yet. Also, open and others will create a vm object if it is necessary later.
# 81e01743	05-May-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Expand the one-line function pbreassignbuf() the only place it is or could be used.
# 9f943554	04-May-2002	Matthew Dillon <dillon@FreeBSD.org>	Remove obsolete code (that was already #if 0'd out). Requested by: Hiten Pandya <hitmaster2k@yahoo.com>
# 6008862b	04-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64
# 44731cab	01-Apr-2002	John Baldwin <jhb@FreeBSD.org>	Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@
# 17594b93	26-Mar-2002	Maxime Henrion <mux@FreeBSD.org>	As discussed in -arch, add the new nmount(2) system call and the new vfs_getopt()/vfs_copyopt() API. This is intended to be used later, when there will be filesystems implementing the VFS_NMOUNT operation. The mount(2) system call will disappear when all filesystems will be converted to the new API. Documentation will be committed in a while. Reviewed by: phk
# c897b813	19-Mar-2002	Jeff Roberson <jeff@FreeBSD.org>	Remove references to vm_zone.h and switch over to the new uma API. Also, remove maxsockets. If you look carefully you'll notice that the old zone allocator never honored this anyway.
# 4d77a549	19-Mar-2002	Alfred Perlstein <alfred@FreeBSD.org>	Remove __P.
# 89e1164e	05-Mar-2002	Robert Watson <rwatson@FreeBSD.org>	Three p_ucred -> td_ucred's missed in jhb's earlier pass; all appear to be safe.
# a854ed98	27-Feb-2002	John Baldwin <jhb@FreeBSD.org>	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.
# 68edc1b9	18-Feb-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Make v_addpollinfo() visible and non-inline. Have callers only call it as needed. Add necessary call in ufs_kqfilter(). Test-case found by: Andrew Gallatin <gallatin@cs.duke.edu>
# 90737495	18-Feb-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Remove yet a redundant VN_KNOTE() macro.
# 4b55dbe3	17-Feb-2002	Poul-Henning Kamp <phk@FreeBSD.org>	Move the stuff related to select and poll out of struct vnode. The use of the zone allocator may or may not be overkill. There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need to revisit. This shaves about 60 bytes of struct vnode which on my laptop means 600k less RAM used for vnodes.
# 2b8a08af	07-Feb-2002	Peter Wemm <peter@FreeBSD.org>	Fix a couple of style bugs introduced (or touched by) previous commit.
# 079b7bad	07-Feb-2002	Julian Elischer <julian@FreeBSD.org>	Pre-KSE/M3 commit. this is a low-functionality change that changes the kernel to access the main thread of a process via the linked list of threads rather than assuming that it is embedded in the process. It IS still embeded there but remove all teh code that assumes that in preparation for the next commit which will actually move it out. Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
# 64011154	01-Feb-2002	Kirk McKusick <mckusick@FreeBSD.org>	In the routines vrele() and vput(), we must lock the vnode and call VOP_INACTIVE before placing the vnode back on the free list. Otherwise there is a race condition on SMP machines between getnewvnode() locking the vnode to reclaim it and vrele() locking the vnode to inactivate it. This window of vulnerability becomes exaggerated in the presence of filesystems that have been suspended as the inactive routine may need to temporarily release the lock on the vnode to avoid deadlock with the syncer process.
# c73df808	18-Jan-2002	Matthew Dillon <dillon@FreeBSD.org>	Remove 'VXLOCK: interlock avoided' warnings. This can now occur in normal operation. The vgonel() code has always called vclean() but until we started proactively freeing vnodes it would never actually be called with a dirty vnode, so this situation did not occur prior to the vnlru() code. Now that we proactively free vnodes when kern.maxvnodes is hit, however, vclean() winds up with work to do and improperly generates the warnings. Reviewed by: peter Approved by: re (for MFC) MFC after: 1 day
# cd600596	15-Jan-2002	Kirk McKusick <mckusick@FreeBSD.org>	When downgrading a filesystem from read-write to read-only, operations involving file removal or file update were not always being fully committed to disk. The result was lost files or corrupted file data. This change ensures that the filesystem is properly synced to disk before the filesystem is down-graded. This delta also fixes a long standing bug in which a file open for reading has been unlinked. When the last open reference to the file is closed, the inode is reclaimed by the filesystem. Previously, if the filesystem had been down-graded to read-only, the inode could not be reclaimed, and thus was lost and had to be later recovered by fsck. With this change, such files are found at the time of the down-grade. Normally they will result in the filesystem down-grade failing with `device busy'. If a forcible down-grade is done, then the affected files will be revoked causing the inode to be released and the open file descriptors to begin failing on attempts to read. Submitted by: "Sam Leffler" <sam@errno.com>
# e61ab5fc	10-Jan-2002	Matthew Dillon <dillon@FreeBSD.org>	Add vlruvp() routine - implements LRU operation for vnode recycling. We calculate a trigger point that both guarentees we will find a sufficient number of vnodes to recycle and prevents us from recycling vnodes with lots of resident pages. This particular section of code is designed to recycle vnodes, not do unnecessary frees of cached VM pages.
# 9dd4281d	24-Dec-2001	Matthew Dillon <dillon@FreeBSD.org>	Fix type-o in previous commit (tsleep was using wrong rendezvous point)
# 23b59018	20-Dec-2001	Matthew Dillon <dillon@FreeBSD.org>	Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget() against VM_WAIT in the pageout code. Both fixes involve adjusting the lockmgr's timeout capability so locks obtained with timeouts do not interfere with locks obtained without a timeout. Hopefully MFC: before the 4.5 release
# 9f2f52d6	18-Dec-2001	Peter Wemm <peter@FreeBSD.org>	Do not initialize static/global variables to 0. Use bss instead of taking up space in the data section.
# 8f0d41d3	18-Dec-2001	Peter Wemm <peter@FreeBSD.org>	Use a different mechanism to get the vnlru process to wake up and notice the shutdown request at reboot/halt time. Disable the printf 'vnlru process getting nowhere, pausing...' and instead export the count to the debug.vnlru_nowhere sysctl.
# fdb33f08	18-Dec-2001	Matthew Dillon <dillon@FreeBSD.org>	This is a forward port of Peter's vlrureclaim() fix, with some minor mods by me to make it more efficient. The original code had serious balancing problems and could also deadlock easily. This code relegates the vnode reclamation to its own kproc and relaxes the vnode reclamation requirements to better maintain kern.maxvnodes. This code still doesn't balance as well as it could, but it does a much better job then the original code. Approved by: re@freebsd.org Obtained from: ps, peter, dillon MFS Assuming: Assuming no problems crop up in Yahoo testing MFC after: 7 days
# 873a4904	14-Dec-2001	Matthew Dillon <dillon@FreeBSD.org>	A slightly different version of the vlrureclaim fix. Reported by: peter, ps
# 9446b36b	13-Dec-2001	Peter Wemm <peter@FreeBSD.org>	If we were called to allocate a vnode that is not associated with a mount point, do not dereference the NULL mp argument.
# 6b8bd2ef	04-Nov-2001	Matthew Dillon <dillon@FreeBSD.org>	Add mnt_reservedvnlist so we can MFC to 4.x, in order to make all mount structure changes now rather then piecemeal later on. mnt_nvnodelist currently holds all the vnodes under the mount point. This will eventually be split into a 'dirty' and 'clean' list. This way we only break kld's once rather then twice. nvnodelist will eventually turn into the dirty list and should remain compatible with the klds.
# bcc0dc3d	02-Nov-2001	Robert Watson <rwatson@FreeBSD.org>	Merge from POSIX.1e Capabilities development tree: o POSIX.1e capabilities authorize overriding of VEXEC for VDIR based on CAP_DAC_READ_SEARCH, but of !VDIR based on CAP_DAC_EXECUTE. Add appropriate conditionals to vaccess() to take that into account. o Synchronization cap_check_xxx() -> cap_check() change. Obtained from: TrustedBSD Project
# 4ffa210b	27-Oct-2001	Matthew Dillon <dillon@FreeBSD.org>	syncdelay, filedelay, dirdelay, metadelay are ints, not time_t's, and can also be made static.
# 245df27c	25-Oct-2001	Matthew Dillon <dillon@FreeBSD.org>	Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a real effect. Optimize vfs_msync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. Improves looping case by 500%. Optimize ffs_sync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. This makes a couple of assumptions, which I believe are ok, in regards to vnode stability when the mount list mutex is held. Improves looping case by 500%. (more optimization work is needed on top of these fixes) MFC after: 1 week
# f92dcd3e	25-Oct-2001	Matthew Dillon <dillon@FreeBSD.org>	Add missing TAILQ_INSERT_TAIL's which somehow didn't get comitted with the recent vnode cleanup.
# c72ccd01	22-Oct-2001	Matthew Dillon <dillon@FreeBSD.org>	Change the vnode list under the mount point from a LIST to a TAILQ in preparation for an implementation of limiting code for kern.maxvnodes. MFC after: 3 days
# 2210e5d9	16-Oct-2001	Matthew Dillon <dillon@FreeBSD.org>	fix minor bug in kern.minvnodes sysctl. Use OID_AUTO.
# 917efbaa	08-Oct-2001	Matthew Dillon <dillon@FreeBSD.org>	WS Cleanup
# 845bd795	05-Oct-2001	Matthew Dillon <dillon@FreeBSD.org>	vinvalbuf() was only waiting for write-I/O to complete. It really has to wait for both read AND write I/O to complete. Only NFS calls vinvalbuf() on an active vnode (when the server indicates that the file is stale), so this bug fix only effects NFS clients. MFC after: 3 days
# b5810bab	30-Sep-2001	Matthew Dillon <dillon@FreeBSD.org>	After extensive testing it has been determined that adding complexity to avoid removing higher level directory vnodes from the namecache has no perceivable effect and will be removed. This is especially true when vmiodirenable is turned on, which it is by default now. ( vmiodirenable makes a huge difference in directory caching ). The vfs.vmiodirenable and vfs.nameileafonly sysctls have been left in to allow further testing, but I expect to rip out vfs.nameileafonly soon too. I have also determined through testing that the real problem with numvnodes getting too large is due to the VM Page cache preventing the vnode from being reclaimed. The directory stuff made only a tiny dent relative to Poul's original code, enough so that some tests succeeded. But tests with several million small files show that the bigger problem is the VM Page cache. This will have to be addressed by a future commit. MFC after: 3 days
# b40ce416	12-Sep-2001	Julian Elischer <julian@FreeBSD.org>	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha
# 0f728902	27-Aug-2001	Peter Wemm <peter@FreeBSD.org>	If a file has been completely unlinked, stop automatically syncing the file. ffs will discard any pending dirty pages when it is closed, so we may as well not waste time trying to clean them. This doesn't stop other things from writing it out, eg: pageout, fsync(2) etc.
# b219758f	27-Jul-2001	Peter Wemm <peter@FreeBSD.org>	Revert previous accidental commit. FWIW, it was part of enabling VM caching of disks through mmap() and stopping syncing of open files that had their last reference in the fs removed (ie: their unsync'ed pages get discarded on close already, so I made it stop syncing too).
# 24a590a0	27-Jul-2001	Peter Wemm <peter@FreeBSD.org>	Fix cut/paste blunder. Serves me right for doing a last minute tweak to what I had for some time. Submitted by: bde
# 0cddd8f0	04-Jul-2001	Matthew Dillon <dillon@FreeBSD.org>	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.
# cd2f7215	27-Jun-2001	John Baldwin <jhb@FreeBSD.org>	- Fix a mntvnode and vnode interlock reversal. - Protect the mnt_vnode list with the mntvnode lock.
# 23955314	18-May-2001	Alfred Perlstein <alfred@FreeBSD.org>	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb
# 0864ef1e	16-May-2001	Ian Dowse <iedowse@FreeBSD.org>	Change the second argument of vflush() to an integer that specifies the number of references on the filesystem root vnode to be both expected and released. Many filesystems hold an extra reference on the filesystem root vnode, which must be accounted for when determining if the filesystem is busy and then released if it isn't busy. The old `skipvp' approach required individual filesystem xxx_unmount functions to re-implement much of vflush()'s logic to deal with the root vnode. All 9 filesystems that hold an extra reference on the root vnode got the logic wrong in the case of forced unmounts, so `umount -f' would always fail if there were any extra root vnode references. Fix this issue centrally in vflush(), now that we can. This commit also fixes a vnode reference leak in devfs, which could result in idle devfs filesystems that refuse to unmount. Reviewed by: phk, bp
# 1feb7a6e	11-May-2001	Ian Dowse <iedowse@FreeBSD.org>	In vrele() and vput(), avoid triggering the confusing "missed vn_close" KASSERT when vp->v_usecount is zero or negative. In this case, the "v*: negative ref cnt" panic that follows is much more appropriate. Reviewed by: mckusick
# 8ee8b21b	26-Apr-2001	Poul-Henning Kamp <phk@FreeBSD.org>	vfs_subr.c is getting rather fat. The underlying repocopy and this commit moves the filesystem export handling code to vfs_export.c
# a13234bb	25-Apr-2001	Poul-Henning Kamp <phk@FreeBSD.org>	Move the netexport structure from the fs-specific mountstructure to struct mount. This makes the "struct netexport *" paramter to the vfs_export and vfs_checkexport interface unneeded. Consequently that all non-stacking filesystems can use vfs_stdcheckexp(). At the same time, make it a pointer to a struct netexport in struct mount, so that we can remove the bogus AF_MAX and #include <net/radix.h> from <sys/mount.h>
# d98dc34f	23-Apr-2001	Greg Lehey <grog@FreeBSD.org>	Correct #includes to work with fixed sys/mount.h.
# 759cb263	18-Apr-2001	Seigo Tanimura <tanimura@FreeBSD.org>	Reclaim directory vnodes held in namecache if few free vnodes are available. Only directory vnodes holding no child directory vnodes held in v_cache_src are recycled, so that directory vnodes near the root of the filesystem hierarchy remain in namecache and directory vnodes are not reclaimed in cascade. The period of vnode reclaiming attempt and the number of vnodes attempted to reclaim can be tuned via sysctl(2). Suggested by: tegge Approved by: phk
# f84e29a0	17-Apr-2001	Poul-Henning Kamp <phk@FreeBSD.org>	This patch removes the VOP_BWRITE() vector. VOP_BWRITE() was a hack which made it possible for NFS client side to use struct buf with non-bio backing. This patch takes a more general approach and adds a bp->b_op vector where more methods can be added. The success of this patch depends on bp->b_op being initialized all relevant places for some value of "relevant" which is not easy to determine. For now the buffers have grown a b_magic element which will make such issues a tiny bit easier to debug.
# 7df2842d	23-Feb-2001	Jonathan Lemon <jlemon@FreeBSD.org>	Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean(). Use this to tell a filter attached to a vnode that the underlying vnode is no longer valid, by returning EV_EOF. PR: kern/25309, kern/25206
# c0511d3b	18-Feb-2001	Brian Feldman <green@FreeBSD.org>	Switch to using a struct xucred instead of a struct xucred when not actually in the kernel. This structure is a different size than what is currently in -CURRENT, but should hopefully be the last time any application breakage is caused there. As soon as any major inconveniences are removed, the definition of the in-kernel struct ucred should be conditionalized upon defined(_KERNEL). This also changes struct export_args to remove dependency on the constantly-changing struct ucred, as well as limiting the bounds of the size fields to the correct size. This means: a) mountd and friends won't break all the time, b) mountd and friends won't crash the kernel all the time if they don't know what they're doing wrt actual struct export_args layout. Reviewed by: bde
# 9ed346ba	08-Feb-2001	Bosko Milekic <bmilekic@FreeBSD.org>	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)
# fc2ffbe6	04-Feb-2001	Poul-Henning Kamp <phk@FreeBSD.org>	Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details. Created with: sed(1) Reviewed by: md5(1)
# f3f1af39	30-Jan-2001	Boris Popov <bp@FreeBSD.org>	Properly lock new vnode. Reminded by: tegge
# 1b367556	23-Jan-2001	Jason Evans <jasone@FreeBSD.org>	Convert all simplelocks to mutexes and remove the simplelock implementations.
# 02b65ffb	22-Jan-2001	Robert Watson <rwatson@FreeBSD.org>	o The move to using VADMIN under vaccess() resulted in some system calls returning EACCES instead of EPERM. This patch modifies vaccess() to return EPERM instead of EACCES if VADMIN is among the requested rights. This affects functions normally limited to the owners of a file, such as chmod(), as EPERM is the error indicating that privilege would allow the operation, rather than a chance in mandatory or discretionary rights. Reported by: bde
# ffc831da	15-Dec-2000	John Baldwin <jhb@FreeBSD.org>	Stick the kthread API in a kthread_* namespace, and the specialized kproc functions in a kproc_* namespace. Reviewed by: -arch
# 0bf3b91d	12-Dec-2000	Kirk McKusick <mckusick@FreeBSD.org>	Use proper mutex locking when calling setrunnable from speedup_syncer(). Submitted by: Tor.Egge@fast.no
# 7cc0979f	08-Dec-2000	David Malone <dwmalone@FreeBSD.org>	Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>
# 138e514c	06-Dec-2000	Peter Wemm <peter@FreeBSD.org>	Untangle vfsinit() a bit. Use seperate sysinit functions rather than having a super-function calling bits all over the place.
# 19f08522	02-Dec-2000	Andrew Gallatin <gallatin@FreeBSD.org>	Correct int/long type mismatch in the proper place this time. freevnodes and numvnodes are longs in the kernel. They should remain longs in systat, what really needs to change is that they should be using SYSCTL_LONG rather than SYSCTL_INT. I also changed wantfreevnodes to SYSCTL_LONG because I happened to notice it. I wish there was a way to find all of these automatically.. Pointed out by: bde
# 21913407	30-Nov-2000	John Baldwin <jhb@FreeBSD.org>	Use msleep() instead of mtx_exit()/tsleep() so that we release the lock and go to sleep as an "atomic" operation.
# 6d984dfa	30-Nov-2000	Kirk McKusick <mckusick@FreeBSD.org>	Get rid of a bogus mtx_exit (it was attempting to release an already released mutex). Submitted by: "Chris Knight" <chris@aims.com.au>
# 936524aa	18-Nov-2000	Matthew Dillon <dillon@FreeBSD.org>	Implement a low-memory deadlock solution. Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the WRONG page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
# a2d1480c	02-Nov-2000	Tor Egge <tegge@FreeBSD.org>	Clear the VFREE flag when the vnode is removed from the free list in getnewvnode(). Otherwise routines called from VOP_INACTIVE() might attempt to remove the vnode from a free list the vnode isn't on, causing corruption. PR: 18012
# 1d7e3e42	02-Nov-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Take VBLK devices further out of their missery. This should fix the panic I introduced in my previous commit on this topic.
# 35e0e5b3	20-Oct-2000	John Baldwin <jhb@FreeBSD.org>	Catch up to moving headers: - machine/ipl.h -> sys/ipl.h - machine/mutex.h -> sys/mutex.h
# 47460a23	19-Oct-2000	Robert Watson <rwatson@FreeBSD.org>	o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform "administrative" authorization checks. In most cases, the VADMIN test checks to make sure the credential effective uid is the same as the file owner. o Modify vaccess() to set VADMIN as an available right if the uid is appropriate. o Modify references to uid-based access control operations such that they now always invoke VOP_ACCESS() instead of using hard-coded policy checks. o This allows alternative UFS policies to be implemented by replacing only ufs_access() (such as mandatory system policies). o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the vnode: I believe that new invocations of VOP_ACCESS() are always called with the lock held. o Some direct checks of the uid remain, largely associated with the QUOTA and SUIDDIR code. Reviewed by: eivind Obtained from: TrustedBSD Project
# 7eb9fca5	09-Oct-2000	Eivind Eklund <eivind@FreeBSD.org>	Blow away the v_specmountpoint define, replacing it with what it was defined as (rdev->si_mountpoint)
# 39df8608	06-Oct-2000	Jason Evans <jasone@FreeBSD.org>	Do not call lockdestroy() for v_vnlock, which may point to a lock in a deeper vfs stacking layer. Submitted by: bp
# a863c0fb	05-Oct-2000	Eivind Eklund <eivind@FreeBSD.org>	Style fixes based on comments by bde
# a18b1f1d	03-Oct-2000	Jason Evans <jasone@FreeBSD.org>	Convert lockmgr locks from using simple locks to using mutexes. Add lockdestroy() and appropriate invocations, which corresponds to lockinit() and must be called to clean up after a lockmgr lock is no longer needed.
# f8be809e	02-Oct-2000	Boris Popov <bp@FreeBSD.org>	Move KASSERTs which checks value of v_usecount after vnode locking, so it will not produce wrong alarms.
# 02a1e48f	27-Sep-2000	Kirk McKusick <mckusick@FreeBSD.org>	Do the right thing if bdevvp is called twice for the same device. Obtained from: Poul-Henning Kamp <phk@freebsd.org>
# 67e87166	25-Sep-2000	Boris Popov <bp@FreeBSD.org>	Add a lock structure to vnode structure. Previously it was either allocated separately (nfs, cd9660 etc) or keept as a first element of structure referenced by v_data pointer(ffs). Such organization leads to known problems with stacked filesystems. From this point vop_nolock() functions maintain only interlock lock. vop_stdlock() functions maintain built-in v_lock structure using lockmgr(). vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared lock on vnode. If filesystem wishes to export lockmgr compatible lock, it can put an address of this lock to v_vnlock field. This indicates that the upper filesystem can take advantage of it and use single lock structure for entire (or part) of stack of vnodes. This field shouldn't be examined or modified by VFS code except for initialization purposes. Reviewed in general by: mckusick
# 453aaa0d	21-Sep-2000	Eivind Eklund <eivind@FreeBSD.org>	Style fixes: * Add lots of comments * Convert a couple of assertions to KASSERT() * Minimal whitespace & misapplied {} fixes * Convert #if 0 to #if COMPILING_LINT for code we presently do not support, but want to keep available. Reviewed by: adrian, markm
# bba25953	22-Sep-2000	Eivind Eklund <eivind@FreeBSD.org>	Staticize addalias()
# 21a90397	21-Sep-2000	Alfred Perlstein <alfred@FreeBSD.org>	comment vfs_export functions, requested by: eivind
# e0848358	20-Sep-2000	Robert Watson <rwatson@FreeBSD.org>	o Add additional comment describing vaccess() behavior. Requested by: eivind Reviewed by: eivind, adrian
# b0d17ba6	19-Sep-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Rename lminor() to dev2unit(). This function gives a linear unit number which hides the 'hole' in the minor bits. Introduce unit2minor() to do the reverse operation. Fix some some make_dev() calls which didn't use UID_* or GID_* macros. Kill the v_hashchain alias macro, it hides the real relationship. Introduce experimental SI_CHEAPCLONE flag set it on cloned bpfs.
# 9ff5ce6b	12-Sep-2000	Boris Popov <bp@FreeBSD.org>	Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT. They will be used by nullfs and other stacked filesystems to support full cache coherency. Reviewed in general by: mckusick, dillon
# 0384fff8	06-Sep-2000	Jason Evans <jasone@FreeBSD.org>	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh
# 728783c2	05-Sep-2000	Robert Watson <rwatson@FreeBSD.org>	o Synchronize vaccess() capability access control checks with TrustedBSD tree. Obtained from: TrustedBSD Project
# 64dc16df	05-Sep-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Move extern declaration of dead_vnodeop_p to a .h file. Remove race condition in vn_isdisk().
# 012c643d	29-Aug-2000	Robert Watson <rwatson@FreeBSD.org>	o Restructure vaccess() so as to check for DAC permission to modify the object before falling back on privilege. Make vaccess() accept an additional optional argument, privused, to determine whether privilege was required for vaccess() to return 0. Add commented out capability checks for reference. Rename some variables to make it more clear which modes/uids/etc are associated with the object, and which with the access mode. o Update file system use of vaccess() to pass NULL as the optional privused argument. Once additional patches are applied, suser() will no longer set ASU, so privused will permit passing of privilege information up the stack to the caller. Reviewed by: bde, green, phk, -security, others Obtained from: TrustedBSD Project
# 4fe6d437	20-Aug-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Fix typo in last commit.
# e39c53ed	20-Aug-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Centralize the canonical vop_access user/group/other check in vaccess(). Discussed with: bde
# 9b971133	23-Jul-2000	Kirk McKusick <mckusick@FreeBSD.org>	This patch corrects the first round of panics and hangs reported with the new snapshot code. Update addaliasu to correctly implement the semantics of the old checkalias function. When a device vnode first comes into existence, check to see if an anonymous vnode for the same device was created at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than creating a new vnode for the device. This corrects a problem which caused the kernel to panic when taking a snapshot of the root filesystem. Change the calling convention of vn_write_suspend_wait() to be the same as vn_start_write(). Split out softdep_flushworklist() from softdep_flushfiles() so that it can be used to clear the work queue when suspending filesystem operations. Access to buffers becomes recursive so that snapshots can recursively traverse their indirect blocks using ffs_copyonwrite() when checking for the need for copy on write when flushing one of their own indirect blocks. This eliminates a deadlock between the syncer daemon and a process taking a snapshot. Ensure that softdep_process_worklist() can never block because of a snapshot being taken. This eliminates a problem with buffer starvation. Cleanup change in ffs_sync() which did not synchronously wait when MNT_WAIT was specified. The result was an unclean filesystem panic when doing forcible unmount with heavy filesystem I/O in progress. Return a zero'ed block when reading a block that was not in use at the time that a snapshot was taken. Normally, these blocks should never be read. However, the readahead code will occationally read them which can cause unexpected behavior. Clean up the debugging code that ensures that no blocks be written on a filesystem while it is suspended. Snapshots must explicitly label the blocks that they are writing during the suspension so that they do not cause a `write on suspended filesystem' panic. Reorganize ffs_copyonwrite() to eliminate a deadlock and also to prevent a race condition that would permit the same block to be copied twice. This change eliminates an unexpected soft updates inconsistency in fsck caused by the double allocation. Use bqrelse rather than brelse for buffers that will be needed soon again by the snapshot code. This improves snapshot performance.
# f2a2857b	11-Jul-2000	Kirk McKusick <mckusick@FreeBSD.org>	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).
# 3660ebc2	07-Jul-2000	Boris Popov <bp@FreeBSD.org>	Fix support for more than 256 simultaneous mounts. Theoretical limit is 2^16 mounts per fs type. Reported by: Troy Arie Cobb <tcobb@staff.circle.net> via phk Reviewed by: bde
# 77978ab8	04-Jul-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Previous commit changing SYSCTL_HANDLER_ARGS violated KNF. Pointed out by: bde
# c904bbbd	03-Jul-2000	Kirk McKusick <mckusick@FreeBSD.org>	Simplify and rationalise the management of the vnode free list (preparing the code to add snapshots).
# 37642196	03-Jul-2000	Kirk McKusick <mckusick@FreeBSD.org>	If a buffer flush fails when trying to reclaim a vnode, it is too late to save the vnode, so just toss any remaining unwritten buffers rather than leaving them lying around to make trouble in the future.
# 3275cf73	03-Jul-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Make the two calls from kern/* into softupdates #ifdef SOFTUPDATES, that is way cleaner than using the softupdates_stub stunt, which should be killed when convenient. Discussed with: mckusick
# 82d9ae4e	03-Jul-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Style police catches up with rev 1.26 of src/sys/sys/sysctl.h: Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our sources: -sysctl_vm_zone SYSCTL_HANDLER_ARGS +sysctl_vm_zone (SYSCTL_HANDLER_ARGS)
# a8b1f9d2	27-Jun-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Move prtactive to vfs from ufs. It is used all over the place.
# a2e7a027	16-Jun-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Virtualizes & untangles the bioops operations vector. Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@
# e3975643	25-May-2000	Jake Burkholder <jake@FreeBSD.org>	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others
# 740a1973	23-May-2000	Jake Burkholder <jake@FreeBSD.org>	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd
# 01f76720	14-May-2000	Jeroen Ruigrok van der Werven <asmodai@FreeBSD.org>	Fix the rootmount code for now. This function will probably rewritten/renamed to devpp. Submitted by: Assar Westerlund <assar@sics.se> on -current Confirmed to work: Steinar Haug <sthaug@nethelp.no>, Manfred Antar <mantar@pacbell.net> Reviewed by: phk
# 9626b608	05-May-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>. <sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes. Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data. Still a few bogus uses of struct buf to track down. Repocopy by: peter
# b99c307a	20-Mar-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Rename the existing BUF_STRATEGY() to DEV_STRATEGY() substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo) substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo) This patch is machine generated except for the ccd.c and buf.h parts.
# b081a64a	17-Mar-2000	Chris Costello <chris@FreeBSD.org>	In vn_isdisk(), check whether vp->v_rdev is NULL. If it is, then return ENXIO (Device not configured). Without this, vn_isdisk() could (and did in the case of lstat() under fdesc) pass a NULL pointer to devsw(), which caused a page fault. Reviewed by: alfred
# db5f635a	16-Mar-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Eliminate the undocumented, experimental, non-delivering and highly dangerous MAX_PERF option.
# 05ecdd70	14-Mar-2000	Bruce Evans <bde@FreeBSD.org>	Don't try so hard to make the lower 16 bits of fsids unique. It tended to recycle full fsids after only 16 mount/unmount's. This is probably too often for exported fsids. Now we recycle the full fsids only after 2^16 mount/ umount's and only ensure uniqueness in the lower 16 bits if there have been <= 256 calls to vfs_getnewfsid() since the system started.
# 61214975	12-Mar-2000	Bruce Evans <bde@FreeBSD.org>	Try harder to make the lower 16 bits of fsids unique. The vfs type number was packed very wastefully, giving perfect non-uniqeness in the lower 16 bits of fsids for filesystems with the same vfs type. This made linux_stat() return perfectly non-unique (broken) 16-bit st_dev's for nfs mount points, and effectively reduced mntid_base to 8 bits so that the vfs_getnewfsid() looped endlessly when there are already 256 mounted filesystems with the required vfs type. Approved by: jkh
# e8359a57	07-Feb-2000	Søren Schmidt <sos@FreeBSD.org>	Do refcounting of open devices (more) correctly. count_dev funtion by phk.
# b7a5f3ca	02-Feb-2000	Robert Watson <rwatson@FreeBSD.org>	Remove static qualifier from vgonel, as it is needed by the Arla folk outside of vfs_subr.c. Submitted by: Assar Westerlund <assar@sics.se> Reviewed by: rwatson Approved by: jkh
# 9a2b8fca	29-Jan-2000	Robert Watson <rwatson@FreeBSD.org>	This patch fixes a locking bug that can result in deadlock if the codepath is followed. From the PR: vclean calls vrele leading to deadlock (if usecount > 0) vclean() calls vrele() if v_usecount of the node was higher than one. But before calling it, it sets the VXLOCK flag, which will make vn_lock called from vrele dead-lock. PR: kern/15117 Submitted by: Assar Westerlund <assar@stacken.kth.se> Reviewed by: rwatson Obtained from: NetBSD
# ba4ad1fc	09-Jan-2000	Poul-Henning Kamp <phk@FreeBSD.org>	Give vn_isdisk() a second argument where it can return a suitable errno. Suggested by: bde
# 411e1480	09-Jan-2000	Kirk McKusick <mckusick@FreeBSD.org>	Remove the P_BUFEXHAUST flag from the syncer process (leaving it only on the buf_daemon process). The problem is that when the syncer process starts running the worklist, it wants to delete lots of files. It does this by VFS_VGET'ing the vnodes, clearing the blocks in them and bdwrite'ing the buffer. It can process close to a thousand files per second which generates a large number of dirty buffers. So, giving it special priviledge at the buffer trough leads to trouble as the buf_daemon does occationally need a free buffer to proceed and if the syncer has used every last one up, we are toast.
# e12d97d2	08-Jan-2000	Eivind Eklund <eivind@FreeBSD.org>	Change NDFREE() from a macro to a function for the time being; the macro version caused intolerable bloat (30k). I'm likely to revisit this with an attempt at a smarter macro. Bloat noticed by: bde
# 5e950839	07-Jan-2000	Luoqi Chen <luoqi@FreeBSD.org>	Introduce a mechanism to suspend/resume system processes. Suspend syncer and bufdaemon prior to disk sync during system shutdown.
# c37c9620	04-Jan-2000	Matthew Dillon <dillon@FreeBSD.org>	Enhance reassignbuf(). When a buffer cannot be time-optimally inserted into vnode dirtyblkhd we append it to the list instead of prepend it to the list in order to maintain a 'forward' locality of reference, which is arguably better then 'reverse'. The original algorithm did things this way to but at a huge time cost. Enhance the append interlock for NFS writes to handle intr/soft mounts better. Fix the hysteresis for NFS async daemon I/O requests to reduce the number of unnecessary context switches. Modify handling of NFS mount options. Any given user option that is too high now defaults to the kernel maximum for that option rather then the kernel default for that option. Reviewed by: Alfred Perlstein <bright@wintelcom.net>
# 02b00854	21-Dec-1999	Kirk McKusick <mckusick@FreeBSD.org>	Prettyness police: Identify flags in b_xflags with BX_ to distinguish them from flags in b_flags which are prefixed with B_
# 4f79d873	11-Dec-1999	Matthew Dillon <dillon@FreeBSD.org>	Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to madvise(). This feature prevents the update daemon from gratuitously flushing dirty pages associated with a mapped file-backed region of memory. The system pager will still page the memory as necessary and the VM system will still be fully coherent with the filesystem. Modifications made by other means to the same area of memory, for example by write(), are unaffected. The feature works on a page-granularity basis. MAP_NOSYNC allows one to use mmap() to share memory between processes without incuring any significant filesystem overhead, putting it in the same performance category as SysV Shared memory and anonymous memory. Reviewed by: julian, alc, dg
# 6bdfe06a	11-Dec-1999	Eivind Eklund <eivind@FreeBSD.org>	Lock reporting and assertion changes. * lockstatus() and VOP_ISLOCKED() gets a new process argument and a new return value: LK_EXCLOTHER, when the lock is held exclusively by another process. * The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them * Extend the vnode_if.src format to allow more exact specification than locked/unlocked. This commit should not do any semantic changes unless you are using DEBUG_VFS_LOCKS. Discussed with: grog, mch, peter, phk Reviewed by: peter
# 245efbba	29-Nov-1999	Matthew Dillon <dillon@FreeBSD.org>	Remove vfs_getrootfsid() function (a temporary hack added a few months ago to make BOOTP work again). It is no longer required by BOOTP and no longer used.
# 38224dcd	22-Nov-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Convert various pieces of code to use vn_isdisk() rather than checking for vp->v_type == VBLK. In ccd: we don't need to call VOP_GETATTR to find the type of a vnode. Reviewed by: sos
# 0429e37a	20-Nov-1999	Poul-Henning Kamp <phk@FreeBSD.org>	struct mountlist and struct mount.mnt_list have no business being a CIRCLEQ. Change them to TAILQ_HEAD and TAILQ_ENTRY respectively. This removes ugly mp != (void*)&mountlist comparisons. Requested by: phk Submitted by: Jake Burkholder jake@checker.org PR: 14967
# 1b727751	16-Nov-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Commit the remaining part of PR14914: Alot of the code in sys/kern directly accesses the Q_HEAD and Q_ENTRY structures for list operations. This patch makes all list operations in sys/kern use the queue(3) macros, rather than directly accessing the *Q_{HEAD,ENTRY} structures. Reviewed by: phk Submitted by: Jake Burkholder <jake@checker.org> PR: 14914
# 698f9cf8	09-Nov-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Next step in the device cleanup process. Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code. Unify spec_open() for bdev and cdev cases. Remove the disabled bdev specific read/write code.
# 923502ff	29-Oct-1999	Poul-Henning Kamp <phk@FreeBSD.org>	useracc() the prequel: Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ\|WRITE} rather than B_{READ\|WRITE} as argument.
# d1f088da	11-Oct-1999	Peter Wemm <peter@FreeBSD.org>	Trim unused options (or #ifdef for undoc options). Submitted by: phk
# aa4f4b69	04-Oct-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Move the buffered read/write code out of spec_{read\|write} and into two new functions spec_buf{read\|write}. Add sysctl vfs.bdev_buffered which defaults to 1 == true. This sysctl can be used to experimentally turn buffered behaviour for bdevs off. I should not be changed while any blockdevices are open. Remove the misplaced sysctl vfs.enable_userblk_io. No other changes in behaviour.
# 1b5464ef	29-Sep-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Remove v_maxio from struct vnode. Replace it with mnt_iosize_max in struct mount. Nits from: bde
# 40360b1b	20-Sep-1999	Matthew Dillon <dillon@FreeBSD.org>	Final commit to remove vnode->v_lastr. vm_fault now handles read clustering issues (replacing code that used to be in ufs/ufs/ufs_readwrite.c). vm_fault also now uses the new VM page counter inlines. This completes the changeover from vnode->v_lastr to vm_entry_t->v_lastr for VM, and fp->f_nextread and fp->f_seqcount (which have been in the tree for a while). Determination of the I/O strategy (sequential, random, and so forth) is now handled on a descriptor-by-descriptor basis for base I/O calls, and on a memory-region-by-memory-region and process-by-process basis for VM faults. Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
# 552f337f	20-Sep-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Initialize vp->v_maxio to its default in getnetvnode() rather than four different places in vfs_cluster.c
# e6f71111	19-Sep-1999	Matthew Dillon <dillon@FreeBSD.org>	Fix BOOTP root FS mounts. Also cleanup vfs_getnewfsid() and collapse addaliasu() into addalias() (no operational change) and clarify comments relating to a trick that vclean() uses. The fix to BOOTP is yet another hack. Actually, rootfsid handling is already a major hack. The whole thing needs to be cleaned up. Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>
# bb01f28e	17-Sep-1999	Matthew Dillon <dillon@FreeBSD.org>	Add vfs.enable_userblk_io sysctl to control whether user reads and writes to buffered block devices are allowed. The default is to be backwards compatible, i.e. reads and writes are allowed. The idea is for a larger crowd to start running with this disabled and see what problems, if any, crop up, and then to change the default to off and see if any problems crop up in the next 6 months prior to potentially removing support entirely. There are still a few people, Julian and myself included, who believe the buffered block device access from usermode to be useful. Remove use of vnode->v_lastr from buffered block device I/O in preparation for removal of vnode->v_lastr field, replacing it with the already existing seqcount metric to detect sequential operation. Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>
# d137accc	29-Aug-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Add dev_t freeing code. Controlled by sysctl debug.free_devt, default is off.
# 96267288	28-Aug-1999	Poul-Henning Kamp <phk@FreeBSD.org>	remove unused variables.
# c3aac50f	27-Aug-1999	Peter Wemm <peter@FreeBSD.org>	$Id$ -> $FreeBSD$
# dbafb366	26-Aug-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Simplify the handling of VCHR and VBLK vnodes using the new dev_t: Make the alias list a SLIST. Drop the "fast recycling" optimization of vnodes (including the returning of a prexisting but stale vnode from checkalias). It doesn't buy us anything now that we don't hardlimit vnodes anymore. Rename checkalias2() and checkalias() to addalias() and addaliasu() - which takes dev_t and udev_t arg respectively. Make the revoke syscalls use vcount() instead of VALIASED. Remove VALIASED flag, we don't need it now and it is faster to traverse the much shorter lists than to maintain the flag. vfs_mountedon() can check the dev_t directly, all the vnodes point to the same one. Print the devicename in specfs/vprint(). Remove a couple of stale LFS vnode flags. Remove unimplemented/unused LK_DRAINED;
# 41d2e3e0	24-Aug-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness.
# 0ff7b13a	24-Aug-1999	Julian Elischer <julian@FreeBSD.org>	Make DEVFS use PHK's specinfo struct as the source of dev_t and devsw. In lookup() however it's the other way around as we need to supply the dev_t for the vnode, so devfs still has a copy of it stashed away. Sourcing it from the vnode in the vnops however is useful as it makes a lot of the code almost the same as that in specfs.
# a2801b77	21-Aug-1999	John Polstra <jdp@FreeBSD.org>	Support full-precision file timestamps. Until now, only the seconds have been maintained, and that is still the default. A new sysctl variable "vfs.timestamp_precision" can be used to enable higher levels of precision: 0 = seconds only; nanoseconds zeroed (default). 1 = seconds and nanoseconds, accurate within 1/HZ. 2 = seconds and nanoseconds, truncated to microseconds. >=3 = seconds and nanoseconds, maximum precision. Level 1 uses getnanotime(), which is fast but can be wrong by up to 1/HZ. Level 2 uses microtime(). It might be desirable for consistency with utimes() and friends, which take timeval structures rather than timespecs. Level 3 uses nanotime() for the higest precision. I benchmarked levels 0, 1, and 3 by copying a 550 MB tree with "cpio -pdu". There was almost negligible difference in the system times -- much less than 1%, and less than the variation among multiple runs at the same level. Bruce Evans dreamed up a torture test involving 1-byte reads with intervening fstat() calls, but the cpio test seems more realistic to me. This feature is currently implemented only for the UFS (FFS and MFS) filesystems. But I think it should be easy to support it in the others as well. An earlier version of this was reviewed by Bruce. He's not to blame for any breakage I've introduced since then. Reviewed by: bde (an earlier version of the code)
# 7dc5cd04	13-Aug-1999	Poul-Henning Kamp <phk@FreeBSD.org>	The bdevsw() and cdevsw() are now identical, so kill the former.
# 4d4f9323	13-Aug-1999	Poul-Henning Kamp <phk@FreeBSD.org>	s/v_specinfo/v_rdev/
# 0ef1c826	08-Aug-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>, a few lines into <sys/vnode.h>. Add a few fields to struct specinfo, paving the way for the fun part.
# 67452993	26-Jul-1999	Alan Cox <alc@FreeBSD.org>	Add sysctl and support code to allow directories to be VMIO'd. The default setting for the sysctl is OFF, which is the historical operation. Submitted by: dillon
# 698bfad7	20-Jul-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Now a dev_t is a pointer to struct specinfo which is shared by all specdev vnodes referencing this device. Details: cdevsw->d_parms has been removed, the specinfo is available now (== dev_t) and the driver should modify it directly when applicable, and the only driver doing so, does so: vn.c. I am not sure the logic in checking for "<" was right before, and it looks even less so now. An intial pool of 50 struct specinfo are depleted during early boot, after that malloc had better work. It is likely that fewer than 50 would do. Hashing is done from udev_t to dev_t with a prime number remainder hash, experiments show no better hash available for decent cost (MD5 is only marginally better) The prime number used should not be close to a power of two, we use 83 for now. Add new checkalias2() to get around the loss of info from dev2udev() in bdevvp(); The aliased vnodes are hung on a list straight of the dev_t, and speclisth[SPECSZ] is unused. The sharing of struct specinfo means that the v_specnext moves into the vnode which grows by 4 bytes. Don't use a VBLK dev_t which doesn't make sense in MFS, now we hang a dummy cdevsw on B/Cmaj 253 so that things look sane. Storage overhead from all of this is O(50k). Bump __FreeBSD_version to 400009 The next step will add the stuff needed so device-drivers can start to hang things from struct specinfo
# 3de280c4	19-Jul-1999	Poul-Henning Kamp <phk@FreeBSD.org>	[click] Now all dev_t's in the kernel have their char device major. Only know casualy of this is swapinfo/pstat which should be fixes the right way: Store the actual pathname in the kernel like mount does. [Volounteers sought for this task] The road map from here is roughly: expand struct specinfo into struct based dev_t. Add dev_t registration facilities for device drivers and start to use them.
# 6ca54864	18-Jul-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Introduce the vn_todev(struct vnode*) function, which returns the dev_t corresponding to a VBLK or VCHR node, or NODEV.
# c7119ea7	17-Jul-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Fix 2nd arg to udev2dev().
# f008cfcc	17-Jul-1999	Poul-Henning Kamp <phk@FreeBSD.org>	I have not one single time remembered the name of this function correctly so obviously I gave it the wrong name. s/umakedev/makeudev/g
# e7647e6c	12-Jul-1999	Kris Kennaway <kris@FreeBSD.org>	Correct a couple of spelling errors in comments.
# ad8ac923	08-Jul-1999	Kirk McKusick <mckusick@FreeBSD.org>	These changes appear to give us benefits with both small (32MB) and large (1G) memory machine configurations. I was able to run 'dbench 32' on a 32MB system without bring the machine to a grinding halt. * buffer cache hash table now dynamically allocated. This will have no effect on memory consumption for smaller systems and will help scale the buffer cache for larger systems. * minor enhancement to pmap_clearbit(). I noticed that all the calls to it used constant arguments. Making it an inline allows the constants to propogate to deeper inlines and should produce better code. * removal of inherent vfs_ioopt support through the emplacement of appropriate #ifdef's, with John's permission. If we do not find a use for it by the end of the year we will remove it entirely. * removal of getnewbufloops* counters & sysctl's - no longer necessary for debugging, getnewbuf() is now optimal. * buffer hash table functions removed from sys/buf.h and localized to vfs_bio.c * VFS_BIO_NEED_DIRTYFLUSH flag and support code added ( bwillwrite() ), allowing processes to block when too many dirty buffers are present in the system. * removal of a softdep test in bdwrite() that is no longer necessary now that bdwrite() no longer attempts to flush dirty buffers. * slight optimization added to bqrelse() - there is no reason to test for available buffer space on B_DELWRI buffers. * addition of reverse-scanning code to vfs_bio_awrite(). vfs_bio_awrite() will attempt to locate clusterable areas in both the forward and reverse direction relative to the offset of the buffer passed to it. This will probably not make much of a difference now, but I believe we will start to rely on it heavily in the future if we decide to shift some of the burden of the clustering closer to the actual I/O initiation. * Removal of the newbufcnt and lastnewbuf counters that Kirk added. They do not fix any race conditions that haven't already been fixed by the gbincore() test done after the only call to getnewbuf(). getnewbuf() is a static, so there is no chance of it being misused by other modules. ( Unless Kirk can think of a specific thing that this code fixes. I went through it very carefully and didn't see anything ). * removal of VOP_ISLOCKED() check in flushbufqueues(). I do not think this check is necessary, the buffer should flush properly whether the vnode is locked or not. ( yes? ). * removal of extra arguments passed to getnewbuf() that are not necessary. * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in vfs_cluster.c * vn_write() now calls bwillwrite() PRIOR to locking the vnode, which should greatly aid flushing operations in heavy load situations - both the pageout and update daemons will be able to operate more efficiently. * removal of b_usecount. We may add it back in later but for now it is useless. Prior implementations of the buffer cache never had enough buffers for it to be useful, and current implementations which make more buffers available might not benefit relative to the amount of sophistication required to implement a b_usecount. Straight LRU should work just as well, especially when most things are VMIO backed. I expect that (even though John will not like this assumption) directories will become VMIO backed some point soon. Submitted by: Matthew Dillon <dillon@backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# e929c00d	03-Jul-1999	Kirk McKusick <mckusick@FreeBSD.org>	The buffer queue mechanism has been reformulated. Instead of having QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN, QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean and dirty buffers have been separated. Empty buffers with KVM assignments have been separated from truely empty buffers. getnewbuf() has been rewritten and now operates in a 100% optimal fashion. That is, it is able to find precisely the right kind of buffer it needs to allocate a new buffer, defragment KVM, or to free-up an existing buffer when the buffer cache is full (which is a steady-state situation for the buffer cache). Buffer flushing has been reorganized. Previously buffers were flushed in the context of whatever process hit the conditions forcing buffer flushing to occur. This resulted in processes blocking on conditions unrelated to what they were doing. This also resulted in inappropriate VFS stacking chains due to multiple processes getting stuck trying to flush dirty buffers or due to a single process getting into a situation where it might attempt to flush buffers recursively - a situation that was only partially fixed in prior commits. We have added a new daemon called the buf_daemon which is responsible for flushing dirty buffers when the number of dirty buffers exceeds the vfs.hidirtybuffers limit. This daemon attempts to dynamically adjust the rate at which dirty buffers are flushed such that getnewbuf() calls (almost) never block. The number of nbufs and amount of buffer space is now scaled past the 8MB limit that was previously imposed for systems with over 64MB of memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed somewhat. The number of physical buffers has been increased with the intention that we will manage physical I/O differently in the future. reassignbuf previously attempted to keep the dirtyblkhd list sorted which could result in non-deterministic operation under certain conditions, such as when a large number of dirty buffers are being managed. This algorithm has been changed. reassignbuf now keeps buffers locally sorted if it can do so cheaply, and otherwise gives up and adds buffers to the head of the dirtyblkhd list. The new algorithm is deterministic but not perfect. The new algorithm greatly reduces problems that previously occured when write_behind was turned off in the system. The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive P_BUFEXHAUST bit. This bit allows processes working with filesystem buffers to use available emergency reserves. Normal processes do not set this bit and are not allowed to dig into emergency reserves. The purpose of this bit is to avoid low-memory deadlocks. A small race condition was fixed in getpbuf() in vm/vm_pager.c. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>
# 8947a90a	02-Jul-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Make sure that stat(2) and friends always return a valid st_dev field. Pseudo-FS need not fill in the va_fsid anymore, the syscall code will use the first half of the fsid, which now looks like a udev_t with major 255.
# 9c8b8baa	01-Jul-1999	Peter Wemm <peter@FreeBSD.org>	Slight reorganization of kernel thread/process creation. Instead of using SYSINIT_KT() etc (which is a static, compile-time procedure), use a NetBSD-style kthread_create() interface. kproc_start is still available as a SYSINIT() hook. This allowed simplification of chunks of the sysinit code in the process. This kthread_create() is our old kproc_start internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work the same as the NetBSD one. One thing I'd like to do shortly is get rid of nfsiod as a user initiated process. It makes sense for the nfs client code to create them on the fly as needed up to a user settable limit. This means that nfsiod doesn't need to be in /sbin and is always "available". This is a fair bit easier to do outside of the SYSINIT_KT() framework.
# 67812eac	25-Jun-1999	Kirk McKusick <mckusick@FreeBSD.org>	Convert buffer locking from using the B_BUSY and B_WANTED flags to using lockmgr locks. This commit should be functionally equivalent to the old semantics. That is, all buffer locking is done with LK_EXCLUSIVE requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will be done in future commits.
# f9c8cab5	16-Jun-1999	Kirk McKusick <mckusick@FreeBSD.org>	Add a vnode argument to VOP_BWRITE to get rid of the last vnode operator special case. Delete special case code from vnode_if.sh, vnode_if.src, umap_vnops.c, and null_vnops.c.
# e4ab40bc	15-Jun-1999	Kirk McKusick <mckusick@FreeBSD.org>	Get rid of the global variable rushjob and replace it with a function in kern/vfs_subr.c named speedup_syncer() which handles the speedup request. Change the various clients of rushjob to use the new function.
# 2447bec8	31-May-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Simplify cdevsw registration. The cdevsw_add() function now finds the major number(s) in the struct cdevsw passed to it. cdevsw_add_generic() is no longer needed, cdevsw_add() does the same thing. cdevsw_add() will print an message if the d_maj field looks bogus. Remove nblkdev and nchrdev variables. Most places they were used bogusly. Instead check a dev_t for validity by seeing if devsw() or bdevsw() returns NULL. Move bdevsw() and devsw() functions to kern/kern_conf.c Bump __FreeBSD_version to 400006 This commit removes: 72 bogus makedev() calls 26 bogus SYSINIT functions if_xe.c bogusly accessed cdevsw[], author/maintainer please fix. I4b and vinum not changed. Patches emailed to authors. LINT probably broken until they catch up.
# 02013ff8	23-May-1999	John Birrell <jb@FreeBSD.org>	Remove the test for bdevsw(dev) == NULL from bdevvp() because it fails if there is no character device associated with the block device. In this case that doesn't matter because bdevvp() doesn't use the character device structure. I can use the pointy bit of the axe too.
# 0ce54cbb	14-May-1999	Luoqi Chen <luoqi@FreeBSD.org>	Legally acquire a major number for mfs.
# eaea7a9e	13-May-1999	Kirk McKusick <mckusick@FreeBSD.org>	Previously directories were sync'ed every 10 seconds while bitmaps & inodes were synced every 15 seconds. This is now reversed as during directory create, we cannot commit the directory entry until its inode has been written. With this switch, the inodes will be more likely to be written by the time that the directory is written thus reducing the number of directory rollbacks that are needed.
# cc5881cf	12-May-1999	Peter Wemm <peter@FreeBSD.org>	Fix (?) SPECHASH dev_t/major/minor/etc args
# 8bee45c4	12-May-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Don't peek into dev_t
# bfbb9ce6	11-May-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Divorce "dev_t" from the "major\|minor" bitmap, which is now called udev_t in the kernel but still called dev_t in userland. Provide functions to manipulate both types: major() umajor() minor() uminor() makedev() umakedev() dev2udev() udev2dev() For now they're functions, they will become in-line functions after one of the next two steps in this process. Return major/minor/makedev to macro-hood for userland. Register a name in cdevsw[] for the "filedescriptor" driver. In the kernel the udev_t appears in places where we have the major/minor number combination, (ie: a potential device: we may not have the driver nor the device), like in inodes, vattr, cdevsw registration and so on, whereas the dev_t appears where we carry around a reference to a actual device. In the future the cdevsw and the aliased-from vnode will be hung directly from the dev_t, along with up to two softc pointers for the device driver and a few houskeeping bits. This will essentially replace the current "alias" check code (same buck, bigger bang). A little stunt has been provided to try to catch places where the wrong type is being used (dev_t vs udev_t), if you see something not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if it makes a difference. If it does, please try to track it down (many hands make light work) or at least try to reproduce it as simply as possible, and describe how to do that. Without DEVT_FASCIST I belive this patch is a no-op. Stylistic/posixoid comments about the userland view of the <sys/*.h> files welcome now, from userland they now contain the end result. Next planned step: make all dev_t's refer to the same devsw[] which means convert BLK's to CHR's at the perimeter of the vnodes and other places where they enter the game (bootdev, mknod, sysctl).
# 1637aa4b	08-May-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Fix some of the places where too much inside knowledge about major/minor layout and dev_t structure is being (ab)used.
# 4be2eb8c	08-May-1999	Poul-Henning Kamp <phk@FreeBSD.org>	I got tired of seeing all the cdevsw[major(foo)] all over the place. Made a new (inline) function devsw(dev_t dev) and substituted it. Changed to the BDEV variant to this format as well: bdevsw(dev_t dev) DEVFS will eventually benefit from this change too.
# 46eede00	07-May-1999	Poul-Henning Kamp <phk@FreeBSD.org>	Continue where Julian left off in July 1998: Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline) function. Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention to the order of the cmaj/bmaj arguments!) Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE (ditto!) (Next step will be to convert all bdev dev_t's to cdev dev_t's before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)
# 3d177f46	03-May-1999	Bill Fumerola <billf@FreeBSD.org>	Add sysctl descriptions to many SYSCTL_XXXs PR: kern/11197 Submitted by: Adrian Chadd <adrian@FreeBSD.org> Reviewed by: billf(spelling/style/minor nits) Looked at by: bde(style)
# 4ef2094e	11-Mar-1999	Julian Elischer <julian@FreeBSD.org>	Reviewed by: Many at differnt times in differnt parts, including alan, john, me, luoqi, and kirk Submitted by: Matt Dillon <dillon@frebsd.org> This change implements a relatively sophisticated fix to getnewbuf(). There were two problems with getnewbuf(). First, the writerecursion can lead to a system stack overflow when you have NFS and/or VN devices in the system. Second, the free/dirty buffer accounting was completely broken. Not only did the nfs routines blow it trying to manually account for the buffer state, but the accounting that was done did not work well with the purpose of their existance: figuring out when getnewbuf() needs to sleep. The meat of the change is to kern/vfs_bio.c. The remaining diffs are all minor except for NFS, which includes both the fixes for bp interaction AND fixes for a 'biodone(): buffer already done' lockup. Sys/buf.h also contains a chaining structure which is not used by this patchset but is used by other patches that are coming soon. This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF. (sorry for the missing T matt)
# 155f87da	24-Feb-1999	Matthew Dillon <dillon@FreeBSD.org>	Reviewed by: Julian Elischer <julian@whistle.com> Add d_parms() to {c,b}devsw[]. If non-NULL this function points to a device routine that will properly fill in the specinfo structure. vfs_subr.c's checkalias() supplies appropriate defaults. This change should be fully backwards compatible with existing devices.
# 42e26d47	19-Feb-1999	Matthew Dillon <dillon@FreeBSD.org>	Protect vn worklist and vn->v_{clean,dirty}blkhd at splbio(). Get rid of extra LIST_REMOVE() Reviewed by: hsu@FreeBSD.ORG (Jeffrey Hsu), mckusick@McKusick.COM Submitted by: hsu@FreeBSD.ORG (Jeffrey Hsu), dillon@backplane.com ( Matthew Dillon )
# 82b23b53	04-Feb-1999	Matthew Dillon <dillon@FreeBSD.org>	vp->v_object must be valid after normal flow of vfs_object_create() completes, change if() to KASSERT(). This is not a bug, we are simplify clarifying and optimizing the code. In if/else in vfs_object_create(), the failure of both conditionals will lead to a NULL object. Exit gracefully if this case occurs. ( this case does not normally occur, but needed to be handled ). Obtained from: Eivind Eklund <eivind@FreeBSD.org>
# bc814931	29-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	More const fixes for -Wall, -Wcast-qual
# 8aef1712	27-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile
# 1c7c3c6a	21-Jan-1999	Matthew Dillon <dillon@FreeBSD.org>	This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues. Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>
# 219cbf59	09-Jan-1999	Eivind Eklund <eivind@FreeBSD.org>	KNFize, by bde.
# 5526d2d9	08-Jan-1999	Eivind Eklund <eivind@FreeBSD.org>	Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as discussed on -hackers. Introduce 'KASSERT(assertion, ("panic message", args))' for simple check + panic. Reviewed by: msmith
# fb116777	05-Jan-1999	Eivind Eklund <eivind@FreeBSD.org>	Remove the 'waslocked' parameter to vfs_object_create().
# 0df45b5a	05-Jan-1999	Eivind Eklund <eivind@FreeBSD.org>	Finish staticization.
# 289bdf33	02-Jan-1999	Bruce Evans <bde@FreeBSD.org>	Ifdefed conditionally used simplock variables.
# 4d948813	23-Dec-1998	Bruce Evans <bde@FreeBSD.org>	Restored rev.1.31 which was clobbered by rev.1.69 (the big Lite2 merge). This fixes at least hanging in revoke(2) when a somewhat active slave pty is revoked. The hang made the window for the null pointer bug in ufsspec_{read,write} much larger. There are many other bugs in this area (revoke of an active fifo at best leaks memory...).
# 29c98cd8	21-Dec-1998	Eivind Eklund <eivind@FreeBSD.org>	Check return value of tsleep(). I've checked of all call points - there does not seem to be a problem with this. PR: kern/8732 Analysis by: David G Andersen <danderse@cs.utah.edu> Tested by: Alfred Perlstein <bright@hotjobs.com>
# db878ba4	21-Dec-1998	Eivind Eklund <eivind@FreeBSD.org>	Staticize.
# 2127f260	04-Dec-1998	Archie Cobbs <archie@FreeBSD.org>	Examine all occurrences of sprintf(), strcat(), and str[n]cpy() for possible buffer overflow problems. Replaced most sprintf()'s with snprintf(); for others cases, added terminating NUL bytes where appropriate, replaced constants like "16" with sizeof(), etc. These changes include several bug fixes, but most changes are for maintainability's sake. Any instance where it wasn't "immediately obvious" that a buffer overflow could not occur was made safer. Reviewed by: Bruce Evans <bde@zeta.org.au> Reviewed by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Mike Spengler <mks@networkcs.com>
# 16e9e530	31-Oct-1998	Peter Wemm <peter@FreeBSD.org>	Convert lists for bufs attached to vnodes from a LIST to a TAILQ. - Use TAILQ_* macros extensively instead of internal names - use b_xflags instead of the NOLIST magic number hack in the next pointer - clean bufs are inserted at the tail rather than the head. - redo dirty buffer insert so that metadata (negative lbn) goes to the tail directly rather than at the HEAD. This makes a difference when inserting dirty data blocks in lbn sorted order since data block insertion will not have to bypass all the metadata cruft. data is lbn sorted since it makes sense for clustering and writeback ordering, while metadata sorting doesn't help much since the lbn's are meaningless when walking the list for writebacks. Small systems will not notice much (if any) benefit from this, but really busy systems with large dirty block lists should get a lot more. I've tested this with softdep, and it doesn't seem to mind the change of queueing of metadata. Reviewed (in princible) by: dg Obtained from: partly from John Dyson's work-in-progress patches in June.
# b421db37	31-Oct-1998	Peter Wemm <peter@FreeBSD.org>	The last argument to vm_object_page_clean() are now bit flags, rather than the old true/false. While here, have vfs_msync() only call vm_object_page_clean() with OBJPC_SYNC if called with MNT_WAIT flags. vfs_msync() is called at unmount time (with MNT_WAIT) and from the syncer process (formerly update). This should make dirty mmap writebacks a little less nasty. I have tested this a little with SOFTUPDATES enabled, but I don't normally use it since I've been badly burned too many times.
# cbbbd4c3	29-Oct-1998	Bruce Evans <bde@FreeBSD.org>	Oops, rev.1.167 made the device number checking in bdevvp() too strict for mfs root mounts. Don't require major 255 to be in bdevsw[].
# 20f02ef5	29-Oct-1998	Peter Wemm <peter@FreeBSD.org>	Remove the V_SAVEMETA flag, nothing uses it any more now that msdosfs and ext2fs call vtruncbuf() directly. This simplifies and cleans up vinvalbuf() a little.
# 885bf0b5	26-Oct-1998	Bruce Evans <bde@FreeBSD.org>	Updated the major number check in vfs_object_create(). It's not clear if the check is necessary, but vfs_object_create() is called for all vnodes and it was silly to create objects for VBLK vnodes that don't even have a driver.
# f5ef029e	25-Oct-1998	Poul-Henning Kamp <phk@FreeBSD.org>	Nitpicking and dusting performed on a train. Removes trivial warnings about unused variables, labels and other lint.
# 37906c68	25-Oct-1998	Bruce Evans <bde@FreeBSD.org>	Fixed device number checking in bdevvp(): - dev != NODEV was checked for, but 0 was returned on failure. This was fixed in Lite2 (except the return code was still slightly wrong (ENODEV instead of ENXIO)) but the changes were not merged. This case probably doesn't actually occur under FreeBSD. - major(dev) was not checked to have a valid non-NULL bdevsw entry. This caused panics when the driver for the root device didn't exist. Fixed minor misformattings in bdevvp(). Rev.1.14 consisted mainly of gratuitous reformattings that seem to have caused many Lite2 merge errors. PR: 8417
# f74d75a2	14-Oct-1998	Dmitrij Tejblum <dt@FreeBSD.org>	Backed out rev. 1.164. It caused problems on SMP. PR: 8309
# 6cde7a16	13-Oct-1998	David Greenman <dg@FreeBSD.org>	Fixed two potentially serious classes of bugs: 1) The vnode pager wasn't properly tracking the file size due to "size" being page rounded in some cases and not in others. This sometimes resulted in corrupted files. First noticed by Terry Lambert. Fixed by changing the "size" pager_alloc parameter to be a 64bit byte value (as opposed to a 32bit page index) and changing the pagers and their callers to deal with this properly. 2) Fixed a bogus type cast in round_page() and trunc_page() that caused some 64bit offsets and sizes to be scrambled. Removing the cast required adding casts at a few dozen callers. There may be problems with other bogus casts in close-by macros. A quick check seemed to indicate that those were okay, however.
# 9bbd8a24	12-Oct-1998	Dmitrij Tejblum <dt@FreeBSD.org>	UnVMIO vnodes of block devices when they are no longer in use. (Some things, like msdosfs, do not work (panic) on devices with VMIO enabled. FFS enable VMIO on mounted devices, and nothing previously disabled it, so, after you mounted FFS floppy, you could not mount msdosfs floppy anymore...) This is mostly a quick before-release fix. Reviewed by: bde
# d024c955	14-Sep-1998	Søren Schmidt <sos@FreeBSD.org>	Remove the SLICE code. This clearly needs alot more thought, and we dont need this to hunt us down in 3.0-RELEASE.
# 500b04a2	05-Sep-1998	Bruce Evans <bde@FreeBSD.org>	Instantiate `nfs_mount_type' in a standard file so that it is present when nfs is an LKM. Declare it in a header file. Don't forget to use it in non-Lite2 code. Initialize it to -1 instead of to 0, since 0 will soon be the mount type number for the first vfs loaded. NetBSD uses strcmp() to avoid this ugly global.
# f5ce6752	29-Aug-1998	Bruce Evans <bde@FreeBSD.org>	Oops, the previous revision unconfigured too much pre-Lite2 compatibilty cruft. At least lsvfs(1) was broken.
# 13950bd2	12-Aug-1998	Bruce Evans <bde@FreeBSD.org>	Don't configure compatibility code for pre-Lite2 mount() calls by default. This code should go away soon.
# 7a6c46b5	12-Jul-1998	Doug Rabson <dfr@FreeBSD.org>	Initialise all the fields separately in vattr_null since on the alpha they are not all the same width.
# ac1e407b	11-Jul-1998	Bruce Evans <bde@FreeBSD.org>	Fixed printf format errors.
# be160d60	21-Jun-1998	Bruce Evans <bde@FreeBSD.org>	Removed unused includes.
# 32f5d4d8	10-Jun-1998	Julian Elischer <julian@FreeBSD.org>	Replace 'sleep()' with 'tsleep()' Accidentally imported from Kirk's codebase. Pointed out by: various.
# 28913ebe	10-Jun-1998	Julian Elischer <julian@FreeBSD.org>	Submitted by: Kirk McKusick <mckusick@McKusick.COM> Fix for potential hang when trying to reboot the system or to forcibly unmount a soft update enabled filesystem. FreeBSD already handled the reboot case differently, this is however a better fix.
# ecbb00a2	07-Jun-1998	Doug Rabson <dfr@FreeBSD.org>	This commit fixes various 64bit portability problems required for FreeBSD/alpha. The most significant item is to change the command argument to ioctl functions from int to u_long. This change brings us inline with various other BSD versions. Driver writers may like to use (__FreeBSD_version == 300003) to detect this change. The prototype FreeBSD/alpha machdep will follow in a couple of days time.
# cb87a87c	17-May-1998	Tor Egge <tegge@FreeBSD.org>	Supply the correct process argument to dounmount when possible.
# 3e425b96	19-Apr-1998	Julian Elischer <julian@FreeBSD.org>	Add changes and code to implement a functional DEVFS. This code will be turned on with the TWO options DEVFS and SLICE. (see LINT) Two labels PRE_DEVFS_SLICE and POST_DEVFS_SLICE will deliniate these changes. /dev will be automatically mounted by init (thanks phk) on bootup. See /sys/dev/slice/slice.4 for more info. All code should act the same without these options enabled. Mike Smith, Poul Henning Kamp, Soeren, and a few dozen others This code does not support the following: bad144 handling. Persistance. (My head is still hurting from the last time we discussed this) ATAPI flopies are not handled by the SLICE code yet. When this code is running, all major numbers are arbitrary and COULD be dynamically assigned. (this is not done, for POLA only) Minor numbers for disk slices ARE arbitray and dynamically assigned.
# 37b8ccd3	18-Apr-1998	Peter Wemm <peter@FreeBSD.org>	In vfs_msync(), test to see if the vnode being examined is "interesting" (ie: it has a vm_object attached and is marked as OBJ_MIGHTBEDIRTY) before attempting to lock it. This should reduce the cpu hit that is incurred when doing a sync(2) and when the syncer process is doing the 30-second writeback of dirty mmap() data to disk. Skip this speedup if we are doing an unmount() to be sure to get everything - we can afford to occasionally miss a msync while the system is running, but not at unmount. I'm not sure about the VXLOCK and MNT_WAIT case, it seems a bit odd to skip doing a page_clean at unmount time just because a vnode is VXLOCKed, but that's what was being done before...
# efdc5523	15-Apr-1998	Peter Wemm <peter@FreeBSD.org>	When the softdep conversion took place, the periodic vfs_msync() from update got lost. This is responsible for ensuring that dirty mmap() pages get periodically written to disk. Without it, long time mmap's might not have their dirty pages written out at all of the system crashes or isn't cleanly shut down. This could be nasty if you've got a long-running writing via mmap(), dirty pages used to get written to disk within 30 seconds or so.
# 71033a8c	15-Apr-1998	Tor Egge <tegge@FreeBSD.org>	Unlock mountlist_slock if the mount point was busy (unmount in progress) during the attempt at lazy fsync.
# 227ee8a1	30-Mar-1998	Poul-Henning Kamp <phk@FreeBSD.org>	Eradicate the variable "time" from the kernel, using various measures. "time" wasn't a atomic variable, so splfoo() protection were needed around any access to it, unless you just wanted the seconds part. Most uses of time.tv_sec now uses the new variable time_second instead. gettime() changed to getmicrotime(0. Remove a couple of unneeded splfoo() protections, the new getmicrotime() is atomic, (until Bruce sets a breakpoint in it). A couple of places needed random data, so use read_random() instead of mucking about with time which isn't random. Add a new nfs_curusec() function. Mark a couple of bogosities involving the now disappeard time variable. Update ffs_update() to avoid the weird "== &time" checks, by fixing the one remaining call that passwd &time as args. Change profiling in ncr.c to use ticks instead of time. Resolution is the same. Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call hzto() which subtracts time" sequences. Reviewed by: bde
# 3c1300a6	28-Mar-1998	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# 771b51ef	27-Mar-1998	Bruce Evans <bde@FreeBSD.org>	Don't depend on <sys/mount.h> including <sys/socket.h>.
# 52c64c95	19-Mar-1998	John Dyson <dyson@FreeBSD.org>	In kern_physio.c fix tsleep priority messup. In vfs_bio.c, remove b_generation count usage, remove redundant reassignbuf, remove redundant spl(s), manage page PG_ZERO flags more correctly, utilize in invalid value for b_offset until it is properly initialized. Add asserts for #ifdef DIAGNOSTIC, when b_offset is improperly used. when a process is not performing I/O, and just waiting on a buffer generally, make the sleep priority low. only check page validity in getblk for B_VMIO buffers. In vfs_cluster, add b_offset asserts, correct pointer calculation for clustered reads. Improve readability of certain parts of the code. Remove redundant spl(s). In vfs_subr, correct usage of vfs_bio_awrite (From Andrew Gallatin <gallatin@cs.duke.edu>). More vtruncbuf problems fixed.
# 1c77c6b7	19-Mar-1998	John Dyson <dyson@FreeBSD.org>	Fix an embarassing problem in vtruncbuf.
# 2deb5d04	16-Mar-1998	John Dyson <dyson@FreeBSD.org>	Correct a severely evil bug in the vtruncbuf code. It didn't cause me any problems until after the previous commit. This problem then caused a severe case of creeping crud on my diskdrive, and hosed my system so bad, that I needed to do a complete reinstall. Sorry!!! I assume that others have manifest this bug.
# e85c1afb	15-Mar-1998	John Dyson <dyson@FreeBSD.org>	Allow vfs_ioopt to be enabled with a (temporary) config option.
# bef608bd	15-Mar-1998	John Dyson <dyson@FreeBSD.org>	Some VM improvements, including elimination of alot of Sig-11 problems. Tor Egge and others have helped with various VM bugs lately, but don't blame him -- blame me!!! pmap.c: 1) Create an object for kernel page table allocations. This fixes a bogus allocation method previously used for such, by grabbing pages from the kernel object, using bogus pindexes. (This was a code cleanup, and perhaps a minor system stability issue.) pmap.c: 2) Pre-set the modify and accessed bits when prudent. This will decrease bus traffic under certain circumstances. vfs_bio.c, vfs_cluster.c: 3) Rather than calculating the beginning virtual byte offset multiple times, stick the offset into the buffer header, so that the calculated offset can be reused. (Long long multiplies are often expensive, and this is a probably unmeasurable performance improvement, and code cleanup.) vfs_bio.c: 4) Handle write recursion more intelligently (but not perfectly) so that it is less likely to cause a system panic, and is also much more robust. vfs_bio.c: 5) getblk incorrectly wrote out blocks that are incorrectly sized. The problem is fixed, and writes blocks out ONLY when B_DELWRI is true. vfs_bio.c: 6) Check that already constituted buffers have fully valid pages. If not, then make sure that the B_CACHE bit is not set. (This was a major source of Sig-11 type problems.) vfs_bio.c: 7) Fix a potential system deadlock due to an incorrectly specified sleep priority while waiting for a buffer write operation. The change that I made opens the system up to serious problems, and we need to examine the issue of process sleep priorities. vfs_cluster.c, vfs_bio.c: 8) Make clustered reads work more correctly (and more completely) when buffers are already constituted, but not fully valid. (This was another system reliability issue.) vfs_subr.c, ffs_inode.c: 9) Create a vtruncbuf function, which is used by filesystems that can truncate files. The vinvalbuf forced a file sync type operation, while vtruncbuf only invalidates the buffers past the new end of file, and also invalidates the appropriate pages. (This was a system reliabiliy and performance issue.) 10) Modify FFS to use vtruncbuf. vm_object.c: 11) Make the object rundown mechanism for OBJT_VNODE type objects work more correctly. Included in that fix, create pager entries for the OBJT_DEAD pager type, so that paging requests that might slip in during race conditions are properly handled. (This was a system reliability issue.) vm_page.c: 12) Make some of the page validation routines be a little less picky about arguments passed to them. Also, support page invalidation change the object generation count so that we handle generation counts a little more robustly. vm_pageout.c: 13) Further reduce pageout daemon activity when the system doesn't need help from it. There should be no additional performance decrease even when the pageout daemon is running. (This was a significant performance issue.) vnode_pager.c: 14) Teach the vnode pager to handle race conditions during vnode deallocations.
# 26300b34	14-Mar-1998	John Dyson <dyson@FreeBSD.org>	Disable the vfs.ioopt option for now, so that we don't get gratuitious bugreports. I might not be able to fix the problems before 3.0, due to other, more important things.
# 8293f20a	13-Mar-1998	Tor Egge <tegge@FreeBSD.org>	Don't misuse vnode interlocks in routines that can be called from interrupts. PR: 5893
# b1897c19	08-Mar-1998	Julian Elischer <julian@FreeBSD.org>	Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman) Submitted by: Kirk McKusick (mcKusick@mckusick.com) Obtained from: WHistle development tree
# 8f9110f6	07-Mar-1998	John Dyson <dyson@FreeBSD.org>	This mega-commit is meant to fix numerous interrelated problems. There has been some bitrot and incorrect assumptions in the vfs_bio code. These problems have manifest themselves worse on NFS type filesystems, but can still affect local filesystems under certain circumstances. Most of the problems have involved mmap consistancy, and as a side-effect broke the vfs.ioopt code. This code might have been committed seperately, but almost everything is interrelated. 1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that are fully valid. 2) Rather than deactivating erroneously read initial (header) pages in kern_exec, we now free them. 3) Fix the rundown of non-VMIO buffers that are in an inconsistent (missing vp) state. 4) Fix the disassociation of pages from buffers in brelse. The previous code had rotted and was faulty in a couple of important circumstances. 5) Remove a gratuitious buffer wakeup in vfs_vmio_release. 6) Remove a crufty and currently unused cluster mechanism for VBLK files in vfs_bio_awrite. When the code is functional, I'll add back a cleaner version. 7) The page busy count wakeups assocated with the buffer cache usage were incorrectly cleaned up in a previous commit by me. Revert to the original, correct version, but with a cleaner implementation. 8) The cluster read code now tries to keep data associated with buffers more aggressively (without breaking the heuristics) when it is presumed that the read data (buffers) will be soon needed. 9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The delay loop waiting is not useful for filesystem locks, due to the length of the time intervals. 10) Correct and clean-up spec_getpages. 11) Implement a fully functional nfs_getpages, nfs_putpages. 12) Fix nfs_write so that modifications are coherent with the NFS data on the server disk (at least as well as NFS seems to allow.) 13) Properly support MS_INVALIDATE on NFS. 14) Properly pass down MS_INVALIDATE to lower levels of the VM code from vm_map_clean. 15) Better support the notion of pages being busy but valid, so that fewer in-transit waits occur. (use p->busy more for pageouts instead of PG_BUSY.) Since the page is fully valid, it is still usable for reads. 16) It is possible (in error) for cached pages to be busy. Make the page allocation code handle that case correctly. (It should probably be a printf or panic, but I want the system to handle coding errors robustly. I'll probably add a printf.) 17) Correct the design and usage of vm_page_sleep. It didn't handle consistancy problems very well, so make the design a little less lofty. After vm_page_sleep, if it ever blocked, it is still important to relookup the page (if the object generation count changed), and verify it's status (always.) 18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up. 19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush. 20) Fix vm_pager_put_pages and it's descendents to support an int flag instead of a boolean, so that we can pass down the invalidate bit.
# 59228495	01-Mar-1998	John Dyson <dyson@FreeBSD.org>	Change vfs.ioopt default back to '0'.
# ffc82b0a	28-Feb-1998	John Dyson <dyson@FreeBSD.org>	1) Use a more consistent page wait methodology. 2) Do not unnecessarily force page blocking when paging pages out. 3) Further improve swap pager performance and correctness, including fixing the paging in progress deadlock (except in severe I/O error conditions.) 4) Enable vfs_ioopt=1 as a default. 5) Fix and enable the page prezeroing in SMP mode. All in all, SMP systems especially should show a significant improvement in "snappyness."
# 64d3c7e3	22-Feb-1998	John Dyson <dyson@FreeBSD.org>	Clean-up the vget mechanism by permanently attaching VM objects to vnodes, therefore vget doesn't need to do so anymore. Other minor improvements include the temp free vnode queue obeying the VAGE flag and a printf that warns of to-be-removed code being executed.
# 1b11919b	09-Feb-1998	KATO Takenori <kato@FreeBSD.org>	Fixed vnode interlock handling. Reviewed by: Bruce Evans <bde@zeta.org.au> Tor Egge <Tor.Egge@idi.ntnu.no>
# 303b270b	08-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Staticize.
# 16e3b0b6	07-Feb-1998	KATO Takenori <kato@FreeBSD.org>	When the vp is lcoked, vget() calls vfs_object_create() with waslocked = TRUE. This change may fix lockmgr panic in umapfs/nullfs. PR: 5634 Reviewed by: "John S. Dyson" <toor@dyson.iquest.net> Suggested by: Bruce Evans <bde@zeta.org.au>
# 0b08f5f7	05-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Back out DIAGNOSTIC changes.
# 95461b45	04-Feb-1998	John Dyson <dyson@FreeBSD.org>	1) Start using a cleaner and more consistant page allocator instead of the various ad-hoc schemes. 2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup. 3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some processor errata, and to minimize redundant processor updating of page tables. 4) Modify pmap_protect so that it can only remove permissions (as it originally supported.) The additional capability is not needed. 5) Streamline read-only to read-write page mappings. 6) For pmap_copy_page, don't enable write mapping for source page. 7) Correct and clean-up pmap_incore. 8) Cluster initial kern_exec pagin. 9) Removal of some minor lint from kern_malloc. 10) Correct some ioopt code. 11) Remove some dead code from the MI swapout routine. 12) Correct vm_object_deallocate (to remove backing_object ref.) 13) Fix dead object handling, that had problems under heavy memory load. 14) Add minor vm_page_lookup improvements. 15) Some pages are not in objects, and make sure that the vm_page.c can properly support such pages. 16) Add some more page deficit handling. 17) Some minor code readability improvements.
# 47cfdb16	04-Feb-1998	Eivind Eklund <eivind@FreeBSD.org>	Turn DIAGNOSTIC into a new-style option.
# d09a16d8	30-Jan-1998	Tor Egge <tegge@FreeBSD.org>	Update freevnodes when adding a vnode to the head of the free list.
# 50ce7ff4	23-Jan-1998	John Dyson <dyson@FreeBSD.org>	Add better support for larger I/O clusters, including larger physical I/O. The support is not mature yet, and some of the underlying implementation needs help. However, support does exist for IDE devices now.
# 2d8acc0f	22-Jan-1998	John Dyson <dyson@FreeBSD.org>	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
# 47221757	17-Jan-1998	John Dyson <dyson@FreeBSD.org>	Tie up some loose ends in vnode/object management. Remove an unneeded config option in pmap. Fix a problem with faulting in pages. Clean-up some loose ends in swap pager memory management. The system should be much more stable, but all subtile bugs aren't fixed yet.
# 53f6f085	11-Jan-1998	John Dyson <dyson@FreeBSD.org>	Fix another vnode leak.
# 925a3a41	11-Jan-1998	John Dyson <dyson@FreeBSD.org>	Fix some vnode management problems, and better mgmt of vnode free list. Fix the UIO optimization code. Fix an assumption in vm_map_insert regarding allocation of swap pagers. Fix an spl problem in the collapse handling in vm_object_deallocate. When pages are freed from vnode objects, and the criteria for putting the associated vnode onto the free list is reached, either put the vnode onto the list, or put it onto an interrupt safe version of the list, for further transfer onto the actual free list. Some minor syntax changes changing pre-decs, pre-incs to post versions. Remove a bogus timeout (that I added for debugging) from vn_lock. PHK will likely still have problems with the vnode list management, and so do I, but it is better than it was.
# 857d737e	07-Jan-1998	John Dyson <dyson@FreeBSD.org>	Disable io optimizations again, minor bug found, and will be fixed in a few days.
# 95e5e988	05-Jan-1998	John Dyson <dyson@FreeBSD.org>	Make our v_usecount vnode reference count work identically to the original BSD code. The association between the vnode and the vm_object no longer includes reference counts. The major difference is that vm_object's are no longer freed gratuitiously from the vnode, and so once an object is created for the vnode, it will last as long as the vnode does. When a vnode object reference count is incremented, then the underlying vnode reference count is incremented also. The two "objects" are now more intimately related, and so the interactions are now much less complex. When vnodes are now normally placed onto the free queue with an object still attached. The rundown of the object happens at vnode rundown time, and happens with exactly the same filesystem semantics of the original VFS code. There is absolutely no need for vnode_pager_uncache and other travesties like that anymore. A side-effect of these changes is that SMP locking should be much simpler, the I/O copyin/copyout optimizations work, NFS should be more ponderable, and further work on layered filesystems should be less frustrating, because of the totally coherent management of the vnode objects and vnodes. Please be careful with your system while running this code, but I would greatly appreciate feedback as soon a reasonably possible.
# 483140ea	29-Dec-1997	John Dyson <dyson@FreeBSD.org>	Add the vnode interlock back around vref.
# 60f8d464	28-Dec-1997	John Dyson <dyson@FreeBSD.org>	Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem with the object cache removal.
# 2be70f79	28-Dec-1997	John Dyson <dyson@FreeBSD.org>	Lots of improvements, including restructring the caching and management of vnodes and objects. There are some metadata performance improvements that come along with this. There are also a few prototypes added when the need is noticed. Changes include: 1) Cleaning up vref, vget. 2) Removal of the object cache. 3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore. 4) Correct some missing LK_RETRY's in vn_lock. 5) Correct the page range in the code for msync. Be gentle, and please give me feedback asap.
# 1efb74fb	19-Dec-1997	John Dyson <dyson@FreeBSD.org>	Some performance improvements, and code cleanups (including changing our expensive OFF_TO_IDX to btoc whenever possible.)
# 1cbbd625	14-Dec-1997	Garrett Wollman <wollman@FreeBSD.org>	Add support for poll(2) on files. vop_nopoll() now returns POLLNVAL if one of the new poll types is requested; hopefully this will not break any existing code. (This is done so that programs have a dependable way of determining whether a filesystem supports the extended poll types or not.) The new poll types added are: POLLWRITE - file contents may have been modified POLLNLINK - file was linked, unlinked, or renamed POLLATTRIB - file's attributes may have been changed POLLEXTEND - file was extended Note that the internal operation of poll() means that it is impossible for two processes to reliably poll for the same event (this could be fixed but may not be worth it), so it is not possible to rewrite `tail -f' to use poll at this time.
# cb451ebd	22-Nov-1997	Bruce Evans <bde@FreeBSD.org>	Staticized.
# b1f4a44b	11-Nov-1997	Julian Elischer <julian@FreeBSD.org>	Reviewed by: various. Ever since I first say the way the mount flags were used I've hated the fact that modes, and events, internal and exported, and short-term and long term flags are all thrown together. Finally it's annoyed me enough.. This patch to the entire FreeBSD tree adds a second mount flag word to the mount struct. it is not exported to userspace. I have moved some of the non exported flags over to this word. this means that we now have 8 free bits in the mount flags. There are another two that might well move over, but which I'm not sure about. The only user visible change would have been in pstat -v, except that davidg has disabled it anyhow. I'd still like to move the state flags and the 'command' flags apart from each other.. e.g. MNT_FORCE really doesn't have the same semantics as MNT_RDONLY, but that's left for another day.
# 4a11ca4e	07-Nov-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Remove a bunch of variables which were unused both in GENERIC and LINT. Found by: -Wunused
# dba3870c	26-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	VFS interior redecoration. Rename vn_default_error to vop_defaultop all over the place. Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite. Use vop_null instead of nullop. Move vop_nopoll from vfs_subr.c to vfs_default.c Move vop_sharedlock from vfs_subr.c to vfs_default.c Move vop_nolock from vfs_subr.c to vfs_default.c Move vop_nounlock from vfs_subr.c to vfs_default.c Move vop_noislocked from vfs_subr.c to vfs_default.c Use vop_ebadf instead of *_ebadf. Add vop_defaultop for getpages on master vnode in MFS.
# a1c995b6	12-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Last major round (Unless Bruce thinks of somthing :-) of malloc changes. Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them. A couple of finer points by: bde
# 55166637	11-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Distribute and statizice a lot of the malloc M_* types. Substantial input from: bde
# f7891f9a	11-Oct-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Dike out a weird warning.
# d047b580	26-Sep-1997	Poul-Henning Kamp <phk@FreeBSD.org>	I lost a bit of my change in the last commit, this is more like it. Noticed by: bde
# 87b1940a	25-Sep-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Reduce the target number of vnodes on the freelist from desiredvnodes (usually a couple of thousand) to 25. The measured impact on cache-hits doesn't justify spending memory this way: Target number of free vnodes versus namecache hit rate in % during a make world: 10 98.5316 200 98.5479 500 98.5546 1000 98.5709 3000 98.6006 4000 98.6126
# 00544193	24-Sep-1997	Poul-Henning Kamp <phk@FreeBSD.org>	A couple of handles to tweak, more statistics.
# 514ede09	16-Sep-1997	Bruce Evans <bde@FreeBSD.org>	Fixed gratuitous ANSIisms.
# 7fab7799	13-Sep-1997	Peter Wemm <peter@FreeBSD.org>	Provide a 'return true' poll vnode op rather than duplicating the 'do nothing' case all over the various filesystems.
# 557fe2c5	13-Sep-1997	Peter Wemm <peter@FreeBSD.org>	print correct function name in a panic (vop_nolock -> vop_sharedlock)
# 41fadeeb	07-Sep-1997	Bruce Evans <bde@FreeBSD.org>	Removed yet more vestiges of config-time swap configuration and/or cleaned up nearby cruft.
# a910e75c	07-Sep-1997	Bruce Evans <bde@FreeBSD.org>	Removed vestiges of config-time "argument processing" configuration.
# fd9d9ff1	03-Sep-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Hmm, this is hopefully better.
# 7cb22688	03-Sep-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Revert the v_usecount handling in relation to VOP_INACTIVE.
# e4ba6a82	02-Sep-1997	Bruce Evans <bde@FreeBSD.org>	Removed unused #includes.
# a051452a	31-Aug-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Change the 0xdeadb hack to a flag called VDOOMED. Introduce VFREE which indicates that vnode is on freelist. Rename vholdrele() to vdrop(). Create vfree() and vbusy() to add/delete vnode from freelist. Add vfree()/vbusy() to keep (v_holdcnt != 0 \|\| v_usecount != 0) vnodes off the freelist. Generalize vhold()/v_holdcnt to mean "do not recycle". Fix reassignbuf()s lack of use of vhold(). Use vhold() instead of checking v_cache_src list. Remove vtouch(), the vnodes are always vget'ed soon enough after for it to have any measuable effect. Add sysctl debug.freevnodes to keep track of things. Move cache_purge() up in getnewvnodes to avoid race. Decrement v_usecount after VOP_INACTIVE(), put a vhold() on it during VOP_INACTIVE() Unmacroize vhold()/vdrop() Print out VDOOMED and VFREE flags (XXX: should use %b) Reviewed by: dyson
# d3114049	26-Aug-1997	Bruce Evans <bde@FreeBSD.org>	Restored rev.1.92 which was clobbered by the previous commit.
# a5db4bf4	25-Aug-1997	John Dyson <dyson@FreeBSD.org>	Back out some incorrect changes that was worse than the original bug.
# 89721f6f	21-Aug-1997	John Dyson <dyson@FreeBSD.org>	This is a trial improvement for the vnode reference count while on the vnode free list problem. Also, the vnode age flag is no longer used by the vnode pager. (It is actually incorrect to use then.) Constructive feedback welcome -- just be kind.
# b1037dcd	21-Aug-1997	Bruce Evans <bde@FreeBSD.org>	#include <machine/limits.h> explicitly in the few places that it is required.
# 57bf258e	16-Aug-1997	Garrett Wollman <wollman@FreeBSD.org>	Fix all areas of the system (or at least all those in LINT) to avoid storing socket addresses in mbufs. (Socket buffers are the one exception.) A number of kernel APIs needed to get fixed in order to make this happen. Also, fix three protocol families which kept PCBs in mbufs to not malloc them instead. Delete some old compatibility cruft while we're at it, and add some new routines in the in_cksum family.
# 23be6be8	04-Aug-1997	John Dyson <dyson@FreeBSD.org>	Fix a problem with the vfs vnode caching that it doesn't grow quickly enough and can cause some strange performance problems. Specifically, at or near startup time is when the problem is worst. To reproduce the problem, run "lat_syscall stat" from the alpha lmbench code right after bootup. A positive side effect of this mod is that the name cache can be set to grow again by sysctl. A noticable positive performance impact is realized due to a larger namecache being available as needed (or tuned.)
# f6b4c285	17-Jul-1997	Doug Rabson <dfr@FreeBSD.org>	Merge WebNFS support from NetBSD Obtained from: NetBSD
# 3c631446	21-Jun-1997	John Dyson <dyson@FreeBSD.org>	Remove a window during running down a file vnode. Also, the OBJ_DEAD flag wasn't being respected during vref(), et. al. Note that this isn't the eventual fix for the locking problem. Fine grained SMP in the VM and VFS code will require (lots) more work.
# 2e58c0f8	09-Jun-1997	David Greenman <dg@FreeBSD.org>	Disabled the kern.vnode sysctl variable. It's causing system crashes on large systems and needs to be re-thinked or removed wholesale.
# 8670684a	06-May-1997	Poul-Henning Kamp <phk@FreeBSD.org>	Fix a race condition that did, after all, exist. Reviewed by: phk Submitted by: dfr
# b15a966e	04-May-1997	Poul-Henning Kamp <phk@FreeBSD.org>	1. Add a {pointer, v_id} pair to the vnode to store the reference to the ".." vnode. This is cheaper storagewise than keeping it in the namecache, and it makes more sense since it's a 1:1 mapping. 2. Also handle the case of "." more intelligently rather than stuff the namecache with pointless entries. 3. Add two lists to the vnode and hang namecache entries which go from or to this vnode. When cleaning a vnode, delete all namecache entries it invalidates. 4. Never reuse namecache enties, malloc new ones when we need it, free old ones when they die. No longer a hard limit on how many we can have. 5. Remove the upper limit on namelength of namecache entries. 6. Make a global list for negative namecache entries, limit their number to a sysctl'able (debug.ncnegfactor) fraction of the total namecache. Currently the default fraction is 1/16th. (Suggestions for better default wanted!) 7. Assign v_id correctly in the face of 32bit rollover. 8. Remove the LRU list for namecache entries, not needed. Remove the #ifdef NCH_STATISTICS stuff, it's not needed either. 9. Use the vnode freelist as a true LRU list, also for namecache accesses. 10. Reuse vnodes more aggresively but also more selectively, if we can't reuse, malloc a new one. There is no longer a hard limit on their number, they grow to the point where we don't reuse potentially usable vnodes. A vnode will not get recycled if still has pages in core or if it is the source of namecache entries (Yes, this does indeed work :-) "." and ".." are not namecache entries any longer...) 11. Do not overload the v_id field in namecache entries with whiteout information, use a char sized flags field instead, so we can get rid of the vpid and v_id fields from the namecache struct. Since we're linked to the vnodes and purged when they're cleaned, we don't have to check the v_id any more. 12. NFS knew about the limitation on name length in the namecache, it shouldn't and doesn't now. Bugs: The namecache statistics no longer includes the hits for ".." and "." hits. Performance impact: Generally in the +/- 0.5% for "normal" workstations, but I hope this will allow the system to be selftuning over a bigger range of "special" applications. The case where RAM is available but unused for cache because we don't have any vnodes should be gone. Future work: Straighten out the namecache statistics. "desiredvnodes" is still used to (bogusly ?) size hash tables in the filesystems. I have still to find a way to safely free unused vnodes back so their number can shrink when not needed. There is a few uses of the v_id field left in the filesystems, scheduled for demolition at a later time. Maybe a one slot cache for unused namecache entries should be implemented to decrease the malloc/free frequency.
# 82b8e119	29-Apr-1997	John Dyson <dyson@FreeBSD.org>	Staticize an unnecessarily global function: vputrele. Submitted by: Michael Hancock <michaelh@cet.co.jp>
# 5f61c81d	25-Apr-1997	Peter Wemm <peter@FreeBSD.org>	copyin the export network mask to the correct variable. Submitted by: Mike Hibler <mike@marker.cs.utah.edu>, PR#3380
# de15ef6a	04-Apr-1997	Doug Rabson <dfr@FreeBSD.org>	Add a function vop_sharedlock which a copy of vop_nolock without the implementation #ifdef out. This can be used for now by NFS. As soon as all the other filesystems' locking is fixed, this can go away. Print the vnode address in vprint for easier debugging.
# 0f1adf65	01-Apr-1997	Bruce Evans <bde@FreeBSD.org>	Use OID_AUTO instead of magic number for the Lite2 sysctl debug.busyprt. Removed declaration of vfs_unmountroot() again. Staticized vgonel().
# 2f2160da	04-Mar-1997	David Greenman <dg@FreeBSD.org>	Fixed splbio problems in vinvalbuf. Closes PR#2875, although fixed differently by me.
# 4a8b9660	04-Mar-1997	Bruce Evans <bde@FreeBSD.org>	Attach vfs_sysctl() one level lower so that only the levels below VFS_GENERIC aren't done in the FreeBSD way. The previous commit broke the nfs sysctls.
# 3a76a594	02-Mar-1997	Bruce Evans <bde@FreeBSD.org>	Merged Lite2's vfs_sysctl(). It doesn't fit very well into FreeBSD's (phk's) sysctl framework, and I needed special code to disambiguate the VFS_GENERIC node from the VFS_VFSCONF leaf, so I only converted the leaves to the FreeBSD framework. The error handling isn't quite right. CSRGS's sysctls seem to return ENOTDIR too much and FreeBSD's sysctls don't agree with the man page.
# dc91a89e	02-Mar-1997	Bruce Evans <bde@FreeBSD.org>	Restored some pre-Lite2-merge source-level compatibility to the mount() and getvfsbyname() interfaces. The new interfaces are now hidden from applications unless _NEW_VFSCONF is defined. The new vfsconf interfaces don't work yet.
# a896f025	02-Mar-1997	Bruce Evans <bde@FreeBSD.org>	Moved vfs sysctls to where Lite2 put them. No code changes yet.
# b98afd0d	27-Feb-1997	Bruce Evans <bde@FreeBSD.org>	Fixed Lite2 merge of spechash simplelocking. It was misplaced in checkalias() and missing in vfinddev() and vcount().
# fd7f690f	26-Feb-1997	John Dyson <dyson@FreeBSD.org>	Fix the previous simple_lock fix breakage in the combined vput/vrele routine. Fix a panic message. Fix the vop_nounlock routine so that "special" filesystems that use it work correctly.
# 0d955f71	26-Feb-1997	John Dyson <dyson@FreeBSD.org>	Fix the simple_lock problem with the physical I/O buffer code, and also fix the missing simple_unlock in vrele, and improve vrele/vput by merging them into one routine. BDE pointed these problems out.
# 7c1557c4	26-Feb-1997	Bruce Evans <bde@FreeBSD.org>	Fixed unmounting of the root fs. vfs_unmountroot() wasn't fully updated to do Lite2 locking and vfs_unmountall() wasn't as simple as the Lite2 version.
# c35e283a	25-Feb-1997	Bruce Evans <bde@FreeBSD.org>	Merged some missing locking from Lite2: - getnewvnode() and vref() were missing one simple_unlock() each. - the Lite2 locking changes weren't merged at all in printlockedvnodes() or sysctl_vnode(). Merging these undid some KNF style regressions.
# 6875d254	22-Feb-1997	Peter Wemm <peter@FreeBSD.org>	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.
# 996c772f	09-Feb-1997	John Dyson <dyson@FreeBSD.org>	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>
# 5131d64e	16-Jan-1997	Bruce Evans <bde@FreeBSD.org>	Removed option EXTRAVNODES. All versions of FreeBSD-2.x have a sysctl variable `kern.maxvnodes' which gives much better control over vnode allocation than EXTRAVNODES (except in -current between 1995/10/28 and 1996/11/12, kern.maxvnodes was read-only and thus useless).
# 1130b656	14-Jan-1997	Jordan K. Hubbard <jkh@FreeBSD.org>	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.
# 8b612c4b	28-Dec-1996	John Dyson <dyson@FreeBSD.org>	This commit is the embodiment of some VFS read clustering improvements. Firstly, now our read-ahead clustering is on a file descriptor basis and not on a per-vnode basis. This will allow multiple processes reading the same file to take advantage of read-ahead clustering. Secondly, there previously was a problem with large reads still using the ramp-up algorithm. Of course, that was bogus, and now we read the entire "chunk" off of the disk in one operation. The read-ahead clustering algorithm should use less CPU than the previous also (I hope :-)). NOTE: THAT LKMS MUST BE REBUILT!!!
# b83ddf9c	12-Nov-1996	Bruce Evans <bde@FreeBSD.org>	Restored writability of kern.maxvnodes. It was broken a year ago in rev.1.29 of kern_sysctl.c. Should be in 2.2.
# 19060a3a	28-Oct-1996	Poul-Henning Kamp <phk@FreeBSD.org>	init_main.c: pass -d to init if DEVFS_ROOT kern_conf.c: gd driver is a disk. vfs_subr.c: include opt_devfs.h
# 0082fb46	17-Oct-1996	Jordan K. Hubbard <jkh@FreeBSD.org>	I'm not sure why, but Netcon's TFS filesystem code doesn't want to add free vnodes back to the freelist. They must do their own vnode management. Anyway, this change is only activated with their filesystem and doesn't affect anyone else. Whoops, forgot the submitted-by lines in my previous commits too.. :-( Submitted-By: Tony Ardolino <tony@netcon.com>
# ad980522	16-Oct-1996	John Dyson <dyson@FreeBSD.org>	Clean up the rundown of the object backing a vnode. This should fix NFS problems associated with forcible dismounts.
# a8f42fa9	27-Sep-1996	John Dyson <dyson@FreeBSD.org>	Correct vget by removing a window where a vnode can potentially go away.
# 030e2e9e	19-Sep-1996	Nate Williams <nate@FreeBSD.org>	In sys/time.h, struct timespec is defined as: /* * Structure defined by POSIX.4 to be like a timeval. / struct timespec { time_t ts_sec; / seconds / long ts_nsec; / and nanoseconds */ }; The correct names of the fields are tv_sec and tv_nsec. Reminded by: James Drobina <jdrobina@infinet.com>
# 6476c0d2	21-Aug-1996	John Dyson <dyson@FreeBSD.org>	Even though this looks like it, this is not a complex code change. The interface into the "VMIO" system has changed to be more consistant and robust. Essentially, it is now no longer necessary to call vn_open to get merged VM/Buffer cache operation, and exceptional conditions such as merged operation of VBLK devices is simpler and more correct. This code corrects a potentially large set of problems including the problems with ktrace output and loaded systems, file create/deletes, etc. Most of the changes to NFS are cosmetic and name changes, eliminating a layer of subroutine calls. The direct calls to vput/vrele have been re-instituted for better cross platform compatibility. Reviewed by: davidg
# 619594e8	15-Aug-1996	John Dyson <dyson@FreeBSD.org>	Certain vnode buffer list operations were not being spl protected, and they needed to be. Brelse for example can be called at interrupt level, and the buffer list operations were not being protected from it.
# 8c2ff396	30-Jul-1996	Bruce Evans <bde@FreeBSD.org>	Only use the special bdevvp() for DEVFS if DEVFS_ROOT is defined. This makes option DEVFS safe to use again (although mounting devfs is unsafe).
# e83cf165	24-Jul-1996	Poul-Henning Kamp <phk@FreeBSD.org>	DEVFS needs a special bdevvp().
# cba2a7c6	12-Jul-1996	Bruce Evans <bde@FreeBSD.org>	Staticized a few variables. Fixed warnings about unused variables.
# 114a8cff	30-May-1996	Peter Wemm <peter@FreeBSD.org>	Add an option "EXTRA_VNODES" to cause an extra number of vnode structures to be allocated at boot time. This is an expensive option, as they consume physical ram and are not pageable etc. In certain situations, this kind of option is quite useful, especially for news servers that access a large number of directories at random and torture the name cache. Defining 5000 or 10000 extra vnodes should cut down the amount of vnode recycling somewhat, which should allow better name and directory caching etc. This is a "your mileage may vary" option, with no real indication of what works best for your machine except trial and error. Too many will cost you ram that you could otherwise use for disk buffers etc. This is based on something John Dyson mentioned to me a while ago.
# edbfedac	11-Mar-1996	Peter Wemm <peter@FreeBSD.org>	Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all files are off the vendor branch, so this should not change anything. A "U" marker generally means that the file was not changed in between the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally means that there was a change. [note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]
# e5fadd05	08-Mar-1996	John Dyson <dyson@FreeBSD.org>	Put the "free vnode isn't" check back in the right place.
# bd7e5f99	18-Jan-1996	John Dyson <dyson@FreeBSD.org>	Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.
# 0e41ee30	04-Jan-1996	Garrett Wollman <wollman@FreeBSD.org>	Convert DDB to new-style option.
# 864ef7d1	02-Jan-1996	David Greenman <dg@FreeBSD.org>	Moved the #ifdef DIAGNOSTIC in vrele() so that the check for negative v_usecount is always performed and only the call to vprint is conditional.
# 27a0b398	17-Dec-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Staticize. Unstaticize a function in scsi/scsi_base that was used, with an undocumented option. My last count on the LINT kernel shows: Total symbols: 3647 unref symbols: 463 undef symbols: 4 1 ref symbols: 1751 2 ref symbols: 485 Approaching the pain threshold now.
# a316d390	10-Dec-1995	John Dyson <dyson@FreeBSD.org>	Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.
# efeaf95a	06-Dec-1995	David Greenman <dg@FreeBSD.org>	Untangled the vm.h include file spaghetti.
# 65d0bc13	06-Dec-1995	Poul-Henning Kamp <phk@FreeBSD.org>	A couple of minor tweaks to the sysctl stuff.
# 98d93822	02-Dec-1995	Bruce Evans <bde@FreeBSD.org>	Completed function declarations and/or added prototypes.
# 2d0b1d70	29-Nov-1995	Poul-Henning Kamp <phk@FreeBSD.org>	A test was backwards. Noticed by: Cheng, Hsiao-Yang <sycheng@cis.ufl.edu>
# 4b2af45f	19-Nov-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Mega commit for sysctl. Convert the remaining sysctl stuff to the new way of doing things. the devconf stuff is the reason for the large number of files. Cleaned up some compiler warnings while I were there.
# 986f4ce7	16-Nov-1995	Bruce Evans <bde@FreeBSD.org>	Fixed support for DIAGNOSTIC option. SYSCTL_INT() depends on kernel.h.
# 395e6735	14-Nov-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Change some of the debug sysctl vars. The semantics of these will change.
# ceba6236	10-Nov-1995	Bruce Evans <bde@FreeBSD.org>	Fixed type of vfs_free_netcred(). Removed redundant declaration of insmntque().
# f57e6547	09-Nov-1995	Bruce Evans <bde@FreeBSD.org>	Introduced a type `vop_t' for vnode operation functions and used it 1138 times (:-() in casts and a few more times in declarations. This change is null for the i386. The type has to be `typedef int vop_t(void *)' and not `typedef int vop_t()' because `gcc -Wstrict-prototypes' warns about the latter. Since vnode op functions are called with args of different (struct pointer) types, neither of these function types is any use for type checking of the arg, so it would be preferable not to use the complete function type, especially since using the complete type requires adding 1138 casts to avoid compiler warnings and another 40+ casts to reverse the function pointer conversions before calling the functions.
# 5e527f65	06-Nov-1995	John Dyson <dyson@FreeBSD.org>	This is a modification missed by me in the msync fixes a few days ago.
# e887950a	28-Oct-1995	Bruce Evans <bde@FreeBSD.org>	Call vfs_unbusy() before error returns from sysctl_vnode(). This fixes PR 795. Set the size before one error return from sysctl_vnode() the same as before the other. The caller might want to know about the amount successfully read although the current caller doesn't.
# 430179f0	25-Aug-1995	Bruce Evans <bde@FreeBSD.org>	Don't compile the diagnostic functions vhold() and holdrele() unless DIAGNOSTIC is defined.
# 628641f8	11-Aug-1995	David Greenman <dg@FreeBSD.org>	Converted mountlist to a CIRCLEQ. Partially obtained from: 4.4BSD-Lite2
# 24a1cce3	13-Jul-1995	David Greenman <dg@FreeBSD.org>	NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
# e0dca2b9	07-Jul-1995	David Greenman <dg@FreeBSD.org>	Improve negative usecount diagnostic a little.
# aa2cabb9	27-Jun-1995	David Greenman <dg@FreeBSD.org>	1) Converted v_vmdata to v_object. 2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs after vnode_pager_alloc() calls - the object is already guaranteed to be persistent. 3) Removed some gratuitous casts.
# 6acceb40	27-Jun-1995	Bruce Evans <bde@FreeBSD.org>	Pass the correct nonblocking flag to VOP_CLOSE() in vclean(). VOP_CLOSE() takes `F' (file) flags, not `IO' flags. At least that's what close() passes. I previously fixed ttylclose() to check FNONBLOCK instead of IO_NDELAY. This broke the call from vclean() and cleaning of ptys sometimes deadlocked.
# 61f5d510	21-May-1995	David Greenman <dg@FreeBSD.org>	Changes to fix the following bugs: 1) Files weren't properly synced on filesystems other than UFS. In some cases, this lead to lost data. Most likely would be noticed on NFS. The fix is to make the VM page sync/object_clean general rather than in each filesystem. 2) Mixing regular and mmaped file I/O on NFS was very broken. It caused chunks of files to end up as zeroes rather than the intended contents. The fix was to fix several race conditions and to kludge up the "b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention to page modifications that occurred via the mmapping. Reviewed by: David Greenman Submitted by: John Dyson
# 15f1b096	11-May-1995	David Greenman <dg@FreeBSD.org>	Increased ratio of allowed vnodes on freelist to 1/4th of the total. This is more representative of worst case situations of 4 files/directory. (If that last sentence doesn't make any sense, I'm not surprised. It's rather compilcated how this all fits together....). This should fix a problem that Ed Hudson has been complaining about where directories with lots of symlinks could cause excessive disk I/O.
# 1a477b0c	16-Apr-1995	David Greenman <dg@FreeBSD.org>	Changed #ifdef around printlockedvnodes() from DEBUG to DDB.
# 213fd1b6	09-Apr-1995	David Greenman <dg@FreeBSD.org>	Changes from John Dyson and myself: Fixed remaining known bugs in the buffer IO and VM system. vfs_bio.c: Fixed some race conditions and locking bugs. Improved performance by removing some (now) unnecessary code and fixing some broken logic. Fixed process accounting of # of FS outputs. Properly handle NFS interrupts (B_EINTR). (various) Replaced calls to clrbuf() with calls to an optimized routine call vfs_bio_clrbuf(). (various FS sync) Sync out modified vnode_pager backed pages. ffs_vnops.c: Do two passes: Sync out file data first, then indirect blocks. vm_fault.c: Fixed deadly embrace caused by acquiring locks in the wrong order. vnode_pager.c: Changed to use buffer I/O system for writing out modified pages. This should fix the problem with the modification date previous not getting updated. Also dramatically simplifies the code. Note that this is going to change in the future and be implemented via VOP_PUTPAGES(). vm_object.c: Fixed a pile of bugs related to cleaning (vnode) objects. The performance of vm_object_page_clean() is terrible when dealing with huge objects, but this will change when we implement a binary tree to keep the object pages sorted. vm_pageout.c: Fixed broken clustering of pageouts. Fixed race conditions and other lockup style bugs in the scanning of pages. Improved performance.
# b9461930	20-Mar-1995	David Greenman <dg@FreeBSD.org>	Fixed vinvalbuf() to work like NFS wants it to. The previous code wouldn't flush pages in the vm object if V_SAVE was true.
# 62b71ed6	20-Mar-1995	David Greenman <dg@FreeBSD.org>	Don't gain/lose a reference to the object when yanking its pages in vinvalbuf()...it will cause vnode locking problems in vm_object_terminate, and isn't necessary anyway.
# ff769afc	19-Mar-1995	David Greenman <dg@FreeBSD.org>	Don't attempt to sync pages in the V_SAVE case of vinvalbuf; doing so can lead to a deadlock. Just let the VM system deal with it.
# b5e8ce9f	16-Mar-1995	Bruce Evans <bde@FreeBSD.org>	Add and move declarations to fix all of the warnings from `gcc -Wimplicit' (except in netccitt, netiso and netns) and most of the warnings from `gcc -Wnested-externs'. Fix all the bugs found. There were no serious ones.
# 3d2a8cf3	11-Mar-1995	David Greenman <dg@FreeBSD.org>	Added a comment.
# f8a0b2dd	10-Mar-1995	David Greenman <dg@FreeBSD.org>	Reorganized an if() expression for efficiency.
# fbd6e6c9	09-Mar-1995	Poul-Henning Kamp <phk@FreeBSD.org>	Clean up and improve the namecache. 1. We always keep one 16th of the vnodes on the freelist, so that the namecache doesn't get trashed. It used to be that it wasn't a problem, but the only vnodes getting released these days are directories and things which Clean up and improve the namecache. 1. We always keep one 16th of the vnodes on the freelist, so that the namecache doesn't get trashed. It used to be that it wasn't a problem, but the only vnodes getting released these days are directories and things which gets forced out of the VM/cache. The latter is not numerous enough to keep the pool of vnodes needed for the namecache sufficiently big. 2. Purge invalid entries in the namecache as soon as we notice them. This avoids a stale entry pushing out a valid entry on the LRU list. 3. Speed up the lookup in the namecache by avoid a special case branch. 4. Make the cache purge routines do the thing they're supposed to, and in a decently efficient manner. 5. Make the size of the namecache follow the number of vnodes, so that we can always point to all the vnodes we have in core. 6. Readability has gone way up. 7. Added a "options NCH_STATISTICS" feature that will gather more detailed statistics on the performance of the namecache. Reviewed by: davidg (cvs is dumping core on me :-( )
# acc835fd	07-Mar-1995	David Greenman <dg@FreeBSD.org>	Put VAGE vnodes at the head of the free list.
# 7500ed1d	27-Feb-1995	David Greenman <dg@FreeBSD.org>	Backed out previous change. I forgot (for about the fourth time) that v_rdev is a #define which is dereferenced through v_specinfo->si_rdev, and that isn't initialized until later in checkalias().
# f9ceb7c7	26-Feb-1995	David Greenman <dg@FreeBSD.org>	Initialize v_rdev in getnewvnode() - it appears that some filesystems may not properly initialize this field in all cases, and this would result in very anti-social behavior (overwriting on some other random device/location). Submitted by: John Dyson
# a3a8bb29	22-Feb-1995	David Greenman <dg@FreeBSD.org>	vfs_cluster.c: Various more tweaks from John Dyson to improve read ahead calculations. vfs_subr.c: Only wakeup if numoutput is 0 in vwakeup(). Submitted by: John Dyson
# 480dff54	10-Jan-1995	David Greenman <dg@FreeBSD.org>	Fixed some formatting weirdness that I overlooked in the previous commit.
# 0d94caff	09-Jan-1995	David Greenman <dg@FreeBSD.org>	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
# 602d2b48	22-Dec-1994	David Greenman <dg@FreeBSD.org>	Protect vnode buffer chain manipulation with splbio to prevent list corruption..
# 82478919	06-Oct-1994	David Greenman <dg@FreeBSD.org>	Use tsleep() rather than sleep so that 'ps' is more informative about the wait.
# 8e58bf68	05-Oct-1994	David Greenman <dg@FreeBSD.org>	Stuff object into v_vmdata rather than pager. Not important which at the moment, but will be in the future. Other changes mostly cosmetic, but are made for future VMIO considerations. Submitted by: John Dyson
# 797f2d22	02-Oct-1994	Poul-Henning Kamp <phk@FreeBSD.org>	All of this is cosmetic. prototypes, #includes, printfs and so on. Makes GCC a lot more silent.
# bb56ec4a	25-Sep-1994	Poul-Henning Kamp <phk@FreeBSD.org>	While in the real world, I had a bad case of being swapped out for a lot of cycles. While waiting there I added a lot of the extra ()'s I have, (I have never used LISP to any extent). So I compiled the kernel with -Wall and shut up a lot of "suggest you add ()'s", removed a bunch of unused var's and added a couple of declarations here and there. Having a lap-top is highly recommended. My kernel still runs, yell at me if you kernel breaks.
# 1cdeb653	29-Aug-1994	David Greenman <dg@FreeBSD.org>	"bogus" fixes from 1.1.5 to work around some cache coherency problems.
# e0c02154	23-Aug-1994	David Greenman <dg@FreeBSD.org>	Initialized v_writecount.
# 4f5a3fef	22-Aug-1994	David Greenman <dg@FreeBSD.org>	print "BUSY" instead of error number if filesystem was busy during vfs_unmountall() - this is the most common case. If it was a different error, then print the error number.
# e0e9c421	20-Aug-1994	David Greenman <dg@FreeBSD.org>	Implemented filesystem clean bit via: machdep.c: Changed printf's a little and call vfs_unmountall() if the sync was successful. cd9660_vfsops.c, ffs_vfsops.c, nfs_vfsops.c, lfs_vfsops.c: Allow dismount of root FS. It is now disallowed at a higher level. vfs_conf.c: Removed unused rootfs global. vfs_subr.c: Added new routines vfs_unmountall and vfs_unmountroot. Filesystems are now dismounted if the machine is properly rebooted. ffs_vfsops.c: Toggle clean bit at the appropriate places. Print warning if an unclean FS is mounted. ffs_vfsops.c, lfs_vfsops.c: Fix bug in selecting proper flags for VOP_CLOSE(). vfs_syscalls.c: Disallow dismounting root FS via umount syscall.
# f23b4c91	18-Aug-1994	Garrett Wollman <wollman@FreeBSD.org>	Fix up some sloppy coding practices: - Delete redundant declarations. - Add -Wredundant-declarations to Makefile.i386 so they don't come back. - Delete sloppy COMMON-style declarations of uninitialized data in header files. - Add a few prototypes. - Clean up warnings resulting from the above. NB: ioconf.c will still generate a redundant-declaration warning, which is unavoidable unless somebody volunteers to make `config' smarter.
# 3c4dd356	02-Aug-1994	David Greenman <dg@FreeBSD.org>	Added $Id$
# 26f9a767	25-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman
# df8bae1d	24-May-1994	Rodney W. Grimes <rgrimes@FreeBSD.org>	BSD 4.4 Lite Kernel Sources