History log of /freebsd-current/sys/kern/vfs_subr.c
Revision Date Author Comments
# 9530182e 25-Dec-2023 Jason A. Harmening <jah@FreeBSD.org>

VFS: update VOP_FSYNC() debug check to reflect actual locking policy

Shared vs. exclusive locking is determined not by MNT_EXTENDED_SHARED
but by MNT_SHARED_WRITES (although there are several places that
ignore this and simply always use an exclusive lock). Also add a
comment on the possible difference between VOP_GETWRITEMOUNT(vp)
and vp->v_mount on this path.

Found by local testing of unionfs atop ZFS with DEBUG_VFS_LOCKS.

MFC after: 2 weeks
Reviewed by: kib, olce
Differential Revision: https://reviews.freebsd.org/D43816


# b068bb09 07-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

Add vnode_pager_clean_{a,}sync(9)

Bump __FreeBSD_version for ZFS use.

Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43356


# 2d33ad48 31-Dec-2023 Konstantin Belousov <kib@FreeBSD.org>

vtruncbuf: improve the check for meta buffer

Revision e99215a614675 reorganized the code in vtruncbuf(), and moved
the logic to flush meta buffers into a dedicated loop. While doing it,
the condition was changed from bp->b_lblkno < 0 (to handle) into
bp->b_lblkno > 0 (to skip), which causes buffer at lblkno to needlessly
flush.

Reviewed by: chs, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43261


# 4c41d10f 31-Dec-2023 Konstantin Belousov <kib@FreeBSD.org>

vtruncbuf: add a comment explaining the purpose of the loop

Reviewed by: chs, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43261


# 27f4eda3 04-Jan-2024 Mark Johnston <markj@FreeBSD.org>

vfs: Simplify vrefact()

refcount_acquire() returns the old value, just use that. No functional
change intended.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D43255


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 0c5cd045 01-Nov-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove majority of stale commentary about free list

There is no "free list" for a long time now.

While here slightly tidy up affected comments in other ways.

Note that the "free vnode" term is a misnomer at best and will also need
to get sorted out.


# 3943698c 20-Oct-2023 Kirk McKusick <mckusick@FreeBSD.org>

Minor sysctl description cleanup.

No functional change.

Agreed-by: Mateusz Guzik


# 37544d97 12-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: convert recycles_count and recycles_free_count to mere u_long

Only vnlru ever updates them.

This also removes recycles_count updates from hand-rolled debug vnode
recycling via sysctl.

Sponsored by: Rubicon Communications, LLC ("Netgate")


# a92fc312 12-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: count recycles by vnlru and by vn_alloc separately

Sponsored by: Rubicon Communications, LLC ("Netgate")


# bb679b0c 11-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: count calls to uma_reclaim in vnlru


# 281a9715 11-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add max_vnlru_free to the vfs.vnode.vnlru tree

While here rename the var internally.


# 054f45e0 11-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: further speed up continuous free vnode recycle

The primary bottleneck *was* vnode_list mtx, which got artificially
worsened due to the following work done with the lock held:
1. the global heavily modified numvnodes counter was being read,
inducing massive cache line ping pong
2. should the value fit limits (which it normally did) there would be an
avoidable write to vn_alloc_cyclecount, which is being read outside
of the lock, once more inducing traffic

But if vn_alloc_cyclecount is 0, which it normally is even when facing
vnode shortage, there is no need to check numvnodes nor set it to 0 again.

Another problem was numvnodes adjustment (which made the locked read
much worse). While it fundamentally does not scale as it is not
distributed in any fashion, it was avoidably slow. When bumping over the
vnode limit, it would be modified with atomics 3 times: inc + dec to
backpedal in vn_alloc, then final inc in vn_alloc_hard.

One can let some slop persist over calls to vnlru_free instead.

In principle each thread in the system could get here and bump it, so a
limit is put in place to keep things sane.

Bench setup same as in prior commits: zfs, 20 separate directory trees
each with 1 million files in total and 20 find(1) processes stating them
in parallel (one per each tree).

Total run time (in seconds) goes down as follows:
vnode limit 8388608 400000
before ~20 ~35
after ~8 ~15

With this in place the primary bottleneck is now ZFS.

Sponsored by: Rubicon Communications, LLC ("Netgate")


# a4f753e8 11-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: don't recycle transiently excess vnodes

Sponsored by: Rubicon Communications, LLC ("Netgate")


# 90a008e9 14-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: prefix regular vnlru with a special case for free vnodes

Works around severe performance problems in certain corner cases, see
the commentary added.

Modifying vnlru logic has proven rather error prone in the past and a
release is near, thus take the easy way out and fix it without having to
dig into the current machinery.


# 23ef25d2 10-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: consult freevnodes in vnlru_kick_cond

If the count is high enough there is no point trying to produce more.
Not going there reduces traffic on the vnode_list mtx.

This further shaves total real time in a test mentioned in:
74be676d87745eb7 ("vfs: drop one vnode list lock trip during vnlru free
recycle") -- 20 instances of find each creating 1 million vnodes, while
total limit is set to 400k.

Time goes down from ~41 to ~35 seconds.

Sponsored by: Rubicon Communications, LLC ("Netgate")


# 1bf55a73 10-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: be less eager to call uma_reclaim(UMA_RECLAIM_DRAIN)

In face of vnode shortage the count very easily can go few units above
the limit before going back down.

Calling uma_reclaim results in massive amount of work which in this case
is not warranted.

Sponsored by: Rubicon Communications, LLC ("Netgate")


# 8733bc27 14-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: don't provoke recycling non-free vnodes without a good reason

If the total number of free vnodes is at or above target, there is no
point creating more of them.

Tested by: pho (in a bigger patch)


# 9080190b 16-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: count how many times vnlru got woken up due to vnode shortage


# ef89b78b 16-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: stabilize freevnodes_old

In face of parallel callers.


# 509d843a 16-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: s/u_long vstir/bool vstir/


# d3e64789 15-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: group vnode-related sysctls under vfs.vnode

Instead of having things scattered through vfs, debug and kern trees.

Old names remain for compatibility.

Sample output of "sysctl vfs.vnode":
vfs.vnode.vnlru.failed_runs: 0
vfs.vnode.vnlru.recycles_free: 0
vfs.vnode.vnlru.recycles: 0
vfs.vnode.stats.alloc_sleeps: 0
vfs.vnode.stats.free: 1310
vfs.vnode.stats.skipped_requeues: 0
vfs.vnode.stats.created: 1686
vfs.vnode.stats.count: 1641
vfs.vnode.param.wantfree: 2097152
vfs.vnode.param.limit: 8388608


# 2a689cad 16-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire kern.minvnodes

It was marked as legacy in 2005.


# 03bfee17 14-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use vnlru_read_freevnodes for the freevnodes sysctl

For a more accurate result.


# ba5dc166 14-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire vnlru_under_unlocked

It only looks at the centralized value which in corner cases can end up
being negative.


# 9dc0c983 14-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix stale comment about freevnodes management


# 76f11537 14-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: don't kick vnlru if it is already running

Further shaves some lock trips.


# 74be676d 14-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop one vnode list lock trip during vnlru free recycle

vnlru_free_impl would take the lock prior to returning even though most
frequent caller does not need it.

Unsurprisingly vnode_list mtx is the primary bottleneck when recycling
and avoiding the useless lock trip helps.

Setting maxvnodes to 400000 and running 20 parallel finds each with a
dedicated directory tree of 1 million vnodes in total:
before: 4.50s user 1225.71s system 1979% cpu 1:02.14 total
after: 4.20s user 806.23s system 1973% cpu 41.059 total

That's 34% reduction in total real time.

With this the block *remains* the primary bottleneck when running on
ZFS.


# 712806fc 24-Aug-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retried++ -> retried = true for the boolean

No real changes.

Noted by: rpokala


# c1d85ac3 23-Aug-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: try harder to find free vnodes when recycling

The free vnode marker can slide past eligible entries.

Artificially reducing vnode limit to 300k and spawning 104 workers each
creating a million files results in all of them trying to recycle, which
often fails when it should not have to.

Because of the excessive traffic in this scenario, the trylock to
requeue is virtually guaranteed to fail, meaning nothing gets pushed
forward.

Since no vnodes were found, the most unfortunate sleep for 1 second is
induced (see vn_alloc_hard, the "vlruwk" msleep).

Without the fix the machine is mostly idle with almost everyone stuck
off CPU waiting for the sleep to finish. With the fix it is busy
creating files.

Unrelated to the above problem the marker could have landed in a
similarly problematic spot for because of any failure in vtryrecycle.

Originally reported as poudriere builders stalling in a vnode-count
restricted setup.

Fixes: 138a5dafba31 ("vfs: trylock vnode requeue")
Reported by: Mark Millard


# 64e881f2 18-Aug-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: track how many times vn_alloc blocked on hitting the vnode limit


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# 9c3bfe2a 10-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

Revert "VFS: Remove VV_READLINK flag" and "fdescfs: improve linrdlnk mount option"

This reverts commits 4a402dfe0bc44770c9eac6e58a501e4805e29413 and
3bffa2262328e4ff1737516f176107f607e7bc76.

The fix will be implemented in somewhat different manner. The semantic
adjustment is incompatible with linuxolator expectations.

Reported and reviewed by: dchagin
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D40969


# ba8cc6d7 12-Mar-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use __enum_uint8 for vtype and vstate

This whacks hackery around only reading v_type once.

Bump __FreeBSD_version to 1400093


# 4a402dfe 21-Jun-2023 Konstantin Belousov <kib@FreeBSD.org>

VFS: Remove VV_READLINK flag

since its only reason to exist is removed.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D40700


# 2544b8e0 28-Apr-2023 Olivier Certner <olce.freebsd@certner.fr>

vfs: Rename vfs_emptydir() to vn_dir_check_empty()

No functional change. While here, adapt comments to style(9).

Reviewed by: kib
MFC after: 1 week


# 6450e7bb 22-Apr-2023 Olivier Certner <olce.freebsd@certner.fr>

vfs: Fix "emptydir" mount option

Fix vfs_emptydir(). It would consider directories containing directories
with name of the form 'X.' (X being any authorized byte) as empty. Also,
it would cause VOP_READDIR() to return an error on directories
containing enough whiteouts. While here, use a more decently sized
buffer as done elsewhere.

Remove ad-hoc iteration on the directory's content and instead use the
newly exported vn_dir_next_dirent() function (this is what fixes the
second problem mentioned above).

PR: 270988
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39775


# 7aeea73e 16-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

syncer vnode: add VOP_GETWRITEMOUNT() definition explicitly

Since syncer vnode vector does not provide a fallback to the default
one, its VOP_GETWRITEMOUNT() implementation implicitly returned
EOPNOTSUPP, which means that syncer ignored suspension.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# d8a09662 15-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

sync_vnode(): add assert to check vn_start_write() correctness

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# c53e990b 10-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

DEBUG_VFS_LOCKS: restore diagnostic for the witness use case

Reviewed by: jah, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39477


# 7b6fe242 08-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

DEBUG_VFS_LOCKS: use witness if available

The assert_vop_locked messages are ignored, and file/line information
is not too useful. Fixing this without changing both witness and VFS
asserts KPIs is not possible.

Reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39464


# 02e6e8d2 07-Apr-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: extend vn_printf with vop vector


# 26b96487 07-Apr-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: more informative panic for missing fplookup ops


# f87a9f51 05-Apr-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: validate that a mount point with FPLOOKUP has vop_fplookup ops


# e237e2ba 03-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: only allow doomed vnodes to return EOPNOTSUPP for fplookup vops

This helps asserting that they are provided by filesystems indicating
they do it.


# 0baef43e 06-Apr-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add missing vop_fplookup ops to syncer


# 8495fa49 06-Apr-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: whack spurious comments from syncer's vop_vector


# 138a5daf 20-Mar-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: trylock vnode requeue

The quasi-LRU still gets in the way for example when doing an
incremental bzImage build, with vnode_list lock being at the
top of the profile. Further damage control the problem by trylocking.

Note the entire mechanism desperately wants to be reaped out in favor
of something(tm) which both scales in a multicore setting and provides
sensible replacement policy.

With this change everything vfs almost disappears from the on CPU
flamegraph, what is left is tons of contention in the VM.


# 245767c2 25-Mar-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: flip deferred_inact to atomic

Turns out it is very rarely triggered, making a per-cpu
counter a waste.

Examples from real life boxes:
uptime counter
135 days 847
138 days 2190
141 days 1


# e5eb1d29 25-Mar-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: replace some spelled out VNASSERTs with VNPASS

nfc


# b5d43972 21-Mar-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: decouple freevnodes from vnode batching

In principle one cpu can keep vholding vnodes, while another vdrops
them. In this case it may be the local count will keep growing in an
unbounded manner. Roll it up after a threshold instead.

While here move it out of dpcpu into struct pcpu.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D39195


# 62a573d9 16-Mar-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire KERN_VNODE

It got disabled in 2003:

commit acb18acfec97aa7fe26ff48f80a5c3f89c9b542d
Author: Poul-Henning Kamp <phk@FreeBSD.org>
Date: Sun Feb 23 18:09:05 2003 +0000

Bracket the kern.vnode sysctl in #ifdef notyet because it results
in massive locking issues on diskless systems.

It is also not clear that this sysctl is non-dangerous in its
requirements for locked down memory on large RAM systems.

There does not seem to be practical use for it and the disabled routine
does not work anyway.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D39127


# dc9b1373 09-Feb-2023 Mitchell Horne <mhorne@FreeBSD.org>

Use maybe_yield() in a few more places

Reviewed by: kib, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D38186


# 456f0575 19-Jan-2023 Konstantin Belousov <kib@FreeBSD.org>

Handle int rank issues in in vn_getsize_locked() and vn_seek()

In vn_getsize_locked(), when storing vattr.va_size of type u_quad_t into
off_t size, we must avoid overflow.

Then, the check for fsize < 0, introduced in the commit
f45feecfb27ca51067d6789eaa43547cadc4990b 'vfs: add vn_getsize', is nop [1].

Reported and reviewed by: jhb
Coverity CID: 1502346
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38133


# 0f80d5eb 15-Jan-2023 Konstantin Belousov <kib@FreeBSD.org>

Require INVARIANTS and WITNESS if DEBUG_VFS_LOCKS is set

Reported by: pho
Reviewed by: markj, mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D38070


# f45feecf 22-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vn_getsize

getattr is very expensive and in important cases only gets called to get
the size. This can be optimized with a dedicated routine which obtains
that statistic.

As a step towards that goal make size-only consumers use a dedicated
routine.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D37885


# 829f0bcb 19-Dec-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add the concept of vnode state transitions

To quote from a comment above vput_final:
<quote>
* XXX Some filesystems pass in an exclusively locked vnode and strongly depend
* on the lock being held all the way until VOP_INACTIVE. This in particular
* happens with UFS which adds half-constructed vnodes to the hash, where they
* can be found by other code.
</quote>

As is there is no mechanism which allows filesystems to denote that a
vnode is fully initialized, consequently problems like the above are
only found the hard way(tm).

Add rudimentary support for state transitions, which in particular allow
to assert the vnode is not legally unlocked until its fate is decided
(either construction finishes or vgone is called to abort it).

The new field lands in a 1-byte hole, thus it does not grow the struct.

Bump __FreeBSD_version to 1400077

Reviewed by: kib (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D37759


# 94267fc9 22-Dec-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use designated initializers for the typename array

While here prefix with v for better consistency with the vnode stuff.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D37759


# 85dac03e 17-Nov-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: stop using NDFREE

It provides nothing but a branchfest and next to no consumers want it
anyway.

Tested by: pho


# a134a12b 17-Nov-2022 Eric van Gyzen <vangyzen@FreeBSD.org>

Mark the debug.vnlru_nowhere sysctl as CTLFLAG_STATS

The kernel doesn't read it. It's only writable so it can be cleared.

Sponsored by: Dell EMC Isilon


# 4390622c 19-Oct-2022 Jason A. Harmening <jah@FreeBSD.org>

vfs_busy(): fix wording in comment

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35054


# 080ef8a4 04-Aug-2022 Jason A. Harmening <jah@FreeBSD.org>

Add VV_CROSSLOCK vnode flag to avoid cross-mount lookup LOR

When a lookup operation crosses into a new mountpoint, the mountpoint
must first be busied before the root vnode can be locked. When a
filesystem is unmounted, the vnode covered by the mountpoint must
first be locked, and then the busy count for the mountpoint drained.
Ordinarily, these two operations work fine if executed concurrently,
but with a stacked filesystem the root vnode may in fact use the
same lock as the covered vnode. By design, this will always be
the case for unionfs (with either the upper or lower root vnode
depending on mount options), and can also be the case for nullfs
if the target and mount point are the same (which admittedly is
very unlikely in practice).

In this case, we have LOR. The lookup path holds the mountpoint
busy while waiting on what is effectively the covered vnode lock,
while a concurrent unmount holds the covered vnode lock and waits
for the mountpoint's busy count to drain.

Attempt to resolve this LOR by allowing the stacked filesystem
to specify a new flag, VV_CROSSLOCK, on a covered vnode as necessary.
Upon observing this flag, the vfs_lookup() will leave the covered
vnode lock held while crossing into the mountpoint. Employ this flag
for unionfs with the caveat that it can't be used for '-o below' mounts
until other unionfs locking issues are resolved.

Reported by: pho
Tested by: pho
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35054


# d346e3ac 26-Oct-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use cache_assert_no_entries instead of open-coding it


# a3ab1102 12-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: silence a bogus LOR in freevnode

Reported by: imp


# 5b5b7e2c 17-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: always retain path buffer after lookup

This removes some of the complexity needed to maintain HASBUF and
allows for removing injecting SAVENAME by filesystems.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D36542


# d04c7f10 14-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: make delmntque return with the interlock held

saves on relocking dance -- the lock is taken immediately
afterwards anyway.


# c84c5e00 18-Jul-2022 Mitchell Horne <mhorne@FreeBSD.org>

ddb: annotate some commands with DB_CMD_MEMSAFE

This is not completely exhaustive, but covers a large majority of
commands in the tree.

Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583


# c6487446 11-Apr-2022 Dmitry Chagin <dchagin@FreeBSD.org>

getdirentries: return ENOENT for unlinked but still open directory.

To be more compatible to IEEE Std 1003.1-2008 (“POSIX.1”).

Reviewed by: mjg, Pau Amma (doc)
Differential revision: https://reviews.freebsd.org/D34680
MFC after: 2 weeks


# 768f9b8b 09-Apr-2022 Gordon Bergling <gbe@FreeBSD.org>

kern: Fix a typo in a source code comment

- s/is is/is/

MFC after: 3 days


# 2533b5dc 27-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add missing bits to vdropl_impl

This completes the patch which was originally meant to go in.

Spotted by: mhorne
Fixes: c35ec1efdcb2978b ("vfs: [1/2] fix stalls in vnode reclaim by not
requeieing from vnlru")


# eb574ba0 19-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: replace VFS_NOTIFY_UPPER_* macros with an enum


# cceb91b0 18-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add missing flags to db show mount


# 93a0ba8f 17-Sep-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire the no longer used MNTK_LOOKUP_EXCL_DOTDOT flag

Reviewed by: markj
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D34466


# 1cb0045c 07-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add MNTK_UNLOCKED_INSMNTQUE

Can be used when the fs at hand can synchronize insmntque with other
means than the vnode lock.

Reviewed by: markj
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D34466


# 3a4c5dab 07-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: [2/2] fix stalls in vnode reclaim by only counting attempts

... and ignoring if they succeded, which matches historical behavior.

Reported by: pho


# c35ec1ef 07-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: [1/2] fix stalls in vnode reclaim by not requeieing from vnlru

Reported by: pho


# 9af41803 03-Mar-2022 John Baldwin <jhb@FreeBSD.org>

Use vnsz2log directly in assertion on its relation to sizeof(struct vnode).

This reduces the size of diffs required to support different values of
vnsz2log. In CheriBSD, kernels for CHERI architectures have vnodes
larger than 512 bytes and require a value of 9.

Reviewed by: mjg
Obtained from: CheriBSD
Sponsored by: University of Cambridge, Google, Inc.
Differential Revision: https://reviews.freebsd.org/D34418


# 6aa246e6 12-Feb-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: convert vnsz2log to a macro


# 66c5fbca 27-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

insmntque1(): remove useless arguments

Also remove once-used functions to clean up after failed insmntque1(),
which were destructor callbacks in previous life.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D34071


# 3d68c4e1 21-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

syncer VOP_FSYNC(): unlock syncer vnode around call to VFS_SYNC()

The lock is unneccessary since the mount point is busied, which prevents
unmount and syncer vnode deallocation. Having the vnode locked causes
innocent LoRs and complicates debugging.

Also stop starting write accounting around it. Any caller of
VOP_FSYNC() must do it already, and sync_vnode() does.

Reported and tested by: pho
Reviewed by: markj, mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34072


# 2a7e4cf8 27-Jan-2022 Mateusz Guzik <mjg@FreeBSD.org>

Revert b58ca5df0bb7 ("vfs: remove the now unused insmntque1")

I was somehow convinced that insmntque calls insmntque1 with a NULL
destructor. Unfortunately this worked well enough to not immediately
blow up in simple testing.

Keep not using the destructor in previously patched filesystems though
as it avoids unnecessary casts.

Noted by: kib
Reported by: pho


# b58ca5df 26-Jan-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the now unused insmntque1

Bump __FreeBSD_version to 1400052.


# b214fcce 13-Dec-2021 Alan Somers <asomers@FreeBSD.org>

Change VOP_READDIR's cookies argument to a **uint64_t

The cookies argument is only used by the NFS server. NFSv2 defines the
cookie as 32 bits on the wire, but NFSv3 increased it to 64 bits. Our
VOP_READDIR, however, has always defined it as u_long, which is 32 bits
on some architectures. Change it to 64 bits on all architectures. This
doesn't matter for any in-tree file systems, but it matters for some
FUSE file systems that use 64-bit directory cookies.

PR: 260375
Reviewed by: rmacklem
Differential Revision: https://reviews.freebsd.org/D33404


# 4dd23ae1 10-Dec-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire MNTK_NOKNOTE and VV_NOKNOTE

MNTK_NOKNOTE was introduced in 679985d03a64f5dfb4355538ae6e3b70f8347f38
(dated 2005), VV_NOKNOTE in 34cc826ae8999f454dd6cb9c77d17ce83b169f92 few
months later.

Neither was ever used by anything in the tree.


# 4dcdf398 17-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: replace the MNTK_TEXT_REFS flag with VIRF_TEXT_REF

This allows to stop maintaing the VI_TEXT_REF flag and consequently
opens up fully lockless v_writecount adjustment.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D33127


# 7e1d3eef 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the unused thread argument from NDINIT*

See b4a58fbf640409a1 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.


# d032cda0 31-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

DEBUG_VFS_LOCKS: stop excluding devfs and doomed vnode from asserts

We do not require devvp vnode locked for metadata io. It is typically
not needed indeed, since correctness of the file system using
corresponding block device ensures that there is no incorrect or racy
manipulations.

But right now DEBUG_VFS_LOCKS option excludes both character device
vnodes and completely destroyed (VBAD) vnodes from asserts. This is not
too bad since WITNESS still ensures that we do not leak locks. On the
other hand, asserts do not mean what they should, to the reader, and
reliance on them being enforced might result in wrong code.

Note that ASSERT_VOP_LOCKED() still silently accepts NULLVP, I think it
is worth fixing as well, in the next round.

In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D32761


# 47b248ac 03-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

Make locking assertions for VOP_FSYNC() and VOP_FDATASYNC() more correct

For devfs vnodes, it is fine to not lock vnodes for VOP_FSYNC().
Otherwise vnode must be locked exclusively, except for MNT_SHARED_WRITES()
where the shared lock is enough.

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32761


# d1d675cb 01-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

freevnode(): lock the freeing vnode around destroy_vpollinfo()

to satisfy locking requirements of knlist manipulations.

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32761


# 03d5820f 12-Oct-2021 Mark Johnston <markj@FreeBSD.org>

mount: Check for !VDIR mount points before handling -o emptydir

To implement -o emptydir, vfs_emptydir() checks that the passed
directory is empty. This should be done after checking whether the
vnode is of type VDIR, though, or vfs_emptydir() may end up calling
VOP_READDIR on a non-directory.

Reported by: syzbot+4006732c69fb0f792b2c@syzkaller.appspotmail.com
Reviewed by: kib, imp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32475


# 7259ca31 01-Oct-2021 Kyle Evans <kevans@FreeBSD.org>

fifos: delegate unhandled kqueue filters to underlying filesystem

This gives the vfs layer a chance to provide handling for EVFILT_VNODE,
for instance. Change pipe_specops to use the default vop_kqfilter to
accommodate fifoops that don't specify the method (i.e. all in-tree).

Based on a patch by Jan Kokemüller.

PR: 225934
Reviewed by: kib, markj (both pre-KASSERT)
Differential Revision: https://reviews.freebsd.org/D32271


# b4a58fbf 01-Oct-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove cn_thread

It is always curthread.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D32453


# 5d8e32a6 18-Sep-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire VNODE_REFCOUNT_FENCE_* macros

They are unused as of last year.


# 2bc16e8a 17-Jul-2021 Jason A. Harmening <jah@FreeBSD.org>

VFS: remove MNTK_MARKER

We no longer allow upper filesystems to be unregistered from the base
mount while vfs_notify_upper() or any other upper operation is pending.
New upper mounts can still be registered during this period, but they
will be added at the end of the upper mount tailq. We therefore no
longer need to allocate marker nodes during vfs_notify_upper() to keep
our place in the iteration.

Reviewed by: kib, mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D31016


# c746ed72 12-Jun-2021 Jason A. Harmening <jah@FreeBSD.org>

Allow stacked filesystems to be recursively unmounted

In certain emergency cases such as media failure or removal, UFS will
initiate a forced unmount in order to prevent dirty buffers from
accumulating against the no-longer-usable filesystem. The presence
of a stacked filesystem such as nullfs or unionfs above the UFS mount
will prevent this forced unmount from succeeding.

This change addreses the situation by allowing stacked filesystems to
be recursively unmounted on a taskqueue thread when the MNT_RECURSE
flag is specified to dounmount(). This call will block until all upper
mounts have been removed unless the caller specifies the MNT_DEFERRED
flag to indicate the base filesystem should also be unmounted from the
taskqueue.

To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs
have been combined with the existing 'mnt_uppers' list used by nullfs
and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper().
The format of the mnt_uppers list has also been changed to accommodate
filesystems such as unionfs in which a given mount may be stacked atop
more than one lower mount. Additionally, management of lower FS
reclaim/unlink notifications has been split into a separate list
managed by a separate set of KPIs, as registration of an upper FS no
longer implies interest in these notifications.

Reviewed by: kib, mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D31016


# 59409cb9 17-May-2021 Jason A. Harmening <jah@FreeBSD.org>

Add a generic mechanism for preventing forced unmount

This is aimed at preventing stacked filesystems like nullfs and unionfs
from "losing" their lower mounts due to forced unmount. Otherwise,
VFS operations that are passed through to the lower filesystem(s) may
crash or otherwise cause unpredictable behavior.

Introduce two new functions: vfs_pin_from_vp() and vfs_unpin().
which are intended to be called on the lower mount(s) when the stacked
filesystem is mounted and unmounted, respectively.
Much as registration in the mnt_uppers list previously did, pinning
will prevent even forced unmount of the lower FS and will allow the
stacked FS to freely operate on the lower mount either by direct
use of the struct mount* or indirect use through a properly-referenced
vnode's v_mount field.

vfs_pin_from_vp() is modeled after vfs_ref_from_vp() in that it uses
the mount interlock coupled with re-checking vp->v_mount to ensure
that it will fail in the face of a pending unmount request, even if
the concurrent unmount fully completes.

Adopt these new functions in both nullfs and unionfs.

Reviewed By: kib, markj
Differential Revision: https://reviews.freebsd.org/D30401


# 27006229 30-May-2021 Konstantin Belousov <kib@FreeBSD.org>

vinvalbuf: do not panic if we were unable to flush dirty buffers

Return EBUSY instead and let caller to handle the issue.

For vgone()/vnode reclamation, caller first does vinvalbuf(V_SAVE),
which return EBUSY in case dirty buffers where not flushed. Then caller
calls vinvalbuf(0) due to non-zero return, which gets rid of all dirty
buffers without dependencies.

PR: 238565
Reviewed by: asomers, mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30555


# 3cf75ca2 28-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire unused vn_seqc_write_begin_unheld*


# cf74b2be 22-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire the now unused vnlru_free routine


# f784da88 17-May-2021 Konstantin Belousov <kib@FreeBSD.org>

Move mnt_maxsymlinklen into appropriate fs mount data structures

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-MFC-Note: struct mount layout
Differential revision: https://reviews.freebsd.org/D30325


# d713bf79 21-May-2021 Konstantin Belousov <kib@FreeBSD.org>

vn_need_pageq_flush(): simplify

There is no need to own vnode interlock, since v_object is type stable
and can only change to/from NULL, and no other checks in the function
access fields protected by the interlock. Remove the need variable, the
result of the test is directly usable as return value.

Tested by: mav, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# cc6f46ac 14-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: refactor vdrop

In particular move vunlazy into its own routine.


# 715fcc0d 14-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: change vn_freevnodes_* prefix to idiomatic vfs_freevnodes_*


# 9a2fac6b 16-May-2021 Kirk McKusick <mckusick@FreeBSD.org>

Fix handling of embedded symbolic links (and history lesson).

The original filesystem release (4.2BSD) had no embedded sysmlinks.
Historically symbolic links were just a different type of file, so
the content of the symbolic link was contained in a single disk block
fragment. We observed that most symbolic links were short enough that
they could fit in the area of the inode that normally holds the block
pointers. So we created embedded symlinks where the content of the
link was held in the inode's pointer area thus avoiding the need to
seek and read a data fragment and reducing the pressure on the block
cache. At the time we had only UFS1 with 32-bit block pointers,
so the test for a fastlink was:

di_size < (NDADDR + NIADDR) * sizeof(daddr_t)

(where daddr_t would be ufs1_daddr_t today).

When embedded symlinks were added, a spare field in the superblock
with a known zero value became fs_maxsymlinklen. New filesystems
set this field to (NDADDR + NIADDR) * sizeof(daddr_t). Embedded
symlinks were assumed when di_size < fs->fs_maxsymlinklen. Thus
filesystems that preceeded this change always read from blocks
(since fs->fs_maxsymlinklen == 0) and newer ones used embedded
symlinks if they fit. Similarly symlinks created on pre-embedded
symlink filesystems always spill into blocks while newer ones will
embed if they fit.

At the same time that the embedded symbolic links were added, the
on-disk directory structure was changed splitting the former
u_int16_t d_namlen into u_int8_t d_type and u_int8_t d_namlen.
Thus fs_maxsymlinklen <= 0 (as used by the OFSFMT() macro) can
be used to distinguish old directory formats. In retrospect that
should have just been an added flag, but we did not realize we
needed to know about that change until it was already in production.

Code was split into ufs/ffs so that the log structured filesystem could
use ufs functionality while doing its own disk layout. This meant
that no ffs superblock fields could be used in the ufs code. Thus
ffs superblock fields that were needed in ufs code had to be copied
to fields in the mount structure. Since ufs_readlink needed to know
if a link was embedded, fs_maxlinklen gets copied to mnt_maxsymlinklen.

The kernel panic that arose to making this fix was triggered when a
disk error created an inode of type symlink with no allocated data
blocks but a large size. When readlink was called the uiomove was
attempted which segment faulted.

static int
ufs_readlink(ap)
struct vop_readlink_args /* {
struct vnode *a_vp;
struct uio *a_uio;
struct ucred *a_cred;
} */ *ap;
{
struct vnode *vp = ap->a_vp;
struct inode *ip = VTOI(vp);
doff_t isize;

isize = ip->i_size;
if ((isize < vp->v_mount->mnt_maxsymlinklen) ||
DIP(ip, i_blocks) == 0) { /* XXX - for old fastlink support */
return (uiomove(SHORTLINK(ip), isize, ap->a_uio));
}
return (VOP_READ(vp, ap->a_uio, 0, ap->a_cred));
}

The second part of the "if" statement that adds

DIP(ip, i_blocks) == 0) { /* XXX - for old fastlink support */

is problematic. It never appeared in BSD released by Berkeley because
as noted above mnt_maxsymlinklen is 0 for old format filesystems, so
will always fall through to the VOP_READ as it should. I had to dig
back through `git blame' to find that Rodney Grimes added it as
part of ``The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.''
He must have brought it across from an earlier FreeBSD. Unfortunately
the source-control logs for FreeBSD up to the merger with the
AT&T-blessed 4.4BSD-Lite conversion were destroyed as part of the
agreement to let FreeBSD remain unencumbered, so I cannot pin-point
where that line got added on the FreeBSD side.

The one change needed here is that mnt_maxsymlinklen is declared as
an `int' and should be changed to be `u_int64_t'.

This discovery led us to check out the code that deletes symbolic
links. Specifically

if (vp->v_type == VLNK &&
(ip->i_size < vp->v_mount->mnt_maxsymlinklen ||
datablocks == 0)) {
if (length != 0)
panic("ffs_truncate: partial truncate of symlink");
bzero(SHORTLINK(ip), (u_int)ip->i_size);
ip->i_size = 0;
DIP_SET(ip, i_size, 0);
UFS_INODE_SET_FLAG(ip, IN_SIZEMOD | IN_CHANGE | IN_UPDATE);
if (needextclean)
goto extclean;
return (ffs_update(vp, waitforupdate));
}

Here too our broken symlink inode with no data blocks allocated
and a large size will segment fault as we are incorrectly using the
test that we have no data blocks to decide that it is an embdedded
symbolic link and attempting to bzero past the end of the inode.
The test for datablocks == 0 is unnecessary as the test for
ip->i_size < vp->v_mount->mnt_maxsymlinklen will do the right
thing in all cases.

The test for datablocks == 0 was added by David Greenman in this commit:

Author: David Greenman <dg@FreeBSD.org>
Date: Tue Aug 2 13:51:05 1994 +0000

Completed (hopefully) the kernel support for old style "fastlinks".

Notes:
svn path=/head/; revision=1821

I am guessing that he likely earlier added the incorrect test in the
ufs_readlink code.

I asked David if he had any recollection of why he made this change.
Amazingly, he still had a recollection of why he had made a one-line
change more than twenty years ago. And unsurpisingly it was because
he had been stuck between a rock and a hard place.

FreeBSD was up to 1.1.5 before the switch to the 4.4BSD-Lite code
base. Prior to that, there were three years of development in all
areas of the kernel, including the filesystem code, from the combined
set of people including Bill Jolitz, Patchkit contributors, and
FreeBSD Project members. The compatibility issue at hand was caused
by the FASTLINKS patches from Curt Mayer. In merging in the 4.4BSD-Lite
changes David had to find a way to provide compatibility with both
the changes that had been made in FreeBSD 1.1.5 and with 4.4BSD-Lite.
He felt that these changes would provide compatibility with both systems.

In his words:
``My recollection is that the 'FASTLINKS' symlinks support in
FreeBSD-1.x, as implemented by Curt Mayer, worked differently than
4.4BSD. He used a spare field in the inode to duplicately store the
length. When the 4.4BSD-Lite merge was done, the optimized symlinks
support for existing filesystems (those that were initialized in
FreeBSD-1.x) were broken due to the FFS on-disk structure of
4.4BSD-Lite differing from FreeBSD-1.x. My commit was needed to
restore the backward compatibility with FreeBSD-1.x filesystems.
I think it was the best that could be done in the somewhat urgent
circumstances of the post Berkeley-USL settlement. Also, regarding
Rod's massive commit with little explanation, some context: John
Dyson and I did the initial re-port of the 4.4BSD-Lite kernel to
the 386 platform in just 10 days. It was by far the most intense
hacking effort of my life. In addition to the porting of tons of
FreeBSD-1 code, I think we wrote more than 30,000 lines of new code
in that time to deal with the missing pieces and architectural
changes of 4.4BSD-Lite. We didn't make many notes along the way.
There was a lot of pressure to get something out to the rest of the
developer community as fast as possible, so detailed discrete commits
didn't happen - it all came as a giant wad, which is why Rod's
commit message was worded the way it was.''

Reported by: Chuck Silvers
Tested by: Chuck Silvers
History by: David Greenman Lawrence
MFC after: 1 week
Sponsored by: Netflix


# b261bb40 13-Apr-2021 Mark Johnston <markj@FreeBSD.org>

vfs: Add KASAN state transitions for vnodes

vnodes are a bit special in that they may exist on per-CPU lists even
while free. Add a KASAN-only destructor that poisons regions of each
vnode that are not expected to be accessed after a free.

MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29459


# e9272225 17-Mar-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix vnlru marker handling for filtered/unfiltered cases

The global list has a marker with an invariant that free vnodes are
placed somewhere past that. A caller which performs filtering (like ZFS)
can move said marker all the way to the end, across free vnodes which
don't match. Then a caller which does not perform filtering will fail to
find them. This makes vn_alloc_hard sleep for 1 second instead of
reclaiming, resulting in significant stalls.

Fix the problem by requiring an explicit marker by callers which do
filtering.

As a temporary measure extend vnlru_free to restart if it fails to
reclaim anything.

Big thanks go to the reporter for testing several iterations of the
patch.

Reported by: Yamagi <lists yamagi.org>
Tested by: Yamagi <lists yamagi.org>
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D29324


# 44691b33 06-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

vlrureclaim: only skip vnode with resident pages if it own the pages

Nullfs vnode which shares vm_object and pages with the lower vnode should
not be exempt from the reclaim just because lower vnode cached a lot.
Their reclamation is actually very cheap and should be preferred over
real fs vnodes, but this change is already useful.

Reported and tested by: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29178


# 0cc746f1 02-Mar-2021 Robert Wing <rew@FreeBSD.org>

filt_fsevent: only record interested events

Respect filter-specific flags for the EVFILT_FS filter.

When a kevent is registered with the EVFILT_FS filter, it is always
triggered when an EVFILT_FS event occurs, regardless of the
filter-specific flags used. Fix that.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28974


# 2bfd8992 14-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

vnode: move write cluster support data to inodes.

The data is only needed by filesystems that
1. use buffer cache
2. utilize clustering write support.

Requested by: mjg
Reviewed by: asomers (previous version), fsu (ext2 parts), mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28679


# 662283b1 18-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

vn_printf: handle VI_FOPENING

Noted by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 6 days
Fixes: fa3bd463cee


# b59a8e63 30-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

Stop ignoring ERELOOKUP from VOP_INACTIVE()

When possible, relock the vnode and retry inactivation. Only vunref() is
required not to drop the vnode lock, so handle it specially by not retrying.

This is a part of the efforts to ensure that unlinked not referenced vnode
does not prevent inode from reusing.

Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 739ecbcf 23-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add symlink support to lockless lookup

Reviewed by: kib (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D27488


# 33a195ba 03-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: keep seqc unchanged as long as the vnode is accessible via SMR


# 82397d79 31-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: denote vnode being a mount point with VIRF_MOUNTPOINT

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D27794


# 3e506a67 27-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add v_irflag accessors

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D27793


# 0c23d262 04-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: keep bad ops on vnode reclaim

They were only modified to accomodate a redundant assertion.

This runs into problems as lockless lookup can still try to use the vnode
and crash instead of getting an error.

The bug was only present in kernels with INVARIANTS.

Reported by: kevans


# 5335f643 18-Nov-2020 John Baldwin <jhb@FreeBSD.org>

Fix a few nits in vn_printf().

- Mask out recently added VV_* bits to avoid printing them twice.

- Keep VI_LOCKed on the same line as the rest of the flags.

Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27261


# 441eb16a 13-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Allow some VOPs to return ERELOOKUP to indicate VFS operation restart at top level.

Restart syscalls and some sync operations when filesystem indicated
ERELOOKUP condition, mostly for VOPs operating on metdata. In
particular, lookup results cached in the inode/v_data is no longer
valid and needs recalculating. Right now this should be nop.

Assert that ERELOOKUP is catched everywhere and not returned to
userspace, by asserting that td_errno != ERELOOKUP on syscall return
path.

In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136


# f6dd1aef 09-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: group mount per-cpu vars into one struct

While here move frequently read stuff into the same cacheline.

This shrinks struct mount by 64 bytes.

Tested by: pho


# e90afaa0 08-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

kqueue: save space by using only one func pointer for assertions


# 06855749 30-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: change vnode poll to just a malloc type

The size is 120, close fit for 128 and rarely used. The infrequent use
avoidably populates per-CPU caches and ends up with more memory.


# 11743b6e 27-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: tidy up vnlru_free

Apart from cosmeatic changes make sure to only decrease the recycled counter
if vtryrecycle succeeded.

Tested by: pho


# 68ac2b80 27-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix vnode reclaim races against getnwevnode

All vnodes allocated by UMA are present on the global list used by
vnlru. getnewvnode modifies the state of the vnode (most notably
altering v_holdcnt) but never locks it. Moreover filesystems also
modify it in arbitrary manners sometimes before taking the vnode
lock or adding any other indicator that the vnode can be used.

Picking up such a vnode by vnlru would be problematic.

To that end there are 2 fixes:
- vlrureclaim, not recycling v_holdcnt == 0 vnodes, takes the
interlock and verifies that v_mount has been set. It is an
invariant that the vnode lock is held by that point, providing
the necessary serialisation against locking after vhold.
- vnlru_free_locked, only wanting to free v_holdcnt == 0 vnodes,
now makes sure to only transition the count 0->1 and newly allocated
vnodes start with v_holdcnt == VHOLD_NO_SMR. getnewvnode will only
transition VHOLD_NO_SMR->1 once more making the hold fail

Tested by: pho


# 7cc17186 24-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix a race where reclaim vholds freed vnodes

Reported by: pho
Tested by: pho (previous version)
Fixes: r366974 ("vfs: stop taking the interlock in vnode reclaim")


# 703f3faf 23-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: stop taking the interlock in vnode reclaim

It no longer protects any of tested fields, keeping all the checks racy.

While here make vtryrecycle drop the vnode on its own. Avoids an additional
lock trip.


# c7520caa 22-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: prevent avoidable evictions on mkdir of existing directories

mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST.

The NOCACHE lookup will make the namecache unnecessarily evict the existing entry,
and then fallback to the fs lookup routine eventually leading namei to return an
error as the directory is already there.

For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers
fallbacks to the slowpath for concurrently executing lookups.

Tested by: pho
Discussed with: kib


# ab21ed17 20-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the de facto curthread argument from VOP_INACTIVE


# c0baa3dc 19-Oct-2020 Konstantin Belousov <kib@FreeBSD.org>

vgonel(): avoid recursing into VOP_INACTIVE().

It is a common pattern for filesystems' VOP_INACTIVE() implementation
to forcibly reclaim the vnode when its state is final. For instance,
UFS vnode with zero link count is removed, and since it is
inactivated, the last open reference on it is dropped.

On the other hand, vnode might get spurious usecount reference for
many reasons. If the spurious reference exists while vgonel() checks
for active state of the vnode, it would recurse into VOP_INACTIVE().

Fix it by checking and not doing inactivation when vgone() was called
from inactive VOP.

Reported and tested by: pho
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 3c484f32 15-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Convert page cache read to VOP.

There are several negative side-effects of not calling into VOP layer
at all for page cache reads. The biggest is the missed activation of
EVFILT_READ knotes.

Also, it allows filesystem to make more fine grained decision to
refuse read from page cache.

Keep VIRF_PGREAD flag around, it is still useful for nullfs, and for
asserts.

Reviewed by: markj
Tested by: pho
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26346


# 2bcfa5ba 08-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop a write-only var in vfs_periodic_msync_inactive


# b1a824b6 02-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire vholdl as a symbol

Similarly to vrefl in r364283.


# 2b4632ae 02-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: purge cache entries early on vgone

There is no reason for them to linger across reclaim and it is an
invariant that doomed vnodes are not added to the namecache.


# 6fed89b1 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

kern: clean up empty lines in .c and .h files


# a459a6cf 25-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: respect PRIV_VFS_LOOKUP in vaccess_smr

Reported by: novel


# 9ce9158b 23-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: support denying access in vaccess_vexec_smr


# ba3b0991 23-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: factor away doomed vnode handling into vdropl_final


# 2ca83b5c 23-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: mark freevnode as noinline


# 19337211 21-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix freevnode accounting

Most notably add the missing decrement to vhold_smr.

.-'---`-.
,' `.
| \
| \
\ _ \
,\ _ ,'-,/-)\
( * \ \,' ,' ,'-)
`._,) -',-')
\/ ''/
) / /
/ ,'-'

Reported by: Dan Nelson <dnelson_1901@yahoo.com>
Fixes: r362827 ("vfs: protect vnodes with smr")


# 8f226f4c 19-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the always-curthread td argument from VOP_RECLAIM


# 7ad2a82d 18-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error

Most consumers pass NULL.


# fbca789f 16-Aug-2020 Konstantin Belousov <kib@FreeBSD.org>

VMIO read

If possible, i.e. if the requested range is resident valid in the vm
object queue, and some secondary conditions hold, copy data for
read(2) directly from the valid cached pages, avoiding vnode lock and
instantiating buffers. I intentionally do not start read-ahead, nor
handle the advises on the cached range.

Filesystems indicate support for VMIO reads by setting VIRF_PGREAD
flag, which must not be cleared until vnode reclamation.

Currently only filesystems that use vnode pager for v_objects can
enable it, due to reliance on vnp_size. There is a WIP to handle it
for tmpfs.

Reviewed by: markj
Discussed with: jeff
Tested by: pho
Benchmarked by: mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25968


# 60414088 16-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire vrefl as a symbol

vrefl calls vref and there is only one in-tree consumer.

Keep it as a macro for assertion purposes.


# a92a971b 16-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the thread argument from vget

It was already asserted to be curthread.

Semantic patch:

@@

expression arg1, arg2, arg3;

@@

- vget(arg1, arg2, arg3)
+ vget(arg1, arg2)


# 36f47512 11-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: inline vrefcnt


# 4c2d103a 11-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: garbage collect vrefactn


# 6883f07e 11-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: reimplement vref on top of vget

No change in generated assembly.


# 3b444436 11-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

devfs: rework si_usecount to track opens

This removes a lot of special casing from the VFS layer.

Reviewed by: kib (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D25612


# 7f700801 10-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: disallow NOCACHE with LOOKUP

This means there is no expectation lookup will purge the terminal entry,
which simplifies lockless lookup.

Tested by: pho
Sponsored by: The FreeBSD Foundation


# 1ff80a34 07-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: release the interlock after failing to set VHOLD_NO_SMR

While here add more comments.

Diagnosed by: markj
Reported by: pho
Fixes: r362827 ("vfs: protect vnodes with smr")


# 0ffec1b0 06-Aug-2020 Mark Johnston <markj@FreeBSD.org>

Clean up reassignbuf() and buf_vlist_remove() a bit.

- Convert panic() calls to INVARIANTS-only assertions. The PCTRIE code
provides some of the same protection since it will panic upon an
attempt to remove a non-resident buffer.
- Update the comment above reassignbuf() to reflect reality.

Reviewed by: cem, kib, mjg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25965


# 7013797e 06-Aug-2020 Mark Johnston <markj@FreeBSD.org>

Remove the vfs.reassignbufcalls counter and sysctl.

As the 20-year old comment above it suggests, the counter is of dubious
value. Moreover, the (global) counter was not updated precisely and
hurts scalability.

Reviewed by: cem, kib, mjg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D25965


# d292b194 05-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the obsolete privused argument from vaccess

This brings argument count down to 6, which is passable without the
stack on amd64.


# db99ec56 04-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: support lockless dotdot lookup

Tested by: pho


# 6e10434c 04-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add cache_purge_vgone

cache_purge locklessly checks whether the vnode at hand has any namecache
entries. This can race with a concurrent purge which managed to remove
the last entry, but may not be done touching the vnode.

Make sure we observe the relevant vnode lock as not taken before proceeding
with vgone.

Paired with the fact that doomed vnodes cannnot receive entries this restores
the invariant that there are no namecache-related writing users past cache_purge
in vgone.

Reported by: pho


# 838984de 02-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: move namecache initialisation into cache_vnode_init


# 848f8eff 30-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: inline vops if there are no pre/post associated calls

This removes a level of indirection from frequently used methods, most notably
VOP_LOCK1 and VOP_UNLOCK1.

Tested by: pho


# 07d2145a 25-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add the infrastructure for lockless lookup

Reviewed by: kib
Tested by: pho (in a patchset)
Differential Revision: https://reviews.freebsd.org/D25577


# 0379ff6a 25-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce vnode sequence counters

Modified on each permission change and link/unlink.

Reviewed by: kib
Tested by: pho (in a patchset)
Differential Revision: https://reviews.freebsd.org/D25573


# 68ee1dda 24-Jul-2020 Conrad Meyer <cem@FreeBSD.org>

Add unlocked/SMR fast path to getblk()

Convert the bufobj tries to an SMR zone/PCTRIE and add a gbincore_unlocked()
API wrapping this functionality. Use it for a fast path in getblkx(),
falling back to locked lookup if we raced a thread changing the buf's
identity.

Reported by: Attilio
Reviewed by: kib, markj
Testing: pho (in progress)
Sponsored by: Isilon
Differential Revision: https://reviews.freebsd.org/D25782


# 422f38d8 10-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix trivial whitespace issues which don't interefere with blame

.. even without the -w switch


# 9b0c2e59 05-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: expand on vhold_smr comment


# f8022be3 30-Jun-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: protect vnodes with smr

vget_prep_smr and vhold_smr can be used to ref a vnode while within vfs_smr
section, allowing consumers to get away without locking.

See vhold_smr and vdropl for comments explaining caveats.

Reviewed by: kib
Testec by: pho
Differential Revision: https://reviews.freebsd.org/D23913


# 245bfd34 20-May-2020 Ryan Moeller <freqlabs@FreeBSD.org>

Deduplicate fsid comparisons

Comparing fsid_t objects requires internal knowledge of the fsid structure
and yet this is duplicated across a number of places in the code.

Simplify by creating a fsidcmp function (macro).

Reviewed by: mjg, rmacklem
Approved by: mav (mentor)
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D24749


# f15ccf88 06-Mar-2020 Chuck Silvers <chs@FreeBSD.org>

Add a new "mntfs" pseudo file system which provides private device vnodes for
file systems to safely access their disk devices, and adapt FFS to use it.
Also add a new BO_NOBUFS flag to allow enforcing that file systems using
mntfs vnodes do not accidentally use the original devfs vnode to create buffers.

Reviewed by: kib, mckusick
Approved by: imp (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D23787


# 2782c00c 22-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

vfs: quiet -Wwrite-strings

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23797


# 6c5f36ff 19-Feb-2020 Jeff Roberson <jeff@FreeBSD.org>

Eliminate some unnecessary uses of UMA_ZONE_VM. Only zones involved in
virtual address or physical page allocation need to be marked with this
flag.

Reviewed by: markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23712


# 3403d524 15-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix vlrureclaim ->v_object access

The routine was checking for ->v_type == VBAD. Since vgone drops the interlock
early sets this type at the end of the process of dooming a vnode, this opens
a time window where it can clear the pointer while the inerlock-holders is
accessing it.

Another note is that the code was:
(vp->v_object != NULL &&
vp->v_object->resident_page_count > trigger)

With the compiler being fully allowed to emit another read to get the pointer,
and in fact it did on the kernel used by pho.

Use atomic_load_ptr and remember the result.

Note that this depends on type-safety of vm_object.

Reported by: pho


# c6150094 15-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: check early for VCHR in vput_final to short-circuit in the common case

Otherwise the compiler inlines v_decr_devcount which keps getting jumped over
in the common case of not dealing with a device.


# df0d5a2a 14-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove no longer needed atomic_load_ptr casts


# 46022147 12-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: refactor vputx and add more comment

Reviewed by: jeff (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D23530


# 123c5197 12-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: switch to smp_rendezvous_cpus_retry for vfs_op_thread_enter/exit

In particular on amd64 this eliminates an atomic op in the common case,
trading it for IPIs in the uncommon case of catching CPUs executing the
code while the filesystem is getting suspended or unmounted.


# 57349a4f 11-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix vhold race in mnt_vnode_next_lazy_relock

vdrop can set the hold count to 0 and wait for the ->mnt_listmtx held by
mnt_vnode_next_lazy_relock caller. The routine incorrectly asserted the
count has to be > 0.

Reported by: pho
Tested by: pho


# 2e57c8fd 10-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix device count leak on vrele racing with vgone

The race is:

CPU1 CPU2
devfs_reclaim_vchr
make v_usecount 0
VI_LOCK
sees v_usecount == 0, no updates
vp->v_rdev = NULL;
...
VI_UNLOCK
VI_LOCK
v_decr_devcount
sees v_rdev == NULL, no updates

In this scenario si_devcount decrement is not performed.

Note this can only happen if the vnode lock is not held.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23529


# cd951a0d 10-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix lock recursion in vrele

vrele is supposed to be called with an unlocked vnode, but this was never
asserted for if v_usecount was > 0. For such counts the lock is never touched
by the routine. As a result the kernel has several consumers which expect
vunref semantics and get away with calling vrele since they happen to never do
it when this is the last reference (and for some of them this may happen to be
a guarantee).

Work around the problem by changing vrele semantics to tolerate being called
with a lock. This eliminates a possible bug where the lock is already held and
vputx takes it anyway.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23528


# 2f7f11b7 08-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: tidy up vget_finish and vn_lock

- remove assertion which duplicates vn_lock
- use VNPASS instead of retyping the failure
- report what flags were passed if panicking on them


# 6a5abb1e 02-Feb-2020 Kyle Evans <kevans@FreeBSD.org>

Provide O_SEARCH

O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.

This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.

This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23247


# 6698e11f 02-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the now empty vop_unlock_post


# 643656cf 31-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: replace VOP_MARKATIME with VOP_MMAPPED

The routine is only provided by ufs and is only used on mmap and exec.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23422


# 21c4f104 31-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vrefactn

Differential Revision: https://reviews.freebsd.org/D23427


# 0f4d8b77 31-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: revert the overzealous assert added in r357285 to vgone

The intent was to make it more likely to catch filesystems with custom
need_inactive routines which fail to call vn_need_pageq_flush (or do an
equivalent).

One immediate case which is missed is vgone from called by inactive itself.

A better assertion may land later. The routine is not added to vputx because
it is of no use to tmpfs et al.

Reported by: syzbot+5f697ec11f89b60941db@syzkaller.appspotmail.com


# 3ff65f71 30-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

Remove duplicated empty lines from kern/*.c

No functional changes.


# c2ef6aa3 29-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: assert that doomed vnodes don't need to call vm_object_page_clean

... after the optional inactive processing.


# 07c6e2f4 29-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: unlazy before dooming the vnode

With this change having the listmtx lock held postpones dooming the vnode.
Use this fact to simplify iteration over the lazy list. It also allows
filters to safely access ->v_data.

Reviewed by: kib (early version)
Differential Revision: https://reviews.freebsd.org/D23397


# 79674264 29-Jan-2020 Gleb Smirnoff <glebius@FreeBSD.org>

Fix text format definition for kern.maxvnodes, vfs.wantfreevnodes. This
is a regression from r356642, r356645.


# 1513f803 26-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: do an unlocked check before iterating the lazy list

For most filesystems it is expected to be empty most of the time.


# 6d69e665 25-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix freevnodes count update race against preemption

vdbatch_process leaves the critical section too early, openign a time
window where another thread can get scheduled and modify vd->freevnodes.
Once it the preempted thread gets back it overrides the value with 0.

Just move critical_exit to the end of the function.


# dc9a1cb6 25-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: predict vn_lock failure as unlikely in vget


# 28eb39a5 24-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: allow v_usecount to transition 0->1 without the interlock

There is nothing to do but to bump the count even during said transition.
There are 2 places which can do it:
- vget only does this after locking the vnode, meaning there is no change in
contract versus inactive or reclamantion
- vref only ever did it with the interlock held which did not protect against
either (that is, it would always succeed)

VCHR vnodes retain special casing due to the need to maintain dev use count.

Reviewed by: jeff, kib
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D23185


# d93762b9 24-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: stop handling VI_OWEINACT in vget

vget is almost always called with LK_SHARED, meaning the flag (if present) is
almost guaranteed to get cleared. Stop handling it in the first place and
instead let the thread which wanted to do inactive handle the bumepd usecount.

Reviewed by: jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23184


# 74c4b7cc 24-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: stop unlocking the vnode upfront in vput

Doing so runs into races with filesystems which make half-constructed vnodes
visible to other users, while depending on the chain vput -> vinactive ->
vrecycle to be executed without dropping the vnode lock.

Impediments for making this work got cleared up (notably vop_unlock_post now
does not do anything and lockmgr stops touching the lock after the final
write). Stacked filesystems keep vhold/vdrop across unlock, which arguably can
now be eliminated.

Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23344


# 28479aaa 19-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: allow v_holdcnt to transition 0->1 without the interlock

Since r356672 ("vfs: rework vnode list management") there is nothing to do
apart from altering freevnodes count, but this much can be safely done based
on the result of atomic_fetchadd.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23186


# 512fa9a4 18-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: plug a conditional assigment of lo_name in getnewvnode

It only matters for witness. No functional changes.


# 2d0c6202 17-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: distribute freevnodes counter per-cpu

It gets rolled up to the global when deferred requeueing is performed.
A dedicated read routine makes sure to return a value only off by a certain
amount.

This soothes a global serialisation point for all 0<->1 hold count transitions.

Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23235


# 1ad72b27 17-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: shorten lock hold time in vdbatch_process


# 66f67d5e 16-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: increment numvnodes without the vnode list lock unless under pressure

The vnode list lock is only needed to reclaim free vnodes or kick the vnlru
thread (or to block and not miss a wake up (but note the sleep has a timeout so
this would not be a correctness issue)). Try to get away without the lock by
just doing an atomic increment.

The lock is contended e.g., during poudriere -j 104 where about half of all
acquires come from vnode allocation code.

Note the entire scheme needs a rewrite, the above just reduces it's SMP impact.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23140


# b7f50b9a 16-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: refcator vnode allocation

Semantics are almost identical. Some code is deduplicated and there are
fewer memory accesses.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D23158


# 875cfc08 16-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: reimplement vlrureclaim to actually use LRU

Take advantage of global ordering introduced in r356672.

Reviewed by: mckusick (previous version)
Differential Revision: https://reviews.freebsd.org/D23067


# 0c236d3d 12-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: per-cpu batched requeuing of free vnodes

Constant requeuing adds significant lock contention in certain
workloads. Lessen the problem by batching it.

Per-cpu areas are locked in order to synchronize against UMA freeing
memory.

vnode's v_mflag is converted to short to prevent the struct from
growing.

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 122.38s user 1780.45s system 6242% cpu 30.480 total
patched: 144.84s user 985.90s system 4856% cpu 23.282 total

Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22998


# cc3593fb 12-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: rework vnode list management

The current notion of an active vnode is eliminated.

Vnodes transition between 0<->1 hold counts all the time and the
associated traversal between different lists induces significant
scalability problems in certain workloads.

Introduce a global list containing all allocated vnodes. They get
unlinked only when UMA reclaims memory and are only requeued when
hold count reaches 0.

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 118.55s user 3649.73s system 7479% cpu 50.382 total
patched: 122.38s user 1780.45s system 6242% cpu 30.480 total

Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22997


# 57083d25 12-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add per-mount vnode lazy list and use it for deferred inactive + msync

This obviates the need to scan the entire active list looking for vnodes
of interest.

msync is handled by adding all vnodes with write count to the lazy list.

deferred inactive directly adds vnodes as it sets the VI_DEFINACT flag.

Vnodes get dequeued from the list when their hold count reaches 0.

Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that
spurious locking is avoided in the common case.

Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22995


# 879e0604 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

Add KERNEL_PANICKED macro for use in place of direct panicstr tests


# 91de98e6 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: only recalculate watermarks when limits are changing

Previously they would get recalculated all the time, in particular in:
getnewvnode -> vcheckspace -> vspace


# e6ae744e 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: deduplicate vnode allocation logic

This creates a dedicated routine (vn_alloc) to allocate vnodes.

As a side effect code duplicationw with getnewvnode_reserve is eleminated.

Add vn_free for symmetry.


# b52d50cf 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: prealloc vnodes in getnewvnode_reserve

Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.

Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.


# 69283067 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: incomplete pass at converting more ints to u_long

Most notably numvnodes and freevnodes were u_long, but parameters used to
govern them remained as ints.


# bf62296f 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add missing CLTFLA_MPSAFE annotations

This covers all kern/vfs_*.c files.


# a9a047bc 07-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: handle doomed vnodes in vdefer_inactive

vgone dooms the vnode while keeping VI_OWEINACT set and then drops the
interlock.

vputx can pick up the interlock and pass it to vdefer_inactive since the
flag is set.

The race is harmless, just don't defer anything as vgone will take care of it.

Reported by: pho


# c8b3463d 07-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: reimplement deferred inactive to use a dedicated flag (VI_DEFINACT)

The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.

Reviewed by: kib (previous version)
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23036


# b7cc9d18 07-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: trylock in vfs_msync and refactor the func

- use LK_NOWAIT instead of calling VOP_ISLOCKED before deciding to lock
- evaluate flags before looping over vnodes

Reviewed by: kib
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23035


# c92fe112 07-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use a dedicated counter for free vnode recycling

Otherwise vlrureclaim activitity is mixed in and it is hard to tell which
vnodes got reclaimed.


# cc2b586d 06-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: prevent numvnodes and freevnodes re-reads when appropriate

Otherwise in code like this:
if (numvnodes > desiredvnodes)
vnlru_free_locked(numvnodes - desiredvnodes, NULL);

numvnodes can drop below desiredvnodes prior to the call and if the
compiler generated another read the subtraction would get a negative
value.


# 37fe521a 06-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: annotate numvnodes and vnode_free_list_mtx with __exclusive_cache_line


# 478368ca 06-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: eliminate v_tag from struct vnode

There was only one consumer and it was using it incorrectly.

It is given an equivalent hack.

Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23037


# a91190c6 06-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add a helper for allocating marker vnodes


# 8dbc6352 04-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop thread argument from vinactive


# 867fd730 04-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: patch up vnode count assertions to report found value


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# 57db0e12 01-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop an always-false check from vlrureclaim

The vnode gets held few lines prior, making the VI_FREE condition
illegal.


# eb976461 27-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove production kernel checks and mp == NULL support from vdrop

1. The only place in the tree which calls getnewvnode with mp == NULL does it
for vp_crossmp which will never execute this codepath. Any vnode which legally
has ->v_mount == NULL is also doomed, which once more wont execute this code.
2. Remove an assertion for v_holdcnt from production kernels. It gets taken care
of by refcount macros in debug kernels.

Any code which would want to pass NULL mp can construct a fake one instead.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D22722


# 6fa079fc 15-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: flatten vop vectors

This eliminates the following loop from all VOP calls:

while(vop != NULL && \
vop->vop_spare2 == NULL && vop->vop_bypass == NULL)
vop = vop->vop_default;

Reviewed by: jeff
Tesetd by: pho
Differential Revision: https://reviews.freebsd.org/D22738


# ff4486e8 09-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: refactor vhold and vdrop

No fuctional changes.


# abd80ddb 08-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce v_irflag and make v_type smaller

The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715


# 791a24c7 08-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: clean up vputx a little

1. replace hand-rolled macros for operation type with enum
2. unlock the vnode in vput itself, there is no need to branch on it. existence
of VPUTX_VPUT remains significant in that the inactive variant adds LK_NOWAIT
to locking request.
3. remove the useless v_usecount assertion. few lines above the checks if
v_usecount > 0 and leaves. should the value be negative, refcount would fail.
4. the CTR return vnode %p to the freelist is incorrect as vdrop may find the
vnode with holdcnt > 1. if the like should exist, it should be moved there
5. no need to error = 0 for everyone

Reviewed by: kib, jeff (previous version)
Differential Revision: https://reviews.freebsd.org/D22718


# fd6e0c43 08-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: factor out vnode destruction out of vdrop

Sponsored by: The FreeBSD Foundation


# 12e483e5 06-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: clean up delmntque similarly to vdrop r355414


# 4f4d9a08 06-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: catch vn_printf up with reality

- add the missing VV_VMSIZEVNLOCK and VV_READLINK flags
- add decoding v_mflag

While here sort flags.


# 3eeb8a1f 05-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove 'active' variable from _vdrop

No functional changes.


# d957f3a4 19-Nov-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: perform a more racy check in vfs_notify_upper

Locking mp does not buy anything interms of correctness and only contributes to
contention.


# 1fccb43c 19-Nov-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: change si_usecount management to count used vnodes

Currently si_usecount is effectively a sum of usecounts from all associated
vnodes. This is maintained by special-casing for VCHR every time usecount is
modified. Apart from complicating the code a little bit, it has a scalability
impact since it forces a read from a cacheline shared with said count.

There are no consumers of the feature in the ports tree. In head there are only
2: revoke and devfs_close. Both can get away with a weaker requirement than the
exact usecount, namely just the count of active vnodes. Changing the meaning to
the latter means we only need to modify it on 0<->1 transitions, avoiding the
check plenty of times (and entirely in something like vrefact).

Reviewed by: kib, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22202


# 67d0e293 29-Oct-2019 Jeff Roberson <jeff@FreeBSD.org>

Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY
flag and use the same system.

This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22116


# c92f1304 23-Oct-2019 Konstantin Belousov <kib@FreeBSD.org>

Fix undefined behavior.

Create a sequence point by ending a full expression for call to
vspace() and use of the globals which are modified by vspace().

Reported and reviewed by: imp
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22126


# 8076c4e7 23-Oct-2019 Konstantin Belousov <kib@FreeBSD.org>

vn_printf(): Decode VI_TEXT_REF.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# d1cbf3ee 13-Oct-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add MNTK_NOMSYNC

On many filesystems the traversal is effectively a no-op. Add a way to avoid
the overhead.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22009


# 737241cd 13-Oct-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: return free vnode batches in sync instead of vfs_msync

It is a more natural fit. vfs_msync only deals with active vnodes.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22008


# dc20b834 06-Oct-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add optional root vnode caching

Root vnodes looekd up all the time, e.g. when crossing a mount point.
Currently used routines always perform a costly lookup which can be
trivially avoided.

Reviewed by: jeff (previous version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21646


# e61e783b 04-Oct-2019 Eric van Gyzen <vangyzen@FreeBSD.org>

Add CTLFLAG_STATS to some vfs sysctl OIDs

Add CTLFLAG_STATS to the following OIDs:

vfs.altbufferflushes
vfs.recursiveflushes
vfs.barrierwrites
vfs.flushwithdeps
vfs.reassignbufcalls

Refer to r353111.

MFC after: 2 weeks
Sponsored by: Dell EMC Isilon


# f91dd609 02-Oct-2019 Ed Maste <emaste@FreeBSD.org>

simplify path handling in sysctl_try_reclaim_vnode

MAXPATHLEN / PATH_MAX includes space for the terminating NUL, and namei
verifies the presence of the NUL. Thus there is no need to increase the
buffer size here.

The sysctl passes the string excluding the NUL, so req->newlen equal to
PATH_MAX is too long.

Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21876


# ba7a55d9 22-Sep-2019 Sean Eric Fagan <sef@FreeBSD.org>

Add two options to allow mount to avoid covering up existing mount points.
The two options are

* nocover/cover: Prevent/allow mounting over an existing root mountpoint.
E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local
is already a mountpoint.
* emptydir/noemptydir: Prevent/allow mounting on a non-empty directory.
E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail.

Neither of these options is intended to be a default, for historical and
compatibility reasons.

Reviewed by: allanjude, kib
Differential Revision: https://reviews.freebsd.org/D21458


# 4cace859 16-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: convert struct mount counters to per-cpu

There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.

Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before: 852393 ops/s
after: 76682077 ops/s

Reviewed by: kib, jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21637


# ee831b25 16-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: manage mnt_lockref with atomics

See r352424.

Reviewed by: kib, jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21574


# a8c8e44b 16-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: manage mnt_ref with atomics

New primitive is introduced to denote sections can operate locklessly
on aspects of struct mount, but which can also be disabled if necessary.
This provides an opportunity to start scaling common case modifications
while providing stable state of the struct when facing unmount, write
suspendion or other events.

mnt_ref is the first counter to start being managed in this manner with
the intent to make it per-cpu.

Reviewed by: kib, jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21425


# ce3ba63f 13-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: release usecount using fetchadd

1. If we release the last usecount we take ownership of the hold count, which
means the vnode will remain allocated until we vdrop it.
2. If someone else vrefs they will find no usecount and will proceed to add
their own hold count.
3. No code has a problem with v_usecount transitioning to 0 without the
interlock

These facts combined mean we can fetchadd instead of having a cmpset loop.

Reviewed by: kib (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21528


# 68c3c1ab 05-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: temporarily revert r351825

There are 2 problems:
- it introduces a funny bug where it can end up trylocking the same vnode [1]
- it exposes a pre-existing softdep deadlock [2]

Both are easier to run into that the bug which got fixed, so revert until
a complete solution is worked out.

Reported by: cy [1], pho [2]
Sponsored by: The FreeBSD Foundation


# c07d4a0a 04-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fully hold vnodes in vnlru_free_locked

Currently the code only bumps holdcnt and clears the VI_FREE flag, not
performing actual vhold. Since the vnode is still visible elsewhere, a
potential new user can find it and incorrectly assume it is properly held.

Use vholdl instead to correctly hold the vnode. Another place recycling
(vlrureclaim) does this already.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21522


# e3c3248c 03-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: implement usecount implying holdcnt

vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting
freed and usecount to denote it is actively used.

Previously all operations bumping usecount would also bump holdcnt, which is
not necessary. We can detect if usecount is already > 1 (in which case holdcnt
is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves
on atomic ops.

Reviewed by: kib
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21471


# 08cfa56e 01-Sep-2019 Mark Johnston <markj@FreeBSD.org>

Extend uma_reclaim() to permit different reclamation targets.

The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure. This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.

With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets. Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only
items in excess of the estimate are reclaimed. Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.

Now, when under memory pressure, the page daemon will trim zones
rather than draining them. As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.

Reviewed by: jeff
Tested by: pho (an earlier version)
MFC after: 2 months
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16667


# c2b600f9 30-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add a missing VNODE_REFCOUNT_FENCE_REL to v_incr_usecount_locked

Sponsored by: The FreeBSD Foundation


# 3bb8d8d8 29-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: tidy up assertions in vfs_subr

- assert unlocked vnode interlock in vref
- assert right counts in vputx
- print debug info for panic in vdrop

Sponsored by: The FreeBSD Foundation


# 6470c8d3 29-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

Rework v_object lifecycle for vnodes.

Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode. Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.

Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM(). This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation. Remove code from
vnode_create_vobject() to handle races with the parallel destruction.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21412


# 1e2f0ceb 28-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add VOP_NEED_INACTIVE

vnode usecount drops to 0 all the time (e.g. for directories during path lookup).
When that happens the kernel would always lock the exclusive lock for the vnode
in order to call vinactive(). This blocks other threads who want to use the vnode
for looukp.

vinactive is very rarely needed and can be tested for without the vnode lock held.

This patch gives filesytems an opportunity to do it, sample total wait time for
tmpfs over 500 minutes of poudriere -j 104:

before: 557563641706 (lockmgr:tmpfs)
after: 46309603301 (lockmgr:tmpfs)

Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21371


# 368cabbc 27-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: stop passing LK_INTERLOCK to VOP_UNLOCK

The plan is to drop the flags argument. There is also a temporary bug
now that nullfs ignores the flag.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21252


# 0256405e 24-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vholdnz (for already held vnodes)

Reviewed by: kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21358


# e671edac 23-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

De-commision the MNTK_NOINSMNTQ kernel mount flag.

After all the changes, its dynamic scope is same as for MNTK_UNMOUNT,
but to allow the syncer vnode to be re-installed on unmount failure.
But the case of syncer was already handled by using the VV_FORCEINSMQ
flag for quite some time.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# cf27e0d1 19-Aug-2019 Jeff Roberson <jeff@FreeBSD.org>

Use an atomic reference count for paging in progress so that callers do not
require the object lock.

Reviewed by: markj
Tested by: pho (as part of a larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21311


# e7d8ebc8 28-Jul-2019 Alan Somers <asomers@FreeBSD.org>

Better comments for vlrureclaim

MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 2240d8c4 27-Jul-2019 Alan Somers <asomers@FreeBSD.org>

Add v_inval_buf_range, like vtruncbuf but for a range of a file

v_inval_buf_range invalidates all buffers within a certain LBA range of a
file. It will be used by fusefs(5). This commit is a partial merge of
r346162, r346606, and r346756 from projects/fuse2.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21032


# d10b7578 06-Jun-2019 Alan Somers <asomers@FreeBSD.org>

[skip ci] Better comments for vlrureclaim

Sponsored by: The FreeBSD Foundation


# 46f8169a 06-Jun-2019 Alan Somers <asomers@FreeBSD.org>

Add a testing facility to manually reclaim a vnode

Add the debug.try_reclaim_vnode sysctl. When a pathname is written to it, it
will be reclaimed, as long as it isn't already or doomed. The purpose is to
gain test coverage for vnode reclamation, which is otherwise hard to
achieve.

Add the debug.ftry_reclaim_vnode sysctl. It does the same thing, except
that its argument is a file descriptor instead of a pathname.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20519


# 65417f5e 24-May-2019 Alan Somers <asomers@FreeBSD.org>

Remove "struct ucred*" argument from vtruncbuf

vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it. Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by
vtruncbuf.

Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20377


# daec9284 21-May-2019 Conrad Meyer <cem@FreeBSD.org>

Include ktr.h in more compilation units

Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon


# 5422b0e6 19-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Fix rw->ro remount when there is a text vnode mapping.

Reported and tested by: hrs
Sponsored by: The FreeBSD Foundation
MFC after: 16 days


# 78022527 05-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Switch to use shared vnode locks for text files during image activation.

kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923


# 75d5cb29 26-Apr-2019 Alan Somers <asomers@FreeBSD.org>

fusefs: fix cache invalidation error from r346162

An off-by-one error led to the last page of a write not being removed from
its object, even though that page's buffer was marked as invalid.

PR: 235774
Sponsored by: The FreeBSD Foundation


# 77b82478 23-Apr-2019 Alan Somers <asomers@FreeBSD.org>

Fix bug in vtruncbuf introduced by r346162

r346162 factored out v_inval_buf_range from vtruncbuf, but it made an error
in the interface between the two. The result was a failure to remove
buffers past the first. Surprisingly, I couldn't reproduce the failure with
file systems other than fuse.

Also, modify fusefs's truncate_discards_cached_data test to catch this bug.

PR: 346162
Sponsored by: The FreeBSD Foundation


# 6af6fdce 12-Apr-2019 Alan Somers <asomers@FreeBSD.org>

fusefs: evict invalidated cache contents during write-through

fusefs's default cache mode is "writethrough", although it currently works
more like "write-around"; writes bypass the cache completely. Since writes
bypass the cache, they were leaving stale previously-read data in the cache.
This commit invalidates that stale data. It also adds a new global
v_inval_buf_range method, like vtruncbuf but for a range of a file.

PR: 235774
Reported by: cem
Sponsored by: The FreeBSD Foundation


# c11cbfd9 11-Mar-2019 Kirk McKusick <mckusick@FreeBSD.org>

Update the main loop in the flushbuflist() routine to properly select
buffers for flushing when requested to flush both normal and extended
attributes buffers.

Sponsored by: Netflix


# 735835ed 08-Jan-2019 Michael Tuexen <tuexen@FreeBSD.org>

Avoid overfow in vtruncbuf()

Using daddr_t instead of int avoids trunclbn to become negative when it
shouldn't.
This isssue was found by running syzkaller.

Reviewed by: mckusick, kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18763


# c0029546 27-Dec-2018 Kirk McKusick <mckusick@FreeBSD.org>

When loading an inode from disk, verify that its mode is valid.
If invalid, return EINVAL. Note that inode check-hashes greatly
reduce the chance that these errors will go undetected.

Reported by: Christopher Krah <krah@protonmail.com>
Reported as: FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read)
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix

M sys/fs/ext2fs/ext2_vnops.c
M sys/kern/vfs_subr.c
M sys/ufs/ffs/ffs_snapshot.c
M sys/ufs/ufs/ufs_vnops.c


# 6c59824b 23-Dec-2018 Konstantin Belousov <kib@FreeBSD.org>

Properly test for vmio buffer in bnoreuselist().

The presence of allocated v_object does not imply that the buffer is
necessary VMIO kind. Buffer might has been allocated before the
object created, then the buffer is malloced. Although we try to avoid
such situation, it seems to be still legitimate.

Reported and tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation


# cc426dd3 11-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

Remove unused argument to priv_check_cred.

Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by: The FreeBSD Foundation


# 1436ff1e 17-Aug-2018 Mark Johnston <markj@FreeBSD.org>

Typo.

X-MFC with: r337974


# 3ccbdc82 17-Aug-2018 Mark Johnston <markj@FreeBSD.org>

Add INVARIANTS-only fences around lockless vnode refcount updates.

Some internal KASSERTs access the v_iflag field without the vnode
interlock held after such a refcount update. The fences are needed for
the assertions to be correct in the face of store reordering.

Reported and tested by: jhibbits
Reviewed by: kib, mjg
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16756


# 3f9e1fc8 06-Jun-2018 Justin Hibbits <jhibbits@FreeBSD.org>

Revert r334708

This is the wrong place to put the barrier.
Requested by: kib,mjg


# 32c369f4 05-Jun-2018 Justin Hibbits <jhibbits@FreeBSD.org>

Add a memory barrier after taking a reference on the vnode holdcnt in _vhold

This is needed to avoid a race between the VNASSERT() below, and another
thread updating the VI_FREE flag, on weakly-ordered architectures.

On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would
panic on the assert regularly.

It may be possible to use a weaker barrier, and I'll investigate that once
all stability issues are worked out on POWER9.


# 84482abd 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

vfs: annotate variables only used by debug builds as __unused


# 0e5c6bd4 04-May-2018 Jamie Gritton <jamie@FreeBSD.org>

Make it easier for filesystems to count themselves as jail-enabled,
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of. This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.

Reviewed by: kib
Differential Revision: D14681


# 6469bdcd 06-Apr-2018 Brooks Davis <brooks@FreeBSD.org>

Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941


# f4043145 28-Mar-2018 Andriy Gapon <avg@FreeBSD.org>

ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count

It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code. Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.

Also, the change required making vfs_refcount_release_if_not_last()
public. I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now. While making the move I've dropped the
vfs_ prefix.

Reviewed by: mjg
MFC after: 2 weeks
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D14869


# 06220fa7 19-Feb-2018 Jeff Roberson <jeff@FreeBSD.org>

Further parallelize the buffer cache.

Provide multiple clean queues partitioned into 'domains'. Each domain manages
its own bufspace and has its own bufspace daemon. Each domain has a set of
subqueues indexed by the current cpuid to reduce lock contention on the cleanq.

Refine the sleep/wakeup around the bufspace daemon to use atomics as much as
possible.

Add a B_REUSE flag that is used to requeue bufs during the scan to approximate
LRU rather than locking the queue on every use of a frequently accessed buf.

Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts.

Reviewed by: markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14274


# 5a35a042 31-Jan-2018 Kirk McKusick <mckusick@FreeBSD.org>

One of the vnode fields listed by vn_printf is the union of pointers
whose type depends on the type of vnode. Correct vn_printf so that
it correctly identifies the name of the pointer that it is printing.

Submitted by: Andreas Longwitz <longwitz at incore.de>
MFC after: 1 week


# 31c2c6e9 12-Jan-2018 Mateusz Guzik <mjg@FreeBSD.org>

vfs: tidy up vdrop

Skip vfs_refcount_release_if_not_last if the interlock is held and just
go straight to refcount_release.

While here do cosmetic rearrangement of _vhold to better show it contains
equivalent behaviour.


# caa7e52f 26-Dec-2017 Eitan Adler <eadler@FreeBSD.org>

kernel: Fix several typos and minor errors

- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by: imp, benno


# 151ba793 24-Dec-2017 Alexander Kabaev <kan@FreeBSD.org>

Do pass removing some write-only variables from the kernel.

This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# a3e8a25a 20-Oct-2017 Mark Johnston <markj@FreeBSD.org>

Avoid the nbp lookup in the final loop iteration in flushbuflist().

The end of the loop must re-lookup the next buf since the bufobj lock
is dropped in the loop body. If the lookup fails, the loop is restarted.
This mechanism non-obviously also terminates the loop when the end of
the buf list is reached. Split up the two loops termination cases to
make the code a bit less fragile. No functional change intended.

Reviewed by: kib
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12730


# fa00affd 17-Oct-2017 Mark Johnston <markj@FreeBSD.org>

Fix a racy VI_DOOMED check in MNT_VNODE_FOREACH_ALL().

MNT_VNODE_FOREACH_ALL() is supposed to avoid returning doomed vnodes,
but the VI_DOOMED check it used was done without the vnode interlock
held, so it could race with a concurrent vgone().

Submitted by: Don Morris <don.morris@isilon.com>
Reviewed by: kib, mckusick
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12704


# 5bf94937 19-Sep-2017 Konstantin Belousov <kib@FreeBSD.org>

For unlinked files, do not msync(2) or sync on the vnode deactivation.

One consequence of the patch is that msyncing unlinked file mappings
no longer reduces the amount of the dirty memory in the system, but I
do not think that there are users of msync(2) that utilize it for such
side-effect.

Reported and tested by: tjil
PR: 222356
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12411


# 8359a6b7 28-Aug-2017 Bryan Drewery <bdrewery@FreeBSD.org>

Allow vdrop() of a vnode not yet on the per-mount list after r306512.

The old code allowed calling vdrop() before insmntque() to place the vnode back
onto the freelist for later recycling. Some downstream consumers may rely on
this support. Normally insmntque() failing is fine since is uses vgone() and
immediately frees the vnode rather than attempting to add it to the freelist if
vdrop() were used instead.

Also assert that vhold() cannot be used on such a vnode.

Reviewed by: kib, cem, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12126


# b59ea730 20-Aug-2017 Konstantin Belousov <kib@FreeBSD.org>

Allow vinvalbuf() to operate with the shared vnode lock.

This mode allows other clean buffers to arrive while we flush the buf
lists for the vnode, which is fine for the targeted use. We only need
that all buffers existed at the time of the function start were
flushed. In fact, only one assert has to be relaxed.

In collaboration with: pho
Reviewed by: rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
X-Differential revision: https://reviews.freebsd.org/D12083


# 0c3c207f 02-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

For UNIX sockets make vnode point not to the socket, but to the UNIX PCB,
since the latter is the thing that links together VFS and sockets.

While here, make the union in the struct vnode anonymous.


# 391aba32 15-May-2017 Konstantin Belousov <kib@FreeBSD.org>

mnt_vnode_next_active: use conventional lock order when trylock fails.

Previously, when the VI_TRYLOCK failed, we would spin under the mutex
that protects the vnode active list until we either succeeded or
noticed that we had hogged the CPU. Since we were violating the lock
order, this would guarantee that we would become a hog under any
deadlock condition (e.g. a race with vdrop(9) on the same vnode). In
the presence of many concurrent threads in sync(2) or vdrop etc, the
victim could hang for a long time.

Now, avoid spinning by dropping and reacquiring the locks in the
conventional lock order when the trylock fails. This requires a dance
with the vnode hold count.

Submitted by: Tom Rix <trix@juniper.net>
Tested by: pho
Differential revision: https://reviews.freebsd.org/D10692


# 0226f659 05-Apr-2017 Konstantin Belousov <kib@FreeBSD.org>

Add V_VMIO flag for vinvalbuf(9) to indicate that the flush request
was issued during VM-initiated i/o (pageout), so that the function
does not try to flush or remove pages or wait for the vm object
paging-in-progress counter.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D10241


# db362553 04-Apr-2017 Brooks Davis <brooks@FreeBSD.org>

Correct a kernel stack leak in 32-bit compat when vfc_name is short.

Don't zero unused pointer members again.

Per discussion with secteam we are not issuing an advisory for this
issue as we have no current evidence it leaks exploitable information.

Reviewed by: rwatson, glebius, delphij
MFC after: 1 day
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D10227


# e3f87f6c 12-Mar-2017 Ian Lepore <ian@FreeBSD.org>

Change 'Hz' back to 'HZ'... it's referring to the kernel config option
named HZ, not being used as an abbreviation of the unit of measure.


# 8a396640 12-Mar-2017 Ian Lepore <ian@FreeBSD.org>

Correct the abbreviations for microseconds (us, not ms), and for Hz (not HZ).


# 2d78a553 04-Feb-2017 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use atomic_fcmpset in vfs_refcount_*


# 8acac5a9 22-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Improve debugging printf.


# 067115e0 21-Jan-2017 Mateusz Guzik <mjg@FreeBSD.org>

vfs: hide the getvnode NULL mp message behind DIAGNOSTIC

Since crossmp vnode changes the message was being printed on each boot.

Reported by: trasz
Discussed with: kib


# 41b0046a 31-Dec-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: switch nodes_created, recycles_count and free_owe_inact to counter(9)

Reviewed by: kib


# 5afb134c 12-Dec-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vrefact, to be used when the vnode has to be already active

This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active
vnodes.

Reviewed by: kib (previous version)


# 64910ddb 26-Nov-2016 Mark Johnston <markj@FreeBSD.org>

Launder VPO_NOSYNC pages upon vnode deactivation.

As of r234483, vnode deactivation causes non-VPO_NOSYNC pages to be
laundered. This behaviour has two problems:

1. Dirty VPO_NOSYNC pages must be laundered before the vnode can be
reclaimed, and this work may be unfairly deferred to the vnlru process
or an unrelated application when the system is under vnode pressure.
2. Deactivation of a vnode with dirty VPO_NOSYNC pages requires a scan of
the corresponding VM object's memq for non-VPO_NOSYNC dirty pages; if
the laundry thread needs to launder pages from an unreferenced such
vnode, it will reactivate and deactivate the vnode with each laundering,
potentially resulting in a large number of expensive scans.

Therefore, ensure that all dirty pages are laundered upon deactivation,
i.e., when all maps of the vnode are removed and all references are
released.

Reviewed by: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8641


# c6c44ff7 08-Oct-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: clear the tmp free list flag before taking the free vnode list lock

Safe access is already guaranteed because of the mnt_listmx lock.


# 32641585 06-Oct-2016 Bryan Drewery <bdrewery@FreeBSD.org>

vrefl: Assert that the interlock is held.

Sponsored by: Dell EMC Isilon
MFC after: 2 weeks


# 5a22c958 06-Oct-2016 Bryan Drewery <bdrewery@FreeBSD.org>

Add vrecyclel() to vrecycle() a vnode with the interlock already held.

Obtained from: OneFS
Sponsored by: Dell EMC Isilon
MFC after: 2 weeks


# 0617f64e 04-Oct-2016 Bryan Drewery <bdrewery@FreeBSD.org>

Correct some comments after r294299.

Sponsored by: Dell EMC Isilon


# 5bb81f9b 30-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: batch free vnodes in per-mnt lists

Previously free vnodes would always by directly returned to the global
LRU list. With this change up to mnt_free_list_batch vnodes are collected
first.

syncer runs always return the batch regardless of its size.

While vnodes on per-mnt lists are not counted as free, they can be
returned in case of vnode shortage.

Reviewed by: kib
Tested by: pho


# 8660b707 30-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the __bo_vnode field from struct vnode

The pointer can be obtained using __containerof instead.

Reviewed by: kib


# 69a28758 15-Sep-2016 Ed Maste <emaste@FreeBSD.org>

Renumber license clauses in sys/kern to avoid skipping #3


# f83cc0aa 12-Aug-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Print vnode details when vnode locking assertion gets triggered.

MFC after: 1 month


# 411455a8 10-Aug-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after: 1 month


# 7b255097 04-Aug-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused - never actually implemented - vnode lock types
from vnode_if.src.

MFC after: 1 month


# 725496ce 11-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix grammar.

Submitted by: alc
MFC after: 2 weeks


# 19efd8a5 11-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called,
if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate().

The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers. BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone). Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().

Reported by: David Cross <dcrosstech@gmail.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# d12d6a84 02-Jul-2016 Konstantin Belousov <kib@FreeBSD.org>

Remove racy assert. The thread which changes vnode usecount from 0 to 1
does it under the vnode interlock, but the interlock is not owned by the
asserting thread. As result, we might read increased use counter but also
still see VI_OWEINACT.

In collaboration with: nwhitehorn
Hardware donated by: IBM LTC
Sponsored by: The FreeBSD Foundation (kib)
Approved by: re (gjb)


# d9a503be 20-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix typo. Note that atomic is still required even for interlocked case.

Sponsored by: The FreeBSD Foundation
Approved by: re (marius)


# e896fb3b 17-Jun-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: ifdef out noop vop_* primitives on !DEBUG_VFS_LOCKS kernels

This removes calls to empty functions like vop_lock_{pre/post} from
common vfs routines.

Approved by: re (gjb)


# f8a75278 17-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Add VFS interface to flush specified amount of free vnodes belonging
to mount points with the given filesystem type, specified by mount
vfs_ops pointer.

Based on patch by: mckusick
Reviewed by: avg, mckusick
Tested by: allanjude, madpilot
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)


# f7bd2217 31-May-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Cosmetics - add missing space after ellipses in shutdown messages.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 27d4b35f 16-May-2016 Andriy Gapon <avg@FreeBSD.org>

vfs_read_dirent: increment ncookies after adding a cookie

It seems that at present vfs_read_dirent() is used only with filesystems
that do not support cookies, so the bug never manifested itself.

MFC after: 1 week


# c89e1b87 03-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Add EVFILT_VNODE open, read and close notifications.

While there, order EVFILT_VNODE notes descriptions alphabetically.

Based on submission, and tested by: Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after: 2 weeks


# f7b71c8a 02-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Issue NOTE_EXTEND when a directory entry is added to or removed from
the monitored directory as the result of rename(2) operation. The
renames staying in the directory are not reported.

Submitted by: Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after: 2 weeks


# bd2ead6b 02-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Fix reporting of NOTE_LINK when directory link count changes due to
rename removing or adding subdirectory entry.

Discussed with and tested by: Vladimir Kondratyev <wulf@cicgroup.ru>
NetBSD PR: 48958 (http://gnats.netbsd.org/48958)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# e3043798 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: spelling fixes in comments.

No functional change.


# 55e0987a 26-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: extend use of the howmany() macro when available.

We have a howmany() macro in the <sys/param.h> header that is
convenient to re-use as it makes things easier to read.


# 0791e0c0 24-Feb-2016 Konstantin Belousov <kib@FreeBSD.org>

Provide more correct sizing of the KVA consumed by a vnode, used by
the virtvnodes calculation. Include the size of fs-specific v_data as
the nfs nclnode inline, the NFS nclnode is bigger than either ZFS
znode or UFS inode. Include the size of namecache_ts and short cache
path element, multiplied by the name cache population factor, again
inline.

Inline defines are used to avoid pollution of the vnode.h with the
subsystem-private objects. Non-significant unsynchronized changes of
the definitions are fine, we do not care about that precision, and
e.g. ZFS consumes much malloced memory per vnode for reasons
unaccounted in the formula.

Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10.

The measures reduce vnode cache pressure on kmem and bring the vnode
cache memory use below some apparent thresholds that were exceeded by
r291244 due to more robust vnode reuse.

Reported and tested by: marius (i386, previous version)
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# fa48f413 17-Feb-2016 Konstantin Belousov <kib@FreeBSD.org>

In bnoreuselist(), check both ends of the specified logical block
numbers range.

This effectively skips indirect and extdata blocks on the buffer
queue. Since their logical block numbers are negative, bnoreuselist()
could loop infinitely.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 793c3817 18-Jan-2016 Mark Johnston <markj@FreeBSD.org>

Add vrefl(), a locked variant of vref(9).

This API has no in-tree consumers at the moment but is useful to at least
one out-of-tree consumer, and naturally complements existing vnode refcount
functions (vholdl(9), vdropl(9)).

Obtained from: kib (sys/ portion)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4947
Differential Revision: https://reviews.freebsd.org/D4953


# 1041e090 05-Jan-2016 Konstantin Belousov <kib@FreeBSD.org>

Two fixes for excessive iterations after r292326.

Advance the logical block number to the lblkno of the found block plus
one, instead of incrementing the block number which was used for
lookup. This change skips sparcely populated buffer ranges, similar
to r292325, instead of doing useless lookups.

Do not restart the bnoreuselist() from the start of the range if
buffer lock cannot be obtained without sleep. Only retry lookup and
lock for the same queue and same logical block number.

Reported by: benno
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 106ebb76 16-Dec-2015 Konstantin Belousov <kib@FreeBSD.org>

Optimize vop_stdadvise(POSIX_FADV_DONTNEED). Instead of looking up a
buffer for each block number in the range with gbincore(), look up the
next instantiated buffer with the logical block number which is
greater or equal to the next lblkno. This significantly speeds up the
iteration for sparce-populated range.

Move the iteration into new helper bnoreuselist(), which is structured
similarly to flushbuflist().

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation


# 8549b4b9 16-Dec-2015 Konstantin Belousov <kib@FreeBSD.org>

Simplify the loop step in the flushbuflist() and make it independed on
the type stability of the buffers memory. Instead of memoizing
pointer to the next buffer and validating it, remember the next
logical block number in the bo list and re-lookup.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation


# d9ea698c 03-Dec-2015 Kirk McKusick <mckusick@FreeBSD.org>

We need to zero out the clustering variables in a freed vnode structure.
For completeness add a VNASSERT that there are no threads waiting on a
range lock (this was previously checked on every vnode free).

Reported by; Rick Macklem
Fix from: Mateusz Guzik
PR: 204949


# 003a7c2b 02-Dec-2015 Kirk McKusick <mckusick@FreeBSD.org>

We need to zero out the union of pointers in a freed vnode structure.

PR: 204949
Fix from: Mateusz Guzik
Tested by: Jason Unovitch


# 41d4f103 29-Nov-2015 Kirk McKusick <mckusick@FreeBSD.org>

As the kernel allocates and frees vnodes, it fully initializes them
on every allocation and fully releases them on every free. These
are not trivial costs: it starts by zeroing a large structure then
initializes a mutex, a lock manager lock, an rw lock, four lists,
and six pointers. And looking at vfs.vnodes_created, these operations
are being done millions of times an hour on a busy machine.

As a performance optimization, this code update uses the uma_init
and uma_fini routines to do these initializations and cleanups only
as the vnodes enter and leave the vnode_zone. With this change the
initializations are only done kern.maxvnodes times at system startup
and then only rarely again. The frees are done only if the vnode_zone
shrinks which never happens in practice. For those curious about the
avoided work, look at the vnode_init() and vnode_fini() functions in
kern/vfs_subr.c to see the code that has been removed from the main
vnode allocation/free path.

Reviewed by: kib
Tested by: Peter Holm


# f186a80d 26-Nov-2015 Konstantin Belousov <kib@FreeBSD.org>

Remove VI_AGE vnode iflag, it is unused.

Noted by: bde
Sponsored by: The FreeBSD Foundation


# b3162b45 26-Nov-2015 Konstantin Belousov <kib@FreeBSD.org>

Move the comment about resident pages preventing vnode from leaving
active list, into the header comment for vdrop(), which is the
function that decides whether to leave the vnode on the list. Note
that dirty page write-out in vinactive() is asynchronous.

Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 547831b6 24-Nov-2015 Konstantin Belousov <kib@FreeBSD.org>

Rework the vnode cache recycling to meet free and unused vnodes
targets. See the comment above wantfreevnodes variable for the
description of the algorithm.

The vfs.vlru_alloc_cache_src sysctl is removed. New code frees
namecache sources as the last chance to satisfy the highest watermark,
instead of selecting the source vnodes randomly. This provides good
enough behaviour to keep vn_fullpath() working in most situations.
The filesystem layout with deep trees, where the removed knob was
required, is thus handled automatically.

Submitted by: bde
Discussed with: mckusick
Tested by: pho
MFC after: 1 month


# 09c837b8 20-Nov-2015 Gleb Smirnoff <glebius@FreeBSD.org>

Remove remnants of the old NFS from vnode pager.

Reviewed by: kib
Sponsored by: Netflix


# 0a805de6 26-Sep-2015 Mark Johnston <markj@FreeBSD.org>

Remove a check for a condition that is always false by a preceding KASSERT
that was added in r144704.


# d925c2e8 26-Sep-2015 Mark Johnston <markj@FreeBSD.org>

Fix argument ordering in vn_printf().

MFC after: 3 days


# 55d33667 15-Sep-2015 Conrad Meyer <cem@FreeBSD.org>

kevent(2): Note DOOMED vnodes with NOTE_REVOKE

In poll mode, check for and wake VBAD vnodes. (Vnodes that are VBAD at
registration will never be woken by the RECLAIM trigger.)

Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation. (Vnodes that
were fine at registration but are vgoned while being monitored should signal
waiters.)

Reviewed by: kib
Approved by: markj (mentor)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3675


# 17518b1a 05-Sep-2015 Kirk McKusick <mckusick@FreeBSD.org>

Track changes to kern.maxvnodes and appropriately increase or decrease
the size of the name cache hash table (mapping file names to vnodes)
and the vnode hash table (mapping mount point and inode number to vnode).
An appropriate locking strategy is the key to changing hash table sizes
while they are in active use.

Reviewed by: kib
Tested by: Peter Holm
Differential Revision: https://reviews.freebsd.org/D2265
MFC after: 2 weeks


# c9ba6504 24-Aug-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Make vfs_unmountall() unmount /dev after /, not before. The only
reason this didn't result in an unclean shutdown is that devfs ignores
MNT_FORCE flag.

Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3467


# 6e572e08 23-Aug-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

After r286237 it should be fine to call vgone(9) on a busy GEOM vnode;
remove KASSERT that would prevent forced devfs unmount from working.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 2433a4eb 05-Aug-2015 Ed Schouten <ed@FreeBSD.org>

Make it possible to implement poll(2) on top of kqueue(2).

It looks like EVFILT_READ and EVFILT_WRITE trigger under the same
conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX.
The only difference is that POLLRDNORM has to be triggered on regular
files unconditionally, whereas EVFILT_READ only triggers when not EOF.

Introduce a new flag, NOTE_FILE_POLL, that can be used to make
EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag
will be used by cloudlibc's poll() function.

Reviewed by: jmg
Differential Revision: https://reviews.freebsd.org/D3303


# 57a73b26 04-Aug-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Mark vgonel() as static. It was already declared static earlier;
no idea why compilers don't warn about this.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 752fc07d 16-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

vfs: implement v_holdcnt/v_usecount manipulation using atomic ops

Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.

Reviewed by: kib (earlier version)
Tested by: pho


# c634b752 11-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

vfs: always clear VI_OWEINACT in consumers bumping v_usecount

Previously vputx would detect the condition and clear the flag.

With this change it is invalid to have both v_usecount > 0 and the flag
set. Assert the condition is met in all revlevant places.

Reviewed by: kib


# 2d1ca3cd 11-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

vfs: move si_usecount manipulation to dedicated functions

Reviewed by: kib


# cf88021a 11-Jul-2015 Konstantin Belousov <kib@FreeBSD.org>

Do not allow creation of the dirty buffers for the dead buffer
objects, i.e. for buffer objects which vnode was reclaimed. Buffer
cache cannot write such buffers. Return the error and discard the
buffer immediately on write attempt.

BO_DIRTY now always set during vnode reclamation, since it is used not
only for the INVARIANTS checks. Do allow placement of the clean
buffers on dead bufobj list, otherwise filesystems cannot use bufcache
at all after the devvp reclaim.

Reported and tested by: trasz
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 8bbd1f25 05-Jul-2015 Mark Johnston <markj@FreeBSD.org>

Remove a stale descriptive comment for gbincore().

The splay trees referenced in the comment were converted to
path-compressed tries in r250551.

MFC after: 3 days


# 1977bd23 23-Jun-2015 John-Mark Gurney <jmg@FreeBSD.org>

zero this struct as it depends upon it...

Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D2890


# 1eabd967 16-Jun-2015 Konstantin Belousov <kib@FreeBSD.org>

vfs_msync(), called from syncer vnode fsync VOP, only iterates over
the active vnode list for the given mount point, with the assumption
that vnodes with dirty pages are active. This is enforced by
vinactive() doing vm_object_page_clean() pass over the vnode pages.

The issue is, if vinactive() cannot be called during vput() due to the
vnode being only shared-locked, we might end up with the dirty pages
for the vnode on the free list. Such vnode is invisible to syncer,
and pages are only cleaned on the vnode reactivation. In other words,
the race results in the broken guarantee that user data, written
through the mmap(2), is written to the disk not later than in 30
seconds after the write.

Fix this by keeping the vnode which is freed but still owing
inactivation, on the active list. When syncer loops find such vnode,
it is deactivated and cleaned by the final vput() call.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 780dca1b 27-May-2015 Konstantin Belousov <kib@FreeBSD.org>

Right now, dounmount() is called with unreferenced mount point.
Nothing stops a parallel unmount to suceed before the given call to
dounmount() checks and locks the covered vnode. Prevent dounmount()
from acting on the freed (although type-stable) memory by changing the
interface to require the mount point to be referenced. dounmount()
consumes the reference on return, regardless of the sucessfull or
erronous result.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# dda11d4a 15-Apr-2015 Rick Macklem <rmacklem@FreeBSD.org>

File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

Reviewed by: kib
MFC after: 2 weeks


# 08189ed6 27-Feb-2015 Konstantin Belousov <kib@FreeBSD.org>

The VNASSERT in vflush() FORCECLOSE case is trying to panic early to
prevent errors from yanking devices out from under filesystems. Only
care about special vnodes on devfs, special nodes on other kinds of
filesystems do not have special properties.

Sponsored by: EMC / Isilon Storage Division
Submitted by: Conrad Meyer
MFC after: 1 week


# c514f051 17-Feb-2015 Enji Cooper <ngie@FreeBSD.org>

Add the mnt_lockref field to the ddb(4) 'show mount' command

MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D1688
Submitted by: Conrad Meyer <conrad.meyer@isilon.com>
Sponsored by: EMC / Isilon Storage Division


# 1b76e0b7 14-Feb-2015 John Baldwin <jhb@FreeBSD.org>

Add two new counters for vnode life cycle events:
- vfs.recycles counts the number of vnodes forcefully recycled to avoid
exceeding kern.maxvnodes.
- vfs.vnodes_created counts the number of vnodes created by successful
calls to getnewvnode().

Differential Revision: https://reviews.freebsd.org/D1671
Reviewed by: kib
MFC after: 1 week


# a77a1234 25-Jan-2015 John Baldwin <jhb@FreeBSD.org>

Change the default VFS timestamp precision from seconds to microseconds.

Discussed on: arch@
MFC after: 2 weeks


# ea117d17 13-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

The vinactive() call in vgonel() may start writes for the dirty pages,
creating delayed write buffers belonging to the reclaimed vnode. Put
the buffer cleanup code after inactivation.

Add asserts that ensure that buffer queues are empty and add BO_DEAD
flag for bufobj to check that no buffers are added after the cleanup.
BO_DEAD is only used by INVARIANTS-enabled kernels.

Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# a77c72f5 09-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

Apply chunk forgotten in r275620. Remove local variable for real.

CID: 1257462
Sponsored by: The FreeBSD Foundation


# a25100c5 08-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

Add functions syncer_suspend() and syncer_resume(), which are supposed
to be called before suspension and after resume, correspondingly. The
syncer_suspend() ensures that all filesystems dirty data and metadata
are saved to the permanent storage, and stops kernel threads which
might modify filesystems. The syncer_resume() restores stopped
threads.

For now, only syncer is stopped. This is needed, because each sync
loop causes superblock updates for UFS.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 32e7f8e4 14-Oct-2014 Mateusz Guzik <mjg@FreeBSD.org>

Don't take devmtx unnecessarily in vn_isdisk.

MFC after: 1 week


# 9832a24d 01-Oct-2014 Will Andrews <will@FreeBSD.org>

In the syncer, drop the sync mutex while patting the watchdog.

Some watchdog drivers (like ipmi) need to sleep while patting the watchdog.
See sys/dev/ipmi/ipmi.c:ipmi_wd_event(), which calls malloc(M_WAITOK).

Submitted by: asomers
MFC after: 1 month
Sponsored by: Spectra Logic
MFSpectraBSD: 637548 on 2012/10/04


# 168f4ee0 02-Aug-2014 Konstantin Belousov <kib@FreeBSD.org>

Remove Giant acquisition from the mount and unmount pathes.

It could be claimed that two things were reasonable protected by
Giant. One is vfsconf list links, which is converted to the new
dedicated sx vfsconf_sx. Another is vfsconf.vfc_refcount, which is
now updated with atomics.

Note that vfc_refcount still has the same races now as it has under
the Giant, the unload of filesystem modules can happen while the
module is still in use.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 634012b9 29-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

Remove one-time use macros which check for the vnode lifecycle. More,
some parts of the checks are in fact redundand in the surrounding
code, and it is more clear what the conditions are by direct testing
of the flags. Two of the three macros were only used in assertions.

In vnlru_free(), all relevant parts of vholdl() were already inlined,
except the increment of v_holdcnt itself. Do not call vholdl() to do
the increment as well, this allows to make assertions in
vholdl()/vhold() more strict.

In v_incr_usecount(), call vholdl() before incrementing other ref
counters. The change is no-op, but it makes less surprising to see
the vnode state in debugger if interrupted inside v_incr_usecount().

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 781c93d4 11-Jun-2014 Alexander Motin <mav@FreeBSD.org>

Implement simple direct-mapped cache for popular filesystem identifiers to
avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while
traversing through the list of mount points.

This change significantly improves NFS server scalability, since it had
to do this translation for every request, and the global lock becomes quite
congested.

This code is more optimized for relatively small number of mount points.
On systems with hundreds of active mount points this simple cache may have
many collisions. But the original traversal code in that case should also
behave much worse, so we are not loosing much.

Reviewed by: attilio
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# 4f655310 10-Jun-2014 Alexander Motin <mav@FreeBSD.org>

Remove unneeded mountlist_mtx acquisition from sync_fsync().

All struct mount fields accessed by sync_fsync() are protected by MNT_MTX.


# 3345d73c 08-Jun-2014 Alexander Motin <mav@FreeBSD.org>

Remove extra branching from r267232.

MFC after: 2 weeks


# 590d6363 08-Jun-2014 Alexander Motin <mav@FreeBSD.org>

Use atomics to modify numvnodes variable.

This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(),
that reduces number of global vnode_free_list_mtx mutex acquisitions
from 4 to 2 per NFS request on ZFS, improving SMP scalability.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.


# bf09eca2 20-May-2014 Benjamin Kaduk <bjk@FreeBSD.org>

Check for mismatched vref()/vdrop()

Assert that the hold count has not fallen below the use count, a situation
that would only happen when a vref() (or similar) is erroneously paired
with a vdrop(). This situation has not been observed in the wild, but
could be helpful for someone implementing a new filesystem.

Reviewed by: kib
Approved by: hrs (mentor)


# 44f1c916 22-Mar-2014 Bryan Drewery <bdrewery@FreeBSD.org>

Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division


# 2c1531e7 09-Oct-2013 Konstantin Belousov <kib@FreeBSD.org>

Do not flush buffers when the v_object of the passed vnode does not
really belong to it. Such vnodes, with the pointers to other vnodes
v_objects, are typically instantiated by the bypass filesystems.
Invalidating mappings of other vnode pages and the pages is wrong,
since reclamation of the upper vnode does not imply that lower vnode
is reclaimed too.

One of the consequences of the improper reclamation was destruction of
the wired mappings of the lower vnode pages, triggering miscellaneous
assertions in the VM system.

Reported by: John Marshall <john.marshall@riverwillow.com.au>
Tested by: John Marshall <john.marshall@riverwillow.com.au>, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (gjb)


# d6498b15 01-Oct-2013 Konstantin Belousov <kib@FreeBSD.org>

When printing the vnode information from ddb, print the lengths of the
dirty and clean buffer queues.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (gjb)


# fe39412e 29-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

For vunref(), try to upgrade the vnode lock if the function was called
with the vnode shared-locked. If upgrade succeeded, the inactivation
can be done immediately, instead of being postponed.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (glebius)


# 27884e3b 26-Sep-2013 Konstantin Belousov <kib@FreeBSD.org>

Acquire a hold reference on the vnode when a knote is instantiated.
Otherwise, knote keeps a pointer to a vnode which could become invalid
any time.

Reported by: many
Tested by: Patrick Lamaiziere <patfbsd@davenulle.org>
Discussed with: jmg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (marius)


# 4593c0ad 17-Aug-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

In r114945 the line 'nmp = TAILQ_NEXT(mp, mnt_list);' was duplicated.
Instead of just removing the duplicate, convert the loop to TAILQ_FOREACH().


# 2c38cc79 28-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

When creation of the v_pollinfo raced and our instance of vpollinfo
must be destroyed, knlist_clear() and seldrain() calls could be
avoided, since vpollinfo was not used. More, the knlist_clear()
calling protocol requires the knlist locked, which is not true at the
call site.

Split the destruction into the helper destroy_vpollinfo_free(), and
call it when raced, instead of destroy_vpollinfo().

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# a8b0523a 17-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

Clear the vnode knotes before destroying vpollinfo.

Reported and tested by: Patrick Lamaiziere <patfbsd@davenulle.org>
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# d39116f5 03-Jun-2013 Konstantin Belousov <kib@FreeBSD.org>

Be more generous when donating the current thread time to the owner of
the vnode lock while iterating over the free vnode list. Instead of
yielding, pause for 1 tick. The change is reported to help in some
virtualized environments.

Submitted by: Roger Pau Monn? <roger.pau@citrix.com>
Discussed with: jilles
Tested by: pho
MFC after: 2 weeks


# 22a72260 30-May-2013 Jeff Roberson <jeff@FreeBSD.org>

- Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf


# f2cc1285 11-May-2013 Jeff Roberson <jeff@FreeBSD.org>

- Add a new general purpose path-compressed radix trie which can be used
with any structure containing a uint64_t index. The tree code
auto-generates type safe wrappers.
- Eliminate the buf splay and replace it with pctrie. This is not only
significantly faster with large files but also allows for the possibility
of shared locking.

Reviewed by: alc, attilio
Sponsored by: EMC / Isilon Storage Division


# 0fc6daa7 11-May-2013 Konstantin Belousov <kib@FreeBSD.org>

- Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The
null_hashget() obtains the reference on the nullfs vnode, which must
be dropped.

- Fix a wart which existed from the introduction of the nullfs
caching, do not unlock lower vnode in the nullfs_reclaim_lowervp().
It should be innocent, but now it is also formally safe. Inform the
nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on
nullfs inode.

- Add a callback to the upper filesystems for the lower vnode
unlinking. When inactivating a nullfs vnode, check if the lower
vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC
on the lower vnode, and reclaim upper vnode if so. This allows
nullfs to purge cached vnodes for the unlinked lower vnode, avoiding
excessive caching.

Reported by: G??ran L??wkrantz <goran.lowkrantz@ismobile.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# e63091ea 09-May-2013 Marcel Moolenaar <marcel@FreeBSD.org>

Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE
locks. To support this, VNODE locks are created with the LK_IS_VNODE
flag. This flag is propagated down using the LO_IS_VNODE flag.

Note that WITNESS still records the LOR. Only the printing and the
optional entering into the kernel debugger is bypassed with the
WITNESS_NO_VNODE option.


# 770b41b3 04-May-2013 Matthew D Fleming <mdf@FreeBSD.org>

Add missing vdrop() in error case.

Submitted by: Fahad (mohd.fahadullah@isilon.com)
MFC after: 1 week


# 64fa8df6 16-Apr-2013 Rick Macklem <rmacklem@FreeBSD.org>

Allow the vnode to be unlocked for the weird case of
LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a
usecount on a vnode during NFSv4 recovery from an
expired lease.

Reported and tested by: pho
MFC after: 2 weeks


# 26089666 06-Apr-2013 Jeff Roberson <jeff@FreeBSD.org>

Prepare to replace the buf splay with a trie:

- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
No consumers need to find them there and it complicates the tree.
These flags are all FFS specific and could be moved out of the buf
cache.
- Use pbgetvp() and pbrelvp() to associate the background and journal
bufs with the vp. Not only is this much cheaper it makes more sense
for these transient bufs.
- Fix the assertions in pbget* and pbrel*. It's not safe to check list
pointers which were never initialized. Use the BX flags instead. We
also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with: kib, mckusick, attilio
Sponsored by: EMC / Isilon Storage Division


# 89f6b863 08-Mar-2013 Attilio Rao <attilio@FreeBSD.org>

Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho


# 10b4bb0b 13-Jan-2013 Konstantin Belousov <kib@FreeBSD.org>

Add a trivial comment to record the proper commit log for r245407:

Set the v_hash for a new vnode in the getnewvnode() to the value
calculated based on the vnode structure address. Filesystems using
vfs_hash_insert() override the v_hash using the standard formula of
(inode_number + mnt_hashseed). For other filesystems, the
initialization allows the vfs_hash_index() to provide useful hash too.

Suggested, reviewed and tested by: peter
Sponsored by: The FreeBSD Foundation
MFC after: 5 days


# a41df848 13-Jan-2013 Konstantin Belousov <kib@FreeBSD.org>

diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 7c243b6..0bdaf36 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -279,6 +279,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
#define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt)
#define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt)

+static int vnsz2log;

/*
* Initialize the vnode management data structures.
@@ -293,6 +294,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
static void
vntblinit(void *dummy __unused)
{
+ u_int i;
int physvnodes, virtvnodes;

/*
@@ -332,6 +334,9 @@ vntblinit(void *dummy __unused)
syncer_maxdelay = syncer_mask + 1;
mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF);
cv_init(&sync_wakeup, "syncer");
+ for (i = 1; i <= sizeof(struct vnode); i <<= 1)
+ vnsz2log++;
+ vnsz2log--;
}
SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_FIRST, vntblinit, NULL);

@@ -1067,6 +1072,14 @@ alloc:
}
rangelock_init(&vp->v_rl);

+ /*
+ * For the filesystems which do not use vfs_hash_insert(),
+ * still initialize v_hash to have vfs_hash_index() useful.
+ * E.g., nullfs uses vfs_hash_index() on the lower vnode for
+ * its own hashing.
+ */
+ vp->v_hash = (uintptr_t)vp >> vnsz2log;
+
*vpp = vp;
return (0);
}


# c92c859b 26-Dec-2012 Attilio Rao <attilio@FreeBSD.org>

Fixup r244240: mp_ncpus will be 1 also in the !SMP and smp_disabled=1
case. There is no point in optimizing further the code and use a TRUE
litteral for a path that does heavyweight stuff anyway (like lock acq),
at the price of obfuscated code.

Use the appropriate check where necessary and remove a macro.

Sponsored by: EMC / Isilon storage division
MFC after: 3 days


# b1308d72 21-Dec-2012 Attilio Rao <attilio@FreeBSD.org>

Fixup r218424: uio_yield() was scaling directly to userland priority.
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.

Tested by: pho
Reviewed by: kib, mdf
MFC after: 1 week


# 14df601e 14-Dec-2012 Konstantin Belousov <kib@FreeBSD.org>

When mnt_vnode_next_active iterator cannot lock the next vnode and
yields, specify the user priority for the yield. Otherwise, a
higher-priority (kernel) thread could fall into the priority-inversion
with the thread owning the mutex lock.

On single-processor machines or UP kernels, do not loop adaptively
when the next vnode cannot be locked, instead yield unconditionally.

Restructure the iteration initializer and the iterator to remove code
duplication. Put the code to fetch and lock a vnode next to the
current marker, into the mnt_vnode_next_active() function, and use it
instead of repeating the loop.

Reported by: hrs, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days


# 686ffcac 10-Dec-2012 Konstantin Belousov <kib@FreeBSD.org>

Do not yield while owning a mutex. The Giant reacquire in the
kern_yield() is problematic than.

The owned mutex is the mount interlock, and it is in fact not needed
to guarantee the stability of the mount list of active vnodes, so fix
the the issue by only taking the mount interlock for MNT_REF and
MNT_REL operations.

While there, augment the unconditional yield by some amount of
spinning [1].

Reported and tested by: pho
Reviewed by: attilio
Submitted by: attilio [1]
MFC after: 3 days


# 07840861 03-Dec-2012 Konstantin Belousov <kib@FreeBSD.org>

The vnode_free_list_mtx is required unconditionally when iterating
over the active list. The mount interlock is not enough to guarantee
the validity of the tailq link pointers. The __mnt_vnode_next_active()
and __mnt_vnode_first_active() active lists iterators helper functions
did not provided the neccessary stability for the list, allowing the
iterators to pick garbage.

This was uncovered after the r243599 made the active list iterators
non-nop.

Since a vnode interlock is before the vnode_free_list_mtx, obtain the
vnode ilock in the non-blocking manner when under vnode_free_list_mtx,
and restart iteration after the yield if the lock attempt failed.

Assert that a vnode found on the list is active, and assert that the
helpers return the vnode with interlock owned.

Reported and tested by: pho
MFC after: 1 week


# 3da9ab75 26-Nov-2012 David Xu <davidxu@FreeBSD.org>

Take first active vnode correctly.

Reviewed by: kib
MFC after: 3 days


# 6b991098 24-Nov-2012 Andriy Gapon <avg@FreeBSD.org>

assert_vop_locked: make the assertion race-free and more efficient

this is really a minor improvement for the sake of correctness

MFC after: 6 days


# 4f15bb67 22-Nov-2012 Andriy Gapon <avg@FreeBSD.org>

remove vop_lookup_pre and vop_lookup_post

Suggested by: kib
MFC after: 5 days


# 973b795b 19-Nov-2012 Attilio Rao <attilio@FreeBSD.org>

insmntque() is always called with the lock held in exclusive mode,
then:
- assume the lock is held in exclusive mode and remove a moot check
about the lock acquisition.
- in the destructor remove !MPSAFE specific chunk.

Reviewed by: kib
MFC after: 2 weeks


# ab49c952 19-Nov-2012 Andriy Gapon <avg@FreeBSD.org>

assert_vop_locked should treat LK_EXCLOTHER as the not locked case

... from a perspective of the current thread.

Spotted by: mjg
Discussed with: kib
MFC after: 18 days


# c496727c 19-Nov-2012 Andriy Gapon <avg@FreeBSD.org>

vnode_if: fix locking protocol description for lookup and cachedlookup

Also remove the checks from vop_lookup_pre and vop_lookup_post, which
are now completely redundant (before this change they were partially
redundant).

Discussed with: kib
MFC after: 10 days


# bc2258da 09-Nov-2012 Attilio Rao <attilio@FreeBSD.org>

Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.


# 76fd782c 05-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

A clarification to the behaviour of the active vnode list management
regarding the vnode page cleaning.

In collaboration with: pho
MFC after: 1 week


# 90af5793 04-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command.

MFC after: 3 weeks


# fb819415 04-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command.

MFC after: 3 days


# df3161c7 04-Nov-2012 Konstantin Belousov <kib@FreeBSD.org>

Order the enumeration of the MNT_ flags to be the same as the order of
their definitions.

MFC after: 3 days


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 9b233e23 14-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a KPI to allow to reserve some amount of space in the numvnodes
counter, without actually allocating the vnodes. The supposed use of
the getnewvnode_reserve(9) is to reclaim enough free vnodes while the
code still does not hold any resources that might be needed during the
reclamation, and to consume the slack later for getnewvnode() calls
made from the innards. After the critical block is finished, the
caller shall free any reserve left, by getnewvnode_drop_reserve(9).

Reviewed by: avg
Tested by: pho
MFC after: 1 week


# 0a15e5d3 13-Sep-2012 Attilio Rao <attilio@FreeBSD.org>

Remove all the checks on curthread != NULL with the exception of some MD
trap checks (eg. printtrap()).

Generally this check is not needed anymore, as there is not a legitimate
case where curthread != NULL, after pcpu 0 area has been properly
initialized.

Reviewed by: bde, jhb
MFC after: 1 week


# bcd5bb8e 09-Sep-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a facility for vgone() to inform the set of subscribed mounts
about vnode reclamation. Typical use is for the bypass mounts like
nullfs to get a notification about lower vnode going away.

Now, vgone() calls new VFS op vfs_reclaim_lowervp() with an argument
lowervp which is reclaimed. It is possible to register several
reclamation event listeners, to correctly handle the case of several
nullfs mounts over the same directory.

For the filesystem not having nullfs mounts over it, the overhead
added is a single mount interlock lock/unlock in the vnode reclamation
path.

In collaboration with: pho
MFC after: 3 weeks


# 258f9442 22-Aug-2012 Konstantin Belousov <kib@FreeBSD.org>

Provide some compat32 shims for sysctl vfs.conflist. It is required
for getvfsbyname(3) operation when called from 32bit process, and
getvfsbyname(3) is used by recent bsdtar import.

Reported by: many
Tested by: David Naylor <naylor.b.david@gmail.com>
MFC after: 5 days


# 7adc598a 03-Jun-2012 Andriy Gapon <avg@FreeBSD.org>

free wdog_kern_pat calls in post-panic paths from under SW_WATCHDOG

Those calls are useful with hardware watchdog drivers too.

MFC after: 3 weeks


# 8f0e9130 30-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a rangelock implementation, intended to be used to range-locking
the i/o regions of the vnode data space. The implementation is quite
simple-minded, it uses the list of the lock requests, ordered by
arrival time. Each request may be for read or for write. The
implementation is fair FIFO.

MFC after: 2 month


# af6e6b87 23-Apr-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused thread argument to vrecycle().

Reviewed by: kib


# c52fd858 23-Apr-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused thread argument from vtruncbuf().

Reviewed by: kib


# dca5e0ec 20-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync
routines.

The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.

The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.

In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# f257ebbb 20-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

This change creates a new list of active vnodes associated with
a mount point. Active vnodes are those with a non-zero use or hold
count, e.g., those vnodes that are not on the free list. Note that
this list is in addition to the list of all the vnodes associated
with a mount point.

To avoid adding another set of linkage pointers to the vnode
structure, the active list uses the existing linkage pointers
used by the free list (previously named v_freelist, now renamed
v_actfreelist).

This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point (typically
less than 1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 16165fee 18-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

Delete a no longer useful VNASSERT missed during changes in 234400.

Suggested by: kib


# 60005d66 18-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

Fix a memory leak of M_VNODE_MARKER introduced in 234386.

Found by: Peter Holm


# 73305eb8 17-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

Drop export of vdestroy() function from kern/vfs_subr.c as it is
used only as a helper function in that file. Replace sole call to
vbusy() with inline code in vholdl(). Replace sole calls to vfree()
and vdestroy() with inline code in vdropl().

The Clang compiler already inlines these functions, so they do not
show up in a kernel backtrace which is confusing. Also you cannot
set their frame in kgdb which means that it is impossible to view
their local variables. So, while the produced code is unchanged,
the debugging should be easier.

Discussed with: kib
MFC after: 2 weeks


# 71469bb3 17-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# ecb6e528 11-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().

Reviewed by: kib
MFC after: 2 weeks


# 38ddb572 08-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after: 1 week


# 662c901c 25-Feb-2012 Mikolaj Golub <trociny@FreeBSD.org>

When detaching an unix domain socket, uipc_detach() checks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.

The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.

To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.

Pointed by: kib
Reviewed by: kib
MFC after: 2 weeks


# c480f781 06-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks


# abc942b5 25-Jan-2012 Konstantin Belousov <kib@FreeBSD.org>

When doing vflush(WRITECLOSE), clean vnode pages.

Unmounts do vfs_msync() before calling VFS_UNMOUNT(), but there is
still a race allowing a process to dirty pages after msync
finished. Remounts rw->ro just left dirty pages in system.

Reviewed by: alc, tegge (long time ago)
Tested by: pho
MFC after: 2 weeks


# cc672d35 16-Jan-2012 Kirk McKusick <mckusick@FreeBSD.org>

Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks


# 908cac07 06-Jan-2012 John Baldwin <jhb@FreeBSD.org>

Use proper argument structure types for the extattr post-VOP hooks.
The wrong structure happened to work since the only argument used was
the vnode which is in the same place in both VOP_SETATTR() and the two
extattr VOPs.

MFC after: 3 days


# f0d6c5ca 23-Dec-2011 John Baldwin <jhb@FreeBSD.org>

Add post-VOP hooks for VOP_DELETEEXTATTR() and VOP_SETEXTATTR() and use
these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended
attributes of a vnode are changed.

Note that OS X already implements this behavior.

Reviewed by: rwatson
MFC after: 2 weeks


# 936c09ac 03-Nov-2011 John Baldwin <jhb@FreeBSD.org>

Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month


# 62238a67 27-Oct-2011 John Baldwin <jhb@FreeBSD.org>

Whitespace fix.


# 4c11f091 25-Oct-2011 Pawel Jakub Dawidek <pjd@FreeBSD.org>

The v_data field is a pointer, so set it to NULL, not 0.

MFC after: 3 days


# 25e33e62 07-Oct-2011 Jonathan Anderson <jonathan@FreeBSD.org>

Change one printf() to log().

As noted in kern/159780, printf() is not very jail-friendly, since it can't be easily monitored by jail management tools. This patch reports an error via log() instead, which, if nobody is watching the log file, still prints to the console.

Approved by: mentor (rwatson)
Submitted by: Eugene Grosbein <eugen@eg.sd.rdtc.ru>
MFC after: 5 days


# 837b4d46 04-Oct-2011 Konstantin Belousov <kib@FreeBSD.org>

Move parts of the commit log for r166167, where Tor explained the
interaction between vnode locks and vfs_busy(), into comment.

MFC after: 1 week


# 6aba400a 25-Aug-2011 Attilio Rao <attilio@FreeBSD.org>

Fix a deficiency in the selinfo interface:
If a selinfo object is recorded (via selrecord()) and then it is
quickly destroyed, with the waiters missing the opportunity to awake,
at the next iteration they will find the selinfo object destroyed,
causing a PF#.

That happens because the selinfo interface has no way to drain the
waiters before to destroy the registered selinfo object. Also this
race is quite rare to get in practice, because it would require a
selrecord(), a poll request by another thread and a quick destruction
of the selrecord()'ed selinfo object.

Fix this by adding the seldrain() routine which should be called
before to destroy the selinfo objects (in order to avoid such case),
and fix the present cases where it might have already been called.
Sometimes, the context is safe enough to prevent this type of race,
like it happens in device drivers which installs selinfo objects on
poll callbacks. There, the destruction of the selinfo object happens
at driver detach time, when all the filedescriptors should be already
closed, thus there cannot be a race.
For this case, mfi(4) device driver can be set as an example, as it
implements a full correct logic for preventing this from happening.

Sponsored by: Sandvine Incorporated
Reported by: rstone
Tested by: pluknet
Reviewed by: jhb, kib
Approved by: re (bz)
MFC after: 3 weeks


# d716efa9 24-Jul-2011 Kirk McKusick <mckusick@FreeBSD.org>

Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag
so that it is visible to userland programs. This change enables
the `mount' command with no arguments to be able to show if a
filesystem is mounted using journaled soft updates as opposed
to just normal soft updates.

Approved by: re (bz)


# 6bbee8e2 29-Jun-2011 Alan Cox <alc@FreeBSD.org>

Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this
change.

Reviewed by: kib


# 5e26234f 24-Jun-2011 Jonathan Anderson <jonathan@FreeBSD.org>

Tidy up a capabilities-related comment.

This comment refers to an #ifdef that hasn't been merged [yet?]; remove it.

Approved by: rwatson


# 3d08a76b 12-May-2011 Matthew D Fleming <mdf@FreeBSD.org>

Use a name instead of a magic number for kern_yield(9) when the priority
should not change. Fetch the td_user_pri under the thread lock. This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >


# 2be767e0 28-Apr-2011 Attilio Rao <attilio@FreeBSD.org>

Add the watchdogs patting during the (shutdown time) disk syncing and
disk dumping.
With the option SW_WATCHDOG on, these operations are doomed to let
watchdog fire, fi they take too long.

I implemented the stubs this way because I really want wdog_kern_*
KPI to not be dependant by SW_WATCHDOG being on (and really, the option
only enables watchdog activation in hardclock) and also avoid to
call them when not necessary (avoiding not-volountary watchdog
activations).

Sponsored by: Sandvine Incorporated
Discussed with: emaste, des
MFC after: 2 weeks


# c65c068a 23-Apr-2011 Rick Macklem <rmacklem@FreeBSD.org>

Fix a LOR in vfs_busy() where, after msleeping, it would lock
the mutexes in the wrong order for the case where the
MBF_MNTLSTLOCK is set. I believe this did have the
potential for deadlock. For example, if multiple nfsd threads
called vfs_busyfs(), which calls vfs_busy() with MBF_MNTLSTLOCK.
Thanks go to pho for catching this during his testing.

Tested by: pho
Submitted by: kib
MFC after: 2 weeks


# 443db695 04-Apr-2011 Sergey Kandaurov <pluknet@FreeBSD.org>

Remove malloc type M_NETADDR unused since splitting into vfs_subr.c
and vfs_export.c.

MFC after: 1 week


# fd7032e1 08-Mar-2011 Konstantin Belousov <kib@FreeBSD.org>

Do not assert buffer lock in VFS_STRATEGY() when kernel already paniced.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# e7ceb1e9 07-Feb-2011 Matthew D Fleming <mdf@FreeBSD.org>

Based on discussions on the svn-src mailing list, rework r218195:

- entirely eliminate some calls to uio_yeild() as being unnecessary,
such as in a sysctl handler.

- move should_yield() and maybe_yield() to kern_synch.c and move the
prototypes from sys/uio.h to sys/proc.h

- add a slightly more generic kern_yield() that can replace the
functionality of uio_yield().

- replace source uses of uio_yield() with the functional equivalent,
or in some cases do not change the thread priority when switching.

- fix a logic inversion bug in vlrureclaim(), pointed out by bde@.

- instead of using the per-cpu last switched ticks, use a per thread
variable for should_yield(). With PREEMPTION, the only reasonable
use of this is to determine if a lock has been held a long time and
relinquish it. Without PREEMPTION, this is essentially the same as
the per-cpu variable.


# 08b163fa 02-Feb-2011 Matthew D Fleming <mdf@FreeBSD.org>

Put the general logic for being a CPU hog into a new function
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after: 1 week


# dbccdf76 25-Jan-2011 Konstantin Belousov <kib@FreeBSD.org>

When vtruncbuf() iterates over the vnode buffer list, lock buffer object
before checking the validity of the next buffer pointer. Otherwise, the
buffer might be reclaimed after the check, causing iteration to run into
wrong buffer.

Reported and tested by: pho
MFC after: 1 week


# 2fee06f0 18-Jan-2011 Matthew D Fleming <mdf@FreeBSD.org>

Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need
to rely on the format string.


# fbbb13f9 12-Jan-2011 Matthew D Fleming <mdf@FreeBSD.org>

sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.

Commit the kernel changes.


# a8f4344f 06-Jan-2011 John Baldwin <jhb@FreeBSD.org>

- Restore dropping the priority of syncer down to PPAUSE when it is idle.
This was lost when it was converted to using a condition variable instead
of lbolt.
- Drop the priority of flowtable down to PPAUSE when it is idle as well
since it is a similar background task.

MFC after: 2 weeks


# 7dbb59c7 26-Dec-2010 Konstantin Belousov <kib@FreeBSD.org>

Teach ddb "show mount" about MNTK_SUJ flag.


# f5eb95b1 23-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Allow shared-locked vnode to be passed to vunref(9).
When shared-locked vnode is supplied as an argument to vunref(9) and
resulting usecount is 0, set VI_OWEINACT and do not try to upgrade vnode
lock. The later could cause vnode unlock, allowing the vnode to be
reclaimed meantime.

Tested by: pho
MFC after: 1 week


# 730b63b0 19-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Remove prtactive variable and related printf()s in the vop_inactive
and vop_reclaim() methods. They seems to be unused, and the reported
situation is normal for the forced unmount.

MFC after: 1 week
X-MFC-note: keep prtactive symbol in vfs_subr.c


# 8d065a39 14-Nov-2010 Rebecca Cran <brucec@FreeBSD.org>

Fix some more style(9) issues.


# b389be97 14-Nov-2010 Rebecca Cran <brucec@FreeBSD.org>

Fix style(9) issues from r215281 and r215282.

MFC after: 1 week


# 5d7abc87 14-Nov-2010 Rebecca Cran <brucec@FreeBSD.org>

Add descriptions to some more sysctls.

PR: kern/148510
MFC after: 1 week


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 9a24dc07 11-Sep-2010 Konstantin Belousov <kib@FreeBSD.org>

Protect mnt_syncer with the sync_mtx. This prevents a (rare) vnode leak
when mount and update are executed in parallel.

Encapsulate syncer vnode deallocation into the helper function
vfs_deallocate_syncvnode(), to not externalize sync_mtx from vfs_subr.c.

Found and reviewed by: jh (previous version of the patch)
Tested by: pho
MFC after: 3 weeks


# e5ddf115 01-Sep-2010 Ed Maste <emaste@FreeBSD.org>

As long as we are going to panic anyway, there's no need to hide additional
information behind DIAGNOSTIC.


# de478dd4 30-Aug-2010 Jaakko Heinonen <jh@FreeBSD.org>

execve(2) has a special check for file permissions: a file must have at
least one execute bit set, otherwise execve(2) will return EACCES even
for an user with PRIV_VFS_EXEC privilege.

Add the check also to vaccess(9), vaccess_acl_nfs4(9) and
vaccess_acl_posix1e(9). This makes access(2) to better agree with
execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check
to zfs_freebsd_access() too. There may be other file systems which are
not using vaccess*() functions and need to be handled separately.

PR: kern/125009
Reviewed by: bde, trasz
Approved by: pjd (ZFS part)


# c87f1ad4 28-Aug-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

There is a bug in vfs_allocate_syncvnode() failure handling in mount code.
Actually it is hard to properly handle such a failure, especially in MNT_UPDATE
case. The only reason for the vfs_allocate_syncvnode() function to fail is
getnewvnode() failure. Fortunately it is impossible for current implementation
of getnewvnode() to fail, so we can assert this and make
vfs_allocate_syncvnode() void. This in turn free us from handling its failures
in the mount code.

Reviewed by: kib
MFC after: 1 month


# 3beb1b72 12-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

The buffers b_vflags field is not always properly protected by
bufobj lock. If b_bufobj is not NULL, then bufobj lock should be
held when manipulating the flags. Not doing this sometimes leaves
BV_BKGRDINPROG to be erronously set, causing softdep' getdirtybuf() to
stuck indefinitely in "getbuf" sleep, waiting for background write to
finish which is not actually performed.

Add BO_LOCK() in the cases where it was missed.

In collaboration with: pho
Tested by: bz
Reviewed by: jeff
MFC after: 1 month


# 3b156706 03-Aug-2010 Alan Cox <alc@FreeBSD.org>

In order for MAXVNODES_MAX to be an "int" on powerpc and sparc, we must
cast PAGE_SIZE to an "int". (Powerpc and sparc, unlike the other
architectures, define PAGE_SIZE as a "long".)

Submitted by: Andreas Tobler


# 1d7fe4b5 02-Aug-2010 Alan Cox <alc@FreeBSD.org>

Update the "desiredvnodes" calculation. In particular, make the part of
the calculation that is based on the kernel's heap size more conservative.
Hopefully, this will eliminate the need for MAXVNODES_MAX, but for the
time being set MAXVNODES_MAX to a large value.

Reviewed by: jhb@
MFC after: 6 weeks


# 60ae52f7 21-Jun-2010 Ed Schouten <ed@FreeBSD.org>

Use ISO C99 integer types in sys/kern where possible.

There are only about 100 occurences of the BSD-specific u_int*_t
datatypes in sys/kern. The ISO C99 integer types are used here more
often.


# 52e50b42 18-Jun-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

MFC r209265:

r209260:

Backout r207970 for now, it can lead to deadlocks.

Reported by: kan

r209261:

Turn off UMA allocations on all archs by default. It isn't stable even
on amd64.

Reported by: many

Approved by: re (kib)


# d32ef791 17-Jun-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Backout r207970 for now, it can lead to deadlocks.

Reported by: kan
MFC after: 3 days


# 882da14c 03-Jun-2010 Konstantin Belousov <kib@FreeBSD.org>

Sometimes vnodes share the lock despite being different vnodes on
different mount points, e.g. the nullfs vnode and the covered vnode
from the lower filesystem. In this case, existing assertion in
vop_rename_pre() may be triggered.

Check for vnode locks equiality instead of the vnodes itself to
not trip over the situation.

Submitted by: Mikolaj Golub <to.my.trociny@gmail.com>
Tested by: pho
MFC after: 2 weeks


# 665b912a 24-May-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

MFC r207920,r207934,r207936,r207937,r207970,r208142,r208147,r208148,r208166,
r208454,r208455,r208458:

r207920:

Back out r205134. It is not stable.

r207934:

Add missing new line characters to the warnings.

r207936:

Eventhough r203504 eliminates taste traffic provoked by vdev_geom.c,
ZFS still like to open all vdevs, close them and open them again,
which in turn provokes taste traffic anyway.

I don't know of any clean way to fix it, so do it the hard way - if we can't
open provider for writing just retry 5 times with 0.5 pauses. This should
elimitate accidental races caused by other classes tasting providers created on
top of our vdevs.

Reported by: James R. Van Artsdalen <james-freebsd-fs2@jrv.org>
Reported by: Yuri Pankov <yuri.pankov@gmail.com>

r207937:

I added vfs_lowvnodes event, but it was only used for a short while and now
it is totally unused. Remove it.

r207970:

When there is no memory or KVA, try to help by reclaiming some vnodes.
This helps with 'kmem_map too small' panics.

No objections from: kib
Tested by: Alexander V. Ribchansky <shurik@zk.informjust.ua>

r208142:

The whole point of having dedicated worker thread for each leaf VDEV was to
avoid calling zio_interrupt() from geom_up thread context. It turns out that
when provider is forcibly removed from the system and we kill worker thread
there can still be some ZIOs pending. To complete pending ZIOs when there is
no worker thread anymore we still have to call zio_interrupt() from geom_up
context. To avoid this race just remove use of worker threads altogether.
This should be more or less fine, because I also thought that zio_interrupt()
does more work, but it only makes small UMA allocation with M_WAITOK.
It also saves one context switch per I/O request.

PR: kern/145339
Reported by: Alex Bakhtin <Alex.Bakhtin@gmail.com>

r208147:

Add task structure to zio and use it instead of allocating one.
This eliminates the only place where we can sleep when calling zio_interrupt().
As a side-effect this can actually improve performance a little as we
allocate one less thing for every I/O.

Prodded by: kib

r208148:

Allow to configure UMA usage for ZIO data via loader and turn it on by
default for amd64. On i386 I saw performance degradation when UMA was used,
but for amd64 it should help.

r208166:

Fix userland build by making io_task available only for the kernel and by
providing taskq_dispatch_safe() macro.

r208454:

Remove ZIO_USE_UMA from arc.c as well.

r208455:

ZIO_USE_UMA is no longer used.

r208458:

Create UMA zones unconditionally.


# 7fd32ea9 12-May-2010 Zachary Loafman <zml@FreeBSD.org>

Add VOP_ADVLOCKPURGE so that the file system is called when purging
locks (in the case where the VFS impl isn't using lf_*)

Submitted by: Matthew Fleming <matthew.fleming@isilon.com>
Reviewed by: zml, dfr


# 408a7c50 12-May-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

When there is no memory or KVA, try to help by reclaiming some vnodes.
This helps with 'kmem_map too small' panics.

No objections from: kib
Tested by: Alexander V. Ribchansky <shurik@zk.informjust.ua>
MFC after: 1 week


# c60c36a7 11-May-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

I added vfs_lowvnodes event, but it was only used for a short while and now
it is totally unused. Remove it.

MFC after: 3 days


# 113db2dd 24-Apr-2010 Jeff Roberson <jeff@FreeBSD.org>

- Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm


# a515de67 18-Apr-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

MFC r206160 by jh@:

Add missing MNT_NFS4ACLS.


# bb9a8424 09-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r206093:
Add function vop_rename_fail(9) that performs needed cleanup for locks
and references of the VOP_RENAME(9) arguments. Use vop_rename_fail()
in deadfs_rename().


# 0e9bd417 04-Apr-2010 Jaakko Heinonen <jh@FreeBSD.org>

Add missing MNT_NFS4ACLS.


# b9d8d691 03-Apr-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix some whitespace nits.


# 000026c8 03-Apr-2010 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Add missing mnt_kern_flag flags in 'show mount' output.


# ea015880 02-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

Add function vop_rename_fail(9) that performs needed cleanup for locks
and references of the VOP_RENAME(9) arguments. Use vop_rename_fail()
in deadfs_rename().

Tested by: Mikolaj Golub
MFC after: 1 week


# bd7ae209 27-Mar-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

MFC r197680:

Provide default implementation for VOP_ACCESS(9), so that filesystems which
want to provide VOP_ACCESSX(9) don't have to implement both. Note that
this commit makes implementation of either of these two mandatory.

Reviewed by: kib


# e9560b40 07-Feb-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r202528:
Add vunref(9).


# d2f334bf 17-Jan-2010 Konstantin Belousov <kib@FreeBSD.org>

Add new function vunref(9) that decrements vnode use count (and hold
count) while vnode is exclusively locked.

The code for vput(9), vrele(9) and vunref(9) is merged.

In collaboration with: pho
Reviewed by: alc
MFC after: 3 weeks


# 2d63cbda 10-Jan-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r200770:
Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.


# e6b37c3a 31-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r201134:
Add a knob to allow reclaim of the directory vnodes that are source of
the namecache records.


# a4117865 28-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

Add a knob to allow reclaim of the directory vnodes that are source of
the namecache records. The reclamation is not enabled by default because
for typical workload it would make namecache unusable, but large nested
directory tree easily puts any process that accesses filesystem into 1
second wait for vlru.

Reported by: yar (long time ago)
MFC after: 3 days


# 558e9b5c 26-Dec-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Now that all the callers seem to be fixed, add KASSERTs to make sure VAPPEND
is not being used improperly.


# 49e3050e 20-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by: alc
Tested by: pho
MFC after: 3 weeks


# a9533616 04-Dec-2009 Jaakko Heinonen <jh@FreeBSD.org>

MFC r199529:

Extend ddb(4) "show mount" command to print active string mount options.
Note that only option names are printed, not values.

Approved by: trasz (mentor)


# 10d843a4 19-Nov-2009 Jaakko Heinonen <jh@FreeBSD.org>

Extend ddb(4) "show mount" command to print active string mount options.
Note that only option names are printed, not values.

Reviewed by: pjd
Approved by: trasz (mentor)
MFC after: 2 weeks


# 2c29cfa0 01-Oct-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Provide default implementation for VOP_ACCESS(9), so that filesystems which
want to provide VOP_ACCESSX(9) don't have to implement both. Note that
this commit makes implementation of either of these two mandatory.

Reviewed by: kib


# e76d823b 12-Sep-2009 Robert Watson <rwatson@FreeBSD.org>

Use C99 initialization for struct filterops.

Obtained from: Mac OS X
Sponsored by: Apple Inc.
MFC after: 3 weeks


# 3c9d279b 12-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r197030:
In vfs_mark_atime(9), be resistent against reclaimed vnodes.
Assert that neccessary locks are taken, since vop might not be called.

Approved by: re (kensmith)


# 427992ec 09-Sep-2009 Konstantin Belousov <kib@FreeBSD.org>

In vfs_mark_atime(9), be resistent against reclaimed vnodes.
Assert that neccessary locks are taken, since vop might not be called.

Tested by: pho
MFC after: 3 days


# f0899a34 02-Jul-2009 Jamie Gritton <jamie@FreeBSD.org>

Call prison_check from vfs_suser rather than re-implementing it.

Approved by: re (kib), bz (mentor)


# d8b0556c 10-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by: jhb [1]
Noted by: jhb [2]
Reviewed by: jhb
Tested by: rnoland


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# faef64cc 30-May-2009 Attilio Rao <attilio@FreeBSD.org>

Remove the now invalid (and possibly unused) debug.mpsafevfs
sysctl/tunable.

Reviewed by: emaste
Sponsored by: Sandvine Incorporated


# c97fcdba 30-May-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add VOP_ACCESSX, which can be used to query for newly added V*
permissions, such as VWRITE_ACL. For a filsystems that don't
implement it, there is a default implementation, which works
as a wrapper around VOP_ACCESS.

Reviewed by: rwatson@


# 0304c731 27-May-2009 Jamie Gritton <jamie@FreeBSD.org>

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# dfd233ed 11-May-2009 Attilio Rao <attilio@FreeBSD.org>

Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.


# 607fc40b 29-Mar-2009 Alexander Kabaev <kan@FreeBSD.org>

Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache
in directory vnodes. Allow namecache dotdot entry to be created pointing
from child vnode to parent vnode if no existing links in opposite
direction exist. Use direct link from parent to child for dotdot lookups
otherwise.

This restores more efficient dotdot caching in NFS filesystems which
was lost when vnodes stoppped being type stable.

Reviewed by: kib


# 5ab4bb35 02-Mar-2009 Alexander Kabaev <kan@FreeBSD.org>

Change vfs_busy to wait until an outcome of pending unmount
operation is known and to retry or fail accordingly to that
outcome. This fixes the problem with namespace traversing
programs failing with random ENOENT errors if someone just
happened to try to unmount that same filesystem at the same
time.

Reported by: dhw
Reviewed by: kib, attilio
Sponsored by: Juniper Networks, Inc.


# 8941aad1 06-Feb-2009 John Baldwin <jhb@FreeBSD.org>

Tweak the output of VOP_PRINT/vn_printf() some.
- Align the fifo output in fifo_print() with other vn_printf() output.
- Remove the leading space from lockmgr_printinfo() so its output lines up
in vn_printf().
- lockmgr_printinfo() now ends with a newline, so remove an extra newline
from vn_printf().


# ec48c16f 06-Feb-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add KASSERTs to make it easier to debug problems like the one fixed
in r188141.

Reviewed by: kib,attilio
Approved by: rwatson (mentor)
Tested by: pho
Sponsored by: FreeBSD Foundation


# feabc903 05-Feb-2009 Attilio Rao <attilio@FreeBSD.org>

Add more KTR_VFS logging point in order to have a more effective tracing.

Reviewed by: brueffer, kib
Tested by: Gianni Trematerra <giovanni D trematerra A gmail D com>


# 91082624 23-Jan-2009 John Baldwin <jhb@FreeBSD.org>

Tweak the wording for vfs_mark_atime() since the I/O it is avoiding by not
updating va_atime via VOP_SETATTR() isn't always synchronous. For some
filesystems it is asynchronous.

Suggested by: bde


# 645f1f4e 23-Jan-2009 John Baldwin <jhb@FreeBSD.org>

Push down Giant in the vlnru kproc main loop so that it is only acquired
around calls to vlrureclaim() on non-MPSAFE filesystems. Specifically,
vnlru no longer needs Giant for the common case of waking up and deciding
there is nothing for it to do.

MFC after: 2 weeks


# 1c570a0c 21-Jan-2009 John Baldwin <jhb@FreeBSD.org>

Fix a few style bogons.

Submitted by: bde


# beace176 21-Jan-2009 John Baldwin <jhb@FreeBSD.org>

Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP:
VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME
can be performed while holding a shared vnode lock (the same functionality
is done internally by VOP_READ which can run with a shared vnode lock).
Add missing locking of the vnode interlock to the ufs implementation and
remove a special note and test from the NFS client about not supporting the
feature.

Inspired by: ups
Tested by: pho


# 9316467d 20-Jan-2009 Konstantin Belousov <kib@FreeBSD.org>

FFS puts the extended attributes blocks at the negative blocks for the
vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it
incorrectly does vm_object_page_remove(0, 0), removing all pages from
the underlying vm object, not only the pages that back the extended
attributes data.

Change vinvalbuf() to not remove any pages from the object when
V_NORMAL or V_ALT are specified. Instead, the only in-tree caller
in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely
removes the corresponding page range. The V_NORMAL caller
does vnode_pager_setsize(vp, 0) immediately after the call to
vinvalbuf(V_NORMAL) already.

Reported by: csjp
Reviewed by: ups
MFC after: 3 weeks


# 4a0f8076 16-Dec-2008 Attilio Rao <attilio@FreeBSD.org>

1) Fix a deadlock in the VFS:
- threadA runs vfs_rel(mp1)
- threadB does unmount the mp1 fs, sets MNTK_UNMOUNT and drop MNT_ILOCK()
- threadA runs vfs_busy(mp1) and, as long as, MNTK_UNMOUNT is set, sleeps
waiting for threadB to complete the unmount
- threadB, in vfs_mount_destroy(), finds mnt_lock > 0 and sleeps waiting
for the refcount to expire.

Fix the deadlock by adding a flag called MNTK_REFEXPIRE which signals the
unmounter is waiting for mnt_ref to expire.
The vfs_busy contenders got awake, fails, and if they retry the
MNTK_REFEXPIRE won't allow them to sleep again.

2) Simplify significantly the code of vfs_mount_destroy() trimming
unnecessary codes:
- as long as any reference exited, it is no-more possible to have
write-op (primarty and secondary) in progress.
- it is no needed to drop and reacquire the mount lock.
- filling the structures with dummy values is unuseful as long as
it is going to be freed.

Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
Discussed with: kib


# 61791644 29-Nov-2008 Konstantin Belousov <kib@FreeBSD.org>

In the nfsrv_fhtovp(), after the vfs_getvfs() function found the pointer
to the fs, but before a vnode on the fs is locked, unmount may free fs
structures, causing access to destroyed data and freed memory.

Introduce a vfs_busymp() function that looks up and busies found
fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the
implementation of the handle syscalls.

Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in
sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular,
sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl
handler, that prevents Giant-locked unmount code to interfere with it.

Noted by: tegge
Reviewed by: dfr
Tested by: pho
MFC after: 1 month


# 1ba4a712 17-Nov-2008 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.

- L2ARC

Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris


# 30f60d8c 03-Nov-2008 Attilio Rao <attilio@FreeBSD.org>

Remove the mnt_holdcnt and mnt_holdcntwaiters because they are useless.
Really, the concept of holdcnt in the struct mount is rappresented by
the mnt_ref (which prevents the type-stable structure from being
"recycled) handled through vfs_ref() and vfs_rel().
On this optic, switch the holdcnt acquisition into an emulated vfs_ref()
(and subsequent release into vfs_rel()).

Discussed with: kib
Tested by: pho


# 83b3bdbc 02-Nov-2008 Attilio Rao <attilio@FreeBSD.org>

Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with: kib
Tested by: pho


# 15bc6b2b 28-Oct-2008 Edward Tomasz Napierala <trasz@FreeBSD.org>

Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by: rwatson (mentor)


# 7cd5a03a 27-Oct-2008 Konstantin Belousov <kib@FreeBSD.org>

Style return statements in vn_pollrecord().


# ae53539e 27-Oct-2008 Konstantin Belousov <kib@FreeBSD.org>

Protect check for v_pollinfo == NULL and assignment of the newly allocated
vpollinfo with vnode interlock. Fully initialize vpollinfo before putting
pointer to it into vp->v_pollinfo.

Discussed with: dwhite
Tested by: pho
MFC after: 1 week


# 3cfc3089 20-Oct-2008 Konstantin Belousov <kib@FreeBSD.org>

In vfs_busy(), lockmgr() cannot legitimately sleep, because code checked
MNTK_UNMOUNT before, and mnt_mtx is used as interlock. vfs_busy() always
tries to obtain a shared lock on mnt_lock, the other user is unmount who
tries to drain it, setting MNTK_UNMOUNT before.

Reviewed by: tegge, attilio
Tested by: pho
MFC after: 2 weeks


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 0d7935fd 10-Oct-2008 Attilio Rao <attilio@FreeBSD.org>

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# 59d49325 31-Aug-2008 Attilio Rao <attilio@FreeBSD.org>

Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.

Manpages are updated accordingly.

Tested by: Diego Sardina <siarodx at gmail dot com>


# 0359a12e 28-Aug-2008 Attilio Rao <attilio@FreeBSD.org>

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# a888d54d 28-Aug-2008 Konstantin Belousov <kib@FreeBSD.org>

Introduce the VV_FORCEINSMQ vnode flag. It instructs the insmnque() function
to ignore the unmounting and forces insertion of the vnode into the mount
vnode list.

Change insmntque() to fail when forced unmount is in progress and
VV_FORCEINSMQ is not specified.

Add an assertion to the insmntque(), requiring the vnode to be
exclusively locked for mp-safe filesystems.

Use the VV_FORCEINSMQ for the creation of the syncvnode.

Tested by: pho
Reviewed by: tegge
MFC after: 1 month


# e4517337 24-Aug-2008 Christian S.J. Peron <csjp@FreeBSD.org>

Remove worrying printf warning on bootup when processing vnodes which
have NULL mount-points. This is the case for special vnodes, such as the
one used in nameiinit() which is used for crossing mount points in lookup()
to avoid lock ordering issues.

MFC after: 2 weeks
Discussed with: rwatson, kib


# e7ea30e4 29-Jul-2008 Ed Schouten <ed@FreeBSD.org>

Remove the use of lbolt from the VFS syncer.

It seems we only use `lbolt' inside the VFS syncer and the TTY layer
now. Because I'm planning to replace the TTY layer next month, there's
no reason to keep `lbolt' if it's only used in a single thread inside
the kernel.

Because the syncer code wanted to wake up the syncer thread before the
timeout, it called sleepq_remove(). Because we now just use a condvar(9)
with a timeout value of `hz', we can wake it up using cv_broadcast()
without waking up any unrelated threads.

Reviewed by: phk


# 5573021d 27-Jul-2008 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Assert for exclusive vnode lock in vinactive(), vrecycle() and vgonel()
functions.

Reviewed by: kib


# 610507ae 27-Jul-2008 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Move vp test for beeing NULL under IGNORE_LOCK().
- Check if panicstr isn't set, if it is ignore the lock. This helps to avoid
confusion, because lockmgr is a no-op when panicstr isn't NULL, so
asserting anything at this point doesn't make sense and can just race with
other panic.

Discussed with: kib


# 09400d5a 21-Jul-2008 Attilio Rao <attilio@FreeBSD.org>

- Disallow XFS mounting in write mode. The write support never worked really
and there is no need to maintain it.
- Fix vn_get() in order to let it call vget(9) with a valid locking
request. vget(9) returns the vnode locked in order to prevent recycling,
but in this case internal XFS locks alredy prevent it from happening, so
it is safe to drop the vnode lock before to return by vn_get().
- Add a VNASSERT() in vget(9) in order to catch malformed locking requests.

Discussed with: kan, kib
Tested by: Lothar Braun <lothar at lobraun dot de>


# 988f0e19 18-May-2008 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Be more friendly for DDB pager.

Educated by: jhb's BSDCan presentation


# 60e2edce 04-May-2008 Attilio Rao <attilio@FreeBSD.org>

sync_vnode() has some messy code about locking in order to deal with
mount fs needing Giant to be held when processing bufobjs.
Use a different subqueue for pending workitems on filesystems requiring
Giant. This simplifies the code notably and also reduces the number of
Giant acquisitions (and the whole processing cost).

Suggested by: jeff
Reviewed by: kib
Tested by: pho


# 3800322f 26-Apr-2008 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Implement 'show mount' command in DDB. Without argument, it prints short
info about all currently mounted file systems. When an address is given
as an argument, prints detailed info about the given mount point.

MFC after: 2 weeks


# 12e79a9b 24-Apr-2008 Konstantin Belousov <kib@FreeBSD.org>

Allow the vnode zone to return the unused memory. The vnode reference
count is/shall be properly maintained for the long time, and VFS
shall be safe against the vnode memory reclamation.

Proposed by: jeff
Tested by: pho


# eab626f1 16-Apr-2008 Konstantin Belousov <kib@FreeBSD.org>

Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks


# 1fd9b6a5 02-Apr-2008 Jeff Roberson <jeff@FreeBSD.org>

- Destroy the bo mtx when the vnode is destroyed.


# 71072af5 27-Mar-2008 Attilio Rao <attilio@FreeBSD.org>

b_waiters cannot be adequately protected by the interlock because it is
dropped after the call to lockmgr() so just revert this approach using
something similar to the precedent one:
BUF_LOCKWAITERS() just checks if there are waiters (not the actual number
of them) and it is based on newly introduced lockmgr_waiters() which
returns if the lockmgr has waiters or not. The name has been choosen
differently by old lockwaiters() in order to not confuse them.

KPI results enriched by this commit so __FreeBSD_version bumping and
manpage update will be happening soon.
'struct buf' also changes, so kernel ABI is disturbed.

Bug found by: jeff
Approved by: jeff, kib


# 0ee6cecc 23-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Greatly simplify vget() by removing the guarantee that any new
references to a vnode with VI_OWEINACT set will force the vinactive()
call. The kernel makes no guarantees about which reference was the
last to close a file or when the actual inactive processing will
happen. The previous code was designed to preserve existing semantics
in the face of shared locks, however, this was unnecessary.

Discussed with: mckusick


# e6b2545b 22-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Only return 1 from sync_vnode() in cases where the vnode is still
at the head of the sync list. This prevents sched_sync() from
re-queueing a vnode which may have been freed already.

Discussed with: kib


# f6a8cecf 22-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Pass BO_MTX(bo) to lockmgr in vtruncbuf, we don't own the vnode
interlock here anymore.

Reported by: kris


# 698b1a66 22-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.

Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 7fbfba7b 01-Mar-2008 Attilio Rao <attilio@FreeBSD.org>

- Handle buffer lock waiters count directly in the buffer cache instead
than rely on the lockmgr support [1]:
* bump the waiters only if the interlock is held
* let brelvp() return the waiters count
* rely on brelvp() instead than BUF_LOCKWAITERS() in order to check
for the waiters number
- Remove a namespace pollution introduced recently with lockmgr.h
including lock.h by including lock.h directly in the consumers and
making it mandatory for using lockmgr.
- Modify flags accepted by lockinit():
* introduce LK_NOPROFILE which disables lock profiling for the
specified lockmgr
* introduce LK_QUIET which disables ktr tracing for the specified
lockmgr [2]
* disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it
can only be used on a per-instance basis
- Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer
used

This patch breaks KPI so __FreBSD_version will be bumped and manpages
updated by further commits. Additively, 'struct buf' changes results in
a disturbed ABI also.

[2] Really, currently there is no ktr tracing in the lockmgr, but it
will be added soon.

[1] Submitted by: kib
Tested by: pho, Andrea Barberio <insomniac at slackware dot it>


# 81c794f9 25-Feb-2008 Attilio Rao <attilio@FreeBSD.org>

Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by: Andrea Barberio <insomniac at slackware dot it>


# 2433c488 08-Feb-2008 Attilio Rao <attilio@FreeBSD.org>

Conver all explicit instances to VOP_ISLOCKED(arg, NULL) into
VOP_ISLOCKED(arg, curthread). Now, VOP_ISLOCKED() and lockstatus() should
only acquire curthread as argument; this will lead in axing the additional
argument from both functions, making the code cleaner.

Reviewed by: jeff, kib


# 0e9eb108 23-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

Cleanup lockmgr interface and exported KPI:
- Remove the "thread" argument from the lockmgr() function as it is
always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer
present

Addictionally:
- Introduce a KASSERT() in lockstatus() in order to let it accept only
curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h

KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.

Tested by: matteo


# d638e093 19-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

- Introduce the function lockmgr_recursed() which returns true if the
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj

BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.

KPI results, obviously, broken so further commits will update manpages
and freebsd version.

Tested by: kris (on UFS and NFS)


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# c5f1beb0 27-Dec-2007 Robert Watson <rwatson@FreeBSD.org>

In "show lockedvnods" DDB command, use db_printf() rather than printf()
so that the results end up in the DDB output stream rather than the
console output stream.

This should likely also be done for the vprint() function it calls.

MFC after: 3 months


# 98e4f2e2 27-Dec-2007 Attilio Rao <attilio@FreeBSD.org>

As LK_EXCLUPGRADE is used in conjuction with LK_NOWAIT, LK_UPGRADE becames
equivalent with this and so operate the switch.

That call is the only one remaining LK_EXCLUPGRADE consumer and removing
it will prepare the ground for LK_EXCLUPGRADE axing and further
lockmgr improvements.

Discussed with: jeff, ups


# 3de213cc 25-Dec-2007 Robert Watson <rwatson@FreeBSD.org>

Add a new 'why' argument to kdb_enter(), and a set of constants to use
for that argument. This will allow DDB to detect the broad category of
reason why the debugger has been entered, which it can use for the
purposes of deciding which DDB script to run.

Assign approximate why values to all current consumers of the
kdb_enter() interface.


# 973bdaa0 05-Dec-2007 Konstantin Belousov <kib@FreeBSD.org>

Use curthread instead of the FIRST_THREAD_IN_PROC for vnlru and syncer,
when applicable.

Aquire Giant slightly later for vnlru.

In the syncer, aquire the Giant only when a vnode belongs to the
non-MPsafe fs.

In both speedup_syncer() and syncer_shutdown(), remove the syncer thread from
the lbolt sleep queue after the syncer state is modified, not before.

Herded by: attilio
Tested by: Peter Holm
Reviewed by: ups
MFC after: 1 week


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# 3745c395 20-Oct-2007 Julian Elischer <julian@FreeBSD.org>

Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.


# 245b2044 12-Sep-2007 Konstantin Belousov <kib@FreeBSD.org>

When restoring the mount after umount failed, the MNTK_UNMOUNT flag
prevents insmntque() from placing reallocated syncer vnode on mount
list, that causes panic in vfs_allocate_syncvnode().

Introduce MNTK_NOINSMNTQ flag, that marks the period when instmntque is
not allowed to success, instead of MNTK_UNMOUNT. The MNTK_NOINSMNTQ is
set and cleared simultaneously with MNTK_UNMOUNT, except on umount error
path, where it is cleaned just before the syncer vnode is going to be
allocated.

Reported by: Peter Jeremy <peterjeremy optushome com au>
Suggested by: tegge
Approved by: re (rwatson)


# 354eb801 13-Aug-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Improve vn_printf() by:
- adding missing vnode flags,
- printing unknown flags as numbers,
- using strlcat() instead of strcat().

Approved by: re (bmah)


# 32f9753c 11-Jun-2007 Robert Watson <rwatson@FreeBSD.org>

Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.

Reviewed by: csjp
Obtained from: TrustedBSD Project


# 2feb50bf 31-May-2007 Attilio Rao <attilio@FreeBSD.org>

Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)


# e1e8f51b 27-May-2007 Robert Watson <rwatson@FreeBSD.org>

Universally adopt most conventional spelling of acquire.


# d413d210 18-May-2007 Konstantin Belousov <kib@FreeBSD.org>

Since renaming of vop_lock to _vop_lock, pre- and post-condition
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.


# 222d0195 18-May-2007 Jeff Roberson <jeff@FreeBSD.org>

- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>


# 24b0502e 13-Apr-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix jails and jail-friendly file systems handling:
- We need to allow for PRIV_VFS_MOUNT_OWNER inside a jail.
- Move security checks to vfs_suser() and deny unmounting and updating
for jailed root from different jails, etc.

OK'ed by: rwatson


# 6bc3ab25 13-Apr-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

When we are running low on vnodes, there is currently no way to ask other
subsystems to release some vnodes. Implement backpressure based on
vfs_lowvnodes event (similar to vm_lowmem for memory).


# 08be8194 10-Apr-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Minor style cleanups (mostly removal of trailing whitespaces).


# 21ff8c67 10-Apr-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Correct typos.


# def72fbb 01-Apr-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Now that the vdropl() function is public, assert that the vnode interlock
is held.


# e6534b36 31-Mar-2007 Dag-Erling Smørgrav <des@FreeBSD.org>

Make vdropl() public; zfs needs it. There is also plenty of existing
file system code (mostly *_reclaim()) which look like this:

VOP_LOCK(vp);
/* examine vp */
VOP_UNLOCK(vp);
vdrop(vp);

This can now be rewritten to:

VOP_LOCK(vp);
/* examine vp */
vdropl(vp); /* will unlock vp */

MFC after: 1 week


# f3ea971b 26-Mar-2007 Marcel Moolenaar <marcel@FreeBSD.org>

PowerPC is the only architecture with mpsafe_vfs=0. This is now
broken. Rudimentary tests show that PowerPC can run with
mpsafe_vfs=1. Make it so...


# 61b9d89f 12-Mar-2007 Tor Egge <tegge@FreeBSD.org>

Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount). On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque(). Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode. The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup. Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by: re (kensmith)
Reviewed by: kib
In collaboration with: kib


# 2f6a774b 12-Nov-2006 Kip Macy <kmacy@FreeBSD.org>

change vop_lock handling to allowing tracking of callers' file and line for
acquisition of lockmgr locks

Approved by: scottl (standing in for mentor rwatson)


# 6b8de13a 07-Nov-2006 John Baldwin <jhb@FreeBSD.org>

Simplify operations with sync_mtx in sched_sync():
- Don't drop the lock just to reacquire it again to check rushjob, this
only wastes time.
- Use msleep() to drop the mutex while sleeping instead of explicitly
unlocking around tsleep.

Reviewed by: pjd


# 8064e5d7 07-Nov-2006 John Baldwin <jhb@FreeBSD.org>

Fix comment typo and function declaration.


# acd3428b 06-Nov-2006 Robert Watson <rwatson@FreeBSD.org>

Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>


# a2ca03b3 04-Nov-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Typo, 'from' vnode is locked here, not 'to' vnode.


# 1a60c7fc 31-Oct-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Add gjournal specific code to the UFS file system:
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
object is still open), increase fs_unrefs and cg_unrefs fields,
which is a hint for fsck in which cylinder groups looks for such
(orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.

Sponsored by: home.pl


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 45ea8737 02-Oct-2006 Konstantin Belousov <kib@FreeBSD.org>

Correct the comment: numvnodes is decreased on vdestroying the vnode.

OKed by: tegge
Approved by: pjd (mentor)
MFC after: 1 week


# a1e363f2 25-Sep-2006 Tor Egge <tegge@FreeBSD.org>

Add mnt_noasync counter to better handle interleaved calls to nmount(),
sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.


# 5da56ddb 25-Sep-2006 Tor Egge <tegge@FreeBSD.org>

Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().


# c37789fe 04-Sep-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Add 'show vnode <addr>' DDB command.


# 04d9e255 10-Aug-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

getnewvnode() can be called with NULL mp.

Found by: Coverity Prevent (tm)
Coverity ID: 1521
Confirmed by: phk


# 13c85d33 08-Aug-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Add a bandaid to avoid a deadlock in a situation, when we are trying to suspend
a file system, but need to obtain a vnode. We may not be able to do it, because
all vnodes could be already in use and other processes cannot release them,
because they are waiting in "suspfs" state.

In such situation, we allow to allocate a vnode anyway.

This is a temporary fix - there is no backpressure to free vnodes allocated in
those circumstances.

MFC after: 1 week
Reviewed by: tegge


# ccdebe46 06-Aug-2006 Robert Watson <rwatson@FreeBSD.org>

Improve commenting of vaccess(), making sure to be clear that the ifdef
capabilities code is there for reference and never actually used. Slight
style tweak.


# 27ea2953 15-Jul-2006 Alan Cox <alc@FreeBSD.org>

Enable debug.mpsafevfs by default on arm. Since every architecture except
powerpc has debug.mpsafevfs enabled by default, it is shorter to enumerate
the architectures on which debug.mpsafevfs is off.

Tested by: cognet@


# c8d3bc1f 05-Jul-2006 Konstantin Belousov <kib@FreeBSD.org>

Back out my rev. 1.674. The better fix (rev. 1.637) is already in tree.

Approved by: kan (mentor)


# d81175c7 26-Jun-2006 Sergey Babkin <babkin@FreeBSD.org>

Backed out the change by request from rwatson.

PR: kern/14584


# 7a799f1e 25-Jun-2006 Sergey Babkin <babkin@FreeBSD.org>

The common UID/GID space implementation. It has been discussed on -arch
in 1999, and there are changes to the sysctl names compared to PR,
according to that discussion. The description is in sys/conf/NOTES.
Lines in the GENERIC files are added in commented-out form.
I'll attach the test script I've used to PR.

PR: kern/14584
Submitted by: babkin


# 55aef263 08-Jun-2006 Konstantin Belousov <kib@FreeBSD.org>

Fix the LOR that occurs when the MAC compiled into the kernel
and vnode is destroyed.

Reviewed by: rwatson
LOR: 189
MFC after: 2 weeks
Approved by: kan (mentor)


# dcf67e65 24-May-2006 Stephan Uphoff <ups@FreeBSD.org>

Do not set B_NOCACHE on buffers when releasing them in flushbuflist().
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.

This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.

Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@
Reviewed by: tegge@
MFC after: 7 days


# 73dbd3da 11-May-2006 John Baldwin <jhb@FreeBSD.org>

Remove various bits of conditional Alpha code and fixup a few comments.


# 643df192 29-Apr-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

vn_start_write()/vn_finished_write() is not needed here, because
vn_start_write() is always called earlier in the code path and calling
the function recursively may lead to a deadlock.

Confirmed by: tegge
MFC after: 2 weeks


# 6ca9fcc5 27-Apr-2006 Jeff Roberson <jeff@FreeBSD.org>

- Add a BO_NEEDSGIANT flag to the bufobj. This flag forces all child
buffers to go on the buf daemon's DIRTYGIANT queue.
- Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler
runs in the context of the buf daemon and may require Giant.


# b53bf126 04-Apr-2006 Jeff Roberson <jeff@FreeBSD.org>

- VFS_LOCK_GIANT when recycling a vnode via getnewvnode. We may be
recycling for an unrelated filesystem. I really don't like potentially
acquiring giant in the context of a giantless filesystem but there
are reasonable objections to removing the recycling from this path.

Sponsored by: Isilon Systems, Inc.


# 0af24721 31-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Add an assert to vgone. It is illegal to call vgone without a reference
to the vnode. Without a reference the vnode will never be vdestroy'd
and the memory will never be reclaimed.

Sponsored by: Isilon Systems, Inc.


# 94bc95db 30-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Hold a reference from the time vfs_busy starts until vfs_unbusy is
called.
- vfs_getvfs has to return a reference to prevent the returned mountpoint
from changing identities.
- Release references acquired via vfs_getvfs.

Discussed with: tegge
Tested by: kris
Sponsored by: Isilon Systems, Inc.


# 084d64ac 30-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Add the B_NEEDSGIANT flag which is only set if the vnode that owns a buf
requires Giant. It is set in bgetvp and cleared in brelvp.
- Create QUEUE_DIRTY_GIANT for dirty buffers that require giant.
- In the buf daemon, only grab giant when processing QUEUE_DIRTY_GIANT and
only if we think there are buffers in that queue.

Sponsored by: Isilon Systems, Inc.


# e44270a7 19-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Correct an assert in vop_rename_pre. fdvp may be locked if it is either
the target directory or file. This case should fail in the filesystem
anyway and perhaps kern_rename() should catch it.

Sponsored by: Isilon Systems, Inc.


# 791dd2fa 08-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Use vn_start_secondary_write() and vn_finished_secondary_write() as a
replacement for vn_write_suspend_wait() to better account for secondary write
processing.

Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.

Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.


# 3b582b4e 02-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must
be called without any vnode locks held. Remove calls to vn_start_write() and
vn_finished_write() in vnode_pager_putpages() and add these calls before the
vnode lock is obtained to most of the callers that don't already have them.


# b983aac7 02-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Don't try to show marker nodes.


# eb2ea105 01-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Move softdep from using a global worklist to per-mount worklists. This
has many positive effects including improved smp locking, reducing
interdependencies between mounts that can lead to deadlocks, etc.
- Add the softdep worklist and various counters to the ufsmnt structure.
- Add a mount pointer to the workitem and remove mount pointers from the
various structures derived from the workitem as they are now redundant.
- Remove the poor-man's semaphore protecting softdep_process_worklist and
softdep_flushworklist. Several threads may now process the list
simultaneously.
- Add softdep_waitidle() to block the thread until all pending
dependencies being operated on by other threads have been flushed.
- Use softdep_waitidle() in unmount and snapshots to block either
operation until the fs is stable.
- Remove softdep worklist processing from the syncer and move it into the
softdep_flush() thread. This thread processes all softdep mounts
once each second and when it is called via the new softdep_speedup()
when there is a resource shortage. This removes the softdep hook
from the kernel and various hacks in header files to support it.

Reviewed by/Discussed with: tegge, truckman, mckusick
Tested by: kris


# a1db11fc 22-Feb-2006 Jeff Roberson <jeff@FreeBSD.org>

- Release the mount ref once the vnode has been recycled rather than once
the last reference is dropped. I forgot that vnodes can stick around
for a very long time until processes discover that they are dead. This
means that a vnode reference is not sufficient to keep the mount
referenced and even more code will be required to ref mount points.

Discovered by: kris


# 8a7cd2fd 21-Feb-2006 Jeff Roberson <jeff@FreeBSD.org>

- Grab a mnt ref in vfs_busy() before dropping the interlock. This will
prevent the mount point from going away while we're waiting on the lock.
The ref does not need to persist once we have the lock because the
lock prevents the mount point from being unmounted.

MFC After: 1 week


# 04f6d3ef 06-Feb-2006 Jeff Roberson <jeff@FreeBSD.org>

- Add a ref count to the mount structure. Sleep for up to 3 seconds in
vfs_mount_destroy waiting for this ref to hit 0. We don't print an
error if we are rebooting as the root mount always retains some refernces
by init proc.
- Acquire a mnt ref for every vnode allocated to a mount point. Drop this
ref only once vdestroy() has been called and the mount has been freed.
- No longer NULL the v_mount pointer in delmntque() so that we may release
the ref after vgone() has been called. This allows us to guarantee
that the mount point structure will be valid until the last vnode has
lost its last ref.
- Fix a few places that rely on checking v_mount to detect recycling.

Sponsored by: Isilon Systems, Inc.
MFC After: 1 week


# b099db58 31-Jan-2006 Jeff Roberson <jeff@FreeBSD.org>

- Solve a race where we could lose a call to VOP_INACTIVE. If vget() waiting
on a lock held the last usecount ref on a vnode and the lock failed we
would not call INACTIVE. Solve this by only holding a holdcnt to prevent
the vnode from disappearing while we wait on vn_lock. Other callers
may now VOP_INACTIVE while we are waiting on the lock, however this race
is acceptable, while losing INACTIVE is not.

Discussed with: kan, pjd
Tested by: kkenn
Sponsored by: Isilon Systems, Inc.
MFC After: 1 week


# d5e5528a 27-Jan-2006 Kris Kennaway <kris@FreeBSD.org>

Back out r1.653; it turns out that the race (or at least the printf) is
actually not hard to trigger, and it can cause a lot of console spam.

Approved by: kan


# 6be2c41a 21-Jan-2006 Robert Watson <rwatson@FreeBSD.org>

Convert remaining functions in vfs_subr.c from K&R prototypes to ANSI C
prototypes, as the majority of new functions added have been in this
style. Changing prototype style now results in gcc noticing that the
implementation of vn_pollrecord() has a 'short' argument instead of
'int' as prototyped in vnode.h, so correct that definition. In practice
this didn't matter as only poll flags in the lower 16 bits are used.

MFC after: 1 week


# 82be0a5a 09-Jan-2006 Tor Egge <tegge@FreeBSD.org>

Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by: truckman


# e7736557 29-Dec-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Print a warning when we miss vinactive() call, because of race in vget().
The race is very real, but conditions needed for triggering it are rather
hard to meet now.
When gjournal will be committed (where it is quite easy to trigger) we need
to fix it.

For now, verify if it is really hard to trigger.

Discussed with: kan


# 16e35dcc 09-Nov-2005 Doug White <dwhite@FreeBSD.org>

This is a workaround for a complicated issue involving VFS cookies and devfs.
The PR and patch have the details. The ultimate fix requires architectural
changes and clarifications to the VFS API, but this will prevent the system
from panicking when someone does "ls /dev" while running in a shell under the
linuxulator.

This issue affects HEAD and RELENG_6 only.

PR: 88249
Submitted by: "Devon H. O'Dell" <dodell@ixsystems.com>
MFC after: 3 days


# 5bb84bc8 31-Oct-2005 Robert Watson <rwatson@FreeBSD.org>

Normalize a significant number of kernel malloc type names:

- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.


# 14cdc364 14-Oct-2005 Kris Kennaway <kris@FreeBSD.org>

mpsafevm has been stable and defaulted to 1 on sparc64 for over 6 months,
so we are ready for mpsafevfs=1 by default on sparc64 too. I have been
running this on all my sparc64 machines for over 6 months, and have not
encountered MD problems.

MFC after: 1 week


# 9f5c1d19 12-Oct-2005 Diomidis Spinellis <dds@FreeBSD.org>

Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by: bde
MFC after: 2 weeks


# 6c8b634f 29-Sep-2005 Don Lewis <truckman@FreeBSD.org>

Un-staticize runningbufwakeup() and staticize updateproc.

Add a new private thread flag to indicate that the thread should
not sleep if runningbufspace is too large.

Set this flag on the bufdaemon and syncer threads so that they skip
the waitrunningbufspace() call in bufwrite() rather than than
checking the proc pointer vs. the known proc pointers for these two
threads. A way of preventing these threads from being starved for
I/O but still placing limits on their outstanding I/O would be
desirable.

Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from
blocking on the runningbufspace check while holding snaplk. This
prevents snaplk from being held for an arbitrarily long period of
time if runningbufspace is high and greatly reduces the contention
for snaplk. The disadvantage is that ffs_copyonwrite() can start
a large amount of I/O if there are a large number of snapshots,
which could cause a deadlock in other parts of the code.

Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace
before attempting to grab snaplk so that I/O requests waiting on
snaplk are not counted in runningbufspace as being in-progress.
Increment runningbufspace again before actually launching the
original I/O request.

Prior to the above two changes, the system could deadlock if enough
I/O requests were blocked by snaplk to prevent runningbufspace from
falling below lorunningspace and one of the bawrite() calls in
ffs_copyonwrite() blocked in waitrunningbufspace() while holding
snaplk.

See <http://www.holm.cc/stress/log/cons143.html>


# 61ac14da 16-Sep-2005 Tor Egge <tegge@FreeBSD.org>

Break out of loop if next buffer pointer has become invalid while flushing
current buffer.

Reviewed by: kan


# fd1a469b 12-Sep-2005 Robert Watson <rwatson@FreeBSD.org>

In vfs_kqfilter(), return EINVAL instead of 1 (EPERM) when an unsupported
kqueue filter type is requested on a vnode.

MFC after: 3 days


# 9ed448b2 12-Sep-2005 Jung-uk Kim <jkim@FreeBSD.org>

use monotonic `time_uptime' instead of `time_second'

Approved by: anholt (mentor)
Discussed on: arch


# 2883ba66 12-Sep-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce vfs_read_dirent() which can help VOP_READDIR() implementations
by handling all the cookie stuff.


# a6c109d6 28-Aug-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Fix a typo in vop_rename_pre() where we ended up using vholdl()
instead of vhold(), even though the vnode interlock is unlocked.

MFC after: 3 days


# ad9f1801 22-Aug-2005 Don Lewis <truckman@FreeBSD.org>

Back out the removal of LK_NOWAIT from the VOP_LOCK() call in
vlrureclaim() in vfs_subr.c 1.636 because waiting for the vnode
lock aggravates an existing race condition. It is also undesirable
according to the commit log for 1.631.

Fix the tiny race condition that remains by rechecking the vnode
state after grabbing the vnode lock and grabbing the vnode interlock.

Fix the problem of other threads being starved (which 1.636 attempted
to fix by removing LK_NOWAIT) by calling uio_yield() periodically
in vlrureclaim(). This should be more deterministic than hoping
that VOP_LOCK() without LK_NOWAIT will block, which may not happen
in this loop.

Reviewed by: kan
MFC after: 5 days


# 6cd8dee3 20-Aug-2005 Robert Watson <rwatson@FreeBSD.org>

Silence "busy" warnings when unmounting devfs at system shutdown. This
is a workaround for non-symetric teardown of the file systems at
shutdown with respect to the mount order at boot. The proper long term
fix is to properly detach devfs from the root mount before unmounting
each, and should be implemented, but since the problem is non-harmful,
this temporary band-aid will prevent false positive bug reports and
unnecessary error output for 6.0-RELEASE.

MFC after: 3 days
Tested by: pav, pjd


# fd65baf8 13-Aug-2005 Marcel Moolenaar <marcel@FreeBSD.org>

Make mpsafe_vfs=1 the default on ia64.


# 45a0d1ed 10-Aug-2005 Alexander Kabaev <kan@FreeBSD.org>

Do not drop the vnode interlock if vdropl is called on already doomed vnode.
vdropl callers expect it to return with interlock still being held.

MFC after: 2 days


# 34cc826a 05-Aug-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Holding a vnode doesn't prevent v_mount from disappearing (when the
vnode is inactivated), possibly leading to a NULL dereference when
checking if the mount wants knotes to be activated in the VOP hooks.
So, we add a new vnode flag VV_NOKNOTE that is only set in getnewvnode(),
if necessary, and check it when activating knotes.
Since the flags are not erased when a vnode is being held, we can safely
read them.

Reviewed by: kris@
MFC after: 3 days


# 40a49585 02-Aug-2005 Jeff Roberson <jeff@FreeBSD.org>

- Unlock before we call mac_destroy_vnode to prevent a lock order reversal.

Found by: trhodes


# 39b24068 19-Jul-2005 Jeff Roberson <jeff@FreeBSD.org>

- Allow vnlru to drop giant if the filesystem does not require it. The
vnlru proc is extremely inefficient, potentially iteration over tens of
thousands of vnodes without blocking. Droping Giant allows other threads
to preempt us although we should revisit the algorithm to fix the runtime
problems especially since this may hold up all vnode allocations.
- Remove the LK_NOWAIT from the VOP_LOCK in vlrureclaim. This provides
a natural blocking point to help alleviate the situation described above
although it may not technically be desirable.
- yield after we make a pass on all mount points to prevent us from
blocking other threads which require Giant.

MFC after: 2 weeks


# c23c87bd 05-Jul-2005 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix one "wrong b_bufobj" panic in reassignbuf() by moving VI_UNLOCK(vp)
below KASSERT()s, which means there was no real problem here, we just
needed better locking for assertions.

OK'ed by: jeff
Approved by: re (scottl)


# 571dcd15 01-Jul-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Fix the recent panics/LORs/hangs created by my kqueue commit by:

- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone


# b770ff6e 18-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Try to catch the wrong bufobj panics a little earlier. I believe they
are actually caused by a buf with both VNCLEAN and VNDIRTY set. In
the traces it is clear that the buf is removed from the dirty queue while
it is actually on the clean queue which leaves the tail pointer set.
Assert that both flags are not set in buf_vlist_add and buf_vlist_remove.

Sponsored by: Isilon Systems, Inc.
Approved by: re (blanket vfs)


# 114a1006 15-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Change holdcnt use around vnode recycling. We now always keep a holdcnt
ref while we're calling vgone(). This prevents transient refs from
re-adding us to the free list. Previously, a vfree() triggered via
vinvalbuf() getting rid of all of a vnode's pages could place a partially
destructed vnode on the free list where vtryrecycle() could find it. The
first call to vtryrecycle would hang up on the vnode lock, but when it
failed it would place a now dead vnode onto the free list, and another
call to vtryrecycle() would free an already free vnode. There were many
complications of having a zero ref count while freeing which can now go
away.
- Change vdropl() to release the interlock before returning. All callers
now respect this, so vdropl() directly frees VI_DOOMED vnodes once the
last ref is dropped. This means that we'll never have VI_DOOMED vnodes
on the free list.
- Seperate v_incr_usecount() into v_incr_usecount(), v_decr_usecount() and
v_decr_useonly(). The incr/decr split is so that incr usecount can
return with the interlock still held while decr drops the interlock so
it can call vdropl() which will potentially free the vnode. The calling
function can't drop the lock of an already free'd node. v_decr_useonly()
drops a usecount without droping the hold count. This is done so the
usecount reaches zero in vput() before we recycle, however the holdcount
is still 1 which prevents any new references from placing the vnode
back on the free list.
- Fix vnlrureclaim() to vhold the vnode since it doesn't do a vget(). We
wouldn't want vnlrureclaim() to bump the usecount since this has
different semantics. Also change vnlrureclaim() to do a NOWAIT on the
vn_lock. When this function runs we're usually in a desperate situation
and we wouldn't want to wait for any specific vnode to be released.
- Fix a bunch of misc comments to reflect the new behavior.
- Add vhold() and vdrop() to vflush() for the same reasons that we do in
vlrureclaim(). Previously we held no reference and a vnode could have
been freed while we were waiting on the lock.
- Get rid of vlruvp() and vfreehead(). Neither are used. vlruvp() should
really be rethought before it's reintroduced.
- vgonel() always returns with the vnode locked now and never puts the
vnode back on a free list. The vnode will be freed as soon as the last
reference is released.

Sponsored by: Isilon Systems, Inc.
Debugging help from: Kris Kennaway, Peter Holm
Approved by: re (blanket vfs)


# 12c2dcde 14-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- In reassignbuf() add many asserts to validate the head and tail pointers
of the clean and dirty lists. This is in an attempt to catch the wrong
bufobj problem sooner.
- In vgonel() don't acquire an extra reference in the active case, the
vnode lock and VI_DOOMED protect us from recursively cleaning.
- Also in vgonel() clean up some stale comments.

Sponsored by: Isilon Systems, Inc.
Approved by: re (blanket vfs)


# b930d853 13-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't make vgonel() globally visible, we want to change its prototype
anyway and it's not used outside of vfs_subr.c.
- Change vgonel() to accept a parameter which determines whether or not
we'll put the vnode on the free list when we're done.
- Use the new vgonel() parameter rather than VI_DOOMED to signal our
intentions in vtryrecycle().
- In vgonel() return if VI_DOOMED is already set, this vnode has already
been reclaimed.

Sponsored by: Isilon Systems, Inc.


# d2ad9baa 12-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add KTR_VFS events to vdestroy, vtruncbuf, vinvalbuf, vfreehead.

Sponsored by: Isilon Systems, Inc.


# d6dbf760 11-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Assert that we're not in the name cache anymore in vdestroy().

Sponsored by: Isilon Systems, Inc.


# 9aa0eba4 10-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add KTR_VFS tracing to track the life of vnodes. Eventually KTR_VFS
events could be added to cover other interesting details.
- Add some VNASSERTs to discover places where we access vnodes after
they have been uma_zfree'd before we try to free them again.
- Add a few more VNASSERTs to vdestroy() to be certain that the vnode is
really unused.

Sponsored by: Isilon Systems, Inc.


# 679985d0 09-Jun-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Allow EVFILT_VNODE events to work on every filesystem type, not just
UFS by:
- Making the pre and post hooks for the VOP functions work even when
DEBUG_VFS_LOCKS is not defined.
- Moving the KNOTE activations into the corresponding VOP hooks.
- Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct
mount that permits filesystems to disable the new behavior.
- Creating a default VOP_KQFILTER function: vfs_kqfilter()

My benchmarks have not revealed any performance degradation.

Reviewed by: jeff, bde
Approved by: rwatson, jmg (kqueue changes), grehan (mentor)


# fae89dce 07-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Clear OWEINACT prior to calling VOP_INACTIVE to remove the possibility
of a vget causing another call to INACTIVE before we're finished.


# fd94099e 05-May-2005 Colin Percival <cperciva@FreeBSD.org>

If we are going to
1. Copy a NULL-terminated string into a fixed-length buffer, and
2. copyout that buffer to userland,
we really ought to
0. Zero the entire buffer
first.

Security: FreeBSD-SA-05:08.kmem


# 059f090f 03-May-2005 Jeff Roberson <jeff@FreeBSD.org>

- A vnode may have made its way onto the free list while it was being
vgone'd. We must remove it from the freelist before returning in
vtryrecycle() or we may get a duplicate free.

Reported by: kkenn


# 02fe1744 01-May-2005 Christian S.J. Peron <csjp@FreeBSD.org>

Since it is not possible for curthread to be NULL in this context,
drop the check+initialization for a straight initialization. Also
assert that curthread will never be NULL just to be sure.

Discussed with: rwatson, peter
MFC after: 1 week


# b2e21664 30-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- All buffers should either be clean or dirty. If neither of these flags
are set when we attempt to remove a buffer from a queue we should panic.
Hopefully this will catch the source of the wrong bufobj panics.

Sponsored by: Isilon Systems, Inc.


# b2183bfe 30-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- In vnlru_free() remove the vnode from the free list before we call
vtryrecycle(). We could sometimes get into situations where two threads
could try to recycle the same vnode before this.
- vtryrecycle() is now responsible for returning the vnode to the free list
if it fails and someone else hasn't done it.
- Make a new function vfreehead() which moves a vnode to the head of the
free list and use it in vgone() to clean up that code a bit.

Sponsored by: Isilon Systems, Inc.
Reported by: pho, kkenn


# 0dd02d67 27-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't vgonel() via vgone() or vrecycle() if the vnode is already doomed.
This fixes forced unmounts via nullfs.

Reported by: kkenn
Sponsored by: Isilon Systems, Inc.


# 6c317bc4 27-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Stop setting vxthread, we've asserted that it was useless for several
weeks now.


# 7d60dc52 21-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Disable code which allows getnewvnode() to fail. Many ffs_vget() callers
do not correctly deal with failures. This presently risks deadlock
problems if dependency processing is held up by failures to allocate
a vnode, however, this is better than the situation with the failures.

Sponsored by: Isilon Systems, Inc.


# bdb35646 18-Apr-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Initialize mountlist_mtx with an MTX_SYSINIT(), we need it to be ready
earlier.


# 374df05f 13-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Change vop_lookup_post assertions to reflect recent vfs_lookup changes.

Sponsored by: Isilon Systems, Inc.


# 539de9ed 11-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Enable ASSERT_VOP_ELOCKED and assert_vop_elocked() now that vnode_if.awk
uses it.

Sponsored by: Isilon Systems, Inc.


# 070898b1 11-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Change the VOP_LOCK UPGRADE in vput() to do a LK_NOWAIT to avoid a
potential lock order reversal. Also, don't unlock the vnode if this
fails, lockmgr has already unlocked it for us.
- Restructure vget() now that vn_lock() does all of VI_DOOMED checking
for us and also handles the case where there is no real lock type.
- If VI_OWEINACT is set, we need to upgrade the lock request to EXCLUSIVE
so that we can call inactive. It's not legal to vget a vnode that hasn't
had INACTIVE called yet.

Sponsored by: Isilon Systems, Inc.


# d78e0ee9 06-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Assert that the bufobj matches in flushbuflists. I still haven't gotten
to root cause on exactly how this happens.
- If the assert is disabled, we presently try to handle this case, but the
BUF_UNLOCK was missing. Thus, if this condition ever hit we would leak
a buf lock.

Many thanks to Peter Holm for all his help in finding this bug. He really
put more effort into it than I did.


# 2bbd6c98 05-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Move NDFREE() from vfs_subr to vfs_lookup where namei() is.


# d1cc6041 03-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add a missing unlock of the vnode_free_list_mtx.

Spotted by: Antoine Brodin


# 92b8231d 04-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Instead of waiting forever to get a vnode in getnewvnode() wait for
one to become available for one second and then return ENFILE. We
can run out of vnodes, and there must be a hard limit because without
one we can quickly run out of KVA on x86. Presently the system can
deadlock if there are maxvnodes directories in the namecache. The
original 4.x BSD behavior was to return ENFILE if we reached the max,
but 4.x BSD did not have the vnlru proc so it was less profitable to
wait.


# e451d879 30-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Disable vfs shared locks by default. They must be specifically enabled
on filesystems which safely support them. It appears that many
network filesystems specifically are not shared lock safe.

Sponsored by: Isilon Systems, Inc.


# f247a524 30-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- LK_NOPAUSE is a nop now.

Sponsored by: Isilon Systems, Inc.


# 7ce7f713 29-Mar-2005 David Schultz <das@FreeBSD.org>

Eliminate v_id and v_ddid. The name cache now holds references to
vnodes whose names it caches, so we no longer need a `generation
number' to tell us if a referenced vnode is invalid. Replace the use
of the parent's v_id in the hash function with the address of the
parent vnode.

Tested by: Peter Holm
Glanced at by: jeff, phk


# 0fbc3b7d 29-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Dont clear OWEINACT in vbusy(), we still owe an inactive call if someone
vhold()s us.
- Avoid an extra mutex acquire and release in the common case of vgonel()
by checking for OWEINACT at the start of the function.
- Fix the case where we set OWEINACT in vput(). LK_EXCLUPGRADE drops our
shared lock if it fails.

Sponsored by: Isilon Systems, Inc.


# cb34b95b 29-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't initial v_dd here, let cache_purge() do it for us.

Sponsored by: Isilon Systems, Inc.


# 9dcc5da3 28-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Move code that should probably be an assert above the main body of
vrele so that we can decrease the indentation of the real work and
make things slightly more clear.

Sponsored by: Isilon Systems, Inc.


# d36f0a4f 28-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Adjust asserts in vop_lookup_post() to match the new post PDIRUNLOCK
vfs.

Sponsored by: Isilon Systems, Inc.


# 3b73a3c0 27-Mar-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove another ';' after if().

Also spotted by: bz


# 2d8dfb28 27-Mar-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove extra ; at end of if().

Found by: bz


# 228ea9d2 24-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't recycle vnodes anymore. Free them once they are dead. getnewvnode
now always allocates a new vnode.
- Define a new function, vnlru_free, which frees vnodes from the free list.
It takes as a parameter the number of vnodes to free, which is
wantfreevnodes - freevnodes when called from vnlru_proc or 1 when
called from getnewvnode(). For now, getnewvnode() still tries to reclaim
a free vnode before creating a new one when we are near the limit.
- Define a function, vdestroy, which handles the actual release of memory
and teardown of locks, etc. This could become a uma_dtor() routine.
- Get rid of minvnodes. Now wantfreevnodes is 1/4th the max vnodes. This
keeps more unreferenced vnodes around so that files which have only
been stat'd are less likely to be kicked out of the system before we
have a chance to read them, etc. These vnodes may still be freed via
the normal vnlru_proc() routines which may some day become a real lru.


# d830f828 24-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Pass LK_EXCLUSIVE to VFS_ROOT() to satisfy the new flags argument. For
now, all calls to VFS_ROOT() should still acquire exclusive locks.

Sponsored by: Isilon Systems, Inc.


# c167961e 23-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- If vput() is called with a shared lock it must upgrade to an exclusive
before it can call VOP_INACTIVE(). This must use the EXCLUPGRADE path
because we may violate some lock order with another locked vnode if
we drop and reacquire the lock. If EXCLUPGRADE fails, we mark the
vnode with VI_OWEINACT. This case should be very rare.
- Clear VI_OWEINACT in vinactive() and vbusy().
- If VI_OWEINACT is set in vgone() do the VOP_INACTIVE call here as well.

Sponsored by: Isilon Systems, Inc.


# b172f6c5 15-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Now that there are no external users of vfree() make it static.
- Move VSHOULDBUSY, VSHOULDFREE, and VTRYRECYCLE into vfs_subr.c so
no one else attempts to grow a dependency on them.
- Now that objects with pages hold the vnode we don't have to do unlocked
checks for the page count in the vm object in VSHOULDFREE. These three
macros could simply check for holdcnt state transitions to determine
whether the vnode is on the free list already, but the extra safety
the flag affords us is probably worth the minimal cost.
- The leafonly sysctl and code have been dead for several years now,
remove the sysctl and the code that employed it from vtryrecycle().
- vtryrecycle() also no longer has to check the object's page count as
the object holds the vnode until it reaches 0.

Sponsored by: Isilon Systems, Inc.


# c178628d 15-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Expose vholdl() so it may be used outside of vfs_subr.c


# 8045557f 14-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Increment the holdcnt once for each usecount reference. This allows us
to use only the holdcnt to determine whether a vnode may be recycled,
simplifying the V* macros as well as vtryrecycle(), etc.

Sponsored by: Isilon Systems, Inc.


# 159b4548 14-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- We do not have to check the object's ref_count in VSHOULDFREE or
vtryrecycle(). All obj refs also ref the vnode.
- Consistently use v_incr_usecount() to increment the usecount. This will
be more important later.

Sponsored by: Isilon Systems, Inc.


# 8f13a540 14-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Slightly rearrange vrele() to move the common case in one indentation
level.

Sponsored by: Isilon Systems, Inc.


# 6fc16a83 14-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Rework vget() so we drop the usecount in two failure cases that were
missed by my last commit.

Sponsored by: Isilon Systems, Inc.


# 6703c30b 13-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Remove vx_lock, vx_unlock, vx_wait, etc.
- Add a vn_start_write/vn_finished_write around vlrureclaim so we don't do
writing ops without suspending. This could suspend the vlruproc which
should not be a problem under normal circumstances.
- Manually implement VMIGHTFREE in vlrureclaim as this was the only instance
where it was used.
- Acquire a lock before calling vgone() as it now requires it.
- Move the acquisition of the vnode interlock from vtryrecycle() to
getnewvnode() so that if it fails we don't drop and reacquire the
vnode_free_list_mtx.
- Check for a usecount or holdcount at the end of vtryrecycle() in case
someone grabbed a ref while we were recycling. Abort the recycle, and
on the final ref drop this vnode will be placed on the head of the free
list.
- Move the redundant VOP_INACTIVE protection code into the local
vinactive() routine to avoid code bloat.
- Keep the vnode lock held across calls to vgone() in several places.
- vgonel() no longer uses XLOCK, instead callers must hold an exclusive
vnode lock. The VI_DOOMED flag is set to allow other threads to detect
a vnode which is no longer valid. This flag is set until the last
reference is gone, and there are no chances for a new ref. vgonel()
holds this lock across the entire function, which greatly simplifies
logic.
_ Only vfree() in one place in vgone() not three.
- Adjust vget() to check the VI_DOOMED flag prior to waiting on the lock
in the LK_NOWAIT case. In other cases, check after we have slept and
acquired an exlusive lock. This will simulate the old vx_wait()
behavior.

Sponsored by: Isilon Systems, Inc.


# d9a9c2c2 23-Feb-2005 Jeff Roberson <jeff@FreeBSD.org>

- Enable SMP VFS by default on current. More users are needed to turn up
any remaining bugs. Anyone inconvenienced by this can still disable it
in the loader.

Sponsored by: Isilon Systems, Inc.


# d8a7c99a 22-Feb-2005 Jeff Roberson <jeff@FreeBSD.org>

- Only the xlock holder should be calling VOP_LOCK on a vp once VI_XLOCK
has been set. Assert that this is the case so that we catch filesystems
who are using naked VOP_LOCKs in illegal cases.

Sponsored by: Isilon Systems, Inc.


# 4c11620b 22-Feb-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add a check for xlock in vop_lock_assert. Presently the xlock is
considered to be as good as an exclusive lock, although there is still a
possibility of someone acquiring a VOP LOCK while xlock is held.

Sponsored by: Isilon Systems, Inc.


# 767056c0 22-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Zero the v_un container field to make sure everything is gone.


# aa2f6ddc 22-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Reap more benefits from DEVFS:

List devfs_dirents rather than vnodes off their shared struct cdev, this
saves a pointer field in the vnode at the expense of a field in the
devfs_dirent. There are often 100 times more vnodes so this is bargain.
In addition it makes it harder for people to try to do stypid things like
"finding the vnode from cdev".

Since DEVFS handles all VCHR nodes now, we can do the vnode related
cleanup in devfs_reclaim() instead of in dev_rel() and vgonel().
Similarly, we can do the struct cdev related cleanup in dev_rel()
instead of devfs_reclaim().

rename idestroy_dev() to destroy_devl() for consistency.

Add LIST_ENTRY de_alias to struct devfs_dirent.
Remove v_specnext from struct vnode.
Change si_hlist to si_alist in struct cdev.
String new devfs vnodes' devfs_dirent on si_alist when
we create them and take them off in devfs_reclaim().

Fix devfs_revoke() accordingly. Also don't clear fields
devfs_reclaim() will clear when called from vgone();

Let devfs_reclaim() call dev_rel() instead of vgonel().

Move the usecount tracking from dev_rel() to devfs_reclaim(),
and let dev_rel() take a struct cdev argument instead of vnode.

Destroy SI_CHEAPCLONE devices in dev_rel() (instead of
devfs_reclaim()) when they are no longer used. (This
should maybe happen in devfs_close() instead.)


# 7fc940b2 22-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove vfinddev(), it is generally bogus when faced with jails and
chroot and has no legitimate use(r)s in the tree.


# dfd4be14 19-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Try to unbreak the vnode locking around vop_reclaim() (based mostly on
patch from kan@).

Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.


# 900b7e26 18-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Make sure to drop the VI_LOCK in vgonel();

Spotted by: Taku YAMAMOTO <taku@tackymt.homeip.net>


# 4d8ac58b 17-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce vx_wait{l}() and use it instead of home-rolled versions.


# 58aac128 17-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Convert KASSERTS to VNASSERTS


# 1ba21282 09-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Make various vnode related functions static


# fe019877 10-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't pass NULL to vprint()


# 68f2274d 08-Feb-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add a new assert in the getnewvnode(). Assert that the usecount is still
0 to detect getnewvnode() races.
- Add the vnode address to a few panics near by to help in debugging.

Sponsored by: Isilon Systems, Inc.


# b9489a44 07-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Access vmobject via the bufobj instead of the vnode


# b348abd6 07-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't call VOP_DESTROYVOBJECT(), trust that VOP_RECLAIM() did what
was necessary.


# d4eb29ba 28-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused argument to vrecycle()


# 1fdfaafb 28-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Integrate vclean() into vgonel().

Various associated polishing.


# 3fc8dd06 27-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove register keyword


# 8516dd18 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't use VOP_GETVOBJECT, use vp->v_object directly.


# b5b6ec5f 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate the constant flags argument to vclean()


# 7c93282e 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Change vprint() to vn_printf() which takes varargs.
Add #define for vprint() to call vn_printf().


# 35764be3 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Kill the VV_OBJBUF and test the v_object for NULL instead.


# d1fcf3bb 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add the tunable and sysctl for the mpsafevfs. It currently defaults
to off.
- Protect access to mnt_kern_flag with the mointpoint mutex.
- Remove some KASSERTs which are not legal checks without the appropriate
locks held.
- Use VCANRECYCLE() rather than rolling several slightly different
checks together.
- Return from vtryrecycle() with a recycled vnode rather than a locked
vnode. This simplifies some locking.
- Remove several GIANT_REQUIRED lines.
- Add a few KASSERTs to help with INACT debugging.

Sponsored By: Isilon Systems, Inc.


# 7bf38aea 16-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Fix a bug I introduced in 1.561 which has caused considerable filesystem
unhappiness lately.

As far as I can tell, no files that have made it safely to disk
have been endangered, but stuff in transit has been in peril.

Pointy hat: phk


# 7c0745ee 14-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate unused and unnecessary "cred" argument from vinvalbuf()


# e39db32a 12-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT()
directly.


# 6ef8480a 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Add BO_SYNC() and add a default which uses the secret vnode pointer
and VOP_FSYNC() for now.


# 6afa350d 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

More vnode -> bufobj migration.


# 8d785753 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Give flushbuflist() a struct bufv as first argument and avoid home-rolling
TAILQ_FOREACH_SAFE().

Loose the error pointer argument and return any errors the normal way.

Return EAGAIN for the case where more work needs to be done.


# 8df6bac4 11-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson


# 9454b2d8 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for copyright notices, minor format tweaks as necessary


# 0b3e4fe2 04-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Since we do not support forceful unmount of DEVFS we can do away with
the partially implemented vnode-readoption code in vgonechrl().


# e87047b4 20-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

We can only ever get to vgonechrl() from a devfs vnode, so we do not
need to reassign the vp->v_op to devfs_specops, we know that is the
value already.

Make devfs_specops private to devfs.


# 20a92a18 07-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:

cd9660:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

nfs(client):
Convert to nmount (the simple way, mount_nfs(8) is still necessary).
Add omount compat shims.
Drop COMPAT_PRELITE2 mount arg compatibility.

ffs:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

Remove vfs_omount() method, all filesystems are now converted.

Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.

Change rootmounting to use DEVFS trampoline:

vfs_mount.c:
Mount devfs on /. Devfs needs no 'from' so this is clean.
symlink /dev to /. This makes it possible to lookup /dev/foo.
Mount "real" root filesystem on /.
Surgically move the devfs mountpoint from under the real root
filesystem onto /dev in the real root filesystem.

Remove now unnecessary getdiskbyname().

kern_init.c:
Don't do devfs mounting and rootvnode assignment here, it was
already handled by vfs_mount.c.

Remove now unused bdevvp(), addaliasu() and addalias(). Put the
few necessary lines in devfs where they belong. This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.

Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway. Correct information is provided by
statfs(/).


# f76fedd2 02-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Improve vprint() a little bit: break long lines, reduce indent and tell
if the VI_LOCK() is held.


# aec0fb7b 01-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Back when VOP_* was introduced, we did not have new-style struct
initializations but we did have lofty goals and big ideals.

Adjust to more contemporary circumstances and gain type checking.

Replace the entire vop_t frobbing thing with properly typed
structures. The only casualty is that we can not add a new
VOP_ method with a loadable module. History has not given
us reason to belive this would ever be feasible in the the
first place.

Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.

Give coda correct prototypes and function definitions for
all vop_()s.

Generate a bit more data from the vnode_if.src file: a
struct vop_vector and protype typedefs for all vop methods.

Add a new vop_bypass() and make vop_default be a pointer
to another struct vop_vector.

Remove a lot of vfs_init since vop_vector is ready to use
from the compiler.

Cast various vop_mumble() to void * with uppercase name,
for instance VOP_PANIC, VOP_NULL etc.

Implement VCALL() by making vdesc_offset the offsetof() the
relevant function pointer in vop_vector. This is disgusting
but since the code is generated by a script comparatively
safe. The alternative for nullfs etc. would be much worse.

Fix up all vnode method vectors to remove casts so they
become typesafe. (The bulk of this is generated by scripts)


# a752aa8f 15-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move pbgetvp() and pbrelvp() to vm_pager.c with the rest of the pbuf stuff.


# 11bcbee1 14-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the bit of the syncer which deals with vnodes into a separate
function.


# db442506 13-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate vop_revoke() function now that devfs_revoke() does the entire job.


# c5b846fe 10-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Slim vnodes by another four bytes by eliminating the (now) unused field
v_cachedid.


# c13a4e88 10-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove vn_todev()


# b797084e 09-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove vnode->v_cachedfs.

It was only used for the highly dangerous "export all vnodes with a sysctl"
function.


# c5690651 04-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove buf->b_dev field.


# e0b687d3 03-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Always initialize bo_private along with bo_ops in getnewvnode().

Spotted by: tegge


# 996b2c82 29-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Loose vfs_mountedon()


# e1f355fe 29-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Give the bufobj a private __bo_vnode for now to keep the syncer floating [1]
At some point later the syncer will unlearn about vnodes and the filesystems
method called by the syncer will know enough about what's in bo_private to
do the right thing.

[1] Ok, I know, but I couldn't resist the pun.


# 20eba72f 27-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the syncer linkage from vnode to bufobj.

This is not quite a perfect separation: the syncer still think it knows
that everything is a vnode.


# 5d9d81e7 26-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Put the I/O block size in bufobj->bo_bsize.

We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.


# 156cb265 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Loose the v_dirty* and v_clean* alias macros.

Check the count field where we just want to know the full/empty state,
rather than using TAILQ_EMPTY() or TAILQ_FIRST().


# ee1d0eb3 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove vnode->v_bsize. This was a dead-end.


# 4dcd0ac4 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Collapse vnode->v_object and buf->b_object into bufobj->bo_object.


# b792bebe 24-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the buffer method vector (buf->b_op) to the bufobj.

Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().


# b0e86f6a 22-Oct-2004 Robert Watson <rwatson@FreeBSD.org>

When MAC is enabled, warn if getnewvnode() is asked to produce a vnode
without a mountpoint. In this scenario, there's no useful source for
a label on the vnode, since we can't query the mountpoint for the
labeling strategy or default label.


# ff7c5a48 22-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Alas, poor SPECFS! -- I knew him, Horatio; A filesystem of infinite
jest, of most excellent fancy: he hath taught me lessons a thousand
times; and now, how abhorred in my imagination it is! my gorge rises
at it. Here were those hacks that I have curs'd I know not how
oft. Where be your kludges now? your workarounds? your layering
violations, that were wont to set the table on a roar?

Move the skeleton of specfs into devfs where it now belongs and
bury the rest.


# 494eb176 22-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.

Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.


# a76d8f4e 21-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.


# 1bca607b 21-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add BO_* macros parallel to VI_* macros for manipulating the bo_mtx.

Initialize the bo_mtx when we allocate a vnode i getnewvnode() For
now we point to the vnodes interlock mutex, that retains the exact
same locking sematics.

Move v_numoutput from vnode to bufobj. Add renaming macro to
postpone code sweep.


# 67647b23 21-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Polish vtruncbuf() to improve readability and style a bit.


# e1633956 21-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Simplify buf_vlist_remove().

Now that we have encapsulated the splaytree related information
into a structure we can eliminate the half of this function.


# 57259f28 05-Oct-2004 Greg Lehey <grog@FreeBSD.org>

vtryrecycle: Don't rely on type VBAD alone to mean that we don't need
to clean the vnode. If v_data is set, we still need to
clean it. This code change should catch all incidents of
the previous commit (INVARIANTS only).


# f2154b33 05-Oct-2004 Greg Lehey <grog@FreeBSD.org>

getnewvnode: Weaken the panic "cleaned vnode isn't" to a warning.

Discussion: this panic (or waning) only occurs when the kernel is
compiled with INVARIANTS. Otherwise the problem (which means that
the vp->v_data field isn't NULL, and represents a coding error and
possibly a memory leak) is silently ignored by setting it to NULL
later on.

Panicking here isn't very helpful: by this time, we can only find
the symptoms. The panic occurs long after the reason for "not
cleaning" has been forgotten; in the case in point, it was the
result of severe file system corruption which left the v_type field
set to VBAD. That issue will be addressed by a separate commit.


# ba285125 01-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Fix a LOR relating to freeing cdevs.


# 70526ca6 24-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Hold dev_lock and check for NULL devsw pointer when we determine
if a vnode is a disk.


# a0e78d2e 23-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Do not refcount the cdevsw, but rather maintain a cdev->si_threadcount
of the number of threads which are inside whatever is behind the
cdevsw for this particular cdev.

Make the device mutex visible through dev_lock() and dev_unlock().
We may want finer granularity later.

Replace spechash_mtx use with dev_lock()/dev_unlock().


# 08dbd671 15-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused B_WRITEINPROG flag


# 1affa3ad 07-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Create simple function init_va_filerev() for initializing a va_filerev
field.

Replace three instances of longhaired initialization va_filerev fields.

Added XXX comment wondering why we don't use random bits instead of
uptime of the system for this purpose.


# 8ded6540 20-Aug-2004 Don Lewis <truckman@FreeBSD.org>

Don't attempt to trigger the syncer thread final sync code in the
shutdown_pre_sync state if the RB_NOSYNC flag is set. This is the
likely cause of hangs after a system panic that are keeping crash
dumps from being done.

This is a MFC candidate for RELENG_5.

MFC after: 3 days


# 78c37b0d 16-Aug-2004 David E. O'Brien <obrien@FreeBSD.org>

s/MAX_SAFE_MAXVNODES/MAXVNODES_MAX/g


# ad3b9257 15-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 87e83e7d 10-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

In v_addpollinfo(), we allocate storage to back vp->v_pollinfo. However,
we may sleep when doing so; check that we didn't race with another thread
allocating storage for the vnode after allocation is made to a local
pointer, and only update the vnode pointer if it's still NULL. Otherwise,
accept that another thread got there first, and release the local storage.

Discussed with: jmg


# c8c216d5 09-Aug-2004 Nate Lawson <njl@FreeBSD.org>

Skip the syncing disks loop if there are no dirty buffers. Remove a
variable used to flag the initial printf.

Submitted by: truckman (earlier version)


# 64298d52 02-Aug-2004 David E. O'Brien <obrien@FreeBSD.org>

Put a cap on the auto-tuning of kern.maxvnodes.

Cap value chosen by: scottl


# b1c81391 29-Jul-2004 Nate Lawson <njl@FreeBSD.org>

Minor message cleanup.


# 3dfe213e 27-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Convert the vfsconf list to a TAILQ.

Introduce vfs_byname() function to find things on it.

Staticize vfs_nmount() function under the name vfs_donmount().

Various cleanups.


# 56f21b9d 26-Jul-2004 Colin Percival <cperciva@FreeBSD.org>

Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with: rwatson, scottl
Requested by: jhb


# cf95b5c3 25-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate unused second argument to reassignbuf() and simplify it
accordingly.


# 05656b6e 21-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

put several of the options for DEBUG_VFS_LOCKS under control of sysctls.


# bb5faea3 15-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Cleanup shutdown output.


# da6303ba 14-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Tidy up system shutdown.


# f257b7a5 12-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.


# 7ae8ce5d 11-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Dump the actual bad values when this assertion is tripped.


# 32240d08 10-Jul-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Update for the KDB framework:
o Call kdb_enter() instead of Debugger().


# 057589c4 08-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

fixup sysctl by fsid node


# ea0104b0 06-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Introduce vfs_suser(), used to test if a user should have special privs
for a mount.


# c713aaae 06-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

NFS mobility PHASE I, II & III (phase VI, and V pending):

Rebind the client socket when we experience a timeout. This fixes
the case where our IP changes for some reason.

Signal a VFS event when NFS transitions from up to down and vice
versa.

Add a placeholder vfs_sysctl where we will put status reporting
shortly.

Also:
Make down NFS mounts return EIO instead of EINTR when there is a
soft timeout or force unmount in progress.


# 27875d9c 05-Jul-2004 Don Lewis <truckman@FreeBSD.org>

Unconditionally set last_work_seen while in the SYNCER_RUNNING state
so that last_work_seen has a reasonable value at the transition
to the SYNCER_SHUTTING_DOWN state, even if net_worklist_len happened
to be zero at the time.

Initialize last_work_seen to zero as a safety measure in case the
syncer never ran in the SYNCER_RUNNING state.

Tested by: phk


# faf1b66d 04-Jul-2004 Don Lewis <truckman@FreeBSD.org>

Rework syncer termination code:

Speed up the syncer when shutting down by sleeping for a shorter
period of time instead of cranking up rushjob and using the
normal one second sleep.

Skip empty worklist slots when shutting down to avoid lengthy
intervals of inactivity.

Give I/O more time to complete between steps by not speeding the
syncer quite as much.

Terminate the syncer after one full pass through the worklist
plus one second with the worklist containing nothing but syncer
vnodes.

Print an indication of shutdown progress to the console.

Add a sysctl, vfs.worklist_len, to allow the size of the syncer worklist
to be monitored.


# c555963f 04-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Give synthetic root filesystem device vnodes a v_bsize of DEV_BSIZE.


# 2d1dca73 04-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Pass the operation in with the fsidctl.
Remove some fsidctls that we will not be using.
Correct prototypes for fs sysctls.


# 7f6599fe 04-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Make the last commit handle non-phk root devices better.


# 1cbb1e02 03-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Blocksize for I/O should be a property of the vnode and not found by groping
around in the vnodes surroundings when we allocate a block.

Assign a blocksize when we create a vnode, and yell a warning (and ignore it)
if we got the wrong size.

Please email all such warnings to me.


# 94ed9c8a 04-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a new kevent filter. EVFILT_FS that will be used to signal
generic filesystem events to userspace. Currently only mount and unmount
of filesystems are signalled. Soon to be added, up/down status of NFS.

Introduce a sysctl node used to route requests to/from filesystems
based on filesystem ids.

Introduce a new vfsop, vfs_sysctl(mp, req) that is used as the callback/
entrypoint by the sysctl code to change individual filesystems.


# 903ac7c2 04-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Revision 1.496 would not boot on my system due to
ffs_mount -> bdevvp -> getnewvnode(..., mp = NULL, ...) ->
insmntqueue(vp, mp = NULL) -> KASSERT -> panic

Make getnewvnode() only call insmntqueue() if the mountpoint parameter
is not NULL.


# e3c5a7a4 04-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

When we traverse the vnodes on a mountpoint we need to look out for
our cached 'next vnode' being removed from this mountpoint. If we
find that it was recycled, we restart our traversal from the start
of the list.

Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:

MNT_ILOCK(mp);
loop:
for (vp = TAILQ_FIRST(&mp...);
(vp = nvp) != NULL;
nvp = TAILQ_NEXT(vp,...)) {
if (vp->v_mount != mp)
goto loop;
MNT_IUNLOCK(mp);
...
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);

The code which takes vnodes off a mountpoint looks like this:

MNT_ILOCK(vp->v_mount);
...
TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
...
MNT_IUNLOCK(vp->v_mount);
...
vp->v_mount = something;

(Take a moment and try to spot the locking error before you read on.)

On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.

Fix:

Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go. This saves approx 65 lines of
duplicated code.

Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.

Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.


# e06500dd 01-Jul-2004 Don Lewis <truckman@FreeBSD.org>

When shutting down the syncer kernel thread, first tell it to run
faster and iterate to over its work list a few times in an attempt
to empty the work list before the syncer terminates. This leaves
fewer dirty blocks to be written at the "syncing disks" stage and
keeps the the "giving up on N buffers" problem from being triggered
by the presence of a large soft updates work list at system shutdown
time. The downside is that the syncer takes noticeably longer to
terminate.

Tested by: "Arjan van Leeuwen" <avleeuwen AT piwebs DOT com>
Approved by: mckusick


# f3732fd1 17-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Second half of the dev_t cleanup.

The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.


# 89c9c53d 16-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


# 170593a9 14-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a left over from userland buffer-cache access to disks.


# 9e6127fe 31-May-2004 Robert Watson <rwatson@FreeBSD.org>

Assert Giant in vrele().


# a0b5a679 11-Apr-2004 Maxime Henrion <mux@FreeBSD.org>

Put deprecated sysctl code inside BURN_BRIDGES.


# 7f8a436f 05-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 39d3505a 29-Mar-2004 Peter Wemm <peter@FreeBSD.org>

Kill some XXXKSE's. vnlru/syncer are single threaded.


# 4d453ef1 11-Mar-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Properly vector all bwrite() and BUF_WRITE() calls through the same path
and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().


# 651b11ea 11-Mar-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused second arg to vfinddev().
Don't call addaliasu() on VBLK nodes.


# ff85a3f0 05-Mar-2004 Alexander Kabaev <kan@FreeBSD.org>

Always call vn_finished_write after vn_start_write was called. All
occurences of 'goto done' after vn_start_write invocation were cleaning
up incompletely.


# 44f3b092 27-Feb-2004 John Baldwin <jhb@FreeBSD.org>

Switch the sleep/wakeup and condition variable implementations to use the
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.


# 47934cef 25-Feb-2004 Don Lewis <truckman@FreeBSD.org>

Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms


# ded67d0f 21-Feb-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Check for NODEV return from udev2dev()


# cd690b60 21-Feb-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Device megapatch 6/6:

This is what we came here for: Hang dev_t's from their cdevsw,
refcount cdevsw and dev_t and generally keep track of things a lot
better than we used to:

Hold a cdevsw reference around all entrances into the device driver,
this will be necessary to safely determine when we can unload driver
code.

Hold a dev_t reference while the device is open.

KASSERT that we do not enter the driver on a non-referenced dev_t.

Remove old D_NAG code, anonymous dev_t's are not a problem now.

When destroy_dev() is called on a referenced dev_t, move it to
dead_cdevsw's list. When the refcount drops, free it.

Check that cdevsw->d_version is correct. If not, set all methods
to the dead_*() methods to prevent entrance into driver. Print
warning on console to this effect. The device driver may still
explode if it is also incompatible with newbus, but in that case
we probably didn't get this far in the first place.


# 816d62bb 21-Feb-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Device megapatch 5/6:

Remove the unused second argument from udev2dev().

Convert all remaining users of makedev() to use udev2dev(). The
semantic difference is that udev2dev() will only locate a pre-existing
dev_t, it will not line makedev() create a new one.

Apart from the tiny well controlled windown in D_PSEUDO drivers,
there should no longer be any "anonymous" dev_t's in the system
now, only dev_t's created with make_dev() and make_dev_alias()


# 580ddfa6 05-Jan-2004 Alexander Kabaev <kan@FreeBSD.org>

More style fixes.

Obtained from: bde


# b0fdf716 05-Jan-2004 Alexander Kabaev <kan@FreeBSD.org>

style(9):

Add empty line before first code line in functions with no local
variables.
Properly terminate comment sentences.
Indent lines which are longer that 80 characters.
Move v_addpollinfo closer to the rest of poll-related functions.
Move DEBUG_VFS_LOCKS ifdefed block to the end of file.

Obtained from: bde (partly)


# 3ff1b7c2 03-Jan-2004 Alexander Kabaev <kan@FreeBSD.org>

Cosmetics: strip '\n' from a string passed to Debugger().


# 9efe7d9d 28-Dec-2003 Bruce Evans <bde@FreeBSD.org>

v_vxproc was a bogus name for a thread (pointer).


# 958557e9 16-Dec-2003 Jeff Roberson <jeff@FreeBSD.org>

- In vget() if LK_NOWAIT is specified we should return EBUSY and not ENOENT.

Submitted by: Stephan Uphoff <ups@stups.com>


# d8521366 16-Dec-2003 Jeff Roberson <jeff@FreeBSD.org>

- When doing a forced unmount, VFS attempts to keep VCHR vnodes valid by
reassigning their v_ops field to specfs, detaching from the mountpoint, etc.
However, this is not sufficient. If we vclean() the vnode the pages owned
by the vnode are lost, potentially while buffers reference them. Implement
parts of vclean() seperately in vgonechrl() so that the pages and bufs
associated with a device vnode are not destroyed while in use.


# a6c6a93c 30-Nov-2003 Jeff Roberson <jeff@FreeBSD.org>

- Don't forget to unlock the vnode interlock in the LK_NOWAIT case.

Submitted by: Stephan Uphoff <ups@stups.com>
Approved by: re (rwatson)


# 512824f8 09-Nov-2003 Seigo Tanimura <tanimura@FreeBSD.org>

- Implement selwakeuppri() which allows raising the priority of a
thread being waken up. The thread waken up can run at a priority as
high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate
priorities.

- Add cv_broadcastpri() which raises the priority of the broadcast
threads. Used by selwakeuppri() if collision occurs.

Not objected in: -arch, -current


# ca430f2e 04-Nov-2003 Alexander Kabaev <kan@FreeBSD.org>

Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff


# 06cb76bd 23-Oct-2003 Garrett Wollman <wollman@FreeBSD.org>

Add appropriate const poisoning to the assert_*locked() family so that I can
call ASSERT_VOP_LOCKED(vp, __func__) without a diagnostic.

Inspired by: the evil and rude OpenAFS cache manager code


# f2b1200d 20-Oct-2003 Alan Cox <alc@FreeBSD.org>

Initialize the buf's b_object in pbgetvp(). Clear it in pbrelvp(). (This
facilitates synchronization of the vm page's valid field using the
vm object's lock.)

Suggested by: tegge


# 3da2d6a4 17-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Simplify count_dev()


# 5108cd36 12-Oct-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Simplify vn_isdisk() a bit.


# 7dd1328c 11-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Fix a typo, I meant & and not |. This was causing lockups from the syncer
looping forever due to list corruption.

Solved by: tegge


# bdcfcdec 05-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Fix an XXX. Check the error of vn_lock() in vflush(). Don't specify
LK_RETRY either, we don't want this vnode if it turns into another.
- Remove the code that checks the mount point after acquiring the lock
we are guaranteed to either fail or get the vnode that we wanted.


# 45503a37 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Rename vcanrecycle() to vtryrecycle() to reflect its new role.
- In vtryrecycle() try to vgonel the vnode if all of the previous checks
passed. We won't vgonel if someone has either acquired a hold or usecount
or started the vgone process elsewhere. This is because we may have been
removed from the free list while we were inspecting the vnode for
recycling.
- The VI_TRYLOCK stops two threads from entering getnewvnode() and recycling
the same vnode. To further reduce the likelyhood of this event, requeue
the vnode on the tail of the list prior to calling vtryrecycle(). We can
not actually remove the vnode from the list until we know that it's
going to be recycled because other interlock holders may see the VI_FREE
flag and try to remove it from the free list.
- Kill a bogus XXX comment. If XLOCK is set we shouldn't wait for it
regardless of MNT_WAIT because the vnode does not actually belong to
this filesystem.


# 85311d4b 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Don't cache_purge() in getnewvnode. It's done in vclean(). With this
purge, the purge in vclean, and the filesystems purge, we had 3 purges
per vnode.
- Move the insmntque(vp, 0) to vclean() so that we may remove it from the
two vgone() functions and reduce the number of lock operations required.


# ce13b187 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Solve a LOR with the sync_mtx by using the VI_ONWORKLST flag to determine
whether or not the sync failed. This could potentially get set between
the time that we VOP_UNLOCK and VI_LOCK() but the race would harmelssly
lead to the sync being delayed by an extra 30 seconds. If we do not move
the vnode it could cause an endless loop if it continues to fail to sync.
- Use vhold and vdrop to stop the vnode from changing identities while we
have it unlocked. Other internal vfs lists are likely to follow this
scheme.


# 894fbf97 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Move the xlock 'locking' code into vx_lock() and vx_unlock().
- Create a new function, vgonechrl(), which performs vgone for an in-use
character device. Move the code from vflush() that did this into
vgonechrl().
- Hold the xlock across the entirety of vgonel() and vgonechrl() so that
at no point will an invalid vnode exist on any list without XLOCK set.
- Move the xlock code out of vclean() now that it is in the vgone*()
functions.


# 6f4b0863 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- In sched_sync() test our preconditions prior to dropping the sync_mtx.
This is so that we may grab the interlock while still holding the
sync_mtx. We have to VI_TRYLOCK() because in all other cases the lock
order runs the other way.
- If we don't meet any of the preconditions, reinsert the vp into the
list for the next second.
- We don't need to panic if we fail to sync here because each FSYNC
function handles this case. Removing this redundant code also
simplifies locking.


# e4c49d2b 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- In a Giantless world, the vn_lock() in vcanrecycle() could legitimately
fail. Remove the panic from that case and document why it might fail.
- Document the reason for calling cache_purge() on a newly created vnode.
- In insmntque() order the operations so that we can call mtx_unlock()
one fewer times. This makes the code somewhat clearer as well.
- Add XXX comments in sched_sync() and vflush().
- In vget(), do not sleep while waiting for XLOCK to clear if LK_NOWAIT is
set.
- In vclean() we don't need to acquire a lock around a single TAILQ_FIRST
call. It's ok if we race here, the vinvalbuf will just do nothing.
- Increase the scope of the lock in vgonel() to reduce the number of lock
operations that are performed.


# 51b57549 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- In reassignbuf() don't unlock vp and lock newvp if they are the same.
Doing so creates a race where the buf is on neither list.
- Only vfree() in an error case in vclean() if VSHOULDFREE() thinks we
should.
- Convert the error case in vclean() to INVARIANTS from DIAGNOSTIC as this
really should not happen and is fast to check.


# 6b6c163a 19-Sep-2003 Jeff Roberson <jeff@FreeBSD.org>

- Remove spls(). The locking that has replaced them is in place and they
no longer serve as guidelines for future work.


# aebbeee8 19-Sep-2003 Alexander Kabaev <kan@FreeBSD.org>

Eliminate one case of VI_UNLOCK followed by an immediate
VI_LOCK.


# 8b149b51 07-Aug-2003 John Baldwin <jhb@FreeBSD.org>

Consistently use the BSD u_int and u_short instead of the SYSV uint and
ushort. In most of these files, there was a mixture of both styles and
this change just makes them self-consistent.

Requested by: bde (kern_ktrace.c)


# 68f2d20b 22-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Revert stuff which accidentally ended up in the previous commit.


# 55d1d703 22-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Don't attempt to inline large functions mb_alloc() and mb_free(),
it more than doubles the text size of this file.

GCC has wisely ignored us on this previously


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# a62f80f8 31-May-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused variable and now unbalanced call to splbio();

Found by: FlexeLint


# 2e05d898 23-May-2003 Alan Cox <alc@FreeBSD.org>

Make the maximum number of vnodes a function of both the physical memory
size and the kernel's heap size, specifically, vm_kmem_size. This
function allows a maximum of 40% of the vm_kmem_size to be used for
vnodes and vm objects. This is a conservative bound based upon recent
problem reports. (In other words, a slight increase in this percentage
may be safe.)

Finally, machines with less than ~3GB of RAM should be unaffected
by this change, i.e., the maximum number of vnodes should remain
the same. If necessary, machines with 3GB or more of RAM can increase
the maximum number of vnodes by increasing vm_kmem_size.

Desired by: scottl
Tested by: jake
Approved by: re (rwatson,scottl)


# 1e9bc9f8 16-May-2003 Don Lewis <truckman@FreeBSD.org>

Detect that a vnode has been reclaimed while vflush() was waiting to lock
the vnode and restart the loop. Vflush() is vulnerable since it does not
hold a reference to the vnode and it holds no other locks while waiting
for the vnode lock. The vnode will no longer be on the list when the
loop is restarted.

Approved by: re (rwatson)


# 099e981a 12-May-2003 Alan Cox <alc@FreeBSD.org>

Optimize the use of splay in gbincore(). During a "make buildworld" the
desired buffer is found at one of the roots more than 60% of the time.
Thus, checking both roots before performing either splay eliminates
unnecessary splays on the first tree splayed.

Approved by: re (jhb)


# 1964fb9b 12-May-2003 Robert Watson <rwatson@FreeBSD.org>

Remove bogus locking from DDB's "show lockedvnods" command: using
synchronization primitives from inside DDB is generally a bad idea,
and in this case it frequently results in panics due to DDB commands
being executed from the sio fast interrupt context on a serial
console. Replace the locking with a note that a lack of locking
means that DDB may get see inconsistent views of the mount and vnode
lists, which could also result in a panic. More frequently,
though, this avoids a panic than causes it.

Discussed with ages ago: bde
Approved by: re (scottl)


# bff99f0d 03-May-2003 Alan Cox <alc@FreeBSD.org>

- Revert kern/vfs_subr.c revision 1.444. The vm_object's size isn't
trustworthy for vnode-backed objects.
- Restore the old behavior of vm_object_page_remove() when the end
of the given range is zero. Add a comment to vm_object_page_remove()
regarding this behavior.

Reported by: iedowse


# ebba1b25 30-Apr-2003 Alan Cox <alc@FreeBSD.org>

Lock accesses to the vm_object's ref_count and resident_page_count.


# ecde4b32 26-Apr-2003 Alan Cox <alc@FreeBSD.org>

Various changes to vm_object_page_remove():
- Eliminate an odd, special-case feature:
if start == end == 0 then all pages are removed. Only one caller
used this feature and that caller can trivially pass the object's
size.
- Assert that the vm_object is locked on entry; don't bother testing
for a NULL vm_object.
- Style: Fix lines that are longer than 80 characters.


# 1ca58953 26-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Convert vm_object_pip_wait() from using tsleep() to msleep().
- Make vm_object_pip_sleep() static.
- Lock the vm_object when performing vm_object_pip_wait().


# b6e48e03 23-Apr-2003 Alan Cox <alc@FreeBSD.org>

- Acquire the vm_object's lock when performing vm_object_page_clean().
- Add a parameter to vm_pageout_flush() that tells vm_pageout_flush()
whether its caller has locked the vm_object. (This is a temporary
measure to bootstrap vm_object locking.)


# 49281fbf 18-Apr-2003 Alan Cox <alc@FreeBSD.org>

Update locking around vm_object_page_remove() to use the new macros.


# e96c181d 12-Apr-2003 Alan Cox <alc@FreeBSD.org>

Use vm_object_pip_wait() rather than reimplementing it.


# 6b080461 26-Mar-2003 Tor Egge <tegge@FreeBSD.org>

Adjust the number of vnodes scanned by vlrureclaim() according to the
size of the vnode list.


# 17ce5b94 22-Mar-2003 Yaroslav Tykhiy <ytykhiy@gmail.com>

We shouldn't assert that a vode is locked in vop_lock_post()
if VOP_LOCK() has failed.

Reviewed by: jeff


# e99215a6 13-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Remove a dead check for bp->b_vp == vp in vtruncbuf(). This has not been
possible for some time.
- Lock the buf before accessing fields. This should very rarely be locked.
- Assert that B_DELWRI is set after we acquire the buf. This should always
be the case now.


# 09f11da5 13-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Remove a race between fsync like functions and flushbufqueues() by
requiring locked bufs in vfs_bio_awrite(). Previously the buf could
have been written out by fsync before we acquired the buf lock if it
weren't for giant. The cluster_wbuild() handles this race properly but
the single write at the end of vfs_bio_awrite() would not.
- Modify flushbufqueues() so there is only one copy of the loop. Pass a
parameter in that says whether or not we should sync bufs with deps.
- Call flushbufqueues() a second time and then break if we couldn't find
any bufs without deps.


# 09c80124 05-Mar-2003 Alan Cox <alc@FreeBSD.org>

Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress.

Discussed on: arch@


# 99648386 03-Mar-2003 Nate Lawson <njl@FreeBSD.org>

Finish cleanup of vprint() which was begun with changing v_tag to a string.
Remove extraneous uses of vop_null, instead defering to the default op.
Rename vnode type "vfs" to the more descriptive "syncer".
Fix formatting for various filesystems that use vop_print.


# 491081fa 01-Mar-2003 Jeff Roberson <jeff@FreeBSD.org>

- Hold the vnode interlock across calls to bgetvp instead of acquiring it
internally. This is required to stop multiple bufs from being associated
with a single lblkno.


# bff5362b 28-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- gc USE_BUFHASH. The smp locking of the buf cache renders this useless.


# 3a7053cb 24-Feb-2003 Kirk McKusick <mckusick@FreeBSD.org>

Prevent large files from monopolizing the system buffers. Keep
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.

Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.

Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.


# 17661e5a 24-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Add an interlock argument to BUF_LOCK and BUF_TIMELOCK.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.

Reviewed by: arch, mckusick


# acb18acf 23-Feb-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Bracket the kern.vnode sysctl in #ifdef notyet because it results
in massive locking issues on diskless systems.

It is also not clear that this sysctl is non-dangerous in its
requirements for locked down memory on large RAM systems.


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 6a1b2a22 29-Dec-2002 Ian Dowse <iedowse@FreeBSD.org>

Add a new vnode flag VI_DOINGINACT to indicate that a VOP_INACTIVE
call is in progress on the vnode. When vput() or vrele() sees a
1->0 reference count transition, it now return without any further
action if this flag is set. This flag is necessary to avoid recursion
into VOP_INACTIVE if the filesystem inactive routine causes the
reference count to increase and then drop back to zero. It is also
used to guarantee that an unlocked vnode will not be recycled while
blocked in VOP_INACTIVE().

There are at least two cases where the recursion can occur: one is
that the softupdates code called by ufs_inactive() via ffs_truncate()
can call vput() on the vnode. This has been reported by many people
as "lockmgr: draining against myself" panics. The other case is
that nfs_inactive() can call vget() and then vrele() on the vnode
to clean up a sillyrename file.

Reviewed by: mckusick (an older version of the patch)


# 371400cf 29-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Use a timeout of one second while we wait for the vnode washer,
this prevents a potential race and makes the system a little bit
less jerky under extreme loads.


# 851a87ea 29-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Vnodes pull in 800-900 bytes these days, all things counted, so we need
to treat desiredvnodes much more like a limit than as a vague concept.

On a 2GB RAM machine where desired vnodes is 130k, we run out of
kmem_map space when we hit about 190k vnodes.

If we wake up the vnode washer in getnewvnode(), sleep until it is done,
so that it has a chance to offer us a washed vnode. If we don't sleep
here we'll just race ahead and allocate yet a vnode which will never
get freed.

In the vnodewasher, instead of doing 10 vnodes per mountpoint per
rotation, do 10% of the vnodes distributed evenly across the
mountpoints.


# 9f162827 28-Dec-2002 Poul-Henning Kamp <phk@FreeBSD.org>

KASSERT that vop_revoke() gets a VCHR.


# 475e8011 14-Dec-2002 Alan Cox <alc@FreeBSD.org>

Perform vm_object_lock() and vm_object_unlock() around
vm_object_page_remove().


# 2e29a1f2 07-Dec-2002 Alan Cox <alc@FreeBSD.org>

To avoid lock order reversals in getnewvnode(), the call to uma_zfree()
must be delayed until the vnode interlock is released.

Reported by: kris@
Approved by: re (jhb)


# f85a9619 27-Nov-2002 Robert Drehmel <robert@FreeBSD.org>

Do not set a variable (vp->p_pollinfo) to NULL if we know
it already has that value.

Approved by: re


# 763bbd2f 26-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Slightly change the semantics of vnode labels for MAC: rather than
"refreshing" the label on the vnode before use, just get the label
right from inception. For single-label file systems, set the label
in the generic VFS getnewvnode() code; for multi-label file systems,
leave the labeling up to the file system. With UFS1/2, this means
reading the extended attribute during vfs_vget() as the inode is
pulled off disk, rather than hitting the extended attributes
frequently during operations later, improving performance. This
also corrects sematics for shared vnode locks, which were not
previously present in the system. This chances the cache
coherrency properties WRT out-of-band access to label data, but in
an acceptable form. With UFS1, there is a small race condition
during automatic extended attribute start -- this is not present
with UFS2, and occurs because EAs aren't available at vnode
inception. We'll introduce a work around for this shortly.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 0d6dc414 25-Oct-2002 Poul-Henning Kamp <phk@FreeBSD.org>

In vrele() we can actually have a VCHR with v_rdev == NULL if we
came from the bottom of addaliasu(). Don't panic.


# 9ab73fd1 24-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

Within ufs, the ffs_sync and ffs_fsync functions did not always
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned. This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.

Sponsored by: DARPA & NAI Labs.


# a2fb4fed 24-Oct-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Fix the spechash lock order reversal by keeping an updated sum
of v_usecount in the dev_t which vcount() can return without
locking any vnodes.

Seen by: jhb


# a6b9f47b 14-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

When scanning the freelist looking for candidate vnodes to recycle,
be sure to exit the loop with vp == NULL if no candidates are found.
Formerly, this bug would cause the last vnode inspected to be used,
even if it was not available. The result was a panic "vn_finished_write:
neg cnt".

Sponsored by: DARPA & NAI Labs.


# e04a0200 14-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

Unconditionally reset vp->v_vnlock back to the default in the
vclean() function (e.g., vp->v_vnlock = &vp->v_lock) rather
than requiring filesystems that use alternate locks to do so
in their vop_reclaim functions. This change is a further cleanup
of the vop_stdlock interface.

Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sponsored by: DARPA & NAI Labs.


# a5b65058 13-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

Regularize the vop_stdlock'ing protocol across all the filesystems
that use it. Specifically, vop_stdlock uses the lock pointed to by
vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to
reference vp->v_lock. Filesystems that wish to use the default
do not need to allocate a lock at the front of their node structure
(as some still did) or do a lockinit. They can simply start using
vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks,
but still use the vop_stdlock functions (such as nullfs) can simply
replace vp->v_vnlock with a pointer to the lock that they wish to
have used for the vnode. Such filesystems are responsible for
setting the vp->v_vnlock back to the default in their vop_reclaim
routine (e.g., vp->v_vnlock = &vp->v_lock).

In theory, this set of changes cleans up the existing filesystem
lock interface and should have no function change to the existing
locking scheme.

Sponsored by: DARPA & NAI Labs.


# 192e439e 10-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

When considering a vnode for reuse in getnewvnode, we call
vcanrecycle to check a free vnode's availability. If it is
available, vcanrecycle returns an error code of zero and the
vnode in question locked. The getnewvnode routine then used
to call vn_start_write with the V_NOWAIT flag. If the filesystem
was suspended while taking a snapshot, the vn_start_write would
fail but getnewvnode would fail to unlock the vnode, instead
leaving it locked on the freelist. The result would be that the
vnode would be locked forever and would eventually hang the
system with a race to the root when it was attempted to recycle
it. This fix moves the vn_start_write check into vcanrecycle
where it will properly unlock the vnode if it is unavailable
for recycling due to filesystem suspension.

Sponsored by: DARPA & NAI Labs.


# 790a8088 04-Oct-2002 Maxim Sobolev <sobomax@FreeBSD.org>

Fix problem introduced in rev.1.406, which can cause already unlocked
mutex being unlocked again causing system panic.


# 8d3574c7 01-Oct-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Fix some harmless mis-indents.

Spotted by: FlexeLint


# 0626774f 30-Sep-2002 Robert Watson <rwatson@FreeBSD.org>

Move vnode MAC label initialization to after the release of the vnode
interlock in getnewvnode() to avoid possible sleeps while holding
the mutex. Note that the warning from Witness is a slight false
positive since we know there will be no contention on the interlock
since we haven't made the vnode available for use yet, but the theory
is not a bad one.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# 37c84183 28-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512


# 6423c943 25-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Move ASSERT_VOP_*LOCK* functionality into functions in vfs_subr.c
- Make the VI asserts more orthogonal to the rest of the asserts by using a
new, common vfs_badlock() function and adding a 'str' arg.
- Adjust generated ASSERTS to match the new prototype.
- Adjust explicit ASSERTS to match the new prototype.


# 6cb8bf20 24-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Lock down the syncer with sync_mtx.
- Enable vfs_badlock_mutex by default.
- Assert that the vp is locked in VOP_UNLOCK.
- Use standard interlock macros in remaining code.
- Correct a race in getnewvnode().
- Lock access to v_numoutput with interlock.
- Lock access to buf lists and splay tree with interlock.
- Add VOP and VI asserts.
- Lock b_vnbufs with the vnode interlock.
- Add vrefcnt() for callers who want to retreive the vnode ref without
holding a lock. Add a comment that describes when this is safe.
- Add vholdl() and vdropl() so that callers who already own the interlock
can avoid race conditions and unnecessary unlocking.
- Move the VOP_GETATTR() in vflush() into the WRITECLOSE conditional case.
- Hold the interlock before droping the mntlist_mtx in vflush() to avoid
a race.
- Fix locking in vfs_msync().


# 86ed6d45 18-Sep-2002 Nate Lawson <njl@FreeBSD.org>

Remove any VOP_PRINT that redundantly prints the tag.
Move lockmgr_printinfo() into vprint() for everyone's benefit.

Suggested by: bde


# 06be2aaa 14-Sep-2002 Nate Lawson <njl@FreeBSD.org>

Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by: phk
Reviewed by: bde, rwatson (earlier version)


# 85e40eaf 11-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Indentation does not make a block.. need curly braces too.
Submitted by: Eagle-eyes evans <bde@freebsd.org>


# 71fad9fd 11-Sep-2002 Julian Elischer <julian@FreeBSD.org>

Completely redo thread states.

Reviewed by: davidxu@freebsd.org


# f8b66361 05-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Fix an inherited style bug: compare with NOCRED instead of NULL.

Sponsored by: DARPA & NAI Labs.


# c1a925a6 05-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce new extattr_check_cred() function which implements the canonical
crential washing for extended attributes.

Sponsored by: DARPA & NAI Labs.


# 93b0017f 25-Aug-2002 Philippe Charnier <charnier@FreeBSD.org>

Replace various spelling with FALLTHROUGH which is lint()able


# ad32f726 22-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Fix a mistake in my last few commits. The PDROP flag stops msleep from
re-acquiring the mutex.

Pointy hat to: me
Noticed by: tegge


# 9abf54f0 22-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Make vn_lock() vget() and VOP_LOCK() all behave the same way WRT
LK_INTERLOCK. The interlock will never be held on return from these
functions even when there is an error. Errors typically only occur when
the XLOCK is held which means this isn't the vnode we want anyway. Almost
all users of these interfaces expected this behavior even though it was
not provided before.


# 18315848 22-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Fix interlock handling in vn_lock(). Previously, vn_lock() could return
with interlock held in error conditions when the caller did not specify
LK_INTERLOCK.
- Add several comments to vn_lock() describing the rational behind the code
flow since it was not immediately obvious.


# 0b600db4 21-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Document two cases, one in vget and the other in vn_lock, where the state
of interlock on exit is not consistent. There are probably several bugs
relating to this.


# 88cf6b94 21-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- If vn_lock fails with the LK_INTERLOCK flag set, interlock will not be
released. vcanrecycle() failed to unlock interlock under this condition.
- Remove an extra VOP_UNLOCK from a failure case in vcanrecycle().

Pointed out by: rwatson


# 71ea4ba5 21-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add two new debugging macros: ASSERT_VI_LOCKED and ASSERT_VI_UNLOCKED
- Use the new VI asserts in place of the old mtx_assert checks.
- Add the VI asserts to the automated lock checking in the VOP calls. The
interlock should not be held across vops with a few exceptions.
- Add the vop_(un)lock_{pre,post} functions to assert that interlock is held
when LK_INTERLOCK is set.


# 055c0123 12-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Extend the vnode_free_list_mtx to cover numvnodes and freevnodes. This
was done only some of the time before, and now it is uniformly applied.


# 5965373e 10-Aug-2002 Maxime Henrion <mux@FreeBSD.org>

- Introduce a new struct xvfsconf, the userland version of struct vfsconf.
- Make getvfsbyname() take a struct xvfsconf *.
- Convert several consumers of getvfsbyname() to use struct xvfsconf.
- Correct the getvfsbyname.3 manpage.
- Create a new vfs.conflist sysctl to dump all the struct xvfsconf in the
kernel, and rewrite getvfsbyname() to use this instead of the weird
existing API.
- Convert some {set,get,end}vfsent() consumers to use the new vfs.conflist
sysctl.
- Convert a vfsload() call in nfsiod.c to kldload() and remove the useless
vfsisloadable() and endvfsent() calls.
- Add a warning printf() in vfs_sysctl() to tell people they are using
an old userland.

After these changes, it's possible to modify struct vfsconf without
breaking the binary compatibility. Please note that these changes don't
break this compatibility either.

When bp will have updated mount_smbfs(8) with the patch I sent him, there
will be no more consumers of the {set,get,end}vfsent(), vfsisloadable()
and vfsload() API, and I will promptly delete it.


# 8947be9b 05-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Move some logic from getnewvnode() to a new function vcanrecycle()
- Unlock the free list mutex around vcanrecycle to prevent a lock order
reversal.


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# f9d0d524 01-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Include file cleanup; mac.h and malloc.h at one point had ordering
relationship requirements, and no longer do.

Reminded by: bde


# 30721972 30-Jul-2002 Dag-Erling Smørgrav <des@FreeBSD.org>

Nit in previous commit: the correct sysctl type is "S,xvnode"


# 217b2a0b 30-Jul-2002 Dag-Erling Smørgrav <des@FreeBSD.org>

Initialize v_cachedid to -1 in getnewvnode().
Reintroduce the kern.vnode sysctl and make it export xvnodes rather than
vnodes.

Sponsored by: DARPA, NAI Labs


# 07bdba7e 30-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Note that the privilege indicating flag to vaccess() originally used
by the process accounting system is now deprecated.


# a0ee6ed1 30-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

Invoke the necessary MAC entry points to maintain labels on vnodes.
In particular, initialize the label when the vnode is allocated or
reused, and destroy the label when the vnode is going to be released,
or reused. Wow, an object where there really is exactly one place
where it's allocated, and one other where it's freed. Amazing.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# a562685f 29-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

- Backout the patch made in revision 1.75 of vfs_mount.c. The vputs here
were hiding the real problem of the missing unlock in sync_inactive.
- Add the missing unlock in sync_inactive.

Submitted by: iedowse


# 5c38b6db 28-Jul-2002 Don Lewis <truckman@FreeBSD.org>

Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.


# b02aac46 21-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Teach discretionary access control methods for files about VAPPEND
and VALLPERM.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 7aca6291 19-Jul-2002 Kirk McKusick <mckusick@FreeBSD.org>

Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

if (ioflags & IO_SYNC)
flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by: DARPA & NAI Labs.


# fb36a3d8 16-Jul-2002 Kirk McKusick <mckusick@FreeBSD.org>

Change utimes to set the file creation time (for filesystems that
support creation times such as UFS2) to the value of the
modification time if the value of the modification time is older
than the current creation time. See utimes(2) for further details.

Sponsored by: DARPA & NAI Labs.


# d331c5d4 10-Jul-2002 Matthew Dillon <dillon@FreeBSD.org>

Replace the global buffer hash table with per-vnode splay trees using a
methodology similar to the vm_map_entry splay and the VM splay that Alan
Cox is working on. Extensive testing has appeared to have shown no
increase in overhead.

Disadvantages
Dirties more cache lines during lookups.

Not as fast as a hash table lookup (but still N log N and optimal
when there is locality of reference).

Advantages
vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem
syncer operate more efficiently.

I get to rip out all the old hacks (some of which were mine) that tried
to keep the v_dirtyblkhd tailq sorted.

The per-vnode splay tree should be easier to lock / SMPng pushdown on
vnodes will be easier.

This commit along with another that Alan is working on for the VM page
global hash table will allow me to implement ranged fsync(), optimize
server-side nfs commit rpcs, and implement partial syncs by the
filesystem syncer (aka filesystem syncer would detect that someone is
trying to get the vnode lock, remembers its place, and skip to the
next vnode).

Note that the buffer cache splay is somewhat more complex then other splays
due to special handling of background bitmap writes (multiple buffers with
the same lblkno in the same vnode), and B_INVAL discontinuities between the
old hash table and the existence of the buffer on the v_cleanblkhd list.

Suggested by: alc


# 25b286d6 09-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

- Use standard locking functions in syncer's opv
- vput instead of vrele syncer vnodes in vfs_mount
- Add vop_lookup_{pre,post} to verify locking in VOP_LOOKUP


# 18c48f43 07-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

- Don't hold the vn lock while calling VOP_CLOSE in vclean().


# bed75d46 06-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

- BUF_REFCNT() seems to be the preferred method for verifying a locked buf.
Tell vop_strategy_pre() to use this instead.
- Ignore B_CLUSTER bufs. Their components are locked but they don't really
exist so they don't have to be. This isn't ideal but it is safe.


# c031d11b 06-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Fix a mistake in my last commit. Don't grab an extra reference to the object
in bp->b_object.


# 9a236af3 06-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Fixup uses of GETVOBJECT.

- Cache a pointer to the vnode's object in the buf.
- Hold a reference to that object in addition to the vnode's reference just
to be consistent.
- Cleanup code that got the object indirectly through the vp and VOP calls.

This fixes at least one case where we were calling GETVOBJECT without a lock.
It also avoids an expensive layered call at the cost of another pointer in
struct buf.


# 302c7aaa 05-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add vop_strategy_pre to validate VOP_STRATEGY locking.
- Disable original vop_strategy lock specification.
- Switch to the new vop_strategy_pre for lock validation.

VOP_STRATEGY requires only that the buf is locked UNLESS the block numbers need
to be translated. There may be other reasons, but as long as the underlying
layer uses a VOP to perform the operations they will be caught later.


# cc8662b0 05-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Add "vop_rename_pre" to do pre rename lock verification. This is enabled only
with DEBUG_VFS_LOCKS.


# d7f9ecc8 03-Jul-2002 Maxime Henrion <mux@FreeBSD.org>

Move vfs_rootmountalloc() in vfs_mount.c and remove lite2_vfs_mountroot()
which was #if 0'd and is not likely to be used now.


# 2b4edb69 02-Jul-2002 Maxime Henrion <mux@FreeBSD.org>

Move every code related to mount(2) in a new file, vfs_mount.c.
The file vfs_conf.c which was dealing with root mounting has
been repo-copied into vfs_mount.c to preserve history.
This makes nmount related development easier, and help reducing
the size of vfs_syscalls.c, which is still an enormous file.

Reviewed by: rwatson
Repo-copy by: peter


# 6bd521df 01-Jul-2002 Ian Dowse <iedowse@FreeBSD.org>

Use indirect function pointer hooks instead of #ifdef SOFTUPDATES
direct calls for the two places where the kernel calls into soft
updates code. Set up the hooks in softdep_initialize() and NULL
them out in softdep_uninitialize(). This change allows soft updates
to function correctly when ufs is loaded as a module.

Reviewed by: mckusick


# 87e1503e 28-Jun-2002 David E. O'Brien <obrien@FreeBSD.org>

Rename the db command lockedvnodes to lockedvnods so that it fits on the
help screen and one doens't think we have a lockedvnodesmap command.


# 210a5a71 28-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

nuke caddr_t.


# 90769c9e 28-Jun-2002 Jeff Roberson <jeff@FreeBSD.org>

Improve the VOP locking asserts

- Add vfs_badlock_print to control whether or not we print lock violations
- Add vfs_badlock_panic to control whether we panic on lock violations

Both default to on to mimic the original behavior if DEBUG_VFS_LOCKS is on.


# aac12bcf 28-Jun-2002 Brian Feldman <green@FreeBSD.org>

Fix a case where a vnode got explicitly unlocked after the pointer to it
got set to NULL.

Revision 1.355: in the box


# 7d2d4409 20-Jun-2002 Maxime Henrion <mux@FreeBSD.org>

Change the way we internally store the mount options to
a linked list. This is to allow the merging of the mount
options in the MNT_UPDATE case, as the current data structure
is unsuitable for this.

There are no functional differences in this commit.

Reviewed by: phk


# fe937506 14-Jun-2002 Maxime Henrion <mux@FreeBSD.org>

Change vfs_copyopt() so that the length argument passed to it
must be the exact same size as the mount option. This makes
vfs_copyopt() much more useful.


# edad3af2 06-Jun-2002 Dag-Erling Smørgrav <des@FreeBSD.org>

Move some sysctls from the debug tree to the vfs tree.


# 4a357a32 06-Jun-2002 Dag-Erling Smørgrav <des@FreeBSD.org>

Gratuitous whitespace cleanup.


# d394511d 16-May-2002 Tom Rhodes <trhodes@FreeBSD.org>

More s/file system/filesystem/g


# 34e53231 16-May-2002 Maxime Henrion <mux@FreeBSD.org>

o Fix vfs_copyopt(), the first argument to bcopy() is the source,
not the destination.
o Remove some code from vfs_getopt() which was making the interface
more complicated to use for a very slight gain.


# f0d73b3e 06-May-2002 Jeff Roberson <jeff@FreeBSD.org>

Switch from just holding the interlock to holding the standard lock throughout
getnewvnode(). This is safer. In the future, we should investigate requiring
only the interlock to get the vnode object.


# 6953f5da 05-May-2002 Jeff Roberson <jeff@FreeBSD.org>

Hold the currently selected vnode's lock across the call to VOP_GETVOBJECT.

Don't try to create a vm object before the file system has a chance to finish
initializing it. This is incorrect for a number of reasons. Firstly, that
VOP requires a lock which the file system may not have initialized yet. Also,
open and others will create a vm object if it is necessary later.


# 81e01743 05-May-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Expand the one-line function pbreassignbuf() the only place it is or could
be used.


# 9f943554 04-May-2002 Matthew Dillon <dillon@FreeBSD.org>

Remove obsolete code (that was already #if 0'd out).
Requested by: Hiten Pandya <hitmaster2k@yahoo.com>


# 6008862b 04-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64


# 44731cab 01-Apr-2002 John Baldwin <jhb@FreeBSD.org>

Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@


# 17594b93 26-Mar-2002 Maxime Henrion <mux@FreeBSD.org>

As discussed in -arch, add the new nmount(2) system call and the
new vfs_getopt()/vfs_copyopt() API. This is intended to be used
later, when there will be filesystems implementing the VFS_NMOUNT
operation. The mount(2) system call will disappear when all
filesystems will be converted to the new API. Documentation will
be committed in a while.

Reviewed by: phk


# c897b813 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 89e1164e 05-Mar-2002 Robert Watson <rwatson@FreeBSD.org>

Three p_ucred -> td_ucred's missed in jhb's earlier pass; all appear to
be safe.


# a854ed98 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.


# 68edc1b9 18-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Make v_addpollinfo() visible and non-inline.
Have callers only call it as needed.
Add necessary call in ufs_kqfilter().

Test-case found by: Andrew Gallatin <gallatin@cs.duke.edu>


# 90737495 18-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Remove yet a redundant VN_KNOTE() macro.


# 4b55dbe3 17-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Move the stuff related to select and poll out of struct vnode.
The use of the zone allocator may or may not be overkill.
There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need
to revisit.

This shaves about 60 bytes of struct vnode which on my laptop means
600k less RAM used for vnodes.


# 2b8a08af 07-Feb-2002 Peter Wemm <peter@FreeBSD.org>

Fix a couple of style bugs introduced (or touched by) previous commit.


# 079b7bad 07-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


# 64011154 01-Feb-2002 Kirk McKusick <mckusick@FreeBSD.org>

In the routines vrele() and vput(), we must lock the vnode and
call VOP_INACTIVE before placing the vnode back on the free list.
Otherwise there is a race condition on SMP machines between
getnewvnode() locking the vnode to reclaim it and vrele()
locking the vnode to inactivate it. This window of vulnerability
becomes exaggerated in the presence of filesystems that have
been suspended as the inactive routine may need to temporarily
release the lock on the vnode to avoid deadlock with the syncer
process.


# c73df808 18-Jan-2002 Matthew Dillon <dillon@FreeBSD.org>

Remove 'VXLOCK: interlock avoided' warnings. This can now occur in normal
operation. The vgonel() code has always called vclean() but until we
started proactively freeing vnodes it would never actually be called with
a dirty vnode, so this situation did not occur prior to the vnlru() code.
Now that we proactively free vnodes when kern.maxvnodes is hit, however,
vclean() winds up with work to do and improperly generates the warnings.

Reviewed by: peter
Approved by: re (for MFC)
MFC after: 1 day


# cd600596 15-Jan-2002 Kirk McKusick <mckusick@FreeBSD.org>

When downgrading a filesystem from read-write to read-only, operations
involving file removal or file update were not always being fully
committed to disk. The result was lost files or corrupted file data.
This change ensures that the filesystem is properly synced to disk
before the filesystem is down-graded.

This delta also fixes a long standing bug in which a file open for
reading has been unlinked. When the last open reference to the file
is closed, the inode is reclaimed by the filesystem. Previously,
if the filesystem had been down-graded to read-only, the inode could
not be reclaimed, and thus was lost and had to be later recovered
by fsck. With this change, such files are found at the time of the
down-grade. Normally they will result in the filesystem down-grade
failing with `device busy'. If a forcible down-grade is done, then
the affected files will be revoked causing the inode to be released
and the open file descriptors to begin failing on attempts to read.

Submitted by: "Sam Leffler" <sam@errno.com>


# e61ab5fc 10-Jan-2002 Matthew Dillon <dillon@FreeBSD.org>

Add vlruvp() routine - implements LRU operation for vnode recycling.

We calculate a trigger point that both guarentees we will find a
sufficient number of vnodes to recycle and prevents us from recycling
vnodes with lots of resident pages. This particular section of
code is designed to recycle vnodes, not do unnecessary frees of
cached VM pages.


# 9dd4281d 24-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

Fix type-o in previous commit (tsleep was using wrong rendezvous point)


# 23b59018 20-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code. Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release


# 9f2f52d6 18-Dec-2001 Peter Wemm <peter@FreeBSD.org>

Do not initialize static/global variables to 0. Use bss instead of
taking up space in the data section.


# 8f0d41d3 18-Dec-2001 Peter Wemm <peter@FreeBSD.org>

Use a different mechanism to get the vnlru process to wake up and notice
the shutdown request at reboot/halt time.
Disable the printf 'vnlru process getting nowhere, pausing...' and instead
export the count to the debug.vnlru_nowhere sysctl.


# fdb33f08 18-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

This is a forward port of Peter's vlrureclaim() fix, with some minor mods
by me to make it more efficient. The original code had serious balancing
problems and could also deadlock easily. This code relegates the vnode
reclamation to its own kproc and relaxes the vnode reclamation requirements
to better maintain kern.maxvnodes. This code still doesn't balance as well
as it could, but it does a much better job then the original code.

Approved by: re@freebsd.org
Obtained from: ps, peter, dillon
MFS Assuming: Assuming no problems crop up in Yahoo testing
MFC after: 7 days


# 873a4904 14-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

A slightly different version of the vlrureclaim fix.

Reported by: peter, ps


# 9446b36b 13-Dec-2001 Peter Wemm <peter@FreeBSD.org>

If we were called to allocate a vnode that is not associated with a
mount point, do not dereference the NULL mp argument.


# 6b8bd2ef 04-Nov-2001 Matthew Dillon <dillon@FreeBSD.org>

Add mnt_reservedvnlist so we can MFC to 4.x, in order to make all mount
structure changes now rather then piecemeal later on. mnt_nvnodelist
currently holds all the vnodes under the mount point. This will eventually
be split into a 'dirty' and 'clean' list. This way we only break kld's once
rather then twice. nvnodelist will eventually turn into the dirty list
and should remain compatible with the klds.


# bcc0dc3d 02-Nov-2001 Robert Watson <rwatson@FreeBSD.org>

Merge from POSIX.1e Capabilities development tree:

o POSIX.1e capabilities authorize overriding of VEXEC for VDIR based
on CAP_DAC_READ_SEARCH, but of !VDIR based on CAP_DAC_EXECUTE. Add
appropriate conditionals to vaccess() to take that into account.
o Synchronization cap_check_xxx() -> cap_check() change.

Obtained from: TrustedBSD Project


# 4ffa210b 27-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

syncdelay, filedelay, dirdelay, metadelay are ints, not time_t's,
and can also be made static.


# 245df27c 25-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.

Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after: 1 week


# f92dcd3e 25-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Add missing TAILQ_INSERT_TAIL's which somehow didn't get comitted with
the recent vnode cleanup.


# c72ccd01 22-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Change the vnode list under the mount point from a LIST to a TAILQ
in preparation for an implementation of limiting code for kern.maxvnodes.

MFC after: 3 days


# 2210e5d9 16-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

fix minor bug in kern.minvnodes sysctl. Use OID_AUTO.


# 917efbaa 08-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

WS Cleanup


# 845bd795 05-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

vinvalbuf() was only waiting for write-I/O to complete. It really has to
wait for both read AND write I/O to complete. Only NFS calls vinvalbuf()
on an active vnode (when the server indicates that the file is stale), so
this bug fix only effects NFS clients.

MFC after: 3 days


# b5810bab 30-Sep-2001 Matthew Dillon <dillon@FreeBSD.org>

After extensive testing it has been determined that adding complexity
to avoid removing higher level directory vnodes from the namecache has
no perceivable effect and will be removed. This is especially true
when vmiodirenable is turned on, which it is by default now. ( vmiodirenable
makes a huge difference in directory caching ). The vfs.vmiodirenable and
vfs.nameileafonly sysctls have been left in to allow further testing, but
I expect to rip out vfs.nameileafonly soon too.

I have also determined through testing that the real problem with numvnodes
getting too large is due to the VM Page cache preventing the vnode from
being reclaimed. The directory stuff made only a tiny dent relative
to Poul's original code, enough so that some tests succeeded. But tests
with several million small files show that the bigger problem is the VM Page
cache. This will have to be addressed by a future commit.

MFC after: 3 days


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 0f728902 27-Aug-2001 Peter Wemm <peter@FreeBSD.org>

If a file has been completely unlinked, stop automatically syncing the
file. ffs will discard any pending dirty pages when it is closed,
so we may as well not waste time trying to clean them. This doesn't
stop other things from writing it out, eg: pageout, fsync(2) etc.


# b219758f 27-Jul-2001 Peter Wemm <peter@FreeBSD.org>

Revert previous accidental commit. FWIW, it was part of enabling
VM caching of disks through mmap() and stopping syncing of open files
that had their last reference in the fs removed (ie: their unsync'ed
pages get discarded on close already, so I made it stop syncing too).


# 24a590a0 27-Jul-2001 Peter Wemm <peter@FreeBSD.org>

Fix cut/paste blunder. Serves me right for doing a last minute tweak
to what I had for some time.

Submitted by: bde


# 0cddd8f0 04-Jul-2001 Matthew Dillon <dillon@FreeBSD.org>

With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.


# cd2f7215 27-Jun-2001 John Baldwin <jhb@FreeBSD.org>

- Fix a mntvnode and vnode interlock reversal.
- Protect the mnt_vnode list with the mntvnode lock.


# 23955314 18-May-2001 Alfred Perlstein <alfred@FreeBSD.org>

Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb


# 0864ef1e 16-May-2001 Ian Dowse <iedowse@FreeBSD.org>

Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by: phk, bp


# 1feb7a6e 11-May-2001 Ian Dowse <iedowse@FreeBSD.org>

In vrele() and vput(), avoid triggering the confusing "missed vn_close"
KASSERT when vp->v_usecount is zero or negative. In this case, the
"v*: negative ref cnt" panic that follows is much more appropriate.

Reviewed by: mckusick


# 8ee8b21b 26-Apr-2001 Poul-Henning Kamp <phk@FreeBSD.org>

vfs_subr.c is getting rather fat. The underlying repocopy and this
commit moves the filesystem export handling code to vfs_export.c


# a13234bb 25-Apr-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Move the netexport structure from the fs-specific mountstructure
to struct mount.

This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.

Consequently that all non-stacking filesystems can use
vfs_stdcheckexp().

At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# 759cb263 18-Apr-2001 Seigo Tanimura <tanimura@FreeBSD.org>

Reclaim directory vnodes held in namecache if few free vnodes are
available.

Only directory vnodes holding no child directory vnodes held in
v_cache_src are recycled, so that directory vnodes near the root of
the filesystem hierarchy remain in namecache and directory vnodes are
not reclaimed in cascade.

The period of vnode reclaiming attempt and the number of vnodes
attempted to reclaim can be tuned via sysctl(2).

Suggested by: tegge
Approved by: phk


# f84e29a0 17-Apr-2001 Poul-Henning Kamp <phk@FreeBSD.org>

This patch removes the VOP_BWRITE() vector.

VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.


# 7df2842d 23-Feb-2001 Jonathan Lemon <jlemon@FreeBSD.org>

Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean().
Use this to tell a filter attached to a vnode that the underlying vnode is
no longer valid, by returning EV_EOF.

PR: kern/25309, kern/25206


# c0511d3b 18-Feb-2001 Brian Feldman <green@FreeBSD.org>

Switch to using a struct xucred instead of a struct xucred when not
actually in the kernel. This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there. As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).

This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size. This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.

Reviewed by: bde


# 9ed346ba 08-Feb-2001 Bosko Milekic <bmilekic@FreeBSD.org>

Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)


# fc2ffbe6 04-Feb-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)


# f3f1af39 30-Jan-2001 Boris Popov <bp@FreeBSD.org>

Properly lock new vnode.

Reminded by: tegge


# 1b367556 23-Jan-2001 Jason Evans <jasone@FreeBSD.org>

Convert all simplelocks to mutexes and remove the simplelock implementations.


# 02b65ffb 22-Jan-2001 Robert Watson <rwatson@FreeBSD.org>

o The move to using VADMIN under vaccess() resulted in some system
calls returning EACCES instead of EPERM. This patch modifies vaccess()
to return EPERM instead of EACCES if VADMIN is among the requested
rights. This affects functions normally limited to the owners of
a file, such as chmod(), as EPERM is the error indicating that
privilege would allow the operation, rather than a chance in mandatory
or discretionary rights.

Reported by: bde


# ffc831da 15-Dec-2000 John Baldwin <jhb@FreeBSD.org>

Stick the kthread API in a kthread_* namespace, and the specialized kproc
functions in a kproc_* namespace.

Reviewed by: -arch


# 0bf3b91d 12-Dec-2000 Kirk McKusick <mckusick@FreeBSD.org>

Use proper mutex locking when calling setrunnable from speedup_syncer().

Submitted by: Tor.Egge@fast.no


# 7cc0979f 08-Dec-2000 David Malone <dwmalone@FreeBSD.org>

Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>


# 138e514c 06-Dec-2000 Peter Wemm <peter@FreeBSD.org>

Untangle vfsinit() a bit. Use seperate sysinit functions rather than
having a super-function calling bits all over the place.


# 19f08522 02-Dec-2000 Andrew Gallatin <gallatin@FreeBSD.org>

Correct int/long type mismatch in the proper place this time. freevnodes
and numvnodes are longs in the kernel. They should remain longs in systat,
what really needs to change is that they should be using SYSCTL_LONG rather
than SYSCTL_INT. I also changed wantfreevnodes to SYSCTL_LONG because I
happened to notice it.

I wish there was a way to find all of these automatically..

Pointed out by: bde


# 21913407 30-Nov-2000 John Baldwin <jhb@FreeBSD.org>

Use msleep() instead of mtx_exit()/tsleep() so that we release the lock and
go to sleep as an "atomic" operation.


# 6d984dfa 30-Nov-2000 Kirk McKusick <mckusick@FreeBSD.org>

Get rid of a bogus mtx_exit (it was attempting to release an
already released mutex).

Submitted by: "Chris Knight" <chris@aims.com.au>


# 936524aa 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# a2d1480c 02-Nov-2000 Tor Egge <tegge@FreeBSD.org>

Clear the VFREE flag when the vnode is removed from the free list in
getnewvnode(). Otherwise routines called from VOP_INACTIVE() might
attempt to remove the vnode from a free list the vnode isn't on,
causing corruption.
PR: 18012


# 1d7e3e42 02-Nov-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Take VBLK devices further out of their missery.

This should fix the panic I introduced in my previous commit on this topic.


# 35e0e5b3 20-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# 47460a23 19-Oct-2000 Robert Watson <rwatson@FreeBSD.org>

o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform
"administrative" authorization checks. In most cases, the VADMIN test
checks to make sure the credential effective uid is the same as the file
owner.
o Modify vaccess() to set VADMIN as an available right if the uid is
appropriate.
o Modify references to uid-based access control operations such that they
now always invoke VOP_ACCESS() instead of using hard-coded policy checks.
o This allows alternative UFS policies to be implemented by replacing only
ufs_access() (such as mandatory system policies).
o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the
vnode: I believe that new invocations of VOP_ACCESS() are always called
with the lock held.
o Some direct checks of the uid remain, largely associated with the QUOTA
and SUIDDIR code.

Reviewed by: eivind
Obtained from: TrustedBSD Project


# 7eb9fca5 09-Oct-2000 Eivind Eklund <eivind@FreeBSD.org>

Blow away the v_specmountpoint define, replacing it with what it was
defined as (rdev->si_mountpoint)


# 39df8608 06-Oct-2000 Jason Evans <jasone@FreeBSD.org>

Do not call lockdestroy() for v_vnlock, which may point to a lock in a
deeper vfs stacking layer.

Submitted by: bp


# a863c0fb 05-Oct-2000 Eivind Eklund <eivind@FreeBSD.org>

Style fixes based on comments by bde


# a18b1f1d 03-Oct-2000 Jason Evans <jasone@FreeBSD.org>

Convert lockmgr locks from using simple locks to using mutexes.

Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.


# f8be809e 02-Oct-2000 Boris Popov <bp@FreeBSD.org>

Move KASSERTs which checks value of v_usecount after vnode locking, so
it will not produce wrong alarms.


# 02a1e48f 27-Sep-2000 Kirk McKusick <mckusick@FreeBSD.org>

Do the right thing if bdevvp is called twice for the same device.

Obtained from: Poul-Henning Kamp <phk@freebsd.org>


# 67e87166 25-Sep-2000 Boris Popov <bp@FreeBSD.org>

Add a lock structure to vnode structure. Previously it was either allocated
separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.

From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.

If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.

Reviewed in general by: mckusick


# 453aaa0d 21-Sep-2000 Eivind Eklund <eivind@FreeBSD.org>

Style fixes:
* Add lots of comments
* Convert a couple of assertions to KASSERT()
* Minimal whitespace & misapplied {} fixes
* Convert #if 0 to #if COMPILING_LINT for code we presently do not
support, but want to keep available.

Reviewed by: adrian, markm


# bba25953 22-Sep-2000 Eivind Eklund <eivind@FreeBSD.org>

Staticize addalias()


# 21a90397 21-Sep-2000 Alfred Perlstein <alfred@FreeBSD.org>

comment vfs_export functions, requested by: eivind


# e0848358 20-Sep-2000 Robert Watson <rwatson@FreeBSD.org>

o Add additional comment describing vaccess() behavior.

Requested by: eivind
Reviewed by: eivind, adrian


# b0d17ba6 19-Sep-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Rename lminor() to dev2unit(). This function gives a linear unit number
which hides the 'hole' in the minor bits.

Introduce unit2minor() to do the reverse operation.

Fix some some make_dev() calls which didn't use UID_* or GID_* macros.

Kill the v_hashchain alias macro, it hides the real relationship.

Introduce experimental SI_CHEAPCLONE flag set it on cloned bpfs.


# 9ff5ce6b 12-Sep-2000 Boris Popov <bp@FreeBSD.org>

Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by: mckusick, dillon


# 0384fff8 06-Sep-2000 Jason Evans <jasone@FreeBSD.org>

Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*(). See mutex(9). (Note: The
alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
preempted (i386 only).

Partially contributed by: BSDi (BSD/OS)
Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh


# 728783c2 05-Sep-2000 Robert Watson <rwatson@FreeBSD.org>

o Synchronize vaccess() capability access control checks with TrustedBSD
tree.

Obtained from: TrustedBSD Project


# 64dc16df 05-Sep-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move extern declaration of dead_vnodeop_p to a .h file.

Remove race condition in vn_isdisk().


# 012c643d 29-Aug-2000 Robert Watson <rwatson@FreeBSD.org>

o Restructure vaccess() so as to check for DAC permission to modify the
object before falling back on privilege. Make vaccess() accept an
additional optional argument, privused, to determine whether
privilege was required for vaccess() to return 0. Add commented
out capability checks for reference. Rename some variables to make
it more clear which modes/uids/etc are associated with the object,
and which with the access mode.
o Update file system use of vaccess() to pass NULL as the optional
privused argument. Once additional patches are applied, suser()
will no longer set ASU, so privused will permit passing of
privilege information up the stack to the caller.

Reviewed by: bde, green, phk, -security, others
Obtained from: TrustedBSD Project


# 4fe6d437 20-Aug-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Fix typo in last commit.


# e39c53ed 20-Aug-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Centralize the canonical vop_access user/group/other check in vaccess().

Discussed with: bde


# 9b971133 23-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.


# f2a2857b 11-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


# 3660ebc2 07-Jul-2000 Boris Popov <bp@FreeBSD.org>

Fix support for more than 256 simultaneous mounts. Theoretical limit
is 2^16 mounts per fs type.

Reported by: Troy Arie Cobb <tcobb@staff.circle.net> via phk
Reviewed by: bde


# 77978ab8 04-Jul-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.

Pointed out by: bde


# c904bbbd 03-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Simplify and rationalise the management of the vnode free list
(preparing the code to add snapshots).


# 37642196 03-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

If a buffer flush fails when trying to reclaim a vnode, it is too
late to save the vnode, so just toss any remaining unwritten buffers
rather than leaving them lying around to make trouble in the future.


# 3275cf73 03-Jul-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Make the two calls from kern/* into softupdates #ifdef SOFTUPDATES,
that is way cleaner than using the softupdates_stub stunt, which
should be killed when convenient.

Discussed with: mckusick


# 82d9ae4e 03-Jul-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:

Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

-sysctl_vm_zone SYSCTL_HANDLER_ARGS
+sysctl_vm_zone (SYSCTL_HANDLER_ARGS)


# a8b1f9d2 27-Jun-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move prtactive to vfs from ufs. It is used all over the place.


# a2e7a027 16-Jun-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Virtualizes & untangles the bioops operations vector.

Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# 01f76720 14-May-2000 Jeroen Ruigrok van der Werven <asmodai@FreeBSD.org>

Fix the rootmount code for now.
This function will probably rewritten/renamed to devpp.

Submitted by: Assar Westerlund <assar@sics.se> on -current
Confirmed to work: Steinar Haug <sthaug@nethelp.no>,
Manfred Antar <mantar@pacbell.net>
Reviewed by: phk


# 9626b608 05-May-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by: peter


# b99c307a 20-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Rename the existing BUF_STRATEGY() to DEV_STRATEGY()

substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.


# b081a64a 17-Mar-2000 Chris Costello <chris@FreeBSD.org>

In vn_isdisk(), check whether vp->v_rdev is NULL. If it is, then
return ENXIO (Device not configured). Without this, vn_isdisk()
could (and did in the case of lstat() under fdesc) pass a NULL pointer
to devsw(), which caused a page fault.

Reviewed by: alfred


# db5f635a 16-Mar-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate the undocumented, experimental, non-delivering and highly
dangerous MAX_PERF option.


# 05ecdd70 14-Mar-2000 Bruce Evans <bde@FreeBSD.org>

Don't try so hard to make the lower 16 bits of fsids unique. It tended
to recycle full fsids after only 16 mount/unmount's. This is probably
too often for exported fsids. Now we recycle the full fsids only
after 2^16 mount/ umount's and only ensure uniqueness in the lower 16
bits if there have been <= 256 calls to vfs_getnewfsid() since the
system started.


# 61214975 12-Mar-2000 Bruce Evans <bde@FreeBSD.org>

Try harder to make the lower 16 bits of fsids unique. The vfs type
number was packed very wastefully, giving perfect non-uniqeness in
the lower 16 bits of fsids for filesystems with the same vfs type.
This made linux_stat() return perfectly non-unique (broken) 16-bit
st_dev's for nfs mount points, and effectively reduced mntid_base to
8 bits so that the vfs_getnewfsid() looped endlessly when there are
already 256 mounted filesystems with the required vfs type.

Approved by: jkh


# e8359a57 07-Feb-2000 Søren Schmidt <sos@FreeBSD.org>

Do refcounting of open devices (more) correctly.

count_dev funtion by phk.


# b7a5f3ca 02-Feb-2000 Robert Watson <rwatson@FreeBSD.org>

Remove static qualifier from vgonel, as it is needed by the Arla folk
outside of vfs_subr.c.

Submitted by: Assar Westerlund <assar@sics.se>
Reviewed by: rwatson
Approved by: jkh


# 9a2b8fca 29-Jan-2000 Robert Watson <rwatson@FreeBSD.org>

This patch fixes a locking bug that can result in deadlock if
the codepath is followed.

From the PR:

vclean calls vrele leading to deadlock (if usecount > 0)

vclean() calls vrele() if v_usecount of the node was higher than one.
But before calling it, it sets the VXLOCK flag, which will make
vn_lock called from vrele dead-lock.

PR: kern/15117
Submitted by: Assar Westerlund <assar@stacken.kth.se>
Reviewed by: rwatson
Obtained from: NetBSD


# ba4ad1fc 09-Jan-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


# 411e1480 09-Jan-2000 Kirk McKusick <mckusick@FreeBSD.org>

Remove the P_BUFEXHAUST flag from the syncer process (leaving
it only on the buf_daemon process). The problem is that when the
syncer process starts running the worklist, it wants to delete
lots of files. It does this by VFS_VGET'ing the vnodes, clearing
the blocks in them and bdwrite'ing the buffer. It can process close
to a thousand files per second which generates a large number of
dirty buffers. So, giving it special priviledge at the buffer trough
leads to trouble as the buf_daemon does occationally need a free
buffer to proceed and if the syncer has used every last one up,
we are toast.


# e12d97d2 08-Jan-2000 Eivind Eklund <eivind@FreeBSD.org>

Change NDFREE() from a macro to a function for the time being; the macro
version caused intolerable bloat (30k). I'm likely to revisit this with an
attempt at a smarter macro.

Bloat noticed by: bde


# 5e950839 07-Jan-2000 Luoqi Chen <luoqi@FreeBSD.org>

Introduce a mechanism to suspend/resume system processes. Suspend syncer
and bufdaemon prior to disk sync during system shutdown.


# c37c9620 04-Jan-2000 Matthew Dillon <dillon@FreeBSD.org>

Enhance reassignbuf(). When a buffer cannot be time-optimally inserted
into vnode dirtyblkhd we append it to the list instead of prepend it to
the list in order to maintain a 'forward' locality of reference, which
is arguably better then 'reverse'. The original algorithm did things this
way to but at a huge time cost.

Enhance the append interlock for NFS writes to handle intr/soft mounts
better.

Fix the hysteresis for NFS async daemon I/O requests to reduce the
number of unnecessary context switches.

Modify handling of NFS mount options. Any given user option that is
too high now defaults to the kernel maximum for that option rather then
the kernel default for that option.

Reviewed by: Alfred Perlstein <bright@wintelcom.net>


# 02b00854 21-Dec-1999 Kirk McKusick <mckusick@FreeBSD.org>

Prettyness police: Identify flags in b_xflags with BX_ to distinguish
them from flags in b_flags which are prefixed with B_


# 4f79d873 11-Dec-1999 Matthew Dillon <dillon@FreeBSD.org>

Add MAP_NOSYNC feature to mmap(), and MADV_NOSYNC and MADV_AUTOSYNC to
madvise().

This feature prevents the update daemon from gratuitously flushing
dirty pages associated with a mapped file-backed region of memory. The
system pager will still page the memory as necessary and the VM system
will still be fully coherent with the filesystem. Modifications made
by other means to the same area of memory, for example by write(), are
unaffected. The feature works on a page-granularity basis.

MAP_NOSYNC allows one to use mmap() to share memory between processes
without incuring any significant filesystem overhead, putting it in
the same performance category as SysV Shared memory and anonymous memory.

Reviewed by: julian, alc, dg


# 6bdfe06a 11-Dec-1999 Eivind Eklund <eivind@FreeBSD.org>

Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with: grog, mch, peter, phk
Reviewed by: peter


# 245efbba 29-Nov-1999 Matthew Dillon <dillon@FreeBSD.org>

Remove vfs_getrootfsid() function (a temporary hack added a few months
ago to make BOOTP work again). It is no longer required by BOOTP and
no longer used.


# 38224dcd 22-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by: sos


# 0429e37a 20-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

struct mountlist and struct mount.mnt_list have no business being
a CIRCLEQ. Change them to TAILQ_HEAD and TAILQ_ENTRY respectively.

This removes ugly mp != (void*)&mountlist comparisons.

Requested by: phk
Submitted by: Jake Burkholder jake@checker.org
PR: 14967


# 1b727751 16-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Commit the remaining part of PR14914:

Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.

Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914


# 698f9cf8 09-Nov-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Next step in the device cleanup process.

Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code.

Unify spec_open() for bdev and cdev cases.

Remove the disabled bdev specific read/write code.


# 923502ff 29-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

useracc() the prequel:

Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.


# d1f088da 11-Oct-1999 Peter Wemm <peter@FreeBSD.org>

Trim unused options (or #ifdef for undoc options).

Submitted by: phk


# aa4f4b69 04-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Move the buffered read/write code out of spec_{read|write} and into
two new functions spec_buf{read|write}.

Add sysctl vfs.bdev_buffered which defaults to 1 == true. This
sysctl can be used to experimentally turn buffered behaviour for
bdevs off. I should not be changed while any blockdevices are
open. Remove the misplaced sysctl vfs.enable_userblk_io.

No other changes in behaviour.


# 1b5464ef 29-Sep-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Remove v_maxio from struct vnode.

Replace it with mnt_iosize_max in struct mount.

Nits from: bde


# 40360b1b 20-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

Final commit to remove vnode->v_lastr. vm_fault now handles read
clustering issues (replacing code that used to be in
ufs/ufs/ufs_readwrite.c). vm_fault also now uses the new VM page counter
inlines.

This completes the changeover from vnode->v_lastr to vm_entry_t->v_lastr
for VM, and fp->f_nextread and fp->f_seqcount (which have been in the
tree for a while). Determination of the I/O strategy (sequential, random,
and so forth) is now handled on a descriptor-by-descriptor basis for
base I/O calls, and on a memory-region-by-memory-region and
process-by-process basis for VM faults.

Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>


# 552f337f 20-Sep-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Initialize vp->v_maxio to its default in getnetvnode() rather than
four different places in vfs_cluster.c


# e6f71111 19-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix BOOTP root FS mounts. Also cleanup vfs_getnewfsid() and collapse
addaliasu() into addalias() (no operational change) and clarify comments
relating to a trick that vclean() uses.

The fix to BOOTP is yet another hack. Actually, rootfsid handling
is already a major hack. The whole thing needs to be cleaned up.

Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>


# bb01f28e 17-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

Add vfs.enable_userblk_io sysctl to control whether user reads and writes
to buffered block devices are allowed. The default is to be backwards
compatible, i.e. reads and writes are allowed.

The idea is for a larger crowd to start running with this disabled and
see what problems, if any, crop up, and then to change the default to
off and see if any problems crop up in the next 6 months prior to
potentially removing support entirely. There are still a few people,
Julian and myself included, who believe the buffered block device
access from usermode to be useful.

Remove use of vnode->v_lastr from buffered block device I/O in
preparation for removal of vnode->v_lastr field, replacing it with
the already existing seqcount metric to detect sequential operation.

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


# d137accc 29-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Add dev_t freeing code. Controlled by sysctl debug.free_devt, default
is off.


# 96267288 28-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

remove unused variables.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# dbafb366 26-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Simplify the handling of VCHR and VBLK vnodes using the new dev_t:

Make the alias list a SLIST.

Drop the "fast recycling" optimization of vnodes (including
the returning of a prexisting but stale vnode from checkalias).
It doesn't buy us anything now that we don't hardlimit
vnodes anymore.

Rename checkalias2() and checkalias() to addalias() and
addaliasu() - which takes dev_t and udev_t arg respectively.

Make the revoke syscalls use vcount() instead of VALIASED.

Remove VALIASED flag, we don't need it now and it is faster
to traverse the much shorter lists than to maintain the
flag.

vfs_mountedon() can check the dev_t directly, all the vnodes
point to the same one.

Print the devicename in specfs/vprint().

Remove a couple of stale LFS vnode flags.

Remove unimplemented/unused LK_DRAINED;


# 41d2e3e0 24-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness.


# 0ff7b13a 24-Aug-1999 Julian Elischer <julian@FreeBSD.org>

Make DEVFS use PHK's specinfo struct as the source of dev_t and devsw.

In lookup() however it's the other way around as we need to supply the
dev_t for the vnode, so devfs still has a copy of it stashed away.

Sourcing it from the vnode in the vnops however is useful as it makes
a lot of the code almost the same as that in specfs.


# a2801b77 21-Aug-1999 John Polstra <jdp@FreeBSD.org>

Support full-precision file timestamps. Until now, only the seconds
have been maintained, and that is still the default. A new sysctl
variable "vfs.timestamp_precision" can be used to enable higher
levels of precision:

0 = seconds only; nanoseconds zeroed (default).
1 = seconds and nanoseconds, accurate within 1/HZ.
2 = seconds and nanoseconds, truncated to microseconds.
>=3 = seconds and nanoseconds, maximum precision.

Level 1 uses getnanotime(), which is fast but can be wrong by up
to 1/HZ. Level 2 uses microtime(). It might be desirable for
consistency with utimes() and friends, which take timeval structures
rather than timespecs. Level 3 uses nanotime() for the higest
precision.

I benchmarked levels 0, 1, and 3 by copying a 550 MB tree with
"cpio -pdu". There was almost negligible difference in the system
times -- much less than 1%, and less than the variation among
multiple runs at the same level. Bruce Evans dreamed up a torture
test involving 1-byte reads with intervening fstat() calls, but
the cpio test seems more realistic to me.

This feature is currently implemented only for the UFS (FFS and
MFS) filesystems. But I think it should be easy to support it in
the others as well.

An earlier version of this was reviewed by Bruce. He's not to
blame for any breakage I've introduced since then.

Reviewed by: bde (an earlier version of the code)


# 7dc5cd04 13-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

The bdevsw() and cdevsw() are now identical, so kill the former.


# 4d4f9323 13-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

s/v_specinfo/v_rdev/


# 0ef1c826 08-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 67452993 26-Jul-1999 Alan Cox <alc@FreeBSD.org>

Add sysctl and support code to allow directories to be VMIO'd. The default
setting for the sysctl is OFF, which is the historical operation.

Submitted by: dillon


# 698bfad7 20-Jul-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Now a dev_t is a pointer to struct specinfo which is shared by all specdev
vnodes referencing this device.

Details:
cdevsw->d_parms has been removed, the specinfo is available
now (== dev_t) and the driver should modify it directly
when applicable, and the only driver doing so, does so:
vn.c. I am not sure the logic in checking for "<" was right
before, and it looks even less so now.

An intial pool of 50 struct specinfo are depleted during
early boot, after that malloc had better work. It is
likely that fewer than 50 would do.

Hashing is done from udev_t to dev_t with a prime number
remainder hash, experiments show no better hash available
for decent cost (MD5 is only marginally better) The prime
number used should not be close to a power of two, we use
83 for now.

Add new checkalias2() to get around the loss of info from
dev2udev() in bdevvp();

The aliased vnodes are hung on a list straight of the dev_t,
and speclisth[SPECSZ] is unused. The sharing of struct
specinfo means that the v_specnext moves into the vnode
which grows by 4 bytes.

Don't use a VBLK dev_t which doesn't make sense in MFS, now
we hang a dummy cdevsw on B/Cmaj 253 so that things look sane.

Storage overhead from all of this is O(50k).

Bump __FreeBSD_version to 400009

The next step will add the stuff needed so device-drivers can start to
hang things from struct specinfo


# 3de280c4 19-Jul-1999 Poul-Henning Kamp <phk@FreeBSD.org>

[click] Now all dev_t's in the kernel have their char device major.

Only know casualy of this is swapinfo/pstat which should be fixes
the right way: Store the actual pathname in the kernel like mount
does. [Volounteers sought for this task]

The road map from here is roughly: expand struct specinfo into struct
based dev_t. Add dev_t registration facilities for device drivers and
start to use them.


# 6ca54864 18-Jul-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce the vn_todev(struct vnode*) function, which returns the dev_t
corresponding to a VBLK or VCHR node, or NODEV.


# c7119ea7 17-Jul-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Fix 2nd arg to udev2dev().


# f008cfcc 17-Jul-1999 Poul-Henning Kamp <phk@FreeBSD.org>

I have not one single time remembered the name of this function correctly
so obviously I gave it the wrong name. s/umakedev/makeudev/g


# e7647e6c 12-Jul-1999 Kris Kennaway <kris@FreeBSD.org>

Correct a couple of spelling errors in comments.


# ad8ac923 08-Jul-1999 Kirk McKusick <mckusick@FreeBSD.org>

These changes appear to give us benefits with both small (32MB) and
large (1G) memory machine configurations. I was able to run 'dbench 32'
on a 32MB system without bring the machine to a grinding halt.

* buffer cache hash table now dynamically allocated. This will
have no effect on memory consumption for smaller systems and
will help scale the buffer cache for larger systems.

* minor enhancement to pmap_clearbit(). I noticed that
all the calls to it used constant arguments. Making
it an inline allows the constants to propogate to
deeper inlines and should produce better code.

* removal of inherent vfs_ioopt support through the emplacement
of appropriate #ifdef's, with John's permission. If we do not
find a use for it by the end of the year we will remove it entirely.

* removal of getnewbufloops* counters & sysctl's - no longer
necessary for debugging, getnewbuf() is now optimal.

* buffer hash table functions removed from sys/buf.h and localized
to vfs_bio.c

* VFS_BIO_NEED_DIRTYFLUSH flag and support code added
( bwillwrite() ), allowing processes to block when too many dirty
buffers are present in the system.

* removal of a softdep test in bdwrite() that is no longer necessary
now that bdwrite() no longer attempts to flush dirty buffers.

* slight optimization added to bqrelse() - there is no reason
to test for available buffer space on B_DELWRI buffers.

* addition of reverse-scanning code to vfs_bio_awrite().
vfs_bio_awrite() will attempt to locate clusterable areas
in both the forward and reverse direction relative to the
offset of the buffer passed to it. This will probably not
make much of a difference now, but I believe we will start
to rely on it heavily in the future if we decide to shift
some of the burden of the clustering closer to the actual
I/O initiation.

* Removal of the newbufcnt and lastnewbuf counters that Kirk
added. They do not fix any race conditions that haven't already
been fixed by the gbincore() test done after the only call
to getnewbuf(). getnewbuf() is a static, so there is no chance
of it being misused by other modules. ( Unless Kirk can think
of a specific thing that this code fixes. I went through it
very carefully and didn't see anything ).

* removal of VOP_ISLOCKED() check in flushbufqueues(). I do not
think this check is necessary, the buffer should flush properly
whether the vnode is locked or not. ( yes? ).

* removal of extra arguments passed to getnewbuf() that are not
necessary.

* missed cluster_wbuild() that had to be a cluster_wbuild_wb() in
vfs_cluster.c

* vn_write() now calls bwillwrite() *PRIOR* to locking the vnode,
which should greatly aid flushing operations in heavy load
situations - both the pageout and update daemons will be able
to operate more efficiently.

* removal of b_usecount. We may add it back in later but for now
it is useless. Prior implementations of the buffer cache never
had enough buffers for it to be useful, and current implementations
which make more buffers available might not benefit relative to
the amount of sophistication required to implement a b_usecount.
Straight LRU should work just as well, especially when most things
are VMIO backed. I expect that (even though John will not like
this assumption) directories will become VMIO backed some point soon.

Submitted by: Matthew Dillon <dillon@backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# e929c00d 03-Jul-1999 Kirk McKusick <mckusick@FreeBSD.org>

The buffer queue mechanism has been reformulated. Instead of having
QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN,
QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean
and dirty buffers have been separated. Empty buffers with KVM
assignments have been separated from truely empty buffers. getnewbuf()
has been rewritten and now operates in a 100% optimal fashion. That is,
it is able to find precisely the right kind of buffer it needs to
allocate a new buffer, defragment KVM, or to free-up an existing buffer
when the buffer cache is full (which is a steady-state situation for
the buffer cache).

Buffer flushing has been reorganized. Previously buffers were flushed
in the context of whatever process hit the conditions forcing buffer
flushing to occur. This resulted in processes blocking on conditions
unrelated to what they were doing. This also resulted in inappropriate
VFS stacking chains due to multiple processes getting stuck trying to
flush dirty buffers or due to a single process getting into a situation
where it might attempt to flush buffers recursively - a situation that
was only partially fixed in prior commits. We have added a new daemon
called the buf_daemon which is responsible for flushing dirty buffers
when the number of dirty buffers exceeds the vfs.hidirtybuffers limit.
This daemon attempts to dynamically adjust the rate at which dirty buffers
are flushed such that getnewbuf() calls (almost) never block.

The number of nbufs and amount of buffer space is now scaled past the
8MB limit that was previously imposed for systems with over 64MB of
memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed
somewhat. The number of physical buffers has been increased with the
intention that we will manage physical I/O differently in the future.

reassignbuf previously attempted to keep the dirtyblkhd list sorted which
could result in non-deterministic operation under certain conditions,
such as when a large number of dirty buffers are being managed. This
algorithm has been changed. reassignbuf now keeps buffers locally sorted
if it can do so cheaply, and otherwise gives up and adds buffers to
the head of the dirtyblkhd list. The new algorithm is deterministic but
not perfect. The new algorithm greatly reduces problems that previously
occured when write_behind was turned off in the system.

The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive
P_BUFEXHAUST bit. This bit allows processes working with filesystem
buffers to use available emergency reserves. Normal processes do not set
this bit and are not allowed to dig into emergency reserves. The purpose
of this bit is to avoid low-memory deadlocks.

A small race condition was fixed in getpbuf() in vm/vm_pager.c.

Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Kirk McKusick <mckusick@mckusick.com>


# 8947a90a 02-Jul-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Make sure that stat(2) and friends always return a valid st_dev field.

Pseudo-FS need not fill in the va_fsid anymore, the syscall code
will use the first half of the fsid, which now looks like a udev_t
with major 255.


# 9c8b8baa 01-Jul-1999 Peter Wemm <peter@FreeBSD.org>

Slight reorganization of kernel thread/process creation. Instead of using
SYSINIT_KT() etc (which is a static, compile-time procedure), use a
NetBSD-style kthread_create() interface. kproc_start is still available
as a SYSINIT() hook. This allowed simplification of chunks of the
sysinit code in the process. This kthread_create() is our old kproc_start
internals, with the SYSINIT_KT fork hooks grafted in and tweaked to work
the same as the NetBSD one.

One thing I'd like to do shortly is get rid of nfsiod as a user initiated
process. It makes sense for the nfs client code to create them on the
fly as needed up to a user settable limit. This means that nfsiod
doesn't need to be in /sbin and is always "available". This is a fair bit
easier to do outside of the SYSINIT_KT() framework.


# 67812eac 25-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.


# f9c8cab5 16-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.


# e4ab40bc 15-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Get rid of the global variable rushjob and replace it with a function in
kern/vfs_subr.c named speedup_syncer() which handles the speedup request.
Change the various clients of rushjob to use the new function.


# 2447bec8 31-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Simplify cdevsw registration.

The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it. cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.

cdevsw_add() will print an message if the d_maj field looks bogus.

Remove nblkdev and nchrdev variables. Most places they were used
bogusly. Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.

Move bdevsw() and devsw() functions to kern/kern_conf.c

Bump __FreeBSD_version to 400006

This commit removes:
72 bogus makedev() calls
26 bogus SYSINIT functions

if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.

I4b and vinum not changed. Patches emailed to authors. LINT
probably broken until they catch up.


# 02013ff8 23-May-1999 John Birrell <jb@FreeBSD.org>

Remove the test for bdevsw(dev) == NULL from bdevvp() because it fails
if there is no character device associated with the block device. In this
case that doesn't matter because bdevvp() doesn't use the character
device structure.

I can use the pointy bit of the axe too.


# 0ce54cbb 14-May-1999 Luoqi Chen <luoqi@FreeBSD.org>

Legally acquire a major number for mfs.


# eaea7a9e 13-May-1999 Kirk McKusick <mckusick@FreeBSD.org>

Previously directories were sync'ed every 10 seconds while bitmaps &
inodes were synced every 15 seconds. This is now reversed as during
directory create, we cannot commit the directory entry until its
inode has been written. With this switch, the inodes will be more
likely to be written by the time that the directory is written thus
reducing the number of directory rollbacks that are needed.


# cc5881cf 12-May-1999 Peter Wemm <peter@FreeBSD.org>

Fix (?) SPECHASH dev_t/major/minor/etc args


# 8bee45c4 12-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Don't peek into dev_t


# bfbb9ce6 11-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Divorce "dev_t" from the "major|minor" bitmap, which is now called
udev_t in the kernel but still called dev_t in userland.

Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()

For now they're functions, they will become in-line functions
after one of the next two steps in this process.

Return major/minor/makedev to macro-hood for userland.

Register a name in cdevsw[] for the "filedescriptor" driver.

In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.

In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).

A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.

Without DEVT_FASCIST I belive this patch is a no-op.

Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.

Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).


# 1637aa4b 08-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Fix some of the places where too much inside knowledge about major/minor
layout and dev_t structure is being (ab)used.


# 4be2eb8c 08-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

I got tired of seeing all the cdevsw[major(foo)] all over the place.

Made a new (inline) function devsw(dev_t dev) and substituted it.

Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)

DEVFS will eventually benefit from this change too.


# 46eede00 07-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Continue where Julian left off in July 1998:

Virtualize bdevsw[] from cdevsw. bdevsw() is now an (inline)
function.

Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
to the order of the cmaj/bmaj arguments!)

Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
(ditto!)

(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)


# 3d177f46 03-May-1999 Bill Fumerola <billf@FreeBSD.org>

Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)


# 4ef2094e 11-Mar-1999 Julian Elischer <julian@FreeBSD.org>

Reviewed by: Many at differnt times in differnt parts,
including alan, john, me, luoqi, and kirk
Submitted by: Matt Dillon <dillon@frebsd.org>

This change implements a relatively sophisticated fix to getnewbuf().
There were two problems with getnewbuf(). First, the writerecursion
can lead to a system stack overflow when you have NFS and/or VN
devices in the system. Second, the free/dirty buffer accounting was
completely broken. Not only did the nfs routines blow it trying to
manually account for the buffer state, but the accounting that was
done did not work well with the purpose of their existance: figuring
out when getnewbuf() needs to sleep.

The meat of the change is to kern/vfs_bio.c. The remaining diffs are
all minor except for NFS, which includes both the fixes for bp
interaction AND fixes for a 'biodone(): buffer already done' lockup.
Sys/buf.h also contains a chaining structure which is not used by
this patchset but is used by other patches that are coming soon.
This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF.
(sorry for the missing T matt)


# 155f87da 24-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Reviewed by: Julian Elischer <julian@whistle.com>

Add d_parms() to {c,b}devsw[]. If non-NULL this function points to
a device routine that will properly fill in the specinfo structure.
vfs_subr.c's checkalias() supplies appropriate defaults. This change
should be fully backwards compatible with existing devices.


# 42e26d47 19-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Protect vn worklist and vn->v_{clean,dirty}blkhd at splbio().

Get rid of extra LIST_REMOVE()

Reviewed by: hsu@FreeBSD.ORG (Jeffrey Hsu), mckusick@McKusick.COM
Submitted by: hsu@FreeBSD.ORG (Jeffrey Hsu), dillon@backplane.com ( Matthew Dillon )


# 82b23b53 04-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

vp->v_object must be valid after normal flow of vfs_object_create()
completes, change if() to KASSERT(). This is not a bug, we are
simplify clarifying and optimizing the code.

In if/else in vfs_object_create(), the failure of both conditionals
will lead to a NULL object. Exit gracefully if this case occurs.
( this case does not normally occur, but needed to be handled ).

Obtained from: Eivind Eklund <eivind@FreeBSD.org>


# bc814931 29-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

More const fixes for -Wall, -Wcast-qual


# 8aef1712 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 1c7c3c6a 21-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

This is a rather large commit that encompasses the new swapper,
changes to the VM system to support the new swapper, VM bug
fixes, several VM optimizations, and some additional revamping of the
VM code. The specific bug fixes will be documented with additional
forced commits. This commit is somewhat rough in regards to code
cleanup issues.

Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>


# 219cbf59 09-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

KNFize, by bde.


# 5526d2d9 08-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith


# fb116777 05-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Remove the 'waslocked' parameter to vfs_object_create().


# 0df45b5a 05-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Finish staticization.


# 289bdf33 02-Jan-1999 Bruce Evans <bde@FreeBSD.org>

Ifdefed conditionally used simplock variables.


# 4d948813 23-Dec-1998 Bruce Evans <bde@FreeBSD.org>

Restored rev.1.31 which was clobbered by rev.1.69 (the big Lite2
merge). This fixes at least hanging in revoke(2) when a somewhat
active slave pty is revoked. The hang made the window for the
null pointer bug in ufsspec_{read,write} much larger.

There are many other bugs in this area (revoke of an active fifo
at best leaks memory...).


# 29c98cd8 21-Dec-1998 Eivind Eklund <eivind@FreeBSD.org>

Check return value of tsleep(). I've checked of all call points -
there does not seem to be a problem with this.

PR: kern/8732
Analysis by: David G Andersen <danderse@cs.utah.edu>
Tested by: Alfred Perlstein <bright@hotjobs.com>


# db878ba4 21-Dec-1998 Eivind Eklund <eivind@FreeBSD.org>

Staticize.


# 2127f260 04-Dec-1998 Archie Cobbs <archie@FreeBSD.org>

Examine all occurrences of sprintf(), strcat(), and str[n]cpy()
for possible buffer overflow problems. Replaced most sprintf()'s
with snprintf(); for others cases, added terminating NUL bytes where
appropriate, replaced constants like "16" with sizeof(), etc.

These changes include several bug fixes, but most changes are for
maintainability's sake. Any instance where it wasn't "immediately
obvious" that a buffer overflow could not occur was made safer.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>
Reviewed by: Mike Spengler <mks@networkcs.com>


# 16e9e530 31-Oct-1998 Peter Wemm <peter@FreeBSD.org>

Convert lists for bufs attached to vnodes from a LIST to a TAILQ.
- Use TAILQ_* macros extensively instead of internal names
- use b_xflags instead of the NOLIST magic number hack in the next pointer
- clean bufs are inserted at the tail rather than the head.
- redo dirty buffer insert so that metadata (negative lbn) goes to the
tail directly rather than at the HEAD. This makes a difference when
inserting dirty data blocks in lbn sorted order since data block
insertion will not have to bypass all the metadata cruft. data is
lbn sorted since it makes sense for clustering and writeback ordering,
while metadata sorting doesn't help much since the lbn's are
meaningless when walking the list for writebacks.

Small systems will not notice much (if any) benefit from this, but really
busy systems with large dirty block lists should get a lot more.

I've tested this with softdep, and it doesn't seem to mind the change of
queueing of metadata.

Reviewed (in princible) by: dg
Obtained from: partly from John Dyson's work-in-progress patches in June.


# b421db37 31-Oct-1998 Peter Wemm <peter@FreeBSD.org>

The last argument to vm_object_page_clean() are now bit flags, rather than
the old true/false.

While here, have vfs_msync() only call vm_object_page_clean() with
OBJPC_SYNC if called with MNT_WAIT flags. vfs_msync() is called at unmount
time (with MNT_WAIT) and from the syncer process (formerly update).
This should make dirty mmap writebacks a little less nasty.

I have tested this a little with SOFTUPDATES enabled, but I don't normally
use it since I've been badly burned too many times.


# cbbbd4c3 29-Oct-1998 Bruce Evans <bde@FreeBSD.org>

Oops, rev.1.167 made the device number checking in bdevvp() too strict
for mfs root mounts. Don't require major 255 to be in bdevsw[].


# 20f02ef5 29-Oct-1998 Peter Wemm <peter@FreeBSD.org>

Remove the V_SAVEMETA flag, nothing uses it any more now that msdosfs and
ext2fs call vtruncbuf() directly. This simplifies and cleans up
vinvalbuf() a little.


# 885bf0b5 26-Oct-1998 Bruce Evans <bde@FreeBSD.org>

Updated the major number check in vfs_object_create(). It's not
clear if the check is necessary, but vfs_object_create() is called
for all vnodes and it was silly to create objects for VBLK vnodes
that don't even have a driver.


# f5ef029e 25-Oct-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.


# 37906c68 25-Oct-1998 Bruce Evans <bde@FreeBSD.org>

Fixed device number checking in bdevvp():
- dev != NODEV was checked for, but 0 was returned on failure. This was
fixed in Lite2 (except the return code was still slightly wrong (ENODEV
instead of ENXIO)) but the changes were not merged. This case probably
doesn't actually occur under FreeBSD.
- major(dev) was not checked to have a valid non-NULL bdevsw entry. This
caused panics when the driver for the root device didn't exist.

Fixed minor misformattings in bdevvp(). Rev.1.14 consisted mainly of
gratuitous reformattings that seem to have caused many Lite2 merge
errors.

PR: 8417


# f74d75a2 14-Oct-1998 Dmitrij Tejblum <dt@FreeBSD.org>

Backed out rev. 1.164. It caused problems on SMP.

PR: 8309


# 6cde7a16 13-Oct-1998 David Greenman <dg@FreeBSD.org>

Fixed two potentially serious classes of bugs:

1) The vnode pager wasn't properly tracking the file size due to
"size" being page rounded in some cases and not in others.
This sometimes resulted in corrupted files. First noticed by
Terry Lambert.
Fixed by changing the "size" pager_alloc parameter to be a 64bit
byte value (as opposed to a 32bit page index) and changing the
pagers and their callers to deal with this properly.
2) Fixed a bogus type cast in round_page() and trunc_page() that
caused some 64bit offsets and sizes to be scrambled. Removing
the cast required adding casts at a few dozen callers.
There may be problems with other bogus casts in close-by
macros. A quick check seemed to indicate that those were okay,
however.


# 9bbd8a24 12-Oct-1998 Dmitrij Tejblum <dt@FreeBSD.org>

UnVMIO vnodes of block devices when they are no longer in use. (Some
things, like msdosfs, do not work (panic) on devices with VMIO enabled.
FFS enable VMIO on mounted devices, and nothing previously disabled it, so,
after you mounted FFS floppy, you could not mount msdosfs floppy anymore...)

This is mostly a quick before-release fix.

Reviewed by: bde


# d024c955 14-Sep-1998 Søren Schmidt <sos@FreeBSD.org>

Remove the SLICE code.
This clearly needs alot more thought, and we dont need this to hunt
us down in 3.0-RELEASE.


# 500b04a2 05-Sep-1998 Bruce Evans <bde@FreeBSD.org>

Instantiate `nfs_mount_type' in a standard file so that it is present
when nfs is an LKM. Declare it in a header file. Don't forget to use
it in non-Lite2 code. Initialize it to -1 instead of to 0, since 0
will soon be the mount type number for the first vfs loaded.

NetBSD uses strcmp() to avoid this ugly global.


# f5ce6752 29-Aug-1998 Bruce Evans <bde@FreeBSD.org>

Oops, the previous revision unconfigured too much pre-Lite2 compatibilty
cruft. At least lsvfs(1) was broken.


# 13950bd2 12-Aug-1998 Bruce Evans <bde@FreeBSD.org>

Don't configure compatibility code for pre-Lite2 mount() calls by
default. This code should go away soon.


# 7a6c46b5 12-Jul-1998 Doug Rabson <dfr@FreeBSD.org>

Initialise all the fields separately in vattr_null since on the alpha
they are not all the same width.


# ac1e407b 11-Jul-1998 Bruce Evans <bde@FreeBSD.org>

Fixed printf format errors.


# be160d60 21-Jun-1998 Bruce Evans <bde@FreeBSD.org>

Removed unused includes.


# 32f5d4d8 10-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Replace 'sleep()' with 'tsleep()'
Accidentally imported from Kirk's codebase.

Pointed out by: various.


# 28913ebe 10-Jun-1998 Julian Elischer <julian@FreeBSD.org>

Submitted by: Kirk McKusick <mckusick@McKusick.COM>

Fix for potential hang when trying to reboot the system or
to forcibly unmount a soft update enabled filesystem.
FreeBSD already handled the reboot case differently, this is however a better
fix.


# ecbb00a2 07-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# cb87a87c 17-May-1998 Tor Egge <tegge@FreeBSD.org>

Supply the correct process argument to dounmount when possible.


# 3e425b96 19-Apr-1998 Julian Elischer <julian@FreeBSD.org>

Add changes and code to implement a functional DEVFS.
This code will be turned on with the TWO options
DEVFS and SLICE. (see LINT)
Two labels PRE_DEVFS_SLICE and POST_DEVFS_SLICE will deliniate these changes.

/dev will be automatically mounted by init (thanks phk)
on bootup. See /sys/dev/slice/slice.4 for more info.
All code should act the same without these options enabled.

Mike Smith, Poul Henning Kamp, Soeren, and a few dozen others

This code does not support the following:
bad144 handling.
Persistance. (My head is still hurting from the last time we discussed this)
ATAPI flopies are not handled by the SLICE code yet.

When this code is running, all major numbers are arbitrary and COULD
be dynamically assigned. (this is not done, for POLA only)
Minor numbers for disk slices ARE arbitray and dynamically assigned.


# 37b8ccd3 18-Apr-1998 Peter Wemm <peter@FreeBSD.org>

In vfs_msync(), test to see if the vnode being examined is "interesting"
(ie: it has a vm_object attached and is marked as OBJ_MIGHTBEDIRTY) before
attempting to lock it. This should reduce the cpu hit that is incurred
when doing a sync(2) and when the syncer process is doing the 30-second
writeback of dirty mmap() data to disk. Skip this speedup if we are
doing an unmount() to be sure to get everything - we can afford to
occasionally miss a msync while the system is running, but not at unmount.

I'm not sure about the VXLOCK and MNT_WAIT case, it seems a bit odd to skip
doing a page_clean at unmount time just because a vnode is VXLOCKed, but
that's what was being done before...


# efdc5523 15-Apr-1998 Peter Wemm <peter@FreeBSD.org>

When the softdep conversion took place, the periodic vfs_msync() from
update got lost. This is responsible for ensuring that dirty mmap() pages
get periodically written to disk. Without it, long time mmap's might not
have their dirty pages written out at all of the system crashes or isn't
cleanly shut down. This could be nasty if you've got a long-running
writing via mmap(), dirty pages used to get written to disk within 30
seconds or so.


# 71033a8c 15-Apr-1998 Tor Egge <tegge@FreeBSD.org>

Unlock mountlist_slock if the mount point was busy (unmount in progress)
during the attempt at lazy fsync.


# 227ee8a1 30-Mar-1998 Poul-Henning Kamp <phk@FreeBSD.org>

Eradicate the variable "time" from the kernel, using various measures.
"time" wasn't a atomic variable, so splfoo() protection were needed
around any access to it, unless you just wanted the seconds part.

Most uses of time.tv_sec now uses the new variable time_second instead.

gettime() changed to getmicrotime(0.

Remove a couple of unneeded splfoo() protections, the new getmicrotime()
is atomic, (until Bruce sets a breakpoint in it).

A couple of places needed random data, so use read_random() instead
of mucking about with time which isn't random.

Add a new nfs_curusec() function.

Mark a couple of bogosities involving the now disappeard time variable.

Update ffs_update() to avoid the weird "== &time" checks, by fixing the
one remaining call that passwd &time as args.

Change profiling in ncr.c to use ticks instead of time. Resolution is
the same.

Add new function "tvtohz()" to avoid the bogus "splfoo(), add time, call
hzto() which subtracts time" sequences.

Reviewed by: bde


# 3c1300a6 28-Mar-1998 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 771b51ef 27-Mar-1998 Bruce Evans <bde@FreeBSD.org>

Don't depend on <sys/mount.h> including <sys/socket.h>.


# 52c64c95 19-Mar-1998 John Dyson <dyson@FreeBSD.org>

In kern_physio.c fix tsleep priority messup.

In vfs_bio.c, remove b_generation count usage,
remove redundant reassignbuf,
remove redundant spl(s),
manage page PG_ZERO flags more correctly,
utilize in invalid value for b_offset until it
is properly initialized. Add asserts
for #ifdef DIAGNOSTIC, when b_offset is
improperly used.
when a process is not performing I/O, and just waiting
on a buffer generally, make the sleep priority
low.
only check page validity in getblk for B_VMIO buffers.

In vfs_cluster, add b_offset asserts, correct pointer calculation
for clustered reads. Improve readability of certain parts of
the code. Remove redundant spl(s).

In vfs_subr, correct usage of vfs_bio_awrite (From Andrew Gallatin
<gallatin@cs.duke.edu>). More vtruncbuf problems fixed.


# 1c77c6b7 19-Mar-1998 John Dyson <dyson@FreeBSD.org>

Fix an embarassing problem in vtruncbuf.


# 2deb5d04 16-Mar-1998 John Dyson <dyson@FreeBSD.org>

Correct a severely evil bug in the vtruncbuf code. It didn't cause
me any problems until after the previous commit. This problem then
caused a severe case of creeping crud on my diskdrive, and hosed
my system so bad, that I needed to do a complete reinstall. Sorry!!!

I assume that others have manifest this bug.


# e85c1afb 15-Mar-1998 John Dyson <dyson@FreeBSD.org>

Allow vfs_ioopt to be enabled with a (temporary) config option.


# bef608bd 15-Mar-1998 John Dyson <dyson@FreeBSD.org>

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


# 26300b34 14-Mar-1998 John Dyson <dyson@FreeBSD.org>

Disable the vfs.ioopt option for now, so that we don't get gratuitious
bugreports. I might not be able to fix the problems before 3.0, due
to other, more important things.


# 8293f20a 13-Mar-1998 Tor Egge <tegge@FreeBSD.org>

Don't misuse vnode interlocks in routines that can be called from interrupts.
PR: 5893


# b1897c19 08-Mar-1998 Julian Elischer <julian@FreeBSD.org>

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# 59228495 01-Mar-1998 John Dyson <dyson@FreeBSD.org>

Change vfs.ioopt default back to '0'.


# ffc82b0a 28-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Use a more consistent page wait methodology.
2) Do not unnecessarily force page blocking when paging
pages out.
3) Further improve swap pager performance and correctness,
including fixing the paging in progress deadlock (except
in severe I/O error conditions.)
4) Enable vfs_ioopt=1 as a default.
5) Fix and enable the page prezeroing in SMP mode.

All in all, SMP systems especially should show a significant
improvement in "snappyness."


# 64d3c7e3 22-Feb-1998 John Dyson <dyson@FreeBSD.org>

Clean-up the vget mechanism by permanently attaching VM objects to
vnodes, therefore vget doesn't need to do so anymore. Other minor
improvements include the temp free vnode queue obeying the VAGE
flag and a printf that warns of to-be-removed code being executed.


# 1b11919b 09-Feb-1998 KATO Takenori <kato@FreeBSD.org>

Fixed vnode interlock handling.

Reviewed by: Bruce Evans <bde@zeta.org.au>
Tor Egge <Tor.Egge@idi.ntnu.no>


# 303b270b 08-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Staticize.


# 16e3b0b6 07-Feb-1998 KATO Takenori <kato@FreeBSD.org>

When the vp is lcoked, vget() calls vfs_object_create() with
waslocked = TRUE. This change may fix lockmgr panic in umapfs/nullfs.

PR: 5634
Reviewed by: "John S. Dyson" <toor@dyson.iquest.net>
Suggested by: Bruce Evans <bde@zeta.org.au>


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 95461b45 04-Feb-1998 John Dyson <dyson@FreeBSD.org>

1) Start using a cleaner and more consistant page allocator instead
of the various ad-hoc schemes.
2) When bringing in UPAGES, the pmap code needs to do another vm_page_lookup.
3) When appropriate, set the PG_A or PG_M bits a-priori to both avoid some
processor errata, and to minimize redundant processor updating of page
tables.
4) Modify pmap_protect so that it can only remove permissions (as it
originally supported.) The additional capability is not needed.
5) Streamline read-only to read-write page mappings.
6) For pmap_copy_page, don't enable write mapping for source page.
7) Correct and clean-up pmap_incore.
8) Cluster initial kern_exec pagin.
9) Removal of some minor lint from kern_malloc.
10) Correct some ioopt code.
11) Remove some dead code from the MI swapout routine.
12) Correct vm_object_deallocate (to remove backing_object ref.)
13) Fix dead object handling, that had problems under heavy memory load.
14) Add minor vm_page_lookup improvements.
15) Some pages are not in objects, and make sure that the vm_page.c can
properly support such pages.
16) Add some more page deficit handling.
17) Some minor code readability improvements.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# d09a16d8 30-Jan-1998 Tor Egge <tegge@FreeBSD.org>

Update freevnodes when adding a vnode to the head of the free list.


# 50ce7ff4 23-Jan-1998 John Dyson <dyson@FreeBSD.org>

Add better support for larger I/O clusters, including larger physical
I/O. The support is not mature yet, and some of the underlying implementation
needs help. However, support does exist for IDE devices now.


# 2d8acc0f 22-Jan-1998 John Dyson <dyson@FreeBSD.org>

VM level code cleanups.

1) Start using TSM.
Struct procs continue to point to upages structure, after being freed.
Struct vmspace continues to point to pte object and kva space for kstack.
u_map is now superfluous.
2) vm_map's don't need to be reference counted. They always exist either
in the kernel or in a vmspace. The vmspaces are managed by reference
counts.
3) Remove the "wired" vm_map nonsense.
4) No need to keep a cache of kernel stack kva's.
5) Get rid of strange looking ++var, and change to var++.
6) Change more data structures to use our "zone" allocator. Added
struct proc, struct vmspace and struct vnode. This saves a significant
amount of kva space and physical memory. Additionally, this enables
TSM for the zone managed memory.
7) Keep ioopt disabled for now.
8) Remove the now bogus "single use" map concept.
9) Use generation counts or id's for data structures residing in TSM, where
it allows us to avoid unneeded restart overhead during traversals, where
blocking might occur.
10) Account better for memory deficits, so the pageout daemon will be able
to make enough memory available (experimental.)
11) Fix some vnode locking problems. (From Tor, I think.)
12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp.
(experimental.)
13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c
code. Use generation counts, get rid of unneded collpase operations,
and clean up the cluster code.
14) Make vm_zone more suitable for TSM.

This commit is partially as a result of discussions and contributions from
other people, including DG, Tor Egge, PHK, and probably others that I
have forgotten to attribute (so let me know, if I forgot.)

This is not the infamous, final cleanup of the vnode stuff, but a necessary
step. Vnode mgmt should be correct, but things might still change, and
there is still some missing stuff (like ioopt, and physical backing of
non-merged cache files, debugging of layering concepts.)


# 47221757 17-Jan-1998 John Dyson <dyson@FreeBSD.org>

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


# 53f6f085 11-Jan-1998 John Dyson <dyson@FreeBSD.org>

Fix another vnode leak.


# 925a3a41 11-Jan-1998 John Dyson <dyson@FreeBSD.org>

Fix some vnode management problems, and better mgmt of vnode free list.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.

PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.


# 857d737e 07-Jan-1998 John Dyson <dyson@FreeBSD.org>

Disable io optimizations again, minor bug found, and will be fixed in
a few days.


# 95e5e988 05-Jan-1998 John Dyson <dyson@FreeBSD.org>

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 483140ea 29-Dec-1997 John Dyson <dyson@FreeBSD.org>

Add the vnode interlock back around vref.


# 60f8d464 28-Dec-1997 John Dyson <dyson@FreeBSD.org>

Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem
with the object cache removal.


# 2be70f79 28-Dec-1997 John Dyson <dyson@FreeBSD.org>

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


# 1efb74fb 19-Dec-1997 John Dyson <dyson@FreeBSD.org>

Some performance improvements, and code cleanups (including changing our
expensive OFF_TO_IDX to btoc whenever possible.)


# 1cbbd625 14-Dec-1997 Garrett Wollman <wollman@FreeBSD.org>

Add support for poll(2) on files. vop_nopoll() now returns POLLNVAL
if one of the new poll types is requested; hopefully this will not break
any existing code. (This is done so that programs have a dependable
way of determining whether a filesystem supports the extended poll types
or not.)

The new poll types added are:

POLLWRITE - file contents may have been modified
POLLNLINK - file was linked, unlinked, or renamed
POLLATTRIB - file's attributes may have been changed
POLLEXTEND - file was extended

Note that the internal operation of poll() means that it is impossible
for two processes to reliably poll for the same event (this could
be fixed but may not be worth it), so it is not possible to rewrite
`tail -f' to use poll at this time.


# cb451ebd 22-Nov-1997 Bruce Evans <bde@FreeBSD.org>

Staticized.


# b1f4a44b 11-Nov-1997 Julian Elischer <julian@FreeBSD.org>

Reviewed by: various.

Ever since I first say the way the mount flags were used I've hated the
fact that modes, and events, internal and exported, and short-term
and long term flags are all thrown together. Finally it's annoyed me enough..
This patch to the entire FreeBSD tree adds a second mount flag word
to the mount struct. it is not exported to userspace. I have moved
some of the non exported flags over to this word. this means that we now
have 8 free bits in the mount flags. There are another two that might
well move over, but which I'm not sure about.
The only user visible change would have been in pstat -v, except
that davidg has disabled it anyhow.
I'd still like to move the state flags and the 'command' flags
apart from each other.. e.g. MNT_FORCE really doesn't have the
same semantics as MNT_RDONLY, but that's left for another day.


# 4a11ca4e 07-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


# dba3870c 26-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

VFS interior redecoration.

Rename vn_default_error to vop_defaultop all over the place.
Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite.
Use vop_null instead of nullop.
Move vop_nopoll from vfs_subr.c to vfs_default.c
Move vop_sharedlock from vfs_subr.c to vfs_default.c
Move vop_nolock from vfs_subr.c to vfs_default.c
Move vop_nounlock from vfs_subr.c to vfs_default.c
Move vop_noislocked from vfs_subr.c to vfs_default.c
Use vop_ebadf instead of *_ebadf.
Add vop_defaultop for getpages on master vnode in MFS.


# a1c995b6 12-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 55166637 11-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde


# f7891f9a 11-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Dike out a weird warning.


# d047b580 26-Sep-1997 Poul-Henning Kamp <phk@FreeBSD.org>

I lost a bit of my change in the last commit, this is more like it.
Noticed by: bde


# 87b1940a 25-Sep-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Reduce the target number of vnodes on the freelist from desiredvnodes
(usually a couple of thousand) to 25. The measured impact on cache-hits
doesn't justify spending memory this way:

Target number of free vnodes versus namecache hit rate in % during a
make world:
10 98.5316
200 98.5479
500 98.5546
1000 98.5709
3000 98.6006
4000 98.6126


# 00544193 24-Sep-1997 Poul-Henning Kamp <phk@FreeBSD.org>

A couple of handles to tweak, more statistics.


# 514ede09 16-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Fixed gratuitous ANSIisms.


# 7fab7799 13-Sep-1997 Peter Wemm <peter@FreeBSD.org>

Provide a 'return true' poll vnode op rather than duplicating the
'do nothing' case all over the various filesystems.


# 557fe2c5 13-Sep-1997 Peter Wemm <peter@FreeBSD.org>

print correct function name in a panic (vop_nolock -> vop_sharedlock)


# 41fadeeb 07-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed yet more vestiges of config-time swap configuration and/or
cleaned up nearby cruft.


# a910e75c 07-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed vestiges of config-time "argument processing" configuration.


# fd9d9ff1 03-Sep-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Hmm, this is hopefully better.


# 7cb22688 03-Sep-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Revert the v_usecount handling in relation to VOP_INACTIVE.


# e4ba6a82 02-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# a051452a 31-Aug-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Change the 0xdeadb hack to a flag called VDOOMED.
Introduce VFREE which indicates that vnode is on freelist.
Rename vholdrele() to vdrop().
Create vfree() and vbusy() to add/delete vnode from freelist.
Add vfree()/vbusy() to keep (v_holdcnt != 0 || v_usecount != 0)
vnodes off the freelist.
Generalize vhold()/v_holdcnt to mean "do not recycle".
Fix reassignbuf()s lack of use of vhold().
Use vhold() instead of checking v_cache_src list.
Remove vtouch(), the vnodes are always vget'ed soon enough
after for it to have any measuable effect.
Add sysctl debug.freevnodes to keep track of things.
Move cache_purge() up in getnewvnodes to avoid race.
Decrement v_usecount after VOP_INACTIVE(), put a vhold() on
it during VOP_INACTIVE()
Unmacroize vhold()/vdrop()
Print out VDOOMED and VFREE flags (XXX: should use %b)

Reviewed by: dyson


# d3114049 26-Aug-1997 Bruce Evans <bde@FreeBSD.org>

Restored rev.1.92 which was clobbered by the previous commit.


# a5db4bf4 25-Aug-1997 John Dyson <dyson@FreeBSD.org>

Back out some incorrect changes that was worse than the original bug.


# 89721f6f 21-Aug-1997 John Dyson <dyson@FreeBSD.org>

This is a trial improvement for the vnode reference count while on the vnode
free list problem. Also, the vnode age flag is no longer used by the
vnode pager. (It is actually incorrect to use then.) Constructive
feedback welcome -- just be kind.


# b1037dcd 21-Aug-1997 Bruce Evans <bde@FreeBSD.org>

#include <machine/limits.h> explicitly in the few places that it is required.


# 57bf258e 16-Aug-1997 Garrett Wollman <wollman@FreeBSD.org>

Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.


# 23be6be8 04-Aug-1997 John Dyson <dyson@FreeBSD.org>

Fix a problem with the vfs vnode caching that it doesn't grow quickly
enough and can cause some strange performance problems. Specifically, at
or near startup time is when the problem is worst. To reproduce
the problem, run "lat_syscall stat" from the alpha lmbench code right
after bootup. A positive side effect of this mod is that the name
cache can be set to grow again by sysctl. A noticable positive
performance impact is realized due to a larger namecache being available
as needed (or tuned.)


# f6b4c285 17-Jul-1997 Doug Rabson <dfr@FreeBSD.org>

Merge WebNFS support from NetBSD

Obtained from: NetBSD


# 3c631446 21-Jun-1997 John Dyson <dyson@FreeBSD.org>

Remove a window during running down a file vnode. Also, the OBJ_DEAD
flag wasn't being respected during vref(), et. al. Note that this
isn't the eventual fix for the locking problem. Fine grained SMP
in the VM and VFS code will require (lots) more work.


# 2e58c0f8 09-Jun-1997 David Greenman <dg@FreeBSD.org>

Disabled the kern.vnode sysctl variable. It's causing system crashes on
large systems and needs to be re-thinked or removed wholesale.


# 8670684a 06-May-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Fix a race condition that did, after all, exist.

Reviewed by: phk
Submitted by: dfr


# b15a966e 04-May-1997 Poul-Henning Kamp <phk@FreeBSD.org>

1. Add a {pointer, v_id} pair to the vnode to store the reference to the
".." vnode. This is cheaper storagewise than keeping it in the
namecache, and it makes more sense since it's a 1:1 mapping.

2. Also handle the case of "." more intelligently rather than stuff
the namecache with pointless entries.

3. Add two lists to the vnode and hang namecache entries which go from
or to this vnode. When cleaning a vnode, delete all namecache
entries it invalidates.

4. Never reuse namecache enties, malloc new ones when we need it, free
old ones when they die. No longer a hard limit on how many we can
have.

5. Remove the upper limit on namelength of namecache entries.

6. Make a global list for negative namecache entries, limit their number
to a sysctl'able (debug.ncnegfactor) fraction of the total namecache.
Currently the default fraction is 1/16th. (Suggestions for better
default wanted!)

7. Assign v_id correctly in the face of 32bit rollover.

8. Remove the LRU list for namecache entries, not needed. Remove the
#ifdef NCH_STATISTICS stuff, it's not needed either.

9. Use the vnode freelist as a true LRU list, also for namecache accesses.

10. Reuse vnodes more aggresively but also more selectively, if we can't
reuse, malloc a new one. There is no longer a hard limit on their
number, they grow to the point where we don't reuse potentially
usable vnodes. A vnode will not get recycled if still has pages in
core or if it is the source of namecache entries (Yes, this does
indeed work :-) "." and ".." are not namecache entries any longer...)

11. Do not overload the v_id field in namecache entries with whiteout
information, use a char sized flags field instead, so we can get
rid of the vpid and v_id fields from the namecache struct. Since
we're linked to the vnodes and purged when they're cleaned, we don't
have to check the v_id any more.

12. NFS knew about the limitation on name length in the namecache, it
shouldn't and doesn't now.

Bugs:
The namecache statistics no longer includes the hits for ".."
and "." hits.

Performance impact:
Generally in the +/- 0.5% for "normal" workstations, but
I hope this will allow the system to be selftuning over a
bigger range of "special" applications. The case where
RAM is available but unused for cache because we don't have
any vnodes should be gone.

Future work:
Straighten out the namecache statistics.

"desiredvnodes" is still used to (bogusly ?) size hash
tables in the filesystems.

I have still to find a way to safely free unused vnodes
back so their number can shrink when not needed.

There is a few uses of the v_id field left in the filesystems,
scheduled for demolition at a later time.

Maybe a one slot cache for unused namecache entries should
be implemented to decrease the malloc/free frequency.


# 82b8e119 29-Apr-1997 John Dyson <dyson@FreeBSD.org>

Staticize an unnecessarily global function: vputrele.
Submitted by: Michael Hancock <michaelh@cet.co.jp>


# 5f61c81d 25-Apr-1997 Peter Wemm <peter@FreeBSD.org>

copyin the export network mask to the correct variable.

Submitted by: Mike Hibler <mike@marker.cs.utah.edu>, PR#3380


# de15ef6a 04-Apr-1997 Doug Rabson <dfr@FreeBSD.org>

Add a function vop_sharedlock which a copy of vop_nolock without the
implementation #ifdef out. This can be used for now by NFS. As soon
as all the other filesystems' locking is fixed, this can go away.

Print the vnode address in vprint for easier debugging.


# 0f1adf65 01-Apr-1997 Bruce Evans <bde@FreeBSD.org>

Use OID_AUTO instead of magic number for the Lite2 sysctl debug.busyprt.

Removed declaration of vfs_unmountroot() again.

Staticized vgonel().


# 2f2160da 04-Mar-1997 David Greenman <dg@FreeBSD.org>

Fixed splbio problems in vinvalbuf. Closes PR#2875, although fixed
differently by me.


# 4a8b9660 04-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Attach vfs_sysctl() one level lower so that only the levels below
VFS_GENERIC aren't done in the FreeBSD way. The previous commit
broke the nfs sysctls.


# 3a76a594 02-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Merged Lite2's vfs_sysctl(). It doesn't fit very well into FreeBSD's
(phk's) sysctl framework, and I needed special code to disambiguate
the VFS_GENERIC node from the VFS_VFSCONF leaf, so I only converted
the leaves to the FreeBSD framework. The error handling isn't quite
right. CSRGS's sysctls seem to return ENOTDIR too much and FreeBSD's
sysctls don't agree with the man page.


# dc91a89e 02-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Restored some pre-Lite2-merge source-level compatibility to the mount()
and getvfsbyname() interfaces. The new interfaces are now hidden from
applications unless _NEW_VFSCONF is defined. The new vfsconf interfaces
don't work yet.


# a896f025 02-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Moved vfs sysctls to where Lite2 put them. No code changes yet.


# b98afd0d 27-Feb-1997 Bruce Evans <bde@FreeBSD.org>

Fixed Lite2 merge of spechash simplelocking. It was misplaced in
checkalias() and missing in vfinddev() and vcount().


# fd7f690f 26-Feb-1997 John Dyson <dyson@FreeBSD.org>

Fix the previous simple_lock fix breakage in the combined
vput/vrele routine. Fix a panic message. Fix the vop_nounlock
routine so that "special" filesystems that use it work correctly.


# 0d955f71 26-Feb-1997 John Dyson <dyson@FreeBSD.org>

Fix the simple_lock problem with the physical I/O buffer code, and
also fix the missing simple_unlock in vrele, and improve vrele/vput
by merging them into one routine. BDE pointed these problems out.


# 7c1557c4 26-Feb-1997 Bruce Evans <bde@FreeBSD.org>

Fixed unmounting of the root fs. vfs_unmountroot() wasn't fully updated
to do Lite2 locking and vfs_unmountall() wasn't as simple as the Lite2
version.


# c35e283a 25-Feb-1997 Bruce Evans <bde@FreeBSD.org>

Merged some missing locking from Lite2:
- getnewvnode() and vref() were missing one simple_unlock() each.
- the Lite2 locking changes weren't merged at all in
printlockedvnodes() or sysctl_vnode(). Merging these undid
some KNF style regressions.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 5131d64e 16-Jan-1997 Bruce Evans <bde@FreeBSD.org>

Removed option EXTRAVNODES. All versions of FreeBSD-2.x have a sysctl
variable `kern.maxvnodes' which gives much better control over vnode
allocation than EXTRAVNODES (except in -current between 1995/10/28 and
1996/11/12, kern.maxvnodes was read-only and thus useless).


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 8b612c4b 28-Dec-1996 John Dyson <dyson@FreeBSD.org>

This commit is the embodiment of some VFS read clustering improvements.
Firstly, now our read-ahead clustering is on a file descriptor basis and not
on a per-vnode basis. This will allow multiple processes reading the
same file to take advantage of read-ahead clustering. Secondly, there
previously was a problem with large reads still using the ramp-up
algorithm. Of course, that was bogus, and now we read the entire
"chunk" off of the disk in one operation. The read-ahead clustering
algorithm should use less CPU than the previous also (I hope :-)).

NOTE: THAT LKMS MUST BE REBUILT!!!


# b83ddf9c 12-Nov-1996 Bruce Evans <bde@FreeBSD.org>

Restored writability of kern.maxvnodes. It was broken a year ago in
rev.1.29 of kern_sysctl.c.

Should be in 2.2.


# 19060a3a 28-Oct-1996 Poul-Henning Kamp <phk@FreeBSD.org>

init_main.c: pass -d to init if DEVFS_ROOT
kern_conf.c: gd driver is a disk.
vfs_subr.c: include opt_devfs.h


# 0082fb46 17-Oct-1996 Jordan K. Hubbard <jkh@FreeBSD.org>

I'm not sure why, but Netcon's TFS filesystem code doesn't want to
add free vnodes back to the freelist. They must do their own vnode
management. Anyway, this change is *only* activated with their filesystem
and doesn't affect anyone else. Whoops, forgot the submitted-by lines
in my previous commits too.. :-(
Submitted-By: Tony Ardolino <tony@netcon.com>


# ad980522 16-Oct-1996 John Dyson <dyson@FreeBSD.org>

Clean up the rundown of the object backing a vnode. This should fix
NFS problems associated with forcible dismounts.


# a8f42fa9 27-Sep-1996 John Dyson <dyson@FreeBSD.org>

Correct vget by removing a window where a vnode can potentially go away.


# 030e2e9e 19-Sep-1996 Nate Williams <nate@FreeBSD.org>

In sys/time.h, struct timespec is defined as:

/*
* Structure defined by POSIX.4 to be like a timeval.
*/
struct timespec {
time_t ts_sec; /* seconds */
long ts_nsec; /* and nanoseconds */
};

The correct names of the fields are tv_sec and tv_nsec.

Reminded by: James Drobina <jdrobina@infinet.com>


# 6476c0d2 21-Aug-1996 John Dyson <dyson@FreeBSD.org>

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


# 619594e8 15-Aug-1996 John Dyson <dyson@FreeBSD.org>

Certain vnode buffer list operations were not being spl protected,
and they needed to be. Brelse for example can be called at interrupt
level, and the buffer list operations were not being protected from it.


# 8c2ff396 30-Jul-1996 Bruce Evans <bde@FreeBSD.org>

Only use the special bdevvp() for DEVFS if DEVFS_ROOT is defined. This
makes option DEVFS safe to use again (although mounting devfs is unsafe).


# e83cf165 24-Jul-1996 Poul-Henning Kamp <phk@FreeBSD.org>

DEVFS needs a special bdevvp().


# cba2a7c6 12-Jul-1996 Bruce Evans <bde@FreeBSD.org>

Staticized a few variables.

Fixed warnings about unused variables.


# 114a8cff 30-May-1996 Peter Wemm <peter@FreeBSD.org>

Add an option "EXTRA_VNODES" to cause an extra number of vnode structures
to be allocated at boot time. This is an expensive option, as they
consume physical ram and are not pageable etc. In certain situations,
this kind of option is quite useful, especially for news servers that
access a large number of directories at random and torture the name cache.
Defining 5000 or 10000 extra vnodes should cut down the amount of vnode
recycling somewhat, which should allow better name and directory caching
etc.

This is a "your mileage may vary" option, with no real indication of
what works best for your machine except trial and error. Too many will
cost you ram that you could otherwise use for disk buffers etc.

This is based on something John Dyson mentioned to me a while ago.


# edbfedac 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]


# e5fadd05 08-Mar-1996 John Dyson <dyson@FreeBSD.org>

Put the "free vnode isn't" check back in the right place.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 0e41ee30 04-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Convert DDB to new-style option.


# 864ef7d1 02-Jan-1996 David Greenman <dg@FreeBSD.org>

Moved the #ifdef DIAGNOSTIC in vrele() so that the check for negative
v_usecount is always performed and only the call to vprint is conditional.


# 27a0b398 17-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Staticize.
Unstaticize a function in scsi/scsi_base that was used, with an undocumented
option.
My last count on the LINT kernel shows:
Total symbols: 3647
unref symbols: 463
undef symbols: 4
1 ref symbols: 1751
2 ref symbols: 485
Approaching the pain threshold now.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# efeaf95a 06-Dec-1995 David Greenman <dg@FreeBSD.org>

Untangled the vm.h include file spaghetti.


# 65d0bc13 06-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

A couple of minor tweaks to the sysctl stuff.


# 98d93822 02-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Completed function declarations and/or added prototypes.


# 2d0b1d70 29-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

A test was backwards.
Noticed by: Cheng, Hsiao-Yang <sycheng@cis.ufl.edu>


# 4b2af45f 19-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Mega commit for sysctl.
Convert the remaining sysctl stuff to the new way of doing things.
the devconf stuff is the reason for the large number of files.
Cleaned up some compiler warnings while I were there.


# 986f4ce7 16-Nov-1995 Bruce Evans <bde@FreeBSD.org>

Fixed support for DIAGNOSTIC option. SYSCTL_INT() depends on kernel.h.


# 395e6735 14-Nov-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Change some of the debug sysctl vars. The semantics of these will change.


# ceba6236 10-Nov-1995 Bruce Evans <bde@FreeBSD.org>

Fixed type of vfs_free_netcred().
Removed redundant declaration of insmntque().


# f57e6547 09-Nov-1995 Bruce Evans <bde@FreeBSD.org>

Introduced a type `vop_t' for vnode operation functions and used
it 1138 times (:-() in casts and a few more times in declarations.
This change is null for the i386.

The type has to be `typedef int vop_t(void *)' and not `typedef
int vop_t()' because `gcc -Wstrict-prototypes' warns about the
latter. Since vnode op functions are called with args of different
(struct pointer) types, neither of these function types is any use
for type checking of the arg, so it would be preferable not to use
the complete function type, especially since using the complete
type requires adding 1138 casts to avoid compiler warnings and
another 40+ casts to reverse the function pointer conversions before
calling the functions.


# 5e527f65 06-Nov-1995 John Dyson <dyson@FreeBSD.org>

This is a modification missed by me in the msync fixes a few days ago.


# e887950a 28-Oct-1995 Bruce Evans <bde@FreeBSD.org>

Call vfs_unbusy() before error returns from sysctl_vnode(). This fixes
PR 795.

Set the size before one error return from sysctl_vnode() the same as before
the other. The caller might want to know about the amount successfully
read although the current caller doesn't.


# 430179f0 25-Aug-1995 Bruce Evans <bde@FreeBSD.org>

Don't compile the diagnostic functions vhold() and holdrele() unless
DIAGNOSTIC is defined.


# 628641f8 11-Aug-1995 David Greenman <dg@FreeBSD.org>

Converted mountlist to a CIRCLEQ.

Partially obtained from: 4.4BSD-Lite2


# 24a1cce3 13-Jul-1995 David Greenman <dg@FreeBSD.org>

NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct
proc or any VM system structure will have to be rebuilt!!!

Much needed overhaul of the VM system. Included in this first round of
changes:

1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages,
haspage, and sync operations are supported. The haspage interface now
provides information about clusterability. All pager routines now take
struct vm_object's instead of "pagers".

2) Improved data structures. In the previous paradigm, there is constant
confusion caused by pagers being both a data structure ("allocate a
pager") and a collection of routines. The idea of a pager structure has
escentially been eliminated. Objects now have types, and this type is
used to index the appropriate pager. In most cases, items in the pager
structure were duplicated in the object data structure and thus were
unnecessary. In the few cases that remained, a un_pager structure union
was created in the object to contain these items.

3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now
be removed. For instance, vm_object_enter(), vm_object_lookup(),
vm_object_remove(), and the associated object hash list were some of the
things that were removed.

4) simple_lock's removed. Discussion with several people reveals that the
SMP locking primitives used in the VM system aren't likely the mechanism
that we'll be adopting. Even if it were, the locking that was in the code
was very inadequate and would have to be mostly re-done anyway. The
locking in a uni-processor kernel was a no-op but went a long way toward
making the code difficult to read and debug.

5) Places that attempted to kludge-up the fact that we don't have kernel
thread support have been fixed to reflect the reality that we are really
dealing with processes, not threads. The VM system didn't have complete
thread support, so the comments and mis-named routines were just wrong.
We now use tsleep and wakeup directly in the lock routines, for instance.

6) Where appropriate, the pagers have been improved, especially in the
pager_alloc routines. Most of the pager_allocs have been rewritten and
are now faster and easier to maintain.

7) The pagedaemon pageout clustering algorithm has been rewritten and
now tries harder to output an even number of pages before and after
the requested page. This is sort of the reverse of the ideal pagein
algorithm and should provide better overall performance.

8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup
have been removed. Some other unnecessary casts have also been removed.

9) Some almost useless debugging code removed.

10) Terminology of shadow objects vs. backing objects straightened out.
The fact that the vm_object data structure escentially had this
backwards really confused things. The use of "shadow" and "backing
object" throughout the code is now internally consistent and correct
in the Mach terminology.

11) Several minor bug fixes, including one in the vm daemon that caused
0 RSS objects to not get purged as intended.

12) A "default pager" has now been created which cleans up the transition
of objects to the "swap" type. The previous checks throughout the code
for swp->pg_data != NULL were really ugly. This change also provides
the rudiments for future backing of "anonymous" memory by something
other than the swap pager (via the vnode pager, for example), and it
allows the decision about which of these pagers to use to be made
dynamically (although will need some additional decision code to do
this, of course).

13) (dyson) MAP_COPY has been deprecated and the corresponding "copy
object" code has been removed. MAP_COPY was undocumented and non-
standard. It was furthermore broken in several ways which caused its
behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will
continue to work correctly, but via the slightly different semantics
of MAP_PRIVATE.

14) (dyson) Sharing maps have been removed. It's marginal usefulness in a
threads design can be worked around in other ways. Both #12 and #13
were done to simplify the code and improve readability and maintain-
ability. (As were most all of these changes)

TODO:

1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing
this will reduce the vnode pager to a mere fraction of its current size.

2) Rewrite vm_fault and the swap/vnode pagers to use the clustering
information provided by the new haspage pager interface. This will
substantially reduce the overhead by eliminating a large number of
VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be
improved to provide both a "behind" and "ahead" indication of
contiguousness.

3) Implement the extended features of pager_haspage in swap_pager_haspage().
It currently just says 0 pages ahead/behind.

4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps
via a much more general mechanism that could also be used for disk
striping of regular filesystems.

5) Do something to improve the architecture of vm_object_collapse(). The
fact that it makes calls into the swap pager and knows too much about
how the swap pager operates really bothers me. It also doesn't allow
for collapsing of non-swap pager objects ("unnamed" objects backed by
other pagers).


# e0dca2b9 07-Jul-1995 David Greenman <dg@FreeBSD.org>

Improve negative usecount diagnostic a little.


# aa2cabb9 27-Jun-1995 David Greenman <dg@FreeBSD.org>

1) Converted v_vmdata to v_object.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.


# 6acceb40 27-Jun-1995 Bruce Evans <bde@FreeBSD.org>

Pass the correct nonblocking flag to VOP_CLOSE() in vclean().
VOP_CLOSE() takes `F' (file) flags, not `IO' flags. At least that's
what close() passes. I previously fixed ttylclose() to check
FNONBLOCK instead of IO_NDELAY. This broke the call from vclean()
and cleaning of ptys sometimes deadlocked.


# 61f5d510 21-May-1995 David Greenman <dg@FreeBSD.org>

Changes to fix the following bugs:

1) Files weren't properly synced on filesystems other than UFS. In some
cases, this lead to lost data. Most likely would be noticed on NFS.
The fix is to make the VM page sync/object_clean general rather than
in each filesystem.
2) Mixing regular and mmaped file I/O on NFS was very broken. It caused
chunks of files to end up as zeroes rather than the intended contents.
The fix was to fix several race conditions and to kludge up the
"b_dirtyoff" and "b_dirtyend" that NFS relies upon - paying attention
to page modifications that occurred via the mmapping.

Reviewed by: David Greenman
Submitted by: John Dyson


# 15f1b096 11-May-1995 David Greenman <dg@FreeBSD.org>

Increased ratio of allowed vnodes on freelist to 1/4th of the total. This
is more representative of worst case situations of 4 files/directory. (If
that last sentence doesn't make any sense, I'm not surprised. It's rather
compilcated how this all fits together....).
This should fix a problem that Ed Hudson has been complaining about where
directories with lots of symlinks could cause excessive disk I/O.


# 1a477b0c 16-Apr-1995 David Greenman <dg@FreeBSD.org>

Changed #ifdef around printlockedvnodes() from DEBUG to DDB.


# 213fd1b6 09-Apr-1995 David Greenman <dg@FreeBSD.org>

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
call vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# b9461930 20-Mar-1995 David Greenman <dg@FreeBSD.org>

Fixed vinvalbuf() to work like NFS wants it to. The previous code wouldn't
flush pages in the vm object if V_SAVE was true.


# 62b71ed6 20-Mar-1995 David Greenman <dg@FreeBSD.org>

Don't gain/lose a reference to the object when yanking its pages in
vinvalbuf()...it will cause vnode locking problems in vm_object_terminate,
and isn't necessary anyway.


# ff769afc 19-Mar-1995 David Greenman <dg@FreeBSD.org>

Don't attempt to sync pages in the V_SAVE case of vinvalbuf; doing so can
lead to a deadlock. Just let the VM system deal with it.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 3d2a8cf3 11-Mar-1995 David Greenman <dg@FreeBSD.org>

Added a comment.


# f8a0b2dd 10-Mar-1995 David Greenman <dg@FreeBSD.org>

Reorganized an if() expression for efficiency.


# fbd6e6c9 09-Mar-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Clean up and improve the namecache.

1. We always keep one 16th of the vnodes on the freelist, so that the
namecache doesn't get trashed. It used to be that it wasn't a problem, but
the only vnodes getting released these days are directories and things which
Clean up and improve the namecache.

1. We always keep one 16th of the vnodes on the freelist, so that the
namecache doesn't get trashed. It used to be that it wasn't a problem, but
the only vnodes getting released these days are directories and things which
gets forced out of the VM/cache. The latter is not numerous enough to keep
the pool of vnodes needed for the namecache sufficiently big.

2. Purge invalid entries in the namecache as soon as we notice them. This
avoids a stale entry pushing out a valid entry on the LRU list.

3. Speed up the lookup in the namecache by avoid a special case branch.

4. Make the cache purge routines do the thing they're supposed to, and in
a decently efficient manner.

5. Make the size of the namecache follow the number of vnodes, so that we
can always point to all the vnodes we have in core.

6. Readability has gone way up.

7. Added a "options NCH_STATISTICS" feature that will gather more
detailed statistics on the performance of the namecache.

Reviewed by: davidg

(cvs is dumping core on me :-( )


# acc835fd 07-Mar-1995 David Greenman <dg@FreeBSD.org>

Put VAGE vnodes at the head of the free list.


# 7500ed1d 27-Feb-1995 David Greenman <dg@FreeBSD.org>

Backed out previous change. I forgot (for about the fourth time) that
v_rdev is a #define which is dereferenced through v_specinfo->si_rdev,
and that isn't initialized until later in checkalias().


# f9ceb7c7 26-Feb-1995 David Greenman <dg@FreeBSD.org>

Initialize v_rdev in getnewvnode() - it appears that some filesystems
may not properly initialize this field in all cases, and this would
result in very anti-social behavior (overwriting on some other random
device/location).

Submitted by: John Dyson


# a3a8bb29 22-Feb-1995 David Greenman <dg@FreeBSD.org>

vfs_cluster.c:
Various more tweaks from John Dyson to improve read ahead calculations.

vfs_subr.c:
Only wakeup if numoutput is 0 in vwakeup().

Submitted by: John Dyson


# 480dff54 10-Jan-1995 David Greenman <dg@FreeBSD.org>

Fixed some formatting weirdness that I overlooked in the previous commit.


# 0d94caff 09-Jan-1995 David Greenman <dg@FreeBSD.org>

These changes embody the support of the fully coherent merged VM buffer cache,
much higher filesystem I/O performance, and much better paging performance. It
represents the culmination of over 6 months of R&D.

The majority of the merged VM/cache work is by John Dyson.

The following highlights the most significant changes. Additionally, there are
(mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to
support the new VM/buffer scheme.

vfs_bio.c:
Significant rewrite of most of vfs_bio to support the merged VM buffer cache
scheme. The scheme is almost fully compatible with the old filesystem
interface. Significant improvement in the number of opportunities for write
clustering.

vfs_cluster.c, vfs_subr.c
Upgrade and performance enhancements in vfs layer code to support merged
VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff.

vm_object.c:
Yet more improvements in the collapse code. Elimination of some windows that
can cause list corruption.

vm_pageout.c:
Fixed it, it really works better now. Somehow in 2.0, some "enhancements"
broke the code. This code has been reworked from the ground-up.

vm_fault.c, vm_page.c, pmap.c, vm_object.c
Support for small-block filesystems with merged VM/buffer cache scheme.

pmap.c vm_map.c
Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of
kernel PTs.

vm_glue.c
Much simpler and more effective swapping code. No more gratuitous swapping.

proc.h
Fixed the problem that the p_lock flag was not being cleared on a fork.

swap_pager.c, vnode_pager.c
Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the
code doesn't need it anymore.

machdep.c
Changes to better support the parameter values for the merged VM/buffer cache
scheme.

machdep.c, kern_exec.c, vm_glue.c
Implemented a seperate submap for temporary exec string space and another one
to contain process upages. This eliminates all map fragmentation problems
that previously existed.

ffs_inode.c, ufs_inode.c, ufs_readwrite.c
Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on
busy buffers.

Submitted by: John Dyson and David Greenman


# 602d2b48 22-Dec-1994 David Greenman <dg@FreeBSD.org>

Protect vnode buffer chain manipulation with splbio to prevent list
corruption..


# 82478919 06-Oct-1994 David Greenman <dg@FreeBSD.org>

Use tsleep() rather than sleep so that 'ps' is more informative about
the wait.


# 8e58bf68 05-Oct-1994 David Greenman <dg@FreeBSD.org>

Stuff object into v_vmdata rather than pager. Not important which at
the moment, but will be in the future. Other changes mostly cosmetic,
but are made for future VMIO considerations.

Submitted by: John Dyson


# 797f2d22 02-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.


# bb56ec4a 25-Sep-1994 Poul-Henning Kamp <phk@FreeBSD.org>

While in the real world, I had a bad case of being swapped out for a lot of
cycles. While waiting there I added a lot of the extra ()'s I have, (I have
never used LISP to any extent). So I compiled the kernel with -Wall and
shut up a lot of "suggest you add ()'s", removed a bunch of unused var's
and added a couple of declarations here and there. Having a lap-top is
highly recommended. My kernel still runs, yell at me if you kernel breaks.


# 1cdeb653 29-Aug-1994 David Greenman <dg@FreeBSD.org>

"bogus" fixes from 1.1.5 to work around some cache coherency problems.


# e0c02154 23-Aug-1994 David Greenman <dg@FreeBSD.org>

Initialized v_writecount.


# 4f5a3fef 22-Aug-1994 David Greenman <dg@FreeBSD.org>

print "BUSY" instead of error number if filesystem was busy during
vfs_unmountall() - this is the most common case. If it was a different
error, then print the error number.


# e0e9c421 20-Aug-1994 David Greenman <dg@FreeBSD.org>

Implemented filesystem clean bit via:

machdep.c:
Changed printf's a little and call vfs_unmountall() if the sync was
successful.

cd9660_vfsops.c, ffs_vfsops.c, nfs_vfsops.c, lfs_vfsops.c:
Allow dismount of root FS. It is now disallowed at a higher level.

vfs_conf.c:
Removed unused rootfs global.

vfs_subr.c:
Added new routines vfs_unmountall and vfs_unmountroot. Filesystems
are now dismounted if the machine is properly rebooted.

ffs_vfsops.c:
Toggle clean bit at the appropriate places. Print warning if an
unclean FS is mounted.

ffs_vfsops.c, lfs_vfsops.c:
Fix bug in selecting proper flags for VOP_CLOSE().

vfs_syscalls.c:
Disallow dismounting root FS via umount syscall.


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources