History log of /freebsd-current/sys/sys/vnode.h
Revision Date Author Comments
# 56a8aca8 18-May-2024 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Stop treating size 0 as unknown size in vnode_create_vobject().

Whenever file is created, the vnode_create_vobject() function will
try to determine its size by calling vn_getsize_locked() as size 0
is ambigious: it means either the file size is 0 or the file size
is unknown.

Introduce special value for the size argument: VNODE_NO_SIZE.
Only when it is given, the vnode_create_vobject() will try to obtain
file's size on its own.

Introduce dedicated vnode_disk_create_vobject() for use by
g_vfs_open(), so we don't have to call vn_isdisk() in the common case
(for regular files).

Handle the case of mediasize==0 in g_vfs_open().

Reviewed by: alc, kib, markj, olce
Approved by: oshogbo (mentor), allanjude (mentor)
Differential Revision: https://reviews.freebsd.org/D45244


# f04220c1 19-Jan-2024 Konstantin Belousov <kib@FreeBSD.org>

kcmp(2): implement for vnode files

Reviewed by: brooks, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D43518


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 2ff63af9 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .h pattern

Remove /^\s*\*+\s*\$FreeBSD\$.*$\n/


# 9c3bfe2a 10-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

Revert "VFS: Remove VV_READLINK flag" and "fdescfs: improve linrdlnk mount option"

This reverts commits 4a402dfe0bc44770c9eac6e58a501e4805e29413 and
3bffa2262328e4ff1737516f176107f607e7bc76.

The fix will be implemented in somewhat different manner. The semantic
adjustment is incompatible with linuxolator expectations.

Reported and reviewed by: dchagin
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D40969


# ba8cc6d7 12-Mar-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use __enum_uint8 for vtype and vstate

This whacks hackery around only reading v_type once.

Bump __FreeBSD_version to 1400093


# 9def8ea6 21-Apr-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: list enums on separate lines

Requested by: kib


# 4a402dfe 21-Jun-2023 Konstantin Belousov <kib@FreeBSD.org>

VFS: Remove VV_READLINK flag

since its only reason to exist is removed.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D40700


# 2544b8e0 28-Apr-2023 Olivier Certner <olce.freebsd@certner.fr>

vfs: Rename vfs_emptydir() to vn_dir_check_empty()

No functional change. While here, adapt comments to style(9).

Reviewed by: kib
MFC after: 1 week


# 3d8450db 24-Apr-2023 Olivier Certner <olce.freebsd@certner.fr>

vfs: vn_dir_next_dirent(): Simplify interface and harden

Simplify the old interface (one less argument, simpler termination test)
and add documentation about it. Add more sanity checks (mostly under
INVARIANTS, but also in the general case to prevent infinite
loops). Drop the explicit test on minimum directory entry size (without
INVARIANTS).

Deal with the impacts in callers (dirent_exists() and vop_stdvptocnp()).
dirent_exists() has been simplified a bit, preserving the exact same
semantics but for the return code whose meaning has been reversed (0 now
means the entry exists, ENOENT that it doesn't and other values are
genuine errors). While here, suppress gratuitous casts of malloc return
values.

vn_dir_next_dirent() has been tested by a 'make -j4 buildkernel' with a
temporary modification to the VFS cache causing vn_vptocnp() to always
call VOP_VPTOCNP() and finally vop_stdvptocnp() (observed with temporary
debug counters).

Export new _GENERIC_MINDIRSIZ and _GENERIC_MAXDIRSIZ on __BSD_VISIBLE,
and GENERIC_MINDIRSIZ and GENERIC_MAXDIRSIZ on _KERNEL.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39764


# 6bce3f23 23-Apr-2023 Olivier Certner <olce.freebsd@certner.fr>

vfs: Export get_next_dirent() as vn_dir_next_dirent()

Move internal-to-'vfs_default.c' get_next_dirent() to 'vfs_vnops.c' and
export it for use by other parts of the VFS. This is a preparatory
change for using it in vfs_emptydir().

No functional change.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D39755


# 7b6fe242 08-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

DEBUG_VFS_LOCKS: use witness if available

The assert_vop_locked messages are ignored, and file/line information
is not too useful. Fixing this without changing both witness and VFS
asserts KPIs is not possible.

Reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39464


# bb24eaea 05-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

vn_lock_pair(): allow to request shared locking

If either of vnodes is shared locked, lock must not be recursed.

Requested by: rmacklem
Reviewed by: markj, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39444


# 26b96487 07-Apr-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: more informative panic for missing fplookup ops


# 5f6df177 03-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: validate that vop vectors provide all or none fplookup vops

In order to prevent later susprises.


# 62a573d9 16-Mar-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire KERN_VNODE

It got disabled in 2003:

commit acb18acfec97aa7fe26ff48f80a5c3f89c9b542d
Author: Poul-Henning Kamp <phk@FreeBSD.org>
Date: Sun Feb 23 18:09:05 2003 +0000

Bracket the kern.vnode sysctl in #ifdef notyet because it results
in massive locking issues on diskless systems.

It is also not clear that this sysctl is non-dangerous in its
requirements for locked down memory on large RAM systems.

There does not seem to be practical use for it and the disabled routine
does not work anyway.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D39127


# f45feecf 22-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vn_getsize

getattr is very expensive and in important cases only gets called to get
the size. This can be optimized with a dedicated routine which obtains
that statistic.

As a step towards that goal make size-only consumers use a dedicated
routine.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D37885


# 829f0bcb 19-Dec-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add the concept of vnode state transitions

To quote from a comment above vput_final:
<quote>
* XXX Some filesystems pass in an exclusively locked vnode and strongly depend
* on the lock being held all the way until VOP_INACTIVE. This in particular
* happens with UFS which adds half-constructed vnodes to the hash, where they
* can be found by other code.
</quote>

As is there is no mechanism which allows filesystems to denote that a
vnode is fully initialized, consequently problems like the above are
only found the hard way(tm).

Add rudimentary support for state transitions, which in particular allow
to assert the vnode is not legally unlocked until its fate is decided
(either construction finishes or vgone is called to abort it).

The new field lands in a 1-byte hole, thus it does not grow the struct.

Bump __FreeBSD_version to 1400077

Reviewed by: kib (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D37759


# 94267fc9 22-Dec-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use designated initializers for the typename array

While here prefix with v for better consistency with the vnode stuff.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D37759


# 78d35459 02-Dec-2022 Doug Rabson <dfr@FreeBSD.org>

Add vn_path_to_global_path_hardlink

This is similar to vn_path_to_global_path but allows for regular files
which may not be present in the cache.

Reviewed by: mjg, kib
Tested by: pho


# 080ef8a4 04-Aug-2022 Jason A. Harmening <jah@FreeBSD.org>

Add VV_CROSSLOCK vnode flag to avoid cross-mount lookup LOR

When a lookup operation crosses into a new mountpoint, the mountpoint
must first be busied before the root vnode can be locked. When a
filesystem is unmounted, the vnode covered by the mountpoint must
first be locked, and then the busy count for the mountpoint drained.
Ordinarily, these two operations work fine if executed concurrently,
but with a stacked filesystem the root vnode may in fact use the
same lock as the covered vnode. By design, this will always be
the case for unionfs (with either the upper or lower root vnode
depending on mount options), and can also be the case for nullfs
if the target and mount point are the same (which admittedly is
very unlikely in practice).

In this case, we have LOR. The lookup path holds the mountpoint
busy while waiting on what is effectively the covered vnode lock,
while a concurrent unmount holds the covered vnode lock and waits
for the mountpoint's busy count to drain.

Attempt to resolve this LOR by allowing the stacked filesystem
to specify a new flag, VV_CROSSLOCK, on a covered vnode as necessary.
Upon observing this flag, the vfs_lookup() will leave the covered
vnode lock held while crossing into the mountpoint. Employ this flag
for unionfs with the caveat that it can't be used for '-o below' mounts
until other unionfs locking issues are resolved.

Reported by: pho
Tested by: pho
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35054


# d653aaec 24-Oct-2022 Mateusz Guzik <mjg@FreeBSD.org>

cache: add cache_assert_no_entries


# 1b4b7517 18-Sep-2022 Konstantin Belousov <kib@FreeBSD.org>

Add vn_rlimit_fsizex() and vn_rlimit_fsizex_res()

The vn_rlimit_fsizex() function:
- checks that the write does not exceed RLIMIT_FSIZE limit and fs
maximum supported file size
- truncates write length if it exceeds the RLIMIT_FSIZE or max file size,
but there are some bytes to write
- sends SIGXFSZ if RLIMIT_FSIZE would be exceed otherwise

POSIX mandates the truncated write in case when some bytes can be
written but whole write request fails the RLIMIT_FSIZE check.

The function is supposed to be used from VOP_WRITE()s. Due to
pecularity in the VFS generic write syscall layer, uio_resid must
correctly reflect the written amount (noted by markj). Provide the dual
vn_rlimit_fsizex_res() function to correct uio_resid after the clamp
done in vn_rlimit_fsizex() on VOP_WRITE() return.

PR: 164793
Reviewed by: asomers, jah, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36625


# 2ac083f6 18-Sep-2022 Konstantin Belousov <kib@FreeBSD.org>

Add vn_rlimit_trunc()

Reviewed by: asomers, jah, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36625


# fa3eb3c9 18-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: indent V_VALID_FLAGS with a tab

Requested by: kib


# a75d1ddd 17-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce V_PCATCH to stop abusing PCATCH


# a755fb92 10-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire the V_MNTREF flag

Reviewed by: kib, mckusick
Differential Revision: https://reviews.freebsd.org/D36521


# c6487446 11-Apr-2022 Dmitry Chagin <dchagin@FreeBSD.org>

getdirentries: return ENOENT for unlinked but still open directory.

To be more compatible to IEEE Std 1003.1-2008 (“POSIX.1”).

Reviewed by: mjg, Pau Amma (doc)
Differential revision: https://reviews.freebsd.org/D34680
MFC after: 2 weeks


# b7262756 02-Apr-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fixup WANTIOCTLCAPS on open

In some cases vn_open_cred overwrites cn_flags, effectively nullifying
initialisation done in NDINIT. This will have to be fixed.

In the meantime make sure the flag is passed.

Reported by: jenkins
Noted by: Mathieu <sigsys@gmail.com>


# 66b177e1 12-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: reduce spurious zeroing in VOP_STAT

clang fails to take advantage of the fact that majority of the struct
gets written to in the routine and decides to bzero the entire thing.

Explicitly zero padding and spare fields, relying on KMSAN to catch
problems should anything pop up later which also needs explicit
zeroing.

fstat on tmpfs (ops/s):
before: 8216636
after: 8508033


# 381dd12c 12-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: stop evaluating the argument multpile times in stat macros


# 66c5fbca 27-Jan-2022 Konstantin Belousov <kib@FreeBSD.org>

insmntque1(): remove useless arguments

Also remove once-used functions to clean up after failed insmntque1(),
which were destructor callbacks in previous life.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D34071


# 2a7e4cf8 27-Jan-2022 Mateusz Guzik <mjg@FreeBSD.org>

Revert b58ca5df0bb7 ("vfs: remove the now unused insmntque1")

I was somehow convinced that insmntque calls insmntque1 with a NULL
destructor. Unfortunately this worked well enough to not immediately
blow up in simple testing.

Keep not using the destructor in previously patched filesystems though
as it avoids unnecessary casts.

Noted by: kib
Reported by: pho


# b58ca5df 26-Jan-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the now unused insmntque1

Bump __FreeBSD_version to 1400052.


# 4dd23ae1 10-Dec-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire MNTK_NOKNOTE and VV_NOKNOTE

MNTK_NOKNOTE was introduced in 679985d03a64f5dfb4355538ae6e3b70f8347f38
(dated 2005), VV_NOKNOTE in 34cc826ae8999f454dd6cb9c77d17ce83b169f92 few
months later.

Neither was ever used by anything in the tree.


# 4dcdf398 17-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: replace the MNTK_TEXT_REFS flag with VIRF_TEXT_REF

This allows to stop maintaing the VI_TEXT_REF flag and consequently
opens up fully lockless v_writecount adjustment.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D33127


# 3ffcfa59 26-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vop_stdadd_writecount_nomsync

This avoids needing to inspect the mount point every time.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D33125


# 6eefabd4 22-Nov-2021 Brooks Davis <brooks@FreeBSD.org>

syscalls: improve nstat, nfstat, nlstat

Optionally return errors when truncating dev_t, ino_t, and nlink_t.
In the interest of code reuse, use freebsd11_cvtstat() to perform the
truncation and error handling and then convert the resulting struct
freebsd11_stat to struct nstat.

Add missing freebsd32 compat syscalls. These syscalls require
translation because struct nstat contains four instances of struct
timespec which in turn contains a time_t and a long.

Reviewed by: kib


# d28af1ab 15-Nov-2021 Mark Johnston <markj@FreeBSD.org>

vm: Add a mode to vm_object_page_remove() which skips invalid pages

This will be used to break a deadlock in ZFS between the per-mountpoint
teardown lock and page busy locks. In particular, when purging data
from the page cache during dataset rollback, we want to avoid blocking
on the busy state of invalid pages since the busying thread may be
blocked on the teardown lock in zfs_getpages().

Add a helper, vn_pages_remove_valid(), for use by filesystems. Bump
__FreeBSD_version so that the OpenZFS port can make use of the new
helper.

PR: 258208
Reviewed by: avg, kib, sef
Tested by: pho (part of a larger patch)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32931


# 47b248ac 03-Nov-2021 Konstantin Belousov <kib@FreeBSD.org>

Make locking assertions for VOP_FSYNC() and VOP_FDATASYNC() more correct

For devfs vnodes, it is fine to not lock vnodes for VOP_FSYNC().
Otherwise vnode must be locked exclusively, except for MNT_SHARED_WRITES()
where the shared lock is enough.

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32761


# 9a0bee9f 22-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

Make vn_fullpath_hardlink() externally callable

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611


# 2b68eb8e 01-Oct-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove thread argument from VOP_STAT

and fo_stat.


# c5128c48 07-Sep-2021 Rick Macklem <rmacklem@FreeBSD.org>

VOP_COPY_FILE_RANGE: Add a COPY_FILE_RANGE_TIMEO1SEC flag

Although it is not specified in the RFCs, the concept that
the NFSv4 server should reply to an RPC request within a
reasonable time is accepted practice within the NFSv4 community.

Without this patch, the NFSv4.2 server attempts to reply to
a Copy operation within 1second by limiting the copy to
vfs.nfs.maxcopyrange bytes (default 10Mbytes). This is crude at
best, given the large variation in I/O subsystem performance.

This patch adds a kernel only flag COPY_FILE_RANGE_TIMEO1SEC
that the NFSv4.2 can specify, which tells VOP_COPY_FILE_RANGE()
to return after approximately 1 second with a partial result and
implements this in vn_generic_copy_file_range(), used by
vop_stdcopyfilerange().

Modifying the NFSv4.2 server to set this flag will be done in
a separate patch. Also under consideration is exposing the
COPY_FILE_RANGE_TIMEO1SEC to userland for use on the FreeBSD
copy_file_range(2) syscall.

MFC after: 2 weeks
Reviewed by: khng
Differential Revision: https://reviews.freebsd.org/D31829


# da779f26 27-Aug-2021 Rick Macklem <rmacklem@FreeBSD.org>

vfs_default: Change vop_stddeallocate() from static to global

A future commit to the NFS client uses vop_stddeallocate() for
cases where the NFS server does not support a Deallocate operation.
Change vop_stddeallocate() from static to global so that it can
be called by the NFS client.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31640


# 0dc332bf 05-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).

fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI for kmods' use.

The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.

fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd has to be
SPACECTL_DEALLOC, and flags has to be 0.

fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28347


# abbb57d5 04-Aug-2021 Ka Ho Ng <khng@FreeBSD.org>

vfs: Introduce vn_bmap_seekhole_locked()

vn_bmap_seekhole_locked() is factored out version of vn_bmap_seekhole().
This variant requires shared vnode lock being held around the call.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D31404


# 0ef5eee9 03-Aug-2021 Konstantin Belousov <kib@FreeBSD.org>

Add vn_lktype_write()

and remove repetetive code that calculates vnode locking type for write.

Reviewed by: khng, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31405


# 844aa31c 08-Jul-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add cache_enter_time_flags


# 3cf75ca2 28-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire unused vn_seqc_write_begin_unheld*


# cf74b2be 22-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire the now unused vnlru_free routine


# 72b3b5a9 08-Apr-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: replace vfs_smr_quiesce with vfs_smr_synchronize

This ends up using a smr specific method.

Suggested by: markj
Tested by: pho


# 3f56bc79 30-Mar-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vfs_smr_quiesce

This can be used to observe all CPUs not executing while within
vfs_smr_enter.


# e9272225 17-Mar-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix vnlru marker handling for filtered/unfiltered cases

The global list has a marker with an invariant that free vnodes are
placed somewhere past that. A caller which performs filtering (like ZFS)
can move said marker all the way to the end, across free vnodes which
don't match. Then a caller which does not perform filtering will fail to
find them. This makes vn_alloc_hard sleep for 1 second instead of
reclaiming, resulting in significant stalls.

Fix the problem by requiring an explicit marker by callers which do
filtering.

As a temporary measure extend vnlru_free to restart if it fails to
reclaim anything.

Big thanks go to the reporter for testing several iterations of the
patch.

Reported by: Yamagi <lists yamagi.org>
Tested by: Yamagi <lists yamagi.org>
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D29324


# 2443068d 21-Feb-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: shrink struct vnode to 448 bytes on LP64

... by moving v_hash into a 4 byte hole.

Combined with several previous size reductions this makes the size small
enough to fit 9 vnodes per page as opposed to 8.

Add a compilation time assert so that this is not unknowingly worsened.

Note the structure still remains bigger than it should be.


# 2bfd8992 14-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

vnode: move write cluster support data to inodes.

The data is only needed by filesystems that
1. use buffer cache
2. utilize clustering write support.

Requested by: mjg
Reviewed by: asomers (previous version), fsu (ext2 parts), mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28679


# fa3bd463 29-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

lockf: ensure atomicity of lockf for open(O_CREAT|O_EXCL|O_EXLOCK)

or EX_SHLOCK. Do it by setting a vnode iflag indicating that the locking
exclusive open is in progress, and not allowing F_LOCK request to make
a progress until the first open finishes.

Requested by: mckusick
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28697


# b59a8e63 30-Jan-2021 Konstantin Belousov <kib@FreeBSD.org>

Stop ignoring ERELOOKUP from VOP_INACTIVE()

When possible, relock the vnode and retry inactivation. Only vunref() is
required not to drop the vnode lock, so handle it specially by not retrying.

This is a part of the efforts to ensure that unlinked not referenced vnode
does not prevent inode from reusing.

Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 8d2a230e 25-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use atomic_load_consume_ptr in vn_load_v_data_smr


# 739ecbcf 23-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add symlink support to lockless lookup

Reviewed by: kib (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D27488


# a5e28403 07-Jan-2021 Thomas Munro <tmunro@FreeBSD.org>

open(2): Add O_DSYNC flag.

POSIX O_DSYNC means that writes include an implicit fdatasync(2), just
as O_SYNC implies fsync(2).

VOP_WRITE() functions that understand the new IO_DATASYNC flag can act
accordingly, but we'll still pass down IO_SYNC so that file systems that
don't understand it will continue to provide the stronger O_SYNC
behaviour.

Flag also applies to fcntl(2).

Reviewed by: kib, delphij
Differential Revision: https://reviews.freebsd.org/D25090


# c6d3272b 05-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vn_seqc_read_notmodify


# 33f3e81d 01-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: combine fast path enabled status into one flag

Tested by: pho


# 82397d79 31-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: denote vnode being a mount point with VIRF_MOUNTPOINT

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D27794


# 3e506a67 27-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add v_irflag accessors

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D27793


# 3b1f974b 26-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Make max ticks for pause in vn_lock_pair() adjustable at runtime.

Reduce default value from hz / 10 to hz / 100.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation


# 7cde2ec4 13-Nov-2020 Konstantin Belousov <kib@FreeBSD.org>

Implement vn_lock_pair().

In collaboration with: pho
Reviewed by: mckusick (previous version), markj (previous version)
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136


# 4bfebc8d 30-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add cache_vop_mkdir and rename cache_rename to cache_vop_rename


# c7520caa 22-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: prevent avoidable evictions on mkdir of existing directories

mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST.

The NOCACHE lookup will make the namecache unnecessarily evict the existing entry,
and then fallback to the fs lookup routine eventually leading namei to return an
error as the directory is already there.

For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers
fallbacks to the slowpath for concurrently executing lookups.

Tested by: pho
Discussed with: kib


# 8ecd87a3 20-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop spurious cred argument from VOP_VPTOCNP


# 214eccf4 14-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add VOP_EAGAIN

Can be used to stub fplookup for example.


# a3d9bf49 23-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the force flag from purgevfs

The optional scan is wasteful, thus it is removed altogether from unmount.

Callers which always want it anyway remain unaffected.


# 3c484f32 15-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Convert page cache read to VOP.

There are several negative side-effects of not calling into VOP layer
at all for page cache reads. The biggest is the missed activation of
EVFILT_READ knotes.

Also, it allows filesystem to make more fine grained decision to
refuse read from page cache.

Keep VIRF_PGREAD flag around, it is still useful for nullfs, and for
asserts.

Reviewed by: markj
Tested by: pho
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26346


# 88863665 15-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

vfs_subr.c: export io_hold_cnt and vn_read_from_obj().

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26346


# b1a824b6 02-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire vholdl as a symbol

Similarly to vrefl in r364283.


# f6e54eb3 01-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

sys: clean up empty lines in .c and .h files


# feabaaf9 24-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the always curthread argument from reverse lookup routines

Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs.

Tested by: pho


# 39f88150 20-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add cache_rename, a dedicated helper to use for renames

While here make both tmpfs and ufs use it.

No fuctional changes.


# 7ad2a82d 18-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error

Most consumers pass NULL.


# fbca789f 16-Aug-2020 Konstantin Belousov <kib@FreeBSD.org>

VMIO read

If possible, i.e. if the requested range is resident valid in the vm
object queue, and some secondary conditions hold, copy data for
read(2) directly from the valid cached pages, avoiding vnode lock and
instantiating buffers. I intentionally do not start read-ahead, nor
handle the advises on the cached range.

Filesystems indicate support for VMIO reads by setting VIRF_PGREAD
flag, which must not be cleared until vnode reclamation.

Currently only filesystems that use vnode pager for v_objects can
enable it, due to reliance on vnp_size. There is a WIP to handle it
for tmpfs.

Reviewed by: markj
Discussed with: jeff
Tested by: pho
Benchmarked by: mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D25968


# 60414088 16-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire vrefl as a symbol

vrefl calls vref and there is only one in-tree consumer.

Keep it as a macro for assertion purposes.


# a92a971b 16-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the thread argument from vget

It was already asserted to be curthread.

Semantic patch:

@@

expression arg1, arg2, arg3;

@@

- vget(arg1, arg2, arg3)
+ vget(arg1, arg2)


# 36f47512 11-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: inline vrefcnt


# 4c2d103a 11-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: garbage collect vrefactn


# 3b444436 11-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

devfs: rework si_usecount to track opens

This removes a lot of special casing from the VFS layer.

Reviewed by: kib (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D25612


# 51ea7bea 07-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add VOP_STAT

The current scheme of calling VOP_GETATTR adds avoidable overhead.

An example with tmpfs doing fstat (ops/s):
before: 7488958
after: 7913833

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D25910


# d292b194 05-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the obsolete privused argument from vaccess

This brings argument count down to 6, which is passable without the
stack on amd64.


# db99ec56 04-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: support lockless dotdot lookup

Tested by: pho


# 6e10434c 04-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add cache_purge_vgone

cache_purge locklessly checks whether the vnode at hand has any namecache
entries. This can race with a concurrent purge which managed to remove
the last entry, but may not be done touching the vnode.

Make sure we observe the relevant vnode lock as not taken before proceeding
with vgone.

Paired with the fact that doomed vnodes cannnot receive entries this restores
the invariant that there are no namecache-related writing users past cache_purge
in vgone.

Reported by: pho


# b145e389 02-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: shorten v_iflag and v_vflag

While here renumber VI_* flags to remove the gaps.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D25921


# 838984de 02-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: move namecache initialisation into cache_vnode_init


# 848f8eff 30-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: inline vops if there are no pre/post associated calls

This removes a level of indirection from frequently used methods, most notably
VOP_LOCK1 and VOP_UNLOCK1.

Tested by: pho


# 07d2145a 25-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add the infrastructure for lockless lookup

Reviewed by: kib
Tested by: pho (in a patchset)
Differential Revision: https://reviews.freebsd.org/D25577


# 0379ff6a 25-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce vnode sequence counters

Modified on each permission change and link/unlink.

Reviewed by: kib
Tested by: pho (in a patchset)
Differential Revision: https://reviews.freebsd.org/D25573


# f8022be3 30-Jun-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: protect vnodes with smr

vget_prep_smr and vhold_smr can be used to ref a vnode while within vfs_smr
section, allowing consumers to get away without locking.

See vhold_smr and vdropl for comments explaining caveats.

Reviewed by: kib
Testec by: pho
Differential Revision: https://reviews.freebsd.org/D23913


# 2782c00c 22-Feb-2020 Ryan Libby <rlibby@FreeBSD.org>

vfs: quiet -Wwrite-strings

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23797


# 6a5abb1e 02-Feb-2020 Kyle Evans <kevans@FreeBSD.org>

Provide O_SEARCH

O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.

This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.

This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23247


# 6698e11f 02-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the now empty vop_unlock_post


# 10a15df6 02-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the never set VDESC_VPP_WILLRELE flag


# 7739d927 01-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: replace kern___getcwd with vn_getcwd

The previous routine was resulting in extra data copies most notably in
linux_getcwd.


# 45757984 01-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: consistently use size_t for buflen around VOP_VPTOCNP


# 643656cf 31-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: replace VOP_MARKATIME with VOP_MMAPPED

The routine is only provided by ufs and is only used on mmap and exec.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23422


# 21c4f104 31-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vrefactn

Differential Revision: https://reviews.freebsd.org/D23427


# 3cfabd81 30-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the never set VDESC_NOMAP_VPP flag


# 0c236d3d 12-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: per-cpu batched requeuing of free vnodes

Constant requeuing adds significant lock contention in certain
workloads. Lessen the problem by batching it.

Per-cpu areas are locked in order to synchronize against UMA freeing
memory.

vnode's v_mflag is converted to short to prevent the struct from
growing.

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 122.38s user 1780.45s system 6242% cpu 30.480 total
patched: 144.84s user 985.90s system 4856% cpu 23.282 total

Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22998


# cc3593fb 12-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: rework vnode list management

The current notion of an active vnode is eliminated.

Vnodes transition between 0<->1 hold counts all the time and the
associated traversal between different lists induces significant
scalability problems in certain workloads.

Introduce a global list containing all allocated vnodes. They get
unlinked only when UMA reclaims memory and are only requeued when
hold count reaches 0.

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 118.55s user 3649.73s system 7479% cpu 50.382 total
patched: 122.38s user 1780.45s system 6242% cpu 30.480 total

Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22997


# 57083d25 12-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add per-mount vnode lazy list and use it for deferred inactive + msync

This obviates the need to scan the entire active list looking for vnodes
of interest.

msync is handled by adding all vnodes with write count to the lazy list.

deferred inactive directly adds vnodes as it sets the VI_DEFINACT flag.

Vnodes get dequeued from the list when their hold count reaches 0.

Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that
spurious locking is avoided in the common case.

Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22995


# b52d50cf 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: prealloc vnodes in getnewvnode_reserve

Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.

Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.


# 69283067 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: incomplete pass at converting more ints to u_long

Most notably numvnodes and freevnodes were u_long, but parameters used to
govern them remained as ints.


# c8b3463d 07-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: reimplement deferred inactive to use a dedicated flag (VI_DEFINACT)

The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.

Reviewed by: kib (previous version)
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23036


# 478368ca 06-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: eliminate v_tag from struct vnode

There was only one consumer and it was using it incorrectly.

It is given an equivalent hack.

Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23037


# 8dbc6352 04-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop thread argument from vinactive


# 1cde9e38 04-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: predict VN_IS_DOOMED as false

The macro is used everywhere.


# 952f5953 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove CTASSERT from VOP_UNLOCK_FLAGS

gcc does not like it and it's not worth working around just for that
compiler.


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# 3d59b89c 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add VOP_UNLOCK_FLAGS

The flags argument from VOP_UNLOCK is about to be removed and some
filesystems unlock the interlock as a convienience with it.

Add a helper to retain the behavior for the few cases it is needed.


# c400abe5 16-Dec-2019 Li-Wen Hsu <lwhsu@FreeBSD.org>

Fix gcc build after r355790

Sponsored by: The FreeBSD Foundation


# 6fa079fc 15-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: flatten vop vectors

This eliminates the following loop from all VOP calls:

while(vop != NULL && \
vop->vop_spare2 == NULL && vop->vop_bypass == NULL)
vop = vop->vop_default;

Reviewed by: jeff
Tesetd by: pho
Differential Revision: https://reviews.freebsd.org/D22738


# ea9a16b2 12-Dec-2019 Rick Macklem <rmacklem@FreeBSD.org>

r355677 requires that vop_stdioctl() be global so it can be called from NFS.

r355677 modified the NFS client so that it does lseek(SEEK_DATA/SEEK_HOLE)
for NFSv4.2, but calls vop_stdioctl() otherwise. As such, vop_stdioctl()
needs to be a global function.

Missed during the code merge for r355677.


# c8b29d12 11-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: locking primitives which elide ->v_vnlock and shared locking disablement

Both of these features are not needed by many consumers and result in avoidable
reads which in turn puts them on profiles due to cache-line ping ponging.

On top of that the current lockgmr entry point is slower than necessary
single-threaded. As an attempted clean up preparing for other changes,
provide new routines which don't support any of the aforementioned features.

With these patches in place vop_stdlock and vop_stdunlock disappear from
flamegraphs during -j 104 buildkernel.

Reviewed by: jeff (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22665


# ff4486e8 09-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: refactor vhold and vdrop

No fuctional changes.


# abd80ddb 08-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce v_irflag and make v_type smaller

The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715


# 89c4c2e5 30-Nov-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: swap placement between v_type and v_tag

The former is frequently accessed (e.g., in vfs_cache_lookup) and shares the
cacheline with v_usecount, avoidably adding to cache misses during concurrent
lookup. The latter is almost unused and probably can get garbage-collected.

The struct does not change in size despite enum vs char * discrepancy.
On 64-bit archs there used to be 4 bytes padding after v_type giving 480 bytes
in total.


# fdc6b10d 29-Nov-2019 Konstantin Belousov <kib@FreeBSD.org>

Add a VN_OPEN_INVFS flag.

vn_open_cred() assumes that it is called from the top-level of a VFS
syscall. Writers must call bwillwrite() before locking any VFS
resource to wait for cleanup of dirty buffers.

ZFS getextattr() and setextattr() VOPs do call vn_open_cred(), which
results in wait for unrelated buffers while owning ZFS vnode lock (and
ZFS does not use buffer cache). VN_OPEN_INVFS allows caller to skip
bwillwrite.

Note that ZFS is still incorrect there, because it starts write on an
mp and locks a vnode while holding another vnode lock.

Reported by: Willem Jan Withagen <wjw@digiware.nl>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 208b81bb 22-Oct-2019 Konstantin Belousov <kib@FreeBSD.org>

Add VV_VMSIZEVNLOCK flag.

The flag specifies that vm_fault() handler should check the vnode'
vm_object size under the vnode lock. It is converted into the object'
OBJ_SIZEVNLOCK flag in vnode_pager_alloc().

Tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D21883


# dc20b834 06-Oct-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add optional root vnode caching

Root vnodes looekd up all the time, e.g. when crossing a mount point.
Currently used routines always perform a costly lookup which can be
trivially avoided.

Reviewed by: jeff (previous version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21646


# ba7a55d9 22-Sep-2019 Sean Eric Fagan <sef@FreeBSD.org>

Add two options to allow mount to avoid covering up existing mount points.
The two options are

* nocover/cover: Prevent/allow mounting over an existing root mountpoint.
E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local
is already a mountpoint.
* emptydir/noemptydir: Prevent/allow mounting on a non-empty directory.
E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail.

Neither of these options is intended to be a default, for historical and
compatibility reasons.

Reviewed by: allanjude, kib
Differential Revision: https://reviews.freebsd.org/D21458


# e3c3248c 03-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: implement usecount implying holdcnt

vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting
freed and usecount to denote it is actively used.

Previously all operations bumping usecount would also bump holdcnt, which is
not necessary. We can detect if usecount is already > 1 (in which case holdcnt
is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves
on atomic ops.

Reviewed by: kib
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21471


# 1e2f0ceb 28-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add VOP_NEED_INACTIVE

vnode usecount drops to 0 all the time (e.g. for directories during path lookup).
When that happens the kernel would always lock the exclusive lock for the vnode
in order to call vinactive(). This blocks other threads who want to use the vnode
for looukp.

vinactive is very rarely needed and can be tested for without the vnode lock held.

This patch gives filesytems an opportunity to do it, sample total wait time for
tmpfs over 500 minutes of poudriere -j 104:

before: 557563641706 (lockmgr:tmpfs)
after: 46309603301 (lockmgr:tmpfs)

Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21371


# 8795830c 25-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: swap vop_unlock_post and vop_unlock_pre definitions to the logical order

The change is no-op.

Sponsored by: The FreeBSD Foundation


# 0256405e 24-Aug-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vholdnz (for already held vnodes)

Reviewed by: kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21358


# de4e1aeb 18-Aug-2019 Konstantin Belousov <kib@FreeBSD.org>

Fix an issue with executing tmpfs binary.

Suppose that a binary was executed from tmpfs mount, and the text
vnode was reclaimed while the binary was still running. It is
possible during even the normal operations since tmpfs vnode'
vm_object has swap type, and no references on the vnode is held. Also
assume that the text vnode was revived for some reason. Then, on the
process exit or exec, unmapping of the text mapping tries to remove
the text reference from the vnode, but since it went from
recycle/instantiation cycle, there is no reference kept, and assertion
in VOP_UNSET_TEXT_CHECKED() triggers.

Fix this by keeping a use reference on the tmpfs vnode for each exec
reference. This prevents the vnode reclamation while executable map
entry is active.

Do it by adding per-mount flag MNTK_TEXT_REFS that directs
vop_stdset_text() to add use ref on first vnode text use, and
per-vnode VI_TEXT_REF flag, to record the need on unref in
vop_stdunset_text() on last vnode text use going away. Set
MNTK_TEXT_REFS for tmpfs mounts.

Reported by: bdrewery
Tested by: sbruno, pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 520482f4 30-Jul-2019 Mark Johnston <markj@FreeBSD.org>

Use VNASSERT() in checked VOP wrappers.

Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21120


# 2240d8c4 27-Jul-2019 Alan Somers <asomers@FreeBSD.org>

Add v_inval_buf_range, like vtruncbuf but for a range of a file

v_inval_buf_range invalidates all buffers within a certain LBA range of a
file. It will be used by fusefs(5). This commit is a partial merge of
r346162, r346606, and r346756 from projects/fuse2.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21032


# bbbbeca3 24-Jul-2019 Rick Macklem <rmacklem@FreeBSD.org>

Add kernel support for a Linux compatible copy_file_range(2) syscall.

This patch adds support to the kernel for a Linux compatible
copy_file_range(2) syscall and the related VOP_COPY_FILE_RANGE(9).
This syscall/VOP can be used by the NFSv4.2 client to implement the
Copy operation against an NFSv4.2 server to do file copies locally on
the server.
The vn_generic_copy_file_range() function in this patch can be used
by the NFSv4.2 server to implement the Copy operation.
Fuse may also me able to use the VOP_COPY_FILE_RANGE() method.

vn_generic_copy_file_range() attempts to maintain holes in the output
file in the range to be copied, but may fail to do so if the input and
output files are on different file systems with different _PC_MIN_HOLE_SIZE
values.

Separate commits will be done for the generated syscall files and userland
changes. A commit for a compat32 syscall will be done later.

Reviewed by: kib, asomers (plus comments by brooks, jilles)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20584


# 555d8f28 01-Jul-2019 Rick Macklem <rmacklem@FreeBSD.org>

Factor out the code that does a VOP_SETATTR(size) from vn_truncate().

This patch factors the code in vn_truncate() that does the actual
VOP_SETATTR() of size into a separate function called vn_truncate_locked().
This will allow the NFS server and the patch that adds a
copy_file_range(2) syscall to call this function instead of duplicating
the code and carrying over changes, such as the recent r347151.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D20808


# e3680954 27-Jun-2019 Rick Macklem <rmacklem@FreeBSD.org>

Add non-blocking trylock variants for the rangelock functions.

A future patch that will add a Linux compatible copy_file_range(2) syscall
needs to be able to lock the byte ranges of two files concurrently.
To do this without a risk of deadlock, a non-blocking variant of
vn_rangelock_rlock() called vn_rangelock_tryrlock() was needed.
This patch adds this, along with vn_rangelock_trywlock(), in order to
do this.
The patch also adds a couple of comments, that I hope clarify how the
algorithm used in kern_rangelock.c works.

Reviewed by: kib, asomers (previous version)
Differential Revision: https://reviews.freebsd.org/D20645


# 65417f5e 24-May-2019 Alan Somers <asomers@FreeBSD.org>

Remove "struct ucred*" argument from vtruncbuf

vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it. Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by
vtruncbuf.

Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20377


# 78022527 05-May-2019 Konstantin Belousov <kib@FreeBSD.org>

Switch to use shared vnode locks for text files during image activation.

kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923


# 6af6fdce 12-Apr-2019 Alan Somers <asomers@FreeBSD.org>

fusefs: evict invalidated cache contents during write-through

fusefs's default cache mode is "writethrough", although it currently works
more like "write-around"; writes bypass the cache completely. Since writes
bypass the cache, they were leaving stale previously-read data in the cache.
This commit invalidates that stale data. It also adds a new global
v_inval_buf_range method, like vtruncbuf but for a range of a file.

PR: 235774
Reported by: cem
Sponsored by: The FreeBSD Foundation


# ae909414 09-Apr-2019 Konstantin Belousov <kib@FreeBSD.org>

Add vn_fsync_buf().

Provide a convenience function to avoid the hack with filling fake
struct vop_fsync_args and then calling vop_stdfsync().

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 8ff7fad1 23-Oct-2018 Konstantin Belousov <kib@FreeBSD.org>

Only call sigdeferstop() for NFS.

Use bypass to catch any NFS VOP dispatch and route it through the
wrapper which does sigdeferstop() and then dispatches original
VOP. NFS does not need a bypass below it, which is not supported.

The vop offset in the vop_vector is added since otherwise it is
impossible to get vop_op_t from the internal table, and I did not
wanted to create the layered fs only to wrap NFS VOPs.

VFS_OP()s wrap is straightforward.

Requested and reviewed by: mjg (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17658


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# b59ea730 20-Aug-2017 Konstantin Belousov <kib@FreeBSD.org>

Allow vinvalbuf() to operate with the shared vnode lock.

This mode allows other clean buffers to arrive while we flush the buf
lists for the vnode, which is fine for the targeted use. We only need
that all buffers existed at the time of the function start were
flushed. In fact, only one assert has to be relaxed.

In collaboration with: pho
Reviewed by: rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
X-Differential revision: https://reviews.freebsd.org/D12083


# 77d3337c 31-Jul-2017 Dmitry Chagin <dchagin@FreeBSD.org>

Implement proper Linux /dev/fd and /proc/self/fd behavior by adding
Linux specific things to the native fdescfs file system.

Unlike FreeBSD, the Linux fdescfs is a directory containing a symbolic
links to the actual files, which the process has open.
A readlink(2) call on this file returns a full path in case of regular file
or a string in a special format (type:[inode], anon_inode:<file-type>, etc..).
As well as in a FreeBSD, opening the file in the Linux fdescfs directory is
equivalent to duplicating the corresponding file descriptor.

Here we have mutually exclusive requirements:
- in case of readlink(2) call fdescfs lookup() method should return VLNK
vnode otherwise our kern_readlink() fail with EINVAL error;
- in the other calls fdescfs lookup() method should return non VLNK vnode.

For what new vnode v_flag VV_READLINK was added, which is set if fdescfs has beed
mounted with linrdlnk option an modified kern_readlinkat() to properly handle it.

For now For Linux ABI compatibility mount fdescfs volume with linrdlnk option:

mount -t fdescfs -o linrdlnk null /compat/linux/dev/fd

Reviewed by: kib@
MFC after: 1 week
Relnotes: yes


# 3df7ebc4 05-Jun-2017 Konstantin Belousov <kib@FreeBSD.org>

Add sysctl vfs.ino64_trunc_error controlling action on truncating
inode number or link count for the ABI compat binaries.

Right now, and by default after the change, too large 64bit values are
silently truncated to 32 bits. Enabling the knob causes the system to
return EOVERFLOW for stat(2) family of compat syscalls when some
values cannot be completely represented by the old structures. For
getdirentries(2), knob skips the dirents which would cause non-trivial
truncation of d_ino.

EOVERFLOW error is specified by the X/Open 1996 LFS document
('Adding Support for Arbitrary File Sizes to the Single UNIX
Specification').

Based on the discussion with: bde
Sponsored by: The FreeBSD Foundation


# 0c3c207f 02-Jun-2017 Gleb Smirnoff <glebius@FreeBSD.org>

For UNIX sockets make vnode point not to the socket, but to the UNIX PCB,
since the latter is the thing that links together VFS and sockets.

While here, make the union in the struct vnode anonymous.


# 03311f11 27-May-2017 Konstantin Belousov <kib@FreeBSD.org>

Use whole mnt_stat.f_fsid bits for st_dev.

Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all
bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use
f_fsid. In particular, NFSv3 and sometimes NFSv4, and ZFS use this
method or reporting st_dev by stat(2).

Provide a new helper vn_fsid() to avoid duplicating code to copy
f_fsid to va_fsid.

Note that the change is mostly cosmetic. Its motivation is to avoid
sign-extension of f_fsid[0] into 64bit dev_t value which happens after
dev_t becomes 64bit..

Reviewed by: avg(zfs), rmacklem (nfs) (both for previous version)
Sponsored by: The FreeBSD Foundation


# 69921123 23-May-2017 Konstantin Belousov <kib@FreeBSD.org>

Commit the 64-bit inode project.

Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439


# 0226f659 05-Apr-2017 Konstantin Belousov <kib@FreeBSD.org>

Add V_VMIO flag for vinvalbuf(9) to indicate that the flush request
was issued during VM-initiated i/o (pageout), so that the function
does not try to flush or remove pages or wait for the vm object
paging-in-progress counter.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D10241


# fbbd9655 28-Feb-2017 Warner Losh <imp@FreeBSD.org>

Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96


# 5afb134c 12-Dec-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vrefact, to be used when the vnode has to be already active

This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active
vnodes.

Reviewed by: kib (previous version)


# 99e6e193 23-Nov-2016 Mark Johnston <markj@FreeBSD.org>

Release laundered vnode pages to the head of the inactive queue.

The swap pager enqueues laundered pages near the head of the inactive queue
to avoid another trip through LRU before reclamation. This change adds
support for this behaviour to the vnode pager and makes use of it in UFS and
ext2fs. Some ioflag handling is consolidated into a common subroutine so
that this support can be easily extended to other filesystems which make use
of the buffer cache. No changes are needed for ZFS since its putpages
routine always undirties the pages before returning, and the laundry
thread requeues the pages appropriately in this case.

Reviewed by: alc, kib
Differential Revision: https://reviews.freebsd.org/D8589


# f71d0856 07-Oct-2016 Konstantin Belousov <kib@FreeBSD.org>

Limit scope of the optimization in r306608 to dounmount() caller only.
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.

Reported and tested by: jkim
Discussed with: mjg
Sponsored by: The FreeBSD Foundation


# 5a22c958 06-Oct-2016 Bryan Drewery <bdrewery@FreeBSD.org>

Add vrecyclel() to vrecycle() a vnode with the interlock already held.

Obtained from: OneFS
Sponsored by: Dell EMC Isilon
MFC after: 2 weeks


# 5bb81f9b 30-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: batch free vnodes in per-mnt lists

Previously free vnodes would always by directly returned to the global
LRU list. With this change up to mnt_free_list_batch vnodes are collected
first.

syncer runs always return the batch regardless of its size.

While vnodes on per-mnt lists are not counted as free, they can be
returned in case of vnode shortage.

Reviewed by: kib
Tested by: pho


# 8660b707 30-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the __bo_vnode field from struct vnode

The pointer can be obtained using __containerof instead.

Reviewed by: kib


# 47e61f6c 15-Aug-2016 Konstantin Belousov <kib@FreeBSD.org>

Implement VOP_FDATASYNC() for msdosfs.

Standard VOP_FSYNC() implementation just syncs data buffers, and due
to this, is the correct and efficient implementation for msdosfs or
any other filesystem which uses bufer cache trivially. Provide
globally visible wrapper vop_stdfdatasync_buf() for future consumption
by other filesystems.

Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D7471


# d293d0c9 11-Aug-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused textvp_fullpath() macro.

MFC after: 1 month


# 411455a8 10-Aug-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after: 1 month


# 7b255097 04-Aug-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused - never actually implemented - vnode lock types
from vnode_if.src.

MFC after: 1 month


# e896fb3b 17-Jun-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: ifdef out noop vop_* primitives on !DEBUG_VFS_LOCKS kernels

This removes calls to empty functions like vop_lock_{pre/post} from
common vfs routines.

Approved by: re (gjb)


# f8a75278 17-Jun-2016 Konstantin Belousov <kib@FreeBSD.org>

Add VFS interface to flush specified amount of free vnodes belonging
to mount points with the given filesystem type, specified by mount
vfs_ops pointer.

Based on patch by: mckusick
Reviewed by: avg, mckusick
Tested by: allanjude, madpilot
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)


# 3f7ca894 17-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Ensure that ftruncate(2) is performed synchronously when file is
opened in O_SYNC mode, at least for UFS. This also handles
truncation, done due to the O_SYNC | O_TRUNC flags combination to
open(2), in synchronous way.

Noted by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 54a33d2f 11-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Add vfs_hash_ref(9) function, which finds a vnode by the hash value
and returns it referenced.

The function is similar to vfs_hash_get(9), but unlike the later,
returned vnode is not locked. This operation cannot be requested with
the vget(9) flags.

Reviewed and tested by: rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# cd85d599 11-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Style: wrap long lines.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# c89e1b87 03-May-2016 Konstantin Belousov <kib@FreeBSD.org>

Add EVFILT_VNODE open, read and close notifications.

While there, order EVFILT_VNODE notes descriptions alphabetically.

Based on submission, and tested by: Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after: 2 weeks


# 399e8c17 09-Mar-2016 John Baldwin <jhb@FreeBSD.org>

Simplify AIO initialization now that it is standard.

- Mark AIO system calls as STD and remove the helpers to dynamically
register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
the pathconf() system call. fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5589


# 0791e0c0 24-Feb-2016 Konstantin Belousov <kib@FreeBSD.org>

Provide more correct sizing of the KVA consumed by a vnode, used by
the virtvnodes calculation. Include the size of fs-specific v_data as
the nfs nclnode inline, the NFS nclnode is bigger than either ZFS
znode or UFS inode. Include the size of namecache_ts and short cache
path element, multiplied by the name cache population factor, again
inline.

Inline defines are used to avoid pollution of the vnode.h with the
subsystem-private objects. Non-significant unsynchronized changes of
the definitions are fine, we do not care about that precision, and
e.g. ZFS consumes much malloced memory per vnode for reasons
unaccounted in the formula.

Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10.

The measures reduce vnode cache pressure on kmem and bring the vnode
cache memory use below some apparent thresholds that were exceeded by
r291244 due to more robust vnode reuse.

Reported and tested by: marius (i386, previous version)
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 793c3817 18-Jan-2016 Mark Johnston <markj@FreeBSD.org>

Add vrefl(), a locked variant of vref(9).

This API has no in-tree consumers at the moment but is useful to at least
one out-of-tree consumer, and naturally complements existing vnode refcount
functions (vholdl(9), vdropl(9)).

Obtained from: kib (sys/ portion)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4947
Differential Revision: https://reviews.freebsd.org/D4953


# 106ebb76 16-Dec-2015 Konstantin Belousov <kib@FreeBSD.org>

Optimize vop_stdadvise(POSIX_FADV_DONTNEED). Instead of looking up a
buffer for each block number in the range with gbincore(), look up the
next instantiated buffer with the logical block number which is
greater or equal to the next lblkno. This significantly speeds up the
iteration for sparce-populated range.

Move the iteration into new helper bnoreuselist(), which is structured
similarly to flushbuflist().

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation


# f186a80d 26-Nov-2015 Konstantin Belousov <kib@FreeBSD.org>

Remove VI_AGE vnode iflag, it is unused.

Noted by: bde
Sponsored by: The FreeBSD Foundation


# 412ce274 16-Sep-2015 Steven Hartland <smh@FreeBSD.org>

Fix kqueue write events for files > 2GB

Due to the use of int's for file offsets in the VOP_WRITE_(PRE|POST)
macros, kqueue write events for files greater 2GB where never fired.

This caused tail -f on a file greater 2GB to never see updates.

MFC after: 1 week
Relnotes: YES
Sponsored by: Multiplay


# 55d33667 15-Sep-2015 Conrad Meyer <cem@FreeBSD.org>

kevent(2): Note DOOMED vnodes with NOTE_REVOKE

In poll mode, check for and wake VBAD vnodes. (Vnodes that are VBAD at
registration will never be woken by the RECLAIM trigger.)

Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation. (Vnodes that
were fine at registration but are vgoned while being monitored should signal
waiters.)

Reviewed by: kib
Approved by: markj (mentor)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3675


# 17518b1a 05-Sep-2015 Kirk McKusick <mckusick@FreeBSD.org>

Track changes to kern.maxvnodes and appropriately increase or decrease
the size of the name cache hash table (mapping file names to vnodes)
and the vnode hash table (mapping mount point and inode number to vnode).
An appropriate locking strategy is the key to changing hash table sizes
while they are in active use.

Reviewed by: kib
Tested by: Peter Holm
Differential Revision: https://reviews.freebsd.org/D2265
MFC after: 2 weeks


# c9ba6504 24-Aug-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Make vfs_unmountall() unmount /dev after /, not before. The only
reason this didn't result in an unclean shutdown is that devfs ignores
MNT_FORCE flag.

Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3467


# 6e572e08 23-Aug-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

After r286237 it should be fine to call vgone(9) on a busy GEOM vnode;
remove KASSERT that would prevent forced devfs unmount from working.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# 752fc07d 16-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

vfs: implement v_holdcnt/v_usecount manipulation using atomic ops

Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.

Reviewed by: kib (earlier version)
Tested by: pho


# f0725a8e 11-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

Move chdir/chroot-related fdp manipulation to kern_descrip.c

Prefix exported functions with pwd_.

Deduplicate some code by adding a helper for setting fd_cdir.

Reviewed by: kib


# f6f6d240 10-Jun-2015 Mateusz Guzik <mjg@FreeBSD.org>

Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.


# 2db0e1f5 27-May-2015 Konstantin Belousov <kib@FreeBSD.org>

Add V_MNTREF flag to the vn_start_write(9) and
vn_start_secondary_write(9) functions. The flag indicates that the
caller already owns a reference on the mount point, and the functions
can consume it. The reference is released by vn_finished_write(9) and
vn_finished_secondary_write(9) in due course.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# d5fec489 21-Apr-2015 Craig Rodrigues <rodrigc@FreeBSD.org>

Support file verification in MAC.

* Add VCREAT flag to indicate when a new file is being created
* Add VVERIFY to indicate verification is required
* Both VCREAT and VVERIFY are only passed on the MAC method vnode_check_open
and are removed from the accmode after
* Add O_VERIFY flag to rtld open of objects
* Add 'v' flag to __sflags to set O_VERIFY flag.

Submitted by: Steve Kiernan <stevek@juniper.net>
Obtained from: Juniper Networks, Inc.
GitHub Pull Request: https://github.com/freebsd/freebsd/pull/27
Relnotes: yes


# 08189ed6 27-Feb-2015 Konstantin Belousov <kib@FreeBSD.org>

The VNASSERT in vflush() FORCECLOSE case is trying to panic early to
prevent errors from yanking devices out from under filesystems. Only
care about special vnodes on devfs, special nodes on other kinds of
filesystems do not have special properties.

Sponsored by: EMC / Isilon Storage Division
Submitted by: Conrad Meyer
MFC after: 1 week


# 8ee9765a 21-Dec-2014 Konstantin Belousov <kib@FreeBSD.org>

Add VN_OPEN_NAMECACHE flag for vn_open_cred(9), which requests that
the created file name was cached. Use the flag for core dumps.

Requested by: rpaulo
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 90effb23 22-Nov-2014 Gleb Smirnoff <glebius@FreeBSD.org>

Merge from projects/sendfile:

o Provide a new VOP_GETPAGES_ASYNC(), which works like VOP_GETPAGES(), but
doesn't sleep. It returns immediately, and will execute the I/O done handler
function that must be supplied as argument.
o Provide VOP_GETPAGES_ASYNC() for the FFS, which uses vnode_pager.
o Extend pagertab to support pgo_getpages_async method, and implement this
method for vnode_pager.

Reviewed by: kib
Tested by: pho
Sponsored by: Netflix
Sponsored by: Nginx, Inc.


# f12aa60c 15-Oct-2014 Konstantin Belousov <kib@FreeBSD.org>

When vnode bypass cannot be performed on the cdev file descriptor for
read/write/poll/ioctl, call standard vnode filedescriptor fop. This
restores the special handling for terminals by calling the deadfs VOP,
instead of always returning ENXIO for destroyed devices or revoked
terminals.

Since destroyed (and not revoked) device would use devfs_specops VOP
vector, make dead_read/write/poll non-static and fill VOP table with
pointers to the functions, to instead of VOP_PANIC.

Noted and reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# e3d6fece 04-Oct-2014 Konstantin Belousov <kib@FreeBSD.org>

Add IO_RANGELOCKED flag for vn_rdwr(9), which specifies that vnode is
not locked, but range is.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# bf66496e 01-Oct-2014 Will Andrews <will@FreeBSD.org>

Embellish a comment regarding the reliability of DEBUG_VFS_LOCKS.

Submitted by: kib


# 575e02d9 29-Aug-2014 Konstantin Belousov <kib@FreeBSD.org>

Add function and wrapper to switch lockmgr and vnode lock back to
auto-promotion of shared to exclusive.

Tested by: hrs, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 895b3782 14-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

Extract the code to put a filesystem into the suspended state (at the
unmount time) in the helper vfs_write_suspend_umnt(). Use it instead
of two inline copies in FFS.

Fix the bug in the FFS unmount, when suspension failed, the ufs
extattrs were not reinitialized.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# a6945216 14-Jul-2014 Konstantin Belousov <kib@FreeBSD.org>

Generalize vn_get_ino() to allow filesystems to use custom vnode
producer, instead of hard-coding VFS_VGET(). New function, which
takes callback, is called vn_get_ino_gen(), standard callback for
vn_get_ino() is provided.

Convert inline copies of vn_get_ino() in msdosfs and cd9660 into the
uses of vn_get_ino_gen().

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 7b81a399 17-Jun-2014 Konstantin Belousov <kib@FreeBSD.org>

In msdosfs_setattr(), add a check for result of the utimes(2)
permissions test, forgotten in r164033.

Refactor the permission checks for utimes(2) into vnode helper
function vn_utimes_perm(9), and simplify its code comparing with the
UFS origin, by writing the call to VOP_ACCESSX only once. Use the
helper for UFS(5), tmpfs(5), devfs(5) and msdosfs(5).

Reported by: bde
Reviewed by: bde, trasz
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# cc3d8c35 09-Jul-2013 Konstantin Belousov <kib@FreeBSD.org>

There are several code sequences like
vfs_busy(mp);
vfs_write_suspend(mp);
which are problematic if other thread starts unmount between two
calls. The unmount starts a write, while vfs_write_suspend() drain
writers. On the other hand, unmount drains busy references, causing
the deadlock.

Add a flag argument to vfs_write_suspend and require the callers of it
to specify VS_SKIP_UNMOUNT flag, when the call is performed not in the
mount path, i.e. the covered vnode is not locked. The suspension is
not attempted if VS_SKIP_UNMOUNT is specified and unmount is in
progress.

Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Sponsored by: The FreeBSD Foundation
MFC after: 3 weeks


# 3289d587 20-Mar-2013 Kirk McKusick <mckusick@FreeBSD.org>

When renaming a directory from one parent directory to another,
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 4 weeks


# 5f5f0554 15-Mar-2013 Konstantin Belousov <kib@FreeBSD.org>

Implement the helper function vn_io_fault_pgmove(), intended to use by
the filesystem VOP_READ() and VOP_WRITE() implementations in the same
way as vn_io_fault_uiomove() over the unmapped buffers. Helper
provides the convenient wrapper over the pmap_copy_pages() for struct
uio consumers, taking care of the TDP_UIOHELD situations.

Sponsored by: The FreeBSD Foundation
Tested by: pho
MFC after: 2 weeks


# ba05dec5 27-Feb-2013 Konstantin Belousov <kib@FreeBSD.org>

The softdep freeblks workitem might hold a reference on the dquot.
Current dqflush() panics when a dquot with with non-zero refcount is
encountered. The situation is possible, because quotas are turned off
before softdep workitem queue if flushed, due to the quota file writes
might create softdep workitems.

Make the encountering an active dquot in dqflush() not fatal, return
the error from quotaoff() instead. Ignore the quotaoff() failures
when ffs_flushfiles() is called in the course of softdep_flushfiles()
loop, until the last iteration. At the last loop, the quotas must be
closed, and because SU workitems should be already flushed, the
references to dquot are gone.

Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
Reviewed by: mckusick
MFC after: 2 weeks


# ab52a230 13-Jan-2013 Konstantin Belousov <kib@FreeBSD.org>

Rearrange the struct bufobj and struct vnode layouts to reduce
padding. On the amd64 kernel with INVARIANTS turned off, size of the
struct vnode is reduced from 496 to 472 bytes, saving 24 bytes of
memory and KVA per vnode.

Noted and reviewed by: peter
Tested by: pho
Sponsored by: The FreeBSD Foundation


# f6af8e37 13-Jan-2013 Konstantin Belousov <kib@FreeBSD.org>

Add exported vfs_hash_index() function, which calculates the canonical
pre-masked hash for the given vnode. The function assumes that
vp->v_hash is initialized by the filesystem vnode instantiation
function. At the moment, it is only done if filesystem uses
vfs_hash_insert().

Reviewed by: peter
Tested by: peter, pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 5 days


# ddd6b3fc 10-Jan-2013 Konstantin Belousov <kib@FreeBSD.org>

Add flags argument to vfs_write_resume() and remove
vfs_write_resume_flags().

Sponsored by: The FreeBSD Foundation


# f99cb34c 01-Jan-2013 Konstantin Belousov <kib@FreeBSD.org>

The process_deferred_inactive() function locks the vnodes of the ufs
mount, which means that is must not be called while the snaplock is
owned. The vfs_write_resume(9) does call the function as the
VFS_SUSP_CLEAN() method, which is too early and falls into the region
still protected by snaplock.

Add yet another flag for the vfs_write_resume_flags() to avoid calling
suspension cleanup handler after the suspend is lifted, and use it in
the ffs_snapshot() call to vfs_write_resume.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 91e94745 28-Dec-2012 Konstantin Belousov <kib@FreeBSD.org>

Make it possible to atomically resume writes on the mount and account
the write start, by adding a variation of the vfs_write_resume(9)
which accepts flags.

Use the new function to prevent a deadlock between parallel suspension
and snapshotting a UFS mount. The ffs_snapshot() code performed
vfs_write_resume() followed by vn_start_write() while owning the
snaplock. If the suspension intervene between resume and
vn_start_write(), the deadlock occured after the suspending thread
tried to lock the snaplock, most typically during the write in the
ffs_copyonwrite().

Reported and tested by: Andreas Longwitz <longwitz@incore.de>
Reviewed by: mckusick
MFC after: 2 weeks
X-MFC-note: make the vfs_write_resume(9) function a macro after the MFC,
in HEAD


# f121e3e8 27-Nov-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Add NOCAPCHECK flag to namei that allows lookup to work even if the process
is in capability mode.
- Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into
NOCAPCHECK namei flag.

This functionality will be used to enable core dumps for sandboxed processes.

Reviewed by: rwatson
Obtained from: WHEEL Systems
MFC after: 2 weeks


# 9b233e23 14-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a KPI to allow to reserve some amount of space in the numvnodes
counter, without actually allocating the vnodes. The supposed use of
the getnewvnode_reserve(9) is to reclaim enough free vnodes while the
code still does not hold any resources that might be needed during the
reclamation, and to consume the slack later for getnewvnode() calls
made from the innards. After the critical block is finished, the
caller shall free any reserve left, by getnewvnode_drop_reserve(9).

Reviewed by: avg
Tested by: pho
MFC after: 1 week


# 7ac1b61a 08-Jun-2012 John Baldwin <jhb@FreeBSD.org>

Split the second half of vn_open_cred() (after a vnode has been found via
a lookup or created via VOP_CREATE()) into a new vn_open_vnode() function
and use this function in fhopen() instead of duplicating code from
vn_open_cred() directly.

Tested by: pho
Reviewed by: kib
MFC after: 2 weeks


# 41014d99 30-May-2012 Konstantin Belousov <kib@FreeBSD.org>

vn_io_fault() is a facility to prevent page faults while filesystems
perform copyin/copyout of the file data into the usermode
buffer. Typical filesystem hold vnode lock and some buffer locks over
the VOP_READ() and VOP_WRITE() operations, and since page fault
handler may need to recurse into VFS to get the page content, a
deadlock is possible.

The facility works by disabling page faults handling for the current
thread and attempting to execute i/o while allowing uiomove() to
access the usermode mapping of the i/o buffer. If all buffer pages are
resident, uiomove() is successfull and request is finished. If EFAULT
is returned from uiomove(), the pages backing i/o buffer are faulted
in and held, and the copyin/out is performed using uiomove_fromphys()
over the held pages for the second attempt of VOP call.

Since pages are hold in chunks to prevent large i/o requests from
starving free pages pool, and since vnode lock is only taken for
i/o over the current chunk, the vnode lock no longer protect atomicity
of the whole i/o request. Use newly added rangelocks to provide the
required atomicity of i/o regardind other i/o and truncations.

Filesystems need to explicitely opt-in into the scheme, by setting the
MNTK_NO_IOPF struct mount flag, and optionally by using
vn_io_fault_uiomove(9) helper which takes care of calling uiomove() or
converting uio into request for uiomove_fromphys().

Reviewed by: bf (comments), mdf, pjd (previous version)
Tested by: pho
Tested by: flo, Gustau P?rez <gperez entel upc edu> (previous version)
MFC after: 2 months


# 8f0e9130 30-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a rangelock implementation, intended to be used to range-locking
the i/o regions of the vnode data space. The implementation is quite
simple-minded, it uses the list of the lock requests, ordered by
arrival time. Each request may be for read or for write. The
implementation is fair FIFO.

MFC after: 2 month


# c83f909c 30-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Clarify that the v_lockf is advisory lock list.

MFC after: 3 days


# 292520f7 25-May-2012 Konstantin Belousov <kib@FreeBSD.org>

Add a vn_bmap_seekhole(9) vnode helper which can be used by any
filesystem which supports VOP_BMAP(9) to implement SEEK_HOLE/SEEK_DATA
commands for lseek(2).

MFC after: 2 weeks


# af6e6b87 23-Apr-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused thread argument to vrecycle().

Reviewed by: kib


# c52fd858 23-Apr-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused thread argument from vtruncbuf().

Reviewed by: kib


# f257ebbb 20-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

This change creates a new list of active vnodes associated with
a mount point. Active vnodes are those with a non-zero use or hold
count, e.g., those vnodes that are not on the free list. Note that
this list is in addition to the list of all the vnodes associated
with a mount point.

To avoid adding another set of linkage pointers to the vnode
structure, the active list uses the existing linkage pointers
used by the free list (previously named v_freelist, now renamed
v_actfreelist).

This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point (typically
less than 1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks


# 73305eb8 17-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

Drop export of vdestroy() function from kern/vfs_subr.c as it is
used only as a helper function in that file. Replace sole call to
vbusy() with inline code in vholdl(). Replace sole calls to vfree()
and vdestroy() with inline code in vdropl().

The Clang compiler already inlines these functions, so they do not
show up in a kernel backtrace which is confusing. Also you cannot
set their frame in kgdb which means that it is impossible to view
their local variables. So, while the produced code is unchanged,
the debugging should be easier.

Discussed with: kib
MFC after: 2 weeks


# ecb6e528 11-Apr-2012 Kirk McKusick <mckusick@FreeBSD.org>

Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into
process_deferred_inactive().

Reviewed by: kib
MFC after: 2 weeks


# bd944e01 24-Mar-2012 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove unused define.

Discussed with: kib


# b80dcb55 10-Mar-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove fifo.h. The only used function declaration from the header is
migrated to sys/vnode.h.

Submitted by: gianni


# 5e99212d 02-Mar-2012 Rick Macklem <rmacklem@FreeBSD.org>

Post r230394, the Lookup RPC counts for both NFS clients increased
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.

Reported by: bde
Discussed with: kib
Reviewed by: jhb
MFC after: 2 weeks


# c7e41c8b 29-Feb-2012 Mikolaj Golub <trociny@FreeBSD.org>

Introduce VOP_UNP_BIND(), VOP_UNP_CONNECT(), and VOP_UNP_DETACH()
operations for setting and accessing vnode's v_socket field.

The operations are necessary to implement proper unix socket handling
on layered file systems like nullfs(5).

This change fixes the long standing issue with nullfs(5) being in that
unix sockets did not work between lower and upper layers: if we bound
to a socket on the lower layer we could connect only to the lower
path; if we bound to the upper layer we could connect only to the
upper path. The new behavior is one can connect to both the lower and
the upper paths regardless what layer path one binds to.

PR: kern/51583, kern/159663
Suggested by: kib
Reviewed by: arch
MFC after: 2 weeks


# 662c901c 25-Feb-2012 Mikolaj Golub <trociny@FreeBSD.org>

When detaching an unix domain socket, uipc_detach() checks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.

The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.

To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.

Pointed by: kib
Reviewed by: kib
MFC after: 2 weeks


# 526d0bd5 20-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix found places where uio_resid is truncated to int.

Add the sysctl debug.iosize_max_clamp, enabled by default. Setting the
sysctl to zero allows to perform the SSIZE_MAX-sized i/o requests from
the usermode.

Discussed with: bde, das (previous versions)
MFC after: 1 month


# c0ae88db 08-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Trim 8 unused bytes from struct vnode on 64-bit architectures.

Reviewed by: alc


# bf40d24a 06-Feb-2012 John Baldwin <jhb@FreeBSD.org>

Rename cache_lookup_times() to cache_lookup() and retire the old API and
ABI stub for cache_lookup().


# c480f781 06-Feb-2012 Konstantin Belousov <kib@FreeBSD.org>

Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks


# 5aefb4cb 20-Jan-2012 John Baldwin <jhb@FreeBSD.org>

Close a race in NFS lookup processing that could result in stale name cache
entries on one client when a directory was renamed on another client. The
root cause for the stale entry being trusted is that each per-vnode nfsnode
structure has a single 'n_ctime' timestamp used to validate positive name
cache entries. However, if there are multiple entries for a single vnode,
they all share a single timestamp. To fix this, extend the name cache
to allow filesystems to optionally store a timestamp value in each name
cache entry. The NFS clients now fetch the timestamp associated with
each name cache entry and use that to validate cache hits instead of the
timestamps previously stored in the nfsnode. Another part of the fix is
that the NFS clients now use timestamps from the post-op attributes of
RPCs when adding name cache entries rather than pulling the timestamps out
of the file's attribute cache. The latter is subject to races with other
lookups updating the attribute cache concurrently. Some more details:
- Add a variant of nfsm_postop_attr() to the old NFS client that can return
a vattr structure with a copy of the post-op attributes.
- Handle lookups of "." as a special case in the NFS clients since the name
cache does not store name cache entries for ".", so we cannot get a
useful timestamp. It didn't really make much sense to recheck the
attributes on the the directory to validate the namecache hit for "."
anyway.
- ABI compat shims for the name cache routines are present in this commit
so that it is safe to MFC.

MFC after: 2 weeks


# f6e633a9 14-Jan-2012 Martin Matuska <mm@FreeBSD.org>

Introduce vn_path_to_global_path()

This function updates path string to vnode's full global path and checks
the size of the new path string against the pathlen argument.

In vfs_domount(), sys_unmount() and kern_jail_set() this new function
is used to update the supplied path argument to the respective global path.

Unbreaks jailed zfs(8) with enforce_statfs set to 1.

Reviewed by: kib
MFC after: 1 month


# f0d6c5ca 23-Dec-2011 John Baldwin <jhb@FreeBSD.org>

Add post-VOP hooks for VOP_DELETEEXTATTR() and VOP_SETEXTATTR() and use
these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended
attributes of a vnode are changed.

Note that OS X already implements this behavior.

Reviewed by: rwatson
MFC after: 2 weeks


# 936c09ac 03-Nov-2011 John Baldwin <jhb@FreeBSD.org>

Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if
possible.

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month


# 82378711 25-Aug-2011 Martin Matuska <mm@FreeBSD.org>

Generalize ffs_pages_remove() into vn_pages_remove().

Remove mapped pages for all dataset vnodes in zfs_rezget() using
new vn_pages_remove() to fix mmapped files changed by
zfs rollback or zfs receive -F.

PR: kern/160035, kern/156933
Reviewed by: kib, pjd
Approved by: re (kib)
MFC after: 1 week


# 9c00bb91 16-Aug-2011 Konstantin Belousov <kib@FreeBSD.org>

Add the fo_chown and fo_chmod methods to struct fileops and use them
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.

Based on the submission by: glebius
Reviewed by: rwatson
Approved by: re (bz)


# 91538d78 10-Jul-2011 Konstantin Belousov <kib@FreeBSD.org>

Update locking annotations for the struct vnode.

MFC after: 3 days


# 280e091a 10-Jun-2011 Jeff Roberson <jeff@FreeBSD.org>

Implement fully asynchronous partial truncation with softupdates journaling
to resolve errors which can cause corruption on recovery with the old
synchronous mechanism.

- Append partial truncation freework structures to indirdeps while
truncation is proceeding. These prevent new block pointers from
becoming valid until truncation completes and serialize truncations.
- On completion of a partial truncate journal work waits for zeroed
pointers to hit indirects.
- softdep_journal_freeblocks() handles last frag allocation and last
block zeroing.
- vtruncbuf/ffs_page_remove moved into softdep_*_freeblocks() so it
is only implemented in one place.
- Block allocation failure handling moved up one level so it does not
proceed with buf locks held. This permits us to do more extensive
reclaims when filesystem space is exhausted.
- softdep_sync_metadata() is broken into two parts, the first executes
once at the start of ffs_syncvnode() and flushes truncations and
inode dependencies. The second is called on each locked buf. This
eliminates excessive looping and rollbacks.
- Improve the mechanism in process_worklist_item() that handles
acquiring vnode locks for handle_workitem_remove() so that it works
more generally and does not loop excessively over the same worklist
items on each call.
- Don't corrupt directories by zeroing the tail in fsck. This is only
done for regular files.
- Push a fsync complete record for files that need it so the checker
knows a truncation in the journal is no longer valid.

Discussed with: mckusick, kib (ffs_pages_remove and ffs_truncate parts)
Tested by: pho


# d91f88f7 18-Apr-2011 Matthew D Fleming <mdf@FreeBSD.org>

Add the posix_fallocate(2) syscall. The default implementation in
vop_stdallocate() is filesystem agnostic and will run as slow as a
read/write loop in userspace; however, it serves to correctly
implement the functionality for filesystems that do not implement a
VOP_ALLOCATE.

Note that __FreeBSD_version was already bumped today to 900036 for any
ports which would like to use this function.

Also reserve space in the syscall table for posix_fadvise(2).

Reviewed by: -arch (previous version)


# 08b163fa 02-Feb-2011 Matthew D Fleming <mdf@FreeBSD.org>

Put the general logic for being a CPU hog into a new function
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after: 1 week


# 730b63b0 19-Nov-2010 Konstantin Belousov <kib@FreeBSD.org>

Remove prtactive variable and related printf()s in the vop_inactive
and vop_reclaim() methods. They seems to be unused, and the reported
situation is normal for the forced unmount.

MFC after: 1 week
X-MFC-note: keep prtactive symbol in vfs_subr.c


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 3634d5b2 20-Aug-2010 John Baldwin <jhb@FreeBSD.org>

Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created. Use them to implement macros that
otherwise manipulated the flags directly. Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe. This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by: kib
MFC after: 3 days


# 3979450b 06-Aug-2010 Konstantin Belousov <kib@FreeBSD.org>

Add new make_dev_p(9) flag MAKEDEV_ETERNAL to inform devfs that created
cdev will never be destroyed. Propagate the flag to devfs vnodes as
VV_ETERNVALDEV. Use the flags to avoid acquiring devmtx and taking a
thread reference on such nodes.

In collaboration with: pho
MFC after: 1 month


# df17ff61 11-Jun-2010 Andriy Gapon <avg@FreeBSD.org>

vnode.h: expand debug macros to non-empty void statements when
DEBUG_VFS_LOCKS is disabled

MFC after: 2 weeks


# 7fd32ea9 12-May-2010 Zachary Loafman <zml@FreeBSD.org>

Add VOP_ADVLOCKPURGE so that the file system is called when purging
locks (in the case where the VFS impl isn't using lf_*)

Submitted by: Matthew Fleming <matthew.fleming@isilon.com>
Reviewed by: zml, dfr


# 307d88b7 06-May-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Style fixes and removal of unneeded variable.

Submitted by: bde@


# b5f770bd 05-May-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

Move checking against RLIMIT_FSIZE into one place, vn_rlimit_fsize().

Reviewed by: kib


# bb9a8424 09-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r206093:
Add function vop_rename_fail(9) that performs needed cleanup for locks
and references of the VOP_RENAME(9) arguments. Use vop_rename_fail()
in deadfs_rename().


# ea015880 02-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

Add function vop_rename_fail(9) that performs needed cleanup for locks
and references of the VOP_RENAME(9) arguments. Use vop_rename_fail()
in deadfs_rename().

Tested by: Mikolaj Golub
MFC after: 1 week


# bd7ae209 27-Mar-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

MFC r197680:

Provide default implementation for VOP_ACCESS(9), so that filesystems which
want to provide VOP_ACCESSX(9) don't have to implement both. Note that
this commit makes implementation of either of these two mandatory.

Reviewed by: kib


# 78e32164 27-Mar-2010 Edward Tomasz Napierala <trasz@FreeBSD.org>

MFC r197405:

Add pieces of infrastructure required for NFSv4 ACL support in UFS.

Reviewed by: rwatson


# e9560b40 07-Feb-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r202528:
Add vunref(9).


# d2f334bf 17-Jan-2010 Konstantin Belousov <kib@FreeBSD.org>

Add new function vunref(9) that decrements vnode use count (and hold
count) while vnode is exclusively locked.

The code for vput(9), vrele(9) and vunref(9) is merged.

In collaboration with: pho
Reviewed by: alc
MFC after: 3 weeks


# 2d63cbda 10-Jan-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r200770:
Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.


# 49e3050e 20-Dec-2009 Konstantin Belousov <kib@FreeBSD.org>

VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by: alc
Tested by: pho
MFC after: 3 weeks


# 2c29cfa0 01-Oct-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Provide default implementation for VOP_ACCESS(9), so that filesystems which
want to provide VOP_ACCESSX(9) don't have to implement both. Note that
this commit makes implementation of either of these two mandatory.

Reviewed by: kib


# a9315dde 22-Sep-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add pieces of infrastructure required for NFSv4 ACL support in UFS.

Reviewed by: rwatson


# fe1d3f15 28-Jun-2009 Stanislav Sedov <stas@FreeBSD.org>

- Turn the third (islocked) argument of the knote call into flags parameter.
Introduce the new flag KNF_NOKQLOCK to allow event callers to be called
without KQ_LOCK mtx held.
- Modify VFS knote calls to always use KNF_NOKQLOCK flag. This is required
for ZFS as its getattr implementation may sleep.

Approved by: re (rwatson)
Reviewed by: kib
MFC after: 2 weeks


# aec53722 25-Jun-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Tweak comment.


# c808c963 21-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Add explicit struct ucred * argument for VOP_VPTOCNP, to be used by
vn_open_cred in default implementation. Valid struct ucred is needed for
audit and MAC, and curthread credentials may be wrong.

This further requires modifying the interface of vn_fullpath(9), but it
is out of scope of this change.

Reviewed by: rwatson


# e0c161b8 21-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Add another flags argument to vn_open_cred. Use it to specify that some
vn_open_cred invocations shall not audit namei path.

In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.

Reported, reviewed and tested by: rwatson


# f0830182 02-Jun-2009 Attilio Rao <attilio@FreeBSD.org>

Handle lock recursion differenty by always checking against LO_RECURSABLE
instead the lock own flag itself.

Tested by: pho


# 0449e6e1 31-May-2009 Konstantin Belousov <kib@FreeBSD.org>

Eliminate code duplication in vn_fullpath1() around the cache lookups
and calls to vn_vptocnp() by moving more of the common code to
vn_vptocnp(). Rename vn_vptocnp() to vn_vptocnp_locked() to signify that
cache is locked around the call.

Do not track buffer position by both the pointer and offset, use only
buflen to record the start of the free space.

Export vn_vptocnp() for external consumers as a wrapper around
vn_vptocnp_locked() that locks the cache and handles hold counts.

Tested by: pho


# c97fcdba 30-May-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add VOP_ACCESSX, which can be used to query for newly added V*
permissions, such as VWRITE_ACL. For a filsystems that don't
implement it, there is a default implementation, which works
as a wrapper around VOP_ACCESS.

Reviewed by: rwatson@


# 885868cd 10-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Remove VOP_LEASE and supporting functions. This hasn't been used since
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.

Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.

Proposed by: jeff
Reviewed by: jeff
Discussed with: rmacklem, zach loafman @ isilon


# 607fc40b 29-Mar-2009 Alexander Kabaev <kan@FreeBSD.org>

Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache
in directory vnodes. Allow namecache dotdot entry to be created pointing
from child vnode to parent vnode if no existing links in opposite
direction exist. Use direct link from parent to child for dotdot lookups
otherwise.

This restores more efficient dotdot caching in NFS filesystems which
was lost when vnodes stoppped being type stable.

Reviewed by: kib


# 6180d318 29-Mar-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Get rid of VSTAT and replace it with VSTAT_PERMS, which is somewhat
better defined.

Approved by: rwatson (mentor)


# 54377204 27-Mar-2009 Edward Tomasz Napierala <trasz@FreeBSD.org>

Add new V* constants, neccessary for granular permission checks
in NFSv4 ACLs. While here, get rid of VALLPERM; it wasn't used anyway.

Approved by: rwatson (mentor)


# f0ffa083 08-Mar-2009 Joe Marcus Clarke <marcus@FreeBSD.org>

Add a prototype for the new vop_stdvptocnp function.

Reviewed by: kib
Approved by: kib
Tested by: pho


# 03964c8e 19-Feb-2009 John Baldwin <jhb@FreeBSD.org>

Enable caching of negative pathname lookups in the NFS client. To avoid
stale entries, we save a copy of the directory's modification time when
the first negative cache entry was added in the directory's NFS node.
When a negative cache entry is hit during a pathname lookup, the parent
directory's modification time is checked. If it has changed, all of the
negative cache entries for that parent are purged and the lookup falls
back to using the RPC. This required adding a new cache_purge_negative()
method to the name cache to purge only negative cache entries for a given
directory.

Submitted by: mohans, Rick Macklem, Ricardo Labiaga @ NetApp
Reviewed by: mohans


# 9a734a28 28-Jan-2009 John Baldwin <jhb@FreeBSD.org>

Actually remove the VA_MARK_ATIME flag. This should have been in the
earlier commit to add VOP_MARKATIME().


# e9aff357 21-Jan-2009 Konstantin Belousov <kib@FreeBSD.org>

Move the code from ufs_lookup.c used to do dotdot lookup, into
the helper function. It is supposed to be useful for any filesystem
that has to unlock dvp to walk to the ".." entry in lookup routine.

Requested by: jhb
Tested by: pho
MFC after: 1 month


# 24b087e4 11-Dec-2008 Joe Marcus Clarke <marcus@FreeBSD.org>

Add a new error VOP, VOP_ENOENT. This function will simply return ENOENT.

Reviewed by: arch
Approved by: kib


# 1ba4a712 17-Nov-2008 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.

- L2ARC

Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris


# 15bc6b2b 28-Oct-2008 Edward Tomasz Napierala <trasz@FreeBSD.org>

Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by: rwatson (mentor)


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# 0d7935fd 10-Oct-2008 Attilio Rao <attilio@FreeBSD.org>

Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later
commit.

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# bdb80947 16-Sep-2008 Konstantin Belousov <kib@FreeBSD.org>

Garbage-collect vn_write_suspend_wait().

Suggested and reviewed by: tegge
Tested by: pho
MFC after: 1 month


# dfa7fd1d 10-Sep-2008 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove VSVTX, VSGID and VSUID. This should be a no-op,
as VSVTX == S_ISVTX, VSGID == S_ISGID and VSUID == S_ISUID.

Approved by: rwatson (mentor)


# 0359a12e 28-Aug-2008 Attilio Rao <attilio@FreeBSD.org>

Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>


# a888d54d 28-Aug-2008 Konstantin Belousov <kib@FreeBSD.org>

Introduce the VV_FORCEINSMQ vnode flag. It instructs the insmnque() function
to ignore the unmounting and forces insertion of the vnode into the mount
vnode list.

Change insmntque() to fail when forced unmount is in progress and
VV_FORCEINSMQ is not specified.

Add an assertion to the insmntque(), requiring the vnode to be
exclusively locked for mp-safe filesystems.

Use the VV_FORCEINSMQ for the creation of the syncvnode.

Tested by: pho
Reviewed by: tegge
MFC after: 1 month


# dfc714fb 31-Jul-2008 Christian S.J. Peron <csjp@FreeBSD.org>

Currently, BSM audit pathname token generation for chrooted or jailed
processes are not producing absolute pathname tokens. It is required
that audited pathnames are generated relative to the global root mount
point. This modification changes our implementation of audit_canon_path(9)
and introduces a new function: vn_fullpath_global(9) which performs a
vnode -> pathname translation relative to the global mount point based
on the contents of the name cache. Much like vn_fullpath,
vn_fullpath_global is a wrapper function which called vn_fullpath1.

Further, the string parsing routines have been converted to use the
sbuf(9) framework. This change also removes the conditional acquisition
of Giant, since the vn_fullpath1 method will not dip into file system
dependent code.

The vnode locking was modified to use vhold()/vdrop() instead the vref()
and vrele(). This will modify the hold count instead of modifying the
user count. This makes more sense since it's the kernel that requires
the reference to the vnode. This also makes sure that the vnode does not
get recycled we hold the reference to it. [1]

Discussed with: rwatson
Reviewed by: kib [1]
MFC after: 2 weeks


# eab626f1 16-Apr-2008 Konstantin Belousov <kib@FreeBSD.org>

Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks


# 047dd67e 06-Apr-2008 Attilio Rao <attilio@FreeBSD.org>

Optimize lockmgr in order to get rid of the pool mutex interlock, of the
state transitioning flags and of msleep(9) callings.
Use, instead, an algorithm very similar to what sx(9) and rwlock(9)
alredy do and direct accesses to the sleepqueue(9) primitive.

In order to avoid writer starvation a mechanism very similar to what
rwlock(9) uses now is implemented, with the correspective per-thread
shared lockmgrs counter.

This patch also adds 2 new functions to lockmgr KPI: lockmgr_rw() and
lockmgr_args_rw(). These two are like the 2 "normal" versions, but they
both accept a rwlock as interlock. In order to realize this, the general
lockmgr manager function "__lockmgr_args()" has been implemented through
the generic lock layer. It supports all the blocking primitives, but
currently only these 2 mappers live.

The patch drops the support for WITNESS atm, but it will be probabilly
added soon. Also, there is a little race in the draining code which is
also present in the current CVS stock implementation: if some sharers,
once they wakeup, are in the runqueue they can contend the lock with
the exclusive drainer. This is hard to be fixed but the now committed
code mitigate this issue a lot better than the (past) CVS version.
In addition assertive KA_HELD and KA_UNHELD have been made mute
assertions because they are dangerous and they will be nomore supported
soon.

In order to avoid namespace pollution, stack.h is splitted into two
parts: one which includes only the "struct stack" definition (_stack.h)
and one defining the KPI. In this way, newly added _lockmgr.h can
just include _stack.h.

Kernel ABI results heavilly changed by this commit (the now committed
version of "struct lock" is a lot smaller than the previous one) and
KPI results broken by lockmgr_rw() / lockmgr_args_rw() introduction,
so manpages and __FreeBSD_version will be updated accordingly.

Tested by: kris, pho, jeff, danger
Reviewed by: jeff
Sponsored by: Google, Summer of Code program 2007


# 0a3af16a 31-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Add the utility function vn_commname() to retrieve the command name
from the vfs namecache, when available.

Reviewed by: rwatson, rdivacky
Tested by: pho


# 97735db7 23-Mar-2008 Jeff Roberson <jeff@FreeBSD.org>

- Remove an old comment; vnodes have been working without Giant for
years now.
- Clarify the locking required for VI_DOOMED in preparation for
simplifications to vget() and vn_lock().


# 1be222e9 23-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Yield the cpu in the kernel while iterating the list of the
vnodes belonging to the mountpoint. Also, yield when in the
softdep_process_worklist() even when we are not going to sleep due to
buffer drain.

It is believed that the ULE fixed the problem [1], but the yielding
seems to be needed at least for the 4BSD case.

Discussed: on stable@, with bde
Reviewed by: tegge, jeff [1]
MFC after: 2 weeks


# 7fbfba7b 01-Mar-2008 Attilio Rao <attilio@FreeBSD.org>

- Handle buffer lock waiters count directly in the buffer cache instead
than rely on the lockmgr support [1]:
* bump the waiters only if the interlock is held
* let brelvp() return the waiters count
* rely on brelvp() instead than BUF_LOCKWAITERS() in order to check
for the waiters number
- Remove a namespace pollution introduced recently with lockmgr.h
including lock.h by including lock.h directly in the consumers and
making it mandatory for using lockmgr.
- Modify flags accepted by lockinit():
* introduce LK_NOPROFILE which disables lock profiling for the
specified lockmgr
* introduce LK_QUIET which disables ktr tracing for the specified
lockmgr [2]
* disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it
can only be used on a per-instance basis
- Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer
used

This patch breaks KPI so __FreBSD_version will be bumped and manpages
updated by further commits. Additively, 'struct buf' changes results in
a disturbed ABI also.

[2] Really, currently there is no ktr tracing in the lockmgr, but it
will be added soon.

[1] Submitted by: kib
Tested by: pho, Andrea Barberio <insomniac at slackware dot it>


# 628f51d2 24-Feb-2008 Attilio Rao <attilio@FreeBSD.org>

Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 51662691 20-Oct-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Remove redundant prototypes.


# 9e223287 31-May-2007 Konstantin Belousov <kib@FreeBSD.org>

Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)


# 950afe99 25-May-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

The cache_leaf_test() function seems to be unused, so remove it.


# d413d210 18-May-2007 Konstantin Belousov <kib@FreeBSD.org>

Since renaming of vop_lock to _vop_lock, pre- and post-condition
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.


# e6534b36 31-Mar-2007 Dag-Erling Smørgrav <des@FreeBSD.org>

Make vdropl() public; zfs needs it. There is also plenty of existing
file system code (mostly *_reclaim()) which look like this:

VOP_LOCK(vp);
/* examine vp */
VOP_UNLOCK(vp);
vdrop(vp);

This can now be rewritten to:

VOP_LOCK(vp);
/* examine vp */
vdropl(vp); /* will unlock vp */

MFC after: 1 week


# 61b9d89f 12-Mar-2007 Tor Egge <tegge@FreeBSD.org>

Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount). On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque(). Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode. The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup. Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by: re (kensmith)
Reviewed by: kib
In collaboration with: kib


# 10bcafe9 15-Feb-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Move vnode-to-file-handle translation from vfs_vptofh to vop_vptofh method.
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.

Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.

VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.

Approved by: mckusick
Discussed with: many (on IRC)
Tested with: ufs, msdosfs, cd9660, nullfs and zfs


# 8fd14511 14-Dec-2006 Konstantin Belousov <kib@FreeBSD.org>

Use tab after #define.

Pointed out by: pjd


# 3b7b5496 14-Dec-2006 Konstantin Belousov <kib@FreeBSD.org>

Resolve two deadlocks that could be caused by busy md device backed
by vnode. Allow for md thread and the thread that owns lock on vnode
backing the md device to do the write even when runningbufspace is
exhausted.

Tested by: Peter Holm
Reviewed by: tegge
MFC after: 2 weeks


# 2f6a774b 12-Nov-2006 Kip Macy <kmacy@FreeBSD.org>

change vop_lock handling to allowing tracking of callers' file and line for
acquisition of lockmgr locks

Approved by: scottl (standing in for mentor rwatson)


# 1a60c7fc 31-Oct-2006 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Add gjournal specific code to the UFS file system:
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
object is still open), increase fs_unrefs and cg_unrefs fields,
which is a hint for fsck in which cylinder groups looks for such
(orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.

Sponsored by: home.pl


# 4207c279 18-Apr-2006 Xin LI <delphij@FreeBSD.org>

In vfs_hash_get(): mount point should never be changed
so explicitly constify the mp parameter.

Reviewed by: phk


# 791dd2fa 08-Mar-2006 Tor Egge <tegge@FreeBSD.org>

Use vn_start_secondary_write() and vn_finished_secondary_write() as a
replacement for vn_write_suspend_wait() to better account for secondary write
processing.

Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.

Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.


# eb2ea105 01-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- Move softdep from using a global worklist to per-mount worklists. This
has many positive effects including improved smp locking, reducing
interdependencies between mounts that can lead to deadlocks, etc.
- Add the softdep worklist and various counters to the ufsmnt structure.
- Add a mount pointer to the workitem and remove mount pointers from the
various structures derived from the workitem as they are now redundant.
- Remove the poor-man's semaphore protecting softdep_process_worklist and
softdep_flushworklist. Several threads may now process the list
simultaneously.
- Add softdep_waitidle() to block the thread until all pending
dependencies being operated on by other threads have been flushed.
- Use softdep_waitidle() in unmount and snapshots to block either
operation until the fs is stable.
- Remove softdep worklist processing from the syncer and move it into the
softdep_flush() thread. This thread processes all softdep mounts
once each second and when it is called via the new softdep_speedup()
when there is a resource shortage. This removes the softdep hook
from the kernel and various hacks in header files to support it.

Reviewed by/Discussed with: tegge, truckman, mckusick
Tested by: kris


# 731959b1 31-Jan-2006 Yaroslav Tykhiy <ytykhiy@gmail.com>

Use off_t for file size passed to vnode_create_vobject().
The former type, size_t, was causing truncation to 32 bits on i386,
which immediately led to undersizing of VM objects backed by
files >4GB. In particular, sendfile(2) was broken for such files.

PR: kern/92243
MFC after: 5 days


# ea6d62f1 14-Jan-2006 Robert Watson <rwatson@FreeBSD.org>

Rename uid and gid arguments to vaccess() prototype to match vaccess()
implementation in vfs_subr.c. No functional change.

MFC after: 3 days


# 82be0a5a 09-Jan-2006 Tor Egge <tegge@FreeBSD.org>

Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by: truckman


# 0430a5e2 13-Dec-2005 Dag-Erling Smørgrav <des@FreeBSD.org>

Eradicate caddr_t from the VFS API.


# e26b05cf 13-Dec-2005 Dag-Erling Smørgrav <des@FreeBSD.org>

Nuke vnodeop_desc.vdesc_transports, which has been unused since the dawn
of time (or the inception of ncvs, whichever came last)


# 9f5c1d19 12-Oct-2005 Diomidis Spinellis <dds@FreeBSD.org>

Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by: bde
MFC after: 2 weeks


# 2883ba66 12-Sep-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce vfs_read_dirent() which can help VOP_READDIR() implementations
by handling all the cookie stuff.


# 34cc826a 05-Aug-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Holding a vnode doesn't prevent v_mount from disappearing (when the
vnode is inactivated), possibly leading to a NULL dereference when
checking if the mount wants knotes to be activated in the VOP hooks.
So, we add a new vnode flag VV_NOKNOTE that is only set in getnewvnode(),
if necessary, and check it when activating knotes.
Since the flags are not erased when a vnode is being held, we can safely
read them.

Reviewed by: kris@
MFC after: 3 days


# e8ddb61d 02-Aug-2005 Jeff Roberson <jeff@FreeBSD.org>

- Replace the series of DEBUG_LOCKS hacks which tried to save the vn_lock
caller by saving the stack of the last locker/unlocker in lockmgr. We
also put the stack in KTR at the moment.

Contributed by: Antoine Brodin <antoine.brodin@laposte.net>


# 2b0f687b 01-Jul-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Mistakingly undefined VN_KNOTE_LOCKED in my previous commit.

Noticed by: Antoine Brodin <antoine.brodin@laposte.net>
Approved by: re (scottl)


# 571dcd15 01-Jul-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Fix the recent panics/LORs/hangs created by my kqueue commit by:

- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone


# b930d853 13-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Don't make vgonel() globally visible, we want to change its prototype
anyway and it's not used outside of vfs_subr.c.
- Change vgonel() to accept a parameter which determines whether or not
we'll put the vnode on the free list when we're done.
- Use the new vgonel() parameter rather than VI_DOOMED to signal our
intentions in vtryrecycle().
- In vgonel() return if VI_DOOMED is already set, this vnode has already
been reclaimed.

Sponsored by: Isilon Systems, Inc.


# 679985d0 09-Jun-2005 Suleiman Souhlal <ssouhlal@FreeBSD.org>

Allow EVFILT_VNODE events to work on every filesystem type, not just
UFS by:
- Making the pre and post hooks for the VOP functions work even when
DEBUG_VFS_LOCKS is not defined.
- Moving the KNOTE activations into the corresponding VOP hooks.
- Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct
mount that permits filesystems to disable the new behavior.
- Creating a default VOP_KQFILTER function: vfs_kqfilter()

My benchmarks have not revealed any performance degradation.

Reviewed by: jeff, bde
Approved by: rwatson, jmg (kqueue changes), grehan (mentor)


# 6341095e 31-May-2005 Ken Smith <kensmith@FreeBSD.org>

This patch addresses a standards violation issue. The standards say a
file's access time should be updated when it gets executed. A while
ago the mechanism used to exec was changed to use a more mmap based
mechanism and this behavior was broken as a side-effect of that.

A new vnode flag is added that gets set when the file gets executed,
and the VOP_SETATTR() vnode operation gets called. The underlying
filesystem is expected to handle it based on its own semantics, some
filesystems don't support access time at all. Those that do should
handle it in a way that does not block, does not generate I/O if possible,
etc. In particular vn_start_write() has not been called. The UFS code
handles it the same way as it would normally handle the access time if
a file was read - the IN_ACCESS flag gets set in the inode but no other
action happens at this point. The actual time update will happen later
during a sync (which handles all the necessary locking).

Got me into this: cperciva
Discussed with: a lot with bde, a little with kan
Showed patches to: phk, jeffr, standards@, arch@
Minor discussion on: arch@


# fc8dfa75 27-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Changes to vgone() and related teardown code have meant that the vxthread
pointer is no longer needed.


# ab9707d7 22-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add a VI_LOCK_FLAGS so we can pass MTX_DUPOK in. This somewhat defeats
the purpose of having macros to hide the lock type as we may now be
dependent on MTX_ flags.

Sponsored by: Isilon Systems, Inc.


# cbc5da3a 11-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add the mising ASSERT_VOP_ELOCKED code in the !DEBUG_VFS_LOCKS case.

Pointy hat to: me


# 539de9ed 11-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Enable ASSERT_VOP_ELOCKED and assert_vop_elocked() now that vnode_if.awk
uses it.

Sponsored by: Isilon Systems, Inc.


# 878cdac0 29-Mar-2005 David Schultz <das@FreeBSD.org>

Eliminate v_id and v_ddid. This changes struct vnode, so all
filesystem modules must be recompiled. (Since struct vnode has
already changed in 6-CURRENT, there's little advantage to leaving
the unused fields around.)


# c167961e 23-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- If vput() is called with a shared lock it must upgrade to an exclusive
before it can call VOP_INACTIVE(). This must use the EXCLUPGRADE path
because we may violate some lock order with another locked vnode if
we drop and reacquire the lock. If EXCLUPGRADE fails, we mark the
vnode with VI_OWEINACT. This case should be very rare.
- Clear VI_OWEINACT in vinactive() and vbusy().
- If VI_OWEINACT is set in vgone() do the VOP_INACTIVE call here as well.

Sponsored by: Isilon Systems, Inc.


# 51f5ce0c 16-Mar-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Add two arguments to the vfs_hash() KPI so that filesystems which do
not have unique hashes (NFS) can also use it.


# 78bb3c21 16-Mar-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Add mnt_hashseed to struct mount and initialize it witn PRNG bits, use
it to get better hashing in vfs_hash.

In case of an insert collision in vfs_hash_insert(), put the loosing vnode
on a special list so that vfs_hash_remove() can just assume that it is on
a list.

Drop the VI_HASHED flag.


# b172f6c5 15-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Now that there are no external users of vfree() make it static.
- Move VSHOULDBUSY, VSHOULDFREE, and VTRYRECYCLE into vfs_subr.c so
no one else attempts to grow a dependency on them.
- Now that objects with pages hold the vnode we don't have to do unlocked
checks for the page count in the vm object in VSHOULDFREE. These three
macros could simply check for holdcnt state transitions to determine
whether the vnode is on the free list already, but the extra safety
the flag affords us is probably worth the minimal cost.
- The leafonly sysctl and code have been dead for several years now,
remove the sysctl and the code that employed it from vtryrecycle().
- vtryrecycle() also no longer has to check the object's page count as
the object holds the vnode until it reaches 0.

Sponsored by: Isilon Systems, Inc.


# c178628d 15-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Expose vholdl() so it may be used outside of vfs_subr.c


# 6c325a2a 14-Mar-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Currently (almost) all filesystems maintain a local inode hash table
to get from (mount + inode) to vnode. These tables are mostly
copy&pasted from UFS, sized based on desiredvnodes and therefore
quite large (128K-512K). Several filesystems are buggy enough that
they allocate the hash table even before they know if they will
ever be used or not.

Add "vfs_hash", a system wide hash table, which will replace all
the per-filesystem hash-tables.

The fields we add to struct vnode will more or less be saved in
the respective filesystems inodes.

Having one central implementation will save code and will allow us
to justify the complexity of code to dynamically (re)size the hash
at a later point.


# 8045557f 14-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Increment the holdcnt once for each usecount reference. This allows us
to use only the holdcnt to determine whether a vnode may be recycled,
simplifying the V* macros as well as vtryrecycle(), etc.

Sponsored by: Isilon Systems, Inc.


# 159b4548 14-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- We do not have to check the object's ref_count in VSHOULDFREE or
vtryrecycle(). All obj refs also ref the vnode.
- Consistently use v_incr_usecount() to increment the usecount. This will
be more important later.

Sponsored by: Isilon Systems, Inc.


# 1d39df3f 14-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Retire OLOCK and OWANT. All callers hold the vnode lock when creating
a vnode object. There has been an assert to prove this for some time.

Sponsored by: Isilon Systems, Inc.


# 5e7475a3 13-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Get rid of VXLOCK, VXWANT, and vx_*. The vnode lock now protects us
against recycling.
- Modify VSHOULDFREE, VCANRECYCLE, etc. now that certain flags are no
longer important. Remove VMIGHTFREE as it is only used in one place.

Sponsored by: Isilon Systems, Inc.


# bd8bb8e4 22-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Group the fields in struct vnode by their function and stick comments
there to tell what the function is.


# aa2f6ddc 22-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Reap more benefits from DEVFS:

List devfs_dirents rather than vnodes off their shared struct cdev, this
saves a pointer field in the vnode at the expense of a field in the
devfs_dirent. There are often 100 times more vnodes so this is bargain.
In addition it makes it harder for people to try to do stypid things like
"finding the vnode from cdev".

Since DEVFS handles all VCHR nodes now, we can do the vnode related
cleanup in devfs_reclaim() instead of in dev_rel() and vgonel().
Similarly, we can do the struct cdev related cleanup in dev_rel()
instead of devfs_reclaim().

rename idestroy_dev() to destroy_devl() for consistency.

Add LIST_ENTRY de_alias to struct devfs_dirent.
Remove v_specnext from struct vnode.
Change si_hlist to si_alist in struct cdev.
String new devfs vnodes' devfs_dirent on si_alist when
we create them and take them off in devfs_reclaim().

Fix devfs_revoke() accordingly. Also don't clear fields
devfs_reclaim() will clear when called from vgone();

Let devfs_reclaim() call dev_rel() instead of vgonel().

Move the usecount tracking from dev_rel() to devfs_reclaim(),
and let dev_rel() take a struct cdev argument instead of vnode.

Destroy SI_CHEAPCLONE devices in dev_rel() (instead of
devfs_reclaim()) when they are no longer used. (This
should maybe happen in devfs_close() instead.)


# 7fc940b2 22-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove vfinddev(), it is generally bogus when faced with jails and
chroot and has no legitimate use(r)s in the tree.


# 4d8ac58b 17-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce vx_wait{l}() and use it instead of home-rolled versions.


# 1ba21282 09-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Make various vnode related functions static


# 8d63297e 10-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Add __printflike() to vn_printf()


# 88e5b12a 08-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Drag another softupdates tentacle back into FFS: Now that FFS's
vop_fsync is separate from the internal use we can do the full job
there.


# 7c5d36fb 07-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove vop_stddestroyvobject()


# d4eb29ba 28-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused argument to vrecycle()


# 7146d6cb 28-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Move the contents of vop_stddestroyvobject() to the new vnode_pager
function vnode_destroy_vobject().

Make the new function zero the vp->v_object pointer so we can tell
if a call is missing.


# 729fcf7e 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Take VOP_GETVOBJECT() out to pasture. We use the direct pointer now.


# 69816ea3 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Kill VOP_CREATEVOBJECT(), it is now the responsibility of the filesystem
for a given vnode to create a vnode_pager object if one is needed.


# d07a6d3f 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Move the body of vop_stdcreatevobject() over to the vnode_pager under
the name Sande^H^H^H^H^Hvnode_create_vobject().

Make the new function take a size argument which removes the need for
a VOP_STAT() or a very pessimistic guess for disks.

Call that new function from vop_stdcreatevobject().

Make vnode_pager_alloc() private now that its only user came home.


# 7c93282e 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Change vprint() to vn_printf() which takes varargs.
Add #define for vprint() to call vn_printf().


# 35764be3 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Kill the VV_OBJBUF and test the v_object for NULL instead.


# 49bc5dce 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add a VCANRECYCLE() which performs all the checks required to ensure
that we are free to release a vnode.


# 7c0745ee 14-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate unused and unnecessary "cred" argument from vinvalbuf()


# e39db32a 12-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT()
directly.


# de8a6c06 13-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Get rid of the VDESC() macro while the pot is boiling anyway, it is
only used from generate files now, so we might as well generate the
right stuff from the start.


# 63f89abf 13-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Change the generated VOP_ macro implementations to improve type checking
and KASSERT coverage.

After this check there is only one "nasty" cast in this code but there
is a KASSERT to protect against the wrong argument structure behind
that cast.

Un-inlining the meat of VOP_FOO() saves 35kB of text segment on a typical
kernel with no change in performance.

We also now run the checking and tracing on VOP's which have been layered
by nullfs, umapfs, deadfs or unionfs.

Add new (non-inline) VOP_FOO_AP() functions which take a "struct
foo_args" argument and does everything the VOP_FOO() macros
used to do with checks and debugging code.

Add KASSERT to VOP_FOO_AP() check for argument type being
correct.

Slim down VOP_FOO() inline functions to just stuff arguments
into the struct foo_args and call VOP_FOO_AP().

Put function pointer to VOP_FOO_AP() into vop_foo_desc structure
and make VCALL() use it instead of the current offsetoff() hack.

Retire vcall() which implemented the offsetoff()

Make deadfs and unionfs use VOP_FOO_AP() calls instead of
VCALL(), we know which specific call we want already.

Remove unneeded arguments to VCALL() in nullfs and umapfs bypass
functions.

Remove unused vdesc_offset and VOFFSET().

Generally improve style/readability of the generated code.


# 60727d8b 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for license, minor formatting changes


# 10eee285 22-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Shuffle numeric values of the IO_* flags to match the O_* flags from
fcntl.h.

This is in preparation for making the flags passed to device drivers be
consistently from fcntl.h for all entrypoints.

Today open, close and ioctl uses fcntl.h flags, while read and write
uses vnode.h flags.


# e87047b4 20-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

We can only ever get to vgonechrl() from a devfs vnode, so we do not
need to reassign the vp->v_op to devfs_specops, we know that is the
value already.

Make devfs_specops private to devfs.


# d3ecc7aa 11-Dec-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Revert rev 1.259. The null-pointer function call (a dereference on
ia64) was not the result of a change in the vector operations. It
was caused by the NFS locking code using a FIFO and those bypassing
the vnode. This indirectly caused the panic. The NFS locking code
has been changed.

Requested by: phk


# 20a92a18 07-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:

cd9660:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

nfs(client):
Convert to nmount (the simple way, mount_nfs(8) is still necessary).
Add omount compat shims.
Drop COMPAT_PRELITE2 mount arg compatibility.

ffs:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

Remove vfs_omount() method, all filesystems are now converted.

Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.

Change rootmounting to use DEVFS trampoline:

vfs_mount.c:
Mount devfs on /. Devfs needs no 'from' so this is clean.
symlink /dev to /. This makes it possible to lookup /dev/foo.
Mount "real" root filesystem on /.
Surgically move the devfs mountpoint from under the real root
filesystem onto /dev in the real root filesystem.

Remove now unnecessary getdiskbyname().

kern_init.c:
Don't do devfs mounting and rootvnode assignment here, it was
already handled by vfs_mount.c.

Remove now unused bdevvp(), addaliasu() and addalias(). Put the
few necessary lines in devfs where they belong. This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.

Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway. Correct information is provided by
statfs(/).


# 061f5ec8 05-Dec-2004 Marcel Moolenaar <marcel@FreeBSD.org>

Fix null-pointer indirect function calls introduced in the previous
commit. In the new world order, the transitive closure on the vector
operations is not precomputed. As such, it's unsafe to actually use
any of the function pointers in an indirect function call. They can
be null, and we need to use the default vector in that case.
This is mostly a quick fix for the four function pointers that are
ed explicitly. A more generic or scalable solution is likely to see
the light of day.

No pathos on: current@


# aec0fb7b 01-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Back when VOP_* was introduced, we did not have new-style struct
initializations but we did have lofty goals and big ideals.

Adjust to more contemporary circumstances and gain type checking.

Replace the entire vop_t frobbing thing with properly typed
structures. The only casualty is that we can not add a new
VOP_ method with a loadable module. History has not given
us reason to belive this would ever be feasible in the the
first place.

Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.

Give coda correct prototypes and function definitions for
all vop_()s.

Generate a bit more data from the vnode_if.src file: a
struct vop_vector and protype typedefs for all vop methods.

Add a new vop_bypass() and make vop_default be a pointer
to another struct vop_vector.

Remove a lot of vfs_init since vop_vector is ready to use
from the compiler.

Cast various vop_mumble() to void * with uppercase name,
for instance VOP_PANIC, VOP_NULL etc.

Implement VCALL() by making vdesc_offset the offsetof() the
relevant function pointer in vop_vector. This is disgusting
but since the code is generated by a script comparatively
safe. The alternative for nullfs etc. would be much worse.

Fix up all vnode method vectors to remove casts so they
become typesafe. (The bulk of this is generated by scripts)


# db442506 13-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Eliminate vop_revoke() function now that devfs_revoke() does the entire job.


# c5b846fe 10-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Slim vnodes by another four bytes by eliminating the (now) unused field
v_cachedid.


# c13a4e88 10-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove vn_todev()


# b797084e 09-Nov-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove vnode->v_cachedfs.

It was only used for the highly dangerous "export all vnodes with a sysctl"
function.


# 20eba72f 27-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the syncer linkage from vnode to bufobj.

This is not quite a perfect separation: the syncer still think it knows
that everything is a vnode.


# 156cb265 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Loose the v_dirty* and v_clean* alias macros.

Check the count field where we just want to know the full/empty state,
rather than using TAILQ_EMPTY() or TAILQ_FIRST().


# ee1d0eb3 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove vnode->v_bsize. This was a dead-end.


# 4dcd0ac4 25-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Collapse vnode->v_object and buf->b_object into bufobj->bo_object.


# ff7c5a48 22-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Alas, poor SPECFS! -- I knew him, Horatio; A filesystem of infinite
jest, of most excellent fancy: he hath taught me lessons a thousand
times; and now, how abhorred in my imagination it is! my gorge rises
at it. Here were those hacks that I have curs'd I know not how
oft. Where be your kludges now? your workarounds? your layering
violations, that were wont to set the table on a roar?

Move the skeleton of specfs into devfs where it now belongs and
bury the rest.


# a76d8f4e 21-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.


# 1bca607b 21-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add BO_* macros parallel to VI_* macros for manipulating the bo_mtx.

Initialize the bo_mtx when we allocate a vnode i getnewvnode() For
now we point to the vnodes interlock mutex, that retains the exact
same locking sematics.

Move v_numoutput from vnode to bufobj. Add renaming macro to
postpone code sweep.


# 0bf87424 20-Oct-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Add new function ttyinitmode() which sets our systemwide default
modes on a tty structure. Both the ".init" and the current settings
are initialized allowing the function to be used both at attach and
open time.

The function takes an argument to decide if echoing should be enabled
by default. Echoing should not be enabled for regular physical
serial ports unless they are consoles, in which case they should
be configured by ttyconsolemode() instead.

Use the new function throughout.


# 1affa3ad 07-Sep-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Create simple function init_va_filerev() for initializing a va_filerev
field.

Replace three instances of longhaired initialization va_filerev fields.

Added XXX comment wondering why we don't use random bits instead of
uptime of the system for this purpose.


# ad3b9257 15-Aug-2004 John-Mark Gurney <jmg@FreeBSD.org>

Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)


# 39dfb406 10-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Modify vnode locking key: the v_pollinfo pointer itself is protected
by Giant; the contents are protected by the pollinfo mutex. We rely
on Giant to prevent races in assigning the value of v_pollinfo.


# f257b7a5 12-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.


# 1cbb1e02 03-Jul-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Blocksize for I/O should be a property of the vnode and not found by groping
around in the vnodes surroundings when we allocate a block.

Assign a blocksize when we create a vnode, and yell a warning (and ignore it)
if we got the wrong size.

Please email all such warnings to me.


# f3732fd1 17-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Second half of the dev_t cleanup.

The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.


# 89c9c53d 16-Jun-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.


# f99619a0 04-Jun-2004 Tim J. Robbins <tjr@FreeBSD.org>

Change the types of vn_rdwr_inchunks()'s len and aresid arguments to
size_t and size_t *, respectively. Update callers for the new interface.
This is a better fix for overflows that occurred when dumping segments
larger than 2GB to core files.


# 82c6e879 06-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# b21126c6 29-Mar-2004 Peter Wemm <peter@FreeBSD.org>

Clean up the stub fake vnode locking implemenations. The main reason this
stuff was here (NFS) was fixed by Alfred in November. The only remaining
consumer of the stub functions was umapfs, which is horribly horribly
broken. It has missed out on about the last 5 years worth of maintenence
that was done on nullfs (from which umapfs is derived). It needs major
work to bring it up to date with the vnode locking protocol. umapfs really
needs to find a caretaker to bring it into the 21st century.

Functions GC'ed:
vop_noislocked, vop_nolock, vop_nounlock, vop_sharedlock.


# 651b11ea 11-Mar-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Remove unused second arg to vfinddev().
Don't call addaliasu() on VBLK nodes.


# 64f4c8e8 05-Jan-2004 Alexander Kabaev <kan@FreeBSD.org>

Properly ifdef support for vfs locking assertions based on DEBUG_VFS_LOCKS.

Obtained from: bde


# 89144d1a 05-Jan-2004 Alexander Kabaev <kan@FreeBSD.org>

Style fixes:
Remove double empty lines.
Add tab after #define's.
Properly terminate sentences in comments.

Obtained from: bde (mostly).


# 9efe7d9d 28-Dec-2003 Bruce Evans <bde@FreeBSD.org>

v_vxproc was a bogus name for a thread (pointer).


# eca8a663 11-Nov-2003 Robert Watson <rwatson@FreeBSD.org>

Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'
symbol.

Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# ca430f2e 04-Nov-2003 Alexander Kabaev <kan@FreeBSD.org>

Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff


# 06cb76bd 23-Oct-2003 Garrett Wollman <wollman@FreeBSD.org>

Add appropriate const poisoning to the assert_*locked() family so that I can
call ASSERT_VOP_LOCKED(vp, __func__) without a diagnostic.

Inspired by: the evil and rude OpenAFS cache manager code


# c31e9345 04-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Document more of the vnode locking strategy.


# 7c89f162 27-Jul-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Add fdidx argument to vn_open() and vn_open_cred() and pass -1 throughout.


# b6f5d1d6 09-Jul-2003 Jeffrey Hsu <hsu@FreeBSD.org>

Replace custom field offset macro with the system __offsetof() macro.

Reviewed by: bde


# 17a13919 31-May-2003 Poul-Henning Kamp <phk@FreeBSD.org>

The IO_NOWDRAIN and B_NOWDRAIN hacks are no longer needed to prevent
deadlocks with vnode backed md(4) devices because md now uses a
kthread to run the bio requests instead of doing it directly from
the bio down path.


# adf4e1d5 26-Apr-2003 Alan Cox <alc@FreeBSD.org>

Remove an unused declaration.


# 9c926592 09-Apr-2003 Mike Barcroft <mike@FreeBSD.org>

Add prototypes for change_root() and change_dir().


# 3a7053cb 24-Feb-2003 Kirk McKusick <mckusick@FreeBSD.org>

Prevent large files from monopolizing the system buffers. Keep
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.

Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.

Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.


# 767b9a52 09-Feb-2003 Jeff Roberson <jeff@FreeBSD.org>

- Cleanup unlocked accesses to buf flags by introducing a new b_vflag member
that is protected by the vnode lock.
- Move B_SCANNED into b_vflags and call it BV_SCANNED.
- Create a vop_stdfsync() modeled after spec's sync.
- Replace spec_fsync, msdos_fsync, and hpfs_fsync with the stdfsync and some
fs specific processing. This gives all of these filesystems proper
behavior wrt MNT_WAIT/NOWAIT and the use of the B_SCANNED flag.
- Annotate the locking in buf.h


# 6a1b2a22 29-Dec-2002 Ian Dowse <iedowse@FreeBSD.org>

Add a new vnode flag VI_DOINGINACT to indicate that a VOP_INACTIVE
call is in progress on the vnode. When vput() or vrele() sees a
1->0 reference count transition, it now return without any further
action if this flag is set. This flag is necessary to avoid recursion
into VOP_INACTIVE if the filesystem inactive routine causes the
reference count to increase and then drop back to zero. It is also
used to guarantee that an unlocked vnode will not be recycled while
blocked in VOP_INACTIVE().

There are at least two cases where the recursion can occur: one is
that the softupdates code called by ufs_inactive() via ffs_truncate()
can call vput() on the vnode. This has been reported by many people
as "lockmgr: draining against myself" panics. The other case is
that nfs_inactive() can call vget() and then vrele() on the vnode
to clean up a sillyrename file.

Reviewed by: mckusick (an older version of the patch)


# 45587e25 28-Dec-2002 Matthew Dillon <dillon@FreeBSD.org>

Abstract-out the constants for the sequential heuristic.

No operational changes.

MFC after: 1 day


# c7047e52 27-Oct-2002 Garrett Wollman <wollman@FreeBSD.org>

Change the way support for asynchronous I/O is indicated to applications
to conform to 1003.1-2001. Make it possible for applications to actually
tell whether or not asynchronous I/O is supported.

Since FreeBSD's aio implementation works on all descriptor types, don't
call down into file or vnode ops when [f]pathconf() is asked about
_PC_ASYNC_IO; this avoids the need for every file and vnode op to know about
it.


# 9ab73fd1 24-Oct-2002 Kirk McKusick <mckusick@FreeBSD.org>

Within ufs, the ffs_sync and ffs_fsync functions did not always
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned. This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.

Sponsored by: DARPA & NAI Labs.


# 7d2eb23f 12-Oct-2002 Jeff Roberson <jeff@FreeBSD.org>

- Remove the do { } while(0) from the VOP lock assert macros. This was
not optimized away by the compiler in time for it to still leave the VOP
functions as inlines.

Submitted by: bde


# 5c5f6cfa 28-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Don't use unnamed anonymous structs: give it a name.


# ca916247 27-Sep-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Rename struct specinfo to the more appropriate struct cdev.

Agreed on: jake, rwatson, jhb


# 6423c943 25-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Move ASSERT_VOP_*LOCK* functionality into functions in vfs_subr.c
- Make the VI asserts more orthogonal to the rest of the asserts by using a
new, common vfs_badlock() function and adding a 'str' arg.
- Adjust generated ASSERTS to match the new prototype.
- Adjust explicit ASSERTS to match the new prototype.


# 6cb8bf20 24-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Lock down the syncer with sync_mtx.
- Enable vfs_badlock_mutex by default.
- Assert that the vp is locked in VOP_UNLOCK.
- Use standard interlock macros in remaining code.
- Correct a race in getnewvnode().
- Lock access to v_numoutput with interlock.
- Lock access to buf lists and splay tree with interlock.
- Add VOP and VI asserts.
- Lock b_vnbufs with the vnode interlock.
- Add vrefcnt() for callers who want to retreive the vnode ref without
holding a lock. Add a comment that describes when this is safe.
- Add vholdl() and vdropl() so that callers who already own the interlock
can avoid race conditions and unnecessary unlocking.
- Move the VOP_GETATTR() in vflush() into the WRITECLOSE conditional case.
- Hold the interlock before droping the mntlist_mtx in vflush() to avoid
a race.
- Fix locking in vfs_msync().


# 9a0a3322 24-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Finish the struct vnode lock annotation.
- Order fields by what lock is required to access them.


# cc8e7533 23-Sep-2002 Jeff Roberson <jeff@FreeBSD.org>

- Include sys/ktr.h so that vnode_if.h can define trace points.


# 06be2aaa 14-Sep-2002 Nate Lawson <njl@FreeBSD.org>

Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by: phk
Reviewed by: bde, rwatson (earlier version)


# bb77b793 10-Sep-2002 Bruce Evans <bde@FreeBSD.org>

Fixed namespace pollution in uma changes:
- use `struct uma_zone *' instead of uma_zone_t, so that <sys/uma.h> isn't
a prerequisite.
- don't include <sys/uma.h>.
Namespace pollution makes "opaque" types like uma_zone_t perfectly
non-opaque. Such types should never be used (see style(9)).

"Fixed" subsequently grown dependencies of this header on its own
pollution by polluting explicitly:
- include <sys/mutex.h> and its prerequisite <sys/lock.h> instead of
depending on namespace pollution 2 layers deep in <sys/uma.h>.


# 8f19eb88 01-Sep-2002 Ian Dowse <iedowse@FreeBSD.org>

Split out a number of mostly VFS and signal related syscalls into
a kernel-internal kern_*() version and a wrapper that is called via
the syscall vector table. For paths and structure pointers, the
internal version either takes a uio_seg parameter or requires the
caller to copyin() the data to kernel memory as appropiate. This
will permit emulation layers to use these syscalls without having
to copy out translated arguments to the stack gap.

Discussed on: -arch
Review/suggestions: bde, jhb, peter, marcel


# 71ea4ba5 21-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add two new debugging macros: ASSERT_VI_LOCKED and ASSERT_VI_UNLOCKED
- Use the new VI asserts in place of the old mtx_assert checks.
- Add the VI asserts to the automated lock checking in the VOP calls. The
interlock should not be held across vops with a few exceptions.
- Add the vop_(un)lock_{pre,post} functions to assert that interlock is held
when LK_INTERLOCK is set.


# ea6027a8 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Make similar changes to fo_stat() and fo_poll() as made earlier to
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential. Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential. Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument. This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.

Trickle this change down into fo_stat/poll() implementations:

- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL()
to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
and consumers so that this distinction is maintained at the VFS
as well as 'struct file' layer. Pass active_cred instead of
td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.

- fifofs: modify the creation of a "filetemp" so that the file
credential is properly initialized and can be used in the socket
code if desired. Pass ap->a_td->td_ucred as the active
credential to soo_poll(). If we teach the vnop interface about
the distinction between file and active credentials, we would use
the active credential here.

Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained. It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 9ca43589 15-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 01abbb42 13-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Move to a nested include of _label.h instead of mac.h in sys/sys/*.h
(Most of the places where mac.h was recursively included from another
kernel header file. net/netinet to follow.)

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
Suggested by: bde


# 62c0c263 11-Aug-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce IO_NOMACCHECK, a flag that will be passed to vn_rdwr() to
indicate that the calling code has already performed necessary MAC
checks (if any) for this operation. This flag will help resolve
layering problems that existing because vn_rdwr() is called both
on behalf of user processes directly (such as in system calls of
various sorts, during core dumps, etc), as well as deep in the file
system code on behalf of the file system (such as in UFS, ext2fs,
etc). Code that is acting on behalf of a kernel service rather
than explicitly on behalf of a user process will specify this flag.
By default, MAC checks will be performed (and generally should
be performed).

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 780bb93d 07-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Adjust locking markup to match the proc markup.
- Add a comment about the current, unfinished, state of vnode locking.

Suggested by: bde


# 973a3f4b 05-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Document more of the struct vnode locking protocol.
- Slightly reformat a comment block.


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# 4eee8de7 30-Jul-2002 Dag-Erling Smørgrav <des@FreeBSD.org>

Introduce struct xvnode, which will be used instead of struct vnode for
sysctl purposes. Also add two fields to struct vnode, v_cachedfs and
v_cachedid, which hold the vnode's device and file id and are filled in
by vn_open_cred() and vn_stat().

Sponsored by: DARPA, NAI Labs


# f3cfa607 30-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Begin committing support for Mandatory Access Control and extensible
kernel access control. The MAC framework permits loadable kernel
modules to link to the kernel at compile-time, boot-time, or run-time,
and augment the system security policy. This commit includes the
initial kernel implementation, although the interface with the userland
components of the oeprating system is still under work, and not all
kernel subsystems are supported. Later in this commit sequence,
documentation of which kernel subsystems will not work correctly with
a kernel compiled with MAC support will be added.

Label vnodes, permitting security information to maintained at the
granularity of the individual file, directory (et al). This data is
protected by the vnode lock and may be read only when holding a shared
lock, or modified only when holding an exclusive lock. Label
information may be considered either the primary copy, or a cached
copy. Individual file systems or kernel services may use the
VCACHEDLABEL flag for accounting purposes to determine which it is.
New VOPs will be introduced to refresh this label on demand, or to
set the label value.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# c72f085d 30-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add vfs_badlock_{print,panic} support to the remaining VOP_ASSERT_*
macros.


# 3b82505b 29-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add VBAD to the list of vnodes that are ignored on locking operations.


# 586261a3 27-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Reserve VCACHEDLABEL vnode flag for use by the TrustedBSD MAC
implementation. This flag will indicate that the security label
in the vnode is currently valid, and therefore doesn't need to
be refreshed before an access control decision can be made. Most
file systems (or stdvops) will set this flag after they load the
MAC label from disk the first time to prevent redundant disk I/O;
some file synthetic file systems (procfs, for example) may not.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# ccac0940 21-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Add VALLPERM, which is a mask of all the access control request permission
bits for vnodes passed to vaccess() and friends.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 230d10f6 21-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Sort vnode access mode flags.
Add flags VSTAT, VAPPEND required for TrustedBSD.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# 7aca6291 19-Jul-2002 Kirk McKusick <mckusick@FreeBSD.org>

Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

if (ioflags & IO_SYNC)
flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by: DARPA & NAI Labs.


# faab4e27 16-Jul-2002 Kirk McKusick <mckusick@FreeBSD.org>

Change the name of st_createtime to st_birthtime. This change is
made to reduce confusion between st_ctime and st_createtime.

Submitted by: Eric Allman <eric@sendmail.org>
Sponsored by: DARPA & NAI Labs.


# d331c5d4 10-Jul-2002 Matthew Dillon <dillon@FreeBSD.org>

Replace the global buffer hash table with per-vnode splay trees using a
methodology similar to the vm_map_entry splay and the VM splay that Alan
Cox is working on. Extensive testing has appeared to have shown no
increase in overhead.

Disadvantages
Dirties more cache lines during lookups.

Not as fast as a hash table lookup (but still N log N and optimal
when there is locality of reference).

Advantages
vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem
syncer operate more efficiently.

I get to rip out all the old hacks (some of which were mine) that tried
to keep the v_dirtyblkhd tailq sorted.

The per-vnode splay tree should be easier to lock / SMPng pushdown on
vnodes will be easier.

This commit along with another that Alan is working on for the VM page
global hash table will allow me to implement ranged fsync(), optimize
server-side nfs commit rpcs, and implement partial syncs by the
filesystem syncer (aka filesystem syncer would detect that someone is
trying to get the vnode lock, remembers its place, and skip to the
next vnode).

Note that the buffer cache splay is somewhat more complex then other splays
due to special handling of background bitmap writes (multiple buffers with
the same lblkno in the same vnode), and B_INVAL discontinuities between the
old hash table and the existence of the buffer on the v_cleanblkhd list.

Suggested by: alc


# f00501bd 09-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

- Remove IS_LOCKING_VFS() all of our filesystems support locking now
- Add IGNORE_LOCK() that only ignores VCHR files for now since no one locks
their underlying device in the leaf filesystems. (devvp)
- Add prototypes for vop_lookup_{pre,post} that I forgot before.


# 9725e58e 07-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

- VT_PSEUDOFS and VT_PROCFS support locking now
- Remove VBLK from the list of vtypes that are ignored for locking ops.


# 302c7aaa 05-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

- Add vop_strategy_pre to validate VOP_STRATEGY locking.
- Disable original vop_strategy lock specification.
- Switch to the new vop_strategy_pre for lock validation.

VOP_STRATEGY requires only that the buf is locked UNLESS the block numbers need
to be translated. There may be other reasons, but as long as the underlying
layer uses a VOP to perform the operations they will be caught later.


# cc8662b0 05-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Add "vop_rename_pre" to do pre rename lock verification. This is enabled only
with DEBUG_VFS_LOCKS.


# d0641e4e 04-Jul-2002 Jeff Roberson <jeff@FreeBSD.org>

Cleanups for vnode lock debugging.
- Tell IS_LOCKING_VFS to ignore block and character devices. specfs vnodes
aren't locked for io and they just generate lots of false positives.
- Add newlines to the badlock prints.


# 6bd521df 01-Jul-2002 Ian Dowse <iedowse@FreeBSD.org>

Use indirect function pointer hooks instead of #ifdef SOFTUPDATES
direct calls for the two places where the kernel calls into soft
updates code. Set up the hooks in softdep_initialize() and NULL
them out in softdep_uninitialize(). This change allows soft updates
to function correctly when ufs is loaded as a module.

Reviewed by: mckusick


# 90769c9e 28-Jun-2002 Jeff Roberson <jeff@FreeBSD.org>

Improve the VOP locking asserts

- Add vfs_badlock_print to control whether or not we print lock violations
- Add vfs_badlock_panic to control whether we panic on lock violations

Both default to on to mimic the original behavior if DEBUG_VFS_LOCKS is on.


# 1c85e6a3 21-Jun-2002 Kirk McKusick <mckusick@FreeBSD.org>

This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by: Poul-Henning Kamp <phk@freebsd.org>


# d394511d 16-May-2002 Tom Rhodes <trhodes@FreeBSD.org>

More s/file system/filesystem/g


# 84f9ed84 29-Apr-2002 Robert Watson <rwatson@FreeBSD.org>

Since devfs now uses vnode locks, add devfs back to IS_LOCKING_VFS.


# ae6b9ab0 25-Apr-2002 Robert Watson <rwatson@FreeBSD.org>

Add UDF to the list of filesystems where locking assertions should be
evaluated.

Approved by: scottl


# dbb17987 25-Apr-2002 Robert Watson <rwatson@FreeBSD.org>

1.43 (dfr 04-Apr-97): /*
1.43 (dfr 04-Apr-97): * [dfr] Kludge until I get around to fixing all the vfs locking.
1.43 (dfr 04-Apr-97): */

The new devfs doesn't support VFS locking. So don't do locking
assertions for devfs vnodes.

With this change, a kernel with options DEBUG_VFS_LOCKS actually
gets to single-user mode.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# df263cbd 14-Apr-2002 Scott Long <scottl@FreeBSD.org>

Add a filesystem driver for the Universal Disk Format. For more info,
see http://people.freebsd.org/~scottl/udf

MFC after: when asmodai gets the backport done
Prodded by: phk asmodai des


# c897b813 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 789f12fe 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P


# 8355f576 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@


# 68edc1b9 18-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Make v_addpollinfo() visible and non-inline.
Have callers only call it as needed.
Add necessary call in ufs_kqfilter().

Test-case found by: Andrew Gallatin <gallatin@cs.duke.edu>


# 4b55dbe3 17-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Move the stuff related to select and poll out of struct vnode.
The use of the zone allocator may or may not be overkill.
There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need
to revisit.

This shaves about 60 bytes of struct vnode which on my laptop means
600k less RAM used for vnodes.


# e8b26e99 17-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Collect the VN_KNOTE() macro definitions on vnode.h


# 038b6417 17-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

v_lease is unused, zap it.


# 079b7bad 07-Feb-2002 Julian Elischer <julian@FreeBSD.org>

Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,


# 23b59018 20-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code. Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release


# fdb33f08 18-Dec-2001 Matthew Dillon <dillon@FreeBSD.org>

This is a forward port of Peter's vlrureclaim() fix, with some minor mods
by me to make it more efficient. The original code had serious balancing
problems and could also deadlock easily. This code relegates the vnode
reclamation to its own kproc and relaxes the vnode reclamation requirements
to better maintain kern.maxvnodes. This code still doesn't balance as well
as it could, but it does a much better job then the original code.

Approved by: re@freebsd.org
Obtained from: ps, peter, dillon
MFS Assuming: Assuming no problems crop up in Yahoo testing
MFC after: 7 days


# f03e89de 11-Nov-2001 Alfred Perlstein <alfred@FreeBSD.org>

turn vn_open() into a wrapper around vn_open_cred() which allows
one to perform a vn_open using temporary/other/fake credentials.

Modify the nfs client side locking code to use vn_open_cred() passing
proc0's ucred instead of the old way which was to temporary raise
privs while running vn_open(). This should close the race hopefully.


# 7e76bb56 05-Nov-2001 Matthew Dillon <dillon@FreeBSD.org>

Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking
in wdrain during a write. This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.

Remove a vprintf() warning from MD when the backing vnode is found to be
in-use. The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.

MFC after: 1 week


# 4ffa210b 27-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

syncdelay, filedelay, dirdelay, metadelay are ints, not time_t's,
and can also be made static.


# 245df27c 25-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.

Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after: 1 week


# c72ccd01 22-Oct-2001 Matthew Dillon <dillon@FreeBSD.org>

Change the vnode list under the mount point from a LIST to a TAILQ
in preparation for an implementation of limiting code for kern.maxvnodes.

MFC after: 3 days


# 45fb069a 21-Oct-2001 Dag-Erling Smørgrav <des@FreeBSD.org>

Convert textvp_fullpath() into the more generic vn_fullpath() which takes a
struct thread * and a struct vnode * instead of a struct proc *.

Temporarily add a textvp_fullpath macro for compatibility.


# b5810bab 30-Sep-2001 Matthew Dillon <dillon@FreeBSD.org>

After extensive testing it has been determined that adding complexity
to avoid removing higher level directory vnodes from the namecache has
no perceivable effect and will be removed. This is especially true
when vmiodirenable is turned on, which it is by default now. ( vmiodirenable
makes a huge difference in directory caching ). The vfs.vmiodirenable and
vfs.nameileafonly sysctls have been left in to allow further testing, but
I expect to rip out vfs.nameileafonly soon too.

I have also determined through testing that the real problem with numvnodes
getting too large is due to the VM Page cache preventing the vnode from
being reclaimed. The directory stuff made only a tiny dent relative
to Poul's original code, enough so that some tests succeeded. But tests
with several million small files show that the bigger problem is the VM Page
cache. This will have to be addressed by a future commit.

MFC after: 3 days


# 2af8d76d 13-Sep-2001 David E. O'Brien <obrien@FreeBSD.org>

Re-apply rev 1.178 -- style(9) the structure definitions.
I have to wonder how many other changes were lost in the KSE mildstone 2 merge.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 06ae1e91 08-Sep-2001 Matthew Dillon <dillon@FreeBSD.org>

This brings in a Yahoo coredump patch from Paul, with additional mods by
me (addition of vn_rdwr_inchunks). The problem Yahoo is solving is that
if you have large process images core dumping, or you have a large number of
forked processes all core dumping at the same time, the original coredump code
would leave the vnode locked throughout. This can cause the directory vnode
to get locked up, which can cause the parent directory vnode to get locked
up, and so on all the way to the root node, locking the entire machine up
for extremely long periods of time.

This patch solves the problem in two ways. First it uses an advisory
non-blocking lock to abort multiple processes trying to core to the same
file. Second (my contribution) it chunks up the writes and uses bwillwrite()
to avoid holding the vnode locked while blocking in the buffer cache.

Submitted by: ps
Reviewed by: dillon
MFC after: 2 weeks


# 0f728902 27-Aug-2001 Peter Wemm <peter@FreeBSD.org>

If a file has been completely unlinked, stop automatically syncing the
file. ffs will discard any pending dirty pages when it is closed,
so we may as well not waste time trying to clean them. This doesn't
stop other things from writing it out, eg: pageout, fsync(2) etc.


# 753d4978 29-May-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Remove MFS


# ac8f990b 24-May-2001 Matthew Dillon <dillon@FreeBSD.org>

This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%. For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by: tegge, dillon


# 0864ef1e 16-May-2001 Ian Dowse <iedowse@FreeBSD.org>

Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by: phk, bp


# a62615e5 01-May-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Implement vop_std{get|put}pages() and add them to the default vop[].

Un-copy&paste all the VOP_{GET|PUT}PAGES() functions which do nothing but
the default.


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# b7ebffbc 29-Apr-2001 Poul-Henning Kamp <phk@FreeBSD.org>

Add a vop_stdbmap(), and make it part of the default vop vector.

Make 7 filesystems which don't really know about VOP_BMAP rely
on the default vector, rather than more or less complete local
vop_nopbmap() implementations.


# 1690d305 22-Apr-2001 David E. O'Brien <obrien@FreeBSD.org>

Removed old version of vaccess_acl_posix1e() that snuck back in rev 1.146.

Submitted by (with good eye): Niels Chr. Bank-Pedersen <ncbp@bank-pedersen.dk>


# ea88c01d 21-Apr-2001 David E. O'Brien <obrien@FreeBSD.org>

Style(9) fixes:
* get rid of space (0x20) before tab (^I)
* indent with ^I, not 0x20
* continuation line for prototypes is for 0x20's past function's name col.
* etc.


# 759cb263 18-Apr-2001 Seigo Tanimura <tanimura@FreeBSD.org>

Reclaim directory vnodes held in namecache if few free vnodes are
available.

Only directory vnodes holding no child directory vnodes held in
v_cache_src are recycled, so that directory vnodes near the root of
the filesystem hierarchy remain in namecache and directory vnodes are
not reclaimed in cascade.

The period of vnode reclaiming attempt and the number of vnodes
attempted to reclaim can be tuned via sysctl(2).

Suggested by: tegge
Approved by: phk


# f84e29a0 17-Apr-2001 Poul-Henning Kamp <phk@FreeBSD.org>

This patch removes the VOP_BWRITE() vector.

VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.


# b114e127 16-Apr-2001 Robert Watson <rwatson@FreeBSD.org>

In my first reading of POSIX.1e, I misinterpreted handling of the
ACL_USER_OBJ and ACL_GROUP_OBJ fields, believing that modification of the
access ACL could be used by privileged processes to change file/directory
ownership. In fact, this is incorrect; ACL_*_OBJ (+ ACL_MASK and
ACL_OTHER) should have undefined ae_id fields; this commit attempts
to correct that misunderstanding.

o Modify arguments to vaccess_acl_posix1e() to accept the uid and gid
associated with the vnode, as those can no longer be extracted from
the ACL passed as an argument. Perform all comparisons against
the passed arguments. This actually has the effect of simplifying
a number of components of this call, as well as reducing the indent
level, but now seperates handling of ACL_GROUP_OBJ from ACL_GROUP.

o Modify acl_posix1e_check() to return EINVAL if the ae_id field of
any of the ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} entries is a value
other than ACL_UNDEFINED_ID. As a temporary work-around to allow
clean upgrades, set the ae_id field to ACL_UNDEFINED_ID before
each check so that this cannot cause a failure in the short term
(this work-around will be removed when the userland libraries and
utilities are updated to take this change into account).

o Modify ufs_sync_acl_from_inode() so that it forces
ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} ae_id fields to ACL_UNDEFINED_ID
when synchronizing the ACL from the inode.

o Modify ufs_sync_inode_from_acl to not propagate uid and gid
information to the inode from the ACL during ACL update. Also
modify the masking of permission bits that may be set from
ALLPERMS to (S_IRWXU|S_IRWXG|S_IRWXO), as ACLs currently do not
carry none-ACCESSPERMS (S_ISUID, S_ISGID, S_ISTXT).

o Modify ufs_getacl() so that when it emulates an access ACL from
the inode, it initializes the ae_id fields to ACL_UNDEFINED_ID.

o Clean up ufs_setacl() substantially since it is no longer possible
to perform chown/chgrp operations using vop_setacl(), so all the
access control for that can be eliminated.

o Modify ufs_access() so that it passes owner uid and gid information
into vaccess_acl_posix1e().

Pointed out by: jedger
Obtained from: TrustedBSD Project


# 0fdabd3a 13-Apr-2001 Boris Popov <bp@FreeBSD.org>

Move VT_SMBFS definition to the proper place. Undefine VI_LOCK/VI_UNLOCK.


# d67fe1bd 28-Mar-2001 Dag-Erling Smørgrav <des@FreeBSD.org>

Prepare for pseudofs.


# 30632071 18-Mar-2001 Robert Watson <rwatson@FreeBSD.org>

o Rename "namespace" argument to "attrnamespace" as namespace is a C++
reserved word.

Submitted by: jkh
Obtained from: TrustedBSD Project


# 70f36851 14-Mar-2001 Robert Watson <rwatson@FreeBSD.org>

o Change the API and ABI of the Extended Attribute kernel interfaces to
introduce a new argument, "namespace", rather than relying on a first-
character namespace indicator. This is in line with more recent
thinking on EA interfaces on various mailing lists, including the
posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces
are defined by default, EXTATTR_NAMESPACE_SYSTEM and
EXTATTR_NAMESPACE_USER, where the primary distinction lies in the
access control model: user EAs are accessible based on the normal
MAC and DAC file/directory protections, and system attributes are
limited to kernel-originated or appropriately privileged userland
requests.

o These API changes occur at several levels: the namespace argument is
introduced in the extattr_{get,set}_file() system call interfaces,
at the vnode operation level in the vop_{get,set}extattr() interfaces,
and in the UFS extended attribute implementation. Changes are also
introduced in the VFS extattrctl() interface (system call, VFS,
and UFS implementation), where the arguments are modified to include
a namespace field, as well as modified to advoid direct access to
userspace variables from below the VFS layer (in the style of recent
changes to mount by adrian@FreeBSD.org). This required some cleanup
and bug fixing regarding VFS locks and the VFS interface, as a vnode
pointer may now be optionally submitted to the VFS_EXTATTRCTL()
call. Updated documentation for the VFS interface will be committed
shortly.

o In the near future, the auto-starting feature will be updated to
search two sub-directories to the ".attribute" directory in appropriate
file systems: "user" and "system" to locate attributes intended for
those namespaces, as the single filename is no longer sufficient
to indicate what namespace the attribute is intended for. Until this
is committed, all attributes auto-started by UFS will be placed in
the EXTATTR_NAMESPACE_SYSTEM namespace.

o The default POSIX.1e attribute names for ACLs and Capabilities have
been updated to no longer include the '$' in their filename. As such,
if you're using these features, you'll need to rename the attribute
backing files to the same names without '$' symbols in front.

o Note that these changes will require changes in userland, which will
be committed shortly. These include modifications to the extended
attribute utilities, as well as to libutil for new namespace
string conversion routines. Once the matching userland changes are
committed, a buildworld is recommended to update all the necessary
include files and verify that the kernel and userland environments
are in sync. Note: If you do not use extended attributes (most people
won't), upgrading is not imperative although since the system call
API has changed, the new userland extended attribute code will no longer
compile with old include files.

o Couple of minor cleanups while I'm there: make more code compilation
conditional on FFS_EXTATTR, which should recover a bit of space on
kernels running without EA's, as well as update copyright dates.

Obtained from: TrustedBSD Project


# 5293465f 06-Mar-2001 Robert Watson <rwatson@FreeBSD.org>

o Introduce filesystem-independent POSIX.1e ACL utility routines to
support implementations of ACLs in file systems. Introduce the
following new functions:

vaccess_acl_posix1e() vaccess() that accepts an ACL
acl_posix1e_mode_to_perm() Convert mode bits to ACL rights
acl_posix1e_mode_to_entry() Build ACL entry from mode/uid/gid
acl_posix1e_perms_to_mode() Generate file mode from ACL
acl_posix1e_check() Syntax verification for ACL

These functions allow a file system to rely on central ACL evaluation
and syntax checking, as well as providing useful utilities to
allow ACL-based file systems to generate mode/owner/etc information
to return via VOP_GETATTR(), and to support file systems that split
their ACL information over their existing inode storage (mode, uid,
gid) and extended ACL into extended attributes (additional users,
groups, ACL mask).

o Add prototypes for exported functions to sys/acl.h, sys/vnode.h

Reviewed by: trustedbsd-discuss, freebsd-arch
Obtained from: TrustedBSD Project


# dfe53f15 21-Feb-2001 Boris Popov <bp@FreeBSD.org>

Add VI_LOCK(), VI_TRYLOCK() and VI_UNLOCK() macros to isolate implementation
details of v_interlock.

Reviewed by: jhb, phk, arch@


# 1b367556 23-Jan-2001 Jason Evans <jasone@FreeBSD.org>

Convert all simplelocks to mutexes and remove the simplelock implementations.


# 0a2c3d48 08-Jan-2001 Garrett Wollman <wollman@FreeBSD.org>

select() DKI is now in <sys/selinfo.h>.


# 2b6b0df7 26-Dec-2000 Matthew Dillon <dillon@FreeBSD.org>

This implements a better launder limiting solution. There was a solution
in 4.2-REL which I ripped out in -stable and -current when implementing the
low-memory handling solution. However, maxlaunder turns out to be the saving
grace in certain very heavily loaded systems (e.g. newsreader box). The new
algorithm limits the number of pages laundered in the first pageout daemon
pass. If that is not sufficient then suceessive will be run without any
limit.

Write I/O is now pipelined using two sysctls, vfs.lorunningspace and
vfs.hirunningspace. This prevents excessive buffered writes in the
disk queues which cause long (multi-second) delays for reads. It leads
to more stable (less jerky) and generally faster I/O streaming to disk
by allowing required read ops (e.g. for indirect blocks and such) to occur
without interrupting the write stream, amoung other things.

NOTE: eventually, filesystem write I/O pipelining needs to be done on a
per-device basis. At the moment it is globalized.


# 936524aa 18-Nov-2000 Matthew Dillon <dillon@FreeBSD.org>

Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking
op.

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>


# 35e0e5b3 20-Oct-2000 John Baldwin <jhb@FreeBSD.org>

Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h


# 47460a23 19-Oct-2000 Robert Watson <rwatson@FreeBSD.org>

o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform
"administrative" authorization checks. In most cases, the VADMIN test
checks to make sure the credential effective uid is the same as the file
owner.
o Modify vaccess() to set VADMIN as an available right if the uid is
appropriate.
o Modify references to uid-based access control operations such that they
now always invoke VOP_ACCESS() instead of using hard-coded policy checks.
o This allows alternative UFS policies to be implemented by replacing only
ufs_access() (such as mandatory system policies).
o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the
vnode: I believe that new invocations of VOP_ACCESS() are always called
with the lock held.
o Some direct checks of the uid remain, largely associated with the QUOTA
and SUIDDIR code.

Reviewed by: eivind
Obtained from: TrustedBSD Project


# a18b1f1d 03-Oct-2000 Jason Evans <jasone@FreeBSD.org>

Convert lockmgr locks from using simple locks to using mutexes.

Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.


# 67e87166 25-Sep-2000 Boris Popov <bp@FreeBSD.org>

Add a lock structure to vnode structure. Previously it was either allocated
separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.

From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.

If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.

Reviewed in general by: mckusick


# 100d2c18 22-Sep-2000 Robert Watson <rwatson@FreeBSD.org>

o Introduce vn_extattr_rm(), a helper function in the style of
vn_extattr_get() and vn_extattr_set(). vn_extattr_rm() removes the
specified extended attribute from a vnode, authorizing the change as
the kernel (NULL cred).

Obtained from: TrustedBSD Project


# 210a5403 22-Sep-2000 Eivind Eklund <eivind@FreeBSD.org>

Remove addalias() prototype (staticized in kern/vfs_subr.c)


# 9ff5ce6b 12-Sep-2000 Boris Popov <bp@FreeBSD.org>

Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT.
They will be used by nullfs and other stacked filesystems to support full
cache coherency.

Reviewed in general by: mckusick, dillon


# 64dc16df 05-Sep-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Move extern declaration of dead_vnodeop_p to a .h file.

Remove race condition in vn_isdisk().


# 012c643d 29-Aug-2000 Robert Watson <rwatson@FreeBSD.org>

o Restructure vaccess() so as to check for DAC permission to modify the
object before falling back on privilege. Make vaccess() accept an
additional optional argument, privused, to determine whether
privilege was required for vaccess() to return 0. Add commented
out capability checks for reference. Rename some variables to make
it more clear which modes/uids/etc are associated with the object,
and which with the access mode.
o Update file system use of vaccess() to pass NULL as the optional
privused argument. Once additional patches are applied, suser()
will no longer set ASU, so privused will permit passing of
privilege information up the stack to the caller.

Reviewed by: bde, green, phk, -security, others
Obtained from: TrustedBSD Project


# e39c53ed 20-Aug-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Centralize the canonical vop_access user/group/other check in vaccess().

Discussed with: bde


# 39f70682 18-Aug-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce vop_stdinactive() and make it the default if no vop_inactive
is declared.

Sort and prune a few vop_op[].


# e6a9ab52 08-Aug-2000 Robert Watson <rwatson@FreeBSD.org>

o Introduce vn_extattr_{get,set}, wrapper routines for VOP_GETEXTATTR
and VOP_SETEXTATTR to simplify calling from in-kernel consumers,
such as capability code. Both accept a vnode (optionally locked,
with ioflg to indicate that), attribute name, and a buffer + buffer
length in UIO_SYSSPACE. Both authorize the call as a kernel request,
with cred set to NULL for the actual VOP_ calls.

Obtained from: TrustedBSD Project


# 9b971133 23-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.


# f2a2857b 11-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).


# c904bbbd 03-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Simplify and rationalise the management of the vnode free list
(preparing the code to add snapshots).


# e6796b67 03-Jul-2000 Kirk McKusick <mckusick@FreeBSD.org>

Move the truncation code out of vn_open and into the open system call
after the acquisition of any advisory locks. This fix corrects a case
in which a process tries to open a file with a non-blocking exclusive
lock. Even if it fails to get the lock it would still truncate the
file even though its open failed. With this change, the truncation
is done only after the lock is successfully acquired.

Obtained from: BSD/OS


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# b7db1901 26-Apr-2000 Brian Feldman <green@FreeBSD.org>

Move procfs_fullpath() to vfs_cache.c, with a rename to textvp_fullpath().
There's no excuse to have code in synthetic filestores that allows direct
references to the textvp anymore.

Feature requested by: msmith
Feature agreed to by: warner
Move requested by: phk
Move agreed to by: bde


# 8a2852b1 21-Apr-2000 Brian Feldman <green@FreeBSD.org>

Move the declaration of "struct namecache" to vnode.h, as it can be useful
elsewhere. Note, of course, that in an ideal world nothing should need
to see our VFS implementation :-/


# e4649cfa 01-Apr-2000 Matthew Dillon <dillon@FreeBSD.org>

Change the write-behind code to take more care when starting
async I/O's. The sequential read heuristic has been extended to
cover writes as well. We continue to call cluster_write() normally,
thus blocks in the file will still be reallocated for large (but still
random) I/O's, but I/O will only be initiated for truely sequential
writes.

This solves a number of annoying situations, especially with DBM (hash
method) writes, and also has the side effect of fixing a number of
(stupid) benchmarks.

Reviewed-by: mckusick


# b7a5f3ca 02-Feb-2000 Robert Watson <rwatson@FreeBSD.org>

Remove static qualifier from vgonel, as it is needed by the Arla folk
outside of vfs_subr.c.

Submitted by: Assar Westerlund <assar@sics.se>
Reviewed by: rwatson
Approved by: jkh


# 6cca21b1 15-Jan-2000 Boris Popov <bp@FreeBSD.org>

Add VT_NWFS tag.


# ba4ad1fc 09-Jan-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Give vn_isdisk() a second argument where it can return a suitable errno.

Suggested by: bde


# 664a31e4 28-Dec-1999 Peter Wemm <peter@FreeBSD.org>

Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.


# 91f37dcb 18-Dec-1999 Robert Watson <rwatson@FreeBSD.org>

Second pass commit to introduce new ACL and Extended Attribute system
calls, vnops, vfsops, both in /kern, and to individual file systems that
require a vfsop_ array entry.

Reviewed by: eivind


# 1e12157c 12-Dec-1999 Alfred Perlstein <alfred@FreeBSD.org>

explain that ioflags can be used to give read-ahead hints to the underlying
filesystem.


# 6bdfe06a 11-Dec-1999 Eivind Eklund <eivind@FreeBSD.org>

Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
return value: LK_EXCLOTHER, when the lock is held exclusively by another
process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with: grog, mch, peter, phk
Reviewed by: peter


# 0026ddba 09-Dec-1999 Semen Ustimenko <semenu@FreeBSD.org>

Added VT_HPFS vnode type.


# aa4f4b69 04-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Move the buffered read/write code out of spec_{read|write} and into
two new functions spec_buf{read|write}.

Add sysctl vfs.bdev_buffered which defaults to 1 == true. This
sysctl can be used to experimentally turn buffered behaviour for
bdevs off. I should not be changed while any blockdevices are
open. Remove the misplaced sysctl vfs.enable_userblk_io.

No other changes in behaviour.


# 1b5464ef 29-Sep-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Remove v_maxio from struct vnode.

Replace it with mnt_iosize_max in struct mount.

Nits from: bde


# 40360b1b 20-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

Final commit to remove vnode->v_lastr. vm_fault now handles read
clustering issues (replacing code that used to be in
ufs/ufs/ufs_readwrite.c). vm_fault also now uses the new VM page counter
inlines.

This completes the changeover from vnode->v_lastr to vm_entry_t->v_lastr
for VM, and fp->f_nextread and fp->f_seqcount (which have been in the
tree for a while). Determination of the I/O strategy (sequential, random,
and so forth) is now handled on a descriptor-by-descriptor basis for
base I/O calls, and on a memory-region-by-memory-region and
process-by-process basis for VM faults.

Reviewed by: David Greenman <dg@root.com>, Alan Cox <alc@cs.rice.edu>


# bb01f28e 17-Sep-1999 Matthew Dillon <dillon@FreeBSD.org>

Add vfs.enable_userblk_io sysctl to control whether user reads and writes
to buffered block devices are allowed. The default is to be backwards
compatible, i.e. reads and writes are allowed.

The idea is for a larger crowd to start running with this disabled and
see what problems, if any, crop up, and then to change the default to
off and see if any problems crop up in the next 6 months prior to
potentially removing support entirely. There are still a few people,
Julian and myself included, who believe the buffered block device
access from usermode to be useful.

Remove use of vnode->v_lastr from buffered block device I/O in
preparation for removal of vnode->v_lastr field, replacing it with
the already existing seqcount metric to detect sequential operation.

Reviewed by: Alan Cox <alc@cs.rice.edu>, David Greenman <dg@root.com>


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# dbafb366 26-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Simplify the handling of VCHR and VBLK vnodes using the new dev_t:

Make the alias list a SLIST.

Drop the "fast recycling" optimization of vnodes (including
the returning of a prexisting but stale vnode from checkalias).
It doesn't buy us anything now that we don't hardlimit
vnodes anymore.

Rename checkalias2() and checkalias() to addalias() and
addaliasu() - which takes dev_t and udev_t arg respectively.

Make the revoke syscalls use vcount() instead of VALIASED.

Remove VALIASED flag, we don't need it now and it is faster
to traverse the much shorter lists than to maintain the
flag.

vfs_mountedon() can check the dev_t directly, all the vnodes
point to the same one.

Print the devicename in specfs/vprint().

Remove a couple of stale LFS vnode flags.

Remove unimplemented/unused LK_DRAINED;


# 41d2e3e0 24-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness.


# 0ff7b13a 24-Aug-1999 Julian Elischer <julian@FreeBSD.org>

Make DEVFS use PHK's specinfo struct as the source of dev_t and devsw.

In lookup() however it's the other way around as we need to supply the
dev_t for the vnode, so devfs still has a copy of it stashed away.

Sourcing it from the vnode in the vnops however is useful as it makes
a lot of the code almost the same as that in specfs.


# a2801b77 21-Aug-1999 John Polstra <jdp@FreeBSD.org>

Support full-precision file timestamps. Until now, only the seconds
have been maintained, and that is still the default. A new sysctl
variable "vfs.timestamp_precision" can be used to enable higher
levels of precision:

0 = seconds only; nanoseconds zeroed (default).
1 = seconds and nanoseconds, accurate within 1/HZ.
2 = seconds and nanoseconds, truncated to microseconds.
>=3 = seconds and nanoseconds, maximum precision.

Level 1 uses getnanotime(), which is fast but can be wrong by up
to 1/HZ. Level 2 uses microtime(). It might be desirable for
consistency with utimes() and friends, which take timeval structures
rather than timespecs. Level 3 uses nanotime() for the higest
precision.

I benchmarked levels 0, 1, and 3 by copying a 550 MB tree with
"cpio -pdu". There was almost negligible difference in the system
times -- much less than 1%, and less than the variation among
multiple runs at the same level. Bruce Evans dreamed up a torture
test involving 1-byte reads with intervening fstat() calls, but
the cpio test seems more realistic to me.

This feature is currently implemented only for the UFS (FFS and
MFS) filesystems. But I think it should be easy to support it in
the others as well.

An earlier version of this was reviewed by Bruce. He's not to
blame for any breakage I've introduced since then.

Reviewed by: bde (an earlier version of the code)


# 4d4f9323 13-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

s/v_specinfo/v_rdev/


# 0ef1c826 08-Aug-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.


# 67452993 26-Jul-1999 Alan Cox <alc@FreeBSD.org>

Add sysctl and support code to allow directories to be VMIO'd. The default
setting for the sysctl is OFF, which is the historical operation.

Submitted by: dillon


# 698bfad7 20-Jul-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Now a dev_t is a pointer to struct specinfo which is shared by all specdev
vnodes referencing this device.

Details:
cdevsw->d_parms has been removed, the specinfo is available
now (== dev_t) and the driver should modify it directly
when applicable, and the only driver doing so, does so:
vn.c. I am not sure the logic in checking for "<" was right
before, and it looks even less so now.

An intial pool of 50 struct specinfo are depleted during
early boot, after that malloc had better work. It is
likely that fewer than 50 would do.

Hashing is done from udev_t to dev_t with a prime number
remainder hash, experiments show no better hash available
for decent cost (MD5 is only marginally better) The prime
number used should not be close to a power of two, we use
83 for now.

Add new checkalias2() to get around the loss of info from
dev2udev() in bdevvp();

The aliased vnodes are hung on a list straight of the dev_t,
and speclisth[SPECSZ] is unused. The sharing of struct
specinfo means that the v_specnext moves into the vnode
which grows by 4 bytes.

Don't use a VBLK dev_t which doesn't make sense in MFS, now
we hang a dummy cdevsw on B/Cmaj 253 so that things look sane.

Storage overhead from all of this is O(50k).

Bump __FreeBSD_version to 400009

The next step will add the stuff needed so device-drivers can start to
hang things from struct specinfo


# 6ca54864 18-Jul-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Introduce the vn_todev(struct vnode*) function, which returns the dev_t
corresponding to a VBLK or VCHR node, or NODEV.


# b008f92e 28-Jun-1999 Poul-Henning Kamp <phk@FreeBSD.org>

make va_fsid be of type udev_t


# e4ab40bc 15-Jun-1999 Kirk McKusick <mckusick@FreeBSD.org>

Get rid of the global variable rushjob and replace it with a function in
kern/vfs_subr.c named speedup_syncer() which handles the speedup request.
Change the various clients of rushjob to use the new function.


# bfbb9ce6 11-May-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Divorce "dev_t" from the "major|minor" bitmap, which is now called
udev_t in the kernel but still called dev_t in userland.

Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()

For now they're functions, they will become in-line functions
after one of the next two steps in this process.

Return major/minor/makedev to macro-hood for userland.

Register a name in cdevsw[] for the "filedescriptor" driver.

In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.

In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).

A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.

Without DEVT_FASCIST I belive this patch is a no-op.

Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.

Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).


# e9189611 17-Apr-1999 Peter Wemm <peter@FreeBSD.org>

Well folks, this is it - The second stage of the removal for build support
for LKM's..


# cd0fb97e 19-Feb-1999 Matthew Dillon <dillon@FreeBSD.org>

Make worklist add function a static, remove from sys/vnode.h


# 52e93a35 02-Feb-1999 Semen Ustimenko <semenu@FreeBSD.org>

Added vnode tag for NTFS.
Reviewed by: David O'Brien <obrien@NUXI.com>


# 5a24726b 28-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Clarify the SYSINIT problem by breaking SYSINIT's up into a void *
version and a const void * version. Currently the const void * version
simply calls the void * version ( i.e. no 'fix' is in place ).

A solution needs to be found for the C_SYSINIT ( etc...) family of
macros that allows const void * without generating a warning, but
does not allow non-const void *.


# 8aef1712 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# d254af07 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 15a1057c 20-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Add 'options DEBUG_LOCKS', which stores extra information in struct
lock, and add some macros and function parameters to make sure that
the information get to the point where it can be put in the lock
structure.

While I'm here, add DEBUG_VFS_LOCKS to LINT.


# fb116777 05-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Remove the 'waslocked' parameter to vfs_object_create().


# 4e61198e 10-Nov-1998 Peter Wemm <peter@FreeBSD.org>

Make the vnode opv vector construction fully dynamic. Previously we
leaked memory on each unload and were limited to items referenced in
the kernel copy of vnode_if.c. Now a kernel module is free to create
it's own VOP_FOO() routines and the rest of the system will happily
deal with it, including passthrough layers like union/umap/etc.

Have VFS_SET() call a common vfs_modevent() handler rather than
inline duplicating the common code all over the place.

Have VNODEOP_SET() have the vnodeops removed at unload time (assuming a
module) so that the vop_t ** vector is reclaimed.

Slightly adjust the vop_t ** vectors so that calling slot 0 is a panic
rather than a page fault. This could happen if VOP_something() was called
without *any* handlers being present anywhere (including in vfs_default.c).
slot 1 becomes the default vector for the vnodeop table.

TODO: reclaim zones on unload (eg: nfs code)


# 630ff663 31-Oct-1998 Peter Wemm <peter@FreeBSD.org>

Convert the vnode clean/dirty attached buffer lists from LISTs to TAILQs.
Add a new flags field (we get this for free because of struct packing)
for cleaner management of tailq membership.
We had two spare b_flags slots, but they are a precious resource and may
be needed for other things that are related to other b_flags bits. The two
new flags are convenient to use in a seperate location.

Reviewed (in principle) by: dg
Obtained from: John Dyson's old work-in-progress


# 20f02ef5 29-Oct-1998 Peter Wemm <peter@FreeBSD.org>

Remove the V_SAVEMETA flag, nothing uses it any more now that msdosfs and
ext2fs call vtruncbuf() directly. This simplifies and cleans up
vinvalbuf() a little.


# aa855a59 15-Oct-1998 Peter Wemm <peter@FreeBSD.org>

*gulp*. Jordan specifically OK'ed this..

This is the bulk of the support for doing kld modules. Two linker_sets
were replaced by SYSINIT()'s. VFS's and exec handlers are self registered.
kld is now a superset of lkm. I have converted most of them, they will
follow as a seperate commit as samples.
This all still works as a static a.out kernel using LKM's.


# 9afcea2f 11-Sep-1998 Robert V. Baron <rvb@FreeBSD.org>

All the references to cfs, in symbols, structs, and strings
have been changed to coda. (Same for CFS.)


# a58915fc 09-Sep-1998 Tor Egge <tegge@FreeBSD.org>

Don't keep the underlying directory locked while performing the file
system specific VFS_MOUNT operation.
PR: 1067


# 4c162420 26-Aug-1998 Jordan K. Hubbard <jkh@FreeBSD.org>

Add VT_CFS type.
Submitted by: Robert Baron <rvb@sicily.odyssey.cs.cmu.edu>


# 1f562172 10-May-1998 John Dyson <dyson@FreeBSD.org>

Fix the futimes/undelete/utrace conflict with other BSD's. Note that
the only common usage of utrace (the possible problem with this
commit) is with malloc, so this should be a real problem. Add
the various NetBSD syscalls that allow full emulation of their
development environment.


# 08637435 28-Mar-1998 Bruce Evans <bde@FreeBSD.org>

Moved some #includes from <sys/param.h> nearer to where they are actually
used.


# bef608bd 15-Mar-1998 John Dyson <dyson@FreeBSD.org>

Some VM improvements, including elimination of alot of Sig-11
problems. Tor Egge and others have helped with various VM bugs
lately, but don't blame him -- blame me!!!

pmap.c:
1) Create an object for kernel page table allocations. This
fixes a bogus allocation method previously used for such, by
grabbing pages from the kernel object, using bogus pindexes.
(This was a code cleanup, and perhaps a minor system stability
issue.)

pmap.c:
2) Pre-set the modify and accessed bits when prudent. This will
decrease bus traffic under certain circumstances.

vfs_bio.c, vfs_cluster.c:
3) Rather than calculating the beginning virtual byte offset
multiple times, stick the offset into the buffer header, so
that the calculated offset can be reused. (Long long multiplies
are often expensive, and this is a probably unmeasurable performance
improvement, and code cleanup.)

vfs_bio.c:
4) Handle write recursion more intelligently (but not perfectly) so
that it is less likely to cause a system panic, and is also
much more robust.

vfs_bio.c:
5) getblk incorrectly wrote out blocks that are incorrectly sized.
The problem is fixed, and writes blocks out ONLY when B_DELWRI
is true.

vfs_bio.c:
6) Check that already constituted buffers have fully valid pages. If
not, then make sure that the B_CACHE bit is not set. (This was
a major source of Sig-11 type problems.)

vfs_bio.c:
7) Fix a potential system deadlock due to an incorrectly specified
sleep priority while waiting for a buffer write operation. The
change that I made opens the system up to serious problems, and
we need to examine the issue of process sleep priorities.

vfs_cluster.c, vfs_bio.c:
8) Make clustered reads work more correctly (and more completely)
when buffers are already constituted, but not fully valid.
(This was another system reliability issue.)

vfs_subr.c, ffs_inode.c:
9) Create a vtruncbuf function, which is used by filesystems that
can truncate files. The vinvalbuf forced a file sync type operation,
while vtruncbuf only invalidates the buffers past the new end of file,
and also invalidates the appropriate pages. (This was a system reliabiliy
and performance issue.)

10) Modify FFS to use vtruncbuf.

vm_object.c:
11) Make the object rundown mechanism for OBJT_VNODE type objects work
more correctly. Included in that fix, create pager entries for
the OBJT_DEAD pager type, so that paging requests that might slip
in during race conditions are properly handled. (This was a system
reliability issue.)

vm_page.c:
12) Make some of the page validation routines be a little less picky
about arguments passed to them. Also, support page invalidation
change the object generation count so that we handle generation
counts a little more robustly.

vm_pageout.c:
13) Further reduce pageout daemon activity when the system doesn't
need help from it. There should be no additional performance
decrease even when the pageout daemon is running. (This was
a significant performance issue.)

vnode_pager.c:
14) Teach the vnode pager to handle race conditions during vnode
deallocations.


# b1897c19 08-Mar-1998 Julian Elischer <julian@FreeBSD.org>

Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman)
Submitted by: Kirk McKusick (mcKusick@mckusick.com)
Obtained from: WHistle development tree


# 8f9110f6 07-Mar-1998 John Dyson <dyson@FreeBSD.org>

This mega-commit is meant to fix numerous interrelated problems. There
has been some bitrot and incorrect assumptions in the vfs_bio code. These
problems have manifest themselves worse on NFS type filesystems, but can
still affect local filesystems under certain circumstances. Most of
the problems have involved mmap consistancy, and as a side-effect broke
the vfs.ioopt code. This code might have been committed seperately, but
almost everything is interrelated.

1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that
are fully valid.
2) Rather than deactivating erroneously read initial (header) pages in
kern_exec, we now free them.
3) Fix the rundown of non-VMIO buffers that are in an inconsistent
(missing vp) state.
4) Fix the disassociation of pages from buffers in brelse. The previous
code had rotted and was faulty in a couple of important circumstances.
5) Remove a gratuitious buffer wakeup in vfs_vmio_release.
6) Remove a crufty and currently unused cluster mechanism for VBLK
files in vfs_bio_awrite. When the code is functional, I'll add back
a cleaner version.
7) The page busy count wakeups assocated with the buffer cache usage were
incorrectly cleaned up in a previous commit by me. Revert to the
original, correct version, but with a cleaner implementation.
8) The cluster read code now tries to keep data associated with buffers
more aggressively (without breaking the heuristics) when it is presumed
that the read data (buffers) will be soon needed.
9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The
delay loop waiting is not useful for filesystem locks, due to the
length of the time intervals.
10) Correct and clean-up spec_getpages.
11) Implement a fully functional nfs_getpages, nfs_putpages.
12) Fix nfs_write so that modifications are coherent with the NFS data on
the server disk (at least as well as NFS seems to allow.)
13) Properly support MS_INVALIDATE on NFS.
14) Properly pass down MS_INVALIDATE to lower levels of the VM code from
vm_map_clean.
15) Better support the notion of pages being busy but valid, so that
fewer in-transit waits occur. (use p->busy more for pageouts instead
of PG_BUSY.) Since the page is fully valid, it is still usable for
reads.
16) It is possible (in error) for cached pages to be busy. Make the
page allocation code handle that case correctly. (It should probably
be a printf or panic, but I want the system to handle coding errors
robustly. I'll probably add a printf.)
17) Correct the design and usage of vm_page_sleep. It didn't handle
consistancy problems very well, so make the design a little less
lofty. After vm_page_sleep, if it ever blocked, it is still important
to relookup the page (if the object generation count changed), and
verify it's status (always.)
18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up.
19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush.
20) Fix vm_pager_put_pages and it's descendents to support an int flag
instead of a boolean, so that we can pass down the invalidate bit.


# 50ce7ff4 23-Jan-1998 John Dyson <dyson@FreeBSD.org>

Add better support for larger I/O clusters, including larger physical
I/O. The support is not mature yet, and some of the underlying implementation
needs help. However, support does exist for IDE devices now.


# 47221757 17-Jan-1998 John Dyson <dyson@FreeBSD.org>

Tie up some loose ends in vnode/object management. Remove an unneeded
config option in pmap. Fix a problem with faulting in pages. Clean-up
some loose ends in swap pager memory management.

The system should be much more stable, but all subtile bugs aren't fixed yet.


# 925a3a41 11-Jan-1998 John Dyson <dyson@FreeBSD.org>

Fix some vnode management problems, and better mgmt of vnode free list.
Fix the UIO optimization code.
Fix an assumption in vm_map_insert regarding allocation of swap pagers.
Fix an spl problem in the collapse handling in vm_object_deallocate.
When pages are freed from vnode objects, and the criteria for putting
the associated vnode onto the free list is reached, either put the
vnode onto the list, or put it onto an interrupt safe version of the
list, for further transfer onto the actual free list.
Some minor syntax changes changing pre-decs, pre-incs to post versions.
Remove a bogus timeout (that I added for debugging) from vn_lock.

PHK will likely still have problems with the vnode list management, and
so do I, but it is better than it was.


# 95e5e988 05-Jan-1998 John Dyson <dyson@FreeBSD.org>

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 82dc3896 29-Dec-1997 John Dyson <dyson@FreeBSD.org>

Add the vnode interlock back around vget.


# 60f8d464 28-Dec-1997 John Dyson <dyson@FreeBSD.org>

Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem
with the object cache removal.


# 2be70f79 28-Dec-1997 John Dyson <dyson@FreeBSD.org>

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


# 1cbbd625 14-Dec-1997 Garrett Wollman <wollman@FreeBSD.org>

Add support for poll(2) on files. vop_nopoll() now returns POLLNVAL
if one of the new poll types is requested; hopefully this will not break
any existing code. (This is done so that programs have a dependable
way of determining whether a filesystem supports the extended poll types
or not.)

The new poll types added are:

POLLWRITE - file contents may have been modified
POLLNLINK - file was linked, unlinked, or renamed
POLLATTRIB - file's attributes may have been changed
POLLEXTEND - file was extended

Note that the internal operation of poll() means that it is impossible
for two processes to reliably poll for the same event (this could
be fixed but may not be worth it), so it is not possible to rewrite
`tail -f' to use poll at this time.


# 1cd52ec3 05-Dec-1997 Bruce Evans <bde@FreeBSD.org>

Don't include <sys/lock.h> in headers when only `struct simplelock' is
required. Fixed everything that depended on the pollution.


# cb451ebd 22-Nov-1997 Bruce Evans <bde@FreeBSD.org>

Staticized.


# f6fdfec4 18-Nov-1997 Bruce Evans <bde@FreeBSD.org>

Don't #include <machine/smp.h> even in the SMP case. Fixed the one
place that depended on it. The "bazillion warnings" mentioned in the
log for rev.1.45 apparently aren't a problem any more. It is hard
to be sure because the SIMPLELOCK_DEBUG option turns off (and breaks)
things in the SMP case.

Don't forward declare structs that are already implicitly forward declared.

Fixed a disordered declaration.


# dba3870c 26-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

VFS interior redecoration.

Rename vn_default_error to vop_defaultop all over the place.
Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite.
Use vop_null instead of nullop.
Move vop_nopoll from vfs_subr.c to vfs_default.c
Move vop_sharedlock from vfs_subr.c to vfs_default.c
Move vop_nolock from vfs_subr.c to vfs_default.c
Move vop_nounlock from vfs_subr.c to vfs_default.c
Move vop_noislocked from vfs_subr.c to vfs_default.c
Use vop_ebadf instead of *_ebadf.
Add vop_defaultop for getpages on master vnode in MFS.


# 1b09ae77 26-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Simplify the lease_check stuff.


# d54d34b5 16-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Make a set of VOP standard lock, unlock & islocked VOP operators, which
depend on the lock being located at vp->v_data. Saves 3x3 identical
vop procs, more as the other filesystems becomes lock aware.


# 987f5696 16-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Another VFS cleanup "kilo commit"

1. Remove VOP_UPDATE, it is (also) an UFS/{FFS,LFS,EXT2FS,MFS}
intereface function, and now lives in the ufsmount structure.

2. Remove VOP_SEEK, it was unused.

3. Add mode default vops:

VOP_ADVLOCK vop_einval
VOP_CLOSE vop_null
VOP_FSYNC vop_null
VOP_IOCTL vop_enotty
VOP_MMAP vop_einval
VOP_OPEN vop_null
VOP_PATHCONF vop_einval
VOP_READLINK vop_einval
VOP_REALLOCBLKS vop_eopnotsupp

And remove identical functionality from filesystems

4. Add vop_stdpathconf, which returns the canonical stuff. Use
it in the filesystems. (XXX: It's probably wrong that specfs
and fifofs sets this vop, shouldn't it come from the "host"
filesystem, for instance ufs or cd9660 ?)

5. Try to make system wide VOP functions have vop_* names.

6. Initialize the um_* vectors in LFS.

(Recompile your LKMS!!!)


# cec0f20c 16-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

VFS mega cleanup commit (x/N)

1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.

2. Change VOP_BLKATOFF to a normal function in cd9660.

3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.

4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.

5. Fix another VCALL in vfs_cache.c (thanks Bruce!)


# a1c995b6 12-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde


# 99448ed1 20-Sep-1997 John Dyson <dyson@FreeBSD.org>

Change the M_NAMEI allocations to use the zone allocator. This change
plus the previous changes to use the zone allocator decrease the useage
of malloc by half. The Zone allocator will be upgradeable to be able
to use per CPU-pools, and has more intelligent usage of SPLs. Additionally,
it has reasonable stats gathering capabilities, while making most calls
inline.


# 3a74593f 13-Sep-1997 Peter Wemm <peter@FreeBSD.org>

Update interfaces for poll()


# a051452a 31-Aug-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Change the 0xdeadb hack to a flag called VDOOMED.
Introduce VFREE which indicates that vnode is on freelist.
Rename vholdrele() to vdrop().
Create vfree() and vbusy() to add/delete vnode from freelist.
Add vfree()/vbusy() to keep (v_holdcnt != 0 || v_usecount != 0)
vnodes off the freelist.
Generalize vhold()/v_holdcnt to mean "do not recycle".
Fix reassignbuf()s lack of use of vhold().
Use vhold() instead of checking v_cache_src list.
Remove vtouch(), the vnodes are always vget'ed soon enough
after for it to have any measuable effect.
Add sysctl debug.freevnodes to keep track of things.
Move cache_purge() up in getnewvnodes to avoid race.
Decrement v_usecount after VOP_INACTIVE(), put a vhold() on
it during VOP_INACTIVE()
Unmacroize vhold()/vdrop()
Print out VDOOMED and VFREE flags (XXX: should use %b)

Reviewed by: dyson


# 0fa2443f 26-Aug-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Uncut&paste cache_lookup().

This unifies several times in theory indentical 50 lines of code.

The filesystems have a new method: vop_cachedlookup, which is the
meat of the lookup, and use vfs_cache_lookup() for their vop_lookup
method. vfs_cache_lookup() will check the namecache and pass on
to the vop_cachedlookup method in case of a miss.

It's still the task of the individual filesystems to populate the
namecache with cache_enter().

Filesystems that do not use the namecache will just provide the
vop_lookup method as usual.


# 7cbfd031 17-Aug-1997 Steve Passe <fsmp@FreeBSD.org>

Added includes of smp.h for SMP.
This eliminates a bazillion warnings about implicit s_lock & friends.


# b15a966e 04-May-1997 Poul-Henning Kamp <phk@FreeBSD.org>

1. Add a {pointer, v_id} pair to the vnode to store the reference to the
".." vnode. This is cheaper storagewise than keeping it in the
namecache, and it makes more sense since it's a 1:1 mapping.

2. Also handle the case of "." more intelligently rather than stuff
the namecache with pointless entries.

3. Add two lists to the vnode and hang namecache entries which go from
or to this vnode. When cleaning a vnode, delete all namecache
entries it invalidates.

4. Never reuse namecache enties, malloc new ones when we need it, free
old ones when they die. No longer a hard limit on how many we can
have.

5. Remove the upper limit on namelength of namecache entries.

6. Make a global list for negative namecache entries, limit their number
to a sysctl'able (debug.ncnegfactor) fraction of the total namecache.
Currently the default fraction is 1/16th. (Suggestions for better
default wanted!)

7. Assign v_id correctly in the face of 32bit rollover.

8. Remove the LRU list for namecache entries, not needed. Remove the
#ifdef NCH_STATISTICS stuff, it's not needed either.

9. Use the vnode freelist as a true LRU list, also for namecache accesses.

10. Reuse vnodes more aggresively but also more selectively, if we can't
reuse, malloc a new one. There is no longer a hard limit on their
number, they grow to the point where we don't reuse potentially
usable vnodes. A vnode will not get recycled if still has pages in
core or if it is the source of namecache entries (Yes, this does
indeed work :-) "." and ".." are not namecache entries any longer...)

11. Do not overload the v_id field in namecache entries with whiteout
information, use a char sized flags field instead, so we can get
rid of the vpid and v_id fields from the namecache struct. Since
we're linked to the vnodes and purged when they're cleaned, we don't
have to check the v_id any more.

12. NFS knew about the limitation on name length in the namecache, it
shouldn't and doesn't now.

Bugs:
The namecache statistics no longer includes the hits for ".."
and "." hits.

Performance impact:
Generally in the +/- 0.5% for "normal" workstations, but
I hope this will allow the system to be selftuning over a
bigger range of "special" applications. The case where
RAM is available but unused for cache because we don't have
any vnodes should be gone.

Future work:
Straighten out the namecache statistics.

"desiredvnodes" is still used to (bogusly ?) size hash
tables in the filesystems.

I have still to find a way to safely free unused vnodes
back so their number can shrink when not needed.

There is a few uses of the v_id field left in the filesystems,
scheduled for demolition at a later time.

Maybe a one slot cache for unused namecache entries should
be implemented to decrease the malloc/free frequency.


# 754c6d37 04-Apr-1997 Doug Rabson <dfr@FreeBSD.org>

Add some debugging macros for tracing VFS locking bugs.
Declare (hopefully short-lived) vop_sharedlock.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 670718e2 12-Feb-1997 Mike Pritchard <mpp@FreeBSD.org>

Remove function prototypes for vfs_mountroot and vgoneall, since
they were removed with the Lite2 merge.

Submitted by: bde


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 5131d64e 16-Jan-1997 Bruce Evans <bde@FreeBSD.org>

Removed option EXTRAVNODES. All versions of FreeBSD-2.x have a sysctl
variable `kern.maxvnodes' which gives much better control over vnode
allocation than EXTRAVNODES (except in -current between 1995/10/28 and
1996/11/12, kern.maxvnodes was read-only and thus useless).


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 8b612c4b 28-Dec-1996 John Dyson <dyson@FreeBSD.org>

This commit is the embodiment of some VFS read clustering improvements.
Firstly, now our read-ahead clustering is on a file descriptor basis and not
on a per-vnode basis. This will allow multiple processes reading the
same file to take advantage of read-ahead clustering. Secondly, there
previously was a problem with large reads still using the ramp-up
algorithm. Of course, that was bogus, and now we read the entire
"chunk" off of the disk in one operation. The read-ahead clustering
algorithm should use less CPU than the previous also (I hope :-)).

NOTE: THAT LKMS MUST BE REBUILT!!!


# 17a6a9e3 17-Oct-1996 Jordan K. Hubbard <jkh@FreeBSD.org>

Some very small changes to support Netcon's TFS filesystem.
These patches were formerly applied by the Netcon installer
before rebuilding your kernel.


# 62c3734c 15-Oct-1996 Bruce Evans <bde@FreeBSD.org>

Updated #includes to 4.4lite style.


# 6476c0d2 21-Aug-1996 John Dyson <dyson@FreeBSD.org>

Even though this looks like it, this is not a complex code change.
The interface into the "VMIO" system has changed to be more consistant
and robust. Essentially, it is now no longer necessary to call vn_open
to get merged VM/Buffer cache operation, and exceptional conditions
such as merged operation of VBLK devices is simpler and more correct.

This code corrects a potentially large set of problems including the
problems with ktrace output and loaded systems, file create/deletes,
etc.

Most of the changes to NFS are cosmetic and name changes, eliminating
a layer of subroutine calls. The direct calls to vput/vrele have
been re-instituted for better cross platform compatibility.

Reviewed by: davidg


# 114a8cff 30-May-1996 Peter Wemm <peter@FreeBSD.org>

Add an option "EXTRA_VNODES" to cause an extra number of vnode structures
to be allocated at boot time. This is an expensive option, as they
consume physical ram and are not pageable etc. In certain situations,
this kind of option is quite useful, especially for news servers that
access a large number of directories at random and torture the name cache.
Defining 5000 or 10000 extra vnodes should cut down the amount of vnode
recycling somewhat, which should allow better name and directory caching
etc.

This is a "your mileage may vary" option, with no real indication of
what works best for your machine except trial and error. Too many will
cost you ram that you could otherwise use for disk buffers etc.

This is based on something John Dyson mentioned to me a while ago.


# e206cebf 28-Mar-1996 David Greenman <dg@FreeBSD.org>

Change v_usecount & v_writecount from a short to an int. As shorts they
can and will overflow on large machines - especially on machines with
filesystems with lots of files (like netnews servers), and the result
is a "free vnode isn't" panic or worse.
This fixes one of the causes of these panics that I've been experiancing on
wcarchive.


# 02e2c406 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[new sys/syscallargs.h file, to be "cvs rm"ed]


# d9c68230 03-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Add missing prototype for newly public vn_vmio_open function, next to
vn_vmio_close.


# 6c5e9bbd 30-Jan-1996 Mike Pritchard <mpp@FreeBSD.org>

Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 45f486a5 25-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Removed redundant (incompletely staticized) declararations.


# 27a0b398 17-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Staticize.
Unstaticize a function in scsi/scsi_base that was used, with an undocumented
option.
My last count on the LINT kernel shows:
Total symbols: 3647
unref symbols: 463
undef symbols: 4
1 ref symbols: 1751
2 ref symbols: 485
Approaching the pain threshold now.


# 68a3b3cd 15-Dec-1995 Bruce Evans <bde@FreeBSD.org>

Completed a function declaration.

Restored order to prototype list.

Restored tabs to #defines.


# a316d390 10-Dec-1995 John Dyson <dyson@FreeBSD.org>

Changes to support 1Tb filesizes. Pages are now named by an
(object,index) pair instead of (object,offset) pair.


# f57e6547 09-Nov-1995 Bruce Evans <bde@FreeBSD.org>

Introduced a type `vop_t' for vnode operation functions and used
it 1138 times (:-() in casts and a few more times in declarations.
This change is null for the i386.

The type has to be `typedef int vop_t(void *)' and not `typedef
int vop_t()' because `gcc -Wstrict-prototypes' warns about the
latter. Since vnode op functions are called with args of different
(struct pointer) types, neither of these function types is any use
for type checking of the arg, so it would be preferable not to use
the complete function type, especially since using the complete
type requires adding 1138 casts to avoid compiler warnings and
another 40+ casts to reverse the function pointer conversions before
calling the functions.


# ef27d1fe 07-Nov-1995 John Dyson <dyson@FreeBSD.org>

Export a symbol that ext2fs wants (insmntque.)


# 39d38f93 06-Jul-1995 David Greenman <dg@FreeBSD.org>

Fixed an object allocation race condition that was causing a "object
deallocated too many times" panic when using NFS.

Reviewed by: John Dyson


# aa2cabb9 27-Jun-1995 David Greenman <dg@FreeBSD.org>

1) Converted v_vmdata to v_object.
2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs
after vnode_pager_alloc() calls - the object is already guaranteed to be
persistent.
3) Removed some gratuitous casts.


# 999422d7 19-Apr-1995 Julian Elischer <julian@FreeBSD.org>

Reviewed by: no-one yet, but non-intrusive
Submitted by: julian@tfs.com
Obtained from: written from scratch

slight changes to make space for devfs..
(also conditional test code in i386/isa/fd.c)

===================================================================
RCS file: /home/ncvs/src/sys/sys/malloc.h,v
retrieving revision 1.7
diff -r1.7 malloc.h
113a114,117
> #define M_DEVFSMNT 62 /* DEVFS mount structure */
> #define M_DEVFSBACK 63 /* DEVFS Back node */
> #define M_DEVFSFRONT 64 /* DEVFS Front node */
> #define M_DEVFSNODE 65 /* DEVFS node */
184c188,192
< NULL, NULL, NULL, NULL, NULL, \
---
> "DEVFS mount", /* 62 M_DEVFSMNT */ \
> "DEVFS back", /* 63 M_DEVFSBACK */ \
> "DEVFS front", /* 64 M_DEVFSFRONT */ \
> "DEVFS node", /* 65 M_DEVFSNODE */ \
> NULL, \
Index: sys/mount.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/mount.h,v
retrieving revision 1.16
diff -r1.16 mount.h
100c100,101
< #define MOUNT_MAXTYPE 15
---
> #define MOUNT_DEVFS 16 /* existing device Filesystem */
> #define MOUNT_MAXTYPE 16
118a120
> "devfs", /* 15 MOUNT_DEVFS */ \
Index: sys/vnode.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/vnode.h,v
retrieving revision 1.19
diff -r1.19 vnode.h
61c61
< VT_UNION, VT_MSDOSFS
---
> VT_UNION, VT_MSDOSFS, VT_DEVFS


# f6b04d2b 09-Apr-1995 David Greenman <dg@FreeBSD.org>

Changes from John Dyson and myself:

Fixed remaining known bugs in the buffer IO and VM system.

vfs_bio.c:
Fixed some race conditions and locking bugs. Improved performance
by removing some (now) unnecessary code and fixing some broken
logic.
Fixed process accounting of # of FS outputs.
Properly handle NFS interrupts (B_EINTR).

(various)
Replaced calls to clrbuf() with calls to an optimized routine
called vfs_bio_clrbuf().

(various FS sync)
Sync out modified vnode_pager backed pages.

ffs_vnops.c:
Do two passes: Sync out file data first, then indirect blocks.

vm_fault.c:
Fixed deadly embrace caused by acquiring locks in the wrong order.

vnode_pager.c:
Changed to use buffer I/O system for writing out modified pages. This
should fix the problem with the modification date previous not getting
updated. Also dramatically simplifies the code. Note that this is
going to change in the future and be implemented via VOP_PUTPAGES().

vm_object.c:
Fixed a pile of bugs related to cleaning (vnode) objects. The performance
of vm_object_page_clean() is terrible when dealing with huge objects,
but this will change when we implement a binary tree to keep the object
pages sorted.

vm_pageout.c:
Fixed broken clustering of pageouts. Fixed race conditions and other
lockup style bugs in the scanning of pages. Improved performance.


# cb9db2b6 28-Mar-1995 David Greenman <dg@FreeBSD.org>

When NFS is compiled into the kernel, make NQNFS lease checking conditional
on a "NQNFS" kernel config option. NQNFS is a 4.4 wart and the performance
penalty of the lease checks on the client/server for _local_ I/O is too high
to have this occur all the time - especially when most people will never
use it.


# b5e8ce9f 16-Mar-1995 Bruce Evans <bde@FreeBSD.org>

Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
ones.


# 79f7a9e1 07-Mar-1995 David Greenman <dg@FreeBSD.org>

Added a new flag "VAGE" to indicate that the vnode should go on the head
of the free list.


# ff2e6a8b 05-Jan-1995 Justin T. Gibbs <gibbs@FreeBSD.org>

Add VNINACT flag. LFS has a habbit of skipping the ufs_inactive procedure.
It used to do this by setting a global <Yuck>. Now we set th VNINACT
flag in the vnode to force a skip of ufs_inactive.

Sorry for missing this file in my last commit folks.

Index: vnode.h
===================================================================
RCS file: /usr/cvs/src/sys/sys/vnode.h,v
retrieving revision 1.14
diff -c -r1.14 vnode.h
*** 1.14 1994/11/14 13:51:53
--- vnode.h 1994/12/03 01:06:27
***************
*** 116,121 ****
--- 116,122 ----
#define VALIASED 0x0800 /* vnode has an alias */
#define VDIROP 0x1000 /* LFS: vnode is involved in a directory op */
#define VVMIO 0x2000 /* VMIO flag */
+ #define VNINACT 0x4000 /* LFS: skip ufs_inactive() in lfs_vunref */

/*
* Vnode attributes. A field value of VNOVAL represents a field whose value


# 80d08a82 14-Nov-1994 Bruce Evans <bde@FreeBSD.org>

Add prototype for vfinddev().


# 091b0456 20-Oct-1994 Garrett Wollman <wollman@FreeBSD.org>

Make my ALLDEVS kernel compile (basically, LINT minus a lot of options).

This involves fixing a few things I broke last time.


# b4a8d575 08-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Added prototypes here and there. Moved pfctlinput into socket.h.


# 8e58bf68 05-Oct-1994 David Greenman <dg@FreeBSD.org>

Stuff object into v_vmdata rather than pager. Not important which at
the moment, but will be in the future. Other changes mostly cosmetic,
but are made for future VMIO considerations.

Submitted by: John Dyson


# f86eaaca 02-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Prototypes, prototypes and even more prototypes. Not quite done yet, but
getting closer all the time.


# bb56ec4a 25-Sep-1994 Poul-Henning Kamp <phk@FreeBSD.org>

While in the real world, I had a bad case of being swapped out for a lot of
cycles. While waiting there I added a lot of the extra ()'s I have, (I have
never used LISP to any extent). So I compiled the kernel with -Wall and
shut up a lot of "suggest you add ()'s", removed a bunch of unused var's
and added a couple of declarations here and there. Having a lap-top is
highly recommended. My kernel still runs, yell at me if you kernel breaks.


# e21fa31a 22-Sep-1994 Garrett Wollman <wollman@FreeBSD.org>

Make NFS loadable.


# c901836c 20-Sep-1994 Garrett Wollman <wollman@FreeBSD.org>

Implemented loadable VFS modules, and made most existing filesystems
loadable. (NFS is a notable exception.)


# 27a0bc89 19-Sep-1994 Doug Rabson <dfr@FreeBSD.org>

Added msdosfs.

Obtained from: NetBSD


# d8f10c11 15-Sep-1994 Bruce Evans <bde@FreeBSD.org>

Add some prototypes.


# 1cdeb653 29-Aug-1994 David Greenman <dg@FreeBSD.org>

"bogus" fixes from 1.1.5 to work around some cache coherency problems.


# af9da405 20-Aug-1994 Paul Richards <paul@FreeBSD.org>

Made them all idempotent.
Reviewed by:
Submitted by:


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources