History log of /freebsd-current/sys/kern/vfs_lookup.c
Revision Date Author Comments
# 6b0cf2a2 24-Apr-2024 Konstantin Belousov <kib@FreeBSD.org>

vfs_lookup.c: only call ktrcapfail() if KTRACE is enabled

Reviewed by: emaste, imp, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D44931


# 66df8102 24-Apr-2024 Konstantin Belousov <kib@FreeBSD.org>

sys/namei.h: move NI_CAP_VIOLATION() macro from namei.h to vfs_lookup.c

Reviewed by: emaste, imp, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D44931


# 0cd9cde7 06-Apr-2024 Jake Freeland <jfree@FreeBSD.org>

ktrace: Record namei violations with KTR_CAPFAIL

Report namei path lookups while Capsicum violation tracing with
CAPFAIL_NAMEI. vfs caching is also ignored when tracing to mimic
capability mode behavior.

Reviewed by: markj
Approved by: markj (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D40680


# 55edc40e 04-Jan-2024 Mark Johnston <markj@FreeBSD.org>

file: Remove the fd parameter to fgetvp_lookup() and fgetvp_lookup_smr()

The fd is always obtained from nameidata, so just fetch it from there
instead. No functional change intended.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D43257


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# 586fed0b 04-Nov-2023 Jason A. Harmening <jah@FreeBSD.org>

vfs_lookup_cross_mount(): restore previous do...while loop

When the cross-mount walking logic in vfs_lookup() was factored into
a separate function, the main cross-mount traversal loop was changed
from a do...while loop conditional on the current vnode having
VIRF_MOUNTPOINT set to an unconditional for(;;) loop. For the
unionfs 'crosslock' case in which the vnode may be re-locked, this
meant that continuing the loop upon finding inconsistent
v_mountedhere state would no longer branch to a check that the vnode
is in fact still a mountpoint. This would in turn lead to over-
iteration and, for INVARIANTS builds, a failed assert on the next
iteration.

Fix this by restoring the previous loop behavior.

Reported by: pho
Tested by: pho
Fixes: 80bd5ef0702562c546fa1717e8fe221058974eac
MFC after: 1 week


# 02cbc029 22-Sep-2023 Olivier Certner <olce.freebsd@certner.fr>

vfs: fix reference counting/locking on LK_UPGRADE error

Factoring out this code unfortunately introduced reference and lock leaks in
case of failure in the lock upgrade path under VV_CROSSLOCK. In terms of
practical use, this impacts unionfs (and nullfs in a corner case).

Fixes: 80bd5ef07025 ("vfs: factor out mount point traversal to a dedicated routine")
MFC after: 3 days
MFC to: stable/14 releng/14.0
Sponsored by: The FreeBSD Foundation
Reviewed by: mjg
[mjg: massaged the commit message a little bit]

Differential Revision: https://reviews.freebsd.org/D41731


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# b8b33f3b 09-Aug-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire NAMEI_DIAGNOSTIC

It is too spammy and information-deficient for practical use.

Also see https://reviews.freebsd.org/D41207


# 80bd5ef0 05-Jul-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: factor out mount point traversal to a dedicated routine

While here tidy up asserts in the area.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D40883


# ebf37c3f 05-Jul-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop LK_RETRY when crossing mount points in vfs_lookup

vn_lock already returns the expected error.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D40883


# 0724cf38 05-Jul-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: whack dpunlocked var in vfs_lookup

It is redundant given the bad_unlocked goto label.


# 5842f73d 05-Jul-2023 Olivier Certner <olce.freebsd@certner.fr>

vfs: compute_lk_cnflags(): Remove unused argument 'cnflags'; Rename

Argument unused since commit 93a0ba8f4990785f.

Rename it to enforce_lkflags(), which seems to more aptly describe what it does.

[mjg: massaged the commit message a little]
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D40848


# 7b5a1c39 27-Jun-2023 Igor Ostapenko <pm@igoro.pro>

vfs: bring vfs_lookup() description comment up to date

Signed-off-by: Igor Ostapenko <pm@igoro.pro>
Reviewed by: imp, mhorne
Pull Request: https://github.com/freebsd/freebsd-src/pull/737


# 5958cd88 27-Jun-2023 Igor Ostapenko <pm@igoro.pro>

vfs: fix description comment of vfs_lookup()

Signed-off-by: Igor Ostapenko <pm@igoro.pro>
Reviewed by: imp, mhorne
Pull Request: https://github.com/freebsd/freebsd-src/pull/737


# cea7c564 13-Jun-2023 Dmitry Chagin <dchagin@FreeBSD.org>

namei: Reset the lookup to start from the real root for abs symlink target

Since fd745e1d Linux ABI specifies alternative root directory to reroot
lookups. First, an attempt is made to lookup the file in /ABI/original-path.
If that fails, the lookup is done in /original-path. In case of lookup
symbolic link with leading / in target namei() fails due to reroot reloads
original file name.
To avoid this handle restart in a special maner, without origin path name
reloading.

Reported by: Goran Mekić, Vincent Milum Jr
Tested by: Goran Mekić
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D40479


# 861abdad 13-Jun-2023 Dmitry Chagin <dchagin@FreeBSD.org>

namei: Add a comment explaining ISRESTARTED flag

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D40494


# 07c0b6e5 29-May-2023 Dmitry Chagin <dchagin@FreeBSD.org>

vfs: Retire kern_alternate_path() as unused anymore

From now a non-native ABI should use pwd_altroot() ability to tell
to the namei() its root directory to dynamically reroots lookups.

Differential Revision: https://reviews.freebsd.org/D40093
MFC after: 2 month


# 3d2fec7d 29-May-2023 Dmitry Chagin <dchagin@FreeBSD.org>

namei: Add the abilty for the ABI to specify an alternate root path

For now a non-native ABI (i.e., Linux) uses the kern_alternate_path()
facility to dynamically reroot lookups. First, an attempt is made to
lookup the file in /compat/linux/original-path. If that fails, the
lookup is done in /original-path. Thats requires a bit of code in
every ABI syscall implementation where path name translation is needed.
Also our kern_alternate_path() does not properly lookups absolute symlinks
in second attempt, i.e., does not append /compat/linux part to the resolved
link.
The change is intended to avoid this by specifiyng the ABI root directory
for namei(), using one call to pwd_altroot() during exec-time into the ABI.
In that case namei() will dynamically reroot lookups as mentioned above.

PR: 72920
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D38933
MFC after: 2 month


# cf0fc64b 03-May-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: reduce audit branching in namei_setup


# a718431c 23-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

lookup(): ensure that openat("/", "..", O_RESOLVE_BENEATH) fails

PR: 269780
Reported by: Dan Gohman <dev@sunfishcode.online>
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39773


# 0c01203e 28-Mar-2023 Jason A. Harmening <jah@FreeBSD.org>

vfs_lookup(): re-check v_mountedhere on lock upgrade

The VV_CROSSLOCK handling logic may need to upgrade the covered
vnode lock depending upon the requirements of the filesystem into
which vfs_lookup() is walking. This may involve transiently
dropping the lock, which can allow the target mount to be unmounted.

Tested by: pho
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D39272


# 7b6fe242 08-Apr-2023 Konstantin Belousov <kib@FreeBSD.org>

DEBUG_VFS_LOCKS: use witness if available

The assert_vop_locked messages are ignored, and file/line information
is not too useful. Fixing this without changing both witness and VFS
asserts KPIs is not possible.

Reviewed by: markj (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D39464


# 829f0bcb 19-Dec-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add the concept of vnode state transitions

To quote from a comment above vput_final:
<quote>
* XXX Some filesystems pass in an exclusively locked vnode and strongly depend
* on the lock being held all the way until VOP_INACTIVE. This in particular
* happens with UFS which adds half-constructed vnodes to the hash, where they
* can be found by other code.
</quote>

As is there is no mechanism which allows filesystems to denote that a
vnode is fully initialized, consequently problems like the above are
only found the hard way(tm).

Add rudimentary support for state transitions, which in particular allow
to assert the vnode is not legally unlocked until its fate is decided
(either construction finishes or vgone is called to abort it).

The new field lands in a 1-byte hole, thus it does not grow the struct.

Bump __FreeBSD_version to 1400077

Reviewed by: kib (previous version)
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D37759


# 8f7859e8 14-Dec-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire the now unused SAVESTART flag

Bump __FreeBSD_version to 1400075

Tested by: pho


# 8f874e92 09-Nov-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: make relookup take an additional argument

instead of looking at SAVESTART

This is a step towards removing the flag.

Reviewed by: mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D34468


# 269c564b 17-Nov-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire NDFREE

There are no consumers anymore. Interested parties can NDFREE_PNBUF
and vput or vrele relevant vnodes.

Tested by: pho


# 42442d7a 26-Oct-2022 Jason A. Harmening <jah@FreeBSD.org>

Generalize the VV_CROSSLOCK logic in vfs_lookup()

When VV_CROSSLOCK is present, the lock for the vnode at the current
stage of lookup must be held across the VFS_ROOT() call for the
filesystem mounted at the vnode. Since VV_CROSSLOCK implies that
the root vnode reuses the already-held lock, the possibility for
recursion should be made clear in the flags passed to VFS_ROOT().

For cases in which the lock is held exclusive, this means passing
LK_CANRECURSE. For cases in which the lock is held shared, it
means clearing LK_NODDLKTREAT to allow VFS_ROOT() to potentially
recurse on the shared lock even in the presence of an exclusive
waiter.

That the existing code works for unionfs is due to a coincidence
of the current unionfs implementation.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D37458


# f7833196 19-Oct-2022 Jason A. Harmening <jah@FreeBSD.org>

vfs_lookup(): Minor performance optimizations

Refactor the symlink and mountpoint traversal logic to avoid
repeatedly checking the vnode type; a symlink cannot be a mountpoint
and vice versa. Avoid repeatedly checking cn_flags for NOCROSSMOUNT
and simplify the check which determines whether the vnode is a
mountpoint.

Suggested by: mjg
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D35054


# 706f15c5 04-Aug-2022 Jason A. Harmening <jah@FreeBSD.org>

Remove witness directives from crossmp locking VOPs

These are of limited use since the crossmp vnode locking ops have not
actually used a lock since commit
a2d35545429117e68fbcbc68e14ad55e84265d69. We in fact require that
these operations are always issued with LK_SHARED. Additionally,
these directives can produce a false positive in certain VV_CROSSLOCK
cases which require upgrading of the covered vnode lock from shared
to exclusive.

While here, replace the runtime check of LK_SHARED with a KASSERT and
expand the check to include LK_NOWAIT, which all callers pass.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D35054


# 080ef8a4 04-Aug-2022 Jason A. Harmening <jah@FreeBSD.org>

Add VV_CROSSLOCK vnode flag to avoid cross-mount lookup LOR

When a lookup operation crosses into a new mountpoint, the mountpoint
must first be busied before the root vnode can be locked. When a
filesystem is unmounted, the vnode covered by the mountpoint must
first be locked, and then the busy count for the mountpoint drained.
Ordinarily, these two operations work fine if executed concurrently,
but with a stacked filesystem the root vnode may in fact use the
same lock as the covered vnode. By design, this will always be
the case for unionfs (with either the upper or lower root vnode
depending on mount options), and can also be the case for nullfs
if the target and mount point are the same (which admittedly is
very unlikely in practice).

In this case, we have LOR. The lookup path holds the mountpoint
busy while waiting on what is effectively the covered vnode lock,
while a concurrent unmount holds the covered vnode lock and waits
for the mountpoint's busy count to drain.

Attempt to resolve this LOR by allowing the stacked filesystem
to specify a new flag, VV_CROSSLOCK, on a covered vnode as necessary.
Upon observing this flag, the vfs_lookup() will leave the covered
vnode lock held while crossing into the mountpoint. Employ this flag
for unionfs with the caveat that it can't be used for '-o below' mounts
until other unionfs locking issues are resolved.

Reported by: pho
Tested by: pho
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35054


# b77bdfdb 17-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix non-INVARIANTS build after 5b5b7e2ca2fa9a2418dd51749f4ef6f881ae7179

Reported by: gj


# 5b5b7e2c 17-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: always retain path buffer after lookup

This removes some of the complexity needed to maintain HASBUF and
allows for removing injecting SAVENAME by filesystems.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D36542


# 3df3d88c 16-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: move cn_nameptr assignment out of namei_getpath


# f7dc4a71 13-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: plug spurious error checks in namei

error is guaranteed 0 at that point


# b4137c9e 12-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: make NDVALIDATE private to vfs_lookup.c

it is not used elsewhere.


# 14312394 27-Apr-2022 John Baldwin <jhb@FreeBSD.org>

Add a __witness_used for variables only used under #ifdef WITNESS.

__diagused is now solely used for variables only used under INVARIANTS.

Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D35085


# c9b04ee4 02-Apr-2022 Gordon Bergling <gbe@FreeBSD.org>

kern: Fix two typos in source code comments

- s/accomodate/accommodate/

MFC after: 3 days


# 0c805718 24-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix memory leak on lookup with fds with ioctl caps

Reviewed by: markj
PR: 262515
Noted by: firk@cantconnect.ru
Differential Revision: https://reviews.freebsd.org/D34667


# a4032e2a 26-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: assorted tidy ups to lookup

No functional changes.


# 0f600883 25-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: set cn_namelen when handling degenerate lookups

Turns out execve looks at it to store binary name, but in order to
trigger the problem one has to be trying to exec '/'. As is the value
would be left uninitialized (or rather set to -1 on debug kernels).

Fixes: 56244d35741a62e7 ("vfs: hoist degenerate path lookups out of the
loop")


# 4ef6e56a 24-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: hoist trailing slash handling out of the loop


# 3b6792d2 23-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: factor symlink traversal out of namei

The intent down the road is to eliminate the loop to begin with,
pushing traversal down to vfs_lookup, all while not allocating the
extra buffer.


# d9ea7e2b 13-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: factor FAILIFEXISTS handling out of vfs_lookup


# 56244d35 11-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: hoist degenerate path lookups out of the loop


# bb92cd7b 24-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd)


# 93a0ba8f 17-Sep-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire the no longer used MNTK_LOOKUP_EXCL_DOTDOT flag

Reviewed by: markj
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D34466


# 0134bbe5 13-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: prefix lookup and relookup with vfs_

Reviewed by: imp, mckusick
Differential Revision: https://reviews.freebsd.org/D34530


# 2cee5861 28-Dec-2021 John Baldwin <jhb@FreeBSD.org>

sys/kern: Use C99 fixed-width integer types.

No functional change.

Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D33630


# 054f5815 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: plug a set-but-not-used var in kern_alternate_path

Sponsored by: Rubicon Communications, LLC ("Netgate")


# 7e1d3eef 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the unused thread argument from NDINIT*

See b4a58fbf640409a1 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.


# c40fee6f 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the always curthread argument from kern_alternate_path


# 7dd419ca 26-Sep-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add empty path support

This avoids spurious drop offs as EMPTY is passed regardless of the
actual path name.

Pushign the work inside the lookup instead of just ignorign the flag
allows avoid checking for empty pathname for all other lookups.


# 46dd801a 16-Oct-2021 Colin Percival <cperciva@FreeBSD.org>

Add userland boot profiling to TSLOG

On kernels compiled with 'options TSLOG', record for each process ID:
* The timestamp of the fork() which creates it and the parent
process ID,
* The first path passed to execve(), if any,
* The first path resolved by namei, if any, and
* The timestamp of the exit() which terminates the process.

Expose this information via a new sysctl, debug.tslog_user.

On kernels lacking 'options TSLOG' (the default), no information is
recorded and the sysctl does not exist.

Note that recording namei is needed in order to obtain the names of
rc.d scripts being launched, as the rc system sources them in a
subshell rather than execing the scripts.

With this commit it is now possible to generate flamecharts of the
entire boot process from the start of the loader to the end of
/etc/rc. The code needed to perform this processing is currently
found in github: https://github.com/cperciva/freebsd-boot-profiling

Reviewed by: mhorne
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D32493


# b4a58fbf 01-Oct-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove cn_thread

It is always curthread.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D32453


# c9536389 01-Oct-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: hoist cn_thread assert in namei

Making it condtional on whether ktrace happens to be enabled makes no
sense.


# 7fd856ba 22-Aug-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: s/__unused/__diagused in crossmp_*

Sponsored by: Rubicon Communications, LLC ("Netgate")


# 5d75ffdd 20-Aug-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove an unused variable from nameicap_tracker_add

Reported by cc --analyze

Sponsored by: Rubicon Communications, LLC ("Netgate")


# 9446d9e8 13-Aug-2021 Konstantin Belousov <kib@FreeBSD.org>

fstatat(2): handle non-vnode file descriptors for AT_EMPTY_PATH

Set NIRES_EMPTYPATH earlies, to have use of EMPTYPATH recorded even if
we are going to return error. When namei_setup() refused to accept dirfd,
which is not of the vnode type, and indicated by ENOTDIR error return,
fall back to kern_fstat(dirfd).

Reported by: dchagin
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31530


# d81aefa8 25-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use the sentinel trick in locked lookup path parsing


# cef8a95a 13-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix vnode use count leak in O_EMPTY_PATH support

The vnode returned by namei_setup is already referenced.

Reported by: pho


# a5970a52 03-Apr-2021 Konstantin Belousov <kib@FreeBSD.org>

Make files opened with O_PATH to not block non-forced unmount

by only keeping hold count on the vnode, instead of the use count.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29323


# 8d9ed174 17-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

open(2): Implement O_PATH

Reviewed by: markj
Tested by: pho
Discussed with: walker.aj325_gmail.com, wulf
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29323


# 509124b6 07-Mar-2021 Konstantin Belousov <kib@FreeBSD.org>

Add AT_EMPTY_PATH for several *at(2) syscalls

It is currently allowed to fchownat(2), fchmodat(2), fchflagsat(2),
utimensat(2), fstatat(2), and linkat(2).

For linkat(2), PRIV_VFS_FHOPEN privilege is required to exercise the flag.
It allows to link any open file.

Requested by: trasz
Tested by: pho, trasz
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D29111


# 71c160a8 27-Mar-2021 Mark Johnston <markj@FreeBSD.org>

vfs: Add an assertion around name length limits

Some filesystems assume that they can copy a name component, with length
bounded by NAME_MAX, into a dirent buffer of size MAXNAMLEN. These
constants have the same value; add a compile-time assertion to that
effect.

Reported by: Alexey Kulaev <alex.qart@gmail.com>
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29431


# 28cd3a67 28-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

O_RELATIVE_BENEATH: return ENOTCAPABLE instead of EINVAL for abs path

Requested and reviewed by: markj
Tested by: arichardson, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28907


# 49c98a4b 27-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

nameicap_check_dotdot: trim tracker on check

Tracker should contain exactly the path from the starting directory to
the current lookup point. Otherwise we might not detect some cases of
dotdot escape. Consequently, if we are walking up the tree by dotdot
lookup, we must remove an entries below the walked directory.

Reviewed by: markj
Tested by: arichardson, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D28907


# e8a2862a 27-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

Add nameicap_cleanup_from(), to clean tracker list starting from some element

Reviewed by: markj
Tested by: arichardson, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D28907


# 2388ad7c 27-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

nameicap_tracker_add: avoid duplicates in the tracker list

Reviewed by: markj
Tested by: arichardson, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D28907


# 59e74942 27-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

Do not call nameicap_tracker_add() for dotdot case.

Reviewed by: markj
Tested by: arichardson, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D28907


# 20e91ca3 15-Feb-2021 Konstantin Belousov <kib@FreeBSD.org>

open(2): Remove O_BENEATH and AT_BENEATH

with the reasoning that the flags did not worked properly, and were not
shipped in a release.

O_RESOLVE_BENEATH is kept as useful.

Reviewed by: markj
Tested by: arichardson, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D28907


# 739ecbcf 23-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add symlink support to lockless lookup

Reviewed by: kib (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D27488


# 70ba7770 12-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: extend vfs:namei:lookup:return probe with nameidata


# cdb62ab7 11-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add NDFREE_NOTHING and convert several NDFREE_PNBUF callers

Check the comment above the routine for reasoning.


# 002e18eb 27-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add FAILIFEXISTS flag

Both FreeBSD and Linux mkdir -p walk the tree up ignoring any EEXIST on
the way and both are used a lot when building respective kernels.

This poses a problem as spurious locking avoidably interferes with
concurrent operations like getdirentries on affected directories.

Work around the problem by adding FAILIFEXISTS flag. In case of lockless
lookup this manages to avoid any work to begin with, there is no speed
up for the locked case but perhaps this can be augmented later on.

For simplicity the only supported semantics are as used by mkdir.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D27789


# 8fcfd0e2 06-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add cleanup on error missed in r368375

Noted by: jrtc27


# 60e2a0d9 05-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: factor buffer allocation/copyin out of namei


# 9c8c797c 22-Nov-2020 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove the 'wantparent' variable, unused since r145004.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27193


# 2fbb45c6 04-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: change nt_zone into a malloc type

Elements are small in size and allocated for short periods.


# 62568e88 29-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add NAMEI_DBG_HADSTARTDIR handling lost in rewrite

Noted by: rpokala


# eebc2e45 28-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add NDREINIT to facilitate repeated namei calls

struct nameidata mixes caller arguments, internal state and output, which
can be quite error prone.

Recent addition of valdiating ni_resflags uncovered a caller which could
repeatedly call namei, effectively operating on partially populated state.

Add bare minimium validation this does not happen. The real fix would
decouple aforementioned state.

Reported by: pho
Tested by: pho (different variant)


# d681c51d 26-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add missing NIRES_ABS handling


# 4ea49660 08-Oct-2020 Konstantin Belousov <kib@FreeBSD.org>

Do not allow to use O_BENEATH as an oracle.

Specifically, if lookup() returned any error and the topping directory
was not latched, which means that (non-existent) path did not returned
to the topping location, give ENOTCAPABLE a priority over the lookup()
error.

PR: 249960
Reviewed by: emaste, ngie
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D26695


# 1317da43 22-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Add O_RESOLVE_BENEATH and AT_RESOLVE_BENEATH to mimic Linux' RESOLVE_BENEATH.

It is like O_BENEATH, but disables to walk out of the subtree rooted
in the starting directory. O_BENEATH does not care if path walks out
if it returned.

Requested by: Dan Gohman <dev@sunfishcode.online>
PR: 248335
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886


# 6a9c72d9 22-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Change O_BENEATH to handle relative paths same as absolute.

Do not care if path walks out of the topping directory if it returns back.

Requested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886


# 07e7ad2b 22-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Only clear latch for BENEATH when we walk out of the startdir,

not unconditionally on any dotdot component.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886


# c7de3d6f 22-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Add NIRES_STRICTREL.

Stop abusing internal namei flag NI_LCF_STRICTRELATIVE as indicator of
cap-restricted lookup. Add designated returned flag NIRES_STRICTREL
to inform kern_openat() that lookup was restricted.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886


# f9e46c9b 22-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

lookup: Track last lookup component if it is directory.

This makes open("/a/../a", O_BENEATH) with cwd == "/a" work.

Reviewed by: markj
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886


# 44619a5e 22-Sep-2020 Konstantin Belousov <kib@FreeBSD.org>

Improve comment above nameicap_check_dotdot().

Explain why tracker is needed at all.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886


# 66ac5b2c 27-Aug-2020 Kirk McKusick <mckusick@FreeBSD.org>

Add a comment to clarify when and why cached names are deleted
during pathname lookup.

Reviewed by: kib
MFC after: 3 days
Sponsored by: Netflix


# f0d9c77e 23-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: validate ndp state after the lookup

The intent is to remove known-to-be-nops NDFREE calls after many lookups.


# 4b500119 23-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: convert nameiop into an enum

While here change the field size from long to int and move it into the
gap next to cn_flags.

Shrinks struct componentname from 64 to 56 bytes on amd64.


# de0fcd3a 22-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: assert that HASBUF is only set with SAVENAME or SAVESTART

as requested by the caller. The intent is to eradicate the mostly
spurious NDFREE_PNBUF calls.


# 760a430b 22-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add a work around for vp_crossmp bug to realpath

The actual bug is not yet addressed as it will get much easier after other
problems are addressed (most notably rename contract).

The only affected in-tree consumer is realpath. Everyone else happens to be
performing lookups within a mount point, having a side effect of ni_dvp being
set to mount point's root vnode in the worst case.

Reported by: pho


# 494c0f2a 16-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: mark HASBUF as an internal flag

There is no setter for cn_pnbuf.


# b38ad268 13-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add missing pwd_drop on error in namei_setup

Reported by: pho


# 2d0631dd 10-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: stricter validation for flags passed to namei in cn_flags

namei de facto expects that the naimeidata object is properly initialized,
but at the same time it mixes consumer-passable and internal flags, while
tolerating this part by explicitly clearing some of them.

Tighten the interface instead.

While here renumber the flags and denote the gap between the 2 variants.

Try to piggy back th renumber on the just bumped __FreeBSD_version.


# 25e42ee2 10-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the hello world stat probes from the vfs provider

Interested parties can get the same information by hoooking on vop_stat.


# 7f700801 10-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: disallow NOCACHE with LOOKUP

This means there is no expectation lookup will purge the terminal entry,
which simplifies lockless lookup.

Tested by: pho
Sponsored by: The FreeBSD Foundation


# 158ab70c 05-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: tidy up namei entry point

- predict for string copy errors
- reshuffle inititalistion of vars which are not needed


# 85cf3161 01-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: inline NDINIT_ALL

The routine takes more than 6 arguments, which on amd64 means some of
them have to be passed through the stack.


# 14576629 01-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: convert ni_rigthsneeded to a pointer

Shaves 8 bytes of struct nameidata on 64-bit platforms.


# 21c16260 01-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: make rights mandatory for NDINIT_ALL


# b1f910e0 30-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: short-circuit the common case NDFREE calls

Almost all consumers use the NDF_ONLY_PNBUF macro, making them avoidably branch
a lot in the NDFREE routine. Also note most of them should not need to call
any cleanup anyway as they don't request HASBUF.


# d3e63e8e 30-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: make sure startdir_used is always assigned to before use

CID: 1431070


# c42b77e6 25-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: lockless lookup

Provides full scalability as long as all visited filesystems support the
lookup and terminal vnodes are different.

Inner workings are explained in the comment above cache_fplookup.

Capabilities and fd-relative lookups are not supported and will result in
immediate fallback to regular code.

Symlinks, ".." in the path, mount points without support for lockless lookup
and mismatched counters will result in an attempt to get a reference to the
directory vnode and continue in regular lookup. If this fails, the entire
operation is aborted and regular lookup starts from scratch. However, care is
taken that data is not copied again from userspace.

Sample benchmark:
incremental -j 104 bzImage on tmpfs:
before: 142.96s user 1025.63s system 4924% cpu 23.731 total
after: 147.36s user 313.40s system 3216% cpu 14.326 total

Sample microbenchmark: access calls to separate files in /tmpfs, 104 workers, ops/s:
before: 2165816
after: 151216530

Reviewed by: kib
Tested by: pho (in a patchset)
Differential Revision: https://reviews.freebsd.org/D25578


# 422f38d8 10-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix trivial whitespace issues which don't interefere with blame

.. even without the -w switch


# 2f423bce 01-Mar-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: stop taking additional refs on root vnode during lookup

They are spurious since introduction of struct pwd, which provides them
implicitly.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23885


# 8d03b99b 01-Mar-2020 Mateusz Guzik <mjg@FreeBSD.org>

fd: move vnodes out of filedesc into a dedicated structure

The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.

This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23884


# 721a81c3 20-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: stop duplicating vnode work in audit during path lookup

Duplicating the work was putting an avoidable requirement that the filedesc
lock is held across the entire operation (otherwise by the time audit reads
vnode pointers another thread in the same process can chdir somewhere else,
making audit log things using different vnode than the one which will be
used for actual lookup).

Do the obvious thing and pass down vnodes which will be used.


# e126c5a3 14-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use new capsicum helpers


# 6ebab6ba 13-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use mac fastpath for lookup, open, read, write, mmap


# 3d62f685 03-Feb-2020 Kyle Evans <kevans@FreeBSD.org>

namei: preserve errors from fget_cap_locked

Most notably, we want to make sure we don't clobber any capabilities-related
errors. This is a regression from r357412 (O_SEARCH) that was picked up by
the capsicum tests.

PR: 243839
Reviewed by: kib (committed form recommended by)
Tested by: lwhsu
Differential Revision: https://reviews.freebsd.org/D23479


# 6a5abb1e 02-Feb-2020 Kyle Evans <kevans@FreeBSD.org>

Provide O_SEARCH

O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.

This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.

This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23247


# 90f4ec33 31-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: save on atomics on the root vnode for absolute lookups

There are 2 back-to-back atomics on the vnode, but we can check upfront if one
is sufficient. Similarly we can handle relative lookups where current working
directory == root directory.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23427


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# 6fa079fc 15-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: flatten vop vectors

This eliminates the following loop from all VOP calls:

while(vop != NULL && \
vop->vop_spare2 == NULL && vop->vop_bypass == NULL)
vop = vop->vop_default;

Reviewed by: jeff
Tesetd by: pho
Differential Revision: https://reviews.freebsd.org/D22738


# abd80ddb 08-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce v_irflag and make v_type smaller

The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715


# e28fa55a 21-May-2019 Konstantin Belousov <kib@FreeBSD.org>

NDFREE(): Fix unlocking for LOCKPARENT|LOCKLEAF and ndp->ni_dvp == ndp->ni_vp.

NDFREE() calculates unlock_dvp after ndp->ni_vp is unlocked and zeroed
out. This makes the comparision of ni_dvp with ni_vp always fail.
Move the calculation of unlock_dvp right after unlock_vp, so that the
code sees correct ni_vp value.

Reproduced by
chdir("/usr");
open("/..", O_BENEATH | O_RDONLY);

Reported by: syzkaller
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20304


# 7cdb0b9d 07-Feb-2019 Konstantin Belousov <kib@FreeBSD.org>

Fix renameat(2) for CAPABILITIES kernels.

When renameat(2) is used with:
- absolute path for to;
- tofd not set to AT_FDCWD;
- the target exists
kern_renameat() requires CAP_UNLINK capability on tofd, but
corresponding namei ni_filecap is not initialized at all because the
lookup is absolute. As result, the check was done against empty filecap
and syscall fails erronously.

Fix it by creating a return flags namei member and reporting if the
lookup was absolute, then do not touch to.ni_filecaps at all.

PR: 222258
Reviewed by: jilles, ngie
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-MFC-note: KBI breakage
Differential revision: https://reviews.freebsd.org/D19096


# 24d64be4 13-Dec-2018 Mateusz Guzik <mjg@FreeBSD.org>

vfs: mostly depessimize NDINIT_ALL

1) filecaps_init was unnecesarily a function call
2) an asignment at the end was preventing tail calling of cap_rights_init

Sponsored by: The FreeBSD Foundation


# 7d2b0bd7 29-Nov-2018 Konstantin Belousov <kib@FreeBSD.org>

If BENEATH is specified, always latch the topping directory vnode.

It is possible that we started with a relative path but during the
lookup, found an absolute symlink. In this case, BENEATH handling
code needs the latch, but it is too late to calculate it.

While there, somewhat improve the assertions. Clear the NI_LCF_LATCH
flag when the latch vnode is released, so that asserts know the state.
Assert that there is a latch if we entered beneath+abs path mode,
after the starting point is processed.

Reported by: wulf
With more input from: pho
Sponsored by: The FreeBSD Foundation


# ade85c5e 10-Nov-2018 Konstantin Belousov <kib@FreeBSD.org>

Allow absolute paths for O_BENEATH.

The path must have a tail which does not escape starting/topping
directory. The documentation will come shortly, see the man pages
commit message for the reason of separate commit.

Reviewed by: jilles (previous version)
Discussed with: emaste
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17714


# 4f77f488 25-Oct-2018 Konstantin Belousov <kib@FreeBSD.org>

Implement O_BENEATH and AT_BENEATH.

Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory. Supposedly the interface is similar
to the same proposed Linux flags.

Reviewed by: jilles (code, previous version of manpages), 0mp (manpages)
Discussed with: allanjude, emaste, jonathan
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17547


# c396945b 20-Sep-2018 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove lookup_shared tunable

Reviewed by: kib, jhb
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17253


# 84482abd 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

vfs: annotate variables only used by debug builds as __unused


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# 38d84d68 17-Nov-2017 Conrad Meyer <cem@FreeBSD.org>

vfs_lookup: Allow PATH_MAX-1 symlinks

Previously, symlinks in FreeBSD were artificially limited to PATH_MAX-2.

Add a short test case to verify the change.

Submitted by: Gaurav Gangalwar <ggangalwar AT isilon.com>
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12589


# 11ce4d9f 15-Oct-2017 Tijl Coosemans <tijl@FreeBSD.org>

When a Linux program tries to access a /path the kernel tries
/compat/linux/path before /path. Stop following symbolic links when
looking up /compat/linux/path so dead symbolic links aren't ignored.
This allows syscalls like readlink(2) and lstat(2) to work on such links.
And open(2) will return an error now instead of trying /path.


# 03f7f178 15-Mar-2017 John Baldwin <jhb@FreeBSD.org>

Use UMA_ALIGN_PTR instead of sizeof(void *) for zone alignment.

uma_zcreate()'s alignment argument is supposed to be sizeof(foo) - 1,
and uma.h provides a set of helper macros for common types. Passing
sizeof(void *) results in all of the members being misaligned triggering
unaligned access faults on certain architectures (notably MIPS).

Reported by: brooks
Obtained from: CheriBSD
MFC after: 3 days
Sponsored by: DARPA / AFRL


# aec8391d 22-Jan-2017 Konstantin Belousov <kib@FreeBSD.org>

Provide fallback VOP methods for crossmp vnode.

In particular, crossmp vnode might leak into rename code.

PR: 216380
Reported by: fnacl@protonmail.com
Sponsored by: The FreeBSD Foundation
X-MFC with: r309425


# 5ec7cde4 04-Jan-2017 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix bug that would result in a kernel crash in some cases involving
a symlink and an autofs mount request. The crash was caused by namei()
calling bcopy() with a negative length, caused by numeric underflow:
in lookup(), in the relookup path, the ni_pathlen was decremented too
many times. The bug was introduced in r296715.

Big thanks to Alex Deiter for his help with debugging this.

Reviewed by: kib@
Tested by: Alex Deiter <alex.deiter at gmail.com>
MFC after: 1 month


# 5afb134c 12-Dec-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vrefact, to be used when the vnode has to be already active

This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active
vnodes.

Reviewed by: kib (previous version)


# 778aa66a 12-Dec-2016 Konstantin Belousov <kib@FreeBSD.org>

Enable lookup_cap_dotdot and lookup_cap_dotdot_nonlocal.

Requested and reviewed by: cem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D8746


# a2d35545 02-Dec-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: provide fake locking primitives for the crossmp vnode

Since the vnode is only expected to be shared locked, we can save a
little overhead by only pretending we are locking in the first place.

Reviewed by: kib
Tested by: pho


# a4ce25b5 29-Nov-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix a whitespace nit in r309307


# 1babea03 29-Nov-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: avoid VOP_ISLOCKED in the common case in lookup


# 7359fdcf 01-Nov-2016 Konstantin Belousov <kib@FreeBSD.org>

Allow some dotdot lookups in capability mode.

If dotdot lookup does not escape from the file descriptor passed as
the lookup root, we can allow the component traversal. Track the
directories traversed, and check the result of dotdot lookup against
the recorded list of the directory vnodes.

Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently
disabled by default until more verification of the approach is done.

Disallow non-local filesystems for dotdot, since remote server might
conspire with the local process to allow it to escape the namespace.
This might be too cautious, provide the knob
vfs.lookup_cap_dotdot_nonlocal to override as well.

Idea by: rwatson
Discussed with: emaste, jonathan, rwatson
Reviewed by: mjg (previous version)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
Differential revision: https://reviews.freebsd.org/D8110


# 1bf6a090 01-Nov-2016 Konstantin Belousov <kib@FreeBSD.org>

Remove tautological casts.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# ec846935 01-Nov-2016 Konstantin Belousov <kib@FreeBSD.org>

Style fixes.

Discussed with: emaste
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 69a28758 15-Sep-2016 Ed Maste <emaste@FreeBSD.org>

Renumber license clauses in sys/kern to avoid skipping #3


# 11d3ad2e 27-Aug-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: provide a common exit point in namei for error cases

This shortens the function, adds the SDT_PROBE use for error cases and
consistenly unrefs rootdir last.

Reviewed by: kib
MFC after: 2 weeks


# 411455a8 10-Aug-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after: 1 month


# e3043798 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: spelling fixes in comments.

No functional change.


# b85f65af 15-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

kern: for pointers replace 0 with NULL.

These are mostly cosmetical, no functional change.

Found with devel/coccinelle.


# 2a5a08cb 12-Mar-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Refactor the way we restore cn_lkflags; no functional changes.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# f69db551 12-Mar-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Remove cn_consume from 'struct componentname'. It was never set to anything
other than 0.

Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5611


# 213ed838 12-Mar-2016 Edward Tomasz Napierala <trasz@FreeBSD.org>

Fix autofs triggering problem. Assume you have an NFS server,
192.168.1.1, with share "share". This commit fixes a problem
where "mkdir /net/192.168.1.1/share/meh" would return spurious
error instead of creating the directory if the target filesystem
wasn't mounted yet; subsequent attempts would work correctly.

The failure scenario is kind of complicated to explain, but it all
boils down to calling VOP_MKDIR() for the target filesystem (NFS)
with wrong dvp - the autofs vnode instead of the filesystem root
mounted over it.

Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5442


# 2f2f522b 27-Sep-2015 Andriy Gapon <avg@FreeBSD.org>

save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE

SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after: 20 days


# 8c43e4cc 12-Aug-2015 Ed Schouten <ed@FreeBSD.org>

Properly return ENOTDIR when calling *at() on a non-vnode.

We already properly return ENOTDIR when calling *at() on a non-directory
vnode, but it turns out that if you call it on a socket, we see EINVAL.
Patch up namei to properly translate this to ENOTDIR.


# 318b9463 09-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

vfs: cosmetic changes to namei and namei_handle_root

- don't initialize cnp during declaration
- don't test error/!error, compare to 0 instead


# d177f49f 09-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

vfs: simplify error handling in namei

The logic is reorganised so that there is one exit point prior to the
lookup loop. This is an intermediate step to making audit logging
functions use found vnode instead of translating ni_dirfd on their own.

ni_startdir validation is removed. The only in-tree consumer is nfs
which already makes sure it is a directory.

Reviewed by: kib


# d19ba50e 09-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

vfs: avoid spurious vref/vrele for absolute lookups

namei used to vref fd_cdir, which was immediatley vrele'd on entry to
the loop.

Check for absolute lookup and vref the right vnode the first time.

Reviewed by: kib


# a03f1b29 09-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

vfs: plug a use-after-free of fd_rdir in namei

fd_rdir vnode was stored in ni_rootdir without refing it in any way,
after which the filedsc lock was being dropped.

The vnode could have been freed by mountcheckdirs or another thread doing
chroot.

VREF the vnode while the lock is held.

Reviewed by: kib
MFC after: 1 week


# 947401dd 05-Jul-2015 Mark Johnston <markj@FreeBSD.org>

Move the comment describing namei(9) back to namei()'s definition.

MFC after: 3 days


# 72ba3c08 02-Nov-2014 Konstantin Belousov <kib@FreeBSD.org>

Fix two issues with lockmgr(9) LK_CAN_SHARE() test, which determines
whether the shared request for already shared-locked lock could be
granted. Both problems result in the exclusive locker starvation.

The concurrent exclusive request is indicated by either
LK_EXCLUSIVE_WAITERS or LK_EXCLUSIVE_SPINNERS flags. The reverse
condition, i.e. no exclusive waiters, must check that both flags are
cleared.

Add a flag LK_NODDLKTREAT for shared lock request to indicate that
current thread guarantees that it does not own the lock in shared
mode. This turns back the exclusive lock starvation avoidance code;
see man page update for detailed description.

Use LK_NODDLKTREAT when doing lookup(9).

Reported and tested by: pho
No objections from: attilio
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 8cc11167 23-Aug-2014 Mateusz Guzik <mjg@FreeBSD.org>

Plug a memory leak in case of failed lookups in capability mode.

Put common cnp cleanup into one function and use it for this purpose.

MFC after: 1 week


# af3b2549 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Pull in r267961 and r267973 again. Fix for issues reported will follow.


# 37a107a4 27-Jun-2014 Glen Barber <gjb@FreeBSD.org>

Revert r267961, r267973:

These changes prevent sysctl(8) from returning proper output,
such as:

1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory


# 3da1cf1e 27-Jun-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after: 2 weeks
Sponsored by: Mellanox Technologies


# 4a144410 16-Mar-2014 Robert Watson <rwatson@FreeBSD.org>

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks


# d9fae5ab 26-Nov-2013 Andriy Gapon <avg@FreeBSD.org>

dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE

In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by: markj
MFC after: 3 weeks


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# 6272798a 09-Nov-2013 Konstantin Belousov <kib@FreeBSD.org>

Both vn_close() and VFS_PROLOGUE() evaluate vp->v_mount twice, without
holding the vnode lock; vp->v_mount is checked first for NULL
equiality, and then dereferenced if not NULL. If vnode is reclaimed
meantime, second dereference would still give NULL. Change
VFS_PROLOGUE() to evaluate the mp once, convert MNTK_SHARED_WRITES and
MNTK_EXTENDED_SHARED tests into inline functions.

Reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# 3fded357 18-Sep-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix panic in ktrcapfail() when no capability rights are passed.
While here, correct all consumers to pass NULL instead of 0 as we pass
capability rights as pointers now, not uint64_t.

Reported by: Daniel Peyrolon
Tested by: Daniel Peyrolon
Approved by: re (marius)


# 7008be5b 04-Sep-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation


# 456597e7 05-Aug-2013 Konstantin Belousov <kib@FreeBSD.org>

Do not override the ENOENT error for the empty path, or EFAULT errors
from copyins, with the relative lookup check.

Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# c686ee46 01-Apr-2013 Konstantin Belousov <kib@FreeBSD.org>

Do not call the VOP_LOOKUP() for the doomed directory vnode. The
vnode could be reclaimed while lock upgrade was performed.

Sponsored by: The FreeBSD Foundation
Reported and tested by: pho
Diagnosed and reviewed by: rmacklem
MFC after: 1 week


# 2609222a 01-Mar-2013 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib


# 593efaf9 21-Feb-2013 John Baldwin <jhb@FreeBSD.org>

Further refine the handling of stop signals in the NFS client. The
changes in r246417 were incomplete as they did not add explicit calls to
sigdeferstop() around all the places that previously passed SBDRY to
_sleep(). In addition, nfs_getcacheblk() could trigger a write RPC from
getblk() resulting in sigdeferstop() recursing. Rather than manually
deferring stop signals in specific places, change the VFS_*() and VOP_*()
methods to defer stop signals for filesystems which request this behavior
via a new VFCF_SBDRY flag. Note that this has to be a VFC flag rather than
a MNTK flag so that it works properly with VFS_MOUNT() when the mount is
not yet fully constructed. For now, only the NFS clients are set this new
flag in VFS_SET().

A few other related changes:
- Add an assertion to ensure that TDF_SBDRY doesn't leak to userland.
- When a lookup request uses VOP_READLINK() to follow a symlink, mark
the request as being on behalf of the thread performing the lookup
(cnp_thread) rather than using a NULL thread pointer. This causes
NFS to properly handle signals during this VOP on an interruptible
mount.

PR: kern/176179
Reported by: Russell Cattelan (sigdeferstop() recursion)
Reviewed by: kib
MFC after: 1 month


# 8909f88d 01-Dec-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix one more compilation issue.


# 499f0f4d 30-Nov-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

IFp4 @208451:

Fix path handling for *at() syscalls.

Before the change directory descriptor was totally ignored,
so the relative path argument was appended to current working
directory path and not to the path provided by descriptor, thus
wrong paths were stored in audit logs.

Now that we use directory descriptor in vfs_lookup, move
AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2() calls to the place where
we hold file descriptors table lock, so we are sure paths will
be resolved according to the same directory in audit record and
in actual operation.

Sponsored by: FreeBSD Foundation (auditdistd)
Reviewed by: rwatson
MFC after: 2 weeks


# f121e3e8 27-Nov-2012 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Add NOCAPCHECK flag to namei that allows lookup to work even if the process
is in capability mode.
- Add VN_OPEN_NOCAPCHECK flag for vn_open_cred() to will ne converted into
NOCAPCHECK namei flag.

This functionality will be used to enable core dumps for sandboxed processes.

Reviewed by: rwatson
Obtained from: WHEEL Systems
MFC after: 2 weeks


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 84c3cd4f 09-Sep-2012 Konstantin Belousov <kib@FreeBSD.org>

Add MNTK_LOOKUP_EXCL_DOTDOT struct mount flag, which specifies to the
lookup code that dotdot lookups shall override any shared lock
requests with the exclusive one. The flag is useful for filesystems
which sometimes need to upgrade shared lock to exclusive inside the
VOP_LOOKUP or later, which cannot be done safely for dotdot, due to
dvp also locked and causing LOR.

In collaboration with: pho
MFC after: 3 weeks


# cdb7a431 01-Jan-2012 Konstantin Belousov <kib@FreeBSD.org>

Avoid double-unlock or double unreference for ndp->ni_dvp when the vnode dp
lock upgrade right after the 'success' label fails.

In collaboration with: pho
MFC after: 1 week


# e141be6f 18-Oct-2011 Dag-Erling Smørgrav <des@FreeBSD.org>

Revisit the capability failure trace points. The initial implementation
only logged instances where an operation on a file descriptor required
capabilities which the file descriptor did not have. By adding a type enum
to struct ktr_cap_fail, we can catch other types of capability failures as
well, such as disallowed system calls or attempts to wrap a file descriptor
with more capabilities than it had to begin with.


# 69d377fe 13-Aug-2011 Jonathan Anderson <jonathan@FreeBSD.org>

Allow Capsicum capabilities to delegate constrained
access to file system subtrees to sandboxed processes.

- Use of absolute paths and '..' are limited in capability mode.
- Use of absolute paths and '..' are limited when looking up relative
to a capability.
- When a name lookup is performed, identify what operation is to be
performed (such as CAP_MKDIR) as well as check for CAP_LOOKUP.

With these constraints, openat() and friends are now safe in capability
mode, and can then be used by code such as the capability-mode runtime
linker.

Approved by: re (bz), mentor (rwatson)
Sponsored by: Google Inc


# a9d2f8d8 10-Aug-2011 Robert Watson <rwatson@FreeBSD.org>

Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 79856499 22-Aug-2010 Rui Paulo <rpaulo@FreeBSD.org>

Add an extra comment to the SDT probes definition. This allows us to get
use '-' in probe names, matching the probe names in Solaris.[1]

Add userland SDT probes definitions to sys/sdt.h.

Sponsored by: The FreeBSD Foundation
Discussed with: rwaston [1]


# 3634d5b2 20-Aug-2010 John Baldwin <jhb@FreeBSD.org>

Add dedicated routines to toggle lockmgr flags such as LK_NOSHARE and
LK_CANRECURSE after a lock is created. Use them to implement macros that
otherwise manipulated the flags directly. Assert that the associated
lockmgr lock is exclusively locked by the current thread when manipulating
these flags to ensure the flag updates are safe. This last change required
some minor shuffling in a few filesystems to exclusively lock a brand new
vnode slightly earlier.

Reviewed by: kib
MFC after: 3 days


# 1732ca8f 31-May-2010 Robert Watson <rwatson@FreeBSD.org>

Merge r203410 from head to stable/8:

Only audit pathnames in namei(9) if copying the directory string completes
successfully. Continue to do this before the empty path check so that the
ENOENT returned in that case gets an empty string token in the BSM record.

Approved by: re (kib)


# 246b6510 26-Mar-2010 Jaakko Heinonen <jh@FreeBSD.org>

Support only LOOKUP operation for "/" in relookup() because lookup()
can't succeed for CREATE, DELETE and RENAME.

Discussed with: bde


# b10c6cf4 02-Feb-2010 Robert Watson <rwatson@FreeBSD.org>

Only audit pathnames in namei(9) if copying the directory string completes
successfully. Continue to do this before the empty path check so that the
ENOENT returned in that case gets an empty string token in the BSM record.

MFC after: 3 days


# 5c303791 17-Nov-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r199137:
Detect the slashdot lookup for RENAME or REMOVE in lookup(), and return
EINVAL.


# 88e6f61a 10-Nov-2009 Konstantin Belousov <kib@FreeBSD.org>

When rename("a", "b/.") is performed, target namei() call returns
dvp == vp. Rename syscall does not check for the case, and at least
ufs_rename() cannot deal with it. POSIX explicitely requires that both
rename(2) and rmdir(2) return EINVAL when any of the pathes end in "/.".

Detect the slashdot lookup for RENAME or REMOVE in lookup(), and return
EINVAL.

Reported by: Jim Meyering <jim meyering net>
Tested by: simon, pho
MFC after: 1 week


# 791b0ad2 29-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Eliminate ARG_UPATH[12] arguments to AUDIT_ARG_UPATH() and instead
provide specific macros, AUDIT_ARG_UPATH1() and AUDIT_ARG_UPATH2()
to capture path information for audit records. This allows us to
move the definitions of ARG_* out of the public audit header file,
as they are an implementation detail of our current kernel-internal
audit record, which may change.

Approved by: re (kensmith)
Obtained from: TrustedBSD Project
MFC after: 1 month


# b146fc1b 28-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Rework vnode argument auditing to follow the same structure, in order
to avoid exposing ARG_ macros/flag values outside of the audit code in
order to name which one of two possible vnodes will be audited for a
system call.

Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 month


# e4b4bbb6 28-Jul-2009 Robert Watson <rwatson@FreeBSD.org>

Audit file descriptors passed to fooat(2) system calls, which are used
instead of the root/current working directory as the starting point for
lookups. Up to two such descriptors can be audited. Add audit record
BSM encoding for fooat(2).

Note: due to an error in the OpenBSM 1.1p1 configuration file, a
further change is required to that file in order to fix openat(2)
auditing.

Approved by: re (kib)
Reviewed by: rdivacky (fooat(2) portions)
Obtained from: TrustedBSD Project
MFC after: 1 month


# 14961ba7 27-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Replace AUDIT_ARG() with variable argument macros with a set more more
specific macros for each audit argument type. This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).

In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.

Suggested by: brooks
Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 week


# 322ef7cc 05-Jun-2009 Dag-Erling Smørgrav <des@FreeBSD.org>

Eliminate trailing_slash, which was made redundant in r193028.
Remove a couple of 4-year-old "temporary" KASSERTs.
Improve comments.

MFC after: 1 week


# bcf11e8d 05-Jun-2009 Robert Watson <rwatson@FreeBSD.org>

Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd


# 32bf7cdf 29-May-2009 Dag-Erling Smørgrav <des@FreeBSD.org>

Let vfs_lookup() return ENOTDIR if the path has a trailing slash and
the last component is a symlink to something that isn't a directory.

We introduce a new namei flag, TRAILINGSLASH, which is set by lookup()
if the last component is followed by a slash. The trailing slash is
then stripped, as before. If the final component is a symlink,
lookup() will return to namei(), which will expand the symlink and
call lookup() with the new path. When all symlinks have been
resolved, lookup() checks if the TRAILINGSLASH flag is set, and if it
is, and the vnode it ended up with is not a directory, it returns
ENOTDIR.

PR: kern/21768
Submitted by: Eygene Ryabinkin <rea-fbsd@codelabs.ru>
MFC after: 3 weeks


# b181c8aa 29-May-2009 Dag-Erling Smørgrav <des@FreeBSD.org>

Fix misleading comment.

MFC after: 1 week


# 0304c731 27-May-2009 Jamie Gritton <jamie@FreeBSD.org>

Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)


# 22d7ae67 11-May-2009 Attilio Rao <attilio@FreeBSD.org>

Fix a kernel compilation error, introduced after r191990, by defining
thread with curthread in the AUDIT case.

Reported by: dchagin


# dfd233ed 11-May-2009 Attilio Rao <attilio@FreeBSD.org>

Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.


# 4b4e58ba 06-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Add SDT DTrace probes for namei():

vfs:namei:lookup:entry takes parent directory vnode pointer, path to
look up, and lookup flags.
vfs:namei:lookup:return takes an error value, and if successful, the
returned vnode pointer.

MFC after: 1 month


# 049ce093 24-Mar-2009 John Baldwin <jhb@FreeBSD.org>

When a file lookup fails due to encountering a doomed vnode from a forced
unmount, consistently return ENOENT rather than EBADF.

Reviewed by: kib
MFC after: 1 month


# a6b6eb6b 11-Mar-2009 John Baldwin <jhb@FreeBSD.org>

Gah, fix the code to match the comment. For non-open lookups use a
shared vnode lock for the leaf vnode if LOCKSHARED is set.

Submitted by: rdivacky


# 33fc3625 11-Mar-2009 John Baldwin <jhb@FreeBSD.org>

Add a new internal mount flag (MNTK_EXTENDED_SHARED) to indicate that a
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
shared vnode lock for the leaf vnode only if the mount point has the
extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
FIFO's require exclusive vnode locks for their open() and close()
routines. (My recent MPSAFE patches for UDF and cd9660 already included
this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.

Submitted by: ups
Reviewed by: pjd (ZFS bits)
MFC after: 1 month


# 2cfddad7 18-Dec-2008 Konstantin Belousov <kib@FreeBSD.org>

Do not return success and doomed vnode from lookup. LK_UPGRADE allows
the vnode to be reclaimed.

Tested by: pho
MFC after: 1 month


# 1ba4a712 17-Nov-2008 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.

- L2ARC

Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris


# 0f54f8c2 03-Nov-2008 John Baldwin <jhb@FreeBSD.org>

A few style nits.


# 83b3bdbc 02-Nov-2008 Attilio Rao <attilio@FreeBSD.org>

Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with: kib
Tested by: pho


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# b957a822 01-Oct-2008 John Baldwin <jhb@FreeBSD.org>

Enable shared locks for path name lookups on supported filesystems (NFS
client, UFS, and ZFS) by default.


# d59701d0 01-Oct-2008 John Baldwin <jhb@FreeBSD.org>

Remove the LOOKUP_SHARED kernel option. Instead, make vfs.lookup_shared
a loader tunable (it was already a sysctl).


# 59d49325 31-Aug-2008 Attilio Rao <attilio@FreeBSD.org>

Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.

Manpages are updated accordingly.

Tested by: Diego Sardina <siarodx at gmail dot com>


# 48b05c3f 08-Apr-2008 Konstantin Belousov <kib@FreeBSD.org>

Implement the linux syscalls
openat, mkdirat, mknodat, fchownat, futimesat, fstatat, unlinkat,
renameat, linkat, symlinkat, readlinkat, fchmodat, faccessat.

Submitted by: rdivacky
Sponsored by: Google Summer of Code 2007
Tested by: pho


# 57b4252e 30-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Add the support for the AT_FDCWD and fd-relative name lookups to the
namei(9).

Based on the submission by rdivacky,
sponsored by Google Summer of Code 2007
Reviewed by: rwatson, rdivacky
Tested by: pho


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 81c794f9 25-Feb-2008 Attilio Rao <attilio@FreeBSD.org>

Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by: Andrea Barberio <insomniac at slackware dot it>


# 628f51d2 24-Feb-2008 Attilio Rao <attilio@FreeBSD.org>

Introduce some functions in the vnode locks namespace and in the ffs
namespace in order to handle lockmgr fields in a controlled way instead
than spreading all around bogus stubs:
- VN_LOCK_AREC() allows lock recursion for a specified vnode
- VN_LOCK_ASHARE() allows lock sharing for a specified vnode

In FFS land:
- BUF_AREC() allows lock recursion for a specified buffer lock
- BUF_NOREC() disallows recursion for a specified buffer lock

Side note: union_subr.c::unionfs_node_update() is the only other function
directly handling lockmgr fields. As this is not simple to fix, it has
been left behind as "sole" exception.


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# 30d239bc 24-Oct-2007 Robert Watson <rwatson@FreeBSD.org>

Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

mac_<object>_<method/action>
mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer


# b4d7e298 21-Sep-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix some locking cases where we ask for exclusively locked vnode, but we get
shared locked vnode in instead when vfs.lookup_shared is set to 1.

Discussed with: kib, kris
Tested by: kris
Approved by: re (kensmith)


# e1e8f51b 27-May-2007 Robert Watson <rwatson@FreeBSD.org>

Universally adopt most conventional spelling of acquire.


# 5e3f7694 04-Apr-2007 Robert Watson <rwatson@FreeBSD.org>

Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by: kris
Discussed with: jhb, kris, attilio, jeff


# e92d773f 31-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Rather than ignoring any error return from getnewvnode() in nameiinit(),
explicitly test and panic. This should not ever happen, but if it does,
this is a preferred failure mode to a NULL pointer dereference in kernel.

Coverity CID: 1716
Found with: Coverity Prevent(tm)


# 478a8db4 15-Feb-2007 Konstantin Belousov <kib@FreeBSD.org>

If both ISDOTDOT and NOCROSSMOUNT are set then lookup() might breaks out
of the special handling for ".." and perform an ISDOTDOT VOP_LOOKUP()
for a filesystem root vnode. Handle this case inside lookup().

Submitted by: tegge
PR: 92785
MFC after: 1 week


# 7f92c4ee 22-Jan-2007 Konstantin Belousov <kib@FreeBSD.org>

Below is slightly edited description of the LOR by Tor Egge:

--------------------------
[Deadlock] is caused by a lock order reversal in vfs_lookup(), where
[some] process is trying to lock a directory vnode, that is the parent
directory of covered vnode) while holding an exclusive vnode lock on
covering vnode.

A simplified scenario:

root fs var fs
/ A / (/var) D
/var B /log (/var/log) E
vfs lock C vfs lock F

Within each file system, the lock order is clear: C->A->B and F->D->E

When traversing across mounts, the system can choose between two lock orders,
but everything must then follow that lock order:

L1: C->A->B
|
+->F->D->E

L2: F->D->E
|
+->C->A->B

The lookup() process for namei("/var") mixes those two lock orders:

VOP_LOOKUP() obtains B while A is held
vfs_busy() obtains a shared lock on F while A and B are held (follows L1,
violates L2)
vput() releases lock on B
VOP_UNLOCK() releases lock on A
VFS_ROOT() obtains lock on D while shared lock on F is held
vfs_unbusy() releases shared lock on F
vn_lock() obtains lock on A while D is held (violates L1, follows L2)

dounmount() follows L1 (B is locked while F is drained).

Without unmount activity, vfs_busy() will always succeed without blocking
and the deadlock isn't triggered (the system behaves as if L2 is followed).

With unmount, you can get 4 processes in a deadlock:

p1: holds D, want A (in lookup())
p2: holds shared lock on F, want D (in VFS_ROOT())
p3: holds B, want drain lock on F (in dounmount())
p4: holds A, want B (in VOP_LOOKUP())

You can have more than one instance of p2.

The reversal was introduced in revision 1.81 of src/sys/kern/vfs_lookup.c and
MFCed to revision 1.80.2.1, probably to avoid a cascade of vnode locks when nfs
servers are dead (VFS_ROOT() just hangs) spreading to the root fs root vnode.

- Tor Egge

To fix the LOR, ups@ noted that when crossing the mount point, ni_dvp
is actually not used by the callers of namei. Thus, placeholder deadfs
vnode vp_crossmp is introduced that is filled into ni_dvp.

Idea by: ups
Reviewed by: tegge, ups, jeff, rwatson (mac interaction)
Tested by: Peter Holm
MFC after: 2 weeks


# aed55708 22-Oct-2006 Robert Watson <rwatson@FreeBSD.org>

Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA


# 3c5b80d6 14-Sep-2006 Mohan Srinivasan <mohans@FreeBSD.org>

Fix for a potential bug caught by Coverity. Pointed out to me by Kris Kennaway.


# 7d7d9e22 13-Sep-2006 Mohan Srinivasan <mohans@FreeBSD.org>

Fixes up the handling of shared vnode lock lookups in the NFS client,
adds a FS type specific flag indicating that the FS supports shared
vnode lock lookups, adds some logic in vfs_lookup.c to test this flag
and set lock flags appropriately.

- amd on 6.x is a non-starter (without this change). Using amd under
heavy load results in a deadlock (with cascading vnode locks all the
way to the root) very quickly.
- This change should also fix the more general problem of cascading
vnode deadlocks when an NFS server goes down.

Ideally, we wouldn't need these changes, as enabling shared vnode lock
lookups globally would work. Unfortunately, UFS, for example isn't
ready for shared vnode lock lookups, crashing pretty quickly.

This change is the result of discussions with Stephan Uphoff (ups@).

Reviewed by: ups@


# 5111b5e1 05-Aug-2006 Robert Watson <rwatson@FreeBSD.org>

Remove register, use ANSI function headers.


# 12de4510 05-Aug-2006 Robert Watson <rwatson@FreeBSD.org>

We now spell "inode" as "vnode" in the VFS layer, so update comment
for new world order.

MFC after: 3 days
Pointed out by: mckusick


# cef31ff7 29-Apr-2006 Kris Kennaway <kris@FreeBSD.org>

Lock giant when assigning ni_vp and keep vfslocked state valid.

Committed for: jeff


# 4b5b8681 27-Apr-2006 Jeff Roberson <jeff@FreeBSD.org>

- Consistently track ni_dvp and ni_vp with dvfslocked and vfslocked rather
than trying to optimize it into a single lock. This adds more calls to
lock giant with non smpsafe filesystems but is the only way to reliably
hold the correct lock.
- Remove an invalid assert in the mountedhere case in lookup and fix the
code to properly deal with the scenario. We can actually have a lookup
that returns dp == dvp with mountedhere set with certain unmount races.

Tested by: kris
Reported by: kris/mohans


# fdf86b2d 30-Mar-2006 Jeff Roberson <jeff@FreeBSD.org>

- LK_RETRY means nothing when passed to VOP_LOCK. Call vn_lock instead.
- Move the vn_lock of the dvp until after we've unbusied the filesystem
to avoid a LOR with the mount point lock.
- In the v_mountedhere while loop we acquire a new instance of giant each
time through without releasing the first. This would cause us to leak
Giant.

Sponsored by: Isilon Systems, Inc.


# 2f0bca55 06-Feb-2006 Jeff Roberson <jeff@FreeBSD.org>

- Don't check v_mount for NULL to determine if a vnode has been recycled.
Use the more appropriate VI_DOOMED flag instead.

Sponsored by: Isilon Systems, Inc.
MFC After: 1 week


# 95fea57c 05-Feb-2006 Robert Watson <rwatson@FreeBSD.org>

Add AUDITVNODE[12] flags to namei(), which cause namei() to audit path
and vnode attribute information for looked up vnodes during the lookup
operation. This will allow consumers of namei() to specify that this
information be added to the in-process audit record.

Submitted by: wsalamon
Obtained from: TrustedBSD Project


# 9157b485 01-Feb-2006 Jeff Roberson <jeff@FreeBSD.org>

- Solve a problem where a vput could be called on an outgoing directory
without Giant held. Do this by tracking the vfslocked state for
the directory seperate from the child. This is only important
in the case where we cross a mountpoint.

Sponsored by: Isilon Systems, Inc.
MFC After: 3 days


# 1dd5fc0f 22-Jan-2006 Don Lewis <truckman@FreeBSD.org>

Tweak previous vfs_lookup.c commit to return an EINVAL error from
lookup() instead of EPERM when a DELETE or RENAME operation is
attempted on "..".

In kern_unlink(), remap EINVAL errors returned from namei() to EPERM
to match existing (and POSIX required) behaviour.

Discussed with: bde
MFC after: 3 days


# bea7a8d7 21-Jan-2006 Don Lewis <truckman@FreeBSD.org>

Return EPERM from lookup() if cn_nameiop is DELETE or RENAME and
the last component of the path name is "..". This keeps VOP_LOOKUP()
from locking vnodes in reverse order.

Tested by: Denis Shaposhnikov <dsh AT vlink DOT ru>
MFC after: 3 days


# e12560dd 21-Sep-2005 John Baldwin <jhb@FreeBSD.org>

Use correct VFS locking rather than unconditionally grabbing Giant around
namei() calls in kern_alternate_path().

Reviewed by: csjp
MFC after: 1 week


# 68ff2a43 15-Sep-2005 Christian S.J. Peron <csjp@FreeBSD.org>

Improve the MP safeness associated with the creation of symbolic
links and the execution of ELF binaries. Two problems were found:

1) The link path wasn't tagged as being MP safe and thus was not properly
protected.
2) The ELF interpreter vnode wasnt being locked in namei(9) and thus was
insufficiently protected.

This commit makes the following changes:

-Sets the MPSAFE flag in NDINIT for symbolic link paths
-Sets the MPSAFE flag in NDINIT and introduce a vfslocked variable which
will be used to instruct VFS_UNLOCK_GIANT to unlock Giant if it has been
picked up.
-Drop in an assertion into vfs_lookup which ensures that if the MPSAFE
flag is NOT set, that we have picked up giant. If not panic (if WITNESS
compiled into the kernel). This should help us find conditions where vnode
operations are in-sufficiently protected.

This is a RELENG_6 candidate.

Discussed with: jeff
MFC after: 4 days


# 0c207975 14-Aug-2005 Alexander Kabaev <kan@FreeBSD.org>

Do not keep parent directory locked while calling VFS_ROOT to traverse mount
points in lookup(). The lock can be dropped safely around VFS_ROOT because
LOCKPARENT semantics with child and perent vnodes coming from different FSes
does not really have any meaningful use. On the other hard, this prevents
easily triggered deadlock on systems using automounter daemon.


# 74a51232 13-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Remove a debugging printf that slipped in.

Spotted by: Peter Wemm


# 18ef8344 13-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Further simplify lookup; Force all filesystems to relock in the DOTDOT
case. There are bugs in some which didn't unlock in the ISDOTDOT case
to begin with that need to be addressed seperately. This simplifies
things anyway.
- Fix relookup() to prevent it from vrele()'ing the dvp while the vp
is locked. Catch up to other lookup changes.

Sponsored by: Isilon Systems, Inc.
Reported by: Peter Wemm


# d3b78f73 09-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- If we vrele() a dvp while the child is locked we can potentially deadlock
when vrele() acquires the directory lock in the wrong order. Fix this
via the following changes:
- Keep the directory locked after VOP_LOOKUP() until we've determined
what we're going to do with the child. This allows us to remove the
complicated post LOOKUP code which determins whether we should lock or
unlock the parent. This means we may have to vput() in the appropriate
cases later, rather than doing an unsafe vrele.
- in NDFREE() keep two flags to indicate whether we need to unlock vp or
dvp. This allows us to vput rather than vrele in the appropriate
cases without rechecking the flags. Move the code to handle dvp after
we handle vp.
- Remove some dead code from namei() that was the result of changes to
VFS_LOCK_GIANT().

Sponsored by: Isilon Systems, Inc.


# 2bbd6c98 05-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Move NDFREE() from vfs_subr to vfs_lookup where namei() is.


# 9a6bb8ad 03-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Include opt_vfs.h for LOOKUP_SHARED.
- Control the behavior of shared lookups with the lookup_shared sysctl
which has its default behavior set via the LOOKUP_SHARED option.


# 99f3c870 29-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Set cn_lkflags to LK_SHARED in the LOOKUP_SHARED case so that we only
acquire shared locks on intermediate directories.
- For the LASTCN, we may have to LK_UPGRADE the parent directory before
we lookup the last component.
- Acquire VFS_ROOT and dp locks based on the cn_lkflag.

Sponsored by: Isilon Systems, Inc.


# ea9aa09d 28-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Remove an unused variable from relookup().
- Assert that REMOVE, CREATE, and RENAME callers have WANTPARENT
or LOCKPARENT set. You can't complete any of these operations without
at least a reference to the parent. Many filesystems check for this case
even though it isn't possible in the current system.


# 1e38e08e 28-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Get rid of PDIRUNLOCK, instead, we fixup the lock state immediately after
calling VOP_LOOKUP(). Rather than having each filesystem check the
LOCKPARENT flag, we simply check it once here and unlock as required.
The only unusual case is ISDOTDOT, where we require an unlocked vnode
on return. Relocking this vnode with the child locked is allowed since
the child is actually its parent.
- Add a few asserts for some unusual conditions that I do not believe can
happen. These will later go away and turn into implementations for these
conditions.

Sponsored by: Isilon Systems, Inc.


# d830f828 24-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Pass LK_EXCLUSIVE to VFS_ROOT() to satisfy the new flags argument. For
now, all calls to VFS_ROOT() should still acquire exclusive locks.

Sponsored by: Isilon Systems, Inc.


# ad09e57f 23-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Clear LOCKSHARED if LOOKUP_SHARED is not enabled. This is not strictly
necessary since we disable the shared locks in vfs_cache, but it is
prefered that the option not leak out into filesystems when it is
disabled.

Sponsored by: Isilon Systems, Inc.


# 76951d21 07-Feb-2005 John Baldwin <jhb@FreeBSD.org>

- Tweak kern_msgctl() to return a copy of the requested message queue id
structure in the struct pointed to by the 3rd argument for IPC_STAT and
get rid of the 4th argument. The old way returned a pointer into the
kernel array that the calling function would then access afterwards
without holding the appropriate locks and doing non-lock-safe things like
copyout() with the data anyways. This change removes that unsafeness and
resulting race conditions as well as simplifying the interface.
- Implement kern_foo wrappers for stat(), lstat(), fstat(), statfs(),
fstatfs(), and fhstatfs(). Use these wrappers to cut out a lot of
code duplication for freebsd4 and netbsd compatability system calls.
- Add a new lookup function kern_alternate_path() that looks up a filename
under an alternate prefix and determines which filename should be used.
This is basically a more general version of linux_emul_convpath() that
can be shared by all the ABIs thus allowing for further reduction of
code duplication.


# dcff5b14 24-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Don't call VOP_CREATEVOBJECT(), it's the responsibility of the
filesystem which owns the vnode.


# 22a960a6 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Acquire and release Giant as we enter and leave filesystems which
require it.
- Track the status of Giant with the nd flag HASGIANT.
- Release giant on return of namei() callers are not marked MPSAFE as
they already own giant.

Sponsored By: Isilon Systems, Inc.


# e39db32a 12-Jan-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT()
directly.


# 9454b2d8 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for copyright notices, minor format tweaks as necessary


# 082d2122 02-Dec-2004 Poul-Henning Kamp <phk@FreeBSD.org>

Make NAMEI_DIAGNOSTIC compile again and add a stragic vprint()


# 7a36e1d6 04-Aug-2004 Robert Watson <rwatson@FreeBSD.org>

Assert Giant in namei(). Bugs have been reported in which, following
a sleep() call waking up in namei(), a later assertion triggers that
Giant is not held. By asserting Giant at the start of namei(), we can
know that if that assertion triggers, Giant is lost during the call to
namei(), and not before.


# f257b7a5 12-Jul-2004 Alfred Perlstein <alfred@FreeBSD.org>

Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.


# 7f8a436f 05-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# b614dd13 19-Oct-2002 Robert Watson <rwatson@FreeBSD.org>

Add a new 'NOMACCHECK' flag to namei() NDINIT flags, which permits the
caller to indicate that MAC checks are not required for the lookup.
Similar to IO_NOMACCHECK for vn_rdwr(), this indicates that the caller
has already performed all required protections and that this is an
internally generated operation. This will be used by the NFS server
code, as we don't currently enforce MAC protections against requests
delivered via NFS.

While here, add NOCROSSMOUNT to PARAMASK; apparently this was used at
one point for name lookup flag checking, but isn't any longer or it
would have triggered from the NFS server code passing it to indicate
that mountpoints shouldn't be crossed in lookups.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# d03db429 31-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Introduce support for Mandatory Access Control and extensible
kernel access control.

Authorize vop_readlink() and vop_lookup() activities during recursive
path lookup via namei() via calls to appropriate MAC entry points.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs


# eeb92518 24-Jul-2002 Robert Watson <rwatson@FreeBSD.org>

Under #ifdef DIAGNOSTIC, NULL out componentname pointers if we free the
pnbuf to increase the chances of detecting use of a free'd name buffer
if SAVENAME or SAVESTART wasn't passed in. Curiously, running with these
changes doesn't panic the kernel, and should.


# 60a9bb19 06-Jun-2002 John Baldwin <jhb@FreeBSD.org>

Catch up to changes in ktrace API.


# d394511d 16-May-2002 Tom Rhodes <trhodes@FreeBSD.org>

More s/file system/filesystem/g


# c897b813 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.


# 8355f576 19-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@


# bdd67d48 27-Feb-2002 John Baldwin <jhb@FreeBSD.org>

- Change namei() to use td_ucred instead of p_ucred.
- Change the hack in access() that uses a temporary credential to set
td_ucred to the temp cred instead of p_ucred.


# 9e209b12 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

Include sys/_lock.h and sys/_mutex.h to reduce namespace pollution.

Requested by: jhb


# 426da3bc 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# c7503f60 23-Jun-2001 Matthew Dillon <dillon@FreeBSD.org>

After exhaustive discussions and some meandering and confusion, enough
people are on track with the cause and effect of this, and although
fixing this severely degenerate case appears to violate the letter of
POSIX.1-200x, Bruce and I (and enough others) agree that it should be
comitted.

So, this patch generates an ENOENT error for any attempt to do a path lookup
through an empty symlink (e.g. open(), stat()).

Submitted by: "Andrey A. Chernov" <ache@nagual.pp.ru>
Reviewed by: bde
Discussed exhaustively on: freebsd-current
Previously committed to: NetBSD 4 years ago


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 60fb0ce3 28-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# 138e514c 06-Dec-2000 Peter Wemm <peter@FreeBSD.org>

Untangle vfsinit() a bit. Use seperate sysinit functions rather than
having a super-function calling bits all over the place.


# 23771027 30-Nov-2000 Alfred Perlstein <alfred@FreeBSD.org>

This is a fix for a problem described in PR kern/19572. It was
recently discussed at -hackers. The problem is a null-pointer
dereference that happens in kern/vfs_lookup.c when accessing ".."
with a v_mount entry for the current directory vnode of NULL. This
happens when a volume is forcibly unmounted, and the vnode for a
working directory in the mounted volume is cleared.

PR: 23191
Submitted by: Thomas Moestl <tmoestl@gmx.net>


# 3ff1a2f4 17-Sep-2000 Boris Popov <bp@FreeBSD.org>

Add new flag PDIRUNLOCK to the component.cn_flags which should be set by
filesystem lookup() routine if it unlocks parent directory. This flag should
be carefully tracked by filesystems if they want to work properly with nullfs
and other stacked filesystems.

VFS takes advantage of this flag to perform symantically correct usage
of vrele() instead of vput() if parent directory already unlocked.

If filesystem fails to track this flag then previous codepath in VFS left
unchanged.

Convert UFS code to set PDIRUNLOCK flag if necessary. Other filesystmes will
be changed after some period of testing.

Reviewed in general by: mckusick, dillon, adrian
Obtained from: NetBSD


# 64134168 13-Sep-2000 Boris Popov <bp@FreeBSD.org>

Unlock current directory when calling VFS_ROOT() because underlying
filesystem may hold the lock. Otherwise unavoidable deadlock will occur.
This shouldn't have any side effects as long as we hold vfs lock.

Obtained from: NetBSD


# 762e6b85 15-Dec-1999 Eivind Eklund <eivind@FreeBSD.org>

Introduce NDFREE (and remove VOP_ABORTOP)


# 3b6fb885 02-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Before we start to mess with the VFS name-cache clean things up a little bit:
Isolate the namecache in its own file, and give it a dedicated malloc type.


# 2fe5bd8b 25-Sep-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Fix a hole in jail(2).

Noticed by: Alexander Bezroutchko <abb@zenon.net>


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 67452993 26-Jul-1999 Alan Cox <alc@FreeBSD.org>

Add sysctl and support code to allow directories to be VMIO'd. The default
setting for the sysctl is OFF, which is the historical operation.

Submitted by: dillon


# 8aef1712 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# d254af07 27-Jan-1999 Matthew Dillon <dillon@FreeBSD.org>

Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile


# 219cbf59 09-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

KNFize, by bde.


# 5526d2d9 08-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith


# fb116777 05-Jan-1999 Eivind Eklund <eivind@FreeBSD.org>

Remove the 'waslocked' parameter to vfs_object_create().


# ecbb00a2 07-Jun-1998 Doug Rabson <dfr@FreeBSD.org>

This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
time.


# 5ddc8ded 08-Apr-1998 Wolfram Schneider <wosch@FreeBSD.org>

New mount option nosymfollow. If enabled, the kernel lookup()
function will not follow symbolic links on the mounted
file system and return EACCES (Permission denied).


# 9f24f214 14-Feb-1998 John Dyson <dyson@FreeBSD.org>

Make the rootdir handling more consistent. Now, processes always
have a root vnode associated with them, and no special checks for
the null case are needed.
Submitted by: terry@freebsd.org


# 0b08f5f7 05-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Back out DIAGNOSTIC changes.


# 47cfdb16 04-Feb-1998 Eivind Eklund <eivind@FreeBSD.org>

Turn DIAGNOSTIC into a new-style option.


# 95e5e988 05-Jan-1998 John Dyson <dyson@FreeBSD.org>

Make our v_usecount vnode reference count work identically to the
original BSD code. The association between the vnode and the vm_object
no longer includes reference counts. The major difference is that
vm_object's are no longer freed gratuitiously from the vnode, and so
once an object is created for the vnode, it will last as long as the
vnode does.

When a vnode object reference count is incremented, then the underlying
vnode reference count is incremented also. The two "objects" are now
more intimately related, and so the interactions are now much less
complex.

When vnodes are now normally placed onto the free queue with an object still
attached. The rundown of the object happens at vnode rundown time, and
happens with exactly the same filesystem semantics of the original VFS
code. There is absolutely no need for vnode_pager_uncache and other
travesties like that anymore.

A side-effect of these changes is that SMP locking should be much simpler,
the I/O copyin/copyout optimizations work, NFS should be more ponderable,
and further work on layered filesystems should be less frustrating, because
of the totally coherent management of the vnode objects and vnodes.

Please be careful with your system while running this code, but I would
greatly appreciate feedback as soon a reasonably possible.


# 2be70f79 28-Dec-1997 John Dyson <dyson@FreeBSD.org>

Lots of improvements, including restructring the caching and management
of vnodes and objects. There are some metadata performance improvements
that come along with this. There are also a few prototypes added when
the need is noticed. Changes include:

1) Cleaning up vref, vget.
2) Removal of the object cache.
3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore.
4) Correct some missing LK_RETRY's in vn_lock.
5) Correct the page range in the code for msync.

Be gentle, and please give me feedback asap.


# 675ea6f0 26-Dec-1997 Bruce Evans <bde@FreeBSD.org>

Unspammed nested include of <vm/vm_zone.h>.


# 99448ed1 20-Sep-1997 John Dyson <dyson@FreeBSD.org>

Change the M_NAMEI allocations to use the zone allocator. This change
plus the previous changes to use the zone allocator decrease the useage
of malloc by half. The Zone allocator will be upgradeable to be able
to use per CPU-pools, and has more intelligent usage of SPLs. Additionally,
it has reasonable stats gathering capabilities, while making most calls
inline.


# e4ba6a82 02-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# 42146e37 04-Apr-1997 Doug Rabson <dfr@FreeBSD.org>

[Previous comment was incorrect for these files]
Added calls to VFS lock debugging macros to make fixing filesystems' locking
easier.


# de15ef6a 04-Apr-1997 Doug Rabson <dfr@FreeBSD.org>

Add a function vop_sharedlock which a copy of vop_nolock without the
implementation #ifdef out. This can be used for now by NFS. As soon
as all the other filesystems' locking is fixed, this can go away.

Print the vnode address in vprint for easier debugging.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 78fd7b3d 17-Feb-1997 Bruce Evans <bde@FreeBSD.org>

Fixed namei caching for LOOKUPs. It was broken for lstat() and olstat().
Successful lstat()s purged an existing entry as well as not caching the
result.

This bug was introduced in Lite1 by setting the LOCKPARENT flag for
[o]lstat() in order to support the inherit-attributes-from-parent-
directory misfeature for symlinks. LOCKPARENT was previously only set
for CREATEs and DELETEs. It is now set for LOOKUPs, but only for
[o]lstat(), so the problem wasn't very noticeable.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# 4958bbd1 01-Dec-1996 Bruce Evans <bde@FreeBSD.org>

Don't allow empty pathnames. POSIX standard.

Most of the standard utilities that depended on (or were broken in
a different way by) the old behaviour of interpreting "" as "."
were fixed a year or two ago. There is still a fairly harmless
bug in tar and a harmless bug in gzip. Tar apparently replaces
"/" by "" when it strips leading slashes.


# edbfedac 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]


# db6a20e2 03-Jan-1996 Garrett Wollman <wollman@FreeBSD.org>

Converted two options over to the new scheme: USER_LDT and KTRACE.


# d68a4190 22-Oct-1995 David Greenman <dg@FreeBSD.org>

Moved the filesystem read-only check out of the syscalls and into the
filesystem layer, as was done in lite-2. Merged in some other cosmetic
changes while I was at it. Rewrote most of msdosfs_access() to be more
like ufs_access() and to include the FS read-only check.

Obtained from: partially from 4.4BSD-lite2


# 27df9774 24-Aug-1995 Doug Rabson <dfr@FreeBSD.org>

Add support for amd direct maps.

Reviewed by: Thomas Graichen <graichen@sirius.physik.fu-berlin.de>


# 70eec742 30-Jul-1995 Bruce Evans <bde@FreeBSD.org>

Ignore trailing slashes in pathnames that "refer to a directory",
as is required to be POSIXLY_CORRECT and "right". I interpret
"referring to a directory" as being a directory or becoming a
directory. E.g., the trailing slashes in mkdir("/nonesuch/"),
rename("/tmp", /nonesuch/") and link("/tmp", "/root_can_like_dirs/")
are ignored because the target will become a directory if the
syscall succeeds. A trailing slash on a symlink causes the symlink
to be followed (this is a bug if the symlink doesn't point to a
directory; fix later).


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# 82478919 06-Oct-1994 David Greenman <dg@FreeBSD.org>

Use tsleep() rather than sleep so that 'ps' is more informative about
the wait.


# 38103198 27-Sep-1994 Poul-Henning Kamp <phk@FreeBSD.org>

Moved the "relookup" routine into vfs_lookup.c from ufs/ufs/ufs_vnops.c.
Several FS's use this, so it doesn't belong in ufs. (unionfs, msdosfs and ufs)


# 7b42c960 19-Aug-1994 David Greenman <dg@FreeBSD.org>

1) cleaned up after Garrett - fixed more redundant declarations, changed
use of timeout_t -> timeout_func_t in aha1542 and aha1742 drivers.
2) fix a bug in the portalfs that was uncovered by better prototyping -
specifically, the time must be converted from timeval to timespec
before storing in va_atime.
3) fixed/added some miscellaneous prototypes


# f23b4c91 18-Aug-1994 Garrett Wollman <wollman@FreeBSD.org>

Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources