History log of /freebsd-current/sys/kern/vfs_cache.c
Revision Date Author Comments
# 0cd9cde7 06-Apr-2024 Jake Freeland <jfree@FreeBSD.org>

ktrace: Record namei violations with KTR_CAPFAIL

Report namei path lookups while Capsicum violation tracing with
CAPFAIL_NAMEI. vfs caching is also ignored when tracing to mimic
capability mode behavior.

Reviewed by: markj
Approved by: markj (mentor)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D40680


# 55edc40e 04-Jan-2024 Mark Johnston <markj@FreeBSD.org>

file: Remove the fd parameter to fgetvp_lookup() and fgetvp_lookup_smr()

The fd is always obtained from nameidata, so just fetch it from there
instead. No functional change intended.

Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D43257


# 29363fb4 23-Nov-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove ancient SCCS tags.

Remove ancient SCCS tags from the tree, automated scripting, with two
minor fixup to keep things compiling. All the common forms in the tree
were removed with a perl script.

Sponsored by: Netflix


# bb8ecf25 19-Oct-2023 Dmitry Chagin <dchagin@FreeBSD.org>

vfs cache: Fallback to namei to resolve symlinks with leading / in target for non-native ABI

This is a temporary solution to fix PR before release.
During 15.0 it's necessary to refactor symlinks handling
between vfs & namecache.

PR: 273414
Reported by: Vincent Milum Jr, Dan Kotowski, glebius
Tested by: Dan Kotowski, glebius
Reviewed by:
Differential Revision: https://reviews.freebsd.org/D41806
MFC after: 3 days


# 8b622172 04-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: add 2 more optimizaiton ideas


# cd2105d6 04-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: denote a known bug in cache_remove_cnp


# 0f15054f 22-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: plug a hypothetical corner case when freeing

cache_zap_unlocked_bucket is called with a bunch of addresses and
without any locks held, forcing it to revalidate everything from
scratch.

It did not account for a case where the entry is reallocated with
everything the same except for the target vnode.

Should the target use a different lock than the one expected, freeing
would proceed without being properly synchronized.

Note this is almost impossible to happen in practice.


# 2749c222 04-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: sanitize debug counters

They are very rarely triggered, so no need for per-cpu distribution.

At the same time the non-cpu ones still should use atomics to not lose
any updates.


# 4862e8ac 03-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: describe various optimization ideas

While here report a sample result from running on Sapphire Rapids:

An access(2) loop slapped into will-it-scale, like so:
while (1) {
int error = access(tmpfile, R_OK);
assert(error == 0);

(*iterations)++;
}

.. operating on /usr/obj/usr/src/amd64.amd64/sys/GENERIC/vnode_if.c

In operations per second:
lockless: 3462164
locked: 1362376

While the over 3.4 mln may seem like a big number, a critical look shows
it should be significantly higher.

A poor man's profiler, counting how many times given routine was sampled:
dtrace -w -n 'profile:::profile-4999 /execname == "a.out"/ {
@[sym(arg0)] = count(); } tick-5s { system("clear"); trunc(@, 40);
printa("%40a %@16d\n", @); clear(@); }'

[snip]
kernel`kern_accessat 231
kernel`cpu_fetch_syscall_args 324
kernel`cache_fplookup_cross_mount 340
kernel`namei 346
kernel`amd64_syscall 352
kernel`tmpfs_fplookup_vexec 388
kernel`vput 467
kernel`vget_finish 499
kernel`lockmgr_unlock 529
kernel`lockmgr_slock 558
kernel`vget_prep_smr 571
kernel`vput_final 578
kernel`vdropl 1070
kernel`memcmp 1174
kernel`0xffffffff80 2080
0x0 2231
kernel`copyinstr_smap 2492
kernel`cache_fplookup 9246


# 38a375c4 03-Oct-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: s/vfs.cache_fast_lookup/vfs.cache.param.fast_lookup


# bb124a0f 22-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: retire dothits and dotdothits counters

They demonstrate nothing, and in case of dotdot they are not even hits.
This is just a count of lookups with "..", which are not worth
mentioniong.


# 33fdf1af 22-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: mark vfs.cache.param.size as read-only

It was not meant to be writable and writes don't work correctly as they
fail to resize the hash.


# 02ef039c 22-Sep-2023 Olivier Certner <olce.freebsd@certner.fr>

vfs cache: Drop known argument of internal cache_recalc_neg_min()

'ncnegminpct' is to be passed always, so just drop the unneeded parameter.

Sponsored by: The FreeBSD Foundation
Reviewed by: mjg

Differential Revision: https://reviews.freebsd.org/D41763


# 07f52c4b 14-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: garbage collect the fullpathfail2 counter

The conditions it checks cannot legally be true (modulo races against
forced unmount), so assert on it instead.


# 32988c14 02-Sep-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: fix a hang when bumping vnode limit too high

Overflow in cache_changesize would make the value flip to 0 and stay
there as 0 << 1 does not do anything.

Note callers limit the outcome to something below u_int.

Also note there entire vnode handling thing both in vfs layer as a whole
and this file can't decide whether to long, u_long or u_int.


# 685dc743 16-Aug-2023 Warner Losh <imp@FreeBSD.org>

sys: Remove $FreeBSD$: one-line .c pattern

Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/


# dbac8474 29-Jul-2023 Dmitry Chagin <dchagin@FreeBSD.org>

vfs: Deleting a doubled inclusion of sys/capsicum.h

Reviewed by:
Differential Revision: https://reviews.freebsd.org/D41223
MFC after: 1 week


# ba8cc6d7 12-Mar-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use __enum_uint8 for vtype and vstate

This whacks hackery around only reading v_type once.

Bump __FreeBSD_version to 1400093


# d7614c01 04-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

vn_path_to_global_path_hardlink(): initialize len

before calling vn_fullpath_hardlink(). Otherwise we get random failures
when the len is automatically clipped.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# d6b900c9 03-Jul-2023 Konstantin Belousov <kib@FreeBSD.org>

vn_path_to_global_path_hardlink(): avoid freeing non-initialized pointer

Reported by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# 60bd7f97 30-May-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: restore sorted order of CACHE_FPL_SUPPORTED_CN_FLAGS


# 3d2fec7d 29-May-2023 Dmitry Chagin <dchagin@FreeBSD.org>

namei: Add the abilty for the ABI to specify an alternate root path

For now a non-native ABI (i.e., Linux) uses the kern_alternate_path()
facility to dynamically reroot lookups. First, an attempt is made to
lookup the file in /compat/linux/original-path. If that fails, the
lookup is done in /original-path. Thats requires a bit of code in
every ABI syscall implementation where path name translation is needed.
Also our kern_alternate_path() does not properly lookups absolute symlinks
in second attempt, i.e., does not append /compat/linux part to the resolved
link.
The change is intended to avoid this by specifiyng the ABI root directory
for namei(), using one call to pwd_altroot() during exec-time into the ABI.
In that case namei() will dynamically reroot lookups as mentioned above.

PR: 72920
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D38933
MFC after: 2 month


# 0e0c47ec 19-Apr-2023 Igor Ostapenko <pm@igoro.pro>

vfs cache: fix vfs.cache.stats.* name typos

Two vfs.cache.stats names are fixed:
- s/.dotdothis/.dotdothits/
- s/.posszaps/.poszaps/

Signed-off-by: Igor Ostapenko <pm@igoro.pro>
[mjg: massaged the header a little bit]


# 26b96487 07-Apr-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs: more informative panic for missing fplookup ops


# 5f6df177 03-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: validate that vop vectors provide all or none fplookup vops

In order to prevent later susprises.


# 22eb66d9 23-Mar-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: always assert on ndp->ni_resflags


# c16c4ea6 23-Mar-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: return ENOTDIR for not_a_dir/{.,..} lookups

Reported by: Oliver Kiddle
PR: 270419
MFC: 3 days


# dbcd7e7e 21-Feb-2023 Mateusz Guzik <mjg@FreeBSD.org>

vfs cache: whack set-but-not-used warn in cache_purgevfs

Reported by: kib
Sponsored by: Rubicon Communications, LLC ("Netgate")


# a1d74b2d 04-Dec-2022 Doug Rabson <dfr@FreeBSD.org>

Allow realpath to work for file mounts

For file mounts, the directory vnode is not available from namei and this
prevents the use of vn_fullpath_hardlink. In this case, we can use the
vnode which was covered by the file mount with vn_fullpath.

This also disallows file mounts over files with link counts greater than
one to ensure a deterministic path to the mount point.

Reviewed by: mjg, kib
Tested by: pho


# 78d35459 02-Dec-2022 Doug Rabson <dfr@FreeBSD.org>

Add vn_path_to_global_path_hardlink

This is similar to vn_path_to_global_path but allows for regular files
which may not be present in the cache.

Reviewed by: mjg, kib
Tested by: pho


# 8f7859e8 14-Dec-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: retire the now unused SAVESTART flag

Bump __FreeBSD_version to 1400075

Tested by: pho


# 85dac03e 17-Nov-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: stop using NDFREE

It provides nothing but a branchfest and next to no consumers want it
anyway.

Tested by: pho


# d653aaec 24-Oct-2022 Mateusz Guzik <mjg@FreeBSD.org>

cache: add cache_assert_no_entries


# 5b5b7e2c 17-Sep-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: always retain path buffer after lookup

This removes some of the complexity needed to maintain HASBUF and
allows for removing injecting SAVENAME by filesystems.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D36542


# 7388fb71 27-Jun-2022 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the vfs.cache_rename_add tunable

The functionality has been in use since Jan 2021 -- long enough(tm).


# c9b04ee4 02-Apr-2022 Gordon Bergling <gbe@FreeBSD.org>

kern: Fix two typos in source code comments

- s/accomodate/accommodate/

MFC after: 3 days


# 0c805718 24-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: fix memory leak on lookup with fds with ioctl caps

Reviewed by: markj
PR: 262515
Noted by: firk@cantconnect.ru
Differential Revision: https://reviews.freebsd.org/D34667


# bb92cd7b 24-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd)


# 6ff3e8a3 19-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

cache: add a comment about a realpath bug


# 02fc4e31 13-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

cache: use flexible array member

... instead of 0-sizing the array


# afb08a6d 03-Mar-2022 Mateusz Guzik <mjg@FreeBSD.org>

cache: hide hash stats behind DEBUG_CACHE

They take a long time to dump and hinder sysctl -a when used with
DIAGNOSTIC.


# 1d65a9b4 09-Feb-2022 Mateusz Guzik <mjg@FreeBSD.org>

cache: improve vnode vs name assertion in cache_enter_time


# 611470a5 09-Feb-2022 Mateusz Guzik <mjg@FreeBSD.org>

cache: remove NOCACHE handling from cache_fplookup_noentry

It was copy-pasted from locked lookup. As LOOKUP operation cannot have
the flag set it was always ending up setting MAKEENTRY.


# 7e1d3eef 25-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove the unused thread argument from NDINIT*

See b4a58fbf640409a1 ("vfs: remove cn_thread")

Bump __FreeBSD_version to 1400043.


# 7e9680d3 14-Nov-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: whack "set but not used" warnings


# 9a0bee9f 22-Oct-2021 Konstantin Belousov <kib@FreeBSD.org>

Make vn_fullpath_hardlink() externally callable

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611


# 628c3b30 27-Oct-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: only let non-dir descriptors through when doing EMPTYPATH lookups

Otherwise things like realpath against a file and '.' end up with an
illegal state of having a regular vnode for the parent.

Reported by: syzbot+9aa5439dd9c708aeb1a8@syzkaller.appspotmail.com


# 1045352f 17-Oct-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: only assert on flags when dealing with EMPTYPATH

Reported by: syzbot+bd48ee0843206a09e6b8@syzkaller.appspotmail.com
Fixes: 7dd419cabc6bb9e0 ("cache: add empty path support")


# 7dd419ca 26-Sep-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add empty path support

This avoids spurious drop offs as EMPTY is passed regardless of the
actual path name.

Pushign the work inside the lookup instead of just ignorign the flag
allows avoid checking for empty pathname for all other lookups.


# b4a58fbf 01-Oct-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: remove cn_thread

It is always curthread.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D32453


# a2cb65b8 18-Sep-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: count vnodes in cache_purgevfs


# b65ad701 23-Aug-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: retire cache_fast_revlookup sysctl

Sponsored by: Rubicon Communications, LLC ("Netgate")


# b30e7cb7 07-Aug-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add OPENREAD and OPENWRITE to fast path lookup


# 844aa31c 08-Jul-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add cache_enter_time_flags


# 12288bd9 10-May-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix lockless absolute symlink traversal to non-fp mounts

Said lookups would incorrectly fail with EOPNOTSUP.

Reported by: kib


# c8bbb127 10-May-2021 Mark Johnston <markj@FreeBSD.org>

vfs: Fix error handling in vn_fullpath_hardlink()

vn_fullpath_any_smr() will return a positive error number if the
caller-supplied buffer isn't big enough. In this case the error must be
propagated up, otherwise we may copy out uninitialized bytes.

Reported by: syzkaller+KMSAN
Reviewed by: mjg, kib
MFC aftr: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30198


# 074abacc 10-Apr-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: remove incomplete lockless lockout support during resize

This is already properly handled thanks to 2 step hash replacement.


# 4f0279e0 15-Apr-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: extend mismatch vnode assert print to include the name


# 72b3b5a9 08-Apr-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: replace vfs_smr_quiesce with vfs_smr_synchronize

This ends up using a smr specific method.

Suggested by: markj
Tested by: pho


# 13b3862e 06-Apr-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: update an assert on CACHE_FPL_STATUS_ABORTED

Since symlink support it can get upgraded to CACHE_FPL_STATUS_DESTROYED.

Reported by: bdrewery


# f79bd71d 11-Feb-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add high level overview

Differential Revision: https://reviews.freebsd.org/D28675


# dc532884 29-Mar-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix resizing in face of lockless lookup

Reported by: pho
Tested by: pho


# 1239a722 27-Feb-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: temporarily drop the assert that dvp != vp when adding an entry

Historically it was allowed for any names, but arguably should never be
even attempted. Allow it again since there is a release pending and
allowing it is bug-compatible with previous behavior.

Reported by: otis


# 39e0c3f6 09-Feb-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: assorted comment fixups


# 2f8a8446 05-Feb-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: remove the largely obsolete general description

Examples of inconsistencies with the current state:
- references LRU of all entries, removed years ago
- references a non-existent lock (neglist)
- claims negative entries have a NULL target

It will be replaced with a more accurate and more informative
description.

In the meantime take it out so it stops misleading.


# 0e1594e6 05-Feb-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix vfs:namecache:lookup:miss probe call sites


# 2e96132a 05-Feb-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop spurious arg from panic in cache_validate

vp is already reported when noting mismatch


# b54ed778 03-Feb-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: comment on FNV


# 45456abc 02-Feb-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix trailing slash support in face of permission problems

Reported by: Johan Hendriks <joh.hendriks gmail.com>
Tested by: kevans


# 6f19dc21 31-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add delayed degenerate path handling


# bbfb1edd 31-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: move hash computation into the parsing loop


# e027e24b 25-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add trailing slash support

Tested by: pho


# 8cbd164a 26-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: handle NOFOLLOW requests for symlinks

Tested by: pho


# 5c325977 27-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add missing MNT_NOSYMFOLLOW check to symlink traversal


# 5fc384d1 27-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: fallback when encountering a mount point during .. lookup

The current abort is overzealous.


# a098a831 26-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: tidy up handling of foo/bar lookups where foo is not a directory

The code was performing an avoidable check for doomed state to account
for foo being a VDIR but turning VBAD. Now that dooming puts a vnode
in a permanent "modify" state this is no longer necessary as the final
status check will catch it.


# a51eca79 26-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop referring to removing entries as invalidating them

Said use is a remnant from the old code and clashes with the NCF_INVALID
flag.


# 6943671b 25-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: convert cache_fplookup_parse to void now that it always succeeds


# e7cf562a 25-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: change ->v_cache_dd synchronisation rules

Instead of resorting to seqc modification take advantage of immutability
of entries and check if the entry still matches after everything got
prepared.


# 6f084276 25-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: make ->v_cache_dd accesses atomic-clean for lockless usage


# 6ef8fede 25-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: make ->nc_flag accesses atomic-clean for lockless usage


# ffcf8f97 23-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: store vnodes in local vars in cache_zap_locked


# 868643e7 24-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: assorted cleanups


# 1c7a65ad 24-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: track calls to cache_symlink_alloc with unsupported size

While here assert on size passed to free.


# 02ec31bd 23-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add back target entry on rename


# 739ecbcf 23-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: add symlink support to lockless lookup

Reviewed by: kib (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D27488


# 2171b8e8 20-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: augment sdt probe in cache_fplookup_dot

Same as 6d386b4c ("cache: save a branch in cache_fplookup_next")


# aae03cfe 20-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: whitespace nit in cache_fplookup_modifying


# 57dab029 19-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix some typos


# 84ab77ad 19-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop-write only var from cache_fplookup_preparse


# 6d386b4c 19-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: save a branch in cache_fplookup_next

Previously the code would branch on top find out whether it should
branch on SDT probe and bumping the numposhits counter, depending
on cache_fplookup_cross_mount.

Arguably it should be done regardless of what said function returns.


# 70ba7770 12-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

vfs: extend vfs:namei:lookup:return probe with nameidata


# 8ddea0b1 08-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: just assign ni_resflags = NIRES_ABS

It is guaranteed to be 0 on entry.


# fee405e0 06-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop checkpointing cn_flags

They are only modified, if ever, for the last component.


# ac771547 06-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop checkpointing cn_nameptr

For aborts cn_nameptr is the same as cn_pnbuf. For partial results
the same cn_nameptr is to be used.


# 0f1fc3a3 01-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop manipulating pathlen

It is a copy-pasto from regular lookup. Add debug to ensure the result
is the same.


# f2b794e1 06-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: unengrish the comment in previous commit

Reported by: rpokala, brd


# deabdc68 05-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop pre-checking seqc when starting the lookup

Tested by: pho


# 71a6a0b5 31-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: skip checking for spurious slashes if possible

Tested by: pho


# 33f3e81d 01-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: combine fast path enabled status into one flag

Tested by: pho


# dbbbc07c 31-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: split handling of 0 and non-0 error codes

Tested by: pho


# a1a8f8ad 31-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: deinline state handling

The intent is to reduce branchfest when finishing the lookup.

Tested by: pho


# 05803be0 05-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop setting cn_nameptr on entry as matches cn_pnbuf already

While here tidy up other asserts.


# 3814bea0 03-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the now spurious doomed check when crossing a mount point


# 82397d79 31-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: denote vnode being a mount point with VIRF_MOUNTPOINT

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D27794


# 51bf55fa 01-Jan-2021 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop checkpointing cn_namelen

The variable is recomputed by regular lookup from the get go.


# 7220a10b 31-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: predict on no spurious slashes in cache_fpl_handle_root

This is a step towards speculatively not handling them.


# 30a2fc91 31-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: postpone NAME_MAX check as it may be unnecessary


# eca899bd 31-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: remove spurious null check in sdt probe


# 1365b5f8 28-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: fold NCF_WHITE check into the rest

Tested by: pho


# d7c62d98 28-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: call cache_fplookup_modifying in neg

Tested by: pho


# 6fe7de1a 28-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: refactor cache_fpl_handle_root to fit the rest of the code better

Tested by: pho


# e17e01bd 28-Dec-2020 Mateusz Guzik <mjguzik@gmail.com>

cache: refactor dot handling

Tested by: pho


# 4651db56 28-Dec-2020 Mateusz Guzik <mjguzik@gmail.com>

cache: remove a branch from mount point checking

Tested by: pho


# 0b5bd1af 27-Dec-2020 Mateusz Guzik <mjguzik@gmail.com>

cache: support lockless lookup of degenerate paths

Tested by: pho


# 1d6eb976 27-Dec-2020 Mateusz Guzik <mjguzik@gmail.com>

cache: save on branching when parsing the path by inserting a sentinel

Tested by: pho


# 67297766 27-Dec-2020 Mateusz Guzik <mjguzik@gmail.com>

cache: hoist trailing slash and degenerate path handling out of the loop

Tested by: pho


# 0c09f4b0 28-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: work around corner case of dvp == tvp in cache_fplookup_final_modifying

Fixes a panic where the kernel would unlock an unheld lock coming from
rename looking up "foo/." as the source.

Reported by: markj (syzkaller)


# 4ab7d9f4 27-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: reduce engrish in previous commit


# 0714f921 27-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: save on some branching in common case mount point traversal


# 002e18eb 27-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add FAILIFEXISTS flag

Both FreeBSD and Linux mkdir -p walk the tree up ignoring any EEXIST on
the way and both are used a lot when building respective kernels.

This poses a problem as spurious locking avoidably interferes with
concurrent operations like getdirentries on affected directories.

Work around the problem by adding FAILIFEXISTS flag. In case of lockless
lookup this manages to avoid any work to begin with, there is no speed
up for the locked case but perhaps this can be augmented later on.

For simplicity the only supported semantics are as used by mkdir.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D27789


# ff97bc03 27-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: simplify lockless dot lookups


# abd7ded4 27-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: modification and last entry filling support in lockless lookup v2

The previous patch failed to set the ISDOTDOT flag when appropriate,
which in turn fail to properly handle degenerate lookups.

While here sprinkle some extra assertions.

Tested by: pho (previous version)


# 623daa69 27-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: assert internal flags are not passed by namei


# a1fc1f10 27-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

Revert "cache: modification and last entry filling support in lockless lookup"

This reverts commit 6dbb07ed6872ae7988b9b705e322c94658eba6d1.

Some ports unreliably fail to build with rmdir getting ENOTEMPTY.


# 6dbb07ed 27-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: modification and last entry filling support in lockless lookup

Tested by: pho (previous version)


# 906a73e7 23-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix up cache_hold_vnode comment


# 8ab96e26 13-Dec-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix ups bad predicts

- last level fallback normally sees CREATE; the code should be optimized to not
get there for said case
- fast path commonly fails with ENOENT


# d3bbf8af 11-Dec-2020 Ryan Libby <rlibby@FreeBSD.org>

cache_fplookup: quiet gcc -Wreturn-type

Reviewed by: markj, mjg
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D27555


# f6dd1aef 09-Nov-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: group mount per-cpu vars into one struct

While here move frequently read stuff into the same cacheline.

This shrinks struct mount by 64 bytes.

Tested by: pho


# 4bfebc8d 30-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add cache_vop_mkdir and rename cache_rename to cache_vop_rename


# d681c51d 26-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add missing NIRES_ABS handling


# eb65cde4 24-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: assorted typo fixes


# 029cfccc 24-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add the missing NC_NOMAKEENTRY and NC_KEEPPOSENTRY to lockless lookup

They are de facto ignored.


# acb41008 23-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: batch updates to numcache in case of mass removal


# 208cb7c4 23-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: refactor alloc/free

This in particular centralizes manipulation of numcache.


# 1d444056 23-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: fold branch prediction into cache_ncp_canuse


# c13d7d1f 23-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix some typos


# f878526f 23-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop write-only vars


# 38628389 23-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: reduce memory waste in struct namecache

The previous scheme for calculating the total size was doing sizeof
on the struct and then adding the wanted space for the buffer.

nc_name is at offset 58 while sizeof(struct namecache) is 64.
With CACHE_PATH_CUTOFF of 39 bytes and 1 byte of padding we were
allocating 104 bytes for the entry and never accounting for the 6
byte padding, wasting that space.


# c7520caa 22-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: prevent avoidable evictions on mkdir of existing directories

mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST.

The NOCACHE lookup will make the namecache unnecessarily evict the existing entry,
and then fallback to the fs lookup routine eventually leading namei to return an
error as the directory is already there.

For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers
fallbacks to the slowpath for concurrently executing lookups.

Tested by: pho
Discussed with: kib


# 54f09403 22-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: assert the created entry does not point to itself


# 2f1c3505 20-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the spurious slash_prefixed argument


# 8ecd87a3 20-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop spurious cred argument from VOP_VPTOCNP


# 6d5d469f 19-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: promote negative entries based on more than one hit

During tinderbox and similar workloads negative entries get at least one
hit before they get evicted. In the current scheme this avoidably promotes
them.

Be conservative and stick to 2 hits for now.


# 665c8c3e 19-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: refactor negative promotion/demotion handling

This will simplify policy changes.


# 4c4aa848 17-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: shorten names of debug stats


# 67655714 17-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: don't automatically evict negative entries if usage is low

The previous scheme only looked at negative entry count in relation to the
total count, leading to tons of spurious evictions if the cache is not
significantly populated.

Instead, only try the above if negative entry count goes beyond namecache
capacity.


# e98c3bc6 17-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: erwork sysctl vfs.cache tree

Split everything into neg, debug, param and stat categories.

The legacy nchstats sysctl (queried e.g., by systat) remains untouched.

While here rename some vars to be easier on the eye.


# fa7c73d3 17-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: factor negative lookup out of cache_fplookup_next


# 41e6b184 17-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: avoid smr in cache_neg_evict in favoro of the already held bucket lock


# c38d8e1e 17-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: rework parts of negative entry management

- declutter sysctl vfs.cache by moving relevant entries into
vfs.cache.neg
- add a little more parallelism to eviction by replacing the
global lock with an atomically modified counter
- track more statistics

The code needs further effort.


# b31b5e9c 17-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: remove entries before trying to add new ones, not after

Should allow positive entries to replace negative ones in case
the cache is full.


# d6eee350 16-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add a probe reporting addition of duplicate entries


# a59b0ac3 15-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: flip inverted condition in previous

It happened to not affect correctness in that the fallback code would
simply neglect to promote the entry.


# e7602e04 15-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: support negative entry promotion in slowpath smr


# 571bc3d1 15-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: elide vhold/vdrop around promoting negative entry


# 640e6162 15-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: dedup code for negative promotion


# c97c8746 15-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: neglist -> nl; negstate -> ns

No functional changes.


# 43777a20 15-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: split hotlist between existing negative lists

This simplifies the code while allowing for concurrent negative eviction
down the road.

Cache misses increased slightly due to higher rate of evictions allowed by
the change.

The current algorithm remains too aggressive.


# 430dc451 15-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: make neglist an array given the static size


# dd28b379 09-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: support lockless dirfd lookups


# eb88fed4 09-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix vexec panic when racing against vgone

Use of dead_vnodeops would result in a panic instead of returning the intended
EOPNOTSUPP error.

While here make sure to abort, not just try to return a partial result.
The former allows the regular lookup to restart from scratch, while the latter
makes it stuck with an unusable vnode.

Reported by: kevans


# 4e226610 05-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix pwd use-after-free in setting up fallback

Since the code exits smr section prior to calling pwd_hold, the used
pwd can be freed and a new one allocated with the same address, making
the comparison erroneously true.

Note it is very unlikely anyone ran into it.


# aa34e791 02-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: update the commentary for path parsing


# b5ab177a 01-Oct-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: properly report ENOTDIR on foo/bar lookups where foo is a file

Reported by: fernape


# 4301a5a7 30-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: push the lock into cache_purge_impl


# d4cac594 29-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: use cache_has_entries where appropriate instead of opencoding it


# 1b2edd6e 23-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: eliminate cache_zap_locked_vnode

It is only ever called for negative entries and for those it is
just a wrapper around cache_zap_negative_locked_vnode_kl which
always succeeds.

This also fixes a bug where cache_lookup_fallback should have been
calling cache_zap_locked_bucket instead. Note that in order to trigger
the bug NOCACHE must not be set, which currently only happens when
creating a new coredump (and then the coredump-to-be has to have a
negative entry).


# a3d9bf49 23-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the force flag from purgevfs

The optional scan is wasteful, thus it is removed altogether from unmount.

Callers which always want it anyway remain unaffected.


# a952feff 23-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: reimplement purgevfs to iterate vnodes instead of the entire hash

The entire cache scan was a leftover from the old implementation.

It is incredibly wasteful in presence of several mount points and does not
win much even for single ones.


# efeec5f0 23-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: clean up atomic ops on numneg and numcache

- use subtract instead of adding -1
- drop the useless _rel fence

Note this should be converted to a scalable scheme.


# da62ed4f 08-Sep-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop write-only tvp_seqc vars


# 84ecea90 27-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: don't update timestmaps on found entry


# 5f08d440 27-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: assorted clean ups

In particular remove spurious comments, duplicate assertions and the
inconsistently done KTR support.


# 12441fcb 27-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: ncp = NULL early to account for sdt probes in ailure path

CID: 1432106


# 1e9a0b39 25-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: relock on failure in cache_zap_locked_vnode

This gets rid of bogus scheme of yielding in hopes the blocking thread will
make progress.


# 075f58f2 25-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop null checking in cache_free


# 66fa11c8 25-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: make it mandatory to request both timestamps or neither


# eef63775 25-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: convert bucketlocks to a mutex

By now bucket locks are almost never taken for anything but writing and
converting to mutex simplifies the code.


# 32f3d082 25-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: only evict negative entries on CREATE when ISLASTCN is set


# 935e1518 25-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: decouple smr and locked lookup in the slowpath

Tested by: pho


# d3476dad 25-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: factor dotdot lookup out of cache_lookup

Tested by: pho


# f9cdb077 24-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: remove leftover assert in vn_fullpath_any_smr

It is only valid when !slash_prefixed. For slash_prefixed the length
is properly accounted for later.

Reported by: markj (syzkaller)


# e35406c8 24-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: lockless reverse lookup

This enables fully scalable operation for getcwd and significantly improves
realpath.

For example:
PATH_CUSTOM=/usr/src ./getcwd_processes -t 104
before: 1550851
after: 380135380

Tested by: pho


# feabaaf9 24-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the always curthread argument from reverse lookup routines

Note VOP_VPTOCNP keeps getting it as temporary compatibility for zfs.

Tested by: pho


# f0696c5e 24-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: perform reverse lookup using v_cache_dd if possible

Tested by: pho


# ce575cd0 24-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: populate v_cache_dd for non-VDIR entries

It makes v_cache_dd into a little bit of a misnomer and it may be addressed later.

Tested by: pho


# 1e448a15 22-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: stronger vnode asserts in cache_enter_time


# 760a430b 22-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add a work around for vp_crossmp bug to realpath

The actual bug is not yet addressed as it will get much easier after other
problems are addressed (most notably rename contract).

The only affected in-tree consumer is realpath. Everyone else happens to be
performing lookups within a mount point, having a side effect of ni_dvp being
set to mount point's root vnode in the worst case.

Reported by: pho


# 17838b58 20-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: don't use cache_purge_negative when renaming

It avoidably scans (and evicts) unrelated entries. Instead take
advantage of passed componentname and perform a hash lookup
for the exact one.

Sample data from buildworld probed on cache_purge_negative extended
to count both scanned and evicted entries on each call are below.
At most it has to evict 1.

evicted
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@ 19506
1 |@@@@@ 5820
2 |@@@@@@ 7751
4 |@@@@@ 6506
8 |@@@@@ 5996
16 |@@@ 4029
32 |@ 1489
64 | 193
128 | 109
256 | 56
512 | 16
1024 | 7
2048 | 3
4096 | 1
8192 | 1
16384 | 0

scanned
value ------------- Distribution ------------- count
-1 | 0
0 |@@ 2456
1 |@ 1496
2 |@@ 2728
4 |@@@ 4171
8 |@@@@ 5122
16 |@@@@ 5335
32 |@@@@@ 6279
64 |@@@@ 5671
128 |@@@@ 4558
256 |@@ 3123
512 |@@ 2790
1024 |@@ 2449
2048 |@@ 3021
4096 |@ 1398
8192 |@ 886
16384 | 0


# 39f88150 20-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add cache_rename, a dedicated helper to use for renames

While here make both tmpfs and ufs use it.

No fuctional changes.


# 16be9f99 20-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: reimplement cache_lookup_nomakeentry as cache_remove_cnp

This in particular removes unused arguments.


# 6c55d6e0 19-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: when adding an already existing entry assert on a complete match


# 7c75f14f 19-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: tidy up the comment above cache_prehash


# 3c5d2ed7 16-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add NOCAPCHECK to the list of supported flags for lockless lookup

It is de facto supported in that lockless lookup does not do any capability
checks.


# 8ab4beca 16-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use namei_zone for getcwd allocations

instead of malloc.

Note that this should probably be wrapped with a dedicated API and other
vn_getcwd callers did not get converted.


# 5e79447d 09-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: let SAVESTART passthrough

The flag is only passed for non-LOOKUP ops and those fallback to the slowpath.


# bb48255c 09-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: resize struct namecache to a multiply of alignment

For example struct namecache on amd64 is 100 bytes, but it has to occupies
104. Use the extra bytes to support longer names.


# 8b62cebe 10-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: remove unused variables from cache_fplookup_parse


# 03337743 10-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: clean MNTK_FPLOOKUP if MNT_UNION is set

Elides checking it during lookup.


# c571b995 10-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: strlcpy -> memcpy


# 3ba0e517 10-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: partially support file create/delete/rename in lockless lookup

Perform the lookup until the last 2 elements and fallback to slowpath.

Tested by: pho
Sponsored by: The FreeBSD Foundation


# 21d5af2b 10-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the thread argumemnt from vfs_fplookup_vexec

It is guaranteed curthread.

Tested by: pho
Sponsored by: The FreeBSD Foundation


# e910c93e 05-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add more predicts for failing conditions


# 95888901 05-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: plug unititalized variable use

CID: 1431128


# e1b1971c 05-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: don't ignore size passed to nchinittbl


# 2b86f9d6 05-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: convert the hash from LIST to SLIST

This reduces struct namecache by sizeof(void *).

Negative side is that we have to find the previous element (if any) when
removing an entry, but since we normally don't expect collisions it should be
fine.

Note this adds cache_get_hash calls which can be eliminated.


# cf8ac0de 05-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: reduce zone alignment to 8 bytes

It used to be sizeof of the given struct to accomodate for 32 bit mips
doing 64 bit loads, but the same can be achieved with requireing just
64 bit alignment.

While here reorder struct namecache so that most commonly used fields
are closer.


# d61ce7ef 05-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: convert ncnegnash into a macro

It is a read-only var with value known at compilation time.


# 2840f07d 05-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: cleanup lockless entry point

- remove spurious bzero
- assert ni_lcf, it has to be set by namei by this point


# 8ccf01e0 05-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop messing with cn_lkflags

See r363882.


# 27c4618d 05-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop messing with cn_flags

This removes flag setting/unsetting carried over from regular lookup.
Flags still get for compatibility when falling back.

Note .. and . handling can get partially folded together.


# db99ec56 04-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: support lockless dotdot lookup

Tested by: pho


# b403aa12 04-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add NCF_WIP flag

This allows making half-constructed entries visible to the lockless lookup,
which now can check for either "not yet fully constructed" and "no longer valid"
state.

This will be used for .. lookup.


# 6e10434c 04-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add cache_purge_vgone

cache_purge locklessly checks whether the vnode at hand has any namecache
entries. This can race with a concurrent purge which managed to remove
the last entry, but may not be done touching the vnode.

Make sure we observe the relevant vnode lock as not taken before proceeding
with vgone.

Paired with the fact that doomed vnodes cannnot receive entries this restores
the invariant that there are no namecache-related writing users past cache_purge
in vgone.

Reported by: pho


# 1164f7a5 04-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: factor away failed vexec handling


# 0439b00e 04-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: assorted tidy ups


# 18bd02e2 04-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: factor away lockless dot lookup and add missing stat + sdt probe


# 17a66c70 04-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add vfs_op_thread_enter/exit _crit variants

and employ them in the namecache. Eliminates all spurious checks for preemption.


# 0311b05f 04-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add missing numcache detrement on insertion failure


# 7ad2f110 02-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: store precomputed namecache hash in the vnode

This significantly speeds up path lookup, Cascade Lake doing access(2) on ufs
on /usr/obj/usr/src/amd64.amd64/sys/GENERIC/vnode_if.c, ops/s:
before: 2535298
after: 2797621

Over +10%.

The reversed order of computation here does not seem to matter for hash
distribution.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D25921


# 838984de 02-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: move namecache initialisation into cache_vnode_init


# 8a7ec170 01-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: reshuffle struct cache_fpl and nameidata_saved

Shaves 16 bytes.


# 5a394433 01-Aug-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: mark climb_mount as __noinline


# cb90ef28 30-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the useless numchecks counter


# 40492735 30-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add support for WANTPARENT and LOCKPARENT to lockless lookup

This makes the realpath syscall operational with the new lookup. Note that the
walk to obtain the full path name still takes locks.

Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23917


# 8230d293 30-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: support negative entry promotion in lockless lookup

Tested by: pho


# 4057e3ea 30-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add NOMACCHECK and AUDITVNODE2 to lockless lookup

They are both nops since lookup does not progress with either mac or audit enabled.

Tested by: pho


# 9dbd12fb 25-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add support for !LOCKLEAF to lockless lookup

Tested by: pho (in a patchset)
Differential Revision: https://reviews.freebsd.org/D23916


# c42b77e6 25-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: lockless lookup

Provides full scalability as long as all visited filesystems support the
lookup and terminal vnodes are different.

Inner workings are explained in the comment above cache_fplookup.

Capabilities and fd-relative lookups are not supported and will result in
immediate fallback to regular code.

Symlinks, ".." in the path, mount points without support for lockless lookup
and mismatched counters will result in an attempt to get a reference to the
directory vnode and continue in regular lookup. If this fails, the entire
operation is aborted and regular lookup starts from scratch. However, care is
taken that data is not copied again from userspace.

Sample benchmark:
incremental -j 104 bzImage on tmpfs:
before: 142.96s user 1025.63s system 4924% cpu 23.731 total
after: 147.36s user 313.40s system 3216% cpu 14.326 total

Sample microbenchmark: access calls to separate files in /tmpfs, 104 workers, ops/s:
before: 2165816
after: 151216530

Reviewed by: kib
Tested by: pho (in a patchset)
Differential Revision: https://reviews.freebsd.org/D25578


# 29f3e5ea 14-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: make negative shrinker round robin on all lists every time

Previously it would check 4, 3, 2, 1 lists. In practice by the time
it is getting called all lists have some elements and consequently
this does not result in new evictions.

Nonetheless, the code is clearer.

Tested by: pho


# a110fa2e 14-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: remove numcalls

The counter is not very useful and if necessary the value can be
found by summing up other counters.


# 4516c7ee 14-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: count dropped entries


# 654e644e 14-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: remove neg_locked argument from cache_zap_locked

Tested by: pho


# ffb0abdd 14-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: remove a useless argument from cache_negative_insert


# 9f8d4521 14-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: create a dedicate struct for negative entries

.. and stuff if into the unused target vnode field

This gets rid of concurrent nc_flag modifications racing with the
shrinker and consequently fixes a bug where such a change could have
been missed when cache_ncp_invalidate was being issued..

Reported by: zeising
Tested by: pho, zeising
Fixes: r362828 ("cache: lockless forward lookup with smr")


# d2385020 01-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: add missing call to cache_ncp_invalid for negative hits

Note the dtrace probe can fire even the entry is gone, but I don't think that's
worth fixing.


# d129e0eb 01-Jul-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix misplaced fence in cache_ncp_invalidate

The intent was to mark the entry as invalid before cache_zap starts messing
with it.

While here add some comments.


# 5d1c042d 30-Jun-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: lockless forward lookup with smr

This eliminates the need to take bucket locks in the common case.

Concurrent lookup utilizng the same vnodes is still bottlenecked on referencing
and locking path components, this will be taken care of separately.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23913


# d869a17e 06-Mar-2020 Mark Johnston <markj@FreeBSD.org>

Use COUNTER_U64_DEFINE_EARLY() in places where it simplifies things.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D23978


# 8d03b99b 01-Mar-2020 Mateusz Guzik <mjg@FreeBSD.org>

fd: move vnodes out of filedesc into a dedicated structure

The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.

This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23884


# 7029da5c 26-Feb-2020 Pawel Biernacki <kaktus@FreeBSD.org>

Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)

r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718


# 0573d0a9 20-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: add realpathat syscall

realpath(3) is used a lot e.g., by clang and is a major source of getcwd
and fstatat calls. This can be done more efficiently in the kernel.

This works by performing a regular lookup while saving the name and found
parent directory. If the terminal vnode is a directory we can resolve it using
usual means. Otherwise we can use the name saved by lookup and resolve the
parent.

See the review for sample syscall counts.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23574


# 6a5abb1e 02-Feb-2020 Kyle Evans <kevans@FreeBSD.org>

Provide O_SEARCH

O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.

This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.

This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23247


# 7739d927 01-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: replace kern___getcwd with vn_getcwd

The previous routine was resulting in extra data copies most notably in
linux_getcwd.


# 921e7210 01-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: return the total length from vn_fullpath1

This removes strlen from getcwd.


# 4511dd9d 01-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: remove vnode -> path lookup disablement

It seems to be of little to no use even when debugging.

Interested parties can resurrect it and gate compilation with a macro.


# 45757984 01-Feb-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: consistently use size_t for buflen around VOP_VPTOCNP


# 64034553 20-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: revert r352613 now that vhold does not take locks


# 8bba93c7 20-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: make numcachehv use counter(9) on all archs

Requested by: kib


# 059cb484 19-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: counter_u64_add_protected -> counter_u64_add

Fixes booting on RISC-V where it does happen to not be equivalent.

Reported by: lwhsu


# 13990335 18-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

cache: convert numcachehv to counter(9) on 64-bit platforms


# 69283067 11-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: incomplete pass at converting more ints to u_long

Most notably numvnodes and freevnodes were u_long, but parameters used to
govern them remained as ints.


# b249ce48 03-Jan-2020 Mateusz Guzik <mjg@FreeBSD.org>

vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427


# abd80ddb 08-Dec-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: introduce v_irflag and make v_type smaller

The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715


# 588e69e2 26-Nov-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop reusing .. entries on enter

It almost never happens in practice anyway. With this eliminated ->nc_vp
cannot change vnodes, removing an obstacle on the road to lockless
lookup.


# 2ac930e3 26-Nov-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix numcache accounting on entry

. entries are never created and .. can reuse existing entries,
meaning the early count bump is both spurious and leading to
overcounting in certain cases.


# 36afce39 26-Nov-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: hide "doingcache" behind DEBUG_CACHE


# d578a425 19-Nov-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: minor stat cleanup

Remove duplicated stats and move numcachehv from debug to vfs.cache.


# 708cf7eb 27-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: decrease ncnegfactor to 5

The current mechanism is bogus in several ways:
- the limit is a percentage of total entries added, which means negative
entries get evicted all the time even if there are plenty of resources
- evicting code is almost not concurrent, which makes it unable to
remove entries fast enough when doing something as simple as -j 104
buildworld
- there is no support for performing mass removal if necessary

Vast majority of negative entries never get any hits. Only evicting
them when the filesystem demands it results in a significant growth of
the namecache with almost no improvement in the hit ratio.

Sample result about afer 90 minutes of poudriere -j 104:

current no evict % of the original
numneg 219737 2013157 916
numneghits 266711906 263544562 98 [1]

[1] this may look funny but there is a certain dose of variation to the
build

The number was chosen as something which mostly eliminates spurious
evictions during lighter workloads but still keeps the total at bay.

Sponsored by: The FreeBSD Foundation


# e6431418 27-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop requeuing negative entries on the hot list

Turns out it does not improve hit ratio, but it does come with a cost
induces stemming from dirtying hit entries.

Sample result: hit counts of evicted entries after 2 buildworlds

before:

value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@ 180865
1 |@@@@@@@ 49150
2 |@@@ 19067
4 |@ 9825
8 |@ 7340
16 |@ 5952
32 |@ 5243
64 |@ 4446
128 | 3556
256 | 3035
512 | 1705
1024 | 1078
2048 | 365
4096 | 95
8192 | 34
16384 | 26
32768 | 23
65536 | 8
131072 | 6
262144 | 0

after:
value ------------- Distribution ------------- count
-1 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@ 184004
1 |@@@@@@ 47577
2 |@@@ 19446
4 |@ 10093
8 |@ 7470
16 |@ 5544
32 |@ 5475
64 |@ 5011
128 | 3451
256 | 3002
512 | 1729
1024 | 1086
2048 | 363
4096 | 86
8192 | 26
16384 | 25
32768 | 24
65536 | 7
131072 | 5
262144 | 0

Sponsored by: The FreeBSD Foundation


# 312196df 27-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: make negative list shrinking a little bit concurrent

Continue protecting demotion from the hotlist and selection of the
target list with the ncneg_shrink_lock lock, but drop it before
relocking to zap the node.

While here count how many times we skipped shrinking due to the lock
being already taken.

Sponsored by: The FreeBSD Foundation


# 95c6dd89 27-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop recalculating upper limit each time a new entry is added

Sponsored by: The FreeBSD Foundation


# 93a85508 23-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: tidy up handling of negative entries

- track the total count of hot entries
- pre-read the lock when shrinking since it is typically already taken
- place the lock in its own cacheline
- shorten the hold time of hot lock list when zapping

Sponsored by: The FreeBSD Foundation


# afe257e3 23-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: count evictions of negatve entries

Sponsored by: The FreeBSD Foundation


# 7505cffa 22-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: try to avoid vhold if locks held

Sponsored by: The FreeBSD Foundation


# cd2112c3 22-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: jump in negative success instead of positive

Sponsored by: The FreeBSD Foundation


# b088a4d6 10-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: avoid excessive relocking on entry removal during lookup

Due to lock ordering issues (bucket lock held, vnode locks wanted) the code
starts with trylocking which in face of contention often fails. Prior to
the change it would loop back with a possible yield.

Instead note we know what locks are needed and can take them in the right
order, avoiding retries. Then we can safely re-lookup and see if the entry
we are looking for is still there.

On a 104-way box poudriere would result in constant retries during an 11h
run as seen in the vfs.cache.zap_and_exit_bucket_fail counter.

before: 408866592
after : 0

However, a new stat reports:
vfs.cache.zap_and_exit_bucket_relock_success: 32638

Note this is only a bandaid over current design issues.

Tested by: pho
Sponsored by: The FreeBSD Foundation


# a6cacb0d 10-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: change the formula for calculating lock array sizes

It used to be mp_ncpus * 64, but this gives unnecessarily big values for small
machines and at the same time constraints bigger ones. In particular this helps
on a 104-way box for which the count is now doubled.

While here make cache_purgevfs less likely. Currently it is not efficient in
face of contention due to lock ordering issues. These are fixable but not worth
it at the moment.

Sponsored by: The FreeBSD Foundation


# 1214618c 10-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: assorted cleanups

Sponsored by: The FreeBSD Foundation


# e3c3248c 03-Sep-2019 Mateusz Guzik <mjg@FreeBSD.org>

vfs: implement usecount implying holdcnt

vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting
freed and usecount to denote it is actively used.

Previously all operations bumping usecount would also bump holdcnt, which is
not necessary. We can detect if usecount is already > 1 (in which case holdcnt
is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves
on atomic ops.

Reviewed by: kib
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21471


# caaa7cee 22-Jul-2019 Alan Somers <asomers@FreeBSD.org>

[skip ci] Fix the comment for cache_purge(9)

This is a merge of r348738 from projects/fuse2

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 571908e2 06-Jun-2019 Alan Somers <asomers@FreeBSD.org>

[skip ci] Fix the comment for cache_purge(9)

Sponsored by: The FreeBSD Foundation


# daec9284 21-May-2019 Conrad Meyer <cem@FreeBSD.org>

Include ktr.h in more compilation units

Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon


# 8ba6c139 12-May-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix a brainfart in r347505

If bumping over the counter goes over the limit we have to decrement it back.

Previous code would only bump the counter after adding the entry (thus allowing
the cache to go over the limit).

Sponsored by: The FreeBSD Foundation


# 5bf50787 12-May-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: bump numcache on entry, while here fix lnumcache type

Sponsored by: The FreeBSD Foundation


# 63ad3b65 12-May-2019 Mateusz Guzik <mjg@FreeBSD.org>

cache: push sdt probes in cache_zap_locked to code doing the work

Avoids branching to check which probe to evaluate. Very same check was
being done later to do the actual work.

Sponsored by: The FreeBSD Foundation


# 691d4ab6 10-Apr-2019 Alan Somers <asomers@FreeBSD.org>

fix cache_lookup's documentation

cache_lookup's documentation got dislocated by r324378. Relocate and expand
it.

Reviewed by: jhb, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation


# 22443809 29-Nov-2018 Mateusz Guzik <mjg@FreeBSD.org>

cache: retire cache_enter compat schim

It was added over 6 years ago for binary compat. cache_enter macro remains
as it expands to cache_enter_time.

Sponsored by: The FreeBSD Foundation


# 7ffbcfe2 20-Jun-2018 Bjoern A. Zeeb <bz@FreeBSD.org>

Sometimes it is helpful to get the path for a vnode.
Implement a ddb function walking the namecache to do this.

Reviewed by: jhb, mjg
Inspired by: gdb macro from jhb (old version)
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D14898


# e9b1074b 18-May-2018 Matt Macy <mmacy@FreeBSD.org>

cache_lookup remove unused variable and initialize used


# e1703ef5 01-Dec-2017 Mark Johnston <markj@FreeBSD.org>

Plug a name cache lock leak.

Reviewed by: mjg
MFC after: 1 week
Sponsored by: Dell EMC Isilon


# 51369649 20-Nov-2017 Pedro F. Giffuni <pfg@FreeBSD.org>

sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.


# ce80021f 05-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

namecache: bump numcache after dropping all locks

This makes no difference correctness-wise, but shortens total hold time.


# 119b826a 05-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

namecache: wlock buckets in cache_lookup_nomakeentry

Since the case of an empty chain was already covered, it si very likely
that the existing entry is matching. Skipping readlocking saves on lock
upgrade.


# ba324b59 05-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

namecache: skip locking in cache_lookup_nomakeentry if there is no entry


# a52058f0 05-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

namecache: skip locking in cache_purge_negative if there are no entries


# ac850e5a 01-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

namecache: fix .. check broken after r324378

wtf by: mjg
Diagnosed by: avg


# 5644fffa 01-Nov-2017 Mateusz Guzik <mjg@FreeBSD.org>

namecache: ncnegfactor 16 -> 12

It is used on each new entry addition to decide whether to whack an existing
negative entry in order to prevent a blow out in size, but the parameter was
set years ago and never revisited.

Building with poudriere results in about 400 evictions per second which
unnecessarily grab entries from the hot list.

With the new parameter there are next to no evictions of the sort.


# 709939a7 06-Oct-2017 Mateusz Guzik <mjg@FreeBSD.org>

namecache: factor out ~MAKEENTRY lookups from the common path

Lookups of the sort are rare compared to regular ones and succesfull ones
result in removing entries from the cache.

In the current code buckets are rlocked and a trylock dance is performed,
which can fail and cause a restart. Fixing it will require a little bit
of surgery and in order to keep the code maintaineable the 2 cases have
to split.

MFC after: 1 week


# c2dc6d5d 27-Sep-2017 John Baldwin <jhb@FreeBSD.org>

Use UMA_ALIGNOF() for name cache UMA zones.

This fixes kernel crashes due to misaligned accesses to the 64-bit
time_t embedded in struct namecache_ts in MIPS n32 kernels.

MFC after: 1 week
Sponsored by: DARPA / AFRL


# 0bbae6f3 10-Sep-2017 Mateusz Guzik <mjg@FreeBSD.org>

namecache: clean up struct namecache_ts handling

namecache_ts differs from mere namecache by few fields placed mid struct.
The access to the last element (the name) is thus special-cased.

The standard solution is to put new fields at the very beginning anad
embedd the original struct. The pointer shuffled around points to the
embedded part. If needed, access to new fields can be gained through
__containerof.

MFC after: 1 week


# dad74ce9 08-Sep-2017 Mateusz Guzik <mjg@FreeBSD.org>

namecache: fold the unlock label into the only consumer

No functional changes.

MFC after: 1 week


# da8f32a7 08-Sep-2017 Mateusz Guzik <mjg@FreeBSD.org>

namecache: factor out dot lookup into a dedicated function

The intent is to move uncommon cases out of the way.

MFC after: 1 week


# 8066a14a 03-May-2017 Mateusz Guzik <mjg@FreeBSD.org>

cache: stop holding the ncneg_hot lock across purging

Only non-hot entries are purged so the lock is not needed in the first place.
This saves one lock/unlock pair.

MFC after: 1 week


# a3b7d0fb 06-Apr-2017 Brooks Davis <brooks@FreeBSD.org>

Regen after r316594.


# dfecf51d 29-Jan-2017 Mateusz Guzik <mjg@FreeBSD.org>

cache: use vrefact for '.' lookups and refing the rdir in fullpath


# 17071ff2 27-Jan-2017 Mateusz Guzik <mjg@FreeBSD.org>

cache: annotate with __read_mostly and __exclusive_cache_line

MFC after: 1 month


# 4938d867 29-Dec-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: sprinkle __predict_false


# b3770753 28-Dec-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: move shrink lock init to nchinit

This gets rid of unnecesary sysinit usage.

While here also rename the lock to be consistent with the rest.


# 0569bc9c 29-Dec-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: depessimize hashing macros/inlines

All hash sizes are power-of-2, but the compiler does not know that for sure
and 'foo % size' forces doing a division.

Store the size - 1 and use 'foo & hash' instead which allows mere shift.


# 6dd9661b 29-Dec-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: drop the NULL check from VP2VNODELOCK

Now that negative entries are annotated with a dedicated flag, NULL vnodes
are no longer passed.


# 25e578de 12-Dec-2016 Mateusz Guzik <mjg@FreeBSD.org>

vfs: use vrefact in getcwd and fchdir


# 8b0e0c91 23-Nov-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: ensure that the number of bucket locks does not exceed hash size

The size can be changed by side effect of modifying kern.maxvnodes.

Since numbucketlocks was not modified, setting a sufficiently low value
would give more locks than actual buckets, which would then lead to
corruption.

Force the number of buckets to be not smaller.

Note this should not matter for real world cases.

Reported and tested by: pho


# 6ce45c6a 14-Nov-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: plug a write-only variable in cache_negative_zap_one


# 317cac6d 14-Nov-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix a race between entry removal and demotion

The negative list shrinker can demote an entry with only hotlist + neglist
locks held. On the other hand entry removal possibly sets the NCF_DVDROP
without aformentioned locks held prior to detaching it from the respective
netlist., which can lose the update made by the shrinker.

Reported and tested by: truckman


# 9bd4f0a2 07-Nov-2016 Konstantin Belousov <kib@FreeBSD.org>

vn_fullpath1() checked VV_ROOT and then unreferenced
vp->v_mount->mnt_vnodecovered unlocked. This allowed unmount to race.
Lock vnode after we noticed the VV_ROOT flag. See comments for
explanation why unlocked check for the flag is considered safe.

Reported and tested by: avg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# bb697a20 20-Oct-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: fix up a corner case in r307650

If no negative entry is found on the last list, the ncp pointer will be
left uninitialized and a non-null value will make the function assume an
entry was found.

Fix the problem by initializing to NULL on entry.

Reported by: glebius


# a45a1a25 19-Oct-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: split negative entry LRU into multiple lists

This splits the ncneg_mtx lock while preserving the hit ratio at least
during buildworld.

Create N dedicated lists for new negative entries.

Entries with at least one hit get promoted to the hot list, where they
get requeued every M hits.

Shrinking demotes one hot entry and performs a round-robin shrinking of
regular lists.

Reviewed by: kib


# f71d0856 07-Oct-2016 Konstantin Belousov <kib@FreeBSD.org>

Limit scope of the optimization in r306608 to dounmount() caller only.
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.

Reported and tested by: jkim
Discussed with: mjg
Sponsored by: The FreeBSD Foundation


# 4876636e 02-Oct-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: ignore purgevfs requests for filesystems with few vnodes

purgevfs is purely optional and induces lock contention in workloads
which frequently mount and unmount filesystems.

In particular, poudriere will do this for filesystems with 4 vnodes or
less. Full cache scan is clearly wasteful.

Since there is no explicit counter for namecache entries, the number of
vnodes used by the target fs is checked.

The default limit is the number of bucket locks.

Reviewed by: kib


# 1d2541fd 22-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: get rid of the global lock

Add a table of vnode locks and use them along with bucketlocks to provide
concurrent modification support. The approach taken is to preserve the
current behaviour of the namecache and just lock all relevant parts before
any changes are made.

Lookups still require the relevant bucket to be locked.

Discussed with: kib
Tested by: pho


# 69a28758 15-Sep-2016 Ed Maste <emaste@FreeBSD.org>

Renumber license clauses in sys/kern to avoid skipping #3


# a2781533 10-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: improve scalability by introducing bucket locks

An array of bucket locks is added.

All modifications still require the global cache_lock to be held for
writing. However, most readers only need the relevant bucket lock and in
effect can run concurrently to the writer as long as they use a
different lock. See the added comment for more details.

This is an intermediate step towards removal of the global lock.

Reviewed by: kib
Tested by: pho


# 591df145 04-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: defer freeing entries until after the global lock is dropped

This also defers vdrop for held vnodes.

Glanced at by: kib


# 31977b42 04-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: manage negative entry list with a dedicated lock

Since negative entries are managed with a LRU list, a hit requires a
modificaton.

Currently the code tries to upgrade the global lock if needed and is
forced to retry the lookup if it fails.

Provide a dedicated lock for use when the cache is only shared-locked.

Reviewed by: kib
MFC after: 1 week


# b9042ae1 04-Sep-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: put all negative entry management code into dedicated functions

Reviewed by: kib
MFC after: 1 week


# e3043798 29-Apr-2016 Pedro F. Giffuni <pfg@FreeBSD.org>

sys/kern: spelling fixes in comments.

No functional change.


# 0791e0c0 24-Feb-2016 Konstantin Belousov <kib@FreeBSD.org>

Provide more correct sizing of the KVA consumed by a vnode, used by
the virtvnodes calculation. Include the size of fs-specific v_data as
the nfs nclnode inline, the NFS nclnode is bigger than either ZFS
znode or UFS inode. Include the size of namecache_ts and short cache
path element, multiplied by the name cache population factor, again
inline.

Inline defines are used to avoid pollution of the vnode.h with the
subsystem-private objects. Non-significant unsynchronized changes of
the definitions are fine, we do not care about that precision, and
e.g. ZFS consumes much malloced memory per vnode for reasons
unaccounted in the formula.

Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10.

The measures reduce vnode cache pressure on kmem and bring the vnode
cache memory use below some apparent thresholds that were exceeded by
r291244 due to more robust vnode reuse.

Reported and tested by: marius (i386, previous version)
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks


# b0632ab4 20-Jan-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: minor changes

1. vhold and zap immediately instead of postponing few lines later
2. increment numneg after new entry is added

No functional changes.

No objections: kib


# baa2bcf5 20-Jan-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: perform . lockup without the namecache lock

Reviewed by: kib


# db709ecb 20-Jan-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: provide a helper for computing the hash

Reviewed by: kib


# 76583fa2 20-Jan-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: use counter(9) API to maintain statistics

Previously the code would just increment statistics while only holding a
shared lock, in effect losing updates.

Separate tracking for nchstats is removed as values can be obtained from
existing counters. Note that some fields are updated by external
consumers and are left unfixed. This should not be a serious issue as
this structure looks quite obsolete.

No strong objections: kib


# 6b53d1bc 06-Jan-2016 Mateusz Guzik <mjg@FreeBSD.org>

cache: ansify functions and fix some style issues

No functional changes.


# 36160958 16-Dec-2015 Mark Johnston <markj@FreeBSD.org>

Fix style issues around existing SDT probes.

- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
set automatically by the SDT framework.

MFC after: 1 week


# 2f2f522b 27-Sep-2015 Andriy Gapon <avg@FreeBSD.org>

save some bytes by using more concise SDT_PROBE<n> instead of SDT_PROBE

SDT_PROBE requires 5 parameters whereas SDT_PROBE<n> requires n parameters
where n is typically smaller than 5.

Perhaps SDT_PROBE should be made a private implementation detail.

MFC after: 20 days


# 17518b1a 05-Sep-2015 Kirk McKusick <mckusick@FreeBSD.org>

Track changes to kern.maxvnodes and appropriately increase or decrease
the size of the name cache hash table (mapping file names to vnodes)
and the vnode hash table (mapping mount point and inode number to vnode).
An appropriate locking strategy is the key to changing hash table sizes
while they are in active use.

Reviewed by: kib
Tested by: Peter Holm
Differential Revision: https://reviews.freebsd.org/D2265
MFC after: 2 weeks


# 752fc07d 16-Jul-2015 Mateusz Guzik <mjg@FreeBSD.org>

vfs: implement v_holdcnt/v_usecount manipulation using atomic ops

Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.

Reviewed by: kib (earlier version)
Tested by: pho


# 6289b482 21-Apr-2015 Edward Tomasz Napierala <trasz@FreeBSD.org>

Modify kern___getcwd() to take max pathlen limit as an additional
argument. This will be used for the Linux emulation layer - for Linux,
PATH_MAX is 4096 and not 1024.

Differential Revision: https://reviews.freebsd.org/D2335
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation


# f3519155 17-Apr-2015 Kirk McKusick <mckusick@FreeBSD.org>

More accurately collect name-cache statistics in sysctl functions
sysctl_debug_hashstat_nchash() and sysctl_debug_hashstat_rawnchash().
These changes are in preparation for allowing changes in the size
of the vnode hash tables driven by increases and decreases in the
maximum number of vnodes in the system.

Reviewed by: kib@
Phabric: D2265


# 9f7a06f2 04-Jan-2015 Dmitry Chagin <dchagin@FreeBSD.org>

Indeed, instead of hiding the kern___getcwd() bug by bogus cast
in r276564, change path type to char * (pathnames are always char *).
And remove bogus casts of malloc().
kern___getcwd() internally doesn't actually use or support u_char *
paths, except to copy them to a normal char * path.

These changes are not visible to libc as libc/gen/getcwd.c misdeclares
__getcwd() as taking a plain char * path.

While here remove _SYS_SYSPROTO_H_ for __getcwd() syscall as
we always have sysproto.h.

Pointed out by: bde

MFC after: 1 week


# f0188618 21-Oct-2014 Hans Petter Selasky <hselasky@FreeBSD.org>

Fix multiple incorrect SYSCTL arguments in the kernel:

- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after: 3 days
Sponsored by: Mellanox Technologies


# bcdd3bce 03-Aug-2014 Sergey Kandaurov <pluknet@FreeBSD.org>

vn_path_to_global_path: update comment.


# fe200470 27-Dec-2013 Konstantin Belousov <kib@FreeBSD.org>

Fix accounting for the negative cache entries when reusing v_cache_dd.
Having ncneg diverge with the actual length of the ncneg tailq causes
NULL dereference.

Add assertion that an entry taken from ncneg queue is indeed negative.

Reported by and discussed with: avg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week


# d9fae5ab 26-Nov-2013 Andriy Gapon <avg@FreeBSD.org>

dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE

In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by: markj
MFC after: 3 weeks


# 54366c0b 25-Nov-2013 Attilio Rao <attilio@FreeBSD.org>

- For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip


# 4633a4c3 09-Jul-2013 Andriy Gapon <avg@FreeBSD.org>

namecache sdt: freebsd doesn't support structured characters yet

:-)

MFC after: 7 days


# 3289d587 20-Mar-2013 Kirk McKusick <mckusick@FreeBSD.org>

When renaming a directory from one parent directory to another,
we need to call ufs_checkpath() to walk from our new location to
the root of the filesystem to ensure that we do not encounter
ourselves along the way. Until now, we accomplished this by reading
the ".." entries of each directory in our path until we reached
the root (or encountered an error). This change tries to avoid the
I/O of reading the ".." entries by first looking them up in the
name cache and only doing the I/O when the name cache lookup fails.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 4 weeks


# 5050aa86 22-Oct-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho


# 5e99212d 02-Mar-2012 Rick Macklem <rmacklem@FreeBSD.org>

Post r230394, the Lookup RPC counts for both NFS clients increased
significantly. Upon investigation this was caused by name cache
misses for lookups of "..". For name cache entries for non-".."
directories, the cache entry serves double duty. It maps both the
named directory plus ".." for the parent of the directory. As such,
two ctime values (one for each of the directory and its parent) need
to be saved in the name cache entry.
This patch adds an entry for ctime of the parent directory to the
name cache. It also adds an additional uma zone for large entries
with this time value, in order to minimize memory wastage.
As well, it fixes a couple of cases where the mtime of the parent
directory was being saved instead of ctime for positive name cache
entries. With this patch, Lookup RPC counts return to values similar
to pre-r230394 kernels.

Reported by: bde
Discussed with: kib
Reviewed by: jhb
MFC after: 2 weeks


# 7dfdd83d 24-Feb-2012 Maxim Konovalov <maxim@FreeBSD.org>

o Reduce chances for integer overflow.
o More verbose sysctl description added.

MFC after: 2 weeks
Sponsored by: Nginx, Inc.


# bf40d24a 06-Feb-2012 John Baldwin <jhb@FreeBSD.org>

Rename cache_lookup_times() to cache_lookup() and retire the old API and
ABI stub for cache_lookup().


# d5210589 25-Jan-2012 Konstantin Belousov <kib@FreeBSD.org>

Fix remaining calls to cache_enter() in both NFS clients to provide
appropriate timestamps. Restore the assertions which verify that
NCF_TS is set when timestamp is asked for.

Reviewed by: jhb (previous version)
MFC after: 2 weeks


# 7a7e609a 23-Jan-2012 Konstantin Belousov <kib@FreeBSD.org>

Apparently, both nfs clients do not use cache_enter_time()
consistently, creating some namecache entries without NCF_TS flag.
This causes panic due to failed assertion.

As a temporal relief, remove the assert. Return epoch timestamp for
the entries without timestamp if asked.

While there, consolidate the code which returns timestamps, into a
helper cache_out_ts().

Discussed with: jhb
MFC after: 2 weeks


# c2b396f2 21-Jan-2012 Konstantin Belousov <kib@FreeBSD.org>

Remove the nc_time and nc_ticks elements from struct namecache, and
provide struct namecache_ts which is the old struct namecache. Only
allocate struct namecache_ts if non-null struct timespec *tsp was
passed to cache_enter_time, otherwise use struct namecache.

Change struct namecache allocation and deallocation macros into static
functions, since logic becomes somewhat twisty. Provide accessor for
the nc_name member of struct namecache to hide difference between
struct namecache and namecache_ts.

The aim of the change is to not waste 20 bytes per small namecache
entry.

Reviewed by: jhb
MFC after: 2 weeks
X-MFC-note: after r230394


# 5aefb4cb 20-Jan-2012 John Baldwin <jhb@FreeBSD.org>

Close a race in NFS lookup processing that could result in stale name cache
entries on one client when a directory was renamed on another client. The
root cause for the stale entry being trusted is that each per-vnode nfsnode
structure has a single 'n_ctime' timestamp used to validate positive name
cache entries. However, if there are multiple entries for a single vnode,
they all share a single timestamp. To fix this, extend the name cache
to allow filesystems to optionally store a timestamp value in each name
cache entry. The NFS clients now fetch the timestamp associated with
each name cache entry and use that to validate cache hits instead of the
timestamps previously stored in the nfsnode. Another part of the fix is
that the NFS clients now use timestamps from the post-op attributes of
RPCs when adding name cache entries rather than pulling the timestamps out
of the file's attribute cache. The latter is subject to races with other
lookups updating the attribute cache concurrently. Some more details:
- Add a variant of nfsm_postop_attr() to the old NFS client that can return
a vattr structure with a copy of the post-op attributes.
- Handle lookups of "." as a special case in the NFS clients since the name
cache does not store name cache entries for ".", so we cannot get a
useful timestamp. It didn't really make much sense to recheck the
attributes on the the directory to validate the namecache hit for "."
anyway.
- ABI compat shims for the name cache routines are present in this commit
so that it is safe to MFC.

MFC after: 2 weeks


# 9cbe30e1 15-Jan-2012 Martin Matuska <mm@FreeBSD.org>

Fix missing in r230129:

kern_jail.c: initialize fullpath_disabled to zero
vfs_cache.c: add missing dot in comment

Reported by: kib
MFC after: 1 month


# f6e633a9 14-Jan-2012 Martin Matuska <mm@FreeBSD.org>

Introduce vn_path_to_global_path()

This function updates path string to vnode's full global path and checks
the size of the new path string against the pathlen argument.

In vfs_domount(), sys_unmount() and kern_jail_set() this new function
is used to update the supplied path argument to the respective global path.

Unbreaks jailed zfs(8) with enforce_statfs set to 1.

Reviewed by: kib
MFC after: 1 month


# 7a7ce668 12-Dec-2011 Andriy Gapon <avg@FreeBSD.org>

put sys/systm.h at its proper place or add it if missing

Reported by: lstewart, tinderbox
Pointyhat to: avg, attilio
MFC after: 1 week
MFC with: r228430


# f82360ac 19-Nov-2011 Konstantin Belousov <kib@FreeBSD.org>

Existing VOP_VPTOCNP() interface has a fatal flow that is critical for
nullfs. The problem is that resulting vnode is only required to be
held on return from the successfull call to vop, instead of being
referenced.

Nullfs VOP_INACTIVE() method reclaims the vnode, which in combination
with the VOP_VPTOCNP() interface means that the directory vnode
returned from VOP_VPTOCNP() is reclaimed in advance, causing
vn_fullpath() to error with EBADF or like.

Change the interface for VOP_VPTOCNP(), now the dvp must be
referenced. Convert all in-tree implementations of VOP_VPTOCNP(),
which is trivial, because vhold(9) and vref(9) are similar in the
locking prerequisites. Out-of-tree fs implementation of VOP_VPTOCNP(),
if any, should have no trouble with the fix.

Tested by: pho
Reviewed by: mckusick
MFC after: 3 weeks (subject of re approval)


# 6472ac3d 07-Nov-2011 Ed Schouten <ed@FreeBSD.org>

Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.

The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.


# 8451d0dd 16-Sep-2011 Kip Macy <kmacy@FreeBSD.org>

In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)


# 8d065a39 14-Nov-2010 Rebecca Cran <brucec@FreeBSD.org>

Fix some more style(9) issues.


# b389be97 14-Nov-2010 Rebecca Cran <brucec@FreeBSD.org>

Fix style(9) issues from r215281 and r215282.

MFC after: 1 week


# 2baa5cdd 13-Nov-2010 Rebecca Cran <brucec@FreeBSD.org>

Add some descriptions to sys/kern sysctls.

PR: kern/148710
Tested by: Chip Camden <sterling at camdensoftware.com>
MFC after: 1 week


# 3a40a00d 30-Oct-2010 Konstantin Belousov <kib@FreeBSD.org>

Remove sysctl debug.ncnegfactor, it is renamed to vfs.ncnegfactor.

MFC: do not


# a7d5f7eb 19-Oct-2010 Jamie Gritton <jamie@FreeBSD.org>

A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.


# 420cfbb4 16-Oct-2010 Konstantin Belousov <kib@FreeBSD.org>

Provide vfs.ncsizefactor instead of hard-coding namecache ratio.
Move debug.ncnegfactor to vfs.ncnegfactor [1].
Provide some descriptions for the namecache related sysctls [1].

Based on the submission by: Rogier R. Mulhuijzen <drwilco drwilco net> [1]
MFC after: 2 weeks
X-MFC-note: remove debug.ncnegfactor in HEAD after MFC


# 79856499 22-Aug-2010 Rui Paulo <rpaulo@FreeBSD.org>

Add an extra comment to the SDT probes definition. This allows us to get
use '-' in probe names, matching the probe names in Solaris.[1]

Add userland SDT probes definitions to sys/sdt.h.

Sponsored by: The FreeBSD Foundation
Discussed with: rwaston [1]


# 60ae52f7 21-Jun-2010 Ed Schouten <ed@FreeBSD.org>

Use ISO C99 integer types in sys/kern where possible.

There are only about 100 occurences of the BSD-specific u_int*_t
datatypes in sys/kern. The ISO C99 integer types are used here more
often.


# 22df1496 03-May-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r206894:
The cache_enter(9) function shall not be called for doomed dvp.
Assert this.

Verify that dvp is not reclaimed before calling cache_enter().


# 5673e3cb 20-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

The cache_enter(9) function shall not be called for doomed dvp.
Assert this.

In the reported panic, vdestroy() fired the assertion "vp has namecache
for ..", because pseudofs may end up doing cache_enter() with reclaimed
dvp, after dotdot lookup temporary unlocked dvp.
Similar problem exists in ufs_lookup() for "." lookup, when vnode
lock needs to be upgraded.

Verify that dvp is not reclaimed before calling cache_enter().

Reported and tested by: pho
Reviewed by: kan
MFC after: 2 weeks


# c9975476 17-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

MFC r206671:
Fix typo.


# 3e22320c 15-Apr-2010 Konstantin Belousov <kib@FreeBSD.org>

Fix typo.

MFC after: 3 days


# 106c3802 14-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

MFC r196203:
Correctly handle unlock for !MAKEENTRY case.

Approved by: re (rwatson)


# 8f408451 14-Aug-2009 Konstantin Belousov <kib@FreeBSD.org>

Correctly handle unlock for !MAKEENTRY case, after successfull attempt of
lock upgrade cache shall be unlocked from write.

Reported by: Lucius Windschuh <lwindschuh googlemail com>
Reviewed by: kan
Approved by: re (rwatson)


# c808c963 21-Jun-2009 Konstantin Belousov <kib@FreeBSD.org>

Add explicit struct ucred * argument for VOP_VPTOCNP, to be used by
vn_open_cred in default implementation. Valid struct ucred is needed for
audit and MAC, and curthread credentials may be wrong.

This further requires modifying the interface of vn_fullpath(9), but it
is out of scope of this change.

Reviewed by: rwatson


# 8a444404 05-Jun-2009 Joe Marcus Clarke <marcus@FreeBSD.org>

Unlock the cache lock before returning when we run out of buffer space
trying to fill in the full path name.

Reported by: David Naylor <naylor.b.david@gmail.com>
Approved by: kib


# 1358a795 31-May-2009 Konstantin Belousov <kib@FreeBSD.org>

Unbreak the build. Add missed probes.

Reviewed by: rwatson
Pointy hat to: me


# 0449e6e1 31-May-2009 Konstantin Belousov <kib@FreeBSD.org>

Eliminate code duplication in vn_fullpath1() around the cache lookups
and calls to vn_vptocnp() by moving more of the common code to
vn_vptocnp(). Rename vn_vptocnp() to vn_vptocnp_locked() to signify that
cache is locked around the call.

Do not track buffer position by both the pointer and offset, use only
buflen to record the start of the free space.

Export vn_vptocnp() for external consumers as a wrapper around
vn_vptocnp_locked() that locks the cache and handles hold counts.

Tested by: pho


# 348496ad 17-Apr-2009 Alexander Kabaev <kan@FreeBSD.org>

More fallout from negative dotdot caching. Negative entries should
be removed from and reinserted to proper ncneg list.

Reported by: pho
Submitted by: kib


# 9cf67722 14-Apr-2009 Alexander Kabaev <kan@FreeBSD.org>

Redo previous change using simpler patch that happens to be also
more correct.

Submitted by: tor


# eed8a9ed 14-Apr-2009 Alexander Kabaev <kan@FreeBSD.org>

Fix yet another negative dotodot entry fallout.

Reported by: pho


# 9d75482f 11-Apr-2009 Alexander Kabaev <kan@FreeBSD.org>

Fix v_cache_dd handling for negative entries. v_cache_dd pointer was
not populated in parent directory if negative entry was being
created, yet entry itself was added to the nc_neg list. It was
possible for parent vnode to get discarded later, leaving negative
entry pointing to now unused memory block.

Reported by: dho
Revewed by: kib


# fd409594 11-Apr-2009 Konstantin Belousov <kib@FreeBSD.org>

When zapping v_cache_dd for !MAKEENTRY case in cache_lookup(), we shall
lock cache as writer.

Reviewed by: kan


# 3f54086e 10-Apr-2009 Konstantin Belousov <kib@FreeBSD.org>

Cache_lookup() for DOTDOT drops dvp vnode lock, allowing dvp to be reclaimed.
Check the condition and return ENOENT then.

In nfs_lookup(), respect ENOENT return from cache_lookup() when it is caused
by dvp reclaim.

Reported and tested by: pho


# 5d5c1748 07-Apr-2009 Robert Watson <rwatson@FreeBSD.org>

Nul-terminate strings in the VFS name cache, which negligibly change
the size and cost of name cache entries, but make adding debugging
and tracing easier.

Add SDT DTrace probes for various namecache events:

vfs:namecache:enter:done - new entry in the name cache, passed parent
directory vnode pointer, name added to the cache, and child vnode
pointer.

vfs:namecache:enter_negative:done - new negative entry in the name cache,
passed parent vnode pointer, name added to the cache.

vfs:namecache:fullpath:enter - call to vn_fullpath1() is made, passed
the vnode to resolve to a name.

vfs:namecache:fullpath:hit - vn_fullpath1() successfully resolved a
search for the parent of an object using the namecache, passed the
discovered parent directory vnode pointer, name, and child vnode
pointer.

vfs:namecache:fullpath:miss - vn_fullpath1() failed to resolve a search
for the parent of an object using the namecache, passed the child
vnode pointer.

vfs:namecache:fullpath:return - vn_fullpath1() has completed, passed the
error number, and if that is zero, the vnode to resolve, and the
returned path.

vfs:namecache:lookup:hit - postive name cache entry hit, passed the
parent directory vnode pointer, name, and child vnode pointer.

vfs:namecache:lookup:hit_negative - negative name cache entry hit,
passed the parent directory vnode pointer and name.

vfs:namecache:lookup:miss - name cache miss, passed the parent directory
pointer and the full remaining component name (not terminated after the
cache miss component).

vfs:namecache:purge:done - name cache purge for a vnode, passed the vnode
pointer to purge.

vfs:namecache:purge_negative:done - name cache purge of negative entries
for children of a vnode, passed the vnode pointer to purge.

vfs:namecache:purgevfs - name cache purge for a mountpoint, passed the
mount pointer. Separate probes will also be invoked for each cache
entry zapped.

vfs:namecache:zap:done - name cache entry zapped, passed the parent
directory vnode pointer, name, and child vnode pointer.

vfs:namecache:zap_negative:done - negative name cache entry zapped,
passed the parent directory vnode pointer and name.

For any probes involving an extant name cache entry (enter, hit, zapp),
we use the nul-terminated string for the name component. For misses,
the remainder of the path, including later components, is provided as
an argument instead since there is no handy nul-terminated version of
the string around. This is arguably a bug.

MFC after: 1 month
Sponsored by: Google, Inc.
Reviewed by: jhb, kan, kib (earlier version)


# bb6418cb 04-Apr-2009 Alexander Kabaev <kan@FreeBSD.org>

Revert change 190655 temporarily. It breaks many setups where nullfs is
used and needs to be revisited.


# 0e875eca 02-Apr-2009 Peter Wemm <peter@FreeBSD.org>

vn_vptocnp() unlocks the name cache and forgets to re-lock it before
returning in one error case, and mistakenly unlocks it for the
umount -f case.


# 607fc40b 29-Mar-2009 Alexander Kabaev <kan@FreeBSD.org>

Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache
in directory vnodes. Allow namecache dotdot entry to be created pointing
from child vnode to parent vnode if no existing links in opposite
direction exist. Use direct link from parent to child for dotdot lookups
otherwise.

This restores more efficient dotdot caching in NFS filesystems which
was lost when vnodes stoppped being type stable.

Reviewed by: kib


# 049ce093 24-Mar-2009 John Baldwin <jhb@FreeBSD.org>

When a file lookup fails due to encountering a doomed vnode from a forced
unmount, consistently return ENOENT rather than EBADF.

Reviewed by: kib
MFC after: 1 month


# 15fb32c0 20-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

Do not underflow the buffer and then report the problem. Check for the
condition before the buffer write.
Also, since buflen is unsigned, previous check was ignored.

Reviewed by: marcus
Tested by: pho


# 83817ce3 20-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

Remove unneeded braces to reduce used vertical screen space.
The location was missed in r190140.


# 91940072 20-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

Do not forget to adjust buflen for the first resolution of the path
from namecache.
While there, compare pointers for equiality.

Reviewed by: marcus
Tested by: pho


# 065fc451 20-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

The nc_nlen member of the struct namecache contains the length of the cached
name, not the length + 1.

PR: 132620, 132542
Reported by: bf2006a yahoo com
Tested by: bf2006a, pho
Reviewed by: marcus


# c4a8c2ee 20-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

When ktracing namei operations, log a result of the __getcwd().

MFC after: 1 week


# bf5c835e 20-Mar-2009 Konstantin Belousov <kib@FreeBSD.org>

Remove unneeded braces to reduce used vertical screen space.


# 4ab2a9a0 09-Mar-2009 John Baldwin <jhb@FreeBSD.org>

Move the debug.hashstat sysctl tree under DIAGNOSTIC. I measured the
debug.hashstat.rawnchash sysctl in particular as taking 7 milliseconds on
a 3GHz Intel Xeon (4x2) running 7.1. It accounted for almost a quarter of
the total runtime of 'sysctl -a'. It also performs lots of copyout's while
holding the namecache lock (this does not attempt to fix that).

MFC after: 2 weeks


# 03964c8e 19-Feb-2009 John Baldwin <jhb@FreeBSD.org>

Enable caching of negative pathname lookups in the NFS client. To avoid
stale entries, we save a copy of the directory's modification time when
the first negative cache entry was added in the directory's NFS node.
When a negative cache entry is hit during a pathname lookup, the parent
directory's modification time is checked. If it has changed, all of the
negative cache entries for that parent are purged and the lookup falls
back to using the RPC. This required adding a new cache_purge_negative()
method to the name cache to purge only negative cache entries for a given
directory.

Submitted by: mohans, Rick Macklem, Ricardo Labiaga @ NetApp
Reviewed by: mohans


# 9078981a 28-Jan-2009 John Baldwin <jhb@FreeBSD.org>

Convert the global mutex protecting the directory lookup name cache from a
mutex to a reader/writer lock. Lookup operations first grab a read lock and
perform the lookup. If the operation results in a need to modify the cache,
then it tries to do an upgrade. If that fails, it drops the read lock,
obtains a write lock, and redoes the lookup.


# 8a7ef10b 23-Jan-2009 John Baldwin <jhb@FreeBSD.org>

- Mark all standalone INT/LONG/QUAD sysctl's MPSAFE. This is done
inside the SYSCTL() macros and thus does not need to be done for
all of the nodes scattered across the source tree.
- Mark the name-cache related sysctl's (including debug.hashstat.*) MPSAFE.
- Mark vm.loadavg MPSAFE.
- Remove GIANT_REQUIRED from vmtotal() (everything in this routine already
has sufficient locking) and mark vm.vmtotal MPSAFE.
- Mark the vm.stats.(sys|vm).* sysctls MPSAFE.


# 58c1607e 19-Jan-2009 Stephen McKay <mckay@FreeBSD.org>

Add a limit on namecache entries.

In normal operation, the number of cache entries is roughly equal to the
number of active vnodes. However, when most of the recently accessed
vnodes have many hard links, the number of cache entries can be 32000
times as large, exhausting kernel memory and provoking a panic in
kmem_malloc().

MFC after: 2 weeks


# 83e73926 29-Dec-2008 Konstantin Belousov <kib@FreeBSD.org>

In r185557, the check for existing negative entry for the given name
did not compared nc_dvp with supplied parent directory vnode pointer.
Add the check and note that now branches for vp != NULL and vp == NULL
are the same, thus can be merged.

Reported and reviewed by: kan
Tested by: pho
MFC after: 2 weeks


# 4769218f 23-Dec-2008 Joe Marcus Clarke <marcus@FreeBSD.org>

Do not KASSERT when vp->v_dd is NULL. Only directories which have had ".."
looked up would have v_dd set to a non-NULL value. This fixes a panic
seen when running installworld on a diskless system with a separate /usr
file system.

Submitted by: cracauer
Approved by: kib


# 86dcb537 23-Dec-2008 Konstantin Belousov <kib@FreeBSD.org>

Keep the hold on the vnode during VOP_VPTOCNP() call, allowing the vop
implementation to drop vnode lock, if needed.

Reported and tested by: pho


# b9022449 11-Dec-2008 Joe Marcus Clarke <marcus@FreeBSD.org>

Add a new VOP, VOP_VPTOCNP, which translates a vnode to its component name
on a best-effort basis. Teach vn_fullpath to use this new VOP if a
regular VFS cache lookup fails. This VOP is designed to supplement the
VFS cache to provide a better chance that a vnode-to-name lookup will
succeed.

Currently, an implementation for devfs is being committed. The default
implementation is to return ENOENT.

A big thanks to kib for the mentorship on this, and to pho for running it
through his stress test suite.

Reviewed by: arch
Approved by: kib


# d6568724 02-Dec-2008 Konstantin Belousov <kib@FreeBSD.org>

Shared lookup makes it possible to create several negative cache
entries for one name. Then, creating inode with that name would remove
one entry, leaving others dormant. Reclaiming the vnode would uncover
negative entries, causing false return of ENOENT from the calls like
stat, that do not create inode.

Prevent creation of the duplicated negative entries.

Reported and debugged with: pho
Reviewed by: jhb
X-MFC: after shared lookup changes


# ef61995e 25-Nov-2008 Joe Marcus Clarke <marcus@FreeBSD.org>

Move vn_fullpath1() outside of FILEDESC locking. This is being done in
advance of teaching vn_fullpath1() how to query file systems for
vnode-to-name mappings when cache lookups fail.

Thanks to kib for guidance and patience on this process.

Reviewed by: kib
Approved by: kib


# d7f03759 19-Oct-2008 Ulf Lilleengen <lulf@FreeBSD.org>

- Import the HEAD csup code which is the basis for the cvsmode work.


# d2722d70 24-Sep-2008 John Baldwin <jhb@FreeBSD.org>

Part 1 of making shared lookups more resilient with respect to forced
unmounts. When we upgrade a vnode lock from shared to exclusive during
a name cache lookup, fail the lookup with EBADF if the vnode is invalidated
while we are waiting for the exclusive lock.

Also, for correctness (though I'm not sure it can occur in practice),
downgrade an exclusively locked vnode if it should be share locked.

Tested by: pho


# cbb598af 18-Sep-2008 John Baldwin <jhb@FreeBSD.org>

Sort includes.


# 969bf150 23-Aug-2008 John Baldwin <jhb@FreeBSD.org>

Fix a race condition with concurrent LOOKUP namecache operations for a vnode
not in the namecache when shared lookups are enabled (vfs.lookup_shared=1,
it is currently off by default) and the filesystem supports shared lookups
(e.g. NFS client). Specifically, if multiple concurrent LOOKUPs both miss
in the name cache in parallel, each of the lookups may each end up adding an
entry to the namecache resulting in duplicate entries in the namecache
for the same pathname. A subsequent removal of the mapping of that
pathname to that vnode (via remove or rename) would only evict one of the
entries from the name cache. As a result, subseqent lookups for that
pathname would still return the old vnode.

This race was observed with shared lookups over NFS where a file was updated
by writing a new file out to a temporary file name and then renaming that
temporary file to the "real" file to effect atomic updates of a file. Other
processes on the same client that were periodically reading the file would
occasionally receive an ESTALE error from open(2) because the VOP_GETATTR()
in nfs_open() would receive that error when given the stale vnode.

The fix here is to check for duplicates in cache_enter() and just return
if an entry for this same directory and leaf file name for this vnode is
already in the cache. The check for duplicates is done by walking the
per-vnode list of name cache entries. It is expected that this list should
be very small in the common case (usually 0 or 1 entries during a
cache_enter() since most files only have 1 "leaf" name).

Reviewed by: ups, scottl
MFC after: 2 months


# cbd3ba3e 16-Aug-2008 Alfred Perlstein <alfred@FreeBSD.org>

Prevent crashes due to unlocked access to hash buckets in two sysctls.
Use CACHE_LOCK to prevent crashes.

Sysctls fixed: debug.hashstat.nchash and debug.hashstat.rawnchash.

Obtained from: Juniper Networks
MFC After: 1 week


# dfc714fb 31-Jul-2008 Christian S.J. Peron <csjp@FreeBSD.org>

Currently, BSM audit pathname token generation for chrooted or jailed
processes are not producing absolute pathname tokens. It is required
that audited pathnames are generated relative to the global root mount
point. This modification changes our implementation of audit_canon_path(9)
and introduces a new function: vn_fullpath_global(9) which performs a
vnode -> pathname translation relative to the global mount point based
on the contents of the name cache. Much like vn_fullpath,
vn_fullpath_global is a wrapper function which called vn_fullpath1.

Further, the string parsing routines have been converted to use the
sbuf(9) framework. This change also removes the conditional acquisition
of Giant, since the vn_fullpath1 method will not dip into file system
dependent code.

The vnode locking was modified to use vhold()/vdrop() instead the vref()
and vrele(). This will modify the hold count instead of modifying the
user count. This makes more sense since it's the kernel that requires
the reference to the vnode. This also makes sure that the vnode does not
get recycled we hold the reference to it. [1]

Discussed with: rwatson
Reviewed by: kib [1]
MFC after: 2 weeks


# b03d7207 09-Apr-2008 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Use LK_TYPE_MASK where needed. Actually after sys/sys/lockmgr.h:1.69 it is
no longer needed, but for now we still want to be consistent with other
similar checks in the tree.
- Call ASSERT_VOP_ELOCKED() only when vget() returns 0.

Reviewed by: jeff


# 0a3af16a 31-Mar-2008 Konstantin Belousov <kib@FreeBSD.org>

Add the utility function vn_commname() to retrieve the command name
from the vfs namecache, when available.

Reviewed by: rwatson, rdivacky
Tested by: pho


# 237fdd78 16-Mar-2008 Robert Watson <rwatson@FreeBSD.org>

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink


# 81c794f9 25-Feb-2008 Attilio Rao <attilio@FreeBSD.org>

Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by: Andrea Barberio <insomniac at slackware dot it>


# 22db15c0 13-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>


# cb05b60a 09-Jan-2008 Attilio Rao <attilio@FreeBSD.org>

vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>


# e6d64a0f 22-Nov-2007 Kris Kennaway <kris@FreeBSD.org>

Remove remaining Giant acquisition around vn_fullpath1. This was missed
in r1.106 and has not been required for some years now.

Reviewed by: jeff
MFC After: 1 week


# b4d7e298 21-Sep-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

Fix some locking cases where we ask for exclusively locked vnode, but we get
shared locked vnode in instead when vfs.lookup_shared is set to 1.

Discussed with: kib, kris
Tested by: kris
Approved by: re (kensmith)


# dfe97ff4 18-Jun-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

We only flush entries related to the given file system. Currently there are
no 'invalid' cache entires - file system is responsible for keeping it that
way. The comment should have been updated in rev.1.25.


# 6e042171 25-May-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

To avoid a deadlock when handling .. directory during a lookup, we unlock
parent vnode and relock it after locking child vnode. The problem was that
we always relock it exclusively, even when it was share-locked.

Discussed with: jeff


# b4c85af9 25-May-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

We no longer need to put namecache entries onto temporary mplist.
It was useful in revision 1.86, but should have been removed in 1.89.


# 950afe99 25-May-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

The cache_leaf_test() function seems to be unused, so remove it.


# f013ccb7 22-May-2007 Pawel Jakub Dawidek <pjd@FreeBSD.org>

- Remove redundant initialization.
- Compare pointer with NULL.


# 5e3f7694 04-Apr-2007 Robert Watson <rwatson@FreeBSD.org>

Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by: kris
Discussed with: jhb, kris, attilio, jeff


# 873fbcd7 05-Mar-2007 Robert Watson <rwatson@FreeBSD.org>

Further system call comment cleanup:

- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.


# 4f0840f3 15-Jun-2006 Christian S.J. Peron <csjp@FreeBSD.org>

Axe Giant from vn_fullpath(9). The vnode -> pathname lookup should be
filesystem agnostic. We are not touching any file system specific functions
in this code path. Since we have a cache lock, there is really no need to
keep Giant around here.

This eliminates Giant acquisitions for any syscall which is auditing pathnames.

Discussed with: jeff


# e98b5a89 16-Apr-2006 John-Mark Gurney <jmg@FreeBSD.org>

remove duplicate sizeof vnode entry (debug.sizeof.vnode already existed)...
move ncsize into debug.sizeof and rename to namecache...


# 2f0bca55 06-Feb-2006 Jeff Roberson <jeff@FreeBSD.org>

- Don't check v_mount for NULL to determine if a vnode has been recycled.
Use the more appropriate VI_DOOMED flag instead.

Sponsored by: Isilon Systems, Inc.
MFC After: 1 week


# 32b6dcd8 16-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Fix a leaked reference to a vnode via v_dd. We rely on cache_purge() and
cache_zap() to clear the v_dd pointers when a directory vnode is forcibly
discarded. For this to work, all vnodes with v_dd pointers to a directory
must also have name cache entries linked via v_cache_dst to that dvp
otherwise we could not find them at cache_purge() time. The following
code snipit could break this guarantee by unlinking a directory before
fetching it's dotdot. The dotdot lookup would initialize the v_dd field
of the unlinked directory which could never be cleared. To fix this
we don't initialize v_dd for orphaned vnodes.
printf("rmdir: %d\n", rmdir("../foo")); /* foo is cwd */
printf("chdir: %d\n", chdir(".."));
printf("%s\n", getwd(NULL));

Sponsored by: Isilon Systems, Inc.
Discovered by: kkenn
Approved by: re (blanket vfs)


# 6bd8103d 12-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Clear v_dd in cache_zap() instead of cache_purge() as cache_purge() may
not be called in all cases where we free the cnp.

Sponsored by: Isilon Systems, Inc.


# eff2d126 12-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Add KTR_VFS messages for various name cache related events.

Sponsored by: Isilon Systems, Inc.


# 1b2da2d0 11-Jun-2005 Jeff Roberson <jeff@FreeBSD.org>

- Assert that we're not adding a doomed vnode to the name cache.

Sponsored by: Isilon Systems, Inc.


# 4585e3ac 13-Apr-2005 Jeff Roberson <jeff@FreeBSD.org>

- Change all filesystems and vfs_cache to relock the dvp once the child is
locked in the ISDOTDOT case. Se vfs_lookup.c r1.79 for details.

Sponsored by: Isilon Systems, Inc.


# 7ce7f713 29-Mar-2005 David Schultz <das@FreeBSD.org>

Eliminate v_id and v_ddid. The name cache now holds references to
vnodes whose names it caches, so we no longer need a `generation
number' to tell us if a referenced vnode is invalid. Replace the use
of the parent's v_id in the hash function with the address of the
parent vnode.

Tested by: Peter Holm
Glanced at by: jeff, phk


# dd33f0d9 29-Mar-2005 David Schultz <das@FreeBSD.org>

Merge kern___cwd() and vn_fullpath(), which were virtually identical,
except for places where people forget to update one of them. We now
collect only one set of stats for both of these routines. Other
changes in this commit include:

- Start acquiring Giant again in vn_fullpath(), since it is required
when crossing a mount point.

- Expand the scope of the cache lock to avoid dropping it and
picking it up again for every pathname component. This also
makes it trivial to avoid races in stats collection.

- Assert that nc_dvp == v_dd for directories instead of returning
an error to userland when this is not true. AFAIK, it should
always be true when v_dd is non-null.

- For vn_fullpath(), handle the first (non-directory) vnode
separately.

Glanced at by: jeff, phk


# 5280e61f 28-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Move the logic that locks and refs the new vnode from vfs_cache_lookup()
to cache_lookup(). This allows us to acquire the vnode interlock before
dropping the cache lock. This protects the vnodes identity until we
have locked it.

Sponsored by: Isilon Systems, Inc.


# 571211c4 29-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Get rid of the old LOOKUP_SHARED code. namei() now supplies the
proper lock flags via cn_lkflag.

Sponsored by: Isilon Systems, Inc.


# b75719af 29-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Invalidate the childrens v_dd pointers when we cache_purge() a directory.
Otherwise the stale pointer may be accessed after a vnode is freed.

Sponsored by: Isilon Systems, Inc.


# f7b404d8 28-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- Remove an unused variable.

Sponsored by: Isilon Systems, Inc.


# ee5a0a2d 28-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- We no longer have to bother with PDIRUNLOCK, lookup() handles it for us.

Sponsored by: Isilon Systems, Inc.


# fdd6a3ff 23-Mar-2005 Jeff Roberson <jeff@FreeBSD.org>

- All of the bugs which lead to the complication of the LOOKUP_SHARED
config option have now been fixed. All filesystems are properly locked
and checked via DEBUG_VFS_LOCKS. Remove the workaround code.

Sponsored by: Isilon Systems, Inc.


# 2adc2b87 09-Feb-2005 Poul-Henning Kamp <phk@FreeBSD.org>

Make a SYSCTL_NODE and a mutex static


# 799cc2dc 24-Jan-2005 Jeff Roberson <jeff@FreeBSD.org>

- Simplify the cache locking. The lock order relationship with the
vnode lock is much simpler than I originally thought it would be.
Now, the cache lock is always acquired before the vnode lock.
- Provide some gotos in __getcwd() to simplify the unlocking a bit.
- Move Giant acquisition down into __getcwd().

Sponsored By: Isilon Systems, Inc.


# 9454b2d8 06-Jan-2005 Warner Losh <imp@FreeBSD.org>

/* -> /*- for copyright notices, minor format tweaks as necessary


# 7f8a436f 05-Apr-2004 Warner Losh <imp@FreeBSD.org>

Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core


# 98d7d155 05-Oct-2003 Jeff Roberson <jeff@FreeBSD.org>

- Apply a big giant lock around the namecache. This has been sitting in
my tree since BSDcon.


# c2935410 13-Jun-2003 Dag-Erling Smørgrav <des@FreeBSD.org>

Make the VFS cache use zones instead of malloc(9). This results in a
small but noticeable increase in performance for name lookup operations.

The code uses two zones, one for short names (less than 32 characters)
and one for long names (up to NAME_MAX). Since most file names are
fairly short, this saves a considerable amount of space that would
otherwise be wasted if we always allocated NAME_MAX bytes. The cutoff
value of 32 characters was picked arbitrarily and may benefit from some
tweaking; it could also be made into a tunable.

Submitted by: hmp


# ffe92432 11-Jun-2003 Dag-Erling Smørgrav <des@FreeBSD.org>

Whitespace cleanup.


# 677b542e 10-Jun-2003 David E. O'Brien <obrien@FreeBSD.org>

Use __FBSDID().


# cc34e37e 20-Mar-2003 Poul-Henning Kamp <phk@FreeBSD.org>

Backout the getcwd changes, a more comprehensive effort will be needed.


# 9eaf5abc 16-Mar-2003 Poul-Henning Kamp <phk@FreeBSD.org>

(This commit certainly increases the need for a wash&clean of vfs_cache.c,
but I decided that it was important for this patch to not bit-rot, and
since it is mainly moving code around, the total amount of entropy is
epsilon /phk)

This is a patch to move the common parts of linux_getcwd() back into
kern/vfs_cache.c so that the standard FreeBSD libc getcwd() can use it's
extended functionality. The linux syscall linux_getcwd() in
compat/linux/linux_getcwd.c has been rewritten to use it too. It should
be possible to simplify libc's getcwd() after this. No doubt this code
needs some cleaning up, since I've left in the sysctl variables I used
for debugging.

PR: 48169
Submitted by: James Whitwell <abacau@yahoo.com.au>


# a163d034 18-Feb-2003 Warner Losh <imp@FreeBSD.org>

Back out M_* changes, per decision of the TRB.

Approved by: trb


# 1f5a94d5 15-Feb-2003 Andrew R. Reiter <arr@FreeBSD.org>

- Update a couple of comments to make sense with what today's code is
doing (stale comments make arr something something ;)).


# da8f0c84 15-Feb-2003 Andrew R. Reiter <arr@FreeBSD.org>

- Remove old comment for PURGE() as it no longer exists and implied it
was a comment to cache_zap().
- Add a comment to quickly state what cache_zap() does.

Reviewed by: phk, mux


# 44956c98 21-Jan-2003 Alfred Perlstein <alfred@FreeBSD.org>

Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.


# 48b52b7a 02-Sep-2002 Ian Dowse <iedowse@FreeBSD.org>

Split up __getcwd so that kernel callers of the internal version
can specify whether the buffer is in user or system space.


# 18c6acee 05-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Move a VOP assert to the right place.

Spotted by: i386 tinderbox


# e6e370a7 04-Aug-2002 Jeff Roberson <jeff@FreeBSD.org>

- Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.

Idea stolen from: BSD/OS


# 210a5a71 28-Jun-2002 Alfred Perlstein <alfred@FreeBSD.org>

nuke caddr_t.


# 0e2d6cc8 14-May-2002 Jeff Roberson <jeff@FreeBSD.org>

Disable the shared locking namei() code for now. It breaks several stacking
filesystems. This is on hold until the rest of VFS Locking is reviewed and
deemed safe. It can be enabled with 'options LOOKUP_SHARED'.


# a59f8b9e 08-Apr-2002 Jeff Roberson <jeff@FreeBSD.org>

Turn #ifdef LOOKUP_SHARED into #ifndef LOOKUP_EXCLUSIVE to enable this
behavior by default. Also, change the options line to reflect this.

If there are no problems reported this will become the only behavior and the
knob will be removed in a month or so.

Demanded by: obrien


# cf4ce70b 07-Apr-2002 David Malone <dwmalone@FreeBSD.org>

Remove a comment which relates to the old name cache code, which
was replaced in 1997.

Approved by: phk


# 4d77a549 19-Mar-2002 Alfred Perlstein <alfred@FreeBSD.org>

Remove __P.


# 8de00f4a 11-Mar-2002 Jeff Roberson <jeff@FreeBSD.org>

This patch adds the "LOCKSHARED" option to namei which causes it to only acquire shared locks on leafs.
The stat() and open() calls have been changed to make use of this new functionality. Using shared locks in
these cases is sufficient and can significantly reduce their latency if IO is pending to these vnodes. Also,
this reduces the number of exclusive locks that are floating around in the system, which helps reduce the
number of deadlocks that occur.

A new kernel option "LOOKUP_SHARED" has been added. It defaults to off so this patch can be turned on for
testing, and should eventually go away once it is proven to be stable. I have personally been running this
patch for over a year now, so it is believed to be fully stable.

Reviewed by: jake, obrien
Approved by: jake


# eb8e6d52 05-Mar-2002 Eivind Eklund <eivind@FreeBSD.org>

Document all functions, global and static variables, and sysctls.
Includes some minor whitespace changes, and re-ordering to be able to document
properly (e.g, grouping of variables and the SYSCTL macro calls for them, where
the documentation has been added.)

Reviewed by: phk (but all errors are mine)


# 362912eb 17-Feb-2002 Poul-Henning Kamp <phk@FreeBSD.org>

Remove cache_purgeleafdirs(), it has been #if 0 for quite some time.


# 9e209b12 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

Include sys/_lock.h and sys/_mutex.h to reduce namespace pollution.

Requested by: jhb


# 426da3bc 13-Jan-2002 Alfred Perlstein <alfred@FreeBSD.org>

SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.


# 45fb069a 21-Oct-2001 Dag-Erling Smørgrav <des@FreeBSD.org>

Convert textvp_fullpath() into the more generic vn_fullpath() which takes a
struct thread * and a struct vnode * instead of a struct proc *.

Temporarily add a textvp_fullpath macro for compatibility.


# b5810bab 30-Sep-2001 Matthew Dillon <dillon@FreeBSD.org>

After extensive testing it has been determined that adding complexity
to avoid removing higher level directory vnodes from the namecache has
no perceivable effect and will be removed. This is especially true
when vmiodirenable is turned on, which it is by default now. ( vmiodirenable
makes a huge difference in directory caching ). The vfs.vmiodirenable and
vfs.nameileafonly sysctls have been left in to allow further testing, but
I expect to rip out vfs.nameileafonly soon too.

I have also determined through testing that the real problem with numvnodes
getting too large is due to the VM Page cache preventing the vnode from
being reclaimed. The directory stuff made only a tiny dent relative
to Poul's original code, enough so that some tests succeeded. But tests
with several million small files show that the bigger problem is the VM Page
cache. This will have to be addressed by a future commit.

MFC after: 3 days


# b40ce416 12-Sep-2001 Julian Elischer <julian@FreeBSD.org>

KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha


# 7476f7e8 04-Sep-2001 Ian Dowse <iedowse@FreeBSD.org>

Fix a memory leak in __getcwd() that can occur after a filesystem
has been forcibly unmounted. If the filesystem root vnode is reached
and it has no associated mountpoint (vp->v_mount == NULL), __getcwd
would return without freeing 'buf'. Add the missing free() call.

PR: kern/30306
Submitted by: Mike Potanin <potanin@mccme.ru>
MFC after: 1 week


# fb919e4d 01-May-2001 Mark Murray <markm@FreeBSD.org>

Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)


# 60fb0ce3 28-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Revert consequences of changes to mount.h, part 2.

Requested by: bde


# d98dc34f 23-Apr-2001 Greg Lehey <grog@FreeBSD.org>

Correct #includes to work with fixed sys/mount.h.


# 759cb263 18-Apr-2001 Seigo Tanimura <tanimura@FreeBSD.org>

Reclaim directory vnodes held in namecache if few free vnodes are
available.

Only directory vnodes holding no child directory vnodes held in
v_cache_src are recycled, so that directory vnodes near the root of
the filesystem hierarchy remain in namecache and directory vnodes are
not reclaimed in cascade.

The period of vnode reclaiming attempt and the number of vnodes
attempted to reclaim can be tuned via sysctl(2).

Suggested by: tegge
Approved by: phk


# 9d10eb0c 10-Apr-2001 Peter Wemm <peter@FreeBSD.org>

Create debug.hashstat.[raw]nchash and debug.hashstat.[raw]nfsnode to
enable easy access to the hash chain stats. The raw prefixed versions
dump an integer array to userland with the chain lengths. This cheats
and calls it an array of 'struct int' rather than 'int' or sysctl -a
faithfully dumps out the 128K array on an average machine. The non-raw
versions return 4 integers: count, number of chains used, maximum chain
length, and percentage utilization (fixed point, multiplied by 100).
The raw forms are more useful for analyzing the hash distribution, while
the other form can be read easily by humans and stats loggers.


# 439fea92 19-Mar-2001 Peter Wemm <peter@FreeBSD.org>

Use the same API as the example code.
Allow the initial hash value to be passed in, as the examples do.
Incrementally hash in the dvp->v_id (using the official api) rather than
add it. This seems to help power-of-two predictable filename trees
where the filenames repeat on a power-of-two cycle and the directory trees
have power-of-two components in it. The simple add then mask was causing
things like 12000+ entry collision chains while most other entries have
between 0 and 3 entries each. This way seems to improve things.


# 6eb39ac8 17-Mar-2001 Peter Wemm <peter@FreeBSD.org>

Use a generic implementation of the Fowler/Noll/Vo hash (FNV hash).
Make the name cache hash as well as the nfsnode hash use it.

As a special tweak, create an unsigned version of register_t. This allows
us to use a special tweak for the 64 bit versions that significantly
speeds up the i386 version (ie: int64 XOR int64 is slower than int64
XOR int32).

The code layout is a little strange for the string function, but I was
able to get between 5 to 10% improvement over the original version I
started with. The layout affects gcc code generation choices and this way
was fastest on x86 and alpha.

Note that 'CPUTYPE=p3' etc makes a fair difference to this. It is
around 45% faster with -march=pentiumpro on a p6 cpu.


# 959b7375 08-Dec-2000 Poul-Henning Kamp <phk@FreeBSD.org>

Staticize some malloc M_ instances.


# 138e514c 06-Dec-2000 Peter Wemm <peter@FreeBSD.org>

Untangle vfsinit() a bit. Use seperate sysinit functions rather than
having a super-function calling bits all over the place.


# aa542997 19-Nov-2000 Robert Watson <rwatson@FreeBSD.org>

o Export nchstats ("VFS cache effectiveness statistics") using
SYSCTL_OPAQUE. This removes a reason that systat requires
setgid kmem. More to come.


# 3ff1a2f4 17-Sep-2000 Boris Popov <bp@FreeBSD.org>

Add new flag PDIRUNLOCK to the component.cn_flags which should be set by
filesystem lookup() routine if it unlocks parent directory. This flag should
be carefully tracked by filesystems if they want to work properly with nullfs
and other stacked filesystems.

VFS takes advantage of this flag to perform symantically correct usage
of vrele() instead of vput() if parent directory already unlocked.

If filesystem fails to track this flag then previous codepath in VFS left
unchanged.

Convert UFS code to set PDIRUNLOCK flag if necessary. Other filesystmes will
be changed after some period of testing.

Reviewed in general by: mckusick, dillon, adrian
Obtained from: NetBSD


# 67b23794 09-Sep-2000 Boris Popov <bp@FreeBSD.org>

Change variable naming to be consistent with the rest of VFS code.
Reduce number of indirections by using already fetched values.


# 9701cd40 05-Jul-2000 John Baldwin <jhb@FreeBSD.org>

Support for unsigned integer and long sysctl variables. Update the
SYSCTL_LONG macro to be consistent with other integer sysctl variables
and require an initial value instead of assuming 0. Update several
sysctl variables to use the unsigned types.

PR: 15251
Submitted by: Kelly Yancey <kbyanc@posi.net>


# e3975643 25-May-2000 Jake Burkholder <jake@FreeBSD.org>

Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others


# 740a1973 23-May-2000 Jake Burkholder <jake@FreeBSD.org>

Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd


# b7db1901 26-Apr-2000 Brian Feldman <green@FreeBSD.org>

Move procfs_fullpath() to vfs_cache.c, with a rename to textvp_fullpath().
There's no excuse to have code in synthetic filestores that allows direct
references to the textvp anymore.

Feature requested by: msmith
Feature agreed to by: warner
Move requested by: phk
Move agreed to by: bde


# 8a2852b1 21-Apr-2000 Brian Feldman <green@FreeBSD.org>

Move the declaration of "struct namecache" to vnode.h, as it can be useful
elsewhere. Note, of course, that in an ideal world nothing should need
to see our VFS implementation :-/


# 194a0b6c 13-Feb-2000 Peter Wemm <peter@FreeBSD.org>

Avoid a panic in __getcwd(2) when combined with umount -f.


# 3b6fb885 02-Oct-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Before we start to mess with the VFS name-cache clean things up a little bit:
Isolate the namecache in its own file, and give it a dedicated malloc type.


# c3aac50f 27-Aug-1999 Peter Wemm <peter@FreeBSD.org>

$Id$ -> $FreeBSD$


# 22f054e2 24-Apr-1999 Poul-Henning Kamp <phk@FreeBSD.org>

Fix a braino in the v_id wraparound code. Give more (current) details
in comment.

PR: 11307
Spotted by: Ville-Pertti Keinonen <will@iki.fi>


# 355a2610 09-Sep-1998 Bruce Evans <bde@FreeBSD.org>

Don't use CTL_VFS at the wrong level. This caused loops in the sysctl
tree if CTL_VFS happened to get assigned as a type number to a vfs that
has some vfs sysctls.


# 1aa9ea7c 19-Dec-1997 Bruce Evans <bde@FreeBSD.org>

Removed some bogus casts.


# 4a11ca4e 07-Nov-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Remove a bunch of variables which were unused both in GENERIC and LINT.

Found by: -Wunused


# cec0f20c 16-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

VFS mega cleanup commit (x/N)

1. Add new file "sys/kern/vfs_default.c" where default actions for
VOPs go. Implement proper defaults for ABORTOP, BWRITE, LEASE,
POLL, REVOKE and STRATEGY. Various stuff spread over the entire
tree belongs here.

2. Change VOP_BLKATOFF to a normal function in cd9660.

3. Kill VOP_BLKATOFF, VOP_TRUNCATE, VOP_VFREE, VOP_VALLOC. These
are private interface functions between UFS and the underlying
storage manager layer (FFS/LFS/MFS/EXT2FS). The functions now
live in struct ufsmount instead.

4. Remove a kludge of VOP_ functions in all filesystems, that did
nothing but obscure the simplicity and break the expandability.
If a filesystem doesn't implement VOP_FOO, it shouldn't have an
entry for it in its vnops table. The system will try to DTRT
if it is not implemented. There are still some cruft left, but
the bulk of it is done.

5. Fix another VCALL in vfs_cache.c (thanks Bruce!)


# 138ec1f7 15-Oct-1997 Poul-Henning Kamp <phk@FreeBSD.org>

vnops megacommit

1. Use the default function to access all the specfs operations.
2. Use the default function to access all the fifofs operations.
3. Use the default function to access all the ufs operations.
4. Fix VCALL usage in vfs_cache.c
5. Use VOCALL to access specfs functions in devfs_vnops.c
6. Staticize most of the spec and fifofs vnops functions.
7. Make UFS panic if it lacks bits of the underlying storage handling.


# 46c320ba 24-Sep-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Add one more counter so we can truly find out how good our name cache
is. If we don't find something and don't what to have found something,
it's actually a success.


# 00544193 24-Sep-1997 Poul-Henning Kamp <phk@FreeBSD.org>

A couple of handles to tweak, more statistics.


# 4d1122bd 04-Sep-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Revert to the previous hashing, double the hashtable size instead.


# 119b6f4c 03-Sep-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Use 2^N hash sizes rather than primesize, this replaces a division
with an and. (Submitted by davidg)

Preemptively record ".." values.

Reviewed by: phk


# e4ba6a82 02-Sep-1997 Bruce Evans <bde@FreeBSD.org>

Removed unused #includes.


# a051452a 31-Aug-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Change the 0xdeadb hack to a flag called VDOOMED.
Introduce VFREE which indicates that vnode is on freelist.
Rename vholdrele() to vdrop().
Create vfree() and vbusy() to add/delete vnode from freelist.
Add vfree()/vbusy() to keep (v_holdcnt != 0 || v_usecount != 0)
vnodes off the freelist.
Generalize vhold()/v_holdcnt to mean "do not recycle".
Fix reassignbuf()s lack of use of vhold().
Use vhold() instead of checking v_cache_src list.
Remove vtouch(), the vnodes are always vget'ed soon enough
after for it to have any measuable effect.
Add sysctl debug.freevnodes to keep track of things.
Move cache_purge() up in getnewvnodes to avoid race.
Decrement v_usecount after VOP_INACTIVE(), put a vhold() on
it during VOP_INACTIVE()
Unmacroize vhold()/vdrop()
Print out VDOOMED and VFREE flags (XXX: should use %b)

Reviewed by: dyson


# 0fa2443f 26-Aug-1997 Poul-Henning Kamp <phk@FreeBSD.org>

Uncut&paste cache_lookup().

This unifies several times in theory indentical 50 lines of code.

The filesystems have a new method: vop_cachedlookup, which is the
meat of the lookup, and use vfs_cache_lookup() for their vop_lookup
method. vfs_cache_lookup() will check the namecache and pass on
to the vop_cachedlookup method in case of a miss.

It's still the task of the individual filesystems to populate the
namecache with cache_enter().

Filesystems that do not use the namecache will just provide the
vop_lookup method as usual.


# 2401f27c 04-Aug-1997 Poul-Henning Kamp <phk@FreeBSD.org>

remove unused MAXVNODEUSE macro.


# b15a966e 04-May-1997 Poul-Henning Kamp <phk@FreeBSD.org>

1. Add a {pointer, v_id} pair to the vnode to store the reference to the
".." vnode. This is cheaper storagewise than keeping it in the
namecache, and it makes more sense since it's a 1:1 mapping.

2. Also handle the case of "." more intelligently rather than stuff
the namecache with pointless entries.

3. Add two lists to the vnode and hang namecache entries which go from
or to this vnode. When cleaning a vnode, delete all namecache
entries it invalidates.

4. Never reuse namecache enties, malloc new ones when we need it, free
old ones when they die. No longer a hard limit on how many we can
have.

5. Remove the upper limit on namelength of namecache entries.

6. Make a global list for negative namecache entries, limit their number
to a sysctl'able (debug.ncnegfactor) fraction of the total namecache.
Currently the default fraction is 1/16th. (Suggestions for better
default wanted!)

7. Assign v_id correctly in the face of 32bit rollover.

8. Remove the LRU list for namecache entries, not needed. Remove the
#ifdef NCH_STATISTICS stuff, it's not needed either.

9. Use the vnode freelist as a true LRU list, also for namecache accesses.

10. Reuse vnodes more aggresively but also more selectively, if we can't
reuse, malloc a new one. There is no longer a hard limit on their
number, they grow to the point where we don't reuse potentially
usable vnodes. A vnode will not get recycled if still has pages in
core or if it is the source of namecache entries (Yes, this does
indeed work :-) "." and ".." are not namecache entries any longer...)

11. Do not overload the v_id field in namecache entries with whiteout
information, use a char sized flags field instead, so we can get
rid of the vpid and v_id fields from the namecache struct. Since
we're linked to the vnodes and purged when they're cleaned, we don't
have to check the v_id any more.

12. NFS knew about the limitation on name length in the namecache, it
shouldn't and doesn't now.

Bugs:
The namecache statistics no longer includes the hits for ".."
and "." hits.

Performance impact:
Generally in the +/- 0.5% for "normal" workstations, but
I hope this will allow the system to be selftuning over a
bigger range of "special" applications. The case where
RAM is available but unused for cache because we don't have
any vnodes should be gone.

Future work:
Straighten out the namecache statistics.

"desiredvnodes" is still used to (bogusly ?) size hash
tables in the filesystems.

I have still to find a way to safely free unused vnodes
back so their number can shrink when not needed.

There is a few uses of the v_id field left in the filesystems,
scheduled for demolition at a later time.

Maybe a one slot cache for unused namecache entries should
be implemented to decrease the malloc/free frequency.


# d8d6519c 08-Mar-1997 Bruce Evans <bde@FreeBSD.org>

Fixed the hash formula. Lite2 doesn't have phashinit(), so Lite2's hash
formula uses `& nchash'. This is very broken when nchash is a prime
number instead of 1 less than a power of 2, but the Lite2 formula was
merged in.

Merged some cosmetic changes from Lite2, rev.1.21 and Lite1. The merge
was difficult because the Lite2 code is essentially ours (phk's) except
where Lite2 improved or broke it.

Summary of the Lite2 changes:
- in the copyright, phk's rights have been transferred to the Regents.
This change should be reviewed.
- nchENOENT went away; the "no" vnode is now simply 0.
- comments were improved.
- style was "improved".
- goto instead of Fanatism (sic) was considered bad :-).
- there are some small changes to support whiteouts.
- new cache entries are added in more cases. More work is required
near here to change the hash table size if kern.desiredvnodes is
changed using sysctl.
- rescanning of the hash bucket in cache_purgevfs() was removed. This
change should be reviewed.


# 6875d254 22-Feb-1997 Peter Wemm <peter@FreeBSD.org>

Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.


# 996c772f 09-Feb-1997 John Dyson <dyson@FreeBSD.org>

This is the kernel Lite/2 commit. There are some requisite userland
changes, so don't expect to be able to run the kernel as-is (very well)
without the appropriate Lite/2 userland changes.

The system boots and can mount UFS filesystems.

Untested: ext2fs, msdosfs, NFS
Known problems: Incorrect Berkeley ID strings in some files.
Mount_std mounts will not work until the getfsent
library routine is changed.

Reviewed by: various people
Submitted by: Jeffery Hsu <hsu@freebsd.org>


# 1130b656 14-Jan-1997 Jordan K. Hubbard <jkh@FreeBSD.org>

Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.


# edbfedac 11-Mar-1996 Peter Wemm <peter@FreeBSD.org>

Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]


# bd7e5f99 18-Jan-1996 John Dyson <dyson@FreeBSD.org>

Eliminated many redundant vm_map_lookup operations for vm_mmap.
Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish
overhead for merged cache.
Efficiency improvement for vfs_cluster. It used to do alot of redundant
calls to cluster_rbuild.
Correct the ordering for vrele of .text and release of credentials.
Use the selective tlb update for 486/586/P6.
Numerous fixes to the size of objects allocated for files. Additionally,
fixes in the various pagers.
Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs.
Fixes in the swap pager for exhausted resources. The pageout code
will not as readily thrash.
Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into
page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE),
thereby improving efficiency of several routines.
Eliminate even more unnecessary vm_page_protect operations.
Significantly speed up process forks.
Make vm_object_page_clean more efficient, thereby eliminating the pause
that happens every 30seconds.
Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the
case of filesystems mounted async.
Fix a panic with busy pages when write clustering is done for non-VMIO
buffers.


# 79c0c4b7 22-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

kern_conf.c: remove a now unused variable.
vfs_cache.c: Fix a very rare probelm in the vnode-cache.
Submitted by: Terry Lambert <terry@lambert.org>


# f708ef1b 14-Dec-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Another mega commit to staticize things.


# a98ca469 29-Oct-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Second batch of cleanup changes.
This time mostly making a lot of things static and some unused
variables here and there.


# 28f8db14 29-Jul-1995 Bruce Evans <bde@FreeBSD.org>

Eliminate sloppy common-style declarations. There should be none left for
the LINT configuation.


# 9b2e5354 30-May-1995 Rodney W. Grimes <rgrimes@FreeBSD.org>

Remove trailing whitespace.


# cf8ad510 14-Apr-1995 David Greenman <dg@FreeBSD.org>

Fixed serious off by one bug I introduced that will likely cause the
machine to panic whenever the name cache fills up.

Submitted by: John Dyson


# 22e53424 03-Apr-1995 David Greenman <dg@FreeBSD.org>

kern_subr.c:
Added a new type to uiomove - "UIO_NOCOPY" which causes it to update
pointers and counts, but doesn't do any data copying. This is needed
for upcoming changes to the way that the vnode pager does its page
outs.
Added a new hash init function call "phashinit" that allocates and
initializes a prime number sized hash table.

vfs_cache.c:
Changed hashing algorithm to use the remainder of dividing by a prime
number to improve the distribution characteristcs. Uses new phashinit
function in kern_subr.c.


# d7e3d98a 19-Mar-1995 David Greenman <dg@FreeBSD.org>

Patch from Kirk McKusick to fix a bug introduced in the Poul's vfs_cache
rewrite.


# 47f19694 11-Mar-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Update a couple of counters.


# 914e6eb7 10-Mar-1995 David Greenman <dg@FreeBSD.org>

Whoops, back out that last change - I misread what Poul had done there.


# dbd90d41 10-Mar-1995 David Greenman <dg@FreeBSD.org>

Don't thrash the name cache while trying to fill up the object cache.
(Make a new cache entry until desiredvnodes is reached).


# b2e10d6d 09-Mar-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Clean up and improve the namecache.

1. We always keep one 16th of the vnodes on the freelist, so that the
namecache doesn't get trashed. It used to be that it wasn't a problem, but
the only vnodes getting released these days are directories and things which
gets forced out of the VM/cache. The latter is not numerous enough to keep
the pool of vnodes needed for the namecache sufficiently big.

2. Purge invalid entries in the namecache as soon as we notice them. This
avoids a stale entry pushing out a valid entry on the LRU list.

3. Speed up the lookup in the namecache by avoid a special case branch.

4. Make the cache purge routines do the thing they're supposed to, and in
a decently efficient manner.

5. Make the size of the namecache follow the number of vnodes, so that we
can always point to all the vnodes we have in core.

6. Readability has gone way up.

7. Added a "options NCH_STATISTICS" feature that will gather more
detailed statistics on the performance of the namecache.

Reviewed by: davidg


# a0e8a1e2 07-Mar-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Another little optimization to the nameicache.
If an entry is stale, ditch it.


# 2425396b 07-Mar-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Improve the quality of the hash used in the namei-cache.


# 30f467d8 05-Mar-1995 Poul-Henning Kamp <phk@FreeBSD.org>

Update vfs_cache.c to use the <sys/queue.h> macros. This makes it easier
to read, but doesn't change the speed.

Reviewed by: phk
Obtained from: via NetBSD


# 797f2d22 02-Oct-1994 Poul-Henning Kamp <phk@FreeBSD.org>

All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.


# 3c4dd356 02-Aug-1994 David Greenman <dg@FreeBSD.org>

Added $Id$


# 26f9a767 25-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman


# df8bae1d 24-May-1994 Rodney W. Grimes <rgrimes@FreeBSD.org>

BSD 4.4 Lite Kernel Sources